doc/parsing.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75

Writing a parser for BibTeX bibliography files is more challenging than it
appears. In particular, the BibTeX language cannot be tokenized with a standard
"local" approach. The interpretation of certain characters is heavily
context-dependent. The best example of this is the interpretation of
a quotation mark encountered when reading a string literal: it marks the end of
the string literal unless it is at non-zero *brace depth*. For example in the
following line:

.. code::

    title = "My {"}wonderful{"} Title"

The internal quotation marks are interpreted as regular characters because they
are at brace depth 1.

There is unfortunately no formal specification of BibTeX's grammar. And many
(if not most) publicly available tools only support a much simpler grammar than
the one supported by the original BibTeX implementation.

An excellent introduction to the more advanced aspects of BibTeX is the
document *Tame the BeaST* [TTB]_. However, it is sometimes necessary to refer
to the original implementation for corner-case behaviors. The original
implementation is available in the ``/orig`` folder. See the ``README`` file in
this folder for instructions on how to generate the documentation attached to
the original implementation.


The following grammar should be very close to the one supported by the original
BibTeX implementation:

.. code::

    number ::= [0-9]+
    key-paren ::= [^\s,)]*
    key-brace ::= [^\s,}]*
    identifier ::= [^0-9{}()=",#%][^{}()=",#%]*
    text ::= [^{}]*
    text-quote ::= [^{}"]*
    comment ::= [^@]*

    bibtex ::= ( comment | command )*
    command ::= comment-command | preamble-command | string-command | entry-command
    comment-command ::= '@' 'comment'
    preamble-command ::= '@' 'preamble' ( '{' literal-list '}' | '(' literal-list ')' )

    string-command ::= '@' 'string' ( '{' field '}' | '(' field ')' }

    entry-command ::= '@' identifier '{' key-brace ( ',' field-list )? '}'
                    | '@' identifier '(' key-paren ( ',' field-list )? '}'

    field-list ::= ( field ( ','  field )* ','? )?
    field ::= identifier '=' literal-list

    literal-list ::= literal ('#' literal)*
    literal ::= number | identifier | quote-literal | brace-literal
    quote-literal ::= '"'~text-quote~brace-literal?~text-quote~'"'
    brace-literal ::= '{'~balanced-text~'}'
    balanced-text ::= balanced-text~'{'~balanced-text~'}'~balanced-text
                    | text


A few remarks which do not seem to be common knowledge:

* an identifier can contain many things (including @ signs)
  but cannot start with a digit.

* an entry key can be empty.

* an entry key can contain many things even braces. The braces can be
  unbalanced, BibTeX won't complain but this will likely be a problem when
  compiling with LaTeX. If the key contains a closing brace, then the entry
  muse use parenthesis delimiters. Similarly, if the key contains a closing
  parenthesis, the entry must use braces delimiters.

.. [TTB] Nicolas Markey, *Tame the BeaST*. Available `here <http://mirrors.ctan.org/info/bibtex/tamethebeast/ttb_en.pdf>`_.