summaryrefslogtreecommitdiffstats
path: root/doc/parsing.rst
blob: 3597f62284003719f862d07773204868425375f2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
Parsing
=======

Writing a parser for BibTeX bibliography files is more challenging than it
appears. In particular, the BibTeX language cannot be tokenized with a standard
"local" regexp-based approach. The interpretation of certain characters is
heavily context-dependent. The best example of this is the interpretation of
a quotation mark encountered when reading a string literal: it marks the end of
the string literal unless it is at non-zero *brace depth*. For example in the
following line:

.. code::

    title = "My {"}wonderful{"} Title"

The internal quotation marks are interpreted as regular characters because they
are at brace depth 1. It seems that the best approach is to skip the lexical
analysis step altogether and to write the parser directly.

There is unfortunately no formal specification of BibTeX's grammar. And many
(if not most) publicly available tools only support a much simpler grammar than
the one supported by the original BibTeX implementation.

An excellent introduction to the more advanced aspects of BibTeX is the
document *Tame the BeaST* [TTB]_. However, it is sometimes necessary to refer
to the original implementation for corner-case behaviors. The original
implementation is available in the ``/orig`` folder. See the ``README`` file in
this folder for instructions on how to generate the documentation attached to
the original implementation.

The following grammar should be very close to the one supported by the original
BibTeX implementation. We will use the following notations:

* ``'foo'``: the string ``foo``.
* ``0-9``: character range (everything between 0 and 9).
* ``\s``: any white space character.
* ``[abc]``: any of the characters appearing inside the brackets (here: a or
  b or c). The brackets can also contain character ranges.
* ``[^abc]``: any character which is not in the list of characters following
  the caret. The list can contain character ranges.
* ``A B``: expression ``A`` followed by ``B``. The two expression can be
  separated by one or many white spaces.
* ``A | B``: expression ``A`` or ``B``.
* ``A?``: expression ``A`` repeated zero or one time.
* ``A*``: expression ``A`` repeated zero, one or more times.
* ``( A )``: expression ``A``. The parentheses are useful to overrule the
  precedence of operators.

Let us first define the terminals:

.. code::

    number := [0-9]+
    key-paren := [^\s,)]*
    key-brace := [^\s,}]*
    identifier := [^\s0-9{}()=",#%][^\s{}()=",#%]*
    text := [^{}]*
    text-quote := [^{}"]*
    comment := [^@]*

Then the derivation rules:

.. code::

    bibtex ::= ( comment | command )*
    command ::= comment-command | preamble-command | string-command | entry-command
    comment-command ::= '@' 'comment'
    preamble-command ::= '@' 'preamble' ( '{' literal-list '}' | '(' literal-list ')' )

    string-command ::= '@' 'string' ( '{' field '}' | '(' field ')' }

    entry-command ::= '@' identifier '{' key-brace ( ',' field-list )? '}'
                    | '@' identifier '(' key-paren ( ',' field-list )? '}'

    field-list ::= ( field ( ','  field )* ','? )?
    field ::= identifier '=' literal-list

    literal-list ::= literal ( '#' literal )*
    literal ::= number | identifier | quote-literal | brace-literal
    quote-literal ::= '"' text-quote brace-literal? text-quote '"'
    brace-literal ::= '{' balanced-text '}'
    balanced-text ::= balanced-text '{' balanced-text '}' balanced-text
                    | text


A couple of remarks which do not seem to be common knowledge:

* an identifier can contain many things (including @ signs) but cannot start
  with a digit. I believe this is to allow simpler parsing of literals: simply
  looking at the first character is sufficient to know which literal type to
  parse.

* an entry key can contain many things, including @ sign and braces. The braces
  can be unbalanced, BibTeX won't complain but this will likely be a problem
  when compiling with LaTeX. If the key contains a closing brace, then the
  entry muse use parenthesis delimiters. Similarly, if the key contains
  a closing parenthesis, the entry must use braces delimiters.

.. [TTB] Nicolas Markey, *Tame the BeaST*. Available `here <http://mirrors.ctan.org/info/bibtex/tamethebeast/ttb_en.pdf>`_.