diff options
Diffstat (limited to 'doc')
| -rw-r--r-- | doc/parsing.rst | 75 |
1 files changed, 75 insertions, 0 deletions
diff --git a/doc/parsing.rst b/doc/parsing.rst new file mode 100644 index 0000000..57270ad --- /dev/null +++ b/doc/parsing.rst @@ -0,0 +1,75 @@ +Writing a parser for BibTeX bibliography files is more challenging than it +appears. In particular, the BibTeX language cannot be tokenized with a standard +"local" approach. The interpretation of certain characters is heavily +context-dependent. The best example of this is the interpretation of +a quotation mark encountered when reading a string literal: it marks the end of +the string literal unless it is at non-zero *brace depth*. For example in the +following line: + +.. code:: + + title = "My {"}wonderful{"} Title" + +The internal quotation marks are interpreted as regular characters because they +are at brace depth 1. + +There is unfortunately no formal specification of BibTeX's grammar. And many +(if not most) publicly available tools only support a much simpler grammar than +the one supported by the original BibTeX implementation. + +An excellent introduction to the more advanced aspects of BibTeX is the +document *Tame the BeaST* [TTB]_. However, it is sometimes necessary to refer +to the original implementation for corner-case behaviors. The original +implementation is available in the ``/orig`` folder. See the ``README`` file in +this folder for instructions on how to generate the documentation attached to +the original implementation. + + +The following grammar should be very close to the one supported by the original +BibTeX implementation: + +.. code:: + + number ::= [0-9]+ + key-paren ::= [^\s,)]* + key-brace ::= [^\s,}]* + identifier ::= [^0-9{}()=",#%][^{}()=",#%]* + text ::= [^{}]* + text-quote ::= [^{}"]* + comment ::= [^@]* + + bibtex ::= ( comment | command )* + command ::= comment-command | preamble-command | string-command | entry-command + comment-command ::= '@' 'comment' + preamble-command ::= '@' 'preamble' ( '{' literal-list '}' | '(' literal-list ')' ) + + string-command ::= '@' 'string' ( '{' field '}' | '(' field ')' } + + entry-command ::= '@' identifier '{' key-brace ( ',' field-list )? '}' + | '@' identifier '(' key-paren ( ',' field-list )? '}' + + field-list ::= ( field ( ',' field )* ','? )? + field ::= identifier '=' literal-list + + literal-list ::= literal ('#' literal)* + literal ::= number | identifier | quote-literal | brace-literal + quote-literal ::= '"'~text-quote~brace-literal?~text-quote~'"' + brace-literal ::= '{'~balanced-text~'}' + balanced-text ::= balanced-text~'{'~balanced-text~'}'~balanced-text + | text + + +A few remarks which do not seem to be common knowledge: + +* an identifier can contain many things (including @ signs) + but cannot start with a digit. + +* an entry key can be empty. + +* an entry key can contain many things even braces. The braces can be + unbalanced, BibTeX won't complain but this will likely be a problem when + compiling with LaTeX. If the key contains a closing brace, then the entry + muse use parenthesis delimiters. Similarly, if the key contains a closing + parenthesis, the entry must use braces delimiters. + +.. [TTB] Nicolas Markey, *Tame the BeaST*. Available `here <http://mirrors.ctan.org/info/bibtex/tamethebeast/ttb_en.pdf>`_. |
