Parsing ======= Writing a parser for BibTeX bibliography files is more challenging than it appears. In particular, the BibTeX language cannot be tokenized with a standard "local" regexp-based approach. The interpretation of certain characters is heavily context-dependent. The best example of this is the interpretation of a quotation mark encountered when reading a string literal: it marks the end of the string literal unless it is at non-zero *brace depth*. For example in the following line: .. code:: title = "My {"}wonderful{"} Title" The internal quotation marks are interpreted as regular characters because they are at brace depth 1. It seems that the best approach is to skip the lexical analysis step altogether and to write the parser directly. There is unfortunately no formal specification of BibTeX's grammar. And many (if not most) publicly available tools only support a much simpler grammar than the one supported by the original BibTeX implementation. An excellent introduction to the more advanced aspects of BibTeX is the document *Tame the BeaST* [TTB]_. However, it is sometimes necessary to refer to the original implementation for corner-case behaviors. The original implementation is available in the ``/orig`` folder. See the ``README`` file in this folder for instructions on how to generate the documentation attached to the original implementation. The following grammar should be very close to the one supported by the original BibTeX implementation. We will use the following notations: * ``'foo'``: the string ``foo``. * ``0-9``: character range (everything between 0 and 9). * ``\s``: any white space character. * ``[abc]``: any of the characters appearing inside the brackets (here: a or b or c). The brackets can also contain character ranges. * ``[^abc]``: any character which is not in the list of characters following the caret. The list can contain character ranges. * ``A B``: expression ``A`` followed by ``B``. The two expression can be separated by one or many white spaces. * ``A | B``: expression ``A`` or ``B``. * ``A?``: expression ``A`` repeated zero or one time. * ``A*``: expression ``A`` repeated zero, one or more times. * ``( A )``: expression ``A``. The parentheses are useful to overrule the precedence of operators. Let us first define the terminals: .. code:: number := [0-9]+ key-paren := [^\s,)]* key-brace := [^\s,}]* identifier := [^\s0-9{}()=",#%][^\s{}()=",#%]* text := [^{}]* text-quote := [^{}"]* comment := [^@]* Then the derivation rules: .. code:: bibtex ::= ( comment | command )* command ::= comment-command | preamble-command | string-command | entry-command comment-command ::= '@' 'comment' preamble-command ::= '@' 'preamble' ( '{' literal-list '}' | '(' literal-list ')' ) string-command ::= '@' 'string' ( '{' field '}' | '(' field ')' } entry-command ::= '@' identifier '{' key-brace ( ',' field-list )? '}' | '@' identifier '(' key-paren ( ',' field-list )? '}' field-list ::= ( field ( ',' field )* ','? )? field ::= identifier '=' literal-list literal-list ::= literal ( '#' literal )* literal ::= number | identifier | quote-literal | brace-literal quote-literal ::= '"' text-quote brace-literal? text-quote '"' brace-literal ::= '{' balanced-text '}' balanced-text ::= balanced-text '{' balanced-text '}' balanced-text | text A couple of remarks which do not seem to be common knowledge: * an identifier can contain many things (including @ signs) but cannot start with a digit. I believe this is to allow simpler parsing of literals: simply looking at the first character is sufficient to know which literal type to parse. * an entry key can contain many things, including @ sign and braces. The braces can be unbalanced, BibTeX won't complain but this will likely be a problem when compiling with LaTeX. If the key contains a closing brace, then the entry muse use parenthesis delimiters. Similarly, if the key contains a closing parenthesis, the entry must use braces delimiters. .. [TTB] Nicolas Markey, *Tame the BeaST*. Available `here `_.