Writing a parser for BibTeX bibliography files is more challenging than it appears. In particular, the BibTeX language cannot be tokenized with a standard "local" approach. The interpretation of certain characters is heavily context-dependent. The best example of this is the interpretation of a quotation mark encountered when reading a string literal: it marks the end of the string literal unless it is at non-zero *brace depth*. For example in the following line: .. code:: title = "My {"}wonderful{"} Title" The internal quotation marks are interpreted as regular characters because they are at brace depth 1. There is unfortunately no formal specification of BibTeX's grammar. And many (if not most) publicly available tools only support a much simpler grammar than the one supported by the original BibTeX implementation. An excellent introduction to the more advanced aspects of BibTeX is the document *Tame the BeaST* [TTB]_. However, it is sometimes necessary to refer to the original implementation for corner-case behaviors. The original implementation is available in the ``/orig`` folder. See the ``README`` file in this folder for instructions on how to generate the documentation attached to the original implementation. The following grammar should be very close to the one supported by the original BibTeX implementation: .. code:: number ::= [0-9]+ key-paren ::= [^\s,)]* key-brace ::= [^\s,}]* identifier ::= [^0-9{}()=",#%][^{}()=",#%]* text ::= [^{}]* text-quote ::= [^{}"]* comment ::= [^@]* bibtex ::= ( comment | command )* command ::= comment-command | preamble-command | string-command | entry-command comment-command ::= '@' 'comment' preamble-command ::= '@' 'preamble' ( '{' literal-list '}' | '(' literal-list ')' ) string-command ::= '@' 'string' ( '{' field '}' | '(' field ')' } entry-command ::= '@' identifier '{' key-brace ( ',' field-list )? '}' | '@' identifier '(' key-paren ( ',' field-list )? '}' field-list ::= ( field ( ',' field )* ','? )? field ::= identifier '=' literal-list literal-list ::= literal ('#' literal)* literal ::= number | identifier | quote-literal | brace-literal quote-literal ::= '"'~text-quote~brace-literal?~text-quote~'"' brace-literal ::= '{'~balanced-text~'}' balanced-text ::= balanced-text~'{'~balanced-text~'}'~balanced-text | text A few remarks which do not seem to be common knowledge: * an identifier can contain many things (including @ signs) but cannot start with a digit. * an entry key can be empty. * an entry key can contain many things even braces. The braces can be unbalanced, BibTeX won't complain but this will likely be a problem when compiling with LaTeX. If the key contains a closing brace, then the entry muse use parenthesis delimiters. Similarly, if the key contains a closing parenthesis, the entry must use braces delimiters. .. [TTB] Nicolas Markey, *Tame the BeaST*. Available `here `_.