summaryrefslogtreecommitdiffstats
path: root/doc
diff options
context:
space:
mode:
Diffstat (limited to 'doc')
-rw-r--r--doc/parsing.rst75
1 files changed, 75 insertions, 0 deletions
diff --git a/doc/parsing.rst b/doc/parsing.rst
new file mode 100644
index 0000000..57270ad
--- /dev/null
+++ b/doc/parsing.rst
@@ -0,0 +1,75 @@
+Writing a parser for BibTeX bibliography files is more challenging than it
+appears. In particular, the BibTeX language cannot be tokenized with a standard
+"local" approach. The interpretation of certain characters is heavily
+context-dependent. The best example of this is the interpretation of
+a quotation mark encountered when reading a string literal: it marks the end of
+the string literal unless it is at non-zero *brace depth*. For example in the
+following line:
+
+.. code::
+
+ title = "My {"}wonderful{"} Title"
+
+The internal quotation marks are interpreted as regular characters because they
+are at brace depth 1.
+
+There is unfortunately no formal specification of BibTeX's grammar. And many
+(if not most) publicly available tools only support a much simpler grammar than
+the one supported by the original BibTeX implementation.
+
+An excellent introduction to the more advanced aspects of BibTeX is the
+document *Tame the BeaST* [TTB]_. However, it is sometimes necessary to refer
+to the original implementation for corner-case behaviors. The original
+implementation is available in the ``/orig`` folder. See the ``README`` file in
+this folder for instructions on how to generate the documentation attached to
+the original implementation.
+
+
+The following grammar should be very close to the one supported by the original
+BibTeX implementation:
+
+.. code::
+
+ number ::= [0-9]+
+ key-paren ::= [^\s,)]*
+ key-brace ::= [^\s,}]*
+ identifier ::= [^0-9{}()=",#%][^{}()=",#%]*
+ text ::= [^{}]*
+ text-quote ::= [^{}"]*
+ comment ::= [^@]*
+
+ bibtex ::= ( comment | command )*
+ command ::= comment-command | preamble-command | string-command | entry-command
+ comment-command ::= '@' 'comment'
+ preamble-command ::= '@' 'preamble' ( '{' literal-list '}' | '(' literal-list ')' )
+
+ string-command ::= '@' 'string' ( '{' field '}' | '(' field ')' }
+
+ entry-command ::= '@' identifier '{' key-brace ( ',' field-list )? '}'
+ | '@' identifier '(' key-paren ( ',' field-list )? '}'
+
+ field-list ::= ( field ( ',' field )* ','? )?
+ field ::= identifier '=' literal-list
+
+ literal-list ::= literal ('#' literal)*
+ literal ::= number | identifier | quote-literal | brace-literal
+ quote-literal ::= '"'~text-quote~brace-literal?~text-quote~'"'
+ brace-literal ::= '{'~balanced-text~'}'
+ balanced-text ::= balanced-text~'{'~balanced-text~'}'~balanced-text
+ | text
+
+
+A few remarks which do not seem to be common knowledge:
+
+* an identifier can contain many things (including @ signs)
+ but cannot start with a digit.
+
+* an entry key can be empty.
+
+* an entry key can contain many things even braces. The braces can be
+ unbalanced, BibTeX won't complain but this will likely be a problem when
+ compiling with LaTeX. If the key contains a closing brace, then the entry
+ muse use parenthesis delimiters. Similarly, if the key contains a closing
+ parenthesis, the entry must use braces delimiters.
+
+.. [TTB] Nicolas Markey, *Tame the BeaST*. Available `here <http://mirrors.ctan.org/info/bibtex/tamethebeast/ttb_en.pdf>`_.