1 files changed, 75 insertions, 0 deletions
diff --git a/doc/parsing.rst b/doc/parsing.rst
new file mode 100644
index 0000000..57270ad
--- /dev/null
+++ b/doc/parsing.rst
@@ -0,0 +1,75 @@
+Writing a parser for BibTeX bibliography files is more challenging than it
+appears. In particular, the BibTeX language cannot be tokenized with a standard
+"local" approach. The interpretation of certain characters is heavily
+context-dependent. The best example of this is the interpretation of
+a quotation mark encountered when reading a string literal: it marks the end of
+the string literal unless it is at non-zero *brace depth*. For example in the
+following line:
+
+.. code::
+
+    title = "My {"}wonderful{"} Title"
+
+The internal quotation marks are interpreted as regular characters because they
+are at brace depth 1.
+
+There is unfortunately no formal specification of BibTeX's grammar. And many
+(if not most) publicly available tools only support a much simpler grammar than
+the one supported by the original BibTeX implementation.
+
+An excellent introduction to the more advanced aspects of BibTeX is the
+document *Tame the BeaST* [TTB]_. However, it is sometimes necessary to refer
+to the original implementation for corner-case behaviors. The original
+implementation is available in the ``/orig`` folder. See the ``README`` file in
+this folder for instructions on how to generate the documentation attached to
+the original implementation.
+
+
+The following grammar should be very close to the one supported by the original
+BibTeX implementation:
+
+.. code::
+
+    number ::= [0-9]+
+    key-paren ::= [^\s,)]*
+    key-brace ::= [^\s,}]*
+    identifier ::= [^0-9{}()=",#%][^{}()=",#%]*
+    text ::= [^{}]*
+    text-quote ::= [^{}"]*
+    comment ::= [^@]*
+
+    bibtex ::= ( comment | command )*
+    command ::= comment-command | preamble-command | string-command | entry-command
+    comment-command ::= '@' 'comment'
+    preamble-command ::= '@' 'preamble' ( '{' literal-list '}' | '(' literal-list ')' )
+
+    string-command ::= '@' 'string' ( '{' field '}' | '(' field ')' }
+
+    entry-command ::= '@' identifier '{' key-brace ( ',' field-list )? '}'
+                    | '@' identifier '(' key-paren ( ',' field-list )? '}'
+
+    field-list ::= ( field ( ','  field )* ','? )?
+    field ::= identifier '=' literal-list
+
+    literal-list ::= literal ('#' literal)*
+    literal ::= number | identifier | quote-literal | brace-literal
+    quote-literal ::= '"'~text-quote~brace-literal?~text-quote~'"'
+    brace-literal ::= '{'~balanced-text~'}'
+    balanced-text ::= balanced-text~'{'~balanced-text~'}'~balanced-text
+                    | text
+
+
+A few remarks which do not seem to be common knowledge:
+
+* an identifier can contain many things (including @ signs)
+  but cannot start with a digit.
+
+* an entry key can be empty.
+
+* an entry key can contain many things even braces. The braces can be
+  unbalanced, BibTeX won't complain but this will likely be a problem when
+  compiling with LaTeX. If the key contains a closing brace, then the entry
+  muse use parenthesis delimiters. Similarly, if the key contains a closing
+  parenthesis, the entry must use braces delimiters.
+
+.. [TTB] Nicolas Markey, *Tame the BeaST*. Available `here <http://mirrors.ctan.org/info/bibtex/tamethebeast/ttb_en.pdf>`_.