From c488ccf95be73c2e1ed3ed537891d8bc75e64555 Mon Sep 17 00:00:00 2001
From: Thibaut Horel <thibaut.horel@gmail.com>
Date: Sun, 21 Feb 2016 21:28:27 -0500
Subject: Documentation cleanup

---
 .gitignore      |  2 ++
 doc/parsing.rst | 76 +++++++++++++++++++++++++++++++++++++--------------------
 2 files changed, 52 insertions(+), 26 deletions(-)

diff --git a/.gitignore b/.gitignore
index 3d79e87..bf6c9ce 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,2 +1,4 @@
 *.blg
 *.html
+orig/*.tex
+orig/*.pdf
diff --git a/doc/parsing.rst b/doc/parsing.rst
index 57270ad..3597f62 100644
--- a/doc/parsing.rst
+++ b/doc/parsing.rst
@@ -1,7 +1,10 @@
+Parsing
+=======
+
 Writing a parser for BibTeX bibliography files is more challenging than it
 appears. In particular, the BibTeX language cannot be tokenized with a standard
-"local" approach. The interpretation of certain characters is heavily
-context-dependent. The best example of this is the interpretation of
+"local" regexp-based approach. The interpretation of certain characters is
+heavily context-dependent. The best example of this is the interpretation of
 a quotation mark encountered when reading a string literal: it marks the end of
 the string literal unless it is at non-zero *brace depth*. For example in the
 following line:
@@ -11,7 +14,8 @@ following line:
     title = "My {"}wonderful{"} Title"
 
 The internal quotation marks are interpreted as regular characters because they
-are at brace depth 1.
+are at brace depth 1. It seems that the best approach is to skip the lexical
+analysis step altogether and to write the parser directly.
 
 There is unfortunately no formal specification of BibTeX's grammar. And many
 (if not most) publicly available tools only support a much simpler grammar than
@@ -24,19 +28,39 @@ implementation is available in the ``/orig`` folder. See the ``README`` file in
 this folder for instructions on how to generate the documentation attached to
 the original implementation.
 
-
 The following grammar should be very close to the one supported by the original
-BibTeX implementation:
+BibTeX implementation. We will use the following notations:
+
+* ``'foo'``: the string ``foo``.
+* ``0-9``: character range (everything between 0 and 9).
+* ``\s``: any white space character.
+* ``[abc]``: any of the characters appearing inside the brackets (here: a or
+  b or c). The brackets can also contain character ranges.
+* ``[^abc]``: any character which is not in the list of characters following
+  the caret. The list can contain character ranges.
+* ``A B``: expression ``A`` followed by ``B``. The two expression can be
+  separated by one or many white spaces.
+* ``A | B``: expression ``A`` or ``B``.
+* ``A?``: expression ``A`` repeated zero or one time.
+* ``A*``: expression ``A`` repeated zero, one or more times.
+* ``( A )``: expression ``A``. The parentheses are useful to overrule the
+  precedence of operators.
+
+Let us first define the terminals:
 
 .. code::
 
-    number ::= [0-9]+
-    key-paren ::= [^\s,)]*
-    key-brace ::= [^\s,}]*
-    identifier ::= [^0-9{}()=",#%][^{}()=",#%]*
-    text ::= [^{}]*
-    text-quote ::= [^{}"]*
-    comment ::= [^@]*
+    number := [0-9]+
+    key-paren := [^\s,)]*
+    key-brace := [^\s,}]*
+    identifier := [^\s0-9{}()=",#%][^\s{}()=",#%]*
+    text := [^{}]*
+    text-quote := [^{}"]*
+    comment := [^@]*
+
+Then the derivation rules:
+
+.. code::
 
     bibtex ::= ( comment | command )*
     command ::= comment-command | preamble-command | string-command | entry-command
@@ -51,25 +75,25 @@ BibTeX implementation:
     field-list ::= ( field ( ','  field )* ','? )?
     field ::= identifier '=' literal-list
 
-    literal-list ::= literal ('#' literal)*
+    literal-list ::= literal ( '#' literal )*
     literal ::= number | identifier | quote-literal | brace-literal
-    quote-literal ::= '"'~text-quote~brace-literal?~text-quote~'"'
-    brace-literal ::= '{'~balanced-text~'}'
-    balanced-text ::= balanced-text~'{'~balanced-text~'}'~balanced-text
+    quote-literal ::= '"' text-quote brace-literal? text-quote '"'
+    brace-literal ::= '{' balanced-text '}'
+    balanced-text ::= balanced-text '{' balanced-text '}' balanced-text
                     | text
 
 
-A few remarks which do not seem to be common knowledge:
-
-* an identifier can contain many things (including @ signs)
-  but cannot start with a digit.
+A couple of remarks which do not seem to be common knowledge:
 
-* an entry key can be empty.
+* an identifier can contain many things (including @ signs) but cannot start
+  with a digit. I believe this is to allow simpler parsing of literals: simply
+  looking at the first character is sufficient to know which literal type to
+  parse.
 
-* an entry key can contain many things even braces. The braces can be
-  unbalanced, BibTeX won't complain but this will likely be a problem when
-  compiling with LaTeX. If the key contains a closing brace, then the entry
-  muse use parenthesis delimiters. Similarly, if the key contains a closing
-  parenthesis, the entry must use braces delimiters.
+* an entry key can contain many things, including @ sign and braces. The braces
+  can be unbalanced, BibTeX won't complain but this will likely be a problem
+  when compiling with LaTeX. If the key contains a closing brace, then the
+  entry muse use parenthesis delimiters. Similarly, if the key contains
+  a closing parenthesis, the entry must use braces delimiters.
 
 .. [TTB] Nicolas Markey, *Tame the BeaST*. Available `here <http://mirrors.ctan.org/info/bibtex/tamethebeast/ttb_en.pdf>`_.
-- 
cgit v1.2.3-70-g09d2