summaryrefslogtreecommitdiffstats
path: root/doc
diff options
context:
space:
mode:
authorThibaut Horel <thibaut.horel@gmail.com>2016-02-21 21:28:27 -0500
committerThibaut Horel <thibaut.horel@gmail.com>2016-02-21 21:28:27 -0500
commitc488ccf95be73c2e1ed3ed537891d8bc75e64555 (patch)
tree111839cce8e6de86d412e93bd080365a4892edf3 /doc
parentbd74996cf63f60e41df951e281410a1de32cb0b7 (diff)
downloadbibtex-c488ccf95be73c2e1ed3ed537891d8bc75e64555.tar.gz
Documentation cleanup
Diffstat (limited to 'doc')
-rw-r--r--doc/parsing.rst76
1 files changed, 50 insertions, 26 deletions
diff --git a/doc/parsing.rst b/doc/parsing.rst
index 57270ad..3597f62 100644
--- a/doc/parsing.rst
+++ b/doc/parsing.rst
@@ -1,7 +1,10 @@
+Parsing
+=======
+
Writing a parser for BibTeX bibliography files is more challenging than it
appears. In particular, the BibTeX language cannot be tokenized with a standard
-"local" approach. The interpretation of certain characters is heavily
-context-dependent. The best example of this is the interpretation of
+"local" regexp-based approach. The interpretation of certain characters is
+heavily context-dependent. The best example of this is the interpretation of
a quotation mark encountered when reading a string literal: it marks the end of
the string literal unless it is at non-zero *brace depth*. For example in the
following line:
@@ -11,7 +14,8 @@ following line:
title = "My {"}wonderful{"} Title"
The internal quotation marks are interpreted as regular characters because they
-are at brace depth 1.
+are at brace depth 1. It seems that the best approach is to skip the lexical
+analysis step altogether and to write the parser directly.
There is unfortunately no formal specification of BibTeX's grammar. And many
(if not most) publicly available tools only support a much simpler grammar than
@@ -24,19 +28,39 @@ implementation is available in the ``/orig`` folder. See the ``README`` file in
this folder for instructions on how to generate the documentation attached to
the original implementation.
-
The following grammar should be very close to the one supported by the original
-BibTeX implementation:
+BibTeX implementation. We will use the following notations:
+
+* ``'foo'``: the string ``foo``.
+* ``0-9``: character range (everything between 0 and 9).
+* ``\s``: any white space character.
+* ``[abc]``: any of the characters appearing inside the brackets (here: a or
+ b or c). The brackets can also contain character ranges.
+* ``[^abc]``: any character which is not in the list of characters following
+ the caret. The list can contain character ranges.
+* ``A B``: expression ``A`` followed by ``B``. The two expression can be
+ separated by one or many white spaces.
+* ``A | B``: expression ``A`` or ``B``.
+* ``A?``: expression ``A`` repeated zero or one time.
+* ``A*``: expression ``A`` repeated zero, one or more times.
+* ``( A )``: expression ``A``. The parentheses are useful to overrule the
+ precedence of operators.
+
+Let us first define the terminals:
.. code::
- number ::= [0-9]+
- key-paren ::= [^\s,)]*
- key-brace ::= [^\s,}]*
- identifier ::= [^0-9{}()=",#%][^{}()=",#%]*
- text ::= [^{}]*
- text-quote ::= [^{}"]*
- comment ::= [^@]*
+ number := [0-9]+
+ key-paren := [^\s,)]*
+ key-brace := [^\s,}]*
+ identifier := [^\s0-9{}()=",#%][^\s{}()=",#%]*
+ text := [^{}]*
+ text-quote := [^{}"]*
+ comment := [^@]*
+
+Then the derivation rules:
+
+.. code::
bibtex ::= ( comment | command )*
command ::= comment-command | preamble-command | string-command | entry-command
@@ -51,25 +75,25 @@ BibTeX implementation:
field-list ::= ( field ( ',' field )* ','? )?
field ::= identifier '=' literal-list
- literal-list ::= literal ('#' literal)*
+ literal-list ::= literal ( '#' literal )*
literal ::= number | identifier | quote-literal | brace-literal
- quote-literal ::= '"'~text-quote~brace-literal?~text-quote~'"'
- brace-literal ::= '{'~balanced-text~'}'
- balanced-text ::= balanced-text~'{'~balanced-text~'}'~balanced-text
+ quote-literal ::= '"' text-quote brace-literal? text-quote '"'
+ brace-literal ::= '{' balanced-text '}'
+ balanced-text ::= balanced-text '{' balanced-text '}' balanced-text
| text
-A few remarks which do not seem to be common knowledge:
-
-* an identifier can contain many things (including @ signs)
- but cannot start with a digit.
+A couple of remarks which do not seem to be common knowledge:
-* an entry key can be empty.
+* an identifier can contain many things (including @ signs) but cannot start
+ with a digit. I believe this is to allow simpler parsing of literals: simply
+ looking at the first character is sufficient to know which literal type to
+ parse.
-* an entry key can contain many things even braces. The braces can be
- unbalanced, BibTeX won't complain but this will likely be a problem when
- compiling with LaTeX. If the key contains a closing brace, then the entry
- muse use parenthesis delimiters. Similarly, if the key contains a closing
- parenthesis, the entry must use braces delimiters.
+* an entry key can contain many things, including @ sign and braces. The braces
+ can be unbalanced, BibTeX won't complain but this will likely be a problem
+ when compiling with LaTeX. If the key contains a closing brace, then the
+ entry muse use parenthesis delimiters. Similarly, if the key contains
+ a closing parenthesis, the entry must use braces delimiters.
.. [TTB] Nicolas Markey, *Tame the BeaST*. Available `here <http://mirrors.ctan.org/info/bibtex/tamethebeast/ttb_en.pdf>`_.