From c488ccf95be73c2e1ed3ed537891d8bc75e64555 Mon Sep 17 00:00:00 2001 From: Thibaut Horel Date: Sun, 21 Feb 2016 21:28:27 -0500 Subject: Documentation cleanup --- .gitignore | 2 ++ doc/parsing.rst | 76 +++++++++++++++++++++++++++++++++++++-------------------- 2 files changed, 52 insertions(+), 26 deletions(-) diff --git a/.gitignore b/.gitignore index 3d79e87..bf6c9ce 100644 --- a/.gitignore +++ b/.gitignore @@ -1,2 +1,4 @@ *.blg *.html +orig/*.tex +orig/*.pdf diff --git a/doc/parsing.rst b/doc/parsing.rst index 57270ad..3597f62 100644 --- a/doc/parsing.rst +++ b/doc/parsing.rst @@ -1,7 +1,10 @@ +Parsing +======= + Writing a parser for BibTeX bibliography files is more challenging than it appears. In particular, the BibTeX language cannot be tokenized with a standard -"local" approach. The interpretation of certain characters is heavily -context-dependent. The best example of this is the interpretation of +"local" regexp-based approach. The interpretation of certain characters is +heavily context-dependent. The best example of this is the interpretation of a quotation mark encountered when reading a string literal: it marks the end of the string literal unless it is at non-zero *brace depth*. For example in the following line: @@ -11,7 +14,8 @@ following line: title = "My {"}wonderful{"} Title" The internal quotation marks are interpreted as regular characters because they -are at brace depth 1. +are at brace depth 1. It seems that the best approach is to skip the lexical +analysis step altogether and to write the parser directly. There is unfortunately no formal specification of BibTeX's grammar. And many (if not most) publicly available tools only support a much simpler grammar than @@ -24,19 +28,39 @@ implementation is available in the ``/orig`` folder. See the ``README`` file in this folder for instructions on how to generate the documentation attached to the original implementation. - The following grammar should be very close to the one supported by the original -BibTeX implementation: +BibTeX implementation. We will use the following notations: + +* ``'foo'``: the string ``foo``. +* ``0-9``: character range (everything between 0 and 9). +* ``\s``: any white space character. +* ``[abc]``: any of the characters appearing inside the brackets (here: a or + b or c). The brackets can also contain character ranges. +* ``[^abc]``: any character which is not in the list of characters following + the caret. The list can contain character ranges. +* ``A B``: expression ``A`` followed by ``B``. The two expression can be + separated by one or many white spaces. +* ``A | B``: expression ``A`` or ``B``. +* ``A?``: expression ``A`` repeated zero or one time. +* ``A*``: expression ``A`` repeated zero, one or more times. +* ``( A )``: expression ``A``. The parentheses are useful to overrule the + precedence of operators. + +Let us first define the terminals: .. code:: - number ::= [0-9]+ - key-paren ::= [^\s,)]* - key-brace ::= [^\s,}]* - identifier ::= [^0-9{}()=",#%][^{}()=",#%]* - text ::= [^{}]* - text-quote ::= [^{}"]* - comment ::= [^@]* + number := [0-9]+ + key-paren := [^\s,)]* + key-brace := [^\s,}]* + identifier := [^\s0-9{}()=",#%][^\s{}()=",#%]* + text := [^{}]* + text-quote := [^{}"]* + comment := [^@]* + +Then the derivation rules: + +.. code:: bibtex ::= ( comment | command )* command ::= comment-command | preamble-command | string-command | entry-command @@ -51,25 +75,25 @@ BibTeX implementation: field-list ::= ( field ( ',' field )* ','? )? field ::= identifier '=' literal-list - literal-list ::= literal ('#' literal)* + literal-list ::= literal ( '#' literal )* literal ::= number | identifier | quote-literal | brace-literal - quote-literal ::= '"'~text-quote~brace-literal?~text-quote~'"' - brace-literal ::= '{'~balanced-text~'}' - balanced-text ::= balanced-text~'{'~balanced-text~'}'~balanced-text + quote-literal ::= '"' text-quote brace-literal? text-quote '"' + brace-literal ::= '{' balanced-text '}' + balanced-text ::= balanced-text '{' balanced-text '}' balanced-text | text -A few remarks which do not seem to be common knowledge: - -* an identifier can contain many things (including @ signs) - but cannot start with a digit. +A couple of remarks which do not seem to be common knowledge: -* an entry key can be empty. +* an identifier can contain many things (including @ signs) but cannot start + with a digit. I believe this is to allow simpler parsing of literals: simply + looking at the first character is sufficient to know which literal type to + parse. -* an entry key can contain many things even braces. The braces can be - unbalanced, BibTeX won't complain but this will likely be a problem when - compiling with LaTeX. If the key contains a closing brace, then the entry - muse use parenthesis delimiters. Similarly, if the key contains a closing - parenthesis, the entry must use braces delimiters. +* an entry key can contain many things, including @ sign and braces. The braces + can be unbalanced, BibTeX won't complain but this will likely be a problem + when compiling with LaTeX. If the key contains a closing brace, then the + entry muse use parenthesis delimiters. Similarly, if the key contains + a closing parenthesis, the entry must use braces delimiters. .. [TTB] Nicolas Markey, *Tame the BeaST*. Available `here `_. -- cgit v1.2.3-70-g09d2