"http://www.w3.org/TR/html4/loose.dtd"> >
The lexer is based on the re1 module. TPG profits from the power of Python regular expressions. This document assumes the reader is familiar with regular expressions.
You can use the syntax of regular expressions as expected by the re module except from the grouping syntax since it is used by TPG to decide which token is recognized.
Tokens can be explicitely defined by the token and separator keywords.
A token is defined by:
Token definitions end with a ; .
See figure 6.1 for examples.
|
The order of the declaration of the tokens is important. The first token that is matched is returned. The regular expression has a special treatment. If it describes a keyword, TPG also looks for a word boundary after the keyword. If you try to match the keywords if and ifxyz TPG will internally search if\b and ifxyz\b. This way, if won’t match ifxyz and won’t interfere with general identifiers (\w+ for example).
There are two kinds of tokens. Tokens defined by the token keyword are parsed by the parser and tokens defined by the separator keyword are considered as separators (white spaces or comments for example) and are wiped out by the lexer.
Tokens can also be defined on the fly. Their definition are then inlined in the grammar rules. This feature may be useful for keywords or punctuation signs. Inline tokens can not be transformed by an action as predefined tokens. They always return the token in a string.
See figure 6.2 for examples.
Inline tokens have a higher precedence than predefined tokens to avoid conflicts (an inlined if won’t be matched as a predefined identifier).
TPG works in two stages. The lexer first splits the input string into a list of tokens and then the parser parses this list.
The lexer split the input string according to the token definitions (see 6.2). When the input string can not be matched a tpg.LexerError exception is raised.
The lexer may loop indefinitely if a token can match an empty string since empty strings are everywhere.
Tokens are matched as symbols are recognized. Predefined tokens have the same syntax than non terminal symbols. The token text (or the result of the function associated to the token) can be saved by the infix / operator (see figure 6.3).
Inline tokens have a similar syntax. You just write the regular expression (in a string). Its text can also be save (see figure 6.4).