Part 1 Tokenizing Input with Lex

suggest change

There are two steps that the code from example 1 carried out: one was tokenizing the input, which means it looked for symbols that constitute the arithmetic expression, and the second step was parsing, which involves analysing the extracted tokens and evaluating the result.

This section provides a simple example of how to tokenize user input, and then breaks it down line by line.

import ply.lex as lex

# List of token names. This is always required
tokens = [
   'NUMBER',
   'PLUS',
   'MINUS',
   'TIMES',
   'DIVIDE',
   'LPAREN',
   'RPAREN',
]

# Regular expression rules for simple tokens
t_PLUS    = r'\+'
t_MINUS   = r'-'
t_TIMES   = r'\*'
t_DIVIDE  = r'/'
t_LPAREN  = r'\('
t_RPAREN  = r'\)'

# A regular expression rule with some action code
def t_NUMBER(t):
    r'\d+'
    t.value = int(t.value)    
    return t

# Define a rule so we can track line numbers
def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)

# A string containing ignored characters (spaces and tabs)
t_ignore  = ' \t'

# Error handling rule
def t_error(t):
    print("Illegal character '%s'" % t.value[0])
    t.lexer.skip(1)

# Build the lexer
lexer = lex.lex()

# Give the lexer some input
lexer.input(data)

# Tokenize
while True:
    tok = lexer.token()
    if not tok: 
        break      # No more input
    print(tok)

Save this file as calclex.py. We’ll be using this when building our Yacc parser.

Breakdown

Import the module using import ply.lex
All lexers must provide a list called tokens that defines all of the possible token names that can be produced by the lexer. This list is always required.

tokens = [
   'NUMBER',
   'PLUS',
   'MINUS',
   'TIMES',
   'DIVIDE',
   'LPAREN',
   'RPAREN',
]

tokens could also be a tuple of strings (rather than a string), where each string denotes a token as before.

The regex rule for each string may be defined either as a string or as a function. In either case, the variable name should be prefixed by t_ to denote it is a rule for matching tokens.

For simple tokens, the regular expression can be specified as strings: t_PLUS = r'\+'
If some kind of action needs to be performed, a token rule can be specified as a function.

def t_NUMBER(t):
    r'\d+'
    t.value = int(t.value)
    return t
Note, the rule is specified as a doc string within the function. The function accepts one argument which is an instance of `LexToken`, performs some action and then returns back the argument. 

If you want to use an external string as the regex rule for the function instead of specifying a doc string, consider the following example:

@TOKEN(identifier)         # identifier is a string holding the regex
def t_ID(t):
    ...      # actions

* An instance of `LexToken` object (let's call this object `t`) has the following attributes:
1) `t.type` which is the token type (as a string) (eg: `'NUMBER'`, `'PLUS'`, etc). By default, `t.type` is set to the name following the `t_` prefix.
2) `t.value` which is the lexeme (the actual text matched) 
3) `t.lineno` which is the current line number (this is not automatically updated, as the lexer knows nothing of line numbers). Update lineno using a function called `t_newline`.

## 

def t_newline(t):
    r'\n+'
    t.lexer.lineno += len(t.value)

##
4) `t.lexpos` which is the position of the token relative to the beginning of the input text.

If nothing is returned from a regex rule function, the token is discarded. If you want to discard a token, you can alternatively add t_ignore_ prefix to a regex rule variable instead of defining a function for the same rule.

def t_COMMENT(t):
    r'\#.*'
    pass
    # No return value. Token discarded

...Is the same as:

t_ignore_COMMENT = r'\#.*'
##

<sup>This is of course invalid if you're carrying out some action when you see a comment. In which case, use a function to define the regex rule.</sup> 

If you haven't defined a token for some characters but still want to ignore it, use `t_ignore = "<characters to ignore>"` (these prefixes are necessary):          

t_ignore_COMMENT = r'\#.*'
t_ignore  = ' \t'    # ignores spaces and tabs

##
- When building the master regex, lex will add the regexes specified in the file as follows: 
1) Tokens defined by functions are added in the same order as they appear in the file. 
2) Tokens defined by strings are added in decreasing order of the string length of the string defining the regex for that token.

If you are matching `==` and `=` in the same file, take advantage of these rules.
##
- Literals are tokens that are returned as they are. Both `t.type` and `t.value` will be set to the character itself.

Define a list of literals as such:

literals = [ '+', '-', '*', '/' ]

or,

literals = "+-*/"

##

It is possible to write token functions that perform additional actions when literals are matched. However, you'll need to set the token type appropriately. For example:

literals = [ '{', '}' ]

def t_lbrace(t):
    r'\{'
    t.type = '{'  # Set token type to the expected literal (ABSOLUTE MUST if this is a literal)
    return t

##
- Handle errors with t_error function.

# Error handling rule
def t_error(t):
  print("Illegal character '%s'" % t.value[0])
  t.lexer.skip(1) # skip the illegal token (don't process it)

In general, `t.lexer.skip(n)` skips n characters in the input string.

Final preparations:

Build the lexer using `lexer = lex.lex()`.

##
 You can also put everything inside a class and call use instance of the class to define the lexer. Eg:

##  
   import ply.lex as lex  
   class MyLexer(object):            
         ...     # everything relating to token rules and error handling comes here as usual 

         # Build the lexer
         def build(self, **kwargs):
             self.lexer = lex.lex(module=self, **kwargs)

         def test(self, data):
             self.lexer.input(data)
             for token in self.lexer.token():
                 print(token)

         # Build the lexer and try it out

   m = MyLexer()
   m.build()           # Build the lexer
   m.test("3 + 4")     #

Provide input using `lexer.input(data)` where data is a string 

To get the tokens, use `lexer.token()` which returns tokens matched. You can iterate over lexer in a loop as in:

##

  for i in lexer: 
      print(i)

Found a mistake? Have a question or improvement idea? Let me know.

Table Of Contents