Skip to content

DIY lexer

As I said before, the separation of the syntax analysis into lexer and parser allows the latter to work with more organized and meaningful material than the source text. But at the same time, both lexer and parser are syntactic analyzers, just working at a different level. They take a list of characters as input and create a higher-level structure.

The lexer works with regular languages, which are languages that can be recognized in one linear pass through the text without going back and storing complex information about previous characters. This approach makes analysis predictable and efficient: at any given time, the lexer makes a decision by looking only at the current character (or a small fixed buffer of characters). Because of this, lexing can be implemented using a finite state machine, where each state determines what to do when the next symbol is encountered, without the need to store additional context.

Overall, creating a lexer is a surprisingly simple task, this article will be short :) The only thing is that the code is not very pleasant to read, so I will start with the simplest version of it, and will expand it gradually.

Recognizing numbers

Let us suppose we have following source file:

0 // this is a comment to ignore by the lexer
1337

We need to convert it into the following tokens:

[
    INTEGER(0),
    INTEGER(1337)
]
How do we do that? As I said, we will turn the source into a stream of characters, and then process them one by one. I will make all the decisions with this finite state machine made of three states:

State 0 is the initial state, in this state we are looking for the beginning of a new token. While in state 0, we simply ignore all spaces, tabs and line breaks that come from the source file. As soon as we see two slashes, we move to state 1, which corresponds to scanning a string comment. In this state, we ignore all characters other than the line break, which returns us to the state 0.

If, while in state 0, we encounter a digit, we immediately go to the state 2, which corresponds to scanning a numeric token. At any character other than a digit, we return to state 0. I forgot to draw in this figure, but, while in state 0, we encounter any incoming character other than digits, spaces, or a pair of slashes, this should cause a lexical error.

Here is an implementation of such a lexer:

Comments and integer tokens
class Token:
    def __init__(self, t, v):
        self.type, self.value = t, v

    def __repr__(self):
         return f'{self.type}({self.value})'

def tokenize(text):
    idx, state, accum = 0, 0, ''
    while idx<len(text):
        sym1 = text[idx+0] if idx<len(text)-0 else ' ' # current symbol
        sym2 = text[idx+1] if idx<len(text)-1 else ' ' # next symbol
        if state==0: # start scanning a new token
            if sym1 == '/' and sym2 == '/': # start a comment scan
                state = 1
            elif sym1.isdigit():            # start a number scan
                state = 2
                accum += sym1
            elif sym1 not in ['\r', '\t', ' ', '\n']: # ignore whitespace
                raise Exception(f'Lexical error: illegal character "{sym1}"')
        elif state==2:                          # scanning a number
            if sym1.isdigit():                  # is next character a digit?
                accum += sym1                   # if yes, continue
            else:
                yield Token('INTEGER', accum)   # otherwise, emit number token
                idx -= 1
                state, accum = 0, ''            # start new scan
        if sym1 == '\n':
            if state==1: # if comment, start new scan
                state, accum = 0, ''
        idx += 1

tokens = list(tokenize('''0 // this is a comment to ignore by the lexer
1337
'''))
print([t for t in tokens])

Strings

To recognize strings, it is enough to add state 3, to which we move by the quotation mark, and from which we exit by the second quotation mark. If we encounter a line break before the second quote, it is a lexical error.

String tokens
class Token:
    def __init__(self, t, v):
        self.type, self.value = t, v

    def __repr__(self):
         return f'{self.type}({self.value})'

def tokenize(text):
    idx, state, accum = 0, 0, ''
    while idx<len(text):
        sym1 = text[idx+0] if idx<len(text)-0 else ' ' # current symbol
        sym2 = text[idx+1] if idx<len(text)-1 else ' ' # next symbol
        if state==0: # start scanning a new token
            if sym1 == '/' and sym2 == '/': # start a comment scan
                state = 1
            elif sym1.isdigit():            # start a number scan
                state = 2
                accum += sym1
            elif sym1 == '"':               # start a string scan
                state = 3
            elif sym1 not in ['\r', '\t', ' ', '\n']: # ignore whitespace
                raise Exception(f'Lexical error: illegal character "{sym1}"')
        elif state==2:                          # scanning a number
            if sym1.isdigit():                  # is next character a digit?
                accum += sym1                   # if yes, continue
            else:
                yield Token('INTEGER', accum)   # otherwise, emit number token
                idx -= 1
                state, accum = 0, ''            # start new scan
        elif state==3:                                          # scanning a string, check next character
            if sym1 != '"' or accum and accum[-1]=='\\':        # if not quote mark (or if escaped quote mark),
                accum += sym1                                   # continue the scan
            else:
                yield Token('STRING', accum)                    # otherwise emit the token
                state, accum = 0, '' # start new scan
        if sym1 == '\n':
            if state==1: # if comment, start new scan
                state, accum = 0, ''
        idx += 1
    if state:
        print(state,accum)
        raise Exception('Lexical error: unexpected EOF')

tokens = list(tokenize('''0 // this is a comment to ignore by the lexer
"this is a string with an escaped quotation mark \\" and double slash // that is not considered as a comment"
1337
'''))
print([t for t in tokens])

Identifiers and keywords

Let us add parsing of variable identifiers and keywords. I want _identifier to be parsed as ID(_identifier), and while to be recognized as a keyword, with token WHILE to be emitted. In fact, both are recognized by the same state of the automaton, but before issuing the token we check whether the text is in the list of reserved words.

We enter to the state 4 by encountering either and underscore character or a letter, and then it's business as usual:

Identifiers and reserved words
class Token:
    def __init__(self, t, v):
        self.type, self.value = t, v

    def __repr__(self):
         return f'{self.type}({self.value})'

def tokenize(text):
    keywords    = {'true':'BOOLEAN','false':'BOOLEAN','print':'PRINT','println':'PRINT','int':'TYPE','bool':'TYPE','if':'IF','else':'ELSE','while':'WHILE','return':'RETURN'}
    idx, state, accum = 0, 0, ''
    while idx<len(text):
        sym1 = text[idx+0] if idx<len(text)-0 else ' ' # current symbol
        sym2 = text[idx+1] if idx<len(text)-1 else ' ' # next symbol
        if state==0: # start scanning a new token
            if sym1 == '/' and sym2 == '/': # start a comment scan
                state = 1
            elif sym1.isdigit():            # start a number scan
                state = 2
                accum += sym1
            elif sym1 == '"':               # start a string scan
                state = 3
            elif sym1.isalpha() or sym1=='_': # start a word scan
                state = 4
                accum += sym1
            elif sym1 not in ['\r', '\t', ' ', '\n']: # ignore whitespace
                raise Exception(f'Lexical error: illegal character "{sym1}"')
        elif state==2:                          # scanning a number
            if sym1.isdigit():                  # is next character a digit?
                accum += sym1                   # if yes, continue
            else:
                yield Token('INTEGER', accum)   # otherwise, emit number token
                idx -= 1
                state, accum = 0, ''            # start new scan
        elif state==3:                                          # scanning a string, check next character
            if sym1 != '"' or accum and accum[-1]=='\\':        # if not quote mark (or if escaped quote mark),
                accum += sym1                                   # continue the scan
            else:
                yield Token('STRING', accum)                    # otherwise emit the token
                state, accum = 0, '' # start new scan
        elif state==4:                                          # scanning a word, check next character
            if sym1.isalpha() or sym1=='_' or  sym1.isdigit():  # still word?
                accum += sym1                                   # if yes, continue
            else:                                               # otherwise the scan stops, we have a word
                if accum in keywords:                           # is the word reserved?
                    yield Token(keywords[accum], accum)         # if yes, keyword
                else:
                    yield Token('ID', accum)                    # identifier otherwise
                idx -= 1
                state, accum = 0, '' # start new scan
        if sym1 == '\n':
            if state==1: # if comment, start new scan
                state, accum = 0, ''
        idx += 1
    if state:
        print(state,accum)
        raise Exception('Lexical error: unexpected EOF')

tokens = list(tokenize('''0 // this is a comment to ignore by the lexer
"this is a string with an escaped quotation mark \\" and double slash // that is not considered as a comment"
1337
_identifier
while
truefalse
'''))
print([t for t in tokens])

One- and two-character lexemes

Finally, we need all the little things like arithmetic operation signs, comparisons and so on. If the text to be recognized consists of two characters max, I don't even bother to create a separate state, I am issuing the tokens directly from the state 0:

Here is the full lexer code for wend, which replaced the SLY lexer:

Operators
class Token:
    def __init__(self, t, v):
        self.type, self.value = t, v

    def __repr__(self):
         return f'{self.type}({self.value})'

def tokenize(text):
    keywords    = {'true':'BOOLEAN','false':'BOOLEAN','print':'PRINT','println':'PRINT','int':'TYPE','bool':'TYPE','if':'IF','else':'ELSE','while':'WHILE','return':'RETURN'}
    double_char = {'==':'COMP', '<=':'COMP', '>=':'COMP', '!=':'COMP', '&&':'AND', '||':'OR'}
    single_char = {'=':'ASSIGN','<':'COMP', '>':'COMP', '!':'NOT', '+':'PLUS', '-':'MINUS', '/':'DIVIDE', '*':'TIMES', '%':'MOD','(':'LPAREN',')':'RPAREN', '{':'BEGIN', '}':'END', ';':'SEMICOLON', ',':'COMMA'}
    idx, state, accum = 0, 0, ''
    while idx<len(text):
        sym1 = text[idx+0] if idx<len(text)-0 else ' ' # current symbol
        sym2 = text[idx+1] if idx<len(text)-1 else ' ' # next symbol
        if state==0: # start scanning a new token
            if sym1 == '/' and sym2 == '/': # start a comment scan
                state = 1
            elif sym1.isdigit():            # start a number scan
                state = 2
                accum += sym1
            elif sym1 == '"':               # start a string scan
                state = 3
            elif sym1.isalpha() or sym1=='_': # start a word scan
                state = 4
                accum += sym1
            elif sym1 + sym2 in double_char:  # emit two-character token
                yield Token(double_char[sym1+sym2], sym1+sym2)
                idx += 1
            elif sym1 in single_char:         # emit one-character token
                yield Token(single_char[sym1], sym1)
            elif sym1 not in ['\r', '\t', ' ', '\n']: # ignore whitespace
                raise Exception(f'Lexical error: illegal character "{sym1}"')
        elif state==2:                          # scanning a number
            if sym1.isdigit():                  # is next character a digit?
                accum += sym1                   # if yes, continue
            else:
                yield Token('INTEGER', accum)   # otherwise, emit number token
                idx -= 1
                state, accum = 0, ''            # start new scan
        elif state==3:                                          # scanning a string, check next character
            if sym1 != '"' or accum and accum[-1]=='\\':        # if not quote mark (or if escaped quote mark),
                accum += sym1                                   # continue the scan
            else:
                yield Token('STRING', accum)                    # otherwise emit the token
                state, accum = 0, '' # start new scan
        elif state==4:                                          # scanning a word, check next character
            if sym1.isalpha() or sym1=='_' or  sym1.isdigit():  # still word?
                accum += sym1                                   # if yes, continue
            else:                                               # otherwise the scan stops, we have a word
                if accum in keywords:                           # is the word reserved?
                    yield Token(keywords[accum], accum)         # if yes, keyword
                else:
                    yield Token('ID', accum)                    # identifier otherwise
                idx -= 1
                state, accum = 0, '' # start new scan
        if sym1 == '\n':
            if state==1: # if comment, start new scan
                state, accum = 0, ''
        idx += 1
    if state:
        print(state,accum)
        raise Exception('Lexical error: unexpected EOF')

tokens = list(tokenize('''0 // this is a comment to ignore by the lexer
"this is a string with an escaped quotation mark \\" and double slash // that is not considered as a comment"
1337
_identifier
while
truefalse
5*3 = -6
2 && 2 == 5;
'''))
print([t for t in tokens])

Comments