Building a Swift Parser with an Improved Tokenizer

WE'VE CREATED AN IOS AND MAC FRAMEWORK FOR ALL OF THIS CODE, AND OPEN-SOURCED IT. FIND OUT MORE HERE

We've made some changes to the tokenizer described in our earlier post to improve it's functionality and API. Although it has been heavily refactored, the internals are not that different, and we'd like to move forward with some richer examples extending with a foray into an example parser. 

For those that are thinking about using the library, there is a reason it's not in GitHub yet... a few comments and this blog are all the documentation it's had. We are still learning about Swift and trying approaches (like the one described here) about how best to use the language features, and we reserve the right to refactor wherever we like. The tokenizer becomes less likely to change, but I'm still not happy with all of the patterns. 

The architectural transition is to one of a more generic state machine, with the ability to "push" new contexts. The underlying protocol is very simple: 

enum TokenizationStateChange{
    //No state change requried
    case None
    //Leave this state
    case Exit
    //Leave this state, and there was an error
    case Error(errorToken:Token.ErrorToken)
    //Move to this new state
    case Transition(newState:TokenizationState)
}
protocol TokenizationState{
    func couldEnterWithCharacter(character:UnicodeScalar, controller:TokenizationController)->Bool
    func consume(character:UnicodeScalar, controller:TokenizationController) -> TokenizationStateChange
    func reset()
    func didEnter()
    func didExit()
    func addBranchTo(newState:TokenizationState)->TokenizationState
    func addBranchesTo(newStates:Array<TokenizationState>)->TokenizationState
    func chain(newStates:Array<TokenizationState>)->TokenizationState
    func createTokensUsing(creationBlock: TokenCreationBlock) -> TokenizationState
    func description()->String
}
typealias   TokenCreationBlock = ((state:TokenizationState,controller:TokenizationController)->Token)

We've split entry testing into couldEnterWithCharacter, and if that returns true, consume() which follows the pattern it did before. Other than this and notification of when the state is entered and exited the majority of the new API is intended to make it easier to build up a parser. We'll see these in action later but the two addBranch/chain methods allow sequences and branching states to be very quickly set up. 

There are 4 key implementations of the protocol

  • Branch - This is the base state for the other 3 and has the basic branching behaviour. When asked if it should consume a character, it checks all of its branches, returning true only if one can process it. It will then transition to that state. 
  • SingleCharacter - This very useful state can be used for many things, but mainly it will only enter on particular character (or one of a set of allowed characters). Once it has consumed its matching character it will behave like a normal Branch
  • Repeated - This state is almost a recursive tokenizer, it is supplied an internal state (which can be a chain of states just like a primary tokenizer) and you may specify a minimum and/or maximum number of times a token should be generated by the repeated state. Once the criteria have been matched (at least once), it will behave like a normal Branch. 
  • Delimited - This special state introduces the stacking behaviour that I had in the previous implementation. When the delimiter character is encountered a new set of root states is pushed onto the tokenisation controller until the delimiter character is encountered and the state is popped

With these 4 simple states it is possible to build very complex tokenisers. To help we have created a class called StandardTokenizers which has tokenisers for most useful character classes as well as many more complex but common sequences. I'd recommend flicking through to get a sense of how the simpler states are used. 

Here are some complex examples from the project's main.swift file

//And something to test with
let parsingTest = "Short 10 string"
//----------------------------
// Simple Sentance tokenizer
//----------------------------
let sentanceTokenizer:Tokenizer = Tokenizer().addBranchesTo([
    StandardTokenizers.blanks,
    StandardTokenizers.number,
    StandardTokenizers.word,
    StandardTokenizers.eot
    ])
println("\nSentance:")
sentanceTokenizer.tokenize(parsingTest,newToken:tokenPrinter)

We used this example in the previous example, but hopefully it is a lot more clear what is happening with the improved API. We create a tokenizer and the call addBranchesTo() to add all the base transitions. Note that chain/addBranch methods are chaining. Note also the need for an End token (or EOT, end of transmission). You don't need to specify it in your string. The output is much as we had before. 

Sentance:
    word 'Short'
    blank ' '
    integer '10'
    blank ' '
    word 'string'
    end ''

The word state is just a Repeated SingleCharacter state (for letters)

Repeated(repeatingState: StandardTokenizers.wordCharacter, tokenName: "word")

We have increased the capability of the advanced sentence tokenizer significantly, adding support for Swift style string interpolation within quoted strings. 

//----------------------------
// More complex tokenizer
//----------------------------
let tougherTest = "1.5 Nasty example with -10 or 10.5 maybe even 1.0e-10 \"Great \\(variableName) \\t 🐨 \\\"Nested\\\" quote\"!"
//Prepare a way of handling \", \t, and \(variableName)
let quotedCharacters = [
    SingleCharacter(allowedCharacters: "\\").addBranchesTo([
        SingleCharacter(allowedCharacters:"t\"", createTokensNamed:"character"),
        Delimited(open: "(",close: ")",states: [StandardTokenizers.word],createTokenNamed: "inline")
        ]),
    SingleCharacter(disallowedCharacters:"\"",createTokensNamed:"character")
]
//Configure the tokenizer
let advancedSentanceTokenizer:Tokenizer = Tokenizer().addBranchesTo([
    Delimited(delimiter:"\"",states:quotedCharacters,createTokenNamed:"quoted-string"),
    StandardTokenizers.blanks,
    StandardTokenizers.number,
    StandardTokenizers.word,
    StandardTokenizers.punctuation,
    StandardTokenizers.eot
    ])
println("\nAdvanced Sentance:")
advancedSentanceTokenizer.tokenize(tougherTest,newToken:tokenPrinter)

The hard work is done in the quoted characters Branch contained in the " delimited output state. It first looks for a \  SingleCharacter, before Branching off to either \\ or \t from within a SingleCharacter state or another embedded delimited state. One word is then allowed in this new delimited state. Finally we have a SingleCharacter state that matches EVERYTHING except the exit delimiter character, "

This is about as complex a tokenisation challenge as you will encounter, use the debugger to step through and see how it works, and have a look at the StandardTokenizers class to see how more complex tokenizers are built up. 

We will leave it there for tokenisation and move on to parsing. This has been implemented to take advantage of a change to the Tokenizer API which can now take a block which will receive the generated tokens as they are streamed. 

typealias TokenHandler = (token: Token) -> Bool

We don't yet use the return value, as we aren't sure that we are happy with the pattern, but if we stick with this pattern it will stop tokenisation. 

The parser exploits this, and the fact that Swifts functions are just closures, in its definition 

protocol Parser{
    func parse(token:Token)->Bool
    func parseString(string:String, withTokenizer:Tokenizer)
}

Note that the parse function matches a token handler, as you might image this means that a parser can supply that function to a tokenizer, as we can see in the StackParser implementation

class StackParser:Parser{
  ...
  func parseString(string: String, withTokenizer: Tokenizer) {
     withTokenizer.tokenize(string,parse)
  }
}

The stack parser is very simple, and in addition to have default implementations like the one above, providers a token stack which can help build up the actual parsing tree (or perhaps more accurately to avoid needing to build one explicitly). 

We can use this to build up a very simple RPN expression parser

class RPNParser : StackParser{
    override func parse(token: Token) -> Bool {
        switch token.name{
        case "operator":
            let right = popToken() as NumberToken
            let left = popToken() as NumberToken
            let actualOperator = token as OperatorToken
            pushToken(actualOperator.applyTo(left, right: right))
            return true
        case "integer","float":
            return super.parse(NumberToken(usingToken: token))
        default:
            return true
        }
    }
    func execute(){
        let top = popToken() as NumberToken
        println(top.numericValue)
    }
}
let rpnExpression = "3 2 * 1 +"
rpnParser.parseString(rpnExpression, withTokenizer: simpleMath)
print("\nRPN result for '"+rpnExpression+"':\n\t")
rpnParser.execute()

OK so reverse polish isn't the most readable form. Let's build this up with a shunting yard infix expression parser. We have not implemented functions at this stage, but if you have a quick read of the Wiki page you should be fine. Again, it's really not much code and we chain the RPN parser in to do the final calculations. 

class SYExpressionParser : StackParser{
    var rpnParser = RPNParser()
    func processOperator(operatorToken:OperatorToken)->Bool{
        while topToken() is OperatorToken{
            let topOp = topToken() as OperatorToken
            if operatorToken.presidence() <= topOp.presidence(){
                rpnParser.parse(popToken()!)
            } else {
                break
            }
        }
        pushToken(operatorToken)
        return true
    }
    func processCloseBracket()->Bool{
        var topTokenName = topToken()?.name
        while topTokenName && topTokenName != "bracket-open" {
            rpnParser.parse(popToken()!)
            topTokenName = topToken()?.name
        }
        if !topTokenName{
            println("Missing open bracket")
            return true
        }
        return true
    }
    override func parse(token: Token) -> Bool {
        switch token.name{
        case "operator":
            return processOperator(token as OperatorToken)
        case "integer","float":
            rpnParser.parse(token)
            return true
        case "bracket-open":
            pushToken(token)
            return true
        case "bracket-close":
            return processCloseBracket()
        case "end":
            while topToken(){
                rpnParser.parse(popToken()!)
            }
            return true
        default:
            return true
        }
    }
    func execute(){
        rpnParser.execute()
    }
    override func parseString(string: String, withTokenizer: Tokenizer) {
        rpnParser = RPNParser()
        super.parseString(string, withTokenizer: withTokenizer)
    }
}

Now we just need to extend the RPN tokenizer with some states for open and close brackets (note that we don't need a DelimitedState  here because the same tokenisation flow applies both inside and outside the brackets)

The code to use it is simple

let expressionTokenizer:Tokenizer = simpleMath.addBranchesTo([
    SingleCharacter(allowedCharacters:"(",createTokensNamed:"bracket-open"),
    SingleCharacter(allowedCharacters:")",createTokensNamed:"bracket-close"),
    ])
let shuntingYardParser = SYExpressionParser()
let infixExpression = "3*(6--4)"
print("\nInfix result for '"+infixExpression+"':\n\t")
shuntingYardParser.parseString(infixExpression, withTokenizer: expressionTokenizer)
shuntingYardParser.execute()

I have wrapped all this up in a project to make it easy for you to play, the code will work fine on both iOS and Mac. This will be a framework, and on GitHub, but until then please play with these pre-alpha builds and keep tracking to see updates and improvements as we go forward. 

This is proving a very useful exercise for exploring patterns in Swift, so I want to take it a bit further than we have now. You can download the project here