A Swift String Tokenizer

WE'VE CREATED AN IOS AND MAC FRAMEWORK FOR ALL OF THIS CODE, AND OPEN-SOURCED IT. FIND OUT MORE HERE

Brace yourself, this is a long article (now you at least can see why I've been quiet for a week or so). You will also see what I can only describe as some unusual patterns, I have had to work around all manor of Swift compiler bugs to get this running. 

The objective: To write an extensible (but simple) Swift tokeniser. There are great Objective C frameworks like ParseKit but I thought this was both a fun thing to do in Swift, and I've never loved the way you build up your tokens and tokenisers. 

The architecture is simple. There is a Token class (intended to be sub-classed, the base class has just a name and the characters it is made up of), a protocol of something that can create Tokens from Characters. A TokenizingState which can create more complex Tokens, and one final implementation of that, the Tokenizer which manages all of the different states of tokenisation. 

The Token Class

@objc
class Token{
    let name:String
    var characters:String = ""
    
    init(name:String){
        self.name = name
    }

    init(name:String, withCharacters:String){
        self.name = name
        self.characters = withCharacters
    }
    
    func description()->String{
        return "\(name) '\(characters)'"
    }
    
    @objc
    class ErrorToken : Token{
        let problem : String
        init(forCharacter:UnicodeScalar,problemDescription:String){
            problem = problemDescription
            super.init(name: "Error", withCharacters: "\(forCharacter)")
        }
        
        override func description() -> String {
            return super.description()+" - "+problem
        }
    }
}

These are what you'll get back out of the Tokenizer, a very simple implementation. They have a name, and record the characters they capture. Note that it includes a nested concrete sub-class, I've grown to quite like this pattern. It says this sub-class is special. In this case it's a predefined Token type that can be used by others to identify that there was an error, and include a description of the problem. You can follow this pattern with your own Token classes. 

Right, we'll need something to create Tokens!

The Tokenizing Protocol

To create a minimal class that can take a single character and return a token, just implement this protocol. It has just two methods. If it can tokenise a character, it should return true from canTokenize(). If it returns true it will be asked for the token with a call to tokenFor(). Simple. 

@objc
protocol Tokenizing{
    //Returns true if the supplied character can be tokenized by the implementing class
    func canTokenize(character:UnicodeScalar) -> Bool
    //Only called if the implementation returned true for canTokenize
    func tokenFor(character:UnicodeScalar) -> Token
}

We'll see some implementations later, but for now we also will need to deal with the parsing of more complex tokens (we might want to pick out whole words, or deal with quoted strings for example). For that we need something that captures some state. 

TokenizerState

This could have been another protocol, but we bumped up against some Swift bugs, and in all honesty there are some things that every complex tokenizer will want to do. 

class TokenizerState : Tokenizing{
    var tokenizers = Array<Tokenizing>()
    var tokens = Array<Token>()
    
    
    //Fetches the correct tokenizer for the given character
    func tokenizerFor(character:UnicodeScalar) -> Tokenizing?{
        for tokenizer in tokenizers{
            if tokenizer.canTokenize(character){
                return tokenizer
            }
        }
        return nil
    }
    
    //Determines if a token can be created by asking all 
    //the tokenizers it has if they can create a token for the character
    func canTokenize(character:UnicodeScalar) -> Bool{
        return tokenizerFor(character) != nil
    }
    
    //Returns the correct token
    func tokenFor(character:UnicodeScalar) -> Token{
        if let tokenizer = tokenizerFor(character){
            //If it is a state, create a new context token for it
            if tokenizer is TokenizerState{
                return NewContextToken(newContext: tokenizer as TokenizerState)
            } else {
                return tokenizer.tokenFor(character)
            }
        }

        return Token.ErrorToken(forCharacter: character, problemDescription: "Failed to create token")        
    }
    
    //Add a token to the stored state
    func append(token:Token) {
        tokens.append(token)
    }
    
    //Return all the tokens in the stored state, this
    //is the point to over-ride in a sub-class if you wish
    //to consolidate multiple tokens into one
    func flushTokens() -> Array<Token>? {
        let oldTokens = tokens
        
        tokens = Array<Token>()
        
        return oldTokens
    }    
}

State is captured as an array of tokens, which are built up using the array of other Tokenizers. Those might be simple character tokenizers, or TokenizerState sub-classes themselves. Note that essentially this class just delegates the recognition of tokens to its known tokenizers, so you may still get a stream of character tokens for example, but you could consolidate those in append() or flushTokens(). It's critical that after flushing the tokens that any state is cleared.

At Last, The Tokenizer

class Tokenizer : TokenizerState{
    var stateStack = Stack<TokenizerState>()
    
    //Entry point for tokenization of a string, returning an 
    //array of tokens
    func tokenize(string:String) -> Array<Token>{
        stateStack = Stack<TokenizerState>()
        stateStack.push(self)
        
        for character in string.unicodeScalars{
            tokenize(character)
            if stateStack.isEmpty(){
                return tokens
            }
        }
        
        fullyUnwind()
        
        return tokens
    }
    
    
    //Removes the current state from the stack, passing
    //its tokens to the next state on the stack
    func unwindState(){
        let state = stateStack.top()!
        stateStack.pop()
        
        if let newState = stateStack.top(){
            //If there is a new state, pass the tokens on to the state
            if let currentStateTokens = state.flushTokens(){
                for token in currentStateTokens{
                    newState.append(token)
                }
            }
        } else {
            //otherwise add them to mine
            if let currentStateTokens = state.flushTokens(){
                tokens.extend(currentStateTokens)
            }
        }
        
    }
    
    //If we hit an error, unwind all the way
    func fullyUnwind(){
        while !stateStack.isEmpty(){
            unwindState()
        }
    }
    
    func tokenize(character:UnicodeScalar){
        if let state = stateStack.top(){
            if state.canTokenize(character){
                let token = state.tokenFor(character)
                
                if let newContextToken = token as? NewContextToken{
                    stateStack.push(newContextToken.newContext)
                    tokenize(character)
                } else {
                    state.append(token)
                }
            } else {
                unwindState()
                tokenize(character)
            }
        } else {
            //If there's no state, we have fully unwound and could not tokenize
            append(Token.ErrorToken(forCharacter: character, problemDescription: "Unrecognized character"))
        }
    }    
}

As you can see, it's just a stateful tokenizer itself, but it manages a more complex stack of states, enabling token scanning to enter new modes as different characters are encountered. Note there is a special token that a TokenizerState can return, NewContext.

@objc
class NewContextToken : Token{
    let newContext : TokenizerState
    init(newContext:TokenizerState){
        self.newContext = TokenizerState() //Works around a compiler bug that causes a segmentation fault if newContext is assigned directly
        self.newContext = newContext
        super.init(name: "NewContextToken")
    }
}


This special token tells the Tokenizer that rather than adding  it to the stream of tokens, it should push a new TokenizerState onto the stack. 

Speaking of the stack, it's a very vanilla implementation

class Stack<T>{
    var stack = Array<T>()
    
    func push(value: T){
        stack.append(value)
    }
    
    func pop() -> T?{
        if (stack.count > 0){
            return stack.removeLast()
        } else {
            return nil
        }
    }
    
    func isEmpty()->Bool{
        return depth() == 0
    }
    
    func depth()->Int{
        return stack.count
    }
    
    func top()->T?{
        if stack.count == 0 {
            return nil
        }
        return stack[stack.endIndex-1]
    }
    
    func popAll(){
        stack = Array<T>();
    }
}

That's it, it's all we need to start building something useful. 

Example 1: The Useless Tokenizer

//We'll need to easily print out the tokens we get
func printTokens(label:String, tokenizer:Tokenizer){
    println(label+" Tokens:")
    for token in tokenizer.tokens{
        println("\t"+token.description())
    }
    println()
}

//And something to test with
let parsingTest = "Short 10 string"

//If we define nothing else, we just get an invalid token straight away
let uselessTokenizer = Tokenizer()
uselessTokenizer.tokenize(parsingTest)
printTokens("Useless",uselessTokenizer)

If we create an instance of Tokenizer with nothing else, you'll see that it will immediately fail with an Error token

Useless Tokens:
    Error 'S' - Unrecognized character

Not a huge step forward, but we can see that it's trying! We will increase the complexity until we have a meaningful tokenisation of our parsing test string

Recognising Characters

The first thing we'll want to do is recognise certain classes of characters (letters, digits and white space in the case of our test). So let's build a class that makes it easy to do that

class CharacterTokenizer : Tokenizing{
    let matchedCharacters : String
    
    init(character : Character, tokenName : String){
        matchedCharacters = "\(character)"
    }
    
    init(allCharacters:String, tokenName : String){
        matchedCharacters = allCharacters
    }
    
    func canTokenize(character: UnicodeScalar) -> Bool {
        for matches in matchedCharacters.unicodeScalars{
            if matches == character {
                return true;
            }
        }
        
        return false
    }
    
    func tokenFor(character: UnicodeScalar) -> Token {
        return CharacterToken(character: character)
    }
    
    class CharacterToken : Token{
        init(character:UnicodeScalar){
            super.init(name: "Character", withCharacters: "\(character)")
        }
    }
}

It's pretty simple, give it a string of valid characters for the token, and it will return a character token for it. We can now create tokenisers for our different classes of characters

let digitTokenizer = CharacterTokenizer(allCharacters: "0123456789",tokenName: "digit")
let whiteSpaceTokenizer = CharacterTokenizer(allCharacters: " \t", tokenName: "whitespace") //You'll probably want to add /n and /r
let englishLetterTokenizer = CharacterTokenizer(allCharacters: "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ", tokenName: "letter")

We can now feed these to the tokenizer and we should get a little further

let characterTokenizer = Tokenizer()
characterTokenizer.tokenizers = [
    digitTokenizer,
    whiteSpaceTokenizer,
    englishLetterTokenizer
    ]
characterTokenizer.tokenize(parsingTest)
printTokens("Character",characterTokenizer)

This yields the rather verbose output

Character Tokens:
    Character 'S'
    Character 'h'
    Character 'o'
    Character 'r'
    Character 't'
    Character ' '
    Character '1'
    Character '0'
    Character ' '
    Character 's'
    Character 't'
    Character 'r'
    Character 'i'
    Character 'n'
    Character 'g'

OK... that's useful, but really the tokens we would probably expect are 'Short' '10' and 'string'. We'd probably also expect that '10' would be an actual integer, not just a string. 

Token Consolidation

What we really need to do is look for sequences of the different character classes. For that we need a sub-class of TokenizerState. Here it is with an implementation for each of the different character classes.

class CharacterSequenceTokenGenerator : TokenizerState{
    //Flushes the token stream, and creates a string with all
    //of the token stream charactes in
    func consolidateCharacters() -> String {
        var string = ""
        
        if let allTokens = super.flushTokens(){
            for token in allTokens{
                string += token.characters
            }
            return string
        } else {
            return ""
        }
    }
    
}

//One for English letters
class WordTokenGenerator : CharacterSequenceTokenGenerator{
    init(){
        super.init()
        tokenizers = [englishLetterTokenizer]
    }
    
    override func flushTokens() -> Array<Token>? {
        var wordString = consolidateCharacters()
        
        return [Token(name: "Word", withCharacters: wordString)]
    }
    
}

//One for Integers
class IntegerTokenGenerator : CharacterSequenceTokenGenerator{
    init(){
        super.init()
        tokenizers = [digitTokenizer]
    }
    
    override func flushTokens() -> Array<Token>? {
        var integerString = consolidateCharacters()
        return [IntegerToken(number: integerString)]
    }
    
    
    class IntegerToken : Token{
        let intValue : Int
        init(number : String){
            intValue = number.toInt()!
            super.init(name: "Integer", withCharacters: number)
        }
    }
}

//One for white space
class WhiteSpaceTokenGenerator : CharacterSequenceTokenGenerator{
    init(){
        super.init()
        tokenizers = [whiteSpaceTokenizer]
    }
    
    override func flushTokens() -> Array<Token>? {
        let spaceString = consolidateCharacters()
        return [Token(name:"Whitespace", withCharacters:spaceString)]
    }
}

Note that the pattern is very simple, for each character tokenizer we wrap it in some state, and then finally, when asked to output our tokens (when we hit a character that can't be tokenised) we consolidate all of the tokens into a single one. 

We're not quite done, we need one final class to wrap all of these different states into one (the root state for the tokenizer if you like). 

class SentanceTokenGenerator : TokenizerState{
    init(){
        super.init()
        tokenizers = [
            WhiteSpaceTokenGenerator(),
            IntegerTokenGenerator(),
            WordTokenGenerator(),
            ]
    }
    
}

It just pulls all of these together. We can now use it. 

let sentanceTokenizer = Tokenizer()
sentanceTokenizer.tokenizers = [SentanceTokenGenerator()]
sentanceTokenizer.tokenize(parsingTest)
printTokens("Sentance", sentanceTokenizer)

And finally we get some output close to what we might want

Sentance Tokens:
    Word 'Short'
    Whitespace ' '
    Integer '10'
    Whitespace ' '
    Word 'string'

Excellent! As you can hopefully see, if you build these up in simple increments starting with the most basic it's very easy to construct a much more advanced parser. Let's finish off with something much tougher!

Complex Token Parsing

We might want to create a tokenizer for syntax highlighting code, and therefore want to capture quoted strings. Sounds easy initially... that's just a state that has to have an open and close quote. However, we'll also need to deal with escaped characters in the string (including the \"). That's quite easy

class EscapedCharacterTokenGenerator : CharacterSequenceTokenGenerator{
    let escapedCharacters = CharacterTokenizer(allCharacters: "\"rt0n\\",tokenName: "control-code")
    
    @objc //Workaround Swift bug, has to be declared with @objc attribute otherwise first implementation is used
    override func canTokenize(character: UnicodeScalar) -> Bool {
        switch tokens.count {
        case 0:
            return character == "\\"
        case 1:
            return escapedCharacters.canTokenize(character)
        default:
            return false
        }
    }
    
    @objc //Workaround Swift bug, has to be declared with @objc attribute otherwise first implementation is used
    override func tokenFor(character: UnicodeScalar) -> Token {
        return CharacterTokenizer.CharacterToken(character: character)
    }
    
    @objc //Workaround Swift bug, has to be declared with @objc attribute otherwise first implementation is used
    override func flushTokens() -> Array<Token>? {
        let escapeCode = consolidateCharacters()
        return [Token(name:"EscapeCode", withCharacters:escapeCode)]
    }
}

If it encounters a \ and returns itself as a new state and looks for valid escaped characters. When flushing it turns these into a single EscapeCode token. 

Right, now we just need something very similar for the quoted string itself

class QuoteTokenGenerator : CharacterSequenceTokenGenerator{
    let QUOTE_DELIMETER = "quote-delimeter"
    
    init(){
        super.init()
        tokenizers = [
            EscapedCharacterTokenGenerator(),
        ]
    }
    
    func openedAndClosed()->Bool{
        var delimeterCount = 0
        for token in tokens{
            if token.name == QUOTE_DELIMETER{
                delimeterCount++
            }
        }
        return delimeterCount > 1
    }
    
    @objc   //Workaround Swift bug, has to be declared with @objc attribute otherwise first implementation is used
    override func canTokenize(character: UnicodeScalar) -> Bool {
        if tokens.count == 0 {
            return character == "\""
        }
        
        //You may want to look for newlines etc and reject, but in this case we will keep it simple
        return !openedAndClosed()
    }
    
    @objc //Workaround Swift bug, has to be declared with @objc attribute otherwise first implementation is used
    override func tokenFor(character: UnicodeScalar) -> Token {
        if tokens.count == 0 {
            return Token(name: QUOTE_DELIMETER, withCharacters: "\(character)")
        }
        
        //Let the super class deal with any escaped characters
        if super.canTokenize(character){
            return super.tokenFor(character)
        }
        
        //We can accept any character now, but we want to give a quote a speicial name
        if character == "\"" {
            return Token(name:QUOTE_DELIMETER,withCharacters: "\(character)")
        } else {
            return CharacterTokenizer.CharacterToken(character: character)
        }
    }
    
    override func flushTokens() -> Array<Token>? {
        var quotedString = consolidateCharacters()
        
        return [Token(name: "Quote", withCharacters: quotedString)]
    }

}

Again, it looks for the opening " and then starts accepting every character until it has a closing quote. Note that it uses the EscapedCharacterTokenizer we defined above. 

Again, we want to wrap all our tokenisers up (we could just do this when we create the Tokenizer, but I like to bundle them up in a single higher-level state, it makes re-usability a bit clearer)

class AdvancedSentanceTokenGenerator : TokenizerState{
    init(){
        super.init()
        tokenizers =  [
            QuoteTokenGenerator(),
            WhiteSpaceTokenGenerator(),
            CharacterTokenizer(allCharacters: "!,:'?`./()&%$£@€#;", tokenName: "Punctuation"),
            IntegerTokenGenerator(),
            WordTokenGenerator(),
            ]
    }

}

Note we've added a few more in there too, so we'll need a nasty string to test too!

let tougherTest = "Nasty example with 1 \"Great \\t 🐨 \\\"Nested\\\" quote\"!"
let advancedSentanceTokenizer = Tokenizer()
advancedSentanceTokenizer.tokenizers = [AdvancedSentanceTokenGenerator()]
advancedSentanceTokenizer.tokenize(tougherTest)
printTokens("Advanced Sentance", advancedSentanceTokenizer)

Plenty of escapes and Koala's in there!

Advanced Sentance Tokens:
    Word 'Nasty'
    Whitespace ' '
    Word 'example'
    Whitespace ' '
    Word 'with'
    Whitespace ' '
    Integer '1'
    Whitespace ' '
    Quote '"Great \t 🐨 \"Nested\" quote"'
    Character '!'

And there we have it, some genuinely useful output! We'll be including this in the OysterKit framework when Yosemite and iOS 8 ships, and we'll tidy up the architecture before then. For now you can copy and paste it all into a single playground and experiment. 

We'd love to hear your feedback!