Building a Swift Parser with an Improved Tokenizer
/We've made some changes to the tokenizer described in our earlier post to improve it's functionality and API. Although it has been heavily refactored, the internals are not that different, and we'd like to move forward with some richer examples extending with a foray into an example parser.
For those that are thinking about using the library, there is a reason it's not in GitHub yet... a few comments and this blog are all the documentation it's had. We are still learning about Swift and trying approaches (like the one described here) about how best to use the language features, and we reserve the right to refactor wherever we like. The tokenizer becomes less likely to change, but I'm still not happy with all of the patterns.
The architectural transition is to one of a more generic state machine, with the ability to "push" new contexts. The underlying protocol is very simple:
enum TokenizationStateChange{ //No state change requried case None //Leave this state case Exit //Leave this state, and there was an error case Error(errorToken:Token.ErrorToken) //Move to this new state case Transition(newState:TokenizationState) } protocol TokenizationState{ func couldEnterWithCharacter(character:UnicodeScalar, controller:TokenizationController)->Bool func consume(character:UnicodeScalar, controller:TokenizationController) -> TokenizationStateChange func reset() func didEnter() func didExit() func addBranchTo(newState:TokenizationState)->TokenizationState func addBranchesTo(newStates:Array<TokenizationState>)->TokenizationState func chain(newStates:Array<TokenizationState>)->TokenizationState func createTokensUsing(creationBlock: TokenCreationBlock) -> TokenizationState func description()->String } typealias TokenCreationBlock = ((state:TokenizationState,controller:TokenizationController)->Token)
We've split entry testing into couldEnterWithCharacter, and if that returns true, consume() which follows the pattern it did before. Other than this and notification of when the state is entered and exited the majority of the new API is intended to make it easier to build up a parser. We'll see these in action later but the two addBranch/chain methods allow sequences and branching states to be very quickly set up.
There are 4 key implementations of the protocol
- Branch - This is the base state for the other 3 and has the basic branching behaviour. When asked if it should consume a character, it checks all of its branches, returning true only if one can process it. It will then transition to that state.
- SingleCharacter - This very useful state can be used for many things, but mainly it will only enter on particular character (or one of a set of allowed characters). Once it has consumed its matching character it will behave like a normal Branch
- Repeated - This state is almost a recursive tokenizer, it is supplied an internal state (which can be a chain of states just like a primary tokenizer) and you may specify a minimum and/or maximum number of times a token should be generated by the repeated state. Once the criteria have been matched (at least once), it will behave like a normal Branch.
- Delimited - This special state introduces the stacking behaviour that I had in the previous implementation. When the delimiter character is encountered a new set of root states is pushed onto the tokenisation controller until the delimiter character is encountered and the state is popped
With these 4 simple states it is possible to build very complex tokenisers. To help we have created a class called StandardTokenizers which has tokenisers for most useful character classes as well as many more complex but common sequences. I'd recommend flicking through to get a sense of how the simpler states are used.
Here are some complex examples from the project's main.swift file
//And something to test with let parsingTest = "Short 10 string" //---------------------------- // Simple Sentance tokenizer //---------------------------- let sentanceTokenizer:Tokenizer = Tokenizer().addBranchesTo([ StandardTokenizers.blanks, StandardTokenizers.number, StandardTokenizers.word, StandardTokenizers.eot ]) println("\nSentance:") sentanceTokenizer.tokenize(parsingTest,newToken:tokenPrinter)
We used this example in the previous example, but hopefully it is a lot more clear what is happening with the improved API. We create a tokenizer and the call addBranchesTo() to add all the base transitions. Note that chain/addBranch methods are chaining. Note also the need for an End token (or EOT, end of transmission). You don't need to specify it in your string. The output is much as we had before.
Sentance: word 'Short' blank ' ' integer '10' blank ' ' word 'string' end '