Embeddable WebAssembly Parsers

How Wasm lets us write portable solutions for parsing

Developer tooling has grown by leaps and bounds. Popular languages have strong support in multiple editors, linters, formatters, code review tools, and more. Even so, tools that support a wide variety of programming languages are hard to write and maintain.

WebAssembly (Wasm) as a compact, portable, and secure format for programs presents a new option for building polyglot tools: parsers written in Wasm-supporting languages that can be leveraged by various tools.

These could be generated by parser generators or written by hand, enable support for every class of language grammar, and can be consumed in any environment where Wasm can be run. So any browser and anywhere wasmtime can be run (hint: that’s almost everywhere) would be able to use them.

Components and Interfaces

WebAssembly has an evolving component model that makes it easier for WASM modules to talk to other modules through high-level interfaces. This allows us to create an interface definition .wit file that specifies what it means to be a parser.

The simplest version of this might take in a string, and output a JSON representation of the parse tree.

parse: function(input: string) -> string

A WIT interface for a parser that inputs and outputs strings

Once we define the output format, even this very simple interface is enough to get us started and enables us to write parsers in Wasm languages that work on any platform.

This is still pretty low level though and doesn’t offer very much structure (and potentially performance). It’s also a bit silly that you have to then parse your parser output as JSON. We can do better.

A Higher Level Interface

Let’s make an interface that is higher level and actually encodes the structure of a parse tree

Tokens

We’ll start by defining a record type to represent tokens.

record token {
    label: string,
    span: span
}

record span {
    offset: u32, // The byte index where the token starts
    length: u32, // The number of bytes long the token is
}

A WIT interface for tokens (label + span) and spans (offset + length)

Each token gets a string label that identifies what kind of token it is (e.g. identifier, keyword) and a span that identifies what part of the input it represents. Span info is crucial to the functioning of formatters, linters, and many other tools.

Parse Tree

Now that we can represent tokens, we need to represent the parse tree itself. Currently, the WIT format does not support recursive type definitions which are the typical way of doing this. Instead, we’ll use indexes to provide a little indirection.

record branch {
    label: string,
    // The nodes that this branch contains
    children: list<node-index>
}

variant node-index {
    token(u32),  // refers to the Nth token in output
    branch(u32), // refers to the Nth branch in output
}

record output {
    tokens: list<token>,
    tree: list<branch>
}

A WIT interface for branches which reference children by index

The Parser

Now we have all the tools we need to describe the parser interface. A parser simply transforms a string into parser output.

parse: function(input: string) -> output

A WIT interface for a parser that returns a structured output

Keen readers might notice that there is no explicit mention of error handling so far. In this simplified example, errors will simply be encoded as tokens and branches with a special label. Future expansions to EWPs could add more explicit error handling.

Implementing the Interface

The interface we just wrote isn’t hypothetical, we can implement it right now!

Using Rust and the wit-bindgen project we can create some scaffolding that looks like this. The wit_bindgen_rust export line tells bindgen that we want to export the interface we defined, which just means that we will provide an implementation for other components to use.

wit_bindgen_rust::export!("../parser1.wit");

struct Parser1 {}

impl parser1::Parser1 for Parser1 {
    fn parse(input: String) -> parser1::Output {
        let tokens = tokenize(&input);
        let tree = parse(&tokens);
        parser1::Output { tokens, tree }
    }
}

pub fn tokenize(input: &str) -> Vec<Token> {
    ...
}

pub fn parse(input: &[Token]) -> Vec<Branch> {
    ...
}

Rust code showing how wit-bindgen can be used for the parser interface

From there we just need to implement the tokenizing and parsing logic for our language of choice. I chose to use JSON since it is simple and practical. The source code for this is available on GitHub but is too long to cover in this article.

Consuming the Interface

Now that we have defined a simple Embeddable WASM Parser interface and implemented it for a language, it’s time to use that implementation to do something useful.

EWP Tool Example

There are a wide variety of things you could do with the ability to parse arbitrary languages, but in the spirit of keeping things simple our example is just going to print the parse tree in a nice format.

The tool will take in the path to a parser’s WASM file and the file to parse and print out the parse tree as an S-expression like so.

$ cargo run -- ../ewp_json.wasm ../test.json
(Object
    (LBrace)
    (Entry
        (String)
        (Colon)
        (Number)
    )
    (RBrace)
)

The tree-ewp tool being run to output a parenthesized parse tree

Using a Runtime

The tool is going to need to execute Embeddable WASM Parsers, which means it needs a runtime. We’ll use wasmtime which is an official project of the Bytecode Alliance, a non-profit foundation building Open Source implementations of the standard.

We’ll create a new crate called tree-ewp with dependencies for running and talking to our WASM runtime.

[package]
name = "tree-ewp"
version = "0.1.0"
edition = "2021"

[dependencies]
wasmtime = "0.33.0"
wit-bindgen-wasmtime = { ... }
...

The Cargo.toml file for the tree-ewp command line tool

In our code, we just instantiate the runtime, load and initialize our WASM file, execute the parse function, and then print the output.

// Set up WASMTIME
let engine = Engine::default();
let mut linker = Linker::new(&engine);
let mut store = Store::new(&engine, interface::Parser1Data {});

// Load and initialize our EWP module
let wat = fs::read(args.ewp_path)
    .expect("Could not read EWP WASM module file");
let module = Module::new(&engine, wat)
    .expect("Could not initialize module");
let result = interface::Parser1::instantiate(
            &mut store, &module,
            &mut linker, get_whole_store
    )
    .expect("Failed to instantiate interface");


// Read input file and parse it
let input = fs::read_to_string(args.input_path)
    .expect("Could not read pares input file");
let output = result.1.parse(&mut store, &input)
    .expect("Failed to run EWP");

printer::pretty_print(output);

The high level implementation of the tree-ewp tool using wasmtime

Printing S-expressions is done with a few recursive functions that keep track of the indentation. You can see how that’s done and the rest of the tree-ewp tool in the GitHub repo.

Wrap Up

WebAssembly is an incredibly promising technology enabling new levels of security and portability for code. Its modularity allows us to share code conforming to specified interfaces as a way to solve specific problems, and it can be used to create portable parsers that we can run anywhere.