Trickery to Tame Big WebAssembly Binaries

by @bwasti

In light of the extremely popular PyScript project (it's great, check it out!), I thought I'd touch on a somewhat glaring issue...

The nearly two second latency involved with loading over 8MB of assets for a "hello world" example. (You can check that out here.)

Big Binaries and Where To Find Them

It should come as no surprise that compiling a large project into an extremely simple ISA blows up the binary size. CPython is around ~350,000 lines of C and getting it compiled to only ~6MB of wasm is nothing short of impressive.

But, sending 6MB over the wire before anything useful can happen isn't really an ideal experience.

Many other useful projects will face this issue, so a solution would be quite nice to have.

What can we do? In this post I go over some techniques I've been playing with. They're largely hacks, so please only read for enjoyment and not edification. :)

Splitting up Files

The first approach to explore is manually splitting up files and only loading the ones you need.

int a_fn(int x) {
  return x + 13;
}

int b_fn(int x) {
  return x * 7;
}

This is by far the easiest, conceptually, but it can be quite a hassle. Not all files can be cleanly split because they may have dependencies. In the worst (yet quite common) case, every file has some function that depends on some functionality in another file.

int a_fn(int x) {
  x = b_fn(x); // damn :(
  return x + 13;
}

our problem case

Inspecting the `wasm`

Let's take a peak under the hood to check out what that dependency looks like.

WebAssembly has a restrictive and easily analyzed ISA, which makes this pretty easy. To inspect it, we'll use wasm-objdump. The disassembled code for the binary looks a bit like this:

$ wasm-objdump -d combined.wasm

func[1] <a_fn(int)>:
 local.get 0
 call 2 <b_fn(int)> <--- Our dependency!
 i32.const 13
 i32.add
 end
func[2] <b_fn(int)>:
 local.get 0
 i32.const 7
 i32.mul
 end

Let's pretend these functions are much bigger and discuss two possible outcomes after we send the binary to the user:

The user calls int a_fn(int)
The user only calls int b_fn(int)

In case 1, we've done all we can and there's nothing to worry about. The issue is case 2. We just sent all of the int a_fn(int) section over the wire as well, potentially causing many milliseconds of hang time!

Let's avoid that at all costs.

Imports

One way to split up binares even when there are dependencies is to just force it. The flag --allow-undefined will make this possible (for ld). Other compilation stacks will likely have similar flags. Note that this doesn't involve changing the source code, just the way we compile it!

// b.cpp
int b_fn(int x) {
  return x * 7;
}

// a.cpp
int b_fn(int x);

int a_fn(int x) {
  x = b_fn(x);
  return x + 13;
}

clang++ --target=wasm32 -nostdlib -O3 -c a.cpp -o /tmp/a.o
clang++ --target=wasm32 -nostdlib -O3 -c b.cpp -o /tmp/b.o
wasm-ld --no-entry --export-all --lto-O3 --allow-undefined --import-memory /tmp/a.o -o a.wasm
wasm-ld --no-entry --export-all --lto-O3 --allow-undefined --import-memory /tmp/b.o -o b.wasm

The a.wasm binary ends up with an import section:

$ wasm-objdump -x -j Import a.wasm

a.wasm:	file format wasm 0x1

Section Details:

Import[2]:
 - memory[0] pages: initial=2 <- env.memory
 - func[0] sig=0 <b_fn(int)> <- env._Z4b_fni

But its code section looks exactly as expected!

$ wasm-objdump -d Import a.wasm

a.wasm:	file format wasm 0x1

Code Disassembly:

func[2] <a_fn(int)>:
 local.get 0
 call 0 <b_fn(int)>
 i32.const 13
 i32.add
 end

Ok, but wouldn't this just break if we ran it? Yes.

Dynamic Linking

We'll need to specify the implementation for int b_fn(int) at load time. Something like this:

const memory = new WebAssembly.Memory({initial:10, maximum:1000});
const b_imports = {
    env: {
      memory: memory
    }
  };
const { b_instance } = await WebAssembly.instantiateStreaming(fetch('b.wasm'), b_imports);

const a_imports = {
    env: {
      memory: memory,
      b_fn: b_instance.exports.b_fn
    }
  };
const { a_instance } = await WebAssembly.instantiateStreaming(fetch('a.wasm'), a_imports);

And now, if we don't really need int a_fn(int), we can just skip that last bit of code and save a bunch of bandwidth. Woo!

(Note that the memory between these modules is shared! That means more complex heap-based computation is fine.)

Generating `imports`

Manually tracking call sites and then generating the correct imports as we did above is quite arduous. We can automate this of course.

I ended up creating a data structure to answer three questions:

Given a module, which functions does it need to import?
Given a function, which module does it live in?
Given a function, which functions (in different files) does it depend on?

The first two are easily answered by wasm-objdump but the last requires some care to be taken. I won't go into details in this post, but it's an interesting little problem.

I wrote a Python script to do this and the output is JSON:

{
  "module_imports": {
    "a.wasm": [
      "b_fn"
    ],
    "b.wasm": []
  },
  "func_locations": {
    "a_fn": "a.wasm",
    "b_fn": "b.wasm"
  },
  "func_import_deps": {
    "a_fn": [
      "b_fn"
    ]
  }
}

Lazily Loading Dependencies Automatically

Ok great, we have the ability to load b.wasm without loading a.wasm. But no one wants to figure out which modules to load in JavaScript. We want the necessary WebAssembly to be loaded automatically and on-demand without changing the C/C++ that we're compiling.

Here's the trick:

WebAssembly imports take JavaScript references and these references can be updated later.
When loading a module, we're going to set the import structure to reference null for all function imports.
We're going to wrap every function call with a check to see if it (and its dependencies) have been loaded already. If it hasn't been loaded, we load it and repopulate the null with a legitimate function.

This way, users can call whatever functions they want and only the relevant wasm will be pulled over the network. In the worst case, the user calls every function and we end up with the same amount of wasm loaded as when we started.

Code Stuff

(Feel free to skip to results :^})

First we'll create a Loader class that takes in our above generated JSON file.

class Loader {
  constructor(json_file) {
    this.json_file = json_file;
  }
  
  async init() {
    this.json = await (await fetch(this.json_file)).json();
    this.module_imports = this.json.module_imports;
    this.func_locations = this.json.func_locations;
    this.funcs = {};
    for (let func of Object.keys(this.func_locations)) {
      this.funcs[func] = new Func(func, this);
    }
    this.func_import_deps = this.json.func_import_deps;
  }
  
  // TODO
}

It also initalizes a bunch of Funcs, which will store references to the actual WebAssembly functions.

class Func {
  constructor(fn, loader) {
    this.loaded = false;
    this.loader = loader;
    this.fn = fn;
    this.func = null;
  }

  async call(...args) {
    if (!this.loaded) {
      await this.loader.load(this.fn);
    }
    return this.func(...args);
  }
}

This sets up a structure that will be used like this:

const loader = new Loader('processed.json');
await loader.init();

a_fn_output = await loader.funcs["a_fn"].call(3);

Why all the awaits? Well we're converting our WebAssembly functions into lazily loaded functions that may hit the network to resolve dependencies on the first call.

If it does, we'll have to wait for the result to come back. Otherwise, we just call the function immediately.

Finally, we need to implement the load(fn) function that we hit in the worst case. Over in the Loader class we can add these two methods:

// in Loader class, dedented for clarity

async load(fn) {
  // already loaded!
  if (this.funcs[fn].loaded) {
    return;
  }
  // if we have deps, load them first
  if (fn in this.func_import_deps) {
    for (let dep of this.func_import_deps[fn]) {
      if (this.funcs[dep].loaded) {
        continue;
      }
      // recurse :^)
      await this.load(dep);
    }
  }
  
  // bring our module into memory
  await this.load_wasm(this.func_locations[fn]);
}

async load_wasm(wasm_fn) {
  const imports = { env: { memory: memory } };
  if (wasm_fn in this.module_imports) {
    for (let imp of this.module_imports[wasm_fn]) {
      // this might acutally be null! that's okay
      // it'll be updated when needed.
      imports.env[imp] = this.funcs[imp].func;
    }
  }
  const m = await WebAssembly.instantiateStreaming(fetch(wasm_fn), imports);
  const exports = m.instance.exports;
  for (let e of Object.keys(exports)) {
    if (e in this.funcs) {
      // this is the key bit of magic
      this.funcs[e].func = exports[e];
      this.funcs[e].loaded = true;
    }   
  }
}

And that's pretty much it! We've now got a nice way to load chunks of a library at the granularity of individual compilation units without changing any of the source code.

Results

I wasn't happy with just the toy case above so I generated a thousand files with varying dependencies on each other to test out this approach.

The result was an 8.7MB fully merged single binary. Using the methods above, it's 1000 seperate files each about 8-9KB in size.

Below is a video of the results. First we use the Loader written above and then we we load the entire wasm. For the three functions we call, this ends up being 5x faster! (0.25 seconds vs 1.35 seconds)

The performance of individual functions takes a reasonable hit, but we're still within ~1.2x of the original performance. This isn't bad for a 5x speedup in terms of initial load, though! And of course, this doesn't exclude just loading the full module in the background while providing a snappy first load.