# Trickery to Tame Big WebAssembly Binaries

*by [@bwasti](https://twitter.com/bwasti)*

****

In light of the extremely popular [PyScript](https://pyscript.net)
project (it's great, check it out!), I thought I'd touch on
a somewhat glaring issue...

![](https://i.imgur.com/lcudcUd.png)

The nearly two second latency involved with loading over
8MB of assets for a "hello world" example.
(You can check that out [here](https://pyscript.net/examples/hello_world.html).)


### Big Binaries and Where To Find Them

It should come as no surprise that compiling a large project
into an extremely simple ISA blows up the binary size.
CPython
[is around ~350,000 lines of C](https://tenthousandmeters.com/blog/python-behind-the-scenes-3-stepping-through-the-cpython-source-code/)
and getting it compiled to only ~6MB of `wasm` is nothing short
of impressive.

But, sending 6MB over the wire before *anything* useful can happen isn't
really an ideal experience.

<center>
<video playsinline loop muted autoplay controls style="max-width:100%" src="https://i.imgur.com/JB2FWiK.mp4"></video>
</center>


Many other useful projects will face this issue,
so a solution would be quite nice to have.

What can we do?
In this post I go over some techniques I've been playing with.
They're largely hacks, so please only read for enjoyment and not
edification. :)

### Splitting up Files

The first approach to explore is manually splitting up files and
only loading the ones you need.

```cpp
int a_fn(int x) {
  return x + 13;
}
```

```cpp
int b_fn(int x) {
  return x * 7;
}
```

![](https://i.imgur.com/MFNo5FD.png)

This is by far the easiest, conceptually, but it can
be quite a hassle.  Not all files can be cleanly split
because they may have dependencies.
In the worst (yet quite common) case, every file has some function
that depends on some functionality in another file.


```cpp
int a_fn(int x) {
  x = b_fn(x); // damn :(
  return x + 13;
}
```
<center><i>our problem case</i></center>

### Inspecting the `wasm`

Let's take a peak under the hood to check out
what that dependency looks like.

WebAssembly has a restrictive and easily analyzed ISA,
which makes this pretty easy.
To inspect it, we'll use `wasm-objdump`.
The disassembled code for the binary looks 
a bit like this:

```bash
$ wasm-objdump -d combined.wasm
```
```wasm
func[1] <a_fn(int)>:
 local.get 0
 call 2 <b_fn(int)> <--- Our dependency!
 i32.const 13
 i32.add
 end
func[2] <b_fn(int)>:
 local.get 0
 i32.const 7
 i32.mul
 end
```

Let's pretend these functions are *much* bigger and
discuss two possible outcomes after we send the binary to the user:

1. The user calls `int a_fn(int)`
2. The user only calls `int b_fn(int)`

In case 1, we've done all we can and there's nothing to
worry about.  The issue is case 2.
We just sent all of the `int a_fn(int)` section over the wire
as well, potentially causing many milliseconds of hang time!

Let's avoid that *at all costs*.

### Imports

One way to split up binares even when there are dependencies
is to just force it.
The flag `--allow-undefined` will make this possible (for `ld`).
Other compilation stacks will likely have similar flags.
Note that this doesn't involve changing the source code,
just the way we compile it!

```cpp
// b.cpp
int b_fn(int x) {
  return x * 7;
}
```
```cpp
// a.cpp
int b_fn(int x);

int a_fn(int x) {
  x = b_fn(x);
  return x + 13;
}
```

```
clang++ --target=wasm32 -nostdlib -O3 -c a.cpp -o /tmp/a.o
clang++ --target=wasm32 -nostdlib -O3 -c b.cpp -o /tmp/b.o
wasm-ld --no-entry --export-all --lto-O3 --allow-undefined --import-memory /tmp/a.o -o a.wasm
wasm-ld --no-entry --export-all --lto-O3 --allow-undefined --import-memory /tmp/b.o -o b.wasm
```

The `a.wasm` binary ends up with an [import section]((https://webassembly.github.io/spec/core/binary/modules.html#binary-importsec):

```bash
$ wasm-objdump -x -j Import a.wasm
```
```
a.wasm:	file format wasm 0x1

Section Details:

Import[2]:
 - memory[0] pages: initial=2 <- env.memory
 - func[0] sig=0 <b_fn(int)> <- env._Z4b_fni
```

But its code section looks exactly as expected!

```bash
$ wasm-objdump -d Import a.wasm
```

```
a.wasm:	file format wasm 0x1

Code Disassembly:

func[2] <a_fn(int)>:
 local.get 0
 call 0 <b_fn(int)>
 i32.const 13
 i32.add
 end
```

Ok, but wouldn't this just break if we ran it?  Yes.

### Dynamic Linking

We'll need to specify the implementation for `int b_fn(int)`
*at load time*.
Something like this:

```javascript
const memory = new WebAssembly.Memory({initial:10, maximum:1000});
const b_imports = {
    env: {
      memory: memory
    }
  };
const { b_instance } = await WebAssembly.instantiateStreaming(fetch('b.wasm'), b_imports);

const a_imports = {
    env: {
      memory: memory,
      b_fn: b_instance.exports.b_fn
    }
  };
const { a_instance } = await WebAssembly.instantiateStreaming(fetch('a.wasm'), a_imports);

```

And now, if we don't really need `int a_fn(int)`,
we can just skip that last bit of code and save a bunch of
bandwidth.  Woo!

(Note that the memory between these modules is shared!
That means more complex heap-based computation is fine.)

### Generating `imports`

Manually tracking `call` sites and then generating
the correct `imports` as we did above is quite arduous.
We can automate this of course.

I ended up creating a data structure to answer three questions:

1. Given a module, which functions does it need to import?
2. Given a function, which module does it live in?
3. Given a function, which functions (in different files) does it depend on?

The first two are easily answered by `wasm-objdump` but the last
requires some care to be taken.
I won't go into details in this post, but it's an interesting
little problem.

I wrote a [Python script to do this](https://github.com/bwasti/web_assembly_experiments/blob/main/lazy_load/process.py)
and the output is JSON:
```json
{
  "module_imports": {
    "a.wasm": [
      "b_fn"
    ],
    "b.wasm": []
  },
  "func_locations": {
    "a_fn": "a.wasm",
    "b_fn": "b.wasm"
  },
  "func_import_deps": {
    "a_fn": [
      "b_fn"
    ]
  }
}
```

### Lazily Loading Dependencies Automatically

Ok great, we have the ability to load `b.wasm`
without loading `a.wasm`.
But no one wants to figure out which modules to load in JavaScript.
We want the necessary WebAssembly to be loaded **automatically**
and **on-demand** without changing the C/C++ that we're compiling.

Here's the trick:

1. WebAssembly imports take JavaScript references
and these references can be updated *later*.

2. When loading a module, we're going to set the import
structure to reference `null` for all function imports.

3. We're going to wrap every function call with a
check to see if it (and its dependencies) have been loaded already.
If it *hasn't* been loaded, we load it and repopulate the `null`
with a legitimate function.

This way, users can call whatever functions they want
and only the relevant `wasm` will be pulled over the network.
In the worst case, the user calls *every* function and we end up
with the same amount of `wasm` loaded as when we started.

### Code Stuff

(Feel free to skip to [results](https://jott.live/markdown/wasm_binary_size#results) :^})

First we'll create a `Loader` class that takes in our above
generated JSON file.

```javascript
class Loader {
  constructor(json_file) {
    this.json_file = json_file;
  }
  
  async init() {
    this.json = await (await fetch(this.json_file)).json();
    this.module_imports = this.json.module_imports;
    this.func_locations = this.json.func_locations;
    this.funcs = {};
    for (let func of Object.keys(this.func_locations)) {
      this.funcs[func] = new Func(func, this);
    }
    this.func_import_deps = this.json.func_import_deps;
  }
  
  // TODO
}
```

It also initalizes a bunch of `Func`s, which
will store references to the actual WebAssembly functions.

```javascript
class Func {
  constructor(fn, loader) {
    this.loaded = false;
    this.loader = loader;
    this.fn = fn;
    this.func = null;
  }

  async call(...args) {
    if (!this.loaded) {
      await this.loader.load(this.fn);
    }
    return this.func(...args);
  }
}
```
This sets up a structure that will
be used like this:

```
const loader = new Loader('processed.json');
await loader.init();

a_fn_output = await loader.funcs["a_fn"].call(3);
```

Why all the `await`s?  Well we're converting our
WebAssembly functions into lazily loaded functions
that *may* hit the network to resolve dependencies on the first call.

If it does, we'll have to wait for the result to come back.
Otherwise, we just call the function immediately.

Finally, we need to implement the `load(fn)` function
that we hit in the worst case.  Over in the `Loader` class
we can add these two methods:

```javascript
// in Loader class, dedented for clarity

async load(fn) {
  // already loaded!
  if (this.funcs[fn].loaded) {
    return;
  }
  // if we have deps, load them first
  if (fn in this.func_import_deps) {
    for (let dep of this.func_import_deps[fn]) {
      if (this.funcs[dep].loaded) {
        continue;
      }
      // recurse :^)
      await this.load(dep);
    }
  }
  
  // bring our module into memory
  await this.load_wasm(this.func_locations[fn]);
}

async load_wasm(wasm_fn) {
  const imports = { env: { memory: memory } };
  if (wasm_fn in this.module_imports) {
    for (let imp of this.module_imports[wasm_fn]) {
      // this might acutally be null! that's okay
      // it'll be updated when needed.
      imports.env[imp] = this.funcs[imp].func;
    }
  }
  const m = await WebAssembly.instantiateStreaming(fetch(wasm_fn), imports);
  const exports = m.instance.exports;
  for (let e of Object.keys(exports)) {
    if (e in this.funcs) {
      // this is the key bit of magic
      this.funcs[e].func = exports[e];
      this.funcs[e].loaded = true;
    }   
  }
}

```

And that's pretty much it!
We've now got a nice way to load chunks of a library
at the granularity of individual compilation units
without changing any of the source code.

### Results

I wasn't happy with just the toy case above
so I [generated a thousand files](https://github.com/bwasti/web_assembly_experiments/blob/main/lazy_load/generate.py)
with varying dependencies
on each other to test out this approach.

The result was an 8.7MB fully merged single binary.
Using the methods above, it's 1000 seperate files each about 8-9KB in size.

Below is a video of the results.
First we use the `Loader` written above and then we
we load the entire `wasm`.
For the three functions we call,
this ends up being 5x faster! (0.25 seconds vs 1.35 seconds)

The performance of individual functions takes a reasonable hit,
but we're still within ~1.2x of the original performance.
This isn't bad for a 5x speedup in terms of initial load, though!
And of course, this doesn't exclude just loading the full
module in the background while providing a snappy first load.

<center>
<video playsinline loop muted autoplay controls style="max-width:100%" src="https://i.imgur.com/g9TLfH2.mp4"></video>
</center>

And here are the network requests over time:

<center>
<video playsinline loop muted autoplay controls style="max-width:100%" src="https://i.imgur.com/djT7hfy.mp4"></video>
</center>

It's cool to see that even though we only called 3 functions,
they depended on a couple others and those all got loaded
automatically for us.

### Thanks for reading!

The full code listing can be found here: https://github.com/bwasti/web_assembly_experiments/tree/main/lazy_load

If you'd like to follow me on performance (with a recent focus on the web),
please follow me on [twitter!](https://twitter.com/bwasti)