# Trickery to Tame Big WebAssembly Binaries *by [@bwasti](https://twitter.com/bwasti)* **** In light of the extremely popular [PyScript](https://pyscript.net) project (it's great, check it out!), I thought I'd touch on a somewhat glaring issue... ![](https://i.imgur.com/lcudcUd.png) The nearly two second latency involved with loading over 8MB of assets for a "hello world" example. (You can check that out [here](https://pyscript.net/examples/hello_world.html).) ### Big Binaries and Where To Find Them It should come as no surprise that compiling a large project into an extremely simple ISA blows up the binary size. CPython [is around ~350,000 lines of C](https://tenthousandmeters.com/blog/python-behind-the-scenes-3-stepping-through-the-cpython-source-code/) and getting it compiled to only ~6MB of `wasm` is nothing short of impressive. But, sending 6MB over the wire before *anything* useful can happen isn't really an ideal experience. <center> <video playsinline loop muted autoplay controls style="max-width:100%" src="https://i.imgur.com/JB2FWiK.mp4"></video> </center> Many other useful projects will face this issue, so a solution would be quite nice to have. What can we do? In this post I go over some techniques I've been playing with. They're largely hacks, so please only read for enjoyment and not edification. :) ### Splitting up Files The first approach to explore is manually splitting up files and only loading the ones you need. ```cpp int a_fn(int x) { return x + 13; } ``` ```cpp int b_fn(int x) { return x * 7; } ``` ![](https://i.imgur.com/MFNo5FD.png) This is by far the easiest, conceptually, but it can be quite a hassle. Not all files can be cleanly split because they may have dependencies. In the worst (yet quite common) case, every file has some function that depends on some functionality in another file. ```cpp int a_fn(int x) { x = b_fn(x); // damn :( return x + 13; } ``` <center><i>our problem case</i></center> ### Inspecting the `wasm` Let's take a peak under the hood to check out what that dependency looks like. WebAssembly has a restrictive and easily analyzed ISA, which makes this pretty easy. To inspect it, we'll use `wasm-objdump`. The disassembled code for the binary looks a bit like this: ```bash $ wasm-objdump -d combined.wasm ``` ```wasm func[1] <a_fn(int)>: local.get 0 call 2 <b_fn(int)> <--- Our dependency! i32.const 13 i32.add end func[2] <b_fn(int)>: local.get 0 i32.const 7 i32.mul end ``` Let's pretend these functions are *much* bigger and discuss two possible outcomes after we send the binary to the user: 1. The user calls `int a_fn(int)` 2. The user only calls `int b_fn(int)` In case 1, we've done all we can and there's nothing to worry about. The issue is case 2. We just sent all of the `int a_fn(int)` section over the wire as well, potentially causing many milliseconds of hang time! Let's avoid that *at all costs*. ### Imports One way to split up binares even when there are dependencies is to just force it. The flag `--allow-undefined` will make this possible (for `ld`). Other compilation stacks will likely have similar flags. Note that this doesn't involve changing the source code, just the way we compile it! ```cpp // b.cpp int b_fn(int x) { return x * 7; } ``` ```cpp // a.cpp int b_fn(int x); int a_fn(int x) { x = b_fn(x); return x + 13; } ``` ``` clang++ --target=wasm32 -nostdlib -O3 -c a.cpp -o /tmp/a.o clang++ --target=wasm32 -nostdlib -O3 -c b.cpp -o /tmp/b.o wasm-ld --no-entry --export-all --lto-O3 --allow-undefined --import-memory /tmp/a.o -o a.wasm wasm-ld --no-entry --export-all --lto-O3 --allow-undefined --import-memory /tmp/b.o -o b.wasm ``` The `a.wasm` binary ends up with an [import section]((https://webassembly.github.io/spec/core/binary/modules.html#binary-importsec): ```bash $ wasm-objdump -x -j Import a.wasm ``` ``` a.wasm: file format wasm 0x1 Section Details: Import[2]: - memory[0] pages: initial=2 <- env.memory - func[0] sig=0 <b_fn(int)> <- env._Z4b_fni ``` But its code section looks exactly as expected! ```bash $ wasm-objdump -d Import a.wasm ``` ``` a.wasm: file format wasm 0x1 Code Disassembly: func[2] <a_fn(int)>: local.get 0 call 0 <b_fn(int)> i32.const 13 i32.add end ``` Ok, but wouldn't this just break if we ran it? Yes. ### Dynamic Linking We'll need to specify the implementation for `int b_fn(int)` *at load time*. Something like this: ```javascript const memory = new WebAssembly.Memory({initial:10, maximum:1000}); const b_imports = { env: { memory: memory } }; const { b_instance } = await WebAssembly.instantiateStreaming(fetch('b.wasm'), b_imports); const a_imports = { env: { memory: memory, b_fn: b_instance.exports.b_fn } }; const { a_instance } = await WebAssembly.instantiateStreaming(fetch('a.wasm'), a_imports); ``` And now, if we don't really need `int a_fn(int)`, we can just skip that last bit of code and save a bunch of bandwidth. Woo! (Note that the memory between these modules is shared! That means more complex heap-based computation is fine.) ### Generating `imports` Manually tracking `call` sites and then generating the correct `imports` as we did above is quite arduous. We can automate this of course. I ended up creating a data structure to answer three questions: 1. Given a module, which functions does it need to import? 2. Given a function, which module does it live in? 3. Given a function, which functions (in different files) does it depend on? The first two are easily answered by `wasm-objdump` but the last requires some care to be taken. I won't go into details in this post, but it's an interesting little problem. I wrote a [Python script to do this](https://github.com/bwasti/web_assembly_experiments/blob/main/lazy_load/process.py) and the output is JSON: ```json { "module_imports": { "a.wasm": [ "b_fn" ], "b.wasm": [] }, "func_locations": { "a_fn": "a.wasm", "b_fn": "b.wasm" }, "func_import_deps": { "a_fn": [ "b_fn" ] } } ``` ### Lazily Loading Dependencies Automatically Ok great, we have the ability to load `b.wasm` without loading `a.wasm`. But no one wants to figure out which modules to load in JavaScript. We want the necessary WebAssembly to be loaded **automatically** and **on-demand** without changing the C/C++ that we're compiling. Here's the trick: 1. WebAssembly imports take JavaScript references and these references can be updated *later*. 2. When loading a module, we're going to set the import structure to reference `null` for all function imports. 3. We're going to wrap every function call with a check to see if it (and its dependencies) have been loaded already. If it *hasn't* been loaded, we load it and repopulate the `null` with a legitimate function. This way, users can call whatever functions they want and only the relevant `wasm` will be pulled over the network. In the worst case, the user calls *every* function and we end up with the same amount of `wasm` loaded as when we started. ### Code Stuff (Feel free to skip to [results](https://jott.live/markdown/wasm_binary_size#results) :^}) First we'll create a `Loader` class that takes in our above generated JSON file. ```javascript class Loader { constructor(json_file) { this.json_file = json_file; } async init() { this.json = await (await fetch(this.json_file)).json(); this.module_imports = this.json.module_imports; this.func_locations = this.json.func_locations; this.funcs = {}; for (let func of Object.keys(this.func_locations)) { this.funcs[func] = new Func(func, this); } this.func_import_deps = this.json.func_import_deps; } // TODO } ``` It also initalizes a bunch of `Func`s, which will store references to the actual WebAssembly functions. ```javascript class Func { constructor(fn, loader) { this.loaded = false; this.loader = loader; this.fn = fn; this.func = null; } async call(...args) { if (!this.loaded) { await this.loader.load(this.fn); } return this.func(...args); } } ``` This sets up a structure that will be used like this: ``` const loader = new Loader('processed.json'); await loader.init(); a_fn_output = await loader.funcs["a_fn"].call(3); ``` Why all the `await`s? Well we're converting our WebAssembly functions into lazily loaded functions that *may* hit the network to resolve dependencies on the first call. If it does, we'll have to wait for the result to come back. Otherwise, we just call the function immediately. Finally, we need to implement the `load(fn)` function that we hit in the worst case. Over in the `Loader` class we can add these two methods: ```javascript // in Loader class, dedented for clarity async load(fn) { // already loaded! if (this.funcs[fn].loaded) { return; } // if we have deps, load them first if (fn in this.func_import_deps) { for (let dep of this.func_import_deps[fn]) { if (this.funcs[dep].loaded) { continue; } // recurse :^) await this.load(dep); } } // bring our module into memory await this.load_wasm(this.func_locations[fn]); } async load_wasm(wasm_fn) { const imports = { env: { memory: memory } }; if (wasm_fn in this.module_imports) { for (let imp of this.module_imports[wasm_fn]) { // this might acutally be null! that's okay // it'll be updated when needed. imports.env[imp] = this.funcs[imp].func; } } const m = await WebAssembly.instantiateStreaming(fetch(wasm_fn), imports); const exports = m.instance.exports; for (let e of Object.keys(exports)) { if (e in this.funcs) { // this is the key bit of magic this.funcs[e].func = exports[e]; this.funcs[e].loaded = true; } } } ``` And that's pretty much it! We've now got a nice way to load chunks of a library at the granularity of individual compilation units without changing any of the source code. ### Results I wasn't happy with just the toy case above so I [generated a thousand files](https://github.com/bwasti/web_assembly_experiments/blob/main/lazy_load/generate.py) with varying dependencies on each other to test out this approach. The result was an 8.7MB fully merged single binary. Using the methods above, it's 1000 seperate files each about 8-9KB in size. Below is a video of the results. First we use the `Loader` written above and then we we load the entire `wasm`. For the three functions we call, this ends up being 5x faster! (0.25 seconds vs 1.35 seconds) The performance of individual functions takes a reasonable hit, but we're still within ~1.2x of the original performance. This isn't bad for a 5x speedup in terms of initial load, though! And of course, this doesn't exclude just loading the full module in the background while providing a snappy first load. <center> <video playsinline loop muted autoplay controls style="max-width:100%" src="https://i.imgur.com/g9TLfH2.mp4"></video> </center> And here are the network requests over time: <center> <video playsinline loop muted autoplay controls style="max-width:100%" src="https://i.imgur.com/djT7hfy.mp4"></video> </center> It's cool to see that even though we only called 3 functions, they depended on a couple others and those all got loaded automatically for us. ### Thanks for reading! The full code listing can be found here: https://github.com/bwasti/web_assembly_experiments/tree/main/lazy_load If you'd like to follow me on performance (with a recent focus on the web), please follow me on [twitter!](https://twitter.com/bwasti)