by @bwasti
In light of the extremely popular PyScript project (it's great, check it out!), I thought I'd touch on a somewhat glaring issue...
The nearly two second latency involved with loading over 8MB of assets for a "hello world" example. (You can check that out here.)
It should come as no surprise that compiling a large project
into an extremely simple ISA blows up the binary size.
CPython
is around ~350,000 lines of C
and getting it compiled to only ~6MB of wasm
is nothing short
of impressive.
But, sending 6MB over the wire before anything useful can happen isn't really an ideal experience.
Many other useful projects will face this issue, so a solution would be quite nice to have.
What can we do? In this post I go over some techniques I've been playing with. They're largely hacks, so please only read for enjoyment and not edification. :)
The first approach to explore is manually splitting up files and only loading the ones you need.
int a_fn(int x) {
return x + 13;
}
int b_fn(int x) {
return x * 7;
}
This is by far the easiest, conceptually, but it can be quite a hassle. Not all files can be cleanly split because they may have dependencies. In the worst (yet quite common) case, every file has some function that depends on some functionality in another file.
int a_fn(int x) {
x = b_fn(x); // damn :(
return x + 13;
}
wasm
Let's take a peak under the hood to check out what that dependency looks like.
WebAssembly has a restrictive and easily analyzed ISA,
which makes this pretty easy.
To inspect it, we'll use wasm-objdump
.
The disassembled code for the binary looks
a bit like this:
$ wasm-objdump -d combined.wasm
func[1] <a_fn(int)>:
local.get 0
call 2 <b_fn(int)> <--- Our dependency!
i32.const 13
i32.add
end
func[2] <b_fn(int)>:
local.get 0
i32.const 7
i32.mul
end
Let's pretend these functions are much bigger and discuss two possible outcomes after we send the binary to the user:
int a_fn(int)
int b_fn(int)
In case 1, we've done all we can and there's nothing to
worry about. The issue is case 2.
We just sent all of the int a_fn(int)
section over the wire
as well, potentially causing many milliseconds of hang time!
Let's avoid that at all costs.
One way to split up binares even when there are dependencies
is to just force it.
The flag --allow-undefined
will make this possible (for ld
).
Other compilation stacks will likely have similar flags.
Note that this doesn't involve changing the source code,
just the way we compile it!
// b.cpp
int b_fn(int x) {
return x * 7;
}
// a.cpp
int b_fn(int x);
int a_fn(int x) {
x = b_fn(x);
return x + 13;
}
clang++ --target=wasm32 -nostdlib -O3 -c a.cpp -o /tmp/a.o
clang++ --target=wasm32 -nostdlib -O3 -c b.cpp -o /tmp/b.o
wasm-ld --no-entry --export-all --lto-O3 --allow-undefined --import-memory /tmp/a.o -o a.wasm
wasm-ld --no-entry --export-all --lto-O3 --allow-undefined --import-memory /tmp/b.o -o b.wasm
The a.wasm
binary ends up with an import section:
$ wasm-objdump -x -j Import a.wasm
a.wasm: file format wasm 0x1
Section Details:
Import[2]:
- memory[0] pages: initial=2 <- env.memory
- func[0] sig=0 <b_fn(int)> <- env._Z4b_fni
But its code section looks exactly as expected!
$ wasm-objdump -d Import a.wasm
a.wasm: file format wasm 0x1
Code Disassembly:
func[2] <a_fn(int)>:
local.get 0
call 0 <b_fn(int)>
i32.const 13
i32.add
end
Ok, but wouldn't this just break if we ran it? Yes.
We'll need to specify the implementation for int b_fn(int)
at load time.
Something like this:
const memory = new WebAssembly.Memory({initial:10, maximum:1000});
const b_imports = {
env: {
memory: memory
}
};
const { b_instance } = await WebAssembly.instantiateStreaming(fetch('b.wasm'), b_imports);
const a_imports = {
env: {
memory: memory,
b_fn: b_instance.exports.b_fn
}
};
const { a_instance } = await WebAssembly.instantiateStreaming(fetch('a.wasm'), a_imports);
And now, if we don't really need int a_fn(int)
,
we can just skip that last bit of code and save a bunch of
bandwidth. Woo!
(Note that the memory between these modules is shared! That means more complex heap-based computation is fine.)
imports
Manually tracking call
sites and then generating
the correct imports
as we did above is quite arduous.
We can automate this of course.
I ended up creating a data structure to answer three questions:
The first two are easily answered by wasm-objdump
but the last
requires some care to be taken.
I won't go into details in this post, but it's an interesting
little problem.
I wrote a Python script to do this and the output is JSON:
{
"module_imports": {
"a.wasm": [
"b_fn"
],
"b.wasm": []
},
"func_locations": {
"a_fn": "a.wasm",
"b_fn": "b.wasm"
},
"func_import_deps": {
"a_fn": [
"b_fn"
]
}
}
Ok great, we have the ability to load b.wasm
without loading a.wasm
.
But no one wants to figure out which modules to load in JavaScript.
We want the necessary WebAssembly to be loaded automatically
and on-demand without changing the C/C++ that we're compiling.
Here's the trick:
WebAssembly imports take JavaScript references and these references can be updated later.
When loading a module, we're going to set the import
structure to reference null
for all function imports.
We're going to wrap every function call with a
check to see if it (and its dependencies) have been loaded already.
If it hasn't been loaded, we load it and repopulate the null
with a legitimate function.
This way, users can call whatever functions they want
and only the relevant wasm
will be pulled over the network.
In the worst case, the user calls every function and we end up
with the same amount of wasm
loaded as when we started.
(Feel free to skip to results :^})
First we'll create a Loader
class that takes in our above
generated JSON file.
class Loader {
constructor(json_file) {
this.json_file = json_file;
}
async init() {
this.json = await (await fetch(this.json_file)).json();
this.module_imports = this.json.module_imports;
this.func_locations = this.json.func_locations;
this.funcs = {};
for (let func of Object.keys(this.func_locations)) {
this.funcs[func] = new Func(func, this);
}
this.func_import_deps = this.json.func_import_deps;
}
// TODO
}
It also initalizes a bunch of Func
s, which
will store references to the actual WebAssembly functions.
class Func {
constructor(fn, loader) {
this.loaded = false;
this.loader = loader;
this.fn = fn;
this.func = null;
}
async call(...args) {
if (!this.loaded) {
await this.loader.load(this.fn);
}
return this.func(...args);
}
}
This sets up a structure that will be used like this:
const loader = new Loader('processed.json');
await loader.init();
a_fn_output = await loader.funcs["a_fn"].call(3);
Why all the await
s? Well we're converting our
WebAssembly functions into lazily loaded functions
that may hit the network to resolve dependencies on the first call.
If it does, we'll have to wait for the result to come back. Otherwise, we just call the function immediately.
Finally, we need to implement the load(fn)
function
that we hit in the worst case. Over in the Loader
class
we can add these two methods:
// in Loader class, dedented for clarity
async load(fn) {
// already loaded!
if (this.funcs[fn].loaded) {
return;
}
// if we have deps, load them first
if (fn in this.func_import_deps) {
for (let dep of this.func_import_deps[fn]) {
if (this.funcs[dep].loaded) {
continue;
}
// recurse :^)
await this.load(dep);
}
}
// bring our module into memory
await this.load_wasm(this.func_locations[fn]);
}
async load_wasm(wasm_fn) {
const imports = { env: { memory: memory } };
if (wasm_fn in this.module_imports) {
for (let imp of this.module_imports[wasm_fn]) {
// this might acutally be null! that's okay
// it'll be updated when needed.
imports.env[imp] = this.funcs[imp].func;
}
}
const m = await WebAssembly.instantiateStreaming(fetch(wasm_fn), imports);
const exports = m.instance.exports;
for (let e of Object.keys(exports)) {
if (e in this.funcs) {
// this is the key bit of magic
this.funcs[e].func = exports[e];
this.funcs[e].loaded = true;
}
}
}
And that's pretty much it! We've now got a nice way to load chunks of a library at the granularity of individual compilation units without changing any of the source code.
I wasn't happy with just the toy case above so I generated a thousand files with varying dependencies on each other to test out this approach.
The result was an 8.7MB fully merged single binary. Using the methods above, it's 1000 seperate files each about 8-9KB in size.
Below is a video of the results.
First we use the Loader
written above and then we
we load the entire wasm
.
For the three functions we call,
this ends up being 5x faster! (0.25 seconds vs 1.35 seconds)
The performance of individual functions takes a reasonable hit, but we're still within ~1.2x of the original performance. This isn't bad for a 5x speedup in terms of initial load, though! And of course, this doesn't exclude just loading the full module in the background while providing a snappy first load.
And here are the network requests over time:
It's cool to see that even though we only called 3 functions, they depended on a couple others and those all got loaded automatically for us.
The full code listing can be found here: https://github.com/bwasti/web_assembly_experiments/tree/main/lazy_load
If you'd like to follow me on performance (with a recent focus on the web), please follow me on twitter!