# 15x Faster TypedArrays: Vector Addition in WebAssembly @ 154GB/s

*by [@bwasti](https://twitter.com/bwasti)*

****

Vector addition is the process of adding vector components:

$$
\overrightarrow{A} + \overrightarrow{B} = (a_0 + b_0, a_1 + b_1, \dots)
$$

Over my holiday break,
I tried to optimize this operation in JavaScript.
I got to around **~12 million additions per second of
1024-dimensional vectors**.
If you use Chrome or Firefox,
you can try out the [benchmark in your browser](https://bwasti.github.io/wasmblr/).

Here's a chart of the results across a spread of the techniques described below:

![](https://i.imgur.com/SuInbUY.png)

(All run on an M1 MacBook Air with node.js 17,
full code [here](https://github.com/bwasti/wasmblr/tree/main/emscripten_example))




### Starting with pure JavaScript

```javascript
let a = new Array(N).fill(0);
let b = new Array(N).fill(0);
let c = new Array(N).fill(0);

function add() {
  for (let i = 0; i < N; ++i) {
    c[i] = a[i] + b[i];
  }
}
```

This should certainly be slow.

Let's try with a couple sizes: 4, 64, 1024.

<pre>
<u>benchmarking vec add of size 4</u>
  <b>pure javascript:     6.43 GB/s (133952036.48 iters/sec)</b>
<u>benchmarking vec add of size 64</u>
  <b>pure javascript:     13.71 GB/s (17850781.69 iters/sec)</b>
<u>benchmarking vec add of size 1024</u>
  <b>pure javascript:     14.96 GB/s (1217559.02 iters/sec)</b>
</pre>

Ok cool, that's our baseline. ~15GB/s.


### + `TypedArray`s

Instead of boring old `Array`, let's use `Float32Array`.
A typed version will let V8 really do it's thing.
On top of that, `Array` uses `double` and `float` is half the size.
All we have to do is swap out the initial call.

```javascript
let a = new Float32Array(N);
let b = new Float32Array(N);
let c = new Float32Array(N);
```
And we're off to the races:


<pre>
<u>benchmarking vec add of size 4</u>
  <b>pure javascript:     6.54 GB/s (136224559.41 iters/sec)</b>
  typed arrays:        5.73 GB/s (119443844.4 iters/sec)
<u>benchmarking vec add of size 64</u>
  <b>pure javascript:     13.86 GB/s (18053198.87 iters/sec)</b>
  typed arrays:        10.81 GB/s (14080397.49 iters/sec)
<u>benchmarking vec add of size 1024</u>
  <b>pure javascript:     14.96 GB/s (1217559.02 iters/sec)</b>
  typed arrays:        10.88 GB/s (885410.91 iters/sec)
</pre>

Huh.  The `TypedArray` calls seems to cap out at 11GB/s?

Even with larger sizes...

<pre>
<u>benchmarking vec add of size 4096</u>
  <b>pure javascript:     15.08 GB/s (306770.86 iters/sec)</b>
  typed arrays:        11.03 GB/s (224453.88 iters/sec)
</pre>

It seems the JIT does a really good job with JavaScript arrays.
Disabling warmup in the benchmark helps to demonstrate this:

<pre>
<u>benchmarking vec add of size 4 [no warmup]</u>
  pure javascript:     0.04 GB/s (782572.09 iters/sec)
  <b>typed arrays:        0.13 GB/s (2701225.26 iters/sec)</b>
<u>benchmarking vec add of size 64 [no warmup]</u>
  pure javascript:     0.12 GB/s (154383.82 iters/sec)
  <b>typed arrays:        0.14 GB/s (180306.89 iters/sec)</b>
<u>benchmarking vec add of size 1024 [no warmup]</u>
  <b>pure javascript:     9.15 GB/s (744839.13 iters/sec)</b>
  typed arrays:        6.66 GB/s (542320.53 iters/sec)
</pre>

There's a great blog post by the V8 team on how this is done:
https://v8.dev/blog/elements-kinds.
The aggregate type of an Array is tracked, which means
the type hint provided by TypedArrays doesn't give them
as much of an intrinsic advantage.
`float`s *are* smaller, though, so I'm still a bit perplexed
as to why this is the case.
(**solved!** *[Matheus28](https://news.ycombinator.com/item?id=29771589) on HackerNews
has pointed out that the addition is being done in double precision, requiring
a conversion at each step. Using Float64Array's speeds things up to match
pure JavaScript*)

Firefox uses a different JavaScript engine than node.js and Chromium,
and has expected results (`TypedArray`s are faster).
Thanks
[viraptor on lobste.rs](https://lobste.rs/s/7rdpsj/15x_faster_typedarrays_vector_addition#c_o9rcgx)
for pointing this out!


### + [Emscripten](https://emscripten.org)

Emscripten is a framework that helps compile C and C++ projects
directly to WebAssembly.  It does a lot of heavy lifting,
including shimming system calls and the C standard library.

WebAssembly is designed as a simple instruction set that can be
executed at near native performance.
Here's the spec: https://webassembly.github.io/spec/core/

### + Memory passing

On top of its simplicity, 
WebAssembly is also designed to be very safe.
This makes it a little bit awkward to interact with
when dealing with large amounts of memory.

For example, every WebAssembly instance has its own linear memory.
Out of the box,
the only function you're given is `Memory.grow`.
This is a lot like `mmap` and less like `malloc`,
which is a bit of a pain.

```javascript
const instance = new WebAssembly.Instance(...);
let mem = instance.exports.memory; // this needs to be manually exported
mem.grow(1); // grow by a single page (64KiB)
```

Emscripten comes the rescue and actually implements `malloc` for us,
but the WebAssembly instance still owns the memory,
and we'll need to do all of our vector addition there.

Commonly, you'll see functions like this written when folks
use Emscripten:
```javascript
function emscripten_array(len) {
  var ptr = Module._malloc(len * 4);
  return [new Float32Array(Module.HEAPF32.buffer, ptr, len), ptr];
}
```

The above code uses Emscripten's `malloc` and has JavaScript
reinterpet that pointer as an offset into Emscripten's heap.
This gives us two variables that refer to the same memory.
The latter can be passed to Emscripten calls and the former
can be manipulated as a TypedArray.

```javascript
let [a, a_] = emscripten_array(N);
let [b, b_] = emscripten_array(N);
let [c, c_] = emscripten_array(N);
const add = Module._add;
// initialize a, b
for (let i = 0; i < N; ++i) {
  a[i] = Math.random();
  b[i] = Math.random();
}
add(a_, b_, c_);
// results are now in c
```

Now we're all set to run our freshly written Emscripten module.
Be sure to compile with `-O3` for performance.
```cpp
// emcc add.cc -O3 -s EXPORTED_FUNCTIONS="['_add']"
extern "C" {

void add(const float* a, const float* b, float* c, int len) {
  for (auto i = 0; i < len; ++i) {
    c[i] = a[i] + b[i];
  }
}

}
```
Unfortunately we're still quite slow for smaller sizes, but 
we see a noticeable speed up at size 1024.

<pre>
<u>benchmarking vec add of size 4</u>
  <b>pure javascript:     6.83 GB/s (142214376.78 iters/sec)</b>
  typed arrays:        5.71 GB/s (119016249.72 iters/sec)
  emscripten:          1.33 GB/s (27608487.76 iters/sec)
<u>benchmarking vec add of size 64</u>
  <b>pure javascript:     13.46 GB/s (17524173.64 iters/sec)</b>
  typed arrays:        10.64 GB/s (13853318.79 iters/sec)
  emscripten:          11.91 GB/s (15505845.36 iters/sec)
<u>benchmarking vec add of size 1024</u>
  pure javascript:     15.06 GB/s (1225731.4 iters/sec)
  typed arrays:        10.92 GB/s (888576.43 iters/sec)
  <b>emscripten:          22.81 GB/s (1856272.37 iters/sec)</b>
</pre>

### + [SIMD](https://en.wikipedia.org/wiki/SIMD)

Can we do any better? Surely `-O3` isn't the only flag available to us.

Modern processors can handle multiple elements with the same instruction at once,
and WebAssembly has limited support for that.
We can tell Emscripten to use that feature by compiling with `-msimd128`.
This gives us a nice win:

<pre>
<u>benchmarking vec add of size 4</u>
  <b>pure javascript:     6.62 GB/s (137877529.08 iters/sec)</b>
  typed arrays:        5.76 GB/s (120071457.96 iters/sec)
  emscripten (simd):   1.33 GB/s (27729893.7 iters/sec)
<u>benchmarking vec add of size 64</u>
  pure javascript:     13.3 GB/s (17319586.98 iters/sec)
  typed arrays:        10.59 GB/s (13795225.43 iters/sec)
  <b>emscripten (simd):   18.31 GB/s (23845788.63 iters/sec)</b>
<u>benchmarking vec add of size 1024</u>
  pure javascript:     15.07 GB/s (1226252.88 iters/sec)
  typed arrays:        10.96 GB/s (892301.4 iters/sec)
  <b>emscripten (simd):   75.14 GB/s (6114821.99 iters/sec)</b>
</pre>

75 GB/s is really good!
But there are still some inefficiencies here.
For example, we're running a function that can handle *any*
vector size.  Would there be more optimizations available
if we limited the function to the vector size
we care about at the time of benchmarking?


### + [wasmblr](https://github.com/bwasti/wasmblr)

Working in C++ with Emscripten is great,
so for 99% of usecases, I don't want to change that workflow.
However, in this really contrived case, I want to be able
to go a bit deeper and do some dynamic code generation.
Unfortunately, Emscripten can often take seconds (if not minutes for larger projects)
to compile.

So I wrote a header file that'd let me
emit pure WebAssembly directly from C++, `wasmblr` (https://github.com/bwasti/wasmblr).
It's fast becasue it doesn't do much beyond expose
the specification with C++ semantics.
In this writeup, most of the calls to it can run in about 1.5 microseconds.
It takes 125 microseconds to instantiate a WebAssembly module,
so we're bounded to around 8000 recompilations per second.
That's a lot faster than Emscripten for this usecase.


### + Unroll

Since we know the size of the loop, we can unroll it.
This is a common technique used to avoid 
executing instructions dealing with loops and pointer arithmetic.

In wasmblr, all the `cg.*` calls emit assembly.
This gives us a meta-programming-like ability to use
C++ loops to emit a load, add, store for every element
in the vectors.

```cpp
std::vector<uint8_t> gen_add_unroll(int len) {
  assert(len % 4 == 0);
  wasmblr::CodeGenerator cg;
  
  auto pages = (len * 3 * 4) / (1 << 16) + 1;
  cg.memory(pages).export_("mem");
  
  auto add_func = cg.function({cg.i32, cg.i32, cg.i32}, {}, [&]() {
    // no loop emitted at all,
    // C++ is generating len / 4 groups of instructions
    for (auto i = 0; i < len / 4; ++i) {
      cg.local.get(2);

      cg.local.get(0);
      cg.v128.load(0, i * 16);

      cg.local.get(1);
      cg.v128.load(0, i * 16);

      cg.v128.f32x4_add();

      cg.v128.store(0, i * 16);
    }
  });
  cg.export_(add_func, "add");
  return cg.emit();
}
```
If you're not familiar with WebAssembly,
the above code might need some explaining.

First, that variable "`pages`".  Since wasmblr is dirt simple,
there is no `malloc` implementation we can invoke from JavaScript,
so I just hacked in a pre-allocation of the full size of the inputs and outputs
(2 inputs, 1 output, 4 byte floats, all of length `len`, a page size is `1<<16`).

Second, what is that `cg.local.get(2)` doing at the top?
`cg.local.get(X)` refers to arguments, so that's a reference to third one (our output).
Surprisingly, I needed to call that at the top
because the `cg.v128.store` way at the 
bottom needs it as an argument on the stack.
WebAssembly is a stack based instruction set, rather than a register based one,
so it can get kinda funky to read.

Third, `load` and `store` seem to take immediate arguments.
This is a part of the WebAssembly specification that we're exploiting.
You can hardcode an offset into a load/store and it's added to the pointer
on the stack.  That means instead of hardcoding 0 and incrementing pointers
to the memory, we can just leave the pointers and put offsets in the code itself.
This wouldn't be possible in Emscripten because the generated code would need to
handle all types of lengths.

The results are pretty good on the sizes we've chosen, about 2x Emscripten.
We're hitting up to 154 GB/s in this latest benchmark!

<pre>
<u>benchmarking vec add of size 4</u>
  <b>pure javascript:         6.47 GB/s (134745835.5 iters/sec)</b>
  typed arrays:            5.94 GB/s (123764397.54 iters/sec)
  emscripten (simd):       1.34 GB/s (27821574.95 iters/sec)
  wasmblr:                 2.73 GB/s (56838383.47 iters/sec)
<u>benchmarking vec add of size 64</u>
  pure javascript:         13.82 GB/s (17996280.78 iters/sec)
  typed arrays:            10.79 GB/s (14052220.48 iters/sec)
  emscripten (simd):       18.26 GB/s (23773963.85 iters/sec)
  <b>wasmblr:                 37.41 GB/s (48715916.95 iters/sec)</b>
<u>benchmarking vec add of size 1024</u>
  pure javascript:         15.04 GB/s (1223936.07 iters/sec)
  typed arrays:            10.91 GB/s (887874.77 iters/sec)
  emscripten (simd):       75.07 GB/s (6109401.71 iters/sec)
  <b>wasmblr:                 154.89 GB/s (12605111.46 iters/sec)</b>
</pre>

But, what about larger sizes?

<pre>
<u>benchmarking vec add of size 16384</u>
  pure javascript:         15.02 GB/s (76393.04 iters/sec)
  typed arrays:            10.89 GB/s (55389.2 iters/sec)
  <b>emscripten (simd):       90.96 GB/s (462658.42 iters/sec)</b>
  wasmblr:                 88.09 GB/s (448069.81 iters/sec)
</pre>

Damn.  We must have fallen out of instruction cache.
Unrolling has its limits and we've gone a bit too far
emitting a program that's ~100KB for a simple for loop.

Guess we'll have to take the hard route and also keep track of pointers:

```cpp
std::vector<uint8_t> gen_add_mix(int len, int unroll) {
  assert(len % (unroll * 4) == 0);
  wasmblr::CodeGenerator cg;
  auto pages = (len * 3 * 4) / (1 << 16) + 1;
  cg.memory(pages).export_("mem");
  auto add_func = cg.function({cg.i32, cg.i32, cg.i32}, {}, [&]() {
    // keep track of the iteration with this variable
    auto iter = cg.local(cg.i32);
    cg.i32.const_(0);
    cg.local.set(iter);

    cg.loop(cg.void_);

    // same as above, but now locals are mutated
    for (auto i = 0; i < unroll; ++i) {
      cg.local.get(2);

      cg.local.get(0);
      cg.v128.load(0, i * 16);

      cg.local.get(1);
      cg.v128.load(0, i * 16);

      cg.v128.f32x4_add();

      cg.v128.store(0, i * 16);
    }

    // pointer/iterator math below
    cg.local.get(0);
    cg.i32.const_(unroll * 16);
    cg.i32.add();
    cg.local.set(0);

    cg.local.get(1);
    cg.i32.const_(unroll * 16);
    cg.i32.add();
    cg.local.set(1);

    cg.local.get(2);
    cg.i32.const_(unroll * 16);
    cg.i32.add();
    cg.local.set(2);

    cg.local.get(iter);
    cg.i32.const_(unroll * 16);
    cg.i32.add();
    cg.local.set(iter);

    // condition to exit the loop
    cg.i32.const_(len * 4);  // bytes
    cg.local.get(iter);
    cg.i32.ge_s();
    cg.br_if(0);

    cg.end(); // need to end the loop! this is important to remember
  });
  cg.export_(add_func, "add");
  return cg.emit();
}
```

Now we can unroll by a more reasonable factor of 16.

<pre>
<u>benchmarking vec add of size 4</u>
  <b>pure javascript:         6.32 GB/s (131599401.31 iters/sec)</b>
  typed arrays:            5.88 GB/s (122460888.5 iters/sec)
  emscripten (simd):       1.33 GB/s (27803398.21 iters/sec)
  wasmblr:                 2.73 GB/s (56812962.22 iters/sec)
<u>benchmarking vec add of size 64</u>
  pure javascript:         13.82 GB/s (18000013.97 iters/sec)
  typed arrays:            10.8 GB/s (14057150.76 iters/sec)
  emscripten (simd):       18.24 GB/s (23755889.12 iters/sec)
  <b>wasmblr:                 36.98 GB/s (48154136.31 iters/sec)</b>
<u>benchmarking vec add of size 1024</u>
  pure javascript:         15.04 GB/s (1224069.46 iters/sec)
  typed arrays:            10.85 GB/s (882788.85 iters/sec)
  emscripten (simd):       73.71 GB/s (5998175.91 iters/sec)
  <b>wasmblr:                 138.9 GB/s (11303927.39 iters/sec)</b>
<u>benchmarking vec add of size 16384</u>
  pure javascript:         15.27 GB/s (77662.85 iters/sec)
  typed arrays:            11.09 GB/s (56398.08 iters/sec)
  <b>emscripten (simd):       90.31 GB/s (459346.92 iters/sec)</b>
  wasmblr:                 86.71 GB/s (441049.92 iters/sec)
</pre>

Well that was a lot of work for nothing.
And our performance for size 1024 seems to have gone down!
Playing around with unroll factor, we can
probably get some more performance...

### + Tuning

Why manually play with unroll factors when we recompile and reload `wasmblr` code
in 1/8000th of a second?
We may as well just try a bunch of unroll factors and save the best one.


```javascript
let best = 0;
let best_time = 1e9;
for (let i = 0; Math.pow(2, i) < Math.min(1024, N / 4 + 2); ++i) {
  let [fn, w_a, w_b, w_c] = gen_wasmblr(N, Math.pow(2, i));
  for (let _ = 0; _ < 100; ++_) {
    fn();
  }
  const t = performance.now();
  for (let _ = 0; _ < 1000; ++_) {
    fn();
  }
  const diff = performance.now() - t;
  if (diff < best_time) {
    best = i;
    best_time = diff;
  }
}
```
Alright, let's see what happens!
And let's add another benchmark for good measure.
As we try larger vectors, we start getting bounded by RAM and less by instruction
execution, so I'd expect things to even out.

<pre>
<u>benchmarking vec add of size 4</u>
  <b>pure javascript:         6.58 GB/s (137027121.14 iters/sec)</b>
  typed arrays:            5.94 GB/s (123815568.64 iters/sec)
  emscripten (simd):       1.33 GB/s (27750315.65 iters/sec)
  wasmblr:                 2.73 GB/s (56932301.96 iters/sec)
  wasmblr (tuned 2):       2.73 GB/s (56956492.7 iters/sec)
<u>benchmarking vec add of size 64</u>
  pure javascript:         13.81 GB/s (17975517.63 iters/sec)
  typed arrays:            10.78 GB/s (14037142.11 iters/sec)
  emscripten (simd):       18.21 GB/s (23713778.4 iters/sec)
  <b>wasmblr:                 36.9 GB/s (48052212.55 iters/sec)</b>
  wasmblr (tuned 4):       36.31 GB/s (47284842.5 iters/sec)
<u>benchmarking vec add of size 1024</u>
  pure javascript:         14.79 GB/s (1203713.72 iters/sec)
  typed arrays:            10.8 GB/s (878615.27 iters/sec)
  emscripten (simd):       73.35 GB/s (5969393.52 iters/sec)
  <b>wasmblr:                 144.1 GB/s (11727035.06 iters/sec)</b>
  wasmblr (tuned 16):      144.05 GB/s (11722845.16 iters/sec)
<u>benchmarking vec add of size 16384</u>
  pure javascript:         14.92 GB/s (75866.72 iters/sec)
  typed arrays:            11.03 GB/s (56121.47 iters/sec)
  emscripten (simd):       89.49 GB/s (455146.48 iters/sec)
  wasmblr:                 85.06 GB/s (432642.22 iters/sec)
  <b>wasmblr (tuned 4):       153.31 GB/s (779787.52 iters/sec)</b>
<u>benchmarking vec add of size 262144</u>
  pure javascript:         14.92 GB/s (4744.22 iters/sec)
  typed arrays:            10.99 GB/s (3493.11 iters/sec)
  emscripten (simd):       93.77 GB/s (29809.79 iters/sec)
  wasmblr:                 87.29 GB/s (27748.19 iters/sec)
  <b>wasmblr (tuned 2):       93.79 GB/s (29814.89 iters/sec)</b>
</pre>

Woo! We've recovered some perf and even improved on certains sizes.
I find it cool that tuning let us climb back up the cliff
for size 16,384 and actually stay in cache.
At the larger size of 262,144 it really doesn't matter much what we do,
the vectors themselves are so big we're just waiting on memory
most of the time.

### Fin.

In conclusion, if your vectors are small, you're probably fine with pure JavaScript.
If you've got very large vectors (larger than cache),
Emscripten does a great job.
If you've got a very specific size and a lot of time on your hands,
you can check out `wasmblr` (or any other WebAssembly emitter).


### Thanks for reading :)

*If you'd like, you can follow me on Twitter here: [@bwasti](https://twitter.com/bwasti)*