Fast Matrix Multiplication (Animated)
by @bwasti
Start with Scalar
Matrix multiplication involves multiplying and summing over the
rows of the red matrix with the columns of the blue matrix. This is done
for every pairwise row and column.
To start, we'll access the memory of two input matrices in a naive
order: row-wise for the red matrix, and column-wise for the blue.
We first load a float from each matrix to registers.
Then, we ask the
CPU to do a multiplication of the red and blue values and an addition of
the current running sum (yellow). Although this is one instruction
(involving 3 inputs), it ends up taking the CPU 4 cycles to complete.
Venture into Vectorization
The columns for the blue matrix do not depend on each other to calculate the output,
so we can safely load 4
values at once from the blue matrix into a single register. The red matrix
contributes only one value for all 4 blue values, so we load a single
element and broadcast it across the register.
Our matrix multiplication is now churning through 4x as many elements per
cycle. However, because we are iterating row-wise over the red matrix,
every instruction depends on the output of a previous instruction. Thus,
we can only use 3 of our 12 available registers.
Proper Pipelining
To use more registers, we'll need to leverage parallelism in the red
matrix. Instead of iterating directly over the rows, we will iterate
partially over the columns, broadcasting each row into a different
register. This allows us to execute independent summations.
The simple act of creating independent workstreams across different
registers allows the CPU to run instructions in a pipelined fashion.
Although a single instruction takes 4 cycles, the runtime is ammortized to
1 cycle as 4 instructions will be in flight at all times.
Keep in mind that registers can be reused with this scheme,
which makes pipelining straight-forward to achieve.
And that's it. One concept not mentioned is the utilization of cache
(memory highlighted in white should be seen as "in cache"). Cache utilization can be tuned by accessing
memory in a "tiled" pattern.
The exact analysis of this access pattern is
left as an exercise to the reader.
Thanks for reading!
If you'd like, follow me on twitter! (@bwasti)