Matrix multiplication involves multiplying and summing over the rows of the red matrix with the columns of the blue matrix. This is done for every pairwise row and column. To start, we'll access the memory of two input matrices in a naive order: row-wise for the red matrix, and column-wise for the blue.

memory
registers
cpu

We first load a float from each matrix to registers. Then, we ask the CPU to do a multiplication of the red and blue values and an addition of the current running sum (yellow). Although this is one instruction (involving 3 inputs), it ends up taking the CPU 4 cycles to complete.

The columns for the blue matrix do not depend on each other to calculate the output, so we can safely load 4 values at once from the blue matrix into a single register. The red matrix contributes only one value for all 4 blue values, so we load a single element and broadcast it across the register.

Our matrix multiplication is now churning through 4x as many elements per cycle. However, because we are iterating row-wise over the red matrix, every instruction depends on the output of a previous instruction. Thus, we can only use 3 of our 12 available registers.

memory
registers
cpu

To use more registers, we'll need to leverage parallelism in the red matrix. Instead of iterating directly over the rows, we will iterate partially over the columns, broadcasting each row into a different register. This allows us to execute independent summations.

The simple act of creating independent workstreams across different registers allows the CPU to run instructions in a pipelined fashion. Although a single instruction takes 4 cycles, the runtime is ammortized to 1 cycle as 4 instructions will be in flight at all times. Keep in mind that registers can be reused with this scheme, which makes pipelining straight-forward to achieve.

memory
registers
cpu

And that's it. One concept not mentioned is the utilization of cache (memory highlighted in white should be seen as "in cache"). Cache utilization can be tuned by accessing memory in a "tiled" pattern. The exact analysis of this access pattern is left as an exercise to the reader.

Thanks for reading! If you'd like, follow me on twitter! (@bwasti)