# `loop_tool`'s Interactive Loop Optimization

*by [@bwasti](https://twitter.com/bwasti) (all thoughts in this writeup are my own)*

****

I'm working on a [loop tool](https://github.com/facebookresearch/loop_tool)
to interactively optimize dense linear algebra workloads.
Below is a demo of a user optimizing a program to be 10x faster than [NumPy](https://numpy.org):

<center>
<video playsinline loop muted autoplay style="max-width:100%" src="https://user-images.githubusercontent.com/4842908/153905962-fd068faa-1438-4327-9c50-c8d1072cbd27.mp4"></video>
</center>

This is done in a split-pane [tmux](https://en.wikipedia.org/wiki/Tmux) session.
The left is [`loop_tool`](https://github.com/facebookresearch/loop_tool),
interactively manipulated and benchmarked (in real time!).
The right is a script continuously running [bat](https://github.com/sharkdp/bat) to
render the underlying generated C file.

To try this out yourself, `pip install loop_tool` (code from the video [here](https://jott.live/code/loop_tool_tui_demo.py)).

The frontend looks a lot like NumPy with symbolic "named" dimensions.
The `lt.ui(tensor)` function lets you edit the schedules
associated with any tensor in-place.  The output is (optionally) a portable C file.

# But, why?

There's been an increasing need for loop compilers due to the spread of
machine learning, data and science workloads as well as the proliferation of
hardware that can execute them.
Ideally, we'd want a magical compiler to connect these.

![](https://i.imgur.com/4woEtjt.png)

Can this be done by moving the code to new fast languages like Rust or Zig?
Can well known compiler backends like LLVM do this for us automatically?

Unfortunately, no. For two reasons.

### *[Intermediate Representation (IR)](https://en.wikipedia.org/wiki/Intermediate_representation) Mismatch*

Check out these programs (feel free to skim):

```python
set(out, 0)
for a in range(16):
  for b in range(16):
    out[a] += in[a, b]
```

```python
for a in range(16):
  tmp = 0
  for b in range(16):
    tmp += in[a, b]
  out[a] = tmp
```

```python
tmp = alloc(8)
for a_ in range(2):
  set(tmp, 0)
  for b in range(16):
    for a in range(8):
      tmp[a] += in[a_ * 8 + a, b]
for a in range(16):
  out[a] = tmp[a]
```
Mathematically, they all do the exact same thing.
Some call `set`, some call `alloc` and some split up loops.
Transformations like this don't exist in low-level IR like LLVM IR.

### *Heuristic Mismatch*

Such tranformations *could* be added, but then you face another problem.
There is no way to decide which tranformation is "best."
The resultant performance depends on so many factors related to the underlying
hardware that it's extremely difficult to predict ahead of time.
The heuristics found in GCC and LLVM take a much simpler approach
and wouldn't be adequate for this task.

# Approaches To Keep an Eye On

While these issues make traditional compilers unsuited for such workloads directly,
there are many solutions being built! There is a lot of
prior work in the space to read about, but I think it can be safely
boiled down to 3 approaches.

### 1. *Automatic Tuning*


<center>
<video playsinline loop muted autoplay style="max-width:100%" src="https://thumbs.gfycat.com/SnarlingQuaintKittiwake-mobile.mp4"></video>
</center>

Projects like [TVM](https://tvm.apache.org)
and [Halide](https://halide-lang.org) implement an IR that can handle the transformations
needed for performance.  The workloads are easily benchmarked,
so they offer utilities for automatic tuning (much like performance guided optimization in conventional compilers).


Since their IRs explode the size of the transformation space, brute-force search
doesn't work very well.  Instead, advanced cost modeling is used to
inform the tuning process and reduce the number of steps needed to find a good result
([Ansor](https://arxiv.org/abs/2006.06762)
and [Autoscheduler](https://halide-lang.org/papers/halide_autoscheduler_2019.pdf) respectively).

For a couple of reasons, these approaches can still take a very long time (hours).
Each step can take a while to compile and benchmark (`-O3` can get expensive)
and the search space, even with pruning and cost modeling, is still very large.

### 2. *Pattern Matching*

<center> <img style="max-width:80%"  src="https://c.tenor.com/UA0CEgMMGAkAAAAC/baby-fun.gif"></center>

Another popular approach is to extend the idea of pattern matching found in conventional compilers.
Projects like [XLA](https://www.tensorflow.org/xla)
find patterns in programs that can be
[transformed into to well known "fast paths."](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/service/cpu/conv_canonicalization.h#L26-L33)
This approach cleanly separates the kernel writers from the compiler writers and
doesn't require large sweeps through many transformations.
However, if user code doesn't contain patterns, there can be a steep performance cliff.


### 3. *Polyhedral Modeling*

<center> <img style="max-width:80%" src="https://upload.wikimedia.org/wikipedia/commons/9/9c/Schönhardt_polyhedron_rotation.gif?20190417030508"></center>


LLVM's [Polly](https://polly.llvm.org) and GCC's [Graphite](https://gcc.gnu.org/wiki/Graphite) can be viewed as a mix of the other two approaches.
It identifies patterns of loop nests and defines an IR that enables
the application of affine transformations to bounds.
By modeling the program as a polytope, well defined mathematical tools
can be used to solve for performance under that model (e.g. [isl](https://en.wikipedia.org/wiki/Integer_set_library)).

# Getting Humans in the Loop(`_tool`)

Rather than compete with the above projects, `loop_tool` aims to supplement them
with human-driven exploration.

People (especially hardware experts) are really good at intuitively finding fast programs.
`loop_tool` attempts to leverage this knowledge and make it simple to craft such programs.

Let's revisit the video at the top of the screen.


<center>
<video playsinline loop muted autoplay style="max-width:100%" src="https://user-images.githubusercontent.com/4842908/153905962-fd068faa-1438-4327-9c50-c8d1072cbd27.mp4"></video>
</center>

What's actually going on?

The user defines a matrix multiplication that only considers positive products
before reducing over the shared dimension of the matrix.  It isn't a typical workload,
but it looks a bit like [gradient clipping](https://twitter.com/chhillee/status/1485679650340687873?s=21)
(which is common in machine learning).

This small change unfortunately falls completely off the chosen fast path of libraries
and hardware vendors, so we lose a lot of theoretical performance with NumPy.

Fortunately, humans don't have such cliff-like knowledge, so an expert
would easily be able to write a fast implementation based on well known
matrix multiplication logic.  That's where `loop_tool` aims to step in and make life
easier.

### Tuning

There are a couple techniques being used:

*Tiling*

The user is tiling the computation into sizes that are amenable to automatic vectorization.  The size of `M` is 33, which is hard to
fit cleanly into a SIMD register.  Instead we can split that into a size of 4 by 8 with 1 extra iteration.
`loop_tool` is happy to unevenly divide loops and will keep track of the "tail" logic for you.  That last iteration won't be
vectorized, but it's only 3% of computation in this case, so it's not a huge deal.

*Minimizing intermediate memory*

When you have a [reduction](https://en.wikipedia.org/wiki/Fold_(higher-order_function)), not every loop can be shared.  In this case, we are adding up elements 
over the loop `K` and can't write that out to memory until it's finished.
As a result, any loop nested with `K` will induce extra memory to store the temporary values (during the summation).
We start with 1584 elements (all of `N` and `M`) but can quickly squash that down to 256 (only 8 of `M` and 32 of `N`).
There's no need to worry that 256 doesn't evenly divide 1584! `loop_tool` will reuse the 256 buffer for the tail logic as well.

*Loop reordering*

Similar to tiling, loop order allows various forms of vectorization to be automatically detected by the C compiler.
In this case we see that `M` innermost gives the best results.  On different machines or using different C compilers this
might be different!



# How it works

The tool can be used in a variety of ways, as it's more of a "fully exposed" utility than
a language or framework.  There is no clear distinction between API surface
and internals yet.

### Pure Functional and Symbolic

`loop_tool` comes with a pure functional lazily-evaluated frontend embedded in Python,
C++ and soon JavaScript.

```python
import loop_tool as lt

s = lt.SymbolGenerator()
a = lt.Tensor(s.x)
b = lt.Tensor(s.x, s.k)
c = (a * b).sum(s.x)
print(c.symbolic_shape) # [Symbol{k}]

# backward size inference (it's symbolic!)
b.set_size(128, 7)
c.unify()
print(a.shape) # [128]
```

It's fully integrated with NumPy:
```
import numpy as np

a = lt.Tensor(np.random.randn(128, 128))
# numpy doesn't give any symbolic names, we can add those
a = a.to(s.m, s.n)
print(a.symbolic_shape) # [Symbol{m}, Symbol{n}]
print(a.shape) # [128, 128]

b = a.sum(s.n).numpy()
assert np.allclose(b, a.numpy().sum(axis=1), rtol=0.001, atol=0.001)
```

### The IR is just a DAG

The underlying IR is simple because the frontend is pure functional.
It's annotated with loop orders and sizes (including remainders if need be).
This is shown in the bottom row of each node below.
A loop tree can be generated by iterating topologically over the nodes
and emitting loops according to the orders on each node.
If a loop can be [coalesced](https://en.wikipedia.org/wiki/Loop_fission_and_fusion#Fission), it is!

<center> <img style="max-width:80%; width:400px" src="https://i.imgur.com/NFxIdiT.png"></center>

Above is a matrix multiplication IR example rendered with Graphviz.
It can be generated as follows:

```python
import loop_tool as lt
s = lt.SymbolGenerator()

a = lt.Tensor(s.m, s.k).set_size(200, 300)
b = lt.Tensor(s.k, s.n).set_size(300, 400)
c = (a * b).sum(s.k) # matrix multiplcation

print(c.ir) # graphviz format - https://dreampuf.github.io/GraphvizOnline
```

### TUI / Transformations

The TUI wraps basic transformations like loop splitting, loop fusion and automatically
determines static memory usage.  The optional second parameter specifies a file that
will be populated with generated C code.  This code can then be used wherever,
with the input/output/scratch space specification commented on the top.

For reference, arrow keys move the cursor, the enter key puts the cursor into "drag" mode.
"S" splits the highlighted loop and "U" undoes the last action.
Try it out!

There's also an API for scripting the transformations provided by the TUI.
This looks a lot like Halide or TVM, but does a bit more work for you in
terms of memory allocation and ensuring correctness.

# Future Development

The core functionality is finished, but there's a lot more to do.

### Arithmetic Operations

There are effectively only two types of nodes in the IR: memory movement
and arithmetic computation.
The latter will need to be extended to cover more operations, as it currently
only handles a dozen or so.


### Information

Total runtime is easily measured, but raw line-by-line performance
is substantially more useful.  Perf-like utilities could be integrated to
provide fine-grained metrics about loop nests and help users determine which
parts of complex workloads need to be optimized.

### Frontends

NumPy is one of many frontends that exist in the space.
Luckily, most frameworks work with NumPy, so compatibility isn't an issue.
However, for performance, native integration is necessary.
Since `loop_tool` is lazily executed, zero-copy support for Tensors
in other ilbraries is theoretically possible.

There's also a work in progress [Javascript frontend](https://github.com/facebookresearch/loop_tool/blob/main/test/test.mjs#L89-L112).

### Backends

C is useful but GPU code is also appealing.
There is an extant (but default disabled) CUDA code generator that needs to be refreshed
to work with newer features.
There is also ongoing work to add a new [WebAssembly backend](https://github.com/facebookresearch/loop_tool/blob/main/src/backends/wasm/wasm.cpp).

### Auto-differentiation

The `loop_tool` IR is differentiable (for the most part), so it'd make
sense to add auto-diff.

### ???

The project is new and there are bunch of unknowns.  If you have any suggestions,
please file an issue in the github repo!

# --

Thanks for reading!