`loop_tool`'s Interactive Loop Optimization

by @bwasti (all thoughts in this writeup are my own)

I'm working on a loop tool to interactively optimize dense linear algebra workloads. Below is a demo of a user optimizing a program to be 10x faster than NumPy:

This is done in a split-pane tmux session. The left is loop_tool, interactively manipulated and benchmarked (in real time!). The right is a script continuously running bat to render the underlying generated C file.

To try this out yourself, pip install loop_tool (code from the video here).

The frontend looks a lot like NumPy with symbolic "named" dimensions. The lt.ui(tensor) function lets you edit the schedules associated with any tensor in-place. The output is (optionally) a portable C file.

But, why?

There's been an increasing need for loop compilers due to the spread of machine learning, data and science workloads as well as the proliferation of hardware that can execute them. Ideally, we'd want a magical compiler to connect these.

Can this be done by moving the code to new fast languages like Rust or Zig? Can well known compiler backends like LLVM do this for us automatically?

Unfortunately, no. For two reasons.

Intermediate Representation (IR) Mismatch

Check out these programs (feel free to skim):

set(out, 0)
for a in range(16):
  for b in range(16):
    out[a] += in[a, b]

for a in range(16):
  tmp = 0
  for b in range(16):
    tmp += in[a, b]
  out[a] = tmp

tmp = alloc(8)
for a_ in range(2):
  set(tmp, 0)
  for b in range(16):
    for a in range(8):
      tmp[a] += in[a_ * 8 + a, b]
for a in range(16):
  out[a] = tmp[a]

Mathematically, they all do the exact same thing. Some call set, some call alloc and some split up loops. Transformations like this don't exist in low-level IR like LLVM IR.

Heuristic Mismatch

Such tranformations could be added, but then you face another problem. There is no way to decide which tranformation is "best." The resultant performance depends on so many factors related to the underlying hardware that it's extremely difficult to predict ahead of time. The heuristics found in GCC and LLVM take a much simpler approach and wouldn't be adequate for this task.

Approaches To Keep an Eye On

While these issues make traditional compilers unsuited for such workloads directly, there are many solutions being built! There is a lot of prior work in the space to read about, but I think it can be safely boiled down to 3 approaches.

1. Automatic Tuning

Projects like TVM and Halide implement an IR that can handle the transformations needed for performance. The workloads are easily benchmarked, so they offer utilities for automatic tuning (much like performance guided optimization in conventional compilers).

Since their IRs explode the size of the transformation space, brute-force search doesn't work very well. Instead, advanced cost modeling is used to inform the tuning process and reduce the number of steps needed to find a good result (Ansor and Autoscheduler respectively).

For a couple of reasons, these approaches can still take a very long time (hours). Each step can take a while to compile and benchmark (-O3 can get expensive) and the search space, even with pruning and cost modeling, is still very large.

2. Pattern Matching

Another popular approach is to extend the idea of pattern matching found in conventional compilers. Projects like XLA find patterns in programs that can be transformed into to well known "fast paths." This approach cleanly separates the kernel writers from the compiler writers and doesn't require large sweeps through many transformations. However, if user code doesn't contain patterns, there can be a steep performance cliff.

3. Polyhedral Modeling

LLVM's Polly and GCC's Graphite can be viewed as a mix of the other two approaches. It identifies patterns of loop nests and defines an IR that enables the application of affine transformations to bounds. By modeling the program as a polytope, well defined mathematical tools can be used to solve for performance under that model (e.g. isl).

Getting Humans in the Loop(`_tool`)

Rather than compete with the above projects, loop_tool aims to supplement them with human-driven exploration.

People (especially hardware experts) are really good at intuitively finding fast programs. loop_tool attempts to leverage this knowledge and make it simple to craft such programs.

Let's revisit the video at the top of the screen.

What's actually going on?

The user defines a matrix multiplication that only considers positive products before reducing over the shared dimension of the matrix. It isn't a typical workload, but it looks a bit like gradient clipping (which is common in machine learning).

This small change unfortunately falls completely off the chosen fast path of libraries and hardware vendors, so we lose a lot of theoretical performance with NumPy.

Fortunately, humans don't have such cliff-like knowledge, so an expert would easily be able to write a fast implementation based on well known matrix multiplication logic. That's where loop_tool aims to step in and make life easier.

Tuning

There are a couple techniques being used:

Tiling

The user is tiling the computation into sizes that are amenable to automatic vectorization. The size of M is 33, which is hard to fit cleanly into a SIMD register. Instead we can split that into a size of 4 by 8 with 1 extra iteration. loop_tool is happy to unevenly divide loops and will keep track of the "tail" logic for you. That last iteration won't be vectorized, but it's only 3% of computation in this case, so it's not a huge deal.

Minimizing intermediate memory

When you have a reduction, not every loop can be shared. In this case, we are adding up elements over the loop K and can't write that out to memory until it's finished. As a result, any loop nested with K will induce extra memory to store the temporary values (during the summation). We start with 1584 elements (all of N and M) but can quickly squash that down to 256 (only 8 of M and 32 of N). There's no need to worry that 256 doesn't evenly divide 1584! loop_tool will reuse the 256 buffer for the tail logic as well.

Loop reordering

Similar to tiling, loop order allows various forms of vectorization to be automatically detected by the C compiler. In this case we see that M innermost gives the best results. On different machines or using different C compilers this might be different!

How it works

The tool can be used in a variety of ways, as it's more of a "fully exposed" utility than a language or framework. There is no clear distinction between API surface and internals yet.

Pure Functional and Symbolic

loop_tool comes with a pure functional lazily-evaluated frontend embedded in Python, C++ and soon JavaScript.

import loop_tool as lt

s = lt.SymbolGenerator()
a = lt.Tensor(s.x)
b = lt.Tensor(s.x, s.k)
c = (a * b).sum(s.x)
print(c.symbolic_shape) # [Symbol{k}]

# backward size inference (it's symbolic!)
b.set_size(128, 7)
c.unify()
print(a.shape) # [128]

It's fully integrated with NumPy:

import numpy as np

a = lt.Tensor(np.random.randn(128, 128))
# numpy doesn't give any symbolic names, we can add those
a = a.to(s.m, s.n)
print(a.symbolic_shape) # [Symbol{m}, Symbol{n}]
print(a.shape) # [128, 128]

b = a.sum(s.n).numpy()
assert np.allclose(b, a.numpy().sum(axis=1), rtol=0.001, atol=0.001)

The IR is just a DAG

The underlying IR is simple because the frontend is pure functional. It's annotated with loop orders and sizes (including remainders if need be). This is shown in the bottom row of each node below. A loop tree can be generated by iterating topologically over the nodes and emitting loops according to the orders on each node. If a loop can be coalesced, it is!

Above is a matrix multiplication IR example rendered with Graphviz. It can be generated as follows:

import loop_tool as lt
s = lt.SymbolGenerator()

a = lt.Tensor(s.m, s.k).set_size(200, 300)
b = lt.Tensor(s.k, s.n).set_size(300, 400)
c = (a * b).sum(s.k) # matrix multiplcation

print(c.ir) # graphviz format - https://dreampuf.github.io/GraphvizOnline

TUI / Transformations

The TUI wraps basic transformations like loop splitting, loop fusion and automatically determines static memory usage. The optional second parameter specifies a file that will be populated with generated C code. This code can then be used wherever, with the input/output/scratch space specification commented on the top.

For reference, arrow keys move the cursor, the enter key puts the cursor into "drag" mode. "S" splits the highlighted loop and "U" undoes the last action. Try it out!

There's also an API for scripting the transformations provided by the TUI. This looks a lot like Halide or TVM, but does a bit more work for you in terms of memory allocation and ensuring correctness.

Future Development

The core functionality is finished, but there's a lot more to do.

Arithmetic Operations

There are effectively only two types of nodes in the IR: memory movement and arithmetic computation. The latter will need to be extended to cover more operations, as it currently only handles a dozen or so.

Information

Total runtime is easily measured, but raw line-by-line performance is substantially more useful. Perf-like utilities could be integrated to provide fine-grained metrics about loop nests and help users determine which parts of complex workloads need to be optimized.

Frontends

NumPy is one of many frontends that exist in the space. Luckily, most frameworks work with NumPy, so compatibility isn't an issue. However, for performance, native integration is necessary. Since loop_tool is lazily executed, zero-copy support for Tensors in other ilbraries is theoretically possible.

There's also a work in progress Javascript frontend.

Backends

C is useful but GPU code is also appealing. There is an extant (but default disabled) CUDA code generator that needs to be refreshed to work with newer features. There is also ongoing work to add a new WebAssembly backend.

Auto-differentiation

The loop_tool IR is differentiable (for the most part), so it'd make sense to add auto-diff.

???

The project is new and there are bunch of unknowns. If you have any suggestions, please file an issue in the github repo!

--

Thanks for reading!

loop_tool's Interactive Loop Optimization