Processing math: 100%

Machine Learning with Unix Pipes

by @bwasti


I contribute to an open source programmer-focused machine learning library called Shumai. Recently, we added some basic /dev/stdin handling, which makes it possible to compose standard Unix utilities with machine learning on the command-line.

Pipes | Refresher

Unix pipes are a surprisingly flexible and easy to use mechanism for inter-process communication. For example, we can count how many three letter words start with "a":

% cat /usr/share/dict/words | grep -e '^a..$' | wc -l
90

90 words! But what's actually happening here?

And that's it!

Another thing to note is that pipes can be extremely quick. They've been around for decades and have been a key tool in the toolbox of many people, so they're pretty damn optimized. A good way to measure the performance of a pipe is to use the pv utility:

% yes "hi" | pv > /dev/null
11.3GiB 0:00:02 [5.69GiB/s] [          <=>         ]

A raw Unix pipe can hit nearly 6GB/s! Flexibility, performance and decades of documentation are super compelling reasons to add support in Shumai.

Part 1: Multiply add

Now let's use these pipe for some machine learning! To start we'll "learn" something very simple:

f(x)=mx+b

To do so, we first express the function in code. Parameters like m and b are typically randomly initialized, but for now we'll hardcode them to 7 and 3 respectively.

// learn.ts
import * as sm from '@shumai/shumai'

const m = sm.scalar(7).requireGrad().checkpoint()
const b = sm.scalar(3).requireGrad().checkpoint()	

export default function f(x) {
  return m.mul(x).add(b)
}

export const backward = sm.optim.sgd // stochastic gradient descent
export const loss = sm.loss.mse // mean squared error

This is all functional code, but checkpoint() does some fancy things to ensure that repeated invocations of this script will cache the values learned. As a result, you'll see random files like tensor_2304823423.fl generated. Delete them to reset training.

To test this model out, we pipe a value into an invocation of shumai infer:

% echo 7 | shumai infer learn.ts
52:Float32

7×7+3=52. Great! Since we know the values of m and b, this is expected.

What we just did is often called inference (which is an overloaded term, apologies to Bayesians): predict a result based on an input.

We also want to train this model to work for some data we've collected. Let's say we know an input of 4 should be 19 and 6 should be 25. Our current parameters don't capture that at all.

Training involves the creation of input/output pairs and this can be done on the command-line too. All we have to do is place a | separator between them and use shumai train.

% echo '4 | 19' | shumai train learn.ts
143.99899291992188
% echo '6 | 25' | shumai train learn.ts
376.3590393066406

The loss is printed out after each run. Loss is basically the cost of getting things wrong. Since we're using mean squared error to measure loss, huge numbers like this mean we haven't learned much yet.

So let's train it for longer! First we make a dataset:

% echo '4 | 19' >> data.txt
% echo '6 | 25' >> data.txt
% cat data.txt
4 | 19
6 | 25

And then we train with it for 100,000 steps:

% yes "$(cat data.txt)" | head -n 100000 | shumai train learn.ts
0.11650670319795609

So, by piping this into shumai train learn.ts, we've trained a model 100k steps.

To test it out, we use shumai infer yet again

% echo 4 | shumai infer learn.ts
18.49370574951172:Float32

Woo! Pretty close.

Part 2: Actually Useful Stuff (Heterogenous Load Balancing)

When would machine learning on the command-line of all places actually be useful? I use the command-line to get things done, not play with images and text generation!

Here's an example: we have a program that uses different amounts of CPU based on its input. We've got two machines, a slow one (cheap) and a fast one (expensive). Our task is to run the program on the fast machine only when it makes sense.

For the sake of this writeup, we'll pretend the input is a single integer. More realistically the input could be a much longer array of bytes and the ideas shown below would still apply. For injesting arbitrary binary, here's how I might convert it to a Shumai readable Tensor:

% hexdump -C [file] | od -td1 -An | tr -ds '\n' ' ' | sed -e 's/^ */[/g' -e 's/ *$/]/g' | tr ' ' ', '

Back to our example. Let's say our program ./prog prints both the input and the amount of CPU it used. In a real-world setting we can filter for this information with grep and awk, but let's assume it's the only thing printed for now:

% ./prog 5
5 0.1234

Ideally, before we run the program - we figure out which machine to run ./prog on. It's not too bad if we get it wrong every so often, but it helps if we consistently get it right. This is the typical trade-off you'll face when applying machine learning in the context of programming. It's not magic sauce, it's just a convenient way to avoid spending too much time on heuristics.

So, how do we make this happen? We do what we did with the other example: feed training data into a model of our own design.

Before building our model, we'll collect a number of datapoints and save them into a file.

% seq -f '%1.0f' 10 64 | xargs -L1 ./prog | sort -R > data.txt

Now that we've got our dataset data.txt, we should define a model. In this case, we'll use a multilayer perceptron (MLP):

// learn.ts
import * as sm from '@shumai/shumai'

const l0 = sm.module.linear(1, 64)
const l1 = sm.module.linear(64, 64)
const l2 = sm.module.linear(64, 1)


export default function f(x) {
  x = l0(x).relu()
  x = l1(x).relu()
  x = l2(x).sigmoid()
}

export const backward = sm.optim.sgd // stochastic gradient descent
export const loss = sm.loss.mse // mean squared error

So, what's going on in the code above?

We multiply our input into 64 hidden "neurons" by a learned weight and then clip out all the values less than zero (setting them to zero). We then do this again. The process of clamping negative values to zero is extremely important, as it is a non-linear operation that gives the neural network the ability to learn arbitrary functions (given the right number of neurons). I like to think of it as giving the network an ability to learn if-conditions.

Then, we add up all the values (again by a learned weight) and smoothly clamp them between 0 and 1 (using sigmoid). We'll be predicting if the CPU used for a given input is over (1) or under (0) a certain threshold, indicating which machine we should use.

And now, with our model and data in hand, we train it:

% yes "$(cat data.txt)" | head -n 10000 | awk '{print $1 "|" $2}'  | shumai train learn.ts

You might notice that this line would be in your bash history, since it's the exact same command used to train the other model.

Querying the model will look similar as well!

% echo 4 | shumai infer learn.ts
0.9044327:Float32

And we're done. 😁 A full model trained on the command-line that can be used anywhere stdin is supported! There are other ways to use (and even train) this model (including HTTP and soon WebSockets), but those are out of scope for this writeup (hint: shumai serve learn.ts and then check out 127.0.0.1:3000/{forward,backward}).

Going Forward

The /dev/stdin API is currenty a work in progress. I haven't really documented it much and I'm mostly looking for feedback on the idea itself.

More generally, the Shumai project is about 2 and half months old and still experimenting with APIs and ideas. A primary focus of the project is hackability. By ditching conventional Python and using a JIT compiled language with native async programming (JavaScript/TypeScript), Shumai hopes to open up doors for ideas that can be implemented quickly and directly in the host language rather than built out as a C extension.

If you'd like to learn more or get quick help, please checkout our Discord! https://discord.gg/kXxWyMFQ Documentation on the operators and API can be found here: https://facebookresearch.github.io/shumai/