# Machine Learning with Unix Pipes *by [@bwasti](https://twitter.com/bwasti)* **** I contribute to an open source programmer-focused machine learning library called [Shumai](https://github.com/facebookresearch/shumai). Recently, we added some basic `/dev/stdin` handling, which makes it possible to compose standard Unix utilities with machine learning on the command-line. ### Pipes `|` Refresher <center> <img src="https://i.imgur.com/5KgEoYL.gif" style="display:inline;width:480px;max-width:80%;"/> </center> [Unix pipes](https://en.wikipedia.org/wiki/Pipeline_(Unix)) are a surprisingly flexible and easy to use mechanism for inter-process communication. For example, we can count how many three letter words start with "a": ```bash % cat /usr/share/dict/words | grep -e '^a..$' | wc -l 90 ``` 90 words! But what's actually happening here? - The **`|`** operator creates a pipe|ine: chaining the outputs of commands as inputs to the following commands - **`cat`** (comes from ["con**cat**enate"](https://en.wikipedia.org/wiki/Cat_(Unix))) prints files out and `/usr/share/dict/words` is a standard wordlist - **`grep`** filters text line by line and "`^a..$`" is our [regex](https://en.wikipedia.org/wiki/Regular_expression) for getting three letter words starting with "a" - **`wc`** is wordcount and the `-l` flag counts lines instead of words And that's it! Another thing to note is that pipes can be extremely quick. They've been around for decades and have been a key tool in the toolbox of *many* people, so they're pretty damn optimized. A good way to measure the performance of a pipe is to use the `pv` utility: ```bash % yes "hi" | pv > /dev/null 11.3GiB 0:00:02 [5.69GiB/s] [ <=> ] ``` A raw Unix pipe can hit nearly 6GB/s! Flexibility, performance and decades of documentation are super compelling reasons to add support in Shumai. ### Part 1: Multiply add Now let's use these pipe for some machine learning! To start we'll "learn" something very simple: $$ f(x) = m\cdot x + b $$ To do so, we first express the function in code. Parameters like `m` and `b` are typically randomly initialized, but for now we'll hardcode them to 7 and 3 respectively. ```javascript // learn.ts import * as sm from '@shumai/shumai' const m = sm.scalar(7).requireGrad().checkpoint() const b = sm.scalar(3).requireGrad().checkpoint() export default function f(x) { return m.mul(x).add(b) } export const backward = sm.optim.sgd // stochastic gradient descent export const loss = sm.loss.mse // mean squared error ``` This is all functional code, but `checkpoint()` does some fancy things to ensure that repeated invocations of this script will cache the values learned. As a result, you'll see random files like `tensor_2304823423.fl` generated. Delete them to reset training. To test this model out, we pipe a value into an invocation of `shumai infer`: ```bash % echo 7 | shumai infer learn.ts 52:Float32 ``` $7 \times 7 + 3 = 52$. Great! Since we know the values of $m$ and $b$, this is expected. What we just did is often called *inference* (which is an overloaded term, apologies to Bayesians): predict a result based on an input. We also want to train this model to work for some data we've collected. Let's say we know an input of `4` should be `19` and `6` should be `25`. Our current parameters don't capture that at all. Training involves the creation of input/output pairs and this can be done on the command-line too. All we have to do is place a `|` separator between them and use `shumai train`. ```bash % echo '4 | 19' | shumai train learn.ts 143.99899291992188 % echo '6 | 25' | shumai train learn.ts 376.3590393066406 ``` The [*loss*](https://en.wikipedia.org/wiki/Loss_function) is printed out after each run. Loss is basically the cost of getting things wrong. Since we're using mean squared error to measure loss, huge numbers like this mean we haven't learned much yet. So let's train it for longer! First we make a dataset: ```bash % echo '4 | 19' >> data.txt % echo '6 | 25' >> data.txt % cat data.txt 4 | 19 6 | 25 ``` And then we train with it for 100,000 steps: ```bash % yes "$(cat data.txt)" | head -n 100000 | shumai train learn.ts 0.11650670319795609 ``` - The **`yes`** command repeats the input infinitely. It was a cheeky program used to quickly (and dirtily) accept the terms of installation scripts, which often prompted things like `[Y/n]?` - **`$(...)`** runs the command inside it. - **`head -n 100000`** will chop the infinite output of `yes` to be only 100k lines long. So, by piping this into `shumai train learn.ts`, we've trained a model 100k steps. To test it out, we use `shumai infer` yet again ```bash % echo 4 | shumai infer learn.ts 18.49370574951172:Float32 ``` Woo! Pretty close. ### Part 2: Actually Useful Stuff (Heterogenous Load Balancing) When would machine learning on the command-line of all places actually be useful? I use the command-line to get things done, not play with images and text generation! Here's an example: we have a program that uses different amounts of CPU based on its input. We've got two machines, a slow one (cheap) and a fast one (expensive). Our task is to run the program on the fast machine only when it makes sense. *For the sake of this writeup, we'll pretend the input is a single integer. More realistically the input could be a much longer array of bytes and the ideas shown below would still apply. For injesting arbitrary binary, here's how I might convert it to a Shumai readable Tensor:* ```bash % hexdump -C [file] | od -td1 -An | tr -ds '\n' ' ' | sed -e 's/^ */[/g' -e 's/ *$/]/g' | tr ' ' ', ' ``` - **`hexdump`** will convert your file to binary - **`od -td1`** will convert that to base 10, `-An` will remove byte offsets. - **`tr -ds`** will replace `\n` with spaces, squashing repeat characters - **`sed -e 's/^ */[/g' -e 's/ *$/]/g'`** will remove the leading and trailing spaces as well as wrap up the output with `[` and `]` (how Shumai takes in Tensors) - The final **`tr ' ' ', '`** comma separates everything Back to our example. Let's say our program `./prog` prints both the input and the amount of CPU it used. In a real-world setting we can filter for this information with `grep` and `awk`, but let's assume it's the *only* thing printed for now: ```bash % ./prog 5 5 0.1234 ``` Ideally, *before* we run the program - we figure out which machine to run `./prog` on. It's not too bad if we get it wrong every so often, but it helps if we consistently get it right. This is the typical trade-off you'll face when applying machine learning in the context of programming. It's not magic sauce, it's just a convenient way to avoid spending too much time on heuristics. So, how do we make this happen? We do what we did with the other example: feed training data into a model of our own design. Before building our model, we'll collect a number of datapoints and save them into a file. ```bash % seq -f '%1.0f' 10 64 | xargs -L1 ./prog | sort -R > data.txt ``` - **`seq`** above prints out numbers from 10 to 64, and it formats them as floating point with no decimal. - **`xargs -L1`** converts every line of the `seq` output into and invocation of `./prog` with the output as the argument. These commands combined are really useful for collecting data over sweeps of numeric inputs. - **`sort -R`** randomly shuffles the data. Now that we've got our dataset `data.txt`, we should define a model. In this case, we'll use a [multilayer perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) (MLP): ```javascript // learn.ts import * as sm from '@shumai/shumai' const l0 = sm.module.linear(1, 64) const l1 = sm.module.linear(64, 64) const l2 = sm.module.linear(64, 1) export default function f(x) { x = l0(x).relu() x = l1(x).relu() x = l2(x).sigmoid() } export const backward = sm.optim.sgd // stochastic gradient descent export const loss = sm.loss.mse // mean squared error ``` So, what's going on in the code above? We multiply our input into 64 hidden "neurons" by a learned weight and then clip out all the values less than zero (setting them to zero). We then do this again. The process of clamping negative values to zero is extremely important, as it is a non-linear operation that gives the neural network the [ability to learn arbitrary functions](https://en.wikipedia.org/wiki/Universal_approximation_theorem) (given the right number of neurons). I like to think of it as giving the network an ability to learn if-conditions. <center> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/4/42/ReLU_and_GELU.svg/1920px-ReLU_and_GELU.svg.png" style="display:inline;width:480px;max-width:80%;"/> </center> Then, we add up all the values (again by a learned weight) and smoothly clamp them between 0 and 1 (using [sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function)). We'll be predicting if the CPU used for a given input is over (1) or under (0) a certain threshold, indicating which machine we should use. <center> <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/88/Logistic-curve.svg/1200px-Logistic-curve.svg.png" style="display:inline;width:480px;max-width:80%;"/> </center> And now, with our model and data in hand, we train it: ```bash % yes "$(cat data.txt)" | head -n 10000 | awk '{print $1 "|" $2}' | shumai train learn.ts ``` You might notice that this line would be in your bash history, since it's the exact same command used to train the other model. Querying the model will look similar as well! ```bash % echo 4 | shumai infer learn.ts 0.9044327:Float32 ``` And we're done. 😁 A full model trained on the command-line that can be used anywhere `stdin` is supported! There are other ways to use (and even train) this model (including HTTP and soon WebSockets), but those are out of scope for this writeup (hint: `shumai serve learn.ts` and then check out `127.0.0.1:3000/{forward,backward}`). ### Going Forward The `/dev/stdin` API is currenty a work in progress. I haven't really documented it much and I'm mostly looking for feedback on the idea itself. More generally, the [Shumai](https://github.com/facebookresearch/shumai) project is about 2 and half months old and still experimenting with APIs and ideas. A primary focus of the project is hackability. By ditching conventional Python and using a JIT compiled language with native async programming (JavaScript/TypeScript), Shumai hopes to open up doors for ideas that can be implemented quickly and directly in the host language rather than built out as a C extension. If you'd like to learn more or get quick help, please checkout our Discord! https://discord.gg/kXxWyMFQ Documentation on the operators and API can be found here: https://facebookresearch.github.io/shumai/