# Machine Learning with Unix Pipes

****

I contribute to an open source programmer-focused machine learning library
Recently, we added some basic /dev/stdin handling,
which makes it possible to compose standard Unix
utilities with machine learning on the command-line.

### Pipes | Refresher

<center>
<img src="https://i.imgur.com/5KgEoYL.gif" style="display:inline;width:480px;max-width:80%;"/>
</center>

[Unix pipes](https://en.wikipedia.org/wiki/Pipeline_(Unix)) are a surprisingly flexible and easy to use mechanism for inter-process communication.
For example, we can count how many three letter words start with "a":

bash
% cat /usr/share/dict/words | grep -e '^a..$' | wc -l 90  90 words! But what's actually happening here? - The **|** operator creates a pipe|ine: chaining the outputs of commands as inputs to the following commands - **cat** (comes from ["con**cat**enate"](https://en.wikipedia.org/wiki/Cat_(Unix))) prints files out and /usr/share/dict/words is a standard wordlist - **grep** filters text line by line and "^a..$" is our [regex](https://en.wikipedia.org/wiki/Regular_expression)
for getting three letter words starting with "a"
- **wc** is wordcount and the -l flag counts lines instead of words

And that's it!

Another thing to note is that pipes can be extremely quick.
They've been around for decades and have been a key tool in the toolbox of *many* people,
so they're pretty damn optimized.
A good way to measure the performance of a pipe is to use the pv utility:

bash
% yes "hi" | pv > /dev/null
11.3GiB 0:00:02 [5.69GiB/s] [          <=>         ]


A raw Unix pipe can hit nearly 6GB/s!
Flexibility, performance and decades of documentation
are super compelling reasons to add
support in Shumai.

Now let's use these pipe for some machine learning!
To start we'll "learn" something very simple:

$$f(x) = m\cdot x + b$$

To do so, we first express the function in code.
Parameters like m and b are typically randomly initialized,
but for now we'll hardcode them to 7 and 3 respectively.

javascript
// learn.ts
import * as sm from '@shumai/shumai'

export default function f(x) {
}

export const backward = sm.optim.sgd // stochastic gradient descent
export const loss = sm.loss.mse // mean squared error


This is all functional code, but checkpoint() does some fancy
things to ensure that repeated invocations of this script
will cache the values learned.
As a result,
you'll see random files like tensor_2304823423.fl generated.
Delete them to reset training.

To test this model out, we pipe a value into an invocation of shumai infer:

bash
% echo 7 | shumai infer learn.ts
52:Float32


$7 \times 7 + 3 = 52$. Great!
Since we know the values of $m$ and $b$, this is expected.

What we just did is often called *inference* (which is an overloaded term,
apologies to Bayesians): predict a result based on an input.

We also want to train this model to work for some data we've collected.
Let's say we know an input of 4 should be 19 and 6 should be 25.
Our current parameters don't capture that at all.

Training involves the creation of input/output pairs
and this can be done on the command-line too.
All we have to do is place a | separator between them
and use shumai train.

bash
% echo '4 | 19' | shumai train learn.ts
143.99899291992188
% echo '6 | 25' | shumai train learn.ts
376.3590393066406


The [*loss*](https://en.wikipedia.org/wiki/Loss_function) is printed out after each run.
Loss is basically the cost of getting things wrong.
Since we're using mean squared error to measure loss,
huge numbers like this mean we haven't learned much yet.

So let's train it for longer!
First we make a dataset:

bash
% echo '4 | 19' >> data.txt
% echo '6 | 25' >> data.txt
% cat data.txt
4 | 19
6 | 25


And then we train with it for 100,000 steps:

bash
% yes "$(cat data.txt)" | head -n 100000 | shumai train learn.ts 0.11650670319795609  - The **yes** command repeats the input infinitely. It was a cheeky program used to quickly (and dirtily) accept the terms of installation scripts, which often prompted things like [Y/n]? - **$(...)** runs the command inside it.
- **head -n 100000** will chop the infinite output of yes to
be only 100k lines long.

So, by piping this into shumai train learn.ts,
we've trained a model 100k steps.

To test it out, we use shumai infer yet again

bash
% echo 4 | shumai infer learn.ts
18.49370574951172:Float32


Woo!  Pretty close.

### Part 2: Actually Useful Stuff (Heterogenous Load Balancing)

When would machine learning on the command-line
of all places actually be useful?
I use the command-line to get things done, not play with
images and text generation!

Here's an example: we have a program that uses different amounts of
CPU based on its input.
We've got two machines, a slow one (cheap) and a fast one (expensive).
Our task is to run the program on the
fast machine only when it makes sense.

*For the sake
of this writeup, we'll pretend
the input is a single integer.
More realistically the input could be a much longer
array of bytes and the ideas shown below would still apply.
For injesting arbitrary binary, here's how I might
convert it to a Shumai readable Tensor:*

bash
% hexdump -C [file] | od -td1 -An | tr -ds '\n' ' ' | sed -e 's/^ */[/g' -e 's/ *$/]/g' | tr ' ' ', '  - **hexdump** will convert your file to binary - **od -td1** will convert that to base 10, -An will remove byte offsets. - **tr -ds** will replace \n with spaces, squashing repeat characters - **sed -e 's/^ */[/g' -e 's/ *$/]/g'** will remove the leading and trailing spaces as well as
wrap up the output with [ and ] (how Shumai takes in Tensors)
- The final **tr ' ' ', '** comma separates everything

Back to our example.
Let's say our program ./prog prints both the input and the amount
of CPU it used.  In a real-world setting we can filter for this information
with grep and awk, but let's assume it's the *only* thing printed
for now:

bash
% ./prog 5
5 0.1234


Ideally, *before* we run the program -
we figure out which machine to
run ./prog on.  It's not too bad if we get it wrong every so often, but it helps
if we consistently get it right.
This is the typical trade-off you'll face when applying machine learning
in the context of programming.  It's not magic sauce, it's just a convenient
way to avoid spending too much time on heuristics.

So, how do we make this happen?
We do what we did with the other example: feed
training data into a model of our own design.

Before building our model, we'll collect a number of datapoints and save them into a file.

bash
% seq -f '%1.0f' 10 64 | xargs -L1 ./prog | sort -R > data.txt


- **seq** above prints out numbers from 10 to 64,
and it formats them as floating point with no decimal.
- **xargs -L1** converts every line of the seq output into
and invocation of ./prog with the output as the argument.
These commands combined are really useful for collecting data
over sweeps of numeric inputs.
- **sort -R** randomly shuffles the data.

Now that we've got our dataset data.txt, we should define a model.
In this case, we'll use a [multilayer perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) (MLP):

javascript
// learn.ts
import * as sm from '@shumai/shumai'

const l0 = sm.module.linear(1, 64)
const l1 = sm.module.linear(64, 64)
const l2 = sm.module.linear(64, 1)

export default function f(x) {
x = l0(x).relu()
x = l1(x).relu()
x = l2(x).sigmoid()
}

export const backward = sm.optim.sgd // stochastic gradient descent
export const loss = sm.loss.mse // mean squared error

So, what's going on in the code above?

We multiply our input into 64 hidden "neurons" by a learned weight
and then clip out all the values less than zero (setting them to zero).
We then do this again.
The process of clamping negative values to zero is extremely
important, as it is a non-linear operation
that gives the neural network the [ability to learn
arbitrary functions](https://en.wikipedia.org/wiki/Universal_approximation_theorem) (given the right number of neurons).
I like to think of it as giving the network
an ability to learn if-conditions.

<center>
</center>

Then, we add up all the values (again by a learned weight)
and smoothly clamp them between 0 and 1 (using [sigmoid](https://en.wikipedia.org/wiki/Sigmoid_function)).
We'll be predicting if the CPU used for a given input is
over (1) or under (0) a certain threshold, indicating which machine
we should use.

<center>
</center>

And now, with our model and data in hand, we train it:

bash
% yes "$(cat data.txt)" | head -n 10000 | awk '{print$1 "|" \$2}'  | shumai train learn.ts

You might notice that this line would be in your bash history, since it's the exact same command
used to train the other model.

Querying the model will look similar as well!

bash
% echo 4 | shumai infer learn.ts
0.9044327:Float32

And we're done. 😁
A full model trained on the command-line that can be used anywhere stdin is supported!
There are other ways to use (and even train) this model (including HTTP and soon WebSockets), but those
are out of scope for this writeup (hint: shumai serve learn.ts and then check out 127.0.0.1:3000/{forward,backward}).

### Going Forward

The /dev/stdin API is currenty a work in progress.  I haven't really documented it much and I'm mostly
looking for feedback on the idea itself.