by @bwasti
I contribute to an open source programmer-focused machine learning library
called Shumai.
Recently, we added some basic /dev/stdin
handling,
which makes it possible to compose standard Unix
utilities with machine learning on the command-line.
|
RefresherUnix pipes are a surprisingly flexible and easy to use mechanism for inter-process communication. For example, we can count how many three letter words start with "a":
% cat /usr/share/dict/words | grep -e '^a..$' | wc -l
90
90 words! But what's actually happening here?
|
operator creates a pipe|ine: chaining the outputs of commands as
inputs to the following commandscat
(comes from "concatenate") prints files out
and /usr/share/dict/words
is a standard wordlistgrep
filters text line by line and "^a..$
" is our regex
for getting three letter words starting with "a"wc
is wordcount and the -l
flag counts lines instead of wordsAnd that's it!
Another thing to note is that pipes can be extremely quick.
They've been around for decades and have been a key tool in the toolbox of many people,
so they're pretty damn optimized.
A good way to measure the performance of a pipe is to use the pv
utility:
% yes "hi" | pv > /dev/null
11.3GiB 0:00:02 [5.69GiB/s] [ <=> ]
A raw Unix pipe can hit nearly 6GB/s! Flexibility, performance and decades of documentation are super compelling reasons to add support in Shumai.
Now let's use these pipe for some machine learning! To start we'll "learn" something very simple:
f(x)=m⋅x+b
To do so, we first express the function in code.
Parameters like m
and b
are typically randomly initialized,
but for now we'll hardcode them to 7 and 3 respectively.
// learn.ts
import * as sm from '@shumai/shumai'
const m = sm.scalar(7).requireGrad().checkpoint()
const b = sm.scalar(3).requireGrad().checkpoint()
export default function f(x) {
return m.mul(x).add(b)
}
export const backward = sm.optim.sgd // stochastic gradient descent
export const loss = sm.loss.mse // mean squared error
This is all functional code, but checkpoint()
does some fancy
things to ensure that repeated invocations of this script
will cache the values learned.
As a result,
you'll see random files like tensor_2304823423.fl
generated.
Delete them to reset training.
To test this model out, we pipe a value into an invocation of shumai infer
:
% echo 7 | shumai infer learn.ts
52:Float32
7×7+3=52. Great! Since we know the values of m and b, this is expected.
What we just did is often called inference (which is an overloaded term, apologies to Bayesians): predict a result based on an input.
We also want to train this model to work for some data we've collected.
Let's say we know an input of 4
should be 19
and 6
should be 25
.
Our current parameters don't capture that at all.
Training involves the creation of input/output pairs
and this can be done on the command-line too.
All we have to do is place a |
separator between them
and use shumai train
.
% echo '4 | 19' | shumai train learn.ts
143.99899291992188
% echo '6 | 25' | shumai train learn.ts
376.3590393066406
The loss is printed out after each run. Loss is basically the cost of getting things wrong. Since we're using mean squared error to measure loss, huge numbers like this mean we haven't learned much yet.
So let's train it for longer! First we make a dataset:
% echo '4 | 19' >> data.txt
% echo '6 | 25' >> data.txt
% cat data.txt
4 | 19
6 | 25
And then we train with it for 100,000 steps:
% yes "$(cat data.txt)" | head -n 100000 | shumai train learn.ts
0.11650670319795609
yes
command repeats the input infinitely.
It was a cheeky program used to quickly (and dirtily) accept
the terms of installation scripts, which often prompted
things like [Y/n]?
$(...)
runs the command inside it.head -n 100000
will chop the infinite output of yes
to
be only 100k lines long.So, by piping this into shumai train learn.ts
,
we've trained a model 100k steps.
To test it out, we use shumai infer
yet again
% echo 4 | shumai infer learn.ts
18.49370574951172:Float32
Woo! Pretty close.
When would machine learning on the command-line of all places actually be useful? I use the command-line to get things done, not play with images and text generation!
Here's an example: we have a program that uses different amounts of CPU based on its input. We've got two machines, a slow one (cheap) and a fast one (expensive). Our task is to run the program on the fast machine only when it makes sense.
For the sake of this writeup, we'll pretend the input is a single integer. More realistically the input could be a much longer array of bytes and the ideas shown below would still apply. For injesting arbitrary binary, here's how I might convert it to a Shumai readable Tensor:
% hexdump -C [file] | od -td1 -An | tr -ds '\n' ' ' | sed -e 's/^ */[/g' -e 's/ *$/]/g' | tr ' ' ', '
hexdump
will convert your file to binaryod -td1
will convert that to base 10, -An
will remove byte offsets.tr -ds
will replace \n
with spaces, squashing repeat characterssed -e 's/^ */[/g' -e 's/ *$/]/g'
will remove the leading and trailing spaces as well as
wrap up the output with [
and ]
(how Shumai takes in Tensors)tr ' ' ', '
comma separates everythingBack to our example.
Let's say our program ./prog
prints both the input and the amount
of CPU it used. In a real-world setting we can filter for this information
with grep
and awk
, but let's assume it's the only thing printed
for now:
% ./prog 5
5 0.1234
Ideally, before we run the program -
we figure out which machine to
run ./prog
on. It's not too bad if we get it wrong every so often, but it helps
if we consistently get it right.
This is the typical trade-off you'll face when applying machine learning
in the context of programming. It's not magic sauce, it's just a convenient
way to avoid spending too much time on heuristics.
So, how do we make this happen? We do what we did with the other example: feed training data into a model of our own design.
Before building our model, we'll collect a number of datapoints and save them into a file.
% seq -f '%1.0f' 10 64 | xargs -L1 ./prog | sort -R > data.txt
seq
above prints out numbers from 10 to 64,
and it formats them as floating point with no decimal.xargs -L1
converts every line of the seq
output into
and invocation of ./prog
with the output as the argument.
These commands combined are really useful for collecting data
over sweeps of numeric inputs.sort -R
randomly shuffles the data.Now that we've got our dataset data.txt
, we should define a model.
In this case, we'll use a multilayer perceptron (MLP):
// learn.ts
import * as sm from '@shumai/shumai'
const l0 = sm.module.linear(1, 64)
const l1 = sm.module.linear(64, 64)
const l2 = sm.module.linear(64, 1)
export default function f(x) {
x = l0(x).relu()
x = l1(x).relu()
x = l2(x).sigmoid()
}
export const backward = sm.optim.sgd // stochastic gradient descent
export const loss = sm.loss.mse // mean squared error
So, what's going on in the code above?
We multiply our input into 64 hidden "neurons" by a learned weight and then clip out all the values less than zero (setting them to zero). We then do this again. The process of clamping negative values to zero is extremely important, as it is a non-linear operation that gives the neural network the ability to learn arbitrary functions (given the right number of neurons). I like to think of it as giving the network an ability to learn if-conditions.
Then, we add up all the values (again by a learned weight) and smoothly clamp them between 0 and 1 (using sigmoid). We'll be predicting if the CPU used for a given input is over (1) or under (0) a certain threshold, indicating which machine we should use.
And now, with our model and data in hand, we train it:
% yes "$(cat data.txt)" | head -n 10000 | awk '{print $1 "|" $2}' | shumai train learn.ts
You might notice that this line would be in your bash history, since it's the exact same command used to train the other model.
Querying the model will look similar as well!
% echo 4 | shumai infer learn.ts
0.9044327:Float32
And we're done. 😁
A full model trained on the command-line that can be used anywhere stdin
is supported!
There are other ways to use (and even train) this model (including HTTP and soon WebSockets), but those
are out of scope for this writeup (hint: shumai serve learn.ts
and then check out 127.0.0.1:3000/{forward,backward}
).
The /dev/stdin
API is currenty a work in progress. I haven't really documented it much and I'm mostly
looking for feedback on the idea itself.
More generally, the Shumai project is about 2 and half months old and still experimenting with APIs and ideas. A primary focus of the project is hackability. By ditching conventional Python and using a JIT compiled language with native async programming (JavaScript/TypeScript), Shumai hopes to open up doors for ideas that can be implemented quickly and directly in the host language rather than built out as a C extension.
If you'd like to learn more or get quick help, please checkout our Discord! https://discord.gg/kXxWyMFQ Documentation on the operators and API can be found here: https://facebookresearch.github.io/shumai/