<center>

# ~ julia ~

</center>

[Julia](https://docs.julialang.org/en/v1/) is a high-level dynamic language designed for high performance
numerical computation with Lisp-like meta-programming facilities.  It’s great.

The path from implementing high-level ideas in Julia down to inspecting and optimizing
the generated code is extremely pleasant.
The language is expressive, the tools are simple,
the documentation and online support is comprehensive,
and the ecosystem of projects built with Julia is well developed for such a young language.

To get a taste, let’s walk through using Julia to implement a simple idea:
element-wise multiplication of quantized arrays.


struct QArray{Prim}
array::Array{Prim, 1}
scale::Float32
end


Those 4 lines define a templated struct in Julia.
The constructor is implicit in the order of the members (array and scale)
and we’re using two built-in types (Array{} and Float32).
The Julia type system is smart enough to let us template the
type stored within the array (arbitrarily named Prim)
yet give the array a fixed rank of 1.
I doubt the above code is going to win over any hearts and minds,
but it’s painless enough to express some pretty complex ideas.
We can inspect the type a bit by jumping into a Julia repl with julia -i quant.jl
and calling the dump function.


julia> dump(QArray)
UnionAll
var: TypeVar
name: Symbol Prim
lb: Core.TypeofBottom Union{}
ub: Any
body: QArray{Prim} <: Any
array::Array{Prim,1}
scale::Float32


Although we can already see that we’re playing with a pretty clever system,
let’s extract some useful functionality out of it.


function mul(a::QArray{Int8}, b::QArray{Int8})::QArray{Int16}
c = x -> convert(Int16, x)
QArray{Int16}(map(c, a.array) .* map(c, b.array), a.scale * b.scale)
end


Above is an implementation of element-wise multiplication,
taking inputs of type QArray{Int8} and outputting QArray{Int16}.
We map an anonymous function c
(which just converts the Int8 into Int16 for the sake of accumulation)
onto the elements in the two inputs and then multiply
(the . in .* makes this element-wise).
The last value in the function,
in this case a freshly constructed QArray{Int16}, is returned.
All we’ve used are built-ins and the code is still pretty damn clean.

But wait, there’s more!
Julia isn’t just one of those new-fangled languages living in performance la-la-land --
there’s actually some nice stuff going on under the hood and Julia is eager to show that off.
Let’s run the code and use built-in tools inspect what the JIT does with it.
(Note: the following can be done in a file or in the repl.)


size = 1000
A = QArray(rand(Int8, size), rand(Float32, 1)[1])
B = QArray(rand(Int8, size), rand(Float32, 1)[1])

using InteractiveUtils
@code_native mul(A,B)


We’ve instantiated random QArray{Int8}'s and imported the InteractiveUtils namespace,
which gives us the code_native macro (all macros start with @ and take spaced arguments).
Running the above dumps the generated code for our program.
I’ve included a snippet below:

asm
; ┌ @ quant.jl:27 within mul'
; │┌ @ broadcast.jl:753 within materialize'
; ││┌ @ broadcast.jl:773 within copy'
; ││││┌ @ simdloop.jl:73 within macro expansion' @ broadcast.jl:843
; │││││┌ @ broadcast.jl:511 within getindex'
; ││││││┌ @ broadcast.jl:550 within _broadcast_getindex'
; ││││││││┌ @ broadcast.jl:544 within _broadcast_getindex'
; │││││││││┌ @ array.jl:729 within getindex'
L1712:
vmovdqu (%r10,%rdi,2), %ymm0
vmovdqu 32(%r10,%rdi,2), %ymm1
vmovdqu 64(%r10,%rdi,2), %ymm2
vmovdqu 96(%r10,%rdi,2), %ymm3
; ││││││└└└└
; ││││││┌ @ int.jl:54 within _broadcast_getindex'
vpmullw (%rcx,%rdi,2), %ymm0, %ymm0
vpmullw 32(%rcx,%rdi,2), %ymm1, %ymm1
vpmullw 64(%rcx,%rdi,2), %ymm2, %ymm2
vpmullw 96(%rcx,%rdi,2), %ymm3, %ymm3
; │││││└└
; │││││ @ simdloop.jl:73 within macro expansion' @ array.jl:767
vmovdqu %ymm0, (%rdx,%rdi,2)
vmovdqu %ymm1, 32(%rdx,%rdi,2)
vmovdqu %ymm2, 64(%rdx,%rdi,2)
vmovdqu %ymm3, 96(%rdx,%rdi,2)
; │││││ @ simdloop.jl:74 within macro expansion'
; │││││┌ @ int.jl:53 within +'
cmpq    %rdi, %rsi
jne     L1712
; └└└└└└

We can see that Julia is no dummy when it comes to execution.
The code isn’t the best in the world, but smart intrinsics like
vpmullw are pipelined with wide vector registers.
For a language that looks and feels like Python meets Lisp,
being able to rely on reasonable code generation is incredibly powerful.

The best part is that everything above is what you get out of the box.
A nice external tool to try is BenchmarkTools.
Perhaps not surprisingly, it helps us benchmark our code.


julia> using BenchmarkTools
- Run import Pkg; Pkg.add("BenchmarkTools") to install the BenchmarkTools package.

Stacktrace:



I included the error above because isn’t that just so nice?


julia> using BenchmarkTools

julia> @benchmark mul(A,B)
BenchmarkTools.Trial:
memory estimate:  6.44 KiB
allocs estimate:  6
--------------
minimum time:     1.068 μs (0.00% GC)
median time:      1.659 μs (0.00% GC)
mean time:        2.681 μs (32.11% GC)
maximum time:     5.106 ms (99.88% GC)
--------------
samples:          10000
evals/sample:     10


Pretty neat!  There are some allocations we probably don’t need and 1.7 microsecond feels a bit slow,
but I’m too new to the language to know the cleanest way to improve that.
I’d probably start by going through Julia’s
[performance tips documentation](https://docs.julialang.org/en/v1/manual/performance-tips/index.html).

Although the above only shows a really simple external tool,
it should be made clear that many projects have been built on top of or integrated with Julia.
The [metaprogrammability](https://docs.julialang.org/en/v1/manual/metaprogramming/)
of the language is phenomenal,
making older languages like C++ look like dinosaurs.
For example, [Zygote.jl](https://github.com/FluxML/Zygote.jl),
the auto-diff engine driving FluxML,
uses the powerful metaprogramming of Julia to make the *whole language* differentiable.
can be trivially added as a differentiable layer to your deep reinforcement learning program.
You can even define entire DSLs within the language,
and packages like
[TensorOperations.jl](https://github.com/Jutho/TensorOperations.jl)
provide the simplicity of defining complex operations entirely
in Einstein notated tensor expressions.

Cool! So what?

Well, Julia doesn’t have a Python level of support for tools that make life easier.
There aren’t dirt simple website crawlers or libraries for every type of audio codec.
What it does have is a good mix of clean tooling and smart decisions that I feel,
rather than compete with Python,
can help to supplement the machine learning and numerical computing ecosystems.

Ultimately, I’d like to see Julia grow in its interoperability with established
languages and tools.
Its novelty lies in language level support for numerical analysis,
but a lot of its utility lies in catering to the needs of hackers across the entire stack.
Seems like a nice approach. 😊