~ julia ~

Julia is a high-level dynamic language designed for high performance numerical computation with Lisp-like meta-programming facilities. It’s great.

The path from implementing high-level ideas in Julia down to inspecting and optimizing the generated code is extremely pleasant. The language is expressive, the tools are simple, the documentation and online support is comprehensive, and the ecosystem of projects built with Julia is well developed for such a young language.

To get a taste, let’s walk through using Julia to implement a simple idea: element-wise multiplication of quantized arrays.

struct QArray{Prim}
  array::Array{Prim, 1}
  scale::Float32
end

Those 4 lines define a templated struct in Julia. The constructor is implicit in the order of the members (array and scale) and we’re using two built-in types (Array{} and Float32). The Julia type system is smart enough to let us template the type stored within the array (arbitrarily named Prim) yet give the array a fixed rank of 1. I doubt the above code is going to win over any hearts and minds, but it’s painless enough to express some pretty complex ideas. We can inspect the type a bit by jumping into a Julia repl with julia -i quant.jl and calling the dump function.

julia> dump(QArray)
UnionAll
  var: TypeVar
    name: Symbol Prim
    lb: Core.TypeofBottom Union{}
    ub: Any
  body: QArray{Prim} <: Any
    array::Array{Prim,1}
    scale::Float32

Although we can already see that we’re playing with a pretty clever system, let’s extract some useful functionality out of it.

function mul(a::QArray{Int8}, b::QArray{Int8})::QArray{Int16}
  c = x -> convert(Int16, x)
  QArray{Int16}(map(c, a.array) .* map(c, b.array), a.scale * b.scale)
end

Above is an implementation of element-wise multiplication, taking inputs of type QArray{Int8} and outputting QArray{Int16}. We map an anonymous function c (which just converts the Int8 into Int16 for the sake of accumulation) onto the elements in the two inputs and then multiply (the . in .* makes this element-wise). The last value in the function, in this case a freshly constructed QArray{Int16}, is returned. All we’ve used are built-ins and the code is still pretty damn clean.

But wait, there’s more! Julia isn’t just one of those new-fangled languages living in performance la-la-land -- there’s actually some nice stuff going on under the hood and Julia is eager to show that off. Let’s run the code and use built-in tools inspect what the JIT does with it. (Note: the following can be done in a file or in the repl.)

size = 1000
A = QArray(rand(Int8, size), rand(Float32, 1)[1])
B = QArray(rand(Int8, size), rand(Float32, 1)[1])

using InteractiveUtils
@code_native mul(A,B)

We’ve instantiated random QArray{Int8}'s and imported the InteractiveUtils namespace, which gives us the code_native macro (all macros start with @ and take spaced arguments). Running the above dumps the generated code for our program. I’ve included a snippet below:

; ┌ @ quant.jl:27 within `mul'
; │┌ @ broadcast.jl:753 within `materialize'
; ││┌ @ broadcast.jl:773 within `copy'
; │││┌ @ broadcast.jl:797 within `copyto!' @ broadcast.jl:842
; ││││┌ @ simdloop.jl:73 within `macro expansion' @ broadcast.jl:843
; │││││┌ @ broadcast.jl:511 within `getindex'
; ││││││┌ @ broadcast.jl:550 within `_broadcast_getindex'
; │││││││┌ @ broadcast.jl:574 within `_getindex' @ broadcast.jl:575
; ││││││││┌ @ broadcast.jl:544 within `_broadcast_getindex'
; │││││││││┌ @ array.jl:729 within `getindex'
L1712:
        vmovdqu (%r10,%rdi,2), %ymm0
        vmovdqu 32(%r10,%rdi,2), %ymm1
        vmovdqu 64(%r10,%rdi,2), %ymm2
        vmovdqu 96(%r10,%rdi,2), %ymm3
; ││││││└└└└
; ││││││┌ @ int.jl:54 within `_broadcast_getindex'
        vpmullw (%rcx,%rdi,2), %ymm0, %ymm0
        vpmullw 32(%rcx,%rdi,2), %ymm1, %ymm1
        vpmullw 64(%rcx,%rdi,2), %ymm2, %ymm2
        vpmullw 96(%rcx,%rdi,2), %ymm3, %ymm3
; │││││└└
; │││││ @ simdloop.jl:73 within `macro expansion' @ array.jl:767
        vmovdqu %ymm0, (%rdx,%rdi,2)
        vmovdqu %ymm1, 32(%rdx,%rdi,2)
        vmovdqu %ymm2, 64(%rdx,%rdi,2)
        vmovdqu %ymm3, 96(%rdx,%rdi,2)
; │││││ @ simdloop.jl:74 within `macro expansion'
; │││││┌ @ int.jl:53 within `+'
        addq    $64, %rdi
        cmpq    %rdi, %rsi
        jne     L1712
; └└└└└└

We can see that Julia is no dummy when it comes to execution. The code isn’t the best in the world, but smart intrinsics like vpmullw are pipelined with wide vector registers. For a language that looks and feels like Python meets Lisp, being able to rely on reasonable code generation is incredibly powerful.

The best part is that everything above is what you get out of the box. A nice external tool to try is BenchmarkTools. Perhaps not surprisingly, it helps us benchmark our code.

julia> using BenchmarkTools
ERROR: ArgumentError: Package BenchmarkTools not found in current path:
- Run `import Pkg; Pkg.add("BenchmarkTools")` to install the BenchmarkTools package.

Stacktrace:
 [1] require(::Module, ::Symbol) at ./loading.jl:823

julia> import Pkg; Pkg.add("BenchmarkTools")

I included the error above because isn’t that just so nice?

julia> using BenchmarkTools

julia> @benchmark mul(A,B)
BenchmarkTools.Trial:
  memory estimate:  6.44 KiB
  allocs estimate:  6
  --------------
  minimum time:     1.068 μs (0.00% GC)
  median time:      1.659 μs (0.00% GC)
  mean time:        2.681 μs (32.11% GC)
  maximum time:     5.106 ms (99.88% GC)
  --------------
  samples:          10000
  evals/sample:     10

Pretty neat! There are some allocations we probably don’t need and 1.7 microsecond feels a bit slow, but I’m too new to the language to know the cleanest way to improve that. I’d probably start by going through Julia’s performance tips documentation.

Although the above only shows a really simple external tool, it should be made clear that many projects have been built on top of or integrated with Julia. The metaprogrammability of the language is phenomenal, making older languages like C++ look like dinosaurs. For example, Zygote.jl, the auto-diff engine driving FluxML, uses the powerful metaprogramming of Julia to make the whole language differentiable. That means your complicated trebuchet simulation can be trivially added as a differentiable layer to your deep reinforcement learning program. You can even define entire DSLs within the language, and packages like TensorOperations.jl provide the simplicity of defining complex operations entirely in Einstein notated tensor expressions.

Cool! So what?

Well, Julia doesn’t have a Python level of support for tools that make life easier. There aren’t dirt simple website crawlers or libraries for every type of audio codec. What it does have is a good mix of clean tooling and smart decisions that I feel, rather than compete with Python, can help to supplement the machine learning and numerical computing ecosystems.

Ultimately, I’d like to see Julia grow in its interoperability with established languages and tools. Its novelty lies in language level support for numerical analysis, but a lot of its utility lies in catering to the needs of hackers across the entire stack. Seems like a nice approach. 😊