Interpreter, Compiler, JIT

ndesaulniers · on May 25, 2015

Hey all, happy to take questions/feedback/criticism.

Funny anecdote: while developing this post, once I got the JIT working I was very excited. I showed a few people in the office. Our CTO walked by and came to take a look. He's worked on numerous VMs in the past. SpiderMonkey's first JIT, TraceMonkey, being his PhD thesis. He took one look and asked "Is it self hosted." I replied, "well...not yet." To which his response was "pffft!" and walked off. I found that pretty funny. Maybe in the next blog post!

vidarh · on May 25, 2015

Dr. Gal studied under Prof. Michael Franz as well, who in turn did his PhD studies under Niklaus Wirth.

If you haven't yet, you should read Franz' PhD thesis [1]. It was on load time code generation from what's effective a compressed representation of an ast that was designed for re-using generated code-segment by compying and template instantiation. So a JIT, but not generating from bytecode.

Consider it somewhat like inserting a serialization/deserialization step right before your code-generation step, sort of, with some added optimizations to cache re-usable fragments.

[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1...

wvenable · on May 25, 2015

You could try using C2BF[1] in an attempt to make it self hosting. :)

[1] http://esolangs.org/wiki/C2BF

ndesaulniers · on May 25, 2015

Hey! I like the way you think, my friend!

nickpsecurity · on May 26, 2015

wvenable's idea is good but a cheat. You referenced Wirth, who simplified assembly (eg P-code). Do something similar: extend BF with a macro language that simplifies it, build macros for BF-like programming constructs, use them to simplify programming the interpreter/compiler, and then host with that. If you want to do real BF, the very act of implementing those macros might help you mentally understand how to... one piece at a time lol... implement the real thing in vanilla BF.

Which you can hand-implement in machine code if you want. Choose your level of masochism carefully. :)

ndesaulniers · on May 26, 2015

The part I'm most curious about would how to make the JIT itself self hosting. If brainfuck's only method of I/O is getchar/putchar, the only way I can think of calling mmap is doing a stack buffer overflow attack based on the misrepresentation of tape.

I remember hacking on Rust being a mind bending experience (the Rust compiler has been self hosting for a long time). Thinking in another level of abstraction is hurting my head.

I think it would have to be able to output an macho64 executable. Oh man.

nickpsecurity · on May 26, 2015

That does sound difficult. Maybe do it bare-metal on a simple, non-MMAP architecture* with basic routines coded in assembler. Both main interpreter and JIT'd code might reference those core functions. I know it's not directly solving the problem but I sneak around the impossible parts wherever possible.

* Come to think of it, one of those microcontrollers or Java processors might come in handy here as they supply basic libs or RTOS's with primitive I/O functionality.

vardump · on May 25, 2015

Maybe one day someone will write a native JIT for x86[-64]. Native x86 code in, optimized native x86 code out.

It should be possible to JIT native code and run it faster than running the native code directly!

It could do:

1) Peephole optimizations (and known sequence/function replacement). Utilize target instruction set extensions.

2) "Constant" folding. Replace code that processes values that can be proven to be immutable with a constant. Also removes unnecessary branches. Guards at loads, indirect stores, by marking page read only, etc.

3) Vectorization. This includes widening existing vectorization if target instruction set supports it.

4) Loadable library call inlining. This could even mean inlining system calls, but of course in that case the JIT would need to be running in kernel...

5) Profile guided optimization.

Of course there are some very hard problems. For example, there'd need to be guards for the case the assumed constant value does change. Or that unexpected call target occurs. Etc.

Maybe this system could learn possible call targets and unexpectedly changing "constants".

pag · on May 26, 2015

What you're talking about is binary translation. Here are a few pointers:

  - DynamoRIO: http://www.dynamorio.org/
  - Intel PIN: https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool
  - Rosetta: https://www.apple.com/asia/rosetta/
  - Transmeta Crusoe: http://en.wikipedia.org/wiki/Transmeta_Crusoe
  - Binary Translation using Peephole Superoptimization: http://theory.stanford.edu/~aiken/publications/papers/osdi08.pdf

You might also find program specialization interesting and related:

   - SynthesisOS: http://valerieaurora.org/synthesis/SynthesisOS/

I've personally created three binary translators, and worked on DynamoRIO as well as a Linux kernel variant of DynamoRIO. It's hard to do well / right, but at the same time it's fun. If you're interested, it might be worth starting with an existing system to give yourself some good ideas of how to structure the system, and perhaps even how not to do it. It might also give you a sense of a niche worth targeting. Your list is interesting, but tackling all that is non-trivial, or at least, tackling all that in a general way is non-trivial. However, if you restrict yourself to some subset of programs / use cases, then you can make an impact.

Creating faster code is incredibly tricky. That was the original motivation for things like DynamoRIO (and HP's Dynamo before it), and it's mostly not worth it. You can sometimes do it for specific workloads in a way that is similar to feedback-directed optimization. You can also sometimes do with with inlining / outlining (e.g. moving cold code off the critical path of the icache) / specializing (creating specialized versions of code for specific inputs, use sites, etc.).

jevinskie · on May 26, 2015

Look at things like PIN tool [0], DynamoRIO [1], and Valgrind [2] to do this at runtime or at any of the LLVM based decompilers like dagger [3], libcpu [4], or fracture [5] to do AoT compilation. I'm not sure if QEMU's IR would be a good fit. DynamoRIO and Valgrind IR may also be able to be AoTed if you work hard enough. Of course, stuff like indirect branches and self-modifying code can really throw a wrench in your JITter's gears.

[0]: https://software.intel.com/en-us/articles/pin-a-dynamic-bina...

[1]: http://www.dynamorio.org/

[2]: http://valgrind.org/

[3]: http://dagger.repzret.org/

[4]: https://github.com/libcpu/libcpu

[5]: https://github.com/draperlaboratory/fracture

ndesaulniers · on May 25, 2015

Doesn't LTO do some of these things?

vardump · on May 25, 2015

I was talking about runtime executable optimization. In general, I don't think this is possible at link time. If the binary is linked in 2015, it's not possible to know about CPUs available in 2035 or to utilize the new features, even if they're binary compatible.

Compiling (incl. linking) bakes in assumptions about target CPU. If these assumptions are incorrect, the performance will be sub-optimal. Even pathologic cases are possible, like order of magnitude lower performance in AMD Jaguar AVX stores.

CPUs have wildly different combinations of instruction set extensions. Instruction throughput and latency varies significantly. Cache subsystems are different.

So code that runs well on one CPU can run sub-optimally on another binary-compatible CPU.

jevinskie · on May 26, 2015

The LTOs I know of always operate on an IR like LLVM or GIMPLE, not binary code.

lmz · on May 26, 2015

Wasn't that what some x86 emulators did before hardware virtualization? (maybe not to that extent).

dfox · on May 26, 2015

Translating code for one architecture into code for host (possibly same) architecture is what Qemu does. But Qemu does not do any significant optimizations in this process as it attempts to do this without extensive knowledge of host architecture (it does not even directly generate separate instructions).

Other thing that comes to mind is HP Dynamo (http://www.hpl.hp.com/techreports/1999/HPL-1999-78.html), which essentially is PA-RISC on PA-RISC optimizing JIT VM (that also does trace-driven optimization).

pron · on May 25, 2015

JITs don't have to repeat all their work every time they run. They can cache their output (this feature is planned for Java 9, I think). And while, as the article says, JITs are pretty much a necessity for languages with dynamic dispatch, which are nearly impossible to optimize ahead-of-time, they can be great for statically-typed languages, too:

1. Their ability to speculatively optimize (and then de-optimize when the assumption proves false and recompile) makes it possible for them to implement zero-cost abstractions, such as inlining polymorphic virtual calls.

2. They make it possible to optimize across shared libraries, even those that are loaded dynamically.

To those interested in the future of JITs, I very much recommend watching one of the talks about Graal[1], HotSpot's (the OpenJDK JVM) next-gen JIT. Like HotSpot's current optimizing JIT compiler, it does speculative, profile-guided optimization, but exposes an API that lets the language designer (or even the programmer) to control optimization and code generation. It is also self-hosted (i.e. written in Java).

It's still under heavy development but early results are promising. Even though it supports multithreading (which complicates things), it performs better (often much better) than PyPy when running Python[2] and on par with V8 when running JavaScript[3].

[1]: https://wiki.openjdk.java.net/display/Graal/Publications+and...

[2]: https://docs.google.com/spreadsheets/d/1fFMWcRIuPKt7wSAM5Ox9...

[3]: http://www.slideshare.net/ThomasWuerthinger/jazoon2014-slide...

vidarh · on May 25, 2015

> JITs are pretty much a necessity for languages with dynamic dispatch, which are nearly impossible to optimize ahead-of-time,

Depends what you consider "nearly impossible". A lot of compilers for dynamic languages are just awful for no particularly good reason so it's often hard to assess what is slow because it is hard and what is slow because the implementation disregards the last 30 years of experience with compiling dynamic languages.

E.g. I'm working on an ahead-of-time Ruby compiler. While it will need a JIT component for cases where people call eval, and while Ruby is particularly nasty to compile for a variety of reasons, the method dispatch is easy to reduce to one indirection via a vtable (you just need to propagate method overrides down the class hierarchy and update the vtables, but updates to the methods is much rarer than cals so it's ok for it to be more expensive), equivalent to C++ virtual member functions (though C++ compilers have better hope of being able to optimize away the virtual call).

Despite the pathological cases possible because of the singly rooted object hierarchy, for most typical applications the wasted space in vtables (for method slots for methods that are unimplemented in a specific branch of the class hiearchy) is easily compensated for by e.g. needing less bookkeeping for methods that are known statically at compile time (which is the vast majority for most applications).

If you're willing to compile a fallback path and suitable guards, you can sometimes optimize away many indirections entirely and even inline code ahead of time even for languages like Ruby (incidentally for example of that you can look at the work Chris Seaton has done on a JRuby Truffle/Graal backend - while that does these optimizations at runtime, many of them are applicable ahead of time too, though the JIT can get the advantage of not having to generate code for fallback cases unless they're actually needed at runtime)

Note that I agree with you that JIT's have many advantages. At the same time, while I love the flexibility of dynamic languages, I prefer the smallest number of moving wheels possible in production environment. Spent too long doing devops.. Makes me lament the lack of attention to AOT/"as static as possible" compilers for dynamic languages.

jerf · on May 25, 2015

"E.g. I'm working on an ahead-of-time Ruby compiler.... easy..."

If you haven't already... it's 2015. You might want to go look at the sun-bleached bones of all your predecessors who confidently proclaimed that all the dynamic scripting languages were slow for no good reason and a JIT could totally just JIT all the slowness away.

If it's a journey you wish to take, by all means, be my guest, but, please, be sure you know what you're getting into.

(The aforementioned sun-bleached bones are the bulk of the reason why I no longer believe in the old adage that "languages don't have performance characteristics, only implementations do", at least in the full sense intended. Languages may not have performance characteristics but the evidence rather strongly suggests they can put upper bounds on performance.)

WalterGR · on May 26, 2015

    You might want to go look at the sun-bleached bones of all your predecessors
    who confidently proclaimed that all the dynamic scripting languages were slow
    for no good reason and a JIT could totally just JIT all the slowness away.

I ask this with the best of intentions: where can I learn more about this?

(FWIW, I'm curious specifically about the ways in which Common Lisp (or Lisps in general) does or does not avoid the performance pitfalls of - say - Ruby or Python. But I'm definitely interested in the general topic as well...)

jerf · on May 26, 2015

Study PyPy, particularly w.r.t. to the definition of RPython and why it had to have the restrictions it had.

There's a lot of issues that Python/Ruby/Perl/JS have at runtime that Lisp takes care of at compile time, where everything is technically a massive set of method resolutions. Lisp may allow macros to do the same thing, but careful construction will do more of the resolution at compile time.

chrisseaton · on May 26, 2015

If you are coming from the perspective of a Ruby programmer, there are lots of papers on optimising Ruby in the Bibliography http://rubybib.org.

But what you want to do is take a look at a few of the VM papers there, and then look at their references and see what people have tried previously. The related work sections often talk about previous limitations.

Lisps in general are much more static than a language like Ruby, so it wouldn't be very instructive to compare against them. The trickiest Ruby features are things like Proc#binding and Kernel#set_trace_func, as these require you to modify running code and data structures which may have already been very aggressively modified.

I don't think (but am not an expert) that most Lisps have features like those. Lisps have code as data of course, which sounds like a nightmare to compile, but it's really not - the code trees are immutable (I believe) when generated, so just compile them whenever you see a new tree root and cache.

vidarh · on May 26, 2015

With just the vtables, Kernel#set_trace_func is reasonably easy - you can do it just by introducing thunks that proxy to the real method after calling the trace function. Wasteful, though, but easy to patch in/out.

The moment you inline, I think it's still possible to do reasonably ok using ptrace as long as sufficient debug info is kept to identify call sites. In both cases I think that for my project, if I ever get to looking at set_trace_func, I'll probably require a compiler option to enable debug info, as it's not exactly frequently used. Alternatively use the approach we've discussed and fall back to a "slow path".

Binding is basically a closure that captures all local variables and arguments of its context, so it's not that tricky in most cases as long as you can identify it and alias / copy variables into a separate environment. This is already how I handle block references to surrounding variables - by hoisting the creation of the environment for the closure up to the start of the relevant scope and aliasing all mentioned variables.

eval() of course complicates this massively, but then eval() complicates everything massively - and doubly so for ahead-of-time compilation.

tokenrove · on May 26, 2015

Features like generic dispatch in Common Lisp result in similar amounts of dynamism; methods and classes can be defined and redefined at any time so a lot of work goes into making that fast; see for example http://metamodular.com/generic-dispatch.pdf

Also similar to Kernel#set_trace_func, CL has TRACE (http://www.lispworks.com/documentation/lw61/CLHS/Body/m_trac...) but unlike generic dispatch, that's not expected to be fast.

_19qg · on May 26, 2015

> Lisps in general are much more static than a language like Ruby

What?

You mean running code directly off of data structures like in Lisp interpreters, using late binding, dynamic CLOS dispatch with method combinations, a full blown meta-object protocol, etc. is more static than Ruby?

Okay...

chrisseaton · on May 26, 2015

> running code directly off of data structures

I specifically addressed this. You may need to re-compile if you see a new code data structure instance, but when you've compiled the fact it came from a data structure isn't interesting. All JIT compilers have code as data - they have an IR.

> late binding

This is no more troublesome than it is in Ruby.

> dynamic CLOS dispatch with method combination

But these are optional features that you sometimes use. In Ruby all operations, method calls and most control structures use the strongest form of dynamic dispatch available in the language.

> full blown MOP

So does Ruby.

So I can't say Lisp is any more dynamic for the compiler than Ruby. A sibling of yours mentioned a trace extension but said people don't expect it to be fast so nobody is bothering. In the implementation of Ruby I work on, we've even made set_trace_func as fast as we can.

_19qg · on May 26, 2015

> All JIT compilers have code as data - they have an IR.

Lisp interpreters run mutable source code, not IR.

> > dynamic CLOS dispatch with method combination

> But these are optional features that you sometimes use. In Ruby all operations, method calls and most control structures

If you look at actual Lisp systems it's not optional. IO, Error handling, tools, ... all use CLOS.

> use the strongest form of dynamic dispatch available in the language.

CLOS uses a more dynamic form of dispatch.

> So I can't say Lisp is any more dynamic for the compiler than Ruby.

Using a compiler or interpreter makes no difference for CLOS.

chrisseaton · on May 26, 2015

Well I'm not an expert in compiling Lisp and I guess you're not an expert in compiling Ruby, so we're probably not going to come to an agreement on this.

tokenrove · on May 27, 2015

> A sibling of yours mentioned a trace extension but said people don't expect it to be fast so nobody is bothering.

That's a bit of a misrepresentation of what I said. I don't know about TRACE in particular (which is not an extension but a core part of the language), but the sbcl developers in particular have been very attentive to the performance of development tools like that. (And, probably, the commercial vendors, as well.)

pron · on May 26, 2015

He means, I think, that method dispatch (i.e. function resolution) in Lisps is less dynamic than in Ruby/JS, where you resolve the call target by searching for the method name in a map -- at least as an abstraction -- belonging to the target object at runtime (or Ruby's method_missing). I am not familiar with all Lisps, but their method polymorphism is usually much more structured.

_19qg · on May 26, 2015

Actually CLOS (Common Lisp Object System) is more dynamic than Ruby.

Every CLOS method is part of a generic function. Each generic function is a collection of methods. At runtime the applicable methods are collected, based on the runtime class graph of the arguments to dispatch over, and an effective method is assembled by a per generic function method combination.

Method selection is also depending on more than one argument (the target object), but on multiple arguments.

Various parts of that are implemented by a meta-object protocol in CLOS implementations, which make it possible to change the behavior of the object system at runtime in various ways.

vidarh · on May 26, 2015

Your contraction massively misrepresents what I wrote. Compiling Ruby is by no means easy. And optimizing it to similar speed as e.g. C is massively hard.

But optimizing the dynamic dispatch down to the cost of a vtable call is trivially easy.

jerf · on May 26, 2015

"Your contraction massively misrepresents what I wrote."

Yeah, sort of, but it represents what I was trying to say accurately enough. You will find, as you go through your project, that nothing is "trivially easy" when it comes to JIT'ing Ruby, and, alas, "just compile it down to a vtable call" is likely to not be that easy either. Dynamic language JITs have a terrible recursive 90/10 problem. In fact I doubt that it actually will be that simple because if it were, PyPy would be a great deal faster than it is.

(From a JIT perspective, Ruby and Python are all but identical.)

The point I'm making is not that you shouldn't do this. The point is that A LOT of people have ALREADY done this, and it has proved to be a lot harder than any of them thought, and you really ought to take some time to go study PyPy or various PHP JIT projects before getting too excited about thinking that you are going to speed Ruby up as much as you think you are. Particularly have a look at RPython. If you are going to tackle this project, this notorious Zeno tarpit [1], I recommend everyone do it with eyes open. No, it isn't going to be as easy as "just do X", or it would already have been done.

[1]: http://esr.ibiblio.org/?p=6772

vidarh · on May 26, 2015

> nothing is "trivially easy" when it comes to JIT'ing Ruby

First of all, I'm not JIT'ing Ruby. Secondly, a lot of things are "trivially easy" when it comes to compiling Ruby - even ahead of time. I know, because while there's lots of outstanding bits, I am compiling a reasonable subset, and I understand the remaining tradeoffs well enough to know where the pitfalls are - it's by no means my first compiler

> The point is that A LOT of people have ALREADY done this

A lot of people have done this, but despite decades of history of optimisation of dynamic languages, most of them appears to not have bothered with even a cursory look at the literature.

In terms of JIT, the Truffle backend for JRuby is a good example of the difference actually applying modern techniques makes.

vardump · on May 25, 2015

When you call/jump through a vtable, you've already lost inlining opportunities. Indirect branches themselves aren't free either. This is an order of magnitude performance loss for simple functions. Additional order of magnitude worse, if the simple function behind vtable indirect jump/call was a vectorization opportunity.

vidarh · on May 26, 2015

Agreed, which is why I also pointed out that with suitable guards and fallback path you can often inline as well, as large proportions of callsites in dynamic code is monomorphic or can be transformed to monomorphic by relatively small numbers of guards and split paths that can often be hoisted quite far.

But a call/jump via vtable is already a massive improvement over most dynamic language implementations. It's not unusual still to find e.g. hash-table lookups and hierarchical searches up through the inheritance chain for no good reason.

ndesaulniers · on May 25, 2015

IIRC, the Oracle guys came to Mozilla to give a talk about this. When Mozilla folks asked if it was open source, which at the time, the answer was "no." That response was met with a bunch of blank looks from both parties. Heh. At least, I think it was Graal. They were showing some VM running Ruby.

pron · on May 25, 2015

Graal is open source now: https://github.com/OracleLabs/GraalVM (I think it's now officially part of OpenJDK)

And yes, it was probably Graal. There's an accompanying language JIT construction toolkit called Truffle, and they already have implementations for Java, Python, JavaScript, Ruby and R.

ndesaulniers · on May 25, 2015

Over in the comments in proggit the inventor of brainfuck, Urban Müller, showed up and gave the Chuck Norris thumbs up: https://www.reddit.com/r/programming/comments/377ov9/interpr...