Worth pointing out: I thought it was just the SIMD that made it fast when I first got involved. It turns out that while it helps, it's just a tool that helps to achieve the real gain: eliminating branches (if statements) that make the processor stumble and fall in its attempts to sprint ahead (speculative execution). simdjson's first stage pushes this capability really far towards its limit, achieving 4 instructions per cycle by not branching. And yes, 1 cycle is the smallest amount of time a single instruction can take. Turns out a single thread is running multiple instructions in parallel at any given time, as long as you don't trip it up!
Parsing is notoriously serial and branchy, which is what makes simdjson so out of the ordinary. It's using a sort of "microparallel algorithm," running a small parallel parse on a SIMD-sized chunk of JSON (16-32 bytes depending on architecture), and then moving to the next.
And yeah, you have to go back over a decade to find CPUs that don't have SIMD. simdjson runs on those too, just obviously doesn't use SIMD instructions :)
An interesting point with the design of simdjson loses its branchlessness in "stage 2". I originally had a bunch of very elaborate plans to try to stay branchless far further into the parsing process. It proved just too hard to make it work. There were some promising things that ultra-modern Intel chips - meaning Icelake - and future iterations of ARM (SVE/SVE2) - are adding to their SIMD abilities, so it might be worth revisiting this in a few years (there aren't too many Icelake boxes out there and SVE barely exists).
Yep. Most of stage 2's branchiness essentially comes from "is the next thing an array? object? string? number? Handle it differently if so."
Making it so you can handle all the brackets at once, all the strings at once, all the numbers at once, would make a big difference, and we're thinking about that. Another thing that could help is making the if statement more predictable using type information from the user. get<int>() could mean "I expect this next thing to be an integer, so parse it that way and just yell if it's not, please."
It's difficult. But it's why I'm still so fascinated! Solving JSON thoroughly and completely will give us a lot of information on how to quickly parse XML, YAML, and other file formats.
We've clearly been collectively doing parsing wrong (including me) if there's this much of a gap. It's exciting to see something innovative and new in this domain and even being able to contribute to it :) @lemire deserves a ton of credit for making an actual project out of his work and promoting it; I likely wouldn't have heard of it otherwise.
That's right. There's a problem which I referred to as the toothpaste problem - you can squeeze the needed branchiness around but you can't make it go away entirely (at least, I couldn't).
There used to be 4 stages (stage 1 was the marks, stage 2 was bits-to-indexes, stage 3 was the tape construction and stage 4 was the 'clean everything up and do all the hard stuff'). It's possible - though awkward - to do tape construction branchlessly, but the gyrations required were expensive and weird and it just delayed the reckoning.
I built a prototype of the 'gather all the X at once and handle it in one go' and the logic to gather that stuff was more expensive than just handling everything.
In my wacky world of branch free coding (which I've been doing a while) there are worse things than missing a branch. The idea that you can accumulate an array branchlessly (i.e. always put the thing you have in a location somewhere and bump a pointer when you need to) seems pretty cool, but branch miss is not the only hazard. This technique of branchlessly writing a log is a anti-pattern I've tried over and over again and a stream of unpredictable writes is just as big a pain as a bunch of unpredictable branches - it causes the pipeline to come to a screaming halt. If you can get it going somehow (new SIMD tricks? better algorithm?) I'd be impressed.
Get my email from Daniel or hit me up in Twitter DMs if you want to discuss further.
... and yes, agreed that Daniel has done a great job making this into a real thing. This would have stayed a bunch of code sketches if I had been the only one working on it. In terms of quality, the project now is well on the way to being commercial quality code rather than the research prototype that I did. I understand I have you to thank for that as well!
Parsing is notoriously serial and branchy, which is what makes simdjson so out of the ordinary. It's using a sort of "microparallel algorithm," running a small parallel parse on a SIMD-sized chunk of JSON (16-32 bytes depending on architecture), and then moving to the next.
And yeah, you have to go back over a decade to find CPUs that don't have SIMD. simdjson runs on those too, just obviously doesn't use SIMD instructions :)