Worth pointing out: I thought it was just the SIMD that made it fast when I firs...

glangdale · on April 2, 2020

This is a good explanation of why it's fast.

An interesting point with the design of simdjson loses its branchlessness in "stage 2". I originally had a bunch of very elaborate plans to try to stay branchless far further into the parsing process. It proved just too hard to make it work. There were some promising things that ultra-modern Intel chips - meaning Icelake - and future iterations of ARM (SVE/SVE2) - are adding to their SIMD abilities, so it might be worth revisiting this in a few years (there aren't too many Icelake boxes out there and SVE barely exists).

jkeiser · on April 2, 2020

Yep. Most of stage 2's branchiness essentially comes from "is the next thing an array? object? string? number? Handle it differently if so."

Making it so you can handle all the brackets at once, all the strings at once, all the numbers at once, would make a big difference, and we're thinking about that. Another thing that could help is making the if statement more predictable using type information from the user. get<int>() could mean "I expect this next thing to be an integer, so parse it that way and just yell if it's not, please."

It's difficult. But it's why I'm still so fascinated! Solving JSON thoroughly and completely will give us a lot of information on how to quickly parse XML, YAML, and other file formats.

We've clearly been collectively doing parsing wrong (including me) if there's this much of a gap. It's exciting to see something innovative and new in this domain and even being able to contribute to it :) @lemire deserves a ton of credit for making an actual project out of his work and promoting it; I likely wouldn't have heard of it otherwise.

glangdale · on April 2, 2020

That's right. There's a problem which I referred to as the toothpaste problem - you can squeeze the needed branchiness around but you can't make it go away entirely (at least, I couldn't).

There used to be 4 stages (stage 1 was the marks, stage 2 was bits-to-indexes, stage 3 was the tape construction and stage 4 was the 'clean everything up and do all the hard stuff'). It's possible - though awkward - to do tape construction branchlessly, but the gyrations required were expensive and weird and it just delayed the reckoning.

I built a prototype of the 'gather all the X at once and handle it in one go' and the logic to gather that stuff was more expensive than just handling everything.

In my wacky world of branch free coding (which I've been doing a while) there are worse things than missing a branch. The idea that you can accumulate an array branchlessly (i.e. always put the thing you have in a location somewhere and bump a pointer when you need to) seems pretty cool, but branch miss is not the only hazard. This technique of branchlessly writing a log is a anti-pattern I've tried over and over again and a stream of unpredictable writes is just as big a pain as a bunch of unpredictable branches - it causes the pipeline to come to a screaming halt. If you can get it going somehow (new SIMD tricks? better algorithm?) I'd be impressed.

Get my email from Daniel or hit me up in Twitter DMs if you want to discuss further.

glangdale · on April 2, 2020

... and yes, agreed that Daniel has done a great job making this into a real thing. This would have stayed a bunch of code sketches if I had been the only one working on it. In terms of quality, the project now is well on the way to being commercial quality code rather than the research prototype that I did. I understand I have you to thank for that as well!