That bit about fixed point is exremely interesting.. I found your blog post abou...

jwr · on April 4, 2014

It all depends on how precise your fixed point values need to be. If you can squeeze them into 8 bits (I could), you can use SSE 128-bit registers to operate on 16 values at a time. It gets even better with AVX, although that wasn't available to me at the time.

So the speedup is not just from going to fixed point, but from managing to use the vector instructions.