All this usually proves is that compilers are smarter than you think.
You can get some gains from SIMD code (i.e. autovectorization is hard), e.g. you've laid out data deliberately but the compiler doesn't quite see it, but modern CPUs are so complicated even executing scalar instructions that I wouldn't bother. I think optimizing memory is more productive half the time anyway, most programs don't spend that much time number crunching.
This program is a special case because the whole state fits in registers unless the compiler detects this a human should be able to beat the compiler. Even unrolled the code would fit in the L1 instruction cache.
You can get some gains from SIMD code (i.e. autovectorization is hard), e.g. you've laid out data deliberately but the compiler doesn't quite see it, but modern CPUs are so complicated even executing scalar instructions that I wouldn't bother. I think optimizing memory is more productive half the time anyway, most programs don't spend that much time number crunching.