AVX-512 is wide enough to process 8 64 bit floats at once. To get a 10x speedup with an 8 wide SIMD unit is a little difficult to explain. Some of this speedup is presumably coming from fewer branch instructions in addition to the vector width. It's extremely impressive. Also, it has taken Intel a surprisingly long time!
That sounds suspiciously as if an implementation tailored to that cache line size might see a considerable part of the speedup even running on non-SIMD operations? (or on less wide SIMD)
Wrong side of the L1 cache. Cache-lines is how L1 cache talks to L2/L3 cache.
I'm talking about the load/store units in the CPU core, or the Core <--> L1 cache communications. This side is less commonly discussed online, but I am pretty sure its important in this AVX512 discussion. (To be fair, I probably should have said "Load/Store" unit instead of L1 cache in my previous post, which would have been more clear)
-------------
Modern CPU cores only have a limited number of load/store units. Its superscalar of course, like 4 load/stores per clock tick or something, but still limited. By "batching" your loads/stores into 512-bits instead of 256-bits or 64-bits at a time, your CPU doesn't have to do as much work to talk with L1 cache.
Ah, so it's not so much about adjacency (which would even benefit an implementation that insisted on loading bytes individually) but about the number of operations required to bucket-brigade those zeroes and ones into registerland when the cache issue is solved (which I'd have very much expected to be the case in the baseline of the comparison, just like other replies).
I was close to dismissing your reply as merely a nomenclature nitpick, but I think I have learned something interesting, thanks!
People already do optimizations like this all the time, when they're working on low-level code that can benefit from it. Sorting is actually a good example, all of the major sort implementations typically use quicksort when the array size is large enough and then at some level of the recursion the arrays get small enough that insertion sort (or even sorting network) is faster. So sorting a large array will use at least two different sorting methods depending on what level of recursion is happening.
You can get information about the cache line sizes and cache hierarchy at runtime from sysfs/sysconf, but I don't think many people actually do this. Instead they just size things so that on the common architectures they expect to run on things will be sized appropriately, since these cache sizes don't change frequently. If you really want to optimize things, when you compile with -march=native (or some specific target architecture) GCC/Clang will implicitly add a bunch of command line flags to the compiler invocation that set up different preprocessor defines that expose information about the cache sizes/hierarchy for the target architecture.
It's not "8-wide", it's "512-bits wide". The basic "foundation" profile supports splitting up those bits into 8 qword, 16 dword, etc. while other profiles support finer granularity up to 64 bytes. Plus you get more registers, new instructions, and so on.