AVX-512 is wide enough to process 8 64 bit floats at once. To get a 10x speedup ...

dragontamer · on Feb 16, 2023

L1 cache on Intel machines reads/writes in 512-bit chunks. So you get a 2x faster L1 cache when working with AVX512 on Intel IIRC.

Or perhaps more accurately: L1 cache that can process twice the data in the same amount of time.

usrusr · on Feb 16, 2023

That sounds suspiciously as if an implementation tailored to that cache line size might see a considerable part of the speedup even running on non-SIMD operations? (or on less wide SIMD)

dragontamer · on Feb 16, 2023

> cache line

Wrong side of the L1 cache. Cache-lines is how L1 cache talks to L2/L3 cache.

I'm talking about the load/store units in the CPU core, or the Core <--> L1 cache communications. This side is less commonly discussed online, but I am pretty sure its important in this AVX512 discussion. (To be fair, I probably should have said "Load/Store" unit instead of L1 cache in my previous post, which would have been more clear)

-------------

Modern CPU cores only have a limited number of load/store units. Its superscalar of course, like 4 load/stores per clock tick or something, but still limited. By "batching" your loads/stores into 512-bits instead of 256-bits or 64-bits at a time, your CPU doesn't have to do as much work to talk with L1 cache.

usrusr · on Feb 17, 2023

Ah, so it's not so much about adjacency (which would even benefit an implementation that insisted on loading bytes individually) but about the number of operations required to bucket-brigade those zeroes and ones into registerland when the cache issue is solved (which I'd have very much expected to be the case in the baseline of the comparison, just like other replies).

I was close to dismissing your reply as merely a nomenclature nitpick, but I think I have learned something interesting, thanks!

dragontamer · on Feb 17, 2023

I definitely didn't use the correct language in my post above. But I think we've got the misunderstanding cleared up now.

I'll try to use the word "Load/store Unit" when talking about that part of the CPU from now on.

lallysingh · on Feb 16, 2023

Almost any array-math implementation that's aware of cache sizes is going to outperform the ones that don't. By a heavy margin.

eklitzke · on Feb 16, 2023

People already do optimizations like this all the time, when they're working on low-level code that can benefit from it. Sorting is actually a good example, all of the major sort implementations typically use quicksort when the array size is large enough and then at some level of the recursion the arrays get small enough that insertion sort (or even sorting network) is faster. So sorting a large array will use at least two different sorting methods depending on what level of recursion is happening.

You can get information about the cache line sizes and cache hierarchy at runtime from sysfs/sysconf, but I don't think many people actually do this. Instead they just size things so that on the common architectures they expect to run on things will be sized appropriately, since these cache sizes don't change frequently. If you really want to optimize things, when you compile with -march=native (or some specific target architecture) GCC/Clang will implicitly add a bunch of command line flags to the compiler invocation that set up different preprocessor defines that expose information about the cache sizes/hierarchy for the target architecture.

skavi · on Feb 16, 2023

AVX-512 has masks and a lot of new instructions. It's not just wider.

VHRanger · on Feb 16, 2023

My understanding is that AVX-512 also has a lot more functions, so composing something less naturally parallel (eg. simdJSON) is easier in it

adgjlsfhk1 · on Feb 16, 2023

avx512 also gives you 2x more register space which can be very useful.

lodi · on Feb 16, 2023

It's not "8-wide", it's "512-bits wide". The basic "foundation" profile supports splitting up those bits into 8 qword, 16 dword, etc. while other profiles support finer granularity up to 64 bytes. Plus you get more registers, new instructions, and so on.

IanCutress · on Feb 17, 2023

AVX2 is like a portion of a pie without filling. AVX512 is like a full pie with extra filling. You're getting filling, not simply more pie.