more channels (and the fact that each ddr5 dimm is basically two channels) is a ...

bayindirh · on Oct 26, 2023

It depends on prefetching very much.

I have a similar simulation code, which can saturate the cores pretty efficiently (It was running with >1.7M integrations/sec/core and scaling linearly on last decade's hardware).

But when you access to square submatrices inside a 3000x3000 matrix, and CPU prefetcher brings the whole n rows, and you discard most of the rows to get another part of the matrix, that bandwidth is wasted.

Instead, you can rearrange your matrices to make the prefetcher happy, and do not throw what prefetcher brings, because you make it bring which is going to be used next.

When I was doing last analysis runs, it was running with ~20% cache trash meaning I'm wasting 20% of the data I'm pulling in regardless. This is a huge loss, considering I was saturating memory bandwidth before the cores.

mastax · on Oct 26, 2023

Which is an interesting aside, but unless I'm missing something the amount of wastage due to the prefetcher isn't at all proportional to the number of available memory channels so it's non sequitur.

bayindirh · on Oct 26, 2023

No, the wastage is due to because I'm not accessing data in a "linear" fashion if I lay out the matrices naively (actually pulling n by n square submatrices from the big matrix), and prefetcher brings in data with its baked in locality rules, but its assumptions doesn't fit my access pattern. So 20 percent of the data is thrown away.

If I change how I fill and lay my matrices out, I can align my data with how prefetcher assumes, and way less data will be trashed during the process.

I didn't do that because it needed a codebase-wide refactorization, and my code was already 30x faster than reference implementation with better accuracy, so we didn't pursue that avenue.

bobim · on Oct 26, 2023

Interesting, but is it even possible to optimize access to very sparse and unstructured matrices from finite elements?

bayindirh · on Oct 26, 2023

My matrices are dense. I’m working with boundary elements. But, there are sparse optimized libraries for efficient storage and access to sparse matrices (e.g.: Eigen has both sparse and dense sub-libraries). Also, there’s a whole literature concerned with sparse matrices and operations on those.

However, my knowledge is limited on sparse side. If you want, I might try to dig into it a bit.