Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

more channels (and the fact that each ddr5 dimm is basically two channels) is a major win that doesn't depend on prefetching.

it is perhaps underappreciated that CPU cores have gotten a lot more "thready" even in the absence of SMT, since each core has hundreds of instructions in flight. you know, kind of like how GPU cores (CUs, SMs) do...



It depends on prefetching very much.

I have a similar simulation code, which can saturate the cores pretty efficiently (It was running with >1.7M integrations/sec/core and scaling linearly on last decade's hardware).

But when you access to square submatrices inside a 3000x3000 matrix, and CPU prefetcher brings the whole n rows, and you discard most of the rows to get another part of the matrix, that bandwidth is wasted.

Instead, you can rearrange your matrices to make the prefetcher happy, and do not throw what prefetcher brings, because you make it bring which is going to be used next.

When I was doing last analysis runs, it was running with ~20% cache trash meaning I'm wasting 20% of the data I'm pulling in regardless. This is a huge loss, considering I was saturating memory bandwidth before the cores.


Which is an interesting aside, but unless I'm missing something the amount of wastage due to the prefetcher isn't at all proportional to the number of available memory channels so it's non sequitur.


No, the wastage is due to because I'm not accessing data in a "linear" fashion if I lay out the matrices naively (actually pulling n by n square submatrices from the big matrix), and prefetcher brings in data with its baked in locality rules, but its assumptions doesn't fit my access pattern. So 20 percent of the data is thrown away.

If I change how I fill and lay my matrices out, I can align my data with how prefetcher assumes, and way less data will be trashed during the process.

I didn't do that because it needed a codebase-wide refactorization, and my code was already 30x faster than reference implementation with better accuracy, so we didn't pursue that avenue.


Interesting, but is it even possible to optimize access to very sparse and unstructured matrices from finite elements?


My matrices are dense. I’m working with boundary elements. But, there are sparse optimized libraries for efficient storage and access to sparse matrices (e.g.: Eigen has both sparse and dense sub-libraries). Also, there’s a whole literature concerned with sparse matrices and operations on those.

However, my knowledge is limited on sparse side. If you want, I might try to dig into it a bit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: