I agree with you, which is why I don't really understand what the point of improving FP64 perf by 4x is, if that is not the bottleneck for many apps.
Per node, a 4x MI250X node has more or less the same BW as a DGX-A100 (8x A100).
It has 2x more FP64 compute, but for most science and engineering apps, which are memory bound, 2x more FP64 compute does not make these apps any faster.
Per node, a 4x MI250X node has more or less the same BW as a DGX-A100 (8x A100). It has 2x more FP64 compute, but for most science and engineering apps, which are memory bound, 2x more FP64 compute does not make these apps any faster.