That's exactly what Nvidia is doing with tensor cores.

bjourne · 2025-11-27T15:43:51 1764258231

Except the native width of Tensor Cores are about 8-32 (depending on scalar type), whereas the width of TPUs is up to 256. The difference in scale is massive.

neilmovva · 2025-12-01T01:15:24 1764551724

I think Hopper's native matmul tile is 64x64, and Blackwell is 128x128.

see this blog for a reference on Blackwell:

https://hazyresearch.stanford.edu/blog/2025-03-15-tk-blackwe...

fooker · 2025-11-28T03:35:23 1764300923

If it turns out to be useful, Nvidia can't just tweak a parameter in their verilog and declare victory?

If not, what's fundamentally difficult about doing 32 vs 256 here?

saagarjha · 2025-11-28T01:07:50 1764292070

Nobody cares about width; they care about TFLOPs.