It actually doesn't require more memory for intermediate results (see the Strass...

It actually doesn't require more memory for intermediate results (see the Strassen reloaded paper). It's more just that it's a ton of work to implement well (even compared to a regular gemm which is already hard), and the benefits only start showing up at pretty large (~4000x4000) matrices.