The ALUs (FPUs) in most CPUs are 64 bit (even more than that internally), but this does not matter, because we don't care how many bits our floats take inside the CPU, we care about how much space they take in our server's RAM. From our point of view, we supply weights and inputs to the CPU (both in FP16), CPU multiplies them (using 64 bit multipliers), and then spits out the result, which is cast to FP16, and that's what gets stored in memory.