I think when they bring up differential amplifiers they're referring more to the DSP technique of how headphone noise cancelling works but the actual electrical properties of how a differential amplifier does that muddies the message a bit.
It sort of feels closer to heterodyning and "demodulating" the signal encoded in the softmax. Those tiny little errors we're trying to denoise with this technique are almost closer to carrier waves (when encoded to softmax) than noise imo. This wouldn't get rid of noise in the training data or noise in the dimensionality of the key / value space. It's really only removing noise introduced by the process itself.
It sort of feels closer to heterodyning and "demodulating" the signal encoded in the softmax. Those tiny little errors we're trying to denoise with this technique are almost closer to carrier waves (when encoded to softmax) than noise imo. This wouldn't get rid of noise in the training data or noise in the dimensionality of the key / value space. It's really only removing noise introduced by the process itself.