what's the state of the art in quantization methods these days that one might apply to a model like LLama 3? Any particular literature to read?
Of course priorities differ across methods. Rather than saving space or speeding up calculations, I'm simply interested in static quantization where integer weights multiply integer activations (like 8-bit integers).
(as for motivation, such quantization enables proving correct execution of inference in sublinear time, at least asymptotically. i'm talking of ZK tech)
Where are f32 and f16 used? I see a lot of `.float()' and `.type_as()' in the model file, and nothing explicit about f16. Are the weights and all the activations in f32?