jojo2219's comments

jojo2219 · on April 18, 2024

what's the state of the art in quantization methods these days that one might apply to a model like LLama 3? Any particular literature to read? Of course priorities differ across methods. Rather than saving space or speeding up calculations, I'm simply interested in static quantization where integer weights multiply integer activations (like 8-bit integers). (as for motivation, such quantization enables proving correct execution of inference in sublinear time, at least asymptotically. i'm talking of ZK tech)

jojo2219 · on April 18, 2024

Where are f32 and f16 used? I see a lot of `.float()' and `.type_as()' in the model file, and nothing explicit about f16. Are the weights and all the activations in f32?

brrrrrm · on April 18, 2024

jojo2219 · on April 18, 2024

weights and activations all bf16?

brrrrrm · on April 18, 2024