Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If they don't quantize the model, how do they achieve these speeds? Groq also says they don't quantize models (and I want to believe them) but we literally have no way to prove they're right.

This is important because their premium $50 (as opposed to $20 on Claude Pro or ChatGPT Plus) should be justified by the speed. GLM 4.6 is fine but I don't think it's still at the GPT-5/Claude Sonnet 4.5 level, so if I'm paying $50 for it on Cerebras it should be mainly because of speed.

What kind of workflow justifies this? I'm genuinely curious.



so apparently they have custom hardware that is basically absolutely gigantic chips - across the scale of a whole wafer at a time. Presumably they keep the entire model right on chip, in effectively L3 cache or whatever. So the memory bandwidth is absurdly fast, allowing very fast inference.

It's more expensive to get the same raw compute as a cluster of nvidia chips, but they don't have the same peak throughput.

As far as price as a coder, I am giving a month of the $50 plan a shot. I haven't figured out how to adapt my workflow yet to faster speeds (also learning and setting up opencode).


For $50/month, it's a non-starter. I hope they can find a way to use all this excess bandwidth to put out a $10 equivalent to Claude Code instead of a 1000 tok/s party trick I can't use properly.


I feel the same and it's also why I can't understand all these people using small local models.

Every local model I've used and even most open source are just not good


the only good-enough model I still use it gpt-oss-120b-mxfp4 (not 20b) and glm-4.6 at q8 (not q4).

quantization ruins models and some models aren't that smart to begin with.


GLM-4.6 is on par with Sonnet 4.5. Sometimes it is better, sometimes it is worse. Give it a shot. It's the only model that made me (almost) ditch Claude. The only problem is, Claude Code is still the best agentic program in town and search doesn't function without a proper subscription.


Have you tried Claude Code Router with GLM 4.6?

https://github.com/musistudio/claude-code-router


z.ai hosted GLM 4.6 works great with claude code, drops right in


Have you tried opencode?


Cerebras offers pay-per-token. What are you asking for? Claude Code starts at $100, or $15/mtok. Cerebras is already much cheaper, but you want it to be even cheaper at $10?


$600 per year is a trivial cost for a professional tool


$600 per anything is Herman Miller territory, pal. I'm not paying that for a SaaS.


> but we literally have no way to prove they're right

Of course we do. Just run a benchmark with Cerebras/Groq and compare to the results produced in a trusted environment. If the scores are equal, the model is is either unquantized, or quantized so well that we can not tell, in which case it does not matter.

For example, here is a comparison of different providers for gpt-oss-120b, with differences of over 10 % for best and worst provider.

https://artificialanalysis.ai/models/gpt-oss-120b/providers#...


Groq does quantise. Look at this benchmark from moonshotai for K2 where they compare their official implementation to third party providers.

https://github.com/MoonshotAI/K2-Vendor-Verifier

It's one of the lowest rated on that table.


> What kind of workflow justifies this?

Think about waiting for compilation to complete: the difference between 5 minutes and 15 seconds is dramatic.

Same applies to AI-based code-wrangling tasks. The preserved concentration may be well worth the $50, especially when paid by your employer.


they should offer a free trial so we build confidence in the model quality (e.g., to make sure it's not nerfed/quantized/limited-context/etc.).


A trial is literally front and center on their website.


You can usually use them with things like openrouter. Load some credits there and use the API in your preferred IDE like you'd use any provider. For some quick tests it's probably be <5$ for a few coding sessions so you can check out the capabilities and see if it's worth it for you.


openrouter charges me $12 on a $100 credit...


> What kind of workflow justifies this? I'm genuinely curious.

Any workflow where verification is faster / cheaper than generation. If you have a well tested piece of code and want to "refactor it to use such and such paradigme", you can run n faster model queries and pick the fastest.

My colleagues that do frontend use faster models (not this one specifically, but they did try fast-code-1) to build components. Someone worked out a workflow w/ worktrees where the model generates n variants of a component, and displays them next to each other. A human can "at a glance" choose which one they like. And sometimes pick and choose from multiple variants (something like passing it to claude and say "keep the styling of component A but the data management of component B"), and at the end of the day is faster / cheaper than having cc do all that work.


It’s because the model weights and KV cache are stored in SRAM. It’s extremely expensive per token.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: