If they don't quantize the model, how do they achieve these speeds? Groq also sa...

cschneid · 2025-11-08T03:49:51 1762573791

so apparently they have custom hardware that is basically absolutely gigantic chips - across the scale of a whole wafer at a time. Presumably they keep the entire model right on chip, in effectively L3 cache or whatever. So the memory bandwidth is absurdly fast, allowing very fast inference.

It's more expensive to get the same raw compute as a cluster of nvidia chips, but they don't have the same peak throughput.

As far as price as a coder, I am giving a month of the $50 plan a shot. I haven't figured out how to adapt my workflow yet to faster speeds (also learning and setting up opencode).

bigyabai · 2025-11-08T03:57:36 1762574256

For $50/month, it's a non-starter. I hope they can find a way to use all this excess bandwidth to put out a $10 equivalent to Claude Code instead of a 1000 tok/s party trick I can't use properly.

typpilol · 2025-11-08T06:07:19 1762582039

I feel the same and it's also why I can't understand all these people using small local models.

Every local model I've used and even most open source are just not good

behnamoh · 2025-11-08T06:26:50 1762583210

the only good-enough model I still use it gpt-oss-120b-mxfp4 (not 20b) and glm-4.6 at q8 (not q4).

quantization ruins models and some models aren't that smart to begin with.

csomar · 2025-11-08T08:07:59 1762589279

GLM-4.6 is on par with Sonnet 4.5. Sometimes it is better, sometimes it is worse. Give it a shot. It's the only model that made me (almost) ditch Claude. The only problem is, Claude Code is still the best agentic program in town and search doesn't function without a proper subscription.

DeathArrow · 2025-11-08T19:05:44 1762628744

Have you tried Claude Code Router with GLM 4.6?

https://github.com/musistudio/claude-code-router

mcpeepants · 2025-11-08T14:05:39 1762610739

z.ai hosted GLM 4.6 works great with claude code, drops right in

esafak · 2025-11-08T17:26:25 1762622785

Have you tried opencode?

wyre · 2025-11-09T01:28:14 1762651694

Cerebras offers pay-per-token. What are you asking for? Claude Code starts at $100, or $15/mtok. Cerebras is already much cheaper, but you want it to be even cheaper at $10?

xadhominemx · 2025-11-08T18:58:52 1762628332

$600 per year is a trivial cost for a professional tool

bigyabai · 2025-11-08T23:43:00 1762645380

$600 per anything is Herman Miller territory, pal. I'm not paying that for a SaaS.

threeducks · 2025-11-08T10:50:21 1762599021

> but we literally have no way to prove they're right

Of course we do. Just run a benchmark with Cerebras/Groq and compare to the results produced in a trusted environment. If the scores are equal, the model is is either unquantized, or quantized so well that we can not tell, in which case it does not matter.

For example, here is a comparison of different providers for gpt-oss-120b, with differences of over 10 % for best and worst provider.

https://artificialanalysis.ai/models/gpt-oss-120b/providers#...

msp26 · 2025-11-08T10:34:57 1762598097

Groq does quantise. Look at this benchmark from moonshotai for K2 where they compare their official implementation to third party providers.

https://github.com/MoonshotAI/K2-Vendor-Verifier

It's one of the lowest rated on that table.

nine_k · 2025-11-08T04:40:23 1762576823

> What kind of workflow justifies this?

Think about waiting for compilation to complete: the difference between 5 minutes and 15 seconds is dramatic.

Same applies to AI-based code-wrangling tasks. The preserved concentration may be well worth the $50, especially when paid by your employer.

behnamoh · 2025-11-08T05:06:27 1762578387

they should offer a free trial so we build confidence in the model quality (e.g., to make sure it's not nerfed/quantized/limited-context/etc.).

conception · 2025-11-08T05:15:51 1762578951

A trial is literally front and center on their website.

NitpickLawyer · 2025-11-08T06:35:21 1762583721

You can usually use them with things like openrouter. Load some credits there and use the API in your preferred IDE like you'd use any provider. For some quick tests it's probably be <5$ for a few coding sessions so you can check out the capabilities and see if it's worth it for you.

behnamoh · 2025-11-08T06:53:34 1762584814

openrouter charges me $12 on a $100 credit...

NitpickLawyer · 2025-11-08T06:32:34 1762583554

> What kind of workflow justifies this? I'm genuinely curious.

Any workflow where verification is faster / cheaper than generation. If you have a well tested piece of code and want to "refactor it to use such and such paradigme", you can run n faster model queries and pick the fastest.

My colleagues that do frontend use faster models (not this one specifically, but they did try fast-code-1) to build components. Someone worked out a workflow w/ worktrees where the model generates n variants of a component, and displays them next to each other. A human can "at a glance" choose which one they like. And sometimes pick and choose from multiple variants (something like passing it to claude and say "keep the styling of component A but the data management of component B"), and at the end of the day is faster / cheaper than having cc do all that work.

xadhominemx · 2025-11-08T18:57:54 1762628274

It’s because the model weights and KV cache are stored in SRAM. It’s extremely expensive per token.