Absolutely, especially the part about just rolling your own alternative to Claud...

_the_inflator · 2025-11-07T08:44:28 1762505068

I agree with you mostly.

On the other hand, I think that show or it didn’t happen is essential.

Dumping a bit of code into an LLM doesn’t make it a code agent.

And what Magic? I think you never hit conceptual and structural problems. Context window? History? Good or bad? Large Scale changes or small refactoring here and there? Sample size one or several teams? What app? How many components? Green field or not? Which programming language?

I bet you will color Claude and especially GitHub Copilot a bit differently, given that you can easily kill any self made Code Agent quite easily with a bit of steam.

Code Agents are incredibly hard to build and use. Vibe Coding is dead for a reason. I remember vividly the inflation of Todo apps and JS frameworks (Ember, Backbone, Knockout are survivors) years ago.

The more you know about agents and especially code agents the more you know, why engineers won’t be replaced so fast - Senior Engineers who hone their craft.

I enjoy fiddling with experimental agent implementations, but value certain frameworks. They solved in an opiated way problems you will run into if you dig deeper and others depend on you.

ericd · 2025-11-07T13:33:22 1762522402

To be clear, no one in this thread said this is replacing all senior engineers. But it is still amazing to see it work, and it’s very clear why the hype is so strong. But you’re right that you can quickly run into problems as it gets bigger.

Caching helps a lot, but yeah, there are some growing pains as the agent gets larger. Anthropic’s caching strategy (4 blocks you designate) is a bit annoying compared to OpenAI’s cache-everything-recent. And you start running into the need to start summarizing old turns, or outright tossing them, and deciding what’s still relevant. Large tool call results can be killer.

I think at least for educational purposes, it’s worth doing, even if people end up going back to Claude code, or away from genetic coding altogether for their day to day.

lowbloodsugar · 2025-11-07T02:09:10 1762481350

>build your own lightsaber

I think this is the best way of putting it I've heard to date. I started building one just to know what's happening under the hood when I use an off-the-shelf one, but it's actually so straightforward that now I'm adding features I want. I can add them faster than a whole team of developers on a "real" product can add them - because they have a bigger audience.

The other takeaway is that agents are fantastically simple.

ericd · 2025-11-07T02:56:06 1762484166

Agreed, and it's actually how I've been thinking about it, but it's also straight from the article, so can't claim credit. But it was fun to see it put into words by someone else.

And yeah, the LLM does so much of the lifting that the agent part is really surprisingly simple. It was really a revelation when I started working on mine.

afc · 2025-11-07T07:55:19 1762502119

I also started building my own, it's fun and you get far quickly.

I'm now experimenting with letting the agent generate its own source code from a specification (currently generating 9K lines of Python code (3K of implementation, 6K of tests) from 1.5K lines in specifications (https://alejo.ch/3hi).

threecheese · 2025-11-07T13:38:44 1762522724

Just reading through your docs, and feeling inspired. What are you spending, token-wise? Order of magnitude.

andai · 2025-11-07T03:22:08 1762485728

What are you using for transcription?

I tried Whisper, but it's slow and not great.

I tried the gpt audio models, but they're trained to refuse to transcribe things.

I tried Google's models and they were terrible.

I ended up using one of Mistral's models, which is alright and very fast except sometimes it will respond to the text instead of transcribing it.

So I'll occasionally end up with pages of LLM rambling pasted instead of the words I said!

tptacek · 2025-11-07T03:25:30 1762485930

I recently bought a mint-condition Alf phone, in the shape of Gordon Shumway of TV's "Alf", out of the back of an old auto shop in the south suburbs of Chicago, and naturally did the most obvious thing, which was to make a Gordon Shumway phone that has conversations in the voice of Gordon Shumway (sampled from Youtube and synthesized with ElevenLabs). I use https://github.com/etalab-ia/faster-whisper-server (I think?) as the Whisper backend. It's fine! Asterix feeds me WAV files, an ASI program feeds them to Whisper (running locally as a server) and does audio synthesis with the ElevenLabs API. Took like 2 hours.

t_akosuke · 2025-11-07T10:34:43 1762511683

Been meaning to build something very similar! What hardware did you use? I'm assuming that a Pi or similar won't cut it

tptacek · 2025-11-07T13:24:46 1762521886

Just a cheap VOIP gateway and a NUC I use for a bunch of other stuff too.

ericd · 2025-11-07T04:29:36 1762489776

Whisper.cpp/Faster-whisper are a good bit faster than OpenAI's implementation. I've found the larger whisper models to be surprisingly good in terms of transcription quality, even with our young children, but I'm sure it varies depending on the speaker, no idea how well it handles heavy accents.

I'm mostly running this on an M4 Max, so pretty good, but not an exotic GPU or anything. But with that setup, multiple sentences usually transcribe quickly enough that it doesn't really feel like much of a delay.

If you want something polished for system-wide use rather than rolling your own, I've been liking MacWhisper on the Mac side, currently hunting for something on Arch.

richardlblair · 2025-11-07T17:25:23 1762536323

The new Qwen model is supposed to be very good.

Honestly, I've gotten really far simply by transcribing audio with whisper, having a cheap model clean up the output to make it make sense (especially in a coding context), and copying the result to the clipboard. My goal is less about speed and more about not touching the keyboard, though.

andai · 2025-11-07T19:03:14 1762542194

Thanks. Could you share more? I'm about to reinvent this wheel right now. (Add a bunch of manual find-replace strings to my setup...)

Here's my current setup:

vt.py (mine) - voice type - uses pyqt to make a status icon and use global hotkeys for start/stop/cancel recording. Formerly used 3rd party APIs, now uses parakeet_py (patent pending).

parakeet_py (mine): A Python binding for transcribe-rs, which is what Handy (see below) uses internally (just a wrapper for Parakeet V3). Claude Code made this one.

(Previously I was using voxtral-small-latest (Mistral API), which is very good except that sometimes it will output its own answer to my question instead of transcribing it.)

In other words, I'm running Parakeet V3 on my CPU, on a ten year old laptop, and it works great. I just have it set up in a slightly convoluted way...

I didn't expect the "generate me some rust bindings" thing to work, or I would have probably gone with a simpler option! (Unexpected downside of Claude is really smart: you end up with a Rube Goldberg machine to maintain!)

For the record, Handy - https://github.com/cjpais/Handy/issues - does 80% of what I want. Gives a nice UI for Parakeet. But I didn't like the hotkey design, didn't like the lack of flexibility for autocorrect etc... already had the muscle memory from my vt.py ;)

richardlblair · 2025-11-10T15:07:16 1762787236

My use case is pretty specific - I have a 6 week old baby. So, I've been walking on my walking pad with her in the carrier. Typing in that situation is really not pleasant for anyone, especially the baby. Speed isn't my concern, I just want to keep my momentum in these moments.

My setup is as follow: - Simple hotkey to kick off shell script to record

- Simple python script that uses ionotify to watch directory where audio is saved. Uses whisper. This same script runs the transcription through Haiku 4.5 to clean it up. I tell it not to modify the contents, but it's haiku, so sometimes it just does it anyway. The original transcript and the ai cleaner versions are dumped into a directory

- The cleaned up version is run through another script to decide if it's code, a project brief, an email. I usually start the recording "this is code", "this is a project brief" to make it easy. Then, depending on what it is the original, the transcribed, and the context get run through different prompts with different output formats.

It's not fancy, but it works really well. I could probably vibe code this into a more robust workflow system all using ionotify and do some more advanced things. Integrating more sophisticated tool calling could be really neat.

nostrebored · 2025-11-07T04:17:51 1762489071

Parakeet is sota

dSebastien · 2025-11-07T04:26:13 1762489573

Agreed. I just launched https://voice-ai.knowii.net and am really a fan of Parakeet now. What it manages to achieve locally without hogging too much resources is awesome

raymond_goo · 2025-11-07T09:29:31 1762507771

https://github.com/rhulha/Speech2Speech

https://github.com/rhulha/EchoMate

segu · 2025-11-07T10:25:08 1762511108

Handy is free, open-source and local model only. Supports Parakeet: https://github.com/cjpais/Handy

ty00001 · 2025-11-08T19:09:04 1762628944

Speechmatics - it is on the expensive side, but provides access to a bunch of languages and the accuracy is phenomenal on all of them - even with multi-speakers.

greenfish6 · 2025-11-07T08:06:12 1762502772

I use Willow AI, which I think is pretty good

Uehreka · 2025-11-07T04:52:00 1762491120

The reason a lot of people don’t do this is because Claude Code lets you use a Claude Max subscription to get virtually unlimited tokens. If you’re using this stuff for your job, Claude Max ends up being like 10x the value of paying by the token, it’s basically mandatory. And you can’t use your Claude Max subscription for tools other than Claude Code (for TOS reasons. And they’ll likely catch you eventually if you try to extract and reuse access tokens).

unshavedyak · 2025-11-07T16:35:44 1762533344

Is using CC outside of the CC binary even needed? CC has a SDK, could you not just use the proper binary? I've debated using it as the backend for internal chat bots and whatnot unrelated to "coding". Though maybe that's against the TOS as i'm not using CC in the spirit of it's design?

simonw · 2025-11-07T17:59:10 1762538350

That's very much in the spirit of Claude Code these days. They renamed the Claude Code SDK to the Claude Agent SDK precisely to support this kind of usage of it: https://www.anthropic.com/engineering/building-agents-with-t...

sumedh · 2025-11-07T05:46:52 1762494412

> catch you eventually if you try to extract and reuse access tokens

What does that mean?

Uehreka · 2025-11-07T08:51:00 1762505460

I’m saying if you try to use Wireshark or something to grab the session token Claude Code is using and pass it to another tool so that tool can use the same session token, they’ll probably eventually find out. All it would take is having Claude Code start passing an extra header that your other tool doesn’t know about yet, suspend any accounts whose session token is used in requests that don’t have that header and manually deal with any false positives. (If you’re thinking of replying with a workaround: That was just one example, there are a bajillion ways they can figure people out if they want to)

baq · 2025-11-07T06:58:04 1762498684

How do they know your requests come from Claude Code?

simonw · 2025-11-07T07:28:43 1762500523

I imagine they can spot it pretty quick using machine learning to spot unlikely API access patterns. They're an AI research company after all, spotting patterns is very much in their wheelhouse.

virgilp · 2025-11-07T14:22:20 1762525340

a million ways, but e.g: once in a while, add a "challenge" header; the next request should contain a "challenge-reply" header for said challenge. If you're just reusing the access token, you won't get it right.

Or: just have a convention/an algorithm to decide how quickly Claude should refresh the access token. If the server knows token should be refreshed after 1000 requests and notices refresh after 2000 requests, well, probably half of the requests were not made by Claude Code.

ericd · 2025-11-07T13:35:27 1762522527

When comparing, are you using the normal token cost, or cached? I find that the vast majority of my token usage is in the 90% off cached bucket, and the costs aren’t terrible.

ay · 2025-11-07T08:50:50 1762505450

Kimi is noticeably better at tool calling than gpt-oss-120b.

I made a fun toy agent where the two models are shoulder surfing each other and swap the turns (either voluntarily, during a summarization phase), or forcefully if a tool calling mistake is made, and Kimi ends up running the show much much more often than gpt-oss.

And yes - it is very much fun to build those!

anonym29 · 2025-11-06T23:00:41 1762470041

Cerebras now has glm 4.6. Still obscenely fast, and now obscenely smart, too.

DeathArrow · 2025-11-07T07:23:34 1762500214

Aren't there cheaper providers of GLM 4.6 on Openrouter? What are the advantages of using Cerebras? Is it much faster?

meeq · 2025-11-07T09:06:46 1762506406

You know how sometimes when you send a prompt to Claude, you just know it’s gonna take a while, so you go grab a coffee, come back, and it’s still working? With Cerebras it’s not even worth switching tabs, because it’ll finish the same task in like three seconds.

anonym29 · 2025-11-08T10:21:13 1762597273

Cerebras offers a $50/mo and $200/mo "Cerebras Code" subscription for token limits way above what you could get for the same price in PAYG API credits. https://www.cerebras.ai/code

Up until recently, this plan only offered Qwen3-Coder-480B, which was decent for the price and speed you got tokens at, but doesn't hold a candle to GLM 4.6.

So while they're not the cheapest PAYG GLM 4.6 provider, they are the fastest, and if you make heavy use their monthly subscription plan, then they're also the cheapest per token.

Note: I am neither affiliated with nor sponsored by Cerebras, I'm just a huge nerd who loves their commercial offerings so much that I can't help but gush about them.

simonw · 2025-11-07T07:29:26 1762500566

It's astonishingly fast.

ericd · 2025-11-06T23:47:54 1762472874

Ooh thanks for the heads up!

lukevp · 2025-11-06T23:51:30 1762473090

What’s a good staring point for getting into this? I don’t even know what Cerebras is. I just use GitHub copilot in VS Code. Is this local models?

ericd · 2025-11-07T00:15:42 1762474542

A lot of it is just from HN osmosis, but /r/LocalLLaMA/ is a good place to hear about the latest open weight models, if that's interesting.

gpt-oss 120b is an open weight model that OpenAI released a while back, and Cerebras (a startup that is making massive wafer-scale chips that keep models in SRAM) is running that as one of the models they provide. They're a small scale contender against nvidia, but by keeping the model weights in SRAM, they get pretty crazy token throughput at low latency.

In terms of making your own agent, this one's pretty good as a starting point, and you can ask the models to help you make tools for eg running ls on a subdirectory, or editing a file. Once you have those two, you can ask it to edit itself, and you're off to the races.

andai · 2025-11-07T03:41:19 1762486879

Here is ChatGpt in 50 lines of Python:

https://gist.github.com/avelican/4fa1baaac403bc0af04f3a7f007...

No dependencies, and very easy to swap out for OpenRouter, Groq or any other API. (Except Anthropic and Google, they are special ;)

This also works on the frontend: pro tip you don't need a server for this stuff, you can make the requests directly from a HTML file. (Patent pending.)

GardenLetter27 · 2025-11-07T09:40:16 1762508416

But it's way more expensive since most providers won't give you prompt caching?