Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Absolutely, especially the part about just rolling your own alternative to Claude Code - build your own lightsaber. Having your coding agent improve itself is a pretty magical experience. And then you can trivially swap in whatever model you want (Cerebras is crazy fast, for example, which makes a big difference for these many-turn tool call conversations with big lumps of context, though gpt-oss 120b is obviously not as good as one of the frontier models). Add note-taking/memory, and ask it to remember key facts to that. Add voice transcription so that you can reply much faster (LLMs are amazing at taking in imperfect transcriptions and understanding what you meant). Each of these things takes on the order of a few minutes, and it's super fun.


I agree with you mostly.

On the other hand, I think that show or it didn’t happen is essential.

Dumping a bit of code into an LLM doesn’t make it a code agent.

And what Magic? I think you never hit conceptual and structural problems. Context window? History? Good or bad? Large Scale changes or small refactoring here and there? Sample size one or several teams? What app? How many components? Green field or not? Which programming language?

I bet you will color Claude and especially GitHub Copilot a bit differently, given that you can easily kill any self made Code Agent quite easily with a bit of steam.

Code Agents are incredibly hard to build and use. Vibe Coding is dead for a reason. I remember vividly the inflation of Todo apps and JS frameworks (Ember, Backbone, Knockout are survivors) years ago.

The more you know about agents and especially code agents the more you know, why engineers won’t be replaced so fast - Senior Engineers who hone their craft.

I enjoy fiddling with experimental agent implementations, but value certain frameworks. They solved in an opiated way problems you will run into if you dig deeper and others depend on you.


To be clear, no one in this thread said this is replacing all senior engineers. But it is still amazing to see it work, and it’s very clear why the hype is so strong. But you’re right that you can quickly run into problems as it gets bigger.

Caching helps a lot, but yeah, there are some growing pains as the agent gets larger. Anthropic’s caching strategy (4 blocks you designate) is a bit annoying compared to OpenAI’s cache-everything-recent. And you start running into the need to start summarizing old turns, or outright tossing them, and deciding what’s still relevant. Large tool call results can be killer.

I think at least for educational purposes, it’s worth doing, even if people end up going back to Claude code, or away from genetic coding altogether for their day to day.


>build your own lightsaber

I think this is the best way of putting it I've heard to date. I started building one just to know what's happening under the hood when I use an off-the-shelf one, but it's actually so straightforward that now I'm adding features I want. I can add them faster than a whole team of developers on a "real" product can add them - because they have a bigger audience.

The other takeaway is that agents are fantastically simple.


Agreed, and it's actually how I've been thinking about it, but it's also straight from the article, so can't claim credit. But it was fun to see it put into words by someone else.

And yeah, the LLM does so much of the lifting that the agent part is really surprisingly simple. It was really a revelation when I started working on mine.


I also started building my own, it's fun and you get far quickly.

I'm now experimenting with letting the agent generate its own source code from a specification (currently generating 9K lines of Python code (3K of implementation, 6K of tests) from 1.5K lines in specifications (https://alejo.ch/3hi).


Just reading through your docs, and feeling inspired. What are you spending, token-wise? Order of magnitude.


What are you using for transcription?

I tried Whisper, but it's slow and not great.

I tried the gpt audio models, but they're trained to refuse to transcribe things.

I tried Google's models and they were terrible.

I ended up using one of Mistral's models, which is alright and very fast except sometimes it will respond to the text instead of transcribing it.

So I'll occasionally end up with pages of LLM rambling pasted instead of the words I said!


I recently bought a mint-condition Alf phone, in the shape of Gordon Shumway of TV's "Alf", out of the back of an old auto shop in the south suburbs of Chicago, and naturally did the most obvious thing, which was to make a Gordon Shumway phone that has conversations in the voice of Gordon Shumway (sampled from Youtube and synthesized with ElevenLabs). I use https://github.com/etalab-ia/faster-whisper-server (I think?) as the Whisper backend. It's fine! Asterix feeds me WAV files, an ASI program feeds them to Whisper (running locally as a server) and does audio synthesis with the ElevenLabs API. Took like 2 hours.


Been meaning to build something very similar! What hardware did you use? I'm assuming that a Pi or similar won't cut it


Just a cheap VOIP gateway and a NUC I use for a bunch of other stuff too.


Whisper.cpp/Faster-whisper are a good bit faster than OpenAI's implementation. I've found the larger whisper models to be surprisingly good in terms of transcription quality, even with our young children, but I'm sure it varies depending on the speaker, no idea how well it handles heavy accents.

I'm mostly running this on an M4 Max, so pretty good, but not an exotic GPU or anything. But with that setup, multiple sentences usually transcribe quickly enough that it doesn't really feel like much of a delay.

If you want something polished for system-wide use rather than rolling your own, I've been liking MacWhisper on the Mac side, currently hunting for something on Arch.


The new Qwen model is supposed to be very good.

Honestly, I've gotten really far simply by transcribing audio with whisper, having a cheap model clean up the output to make it make sense (especially in a coding context), and copying the result to the clipboard. My goal is less about speed and more about not touching the keyboard, though.


Thanks. Could you share more? I'm about to reinvent this wheel right now. (Add a bunch of manual find-replace strings to my setup...)

Here's my current setup:

vt.py (mine) - voice type - uses pyqt to make a status icon and use global hotkeys for start/stop/cancel recording. Formerly used 3rd party APIs, now uses parakeet_py (patent pending).

parakeet_py (mine): A Python binding for transcribe-rs, which is what Handy (see below) uses internally (just a wrapper for Parakeet V3). Claude Code made this one.

(Previously I was using voxtral-small-latest (Mistral API), which is very good except that sometimes it will output its own answer to my question instead of transcribing it.)

In other words, I'm running Parakeet V3 on my CPU, on a ten year old laptop, and it works great. I just have it set up in a slightly convoluted way...

I didn't expect the "generate me some rust bindings" thing to work, or I would have probably gone with a simpler option! (Unexpected downside of Claude is really smart: you end up with a Rube Goldberg machine to maintain!)

For the record, Handy - https://github.com/cjpais/Handy/issues - does 80% of what I want. Gives a nice UI for Parakeet. But I didn't like the hotkey design, didn't like the lack of flexibility for autocorrect etc... already had the muscle memory from my vt.py ;)


My use case is pretty specific - I have a 6 week old baby. So, I've been walking on my walking pad with her in the carrier. Typing in that situation is really not pleasant for anyone, especially the baby. Speed isn't my concern, I just want to keep my momentum in these moments.

My setup is as follow: - Simple hotkey to kick off shell script to record

- Simple python script that uses ionotify to watch directory where audio is saved. Uses whisper. This same script runs the transcription through Haiku 4.5 to clean it up. I tell it not to modify the contents, but it's haiku, so sometimes it just does it anyway. The original transcript and the ai cleaner versions are dumped into a directory

- The cleaned up version is run through another script to decide if it's code, a project brief, an email. I usually start the recording "this is code", "this is a project brief" to make it easy. Then, depending on what it is the original, the transcribed, and the context get run through different prompts with different output formats.

It's not fancy, but it works really well. I could probably vibe code this into a more robust workflow system all using ionotify and do some more advanced things. Integrating more sophisticated tool calling could be really neat.


Parakeet is sota


Agreed. I just launched https://voice-ai.knowii.net and am really a fan of Parakeet now. What it manages to achieve locally without hogging too much resources is awesome



Handy is free, open-source and local model only. Supports Parakeet: https://github.com/cjpais/Handy


Speechmatics - it is on the expensive side, but provides access to a bunch of languages and the accuracy is phenomenal on all of them - even with multi-speakers.


I use Willow AI, which I think is pretty good


The reason a lot of people don’t do this is because Claude Code lets you use a Claude Max subscription to get virtually unlimited tokens. If you’re using this stuff for your job, Claude Max ends up being like 10x the value of paying by the token, it’s basically mandatory. And you can’t use your Claude Max subscription for tools other than Claude Code (for TOS reasons. And they’ll likely catch you eventually if you try to extract and reuse access tokens).


Is using CC outside of the CC binary even needed? CC has a SDK, could you not just use the proper binary? I've debated using it as the backend for internal chat bots and whatnot unrelated to "coding". Though maybe that's against the TOS as i'm not using CC in the spirit of it's design?


That's very much in the spirit of Claude Code these days. They renamed the Claude Code SDK to the Claude Agent SDK precisely to support this kind of usage of it: https://www.anthropic.com/engineering/building-agents-with-t...


> catch you eventually if you try to extract and reuse access tokens

What does that mean?


I’m saying if you try to use Wireshark or something to grab the session token Claude Code is using and pass it to another tool so that tool can use the same session token, they’ll probably eventually find out. All it would take is having Claude Code start passing an extra header that your other tool doesn’t know about yet, suspend any accounts whose session token is used in requests that don’t have that header and manually deal with any false positives. (If you’re thinking of replying with a workaround: That was just one example, there are a bajillion ways they can figure people out if they want to)


How do they know your requests come from Claude Code?


I imagine they can spot it pretty quick using machine learning to spot unlikely API access patterns. They're an AI research company after all, spotting patterns is very much in their wheelhouse.


a million ways, but e.g: once in a while, add a "challenge" header; the next request should contain a "challenge-reply" header for said challenge. If you're just reusing the access token, you won't get it right.

Or: just have a convention/an algorithm to decide how quickly Claude should refresh the access token. If the server knows token should be refreshed after 1000 requests and notices refresh after 2000 requests, well, probably half of the requests were not made by Claude Code.


When comparing, are you using the normal token cost, or cached? I find that the vast majority of my token usage is in the 90% off cached bucket, and the costs aren’t terrible.


Kimi is noticeably better at tool calling than gpt-oss-120b.

I made a fun toy agent where the two models are shoulder surfing each other and swap the turns (either voluntarily, during a summarization phase), or forcefully if a tool calling mistake is made, and Kimi ends up running the show much much more often than gpt-oss.

And yes - it is very much fun to build those!


Cerebras now has glm 4.6. Still obscenely fast, and now obscenely smart, too.


Aren't there cheaper providers of GLM 4.6 on Openrouter? What are the advantages of using Cerebras? Is it much faster?


You know how sometimes when you send a prompt to Claude, you just know it’s gonna take a while, so you go grab a coffee, come back, and it’s still working? With Cerebras it’s not even worth switching tabs, because it’ll finish the same task in like three seconds.


Cerebras offers a $50/mo and $200/mo "Cerebras Code" subscription for token limits way above what you could get for the same price in PAYG API credits. https://www.cerebras.ai/code

Up until recently, this plan only offered Qwen3-Coder-480B, which was decent for the price and speed you got tokens at, but doesn't hold a candle to GLM 4.6.

So while they're not the cheapest PAYG GLM 4.6 provider, they are the fastest, and if you make heavy use their monthly subscription plan, then they're also the cheapest per token.

Note: I am neither affiliated with nor sponsored by Cerebras, I'm just a huge nerd who loves their commercial offerings so much that I can't help but gush about them.


It's astonishingly fast.


Ooh thanks for the heads up!


What’s a good staring point for getting into this? I don’t even know what Cerebras is. I just use GitHub copilot in VS Code. Is this local models?


A lot of it is just from HN osmosis, but /r/LocalLLaMA/ is a good place to hear about the latest open weight models, if that's interesting.

gpt-oss 120b is an open weight model that OpenAI released a while back, and Cerebras (a startup that is making massive wafer-scale chips that keep models in SRAM) is running that as one of the models they provide. They're a small scale contender against nvidia, but by keeping the model weights in SRAM, they get pretty crazy token throughput at low latency.

In terms of making your own agent, this one's pretty good as a starting point, and you can ask the models to help you make tools for eg running ls on a subdirectory, or editing a file. Once you have those two, you can ask it to edit itself, and you're off to the races.


Here is ChatGpt in 50 lines of Python:

https://gist.github.com/avelican/4fa1baaac403bc0af04f3a7f007...

No dependencies, and very easy to swap out for OpenRouter, Groq or any other API. (Except Anthropic and Google, they are special ;)

This also works on the frontend: pro tip you don't need a server for this stuff, you can make the requests directly from a HTML file. (Patent pending.)


But it's way more expensive since most providers won't give you prompt caching?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: