Honestly, I've gotten really far simply by transcribing audio with whisper, having a cheap model clean up the output to make it make sense (especially in a coding context), and copying the result to the clipboard. My goal is less about speed and more about not touching the keyboard, though.
Thanks. Could you share more? I'm about to reinvent this wheel right now. (Add a bunch of manual find-replace strings to my setup...)
Here's my current setup:
vt.py (mine) - voice type - uses pyqt to make a status icon and use global hotkeys for start/stop/cancel recording. Formerly used 3rd party APIs, now uses parakeet_py (patent pending).
parakeet_py (mine): A Python binding for transcribe-rs, which is what Handy (see below) uses internally (just a wrapper for Parakeet V3). Claude Code made this one.
(Previously I was using voxtral-small-latest (Mistral API), which is very good except that sometimes it will output its own answer to my question instead of transcribing it.)
In other words, I'm running Parakeet V3 on my CPU, on a ten year old laptop, and it works great. I just have it set up in a slightly convoluted way...
I didn't expect the "generate me some rust bindings" thing to work, or I would have probably gone with a simpler option! (Unexpected downside of Claude is really smart: you end up with a Rube Goldberg machine to maintain!)
For the record, Handy - https://github.com/cjpais/Handy/issues - does 80% of what I want. Gives a nice UI for Parakeet. But I didn't like the hotkey design, didn't like the lack of flexibility for autocorrect etc... already had the muscle memory from my vt.py ;)
My use case is pretty specific - I have a 6 week old baby. So, I've been walking on my walking pad with her in the carrier. Typing in that situation is really not pleasant for anyone, especially the baby. Speed isn't my concern, I just want to keep my momentum in these moments.
My setup is as follow:
- Simple hotkey to kick off shell script to record
- Simple python script that uses ionotify to watch directory where audio is saved. Uses whisper. This same script runs the transcription through Haiku 4.5 to clean it up. I tell it not to modify the contents, but it's haiku, so sometimes it just does it anyway. The original transcript and the ai cleaner versions are dumped into a directory
- The cleaned up version is run through another script to decide if it's code, a project brief, an email. I usually start the recording "this is code", "this is a project brief" to make it easy. Then, depending on what it is the original, the transcribed, and the context get run through different prompts with different output formats.
It's not fancy, but it works really well. I could probably vibe code this into a more robust workflow system all using ionotify and do some more advanced things. Integrating more sophisticated tool calling could be really neat.
Honestly, I've gotten really far simply by transcribing audio with whisper, having a cheap model clean up the output to make it make sense (especially in a coding context), and copying the result to the clipboard. My goal is less about speed and more about not touching the keyboard, though.