Hey! Thank you for your comment! You can actually use an MCP on this basis, but I haven't tested it yet. I'll look into it as soon as possible. Your feedback is valuable.
I daresay we're going to see a burgeoning situations where the software (code) is open-sourced under a permissive or copyleft license, while the associated data, content, or assets (e.g., datasets, models, or databases) are handled under separate, often more restrictive licenses.
The prevalence of this "personal vibecoded app" spirit makes me start to wonder if an "App" is the right level of abstraction for packaging capabilities. Perhaps we need something more "granular".
> Would instead of the RL step a constrained decoding say via something like xgrammar fix syntax generation issue ?
It can, but you have to consider two things here:
a) constrained decoding ensures adherence to syntax, not semantics. Say you're editing a field in an enum in rust. You can write syntactically correct rust code that doesn't address the new field further in the code (say in a switch). You'd get correctly syntactic code, but the compiler will scream at you. RL works on both.
b) if your goal is to further train the model, so it works on many tasks, RL helps with exploring new paths and training the model further. Constrained grammars help with inference, but the model doesn't "learn" anything. With RL you can also have many reward functions at the same time. Say one that rewards good syntax, one that rewards "closing" all the functions so tree-sitter doesn't complain, and one that rewards 0 errors from the compiler. The model gets to train on all 3 at the same time.
The other one is that constrained decoding only works on CFGs (simpler grammars like JSON schemas) since only these ones can produce automatas which can be used for constrained decoding. Programming languages like Python and C++ aren't CFGs so it doesn't work.
Also constrained decoding generally worsens model quality since the model would be generating off-policy. So RL helps push corrected syntax back on-policy.
Not a stranger but strangers
I was returning home from an event early evening. Being absorbed in my thoughts. I got both my front tires free spinning without traction in a ditch.
Although this was in Nigeria, we have this certain camaraderie through hardship, it was still extremely surprising seeing a group of 6 men come out of nowhere, having nothing to do with each other aside being passerbys join hands, exerting sweaty effort to get my car out a ditch by 8pm.
Reminds me of when my wife would drop her loaded-for-touring motorcycle in a parking lot. People would crawl out of the woodwork to run over and help her pick it up.
I’ve dropped mine on rare occasions, and nary a soul even looked my direction. :-) (But thankfully I’m a grown boy who can pick it up myself.)
I high-centered my car on a drift coming out of the Taco Bell drive thru - not a minute passed before ten or so people appeared out of nowhere and pushed me over and out.
Literally, the moment before there hadn’t been anything around but me and that Taco Bell.
We had a blizzard that dumped 3 feet of snow in a weekend, my car got stuck, and about 6 people came out of their warm cozy houses to help push it. On a separate occasion, years later someone driving by stopped their car to help when they saw me stuck on the side of the road.
We grew up very poor, and I can't count the number of times someone helped us through a difficult situation - there are plenty of times we were on our own and there wasn't any help, but also times when someone noticed and helped. The help was always so appreciated- it lessened the suffering considerably compared to the times when we were on our own with whatever problem.
The main issue I'm facing with realtime responses (speech output) is how to separate non-diegetic outputs (e.g thinking, structured outputs) from outputs meant to be heard by the end user.
A simple way is to split the model’s output stream before TTS.
Reasoning/structured tokens go into one bucket, actual user-facing text into another. Only the second bucket is synthesized. Most thinking out loud issues come from feeding the whole stream directly into audio.
There is no TTS here. It's a native audio output model which outputs audio tokens directly. (At least, that's how the other real-time models work. Maybe I've misunderstood the Qwen-Omni architecture.)
True, but even with native audio-token models you still need to split the model’s output channels. Reasoning/internal tokens shouldn't go into the audio stream only user-facing content should be emitted as audio. The principle is the same, whether the last step is TTS or audio token generation.
There's an assumption there that the audio stream contains an equivalent of the <think>/</think> tokens. Every reason to think it should, but without seeing the tokeniser config it's a bit of a guess.
09h is keyboard interrupt, the utter basic interface [1] that only gives you scancodes and that's it, 16h is the extended interface [2] that you need to deal with if you want to read/set shift and other special keys [3].
reply