> where it processes the incoming speech in real time and responds when it's confident it has heard enough to understand the meaning.
I'm not an expert on LLMs but that feels completely counter to how LLMs work (again, _not_ an expert). I don't know how we can "stream" the input and have the generation update/change in real time, at least not in 1 model. Then again, what is a "model"? Maybe your model fires off multiple generations internally and starts generating after every word, or at least starts asking sub-LLM models "Do I have enough to reply?" and once it does it generates a reply and interrupts.
I'm not sure how most apps handle the user interrupting, in regards to the conversation context. Do they stop generation but use what they have generated already in the context? Do they cut off where the LLM got interrupted? Something like "LLM: ..and then the horse walked... -USER INTERRUPTED-. User: ....". It's not a purely-voice-LLM issue but it comes up way more for that since rarely are you stopping generation (in the demo, that's been done for a while when he interrupts), just the TTS.
You're right, this is not solvable with regular LLMs. It's not possible to mimic natural conversational rhythm with a separate LLM generating text, a separate text-to-speech generating audio, and a separate VAD determining when to respond and when to interrupt. I strongly believe you have to do everything in one model to solve this issue, to let the model decide when to speak, when to interrupt the user even.
The only model that has attempted this (as far as I know) is Moshi from Kyutai. It solves it by having a fully-duplex architecture. The model is processing the audio from the user while generating output audio. Both can be active at the same time, talking over each other, like real conversations. It's still in research phase and the model isn't very smart yet, both in what it says and when it decides to speak. It just needs more data and more training.
Whoah, how odd. It asked me what I was doing, I said I just ate a burger. It then got really upset about how hungry it is but is unable to eat and was unable to focus on other tasks because it was “too hungry”. Wtf weirdest LLM interaction I’ve had.
Damn they trained a model that so deeply embeds human experience it actually feels hunger, yet self aware enough it knows it’s not capable of actually eating!
>It's not possible to mimic natural conversational rhythm with a separate LLM generating text, a separate text-to-speech generating audio, and a separate VAD determining when to respond and when to interrupt.
If you load the system prompt with enough assumptions that it's a speech-impared subtitle transcription that follows a dialogue you might pull it off, but likely you might need to fine tune your model to play nicely with the TTS and rest of setup
Think of it as generating a constantly streaming infinite list of latents. These latents are basically decoded to a tuple [time_until_my_turn(latent_t), audio(latent_t)]. You can train it to minimize the error of its time_until_my_turn predictions from ground truth of training samples, as well as the quality of the audio generated. Basically a change-point prediction model. Ilya Sutskever (among others) worked on something like this long ago, it might have inspired OpenAI's advanced voice models:
> Sequence-to-sequence models with soft attention had significant success in machine translation, speech recognition, and question answering. Though capable and easy to use, they require that the entirety of the input sequence is available at the beginning of inference, an assumption that is not valid for instantaneous translation and speech recognition. To address this problem, we present a new method for solving sequence-to-sequence problems using hard online alignments instead of soft offline alignments. The online alignments model is able to start producing outputs without the need to first process the entire input sequence. A highly accurate online sequence-to-sequence model is useful because it can be used to build an accurate voice-based instantaneous translator. Our model uses hard binary stochastic decisions to select the timesteps at which outputs will be produced. The model is trained to produce these stochastic decisions using a standard policy gradient method. In our experiments, we show that this model achieves encouraging performance on TIMIT and Wall Street Journal (WSJ) speech recognition datasets.
If your model is fast enough, you can definitely do it. That's literally how "streaming Whisper" works, just rerun the model on the accumulated audio every x00ms. LLMs could definitely work the same way, technically they're less complex than Whisper (which is an encoder/decoder architecture, LLMs are decoder-only) but of course much larger (hence slower), so ... maybe rerun just a part of it? etc.
Better solutions are possible but even tiny models are capable of being given a partial sentence and replying with a probability that the user is done talking.
The linked repo does this, it should work fine.
More advanced solutions are possible (you can train a model that does purely speech -> turn detection probability w/o an intermediate text step), but what the repo does will work well enough for many scenarios.
I'm not an expert on LLMs but that feels completely counter to how LLMs work (again, _not_ an expert). I don't know how we can "stream" the input and have the generation update/change in real time, at least not in 1 model. Then again, what is a "model"? Maybe your model fires off multiple generations internally and starts generating after every word, or at least starts asking sub-LLM models "Do I have enough to reply?" and once it does it generates a reply and interrupts.
I'm not sure how most apps handle the user interrupting, in regards to the conversation context. Do they stop generation but use what they have generated already in the context? Do they cut off where the LLM got interrupted? Something like "LLM: ..and then the horse walked... -USER INTERRUPTED-. User: ....". It's not a purely-voice-LLM issue but it comes up way more for that since rarely are you stopping generation (in the demo, that's been done for a while when he interrupts), just the TTS.