Hacker Newsnew | past | comments | ask | show | jobs | submit | peturdarri's commentslogin

You're right, this is not solvable with regular LLMs. It's not possible to mimic natural conversational rhythm with a separate LLM generating text, a separate text-to-speech generating audio, and a separate VAD determining when to respond and when to interrupt. I strongly believe you have to do everything in one model to solve this issue, to let the model decide when to speak, when to interrupt the user even.

The only model that has attempted this (as far as I know) is Moshi from Kyutai. It solves it by having a fully-duplex architecture. The model is processing the audio from the user while generating output audio. Both can be active at the same time, talking over each other, like real conversations. It's still in research phase and the model isn't very smart yet, both in what it says and when it decides to speak. It just needs more data and more training.

https://moshi.chat/


Whoah, how odd. It asked me what I was doing, I said I just ate a burger. It then got really upset about how hungry it is but is unable to eat and was unable to focus on other tasks because it was “too hungry”. Wtf weirdest LLM interaction I’ve had.


Damn they trained a model that so deeply embeds human experience it actually feels hunger, yet self aware enough it knows it’s not capable of actually eating!

That’s like a Black Mirror episode come to life.


>It's not possible to mimic natural conversational rhythm with a separate LLM generating text, a separate text-to-speech generating audio, and a separate VAD determining when to respond and when to interrupt.

If you load the system prompt with enough assumptions that it's a speech-impared subtitle transcription that follows a dialogue you might pull it off, but likely you might need to fine tune your model to play nicely with the TTS and rest of setup


According to the technical paper (https://goo.gle/GeminiPaper), Gemini Nano-1, the smallest model at 1.8B parameters, beats Whisper large-v3 and Google's USM at automatic speech recognition. That's very impressive.


and whisper large is 1.55B parameters at 16bits instead of 4 bits, I believe. so nano-1 weights are ~1/3rd the size. Really impressive if these benchmarks are characteristic of performance


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: