I totally agree, but how, though? All these architectures work with an input-out...

AaronFriel · on July 8, 2024

In autoregressive models we can "feed forward" the model by injecting additional tokens. Computing the KV cache entries for those tokens (called"prefill"), then resuming decoding. If we can do this quickly, and on the same node that has a hot KV cache (or otherwise low latency access to shared KV cache), we are quite a ways closer to offering a full duplex, or at least near zero latency, language model API. This does require a full duplex connection (i.e.: Websocket).

For true full duplex communication, including interruption, it will be more challenging but should be possible with current model architectures. The model may need to be able to emit no-op or "pause" tokens or be used as the VAD, and positional encoding of tokens might need to be replaced or augmented with time and participant.

I imagine the first language model which has "awkward pauses" is only a year or so away.

binary132 · on July 8, 2024

Maybe there’s something I don’t understand, but it seems to me that it would just be a streaming-next-token input instead of a batch input.