I totally agree, but how, though? All these architectures work with an input-output model. What we would need for what you describe would be more akin to living organisms, some sort of AI that is actually coupled to the environment (however that is defined for them) rather than receiving inputs and giving outputs. A complex, allostatic kind of multimodality than a simplistic sequential one. I don't think there is anything like that, at least not in the timescales that make sense for any use. And my belief is that the computational demands would be too high to approach with the current methods.
In autoregressive models we can "feed forward" the model by injecting additional tokens. Computing the KV cache entries for those tokens (called"prefill"), then resuming decoding. If we can do this quickly, and on the same node that has a hot KV cache (or otherwise low latency access to shared KV cache), we are quite a ways closer to offering a full duplex, or at least near zero latency, language model API. This does require a full duplex connection (i.e.: Websocket).
For true full duplex communication, including interruption, it will be more challenging but should be possible with current model architectures. The model may need to be able to emit no-op or "pause" tokens or be used as the VAD, and positional encoding of tokens might need to be replaced or augmented with time and participant.
I imagine the first language model which has "awkward pauses" is only a year or so away.