The existing vision LLMs all work like this, which is most of the major models these days.
Multi-modal audio models are a lot less common. GPT-4o was meant to be able to do this natively from the start but they ended up shipping separate custom models based on it for their audio features. As far as I can tell GPT-5 doesn't have audio input/output at all - the OpenAI features for that still use GPT-4o-audio.
I don't know if Gemini 2.5 (which is multi-modal for vision and audio) shares the same embedding space for all three, but I expect it probably does.
There are many more weird and complex architectures in models for video understanding.
For example, beyond video->text->llm and video->embedding in llm, you can also have an llm controlling/guiding a separate video extractor.
See this paper for a pretty thorough overview.
Tang, Y., Bi, J., Xu, S., Song, L., Liang, S., Wang, T., Zhang, D., An, J., Lin, J., Zhu, R., Vosoughi, A., Huang, C., Zhang, Z., Liu, P., Feng, M., Zheng, F., Zhang, J., Luo, P., Luo, J., & Xu, C. (2025). Video Understanding with Large Language Models: A Survey (No. arXiv:2312.17432). arXiv. https://doi.org/10.48550/arXiv.2312.17432
Yeah, and that's my understanding. Nothing goes video -> text, or audio -> text, or even text -> text without first going through state space. That's where the core of the transformer architecture is.
Multi-modal audio models are a lot less common. GPT-4o was meant to be able to do this natively from the start but they ended up shipping separate custom models based on it for their audio features. As far as I can tell GPT-5 doesn't have audio input/output at all - the OpenAI features for that still use GPT-4o-audio.
I don't know if Gemini 2.5 (which is multi-modal for vision and audio) shares the same embedding space for all three, but I expect it probably does.