The existing vision LLMs all work like this, which is most of the major models t...

adastra22 · 2025-09-22T19:34:02 1758569642

What I mean is that all processing in an LLM occurs in state space. The next-token prediction is the very last step.

uniqueuid · 2025-09-22T19:38:50 1758569930

There are many more weird and complex architectures in models for video understanding.

For example, beyond video->text->llm and video->embedding in llm, you can also have an llm controlling/guiding a separate video extractor.

See this paper for a pretty thorough overview.

Tang, Y., Bi, J., Xu, S., Song, L., Liang, S., Wang, T., Zhang, D., An, J., Lin, J., Zhu, R., Vosoughi, A., Huang, C., Zhang, Z., Liu, P., Feng, M., Zheng, F., Zhang, J., Luo, P., Luo, J., & Xu, C. (2025). Video Understanding with Large Language Models: A Survey (No. arXiv:2312.17432). arXiv. https://doi.org/10.48550/arXiv.2312.17432

adastra22 · 2025-09-22T19:47:45 1758570465

Sure but all of these find some way of mapping inputs (any medium) to state space concepts. That's the core of the transformer architecture.

ludwigschubert · 2025-09-22T20:05:21 1758571521

The user you originally replied to specifically mentioned > without going to text first

adastra22 · 2025-09-22T20:06:35 1758571595

Yeah, and that's my understanding. Nothing goes video -> text, or audio -> text, or even text -> text without first going through state space. That's where the core of the transformer architecture is.