What do they mean by this? Isn't this roughly what GPT-4 Vision and LLaVA do?
Something like LLaVA being a language to vision model but I can't steelman the idea so it makes sense.
Maybe they're just lying?
What do they mean by this? Isn't this roughly what GPT-4 Vision and LLaVA do?