>There's some progress in video generation which takes a short clip and extrapolates what happens next. That's a promising line of development. The key to "common sense" is being able to predict what happens next well enough to avoid big mistakes in the short term, a few seconds. How's that coming along? And what's the internal world model, assuming we even know?
That GTA demo isn't about control. The user, not the net, is driving.
That's more like the demos where someone trains on a scene and the neural net can make plausible extensions to the scene as you move the viewpoint. It's more spatial imagination, like the tool in Photoshop that fills in plausible but imaginary backgrounds.
It does handle collisions with the edge of the road. Collisions with other cars don't really work; they mostly disappear. One car splits in half in confusion.
The spatial part is making progress, but the temporal part, not so much.
https://www.youtube.com/watch?v=udPY5rQVoW0
This has been a thing for a while. It's actually a funny way to demonstrate model based control by replacing the controller with a human.