The point is to try and see if LLM's wide general knowledge can have advantage in something like sensory data + action learning as well. Current self driving models don't have that.
I don't understand this stretch logic. It absolutely depends on the type of problem where they are inaccurate, how well trained they are in it, there is no way you can extrapolate like this.
You can ask them to do math equation which takes steps and if they are trained in that for certain problems they are accurate near 100 percent of the time.
Like ask gpt-4o to solve different variations of
"""What is the answer to 2x + 7 = 31?"""
If the numbers are of similar magnitude and simplicity, it will follow the same steps and be right 99%+ times, and I'm only not saying 100%, because I haven't tried it enough, but I don't see it being wrong.
For example """What is the answer to 2x + 4 = -6?"""
Just run a test yourself. Do random integers within 0 - 20, it will definitely not be incorrect 5% - 10% time. It will be correct 99%+ time.
Where is this number 5% - 10% even coming from? You could also keep asking it "What is the capital of France?" and it's going to be right 99%+ of the time.