This was true up until they started training them using Reinforcement Learning from Verifier Feedback (started with O1). By sticking a calculator in the training loop, they seem to have gotten out of the arithmetic error regime. That said, the ChatGPT default is 4o which is still susceptible to these issues.
We built AccountingBench, a test where LLMs must "close the books" for a real SaaS business using 1 year of Stripe/Ramp/Rippling/Mercury data.
Claude 4 and Grok 4 start strong - within 1% of human CPA baselines in month 1.
But as time progresses, all models inevitably accumulate compounding errors and exhibit erratic behavior, causing significant deviations.
That said, the early accuracy here is promising. With targeted post-training, models may be able to replace humans for this kind of work.