Yeah, I've had similar results. Even with GPT-o1, I find almost all errors at th...

Yeah, I've had similar results. Even with GPT-o1, I find almost all errors at this point come from the web search functionality and the model taking X random source as an authority. It's interesting that I find my human intelligence in the process is most useful for hand-collecting the sources and data to analyze -- and, of course, for directing the process across multiple LLM queries.