(Ryadh from ClickHouse here)
Your comment is spot-on.
This the main challenge with Agentic Analytics and there are known limitations. It is also where we are orienting our investments atm.
Our own experience running internal agents taught us that the best remediation comes from providing the LLMs with the maximum and most accurate context possible. Robust evaluations are also critical to measure accuracy, detect regressions, and improve. But there is no silver bullet.
SOTA LLMs are increasingly better at generating SQL and notoriously bad with math and numbers in general. Combining them with powerful querying capabilities bridges that gap and makes the overall experience an useful one.
IMO, we'll always have to deal with the stochastic nature of these models and hallucinations, which calls for caution and requires raising awareness within the user base. What I found watching our users internally is that, while it's not magical, it allows users to request data more often, and compounds in data-driven decision-making, assuming the users are trained to interpret the interactions
I'll freely admit you have more data (experience) to work with on this than I did in the tests I ran almost a year ago. I spent a lot of time documenting my schemas, feeding the LLM sample rows, etc and the final results were not useful enough even as a starting point for a static query that a developer would improve on and "hard code" into a UI. I approached it as both:
- Wouldn't it be cool to let my users chat with their data? ("How many new users signed up today/this event/this month/etc?" or "How much did we make yesterday?")
- An internal tool to use as a starting point for analytics dashboards
I still use LLMs to help write queries if it's something I know can be done but can't remember the syntax but I scrapped the project to try and accomplish both the above goals due to too many mistakes. Maybe my data is just too "dirty" (but honestly, I've never _not_ seen dirty data) and/or I should have cleaned up deprecated columns in my tables that confused the models (even with strict instructions to ignore them, I should have filtered them completely) but I spent way too much time repeating myself, talking in all caps, and generally fighting with the SOTA models to try to get them to understand my data so that they could generate queries that actually worked (worked as in returned valid data, not just valid SQL). I wasn't doing any training/fine-tuning (which may be the magic needed) but I felt like it was a dead end (given current models). I'll also stress that I haven't re-tested those theories on newer models and my results are at least a year out of date (a lifetime in LLM/AIs) but the fundamental issues I ran into didn't seem to be "on the cusp" of being solved or anything like that.
I wish you all the best of luck in improving on this kind of thing.
Keep in mind that it's not fully "fair", since these public dataset are often documented in the internet so already present in pre-training of the models underneath (Claude Sonnet 4.5 in this case)
Our own experience running internal agents taught us that the best remediation comes from providing the LLMs with the maximum and most accurate context possible. Robust evaluations are also critical to measure accuracy, detect regressions, and improve. But there is no silver bullet.
SOTA LLMs are increasingly better at generating SQL and notoriously bad with math and numbers in general. Combining them with powerful querying capabilities bridges that gap and makes the overall experience an useful one.
IMO, we'll always have to deal with the stochastic nature of these models and hallucinations, which calls for caution and requires raising awareness within the user base. What I found watching our users internally is that, while it's not magical, it allows users to request data more often, and compounds in data-driven decision-making, assuming the users are trained to interpret the interactions