Hacker Newsnew | past | comments | ask | show | jobs | submit | hosel's commentslogin

Can you explain what you mean?

LLM pre-training models risk being unable to be updated with data from after 2025, as much of it is corrupted with LLM-generated content. We might be locked into outdated knowledge, where only whitelisted sources decide what to include.

Taking into account the sometimes blind belief that 'LLMs know everything', the outcome could be very costly, especially for technologies and businesses unfortunate enough to emerge after 2025.


It may not be mainly or solely due to LLM pollution, but rather the fact that every publisher, (social) media company, newspaper, etc. clammed up and started charging (licensing) fees sometime in the last couple of years.

So maybe there's just not much openly available and new content worth training on that wasn't available prior to 2025.


But ChatGPT has been popular since early 2023, and even before it there was no shortage of low-quality content on the web.

If anything, this model being trained up to 2025 is a positive sign that the "circular LLM training" problem hasn't (yet) become unmanagable.

The year-long delay is probably just due to how long it takes to test/refine a cutting-edge model. It's surely possible to train one faster, but Google wouldn't want to release a new model unless it's going to top the usual benchmarks.


Looking at token usage at places like OpenRouter as a proxy for overall production we're looking at exponential growth in AI-created content. Weekly token usage there has tripled just in the past 3 months.

Considering all models can use search engines, is this really relevant?

Yes. Huge difference in quality in from-weights distilled knowledge vs something based on a search tool. If the LLM uses a search tool there's barely a difference between a 30B model and Opus or GPT 5.5, because it just bases its reply on the stuff that came up. Which is generally SEO junk.

Obviously with the last example I'm not talking about long-running agentic tasks here that involve many dozens of search calls (like the recent Erdos problem stuff).

And that doesn't even consider the extra content rot, the time it takes, the need for such an API and so on.

One of the biggest advantages Anthropic models have had over GPT was GPT's woefully outdated data cutoff. They finally improved on this with 5.5, but IIRC it took a year.


This is not meant as an insult, but have you actually LLM/vibe coded anything that used a fast(-ish) moving library or framework? Try asking your favorite LLM with say Jan 2025 knowledge cutoff (or pretraining data cutoff, whatever you want to call it) to work on something using a framework that had a big rewrite later that year (which would make it one year old now, which is like ages in the LLM coding era)... It's a nightmare full of wrestling with the LLM when you try to tell it the version of the framework and that it changed a lot from the previous version and yadda yadda long story short down the thread when context runs out and/or is compressed it begins to forget detailed instructions and just falls back to pulling out old patterns it "remembers" from pretraining. And so you need to constantly remind it what you work with and "oh hey this doesnt work because we're working with react router v7 in framework mode, remember? not react router v6". Or try to use the latest non-lts/breaking version of a library, at first it looks it up online, but again as you get deeper into the weeds and little details, the struggle begins.

So, as far as I'm concerned, training cutoff is still a big deal.


> It's a nightmare full of wrestling with the LLM when you try to tell it the version of the framework and that it changed a lot from the previous version and yadda yadda

Tip: Add a default instruction to look at the actial downloaded source code of the dependencies used (assuming you're not dealing with closed source dependencies). Have the agent treat it as your own (readonly) source code instead of relying on model training data and possibly mismatching documentation on the web. Then it just greps for the exact function signatures and reads the file based documentation.


Great, now you experience context bloat 3x as quickly and any task takes 3x as long.

Ifz Google wants to structurally compete with Anthropic on coding, this issue is a must-fix. OpenAI finally fixed it with 5.5.


Until they prefer not to search. Let me explain using the example of the open-source security framework (1) our team is working on.

If you ask Gemini what you should use to integrate fraud prevention or account takeover protection into your product, there will be no mention of our open-source project. Five years in development, 1.3k stars, over 140 pull requests — all this isn't enough to make it into the training data. From this perspective, any technology that emerges after 2024 is simply invisible to LLMs.

The answer is: without being in the training data, LLMs basically don't understand what they're searching for.

1. https://github.com/tirrenotechnologies/tirreno


I just put the terribly generic query "what tools would you recommend to integrate fraud prevention or account takeover protection into my product" into both Claude (Sonnet) and Gemini (3.1 Pro) via the standard web interface and both took the first step of searching the web. That's consistent with my past experience -- the usual harnesses typically will search the web in cases where I might expect/want them to. Now whether you product has good web visibility or not in those searches and how the LLM's weigh the relative merits of open-source tools versus commercial offerings in deciding what to highlight in their responses is a different issue. As is the change in what constitutes effective SEO in an era where bots, rather then human eyes are the proximal important target. But I don't think the core issue with folks finding your products is the move away from user-driven search toward using models with out-of-date training cutoffs.

FWIW while neither model included your product in it's initial response, when I followed up with "what about open-source" both did another search and Claude's response included your tool....


> while neither model included your product in it's initial response, when I followed up with "what about open-source"

You just proved that LLMs don't know about the product (which is fine), but they don't even know the category exists.

It's like driving a car whose mirrors show a two-year-old reflection and insisting they work fine.


It might indicate core model training and pre training is really slowing down?

also parsing is harder + so much more of the new data is being generated by ai itself.

still the cutoff is very much concerning and inconvenient


I also use an oasis permanently in airplane mode, it’s almost perfect, but I am afraid of the day it bites the dust.


I try to let it go, but this is my pet peeve.


Oh woe is me! Sometimes your kid is going to cry. They’ll make a scene. Who cares that’s what children do. That doesn’t mean you throw an iPad in front of them so that they’ll shut up. I swear my fellow millennials are the worst parents because they won’t let their kids ever be uncomfortable for a second, and they’re too afraid of being embarrassed about their crying kids. Meanwhile putting a screen in front of them is just making everything worse long term


>That doesn’t mean you throw an iPad in front of them so that they’ll shut up.

No, you throw them a cheap Android tablet. iPads are expensive. /s


Really? In my experience it’s been pretty good (using Pydantic)! I read over before I execute it, but it’s never done anything malicious.


I don't trust myself to craft a prompt in natural language which completely specifies my intent as codified with the precision of a programming language.

I also tend to turn to AI for advising me on difficult use cases, and most of the time it's for production code rather than one-offs. The easy cases, I just write myself because it's more mental effort to review code for subtle errors than it is to write it.


What is the relevance of Pydantic with SQL?


Gross, this isn’t Reddit.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: