If you as a human are well behaved, that is absolutely fine.
If you as a human spam the shit out of my website and waste my resources, I will block you.
If you as a human use an agent (or browser or extension or external program) that modifies network requests on your behalf, but doesn't act as a massive leech, you're still welcome.
If you as a human use an agent (or browser or extension or external program) that wrecks my website, I will block you and the agent you rode in on.
Nobody would mind if you had an LLM that intelligently knew what pages contain what (because it had a web crawler backed index that refreshes at a respectful rate, and identifies itself accurately as a robot and follows robots.txt), and even if it needed to make an instantaneous request for you at the time of a pertinent query, it still identified itself as a bot and was still respectful... there would be no problem.
The problem is that LLMs are run by stupid, greedy, evil people who don't give the slightest shit what resources they use up on the hosts they're sucking data from. They don't care what the URLs are, what the site owner wants to keep you away from. They download massive static files hundreds or thousands of times a day, not even doing a HEAD to see that the file hasn't changed in 12 years. They straight up ignore robots.txt and in fact use it as a template of what to go for first. It's like hearing an old man say "I need time to stand up because of this problem with my kneecaps" and thinking "right, I best go for his kneecaps because he's weak there"
There are plenty of open crawler datasets, they should be using those... but they don't, they think that doesn't differentiate them enough from others using "fresher" data, so they crawl even the smallest sites dozens of times a day in case those small sites got updated. Their badly written software is wrecking sites, and they don't care about the wreckage. Not their problem.
The people who run these agents, LLMs, whatever, have broken every rule of decency in crawling, and they're now deliberately evading checks, to try and run away from the repercussions of their actions. They are bad actors and need to be stopped. It's like the fuckwads who scorch the planet mining bitcoin; there's so much money flowing in the market for AI, that they feel they have to fuck over everyone else, as soon as possible, otherwise they won't get that big flow of money. They have zero ethics. They have to be stopped before their human behaviour destroys the entire internet.
If you as a human spam the shit out of my website and waste my resources, I will block you.
If you as a human use an agent (or browser or extension or external program) that modifies network requests on your behalf, but doesn't act as a massive leech, you're still welcome.
If you as a human use an agent (or browser or extension or external program) that wrecks my website, I will block you and the agent you rode in on.
Nobody would mind if you had an LLM that intelligently knew what pages contain what (because it had a web crawler backed index that refreshes at a respectful rate, and identifies itself accurately as a robot and follows robots.txt), and even if it needed to make an instantaneous request for you at the time of a pertinent query, it still identified itself as a bot and was still respectful... there would be no problem.
The problem is that LLMs are run by stupid, greedy, evil people who don't give the slightest shit what resources they use up on the hosts they're sucking data from. They don't care what the URLs are, what the site owner wants to keep you away from. They download massive static files hundreds or thousands of times a day, not even doing a HEAD to see that the file hasn't changed in 12 years. They straight up ignore robots.txt and in fact use it as a template of what to go for first. It's like hearing an old man say "I need time to stand up because of this problem with my kneecaps" and thinking "right, I best go for his kneecaps because he's weak there"
There are plenty of open crawler datasets, they should be using those... but they don't, they think that doesn't differentiate them enough from others using "fresher" data, so they crawl even the smallest sites dozens of times a day in case those small sites got updated. Their badly written software is wrecking sites, and they don't care about the wreckage. Not their problem.
The people who run these agents, LLMs, whatever, have broken every rule of decency in crawling, and they're now deliberately evading checks, to try and run away from the repercussions of their actions. They are bad actors and need to be stopped. It's like the fuckwads who scorch the planet mining bitcoin; there's so much money flowing in the market for AI, that they feel they have to fuck over everyone else, as soon as possible, otherwise they won't get that big flow of money. They have zero ethics. They have to be stopped before their human behaviour destroys the entire internet.