to be absolutely nerve-wracking. Not hard to do but it's just batshit crazy and breaks the whole idea of how web crawlers are supposed to work. On the other hand, we had trouble with people (who we know want to crawl us specifically) crawling a site where you visit
http://example.com/item/448828
and it loads an SPA which in turn fetches a well-structured JSON documents like
with no cache so it downloads megabytes of HTML, Javascript, Images and who knows what -- and if they want to deal with the content in a structured way and it put it in a database it's already in the exact format they want. But I guess it's easier to stand up a Rube Goldberg machine and write parsers when you could look at our site in the developers tools and figure out how it works in five minutes... and just load those JSON documents into a document database and be querying right out of the gate.
What I would want is to GET http://example.com/item/448828 with an Accept header of ‘application/s-expression,application/json;q=0.1’ instead of retrieving the HTML representation of the resource. HTTP is the API.
it felt like this was an opportunity for AI craze to adopt on top of the existing standards, instead they all invented their own stuff with llm.txt and MCP *sigh*