> Does anyone know what's the deal with these scrapers, or why they're attributed to AI?
You don't really need to guess, it's obvious from the access logs. I realize not everyone runs their own server, so here are a couple excerpts from mine to illustrate:
And to give a sense of scale, my cgit instance recieved 37 212 377 requests over the last 60 days, >99% of which are bots. The access.log from nginx grew to 12 GiB in those 60 days. They scrape everything they can find, indiscriminately, including endpoints that have to do quite a bit of work, leading to a baseline 30-50% CPU utilization on that server right now.
Oh, and of course, almost nothing of what they are scraping actually changed in the last 60 days, it's literally just a pointless waste of compute and bandwidth. I'm actually surprised that the hosting companies haven't blocked all of them yet, this has to increase their energy bills substantially.
Some bots also seem better behaved then others, OpenAI alone accounts for 26 million of those 37 million requests.
> ChatGPT-User is not used for crawling the web in an automatic fashion. Because these actions are initiated by a user, robots.txt rules may not apply.
So, not AI training in this case, nor any other large-batch scraping, but rather inference-time Retrieval Augmented Generation, with the "retrieval" happening over the web?
Likely, at least for some. I've caught various chatbots/CLI harnesses more than once inspecting a github repo file by file (often multiple times, because context rot)
But the sheer volume makes it unlikely that's the only reason. It's not like everybody has constantly questions bout the same tiny website.
Those would have the user agent "ChatGPT-User" though, and I barely see those. The majority comes from "GPTBot" like in my excerpt above, which makes it pretty clear that it's used for some sort of training:
"GPTBot is used to make our generative AI foundation models more useful and safe. It is used to crawl content that may be used in training our generative AI foundation models. Disallowing GPTBot indicates a site’s content should not be used in training generative AI foundation models."
You can probably already get that if you order a somewhat significant amount of chips directly from Raspberry Pi. They seem to already have everything required for it, it's literally just setting a bit differently during factory programming.
But I'm assuming you're talking about for consumer use, in which case my question is why? There is absolutely no way you're ever benefiting from them spinning up an extra SKU with significantly less volume (most people want the ARM cores).
Even if they decide to eat the costs for the benefit of consumers, at most the chip would be what, 15 cents cheaper? I really struggle to see how that's a meaningful difference for hobbyist use.
> You can probably already get that if you order a somewhat significant amount of chips directly from Raspberry Pi.
It's not currently possible. As of the A3 stepping, the ARM_DISABLE OTP bit is ignored as a security mitigation - changing that would require a new mask revision.
The TPM itself can actually be discrete, as long as you have a root-of-trust inside the CPU with a unique secret. Derive a secret from the unique secret and the hash of the initial bootcode the CPU is running like HMAC(UDS, hash(program)) and derive a public/private key pair from that. Now you can just do normal Diffie-Hellman to negotiate encryption keys with the TPM and you're safe from any future interposers.
This matters because for some functionality you really want tamper-resistant persistent storage, for example "delete the disk encryption keys if I enter the wrong password 10 times". Fairly easy to do on a TPM that can be made on a process node that supports flash vs a general CPU where that just isn't an option.
If that's a concern, you can lock the OTP either permanently or with a password, before you hand them out. Or just use the older RP2040.
But I don't think that "targeting the education market" is accurate in the first place. They certainly make sure to serve that market with their very nicely priced Pico boards but it hardly seems to be their only goal. You don't go through the effort of spinning up a new revision to fix security holes if there aren't at least some industry customers.
It's worth noting that strcpy() isn't just bad from a security perspective, on any CPU that's not completely ancient it's bad from a performance perspective as well.
Take the best case scenario, copying a string where the precise length is unknown but we know it will always fit in, say, 64 bytes.
In earlier days, I would always have used strcpy() for this task, avoiding the "wasteful" extra copies memcpy() would make. It felt efficient, after all you only replace a i < len check with buf[i] != null inside your loop right?
But of course it doesn't actually work that way, copying one byte at a time is inefficient so instead we copy as many as possible at once, which is easy to do with just a length check but not so easy if you need to find the null byte. And on top of that you're asking the CPU to predict a branch that depends completely on input data.
Given that the C ABI is basically the standard for how arbitrary languages interact, I wouldn't characterize all of the headaches this can cause as just when other languages interact with C; arguably it can come up when any two languages interact at all, even if neither are C.
Arguably the C ABI was one of those Worse is Better problems like the C language itself. Better languages already existed, but C was basically free and easy to implement, so now there's C everywhere. It seems likely that if not for this ABI we might have an ABI today where all languages which want to offer FFI can agree on how to represent say the immutable slice reference type (Rust's &[T], C++ std::span)
Just an agreed ABI for slices would be enough that language A's growable array type (Rust's Vec, C++ std::vector, but equally the ArrayList or some languages even call this just "list") of say 32-bit signed integers can give a (read only) reference to language B to look at all these 32-bit signed integers without language's A and B having to agree how growable arrays work at all. In C today you have to go wrestle with the ABI pig for much less.
From a historical perspective, my guess is that C interop in some fashion has basically been table stakes for any language of the past few decades, and when you want to plug two arbitrary languages together, if that's the one common API they both speak, it's the most obvious way to do it. I'm not sure I'd consider this "worse is better" as much as just self-reinforcing emergent behavior. I'm not even sure I can come up with any example of an explicitly designed format for arbitrary language interop other than maybe WASM (which of course is a lot more than just that, but it does try to tackle the problem of letting languages interact in an agnostic way).
Ideally, the standard would include a type that packages a string with its length, and had functions that used that type and/or took the length as an argument. But even without that it is possible avoid using null terminated strings in a lot of places.
Since C++11 it is required to be null-terminated, you can access the terminator with (for e.g.) operator[], and the string can contain non-terminator null characters.
It has nul-termination for compatibility with C, so you can call c_str and get a C string. With the caveat that an std::string can have nuls anywhere, which breaks C semantics. But C++ does not use that itself.
>Which of course causes issues when languages with more proper strings interact with C but there you go.
Is is an issue of "more proper strings" or just languages trying to have their cake and eat it too? have their sense of a string and C interoperability. I think this is were we see the strength of Zig, it's strings are designed around and extend the C idea of string instead of just saying our way is better and we are just going to blame C for any friction.
My standard disclaimer comes into play here, I am not a programmer and very much a humanities sort, I could be completely missing what is obvious. Just trying to understand better.
Edit: That was not quite right, Zig has its string literal for C compatibility. There is something I am missing here in my understanding of strings in the broader sense.
Modern x86 CPUs have actual instructions for strcpy that work fairly well. There were several false starts along the way, but the performance is fine now.
They have instructions for memcpy/memmove (i.e. rep movs), not for strcpy.
They also have instructions for strlen (i.e. rep scasb), so you could implement strcpy with very few instructions by finding the length and then copying the string.
Executing first strlen, then validating the sizes and then copying with memcpy if possible is actually the recommended way for implementing a replacement for strcpy, inclusive in the parent article.
On modern Intel/AMD CPUs, "rep movs" is usually the optimal way to implement memcpy above some threshold of data size, e.g. on older AMD Zen 3 CPUs the threshold was 2 kB. I have not tested more recent CPUs to see if the threshold has diminished.
On the old AMD Zen 3 there was also a certain size range above 2 kB at sizes comparable with the L3 cache memory where their implementation interacted somehow badly with the cache and using "non-temporal" vector register transfers outperformed "rep movs". Despite that performance bug for certain string lengths, using "rep movs" for any size above 2 kB gave a good enough performance.
No, it's an instruction for memcpy. You still need to compute the string length first, which means touching every byte individually because you can't use SIMD due to alignment assumptions (or lack thereof) and the potential to touch uninitialized or unmapped memory (when the string crosses a page boundary).
The spec and some sanitizers use a scalar loop (because they need to avoid mistakenly detecting UB), but real world libc seem unlikely to use a scalar loop.
Android kernels 4.19 and higher should all have support included for WireGuard unless the OEM specifically disables it: [0]. The Pixel 8 ships with the android 14 6.1 kernel so it most definitely should have WireGuard kernel support. You can check this in the WireGuard app BTW, if you go to settings it will show the backend that's in use.
Kernel support should have no bearing as the apps are purely userspace apps. You can use the kernel mode if you root the phone, but that's not a typical scenario.
Well, the issue isn't kernel vs user space, but you are correct that you still need a custom ROM and/or root unfortunately. I had assumed Android had also allowed netlink sockets for WireGuard but alas they did not. So the app can't communicate with the kernel module, bummer.
You're looking at the wrong thing, WireGuard doesn't use AES, it uses ChaCha20. AES is really, really painful to implement securely in software only, and the result performs poorly. But ChaCha only uses addition rotation and XOR with 32 bit numbers and that makes it pretty performant even on fairly computationally limited devices.
For reference, I have an implementation of ChaCha20 running on the RP2350 at 100MBit/s on a single core at 150Mhz (910/64 = ~14.22 cycles per bytes). That's a lot for a cheap microcontroller costing around 1.5 bucks total. And that's not even taking into account using the other core the RP2350 has, or overclocking (runs fine at 300Mhz also at double the speed).
You’re totally right; I got myself spun around thinking AES instead of of ChaCha because the product I work on (ZeroTier) started with the initially and moved to AES later. I honestly just plain forgot that WireGuard hadn’t followed the same path.
An embarrassing slip, TBH. I’m gonna blame pre-holiday brain fog.
So yes, they are definitely running scrapers that are this badly written.
Also old scraper bots trying to disguise themselves as GPTBot seems wholly unproductive, they're try to immitate users, not bots.
[0] https://openai.com/gptbot.json
reply