Browsertrix (and Webrecorder tools in general) also violate the standard by modifying response data. It's supposed to be the raw bytes as they are sent over the network (minus TLS).
This isn't really true, our tools do not just modify response data for no reason!
Our tools do the best that we can with an old format that is in use by many institutions. The WARC format does not account for H2/H3 data, which is used by most sites nowadays.
The goal of our (Webreocrder) tools is to preserve interactive web content with as much fidelity as possible and make them accessible/viewable in the browser. That means stripping TLS, H2/H3, sometimes forcing a certain video resolution, etc.. while preserving the authenticity and interactivity of the site. It can be a tricky balance.
If the goal is to preserve 'raw bytes sent over the network' you can use Wireshark / packet capture, but your archive won't necessarily be useful to a human.
imo the Webrecorder stuff is truly state of the art, if they're pushing the limits of WARC standards it's for good reason, and I trust their judgement. They pioneered the newer WACZ standard and are really pushing the whole field forward.
Unfortunately the Raspberry Pi version isn't supported anymore because their version of Wget isn't well-tested on ARM. I'm surprised it's still there in the repo.
That doesn't really confirm that, IMHO. The line before claims the opposite, in fact: "The Internet Archive has been around since 1996, and while that does not guarantee anything, it shows continuity of support and strong commitment to digital preservation with as much access as possible."
Yeah, the format really needs an update. For starters, WARC only officially supports HTTP/1.1. Webrecorder has started faking HTTP/1.1 data in WARC files in order to save other versions, but I don't think faking data is great for an archival format, especially if it isn't standardized.
ArchiveTeam's general idea is to archive it first, and make it usable later. Wayback Machine is generally how stuff is "made usable", but in cases like this, a searchable index might need to made. (That's been done before, though - it's not like it's never happened.)
Archive Team hasn't really done much yet - all they've done is an ArchiveBot job, which isn't getting anywhere because of DPReview's ratelimits, and some discovery scrapes. This is probably a different issue.
You select the project you run at any given moment. You can also select Archiveteam's Choice, which is usually the most urgent or the one with the most IPs needed, but if you do want to select it yourself, you pick the project.
The entire WARC ecosystem is kind of a mess.