TheTechRobo's comments

TheTechRobo · on Oct 17, 2024

Browsertrix (and Webrecorder tools in general) also violate the standard by modifying response data. It's supposed to be the raw bytes as they are sent over the network (minus TLS).

The entire WARC ecosystem is kind of a mess.

ikreymer · on Oct 17, 2024

This isn't really true, our tools do not just modify response data for no reason!

Our tools do the best that we can with an old format that is in use by many institutions. The WARC format does not account for H2/H3 data, which is used by most sites nowadays.

The goal of our (Webreocrder) tools is to preserve interactive web content with as much fidelity as possible and make them accessible/viewable in the browser. That means stripping TLS, H2/H3, sometimes forcing a certain video resolution, etc.. while preserving the authenticity and interactivity of the site. It can be a tricky balance.

If the goal is to preserve 'raw bytes sent over the network' you can use Wireshark / packet capture, but your archive won't necessarily be useful to a human.

CorentinB · on Oct 17, 2024

He didn't say you modify the data for no reason, he said you violate the standard. Which is true. You could respect it, but you don't.

nikisweeting · on Oct 17, 2024

imo the Webrecorder stuff is truly state of the art, if they're pushing the limits of WARC standards it's for good reason, and I trust their judgement. They pioneered the newer WACZ standard and are really pushing the whole field forward.

TheTechRobo · on Aug 4, 2024

Unfortunately the Raspberry Pi version isn't supported anymore because their version of Wget isn't well-tested on ARM. I'm surprised it's still there in the repo.

You can get Docker to emulate x86, though.

TheTechRobo · on April 4, 2024

That doesn't really confirm that, IMHO. The line before claims the opposite, in fact: "The Internet Archive has been around since 1996, and while that does not guarantee anything, it shows continuity of support and strong commitment to digital preservation with as much access as possible."

TheTechRobo · on Jan 31, 2024

Everything transmitted over HTTP. That includes cookies, passwords, etc. You need to be careful when writing WARCs.

TheTechRobo · on Jan 31, 2024

Yeah, the format really needs an update. For starters, WARC only officially supports HTTP/1.1. Webrecorder has started faking HTTP/1.1 data in WARC files in order to save other versions, but I don't think faking data is great for an archival format, especially if it isn't standardized.

TheTechRobo · on April 12, 2023

ArchiveTeam's general idea is to archive it first, and make it usable later. Wayback Machine is generally how stuff is "made usable", but in cases like this, a searchable index might need to made. (That's been done before, though - it's not like it's never happened.)

TheTechRobo · on March 22, 2023

Archive Team hasn't really done much yet - all they've done is an ArchiveBot job, which isn't getting anywhere because of DPReview's ratelimits, and some discovery scrapes. This is probably a different issue.

1vuio0pswjnm7 · on March 22, 2023

https://wiki.archiveteam.org/index.php/DPReview

TheTechRobo · on March 23, 2023

Yes, what is your point?

TheTechRobo · on March 22, 2023

You select the project you run at any given moment. You can also select Archiveteam's Choice, which is usually the most urgent or the one with the most IPs needed, but if you do want to select it yourself, you pick the project.

TheTechRobo · on March 2, 2022

Similar: github.com/InternetArchive/warcprox

TheTechRobo · on March 2, 2022

There's github.com/ArchiveTeam/grab-site, but unfortunately it's not maintained very well.