I would like to know why ZIP/HTML polyglot format produced by SingleFile [1] and mentioned in the article "achieve static, single, but not efficiency". What's not efficient compared to the gwtar format?
'efficiency' is downloading only the assets needed to render the current view. How does it implement range requests and avoid downloading the entire SingleFileZ when a web browser requests the URL?
I haven't looked closely, but I get the impression that this is an implementation detail which is not really related to the format. In this case, a polyglot zip/html file could also interrupt page loading via a window.stop() call and rely on range requests (zip.js supports them) to unzip and display the page. This could also be transparent for the user, depending on whether the file is served via HTTP or not. However, I admit that I haven't implemented this mechanism yet.
> that this is an implementation detail which is not really related to the format. In this case, a polyglot zip/html file could also interrupt page loading via a window.stop() call...However, I admit that I haven't implemented this mechanism yet.
Well, yes. That's why we created Gwtar and I didn't just use SingleFileZ. We would have preferred to not go to all this trouble and use someone else's maintained tool, but if it's not implemented, then I can't use it.
(Also, if it had been obvious to you how to do this window.stop+range-request trick beforehand, and you just hadn't gotten around to implementing it, it would have been nice if you had written it up somewhere more prominent; I was unable to find any prior art or discussion.)
The reason I did not implement the innovative mechanism you describe is because, in my case, all the technical effort was/is focused on reading the archive from the filesystem. No one has suggested it either.
The call to window.stop() stops HTML parsing/rendering, which is unnecessary since the script has downloaded the page via HTTP and will decompress it as-is as a binary file (zip.js supports concatenated payloads before and after the zip data). However, in my case, the call to window.stop() is executed asynchronously once the binary has been downloaded, and therefore may be too late. This is probably less effective than in your case with gtwar.
I implemented this in the simplest way possible because if the zip file is read from the filesystem, window.stop() must not be called immediately because the file must be parsed entirely. In my case, it would require slightly more complex logic to call window.stop() as early as possible.
Edit: Maybe it's totally useless though, as documented here [1]: "Because of how scripts are executed, this method cannot interrupt its parent document's loading, but it will stop its images, new windows, and other still-loading objects." (you mentioned it in the article)
Edit #2: Since I didn't know that window.call() was most likely useless in my case, I understand your approach much better now. Thank you very much for clarifying that with your question!
Well, it seems easy enough to test if you think you are getting efficiency 'for free'. Dump a 10GB binary into a SingleFileZ, and see if your browser freezes.
I just ran a test on a 10GB HTML page and called window.stop() via a 100ms setTimeout, which, in my opinion, simulates what would happen in a better-implemented case in SingleFile if the call to window.stop() were made as soon as the HTTP headers of the fetch request are received (i.e. easy fix). And it actually works. It interrupts the loading at approx. 15MB of data, the rendering of the page, and it's partially and smoothly displayed (no freeze). So it's not totally useless but it deserves to be optimized at a minimum in SingleFile, as I indicated. In the end, the MDN documentation is not very clear...
Edit #2: I've just understood that "parent" in "this method cannot interrupt its *parent* document's loading" from the MDN doc probably means the "parent" of the frame (when the script is running into it).
OK, so assuming you clean that up a bit and this becomes officially supported in SingleFile/SingleFileZ, what is missing compared to Gwtar? Anything important or just optional features like image recompression and PAR2?
If we were to compare this to the JS world, it seems Python’s async is closer to Babel-style generator-based coroutines [1] than to JavaScript’s async/await execution model.
Anecdotally (not to diminish any bug the parent had), SingleFile is one of my favorite extensions. Been using it for years and it's saved my ass multiple times. Thank you!
Edit: What's the best way to support the project? I'm seeing there's an option through the Mozilla store and through GitHub. Is there's a preference?
Great idea, some people have already implemented it for the same type of need, it would seem (see the list of user agents in the source code). Implementation seems simple.
For implementation in a library, you can use HttpRangeReader [1][2] in zip.js [3] (disclaimer: I am the author). It's a solid feature that has been in the library for about 10 years.
Based on your experience, is zip the optimal archive format for long term digital archival in object storage if the use case calls for reading archives via http for scanning and cherry picking? Or is there a more optimal archive format?
Unfortunately, I will have difficulty answering your question because my knowledge is limited to the zip format. In the use case presented in the article, I find that the zip format meets the need well. Generally speaking, in the context of long-term archiving, its big advantage is also that there are thousands of implementations for reading/writing zip files.
ZIP isn't a terrible format, but it has a couple of flaws and limitations which make it a less than ideal format for long-term archiving. The biggest ones I'd call out are:
1) The format has limited and archaic support for file metadata - e.g. file modification times are stored as a MS-DOS timestamp with a 2-second (!) resolution, and there's no standard system for representing other metadata.
2) The single-level central directory can be awkward to work with for archives containing a very large number of members.
3) Support for 64-bit file sizes exists but is a messy hack.
4) Compression operates on each file as a separate stream, reducing its effectiveness for archives containing many small files. The format does support pluggable compression methods, but there's no straightforward way to support "solid" compression.
5) There is technically no way to reliably identify a ZIP file, as the end of central directory record can appear at any location near the end of the file, and the file can contain arbitrary data at its start. Most tools recognize ZIP files by the presence of a local file header at the start ("PK\x01\x02"), but that's not reliable.
> there's no straightforward way to support "solid" compression.
I do it by ignoring ZIP's native compression entirely, using store-only ZIP files and then compressing the whole thing at the filesystem level instead.
Here's an example comparison of the same WWW site rip in a DEFLATE ZIP, in a store-only ZIP with zstd filesystem compression, in a tar with same zstd filesystem compression (identical size but less useful for seeking due to lack of trailing directory versus ZIP), and finally the raw size pre-zipping:
982M preserve.mactech.com.deflate.zip
408M preserve.mactech.com.store.zip
410M preserve.mactech.com.tar
3.8G preserve.mactech.com
[Lammy@popola] zfs get compression spinthedisc/Backups/WWW
NAME PROPERTY VALUE SOURCE
spinthedisc/Backups/WWW compression zstd local
This probably wouldn't help GP with their need for HTTP seeking since their HTTP server would incur a decompress+recompress at the filesystem boundary.
It's for when you have a very large number of mostly-identical files, like web pages with consistent header and footer. If 408MiB versus 3.8GiB is a meaningless difference to you then sure don't bother with compression, but why I want it should be very obvious to most people here.
The last example in my list of four file sizes is them in a folder. Filesystem compression works at the file level, so you have to turn many-almost-identical-files into one file in order to benefit from it. ZFS does have block-level deduplication, but that's it's own can of worms that shouldn't be turned on flippantly due to resource requirements and `recordsize` tuning needed to really benefit from it.
FYI, zip.js has no issues with 1 (it can be fixed with standard extra fields), 3 (zip64 support), and 5 (you cannot have more than 64K of comment data at the end of the file).
With regard for the first two - that's good for zip.js, but the problem is that support for those features isn't universal. There's been a lot of fragmentation over the last 36 years (!).
As far as the last (file type detection) goes, the generally agreed upon standard is that file formats should be "sniffable" by looking for a signature in the file's header - ideally within the first few bytes of the file. Having to search through 64 KB of the file's end for a signature is a major departure from that pattern.
The warning message you mentioned simply means that the extension can inject "content scripts" into the web pages you visit. This feature is necessary, for example, to remove ads that cannot be blocked via HTTP.
Seems like a reasonable case for disregarding the client preference. If you're able to speak TLS then you're able to load up a public domain (de)compression library.
One doesn't preclude the other, but I have serious doubts about free speech concerns. There are moderate and extreme movements in Europe. They all express themselves freely within the law.
Of course, we’re probably seeing the normal amount of movement when administrations shift. I work for a global company and have only witnessed the EU to US movement. I’m sure both are happening.
What's normal? You just offended a whole group of people and probably had no intention of doing it. You're proving the point on why the whole thing doesn't work. You should get your comment removed if you seriously believe this to be true and want to avoid being hypocritical.
I'm sorry if you feel offended. My comments will be removed by HN moderators if they think it's necessary. You can flag and downvote them meanwhile. You can also contact the moderators or your local police if you think it is necessary.
EDIT: It might not be possible to flag or downvote comments. So, I recommend you to contact the moderators or your local police. You can find my name and address on my GitHub profile.
I have no interest in pressing charges against you or getting you in trouble. That would be hypocritical of me and contrary to my beliefs. I was simply pointing out how easy it is to offend people on the internet and why it simply doesn't work at scale. If no one has the right to say something potentially offensive on the internet, then the whole thing needs to be shut down.
Thank you for explaining your point of view on the interest of this conversation. Here's mine, I consider an offense to be something quite subjective sometimes. In some cases, it's possible to offend someone without meaning to. The solution to this problem is to apologize and offer to talk about it. If that doesn't work, and the offense is in some way "forbidden", then the offended person can simply defend themselves by going to the “authorities”.
I sincerely apologize if you found my response offensive. It was not my intention and I try to respect the moderation rules on HN. I just wanted to give my point of view, as a person living in the EU.
[1] https://github.com/gildas-lormeau/Polyglot-HTML-ZIP-PNG