A quick search doesn't reveal any hits for "compression", I wonder what formats the National Archives uses themselves. Some compression formats include extra metadata for fancy error-correction, but I'm not qualified to evaluate how useful they are or what advanced settings to employ.
Types of data are grouped by category, which have their own documents with recommended file encodings (including compression).
For example the document for digital audio[0] has FLAC as a "preferred" format and MP3 as "acceptable".
I didn't see any discussion of selecting formats based on recovery of original data from corrupted storage, and given the context I doubt that was given much consideration. When doing data archival it's better to have replication and checksums, then if any file is corrupt it's recovered from replica rather than trying to do a partial recovery.
Any compressor should have at least free access to source code, and also be 'restore' tested against a specific version artifact (binary) of that codebase.
Depending on the source data, I'm personally comfortable with Zstandard, 7zip / xz, or even gzip depending on a mix of how conservative the use case and how effectively each compresses that data structure.
I like zstd as a format, but it's too new for archival use -- it's only been available for five years or so. gzip is easily the winner; while it isn't the best compressor available, it's been around for the last 30 years, and is likely to still be in use another 30 years from now.
There's also the option to compress with one format now, and during routine data storage maintenance to use newer better storage layers, like compression. What's most important are integrity checks, redundancy / duplication, and the ability to repair to the source data stream.
I imagine there are lots of conference slides, and internal decks contributing to gvt decisions that are worth preserving. Decisions are rarely made on the back of a 1600 pages report.
It's interesting how "moving image" is split into "digital cinema" and "digital video", I would have expected one set of formats to cover both.
Apparently "digital cinema" prefers DPX while "digital video" lists AVI/MOV/WMV/MP4/MP2/MXF as acceptable, so maybe "digital cinema" are intended to capture raw data while "digital video" are meant for playback. But audio and still image categories did not have that same split.
Maybe it reflects the very different tools and workflows used by professional cinematography versus amateur work. As opposed to still photography where there isn’t such a gap in either gear or formats (many professional photographers happily use JPEG and certainly many publish in it).
I wonder if it's because DPX is often what you have when working in digital cinema until you package it up as a (potentially encrypted) DCP for distribution.
There's no known issue with file formats being lost. (I think this is what it's about)
That's not to say noone should look at it. The US National Archives is an obvious choice.
But files should be kept in their original format.
You might shift formats to another copy so they can be easily viewed, ie in a web browser.
Changing format is the best way to lose information, for zero preservation gain.
The fact file formats are not lost is probably not intuitive, it's a interesting blog/research project why. Certainly people conflating physical media and it's hardware which does degrade with algorithms is partly why intuition fails.
> There's no known issue with file formats being lost. (I think this is what it's about)
Esoteric, proprietary formats have a high risk factor. Long term preservation goes beyond administrative archives and includes niche software used to make e.g. interactive objects which may need to be preserved for legal or other reasons.
Long term preservation also extends beyond the 2-5 year horizon, when one could make a reasonably accurate risk assessment. The legal framework may demand that some (parts of) archives need to be preserved for the next 30-50-100 years or even eternity (no end date specified). Just consider documents related to nuclear installations, machinery, tooling,... which go well beyond ubiquitous office file formats.
> But files should be kept in their original format.
Substitution is choice which needs to be considered carefully. Hence why emulation is a big topic in digital archiving.
> Changing format is the best way to lose information, for zero preservation gain.
It's not that black and white. Ultimately, bitstreams need to be translated to an output device. A big question is "Am I seeing the message in a digital context such as what one would have seen 30 years ago?" There's a can of worms discussion regarding the "essence" of digital information, the representation on screen, and the bits stored on disk.
Changing formats usually is a question of judicious and pragmatic decision making based on valorisation of the material as well as the context from which it originated.
What's more important? Being able to consume the information? Or being able to consume it in its original form? What do we lose and what do we gain? There are no absolute answers to be had in that regard: it's always a per-case assessment.
The problem I see are faulty or simplistic assumptions being made by the translators and subsequent loss of information in the archive when the originals are discarded.
A new format may allow faithful reproduction of primary intention but disregard others. An obvious example is metadata. For example with photo or video, loss of information on things such as exposure, lens type not to mention camera model/make can be devastating to some uses of the file. Esoterica that "can't possible be useful" can actually be critical to some.
Of course, it is fine to apply some destructive translation in a transient manner, such as your example of consumption by an output device. The problem is assuming there is exactly one type of output "device".
> There's no known issue with file formats being lost.
Cloud products silently change format all the time, and AFAIK the formats aren’t documented.
This list has KeyNote, for example, but claims it’s a XML based format. It hasn’t been for years. I also wouldn’t know how to read a Google sheets file without Google’s software (but maybe I didn’t search hard enough)
Older formats also are getting lost. For example, try opening a MacWrite file or a Word 1.0 file. MacWrite has the additional problem that part of the file (a 16-character etaoin shrdlu string used to compress the text) is in the resource fork. Preserving that is easy, but also easily forgotten.
And even if you can open a file today, today’s tools may do a better job converting to PDF or .mp4 than those we’ll have a hundred years from now. File converters that see little testing for rare formats.
There’s a certain Y2K effect with file formats. Everyone recognized the possibility of losing access to proprietary formats, so countermeasures were taken (shift to open formats, build export and conversion tools, run original software in VMs), and now nobody loses access to proprietary formats.
A quick search doesn't reveal any hits for "compression", I wonder what formats the National Archives uses themselves. Some compression formats include extra metadata for fancy error-correction, but I'm not qualified to evaluate how useful they are or what advanced settings to employ.