The US National Archives’ preservation plans for 600 file formats

kibwen · on Oct 16, 2022

Direct link to the Github repo: https://github.com/usnationalarchives/digital-preservation

A quick search doesn't reveal any hits for "compression", I wonder what formats the National Archives uses themselves. Some compression formats include extra metadata for fancy error-correction, but I'm not qualified to evaluate how useful they are or what advanced settings to employ.

jmillikin · on Oct 17, 2022

Types of data are grouped by category, which have their own documents with recommended file encodings (including compression).

For example the document for digital audio[0] has FLAC as a "preferred" format and MP3 as "acceptable".

I didn't see any discussion of selecting formats based on recovery of original data from corrupted storage, and given the context I doubt that was given much consideration. When doing data archival it's better to have replication and checksums, then if any file is corrupt it's recovered from replica rather than trying to do a partial recovery.

[0] https://github.com/usnationalarchives/digital-preservation/b...

mjevans · on Oct 17, 2022

Any compressor should have at least free access to source code, and also be 'restore' tested against a specific version artifact (binary) of that codebase.

Depending on the source data, I'm personally comfortable with Zstandard, 7zip / xz, or even gzip depending on a mix of how conservative the use case and how effectively each compresses that data structure.

duskwuff · on Oct 17, 2022

I like zstd as a format, but it's too new for archival use -- it's only been available for five years or so. gzip is easily the winner; while it isn't the best compressor available, it's been around for the last 30 years, and is likely to still be in use another 30 years from now.

mjevans · on Oct 17, 2022

There's also the option to compress with one format now, and during routine data storage maintenance to use newer better storage layers, like compression. What's most important are integrity checks, redundancy / duplication, and the ability to repair to the source data stream.

gred · on Oct 17, 2022

PowerPoint is on the list, thank goodness. I can't even imagine all the project status updates that would otherwise be lost.

https://www.archives.gov/preservation/digital-preservation/l...

cm2187 · on Oct 17, 2022

I imagine there are lots of conference slides, and internal decks contributing to gvt decisions that are worth preserving. Decisions are rarely made on the back of a 1600 pages report.

gumby · on Oct 17, 2022

Like the famous space shuttle Columbia destruction, blamed on powerpoint: https://nicholastoole.wordpress.ncsu.edu/disaster-archive-ca...

ocdtrekkie · on Oct 17, 2022

So a step forward in putting them on GitHub... a step back for them all being PDFs so you can't really do anything GitHub-like with them, lol.

arthurcolle · on Oct 17, 2022

There isn't really any reason why you couldn't "diff" two spreadsheets. Its just XML

elygre · on Oct 17, 2022

Excel comes with a little known tool to compare spreadsheets.

It will tell you what cells changed, and what kind of change (raw data, calculates, style change, etc)

fibers · on Oct 17, 2022

How have I never heard of this? What is it called because this would be a godsend for some industries

arthurcolle · on Oct 17, 2022

What's the name of this diff tool?

yaddaor · on Oct 17, 2022

Only in some versions though.

botanical · on Oct 17, 2022

I don't see JPEG XL on there, which is a shame. Is it possible for a non-US citizen to recommend it?

omoikane · on Oct 17, 2022

It's interesting how "moving image" is split into "digital cinema" and "digital video", I would have expected one set of formats to cover both.

Apparently "digital cinema" prefers DPX while "digital video" lists AVI/MOV/WMV/MP4/MP2/MXF as acceptable, so maybe "digital cinema" are intended to capture raw data while "digital video" are meant for playback. But audio and still image categories did not have that same split.

jl6 · on Oct 17, 2022

Maybe it reflects the very different tools and workflows used by professional cinematography versus amateur work. As opposed to still photography where there isn’t such a gap in either gear or formats (many professional photographers happily use JPEG and certainly many publish in it).

vanchor3 · on Oct 17, 2022

I wonder if it's because DPX is often what you have when working in digital cinema until you package it up as a (potentially encrypted) DCP for distribution.

cm2187 · on Oct 17, 2022

Also those are container formats, not codecs.

aliqot · on Oct 17, 2022

No worries everyone, I checked. Flash is on there.

nickhalfasleep · on Oct 17, 2022

No CAD formats. Too bad most are proprietary.

shonenknifefan1 · on Oct 17, 2022

It looks like the CAD file formats are specified here:

https://github.com/usnationalarchives/digital-preservation/b...

which includes X3D, STEP, PDF/E, U3D, PRC

tpmx · on Oct 17, 2022

AutoCAD R10+ (1989), Microstation CAD, DesignCAD and IGES are listed here:

https://www.archives.gov/preservation/digital-preservation/l...

aaron695 · on Oct 17, 2022

There's no known issue with file formats being lost. (I think this is what it's about)

That's not to say noone should look at it. The US National Archives is an obvious choice.

But files should be kept in their original format.

You might shift formats to another copy so they can be easily viewed, ie in a web browser.

Changing format is the best way to lose information, for zero preservation gain.

The fact file formats are not lost is probably not intuitive, it's a interesting blog/research project why. Certainly people conflating physical media and it's hardware which does degrade with algorithms is partly why intuition fails.

CaptArmchair · on Oct 17, 2022

> There's no known issue with file formats being lost. (I think this is what it's about)

Esoteric, proprietary formats have a high risk factor. Long term preservation goes beyond administrative archives and includes niche software used to make e.g. interactive objects which may need to be preserved for legal or other reasons.

Long term preservation also extends beyond the 2-5 year horizon, when one could make a reasonably accurate risk assessment. The legal framework may demand that some (parts of) archives need to be preserved for the next 30-50-100 years or even eternity (no end date specified). Just consider documents related to nuclear installations, machinery, tooling,... which go well beyond ubiquitous office file formats.

> But files should be kept in their original format.

Substitution is choice which needs to be considered carefully. Hence why emulation is a big topic in digital archiving.

> Changing format is the best way to lose information, for zero preservation gain.

It's not that black and white. Ultimately, bitstreams need to be translated to an output device. A big question is "Am I seeing the message in a digital context such as what one would have seen 30 years ago?" There's a can of worms discussion regarding the "essence" of digital information, the representation on screen, and the bits stored on disk.

Changing formats usually is a question of judicious and pragmatic decision making based on valorisation of the material as well as the context from which it originated.

What's more important? Being able to consume the information? Or being able to consume it in its original form? What do we lose and what do we gain? There are no absolute answers to be had in that regard: it's always a per-case assessment.

frumiousirc · on Oct 17, 2022

> What do we lose and what do we gain?

I think there can only be loss or at best parity.

The problem I see are faulty or simplistic assumptions being made by the translators and subsequent loss of information in the archive when the originals are discarded.

A new format may allow faithful reproduction of primary intention but disregard others. An obvious example is metadata. For example with photo or video, loss of information on things such as exposure, lens type not to mention camera model/make can be devastating to some uses of the file. Esoterica that "can't possible be useful" can actually be critical to some.

Of course, it is fine to apply some destructive translation in a transient manner, such as your example of consumption by an output device. The problem is assuming there is exactly one type of output "device".

Someone · on Oct 17, 2022

> There's no known issue with file formats being lost.

Cloud products silently change format all the time, and AFAIK the formats aren’t documented.

This list has KeyNote, for example, but claims it’s a XML based format. It hasn’t been for years. I also wouldn’t know how to read a Google sheets file without Google’s software (but maybe I didn’t search hard enough)

Older formats also are getting lost. For example, try opening a MacWrite file or a Word 1.0 file. MacWrite has the additional problem that part of the file (a 16-character etaoin shrdlu string used to compress the text) is in the resource fork. Preserving that is easy, but also easily forgotten.

And even if you can open a file today, today’s tools may do a better job converting to PDF or .mp4 than those we’ll have a hundred years from now. File converters that see little testing for rare formats.

jl6 · on Oct 17, 2022

There’s a certain Y2K effect with file formats. Everyone recognized the possibility of losing access to proprietary formats, so countermeasures were taken (shift to open formats, build export and conversion tools, run original software in VMs), and now nobody loses access to proprietary formats.