> we need to implement a caching scheme to at least amortize the overhead of this sort for pathological cases. Still, we will need to trigger the sort even on a single read as you say. So there's a pathological case there -- listing one item from a ginormous directory -- that would run much slower than the native/nondeterministic version.
Actually, I just thought that if you decide to go down this path because of this specific pathological issue -- the cache that you're mentioning could simply be a file that Hermit maintains to represent a sorted view of a directory (and which could be deleted after Hermit finishes running if you so desire, although you could keep it in-between runs with some care!).
So you'd only have to pay the cost of reading the entire directory once per directory that is read -- specifically, on the first time that some program reads it.
You could even pre-generate such cached views for all directories in the container once Hermit starts, if you'd prefer to pay this cost at start-up (although I don't see why).
All subsequent operations relating to a directory could use the corresponding file as a cached and sorted view of it (and such operations that modify the directory would have to keep the file in sync, of course).
So even if a program reads a directory, then closes it, and some other program opens it and reads it, this second read would use the cache file instead, so it would not have to read the whole directory at once.
This would be even better for directories that are created by programs running inside Hermit, since the cache file would be created along with the directory when it is initially created, so there would be no need to read the entire directory at once at any time.
It would also mean that Hermit wouldn't have to potentially use large amounts of memory just because of this cache, or "forget" entries from the cache while it is running (to limit memory usage), since it would be stored persistently on-disk (and would still benefit from in-memory caching from the underlying filesystem / kernel).
Yeah, it's interesting to think about persisting the state we would need to make the file system more sympatico with Hermit. If we were willing to have a daemon.... Meta develops this "watchman" tool that our build infrastructure uses. I think for existing file systems we could imagine a daemon that watches the directory and caches what we need.
But if we could dream of new file systems, then I want one that is basically btrfs but with O(1) file-level parallel-hashes (rather than block level). Heck maybe it could even enforce a sorted order on directories for us too. The hashes would be very useful in record-and-replay scenarios, where you could know immediately whether the input files are in your content-addressible blob storage without having to hash them every time.
P.S. about Reproducible Builds -- we pitched this to the community in 2019 (at the Reproducible Builds Summit) and with that ASPLOS'20 paper, but we would be eager to see anyone wanting to pick it up and take it further.
Actually, I just thought that if you decide to go down this path because of this specific pathological issue -- the cache that you're mentioning could simply be a file that Hermit maintains to represent a sorted view of a directory (and which could be deleted after Hermit finishes running if you so desire, although you could keep it in-between runs with some care!).
So you'd only have to pay the cost of reading the entire directory once per directory that is read -- specifically, on the first time that some program reads it.
You could even pre-generate such cached views for all directories in the container once Hermit starts, if you'd prefer to pay this cost at start-up (although I don't see why).
All subsequent operations relating to a directory could use the corresponding file as a cached and sorted view of it (and such operations that modify the directory would have to keep the file in sync, of course).
So even if a program reads a directory, then closes it, and some other program opens it and reads it, this second read would use the cache file instead, so it would not have to read the whole directory at once.
This would be even better for directories that are created by programs running inside Hermit, since the cache file would be created along with the directory when it is initially created, so there would be no need to read the entire directory at once at any time.
It would also mean that Hermit wouldn't have to potentially use large amounts of memory just because of this cache, or "forget" entries from the cache while it is running (to limit memory usage), since it would be stored persistently on-disk (and would still benefit from in-memory caching from the underlying filesystem / kernel).
Just an idea.