Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I keep coming back to point #4 - who ever thought that 16k objects in a single directory would be a good idea? Ever since FAT that's been a bad idea, and while modern FSes will handle it without completely melting down it's still going to cause long access operations on anything to do with it.

Even Finder or `ls` will have trouble with that, and anything with * is almost certainly going to fail. Is the use-case for this something that refers to each library directly, such that nobody ever lists or searches all 16k entries?



I do think that your last sentence is the answer: if you're using a package manager instead of working with the directory heavily, this isn't a visible problem which is going to motivate people to work on it.

The other side to consider: “one directory per package” is a very simple policy and it feels right in many ways to people (e.g. Homebrew has a similar structure because it's a natural fit for the domain). If the filesystem and basic tools like ls work just fine (which is certainly the case on OS X, where even "ls -l" or the Finder take less than a second on a directory of that size), isn't there a valid argument that the answer should be some combination of fixing tools which don't handle that well or encouraging people to learn about things like `find` instead of using wildcards which match huge numbers of files?


One directory per package is completely sensible, just not all in one bunch. It's even fine if the mapping is to a flat namespace at something like the HTTP level - I can mod_rewrite /abcdefg to /a/b/c/abcdefg no problem. My only objection is to file- or directory-level structures that are this flat. I might be mentally deficient, but I can't even process anything that's structured this way.

As loathe as I am to admit anything about Perl is good, CPAN got this right. 161k packages by 12k authors, grouped by A/AU/AUTHOR/Module. That even gives you the added bonus of authorship attribution. Debian splits in a similar way as well, /pool/BRANCH/M/Module/ and even /pool/BRANCH/libM/Module/ as a special case.

Tooling can be considered part of the problem in this case. Because the tooling hides the implementation, nobody (in the project) noticed just how bad it was. I hadn't seen modern FS performance on something of this scale, apparently everything I've worked with has been either much smaller or much larger. Ext4 (and I assume HFS+) is crazy-fast for either `ls -l` or `find` on that repo.

It seems like tooling is part of the solution as well, but from the `git` side. Having "weird" behavior for a tool that's so integral to so many projects scares me a little, but it's awesome that Github has (and uses) enough resources to identify and address such weirdness.


My (perhaps naive) thoughts on this are - suppose a 16k-packages-in-one-directory solution were just as fast as a 16k-packages-sharded-by-prefix (the CPAN solution), then the former is conceptually simpler and so should be preferred. And the fact that you can mechanically transform one structure to the other means that the filesystem (or git) should be able to transparently do it for you (eg use the sharded approach as a hidden implementation, while the end user sees a flat directory). This seems to be similar to what ext4 does (https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Hash...).


The obvious question is how would you implement that. You might argue (as you should) that git has closer semantics to a filesystem than version control. But actually implementing this sharding would require git be a kernel module. Hardlinks and softlinks won't save you because they are both still dentries and thus have the same performance pathology. Maybe you could do it with fuse, but what have you gained by making your version control system even more annoying to use?


It's one of those shouldn't-be-arcane-but-somehow-is pieces of knowledge. Almost every job I've had I've ended up speaking to developers about more efficient file storage when I find yet another "shove everything in a single directory" implementation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: