In my similar project (s3 compatible single-node storage) https://github.com/uroni/hs5 I do use proper fsync for data and metadata durability. But it can be turned of via switch. It is a pet peeve of mine that the defaults should always be to fsync. I do have a section on this in my README of the project.
I also do have an optional WAL. Maybe I should add an additional mode that disables fsync only for the WAL. I don't think it would be a good idea. My WAL does use checksums and sequence numbers etc. to prevent committing wrong data.
I build https://github.com/uroni/hs5 as replacement for single node use with a focus on high performance. I list other alternatives in the README there. Some short version:
Ceph: Robust, widely used for multi-node deployments. Would recommend for serious use.
RustFS: As an in-place replacement using the same storage format. Though, to me it is a bit suspect, e.g. if it uses fsync for durability.
seaweedfs: Multi-node alternative, that keeps the object mapping in memory (so more RAM usage and startup cost compared to alternatives).
Garage: Multi-node alternative, web-interface available separately. To me seems unsuitable for single-node use at this point.
VersityGW: Single-node alternative, which uses the same object=file in file system tree storage as MinIO (with the same disadvantages/advantages). It uses extended attributes for S3 metadata, so less filesystem overhead than MinIO/RustFS at least. Cannot find any Sync() or fsync calls in the code, though.
I'd worry about file create, write, then fsync performance with btrfs, but not about reliability or data-loss.
But a quick grep across versitygw tells me they don't use Sync()/fsync, so not a problem... Any data loss occurring from that is obviously not btrfs fault.
I've used it in a product for a couple of thousand repos. The big problem is architectural. Each branch is serialized into a single bundle file. There is no structural sharing of git objects. So each and every branch will download the full history from scratch. So changing branches is as expensive as a fresh clone. If you combine this with a real user's desire for images/diagrams of any kind, then boom, massive slowness.
There are also two concurrency bugs which the maintainer refuses to acknowledge.
For me it went into the multi-node direction, where I'd use Ceph anyway (or build on-top of an existing solid distributed database) if I needed it.
Also think there is an abstraction mismatch with the object stores that store the objects with a 1:1 mapping into the file system. Obvious issues are that you only get good listing performance with '/' as delimiter and things like "keys with length up to 1024 bytes" break.
I think my point doesn't really land. I was trying to express the idea "S3 is not a standard where AWS is the reference implementation, it is a successful commercial product with many many copy cats".
Their only real inherent commitment here is to whatever backwards-compatibility expectations are being set for their first-party SDKs. If they fulfill that but other vendors can't or won't follow suit, the outcome is gonna be different than it would be for an actual standard rather than an assumed one. There is no meaningful leverage for the third parties to exert to force a community-favored outcome if Amazon decides otherwise.
AGPL is "a plague" by design (viral). It has the explicit goal that any improvements flow back to the community project and the virality is a necessary building block for this. It is an elegant solution to a tragedy of the commons problem.
Companies like MinIO extending the virality beyond the single software/work, even though not intended by license, gives it a bad reputation. They have fixed https://min.io/compliance now, but I guess it does not matter anymore.
Intentions of the license aside, the abusive stance of how far the license extends is an unsettled matter in the US court system. And that means when a project wants to assert that any software which talks to a minIO instance over s3 is included in this license expectation, it’s on you to decide if you want to go the distance defending yourself. And even then, they can just drop the suit whenever in that long process and continue the status quo world of ambiguity.
I never understood why one would use MinIO over Ceph for serious (multi-node) use. Sure, it might be easier to setup initially, but Ceph would be more likely to work.
For the single node use-case, I'm working on https://github.com/uroni/hs5 . The S3 API surface is large, but at this point it covers the basics. By limiting the goals I hope it will be maintainable.
I’ve used this technique in the past, and the problem is that the way some file systems perform the file‑offset‑to‑disk‑location mapping is not scalable. It might always be fine with 512 MB files, but I worked with large files and millions of extents, and it ran into issues, including out‑of‑memory errors on Linux with XFS.
The XFS issue has since been fixed (though you often have no control over which Linux version your program runs on), but in general I’d say it’s better to do such mapping in user space. In this case, there is a RocksDB present anyway, so this would come at no performance cost.
I also do have an optional WAL. Maybe I should add an additional mode that disables fsync only for the WAL. I don't think it would be a good idea. My WAL does use checksums and sequence numbers etc. to prevent committing wrong data.
reply