Their architecture descriptions starts with a strawman:
> Usually distributed file systems split each file into chunks, a central master keeps a mapping of filenames, chunk indices to chunk handles, and also which chunks each chunk server has.
> The main drawback is that the central master can't handle many small files efficiently, and since all read requests need to go through the chunk master, so it might not scale well for many concurrent users.
The chunk server architecture has been first put into production with the Google File System AFAIK. And it has been designed specifically for large files (what search needed at that time). So no surprise.
But that's only one architecture for a DFS. There are also block-based DFS (like GPFS),
object-based DFS (Lustre.), cluster file systems (OCFS), and other architectures. They exhibit other characteristics.
Telling from the architecture and Wiki, it does not seem to be a file system at its core, but an object store with a file translation layer. One of the core problems of this approach is that in-place updates usually mean read-modify-write (if the object store has immutable objects, like most have, with Ceph being a notable exception).
From the replication page:
> If one replica is missing, there are no automatic repair right away. This is to prevent over replication due to transient volume sever failures or disconnections. In stead, the volume will just become readonly. For any new writes, just assign a different file id to a different volume.
This sounds like the architecture and implementation is still pretty basic. Distributed storage without redundancy (working redundancy!) is not that interesting.
Sorry to be that critical (great that someone writes a distributed file system!), but I think it is important to add some context. And the seaweed auther seems to have problems with bold statements either...
Disclaimer: I also work on a distributed file system (with unified access via S3 ;)
There are actually no limit for blob size. The clients decide the blob size. It is limited by available memory and concurrency.
By default, the filer client uses 8MB.
Each identifier is as <volume_id, file_key, cookie>. With volume_id, you can locate the volume server. So the volumes are portable and can be moved around.
For a distributed filesystem's redundancy to be working in my book, it has to actually be able to reconstruct itself after unavailability/failure, which Seaweed seemingly doesn't do.
Additionally, something more complex than N-replication (eg. M.N erasure coding) is also a strong 'redundancy working' requirement for me.
a) I wonder in which sense HDFS was ever the most popular DFS. You could argue that it is not even a DFS, because it is not a general purpose / POSIX file system. In 2012 there was Lustre, GlusterFS, GPFS, ... Not many DFS were built since 2012.
b) It is completely unclear how the replication works, and which properties it has (split-brain safe?). From what it states (see citation), the "repair" is not automatic.
a) you are right. I was not in storage field.
b) During writes, the strong consistency is required. The repair is by admin scripts, not by the data node themselves. This is to keep data nodes simple and scalable. You can also run the repair scripts at the "safe" time.
Evercam has used Seaweed for a few years. We've 1344TB of mostly jpegs and use the filer for folder structure. It's worked well for us, especially with low cost Hetzner SX boxes. I'd echo other people's positive comments about the maintainer's responsiveness & support. Happy to (try and) answer questions.
we don't use any replication, Also in almost 5 years we only had one server crash which was due to file-system corruption, and we overcome that as well, it was a few leveldb files which got corrupt due to which the whole XFS file-system was went down, but we recovered it. Just one drawback was: We never used the same filer for saving files, and Get speed was also quite slow on that one, but with time, the volume compaction and vacuum, everything works fine on GET requests.
If Oracle wins the Supreme Court case against Google, aren't all these "like S3" or S3 API compatible solutions (whether block storage competitors or file systems) at risk?
My suggestion don’t rely too much on S3 API after this ruling and better to have seaweedfs API as primary interface, until Amazon S3 release their API interface as true open source with right to modify for one’s own use.
This is one of the biggest problems with Java, it’s free but not open source (Eric Raymond has been fighting this for a while now that free does not mean open source). Oracle can come after anyone using Java if there is substantial money to be made.
So Java is only free as long as money keeps coming in for Oracle, once it dries like for SCO, they will go after anyone using Java API to extract money.
Also this shows subtle difference between open source and free. In a legitimate open source licensed software copyright and right to change is granted and company cannot sue others just based on copyright like Oracle can sue anyone using Java API.
Amazon wants to open up S3 API for now. I guess my point is launching products with S3 API combability I feel is a legal liability if Oracles wins the case.
I was specifically implying block storage solutions that offer S3 API combability (i.e. use our AWS S3 competitor with matching S3 API).
We've been running SeaweedFS in production serving images and other small files. We're not using Filer functionality just the underlying volume storage. We wrote our own asynchronous replication on top of the volume servers since we couldn't rely on synchronous replication across datacenters. The maintainer is super responsive and is quick to review our PRs. Happy to answer any questions.
Whenever you introduce a new solution into a problem space that already has plenty of options, you are obligated to state why your (new) solution is needed in the first place, IMO.
They did it well:
> Most other distributed file systems seem more complicated than necessary.
> SeaweedFS is meant to be fast and simple, in both setup and operation. If you do not understand how it works when you reach here, we've failed! Please raise an issue with any questions or update this file with clarifications.
However, since I never had to touch hdfs after installing it in the first place, I wonder what the difficulties in operation are, that they tried to overcome here?
HDFS famously suffers from a major "small files" issue where if you have lots of files that are less than the block size you i.e. 64MB you will have production instabilities due to excessive namenode memory, read/write performance and disk efficiency.
Also when you lose datanodes you will need to rebalance the data which SeaweedFS apparently does not need to do.
>Whenever you introduce a new solution into a problem space that already has plenty of options, you are obligated to state why your (new) solution is needed in the first place, IMO.
Agreed ! I'm just very glad, they didn't use the word "lightweight", it's a pet irritation of mine, when ppl "advertise" their software as "lightweight" !
The "first few" versions are always "lightweight" once they included all the bug fixes and corner cases of the other solutions over the next few releases, they seldom stay "lightweight" :/
This looks almost exactly like the kind of data store I need for an application. I have previously considered using minio (too inflexible wrt to adding more shards / replicas), a homebrew system based on something like ScyllaDB (needs code on top to act like a blob store) or S3/B2 (too slow and/or expensive wrt to transfer costs). Is anyone using this in production and can share a story of how stable and hard to run it is?
Like you we also started with minio, but as it was not fully S3 compatible (list folders and few more related S3 API) and do not have FUSE or NFS support either. So we started moving our infrastructure to seaweedfs. We have not completed the move yet, once done might share some benchmarks. Some of the benchmarks you can also see on their GitHub project.
Scylla wouldn't even be good for blob store if you're dealing with megabytes. If you're dealing with kB range Cassandra is stabler and pretty much what you want. For big (MB+) Seaweed is up there for the best option.
Yes, this is another reason that kept me from doing it. I have on class of files (images) that go from 1kB to ~ 1MB in size, but never above. But then I have another set of files that go from 1B to several MB in the worst case (with a normal distribution around 13kB). Ideally I'd like to have one store that can handle all of these so I don't have to worry about worst-case performance too much.
I really wish this project, or other object storage systems modelled after haystack, would get more traction. I think it is reasonable to expect that your object storage system should support both small objects (< 10k) and large objects (> 1MB) transparently, but in my experience none of the heavily used open-source object stores (ceph, swift) can actually support small objects adequately.
Some differentiators that aren't immediately obvious in the comparison:
> SeaweedFS Filer metadata store can be any well-known and proven data stores, e.g., Cassandra, Mongodb, Redis, Elastic Search, MySql, Postgres, MemSql, TiDB, CockroachDB, Etcd etc, and is easy to customized.
I'm not very familiar with other DFS's but at the very least glusterfs stores metadata as xattrs on an underlying filesystem and so has no need of an external data store.
Also, SeaweedFS has a "master" server (single centralized with failover to secondary) and "volume servers" (responsible for data).
Sure. Not really saying one is better than the other; SeaweedFS approach allows you flexibility with these decisions and offloading, GlusterFS has no dependencies on additional services to reach its full potential.
Currently using Minio which is a relative success. However as part of quality checks we're using a lot 'listObjectsV2' S3-calls to check for gaps in our data, which is pretty slow. When using SeadweedFS plus, say, a Filer meta-data store based on Redis, this would presumably be faster?
Of course you can go with faster filer stores. Default filer store would be enough for millions of files. It is the same speed as a common file listing.
Actually I do not understand why MinIO is slow here.
This is interesting. I've been looking for a file system for a non-RAID disk array I want to set up at home, and this seems to have some of the characteristics I'm looking for. The primary downfall for my particular use case seems to be that I want to use partiy-based error correction rather than (or, in addition to) replication, because I want the array to be able to survive a failure of any N disks in the array.
Is there anything like that out there (other than Unraid, which I kinda don't like)?
I am not a large user whatsoever but I've been using SeaweedFS for a few years now.
It is archiving and serving more than 40,000 images on a webapp I built for the small team I work with.
I run SeaweedFS on two machines and it serves all images I host.
I wanted to kick the tires because I was always fascinated by Facebook's Haystack.
It has been simple, reliable, and robust. I really like it and hope if one of my side projects ever take off at some point, I get to test it with a much bigger load.
This is really cool. The killer feature I see is to be able to have a cloud storage tier for warm data that goes off to s3, while keeping the hot storage local. Does anyone know of another option that allows this kind of hybrid local / s3 storage that also has a filesystem interface?
We are running SeaweedFS successfully in production for a few years. We are serving and storing mostly user-uploaded images (around 100TB). It works surprisingly stable and the maintainer is usually responsive when we encounter issues.
Are you running it on multiple nodes? If so, how confidently can you handle typical "cluster" operations like adding nodes, removing broken nodes, etc? Happy to hear via DM as well.
Yes, we scale regularly, though we usually only add nodes. We are slowly approaching 100 seaweed nodes. We are running in k8s on local SSD storage, managing failures is easy this way.
We're running across multiple nodes. Removing and adding volume servers is pretty simple. You can manually fix replication via a cli command after adding/removing a node.
Evercam.io is using it for past few years, i.e 5 years almost, And I have been following seaweedfs for almost the same number of years, I think we are the largest one, this is what I think, as I have not seen anyone else, coming back and forth on GH issues :)
Geez, what's with those weird project names? For a second I expected/hoped this would be some cool hack storing data in actual seaweed. (You know like pingfs..) No it's not! It's some S3 k8s ... thing. Nothing wrong with that but come on, choose a better name!
And no I'm not particularly fond of the namr CockroachDB either.
I'd rather have weird names that are unique than the incredible name overload from generic / cool product names that are used from flagship products all the way down to 5-line npm packages.
On the other side of this, SeaweedFS appears to be largely implemented in Go, an useless keyword to search on.
"Golang" works much better, but sometimes results in a pedantic correction "The language is Go not Golang", to which I can only reply that the project is hosted on golang.org. Weird a.k.a. unique names are useful.
CockroachDB: Cockroaches are hard to kill. The vendor wants to imply this is also the case for their platform.
haha. Actually SeaweedFS started with a worse name. But some users warned me of how bad the name was in Japanese language. So I added "Sea" in front of the old name.
Thanks for good work with seaweedfs, I shared it with the community as was working on using it personally in our startup to replace minio.
It will be nice if you can integrate it with pyfilesystem [1]. It’s a small library providing unified file interface in python for local or cloud storage without fuse. It can already support seaweedfs using S3 API, it will be nicer to just directly have a driver for seaweedfs removing S3 translations.
I like dried crispy seaweed wafers, they do produce umami flavours, so it is a good name, reminding them. The quality of software isn’t based on name but by what it does, so don’t bother about people talking about name.
I am not good at Python at all and would not able to add Python specific support. However, SeaweedFS has gRPC APIs that you can use to do all file operations.
> Usually distributed file systems split each file into chunks, a central master keeps a mapping of filenames, chunk indices to chunk handles, and also which chunks each chunk server has.
> The main drawback is that the central master can't handle many small files efficiently, and since all read requests need to go through the chunk master, so it might not scale well for many concurrent users.
The chunk server architecture has been first put into production with the Google File System AFAIK. And it has been designed specifically for large files (what search needed at that time). So no surprise.
But that's only one architecture for a DFS. There are also block-based DFS (like GPFS), object-based DFS (Lustre.), cluster file systems (OCFS), and other architectures. They exhibit other characteristics.
Telling from the architecture and Wiki, it does not seem to be a file system at its core, but an object store with a file translation layer. One of the core problems of this approach is that in-place updates usually mean read-modify-write (if the object store has immutable objects, like most have, with Ceph being a notable exception).
From the replication page:
> If one replica is missing, there are no automatic repair right away. This is to prevent over replication due to transient volume sever failures or disconnections. In stead, the volume will just become readonly. For any new writes, just assign a different file id to a different volume.
This sounds like the architecture and implementation is still pretty basic. Distributed storage without redundancy (working redundancy!) is not that interesting.
Sorry to be that critical (great that someone writes a distributed file system!), but I think it is important to add some context. And the seaweed auther seems to have problems with bold statements either...
Disclaimer: I also work on a distributed file system (with unified access via S3 ;)