Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
SeaweedFS – A simple and highly scalable distributed file system with S3 API (github.com/chrislusf)
106 points by dragonsh on Oct 9, 2020 | hide | past | favorite | 75 comments


Their architecture descriptions starts with a strawman:

> Usually distributed file systems split each file into chunks, a central master keeps a mapping of filenames, chunk indices to chunk handles, and also which chunks each chunk server has.

> The main drawback is that the central master can't handle many small files efficiently, and since all read requests need to go through the chunk master, so it might not scale well for many concurrent users.

The chunk server architecture has been first put into production with the Google File System AFAIK. And it has been designed specifically for large files (what search needed at that time). So no surprise.

But that's only one architecture for a DFS. There are also block-based DFS (like GPFS), object-based DFS (Lustre.), cluster file systems (OCFS), and other architectures. They exhibit other characteristics.

Telling from the architecture and Wiki, it does not seem to be a file system at its core, but an object store with a file translation layer. One of the core problems of this approach is that in-place updates usually mean read-modify-write (if the object store has immutable objects, like most have, with Ceph being a notable exception).

From the replication page:

> If one replica is missing, there are no automatic repair right away. This is to prevent over replication due to transient volume sever failures or disconnections. In stead, the volume will just become readonly. For any new writes, just assign a different file id to a different volume.

This sounds like the architecture and implementation is still pretty basic. Distributed storage without redundancy (working redundancy!) is not that interesting.

Sorry to be that critical (great that someone writes a distributed file system!), but I think it is important to add some context. And the seaweed auther seems to have problems with bold statements either...

Disclaimer: I also work on a distributed file system (with unified access via S3 ;)


Thanks for educational comments! I need to update those old statements. It was started originally as an object store only.

The blobs are read-modify-write, but a file can have many blobs. So the whole file does not need to be read-modify-write, unless it is a small file.

The redundancy is managed via scriptable admin commands.


Great to see you here!

I understand your unit of smallest write is a blob. How large are blobs? How does their identifier look like?


There are actually no limit for blob size. The clients decide the blob size. It is limited by available memory and concurrency.

By default, the filer client uses 8MB.

Each identifier is as <volume_id, file_key, cookie>. With volume_id, you can locate the volume server. So the volumes are portable and can be moved around.


a) Project was created in 2012 so these statements may have been made around the time when the most popular DFS was HDFS and GCP was brand new.

b) It does have redundancy through data replication unless I am missing something ?


For a distributed filesystem's redundancy to be working in my book, it has to actually be able to reconstruct itself after unavailability/failure, which Seaweed seemingly doesn't do.

Additionally, something more complex than N-replication (eg. M.N erasure coding) is also a strong 'redundancy working' requirement for me.


In SeaweedFS, replication is for hot data and erasure coding is for warm data.


You can automate it with a cron job. For redundancy, a cron job in each master server.


a) I wonder in which sense HDFS was ever the most popular DFS. You could argue that it is not even a DFS, because it is not a general purpose / POSIX file system. In 2012 there was Lustre, GlusterFS, GPFS, ... Not many DFS were built since 2012.

b) It is completely unclear how the replication works, and which properties it has (split-brain safe?). From what it states (see citation), the "repair" is not automatic.


a) you are right. I was not in storage field. b) During writes, the strong consistency is required. The repair is by admin scripts, not by the data node themselves. This is to keep data nodes simple and scalable. You can also run the repair scripts at the "safe" time.


Evercam has used Seaweed for a few years. We've 1344TB of mostly jpegs and use the filer for folder structure. It's worked well for us, especially with low cost Hetzner SX boxes. I'd echo other people's positive comments about the maintainer's responsiveness & support. Happy to (try and) answer questions.


What replication are you using ? What about your recovery times when a server fails on 1Gbs port ?


we don't use any replication, Also in almost 5 years we only had one server crash which was due to file-system corruption, and we overcome that as well, it was a few leveldb files which got corrupt due to which the whole XFS file-system was went down, but we recovered it. Just one drawback was: We never used the same filer for saving files, and Get speed was also quite slow on that one, but with time, the volume compaction and vacuum, everything works fine on GET requests.


If Oracle wins the Supreme Court case against Google, aren't all these "like S3" or S3 API compatible solutions (whether block storage competitors or file systems) at risk?


Oracle wants to lock Java, while Amazon wants to open up S3 API, so more people can use S3.

On the other hand, SeaweedFS is has no API access fees and faster than S3 with your own hardware. Not sure what Amazon may do.


Oracle is fine with others using Java, provided everyone plays ball.

https://adoptopenjdk.net/sponsors.html

https://en.wikipedia.org/wiki/List_of_Java_virtual_machines#...

It is Google's own version of J++ that needs to be taken care of.


My suggestion don’t rely too much on S3 API after this ruling and better to have seaweedfs API as primary interface, until Amazon S3 release their API interface as true open source with right to modify for one’s own use.

This is one of the biggest problems with Java, it’s free but not open source (Eric Raymond has been fighting this for a while now that free does not mean open source). Oracle can come after anyone using Java if there is substantial money to be made.

So Java is only free as long as money keeps coming in for Oracle, once it dries like for SCO, they will go after anyone using Java API to extract money.

Also this shows subtle difference between open source and free. In a legitimate open source licensed software copyright and right to change is granted and company cannot sue others just based on copyright like Oracle can sue anyone using Java API.


Amazon wants to open up S3 API for now. I guess my point is launching products with S3 API combability I feel is a legal liability if Oracles wins the case.

I was specifically implying block storage solutions that offer S3 API combability (i.e. use our AWS S3 competitor with matching S3 API).

[1] https://www.digitalocean.com/products/spaces/ [2] https://cloud.ibm.com/docs/cloud-object-storage?topic=cloud-...


Oracle Cloud have an S3 API compatible service, don't they?


We've been running SeaweedFS in production serving images and other small files. We're not using Filer functionality just the underlying volume storage. We wrote our own asynchronous replication on top of the volume servers since we couldn't rely on synchronous replication across datacenters. The maintainer is super responsive and is quick to review our PRs. Happy to answer any questions.


Whenever you introduce a new solution into a problem space that already has plenty of options, you are obligated to state why your (new) solution is needed in the first place, IMO.

They did it well:

> Most other distributed file systems seem more complicated than necessary.

> SeaweedFS is meant to be fast and simple, in both setup and operation. If you do not understand how it works when you reach here, we've failed! Please raise an issue with any questions or update this file with clarifications.

https://github.com/chrislusf/seaweedfs#compared-to-other-fil...

However, since I never had to touch hdfs after installing it in the first place, I wonder what the difficulties in operation are, that they tried to overcome here?


HDFS famously suffers from a major "small files" issue where if you have lots of files that are less than the block size you i.e. 64MB you will have production instabilities due to excessive namenode memory, read/write performance and disk efficiency.

Also when you lose datanodes you will need to rebalance the data which SeaweedFS apparently does not need to do.


In SeaweedFS, rebalancing data is via admin commands. The data nodes only act as just as storage devices.


>Whenever you introduce a new solution into a problem space that already has plenty of options, you are obligated to state why your (new) solution is needed in the first place, IMO.

Agreed ! I'm just very glad, they didn't use the word "lightweight", it's a pet irritation of mine, when ppl "advertise" their software as "lightweight" !

The "first few" versions are always "lightweight" once they included all the bug fixes and corner cases of the other solutions over the next few releases, they seldom stay "lightweight" :/


Agree!

Yet I do try very hard to keep each layer separate from each other. So each layer can be "lightweight". :)


Nooooo ! :D


This looks almost exactly like the kind of data store I need for an application. I have previously considered using minio (too inflexible wrt to adding more shards / replicas), a homebrew system based on something like ScyllaDB (needs code on top to act like a blob store) or S3/B2 (too slow and/or expensive wrt to transfer costs). Is anyone using this in production and can share a story of how stable and hard to run it is?


Like you we also started with minio, but as it was not fully S3 compatible (list folders and few more related S3 API) and do not have FUSE or NFS support either. So we started moving our infrastructure to seaweedfs. We have not completed the move yet, once done might share some benchmarks. Some of the benchmarks you can also see on their GitHub project.


Scylla wouldn't even be good for blob store if you're dealing with megabytes. If you're dealing with kB range Cassandra is stabler and pretty much what you want. For big (MB+) Seaweed is up there for the best option.


Yes, this is another reason that kept me from doing it. I have on class of files (images) that go from 1kB to ~ 1MB in size, but never above. But then I have another set of files that go from 1B to several MB in the worst case (with a normal distribution around 13kB). Ideally I'd like to have one store that can handle all of these so I don't have to worry about worst-case performance too much.


Have you looked at Ceph?


Fun fact: Ceph is being rebuilt with the core engine of Scylla, Seastar, at its heart:

https://docs.ceph.com/en/latest/dev/seastore/


The architecture reminds me of 'mogilefs', which has a similar mechanism of filename to file storage.

https://github.com/mogilefs/mogilefs-docs/blob/master/HighLe...

It's an old system from the folks @ Danga, but the mailing list still sees random activity now and then...


I really wish this project, or other object storage systems modelled after haystack, would get more traction. I think it is reasonable to expect that your object storage system should support both small objects (< 10k) and large objects (> 1MB) transparently, but in my experience none of the heavily used open-source object stores (ceph, swift) can actually support small objects adequately.


Thanks!

There are many problems with large number of small files. It's more efficient to batch them together.


Yep. Not even minio, or aistore(nvidia), ceph etc

Only ambry(linkedin) AFAIK, but has no erasure coding.


I wish all these file systems told me clearly if and how they guarantee file integrity over time.


Some differentiators that aren't immediately obvious in the comparison:

> SeaweedFS Filer metadata store can be any well-known and proven data stores, e.g., Cassandra, Mongodb, Redis, Elastic Search, MySql, Postgres, MemSql, TiDB, CockroachDB, Etcd etc, and is easy to customized.

I'm not very familiar with other DFS's but at the very least glusterfs stores metadata as xattrs on an underlying filesystem and so has no need of an external data store.

Also, SeaweedFS has a "master" server (single centralized with failover to secondary) and "volume servers" (responsible for data).


In SeaweedFS both metadata and data are portable.

You can move metadata to any faster scalable data store. And/or move the data to any cheaper/faster/larger storage, in cloud or in your garage.


Sure. Not really saying one is better than the other; SeaweedFS approach allows you flexibility with these decisions and offloading, GlusterFS has no dependencies on additional services to reach its full potential.


Currently using Minio which is a relative success. However as part of quality checks we're using a lot 'listObjectsV2' S3-calls to check for gaps in our data, which is pretty slow. When using SeadweedFS plus, say, a Filer meta-data store based on Redis, this would presumably be faster?


Of course you can go with faster filer stores. Default filer store would be enough for millions of files. It is the same speed as a common file listing.

Actually I do not understand why MinIO is slow here.


This is interesting. I've been looking for a file system for a non-RAID disk array I want to set up at home, and this seems to have some of the characteristics I'm looking for. The primary downfall for my particular use case seems to be that I want to use partiy-based error correction rather than (or, in addition to) replication, because I want the array to be able to survive a failure of any N disks in the array.

Is there anything like that out there (other than Unraid, which I kinda don't like)?


SeaweedFS does support erasure coding for warm data.


Are there any docs on this feature? I couldn't find anything in the readme other than the blanket statement that it existed.



Good to know. Thanks!


I am not a large user whatsoever but I've been using SeaweedFS for a few years now.

It is archiving and serving more than 40,000 images on a webapp I built for the small team I work with.

I run SeaweedFS on two machines and it serves all images I host.

I wanted to kick the tires because I was always fascinated by Facebook's Haystack.

It has been simple, reliable, and robust. I really like it and hope if one of my side projects ever take off at some point, I get to test it with a much bigger load.


This is really cool. The killer feature I see is to be able to have a cloud storage tier for warm data that goes off to s3, while keeping the hot storage local. Does anyone know of another option that allows this kind of hybrid local / s3 storage that also has a filesystem interface?


Alluxio (https://github.com/Alluxio/alluxio) does this.

But note that it doesn't work well on Kubernetes and POSIX FS interface isn't perfect but then I haven't found any that are.


There is also aistore which you can use as a distributed cache in front of different systems (s3, https, seaweedfs).


We are running SeaweedFS successfully in production for a few years. We are serving and storing mostly user-uploaded images (around 100TB). It works surprisingly stable and the maintainer is usually responsive when we encounter issues.


Are you running it on multiple nodes? If so, how confidently can you handle typical "cluster" operations like adding nodes, removing broken nodes, etc? Happy to hear via DM as well.


Yes, we scale regularly, though we usually only add nodes. We are slowly approaching 100 seaweed nodes. We are running in k8s on local SSD storage, managing failures is easy this way.


We're running across multiple nodes. Removing and adding volume servers is pretty simple. You can manually fix replication via a cli command after adding/removing a node.


If you want something similar that also supports NFS, then there's leofs: https://github.com/leo-project/leofs


I have been following seaweedFS since forever. Played with it on my own homelab.

But I don't know if there's a major shop that uses it. Anyone knows?


From a user who worked in Apple as a contractor, they have used SeaweedFS object storage to store build artifacts, grown to about 200 nodes.

There are a few other companies, but not as famous as Apple. :)


Evercam.io is using it for past few years, i.e 5 years almost, And I have been following seaweedfs for almost the same number of years, I think we are the largest one, this is what I think, as I have not seen anyone else, coming back and forth on GH issues :)


I like it. What advantage does this have over something like Storj? (aside from the obvious operate differences)


Predictable latency? Hadoop Support? Too different to compare.


Geez, what's with those weird project names? For a second I expected/hoped this would be some cool hack storing data in actual seaweed. (You know like pingfs..) No it's not! It's some S3 k8s ... thing. Nothing wrong with that but come on, choose a better name!

And no I'm not particularly fond of the namr CockroachDB either.


I'd rather have weird names that are unique than the incredible name overload from generic / cool product names that are used from flagship products all the way down to 5-line npm packages.


I'm on the other side of this. Do a Google search before naming your thing!

Make it as silly and unique as possible, I'll be able to search for solutions later


>Make it as silly and unique as possible, I'll be able to search for solutions later

Silly? Not my preference. Unique? Absolutely. Something made up or completely out of context is likely to be a whole lot easier to search for.


On the other side of this, SeaweedFS appears to be largely implemented in Go, an useless keyword to search on.

"Golang" works much better, but sometimes results in a pedantic correction "The language is Go not Golang", to which I can only reply that the project is hosted on golang.org. Weird a.k.a. unique names are useful.

CockroachDB: Cockroaches are hard to kill. The vendor wants to imply this is also the case for their platform.


haha. Actually SeaweedFS started with a worse name. But some users warned me of how bad the name was in Japanese language. So I added "Sea" in front of the old name.


Thanks for good work with seaweedfs, I shared it with the community as was working on using it personally in our startup to replace minio.

It will be nice if you can integrate it with pyfilesystem [1]. It’s a small library providing unified file interface in python for local or cloud storage without fuse. It can already support seaweedfs using S3 API, it will be nicer to just directly have a driver for seaweedfs removing S3 translations.

I like dried crispy seaweed wafers, they do produce umami flavours, so it is a good name, reminding them. The quality of software isn’t based on name but by what it does, so don’t bother about people talking about name.

[1] https://docs.pyfilesystem.org/en/latest/


Thanks!

I am not good at Python at all and would not able to add Python specific support. However, SeaweedFS has gRPC APIs that you can use to do all file operations.


Is it possible to mount a SeaweedFS?


yes. Just run "weed mount". Actually the same binary to start "weed server".


two years ago, In DiDi, it has been storing and serving 10 billions of files.


So weird...I was just thinking about this yesterday.


Looking good!

Take a look at Gasper (https://talhof8.github.com/gasper).


Your link doesn't work. Are you spamming ? At least write something.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: