Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Jepsen: Scylla 4.2-rc3 (jepsen.io)
172 points by aphyr on Dec 23, 2020 | hide | past | favorite | 31 comments


TL;DR: Pay lots of attention before choosing C* or scylla. only use the most basic stuff, don't work on near-realtime. More pifalls than the number of seconds those software have been out.

Every time I see people talking about cassandra or scylla I am reminded of the huge pains and pitfalls of what feels like hacked-together solutions that were built on a very shaky theory.

I worked only 2 years on systems that used cassandra and scylla (both on dev and ops side). About 7TB total.

I admit they are very fast, provided you do proper continuous maintenance and that your application goes out of the way to conform to the database (absolutely not the other way around)

The only safe use of these 2 software is basically: 1) add new unique rows (no updates) 2) much later (granted only after repairs, which take a real lot) you can read, but only up to a certain time in the past. 3) basically almost avoid deleting

Anything else, and you are in the realm of cases and special cases.

You can't trust errors, your operation might have been applied anyway. LWT terribly slow, limited and you should not run more than a very limited number of them.

It's full of features that are added, then the cassandra guys realize that it really does not work, in both theory and practice, and than they deprecate those a few releases later.

Compared to SQL, it feels worse than navigating blind on a minefield of special cases.

The CQL language does not help. Only use the very basics of the language, otherwise it's a constant game of: This is the rule / With This exception / Unless you also have This / Except That / but it works again with X

and on and on.

It's like they had a very basic, specific use case, and went to build a ton of features that don't stack on top of it.

Like adding full SQL on top of redis, but worse.


LWT on Scylla is much faster than its C* implementation. As one of the co-founders, I agree that SQL and the relational model are superior than CQL. However, either they do not scale or even when they do (newsql), they are 10x slower and handle less volume, etc.

The right way to use NoSQL is when you know your scaling model. LWT helps you to serialize important tables.

Scylla consistently makes improvements in operations, consistency and functionality, so more to come, see what we'll announce at the summit next month


> However, either they do not scale or even when they do (newsql), they are 10x slower and handle less volume, etc.

There's something to be said about CQL specifically here, because its transactional model is an odd duck, and not always for scalability reasons. Scylla and Cassandra could offer richer transactions without significant performance penalties--indeed, which would significantly speed up some types of transactional workloads, but instead there are, as the grandparent poster notes, strange edge cases.

For instance, you can't select multiple CQL rows where the select would cover a column backed by a CQL collection. That's been fixed in Cassandra, but is still present in Scylla, and limited the kinds of tests we could write for Scylla.

You also can't perform an LWT write without a guard clause: linearizable upserts in the Jepsen tests require the presence of an always-null field in every row whose only purpose is to convince Scylla that yes, we really would like to use Paxos for this write.

Likewise, there's no logical notion of a batch combining reads and writes, or, for that matter, a batch of reads. If you want those sorts of things in C* or Scylla, I imagine one winds up stringing together reads and CaS statements in retry loops, or something to that effect. We were trying to hack together a batch read via a no-op write, except that it's actually impossible to do that robustly--you can only "read" values involving guard clauses, and there's no guard which will pass on all values. It's just... awkward.

This isn't a scalability thing--all three of these cases could be executed in a single Paxos round, scoped to a single partition. It's just... weird API design.


It's possible to improve cql to a cql++ .. if you start going this path, it will lead to sql which isn't bad a all ;)

It's not 'fair' to compare CQL to SQL, if you compare CQL to DynamoDB's http api, which scylla implements as well, you'll see that cql is better: https://www.scylladb.com/2020/05/12/comparing-cql-and-the-dy...


> It's not 'fair' to compare CQL to SQL

He didn't do that. He pointed out issues with the cql syntax and specifically scylla's implementation of it. he is also uniquely qualified to make such a statement, as his Jepsen Reports are high quality analyses looking at various databases.

But to address your other claim: they solve the same problem of providing data to your application. They should be compared if that's the decision you have to make.


You should offer an additional service - “how to stop you DB7 from being weird” - in addition to Jepsen. Thanks!


> The only safe use of these 2 software is basically: 1) add new unique rows (no updates) 2) much later (granted only after repairs, which take a real lot) you can read, but only up to a certain time in the past. 3) basically almost avoid deleting

1000x this

That is exactly what Cassandra was originally designed to do. Cassandra doesn't implement complex distributed mechanisms that would make it hard to operate and implement like most distributed databases or impose a cost on application developers to understand the tradeoffs they will have to grapple with and make the. Cassandra is not AP with less than a quorum read/write or CP with a quorum read/write. It only fulfills those guarantees in the way you might expect if your cluster topology never changes and your timestamps are carefully synchronized.

Absent these guarantees modifying existing data is essentially an undefined operation. This isn't a shot at cassandra developers. Cassandra isn't magic. There is no possible way it can magically use a simple last write wins strategy to guarantee that data updates won't be lost in the absence of outside guarantees.

Cassandra is essentially a prototype of a class of NoSQL databases try to be incredibly easy to manage for write heavy workloads. Look, I get it, it was a collection of appealing technologies, I found the original white paper on Cassandra interesting too. columnar read-repair eventually consistent merkle trees NoSQL column family super columns cluster gossip phi accrual failure detector

Of course, the 'columnar' attribute was originally intended so that applications designed around cassandra could reconcile multiple values associated with a single key:attribute path without having to utilize a complex api. Whether this is actually viable is another story entirely.

CQL's only advantage is that it vaguely resembles SQL. CQL is not a query language. It is a collection of Cassandra operators mapped onto a vaguely SQLish syntax that bears little relation to the underlying semantics they are derived from. Cassandra just shoehorns a bunch of its idiosyncrasies onto an SQL look alike. INSERT and UPDATE do the same thing(which is not what SQL does for either and are probably not what you expect). The entire behavior of how a 'query' is actually executed can change drastically. In order to use Cassandra correctly you have to understand how the database physically implements every operation. So the query language doesn't provide a meaningful abstraction at all.

CQL would have never made sense in the original implementation of cassandra because cassandra was never designed to do the things that people expect from a database that implements a query language.


When I was doing trainings for customers, I always said that CQL’s syntax similarity to SQL made more harm than useful things because beginners are thinking that they can do SQL-like operations


We at Scylla are perhaps better than anyone else aware of the Cassandra legacy we inherited. However, this is not what this analysis demonstrates. It shows we're working on making the great ideas of Cassandra work well, and bad ideas become irrelevant. Should anyone make the judgement about Scylla based on their prior negative experience with Cassandra? I think no.


Worked with both, only on the opensource version.

I agree that scylla shows a lot more promise than C, but we hit (different) instabilities and gotchas on both

All has its place. I am not saying C or scylla should not be used. I was just pointing out that the basic ideas and the CQL language itself is designed to trick you in in doing a lot of things that turn out to be really wrong on the data model used by these software.

It's not an implementation thing. Should I consider only that, scylla'd be really good (provided you have the correct hardware) and much better than C*. Still, the issue (IMHO) is with the theory and the fact that the actual correct applications are much, much narrower than the initial impact lets you believe (especially due to CQL)

--edit: Kudos on scylla btw, but basically I am doubtful how much it's possible to overcome that legacy without going really incompatible even on the API


I'm likely to bring this up every time I notice a Jepsen post on HN.. I still miss the flair from the old days Kyle :)

I guess I'm nostalgic for it. Your posts were during a time when I was still on a steep trajectory in my career growth and discovery. Just soaking everything up, and it was FUN. And others in my "generation" seemed to also be having FUN, and having FUN with it. Things are just a bit more bland these days :|


Call-out to those fun days: https://circleci.com/blog/its-the-future/



> Insert is a destructive operation by design. In fact, writing any collection literal (e.g. {} or (1, 2)) is internally implemented by writing a deletion tombstone followed by the new values.

These range tombstones are also evil. If you keep doing it to the same partition, even at a rather low QPS, you can easily bring down some nodes by keeping their CPU at 100%. Only tested this with Cassandra though, not Scylla.


Correct but Jepsen focuses on consistency and correctness and not on destructive. Scylla is exposed to range tombstone issue like many other LSMs. We recently have been improving them - check the most recent issues: https://github.com/scylladb/scylla/issues?q=is%3Aissue+is%3A...


I love Kyle's work with Jepsen. Has anyone done anything similar with filesystems? I'm imagining a filesystem mounted on a virtual device, and having the virtual device randomly suffer write faults and various other failures. And then a fuzzer asserting that all consistency requirements are still met. (All operations before a successful fsync should be stored, no lost writes or files, no unexpected garbage data, etc.)


You can imagine that Jepsen could run against filesystems itself. Basic JVM IO would be fairly easy to get going, and the POSIX APIs could be tested by having Jepsen call (maybe via IPC?) a small C/whatever binary. I don't know what fault injection would look like though: presumably you'd want to do something down in the kernel/device drivers/hw, and that land is largely a mystery to me.

Tangentially, I would love to have a FUSE filesystem with a.) minimal build dependencies, b.) some sort of CLI interface, and c.) the ability to, say, forget to flush un-fsynced data to "disk", allowing us to simulate a power failure. There have been a bunch of research projects on this front, and they find bugs spectacularly. I bet this approach would also find errors in distributed systems, but I've yet to find one that really has the right shape for use with Jepsen.


This is probably not relevant to Jepsen, and I don't know how similar to a distributed database distributed metadata in parallel filesystems look. Anyway, some of the tests for the Lustre PFS use Linux' fault injection system. Also, Linux documents specific support for NFS fault injection, which I know nothing about, and may or may not apply to pNFS. (It probably doesn't make it any more relevant or databased, but the OrangeFS PFS uses bdb or lmdb instances for the arbitrarily distributed metadata.)


Still a layer above the filesystem, but the closest example I'm aware of is the IO error testing in SQLite's test suite:

> I/O errors are simulated in both the TCL and TH3 test harnesses by inserting a new Virtual File System object that is specially rigged to simulate an I/O error after a set number of I/O operations. As with OOM error testing, the I/O error simulators can be set to fail just once, or to fail continuously after the first failure. Tests are run in a loop, slowly increasing the point of failure until the test case runs to completion without error. The loop is run twice, once with the I/O error simulator set to simulate only a single failure and a second time with it set to fail all I/O operations after the first failure.

https://www.sqlite.org/testing.html#i_o_error_testing


Yes filesystems are another big important part when it comes to the topic of data consistency. I'm not aware of a project similar to Jepsen for filesystems but there have been research papers like the one presented in this page that explore the guarantees of popular local filesystems: https://danluu.com/file-consistency/

Of course distributed filesystems are even harder but we can see that even writing to a local file is surprisingly complex and in many (most?) apps probably not really guaranteeing proper data consistency.

Imho the whole Posix filesystem API is flawed to begin with and it would be great if a modern replacement would emerge.


We did something in the neighborhood, check it out: https://www.scylladb.com/2016/02/16/fault-injection-filesyst...


> I'm not aware of a project similar to Jepsen for filesystems

Not as elaborate as Jepsen, but there has been some work:

- Kirk McKusick's papers and work on BSD fs log semantics

https://www.researchgate.net/scientific-contributions/Marsha...

I believe the term to look for is "soft updates."

- the BSD/NeXT file test program "fstest.c", used on local and NFS (Samba), which found many bugs in popular fs using simple operations. The ZFS team also has a version.

You can Bing versions of that by using quotes "fstest.c".

- the Luster/Gluster maintainers/consulting team used to just untar emacs on their distributed fs buildouts and see how many nodes left the cluster. (They lived off DARPA funding basically, and were paid to configure and install distributed fs for US govt/military supercomputer installations, and fix the underlying bugs as found.)

- Ironically, the Ceph team did not own any commercial storage devices, so just tested on regular linux machines.

- Reiserfs 3 was the first GA log fs on linux (default on SUSE), so I was one of the earliest US users in production.

SUSE's rep called me a liar at trade show, saying "nobody uses our distro in the US. :) It worked well on email server loads, and could delete 1 million files in a directory in under 1 second. I followed the development of v4, but the "wandering logs" and "dancing trees", etc. kind of wigged me out.

https://en.wikipedia.org/wiki/Dancing_tree

Source: DBA and storage engineer.



Relatedly, at the Midlands Grad School 2019 conference, John Hughes did an introductory course to QuickCheck in which one of the final exercises (meant to demonstrate the use of Haskell and QuickCheck to black-box test arbitrary binaries) was to model `fopen`, `fprintf`, `fclose` and friends in C. I never actually did it, but I guess that's one of the most comprehensive ways to find out how something behaves!


The wolf also shall dwell with the lamb, The leopard shall lie down with the young goat, The calf and the young lion and the fatling together; And a little child shall lead them.

All popular databases shall routinely undergo Jepsen reviews; And none shall prohibit publication of test or benchmark results.

Isaiah 11:6-7 (New Database Translation)


We worked closely with Kyle (@Aphyr) as he prepared this report. He was nothing but gracious with his time, brutal but fair with his feedback, and unrelenting in his commitment to discerning precisely where and how we strayed from the one true path of rectitudinous database behavior. Hat's off to him.


Hell will freeze over before Jepsen runs on Oracle


I would be surprised if it did poorly, however.


http://www.solidrockbaptist.net/satans-niv-bible.html

> Just a few examples should be enough to show you the problem. In the New International Version (NIV) it says in 2 Samuel 21:19 that someone other than David killed the giant Goliath. Anyone knows this is not true, except the 'scholars' who translated it. According to Hebrews 3:16 in the NIV all the children of Israel rebelled in the wilderness and none made it into the promised land; however, we know that Joshua and Caleb did not rebel, and God promised them each a place in the promised land (just read Numbers 13 & 14). Another mistake is found in Mark 1:2, 3 where according to the NIV the writer of Malachi was Isaiah! Just check out Isaiah 40:3 and Malachi 3:1 - the KJV has it right and the NIV has it wrong.

For comparison of the verse this comment parodies, https://biblehub.com/isaiah/11-6.htm


So are you saying the Bible suffers from a stale read problem? Or are you asserting it be eventually consistent?


The bible is most certainly AP.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: