Immutability, MVCC, and garbage collection

jasonwatkinspdx · on Dec 29, 2013

Pretty disappointing critique.

There are manifold differences between couchdb, datomic, rethinkdb and more traditional sql databases, but the author can't see past his pet issue. He doesn't seem to understand the use cases, or the infrastructure differences. In terms of use cases, there are plenty of analytically datasets that are strictly monotonic. There is no opportunity to reclaim overwritten storage in this case.

But CoW index write amplification you say? Well now let's talk about infrastructure differences. Datomic can use s3 as a storage layer. Cold data storage is then $10/month/terabyte. And compaction is trivially non-blocking in a CoW dataset. It can be done by EC2 spec instances whenver is convenient with zero impact on the production system. Don't generalize the experiences of a few folks running couchdb on a couple in house servers to people working with a fundamentally different approach on fundamentally different infrastructure.

Back to the cheap cost of cold storage, for most businesses that's well into "who cares" cost territory, but having the data also allows them to preserve full lineage in the advent they discover a data loss bug in their application. That could potentially be quite valuable. One way to think of these systems is that they have integrated backup data. MySQL folks generally don't count the size of backup sets as part of the database footprint, and common backup schedules (eg father, grandfather) provide only a limited ability to dig back into the past.

By way of simple argument: would anyone advise using a source control system that only kept the last few hundreds of commits? Why do we treat our data differently? The answer is largely one of implementation constraints: text is cheap, vcs doesn't index the way databases do, etc. But as storage and processing densities continue to grow more datasets shift to the trivial category. As much as many HNews posters dream of being in a 'big data' shop, the vast majority of startups have maybe a couple GB in MySQL or Postgres. Our status quo toolset and approach is discarding value for these folks. Maybe you think that's fine, but some of us are going to keep pushing to enable new capabilities and advantages. Increasingly these implementation constraints will disappear as people attack them. It'd be nice to not suffer scornful surface criticisms from people who base their career around the status quo.

jeffdavis · on Dec 29, 2013

I think you are making a mistake by reading it as a critique of immutability in general. Rather, I think the author is criticizing an overly-simplisitic approach to using immutability.

When he describes traditional databases:

"What’s happening in such a database is a combination of short-term immutability, read and write optimizations to save and/or coalesce redundant work, and continuous “compaction” and reuse of disk space to stabilize disk usage and avoid infinite growth."

He's saying that's what people arrive at after working out the issues with immutability. When you say: "Increasingly these implementation constraints will disappear as people attack them." the author would probably respond: "Yes, and people will attack them with a lot of the same strategies that SQL databases are already using today".

jasonwatkinspdx · on Dec 29, 2013

I strongly disagree with the idea that temporal databases will be forced to revise themselves to the implementation designs of Postgres, MySQL, etc. In fact, I'd make the firm prediction that the paged window CoW strategy of LMDB will become dominant for these more traditional database designs, and that a new class of temporal databases will improve upon that architecture.

jeffdavis · on Dec 30, 2013

It's not so much that everything will look exactly like postgres. It's that the biggest difference right now between the designs is complexity (the kind that comes with maturity).

Every mature general-purpose database system makes use of both CoW/immutable design principles as well as locking and update-in-place in various contexts. It remains to be seen how "temporal databases" (as you call them) will adapt to real-world demands, but it might not look as radically different as you think.

Also, for anyone who thinks that CoW is the be-all-end-all design principle, I have my doubts. The main reason is that it doesn't compose very cleanly -- a design that involves CoW over CoW over CoW starts to look wasteful after a while.

That being said, I would not be surprised if some of these new databases found some new trade-offs that worked very nicely and got adopted by more traditional systems. Storage is not nearly as much of a concern for many kinds of problems, and it seems natural that such a shift would lead to some design changes.

homerowilson · on Dec 29, 2013

"By way of simple argument: would anyone advise using a source control system that only kept the last few hundreds of commits? Why do we treat our data differently?"

Well said!

RyanZAG · on Dec 29, 2013

What? Come on, really? My database - only the latest version of all the data that gets updated frequently - has TB of data. My whole git repo with full history is about 500mb and that's excluding large assets. It's a completely different ball game. Sure, if I could buy 100TB disks cheaply and set it up so I could access data across hundreds of them in real time then there wouldn't be a problem. Obviously, I can't do that.

As the author of the blog post points out: the little thing called reality gets in the way of such nice and perfect arguments and ideas.

jasonwatkinspdx · on Dec 29, 2013

When talking about architecture at this level, you need to take a longer view. Once upon a time, a gigabyte database was mammoth beyond thought. Also, you're taking it as given that your data has its present storage footprint even though alternative architectures might have very different properties. It's worth noting that a purely additive database can be very aggressive about compression.

But also, petabyte datasets are downright cheap to deal with these days. They'll look more like that half gigabyte source code repo sooner than I think you're considering.

RyanZAG · on Dec 29, 2013

We're obviously coming from different backgrounds here - do you work at Google or similar ?

'Downright cheap' is a Heroku instance or two. 'Affordable' is 20 or so servers with a few TB of drive space each in a cluster with a few backups. Petabytes of data needs a huge number of servers and dedicated personal to look after - I don't know how that can be regarded as 'downright cheap'.

If you're in a position to handle petabytes of data you're probably going to roll your own data center with custom software ala Google's bigtable. This definitely isn't the market that these little startups are targeting as far as I can see. The OPs critique is very well founded.

corresation · on Dec 29, 2013

petabyte datasets are downright cheap to deal with these days

Are you from the future? In no universe is a petabyte database remotely close to being cheap, even for large organizations. At this point I am seriously questioning your industry knowledge.

empthought · on Dec 29, 2013

Sure it is, a few thousand entries pointing to 10-100GB pirated HD video blobs. What's the problem?

/s

thirsteh · on Dec 29, 2013

You're right, but the vast majority of databases aren't that big. You are very explicitly recommended against using something like Datomic if you think you will need to purge data with regular intervals.

mgkimsal · on Dec 29, 2013

"but the vast majority of databases aren't that big" probably because they don't keep all copies of everything forever. Even as basic blog system would grow huge just based on spam comments alone.

thirsteh · on Dec 29, 2013

But you don't have to keep redundant copies around forever. Immutability means nothing in the past can change, which means it's perfectly fine for future entries to reference old values. New rows can essentially be deltas, much like version control systems.

I disagree that Datomic necessarily grows huge. I'm just saying don't use it if you know you're going to be storing terabytes and you don't have enough disk space to handle it without needing to truncate data with regular intervals.

jeltz · on Dec 29, 2013

Git is not append only though. It has a stop the world GC which compacts data and removes objects not referenced by any current branch, tag or similar object. So even if you want to keep the history you may not want to keep everything.

jasonwatkinspdx · on Dec 29, 2013

It's important to make a distinction between semantic and physical datasets. Git is append only (or more properly, strictly accumulative in an information sense) from a semantic perspective. From a user perspective we don't really care what the packfiles etc look like. Git also will forget timelines (transitive closure of a reference) you don't care about if you tell it to.

So it's important to distinguish purely persistent data structures in the sense of Okasaki et all, and append only datasets in the semantic sense. Git still qualifies as the latter even though it's implementation uses mutability as appropriate.

This is a good example of where the OP's critique is painting with too broad a brush. If we think from first principles: indexes are derived state, not primary state. The value log is the ultimate authority, the index can be regenerated at will. There is a huge design space of indexing over append only logs that remains largely unexplored.

I personally strongly believe databases will move emphatically in this direction because the benefits are so great. Given the typical tiered web application, wouldn't you like it if when you get an exception notification instead of just a text blob of a stack trace you can resume a continuation of the state of the system as the user saw it at the time of the error? Navigate forward and backward in time at will, even with nonlinear jumps? Edit the source code of the past and then replay toward the future to see if it resolves or reproduces the exception? There are huge opportunities here.

I will admit it's not immediately clear how to implement a database that reflects these ideas. It's going to take some experimentation and learning. That'll involve some failed experiments. What bugged me most about the OP was the attitude of "we're done, shut up with this other stuff." We are emphatically not done. It's ironic to note that at the moment MySQL was birthed it faced the same FUD style criticism.

gbog · on Dec 30, 2013

Your comment triggered again a thought of mine: What if we had a perfect data store? What would it look like? I think it would have these properties:

- instant access to the most current data from anywhere in the world (no need for replication)

- infinite storage space (no need for gc)

- instant access to any value in the dataset (any joins are ok, indexing not needed)

Obviously we are not there, but let's suppose we had such a tool, I think going the append-only way would make sense, and data deletion would only happen proactively (e.g. for legal or privacy reasons).

Also, in fact a good database is trying to mimic this perfect storage, and some of its users are hitting the boudaries. Anyway, when you feel that a database is not perfect in this sense, and need to adjust your usage, it means you met a boundary, which may not be there anymore in a few years.

Another angle is that maybe there is a physical limit to data transfer that will stop the evolution of datastores. A bit like speed of light, or would the speed of light already put noticable boundaries to this hypothetical perfect datastore?

lectrick · on Dec 30, 2013

> wouldn't you like it if when you get an exception notification instead of just a text blob of a stack trace you can resume a continuation of the state of the system as the user saw it at the time of the error?

this, along with capturing headers (perhaps via webserver logs) as well as POST data would actually be incredibly powerful. There is a certain difficult class of bugs that this would greatly help mitigate.

corresation · on Dec 29, 2013

What bugged me most about the OP was the attitude of "we're done, shut up with this other stuff."

The OP didn't say that at all, but instead observed (in a manner that you have quite overwhelmingly failed to refute at all, instead relying upon emotives and non sequiturs, betraying a bizarre and completely unsupported defensiveness that seems to be some variation "but it's new, man! It's new!") that everything old is new again over and over again.

People have been trying these things for literally decades, and many of the current products that lead categories have some elements of those designs. If someone is pitching a very old idea as (r)evolutionary, it is worthy of discussion, whether that offends your sensibilities or not.

And it's odd that you mentioned MySQL, given that yes, it did face to "criticisms" because it made the same old mistakes that so many products before made. FUD? Do you think since then Oracle became more like MySQL, or the opposite? I'll answer that for you -- MySQL abandoned all of the supposed "advantages" that were birthed largely in ignorance and adopted the designs of the products that came before.

jasonwatkinspdx · on Dec 29, 2013

s/analytically/analytic/

corresation · on Dec 29, 2013

By way of simple argument: would anyone advise using a source control system that only kept the last few hundreds of commits? Why do we treat our data differently?

All mainstream database products allow you to retain the entire history of every change ever made in the form of transaction logs. With those logs you can recover your database to any specific point in time, and to unwind specific transactions.

Of course most discard transaction logs because the volumes tend to be huge in many mainstream, operating databases. I worked on one system where the main database barely pushed 10GB, but there were 100s of GBs of transaction logs generated daily.

However the core reason that mainstream databases discard with historical data is performance, not size -- if your current users overwhelmingly only care about the current state of the data, and it is rare that you need to go back into the past (which is what transaction logs and snapshots provide), paying an extremely high performance penalty to retain all of that historical data does not pay off. And there are no immutable products that offer similar performance to mainstream products under most usage scenarios, instead forcing you into extremely confined uses.

It'd be nice to not suffer scornful surface criticisms from people who base their career around the status quo.

This is a garbage non sequitur, as an aside. It is a disgraceful attempt -- as so commonly happens in such discussion -- to try to attach agenda to an opinion that one doesn't like.

nickik · on Dec 29, 2013

Well, I have worked in a few companys with big SQL databases. You say the only care about the current state, but inreally EVER big database I have seen actually impnents some kind of backup thingy. Mostly handcrafted stuff.

The diffrence between Datomic and just storing your transaction log is that datomic gives you first class access to your hole history, you can work with it as easly as with the current data. I have at least not seen that in any other database.

So for me at least it seams a database who just stores everything and gives access to it is what most people should use for most problems. For me at least Datomic is the new standard and only if I have a special use case I would go to something else.

corresation · on Dec 30, 2013

you can work with it as easily as with the current data. I have at least not seen that in any other database.

At the cost that working with your current data is as hard as working with the historical data. This does not come for free, and the only way to have such a versioned history is at a significant cost to generalized query performance -- sure certain things (like a single scalar lookup) can be fast, but generalized queries will be terrible. I'm sure there will be some map/reduce claimed solution.

For me at least Datomic is the new standard and only if I have a special use case I would go to something else.

This is an incredible and rather ridiculous statement.

stuarthalloway · on Jan 4, 2014

"The only way to have such a versioned history is ..." is incorrect, and trivially so.

Datomic provides one existence proof of this: Datomic's history data is kept in distinct data structures, so it is "pay as you go" -- querying history is more expensive, because there is more stuff. Querying the present is cheaper.

Datomic queries are datalog, and do not require writing map/reduce jobs.

noelwelsh · on Dec 29, 2013

The author ends with "Also, if I’ve gone too far, missed something important, gotten anything wrong, or otherwise need some education myself, please let me know so I can a) learn and b) correct my error." so allow me to offer the following:

When you spend a substantial portion of your post sniping at the competition you come across as an arrogant arse. For example, second sentence in the post is "My overall impressions of Datomic were pretty negative, but this blog post isn’t about that." Then remove this line! It adds nothing to the point you claim you want to make. Same thing in the next section regarding append-only B-trees, and then we have snark directed at RethinkDB, and so on. I got about half-way through the post before I just skimmed to the end as I was sick of the arrogance and insults, and I was beginning to doubt your authority to comment on this area as a result -- if you have substantial points to make you don't need to cover them up with this crap.

rdtsc · on Dec 29, 2013

Well they are all tools with trade-offs. It is good to have something like Datomic. The trick with Datomic (from a 10 minutes summary I listened to last year) is that because of immutability interesting things can happen in respect to caching. You can cache things locally and have smart clients. It simplifies and adds interesting performance properties.

Sometimes immutability is not right for you, sometimes it is.

He criticizes CouchDB as well. Yeah you need to have double the disk space. But CouchDB now has auto-compaction triggers based on fragmentation threshold or based on times of day. Immutability allows completely lock-less reading. Crash only stopping (can just kill the power or the service anytime without corrupting the database). Master to master replication lets you build custom cluster topologies. Well defined conflict models and explicit conflict handling is invaluable (as opposed to other databases that hide and paper over that, sometimes losing data in the process -- read Aphyr's blog http://aphyr.com/ for an thorough study on that). Do you need those features? Maybe you don't, maybe you do. It is good to be aware of it and have it your toolbox if you need it one day.

richhickey · on Jan 4, 2014

I'd ignored this, trying to have a nice holiday with my family (without having to contend with wrongness on the internet :), but would like to introduce some facts. Datomic is not an append-only b-tree, and never has been.

There's a presumption in the article that "immutable" implies "append-only b-tree", specifically, an argument against append-only b-trees is used to disparage immutability. That's pretty naive. One only need to look at e.g. the BigTable paper for an example of how non-append immutability can be used in a DB. Such architectures are now pervasive. Datomic works similarly.

One can't make a technical argument against keeping information. It's obviously desirable (git) and often legally necessary. Our customers want and need it. Yeah, it's also hard, but update-in-place systems (including MVCC ones) that discard information can't possibly be considered to have "solved" the same problem.

Prefixing a polemic with "apparently" doesn't get one off the hook for spreading a bunch of misinformation.

vardump · on Dec 29, 2013

Append-only b-tree/skiplist/whatever immutability can be combined with "checkpoint hashes" (think git) using for example SHA512, that can be transmitted elsewhere at a low cost and used later to validate system integrity. Any data store tampering would be detectable. In a mutable database you'd need to transmit whole transaction log for the same integrity guarantee.

Thus immutable store makes it trivial to implement systems that need to have full history and audit trail.

When there's only a trivial amount of data involved, immutable state can solve a lot of headaches. No programming error can lose or corrupt data for instance - every case can be traced back after the fact.

I hope some immutable store databases take off that cover those use cases. Make it a graph database, bonus points if the underlying engine implements directed hypergraph. And artificial intelligence reasoning engine. :)

carry_bit · on Dec 29, 2013

"If entities are made of related facts, and some facts are updated but others aren’t, then as the database grows and new facts are recorded, an entity ends up being scattered widely over a lot of storage."

As far as I understand this isn't so. All data in Datomic is stored in the indices, so the facts associated with an entity should be stored together.

gfodor · on Dec 29, 2013

I'm going to go out on a limb here and use a logical fallacy, but I have a feeling that Rich Hickey probably has some understanding of the history of database theory and didn't just build Datomic out of ignorance.

jeffdavis · on Dec 29, 2013

But he also does not have infinite engineering resources at his disposal. I expect that a lot of the designs are probably somewhat simplistic compared to the battle-hardened implementations in common use today.

It doesn't mean Rich Hickey was wrong; merely that we are seeing only the ultra-clean design now. After Datomic takes off and has a lot of production users in demanding environments for 5-10 years, I doubt the design will be quite so clean.

DigitalJack · on Dec 29, 2013

I thought he codesigned it with a company called Relevance

moomin · on Dec 29, 2013

Well, there used to a firm called Datomic, and one called Relevance, but they merged to Cognitect.

jeltz · on Dec 29, 2013

If we are allowing the appeal to authority fallacy then neither would Baron Schwartz write this article out of ignorance. We have two different authorities disagreeing here.

webmaven · on Dec 29, 2013

To me, what was most striking about the article (once I got past the pot-shots), was how much the descriptions of Datomic and RethinkDB reminded me of the ZODB, the object database built into the Zope framework: http://www.zodb.org

ZODB is an append-only object database, and one of the data structures commonly used for persistence in it are BTrees: http://www.zodb.org/en/latest/documentation/guide/modules.ht...

Several disadvantages noted in the OP (infinitely growing data, needing to pack the DB to discard old object versions, needing 2x disk space to do the packing) definitely apply to ZODB, but others do not, as it is ACID-compliant and implements MVCC.

yresnob · on Dec 29, 2013

The author needs to read up way more on datomic... and clojure for that matter then try again...

dschiptsov · on Dec 29, 2013

Very rare example of an old-school thinking and reasoning.

Indeed, industrial-strength databases are hard, and as an Informix DBA, I could only agree, that there is no silver bullet, and that just "check-point early, check-point often" strategy has its trade-offs in terms of concurrent query performance. And, of course, unpredictable stop-the-world GC pauses are unacceptable. Informix have polished their Dynamic server for a decade and it is still one of the best (if not the best) solutions available.

The more subtle notion is that one could never avoid compactiom/GC pauses, so the strategy should be to partition the data and perform "separate" compaction/GC, blocking/freezing only some "snapshot" of the data, without stopping the whole world, like some modern FS do.

Of course, implementation is hard, and here the simplicity of persistent, append-only data-structures could be beneficial for the implementer of "partitioned/partial GC".

There must be some literature, because these ideas were well researched in old times, when design of Lisp's GC, or CPU caches were hot topics.

greatsuccess · on Dec 29, 2013

Id like to keep immutability and databases separate in this comment. Immutability the supposed "big advantage" of functional languages, is a discussion which is one-way.

No one ever discusses the implications of a system which is constantly, needlessly, insanely doing nothing but MAKING COPIES OF DATA.

This is not how systems are/should be designed and Im sure as hell not going to use a functional language until functional understands basic common sense. A system that is constantly copying data for no reason will do nothing else.

catnaroek · on Dec 29, 2013

> Immutability the supposed "big advantage" of functional languages

No, immutability is just the means. The objective is writing programs that are easier to reason about.

> No one ever discusses the implications of a system which is constantly, needlessly, insanely doing nothing but MAKING COPIES OF DATA.

Just because copy assignment requires, well, copying in languages like C++, it does not mean things work the same way in functional languages. Precisely because data structures are immutable, you can internally just pass around a pointer to the data structure and pretend you have copied it (e.g., when you pass it as parameter to another function).

greatsuccess · on Dec 29, 2013

You are correct, thats easy to reason about unless you are writing the garbage collector, in which case its pointless to write a garbage collector because there will never be anything to collect and memory will expand infinitely.

As for the persistent data structure comments, that has nothing to do with the fact that every function in a functional program has to copy data at such a ridiculously fined grained level that all you are doing is writing a garbage creator.

This data that every function creates in the name of immutability is meaningless, void of any purpose and is a bug.

chongli · on Dec 29, 2013

constantly, needlessly, insanely doing nothing but MAKING COPIES OF DATA.

That's simply not the case. With persistent data structures (used in languages such as Clojure, upon which Datomic is built), copying is kept to a minimum. If you've got a million-element hash-map and you want to associate a new key/value pair to it, you copy a path to the root of the tree -- an O(log32N) operation -- and you get a new map with 1,000,001 key/value pairs. Just for reference, log32 of 1,000,000 is ~4. This growth in complexity is so negligible that it ought to be considered constant for most practical purposes.

talaketu · on Dec 29, 2013

> No one ever discusses the implications of a system which is constantly, needlessly, insanely doing nothing but MAKING COPIES OF DATA.

see persistent data structure

greatsuccess · on Dec 29, 2013

I don't believe in this data structure. Its ill informed Rich Hickey nonsense and it doesnt scale and has no purpose but to seem clever.

TheHydroImpulse · on Dec 29, 2013

> Its ill informed Rich Hickey nonsense

How so?

> it doesnt scale

I take it that if many of Clojure's primitives are built using this data structure, it doesn't scale either? Yet some else mentioned how performant it actually is, given that it doesn't copy everything, but merely shares commonality.

Given your past comments, this isn't the first time you say something without knowing much about it.

wging · on Dec 29, 2013

Immutability doesn't mean large amounts of copying. You can exploit the fact that things will never change in clever ways.

http://hypirion.com/musings/understanding-persistent-vector-...

lmm · on Dec 29, 2013

Maybe you make more copies. But that can be done in parallel. Immutability makes it much easier to do fully concurrent garbage collection. If I make my program use 3x more (parallelized) CPU time but avoid 500ms pauses, that's a good tradeoff - and in my actual job solving real problems I've seen Haskell beat even carefully written C++ by a factor of 5.

pkolaczk · on Dec 29, 2013

Copying is in many cases faster than mutating data in-place by several threads, requiring blocking and cache synchronization. Therefore, it is easier to scale.