Anticipating the Future with LMDB (2016)

vlovich123 · on Nov 14, 2022

My hunch is the analysis was fatally flawed.

> the cost of storage has been decreasing at 38%/year

This assumes that the growth rate in bytes used is staying constant. I don't think that's accurate in the aggregate which is why there's compression. Additionally, the space overhead isn't just about space usage. Flash suffers horrible wear which means that your space amplification is also write amplification. The combination in a growth rate in the amount of data and a plateau in write cycles means that without compression your cost of storage is probably costing you much more than it should.

Additionally, it assumes the growth rate for storage won't hit it's own plateau. Indeed, solid state prices seem to have leveled out already only 6 years after this article because flash is really hard to make cheaper (& even spinning disk has plateaued).

I think the analysis is right that you can't just assume you can use as much CPU as you want, but I think it's wrong to eschew compression, especially since the DB gets to choose where to spend compression time on (& decompression is usually free because you cache decompressed results so on average you're not doing much).

hyc_symas · on Nov 18, 2022

Hm, in March 2016 I got a 64GB USB flash drive from amazon.co.uk for £24.99 and today a similar 64GB drive is only £6.99.

A 500GB Samsung 850 EVO SSD was £110.99. Now a 500GB Samsung 870 EVO SSD is £68.47. Samsung has just started mass production of 1 Tbit TLC NAND: densities & performance up, cost down. https://www.techtarget.com/searchstorage/news/252527119/Sams...

I admit I was counting on Optane to have a large part to play by now, and that has completely tanked. But something like MRAM is bound to come along soon and take its place, and the wear factor will be a thing of the past.

cryptonector · on Nov 14, 2022

My take is that a B-tree-like database should really have: a) optional compression, b) variable-size blocks, c) a write-ahead log (possibly indexed) to amortize the cost of B-tree transactions. That way, if compression pays, you use it, and if not, not. I.e., it should be like ZFS.

ComputerGuru · on Nov 14, 2022

Or match your block size to the fs block size and use ZFS compression :)

(It’s still a win to use compression at the db level because the data is typically also compressed in memory.)

hyc_symas · on Nov 18, 2022

Filesystem-level compression isn't magic, you still have to write data in sector-sized chunks to the underlying device. And while individual flash chips still use 512 byte sectors (pages), modern SSDs flash controllers gang them up in parallel, so the smallest "sector" you can write from the host side is still 4KB and up. Which means, when you're doing single-page writes on the host, compression doesn't save you anything at the device level. For large data items, you can obviously save space if the uncompressed size is larger than a couple of pages. But then you lose CPU time on the access. Typically, with very large data items, when you retrieve them, you may only need a couple small pieces, not the entire item. If you've compressed the entire value though, you must decompress the entire thing before you can use any of it.

If you do piecewise page-level compression, then you're back to the problem of not saving any space unless you can guarantee better than 50% compression per page, to fit at least 2 compressed pages per physical page. And nobody gets better than 50% compression on arbitrary binary data without expending a lot of CPU cycles.

To get really good compression ratios you need to be able to analyze a larger volume of data (and of course there has to be low enough entropy in the data itself). So if you're restricted to page-sized chunks to compress, you'll never get enough of a window to achieve great compression ratios.

(PS: I've done plenty of work on data compression. https://en.wikipedia.org/wiki/ARC_%28file_format%29 )

vlovich123 · on Nov 16, 2022

Why b-tree instead of LSM? LSM seems like a more natural fit for flash drives, no?

cryptonector · on Nov 16, 2022

LSMs need to be squashed once in a while, and it's less work to fold an LSM into a B-tree than it is to write the whole DB.

I think LSMs work great as indexed write-ahead logs. Like, the ZIL is essentially a write-ahead log, but it's not indexed. The ZIL should have been indexed so that if you crash you don't have to process the ZIL into the ZFS tree, instead you just run with it (this might also reduce the amount of RAM ZFS needs to track open transactions).