I talk a little about the history of this approach in my early LMDB design paper...

colanderman · on Sept 1, 2015

With Berkeley DB it's even worse, since, by default, BDB uses a memory-mapped file as its cache. So the cache gets effectively double-managed, first by BDB, then by the OS itself. This wreaks havoc on performance.

The way around this design flaw is to either tell BDB to use SysV shared memory (which really should be the default), or to locate the cache on a ramdisk (e.g. tmpfs on Linux).

ksikka · on Sept 1, 2015

I'm surprised it's not the default. I agree it should be. I wonder why it's not... There must be some good reason? This page verifies that what you're saying is actually correct- http://pybsddb.sourceforge.net/ref/env/region.html

Update: Architecture dependence and having to release shared memory after a crash, are two possible reasons BDB defaults to mmap rather than shared memory.

hyc_symas · on Sept 1, 2015

The BDB mmmap cache actually worked fine for many years, on older OSs. They didn't keep accurate track of which memory pages were dirtied or when, so they were pretty lazy about flushing changes back to disk. This made an mmap'd file functionally equivalent to a SysV shared memory region. I remember when Linux changed in the 2.6 series, and suddenly using an mmap'd file got incredibly slow, because the kernel was proactively flushing dirtied pages back to disk. We started recommending SysV shm for all our deployments after that. You had to be careful in configuring this, since SysV shm wasn't swappable or pageable.

ksikka · on Sept 4, 2015

Incredibly insightful. Thank you!