Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I talk a little about the history of this approach in my early LMDB design papers. The mmap approach was actually quite popular up until the mid-to-late 1980s. Up until then, 32 bit address spaces were still orders of magnitude larger than the physical memory of common machines. But when RAM sizes started approaching 1GB, and database sizes reached beyond 4GB, the notion of mmap'ing an entire DB into a process became impractical.

It took the introduction of SPARC64 in 1995 to really make it worth considering again, and Athlon64 in 2003 to make it feasible for mainstream computing.

Die-hard DB authors will tell you that having a user-space page cache buys you control - the DB has more knowledge about the workload and thus can make better decisions about which pages to evict and which to keep, than the kernel.

Of course, with greater power comes greater responsibility. Managing a cache, and eviction/replacement strategies in particular, is one of the most difficult problems in computing. Maintaining all of that knowledge about the workload can be much more compute-intensive than actually executing the original workload. Indeed, LMDB demonstrates that it's always more costly. Kernel code can program the MMU and handle page faults much more efficiently than any user-level code. It also has global visibility into all of the hardware resources in a system. It's already the kernel's job to balance memory demands among various processes in a system. A user-level app really only knows about its own consumption; it might be able to obtain information about other workloads on the machine but only at great cost.

In today's age of virtual-machine (over)use, app-level memory management is an even worse proposition. (In the OpenLDAP Project we had an effort led by an IBM team, looking into a self-tuning cache for OpenLDAP. They in particular wanted it to "play nice" and give up cache memory when there was memory pressure from other processes in the system, and even from pressure from the host system when running inside a VM. That effort was IMO futile and that work never reached fruition. But the mmap approach achieves all of their goals, without any of the massive amount of load monitoring and timing code they needed.)

On modern-day hardware with MMUs, there's no good reason to have user-space cache management.



With Berkeley DB it's even worse, since, by default, BDB uses a memory-mapped file as its cache. So the cache gets effectively double-managed, first by BDB, then by the OS itself. This wreaks havoc on performance.

The way around this design flaw is to either tell BDB to use SysV shared memory (which really should be the default), or to locate the cache on a ramdisk (e.g. tmpfs on Linux).


I'm surprised it's not the default. I agree it should be. I wonder why it's not... There must be some good reason? This page verifies that what you're saying is actually correct- http://pybsddb.sourceforge.net/ref/env/region.html

Update: Architecture dependence and having to release shared memory after a crash, are two possible reasons BDB defaults to mmap rather than shared memory.


The BDB mmmap cache actually worked fine for many years, on older OSs. They didn't keep accurate track of which memory pages were dirtied or when, so they were pretty lazy about flushing changes back to disk. This made an mmap'd file functionally equivalent to a SysV shared memory region. I remember when Linux changed in the 2.6 series, and suddenly using an mmap'd file got incredibly slow, because the kernel was proactively flushing dirtied pages back to disk. We started recommending SysV shm for all our deployments after that. You had to be careful in configuring this, since SysV shm wasn't swappable or pageable.


Incredibly insightful. Thank you!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: