Fun footnote: SQLite only got on board with mmap after I demonstrated how slow their code was without it. I.e., getting a 22x speedup by replacing SQLite's btree code with LMDB https://github.com/LMDB/sqlightning
A lot of potential treatments are too easily available and can't be patented. If a big pharma company can't make massive profit from it, they won't bother bringing it to market. Consider that a not-good reason.
Other treatments may eventually prove to have too many serious negative side effects. That's a good reason to abandon them.
> A lot of potential treatments are too easily available and can't be patented.
This isn’t really an obstacle, at least not as much as it’s made out to be.
There are numerous examples of drugs being brought to market at high prices despite having been generic compounds. Even old drugs can be brought back at $1000/month or more at different doses or delivery mechanisms.
One example: Doxepin is an old antidepressant that is extremely cheap. It was recently re-certified for sleep at lower doses and reintroduced at low doses at a much higher price, despite being “off patent”.
This happens all the time. The drug companies aren’t actually abandoning usable treatments due to patent issues as much as journalists have claimed. If they couldn’t, for some reason, find a way to charge for it they could still use it as a basis for finding an improved relayed compound with more targeted effects, better pharmacokinetics, etc.
They’re not just dropping promising treatments anywhere if there’s a market for them.
About Doxepin. As many seniors do, I also suffer from extreme inability to stay asleep at night. I have trialed through all the known prescription and non prescription possibilities, only eszopiclone and baclofen seem to show some promise, however, eszopiclone is DEA listed, requires higher and higher doses, and if I take it more than say 2 weeks, it has rather serious side effects attempting to withdraw, addictive, serious anxiety, trying to wean oneself off it. Doxepin is prescribed as an antidepressant in large doses, one of the most potent H1 histamine antagonists known. The H1 system in our bodies promotes wakefulness. In very low doses, doxepin acts against the H1 to promote sleep. To avoid the upcharges of low dose doxepin, I am prescribed the high dose version, which I have to break the capsules to administer about 5 to 10 mg. placed in an empty gelatin capsule (it's bitter). It really works well, however you are fairly tired and useless the next day.
Why would a China or India care if it were a viable treatment? Unless a country wants to use their population as lab rats, it takes money and scientists to actually confirm a treatment is safe and effective.
This made me think of the "Institute for one world health" . It came out as a non-profit pharmaceutical company in the mid 2000's. Victoria Hale was the founder-it got her a MacArthur fellowship. It is focuses(focused?) on global health and populations underserved by for-profit models. I think they successfully developed a treatment for leishmaniasis. it's an adorable model and should be pushed but as usual it seems like the philantropy money is limiting.
Wrong, LMDB fully supports multiprocess concurrency as well as DBs multiple orders of magnitude larger than RAM. Wherever you got your info from is dead wrong.
Among embedded key/value stores, only LMDB and BerkeleyDB support multiprocess access. RocksDB, LevelDB, etc. are all single process.
Also, even if LMDB supports databases larger than RAM, that’s it doesn’t mean it’s a good idea to have a working set that exceeds that size. Unless you’re claiming it’s scan resistant?
It has a single writer transaction mutex, yes. But it's a process-shared mutex, so it will serialize write transactions across an arbitrary number of processes. And of course, read transactions are completely lockfree/waitfree across arbitrarily many processes.
As for working set size, that is always merely the height of the B+tree. Scans won't change that. It will always be far more efficient than any other DB under the same conditions.
> As for working set size, that is always merely the height of the B+tree.
This statement makes no sense to me. Are you using a different definition of "working set" than the rest of us? A working set size is application and access pattern dependent.
> It will always be far more efficient than any other DB under the same conditions
That depends on how broadly or narrowly one defines "same conditions" :-)
That’s a bold claim. Are you saying that LMDB outperforms every other database on the same hardware, regardless of access pattern? And if so, is there proof of this?
Since the first question of my two-part inquiry not explicitly answered in the affirmative: To be absolutely clear, you are claiming, in writing, that LMDB outperforms every other database there is, regardless of access pattern, using the same hardware?
LMDB is optimized for read-heavy workloads. I make no particular claims about write-heavy workloads.
Because it's so efficient, it can retain more useful data in-memory than other DBs for a given RAM size. For DBs much larger than RAM it will get more useful work done with the available RAM than other DBs. You can examine the benchmark reports linked above, they provide not just the data but also the analysis of why the results are as they are.
> So much C library code is documented in ad-hoc ways - often through doxygen, which is a disaster. Eg here's the documentation for LMDB. LMDB is one of the most thoroughly documented C APIs I've seen, but I find this almost totally unusable. I often find myself reading the source instead. There's not even any links to the source from here:
Why do we need links to the source code? Doxygen is already embedded in the source, you should already be reading the source code on your local machine. It makes no sense to go searching across the web for information that's already stored on your local machine. Especially since you have no idea if the version you find on the web matches the version you're using locally.
Hah! Do you have a running search giving you alerts when lmdb is mentioned or something?
> How is doxygen a disaster?
I've just never read any good, user friendly documentation ever produced by doxygen. Even when I used it myself. It always comes out looking like a pig's breakfast.
Like, take the lmdb docs. And I'm sorry for picking on it. But its a good example, because you've clearly put effort into using doxygen to document lmdb. I think the lmdb docs are about the best that doxygen generated documentation gets.
But none of those concepts are defined or explained. They're simply referenced without explanation in a big jumble of function names and descriptions, leaving me to figure out how they're supposed to work together. Maybe if those data types were defined up the top of the page? No. Doxygen tries to put some data structures at the top of the page - but for some reason the only documented types are MDB_val, MDB_stat and MDB_envinfo. All terrible places to start reading if you want to understand how to use lmdb.
Good documentation would lead with some front matter like:
- What does the library do
- How does the library sees the world. Here, explain the above concepts and how they relate. (Eg an environment represents a set of files on disk. Each environment contains a numbered set of databases, database contains a set of records. You can read and write within a txn. You can use a cursor to iterate. ... Etc)
- Code examples of all of the above. Ideally a hello world, and more complex / specific examples showing each feature.
Doxygen does not help with any of that. From this documentation I don't know how to use lmdb to make a correct program. I can kinda guess how to use various features, but not what the features actually are or how to use them correctly.
For example, I can obliquely tell from reading the function descriptions that an environment can be opened multiple times at the same time. But I have no idea why I'd want to do that, or how, or what performance implications there are, or if there are any gotchas I need to be aware of if I do that. I see there's a bunch of functions for mdb_env_copy. Does that copy in memory or on disk? Does it do it atomically, like at a snapshot? Is it synchronous? Does it fsync? fdatasync? What errors can happen? The documentation isn't helpful.
So I'm not just banging on about rust, here's equivalent - but much better - reference documentation for prisma:
They explain the concepts and present examples for how to use all the features. When I read those docs, I come away knowing how to use the library to solve my problems. I don't have that experience reading the lmdb documentation.
Maybe its possible to produce good documentation using doxygen. But I've never seen it done. Not even once.
-----
Side points:
> you should already be reading the source code on your local machine.
I'd rather not read the source code of all my dependencies to understand how to use them. Reading the source code of your dependencies should be a last resort. Eg, I don't go reading my compiler's source code if I want to understand my programming language. I don't want to read the source code of my web browser, or postgres, or linux, or any of this stuff.
> It makes no sense to go searching across the web for information that's already stored on your local machine. Especially since you have no idea if the version you find on the web matches the version you're using locally.
I hear you, but honestly I don't really care where documentation lives. Just so long as I can find it and read it. But with rust in particular, if you want local documentation you can run cargo doc to generate & open the documentation of your project and all your dependencies, which is nice.
And re: versions, rust docs hosted online also have a little 'version' field up the top showing which version of the library you're looking at the documentation of. Eg if you open https://docs.rs/rand/ I see "rand-0.9.2". If you change versions, the URL changes. It'd be nice if doxygen had that too.
That's such an obvious error in their benchmark code. In my benchmark code I make sure to touch the data so at least the 1st page is actually paged in from disk.
That was my plan, but I haven't yet--the fix was in my renamed code and I haven't put in the work to make it correspond to the original code. But here's the commit message, maybe you can see it easily:
Fix: Use correct stack index when adjusting cursors in mdb_cursor_del0
In `mdb_cursor_del0`, the second cursor adjustment loop, which runs after
`mdb_rebalance`, contained a latent bug that could lead to memory corruption or
crashes.
The Problem: The loop iterates through all active cursors to update their
positions after a deletion. The logic correctly checks if another cursor
(`cursor_to_update`) points to the same page as the deleting cursor (`cursor`)
at the same stack level: `if
(cursor_to_update->mc_page_stack[cursor->mc_stack_top_idx] == page_ptr)`
However, inside this block, when retrieving the `MDB_node` pointer to update a
sub-cursor, the code incorrectly used the other cursor's own stack top as the
index:
`PAGE_GET_NODE_PTR(cursor_to_update->mc_page_stack[cursor_to_update->mc_stack_to
p_idx], ...)`
If `cursor_to_update` had a deeper stack than `cursor` (e.g., it was a cursor
on a sub-database), `cursor_to_update->mc_stack_top_idx` would be greater than
`cursor->mc_stack_top_idx`. This caused the code to access a page pointer from
a completely different (and deeper) level of `cursor_to_update`'s stack than
the level that was just validated in the parent `if` condition. Accessing this
out-of-context page pointer could lead to memory corruption, segmentation
faults, or other unpredictable behavior.
The Solution: This commit corrects the inconsistency by using the deleting
cursor's stack index (`cursor->mc_stack_top_idx`) for all accesses to
`cursor_to_update`'s stacks within this logic block. This ensures that the node
pointer is retrieved from the same B-tree level that the surrounding code is
operating on, resolving the data corruption risk and making the logic
internally consistent.
And here's the function with renames (and the fix):
struct MDB_cursor {
/** Next cursor on this DB in this txn */
MDB_cursor *mc_next_cursor_ptr;
/** Backup of the original cursor if this cursor is a shadow */
MDB_cursor *mc_backup_ptr;
/** Context used for databases with #MDB_DUPSORT, otherwise NULL */
struct MDB_xcursor *mc_sub_cursor_ctx_ptr;
/** The transaction that owns this cursor */
MDB_txn *mc_txn_ptr;
/** The database handle this cursor operates on */
MDB_dbi mc_dbi;
/** The database record for this cursor */
MDB_db *mc_db_ptr;
/** The database auxiliary record for this cursor */
MDB_dbx *mc_dbx_ptr;
/** The @ref mt_dbflag for this database */
unsigned char *mc_dbi_flags_ptr;
unsigned short mc_stack_depth; /**< number of pushed pages */
unsigned short mc_stack_top_idx; /**< index of top page, normally mc_stack_depth-1 */
/** @defgroup mdb_cursor Cursor Flags
* @ingroup internal
* Cursor state flags.
* @{
*/
#define CURSOR_IS_INITIALIZED 0x01 /**< cursor has been initialized and is valid */
#define CURSOR_AT_EOF 0x02 /**< No more data */
#define CURSOR_IS_SUB_CURSOR 0x04 /**< Cursor is a sub-cursor */
#define CURSOR_JUST_DELETED 0x08 /**< last op was a cursor_del */
#define CURSOR_IS_IN_WRITE_TXN_TRACKING_LIST 0x40 /**< Un-track cursor when closing */
#define CURSOR_IN_WRITE_MAP_TXN TXN_WRITE_MAP /**< Copy of txn flag */
/** Read-only cursor into the txn's original snapshot in the map.
* Set for read-only txns, and in #mdb_page_alloc() for #FREE_DBI when
* #MDB_DEVEL & 2. Only implements code which is necessary for this.
*/
#define CURSOR_IS_READ_ONLY_SNAPSHOT TXN_READ_ONLY
/** @} */
unsigned int mc_flags; /**< @ref mdb_cursor */
MDB_page *mc_page_stack[CURSOR_STACK]; /**< stack of pushed pages */
indx_t mc_index_stack[CURSOR_STACK]; /**< stack of page indices */
#ifdef MDB_VL32
MDB_page *mc_vl32_overflow_page_ptr; /**< a referenced overflow page */
# define CURSOR_OVERFLOW_PAGE_PTR(cursor) ((cursor)->mc_vl32_overflow_page_ptr)
# define CURSOR_SET_OVERFLOW_PAGE_PTR(cursor, page_ptr) ((cursor)->mc_vl32_overflow_page_ptr = (page_ptr))
#else
# define CURSOR_OVERFLOW_PAGE_PTR(cursor) ((MDB_page *)0)
# define CURSOR_SET_OVERFLOW_PAGE_PTR(cursor, page_ptr) ((void)0)
#endif
};
/** @brief Complete a delete operation by removing a node and rebalancing.
*
* This function is called after the preliminary checks in _mdb_cursor_del().
* It performs the physical node deletion, decrements the entry count, adjusts
* all other cursors affected by the deletion, and then calls mdb_rebalance()
* to ensure B-tree invariants are maintained.
*
* @param[in,out] cursor The cursor positioned at the item to delete.
* @return 0 on success, or a non-zero error code on failure.
*/
static int
mdb_cursor_del0(MDB_cursor *cursor)
{
int rc;
MDB_page *page_ptr;
indx_t node_idx_to_delete;
unsigned int num_keys_after_delete;
MDB_cursor *cursor_iter, *cursor_to_update;
MDB_dbi dbi = cursor->mc_dbi;
node_idx_to_delete = cursor->mc_index_stack[cursor->mc_stack_top_idx];
page_ptr = cursor->mc_page_stack[cursor->mc_stack_top_idx];
// 1. Physically delete the node from the page.
mdb_node_del(cursor, cursor->mc_db_ptr->md_leaf2_key_size);
cursor->mc_db_ptr->md_entry_count--;
// 2. Adjust other cursors pointing to the same page.
for (cursor_iter = cursor->mc_txn_ptr->mt_cursors_array_ptr[dbi]; cursor_iter; cursor_iter = cursor_iter->mc_next_cursor_ptr) {
cursor_to_update = (cursor->mc_flags & CURSOR_IS_SUB_CURSOR) ? &cursor_iter->mc_sub_cursor_ctx_ptr->mx_cursor : cursor_iter;
if (!(cursor_iter->mc_flags & cursor_to_update->mc_flags & CURSOR_IS_INITIALIZED)) continue;
if (cursor_to_update == cursor || cursor_to_update->mc_stack_depth < cursor->mc_stack_depth) continue;
if (cursor_to_update->mc_page_stack[cursor->mc_stack_top_idx] == page_ptr) {
if (cursor_to_update->mc_index_stack[cursor->mc_stack_top_idx] == node_idx_to_delete) {
// This cursor pointed to the exact node we deleted.
cursor_to_update->mc_flags |= CURSOR_JUST_DELETED;
if (cursor->mc_db_ptr->md_flags & MDB_DUPSORT) {
cursor_to_update->mc_sub_cursor_ctx_ptr->mx_cursor.mc_flags &= ~(CURSOR_IS_INITIALIZED | CURSOR_AT_EOF);
}
continue;
} else if (cursor_to_update->mc_index_stack[cursor->mc_stack_top_idx] > node_idx_to_delete) {
// This cursor pointed after the deleted node; shift its index down.
cursor_to_update->mc_index_stack[cursor->mc_stack_top_idx]--;
}
XCURSOR_REFRESH(cursor_to_update, cursor->mc_stack_top_idx, page_ptr);
}
}
// 3. Rebalance the tree, which may merge or borrow from sibling pages.
rc = mdb_rebalance(cursor);
if (rc) goto fail;
if (!cursor->mc_stack_depth) { // Tree is now empty.
cursor->mc_flags |= CURSOR_AT_EOF;
return rc;
}
// 4. Perform a second cursor adjustment pass. This is needed because rebalancing
// (specifically page merges) can further change cursor positions.
page_ptr = cursor->mc_page_stack[cursor->mc_stack_top_idx];
num_keys_after_delete = NUMKEYS(page_ptr);
for (cursor_iter = cursor->mc_txn_ptr->mt_cursors_array_ptr[dbi]; !rc && cursor_iter; cursor_iter = cursor_iter->mc_next_cursor_ptr) {
cursor_to_update = (cursor->mc_flags & CURSOR_IS_SUB_CURSOR) ? &cursor_iter->mc_sub_cursor_ctx_ptr->mx_cursor : cursor_iter;
if (!(cursor_iter->mc_flags & cursor_to_update->mc_flags & CURSOR_IS_INITIALIZED)) continue;
if (cursor_to_update->mc_stack_depth < cursor->mc_stack_depth) continue;
if (cursor_to_update->mc_page_stack[cursor->mc_stack_top_idx] == page_ptr) {
if (cursor_to_update->mc_index_stack[cursor->mc_stack_top_idx] >= cursor->mc_index_stack[cursor->mc_stack_top_idx]) {
// If cursor is now positioned past the end of the page, move it to the next sibling.
if (cursor_to_update->mc_index_stack[cursor->mc_stack_top_idx] >= num_keys_after_delete) {
rc = mdb_cursor_sibling(cursor_to_update, 1);
if (rc == MDB_NOTFOUND) {
cursor_to_update->mc_flags |= CURSOR_AT_EOF;
rc = MDB_SUCCESS;
continue;
}
if (rc) goto fail;
}
if (cursor_to_update->mc_sub_cursor_ctx_ptr && !(cursor_to_update->mc_flags & CURSOR_AT_EOF)) {
// BUG FIX: Use the main cursor's stack index to access the other cursor's stacks.
// This ensures we are retrieving the node from the same B-tree level
// that the parent `if` condition already checked. The previous code used
// `cursor_to_update->mc_stack_top_idx`, which could be incorrect if its
// stack was deeper than the main cursor's.
MDB_node *node_ptr = PAGE_GET_NODE_PTR(cursor_to_update->mc_page_stack[cursor->mc_stack_top_idx], cursor_to_update->mc_index_stack[cursor->mc_stack_top_idx]);
if (node_ptr->mn_flags & NODE_DUPLICATE_DATA) {
if (cursor_to_update->mc_sub_cursor_ctx_ptr->mx_cursor.mc_flags & CURSOR_IS_INITIALIZED) {
if (!(node_ptr->mn_flags & NODE_SUB_DATABASE))
cursor_to_update->mc_sub_cursor_ctx_ptr->mx_cursor.mc_page_stack[0] = NODE_GET_DATA_PTR(node_ptr);
} else {
mdb_xcursor_init1(cursor_to_update, node_ptr);
rc = mdb_cursor_first(&cursor_to_update->mc_sub_cursor_ctx_ptr->mx_cursor, NULL, NULL);
if (rc) goto fail;
}
}
cursor_to_update->mc_sub_cursor_ctx_ptr->mx_cursor.mc_flags |= CURSOR_JUST_DELETED;
}
}
}
}
cursor->mc_flags |= CURSOR_JUST_DELETED;
fail:
if (rc)
cursor->mc_txn_ptr->mt_flags |= TXN_HAS_ERROR;
return rc;
}
I have confirmed the fixed logic seems correct, but I haven't written a test for it (I moved on to another project immediately after and haven't returned back to this one). That said…I'm almost certain I have run into this bug in production on a large (1TB) database that used DUPSORT heavily. It's kinda hard to trigger.
Also, thanks for a great library! LMDB is fantastic.
Thanks for that. It doesn't look like there is an open bug report for this yet.
I understand your commit description but I'm still a bit puzzled by it. cursor_to_update shouldn't have a deeper stack than cursor, even if it was a cursor on a subdatabase. All the references to the subdatabase are in the subcursor, and that has no bearing on the main cursor's stack.
The original code with the bug was added for ITS#8406, commit 37081325f7356587c5e6ce4c1f36c3b303fa718c on 2016-04-18. Definitely a rare occurrence since it's uncommon to write from multiple cursors on the same DB. Fixed now in ITS#10396.
LMDB is for read-heavy workloads. The opposite of RocksDB.
RocksDB can use thousands of file descriptors at once, on larger DBs. Makes it unsuitable for servers that may also need to manage thousands of client connections at once.
LMDB uses 2 file descriptors at most; just 1 if you don't use its lock management, or if you're serving static data from a readonly filesystem.
RocksDB requires extensive configuration to tune properly. LMDB doesn't require any tuning.
> LMDB has MDB_PANIC, documented as "Update of meta page failed or environment had fatal error".
Yes. That doesn't mean there was anything bad in the program logic. It most likely means your storage device had a fatal I/O error. It means there's something physically wrong with your system. Not that there was any bug in any code.
reply