I'd like to see a benchmark showing a B-tree beating an array with the correct layout. I think you only want a tree if you also want to update the array.
You can implement your B-tree as an array if that's what you like. The advantage of the B-tree-like-layouts is that you need fewer indirections. With binary search you do one node jump per comparison. With a B-tree you do one jump per N comparisons.
Of course, the detail is in trying it out, intuition does not always capture hardware behavior. I remember looking at this many years ago and B-trees were considerably faster. The code was far from being optimal though. Maybe if one pulls all tricks in the book the additional complexity of managing multiple comparisons will outweigh saving from reducing loads.
For a perfect drop-in replacement of std::lower_bound, the best you can do without breaking anything is to make the search branchless and maybe add some prefetching [1]. Some compilers actually try to implement std::lower_bound this way, but these things can sometimes break from version to version because there are no semantics to reliably make compiler use predication instead of branching.
Your intuitions are very much correct: you can get rid of pointers in a B-tree and make it static and implicit and fast, especially if you use SIMD to search for the lower bound within a node [2], but it would technically not be a replacement to std::lower_bound as we need to build an additional structure (even though it's very hard to imagine a scenario where you obtain a sorted array but can’t afford to spend linear time on preprocessing). C++23 has since added std::flat_set, which seems to be an appropriate place to implement it (in the article I compared against std::lower_bound because neither I nor the vast majority of the readers knew what std::flat_set was).
You can also add pointers back to support insertion and deletion with a moderate penalty to performance [3], but this dynamic B-tree is also technically not a replacement to std::set because of the extra pointer invalidations when a node merges or splits (even though in most cases you don't need pointer stability). You can fix it by, e.g., storing separate pairs of pointers so that each iterator knows where its key is in the tree and vice versa. That would add some overhead (especially in terms of memory) but make it compliant with the standard and still quite a bit faster and lighter than std::set.
The three articles combined are like 50 pages long so for a tl;dr version you might be interested in a talk I did at CppCon [4]. You can also extend the trick for heaps, ropes, segment trees, and other tree-like structures. There is a lot of work to be done here.
First, what a great resource you've put together! You're presenting a lot of useful material clearly and concisely, together with the reasoning behind it. I wish I had this when I started doing performance work.
> For a perfect drop-in replacement of std::lower_bound, the best you can do without breaking anything is to make the search branchless and maybe add some prefetching [1]. Some compilers actually try to implement std::lower_bound this way, but these things can sometimes break from version to version because there are no semantics to reliably make compiler use predication instead of branching.
What's imperfect about a radix-4 (or higher) search instead of a binary search for replacing std::lower_bound? For a radix-k search, you'll have k-1 fetches in flight before you decide which subarray to recurse on, and your dep chain will be log_2(k) times shorter.
When you add prefetching (that is, compare against the middle element and fetch both the middle of the left half and the middle of the right half ahead of time) you are essentially doing radix-4 search, just with fewer actual comparisons.
(You can prefetch k layers ahead for radix-2^k search, but searching in small arrays will become more expensive this way.)
I didn't benchmark it, but I guess on mid-to-large arrays it would actually work somewhat slower than prefetching because it is more instruction-heavy, and, more importantly, prefetches are "cancelable": if the memory bus is too busy with actual requests, they will be skipped, while in explicit radix-k search you would have to wait for all (k-1) elements even if the middle element happened to be already cached and you already know which elements you need next.
That said, it could probably work with small arrays where caching is not a concern and especially if you optimize for latency and not throughput. You can also try to do the comparisons with SIMD (Wojciech Mula tried something similar and got a small boost: http://0x80.pl/articles/simd-search.html).
Thanks, this is great! I think this conversation is a great illustration for the idea that data structures can be seen as more fundamental than algorithms.