Memory Layouts for Binary Search – V2

derf_ · on Oct 19, 2015

For those who want a bit more than "we trade bandwidth for latency" and a bit less than a 43-page paper (not to discourage anyone, it's very well-written):

In a binary-heap ("Eytzinger") layout, the next node to consider is at either index (2i + 1) or (2i + 2). That means you can perfectly predict the next cache line to load. This continues to be true up to four iterations in advance (assuming proper alignment, for 4-byte data with 64-byte cache lines, holding 16 values each). Speculative execution will begin these loads early, regardless of whether the branch prediction is ultimately successful. A branch-free algorithm can do the same with explicit prefetch instructions to avoid the pipeline stalls of mis-prediction. However, in each cache line, you only look at one value (1/16th of the data transferred).

You can attempt to pack the data for several iterations into a single cache line to increase utilization. However, because a complete binary tree has 2^n - 1 values (i.e., because it is odd), there are some small inefficiencies. For example, four iterations of a binary search require (1 + 2 + 4 + 8) = 15 possible values, of which you will look at four (1/4 of the data transferred, given the same 16-element cache line size). But either you include an extra element and thus sometimes need 5 comparisons instead of 4 (B-tree algorithm), or you mis-align the subtrees with cache line boundaries (van Emde Boas algorithm), or you waste space by only putting 15 values in a cache line (not tested). And you can't issue the load for the next cache line until the current subtree search completes.

Even though the latter approaches have up to four times better utilization of the data they load, as long as the bandwidth-delay product is at least 4 cache lines (i.e., you can issue 4 requests in parallel without slowing any of them down), the Eytzinger approach will win. They measured the bandwidth-delay product at 4.7 on their machine, and overall algorithm comparisons seem to be consistent across over 100 different CPUs.

If the data fits in L2 cache (2^16 values), a branchless prefetching algorithm on a simple sorted array is still slightly faster than using the Eytzinger ordering, just because of constant factors. The difference is small (15%ish on average, estimating from the plots).

Some broken CPUs (Phenom II, Celeron J1900, E3-1230) seem to cause large slowdowns when prefetching invalid addresses. Masking the address fixes the problem with no measurable performance impact on non-broken CPUs.

beagle3 · on Oct 19, 2015

The graphs are extremely interesting - there's a significant qualitative difference between what works best: in-cache, "least branches, least work" works best. out-of-cache, "least fetches" works best. Not at all surprising, of course, but a reminder that there's no point discussing these issues independent of the cache size. Also, no point optimizing without knowing the approximate cache size and memory footprint of your real data.

corysama · on Oct 19, 2015

> Spoiler: It has to do do with hiding latency by overusing bandwidth

Not a perfect win, but probably a good trade off given that hardware bandwidth has been much easier to improve compared to the cost of developing hardware with better latency (and will likely continue to do so for the foreseeable future).

jbapple · on Oct 19, 2015

See also "Optimal Hierarchical Layouts for Cache-Oblivious Search Trees", by Peter Lindstrom and Deepak Rajan: http://arxiv.org/abs/1307.5899.

pkhuong · on Oct 19, 2015

The layout is definitely interesting, but the experiments in that paper are of questionable usefulness. The numbers are for simulated caches (cachegrind), and don't take TLBs into account. They also only consider complete binary trees, and that is a worst case for binary search. I pointed out that issue to the authors and using ternary search cuts L1 and L2 misses by about 40-60% (VS binary search), without any change in layout.

If I want to evaluate the Min-WEP layout for a real application, I have to run my own experiment to weigh the impact of reducing cache misses versus constant factors knowing that even the cache miss numbers are way off for the simplest layout (sorted array). The performance information in the paper can't help me make even an educated guess.

That's why I really liked Pat Morin's approach of going for real time on a variety of microarchitectures… which helped us realise that cache misses aren't that good of a metric these days.

FullyFunctional · on Oct 20, 2015

The OP asks for people to try the code, but the provided C++ code requires a fairly recent g++. I could have tried a couple of targets, but I don't want to waste my time porting it. What about this requires C++?

pkhuong · on Oct 20, 2015

Convenience stuff in the test harness.

uxcn · on Oct 19, 2015

Does anyone know if there are any similar studies for hash tables (e.g. algorithm, bucket layouts, probe sequences)?

nkurz · on Oct 20, 2015

Not quite what you are asking for, but I think this is a similarly smart paper about Cuckoo Hashes that takes the right approach for modern processors, and includes some background on how other approaches fare: http://www.cs.columbia.edu/~orestis/sigmod15.pdf

uxcn · on Oct 20, 2015

Interesting results. Thanks, I'll have to take a deeper look through the paper.

Cuckoo hashing seems to be the standard, but I still haven't seen concrete numbers comparing different variations or other schemes like hopscotch particularly for latency and throughput. Hopscotch seems like it might give better latency. For simple stuff, I've found that a good hash and linear probe generally outperforms other simple schemes up to roughly 50% load, but I honestly haven't really had a need to dig further into other schemes so far.