This is a great post and so spot on. At some point in my career my 'review prep' (which was the time I spent working on my own evaluation of my year at a company) became answering the question, "Do I still want to work here?" I categorize my 'review' in four sections (which are each rated at one of five levels, needs improvement, sometimes meets expectations, meets expectations, sometimes exceeds expectations, or consistently exceeds expectations)
I start by reviewing how I'm being managed, I expect someone managing me to be clear in their expectations of my work product, provide resources when I have identified the need to complete jobs, can clearly articulate the problem I am expected to be solving, and can clearly articulate the criteria by which the solution will be evaluated.
Second I review my co-workers, using a three axis evaluation, can I trust what they say to be accurate/honest, can I count on them to meet their commitments, and are they willing to teach me when I don't understand something and conversely learn when their is something they do not know.
Third I review what level of support do I get to do my job. Am I provided with a workspace where I can get work done? Do have have the equipment I need to do what is being asked? Is my commute conducive to the hours required? And finally and most important, does this job allow me to balance work obligations and non-work obligations?
Fourth I review whether or not the company mission, ethics, and culture is still one that I wish to be a part of. Am I proud of the company's mission? Do I believe that the leadership will make ethical calls even if doing so would mean less profit margin? Can I relate to and am I compatible with the values that my co-workers espouse and the actions they take? (this is the "company culture" theme, is it still a company that fits me culturally)
A company that receives lower than a 3.0 rating I put on a 90 day "company improvement plan" (CIP). I bring issues to the leadership who are in a position to address the situations that I've found wanting and try to secure their commitment to change. If after 90 days they haven't been able to (if they choose not to they're done right away), then I "fire" the company and work to process my exit as expeditiously as possible.
For those who want a bit more than "we trade bandwidth for latency" and a bit less than a 43-page paper (not to discourage anyone, it's very well-written):
In a binary-heap ("Eytzinger") layout, the next node to consider is at either index (2i + 1) or (2i + 2). That means you can perfectly predict the next cache line to load. This continues to be true up to four iterations in advance (assuming proper alignment, for 4-byte data with 64-byte cache lines, holding 16 values each). Speculative execution will begin these loads early, regardless of whether the branch prediction is ultimately successful. A branch-free algorithm can do the same with explicit prefetch instructions to avoid the pipeline stalls of mis-prediction. However, in each cache line, you only look at one value (1/16th of the data transferred).
You can attempt to pack the data for several iterations into a single cache line to increase utilization. However, because a complete binary tree has 2^n - 1 values (i.e., because it is odd), there are some small inefficiencies. For example, four iterations of a binary search require (1 + 2 + 4 + 8) = 15 possible values, of which you will look at four (1/4 of the data transferred, given the same 16-element cache line size). But either you include an extra element and thus sometimes need 5 comparisons instead of 4 (B-tree algorithm), or you mis-align the subtrees with cache line boundaries (van Emde Boas algorithm), or you waste space by only putting 15 values in a cache line (not tested). And you can't issue the load for the next cache line until the current subtree search completes.
Even though the latter approaches have up to four times better utilization of the data they load, as long as the bandwidth-delay product is at least 4 cache lines (i.e., you can issue 4 requests in parallel without slowing any of them down), the Eytzinger approach will win. They measured the bandwidth-delay product at 4.7 on their machine, and overall algorithm comparisons seem to be consistent across over 100 different CPUs.
If the data fits in L2 cache (2^16 values), a branchless prefetching algorithm on a simple sorted array is still slightly faster than using the Eytzinger ordering, just because of constant factors. The difference is small (15%ish on average, estimating from the plots).
Some broken CPUs (Phenom II, Celeron J1900, E3-1230) seem to cause large slowdowns when prefetching invalid addresses. Masking the address fixes the problem with no measurable performance impact on non-broken CPUs.