Saturday, September 12, 2015

Trying to speed up Binary Search

It is well known that binary search is not particular fast. For point queries hash tables are much faster, ideally accessing in O(1),  And even when we need range queries n-ary search structures like B-Trees are much faster than binary search or binary search trees.

Still, there is a certain charm to binary search. First, it is easy to implement, And second, it needs no extra memory but can directly operate on sorted data. And sometimes the data is sorted anyway, so we get the binary search more or less for free. And for something we get for free O(log n) lookup time is not that bad. The only question is, how can we implement it efficiently.

Of course there is the text-book way of implementing binary search, as explained for example in Wikipedia. The resulting code looks like this:

``````
while (lower!=upper) {
unsigned middle=lower+((upper-lower)/2);
unsigned v=data[middle];
if (v==needle) {
return middle;
} else if (v<needle) {
lower=middle+1;
} else {
upper=middle;
}
}
return notFound;
``````

Not particular difficult, but how fast it is? We ran experiments where we performed 1,000,000 random lookups for 32bit integers in data sets of various sizes and measured the performance on a Broadwell i7-5500U (all experiments using -O3):

 set size 103 105 107 109 classic 65 ms 119 ms 362 ms 702 ms

When we look carefully we notice that the runtime grows a bit faster than the asymptotic complexity of O(log n) would suggest. Or, in other words, the constants for searching in the 4GB data set seem to be about a factor 4 higher than when searching in the 4KB data set. That is not surprising, as the 4KB of the smallest set easily fit into the CPU cache, while the 4GB of the largest set are clearly beyond cache size. What is surprising is that it is only a factor of 4!

The performance of binary search is largely affected by two effects: First, there are cache misses, as every lookup accesses O(log n) elements, which are often not in cache for the large data sets. And second, there are branch mispredictions, as the comparison v<needle has about a 50% chance of being true, which leads to inefficient execution. For small data sets the branch misses dominate, while for the large data sets the branch misses are somewhat hidden by the memory latency. Therefore a slowdown of only 4.

Now the question is, can we do better? Usually people try to speed up binary search by reducing branch misses. There is an interesting talk about binary search, less wrong that described how to implement binary search without the hard to predict branch. Somewhat ironic the example code in that talk is wrong, contrary to its title, but the idea is valid. The corrected code looks like this:

``````
while (auto_t half=n/2) {
auto middle=lower+half;
lower=((*middle)<=needle)?middle:lower;
n-=half;
}
return ((*lower)==needle)?lower:notFound;
``````

The code is not only quite short, the loop body is compiled by gcc into completely branch-free assembler code by using a conditional move. The resulting performance is `shown below`:

 set size 103 105 107 109 branch free 23 ms 53 ms 320 ms 853 ms

This is a somewhat surprising result. For small data sets the branch-free version is indeed faster, but for the largest data set the search is significantly slower. And that is after manual tuning of the generated code, out of the box it needed 920 ms for the largest set.

The speed-up for the small data sizes is easy to explain (no branch mis-predictions), but where does the slow-down for the largest data set come from? We get a clue by compiling the same code with clang, which is uses a branch instead of a conditional move inside the loop. With that we get

 set size 103 105 107 109 compile with clang 67 ms 115 ms 359 ms 694 ms

which is virtual identical to the performance of the text-book implementation. Apparently the conditional move is good to avoid branch mispredictions, but cannot hide memory latency as much as the regular implementation.

So while we found a faster implementation for small data sets, the question is still open if we can do better for large sets (which might be common in databases). We tried using ternary search instead of binary search, with the argument that while ternary search accesses more memory locations than binary search, the memory latency can be hidden by performing the accesses in parallel, and also the number of iterations is smaller. There are many ways to implement ternary search, all with different trade-offs. One alternative that issues parallel memory accesses is shown below:

``````
while (uintptr_t step=n/3) {
auto t1=lower+step,t2=lower+(step<<1);
unsigned cmp=((*t1)<=needle)+((*t2)<=needle);
if (cmp==2) {
n-=t2-lower; lower=t2;
} else if (cmp) {
n-=t1-lower; lower=t1;
} else {
n=t1-lower;
}
}
``````

Unfortunately the results were disappointing:

 set size 103 105 107 109 ternary search 81 ms 143 ms 400 ms 781 ms

The extra memory accesses are just not worth it. In worst case we have now two cache misses per iteration (even if they are done in parallel), and the benefit of the increased fanout is just too small.

Of course there are ways to speed up search in large data sets further. For example by storing the data in a cache-optimized B-Tree rather than a sorted vector. But that forces us to abandon our original mission, namely exploiting sortedness that is available in the data anyway. For small and medium sets we have shown improvements over the text-book approach here, if someone has a good idea for large sets I would be happy to hear about it in the comments.

Update: As Marcin pointed out in the comments, it is possible to use a SIMD-version of binary search to perform 4 searches in parallel. I have used the code from section 5.5.4 of the thesis, it is a bit long to show here. And indeed we get good performance (best of 10 runs):

 set size 103 105 107 109 SIMD version 24 ms 63 ms 478 ms 706 ms

Note that for many data sizes the performance is in the middle between the text-book version and the conditional-move version, regardless of which of the two is actually faster. Puzzling. What is also puzzling is that the variance of the runtime seems to be much higher than for the non-SIMD versions. For the largest data set I have usually seen runtimes around 780ms, with peaks up to 950ms. I have no explanation for that.

If, as Marcin also suggested in the comment below, we replace the manual gather code
``````
int cmpval0 = data_shifted[result_vector[i + 0]];
int cmpval1 = data_shifted[result_vector[i + 1]];
int cmpval2 = data_shifted[result_vector[i + 2]];
int cmpval3 = data_shifted[result_vector[i + 3]];
__m128i xm_cmpvalvec = _mm_set_epi32(cmpval3, cmpval2, cmpval1, cmpval0);
``````

with the AVX2 gather instruction
``````