Database Architects: Trying to speed up Binary Search

It is well known that binary search is not particular fast. For point queries hash tables are much faster, ideally accessing in O(1), And even when we need range queries n-ary search structures like B-Trees are much faster than binary search or binary search trees.

Still, there is a certain charm to binary search. First, it is easy to implement, And second, it needs no extra memory but can directly operate on sorted data. And sometimes the data is sorted anyway, so we get the binary search more or less for free. And for something we get for free O(log n) lookup time is not that bad. The only question is, how can we implement it efficiently.

Of course there is the text-book way of implementing binary search, as explained for example in Wikipedia. The resulting code looks like this:


while (lower!=upper) {
  unsigned middle=lower+((upper-lower)/2);
  unsigned v=data[middle];
  if (v==needle) {
      return middle;
  } else if (v<needle) {
      lower=middle+1;
  } else {
      upper=middle;
  }
}
return notFound;

Not particular difficult, but how fast it is? We ran experiments where we performed 1,000,000 random lookups for 32bit integers in data sets of various sizes and measured the performance on a Broadwell i7-5500U (all experiments using -O3):

set size	10³	10⁵	10⁷	10⁹
classic	65 ms	119 ms	362 ms	702 ms

When we look carefully we notice that the runtime grows a bit faster than the asymptotic complexity of O(log n) would suggest. Or, in other words, the constants for searching in the 4GB data set seem to be about a factor 4 higher than when searching in the 4KB data set. That is not surprising, as the 4KB of the smallest set easily fit into the CPU cache, while the 4GB of the largest set are clearly beyond cache size. What is surprising is that it is only a factor of 4!

The performance of binary search is largely affected by two effects: First, there are cache misses, as every lookup accesses O(log n) elements, which are often not in cache for the large data sets. And second, there are branch mispredictions, as the comparison v<needle has about a 50% chance of being true, which leads to inefficient execution. For small data sets the branch misses dominate, while for the large data sets the branch misses are somewhat hidden by the memory latency. Therefore a slowdown of only 4.

Now the question is, can we do better? Usually people try to speed up binary search by reducing branch misses. There is an interesting talk about binary search, less wrong that described how to implement binary search without the hard to predict branch. Somewhat ironic the example code in that talk is wrong, contrary to its title, but the idea is valid. The corrected code looks like this:


while (auto_t half=n/2) {
   auto middle=lower+half;
   lower=((*middle)<=needle)?middle:lower;
   n-=half;
}
return ((*lower)==needle)?lower:notFound;

The code is not only quite short, the loop body is compiled by gcc into completely branch-free assembler code by using a conditional move. The resulting performance is shown below:

set size	10³	10⁵	10⁷	10⁹
branch free	23 ms	53 ms	320 ms	853 ms

This is a somewhat surprising result. For small data sets the branch-free version is indeed faster, but for the largest data set the search is significantly slower. And that is after manual tuning of the generated code, out of the box it needed 920 ms for the largest set.

The speed-up for the small data sizes is easy to explain (no branch mis-predictions), but where does the slow-down for the largest data set come from? We get a clue by compiling the same code with clang, which is uses a branch instead of a conditional move inside the loop. With that we get

set size	10³	10⁵	10⁷	10⁹
compile with clang	67 ms	115 ms	359 ms	694 ms

which is virtual identical to the performance of the text-book implementation. Apparently the conditional move is good to avoid branch mispredictions, but cannot hide memory latency as much as the regular implementation.

So while we found a faster implementation for small data sets, the question is still open if we can do better for large sets (which might be common in databases). We tried using ternary search instead of binary search, with the argument that while ternary search accesses more memory locations than binary search, the memory latency can be hidden by performing the accesses in parallel, and also the number of iterations is smaller. There are many ways to implement ternary search, all with different trade-offs. One alternative that issues parallel memory accesses is shown below:


while (uintptr_t step=n/3) {
  auto t1=lower+step,t2=lower+(step<<1);
  unsigned cmp=((*t1)<=needle)+((*t2)<=needle);
  if (cmp==2) {
      n-=t2-lower; lower=t2;
  } else if (cmp) {
      n-=t1-lower; lower=t1;
  } else {
      n=t1-lower;
  }
}

Unfortunately the results were disappointing:

set size	10³	10⁵	10⁷	10⁹
ternary search	81 ms	143 ms	400 ms	781 ms

The extra memory accesses are just not worth it. In worst case we have now two cache misses per iteration (even if they are done in parallel), and the benefit of the increased fanout is just too small.

Of course there are ways to speed up search in large data sets further. For example by storing the data in a cache-optimized B-Tree rather than a sorted vector. But that forces us to abandon our original mission, namely exploiting sortedness that is available in the data anyway. For small and medium sets we have shown improvements over the text-book approach here, if someone has a good idea for large sets I would be happy to hear about it in the comments.

Update: As Marcin pointed out in the comments, it is possible to use a SIMD-version of binary search to perform 4 searches in parallel. I have used the code from section 5.5.4 of the thesis, it is a bit long to show here. And indeed we get good performance (best of 10 runs):

set size	10³	10⁵	10⁷	10⁹
SIMD version	24 ms	63 ms	478 ms	706 ms

Note that for many data sizes the performance is in the middle between the text-book version and the conditional-move version, regardless of which of the two is actually faster. Puzzling. What is also puzzling is that the variance of the runtime seems to be much higher than for the non-SIMD versions. For the largest data set I have usually seen runtimes around 780ms, with peaks up to 950ms. I have no explanation for that.

If, as Marcin also suggested in the comment below, we replace the manual gather code


int cmpval0 = data_shifted[result_vector[i + 0]];
int cmpval1 = data_shifted[result_vector[i + 1]];
int cmpval2 = data_shifted[result_vector[i + 2]];
int cmpval3 = data_shifted[result_vector[i + 3]];
__m128i xm_cmpvalvec = _mm_set_epi32(cmpval3, cmpval2, cmpval1, cmpval0);

with the AVX2 gather instruction


__m128i xm_cmpvalvec = _mm_i32gather_epi32(data_shifted,_mm_load_si128((__m128i*)(result_vector)),4);

we get as runtime performance

set size	10³	10⁵	10⁷	10⁹
AVX2 gather	26 ms	68 ms	487 ms	708 ms

At a first glance the performance of the AVX2-gather is identical to the manual gather. And indeed I had got exactly that result in previous gather experiments, too. However the variance problem seems to have gone away when using AVX2 gather, the runtime is now relative stable across runs.

12 comments:

MarcinSeptember 12, 2015 at 8:42 AM
Nice!
I once did some work on speeding up the binary search for vectorized execution. I used the trick of always working with powers of two.
What was missing then were "gather" SIMD instructions, would love to repeat experiments on the new CPUs... do you have any students that have time to play? :D
I refer to section 5.5.4 in homepages.cwi.nl/~boncz/phd-zukowski.pdf

Btw, the algorithm there is the result of a bet after a few caipirinhas that I can implement binary search in 2 hours (ask Peter Boncz ;)).
UnknownDecember 16, 2015 at 11:48 PM
You can try cache-friendly binary search: http://bannalia.blogspot.ru/2015/06/cache-friendly-binary-search.html
Sergei LebedevDecember 17, 2015 at 12:24 AM
We got a nice speedup by using the lookup-table technique described here: https://geidav.wordpress.com/2013/12/29/optimizing-binary-search. I'm sure you can just as well apply SIMD to this idea.
Philipp UnterbrunnerDecember 17, 2015 at 1:20 AM
Since there is no dominant solution across all input sizes, how about a hybrid version that starts out as a textbook implementation and switches to a branch-free implementation as soon as the step size goes below a certain threshold? That's an easy check to add.

The threshold seems to be in the megabyte range. I suspect roughly around the size of the L3 cache. That would make sense assuming that the conditional move code suffers from unnecessary loads.

Since the threshold is only in one direction, the branch-free version would remain the same. Looks like we can even recycle the variables and goto straight into it.

The textbook version would of course need an extra branch to check for the threshold. So the open question is whether that extra conditional jump is worth it. The branch will repeatedly miss until it eventually hits and we jump to the branch-less code, so the branch predictor should do a good job of absorbing that extra branch.
MarcinDecember 17, 2015 at 1:26 AM
Wow, I never noticed your reply, but apparently you made HackerNews today, and someone pointed me to it :)

Erik van OostenDecember 19, 2015 at 9:11 PM
I have always wondered what would happen if you assume values are well distributed. In that case you can estimate the 'middle' from the start and end values.

Database Architects

Saturday, September 12, 2015

Trying to speed up Binary Search

12 comments:

Blog Archive