Still, there is a certain charm to binary search. First, it is easy to implement, And second, it needs no extra memory but can directly operate on sorted data. And sometimes the data is sorted anyway, so we get the binary search more or less for free. And for something we get for free O(log n) lookup time is not that bad. The only question is, how can we implement it efficiently.

Of course there is the text-book way of implementing binary search, as explained for example in Wikipedia. The resulting code looks like this:

```
while (lower!=upper) {
unsigned middle=lower+((upper-lower)/2);
unsigned v=data[middle];
if (v==needle) {
return middle;
} else if (v<needle) {
lower=middle+1;
} else {
upper=middle;
}
}
return notFound;
```

Not particular difficult, but how fast it is? We ran experiments where we performed 1,000,000 random lookups for 32bit integers in data sets of various sizes and measured the performance on a Broadwell i7-5500U (all experiments using -O3):

set size | 10^{3} | 10^{5} | 10^{7} | 10^{9} |

classic | 65 ms | 119 ms | 362 ms | 702 ms |

When we look carefully we notice that the runtime grows a bit faster than the asymptotic complexity of O(log n) would suggest. Or, in other words, the constants for searching in the 4GB data set seem to be about a factor 4 higher than when searching in the 4KB data set. That is not surprising, as the 4KB of the smallest set easily fit into the CPU cache, while the 4GB of the largest set are clearly beyond cache size. What is surprising is that it is only a factor of 4!

The performance of binary search is largely affected by two effects: First, there are cache misses, as every lookup accesses O(log n) elements, which are often not in cache for the large data sets. And second, there are branch mispredictions, as the comparison v<needle has about a 50% chance of being true, which leads to inefficient execution. For small data sets the branch misses dominate, while for the large data sets the branch misses are somewhat hidden by the memory latency. Therefore a slowdown of only 4.

Now the question is, can we do better? Usually people try to speed up binary search by reducing branch misses. There is an interesting talk about binary search, less wrong that described how to implement binary search without the hard to predict branch. Somewhat ironic the example code in that talk is wrong, contrary to its title, but the idea is valid. The corrected code looks like this:

```
while (auto_t half=n/2) {
auto middle=lower+half;
lower=((*middle)<=needle)?middle:lower;
n-=half;
}
return ((*lower)==needle)?lower:notFound;
```

The code is not only quite short, the loop body is compiled by gcc into completely branch-free assembler code by using a conditional move. The resulting performance is

`shown below`

:set size | 10^{3} | 10^{5} | 10^{7} | 10^{9} |

branch free | 23 ms | 53 ms | 320 ms | 853 ms |

This is a somewhat surprising result. For small data sets the branch-free version is indeed faster, but for the largest data set the search is significantly slower. And that is after manual tuning of the generated code, out of the box it needed 920 ms for the largest set.

The speed-up for the small data sizes is easy to explain (no branch mis-predictions), but where does the slow-down for the largest data set come from? We get a clue by compiling the same code with clang, which is uses a branch instead of a conditional move inside the loop. With that we get

set size | 10^{3} | 10^{5} | 10^{7} | 10^{9} |

compile with clang | 67 ms | 115 ms | 359 ms | 694 ms |

which is virtual identical to the performance of the text-book implementation. Apparently the conditional move is good to avoid branch mispredictions, but cannot hide memory latency as much as the regular implementation.

So while we found a faster implementation for small data sets, the question is still open if we can do better for large sets (which might be common in databases). We tried using ternary search instead of binary search, with the argument that while ternary search accesses more memory locations than binary search, the memory latency can be hidden by performing the accesses in parallel, and also the number of iterations is smaller. There are many ways to implement ternary search, all with different trade-offs. One alternative that issues parallel memory accesses is shown below:

```
while (uintptr_t step=n/3) {
auto t1=lower+step,t2=lower+(step<<1);
unsigned cmp=((*t1)<=needle)+((*t2)<=needle);
if (cmp==2) {
n-=t2-lower; lower=t2;
} else if (cmp) {
n-=t1-lower; lower=t1;
} else {
n=t1-lower;
}
}
```

Unfortunately the results were disappointing:

set size | 10^{3} | 10^{5} | 10^{7} | 10^{9} |

ternary search | 81 ms | 143 ms | 400 ms | 781 ms |

The extra memory accesses are just not worth it. In worst case we have now two cache misses per iteration (even if they are done in parallel), and the benefit of the increased fanout is just too small.

Of course there are ways to speed up search in large data sets further. For example by storing the data in a cache-optimized B-Tree rather than a sorted vector. But that forces us to abandon our original mission, namely exploiting sortedness that is available in the data anyway. For small and medium sets we have shown improvements over the text-book approach here, if someone has a good idea for large sets I would be happy to hear about it in the comments.

**Update:**As Marcin pointed out in the comments, it is possible to use a SIMD-version of binary search to perform 4 searches in parallel. I have used the code from section 5.5.4 of the thesis, it is a bit long to show here. And indeed we get good performance (best of 10 runs):

set size | 10^{3} | 10^{5} | 10^{7} | 10^{9} |

SIMD version | 24 ms | 63 ms | 478 ms | 706 ms |

Note that for many data sizes the performance is in the middle between the text-book version and the conditional-move version, regardless of which of the two is actually faster. Puzzling. What is also puzzling is that the variance of the runtime seems to be much higher than for the non-SIMD versions. For the largest data set I have usually seen runtimes around 780ms, with peaks up to 950ms. I have no explanation for that.

If, as Marcin also suggested in the comment below, we replace the manual gather code

```
int cmpval0 = data_shifted[result_vector[i + 0]];
int cmpval1 = data_shifted[result_vector[i + 1]];
int cmpval2 = data_shifted[result_vector[i + 2]];
int cmpval3 = data_shifted[result_vector[i + 3]];
__m128i xm_cmpvalvec = _mm_set_epi32(cmpval3, cmpval2, cmpval1, cmpval0);
```

with the AVX2 gather instruction

```
__m128i xm_cmpvalvec = _mm_i32gather_epi32(data_shifted,_mm_load_si128((__m128i*)(result_vector)),4);
```

we get as runtime performance

set size | 10^{3} | 10^{5} | 10^{7} | 10^{9} |

AVX2 gather | 26 ms | 68 ms | 487 ms | 708 ms |

At a first glance the performance of the AVX2-gather is identical to the manual gather. And indeed I had got exactly that result in previous gather experiments, too. However the variance problem seems to have gone away when using AVX2 gather, the runtime is now relative stable across runs.

Nice!

ReplyDeleteI once did some work on speeding up the binary search for vectorized execution. I used the trick of always working with powers of two.

What was missing then were "gather" SIMD instructions, would love to repeat experiments on the new CPUs... do you have any students that have time to play? :D

I refer to section 5.5.4 in homepages.cwi.nl/~boncz/phd-zukowski.pdf

Btw, the algorithm there is the result of a bet after a few caipirinhas that I can implement binary search in 2 hours (ask Peter Boncz ;)).

Thanks for the pointer, I have now included SIMD results at the end. I believe gather is still implemented in microcode on Haswell/Broadwell CPUs, the performance is nearly identical to a manual gather. Except perhaps for the variance effects in the results, but I have no good explanation for that.

DeleteYou can try cache-friendly binary search: http://bannalia.blogspot.ru/2015/06/cache-friendly-binary-search.html

ReplyDeleteYes, changing the memory layout can improve performance. The downside is that now range access (i.e., reading the elements in sorted order) becomes more expensive. Which is unfortunately also useful for quite a few operations.

DeleteIf you only want fast lookup, use a hash table. But binary search is attractive if the data is sorted anyway for some other reason.

Sometimes you have an array like this:

Delete[1, 5, 8, 10, 20]

And you want to search the lower number when searching e.g. 13, which in this case would result in 10.

For this case, the cache-friendly layout is perfect.

We got a nice speedup by using the lookup-table technique described here: https://geidav.wordpress.com/2013/12/29/optimizing-binary-search. I'm sure you can just as well apply SIMD to this idea.

ReplyDeleteThat is an interesting idea, you put a small lookup table in front of the binary search. I did not try this out yet, but this could certainly be useful.

DeleteSince there is no dominant solution across all input sizes, how about a hybrid version that starts out as a textbook implementation and switches to a branch-free implementation as soon as the step size goes below a certain threshold? That's an easy check to add.

ReplyDeleteThe threshold seems to be in the megabyte range. I suspect roughly around the size of the L3 cache. That would make sense assuming that the conditional move code suffers from unnecessary loads.

Since the threshold is only in one direction, the branch-free version would remain the same. Looks like we can even recycle the variables and goto straight into it.

The textbook version would of course need an extra branch to check for the threshold. So the open question is whether that extra conditional jump is worth it. The branch will repeatedly miss until it eventually hits and we jump to the branch-less code, so the branch predictor should do a good job of absorbing that extra branch.

Checking before you start the binary search is certainly a good idea. Large range => classic implementation, small range => branch free.

DeleteSwitching implementations inside the hot loop might not be that useful. First it costs instructions inside the hot loop. And second, if my guess is correct and the problem stems from cache misses, then switching to the branch-free variant will not help you, because even the small range at the end will probably be outside the cache.

But I can try it out at some point.

Wow, I never noticed your reply, but apparently you made HackerNews today, and someone pointed me to it :)

ReplyDeleteI have always wondered what would happen if you assume values are well distributed. In that case you can estimate the 'middle' from the start and end values.

ReplyDeleteWhat you propose is basically interpolation search, which indeed works well if the data is distributed nicely. In the comments above Sergei proposed a more general solution, namely using a small lookup table to improve the guess of the range. I should definitively try that out at some point.

Delete