Sunday, November 22, 2020

Taming Deep Recursion

 When operating on hierarchical data structures, it is often convenient to formulate that using pairwise recursive functions. For example, our semantic analysis walks that parse tree recursively and transforms it into an expression tree. This corresponding code looks roughly like this:

 unique_ptr<Expression> analyzeExpression(AST* astNode) {  
   switch (astNode->getType()) {  
    case AST::BinaryExpression: return analyzeBinaryExpression(astNode->as<BinaryExpAST>());  
    case AST::CaseExpression: return analyzeCaseExpression(astNode->as<CaseExpAST>());  
 unique_ptr<Expression> analyzeBinaryExpression(BinaryExpAST* astNode) {  
   auto left = analyzeExpression(astNode->left);  
   auto right = analyzeExpression(astNode->right);  
   auto type = inferBinaryType(astNode->getOp(), left, right);  
   return make_unique<BinaryExpression>(astNode->getOp(), move(left), move(right), type);  

It recursively walks the tree, collects input expressions, infers types, and constructs new expressions. This works beautifully until you encounter a (generated) query with 300,000 expressions, which we did. At that point our program crashed due to stack overflow. Oops.

Our first mitigation was using __builtin_frame_address(0) at the beginning of analyzeExpression to detect excessive stack usage, and to throw an exception if that happens. This prevented the crash, but is not very satisfying. First, it means we refuse a perfectly valid SQL query "just" because it uses 300,000 terms in one expression. And second, we cannot be sure that this is enough. There are several places in the code that recursively walk the algebra tree, and it is hard to predict their stack usage. Even worse, the depth of the tree can change due to optimizations. For example, when a query has 100,000 entries in the from clause, the initial tree is extremely wide but flat. Later, after we have stopped checking for stack overflows, the optimizer might transform that into a tree with 100,000 levels, again leading to stack overflow. Basically, all recursive operations on the algebra tree are dangerous.

Now common wisdom is to avoid recursion if we cannot bound the maximum depth, and use iteration with an explicit stack instead. We spend quite some time thinking about that approach, but the main problem with that is that it makes the code extremely ugly. The code snippet above is greatly simplified, but even there an explicit stack would be unwieldy and ugly if we have to cover both binaryExpression and caseExpression using one stack. And the code gets cut into tiny pieces due to the control flow inversion required for manual stacks. And all that to defend against something that nearly never happens. We were unhappy with that solution, we wanted something that is minimal invasive and created overhead only in the unlikely case that a user gives us an extremely deep input.

One mechanism that promises to solve this problem is -fsplit-stack. There, the compiler checks for stack overflows in the function prolog and creates a new stack segment if needed. Great, exactly what we wanted! We can handle deep trees, no code change, and we only create a new stack if we indeed encounter deep recursion. Except that it is not really usable in practice. First, -fsplit-stack is quite slow. We measured 20% overhead in our optimizer when enabling split stacks, and that in cases where we did not create any new stacks at all. When -fsplit-stack does create new stacks it is even worse. This is most likely a deficit of the implementation, one could implement -fsplit-stack much more efficiently, but the current implementation is not encouraging. Even worse, clang produces an internal compiler error when compiling some functions with -fsplit-stack. Nobody seems to use this mechanism in production, and after disappointing results we stopped considering -fsplit-stack.

But the idea of split stacks is clearly good. When encountering deep recursion we will have to switch stacks at some point. After contemplating this problem for some time we realized that boost.context offers the perfect mechanism for switching stacks: It can start a fiber with a given stack, and switching between fibers costs just 19 cycles. By caching additional stacks and their fibers in thread-local data structures we can provide our own implementation of split stacks that is fast and supports arbitrary code. Without compiler support the split stack mechanism is visible, of course, but that is fine in our code. We have only a few entry points like analyzeExpression that will be called over and over again during recursion, and checking there is enough. Code wise the mechanism is not too ugly, it needs two lines of code per recursion head and looks like

 unique_ptr<Expression> analyzeExpression(AST* astNode) {  
   if (StackGuard::needsNewStack())  
    return StackGuard::newStack([=]() { return analyzeExpression(astNode); });  
   ... // unchanged  

Note that the lambda argument for newStack will be called from within the new stack, avoiding the stack overflow. When the lambda returns the mechanism will use boost.context to switch back to the previous stack. The performance impact of that mechanism is negligible, as 1) we do not check for overflows all the time but only in the few places that we know are central to the recursive invocations, like analyzeExpression here, 2) stack overflows are extremely rare in practice, and we only pay with one if per invocation is no overflow happens, and 3) even if they do happen, the mechanism is reasonably cheap. We cache the child stacks, and switching to a cached stack costs something like 30 cycles. And we never recurse in a hot loop.

It took us a while to get there, but now we can handle really large queries without crashing. Just for fun we tried running a query with 100,000 relations in the from clause. Fortunately our optimizer could already handle that, and now the rest of the system can handle it, too. And that with nice, intuitive, recursive code, at the small price of two lines of code per recursion head.

Friday, October 30, 2020

C++ Concurrency Model on x86 for Dummies

Since C++11, multi-threaded C++ code has been governed by a rigorous memory model. The model allows implementing concurrent code such as low-level synchronization primitives or lock-free data structures in a portable fashion. To use the memory model, programmers need to do two things: First, they have to use the std::atomic type for concurrently-accessed memory locations. Second, each atomic operation requires a memory order argument with six options determining the concurrency semantics in terms of which re-orderings are allowed. (Some operations even allow specifying two memory orders!)

While there are a number of attempts to describe the model, I always found the full semantics very hard to understand and consequently concurrent code hard to write and reason about. And since we are talking about low-level concurrent code here, making a mistake (like picking the wrong memory order) can lead to disastrous consequences.

Luckily, at least on x86, a small subset of the full C++11 memory model is sufficient. In this post, I'll present such a subset that is sufficient to write high-performance concurrent code on x86. This simplification has the advantage that the resulting code is much more likely to be correct, without leaving any performance on the table. (On non-x86 platforms like ARM, code written based on this simplified model will still be correct, but might potentially be slightly slower than necessary.) 

There are only six things one needs to know to write high-performance concurrent code on x86.

1. Data races are undefined

If a data race occurs in C++, the behavior of the program is undefined. Let's unpack that statement. A data race can be defined as two or more threads accessing the same memory location with at least one of the accesses being a write. By default (i.e., without using std::atomic), the compiler may assume that no other thread is concurrently modifying memory. This allows the compiler to optimize the code, for example by reordering or optimizing away memory accesses. Consider the following example:

void wait(bool* flag) {
    while (*flag);

Because data races are undefined, the compiler may assume that the value behind the pointer flag is not concurrently modified by another thread. Using this assumption, gcc translates the code to a return statement while clang translates it to an infinite loop if flag is initially true. Both translations are likely not what the code intends. To avoid undefined code, it is necessary use std::atomic for variables where race conditions may happen:

void wait(std::atomic<bool>* flag) {
    while (*flag); // same as while(flag->load(std::memory_order_seq_cst));

*flag is equivalent to flag.load(std::memory_order_seq_cst), i.e., the default memory order is sequential consistency. Sequential consistency is the strongest memory order guaranteeing that atomic operations are executed in program order. The compiler is not allowed to reorder memory operations or optimize them away.

2. Sequentially-consistent loads are fast

Making the flag atomic may seem expensive, but luckily atomic loads are cheap on x86. Indeed, our wait function is translated to a simple loop with a simple MOV instruction, without any barrier/fence instruction. This is great as it means that on x86 an atomic, sequentially-consistent load can be just as fast as a normal load. It also means that on x86 there is no performance benefit of using any weaker memory order for atomic loads. For loads all memory orders are simply translated to MOV.

3. Sequentially-consistent stores are slow

While sequentially-consistent atomic loads are as cheap as normal loads, this is not the case for stores, as can be observed from the following example:

void unwait(std::atomic<bool>* flag) {
    *flag = false; // same as flag->store(false, std::memory_order_seq_cst);

As with atomic loads, atomic stores are sequentially consistent if no explicit memory order is specified. In clang and gcc 10, the store translates to an XCHG instruction rather than a MOV instruction (older gcc versions translate it to a MOV plus MFENCE.). XCHG and MFENCE are fairly expensive instructions but are required for sequentially consistent stores on x86. (The CPU's store buffer must be flushed to L1 cache to make the write visible to other threads through cache coherency.)

4. Stores that can be delayed can benefit from the release memory order

Because sequentially-consistent stores are fairly expensive, there are situations where a weaker memory order can improve performance. A common case is when the effect of a store can be delayed. The classical example is unlocking a mutex. The unlocking thread does not have to synchronously wait for the unlocking to become visible, but can continue executing other instructions. Another way of saying this is that it is correct to move instructions into the critical section, but not out of it. In C++, this weaker form of store consistency is available through the release memory order.

In our unwait example, the store can indeed be delayed, which is why we can use the release memory order for the store:

void unwait(std::atomic<bool>* flag) {
    flag->store(false, std::memory_order_release);

This code is translated to a simple MOV instruction, which means it can be as efficient as a non-atomic store.

5. Some atomic operations are always sequentially consistent

Besides loads and stores, std::atomic also offers the high-level atomic operations compare_exchange, exchange, add, sub, and, or, xor. On x86, these are always directly translated to sequentially-consistent CPU instructions. This means that there is no performance benefit from specifying weaker memory orders on any of these operations. (An atomic increment with release semantics would be useful in many situations, but alas is not available.)

6. Scalable code should avoid cache invalidations

I mentioned above that sequentially-consistent loads and release stores may be as cheap as non-atomic loads/store. Unfortunately, this is not always the case. Because CPUs have per-core caches, the performance of concurrent programs depends on the dynamic access pattern. Every store has to invalidate any cached copies of that cache line on other cores. This can cause parallel implementations to be slower than single-threaded code. Therefore, to write scalable code, it is important to minimize the number of writes to shared memory locations. A positive way of saying this is that as long as the program does not frequently write to memory locations that are frequently being read or written, the program will scale very well. (Optimistic lock coupling, for example, is a general-purpose concurrency scheme for synchronizing data structures that exploits this observation.)


The full C++ memory model is notoriously hard to understand. x86, on the other hand, has a fairly strong memory model (x86-TSO) that is quite intuitive: basically everything is in-order, except for writes, which are delayed by the write buffer. Exploiting x86's memory model, I presented a simplified subset of the C++ memory model that is sufficient to write scalable, high-performance concurrent programs on x86.

Tuesday, April 28, 2020

Linear Time Liveness Analysis

Standard compiler are usually used with hand-written programs. These programs tend to have reasonably small functions, and can be processed in (nearly) linear time. Generated programs however can be quite large, and compilers sometimes struggle to compile them at all. This can be seen with the following (silly) demonstration script:

 import subprocess  
 from timeit import default_timer as timer  
 def doTest(size):  
   with open("foo.cpp", "w") as out:  
     print("int foo(int x) {", file=out)  
     for s in range(size):  
       p="x" if s==0 else f'l{s-1}'  
       print (f'int l{s}; if (__builtin_sadd_overflow({p},1,&l{s})) goto error;', file=out)  
     print(f'return l{size-1};error: throw;}}', file=out);  
   start = timer()["gcc", "-c", "foo.cpp"])   
   stop = timer()  
   print(size, ": ", (stop-start))  
 for size in [10,100,1000,10000,100000]:  

It generates one function with n statements of the form "int lX; if (__builtin_sadd_overflow(lY,1,&lX)) goto error;" which are basically just n additions with overflow checks, and then measures the compile time. The generated code is conceptually a very simple, but it contains a lot of basic blocks due to the large number of ifs. When compiling with gcc we get the following compile times:

compilation [s]> 1h

The compile time is dramatically super linear, gcc is basically unable to compile the function if it contains 10,000 ifs or more. In this simple example clang fares better when using -O0, but with -O1 it shows super-linear compile times, too. This is disastrous when processing generated code, where we cannot easily limit the size of individual functions. In our own system we use neither gcc nor clang for query compilation, but we have same problem, namely compiling large generated code. And super-linear runtime quickly becomes an issue when the input is large.

One particular important problem in this context is liveness analysis, i.e, figuring out which value is alive at which part of the program. The textbook solution for that problem involves propagating liveness information for each variable across the blocks, but that is clearly super linear and does not scale to large program sizes. We therefore developed a different approach that we recently refined even more and the I want to present here:

Instead of propagating liveness sets or bitmasks, we study the control flow graph of the program. For a simple queries with a for loop and an if within the loop it might look like this:

Now if we define a variable x in block 0, and use it in block 2, the variable has to be alive on every path between definition and usage. Obviously that includes the blocks 0-2, but that is not enough. We see that there is a loop involving all loops between 1 and 4, and we can take that before coming to 2. Thus, the lifetime has to be extended to include the full loop, and is there range 0-4. If, however, we define a variable in 1 and use it in 2, the range is indeed 1-2, as we do not have to wrap around.

Algorithmically, we identify all loops in the program, which we can do in almost linear time, and remember how the loops are nested within each other. Then, we examine each occurrence of a variable (in SSA form). In principle the lifetime is the span from the first to the last occurrences. If, however, two occurrences are in different loops, we walk "up" from the lower loop level until we occur in the same loop (or top level), extending the lifetime to cover the full loop while doing so. In this example x in block 0 is top level, while x in block 2 is in loop level 1. Thus, we leave the loop, expand 2 into 1-4, and find the lifetime to be 0-4.

This assumes, of course, that the block numbers are meaningful. We can guarantee that by first labeling all blocks in reverse postorder. This guarantees that all dominating blocks will have a lower number than their successors. We can further improve the labeling by re-arranging the blocks such that all blocks within a certain loop are next to each other, keeping the original reverse post order within the loop. This leads to nice, tight liveness intervals. Using just an interval instead of a bitmap is of course less precise, but the block reordering makes sure that the intervals primarily contain the blocks that are indeed involved in the execution.

Asymptotically such an interval based approach is far superior to a classical liveness propagation. All algorithms involved are linear or almost linear, and we only to have to store two numbers per variable. When handling large, generated code such an approach is mandatory. And even for classical, smaller programs it is quite attractive. I looked around a bit at lectures about compiler construction, and I am somewhat surprised that nobody seems to teach similar techniques to handle large programs. When you cannot control the input size, super linear runtime is not an option.

Thursday, January 30, 2020

All hash table sizes you will ever need

When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset of bits is ok.

Prime numbers are more forgiving concerning the hash function, and we have more choices concerning the size, which leads to less overhead. But using a prime number requires a modulo computation, which is expensive. And we have to find a suitable prime number at runtime, which is not that simple either.

Fortunately we can solve both problems simultaneously, which is what this blog post is about. We can tackle the problem of finding prime numbers by pre-computing suitable numbers with a given maximum distance. For example when when only considering prime numbers that are at least 5% away from each other we can cover the whole space from 0 to 2^64 with just 841 prime numbers. We can solve the performance problem by pre-computing the magic numbers from Hacker's Delight for each prime number in our list, which allows us to use multiplications instead of expensive modulo computations. And we can skip prime numbers with unpleasant magic numbers (i.e., the ones that require an additional add fixup), preferring the next cheap prime number instead.

The resulting code can be found here. It contains every prime number you will ever need for hash tables, covering the whole 64bit address space. Usage is very simple, we just ask for a prime number and then perform modulo operations as needed:

class HashTable {
   primes::Prime prime;
   vector table;
   HashTable(uint64_t size) { 
      prime = prime::Prime::pick(size);
   Entry* getEntry(uint64_t hash) { return table[prime.mod(hash)]; }

The performance is quite good. On an AMD 1950X, computing the modulo for 10M values (and computing the sum of the results) takes about 4.7ns per value when using a plain (x%p), but only 0.63ns per value when using p.mod(x).

Getting this into unordered_map would be useful, it would probably improve the performance quite significantly when we have few cache misses.

Wednesday, July 24, 2019

Cuckoo Filters with arbitrarily sized tables

Cuckoo Filters are an interesting alternative to Bloom filters. Instead of maintaining a filter bitmap, they maintain a small (cuckoo-)hash table of key signatures, which has several good properties. For example is stores just the signature of a key instead of the key itself, but is nevertheless able to move an element to a different position in the case of conflicts.

This conflict resolution mechanism is quite interesting: Just like regular cuckoo hash tables each element has two potential positions where is be placed, a primary position i1 and a secondary position i2. These can be computed as follows:

i1 = hash(x)
i2 = i1 xor hash(signature(x))

Remember that the cuckoo filter stores only the (small) signature(x), not x itself. Thus, when we encounter a value, we cannot know if it is at its position i1 or position i2. However, we can nevertheless alternate between positions because the following holds

i1 = i2 xor hash(signature(x))

and we have the signature stored in the table. Thus, we can just use the self-inverse xor hash(signature(x)) to switch between i1 and i2, regardless of whether are currently at i1 or i2. Which is a neat little trick. This allows is to switch between positions, which is used in the cuckoo filter conflict resolution logic.

However all this hold only because the original cuckoo filters use power-of-two hash tables. If our hash table size is not a power of 2, the xor can place the alternative position beyond the size of the hash table, which breaks the filter. Thus cuckoo filter tables always had to be powers of two, even if that wasted a lot of memory.

In more recent work Lang et al. proposed using cuckoo filters with size C, where C did not have to be a power of two, offering much better space utilization. They achieved this by using a different self-inverse function:

i1 = hash(x) mod C
i2 = -(i1 + hash(signature(x)) mod C

Note that the modulo computation can be made reasonable efficient by using magic numbers, which can be precomputed when allocating the filter.

A slightly different way to formulate this is to introduce a switch function f, which switches between positions:

f(i,sig,C) = -(i + hash(sig)) mod C
i1 = hash(x) mod C
i2 = f(i1, signature(x), C)
i1 = f(i2, signature(x), C)
All this works because f is self-inverse, i.e.,

i = f(f(i, signature(x), C), signature(x), C)

for all C>0, i between 0 and C-1, and signature(x)>0.

The only problem is: Is this true? In a purely mathematical sense it is, as you can convince yourself by expanding the formula, but the cuckoo filters are not executed on abstract machines but on real CPUs. And there something unpleasant happens: We can get numerical overflows of our integer registers, which implicitly introduces a modulo 2^32 into our computation. Which breaks the self-inverseness of f in some cases, except when C is power of two itself.

Andreas Kipf noticed this problem when using the cuckoo filters with real world data. Which teaches us not to trust in formulas without additional extensive empirical validation... Fortunately we can repair the function f by using proper modular arithmetic. In pseudo-code this looks like this

    x=(C-1)-(hash(sig) mod C)
    if (x>=i)
       return (x-i);
    // The natural formula would be C-(i-x), but we prefer this one...
    return C+(x-i);

This computes the correct wrap-around module C, at the cost of one additional if. We can avoid the if by using predication, as shown below

    x=(C-1)-(hash(sig) mod C)
    m = (x<i)*(~0)
    return (m&C)+(x-i);

which can be attractive for SSE implementations where the comparison produces a bit mask anyway.

We have validated that this new f function is now self-inverse for all possible values of i, sig, and C. And we did this by not just looking at the formula, but by trying out all values programmatically. Which is a good way to get confidence in your approach; there is only a finite number of combinations, and we can test them all.
With this small fix, we can now enjoy Cuckoo Filters with arbitrarily sized tables.

Edit:  The original post did not mirror the hash space correctly (using C-... instead of (C-1)-...), thanks to Andreas Kipf for pointing this out.

Thursday, May 16, 2019

Why use learning when you can fit?

We recently had a talk by Tim Kraska in our group, and he spoke among other things about learned indexes. As I had mentioned before, I am more in favor of using suitably implemented b-trees, for reasons like update friendliness and distribution independence. But nevertheless, the talk made me curious: The model they are learning is in the end very primitive. It is a two-level linear model, i.e., they are using a linear function to select another linear function. But if that is enough, why do we need machine learning? A simple function fit should work just as well.

Thus, I tried the following:
1) we sort all data and keep it in an array, just like with learned indexes
2) we build the CDF
3) we fit a linear spline to the CDF minimizing the Chebyshev norm
4) we fit a polynomial function to the spline nodes
5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the values from the array. The previous step is always the seed to a local search in the next step.

As we bound the Chebyshev norm in each step, the lookup is in O(1), without any need for machine learning or other magic boxes.

Now admittedly there was some weasel wording in the previous paragraph: The lookup is in O(1), but the "constant" here is the Chebyshev norm of the fit, which means this only works well if we can find the good fit. But just the same is true for the learned indexes, of course.

Now do we find a good fit? In theory we know how to construct the optimal fit in O(n log n), but that paper is beyond me. I am not aware of any implementation, and the paper is much too vague to allow for one. But constructing a good fit is much easier, and can also be done in O(n log n). Using that algorithm, we can construct a linear spline that maximum error efficiently, and we know what the maximum is over the whole domain. Thus, we can probe the spline to get an estimate for the real value position, and we then can perform an efficient local search on a small, known, window of the data.

The only problem is evaluating the spline itself. Evaluating a linear spline is pretty cheap, but we have to find the appropriate knot points to evaluate. Traditionally, we find these with binary search again. Note that the spline is much smaller than the original data, but still we want to avoid the binary search. Thus, we construct a polynomial function to predict the spline knot, again minimizing the Chebyshev norm, which allows us to consider only a small subset of spline nodes, leading to the before mentioned time bound.

How well does this work in practice? On the map data set from the learned indexes paper and a log normal data set we get the following. (The learned indexes numbers are from the paper, the b-tree numbers are from here, and the spline numbers from this experiments. I still do not really know what the averages mean for the learned indexes, but probably the average errors averaged over all models).

Map datasize (MB)avg error
Learned Index (10,000)0.158 ± 45
Learned Index (100,000)1.532 ± 36
B-tree (10,000)0.15225
B-tree (100,000)1.5322
Spline (10,000)0.15193
Spline (100,000)1.5322

Log normal datasize (MB)avg error
Learned Index (10,000)0.1517,060 ± 61,072
Learned Index (100,000)1.5317,005 ± 60,959
B-tree (10,000)0.151330
B-tree (100,000)1.533
Spline (10,000)0.15153
Spline (100,000)1.531

Somewhat surprising the accuracy the accuracy of the spline is nearly identical to the interpolating b-tree for the real-world map data, which suggests that the separators span the domain reasonably well there. For the log normal data the spline is significantly better, and leads to nearly perfect predictions. Note that the original data sets contains many millions of data points in both cases, thus the prediction accuracy is really high.

For practical applications I still recommend the B-tree, of course, even though the polynomial+spline solution is in "O(1)" while the B-tree is in O(log n). I can update a B-tree just fine, including concurrency, recovery, etc., while I do not know how to do that with the polynomial+spline solution.
But if one wants to go the read-only/read-mostly route, the function functions could be attractive alternative the machine learning. The advantage of using fits is that the algorithms are reasonably fast, we understand how they work, and they give strong guarantees for the results.

Friday, February 1, 2019

Honest asymptotic complexity for search trees and tries

Fun fact that was pointed out to me by Viktor: All traditional books on algorithms and data structures that we could find gave the lookup costs of balanced search trees as O(log n) (i.e., the depth of the search tree), and the lookup costs of tries as O(k) (i.e., the length of the key). At a first glance this is a logarithmic time lookup against a linear time lookup, which makes people nervous when thinking about long keys.

But that argument is very unfair: Either we consider a key comparison a O(1) operation, then a tree lookup is indeed in O(log n), but then a trie lookup is in O(1)! As the length of the key has to be bounded by a constant to get O(1) comparisons, the depth of the trie is bounded, too. Or the length of the key matters, then a trie lookup is indeed in O(k), but then a tree lookup is in O(k log n). We have to compare with the key on every level, and if we are unlucky we have to look at the whole key, which gives the factor k.

Which of course makes tries much more attractive asymptotic wise. Note that we ignore wall clock times here, which are much more difficult to predict, but in many if not most cases tries are indeed much faster than search trees.

I believe the reason why text books get away with this unfair comparison is that they all present balanced search trees with integer keys:

image/svg+xml 8 10 4 1 6
While tries are traditionally introduced with string examples. If they had used string keys for balanced search trees instead it would have been clear that the key length matters:

image/svg+xml ABCD8 ABCD10 ABCD4 ABCD1 ABCD6
The trie examines every key byte at most once, while the search tree can examine every key byte log n times. Thus, the asymptotic complexity of tries is actually better than that of balanced search tries.