Monday, November 10, 2025

Comparing Integers and Doubles

During automated testing we stumbled upon a problem that boiled down to transitive comparisons: If a=b, and a=c, when we assumed that b=c. Unfortunately that is not always the case, at least not in all systems. Consider the following SQL query:

select a=b, a=c, b=c
from (values(
   1234567890123456789.0::double precision,
   1234567890123456788::bigint,
   1234567890123456789::bigint)) s(a,b,c)

If you execute that in Postgres (or DuckDB, or SQL Server, or ...) the answer is (true, true, false). That is, the comparison is not transitive! Why does that happen? When these systems compare a bigint and a double, they promote the bigint to double and then compare. But a double has only 52 bits of mantissa, which means it will lose precision when promoting large integers to double, producing false positives in the comparison.

This behavior is highly undesirable, first because it confuses the optimizer, and second because (at least in our system) joins work very differently: Hash joins promote to the most restrictive type and discard all values that cannot be represented, as they will never produce a join partner for sure. For double/bigint joins that leads to observable differences between joins and plain comparisons, which is very bad.

How should we compare correctly? Conceptually the situation is clear, an IEEE 754 floating point with sign s, mantissa m, and exponent e represents the values (-1)^s*m*2^e, we just have to compare the integer with that value. But there is no easy way to do that, if we do a int/double comparison in, e.g., C++, the compiler does the same promotion to double, messing up the comparison.

We can get the logic right by doing two conversions: We first convert the int to double and compare that. If the values are not equal, the order is clear and we can use that. Otherwise, we convert the double back to an integer and check if the conversion rounded up or down, and handle the result. Plus some extra checks to avoid undefined behavior (the conversion of intmax64->double->int64 is not defined) and to handle non-finite values, and we get: 

int cmpDoubleInt64(double a, int64_t b) {
   // handle intmax and nan
   if (!(a<=0x1.fffffffffffffp+62)) return 1;

   // fast path comparison
   double bd = b;
   if (a!=bd) return (a<bd)?-1:1;

   // handle loss of precision
   int64_t ai = a;
   if (ai!=b) return (ai<b)?-1:1;
   return 0;
}

Which is the logic that we now use. Who else does it correctly? Perhaps somewhat surprisingly, Python and SQLLite.  Other database systems (and programming languages) that we tested all lost precision during the comparison, leading to tons of problems. IMHO a proper int/double comparison should be available in every programming language, at least as library function. But in most languages (and DBMSes) it isn't. You can use that code above if you ever have this problem.

Thursday, November 6, 2025

Query Compilation Isn't as Hard as You Think

Query compilation based on the produce/consume model has a reputation for delivering high query performance, but also for being more difficult to implement than interpretation-based engines using vectorized execution. While it is true that building a production-grade query compiler with low compilation latency requires substantial infrastructure and tooling effort what is sometimes overlooked is that the vectorization paradigm is not easy to implement either, but for different reasons.

Vectorization relies on vector-at-a-time rather than tuple-at-a-time processing. High-level operations (e.g., inserting tuples into a relation) must be decomposed into per-attribute vector kernels that are then executed successively. This requires a specific way of thinking and can be quite challenging, depending on the operator. In practice, the consequence is that the range of algorithms that can be realistically implemented in a vectorized engine is limited.

In contrast, compilation is fundamentally simpler and more flexible in terms of the resulting code. It supports tuple-at-a-time processing, which makes it easier to implement complex algorithmic ideas. Developing a query processing algorithm typically involves three steps: First, implement the algorithm manually as a standalone program, i.e., outside the compiling query engine. Second, explore different algorithms, benchmark, and optimize this standalone implementation. Finally, once a good implementation has been found, modify the query compiler to generate code that closely resembles the manually developed version.

At TUM, we use a simple pedagogical query compilation framework called p2c ("plan to C") for teaching query processing. It implements a query compiler that, given a relational operator tree as input, generates the C++ code that computes the query result. The core of the compiler consists of about 600 lines of C++ code and supports the TableScan, Selection, Map, Sort, Aggregation, and Join operators.

The generated code is roughly as fast as single-threaded DuckDB and can be easily inspected and debugged, since it is simply straightforward C++ code. This makes it easy for students to optimize (e.g., by adding faster hash tables or multi-core parallelization) and to extend (e.g., with window functions).

To keep p2c simple, it does not support NULL values or variable-size data types (strings have a fixed maximum length). Compiling to template-heavy C++ results in very high compilation times, making it impractical for production use. At the same time, p2c employs the same core concepts as Hyper and Umbra, making it an excellent starting point for learning about query compilation. Note that this type of compilation to C++ can also serve as an effective prototyping platform for research that explores new query processing ideas.

Find the code for p2c on github: https://github.com/viktorleis/p2c