Database Architects

Friday, December 27, 2024

Advent of Code 2024 in pure SQL

On a whim I decided to do this years advent of code in pure SQL. That was an interesting experience that I can recommend to everybody because it forces you to think differently about the problems. And I can report that it was possible to solve every problem in pure SQL.

In many cases SQL was actually surprisingly pleasant to use. The full solution for day 11 (including the puzzle input) is shown below:

with recursive aoc10_input(i) as (select '
89010123
78121874
87430965
96549874
45678903
32019012
01329801
10456732
'),
lines(y,line) as (
   select 0, substr(i,1,position(E'\n' in i)-1), substr(i,position(E'\n' in i)+1)
   from aoc10_input
   union all
   select y+1,substr(r,1,position(E'\n' in r)-1), substr(r,position(E'\n' in r)+1)
   from lines l(y,l,r) where position(E'\n' in r)>0
),
field(x,y,v) as (
   select x,y,ascii(substr(line,x::integer,1))-48
   from (select * from lines l where line<>'') s, lateral generate_series(1,length(line)) g(x)
),
paths(x,y,v,sx,sy) as (
   select x,y,9,x,y from field where v = 9
   union all
   select f.x,f.y,f.v,p.sx,p.sy
   from field f, paths p
   where f.v=p.v-1 and ((f.x=p.x and abs(f.y-p.y)=1) or (f.y=p.y and abs(f.x-p.x)=1)) and p.v>0),
results as (select * from paths where v=0),
part1 as (select distinct * from results)
select (select count(*) from part1)  as part1, (select count(*) from results) as part2

Parsing the input is a bit painful in SQL, but it is not too bad. Lines 1-10 are simply the puzzle input, lines 11-17 split the input into individual lines, and lines 18-21 construct a 2D array from the input. The algorithm itself is pretty short, lines 22-27 perform a recursive traversal of the field, and lines 28-39 extract the puzzle answer from the traversal results. For this kind of small scale traversals SQL works just fine.

Other days were more painful. Day 16 for example does conceptually a very similar traversal of a field, and it computes the minimal traversal distance for each visited. Expressing that in SQL in easy, but evaluation is wasteful. When replacing the reference input with a real puzzle input the field is quite large, and the recursive query generates and preserves a lot of state, even though we only care about the last iteration of the recursive query. As a consequence you need a machine with over 200GB memory to execute that query, even though most of the computed tuples are irrelevant. We could fix that excessive memory consumption by using iteration semantic during recursion, but that is not widely supported by DBMSes. Umbra could do it, but Postgres and DuckDB cannot, thus I have not used it in my solutions.

And sometimes the programming model of recursive SQL clashes with what we want to do. On day 23 we had to find the maximum clique in sparse graph. This can be computed reasonably well with the Bron-Kerbosch algorithm, but expressing that in recursive SQL is quite convoluted because the algorithm wants to maintain multiple sets, but recursive SQL only passes a single set along. It can be done, but the result does not look pretty.

This experiment has shown two things to me 1) it is possible to code quite complex algorithms in SQL, and often the SQL code is surprisingly pleasant, and 2) recursive SQL would be much more efficient and more pleasant to use if we had mechanisms to update state. There is ongoing work on supporting more complex control flow in recursion via a trampoline mechanisms, which is very useful, too, but we should definitively look into more complex state manipulation mechanisms. With just a bit extra functionality SQL would be quite a solid choice for running complex algorithms directly inside a database.

Friday, December 13, 2024

What are important data systems problems, ignored by research?

In November, I had the pleasure of attending the Dutch-Belgian DataBase Day, where I moderated a panel on practical challenges often overlooked in database research. Our distinguished panelists included Allison Lee (founding engineer at Snowflake), Andy Pavlo (professor at CMU), and Hannes Mühleisen (co-creator of DuckDB and researcher at CWI), with attendees contributing to the discussion and sharing their perspectives. In this post, I'll attempt to summarize the discussion in the hope that it inspires young (and young-at-heart) researchers to tackle these challenges. Additionally, I'll link to some paper that can serve as motivation and starting points for research in these areas.

One significant yet understudied problem raised by multiple panellists is the handling of variable-length strings. Any analysis of real-world analytical queries reveals that strings are ubiquitous. For instance, Amazon Redshift recently reported that around 50% of all columns are strings. Since strings are typically larger than numeric data, this implies that strings are a substantial majority of real-world data. Dealing with strings presents two major challenges. First, query processing is often slow due to the variable size of strings and the (time and space) overhead of dynamic allocation. Second, surprisingly little research has been dedicated to efficient database-specific string compression. Given the importance of strings on real-world query performance and storage consumption, it is surprising how little research there is on the topic (there are some exceptions).

Allison highlighted a related issue: standard benchmarks, like TPC-H, are overly simplistic, which may partly explain why string processing is understudied. TPC-H queries involve little complex string processing and don't use strings as join or aggregation keys. Moreover, TPC-H strings have static upper bounds, allowing them to be treated as fixed-size objects. This sidesteps the real challenges of variable-size strings and their complex operations. More broadly, standard benchmarks fall short of reflecting real-world workloads, as they lack advanced relational operators (e.g., window functions, CTEs) and complex expressions. To drive meaningful progress, we likely need new, more realistic benchmarks. While the participants agreed on most points, one particularly interesting topic of discussion was distributed query processing. Allison pointed out that many query processing papers overlook distributed processing, making them hard to adopt in industrial systems. Hannes, however, argued that most user workloads can be handled on a single machine, which should be the primary focus of publicly funded research. My personal view is that both single-node and distributed processing are important, and there is ample room to address both challenges.

While database researchers often focus on database engine architectures, Andy argued that surrounding topics, such as network connection handling (e.g., database proxies), receive little attention despite their practical importance. Surprisingly, there is also limited research on scheduling database workloads and optimizing the network stack, even though communication bottlenecks frequently constrain efficient OLTP systems. Multi-statement stored procedures, though a potential solution, are not widely adopted and fail to address this issue in practice. I believe there are significant research opportunities in exploring how to better structure the interface between applications and database systems.

One striking fact about major database conferences, such as SIGMOD and VLDB, is how few papers address practical database system problems. From personal experience, I believe this presents a significant opportunity for researchers seeking both academic and real-world impact. Solutions to the problems discussed above (and many others) are likely to gain industry attention and be adopted by real database systems. Moreover, with the availability of open-source systems like DuckDB, DataFusion, LeanStore, and PostgreSQL, conducting systems research has become easier than ever.

Tuesday, December 10, 2024

C++ exception performance three years later

About three years ago we noticed serious performance problems in C++ exception unwinding. Due to contention on the unwinding path these became more and more severe the more cores a system had, and unwinding could slow down by orders of magnitude. Due to the constraints of backwards compatibility this contention was not easy to eliminate, and P2544 discussed ways to fix this problem via language changes in C++.

But fortunately people found less invasive solutions. First, Florian Weimer changed the glibc to provide a lock-free mechanism to find the (static) unwind tables for a given shared object. Which eliminates the most serious contention for "simple" C++ programs. For example in a micro-benchmark that calls a function with some computations (100 calls to sqrt per function invocation), and which throws with a certain probability, we previously had very poor scalability with increasing core count. With his patch we now see with gcc 14.2 on a dual-socket EPYC 7713 the following performance development (runtime in ms):

	1	2	4	8	16	32	64	128
0% failure	29	29	29	29	29	29	29	42
0.1% failure	29	29	29	29	29	29	29	32
1% failure	29	30	30	30	30	30	32	34
10% failure	36	36	37	37	37	37	47	65

Which is more or less perfect. 128 threads are a bit slower, but that is to be expected as one EPYC only has 64 cores. With higher failure rates unwinding itself becomes slower but that is still acceptable here. Thus most C++ programs are just fine.

For our use case that is not enough, though. We dynamically generate machine code at runtime, and we want to be able to pass exceptions through generated code. The _dl_find_object mechanism of glibc is not used for JITed code, instead libgcc maintains its own lookup structure. Historically this was a simple list with a global lock, which of course had terrible performance. But through a series of patches we managed to change libgcc into using a lock-free b-tree for maintaining the dynamic unwinding frames. Using a similar experiment to the one above, but now with JIT-generated code (using LLVM 19), we get the following:

	1	2	4	8	16	32	64	128
0% failure	32	38	48	64	48	36	59	62
0.1% failure	32	32	32	32	32	48	62	68
1% failure	41	40	40	40	53	69	80	83
10% failure	123	113	103	116	128	127	131	214

The numbers have more noise than for statically generated code, but overall observation is the same: Unwinding now scales with the number of cores, and we can safely use C++ exceptions even on machines with large core counts.

So is everything perfect now? No. First, only gcc has a fully scalable frame lookup mechanism. clang has its own implementation, and as far as I know it still does not scale properly due to a global lock in DwarfFDECache. Note that at least on many Linux distributions clang uses libgcc by default, thus the problem is not immediately obvious there, but a pure llvm/clang build will have scalability problems. And second unwinding through JIT-ed code is a quite a bit slower, which is unfortunate. But admittedly the problem is less severe than shown here, the benchmark with JITed code simply has a stack frame more to unwind due to the way static code and JITed code interact. And it makes sense to prioritize static unwinding over dynamic unwinding frames, as most people never JIT-generate code.

Overall we are now quite happy with the unwinding mechanism. The bottlenecks are gone, and performance is fine even with high core counts. It is still not appropriate for high failure rates, something like P709 would be better for that, but we can live with the status quo.

Thursday, June 6, 2024

B-trees Require Fewer Comparisons Than Balanced Binary Search Trees

Due to better access locality, B-trees are faster than binary search trees in practice -- but are they also better in theory? To answer this question, let's look at the number of comparisons required for a search operation. Assuming we store n elements in a binary search tree, the lower bound for the number of comparisons is log₂ n in the worst case. However, this is only achievable for a perfectly balanced tree. Maintaining such a tree's perfect balance during insert/delete operations requires O(n) time in the worst case.

Balanced binary search trees, therefore, leave some slack in terms of how balanced they are and have slightly worse bounds. For example, it is well known that an AVL tree guarantees at most 1.44 log₂ n comparisons, and a Red-Black tree guarantees 2 log₂ n comparisons. In other words, AVL trees require at most 1.44 times the minimum number of comparisons, and Red-Black trees require up to twice the minimum.

How many comparisons does a B-tree need? In B-trees with degree k, each node (except the root) has between k and 2k children. For k=2, a B-tree is essentially the same data structure as a Red-Black tree and therefore provides the same guarantee of 2 log₂ n comparisons. So how about larger, more realistic values of k?

To analyze the general case, we start with a B-tree that has the highest possible height for n elements. The height is maximal when each node has only k children (for simplicity, this analysis ignores the special case of underfull root nodes). This implies that the worst-case height of a B-tree is log_k n. During a lookup, one has to perform a binary search that takes log₂ k comparisons in each of the log_k n nodes. So in total, we have log₂ k * log_k n = log₂ n comparisons.

This actually matches the best case, and to construct the worst case, we have to modify the tree somewhat. On one (and only one) arbitrary path from the root to a single leaf node, we increase the number of children from k to 2k. In this situation, the tree height is still less than or equal to log_k n, but we now have one worst-case path where we need log₂ 2k (instead of log₂ k) comparisons. On this worst-case path, we have log₂ 2k * log_k n = (log₂ 2k) / (log₂ k) * log₂ n comparisons.

Using this formula, we get the following bounds:
k=2: 2 log₂ n
k=4: 1.5 log₂ n
k=8: 1.33 log₂ n
k=16: 1.25 log₂ n
...
k=512: 1.11 log₂ n

We see that as k grows, B-trees get closer to the lower bound. For k>=8, B-trees are guaranteed to perform fewer comparisons than AVL trees in the worst case. As k increases, B-trees become more balanced. One intuition for this result is that for larger k values, B-trees become increasingly similar to sorted arrays which achieve the log₂ n lower bound. Practical B-trees often use fairly large values of k (e.g., 100) and therefore offer tight bounds -- in addition to being more cache-friendly than binary search trees.

(Caveat: For simplicity, the analysis assumes that log₂ n and log₂ 2k are integers, and that the root has either k or 2k entries. Nevertheless, the observation that larger k values lead to tighter bounds should hold in general.)

Monday, February 19, 2024

SSDs Have Become Ridiculously Fast, Except in the Cloud

In recent years, flash-based SSDs have largely replaced disks for most storage use cases. Internally, each SSD consists of many independent flash chips, each of which can be accessed in parallel. Assuming the SSD controller keeps up, the throughput of an SSD therefore primarily depends on the interface speed to the host. In the past six years, we have seen a rapid transition from SATA to PCIe 3.0 to PCIe 4.0 to PCIe 5.0. As a result, there was an explosion in SSD throughput:

At the same time, we saw not just better performance, but also more capacity per dollar:

The two plots illustrate the power of a commodity market. The combination of open standards (NVMe and PCIe), huge demand, and competing vendors led to great benefits for customers. Today, top PCIe 5.0 data center SSDs such as the Kioxia CM7-R or Samsung PM1743 achieve up to 13 GB/s read throughput and 2.7M+ random read IOPS. Modern servers have around 100 PCIe lanes, making it possible to have a dozen of SSDs (each usually using 4 lanes) in a single server at full bandwidth. For example, in our lab we have a single-socket Zen 4 server with 8 Kioxia CM7-R SSDs, which achieves 100GB/s (!) I/O bandwidth:

AWS EC2 was an early NVMe pioneer, launching the i3 instance with 8 physically-attached NVMe SSDs in early 2017. At that time, NVMe SSDs were still expensive, and having 8 in a single server was quite remarkable. The per-SSD read (2 GB/s) and write (1 GB/s) performance was considered state of the art as well. Another step forward occurred in 2019 with the launch of i3en instances, which doubled storage capacity per dollar.

Since then, several NVMe instance types, including i4i and im4gn, have been launched. Surprisingly, however, the performance has not increased; seven years after the i3 launch, we are still stuck with 2 GB/s per SSD. Indeed, the venerable i3 and i3en instances basically remain the best EC2 has to offer in terms of IO-bandwidth/$ and SSD-capacity/$, respectively. Personally, I find this very surprising given the SSD bandwidth explosion and cost reductions we have seen on the commodity market. At this point, the performance gap between state-of-the-art SSDs and those offered by major cloud vendors, especially in read throughput, write throughput, and IOPS, is nearing an order of magnitude. (Azure's top NVMe instances are only slightly faster than AWS's.)

What makes this stagnation in the cloud even more surprising is that we have seen great advances in other areas. For example, during the same 2017 to 2023 time frame, EC2 network bandwidth exploded, increasing from 10 Gbit/s (c4) to 200 Gbit/s (c7gn). Now, I can only speculate why the cloud vendors have not caught up on the storage side:

One theory is that EC2 intentionally caps the write speed at 1 GB/s to avoid frequent device failure, given the total number of writes per SSD is limited. However, this does not explain why the read bandwidth is stuck at 2 GB/s.
A second possibility is that there is no demand for faster storage because very few storage systems can actually exploit tens of GB/s of I/O bandwidth. See our recent VLDB paper. On the other hand, as long as fast storage devices are not widely available, there is also little incentive to optimize existing systems.
A third theory is that if EC2 were to launch fast and cheap NVMe instance storage, it would disrupt the cost structure of its other storage service (in particular EBS). This is, of course, the classic innovator's dilemma, but one would hope that one of the smaller cloud vendors would make this step to gain a competitive edge.

Overall, I'm not fully convinced by any of these three arguments. Actually, I hope that we'll soon see cloud instances with 10 GB/s SSDs, making this post obsolete.

Sunday, April 9, 2023

The Great CPU Stagnation

For at least five decades, Moore's law consistently delivered increasing numbers of transistors. Equally significant, Dennard scaling led to each transistor using less energy, enabling higher clock frequencies. This was great, as higher clock frequencies enhanced existing software performance automatically, without necessitating any code rewrite. However, around 2005, Dennard scaling began to falter, and clock frequencies have largely plateaued since then.

Despite this, Moore's law continued to advance, with the additional available transistors being channeled into creating more cores per chip. The following graph displays the number of cores for the largest available x86 CPU at the time:
Notice the logarithmic scale: this represents the exponential trend we had become accustomed to, with core counts doubling roughly every three years. Regrettably, when considering cost per core, this impressive trend appears to have stalled, ushering in an era of CPU stagnation.

To demonstrate this stagnation, I gathered data from wikichip.org on AMD's Epyc single-socket CPU lineup, introduced in 2017 and now in its fourth generation (Naples, Rome, Milan, Genoa):

Model	Gen	Launch	Cores	GHz	IPC	Price
7351P	Naples	06/2017	16	2.4	1.00	$750
7401P	Naples	06/2017	24	2.0	1.00	$1,075
7551P	Naples	06/2017	32	2.0	1.00	$2,100
7302P	Rome	08/2019	16	3.0	1.15	$825
7402P	Rome	08/2019	24	2.8	1.15	$1,250
7502P	Rome	08/2019	32	2.5	1.15	$2,300
7702P	Rome	08/2019	64	2.0	1.15	$4,425
7313P	Milan	03/2021	16	3.0	1.37	$913
7443P	Milan	03/2021	24	2.9	1.37	$1,337
7543P	Milan	03/2021	32	2.8	1.37	$2,730
7713P	Milan	03/2021	64	2.0	1.37	$5,010
9354P	Genoa	11/2022	32	3.3	1.57	$2,730
9454P	Genoa	11/2022	48	2.8	1.57	$4,598
9554P	Genoa	11/2022	64	3.1	1.57	$7,104
9654P	Genoa	11/2022	96	2.4	1.57	$10,625

Over these past six years, AMD has emerged as the x86 performance per dollar leader. Examining these numbers should provide insight into the state of server CPUs. Let's first observe CPU cores per dollar:

This deviates significantly from the expected exponential improvement graphs. In fact, CPU cores are becoming slightly more expensive over time! Admittedly, newer cores outperform their predecessors. When accounting for both clock frequency and higher IPC, we obtain the following image:

This isn't much better. The performance improvement over a 6-year period is underwhelming when normalized for cost. Similar results can also be observed for Intel CPUs in EC2.

Lastly, let's examine transistor counts, only taking into account the logic transistors. Despite improved production nodes from 14nm (Naples) over 7nm (Rome/Milan) to 5nm (Genoa), cost-adjusted figures reveal stagnation:

In conclusion, the results are disheartening. Rapid and exponential improvements in CPU speed seem to be relics of the past. We now find ourselves in a markedly different landscape compared to the historical norm in computing. The implications could be far-reaching. For example, most software is extremely inefficient when compared to what hardware can theoretically achieve, and maybe this needs to change. Furthermore, historically specialized chips enjoyed only limited success due to the rapid advancement of commodity CPUs. Perhaps, custom chips will have a much bigger role in the future.

P.S. Due to popular demand, here's how the last graph looks like after adjusting for inflation:

Tuesday, February 7, 2023

Five Decades of Database Research

Since 1975, over 24 thousand articles have have been published in major database venues (SIGMOD, VLDB/PVLDB, ICDE, EDBT, CIDR, TODS, VLDB Journal, TKDE). The number of papers per year is rising:

Over time, the topics change. Looking at the percentage of keywords appearing in paper titles (in that particular year), we can see interesting trends: