Database Architects: Main-memory vs. disk based

Given today's RAM sizes the working set of a database is in main-memory most of the time, even for disk-based systems. However, it makes a big difference if the system knows that the data will be in main-memory or if it has to expect disk I/O.

We can illustrate that nicely in the HyPer system, which is a pure main-memory database system. Due to the way it was developed it is nearly 100% source compatible with an older project that implemented a System R-style disk-based OLAP engine aiming at very large (and thus cold cache) OLAP workloads. The disk-based engine includes a regular buffer manager, locking/latching, a column store using compressed pages, etc. This high degree of compatibility allows for an interesting experiment, namely replacing the data access of a main-memory system with that of a disk based system (thanks to Alexander Böhm for the idea).

In the following experiment we replaced the table scan operator of HyPer with the table scan of the disk-based system, but left all other operators like joins untouched. We than executed all TPC-H queries on SF1 repeatedly, using identical execution plans, and compared the performance of the disk-based scan to the original HyPer system. After the first pass all data is in main-memory, so both systems are effectively main-memory systems, but of course the disk-based system does not know this and assumes data is coming from disk. In the experiments we disabled parallelism and index nested loop joins, as these were not supported by the older project. All runtimes are in milliseconds on SF1 (single-threaded).

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22
main-memory	50	4	20	72	11	52	43	17	132	44	5	42	117	12	14	41	34	225	133	20	115	17
disk based	374	40	263	208	240	185	277	271	316	202	47	275	170	182	182	53	250	461	200	204	556	39

It is evident that the main-memory system is much faster, approx. a factor 5 in geometric means. The interesting question is why. Profiling showed that a main culprit was compression. The disk-based system compressed data quite aggressively, which is a good idea if the data is indeed coming from disk, as then throughput increases, but a very bad idea if the data is already in memory to begin with. For scan heavy queries like Q1 nearly up to 70% of the time was spend on decompressing the data, Or, to phrase it differently, we could have improved the performance by nearly a factor of 4 by avoiding that expensive decompression for these queries (other queries are less affected, but pay for decompression, too).

Note that even though that aggressive compression looks like a bad idea today (and it probably is, given todays systems), it was plausible when the system was designed. Aggressive compression saves disk I/O, and if the system is waiting for disk the only thing that matters is that decompression is faster than the disk drive. Which is the case here. Today we would prefer a more light-weight compression that does not add such a high CPU overhead if the data is already in main-memory. But which of of course offers worse performance if data should indeed come directly from disk...

The compression explains roughly a factor 3 in performance difference, where does the rest come from? There is no obvious hot spot, performance is lost all over the place. Buffering, latching, memory management, tuple reconstruction. data passing, all components add some overhead here and there, with a quite significant overhead in total. All of that was largely irrelevant as long as we waited for disk I/O, but in the main-memory case the overhead is quite painful.

So what can we learn from that little experiment? First, systems should be designed with the main-memory case in mind. Tuning originally disk-based systems for the in-memory case is difficult, as it requires removing layers and overhead all over the system. Not impossible, with sufficient determination we could probably get the disk-based case within a small factor of the main-memory case, but a lot of work.

And second, comparisons between disk-based systems and main-memory systems are unfair. Here, it looks like the main-memory system would win by a large margin. And of course it does, in most settings. If, however, the data would really come from disk, without any caching, the aggressive compression of the disk-based system would have paid off, as it would fetch less data from the slow disk. These are really two different use cases, and even though main-memory becomes the norm, there is still some use for disk-based systems.

Database Architects

Sunday, June 29, 2014

Main-memory vs. disk based

No comments:

Post a Comment

Blog Archive