The LDBC Social Network Benchmark
The Linked Data Benchmark Council is a new benchmarking organization for RDF and graph data management technology (from neo4j to Giraph to Owlim) and the Social Network Benchmark (SNB) is one of its first initiatives. The SNB was created from the LDBC EU project, in which both Thomas and I are active, and was already used by this year's ACM SIGMOD Programming Contest, which was about graph analytics.
SNB is intended to provide the following value to different stakeholders:
- For end users facing graph processing tasks, SNB provides a recognizable scenario against which it is possible to compare merits of different products and technologies. By covering a wide variety of scales and price points, SNB can serve as an aid to technology selection.
- For vendors of graph database technology, SNB provides a checklist of features and performance characteristics that helps in product positioning and can serve to guide new development.
- For researchers, both industrial and academic, the SNB dataset and workload provide interesting challenges in multiple technical areas, such as query optimization, (distributed) graph analysis, transactional throughput, and provides a way to objectively compare the effectiveness and efficiency of new and existing technology in these areas.
- the benchmark scenario should be understandable to a large audience, and this audience should also understand the relevance of managing such data.
- the scenario in the benchmark should cover the complete range of challenges relevant for graph data management, according to the benchmark scope.
- the query challenges in it should be realistic in the sense that, though synthetic, similar data and workloads are encountered in practice.
The SNB is in fact three distinct benchmarks with a common dataset, since there are three different workloads. Each workload produces a single metric for performance at the given scale and a price/performance metric at the scale. The full disclosure further breaks down the composition of the metric into its constituent parts, e.g. single query execution times.
- Interactive Workload. The Interactive SNB workload is the first one we are releasing. It is defined in plain text, yet we have example implementations in neo4j's Cypher, SPARQL and SQL. The interactive workloads tests a system's throughput with relatively simple queries with concurrent updates. The system under test (SUT) is expected to run in a steady state, providing durable storage with smooth response times. Inserts are typically small, affecting a few nodes at a time, e.g. uploading of a post and its tags. Transactions may require serializability, e.g. verifying that something does not exist before committing the transaction. Reads do not typically require more than read committed isolation. One could call the Interactive Workload an OLTP workload, but while queries typically touch a small fraction of the database, this can still be up to hundreds of thousands of values (the two-step neighborhood of a person in the social graph, often). Note that in order to support the read-queries, there is a lot of liberty to create indexing structures or materialized views, however such structures need to be maintained with regards to the continues inserts that also part of the workload. This workload is now in draft stage, which means that the data generator and driver software stack is is ready and the purpose is to obtain user feedback, as well as develop good system implementations. The first implementations of this workload are now running on Openlink Virtuoso, Neo4j and Sparsity Sparksee, and we are eager to see people try these, and optimize and involve these.
- Business Intelligence Workload. There is a first stab at this workload formulated in SPARQL, tested against Openlink Virtuoso. The BI workload consists of complex structured queries for analyzing online behavior of users for marketing purposes. The workload stresses query execution and optimization. Queries typically touch a large fraction of the data and do not require repeatable read. The queries will be concurrent with trickle load (not out yet). Unlike the interactive workload, the queries touch more data as the database grows.
- Graph Analytics Workload. This workload is not yet available. It will test the functionality and scalability of the SUT for graph analytics that typically cannot be expressed in a query language. As such it is the natural domain for graph programming frameworks like Giraph. The workload is still under development, but will consist of algorithms like PageRank, Clustering and Breadth First Search. The analytics is done on most of the data in the graph as a single operation. The analysis itself produces large intermediate results. The analysis is not expected to be transactional or to have isolation from possible concurrent updates.
As mentioned, we hope SNB will be used on a broad variety of systems, from graph databases (neo4j, Sparksee) to graph programming frameworks (Giraph,GraphLab), RDF databases (Virtuoso,OWLIM), but even relational systems as well as NoSQL systems. The workloads are quite different and not all combinations of systems and workloads are even likely. The below table of SNB workloads versus systems shows our current thinking:
Please take a look at ldbcouncil.org/developer/snb to find all relevant technical information on SNB. We are eager to hear your comments.