Tuesday, July 29, 2014

The LDBC Social Network Benchmark

The Linked Data Benchmark Council is a new benchmarking organization for RDF and graph data management technology (from neo4j to Giraph to Owlim) and the Social Network Benchmark (SNB) is one of its first initiatives. The SNB was created from the LDBC EU project, in which both Thomas and I are active, and was already used by this year's ACM SIGMOD Programming Contest, which was about graph analytics.
SNB is intended to provide the following value to different stakeholders:
  • For end users facing graph processing tasks, SNB provides a recognizable scenario against which it is possible to compare merits of different products and technologies.  By covering a wide variety of scales and price points, SNB can serve as an aid to technology selection.
  • For vendors of graph database technology, SNB provides a checklist of features and performance characteristics that helps in product positioning and can serve to guide new development.
  • For researchers, both industrial and academic, the SNB dataset and workload provide interesting challenges in multiple technical areas, such as query optimization, (distributed) graph analysis, transactional throughput, and provides a way to objectively compare the effectiveness and efficiency of new and existing technology in these areas.
We should clarify that even though the data model of SNB resembles Facebook (and we're extending it to also look more like Twitter), the goal of SNB is not to advise Facebook or Twitter what systems to use, they don't need LDBC for that. Rather, we take social network data as a model for the much more broader graph data management problems that IT practitioners face. The particular characteristic of a graph data management problem is that the queries and analysis is not just about finding data by value, but about learning about the connection patterns between data. The scenario of the SNB, a social network, was chosen with the following goals in mind:
  • the benchmark scenario should be understandable to a large audience, and this audience should also understand the relevance of managing such data.
  • the scenario in the benchmark should cover the complete range of challenges relevant for graph data management, according to the benchmark scope.
  • the query challenges in it should be realistic in the sense that, though synthetic, similar data and workloads are encountered in practice.
The SNB is in fact three distinct benchmarks with a common dataset, since there are three different workloads. Each workload produces a single metric for performance at the given scale and a price/performance metric at the scale.  The full disclosure further breaks down the composition of the metric into its constituent parts, e.g. single query execution times.
  • Interactive Workload.  The Interactive SNB workload is the first one we are releasing. It is defined in plain text, yet we have example implementations in neo4j's Cypher, SPARQL and SQL. The interactive workloads tests a system's throughput with relatively simple queries with concurrent updates.  The system under test (SUT) is expected to run in a steady state, providing durable storage with smooth response times.  Inserts are typically small, affecting a few nodes at a time, e.g. uploading of a post and its tags.  Transactions may require serializability, e.g. verifying that something does not exist before committing the transaction.   Reads do not typically require more than read committed isolation. One could call the Interactive Workload an OLTP workload, but while queries typically touch a small fraction of the database, this can still be up to hundreds of thousands of values (the two-step neighborhood of a person in the social graph, often). Note that in order to support the read-queries, there is a lot of liberty to create indexing structures or materialized views, however such structures need to be maintained with regards to the continues inserts that also part of the workload. This workload is now in draft stage, which means that the data generator and  driver software stack is is ready and the purpose is to obtain user feedback, as well as develop good system implementations.  The first implementations of this workload are now running on Openlink Virtuoso, Neo4j and Sparsity Sparksee, and we are eager to see people try these, and optimize and involve these.
  • Business Intelligence Workload. There is a first stab at this workload formulated in SPARQL, tested against Openlink Virtuoso. The BI workload consists of complex structured queries for analyzing online behavior of users for marketing purposes.  The workload stresses query execution and optimization. Queries typically touch a large fraction of the data and do not require repeatable read.  The queries will be concurrent with trickle load (not out yet). Unlike the interactive workload, the queries touch more data as the database grows.
  • Graph Analytics Workload. This workload is not yet available. It will test the functionality and scalability of the SUT for graph analytics that typically cannot be expressed in a query language. As such it is the natural domain for graph programming frameworks like Giraph. The workload is still under development, but will consist of algorithms like PageRank, Clustering and Breadth First Search. The analytics is done on most of the data in the graph as a single operation.  The analysis itself produces large intermediate results.  The analysis is not expected to be transactional or to have isolation from possible concurrent updates.
All the SNB scenarios share a common scalable synthetic data set, generated by a state-of-the art data generator. We strongly believe in a single dataset that makes sense for all workloads, that is, the interactive and BI workloads will traverse data that has sensible PageRank outcomes, and graph clustering structure, etc. This is in contrast to LinkBench, released by the team of Facebook that manages the OLTP workload on the Facebook Graph, which closely tunes to the low-level MySQL query patterns Facebook sees,but whose graph structure does not attempt to be realistic beyond average out degree of the nodes (so, it makes not attempt to create realistic community patterns or correlations) . The authors of LinkBench may be right that  the graph structure does not make a difference for simple insert/update/delete/lookup actions which LinkBench itself tests, but for the SNB queries in the Interactive and BI workloads this is not true. Note that Facebook's IT infrastructure does not store all user data in MySQL and its modified memcached ("TAO"), some of it ends up in separate subsystems (using HDFS and HBase), which is outside of the scope of LinkBench. However, for queries like in the SNB Interactive and BI workloads it does matter how people are connected, and how the attribute values  of connected people correlate. In fact, the SNB data generator is unique in that it generates a huge graph with correlations, where people who live together, have the same interests or work for the same company have greater chance to be connected, and people from Germany have mostly German names, etc. Correlations frequently occur in practice and can strongly influence the quality of query optimization and execution, therefore LDBC wants to test their effects on graph data management systems (the impact of correlation among values and structure on query optimization and execution are a "choke point" for graph data management system where LDBC wants to stimulate innovation). 

As mentioned, we hope SNB will be used on a broad variety of systems, from graph databases (neo4j, Sparksee) to graph programming frameworks (Giraph,GraphLab), RDF databases (Virtuoso,OWLIM), but even relational systems as well as NoSQL systems. The workloads are quite different and not all combinations of systems and workloads are even likely. The below table of SNB workloads versus systems shows our current thinking:

Please take a look at ldbcouncil.org/developer/snb to find all relevant technical information on SNB. We are eager to hear your comments.


  1. I am not suggesting the LDBC effort isn't needed. Nor am I suggesting that LinkBench is _the_ graph benchmark.

    By using HBase for "posts on user's wall" it would be more clear to state that the Messages product (chat, etc) was documented to use HBase -- https://www.facebook.com/notes/facebook-engineering/the-underlying-technology-of-messages/454991608919

    Much more than friend lists use Tao+MySQL. You can call it the social graph -- https://www.facebook.com/notes/facebook-engineering/tao-the-power-of-the-graph/10151525983993920

    I run linkbench frequently but I didn't write it and I haven't looked at the graph and node generation code. But I don't understand your statement about LinkBench using a single supernode that connects to all others. I don't see any reference to that in the paper. I wasn't aware this was done. Can you elaborate?

  2. Hi Mark,

    Thanks for your comment. Indeed this is not mentioned in the LinkBench paper, but after the presentation last year at SIGMOD2013 I approached the primary author, Timothy Armstrong, and asked what the structure of the graph in Linkbench really looks like. He said that they primarily focused on making the accessed volume by MySQL as realistic as possible but that the graph structure as a whole was not accounted for, and indeed that there were one (or a few) nodes that connected to very very many nodes, mostly to get some global average right. This is what I remember. It is a bit vague but Timothy certainly said that the structure of the LinkBench graph itself is by no means realistic.

    Peter Boncz

  3. I don't think "realistic" is black & white for benchmarks. It has nuance -- things that matter less are made less realistic (synthetic data generators for example). Things that matter more get more effort.

    One of the things that mattered more was to understand the structure of the social graph and then model it using (synthetic) distributions. There is a lot more to it then making the average number of links per node match production and the paper has a lot of details on that -- http://people.cs.uchicago.edu/~tga/pubs/sigmod-linkbench-2013.pdf

    In real life there are some nodes with many links. The World Cup page might have been a recent example of that. There are other nodes with not so many links (my FB page). So a few nodes with many links isn't unrealistic.

  4. The LinkBench sources in the relevant places have not been modified since publication, so I think that what I learned about the LinkBench graph structure both from speaking to Timothy and examining its source code a year ago, is still valid. The details are indeed in the paper you cite, though the relevant part on the graph structure (sec 3.3.3) is just half a page long, so does not contain very many details as you write (but the code is open-source, so no problem), and let me quote from that half page:

    "Further higher-order properties (than out-degree - PB) could be examined, such as clustering coefficient or community structure. However, for the purpose of modeling our database workload, we believe that higher-order graph structure in communities or correlation between properties of neighboring nodes will have only a small effect on database performance."

    Well, the belief that correlations do not matter for the LinkBench SQL workload may very well be true, as it consists of relatively small lookup and update queries, that do not navigate the graph structure much. However, in the LDBC SNB task force we have different workloads. For these workloads, representing the properties attached to Persons in the network by simple bytes of payload does not suffice even to formulate queries. In fact, we see that the complex and correlated structure of the network does play an important performance role, even in queries that access the single or two-step neighborhoods, such as in the interactive workload (in the BI and graph analytics workload, the effects are likely stronger, but we have to find out about that). For the interactive workload, we even had to adopt specific measures to keep the effects under control and assure understandable benchmark results, mining the dataset for benchmark parameters that produce similar results ("Parameter Curation").

    See the forthcoming TPCTC paper:

    Thanks again for the excellent remarks. I think LinkBench is a great effort, but I hope that the difference in graph structure between it and the SNB data generator are becoming clearer in this discussion. And that was one of the goals of this blog post ;-)

  5. Getting correlation into a synthetic graph generator will be very useful. Perhaps one day we can get another ambitious PhD intern to work on it at FB to document and implement the structure that occurs in our graph.

  6. Assuming LDBC includes workloads with complex queries then you want correlation in the graph to make optimizers work harder. LinkBench queries don't need an optimizer. We know the plans for each and they are fixed independent of the values of literals in the WHERE clause.

  7. "posts on walls" fall into the social-graph/timeline which is MySQL at Facebook (comments are nodes too with edges pointing at posts, etc). We call "social graph" not just friend-to-friend relationships, but also all objects that people interact with in social contexts (sharing, liking, etc). Though Linkbench may not represent some patterns that emerged with newer products, it still has much more than friend relationships.

  8. Hi Domas,

    Thanks for your comment! My understanding came from a SIGMOD 2013 presentation given by Dhruba Borthakur (http://borthakur.com/ftp/sigmod2013.pdf), which on slide 5 says that "Facebook Messages" are stored in HBase and the Friends Graph is stored in MySQL/TAO.

    But given that I assume you work on the Facebook MySQL/TAO part, you definitely know best.

    Maybe it is the case that Dhruba meant to say that all Facebook messages get archived in HBase, i.e. that after some time limit certain data is migrated off MySQL/TAO?



  9. Note: I just edited my description of the use of MySQL at Facebook. Thanks for the feedback!