Sun 8 Jan 2006
In building applications to allow end-users to manage information in an intuitive manner (from multiple sources), we, the Haystack and Simile projects, have over time moved to using the W3C’s Semantic Web framework and RDF standard for sharing data and using as application’s backends. Using RDF does require us to either build or find an easily deployable and high-performance data-store, and we have therefore planned an in-depth ‘RDF DB Shootout’. Preliminary results indicate that the Sesame Native store to have the best performance.
Our preliminary evaluation consisted of using the Java API’s against available Java RDF-API’s. We considered: the Jena 2.2 API with Apache Derby and Berkley DB backends, the Sesame 1.2.2 Native Store, and Haystack’s Cholestrol 4 DB. We also tried using the Redland API’s, however, we had difficulties in getting it built with Java API’s on Windows.
Results: Sesame’s Native store added data roughly 20x faster and performed lookups at roughly 2.5x faster than any of Jena’s backends. When compared to Sesame, Cholesterol ran 50% slower in data-addition while it ran 50% faster in the lookups.
Performance Suite: The performance suite is available at: http://simile.mit.edu/repository/shootout/trunk/ and currently consists of the two parts: data adds and lookups. The add portion of the test iterates through 105 resources and adds 10 RDF statements from each having one of 10 randomly picked predicates and one of the 105 resource (picked randomly) as the objects. The lookup portion of the test, iterates a 1000 times, by picking a random node (and of the 105 resources), looking up all its 10 statements to find the set of children nodes, looking up the children’s statements to go a second level, and then repeating to a third level, to perform 1000 x (1 + 10 x (1 + 10 x (1 + 10)) ) lookups = 1,111,000 lookups. During the test, some of the statements could be identical resulting in a fewer statements being traversed.
Details: Tests were run on a Windows-XP, 2.53 GHz P4, 1GB RAM.
During the evaluation Cholestrol did hit a scalability bug. Also, quick runs did show that Jena with MySQL backends to be more performant than the backends we tried, however, the benefits were not large enough to overcome the deployment obstacles (and performance was still significantly slower than the Sesame-Native backend).
The table below shows the test results:
| RDF DB API | Run-time (seconds) | statements traversed |
|
| add | lookup | ||
| Sesame-Native | 96 s | 23 s | 1,109,340 |
| Cholesterol1 | 126 s | 14 s | 1,100,570 |
| Jena-BDB2 | 1,750 s | 54 s | 1,093,727 |
| Jena-Derby2 | 1,710 s | 53 s | 1,103,176 |
1: Cholestrol ran fast, but seems to have scalability problems (was getting heap errors at around 1/2 the data set size). Results are therefore of 1/6th of the data with the add-time multiplied by 6. Given the exponential increase in time towards the end (as the heap limit was being reached), we expect this to be an under-estimate.
2: Jena backends has a very slow add run-time. We therefore just added 1/10th of the data, and multiplied the add time by 10. We expect this number to be a fairly good estimate. We used jena-derby driver available here. There is another driver built by IBM, available here, but they did not work.
Limitations: The performance suite needs to be increased to provide more comprehensive results, and we will appreciate any help in this direction. The most notable limitation of the suite is that it currently only tests add and lookups (given a subject, find all statements). While evaluation with complex queries is important, most of the performance bottle-necks in our experience have been in the simpler lookups — and we have therefore forces on only those parts.
Biases: I have spent 1 week fixing bugs for the Cholestrol DB, 1 week engineering for building Jena-Derby drivers, and 2 days adding a feature on Sesame. Relo currently uses Sesame (but due to knowledge of these results).
May 18th, 2006 at 9:08 am
[...] This post has looked mostly at the question of storage, but another unfortunate truth is that existing semantic Web data systems also suck in the times it takes for initial ingest and subsequent repository querying and retrieval. Some general sources for these observations come from combining metrics from OntoWorld and an interesting post on LargeTripleStores on the ESW (W3C) Wiki.Here are some additional useful links from the ESW wiki, TripleStoreScalability regarding scalability: Vineet Sinha, 2006, RDF DB Shootout: Preliminary Results. See also the follow-up discussion on simile-general [...]