Tracker RDF database performance

I have spent time on-and-off the past week looking at the performance of the Tracker RDF database. Tracker, I believe, started out as a desktop search tool for Gnome. I never used it in this incarnation, and it has only come to my attention  since version 0.7, when the developers implemented a general purpose RDF storage engine at its core. I wanted to know how this newly implemented RDF database compared to a widely used RDF database in terms of query performance. In essence I was interested in whether the Tracker project had spawned something that could compete with Virtuoso and 4Store.

You can find my complete results and analysis at the tracker mailing list archive, but the headline statement is that tracker had roughly 9 times the query performance of Virtuoso. The graph here shows the breakdown by query.

This is a drastic difference in performance that greatly favours the home-grown database utilized by Tracker. However this stellar performance comes at the cost of flexibility. Its obvious that the database has been tailored very much to the needs of Tracker itself. Unlike Virtuoso it is not ‘schema-free’. A description of the data (In the form of something called an RDF ontology) is required for storage. In addition to this, the data formats are more restrictive, and some common elements of RDF are missing.

My general impression was that Tracker has great query performance, especially considering a tiny memory footprint. Unfortunately it is not suited to storage of pre-existing RDF data sets, such as those generated for semantic-web applications. This could well change in the future. Tracker, and its RDF database, are in heavy development. They already have speed and seemingly stability in the code-base. It might soon be time to add the new features that make it more generally applicable.

I should add that when I started this work I was heavily sceptical. Codethink have been highly involved in RDF, but I have not joined in. I have learned a-lot in the past few weeks, and this has made me more positive. I still believe that RDF might be too flexible for its own good, and I’ve found that the ontologies are onerous, complicated, and not very well specified. I did however come across a great post which explains some of the advantages of RDF over other data models; SPARQL is far more intuitive than its SQL cousin. If used to its potential, with highly interlinked data, I think it may be possible for the benefits of RDF to outweigh the tough learning curve.

Tags: , , , ,

4 Responses to “Tracker RDF database performance”

  1. Robin says:

    The graph seems to be wrong. More queries per second is better, isn’t it? In the graph it looks like Virtuoso is faster.

  2. Mark Doffman says:

    No, its not wrong. Perhaps a little misleading. The variations in the tracker performance over different queries was much greater than tracker. In the mix of queries performed by the BSBM benchmark, a couple of the queries performed by virtuoso were incredibly slow. Namely Q2 Q7 Q8 and Q10. This greatly dragged down the virtuoso performance across the test. Also not mentioned is the fact that Queries 9 and 12 were left out of the query mixes performed by Tracker. It was not capable, feature-wise, of completing these queries.

  3. pvanhoof says:

    Wow, thanks for the benchmark and work you’ve put in it. So we’re even performing better than I hoped for. And this is without the performance enhancements that we are planning to do next week on DBus marshalling. I hope the benchmark run can easily be repeated, then perhaps we can use this work to test performance regressions while we continue to develop the product.

  4. Richard Dale says:

    I read the paper you refer to “advantages of RDF over other data models”, and it is certainly interesting, but it seems to leave out some fairly major features of the RDF approach. It doesn’t mention that there are standard ontologies for different subject areas, like the Nepomuk ones that cover the kind of things you find in a personal computer, or FOAF for describing people, or Geonames for geospatial data. These standard ontologies mean that you can easily combine the results of more than one data source, by making ‘federated queries’. You can’t do that with normal relational databases because there is no standard way of representing the same sort of data in the same way with the same tables, and the data isn’t self describing. Hence, the term ‘Open Linked Data’ for RDF stores which is all about turning the web into one big data warehouse.

    So to me it is less important that Tracker can support custom ontologies, than it is to be able to make federated queries than combine data in the local Tracker store with databases out on the web, such as DBpedia. SKOS provides a means to describe how one ontology relates to another. Maybe instead of importing data from the web into custom ontologies in a local store, you need something like SKOS definitions to help describe equivalent terms in Nepomuk to ones in web based ontologies to act as a bridge between the local data, and the web based data.

Leave a Reply