Back to The Blog

Enhancing Wikidata Performance with RDFox: How to dissect the world's leading RDF database faster

Improving speeds by orders of magnitude...

Improving speeds by orders of magnitude...

This article will introduce one of the world’s most popular RDF data sources and how RDFox can use it to offer a rich and responsive experience for web-based applications by reducing load times to minutes or seconds, and querying in a matter of milliseconds — orders of magnitude faster than the dedicated query service.

Introduction to Wikidata

Since 2012, the Wikimedia Foundation has been supporting and hosting an open data initiative called Wikidata, a vast database containing information about everything, from objects to people to abstract concepts. We’re talking about it today because unlike its human-centred kin (Wikipedia among many others), it is available for download in RDF format.

The data model of a single item can be represented the following way:

By Michael F. Schönitzer — Own work based on: Rdf mapping.svg, CC BY 4.0

Every entity is represented by a code that, while formulaic, are too detailed for us to cover here. It is however important to understand that these codes represent any and all things from ideas like ‘an instance of’ (wdt:P31) to physical beings like ‘a cat’ (wd:Q146). For a more in-depth breakdown of these codes, see this Mediawiki page.

The Wikidata model of each object can be identified on its Wikipedia page.

Downloading and importing Wikidata

Due to its richness and open access, Wikidata is one of the most popular RDF sources in the world. However, because of its immense size users don’t usually download the entire 15 billion fact database but instead focus on the individual sections they are interested in. This is far from a perfect solution but is often seen as a necessary evil, required to load the data and compute results in a reasonable time frame. RDFox changes all of this. Instead of taking a day or more, the initial load can now be completed in less than 3 hours, while queries return results in milliseconds — we’ll get to that later.

SPARQL (SPARQL Protocol and RDF Query Language) queries can be used to extract data in these situations. Designed and maintained by the W3C, SPARQL is considered the standard query language for RDF triplestores. For more information, read our SPARQL fact file, or head to Stack Overflow where SPARQL is a running theme among Wikidata questions.

Formulating queries

Wikidata has a live interface (the Wikidata Query Service) that can be used to query the data and view the results. They also provide some example queries to test, such as an online cat store. The following query will select all the instances of cats (?item wdt:P31 wd:Q146), as well as their labels in English.

A query in Wikidata Query Service that selects all ‘instances of’ 'cats’.

Wikidata also offers a SPARQL tutorial which you can find here.

Why use Wikidata?

Wikidata is a fantastic resource to connect to your existing datasets. Take the cat store for example; they could connect their cat inventory to Wikidata and help customers make decisions by providing more contextual information about the cats as they are browsing.

We can use a construct query to fetch all Wikidata’s information on cats:

CONSTRUCT queries allow you to return RDF triples from those that already exist in the database, so these can be imported without any further processing.

However, Wikidata can take a long time to fetch the data depending on the complexity of the query, and the amount of data it needs to search over. Moreover, Wikidata throttles processing power, so slow queries will often time out. In the trivial cat example, the CONSTRUCT query took 0.9 seconds to retrieve only 706 results. Queries to return larger subsets of data, in particular, those that are more complex and span several ‘layers’ (for example, cat > breed > colour > name etc.), will almost always time out.

Improving Wikidata performance with RDFox

A more efficient approach can be to download Wikidata and import it into a faster triplestore — RDFox. When preparing for this article, the initial load took us only 2 hours and 50 minutes for the entire 15 billion triples — a remarkable figure compared to the tens of hours that is commonplace for this magnitude of operation.

RDFox is a high-performance knowledge graph and semantic reasoner, and, most importantly, an in-memory solution. This is the source of its extreme and unmatched querying speed. This is of course crucial, as the faster you can query the dataset, the closer to ‘real-time’ the results are fetched. RDFox also supports SPARQL querying (just like Wikidata) which makes it ideal.

To showcase this power, we took a look at ‘OST Music’ — a hypothetical music streaming service with RDFox at its core that we dreamed up a few months ago. Read our article on OST Music to learn about its superior recommendation system, or using Wikidata in applications more generally.

This time, instead of creating a complex system, we were simply interested in retrieving information in the musical space. We ran three queries of increasing complexity, once over the entirety of the Wikidata data, and once over a sub-graph of the triples containing an instance of a music group, as we did for the streaming service. We also used the Wikidata Query Service as a baseline to provide context for the results.

The three queries were as follows:

  1. Musical groups
  2. Musical groups and their country of origin
  3. Musical groups, their country of origin, and the inspiration for their name

From left to right, queries 1, 2, and 3 in RDFox. wdt:P31 denotes ‘an instance of’; wdt:P495, ‘country of origin’; and wdt:P138,  ‘named after’. wd:Q215380 describes a ‘musical group’.

Results are only returned if the data for the finest requirement is known in the database.

Wikidata Query Service is given as WQS. RDFox used in the music group subgraph is given as RDFox*. All times are given in milliseconds.

Even for the simplest query, RDFox is four times faster than the Wikidata Query Service. When taken in the subgraph, this increases to a factor of almost 20. However, as complexity increases, it becomes quite clear that the prestige of RDFox prevails. For the third and most ‘layered’ query, RDFox returned the results quicker than Wikidata’s own service by an order of magnitude, and nearly 800 times when querying over the subgraph. Such a colossal increase cannot be ignored, especially when you consider how effortlessly it is achieved with RDFox.

As we can see, RDFox offers an attractive option for integrating Wikidata with your existing data sources to enrich the user experience. Not only can you effortlessly gain new insights and increase your efficiency by an order of magnitude, but you can slash operating costs while doing so as a direct result of the diminished operating time.

Don’t just take our word for it, try RDFox for free and see the difference it makes for yourself — whether that’s with Wikidata or your own projects.

For more information on RDFox, head to the OST website or our blog.

...

The Team and Resources

The team behind Oxford Semantic Technologies started working on RDFox in 2011 at the Computer Science Department of the University of Oxford with the conviction that flexible and high-performance reasoning was a possibility for data-intensive applications without jeopardising the correctness of the results. RDFox is the first market-ready knowledge graph designed from the ground up with reasoning in mind. Oxford Semantic Technologies is a spin-out of the University of Oxford and is backed by leading investors including Samsung Venture Investment Corporation (SVIC), Oxford Sciences Innovation (OSI) and Oxford University’s investment arm (OUI). The author is proud to be a member of this team.