Next episode: click here
Hello and welcome to another episode of the foundational series Into RDFox.
In this episode, we're going to be covering the foundations of a SPARQL query and walking through step by step some of the key components of a SPARQL query that you're going to need to understand throughout this episode.
We're going to be working entirely in the RDFox console so if you're not sure how to get here or how to start RDFox, check out our previous episodes where we go into that in great detail.
It would also be very helpful for you to understand how and why SPARQL queries work over a graph database so if you're new to SPARQL or new to querying, then I would again check out the fundamentals videos that came before this one.
Let's dive into our first query. This query, as you can see here, is the most fundamental query that you can write in SPARQL. What this one does is it finds our entire database and returns it all to us. Now we're going to have a look step by step into the key components that allow us to do that.
The first you'll notice is the SELECT keyword itself. This describes to RDFox what the query is about to do, and crucially, what it's going to do with any results that it finds. All that SELECT does is it finds a pattern within your data and it returns those results to you, either the entire pattern or part of it.
Below it we can see the WHERE clause, and this is where we get to describe that pattern I was talking about. In this case, we're just looking at a general pattern of a triple because we're looking to match any of our data, any and all of our data.
The way that we do this is by putting a variable in each position of our triple, the subject position, the predicate position, and the object position. We can see that each here is given by a question mark and then a series of characters. We've called ours ?S, ?P and ?O, but that's just to to remind us that these variables should stand for subject, predicate, and object, but what's important is not the name of these variables, but it's our own self consistency. So, if we would rather call this ‘S’ variable ‘something’, that's absolutely fine so long as we use this name in the same place everywhere.
You'll also notice we have these variables next to the select keyword itself. These are the output variables and this is, from our pattern, what we would like to return. Here we've got all three variables S, P and O because we want to return our entire data set.
So if we click ‘Run SPARQL’, we will see our entire database returned to us in the result set. Here we've just seen a preview of the first 200, but that is purely visual. RDFox has returned the entire data set and in fact we could download those results it by clicking the ‘Download’ button here.
If we have a look at our second query, we can progress our pattern or our query one step further.
Here you will see we've introduced this distinct keyword and now the only variable on our output is ?P.That's because we're just looking for to return the variable in the predicate position of our triple pattern. Why are we doing this? Well, we're just looking for the relationships, the predicates that exist in our data. However, as I'm sure you can imagine, there are going to be lots and lots of repeated relationships in our data store and so to get around this, we use the distinct keyword, which throws away any repeated results just giving the distinct one. So if we click ‘Run SPARQL’, we can see a list of all of the relationships that exist in our data store.
If we head to query 3, we can this time instead of looking at relationships, look at classes again using the DISTINCT keyword to remove any repetition. You'll see first of all that we've renamed our variable in the object position of our triple. This time we've called it ‘type’. What's important to remember here is that this variable name is not semantically meaningful as far as RDFox is concerned, so this is just to remind us that we want this to be a type. RDFox will not interpret the name and it will not understand what we're talking about, it is simply considered a variable.
The important piece of information here though is if we look at our predicate position in our pattern which we have used this ‘a’ keyword. The ‘a’ keyword is a special keyword because it has been universally accepted to mean of type or of class. This is because classes are so incredibly important in everything that we do in data, so we as a community have agreed that we will follow this standard. So, when we describe the pattern any variable ‘?S a ?type’, it is just anything that is an instance of a class. When we return the distinct types on our output, we will see, if we click 'Run’, a list of our classes in our data.
Now, if we head to query four, we will combine those last two queries into one, getting a little piece from each, this time looking at a specific class. In this case, we'll be looking at the class driver. Moreover, we'll be looking for all of the relationships that a member of this class could potentially have but the way that we do this is a little bit interesting.
First, let's dive in on our constant driver - this is actually a class that is actually in our database and in fact, if we head back to query 3, we can see it in the result set here. The reason why it looks like this though is very important. You'll see we have ‘:’ then the word ‘driver’. This is because within RDF, all data points are uniquely identifiable on the Semantic Web. Now what that means in practical terms is that each piece of data actually has a long URI, and that URI looks a little bit something a little something like a web address.
Now if we open up our prefix draw, we can see what these look like. You can see here we have our first prefix, which is just a colon by itself and it represents this long address. So really when we're writing ‘:driver’, this actually stands for this long address here ‘http://rdfox.com/examples/f1/driver’ but to save ourselves having to write this out every single time where we might make mistakes or simply just spend a lot of time doing it instead, we can just use a colon followed by the data point itself.
You'll also notice that we have several other prefixes ready for you there. These are all standard prefixes that are used widely amongst the semantic community, and while you may not need each of these in your project, it's quite likely that you will. If you decide that you do not need them, you can simply click ‘Manage Prefixes’, delete one and click ‘Update’, although I'm not going to do that for now.
There are a few more interesting things about this pattern. The first is that it covers 2 lines, something that we haven't done before so already our pattern is starting to get a little more complex. We found our variable S is a driver, now all of the subjects here we know are going to be along with the class driver but by using a semicolon at the end of this line, we can continue to use this subject on the next line.
By using a semicolon we are effectively saying this, where the two variables S here take the same value all because we've used this semicolon. Here you'll notice that this pattern is just the generic triple pattern that we've seen before. The reason for this is once we've found our variable S, we want to find all of the other triples where this is the subject. With that, we can find all of the relationships that a driver could have. So once we've found all of these other triples, we can simply return distinct P to show all of our relationships for this class. By clicking ‘Run’ again, we can see exactly that. We can see, for example, drivers have a full name, an ID, and a forename name and surname.
This is one of the most important queries that we'll cover in this session because it really helps you to dive deeper into your data. It exposes the kind of questions that you can ask of your data and shows you where you might be lacking, where you'll need to write rules in order to support the queries, questions that you want to ask, or where you need to go and find additional data in order to answer the questions that you really need answering.
Query 5 takes this further again. Now looking at a driver and looking at all drivers and finding their forenames and surnames. We're beginning to find or to ‘build out’ a more functional query that you might put behind a web page or an application.
In this case, turning a driver node into a human friendly forename and surname. Here we're making heavy use of the semicolons and of course appropriate variable names where we have now our driver who is a driver and then from this subject continue on with the semicolon we're finding two specific properties, driver forename and driver surname.
You'll notice that we have a full stop at the end here. This is not strictly necessary in a small query like this, because what the full stop actually does is it says stop using this subject, stop continuing to borrow the subject after we've used it with the semicolons. This isn't very important here, it's not particularly meaningful when we just have a simple pattern like this, but as we start to increase the size of our pattern, particularly as we increase its complexity, then we are going to want to have several blocks each which use their own subject and so this becomes absolutely critical. So again, if we click ‘Run’, we can see a list of our drivers alongside their forename and surname.
And finally, if we head to query 6, we can have a look at all of the properties, the relationships and their actual values for a specific driver. This time we'll be looking at Lewis Hamilton. So once again, we're looking at a driver or any subject driver who is a driver belongs to the class driver. This time again we're looking at two additional properties of this subject, forename and surname, but rather than leaving them variable, we want to specify that this should be the string Lewis and this should be the string Hamilton.
From there, once again, we're using our effectively SPO pattern from before but of course any 3 variables will do to find all the additional information about him and we're finding we're returning that on the output. So we have driver, the predicate and the object. This time when we run the SPARQL, we can see a list of our driver Hamilton, but also his properties and the values of each of these properties.
From here, we can explore our results. We can do this because each because our result set contains nodes for us to display. So it's very important to remember the result set must contain nodes in order for us to explore them visually. But so long as we have that, we can click ‘Explore Results’ and we will see our driver Hamilton in the middle surrounded by his various properties.
If you'd like to learn a little bit more about SPARQL and have a look at some of the more functional queries, continue on to the rest of this series where we will dive deeper into several different categories of queries showing you how to do things like count negation or some inner selects.
The team behind Oxford Semantic Technologies started working on RDFox in 2011 at the Computer Science Department of the University of Oxford with the conviction that flexible and high-performance reasoning was a possibility for data-intensive applications without jeopardising the correctness of the results. RDFox is the first market-ready knowledge graph designed from the ground up with reasoning in mind. Oxford Semantic Technologies is a spin-out of the University of Oxford and is backed by leading investors including Samsung Venture Investment Corporation (SVIC), Oxford Sciences Enterprises (OSE) and Oxford University Innovation (OUI).