A back arrow icon.
RDFox Blog
A back arrow icon.
RDFox Videos

How to use count, sum, and other aggregates in semantic reasoning for a Knowledge Graph

How to use count, sum, and other aggregates in semantic reasoning for a Knowledge Graph
Thomas Vout

Next episode: click here

Hello and welcome to another episode of the RDFox introductory series.  

In this episode, we're going to be looking at aggregation in reasoning and rules, so if you haven't already, please do check out our previous videos on the foundations of reasoning, and of course, Datalog.  

Aggregation is an incredibly powerful and versatile tool, relevant to almost every use case, but it is particularly relevant where you have very large data sets or where query performance is incredibly important, so, things like financial analysis come to mind.  

The reason for its power and versatility is because it enables us to do things like SUM, COUNT, AVERAGE, MIN, MAX, and even some additional, more specific functions that are available in RDFox. Today though, in our example, we're going to be looking at a count of the races that each of our drivers have raced in and we’ll be importing this to RDFox and having a look at how it impacts query performance.  

First though, how do we actually do that? Well, in the body of the rule, we first have to declare that we are performing an aggregate with the 'AGGREGATE’ keyword. Within that, we then have to provide all of the triple patterns that are relevant to this, but in this case, that's just a single triple pattern about the drivers we have and the races they've raced in, so we can just provide the information ‘driver has raced in race’.  

Of course, this pattern was created by a previous rule which highlights how you can stack rules where one refers to another, in a way to emerge complex insights.  

Once we've had described our triple patterns, we then need to tell RDFox which variable we want to aggregate on. In this case, the variable ‘?driver’ because we'd like to group our race counts per driver. This is an equivalent to the GROUP BY keyword in SPARQL, so if you're familiar with that, there's nothing different here.  

Finally, we need to choose the function we're actually going to use. Again, in this case, we're using the COUNT function and were counting the instances of the variable ‘?race’.  

Once we've got the results of that, we can use ‘BIND’ to bind the result as a new variable called ‘?raceCount. With this set up, we can now take these variables and use them in the head of our rule, inferring a new relationship ‘hasRaceCount’ between each of our drivers and their aggregated race count.  

So let's have a look at how we use this in RDFox and how it impacts query performance.  

If we head to the terminal, we can import our rule that you can see given in a file here. We can do this with the ‘import’ command and the relative file path which is going to be ‘rules/r3.dlog’ into. It’s important to note here that I have already imported my data and my previous 2 rules in the workshop, as again, this relies on work that they've done already.

Once we’ve import that rule, we can run our query. On the right is a query that simply relies on our ‘hasRaceCount’ predicate. Once again, we can evaluate this query with the ‘evaluate’ command and again, the relevant file path.

You can see here that we have 850 results and if we scroll up, we can see that we have a list of our drivers, each with the race count. Of course, at the bottom, because we've ordered by the race count, we can just see those with a single race count.  

What is important to note here, is firstly the 850 results, but secondly the time taken for this to evaluate. In this case it was 0.011 seconds. The reason why I'm highlighting these features is because we have actually covered a similar query before. I say similar because the result, and the result of this query, is exactly the same, but the way we achieve that result is by actually performing this count at query time instead of relying on reasoning.  

Now let's run query 8 once again and this time we will see that our result set is the same. We still have 850 query results. This time the evaluation time was 0.069 seconds, almost 7 times slower than our query that relies on reasoning.  

Now this is a very very simple example of an aggregate on a very small data set. It took a single triple pattern and just counted a simple variable. Even here we are getting a time saving of almost an order of magnitude so hopefully you can envision that when we scale this up to larger datasets or more complex aggregates, you can potentially see time savings of several orders of magnitude just from a simple rule.

If you'd like to see anything else only folks can do with reasoning and learn more about Datalog or the functions that has, please check out our other videos in the series.

Take your first steps towards a solution.

Start with a free RDFox demo!

Take your first steps towards a solution.

Get started with RDFox for free!

Team and Resources

The team behind Oxford Semantic Technologies started working on RDFox in 2011 at the Computer Science Department of the University of Oxford with the conviction that flexible and high-performance reasoning was a possibility for data-intensive applications without jeopardising the correctness of the results. RDFox is the first market-ready knowledge graph designed from the ground up with reasoning in mind. Oxford Semantic Technologies is a spin-out of the University of Oxford and is backed by leading investors including Samsung Venture Investment Corporation (SVIC), Oxford Sciences Enterprises (OSE) and Oxford University Innovation (OUI).