Back to The Blog

The Do’s and Don’ts of Rule and Query Writing

RDFox, Datalog and SPARQL

RDFox, Datalog and SPARQL

Photo by Joanna Kosinska on Unsplash

Rules offer an expressive way to process and manipulate a knowledge graph. Rules help bring the intelligence layer closer to the data and can also help with query writing. This article aims to provide a list of best practices for getting the most out of your rules and queries. You can also read our introduction to rules here and download our rule guide for more concrete examples.

What is a rule?

A rule is an ‘if-then’ statement. For example:

?x has uncle ?z if?x has parent ?y and ?y has brother ?z.

We express rules using datalog. Datalog is a declarative and formal logic-based programming language based on Prolog. Datalog uses the following format:

[?x, :hasUncle, ?z] :-
   [?x, :hasParent, ?y],
   [?y, :hasBrother, ?z].

The formula to the left of the :- operator is the rule head (the ‘then’ part) and the formula to the right is the rule body (the ‘if’ part).

Intuitively, a rule says “if [ ?x , :hasParent , ?y] , [?y :hasBrother ?z] all hold, then [ ?x , :hasUncle , ?z ] holds as well”.

What is a query?

A query is a request for data or information from a database. Queries retrieve and model data stored within RDFox. The query language used by RDFox is SPARQL, which is the RDF standard query language.

Queries can be typed or pasted directly into the shell. For example:

SELECT ?s ?p ?o WHERE { ?s ?p ?o }

This query would select all subject-predicate-object triples within RDFox.

Below are some tips and best practices for writing rules and queries with RDFox.

Do’s

Add all rules before the facts or add rules and facts in an arbitrary order but grouped in a single transaction. This will usually increase the performance of the first reasoning operation.

Make rule bodies as selective as possible to improve the performance of a rule. This helps reduce the number of matches in the body that then propagate to the head.

Use rules to materialise costly and/or frequently used sub-queries. We recommend experimenting with the trade-off between reasoning and query answering time to make queries simpler to write, maintain and answer using rules.

Store your rules in separate files by purpose and import them incrementally when they are required. Rules in RDFox materialise as soon as they are imported which is why we recommend.

Restrict variables within your rule body. If types exist within your data use these to.

Start small and build on what you have. In the case of rules, incremental retraction/addition means that you don’t have to worry about rebooting the whole system every time you change a rule.

Write queries before you write rules. The query will let you know how much data will be affected by the rule.

Test your rules. Test whether the query you wrote before the rule (see above point) and the rule return the same result.

This can be done in a query like:

SELECT ?person WHERE {
   {
       SELECT ?person (COUNT(?child) AS ?children) WHERE {
           ?person :hasChild ?child
       }
       GROUP BY ?person
   }
   ?person :materialisedNumberOfChildren ?number
   FILTER ?number != ?children
}

This will return the set of people where the numbers don’t match, so the query should return 0 results.

Don’ts

Don’t forget to define the type of a variable used in the rules. This won’t be an issue in most cases but can slow down performance if the total number of possible relations to evaluate is large. Example:

[?customer, :referral, “true”] :-
   [?customer, :has, ?referralLink].

The types of ?customer and ?referralLink aren’t defined in the rule which would have to verify all the :has relations. Defining the type helps reduce the number of matches in the body that then propagate to the head. A more appropriate solution would be to do:

[?customer, :referral, “true”] :-
   [?customer, :has, ?referralLink],
   [?customer, a, :CustomerType],
   [?referralLink, a, :ReferralLinkType].

Replace joins in filters with regular joins: Consider the following rule, which marks as similar all pairs of entities whose labels are indistinguishable modulo case sensitivity.

[?first, :similarTo, ?second] :-
   [?first, rdfs:label, ?first_label],
   [?second, rdfs:label, ?second_label],
   FILTER(LCASE(?first_label) = LCASE(?second_label)).

While correct, the above rule is very inefficient as it forces RDFox to compare the labels of all pairs of entities, which becomes infeasible for moderately large number of entities (e.g. on a dataset with 1M entities this will result in 1T comparisons). The issue above is that for every triple that matches the first atom RDFox must iterate through all triples that match the second atom and apply the filter condition accordingly. RDFox has no way of reducing the number of compared pairs of entities.

An alternative solution is to precompute the values on which the join is performed in a separate rule and use a regular join in a second rule as illustrated next.

[?entity, :lcase_label, ?lcase_label] :-
   [?entity, rdfs:label, ?label],
   BIND(LCASE(?label) as ?lcase_label).[?first, :similarTo, ?second] :-
   [?first, :lcase_label, ?label],
   [?second, :lcase_label, ?label].

While more verbose, RDFox will evaluate the above program in time proportional to the number of similar pairs (as opposed to all pairs), which depending on the data could be the difference between terminating and not. The first rule simply computes the lower-case labels of entities, while the second rule uses a direct join on the computed labels to identify the similar entities. Because RDFox uses full indexing, it can efficiently identify, for every triple that matches the first atom, only the compatible triples that match the second atom. Note that this solution uses additional memory to store the relation :lcase_label, which could be a significant but also necessary overhead for solving the problem.

Avoid unnecessary Filter statements (in both rules and queries): you are effectively throwing away answers that you already spent (a lot of) time computing.

If you want to say that two variables of the same type should be equal, just call them the same thing.

Avoid cross-product blow-ups (a special case of the selective tip but might be worth mentioning specifically).

Do you have any best practises to add to our list? Or situations to avoid? Feel free to get in touch! We will continue to update this article.

For more information head to our website or contact us at info@oxfordsemantic.tech

To request an evaluation license click here.

...

The Team and Resources

The team behind Oxford Semantic Technologies started working on RDFox in 2011 at the Computer Science Department of the University of Oxford with the conviction that flexible and high-performance reasoning was a possibility for data-intensive applications without jeopardising the correctness of the results. RDFox is the first market-ready knowledge graph designed from the ground up with reasoning in mind. Oxford Semantic Technologies is a spin-out of the University of Oxford and is backed by leading investors including Samsung Venture Investment Corporation (SVIC), Oxford Sciences Innovation (OSI) and Oxford University’s investment arm (OUI). The author is proud to be a member of this team.