This article aims to provide a basic introduction to Datalog with RDFox. It will explain what Datalog is, why RDFox uses Datalog, how to write Datalog rules, how rules can enhance SPARQL query performance, and touch upon RDFox’s Datalog extensions.
Datalog is a rule language for knowledge representation. Rule languages have been in use since the 1980s in the fields of data management and artificial intelligence.
A Datalog rule is a logical implication, where both the “if” part of the implication (the rule body) and the “then” part of the implication (the rule head) consist of a conjunction of conditions. In the context of RDF, a Datalog rule conveys the idea that, from certain combinations of triples in the input RDF graph, we can logically deduce that some other triples must also be part of the graph.
Datalog has various applications including knowledge representation, ontological reasoning, data integration, networking, information extraction, cloud computing, program analysis and security.
RDFox is a high-performance knowledge graph and semantic reasoning engine. Reasoning is the ability to calculate the logical consequences of applying a set of rules to a set of facts. RDFox uses the Datalog rule language to express rules. Rules offer an expressive way to process and manipulate a knowledge graph. Rules bring the intelligence layer closer to the data and can also simplify query formulation and data management.
As Datalog is a declarative logic-based language, renowned for its simplicity and use within knowledge representation, it complements RDFox, as a declarative solution. Declarative tools are useful across a broad range of use cases, for example, applying business logic, machine learning models, business rules, in-house devops or compliance tools or streaming attribution.
It is a widely understood language, which is central to an abundance of rule formalisms and maintains a wide range of extensions. You can read our introduction to rules here or read the Do’s and Don’ts of Rule and Query Writing article.
As aforementioned, Datalog rules are logical implications. Each Datalog rule conveys the idea that, from certain combinations of triples in the input RDF graph, we can logically deduce that some other triples must also be part of the graph. Simply, this can be explained as an ‘if-then’ statement.
For example, we can say that:
?x has uncle ?z if … ?x has parent ?y and ?y has brother ?z .
(e.g. Lucy has Uncle Peter, if Lucy has Parent Alex, and Alex has brother Peter)
This is expressed in Datalog in the following format:
Intuitively, a rule says “if [ ?x , :hasParent , ?y] , [?y, :hasBrother, ?z] all hold, then [ ?x , :hasUncle , ?z ] holds as well” and this results in new information being added to the graph.
It is important to notice that the set of logical consequences obtained is completely independent from the order in which rule applications are performed as well as of the order in which different elements of rule bodies are given. For example, the following statement would also hold:
We can also define a rule to determine if a sibling is a brother or sister:
The following graphs illustrates the relationships between Luke, Peter and Meg which are only partially described in the data. Using rules we can ensure that the data is complete. By having complete data we are able to maximise our ability to make correct business decisions, efficiently and accurately.
The data in this example is incomplete, as Peter is identified as Luke’s brother, but Luke isn’t identified as Peter’s brother. Although humans can logically deduce that if Peter is Luke’s brother, Luke must also be Peter’s; databases are not aware of relationships between data points unless it is explicitly stated using rules. Thus, we can use rules to inform the graph that these relationships exist.
When we import the rules for siblings and uncles into the graph, the implied relations are materialised. This occurs because the rule ranges over all possible nodes in the RDF graph. Whenever the rule is satisfied, i.e. a ‘?x has a brother ?y’ relationship is found, it will propagate this information as a new triple within the graph, enriching the data.
In this example, the sibling rule helped to establish the link between Peter and Luke and the uncle rule established that Luke was Meg’s uncle. This makes the graph more complete, which has benefits for data analysis, and when the process is scaled up it can result in significant efficiency improvements compared to using other methods for inserting data into a graph (see below).
Rules can be simple like the one demonstrated above, or layered on top of one another to provide a complex set of instructions. Check out our resource library or blog for more resources on the power of semantic reasoning with Datalog rules.
By using the SPARQL INSERT query, one can insert triples into the RDF graph. However, with reasoning (i.e. Datalog rules) this process is enhanced.
Fundamentally, the two methods for inserting data into the RDF graph differ because Datalog rules are applied recursively. In this way, the logical consequences of a set of Datalog rules on a graph are captured by the iterative application of the rules until no new information can be added to the graph.
The new triples are added when rules are imported. With RDFox, this process is even more powerful as triples can be materialised incrementally when new data is added to the RDF graph and prior to query-time, dramatically speeding up SPARQL queries.
The rule language supported by RDFox extends the standard Datalog language with stratified negation, stratified aggregation, built-in functions, and more, so as to provide additional data analysis capabilities.
For more information on stratified negation, stratified aggregation, built-in functions, etc, see the RDFox documentation.
If you, or anyone within your organisation, are interested in participating in a Datalog tutorial with Oxford University Professors and Oxford Semantic Technologies’ founders, take a look at our events page for upcoming workshops.