The Semantic Web represents knowledge as a semantic graph structure. Domain knowledge is represented using controlled vocabularies and ontologies that can be applied to semantically describe data. The combination of these vocabularies, ontologies and the instance data of a domain typically constitute the so-called Knowledge Graph (KG). This form of knowledge representation allows a lot of flexibility for modeling as well as combining data sets by linking graphs, which has the potential to solve enterprise data heterogeneity problems by providing a bottom-up and flexible approach.
Integrating various heterogeneous data sources is one of the main use cases of Semantic Web standards. For implementing such integrations, we can apply ETL (Extract-Transform-Load) processes, which result in a KG that semantically represents the original data, linked together and described using common vocabularies and ontologies.
A joint work, this article contains contributions from both Oxford Semantic Technologies and Semantic Web Company, based upon their respective products RDFox and PoolParty. The primary authors were Albin Ahmeti (Knowledge Engineer, SWC), Yavor Nenov (CSO, OST), and Robert David (CTO, SWC).
Inconsistencies and incompleteness can occur because of various reasons in data modeling and processing, which will result in data not being aligned with the original data source, not being conformant to the describing ontologies or even introducing contradictions. As a consequence, these consistency violations can lead to quality problems in applications, like Semantic Search and Recommender Systems, that run on top of the KG.
One of the main standards used in such ETL processes for a schema-based integration is OWL, the Web Ontology Language, which is used to semantically describe a knowledge domain and which allows to do reasoning on top of graph data. However, when using OWL for such data integration processes, we have the problem that it cannot be used to ensure data consistency. OWL reasoning is interpreted using the open word assumption, which means it can infer additional information, but it cannot validate instance data in the sense of constraint checks.
For doing so, the Semantic Web provides the Shapes Constraint Language SHACL. It allows validation of KGs against sets of constraints that describe how conformant data has to look like. When the KG fails to satisfy these constraints, the SHACL processor returns an RDF-based validation report describing the problems, which can then be automatically further processed. Because SHACL is purely RDF-based, it integrates easily into KG ETL processes and can provide data consistency along the linked data life cycle.
While some of the KGs are curated – for instance Wikidata, on the other hand, we have others that are created as part of an ETL (Extract-Transform-Load) process – for instance the most-widely known DBpedia KG. Since DBpedia data is generated from Wikipedia infoboxes and DBpedia mappings, as well as described using the DBpedia OWL ontology, issues regarding inconsistencies and incompleteness in the data frequently occur and have to be considered in an integration scenario. While the DBpedia KG is generated from one data source, in practice KGs often consist of multiple heterogeneous data sources that are integrated using Semantic Web technologies. This heterogeneity is a main reason for inconsistencies that occur as part of the integration. For enterprises to take advantage of KGs, we need to manage these inconsistencies as part of the linked data life cycle. We have to select the appropriate strategies based on the scenario and we need the technical architecture to process these strategies in an enterprise environment.
When having a scenario where inconsistent data can occur, we have to decide on a strategy how to deal with reported inconsistencies. There are mainly two techniques in the realm of Consistent Data Processing.
The first one is called Consistent Query Answering (CQA). Using this technique the queries get re-written, and even after being evaluated on top of inconsistent data, they would still yield consistent query answers – filtering out inconsistent answers in the process. This approach has the advantage that the data source does not need to be modified to resolve data problems and introducing additional constraints to the queries can be easily done. However, there are several drawbacks of CQA. First and foremost, the quality problems are still in the data. In order to avoid the inconsistencies, we have to restrict integrators to only access it with a CQA query. Also, the performance of introducing additional constraints in the queries is likely to be lower compared to the original queries. CQA is an appropriate approach when data modifications to the original data source are not possible. As DBpedia is generated using an ETL process, CQA is the preferred technique for an integration scenario. As an alternative, we could repair the data in a local copy, but would have to keep this copy aligned with new releases. To avoid this, CQA provides us with an independent quality control for the DBpedia integration.
The second approach is called Repairing. Using this technique the data is permanently changed, i.e., repaired, such that it is consistent with the constraints. It can then be used with any application or any other (consistent) integration with the original queries and there is no further need to rewrite them as in the case of CQA. Of course, this approach is only possible when data modifications and propagation back to the data source are possible. Also, the strategies to repair the data might not be straightforward and often require a human-in-the-loop approach. Still it is the approach that will not only make most use out of the data, but also make it reusable for any further processing and integration.
In this blog post, we will present a customer use case that qualifies for repairs. The original data is mapped and transformed to RDF using an ETL approach and there is no need to propagate updates back. We will show how to validate and repair inconsistencies using an automated approach that can be integrated into the ETL process.
The Semantic Web provides various technologies to represent and describe data. OWL, the Web Ontology Language, can be used for semantic descriptions of knowledge domains. These descriptions can then be used for reasoning to infer new data in the KG. OWL reasoning uses an open world approach. This means that knowledge will only be added and never retracted. You cannot draw conclusions on missing information, because it might still exist. This approach to reasoning was taken because the world wide web is an open system where you can always encounter new information, because graph data can be added and linked together freely.
However, this approach is not suitable when it comes to constraints. OWL reasoning can detect inconsistencies in an ontology itself, but when it comes to reasoning over instance data that is described using the ontology elements, OWL reasoners will not return errors, just infer triples. In an open world, there are no constraint violations, but only inferences. For example, we cannot evaluate cardinality restrictions as constraints, because there might be additional unknown data. In an enterprise environment, we need to ensure the data quality along the linked data life cycle so that inconsistencies can be managed and will not result in wrong results or misunderstandings. In this closed scenario, we need technologies for data validation. We need to define constraints that the KG has to conform to.
In order to do this, we can use SHACL, the Shapes Constraint Language. We can represent (groups of) constraints as shapes to validate the KG and report back identified violations. SHACL provides similar language elements as OWL, but the evaluation is fundamentally different and evaluated using a closed world approach. Here we can also check for non-existence of data, effectively providing a negation to validation. Evaluating shapes in the context of large KGs is challenging, because of the complexity in the size of the data and/or the shapes. Next, we will present an approach to tackle such a setting and provide a technical architecture to implement it. We evaluate how scalable data validation with SHACL can be applied in practice to improve data quality as part of an ETL integration scenario with PoolParty Semantic Suite
In our approach we use the combination of PoolParty with RDFox, a high-performance, scalable in-memory triple store and rules engine based on Datalog. It features a SPARQL 1.1 compliant endpoint, an in-store SHACL processor that can be triggered by SPARQL operations and a Datalog engine adapted for the RDF data model.
PoolParty is integrated with RDFox via the SPARQL endpoint, which means that all operations from PoolParty components are transformed to SPARQL queries or updates. Data validation and repair are mainly processed by UnifiedViews, PoolParty’s data orchestration component, integrated with RDFox.
Using this architecture, we can provide several functions to implement use cases:
In this blog post we are going to present an approach of repairing the data that is deemed to be inconsistent. Typically in a scenario of data integration, more precisely in the setting of ETL (Extract-Transform-Load), the data is extracted, transformed (mapped) and loaded to a database – in our case a triple store. For each of three steps different activities are performed, where data is harmonized, mapped, extracted by using a knowledge graph – consisting of taxonomies and ontologies. As a consequence of integrating different sources, we may end up with data that is inconsistent, a notion which we define next.
We distinguish between two types of inconsistencies:
1 We have encountered inconsistencies of a similar type in DBpedia when using inference, e.g., http://dbpedia.org/resource/Daniel_Edvardsen__CareerStation__5 that is inferred to be of type http://dbpedia.org/ontology/Person and http://dbpedia.org/ontology/TimePeriod, which are defined to be mutually disjoint in the DBpedia ontology.
In our use case we will consider a data integration scenario in the domain of manufacturing. Data about different equipment is integrated from different sources, harmonized and mapped to RDF.
As a consequence, the data that is created does not match the data type asserted by the ontology attributes as called in PoolParty (or datatype properties owl:DatatypeProperty as called in OWL). For instance, in the following we see datatype definition of integer as range, whereas the data are of type float (see above example).
This mismatch between the generated data and the ontology definitions is addressed via repairs, where such erroneous data is repaired to the correct data type as defined in the ontology. The requirement of consistency is required by PoolParty given that it should manage consistent data and by that we use the approach of repairs instead of adopting consistent query answering.
In the following figure are shown such inconsistent data in PoolParty (the warning for inconsistency is shown with the ⚠ symbol in the attributes on the right pane):
Fig. 2: Inconsistent data in PoolParty: Data types of the ingested data do not conform to the data types of the ontology ISO14424 definitions used in this project.
As a first step, we create SHACL shapes in an automatic way via a Datalog rule. The rule is given in the following:
The rule rule.dlog is used to create the shapes in SHACL from ISO14224 ontology definitions.
The rule is defined in Datalog syntax head :- body, where the instantiations of the variables in the body against the data are used to create triples in the head together with the named graph as a quadruple. The body of the rule checks all ontology definitions of type owl:DatatypeProperty and its data type, and it creates in the head a new shape (note: SKOLEM function is used to create a new identifier (constant) for the shape) together with the corresponding properties to check for the path of the concepts and data type.
We import this rule by using the import command in RDFox (can also be done using curl):
In the graph called <urn:shapes> are created all the shapes that correspond to all the datatype properties as defined in the ISO14224 ontology. Here’s an example:
This shape can be used to check the conformance of the data. This can be triggered using this command in RDFox by also specifying the named graph of the data to be validated
(in our case: <https://el-capitan.poolparty.biz/Semantics_2021/thesaurus>):
After this query is triggered, the violation report is stored in the graph <urn:report>.
Finally, after the validation is performed on the data, we do the last step of repairing. There are no standard semantics on how to perform repairs, but in the research literature a few of them are discussed. What is essential is that repairs have to always satisfy some criteria, such as the minimality principle – change (delete or insert) as few triples as possible – as well as other postulates that are discussed in the area of belief revision. As an example of a postulate we can mention: “deleting a triple triggers deletion of all the inferred triples that do not have alternative derivations from other triples in the triple store”. This postulate is satisfied in RDFox, given that RDFox implements a variant of the DRed (Delete and Rederive) algorithm that does incremental updates.
In our case, we do repairs based on a SPARQL/Update that satisfies the minimality principle, where we replace the old value with the new, consistent one that conforms with the data type of the ontology definition.
In order to automate the steps of validation and repair, we have created a pipeline in UnifiedViews that runs these steps in a sequence, one after the other, as it is required by the dependency flow of the steps. “SPARQL Endpoint Loader” Data Processing Unit (DPU) issues the respective SPARQL/Update to the RDFox’s SPARQL endpoint.
After the pipeline is run, validation and repair is performed, this means that the data managed in PoolParty results to be consistent, as illustrated in the following figure. Note that the symbol that stands for inconsistency does no longer exist in the data.
In this blog post we presented a practical use case where we solve a data consistency problem in ETL processing by an integration of PoolParty Semantic Suite and the RDFox database. RDFox provides high performance data processing accessible via SPARQL and SHACL standards and implements an RDF-based Datalog engine that can be used for in-store deductions. This integration architecture enables us to automate as much as possible regarding data validation and repair to improve the quality of KGs in enterprise scenarios.
If you'd like to see wider applications from other collaborations, check out some of our other articles: Object detection with Volvo Cars, calculating industrial product compatibility with metaphactory, and more on our blog!
If it’s more of RDFox you’re after, you can request a free trial today! Don’t just take our word for it, try RDFox yourself and discover the last piece in your puzzle, whatever that may be. Head to the Oxford Semantic Technologies website to discover more about the high-performance knowledge graph and reasoner.
The team behind Oxford Semantic Technologies started working on RDFox in 2011 at the Computer Science Department of the University of Oxford with the conviction that flexible and high-performance reasoning was a possibility for data-intensive applications without jeopardising the correctness of the results. RDFox is the first market-ready knowledge graph designed from the ground up with reasoning in mind. Oxford Semantic Technologies is a spin-out of the University of Oxford and is backed by leading investors including Samsung Venture Investment Corporation (SVIC), Oxford Sciences Enterprises (OSE) and Oxford University Innovation (OUI).