In 2012, the BBC famously used linked data to support coverage of the London Olympics on its website, app, and interactive video player. They have continued to champion the benefits of semantic technologies to this day.
Coincidentally, as I started to write this, the Tokyo 2020 Olympics were planned to be in full swing. Therefore only one topic can so aptly be chosen for this article: The Olympics.
I usually write linked data tutorials but in this article I write about fully developing a small project from a few disparate data sources to a complete dynamic dashboard. In order to do this smoothly, I have selected RDFox by Oxford Semantic Technologies for my triplestore and reasoning engine in this project. Wallscope’s platform is of course triplestore agnostic, so any triplestore can be used, but this decision is made clear throughout the article.
Data management and integration has always been a huge problem with floods of new documents (health records, contracts, spreadsheets with columns titled ‘CY3’ and a Word document naturally stored somewhere else telling you what ‘CY3’ means), financial transactions, social media mentions, staff payroll updates, website traffic tracking data, and the list goes on…
Handling all of this data as it comes in live and assimilating it with existing information is a challenge for businesses of every size. This challenge cannot go ignored either - especially in the current climate. Efficiently utilising this unified data in real-time could help the public sector control budget cuts, large businesses retain their staff, and small businesses survive.
I really hope I have conveyed how important it is to tackle this challenge but with the ‘sales pitch’ over - let’s move on and show you what we can do.
To add a little context to the following sections, I thought I’d share my initial designs of this project’s final output. The plan was to create an interface that lets users seamlessly compare Olympic athletes (spoiler alert: we succeeded). We run through the data sources, reasoning, queries, and final result below but these drawings should clarify why we make certain decisions.
First we want an “Athlete View”, to allow the user to single out an individual athlete. In this view we have an infobox, news column, and dynamic charts to compare the selected athlete with their competitors.
Next we aggregate up to a “Sport View”. We again have an infobox and news column but our charts are centred around the selected sport. For example, we could investigate whether skiers have become lighter or heavier over the years.
Finally, we have the “Continent View” and there are once again some dynamic charts, an infobox, and a news column containing unstructured text that mentions that continent. You could check whether there is a recent rise in African footballers for example.
With the three planned views in mind, we need the data to populate our dashboard.
To imitate the disparity in data sources within organisations, we have brought together a few different data sources. These are:
The initial knowledge graph we are using was originally created by myself for another tutorial. Running queries using an athlete’s age was not of concern for example, as I was explaining how to create a knowledge graph from a tabular dataset using OpenRefine. I downloaded the “120 Years of Olympic History” csv file from Kaggle and made the RDF version available on Github.
This knowledge graph contains athletes with their attributes (height, weight, age, sex, team) and medals they won (including the sport, games and year they won them). Each team was also attached to their National Olympic Committee (NOC) code. For example, let’s take a quick look at the gold medal Jessica Ennis-Hill won at London 2012:
Our small tabular dataset just contained a list of NOCs next to their relevant continents. For transcontinental countries (those with land in more than one continent), the continent in which the majority of their land belongs was chosen. This was transformed into triples using a small script. To give an example, here is Portugal (NOC is “POR”) in turtle format:
You may notice that I used dbo:continent to link each NOC to its appropriate continent entity. I then used schema:Continent as each continent entity’s type, indicating that the entity represents a continent. To find these, I used Wallscope’s Pronto tool which is free and open-source (developed by Francesco Belvedere). I will explain how to do this in just two screenshots:
In order to represent a major challenge that businesses face, we needed to include some unstructured text. To do this, I downloaded dumps of Reddit during the last few Olympic games (namely: London 2012, Sochi 2014, Rio 2016, and PyeongChang 2018). I then filtered this data to the 30k submissions in r/olympics for relevance.
This was then processed using Wallscope’s Data Foundry to extract relevant entities and map them to their counterparts in the initial knowledge graph (described above). Essentially, Data Foundry reads through each Reddit submission and extracts any information that it deems relevant. This information is then transformed into a knowledge graph which, as mentioned, is then linked to entities within existing knowledge graphs.
This enhanced graph can then be queried for Reddit submissions relevant to a specific athlete, sport, country, continent, etc…
At this point, our various data sources have been linked together for analysis but we want to do this quickly as new data comes in. This is where RDFox’s incremental reasoning comes in.
The knowledge graphs that Wallscope create and utilise are stored in RDF-triplestores (a database type) and RDFox is the one we are using in this project. One reason for this is their fantastic reasoning engine that runs as new data comes in and we have therefore used it here to improve the performance of our final demo. There are other pros to using RDFox of course, such as its impressive speed.
When working with a client, Wallscope’s team work closely with you to design an easy-to-use and intuitive interface to explore and present your data in the most suitable manner for your use case. We therefore know what calculations and aggregations will likely be requested through this interface and can optimise for this.
There are infinite examples of heavy queries over large graphs. For example, if a business had years of financial transactions of varying types (materials, payroll, insurance, etc…) to process and the staff often require a condensed overview of this information - they may request a summary of all material purchases.
This query has to run through all material transactions and run several aggregations to return a full and accurate report, taking a significant amount of time. Instead (since we know this is a critical summary), we could run these calculations as material transactions take place and integrate this new information into the knowledge graph. As a result, the team can output this report in an instant, make decisions faster, and continue with their day.
An example in a health-related context could be live reporting of how many patients are in the cardiovascular ward. What capacity remains, and how does that compare across the rest of the country?
To illustrate this in practice, we will apply some reasoning to our Olympics data using RDFox rules.
Starting off very simply, we can restructure an existing graph if needed. For example, we can link an athlete to the games they participated in with the following rule:
The head of the rule (before “:-”) is created and stored in RDFox if the conditions in the body of the rule (after “:-”) hold. In this example, when an instance links to both an athlete and an Olympic games (in the original graph), we can deduce that the athlete in question took part in those games.
As our original Olympic knowledge graph was made for a short tutorial, we really need to refactor the athlete ages. Essentially, each athlete is linked to the age at which they won a medal - resulting in athletes with multiple medals also having multiple ages (woops, my fault). This is not a problem with access to RDFox’s rules however as we can grab the year that an athlete won their first medal, grab the youngest age linked to that athlete, and calculate their birth year. Problem solved!
To begin refactoring, let’s attach each athlete to the youngest age at which they won an Olympic medal:
In this example, all of an athlete’s ages are grabbed and the MIN found. Then, this minimum age is linked to the athlete with wso:minAge.
Similarly, we need the earliest year that an athlete won a medal:
Notably here, we are using wso:athleteInGames which we created two rules above - starting to create a small hierarchy of rules.
Finally, using the earliest year that an athlete won an Olympic medal and their age at the time, we can calculate each athlete’s birth year:
Hopefully these short examples are relatively clear, so let’s move on to designing whole new entities as the foundation for later rules:
Essentially, this rule creates Participation (?part) entities which are similar to instances in the original graph but less convoluted for further extensions and rule building. To illustrate an extension to this participation entity, we can link the number of medals an athlete wins to their participation at an Olympic games:
To illustrate the development of further rules using participation entities, we can create a rule to link an athlete to the total number of medals they have ever won at the Olympics:
As you can see here, athletes are either connected to their wso:totalMedalCount (if they have won a medal), or to a wso:totalMedalCount of zero (if they have not ever won a medal).
Developing this even further, we can link the birth years we calculated earlier to the average wso:totalMedalCount of all athletes born in that year:
There are many other rules used in this project but I really wanted to highlight this hierarchy of rules dependent on supporting rules (which in turn depend on supporting rules, etc…). This may sound minor but is very valuable in practice.
Returning to our healthcare example: How many patients are in the cardiovascular ward and what capacity remains? Hospital management may want to know this for their specific hospital, regional directors may want to know this on a healthboard level, and an MP may want to know this at a national level. In addition to the levels of aggregation, they all want to know this information is accurate, updating live as patients are admitted and discharged.
With our hierarchy of rules, information is efficiently updated at all relevant levels of aggregation as data is received.
So far we have designed an interface, enhanced a variety of data sources, and developed some incremental reasoning to support our dashboards live performance. To finally populate our charts, we just need to write a few SPARQL queries to return the appropriate results.
I am going to assume that you would be comfortable writing queries to populate the info boxes (see note above if not), I will therefore run through some of the queries that populate the charts and the news columns.
Starting with the “athlete view”, we have a histogram that displays the average number of medals athlete’s have won bucketed by athlete’s age. In the last rule we described earlier, we combined each athlete’s birthYear and totalMedalCount to link each year to exactly this metric. Therefore, our query is much simpler and can be written like so:
Note the use of the link wso:yearHasAverageMedals - which is the link we created in the last rule described.
For the parallel coordinates plot we need a slightly larger query, using some of the rules that I didn’t detail above (they can all be found on GitHub). Essentially, we have another hierarchy of rules to aggregate athlete stats by sex, sport, continent, and year. This design allows us to populate our more dynamic charts very quickly as the dropdown options are used.
Using this extra information we deduced through our reasoning, we can send the following query:
This calculates and returns the global average weight, height, and age of all Olympic athletes ever. By switching the ?continent or ?sport variables to a fixed entity (see in query comments), we can return more specific aggregates to the user. We do not aggregate by sex in our rules as this can be done very easily within the query. To clarify, we could theoretically have one billion athletes in this graph but the query would still only be finding the mean of two returned values.
To allow public data access without compromising security, Wallscope’s Platform provides a data access management layer (called HiCCUP) that allows fine grained control over the data that can be accessed within the knowledge graph, and makes it available as an API that can be consumed from different applications.
In the “sport” view, we want to report the top athletes, ordered by total medal count. Luckily, we made a rule above to attach athletes directly to the number of Olympic medals they have won in their career. Let’s find the top five male swimmers:
Again, this query would be significantly more complicated if we had not used wso:totalMedalCount. As a preview, here are the results of this specific example:
With these examples, I hope you can see how our ‘on the fly’ reasoning and aggregation removes a lot of the pressure from the queries themselves. This results in a more responsive application without compromising accuracy.
To populate the ‘News’ section of the interface we do something a little different. As mentioned above, we used Wallscope’s platform to process the 30k Reddit texts that we downloaded. This process outputs a knowledge graph that represents the submissions and their content. Finally, we then map this graph to our core entities for retrieval of related “news”.
The first query returns all of the platform’s output entities (Reddit submissions) that match a given text. For this example, we are looking for Michael Phelps.
This query outputs all Reddit submissions that mention Michael Phelps and the output should look like this:
With these matching submission, we now want to retrieve all the other entities that are linked to the same submissions. This query will return related entities to Michael Phelps and provide us with navigation hooks.
We now have submissions linked to Michael Phelps and other related entities. Again, here is the example output:
In this case, “cc00aafe-a54c-41ae-8ea0-a10570e493c9” represents “Michael Phelps” and “baede3ee-9471–494d-8d11–3544b53e1067” represents “Swimming”.
Finally, we need to map the related entities to our core knowledge graph. Like the previous query, we filter results for entities that have types which interest us. In this case we want to find athletes, sports, and continents for navigation hooks.
This final query in the chain links Michael Phelps to swimming through a Reddit submission that mentions them both. Once again, here is the example output:
In a usual project, we wouldn’t store any of these intermediary entities - this is for demonstration purposes only. We would output the following:
This allows us to open the page for Michael Phelps and quickly display the submission relating him to swimming in the interface.
How can we present all of the above work? Well, we developed a dashboard with three views: the athlete view, sport view, and continent view. Now I know that my drawings at the start of the article were incredible (sarcasm of course) but Johnny, Dorota, and Antero really brought this to life.
When you open the dashboard you land on a completely random athlete’s page - for example, I began with Allyson Felix.
Allyson Felix is an outstanding sprinter, competing in the Olympics from 2004 to 2016. The top of her page is shown below.
You can see information that we know about an athlete, personalised charts, and related Reddit posts. Every section contains an info button which explains what is being shown and what we are doing behind the scenes.
Further down in the athlete view there is a chart with some filters. The parallel coordinates plot titled “Statistics Comparison” compares the current athlete to the average Olympian. Using the filters, I have compared Allyson Felix to the average male basketball player from a country in Oceania (a very tall bunch it seems):
From the sport in Allyson’s infobox, we can navigate through to the sport view by clicking “Athletics”:
In this view, we can see that Allyson Felix is not just a great sprinter, but the top female athlete that competes in athletic events (by medal count). The sport view contains many charts and I noticed something interesting while examining the “Medals Per Continent” chart:
You may have noticed that I selected 1972 on the “Medals Per Continent” slider. This chart displays two groups of bars. On the left we can see the number of medals that were won by continents (well, athlete’s representing countries within that continent). On the right we can see the number of athletes that competed for countries within each continent.
In 1972 the chart looks as expected, but lets look at four sequential summer Olympics (1972 to 1984):
I was sliding through the years and noticed that Africa all but disappears in 1976 - but only for the 1976 games? Similarly, North America follows the same pattern but at the 1980 games?
To investigate this further, I will type “Africa” into the search bar and head to the continent view:
The summer and winter Olympics occur every four years but staggered by two years (summer in 2012, winter in 2014, summer in 2016, winter in 2018, etc…). African athletes rarely compete at the winter Olympics (only 5 at Sochi 2014), hence the vastly different numbers of athletes every two years.
Investigating the earlier question, we can filter the chart for “Athletics”:
Now the drop in athlete numbers is very obvious! The number of African athlete’s dropped to almost zero in 1976.
I genuinely noticed this using our interface so had to dig for an answer. It turns out that African countries boycotted the Olympics in 1976 because New Zealand were not banned from the games. South Africa had been banned from the Olympics since 1964 because they refused to condemn apartheid. New Zealand’s rugby team were currently touring South Africa which sparked the start of the boycott. Source here.
It turns out that the drop in North American athletes in 1980 was also the result of a boycott. As every continent view has the same layout, we can check the same filtered chart as above on the “North America” page:
The 1980 Olympics were held in Moscow, Russia and the United States boycotted the games in protest of the Soviet invasion of Afghanistan (source).
Finally, to show off the “news” sections in each view - lets look at a couple of athletes.
I have decided to choose one historical athlete that still gets talked about on Reddit for obvious reasons - Jesse Owens:
As you can see, we display Reddit posts that are related to the entity of interest (Jesse Owens in this case). In addition, if other entities are also mentioned, a related second tag appears as a navigation tool.
In other very unrelated news, bobsledder Johnny Quinn got stuck twice at Sochi 2016:
Johnny first got stuck in a bathroom while taking a shower. After calling for help, he had to use his bobsleigh skills to bash through the door!
He then got stuck in a lift… I don’t imagine he managed to barge out of that one however.
From a few disparate data sources, we noticed data anomalies through an interface and learned something new. This was all built using the combined power of RDFox and Wallscope’s platform, so I hope I have conveyed the power of applications built like this one (not just for the Olympics use-case, but more generally).
If you want to discuss how you could benefit from anything discussed in this article, please feel free to get in touch with us here or here.
If you are a developer, you can find the application GitHub repo here.
We have also created a quick video to show off the demo:
Finally, when the next Olympics go ahead, we are planning to update this project. I will tweet about this at the time - I really hope we have the time!