Summer School AI for Industry: Excercises Knowledge Graphs

Observing existing data

In this part, we will examine data from a well known Linked Data provider, DBpedia. We will do most of the work in a Web browser. Do the following:

Start a Web browser.

Using the address bar, go to http://dbpedia.org/page/Tim_Berners-Lee. What is this page describing?
Observe the data available there. The Web page is an HTML document, but it shows RDF triples from the RDF database DBpedia, in an almost human-readable form. Try to figure out the triples that are shown there. Find out what URIs are used in subject, predicate and object of these triples?
The Web page shows a table with two columns. The first column (with header Property) has values that are hyperlinks. Click on some of those links to see a little bit the structure of the data. Observe specifically what is shown on dbo:birthDate. What kind of information does this property provide?
Go back to the previous page Can you find the description of the entity in English? What is the property used to provide the description?
Now, look at the second column in the table, with head Value. Some values are hyperlinks, some are not. What does it mean when the value is a hyperlink?
Consider the line where the Property is dbo:birthPlace. Move your mouse on the second link in the Value column. On the bottom left of the browser window, you should see the URL to which this link is pointing to. Write down this URL or memorise it well.
Click on the link, then take a look at the address bar in your browser. Compare it to the link you saw just before. Why are they different? What does the address on the link represent with regard to what the address to which you are redirected to?
On that page (that is, http://dbpedia.org/page/England) consider the Property dbp:areaKm. What is the number in the Value column? What does the text between brackets represent? Take a look at dbp:gdpNominalPerCapita. What does the value formally represent? What is its type?
In the header of the page, you can see “Formats”. Select the Turtle format and look at its content. You can also look at other RDF formats, in particular RDF/XML and JSON-LD.
Tim Berners-Lee is also described in other RDF data sets on the Web. Find the property owl:sameAs (not to be confused with schema:sameAs) and look at the values there. You can see URIs that point to other domains. All of them contain RDF data.
As in DBpedia, the data is usually displayed in HTML, but there are links to the RDF data. Find RDF files that describe Tim Berners-Lee at the Deutsche National Bibliothek, and at the BBC, and finally in DBLP.
Challenge (difficult): Starting from Tim_Berners-Lee’s URI on DBpedia, find a minimal subset of the Web of Data (a set of triples published on the Web) that contains both Tim Berners-Lee’s DBpedia URI and Antoine Zimmermann’s Wikidata URI.

Using `curl` to consume Linked Data

In this part, you will use cURL for getting data from the Web. If you are already familiar with cURL, you can jump to the next section.

If you do not have it already, download cURL and put it in a folder you will remember. On a Linux OS, cURL is available as a package in most distributions. Use your distribution’s package manager (such as apt or brew) to get it.

You may need to update the PATH variable in your system environment configuration. On MS Windows, you can use Window-key + R, then type SystemPropertiesAdvanced. Then click Environment variables.... Then find the variable Path or PATH in the user or system variables. Edit it and add the path to the folder you used to put cURL.

We will first learn the basics of cURL, then use it to understand how Linked Data principles and best practices are implemented.

Open a web browser and type http://mines-stetienne.fr in the address bar. Notice what is happening. We are going to compare this to what cURL does.
Open a command line window.
Type curl -V to check that cURL is working. If not, go back to the previous steps.
Type curl http://mines-stetienne.fr and look at the result. cURL displays the payload (that is, the “body”) of the HTTP response. In this case, it is an HTML document saying that the document was moved to https://mines-stetienne.fr.
Type curl https://mines-stetienne.fr and look at the result. It should be empty. We need to figure out what is happening.
Type curl -I https://mines-stetienne.fr and look at the result. -I asks to display only the HTTP HEAD of the response, not the payload. We see that the resource with URI https://mines-stetienne.fr was found at another location and we see the location where we can find it.
Type curl https://www.mines-stetienne.fr/ and look at the result.. This time, you get a web page. This is the HTML code of the page you see in your browser.

The HTTP response codes 301 Moved Permanently and 302 Found are commonly called “redirects”. Your browser directly displays the Web page because it is “following” the redirects. You can check that the URL in the address bar of your browser is https://www.mines-stetienne.fr/. The browser stops redirecting when it finds a 200 OK: it means that the resource you requested (namely, https://www.mines-stetienne.fr/) has been found and is this file, which is an information resource.

You can follow redirects with cURL, using the option -L. Check this: curl -L http://mines-stetienne.fr. You can also see what the server is responding at each step of the negotiation by adding -I. You can get even more details about the requests and responses by further adding -v or --verbose.

Use cURL on Linked Data

We will use the cURL and DBpedia to see how Linked Data can be accessed via HTTP.

We want to get a representation of the resource identified by http://dbpedia.org/resource/Tim_Berners-Lee. Use cURL and see what URIs must be requested, in order, to reach a final representation. It is possible that you have to use the option -k when requesting https URIs, depending on your system configuration.
What is the format of the final response, after following the redirects? What is the response code?
You can request a different format by changing the headers of your HTTP request. Type curl -H "Accept: text/turtle" http://dbpedia.org/resource/Tim_Berners-Lee. Use -H "Accept: text/turtle" on all necessary requests to reach a 200 OK and get some data.
Write a single command that gets the RDF/XML representation of the resource http://dbpedia.org/resource/Tim_Berners-Lee.

Follow the links

If the fourth Linked Data principle is used, there should be links from one data set to another, so that we can follow links to discover more data.

A tool that can help you navigate through RDF data is RDF Browser, a Firefox extension that shows RDF in the browser whenever it is available by content negotiation. If you have Firefox, you can install and try this extension.

As an alternative, you can also use Postman. Postman is comparable to cURL, except that it has a graphical interface that facilitates navigation (among other things). All links that appear in a response body will be clickable. You will have to manually add Accept headers for content negotiation, though.

In DBpedia, Mines Saint-Étienne’s data are linked to many other data sets. Using the RDF Browser, Postman or cURL (in this order of preference), find a path starting from https://dbpedia.org/resource/%C3%89cole_nationale_sup%C3%A9rieure_des_mines_de_Saint-%C3%89tienne and leading to University of São Paulo, then to the Technical University of Berlin, then to University Saint Gallen. This can take a while if you go in a wrong direction, do not spend too much time on it!

Draw a knowledge graph

In terms of knowledge graph modelling, we start with a simple exercise that you can do on paper. Use the following facts:

myProductionSystem is a System
myProductionSystem has subsystem roboticArm1
myProductionSystem has subsystem conveyorBelt2
roboticArm1 is a System
roboticArm1 is a RoboticArm
roboticArm1 has manufacturer ABB
conveyorBelt2 is a System
conveyorBelt2 has speed 0.1

Identify connections, things, and values. Depict things in circles. Depict values in rectangles. Depict connections using arrows. Draw the graph on a piece of paper.

Quick Turtle tutorial

This is a mini tutorial on how to model knowledge in RDF, with a few tips on good and bad practices. You should be able to go through this section quickly. The practical work is given in next section.

Basic relations

An RDF graph contains a set of node–arc–node relations. A simple graph like this one:

Daniel works for Google — A basic *node–arc–node* relation

can be encoded in Turtle as follows:

<Daniel> <worksFor> <Google> .

This forms a triple where we will call the first element of the triple its subject, the second element its predicate, and the third element its object. Note that there is a dot at the end. For a more complex graph like:

Daniel works for Google and Google has parent company Alphabet — Multiple *node–arc–node* relations

we can simply add more triples, separated by dots:

<Daniel> <worksFor> <Google> .
<Google> <hasParentCompany> <Alphabet> .

Note again that the dot separates the triples. When there are multiple arcs coming out of the same node, we can simplify the notation. The following graph:

Google has parent company Alphabet and has headquarter at the Googleplex — Multiple *node-arc-node* relations with the same subject

can be written like this:

<Google> <hasParentCompany> <Alphabet> .
<Google> <hasHeadquarter> <Googleplex> .

or, more concisely, like this:

<Google> <hasParentCompany> <Alphabet> ;
    <hasHeadquarter> <Googleplex> .

When the subject is the same, we can repeat it by simply adding a semicolon between predicate–object pairs. When the series of predicate–object pairs is finished, we must add a dot. We can further simplify the notation when the subject and the predicate are the same:

Google has parent company Alphabet, was founded by Sergey Brin and Larry Page, and has headquarter at the Googleplex — Multiple *node-arc-node* relations with the same subject or same subject and predicate

can be written:

<Google> <hasParentCompany> <Alphabet> .
<Google> <hasHeadquarter> <Googleplex> .
<Google> <hasFounder> <LarryPage> .
<Google> <hasFounder> <SergeyBrin> .

or more concisely:

<Google> <hasParentCompany> <Alphabet> ;
    <hasHeadquarter> <Googleplex> ;
    <hasFounder> <LarryPage> ;
    <hasFounder> <SergeyBrin> .

or even more concisely:

<Google> <hasParentCompany> <Alphabet> ;
    <hasHeadquarter> <Googleplex> ;
    <hasFounder> <LarryPage>, <SergeyBrin> .

Note the comma that separates two objects for the same subject and predicate.

Remarks: (1) the order of the triples is not important; (2) there is no shortcut when the object is repeated with different subjects or predicates; (3) a given triple cannot appear multiple times (i.e., if the subject, the predicate, and the object are the same, then the triple is the same); (4) in Turtle, spaces cannot be used in node names or predicate labels.

Literals and datatypes

In general, nodes in RDF graphs represent things in the real world that we want to describe. These things can be concrete, physical entities (people, objects, etc.), or abstract things (concepts, ideas, legal entities, etc.). Most of these things cannot be fully encoded in a computer: only their (partial) description can be encoded. However, there are entities that can fully be represented and stored as data, such as integers, decimal numbers, character strings, dates. In this case, we use a different type of nodes to represent them, that we call “literals” because what they represent is literally what’s written. In graphical notation, they are often drawn as rectangles:

Larry Page’s name is “Lawrence Edward Page” — Larry Page’s name is, literally, “Lawrence Edward Page”

In Turtle, this is written:

<LarryPage> <name> "Lawrence Edward Page" .

A literal can have spaces in it. A literal can be of different types (number, string, date, etc.) and the set of literal types may be open, or even infinite in some applications. To make sure we interpret the value of a literal correctly, we must associate a datatype to it, as in the following example:

Larry Page’s name is “Lawrence Edward Page” and he was born on the 26th of March, 1973 — Larry Page’s name is “Lawrence Edward Page” and he was born on the 26^th of March, 1973

In Turtle, this can be written as:

<LarryPage> <name> "Lawrence Edward Page" ;
    <birthdate> "1973-03-26"^^xsd:date .

The datatype xsd:date determine how we can interpret the string 1973-03-26. There exist standard datatypes that can be used more concisely in Turtle, for strings, integers, decimal numbers, and floating point binary numbers. A standard for dates exists but there is no short notation for it in Turtle. The following example shows how integers and decimal numbers can be written, and also displays comments in Turtle notation:

# This is a comment, starting with '#' and ending at the end of the line
<LarryPage> <name> "Lawrence Edward Page" ; # Character string
    <numberOfChildren> 2 ; # Integer: just a sequence of digits
    <height> 1.7 . # Decimal: 2 sequences of digits separated by '.'

Other features of Turtle

In order to use IRIs stemming from different places, we define prefixes. In this session, you do not need to be much concerned about IRI namespaces, but you may want to use at least the standard XML Schema Datatypes (XSDs). For this, write this line at the beginning of your Turtle file:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

Then you can use the XSDs like this:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

<LarryPage>  <name> "Lawrence Edward Page"^^xsd:string ; # Equivalent to "Lawrence Edward Page" without datatype
    <numberOfChildren> "2"^^xsd:integer ; # Equivalent to 2 without quotes and datatype
    <height> "1.7"^^xsd:decimal . # Equivalent to 1.7 without quotes and datatype
    <birthdate> "1973-03-26"^^xsd:date ; # Uses ISO 8601 format
    <wikipediaPage> "https://en.wikipedia.org/wiki/Larry_Page"^^xsd:anyURI ;
    <wealthInDollars> "107.9e9"^^xsd:double . # Binary floating point double precision

You can give a type to an entity by using the special predicate a, which is more or less equivalent to the phrase “is a”:

<LarryPage> a <Person> .

Basic tips and good practices

Here is a set of tips that you need to have in mind when making a knowledge graph:

Nodes are entirely identified by their name. There cannot be two distinct nodes that have the same name. So, the following code:
```
<Google> <hasParentCompany> <Alphabet> .
<Google> <hasHeadquarter> <Googleplex> .
```
is a graph with 3 nodes, not 4.
Arc labels can be reused on multiple arcs, but there cannot be the same arc label twice with the same source and same destination. That is:
```
<Google> <hasParentCompany> <Alphabet>, <Alphabet> ;
    <hasParentCompany> <Alphabet> .
<Google> <hasParentCompany> <Alphabet> .
```
is a graph with only 1 triple.
Use standard datatypes for literals, when available, and prefer decimal notations (implicit xsd:decimal) over xsd:double or xsd:float.
Be careful how you name an entity. The same name always identifies the same entity. Avoid generic terms like house to describe a single house. Use, for instance, house1, house2, etc. to distinguish the entities, or use a slash like this:
```
<house/1> a <House>
```
Some things may appear to be instances when in fact they are categories. For instance, “Samsung Galaxy S22 Ultra” may seem to be an instance of phone, but in fact, my Samsung Galaxy S22 Ultra that has been damaged is of the same category as your Samsung Galaxy S22 Ultra. Do not confuse a product model and a single product.
With RDF graphs, you can only represent binary relations. To represent arbitrary n-ary relations, you may have to introduce intermediary nodes that denote the relation, and connect it to the components of the relation. For instance, a sale connects a seller, a buyer, a product or service, a date, and a price. It could be represented like as a graph:
```
<sale/152196> a <Sale> ;
    <soldBy> <JohnDoe> ;
    <boughtBy> <JaneDoe> ;
    <objectOfTransaction> <samsung/gs22ultra/sn456-997> ;
    <dateOfTransaction> "2022-03-11"^^xsd:date ;
    <priceInEuros> 499.95 .
```
Naming entities (nodes and arcs) is one of the most important tasks in knowledge representation. Bad naming conventions can make a knowledge model more ambiguous, more error prone, more difficult to understand, and eventually, not used at all. Be sure you establish naming conventions that you and your collaborators follow through all portions of your knowledge model. E.g., use CamelCase with capital initial for classes, lower case initial for instances and relations, etc. Use verbs for relations everywhere, or use nouns everywhere; or define a convention for where you use verbs, and where you use nouns. A noun for a relation like headquarter could mean “is the headquarter of” or “has headquarter”. Be consistent!
Name things and relations in a descriptive and non ambiguous way. Avoid single generic words like has for a relation. “X has Y” could mean that X owns Y, or that X has the charactertic Y, or that X has the disease Y, etc. Be precise!
As much as possible, make your description as independent of the context as possible. A number that denotes a price in euros must be distinguished from a number that is a price in dollars, or that represents an amount of objects, or that represents a measurement, etc. A word that has 2 senses in different contexts may not be sufficient. Be specific!
Avoid describing things that varies all the time. For instance, prefer the birthdate over the age of a person. Think about the potential need of keeping the history of states of affairs during the whole life cycle of your RDF data base.

Authoring data in RDF

You will be writing some RDF in the Turtle format. You can use the Turtle Editor that is available online, but text editors and IDE often have syntax highlight for Turtle, so you can also use your tool of choice.

Production line description

Write the mini-description of a production line that you did before in an RDF file.

You can save the code to a file with file extension .ttl.

Describe the AI4Industry summer school

Define an IRI for the summer school itself. Add a human-readable label to it by relating this IRI to a character string with the predicate rdfs:label. It is a 35-hours training that started on 24^th July 2023 and ends on the 28^th of July, 2023. The Web page of the summer school is "https://ai4industry2023.sciencesconf.org/"^^xsd:anyURI
Create an IRI for yourself (and add a little description if you want). Indicate that you are a student attending the summer school.

Describe the summer school

Take a look at the programme of the summer school.
Start by describing the first session of the school. Define an IRI for the session and relate it to the IRI of the summer school. The session started on the 24^th of July at 10:30 (CEST) and ended the same day at 11:00. It was presented by Gustavo Nardin. It took place in a room of the Novotel in Saint-Étienne.
Describe the current session. This session is taught by Antoine Zimmermann (https://w3id.org/people/az/me).
You should be able to quickly build an RDF graph with all the sessions of the summer school.

Ensure the set of sessions is complete

On the Web, we cannot be sure that any piece of information is complete unless we have explicit information that tells it. We want to make explicit at which session the school starts, that there is no session missing in between two timestamped sessions, and when the school ends.
Try to figure out a way of connecting the sessions together, and the sessions with the school itself, such that we know exactly the list of sessions for the entire summer school.

Distinguish different types of entities

In our description, we have entities for school sessions, the school itself, people etc. Provide explicit types for these things, using the property rdf:type (equivalently, using the Turtle keyword a).
We may want to distinguish between the summer school series (that was conducted in 2020, 2021, 2023, and will continue for a few years) and the AI4Industry 2023 that spans from 24 to 28 July 2023. Create an identifier for the general series and relate it to the school occurrence you described. Add a type to it.
Similarly, today’s session happens at a time and a place but could be repeated elsewhere and at a different time. We may want to describe a general, atemporal entity for this lecture. Add this to the graph below.
In fact, the lecture today, and the summer school this year, could be considered as instances of a class of lectures and summer schools. These considerations are really touching on the semantic aspect that will be discussed more today.

Knowledge Graphs: Introductory exercises

Objectives

Observing existing data

Using `curl` to consume Linked Data

Use cURL on Linked Data

Follow the links

Draw a knowledge graph

Quick Turtle tutorial

Basic relations

Literals and datatypes

Other features of Turtle

Basic tips and good practices

Authoring data in RDF

Production line description

Describe the AI4Industry summer school

Describe the summer school

Ensure the set of sessions is complete

Distinguish different types of entities

Knowledge Graphs: Introductory exercises

Objectives

Observing existing data

Using curl to consume Linked Data

Use cURL on Linked Data

Follow the links

Draw a knowledge graph

Quick Turtle tutorial

Basic relations

Literals and datatypes

Other features of Turtle

Basic tips and good practices

Authoring data in RDF

Production line description

Describe the AI4Industry summer school

Describe the summer school

Ensure the set of sessions is complete

Distinguish different types of entities

Using `curl` to consume Linked Data