KRR: Lab work session on graphs

Here we start with the basics of graph-based knowledge representation. You should have read the introduction of the book Knowledge Graphs. What we do in this session is related to Section 2.1.1 on directed edge-labelled graphs. In this session, it may also useful to consider Section 3.2 on identity. You may read Section 2.1.1, which is quite short (~3.5 pages) or you can start straight ahead with this page.

Preliminaries

This is a mini tutorial on how to model knowledge as graphs, with a few tips on good and bad practices. To allow you to edit knowledge graphs in a text editor, we will use a data format called Turtle, which is a Web standard from the W3C. It can be used to read, write, and store directed edge-labelled graphs. We will only see the most basic features of Turtle for this session. In the Knowledge Graphs book, all examples given in figures have a link to a Github repository where directed edge-labelled graphs are available in Turtle format. For instance, see Figure 2.1.

Basic relations

A directed edge-labelled graph essentially contains a set of node–arc–node relations. A simple graph like this one:

Daniel works for Google — A basic *node–arc–node* relation

can be encoded in Turtle as follows:

<Daniel> <worksFor> <Google> .

This forms a triple where we will call the first element of the triple its subject, the second element its predicate, and the third element its object. Note that there is a dot at the end. For a more complex graph like:

Daniel works for Google and Google has parent company Alphabet — Multiple *node–arc–node* relations

we can simply add more triples, separated by dots:

<Daniel> <worksFor> <Google> .
<Google> <hasParentCompany> <Alphabet> .

Note again that the dot separates the triples. When there are multiple arcs coming out of the same node, we can simplify the notation. The following graph:

Google has parent company Alphabet and has headquarter at the Googleplex — Multiple *node-arc-node* relations with the same subject

can be written like this:

<Google> <hasParentCompany> <Alphabet> .
<Google> <hasHeadquarter> <Googleplex> .

or, more concisely, like this:

<Google> <hasParentCompany> <Alphabet> ;
    <hasHeadquarter> <Googleplex> .

When the subject is the same, we can repeat it by simply adding a semicolon between predicate–object pairs. When the series of predicate–object pairs is finished, we must add a dot. We can further simplify the notation when the subject and the predicate are the same:

Google has parent company Alphabet, was founded by Sergey Brin and Larry Page, and has headquarter at the Googleplex — Multiple *node-arc-node* relations with the same subject or same subject and predicate

can be written:

<Google> <hasParentCompany> <Alphabet> .
<Google> <hasHeadquarter> <Googleplex> .
<Google> <hasFounder> <LarryPage> .
<Google> <hasFounder> <SergeyBrin> .

or more concisely:

<Google> <hasParentCompany> <Alphabet> ;
    <hasHeadquarter> <Googleplex> ;
    <hasFounder> <LarryPage> ;
    <hasFounder> <SergeyBrin> .

or even more concisely:

<Google> <hasParentCompany> <Alphabet> ;
    <hasHeadquarter> <Googleplex> ;
    <hasFounder> <LarryPage>, <SergeyBrin> .

Note the comma that separates two objects for the same subject and predicate.

Remarks: (1) the order of the triples is not important; (2) there is no shortcut when the object is repeated with different subjects or predicates; (3) a given triple cannot appear multiple times (i.e., if the subject, the predicate, and the object are the same, then the triple is the same); (4) in Turtle, spaces cannot be used in node names or predicate labels.

Literals and datatypes

In general, nodes in knowledge graphs represent things in the real world that we want to describe. These things can be concrete, physical entities (people, objects, etc.), or abstract things (concepts, ideas, legal entities, etc.). Most of these things cannot be fully encoded in a computer: only their (partial) description can be encoded. However, there are entities that can fully be represented and stored as data, such as integers, decimal numbers, character strings, dates. In this case, we use a different type of nodes to represent them, that we call “literals” because what they represent is literally what’s written. In graphical notation, they are often drawn as rectangles:

Larry Page’s name is “Lawrence Edward Page” — Larry Page’s name is, literally, “Lawrence Edward Page”

In Turtle, this is written:

<LarryPage> <name> "Lawrence Edward Page" .

A literal can have spaces in it. A literal can be of different types (number, string, date, etc.) and the set of literal types may be open, or even infinite in some applications. To make sure we interpret the value of a literal correctly, we must associate a datatype to it, as in the following example:

Larry Page’s name is “Lawrence Edward Page” and he was born on the 26th of March, 1973 — Larry Page’s name is “Lawrence Edward Page” and he was born on the 26^th of March, 1973

In Turtle, this can be written as:

<LarryPage> <name> "Lawrence Edward Page" ;
    <birthdate> "26/03/1973"^^<date> .

The datatype date determine how we can interpret the string 26/03/1973. There exist standard datatypes that can be used more concisely in Turtle, for strings, integers, decimal numbers, and floating point binary numbers. A standard for dates exists but there is no short notation for it in Turtle. The following example shows how integers and decimal numbers can be written, and also displays comments in Turtle notation:

# This is a comment, starting with '#' and ending at the end of the line
<LarryPage> <name> "Lawrence Edward Page" ; # Character string
    <numberOfChildren> 2 ; # Integer: just a sequence of digits
    <height> 1.7 . # Decimal: 2 sequences of digits separated by '.'

Other features of Turtle

Turtle is a relatively simple data format and you can learn more about it with the standard specification or by reading Section 3.2 and checking the code for the examples in the graph figures. All examples of directed edge-labelled graphs in the Knowledge Graph book are provided as Turtle in a Github repository. Click on the Github logo in the figure captions to get to the code.

The examples are using the notion of Internationalized Resource Identifiers and define prefixes, but for the purpose of this session, you do not need to worry about this. Nonetheless, you may want to use at least the standard XML Schema Datatypes (XSDs). For this, write this line at the beginning of your Turtle file:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

Then you can use the XSDs like this:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

<LarryPage>  <name> "Lawrence Edward Page"^^xsd:string ; # Equivalent to "Lawrence Edward Page" without datatype
    <numberOfChildren> "2"^^xsd:integer ; # Equivalent to 2 without quotes and datatype
    <height> "1.7"^^xsd:decimal . # Equivalent to 1.7 without quotes and datatype
    <birthdate> "1973-03-26"^^xsd:date ; # Uses ISO 8601 format
    <wikipediaPage> "https://en.wikipedia.org/wiki/Larry_Page"^^xsd:anyURI ;
    <wealthInDollars> "107.9e9"^^xsd:double . # Binary floating point double precision

Finally, you can give a type to an entity by using the special predicate a, which is more or less equivalent to the phrase “is a”:

<LarryPage> a <Person> .

In examples of the Knowledge Graphs book, the “is a” relation is written type. See for instance Figure 2.1.

Basic tips and good practices

Here is a set of tips that you need to have in mind when making a knowledge graph:

Nodes are entirely identified by their name. There cannot be two distinct nodes that have the same name. So, the following code:
```
<Google> <hasParentCompany> <Alphabet> .
<Google> <hasHeadquarter> <Googleplex> .
```
is a graph with 3 nodes, not 4.
Arc labels can be reused on multiple arcs, but there cannot be the same arc label twice with the same source and same destination. That is:
```
<Google> <hasParentCompany> <Alphabet>, <Alphabet> ;
    <hasParentCompany> <Alphabet> .
<Google> <hasParentCompany> <Alphabet> .
```
is a graph with only 1 triple.

Use standard datatypes for literals, when available, and prefer decimal notations over xsd:double or xsd:float. That is, write this:

<Pi> <approximateValue> 3.1415926535 .
# This is a short notation for a literal with datatype xsd:decimal

and not this:

<Pi> <approximateValue> "3.1415926535"^^xsd:double .

Be careful how you name an entity. The same name always identifies the same entity. Avoid generic terms like house to describe a single house. Use, for instance, house1, house2, etc. to distinguish the entities, or use a slash like this:
```
<house/1> a <House>
```
Some things may appear to be instances when in fact they are categories. For instance, “Samsung Galaxy S22 Ultra” may seem to be an instance of phone, but in fact, my Samsung Galaxy S22 Ultra that has been damaged is of the same category as your Samsung Galaxy S22 Ultra. Do not confuse a product model and a single product.
With directed edge-labelled graphs, you can only represent binary relations. To represent arbitrary n-ary relations, you may have to introduce intermediary nodes that denote the relation, and connect it to the components of the relation. For instance, a sale connects a seller, a buyer, a product or service, a date, and a price. It could be represented like this as a graph:
```
<sale/152196> a <Sale> ;
    <soldBy> <JohnDoe> ;
    <boughtBy> <JaneDoe> ;
    <objectOfTransaction> <samsung/gs22ultra/sn456-997> ;
    <dateOfTransaction> "2022-03-11"^^xsd:date ;
    <priceInEuros> 499.95 .
```
Graphs are just syntactic structure. While the meaning can often be guessed intuitively, there must be a formal semantics to derive valid conclusions from them. We will see different ways of interpreting graphs, but for the moment, you can assume that a triple <sub> <pred> <obj> . is equivalent to a FOL atom pred(sub,obj), where pred is a predicate, and sub and obj are constants.
Naming entities (nodes and arcs) is one of the most important tasks in knowledge representation. Bad naming conventions can make a knowledge model more ambiguous, more error prone, more difficult to understand, and eventually, not used at all. Be sure you establish naming conventions that you and your collaborators follow through all portions of your knowledge model. E.g., use CamelCase with capital initial for classes, lower case initial for instances and relations, etc. Use verbs for relations everywhere, or use nouns everywhere; or define a rationale for where you use verbs, and where you use nouns. A noun for a relation like headquarter could mean “is the headquarter of” or “has headquarter”. Be consistent!
Name things and relations in a descriptive and non ambiguous way. Avoid single generic words like has for a relation. “X has Y” could mean that X owns Y, or that X has the characteristic Y, or that X has the disease Y, etc. Be precise!
As much as possible, make your description as independent of the context as possible. A number that denotes a price in euros must be distinguished from a number that is a price in dollars, or that represents an amount of objects, or that represents a measurement, etc. A word that has 2 senses in different contexts may not be sufficient. Be specific!
Avoid describing things that varies all the time. For instance, prefer the birthdate over the age of a person. Think about the potential need of keeping the history of states of affairs during the whole lifecycle of your knowledge base.

Practical work

The following exercises ask you to model a situation or state of affairs as a directed edge-labelled graph, using Turtle.

Describe Mines Saint-Étienne

Write a Turtle file that models this:

Mines Saint-Étienne (officially “École nationale supérieure des mines de Saint-Étienne”) was founded on the 2^nd of August, 1816. Its institutional address is 158 cours Fauriel, 42023 Saint-Étienne cedex 2. Its web site is https://www.mines-stetienne.fr/. It is part of Institut Mines-Télécom.

Save the code to a file with name YourFirstName-YourLastName-ex1.ttl.

People associated with Mines Saint-Étienne

Extend the graph from the previous exercise to describe the following:

Between the 1^st of September, 2008 and the 14^th of July, 2014, the director of Mines Saint-Étienne was Philippe Jamet, who then became director of Institut Mines-Télécom from the 15^th of July, 2014 to the 2^nd of September, 2019. Pascal Ray was director from the 15^th of July, 2014 to the 30^th of November, 2021. Between the 1^st of December, 2021 and 30^th of April, 2022, the director was David Delafosse. Since the 1^st of May, 2022, Jacques Fayolle is the director of Mines Saint-Étienne.

Save the code to a file with name YourFirstName-YourLastName-ex2.ttl.

Product ownership

Write Turtle file that describes the following situation:

Alicia and Kyoko both own a Toyota Prius. Alicia bought a new one in 2022, while Kyoko’s was a second hand vehicle from 2013 that she bought in 2018.

Save the code to a file with name YourFirstName-YourLastName-ex3.ttl.

Send your 3 files to antoine.zimmermann@emse.fr.

Describe a social-networking platform

When doing this, it is best if you can already form groups of 4 people that you will keep for the rest of the course. If you do not have your complete group yet, you can work in pairs with someone who will be a member of your group until the end of the course. If your group is already formed, send me an email with members of your group in CC.

Take a look at a website that offers a social-networking platform. Choose a particular platform and observe what is possible to do with it as a user. Describe the platform in a graph, connecting it to what actions can be performed on it, and also to its content. For instance, a platform may allow one to register, to post messages, to connect with other members of the platform, to update one’s own profile, etc. A platform also allows users to read messages, read someone’s profile details, find responses to a thread, etc. You can discuss the best model with your team. For this part of the work, it is best to work on paper first.

Representing knowledge as graphs