Representing knowledge as graphs

Here we start with the basics of graph-based knowledge representation. You should have read the introduction of the book Knowledge Graphs. What we do in this session is related to Section 2.1.1 on directed edge-labelled graphs. In this session, it may also useful to consider Section 3.2 on identity. You may read Section 2.1.1, which is quite short (~3.5 pages) or you can start straight ahead with this page.

Preliminaries

This is a mini tutorial on how to model knowledge as graphs, with a few tips on good and bad practices. To allow you to edit knowledge graphs in a text editor, we will use a data format called Turtle, which is a Web standard from the W3C. It can be used to read, write, and store directed edge-labelled graphs. We will only see the most basic features of Turtle for this session. In the Knowledge Graphs book, all examples given in figures have a link to a Github repository where directed edge-labelled graphs are available in Turtle format. For instance, see Figure 2.1.

Basic relations

A directed edge-labelled graph essentially contains a set of node–arc–node relations. A simple graph like this one:

Daniel works for Google
A basic node–arc–node relation

can be encoded in Turtle as follows:

<Daniel> <worksFor> <Google> .

This forms a triple where we will call the first element of the triple its subject, the second element its predicate, and the third element its object. Note that there is a dot at the end. For a more complex graph like:

Daniel works for Google and Google has parent company Alphabet
Multiple node–arc–node relations

we can simply add more triples, separated by dots:

<Daniel> <worksFor> <Google> .
<Google> <hasParentCompany> <Alphabet> .

Note again that the dot separates the triples. When there are multiple arcs coming out of the same node, we can simplify the notation. The following graph:

Google has parent company Alphabet and has headquarter at the Googleplex
Multiple node-arc-node relations with the same subject

can be written like this:

<Google> <hasParentCompany> <Alphabet> .
<Google> <hasHeadquarter> <Googleplex> .

or, more concisely, like this:

<Google> <hasParentCompany> <Alphabet> ;
    <hasHeadquarter> <Googleplex> .

When the subject is the same, we can repeat it by simply adding a semicolon between predicate–object pairs. When the series of predicate–object pairs is finished, we must add a dot. We can further simplify the notation when the subject and the predicate are the same:

Google has parent company Alphabet, was founded by Sergey Brin and Larry Page, and has headquarter at the Googleplex
Multiple node-arc-node relations with the same subject or same subject and predicate

can be written:

<Google> <hasParentCompany> <Alphabet> .
<Google> <hasHeadquarter> <Googleplex> .
<Google> <hasFounder> <LarryPage> .
<Google> <hasFounder> <SergeyBrin> .

or more concisely:

<Google> <hasParentCompany> <Alphabet> ;
    <hasHeadquarter> <Googleplex> ;
    <hasFounder> <LarryPage> ;
    <hasFounder> <SergeyBrin> .

or even more concisely:

<Google> <hasParentCompany> <Alphabet> ;
    <hasHeadquarter> <Googleplex> ;
    <hasFounder> <LarryPage>, <SergeyBrin> .

Note the comma that separates two objects for the same subject and predicate.

Remarks: (1) the order of the triples is not important; (2) there is no shortcut when the object is repeated with different subjects or predicates; (3) a given triple cannot appear multiple times (i.e., if the subject, the predicate, and the object are the same, then the triple is the same); (4) in Turtle, spaces cannot be used in node names or predicate labels.

Literals and datatypes

In general, nodes in knowledge graphs represent things in the real world that we want to describe. These things can be concrete, physical entities (people, objects, etc.), or abstract things (concepts, ideas, legal entities, etc.). Most of these things cannot be fully encoded in a computer: only their (partial) description can be encoded. However, there are entities that can fully be represented and stored as data, such as integers, decimal numbers, character strings, dates. In this case, we use a different type of nodes to represent them, that we call “literals” because what they represent is literally what’s written. In graphical notation, they are often drawn as rectangles:

Larry Page’s name is “Lawrence Edward Page”
Larry Page’s name is, literally, “Lawrence Edward Page”

In Turtle, this is written:

<LarryPage> <name> "Lawrence Edward Page" .

A literal can have spaces in it. A literal can be of different types (number, string, date, etc.) and the set of literal types may be open, or even infinite in some applications. To make sure we interpret the value of a literal correctly, we must associate a datatype to it, as in the following example:

Larry Page’s name is “Lawrence Edward Page” and he was born on the 26th of March, 1973
Larry Page’s name is “Lawrence Edward Page” and he was born on the 26th of March, 1973

In Turtle, this can be written as:

<LarryPage> <name> "Lawrence Edward Page" ;
    <birthdate> "26/03/1973"^^<date> .

The datatype date determine how we can interpret the string 26/03/1973. There exist standard datatypes that can be used more concisely in Turtle, for strings, integers, decimal numbers, and floating point binary numbers. A standard for dates exists but there is no short notation for it in Turtle. The following example shows how integers and decimal numbers can be written, and also displays comments in Turtle notation:

# This is a comment, starting with '#' and ending at the end of the line
<LarryPage> <name> "Lawrence Edward Page" ; # Character string
    <numberOfChildren> 2 ; # Integer: just a sequence of digits
    <height> 1.7 . # Decimal: 2 sequences of digits separated by '.'

Other features of Turtle

Turtle is a relatively simple data format and you can learn more about it with the standard specification or by reading Section 3.2 and checking the code for the examples in the graph figures. All examples of directed edge-labelled graphs in the Knowledge Graph book are provided as Turtle in a Github repository. Click on the Github logo in the figure captions to get to the code.

The examples are using the notion of Internationalized Resource Identifiers and define prefixes, but for the purpose of this session, you do not need to worry about this. Nonetheless, you may want to use at least the standard XML Schema Datatypes (XSDs). For this, write this line at the beginning of your Turtle file:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

Then you can use the XSDs like this:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

<LarryPage>  <name> "Lawrence Edward Page"^^xsd:string ; # Equivalent to "Lawrence Edward Page" without datatype
    <numberOfChildren> "2"^^xsd:integer ; # Equivalent to 2 without quotes and datatype
    <height> "1.7"^^xsd:decimal . # Equivalent to 1.7 without quotes and datatype
    <birthdate> "1973-03-26"^^xsd:date ; # Uses ISO 8601 format
    <wikipediaPage> "https://en.wikipedia.org/wiki/Larry_Page"^^xsd:anyURI ;
    <wealthInDollars> "107.9e9"^^xsd:double . # Binary floating point double precision

Finally, you can give a type to an entity by using the special predicate a, which is more or less equivalent to the phrase “is a”:

<LarryPage> a <Person> .

In examples of the Knowledge Graphs book, the “is a” relation is written type. See for instance Figure 2.1.

Basic tips and good practices

Here is a set of tips that you need to have in mind when making a knowledge graph:

Practical work

The following exercises ask you to model a situation or state of affairs as a directed edge-labelled graph, using Turtle.

Describe Mines Saint-Étienne

Write a Turtle file that models this:

Mines Saint-Étienne (officially “École nationale supérieure des mines de Saint-Étienne”) was founded on the 2nd of August, 1816. Its institutional address is 158 cours Fauriel, 42023 Saint-Étienne cedex 2. Its web site is https://www.mines-stetienne.fr/. It is part of Institut Mines-Télécom.

Save the code to a file with name YourFirstName-YourLastName-ex1.ttl.

People associated with Mines Saint-Étienne

Extend the graph from the previous exercise to describe the following:

Between the 1st of September, 2008 and the 14th of July, 2014, the director of Mines Saint-Étienne was Philippe Jamet, who then became director of Institut Mines-Télécom from the 15th of July, 2014 to the 2nd of September, 2019. Pascal Ray was director from the 15th of July, 2014 to the 30th of November, 2021. Between the 1st of December, 2021 and 30th of April, 2022, the director was David Delafosse. Since the 1st of May, 2022, Jacques Fayolle is the director of Mines Saint-Étienne.

Save the code to a file with name YourFirstName-YourLastName-ex2.ttl.

Product ownership

Write Turtle file that describes the following situation:

Alicia and Kyoko both own a Toyota Prius. Alicia bought a new one in 2022, while Kyoko’s was a second hand vehicle from 2013 that she bought in 2018.

Save the code to a file with name YourFirstName-YourLastName-ex3.ttl.

Send your 3 files to antoine.zimmermann@emse.fr.

Describe a social-networking platform

When doing this, it is best if you can already form groups of 4 people that you will keep for the rest of the course. If you do not have your complete group yet, you can work in pairs with someone who will be a member of your group until the end of the course. If your group is already formed, send me an email with members of your group in CC.

Take a look at a website that offers a social-networking platform. Choose a particular platform and observe what is possible to do with it as a user. Describe the platform in a graph, connecting it to what actions can be performed on it, and also to its content. For instance, a platform may allow one to register, to post messages, to connect with other members of the platform, to update one’s own profile, etc. A platform also allows users to read messages, read someone’s profile details, find responses to a thread, etc. You can discuss the best model with your team. For this part of the work, it is best to work on paper first.

The End

last modified 2024/03/14 16:10:42 by Antoine Zimmermann.