Knowledge Graphs

In this paper we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After some opening remarks, we motivate and contrast various graph-based data models and query languages that are used for knowledge graphs. We discuss the roles of schema, identity, and context in knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We summarise methods for the creation, enrichment, quality assessment, refinement, and publication of knowledge graphs. We provide an overview of prominent open knowledge graphs and enterprise knowledge graphs, their applications, and how they use the aforementioned techniques. We conclude with high-level future research directions for knowledge graphs.

Introduction

Though the phrase “knowledge graph” has been used in the literature since at least 1972 [1], the modern incarnation of the phrase stems from the 2012 announcement of the Google Knowledge Graph [2], followed by further announcements of knowledge graphs by Airbnb [3], Amazon [4], eBay [5], Facebook [6], IBM [7], LinkedIn [8], Microsoft [9], Uber [10], and more besides. The growing industrial uptake of the concept proved difficult for academia to ignore: more and more scientific literature is being published on knowledge graphs, which includes books (e.g., [11]), as well as papers outlining definitions (e.g., [12]), novel techniques (e.g., [13, 14, 15]), and surveys of specific aspects of knowledge graphs (e.g., [16, 17]).

Underlying all such developments is the core idea of using graphs to represent data, often enhanced with some way to explicitly represent knowledge [6]. The result is most often used in application scenarios that involve integrating, managing and extracting value from diverse sources of data at large scale [6]. Employing a graph-based abstraction of knowledge has numerous benefits in such settings when compared with, for example, a relational model or NoSQL alternatives. Graphs provide a concise and intuitive abstraction for a variety of domains, where edges capture the (potentially cyclical) relations between the entities inherent in social data, biological interactions, bibliographical citations and co-authorships, transport networks, and so forth [18]. Graphs allow maintainers to postpone the definition of a schema, allowing the data – and its scope – to evolve in a more flexible manner than typically possible in a relational setting, particularly for capturing incomplete knowledge [19]. Unlike (other) NoSQL models, specialised graph query languages support not only standard relational operators (joins, unions, projections, etc.), but also navigational operators for recursively finding entities connected through arbitrary-length paths [20]. Standard knowledge representation formalisms – such as ontologies [21, 22, 23] and rules [24, 25] – can be employed to define and reason about the semantics of the terms used to label and describe the nodes and edges in the graph. Scalable frameworks for graph analytics [26, 27, 28] can be leveraged for computing centrality, clustering, summarisation, etc., in order to gain insights about the domain being described. Various representations have also been developed that support applying machine learning techniques directly over graphs [17, 29].

In summary, the decision to build and use a knowledge graph opens up a range of techniques that can be brought to bear for integrating and extracting value from diverse sources of data. However, we have yet to see a general unifying summary that describes how knowledge graphs are being used, what techniques they employ, and how they relate to existing data management topics.

The goal of this tutorial paper is to motivate and give a comprehensive introduction to knowledge graphs: to describe their foundational data models and how they can be queried; to discuss representations relating to schema, identity, and context; to discuss deductive and inductive ways to make knowledge explicit; to present a variety of techniques that can be used for the creation and enrichment of graph-structured data; to describe how the quality of knowledge graphs can be discerned and how they can be refined; to discuss standards and best practices by which knowledge graphs can be published; and to provide an overview of existing knowledge graphs found in practice. Our intended audience includes researchers and practitioners who are new to knowledge graphs. As such, we do not assume that readers have specific expertise on knowledge graphs.

Knowledge graph. The definition of a “knowledge graph” remains contentious [12, 30, 31], where a number of (sometimes conflicting) definitions have emerged, varying from specific technical proposals to more inclusive general proposals; we address these prior definitions in Appendix A. Herein we adopt an inclusive definition, where we view a knowledge graph as a graph of data intended to accumulate and convey knowledge of the real world, whose nodes represent entities of interest and whose edges represent relations between these entities. The graph of data (aka data graph) conforms to a graph-based data model, which may be a directed edge-labelled graph, a property graph, etc. (we discuss concrete alternatives in Section 2. By knowledge, we refer to something that is known¹^{note 1} A number of specific definitions for knowledge have been proposed in the literature on epistemology.. Such knowledge may be accumulated from external sources, or extracted from the knowledge graph itself. Knowledge may be composed of simple statements, such as “Santiago is the capital of Chile”, or quantified statements, such as “all capitals are cities”. Simple statements can be accumulated as edges in the data graph. If the knowledge graph intends to accumulate quantified statements, a more expressive way to represent knowledge – such as ontologies or rules – is required. Deductive methods can then be used to entail and accumulate further knowledge (e.g., “Santiago is a city”). Additional knowledge – based on simple or quantified statements – can also be extracted from and accumulated by the knowledge graph using inductive methods.

Knowledge graphs are often assembled from numerous sources, and as a result, can be highly diverse in terms of structure and granularity. To address this diversity, representations of schema, identity, and context often play a key role, where a schema defines a high-level structure for the knowledge graph, identity denotes which nodes in the graph (or in external sources) refer to the same real-world entity, while context may indicate a specific setting in which some unit of knowledge is held true. As aforementioned, effective methods for extraction, enrichment, quality assessment, and refinement are required for a knowledge graph to grow and improve over time.

In practice. Knowledge graphs aim to serve as an ever-evolving shared substrate of knowledge within an organisation or community [6]. We distinguish two types of knowledge graphs in practice: open knowledge graphs and enterprise knowledge graphs. Open knowledge graphs are published online, making their content accessible for the public good. The most prominent examples – DBpedia [32], Freebase [33], Wikidata [34], YAGO [35], etc. – cover many domains and are either extracted from Wikipedia [32, 35], or built by communities of volunteers [33, 34]. Open knowledge graphs have also been published within specific domains, such as media [36], government [37, 38], geography [39], tourism [40, 41, 42, 43], life sciences [44], and more besides. Enterprise knowledge graphs are typically internal to a company and applied for commercial use-cases [6]. Prominent industries using enterprise knowledge graphs include Web search (e.g., Bing [9], Google [2]), commerce (e.g., Airbnb [3], Amazon [4, 45], eBay [5], Uber [10]), social networks (e.g., Facebook [6], LinkedIn [8]), finance (e.g., Accenture [46], Banca d’Italia [47], Bloomberg [48], Capital One [49], Wells Fargo [50]), among others. Applications include search [9, 2], recommendations [3, 10, 8, 6], personal agents [5], advertising [8], business analytics [8], risk assessment [51, 52], automation [53], and more besides. We will provide more details on the use of knowledge graphs in practice in Section 10.

Running example. To keep the discussion accessible, throughout the paper, we present concrete examples in the context of a hypothetical knowledge graph relating to tourism in Chile (loosely inspired by, e.g., [41, 40]). The knowledge graph is managed by a tourism board that aims to increase tourism in the country and promote new attractions in strategic areas. The knowledge graph itself will eventually describe tourist attractions, cultural events, services, and businesses, as well as cities and inter-city travel routes. Some applications the organisation envisages are to:

create a tourism portal that allows visitors to search for attractions, upcoming events, and other related services (in multiple languages);
gain insights into tourism demographics in terms of season, nationalities, etc.;
analyse sentiment about various attractions and events, including positive reviews, summaries of complaints about events and services, reports of crime, etc.;
track reports of crime involving tourists and their location;
understand tourism trajectories: the sequence of attractions, events, etc., that tourists visit;
cross-reference trajectories with available flights/buses to suggest new strategic routes;
offer personalised recommendations of places to visit;
and so forth.

Related Literature. A number of related surveys, books, etc., have been published relating to knowledge graphs. In Table 1, we provide an overview of the tertiary literature – surveys, books, tutorials, etc. – relating to knowledge graphs, comparing the topics covered to those covered herein. We see that the existing literature tends to focus on specific aspects of knowledge graphs. Unlike the various surveys that have been published, our goal as a tutorial paper is to provide a broad and accessible introduction to knowledge graphs. Some of the surveys (in particular) provide more in-depth technical details on their chosen topic than this paper; throughout the discussion, where appropriate, we will refer to these surveys for further reading.

Related tertiary literature on knowledge graphs; ☑ denotes in-depth discussion, ☐ denotes brief discussion, * denotes informal publication (arXiv)
Publication	Year	Type	Models	Querying	Shapes	Identity	Context	Ontologies	Entailment	Rules	DLs	Analytics	Embeddings	GNNs	Symbolic Learning	Construction	Quality	Refinement	Publication	Enterprise KGs	Open KGs	Applications	History	Prior Definitions
Jeff Z. Pan, et al. [54]	2017	Book	☐	☑		☑		☐	☐			☐				☑				☑		☑	☑
Heiko Paulheim [16]	2017	Survey											☐		☐	☐	☐	☑		☐	☐
Quan Wang, et al. [17]	2017	Survey											☑		☐	☐	☐	☐			☐	☐
Jihong Yan, et al. [55]	2018	Survey	☐	☐					☐	☑		☐	☑		☐	☑				☑	☑
Genet Asefa Gesese, et al. [56]	2019	Survey				☐							☑									☐
Seyed Mehran Kazemi, et al. [57]	2019	Survey*					☐			☐			☑	☑	☑	☐		☐
Mayank Kejriwal [58]	2019	Book														☑
Guohui Xiao, et al. [59]	2019	Survey	☐	☐			☐	☐	☐							☐						☑
Xiu-Qing Wang, et al. [60]	2019	Survey	☐	☐				☐	☐	☐			☑	☐	☐	☑						☐
Tareq Al-Moslmi, et al. [61]	2020	Survey														☑
Dieter Fensel, et al. [62]	2020	Book														☑				☐	☐	☑		☐
Nicolas Heist, et al. [63]	2020	Survey*						☐								☐	☐				☑
Shaoxiong Ji, et al. [64]	2020	Survey*								☑			☑	☐		☑					☐	☑	☐	☐
Hogan et al.	2021	Tutorial	☑	☑	☑	☑	☑	☑	☑	☑	☑	☑	☑	☑	☑	☑	☑	☑	☑	☑	☑	☑	☑	☑

Structure. The remainder of the paper is structured as follows:

Section 2: outlines graph data models and the languages that can be used to query them.
Section 3: describes representations of schema, identity, and context in knowledge graphs.
Section 4: presents deductive formalisms by which knowledge can be represented and entailed.
Section 5: describes inductive techniques by which additional knowledge can be extracted.
Section 6: discusses the creation and enrichment of knowledge graphs from external sources.
Section 7: enumerates quality dimensions by which a knowledge graph can be assessed.
Section 8: discusses various techniques for knowledge graph refinement.
Section 9: discusses principles and protocols for publishing knowledge graphs.
Section 10: surveys some prominent knowledge graphs and their applications.
Section 11: concludes with a summary and future research directions for knowledge graphs.
Appendix A: provides historical background and previous definitions for knowledge graphs.
Appendix B: enumerates formal definitions that will be referred to from the body of the paper.

Data Graphs

At the foundation of any knowledge graph is the principle of first applying a graph abstraction to data, resulting in an initial data graph. We now discuss a selection of graph-structured data models that are commonly used in practice to represent data graphs. We then discuss the primitives that form the basis of graph query languages used to interrogate such data graphs.

Models

Leaving aside graphs, let us assume that the tourism board from our running example has not yet decided how to model relevant data about attractions, events, services, etc. The board first considers using a tabular structure – in particular, relational databases – to represent the required data, and though they do not know precisely what data they will need to capture, they start to design an initial relational schema. They begin with an Event table with five columns:

Event(name, venue, type, start, end)

where name and start together form the primary key of the table in order to uniquely identify recurring events. But as they start to populate the data, they encounter various issues: events may have multiple names (e.g., in different languages), events may have multiple venues, they may not yet know the start and end date-times for future events, events may have multiple types, and so forth. Incrementally addressing these modelling issues as the data become more diverse, they generate internal identifiers for events and adapt their relational schema until they have:

EventName(id,name), EventStart(id,start), EventEnd(id,end), EventVenue(id,venue), EventType(id,type)(1)

With the above schema, the organisation can now model events with \(0{-}n\) names, venues, and types, and \(0{-}1\) start dates and end dates (without needing relational nulls/blank cells in tables).

Along the way, the board has to incrementally change the schema several times in order to support new sources of data. Each such change requires a costly remodelling, reloading, and reindexing of data; here we only considered one table. The tourism board struggles with the relational model because they do not know, a priori, what data will need to be modelled or what sources they will use. But once they reach the latter relational schema, the board finds that they can integrate further sources without more changes: with minimal assumptions on multiplicities (\(1{-}1\), \(1{-}n\), etc.) this schema offers a lot of flexibility for integrating incomplete and diverse data.

In fact, the refined, flexible schema that the board ends up with – shown in (1) – is modelling a set of binary relations between entities, which indeed can be viewed as modelling a graph. By instead adopting a graph data model from the outset, the board could forgo the need for an upfront schema, and could define any (binary) relation between any pair of entities at any time.

We now introduce graph data models commonly used in practice [20].

Directed edge-labelled graphs

A directed edge-labelled graph (also known as a multi-relational graph [65, 66, 67]) is defined as a set of nodes – like Santiago, Arica, EID16, 2018-03-22 12:00 – and a set of directed labelled edges between those nodes, like Santa Lucía arrow source city arrow tip rightward Santiago. In the case of knowledge graphs, nodes are used to represent entities and edges are used to represent (binary) relations between those entities. Figure 1 provides an example of how the tourism board could model some relevant event data as a directed edge-labelled graph (for a formal definition of a directed edge-labelled graph see Definition 1 in Appendix B). The graph includes data about the names, types, start and end date-times, and venues for events.²^{note 2} We represent bidirectional edges as Viña del Mar arrow tip leftward bus arrow tip rightward Arica, which more concisely depicts two directed edges: Viña del Mar arrow source busArica and Viña del MarbusArica. Also while some naming conventions recommend more complete edge labels that include a verb, such as has venue or is valid from, in this paper, for presentation purposes, we will omit the “has” and “is” verbs from such labels, using simply venue or valid from. Adding information to such a graph typically involves adding new nodes and edges (with some exceptions discussed later). Representing incomplete information requires simply omitting a particular edge; for example, the graph does not yet define a start/end date-time for the Food Truck festival.

Directed edge-labelled graph describing events and their venues

Modelling data as a graph in this way offers more flexibility for integrating new sources of data, compared to the standard relational model, where a schema must be defined upfront and followed at each step. While other structured data models such as trees (XML, JSON, etc.) would offer similar flexibility, graphs do not require organising the data hierarchically (should venue be a parent, child, or sibling of type for example?). They also allow cycles to be represented and queried (e.g., note the directed cycle in the routes between Santiago, Arica, and Viña del Mar).

A standardised data model based on directed edge-labelled graphs is the Resource Description Framework (RDF) [68], which has been recommended by the W3C. The RDF model defines different types of nodes, including Internationalized Resource Identifiers (IRIs) [69] which allow for global identification of entities on the Web; literals, which allow for representing strings (with or without language tags) and other datatype values (integers, dates, etc.); and blank nodes, which are anonymous nodes that are not assigned an identifier (for example, rather than create internal identifiers like EID15, EID16, in RDF, we have the option to use blank nodes). We will discuss these different types of nodes further in Section 3.2 when we speak about issues relating to identity.

Heterogeneous graphs

A heterogeneous graph [70, 71, 72] (or heterogeneous information network [73, 74]) is a graph where each node and edge is assigned one type. Heterogeneous graphs are thus akin to del graphs – with edge labels corresponding to edge types – but where the type of node forms part of the graph model itself, rather than being expressed as a special relation, as illustrated in Figure 2. An edge is called homogeneous if it is between two nodes of the same type (e.g., borders); otherwise it is called heterogeneous (e.g., capital). A benefit of heterogeneous graphs is that they allow for partitioning nodes according to their type, for example, for the purposes of machine learning tasks [70, 71, 72]. Conversely, they typically only support a one-to-one relation between nodes and types, which is not the case for del graphs (see, for example, the node Santiago with zero types and EID15 with multiple types in Figure 1.

Property graphs

Property graphs were introduced to provide additional flexibility when modelling more complex relations. Consider integrating incoming data that provides information on which companies offer fares on which flights, allowing the board to better understand available routes between cities (for example, on national airlines). In the case of directed-edge labelled graphs, we cannot directly annotate an edge like Santiago arrow source flight arrow tip rightward Arica with the company (or companies) offering that route. But we could add a new node denoting a flight, connect it with the source, destination, companies, and mode, as shown in Figure 3. Applying this modelling to all routes in Figure 1 would, however, involve a significant change to the graph. Another option might be to put the flights of different companies in different named graphs, but if named graphs are already being used to track the source of graphs (for example), this could become cumbersome.

Directed edge-labelled graph with companies offering flights between Santiago and Arica

The property graph model was thus proposed to offer additional flexibility when modelling data as a graph [75, 20]. A property graph allows a set of property–value pairs and a label to be associated with both nodes and edges. Figure 4 shows a concise example of a property graph with data analogous to Figure 3 (for a formal definition of a property graph, we refer to Definition 4 in Appendix B. This time we use property–value pairs on edges to model the companies³^{note 3} In practical implementations of property graphs, properties with multiple values may be expressed, for example, as a single array value. Such issues do not, however, affect expressivity, nor our discussion.. The type of relation is captured by the label flight}. We further use node labels to indicate the types of the two nodes, and use property–value pairs to indicate their latitude and longitude.

Property graph with companies offering flights between Santiago and Arica

Property graphs are most prominently used in popular graph databases, such as Neo4j [75, 20]. In choosing between graph models, it is important to note that property graphs can be translated to/from directed edge-labelled graphs without loss of information [76, 77] (per, e.g., Figure 4). In summary, directed-edge labelled graphs offer a more minimal model, while property graphs offer a more flexible one. Often the choice of model will be secondary to other practical factors, such as the implementations available for different models, etc.

Graph dataset

Although multiple directed edge-labelled graphs can be merged by taking their union, it is often desirable to manage several graphs rather than one monolithic graph; for example, it may be beneficial to manage multiple graphs from different sources, making it possible to update or refine data from one source, to distinguish untrustworthy sources from more trustworthy ones, and so forth. A graph dataset then consists of a set of named graphs and a default graph. Each named graph is a pair of a graph ID and a graph. The default graph is a graph without an ID, and is referenced “by default” if a graph ID is not specified. Figure 5 provides an example where events and routes are stored in two named graphs, and the default graph manages meta-data about the named graphs (for a formal definition of a graph dataset, see Definition 4 in Appendix B). Graph names can also be used as nodes in a graph. Furthermore, nodes and edges can be repeated across graphs, where the same node in different graphs will typically refer to the same entity, allowing data on that entity to be integrated when merging graphs. Though the example illustrates a dataset of directed edge-labelled graphs, the concept generalises straightforwardly to other types of graphs.

A prominent use-case for graph datasets is to manage and query Linked Data composed of interlinked documents of RDF graphs spanning the Web. When dealing with Web data, tracking the source of data becomes of key importance [78, 79, 80]. We will discuss Linked Data later in Section 3.2 and further discuss provenance in Section 3.3.

Other graph data models

The previous models are popular examples of graph representations. Other graph data models exist with complex nodes that may contain individual edges [18, 81] or nested graphs [18, 82] (sometimes called hypernodes [83]. Likewise the mathematical notion of a hypergraph defines complex edges that connect sets rather than pairs of nodes. In our view, a knowledge graph can adopt any such graph data model based on nodes and edges: often data can be converted from one model to another (see Figure 3 vs. Figure 4). In the rest of the paper, we prefer discussing directed-edge labelled graphs given their relative succinctness, but most discussion extends naturally to other models.

Graph stores

A variety of techniques have been proposed for storing and indexing graphs, facilitating the efficient evaluation of queries (as discussed next). Directed-edge labelled graphs can be stored in relational databases either as a single relation of arity three (triple table), as a binary relation for each property (vertical partitioning), or as \(n\)-ary relations for entities of a given type (property tables) [84]. Custom storage techniques have also been developed for a variety of graph models, providing efficient access for finding nodes, edges and their adjacent elements [18, 75, 84]. A number of systems further allow for distributing graphs over multiple machines based on popular NoSQL stores or custom partitioning schemes [84, 85]. For further details we refer to the book chapter by Daniel Janke, et al. [85] and the survey by Marcin Wylot, et al. [84] dedicated to this topic.

Querying

A number of practical languages have been proposed for querying graphs [20], including the SPARQL query language for RDF graphs [86]; and Cypher [87], Gremlin [88], and G-CORE [89] for querying property graphs.⁴^{note 4} The popularity of these languages is investigated by Philipp Seifer, et al. [90].. Underlying these query languages are some common primitives, including (basic) graph patterns, relational operators, path expressions, and more besides [20]. We now describe these core features for querying graphs in turn, starting with graph patterns.

Graph patterns

At the core of every structured query language for graphs are (basic) graph patterns [91, 20], which follow the same model as the data graph being queried (see Section 2.1), additionally allowing variables as terms.⁵^{note 5} The terms of a directed edge-labelled graph are its nodes and edge-labels. The terms of a property graph are its ids, labels, properties, and values (as used on either edges or nodes). Terms in graph patterns are thus divided into constants, such as Arica or venue, and variables, which we prefix with question marks, such as ?event or ?rel. A graph pattern is then evaluated against the data graph by generating mappings from the variables of the graph pattern to constants in the data graph such that the image of the graph pattern under the mapping (replacing variables with the assigned constants) is contained within the data graph.

In Figure 6, we provide an example of a graph pattern looking for the venues of Food Festivals, along with the possible mappings generated by the graph pattern against the data graph of Figure 1. In some of the presented mappings (the last two listed), multiple variables are mapped to the same term, which may or may not be desirable depending on the application. Hence a number of semantics have been proposed for evaluating graph patterns [20], amongst which the most important are: homomorphism-based semantics, which allows multiple variables to be mapped to the same term such that all mappings shown in Figure 6 would be considered results; and isomorphism-based semantics, which requires variables on nodes and/or edges to be mapped to unique terms, thus excluding the latter three mappings of Figure 6 from the results. Different practical languages adopt different semantics for evaluating graph patterns where, for example, SPARQL adopts a homomorphism-based semantics, while Cypher adopts an isomorphism-based semantics on edges.

?ev	?vn1	?vn2
`EID16`	`Piscina Olímpica`	`Sotomayor`
`EID16`	`Sotomayor`	`Piscina Olímpica`
`EID16`	`Piscina Olímpica`	`Piscina Olímpica`
`EID16`	`Sotomayor`	`Sotomayor`
`EID15`	`Santa Lucía`	`Santa Lucía`

As we will see in later examples (particularly Figure 8, graph patterns may also form cycles (be they directed or undirected), and may replace edge labels with variables. Graph patterns in the context of other models – such as property graphs – can be defined analogously by allowing variables to replace terms in any position of the model. We provide a formalisation of graph patterns and their evaluation for both directed edge-labelled graphs and property graphs in Appendix B.2.1.

Complex graph patterns

A graph pattern transforms an input graph into a table of results (as shown in Figure 6). We may then consider using the relational algebra to combine and/or transform such tables, thus forming more complex queries from one or more graph patterns. Recall that the relational algebra consists of unary operators that accept one input table, and binary operators that accept two input tables. Unary operators include projection (\(\pi\)) to output a subset of columns, selection (\(\sigma\)) to output a subset of rows matching a given condition, and renaming of columns (\(\rho\)). Binary operators include union (\(\cup\)) to merge the rows of two tables into one table, difference (\(-\)) to remove the rows from the first table present in the second table, and joins (\(\Join\)) to extend the rows of one table with rows from the other table that satisfy a join condition. Selection and join conditions typically include equalities (\(=\)), inequalities (\(\leq\)), negation (\(\neg\)), disjunction (\(\vee\)), etc. From these operators, we can further define other (syntactic) operators, such as intersection (\(\cap\)) to output rows in both tables, anti-join (\(\rhd\), aka not exists) to output rows from the first table for which there are no join-compatible rows in the second table, left-join (\(\mathbin{\rule[0ex]{0.3em}{.5pt}\llap{\rule[1ex]{0.3em}{.5pt}}\mkern-6mu\Join}\)), aka optional) to perform a join but keeping rows from the first table without a compatible row in the second table, etc.

Graph patterns can then be expressed in a subset of relational algebra (namely \(\pi\), \(\sigma\), \(\rho\), \(\Join\)). Assuming, for example, a single ternary relation \(G(s,p,o)\) representing a graph – i.e., a table \(G\) with three columns \(s\), \(p\), \(o\) – the query of Figure 6 can be expressed in relational algebra as:

\(\pi_{ev,vn_1,vn_2}(\sigma_{p=\texttt{type} \wedge o=\texttt{Food Festival} \wedge p_1=p_2=\texttt{venue}}(\rho_{s/ev}(G \bowtie \rho_{p/p_1,o/vn_1}(G) \bowtie \rho_{p/p_2,o/vn_2}(G))))\)

where \(\Join\) denotes a natural join, meaning that equality is checked across pairs of columns with the same name in both tables (here, the join is thus performed on the subject column \(s\)). The result of this query is a table with a column for each variable: \(ev,vn1,vn2\). However, not all queries using \(\pi, \sigma, \rho\) and \(\Join\) on \(G\) can be expressed as graph patterns; for example, we cannot choose which variables to project in a graph pattern, but rather must project all variables not fixed to a constant.

Graph query languages such as SPARQL [86] and Cypher [87] allow the full use of relational operators over the results of graph patterns, giving rise to complex graph patterns [20]. Figure 7 presents an example of a complex graph pattern with projected variables in bold, choosing particular variables to appear in the final results. In terms of expressivity, graph patterns with (unrestricted) projection of this form equate to conjunctive queries on graphs. In Figure 8, we give another example of a complex graph pattern looking for food festivals or drinks festivals not held in Santiago, optionally returning their start date and name (where available). Such queries – allowing the full use of relational operators on top of graph patterns – equate to first-order queries on graphs. In Appendix B.2.2, we formalise complex graph patterns and their evaluation over data graphs.

Conjunctive query (left) with mappings generated over the graph of Figure 1 (right)

?name1	?con	?name2
`Food Truck`	`bus`	`Food Truck`
`Food Truck`	`bus`	`Food Truck`
`Food Truck`	`bus`	`Ñam`
`Food Truck`	`flight`	`Ñam`
`Food Truck`	`flight`	`Ñam`
`Ñam`	`bus`	`Food Truck`
`Ñam`	`flight`	`Food Truck`
`Ñam`	`flight`	`Food Truck`

Complex graph pattern 1 — Complex graph pattern (\(Q\)) with mappings generated (\(Q(G)\)) over the graph of Figure 1 (\(G\))

Complex graph pattern 2 — Complex graph pattern (\(Q\)) with mappings generated (\(Q(G)\)) over the graph of Figure 1 (\(G\))

?event	?start	?name
`EID16`		`Food Truck`

Complex graph patterns can give rise to duplicate results; for example, the first result in Figure 7 appears twice since ?city1 matches Arica and ?city2 matches Viña del Mar in one result, and vice-versa in the other. Query languages then offer two semantics: bag semantics preserves duplicates according to the multiplicity of the underlying mappings, while set semantics (typically invoked with a DISTINCT keyword) removes duplicates from the results.

Navigational graph patterns

A key feature that distinguishes graph query languages is the ability to include path expressions in queries. A path expression \(r\) is a regular expression that allows matching arbitrary-length paths between two nodes, which is expressed as a regular path query \((x,r,y)\), where \(x\) and \(y\) can be variables or constants (or even the same term). The base path expression is where \(r\) is a constant (an edge label). Furthermore if \(r\) is a path expression, then \(r^-\) (inverse)⁶^{note 6} Some authors distinguish 2-way regular path queries from regular path queries, where only the former supports inverses. and \(r^*\) (Kleene star: zero-or-more) are also path expressions. Finally, if \(r_1\) and \(r_2\) are path expressions, then \(r_1 \mid r_2\) (disjunction) and \(r_1 \cdot r_2\) (concatenation) are also path expressions.

Regular path queries can then be evaluated under a number of different semantics. For example, \((\)Arica, bus*, ?city\()\) evaluated against the graph of Figure 1 may match the paths in Figure 9. In fact, since a cycle is present, an infinite number of paths are potentially matched. For this reason, restricted semantics are often applied, returning only the shortest paths, or paths without repeated nodes or edges (as in the case of Cypher).⁷^{note 7} Mapping variables to paths requires special treatment [20]. Cypher [87] returns a string that encodes a path, upon which certain functions such as length(·) can be applied. G-CORE [89], on the other hand, allows for returning paths, and supports additional operators on them, including projecting them as graphs, applying cost functions, and more besides. Rather than returning paths, another option is to instead return the (finite) set of pairs of nodes connected by a matching path (as in the case of SPARQL 1.1).

Path matching 1 — Some possible paths matching \((\)`Arica`, `bus*`, `?city`\()\) over the graph of Figure 1

Path matching 2 — Some possible paths matching \((\)`Arica`, `bus*`, `?city`\()\) over the graph of Figure 1

Regular path queries can then be used in graph patterns to express navigational graph patterns [20], as shown in Figure 10, which illustrates a query searching for food festivals in cities reachable (recursively) from Arica by bus or flight. Furthermore, when regular path queries and graph patterns are combined with operators such as projection, selection, union, difference, and optional, the result is known as complex navigational graph patterns [20]. Appendix B.2.3 provides definitions for (complex) navigational graph patterns and their evaluation.

?event	?name	?city
`EID15`	`Ñam`	`Santiago`
`EID16`	`Food Truck`	`Arica`
`EID16`	`Food Truck`	`Viña del Mar`

Other features

Thus far we have discussed features that form the practical and theoretical foundation of any query language for graphs [20]. However, specific query languages for graphs may support other practical features, such as aggregation (GROUP BY, COUNT, etc.), more complex filters and datatype operators (e.g., range queries on years extracted from a date), federation for querying remotely hosted graphs over the Web, languages for updating graphs, support for semantic entailment regimes, etc. For more information, we refer to the documentation of the respective query languages (e.g., [86, 89]) and to the survey by Renzo Angles, et al. [20].

Schema, Identity, Context

In this section we describe various enhancements and extensions of the data graph – relating to schema, identity and context – that provide additional structures for accumulating knowledge. Henceforth, we refer to a data graph as a collection of data represented as nodes and edges using one of the models discussed in Section 2. We refer to a knowledge graph as a data graph potentially enhanced with representations of schema, identity, context, ontologies and/or rules. These additional representations may be embedded in the data graph, or layered above it. Representations for schema, identity and context are discussed herein, while ontologies and rules will be discussed in Section 4.

Schema

One of the benefits of modelling data as graphs – versus, for example, the relational model – is the option to forgo or postpone the definition of a schema. However, when modelling data as graphs, schemata can be used to prescribe a high-level structure and/or semantics that the graph follows or should follow. We discuss three types of graph schemata: semantic, validating, and emergent.

Semantic schema

A semantic schema allows for defining the meaning of high-level terms (aka vocabulary or terminology) used in the graph, which facilitates reasoning over graphs using those terms. Looking at Figure 1, for example, we may notice some natural groupings of nodes based on the types of entities to which they refer. We may thus decide to define classes to denote these groupings, such as Event, City, etc. In fact, Figure 1 already illustrates three low-level classes – Open Market, Food Market, Drinks Festival – grouping similar entities with an edge labelled type. We may subsequently observe some natural relations between some of these classes that we would like to capture. In Figure 11, we present a class hierarchy for events where children are defined to be subclasses of their parents such that if we find an edge EID15 arrow source type arrow tip rightward Food Festival in our graph, we may also infer that EID15typeFestival and EID15typeEvent.

Example class hierarchy for Event — Example class hierarchy for `Event`

Aside from classes, we may also wish to define the semantics of edge labels, aka properties. Returning to Figure 1, we may consider that the properties city and venue are sub-properties of a more general property location, such that given an edge Santa Lucía arrow source city arrow tip rightward Santiago, for example, we may also infer that Santa LucíalocationSantiago. We may also consider, for example, that bus and flight are both sub-properties of a more general property connects to. As such, properties may also form a hierarchy. We may further define the domain of properties, indicating the class(es) of entities for nodes from which edges with that property extend; for example, we may define that the domain of connects to is a class Place, such that given the previous sub-property relations, we could conclude that Arica arrow source type arrow tip rightward Place. Conversely, we may define the range of properties, indicating the class(es) of entities for nodes to which edges with that property extend; for example, we may define that the range of city is a class City, inferring that AricatypeCity.

A prominent standard for defining a semantic schema for (RDF) graphs is the RDF Schema (RDFS) standard [22], which allows for defining subclasses, subproperties, domains, and ranges amongst the classes and properties used in an RDF graph, where such definitions can be serialised as a graph. We illustrate the semantics of these features in Table 2 and provide a concrete example of definitions in Figure 12 for a sample of terms used in the running example. These definitions can then be embedded into a data graph. More generally, the semantics of terms used in a graph can be defined in much more depth than seen here, as is supported by the Web Ontology Language (OWL) standard [21] for RDF graphs. We will return to such semantics later in Section 4.

Definitions for sub-class, sub-property, domain and range features in semantic schemata
Feature	Definition	Condition	Example
Subclass	\(c\)subc. of\(d\)	\(x\)type\(c\) implies \(x\)type\(d\)	Citysubc. ofPlace
Subproperty	\(p\)subp. of\(q\)	\(x\)\(p\)\(y\) implies \(x\)\(q\)\(y\)	venuesubp. oflocation
Domain	\(p\)domain\(c\)	\(x\)\(p\)\(y\) implies \(x\)type\(c\)	venuedomainEvent
Range	\(p\)range\(c\)	\(x\)\(p\)\(y\) implies \(y\)type\(c\)	venuerangeVenue

Example schema graph describing sub-classes, sub-properties, domains, and ranges

Semantic schema are typically defined for incomplete graph data, where the absence of an edge between two nodes, such as Viña del Mar arrow source flight arrow tip rightward Arica, does not mean that the relation does not hold in the real world. Therefore, from the graph of Figure 1, we cannot assume that there is no flight between Viña del Mar and Arica. In contrast, if the Closed World Assumption (CWA) were adopted – as is the case in many classical database systems – it would be assumed that the data graph is a complete description of the world, thus allowing to assert with certainty that no flight exists between the two cities. Systems that do not adopt the CWA are said to adopt the Open World Assumption (OWA). A consequence of CWA is that the addition of an edge to the data graph may contradict what was previously assumed to be false (due to missing information), whereas with OWA, a statement that is proven false continues to be false with the addition of more edges.

Considering our running example, it would be unreasonable to assume that the tourism organisation has complete knowledge of everything describable in its knowledge graph. However, it is inconvenient if a system is unable to definitely answer “yes” or “no” to questions such as “is there a flight between Arica and Viña del Mar?”, especially when the organisation is certain that it has complete knowledge of the flights. A compromise between OWA and CWA is the Local Closed World Assumption (LCWA), where portions of the data graph are assumed to be complete.

Validating schema

When graphs are used to represent diverse, incomplete data at large-scale, the OWA is the most appropriate choice for a default semantics. But in some scenarios, we may wish to guarantee that our data graph – or specific parts thereof – are in some sense “complete”. Returning to Figure 1, for example, we may wish to ensure that all events have at least a name, a venue, a start date, and an end date, such that applications using the data – e.g., one that sends event notifications to users – can ensure that they have the minimal information required. Furthermore, we may wish to ensure that the city of an event is stated to be a city (rather than inferring that it is a city). We can define such constraints in a validating schema and validate the data graph with respect to the resulting schema, listing constraint violations (if any). Thus while semantic schemata allow for inferring new graph data, validating schemata allow for validating existing graph data.

A standard way to define a validating schema for graphs is using shapes [92, 93, 94]. A shape targets a set of nodes in a data graph and specifies constraints on those nodes. The shape’s target can be defined in many ways, such as targetting all instances of a class, the domain or range of a property, the result of a query, nodes connected to the target of another shape by a given property, etc. Constraints can then be defined on the targetted nodes, such as to restrict the number or types of values taken on a given property. A shapes graph is formed from a set of interrelated shapes.

Shapes graphs can be depicted as UML-like class diagrams, where Figure 13 illustrates an example of a shapes graph based on Figure 1, defining constraints on four interrelated shapes. Each shape – denoted with a box like Place, Event, etc. – is associated with a set of constraints. Nodes conform to a shape if and only if they satisfy all constraints defined on the shape. Inside each shape box are placed constraints on the number (e.g., [1..*] denotes one-to-many, [1..1] denotes precisely one, etc.) and types (e.g., string, dateTime, etc.) of nodes that conforming nodes can relate to with a property (e.g., name, start, etc.). Another option is to place constraints on the number of nodes conforming to a particular shape that the conforming node can relate to with a property (thus generating edges between shapes); for example, Event arrow source venue
1..* arrow tip rightward Venue denotes that conforming nodes for Event must relate to at least one node with the property venue that conforms to the Venue shape. Shapes can inherit the constraints of parent shapes – denoted with an \(\triangle\) connector – as in the case of City and Venue, whose conforming nodes must also conform to the Place shape.

Example shapes graph depicted as a UML-like diagram

Given a shape and a targetted node, it is possible to check if the node conforms to that shape or not, which may require checking conformance of other nodes; for example, the node EID15 conforms to the Event shape not only based on its local properties, but also based on conformance of Santa Lucía to Venue and Santiago to City. Conformance dependencies may also be recursive, where the conformance of Santiago to City requires that it conforms to Place, which requires that Viña del Mar and Arica conform to Place, and so on. Conversely, EID16 does not conform to Event, as it does not have the start and end properties required by the example shapes graph.

When declaring shapes, the data modeller may not know in advance the entire set of properties that some nodes can have. An open shape allows the node to have additional properties not specified by the shape, while a closed shape does not. For example, if we add the edge Santiago arrow source founder arrow tip rightward Pedro de Valdivia to the graph represented in Figure 1, then Santiago only conforms to the City shape if that shape is defined as open (since the shape does not mention founder).

Practical languages for shapes often support additional boolean features, such as conjunction (and), disjunction (or), and negation (not) of shapes; for example, we may say that all the values of venue should conform to the shape Venue and (not City), making explicit that venues in the data graph should not be directly given as cities. However, shapes languages that freely combine recursion and negation may lead to semantic problems, depending on how their semantics are defined. To illustrate, consider the following case inspired by the barber paradox [94], involving a shape Barber whose conforming nodes shave at least one node conforming to Person and (not Barber). Now, given (only) Bob arrow source shave arrow tip rightward Bob with Bob conforming to Person, does Bob conform to Barber? If yes – if Bob conforms to Barber – then Bob violates the constraint by not shaving at least one node conforming to Person and (not Barber). If no – if Bob does not conform to Barber – then Bob satisfies the Barber constraint by shaving such a node. Semantics to avoid such paradoxical situations have been proposed based on stratification [95], partial assignments [96], and stable models [97].

Although validating schemata and semantic schemata serve different purposes, they can complement each other. In particular, a validating schema can take into consideration a semantic schema, such that, for example, validation is applied on the data graph including inferences. Taking the class hierarchy of Figure 11 and the shapes graph of Figure 13, for example, we may define the target of the Event shape as the nodes that are of type Event (the class). If we first apply inferencing with respect to the class hierarchy of the semantic schema, the Event shape would now target EID15 and EID16. The presence of a semantic schema may, however, require adapting the validating schema. Taking into account, for example, the aforementioned class hierarchy would require defining a relaxed cardinality on the type property. Open shapes may also be preferred in such cases rather than enumerating constraints on all possible properties that may be inferred on a node.

We provide high-level definitions for shapes and related concepts in Appendix B.3.2. Two shapes languages have recently emerged for RDF graphs: Shape Expressions (ShEx), published as a W3C Community Group Report [93]; and SHACL (Shapes Constraint Language), published as a W3C Recommendation [92]. These languages support the discussed features (and more) and have been adopted for validating graphs in a number of domains relating to health-care [98], scientific literature [99], spatial data [100], amongst others. More details about ShEx and SHACL can be found in the book by Jose Emilio Labra Gayo, et al. [94]. A recently proposed language that can be used as a common basis for both ShEx and SHACL reveals their similarities and differences [101]. A similar notion of schema has been proposed by Renzo Angles [102] for property graphs.

Emergent schema

Both semantic and validating schemata require a domain expert to explicitly specify definitions and constraints. However, a data graph will often exhibit latent structures that can be automatically extracted as an emergent schema [103] (aka graph summary [104, 105, 106]).

A framework often used for defining emergent schema is that of quotient graphs, which partition groups of nodes in the data graph according to some equivalence relation while preserving some structural properties of the graph. Taking Figure 1, we can intuitively distinguish different types of nodes based on their context, such as event nodes, which link to venue nodes, which in turn link to city nodes, and so forth. In order to describe the structure of the graph, we could consider six partitions of nodes: event, name, venue, class, date-time, city. In practice, these partitions may be computed based on the class or shape of the node. Merging the nodes of each partition into one node while preserving edges leads to the quotient graph shown in Figure 14: the nodes of this quotient graph are the partitions of nodes from the data graph and the edge \(X\) arrow source \(y\) arrow tip rightward \(Z\) is in the quotient graph if and only if there exists \(x \in X\) and \(z \in Z\) such that \(x\)\(y\)\(z\) is in the data graph.

Example quotient graph simulating the data graph in Figure 1

There are many ways in which quotient graphs may be defined, depending not only on how nodes are partitioned, but also how the edges are defined. Different quotient graphs may provide different guarantees with respect to the structure they preserve. Formally, we can say that every quotient graph simulates its input graph (based on the simulation relation of set membership between data nodes and quotient nodes), meaning that for all \(x \in X\) with \(x\) an input node and \(X\) a quotient node, if \(x\) arrow source \(y\) arrow tip rightward \(z\) is an edge in the data graph, then there must exist an edge \(X\)\(y\)\(Z\) in the quotient graph such that \(z \in Z\); for example, the quotient graph of Figure 14 simulates the data graph of Figure 1. However, this quotient graph seems to suggest (for instance) that EID16 would have a start and end date in the data graph when this is not the case. A stronger notion of structural preservation is given by bisimilarity, which in this case would further require that if \(X\) arrow source \(y\) arrow tip rightward \(Z\) is an edge in the quotient graph, then for all \(x \in X\), there must exist a \(z \in Z\) such that \(x\)\(y\)\(z\) is in the data graph; this is not satisfied by EID16 in the quotient graph of Figure 14, which does not have an outgoing edge labelled start or end in the original data graph. Figure 15 illustrates a bisimilar version of the quotient graph, splitting the event partition into two nodes reflecting their different outgoing edges. An interesting property of bisimilarity is that it preserves forward-directed paths: given a path expression \(r\) without inverses and two bisimilar graphs, \(r\) will match a path in one graph if and only if it matches a corresponding path in the other bisimilar graph. One can verify, for example, that a path matches \(x\) arrow source city\(\cdot\)(flight|bus)* arrow tip rightward \(z\) in Figure 1 if and only if there is a path matching \(X\)city\(\cdot\)(flight|bus)*\(Z\) in Figure 15 such that \(x \in X\) and \(z \in Z\).

Example quotient graph bisimilar with the data graph in Figure 1

There are many ways in which quotient graphs may be defined, depending on the equivalence relation that partitions nodes. Furthermore, there are many ways in which other similar or bisimilar graphs can be defined, depending on the (bi)simulation relation that preserves the data graph’s structure [105]. We provide formal definitions for the notions of quotient graphs, simulation and bisimulation in Appendix B.3.3. Such techniques aim to summarise the data graph into a higher-level topology. In order to reduce the memory overhead of the quotient graph, in practice, nodes may rather be labelled with the cardinality of the partition and/or a high-level label (e.g., event, city) for the partition rather than storing the labels of all nodes in the partition.

Various other forms of emergent schema not based on a quotient graph framework have also been proposed; examples include emergent schemata based on relational tables [103], formal concept analysis [107], and so forth. Emergent schemata may be used to provide a human-understandable overview of the data graph, to aid with the definition of a semantic or validating schema, to optimise the indexing and querying of the graph, to guide the integration of data graphs, and so forth. We refer to the survey by Šejla Čebirić, et al. [105] for further details.

Identity

In Figure 1, we use nodes like Santiago, but to which Santiago does this node refer? Do we refer to Santiago de Chile, Santiago de Cuba, Santiago de Compostela, or do we perhaps refer to the indie rock band Santiago? Based on edges such as Santa Lucía arrow source city arrow tip rightward Santiago, we may deduce that it is one of the three cities mentioned (not the rock band), and based on the fact that the graph describes tourist attractions in Chile, we may further deduce that it refers to Santiago de Chile. Without further details, however, disambiguating nodes of this form may rely on heuristics prone to error in more difficult cases. To help avoid such ambiguity, first we may use globally-unique identifiers to avoid naming clashes when the knowledge graph is extended with external data, and second we may add external identity links to disambiguate a node with respect to an external source.

Persistent identifiers

Assume we wished to compare tourism in Chile and Cuba, and we have acquired an appropriate knowledge graph for Cuba. Part of the benefit of using graphs to model data is that we can merge two graphs by taking their union. However, as shown in Figure 16, using an ambiguous node like Santiago may result in a naming clash: the node is referring to two different real-world cities in both graphs, where the merged graph indicates that Santiago is a city in both Chile and Cuba (rather than two different cities).⁸^{note 8} Such a naming clash is not unique to graphs, but could also occur if merging tables, trees, etc. To avoid such clashes, long-lasting persistent identifiers (PIDs) [108] can be created in order to uniquely identify an entity. Prominent examples of PID schemes include Digital Object Identifiers (DOIs) for papers, ORCID iDs for authors, International Standard Book Numbers (ISBNs) for books, Alpha-2 codes for counties, and more besides.

Result of merging two graphs with ambiguous local identifiers

In the context of the Semantic Web, the RDF data model goes one step further and recommends that global Web identifiers be used for nodes and edge labels. However, rather than adopt the Uniform Resource Locators (URLs) used to identify the location of information resources such as webpages, RDF 1.1 proposes to use Internationalised Resource Identifiers (IRIs) to identify non-information resources such as cities or events.⁹^{note 9} Uniform Resource Identifiers (URIs) can be Uniform Resource Locators (URLs), used to locate information resources, and Uniform Resource Names (URNs), used to name non-information resources. Internationalised Resource Identifiers (IRIs) are URIs that allow Unicode. For example, http://example.com/Ñam is an IRI, but not a URI, due to the use of “Ñ”. Percentage encoding – http://example.com/%C3%91am – can encode an IRI as a URI (but reduces readability). Hence, for example, in the RDF representation of the Wikidata [34] – a knowledge graph proposed to complement Wikipedia, discussed in more detail in Section 10 – while the URL https://www.wikidata.org/wiki/Q2887 refers to a webpage that can be loaded in a browser providing human-readable meta-data about Santiago, the IRI http://www.wikidata.org/entity/Q2887 refers to the city itself. Distinguishing the identifiers for both resources (the webpage and the city itself) avoids naming clashes; for example, if we use the URL to identify both the webpage and the city, we may end up with an edge in our graph, such as (with readable labels below the edge):

https://www.wikidata.org/wiki/Q2887 arrow source https://www.wikidata.org/wiki/Property:P112 arrow tip rightward https://www.wikidata.org/wiki/Q203534
[Santiago (URL)][founded by (URL)] [Pedro de Valdivia (URL)]

Such an edge leaves ambiguity: was Pedro de Valdivia the founder of the webpage, or the city? Using IRIs for entities distinct from the URLs for the webpages that describe them avoids such ambiguous cases, where Wikidata thus rather defines the previous edge as follows:

https://www.wikidata.org/entity/Q2887 arrow source https://www.wikidata.org/prop/direct/P112 arrow tip rightward https://www.wikidata.org/entity/Q203534
[Santiago (IRI)][founded by (IRI)] [Pedro de Valdivia (IRI)]

using IRIs for the city, person, and founder of, distinct from the webpages describing them. These Wikidata identifiers use the prefix http://www.wikidata.org/entity/ for entities and the prefix http://www.wikidata.org/prop/direct/ for relations. Such prefixes are known as namespaces, and are often abbreviated with prefix strings, such as wd: or wdt:, where the latter triple can then be written more concisely using such abbreviations as wd:Q2887 arrow source wdt:P112 arrow tip rightward wd:Q203534.

If HTTP IRIs are used to identify the graph’s entities, when the IRI is looked up (via HTTP), the web-server can return (or redirect to) a description of that entity in formats such as RDF. This further enables RDF graphs to link to related entities described in external RDF graphs over the Web, giving rise to Linked Data [109, 110] (discussed in Section 9). Though HTTP IRIs offer a flexible and powerful mechanism for issuing global identifiers on the Web, they are not necessarily persistent: websites may go offline, the resources described at a given location may change, etc. In order to enhance the persistence of such identifiers, Persistent URL (PURL) services offer redirects from a central server to a particular location, where the PURL can be redirected to a new location if necessary, changing the address of a document without changing its identifier. The persistence of HTTP IRIs can then be improved by using namespaces defined through PURL services.

External identity links

Assume that the tourist board opts to define the chile: namespace with an IRI such as http://turismo.cl/entity/ on a web-server that they control, allowing nodes such as chile:Santiago – a shortcut for the IRI http://turismo.cl/entity/Santiago – to be looked up over the Web. While using such a naming scheme helps to avoid naming clashes, the use of IRIs does not necessarily help ground the identity of a resource. For example, an external geographic knowledge graph may assign the same city the IRI geo:SantiagoDeChile in their own namespace, where we have no direct way of knowing that the two identifiers refer to the same city. If we merge the two knowledge graphs, we will end up with two distinct nodes for the same city.

There are a number of ways to ground the identity of an entity. The first is to associate the entity with uniquely-identifying information in the graph, such as its geo-coordinates, its postal code, the year it was founded, etc. Each additional piece of information removes ambiguity as to which city is being referred, providing (for example) more options for matching the city with its analogue in external sources. A second option is to use identity links to state that a local entity has the same identity as another coreferent entity found in an external source; an instantiation of this concept can be found in the OWL standard, which defines the owl:sameAs property relating coreferent entities. Using this property, we could state the edge chile:Santiago arrow source owl:sameAs arrow tip rightward geo:SantiagoDeChile in our RDF graph, thus establishing an identity link between the corresponding nodes in both graphs. The semantics of owl:sameAs defined by the OWL standard then allow us to combine the data for both nodes. Such semantics will be discussed later in Section 4. Ways in which identity links can be computed will also be discussed later in Section 8.

Datatypes

Consider the two date-times on the left of Figure 1: how should we assign these nodes persistent/global identifiers? Intuitively it would not make sense, for example, to assign IRIs to these nodes since their syntactic form tells us what they refer to: specific dates and times in March 2020. This syntactic form is further recognisable by machine, meaning that with appropriate software, we could order such values in ascending or descending order, extract the year, etc.

Most practical data models for graphs allow for defining nodes that are datatype values. RDF utilises XML Schema Datatypes (XSD) [111], amongst others, where a datatype node is given as a pair \((l,d)\) where \(l\) is a lexical string, such as "2020-03-29T20:00:00", and \(d\) is an IRI denoting the datatype, such as xsd:dateTime. The node is then denoted "2020-03-29T20:00:00"^^xsd:dateTime. Datatype nodes in RDF are called literals and are not allowed to have outgoing edges. Other datatypes commonly used in RDF data include xsd:string, xsd:integer, xsd:decimal, xsd:boolean, etc. In case that the datatype is omitted, the value is assumed to be of type xsd:string. Applications built on top of RDF can then recognise these datatypes, parse them into datatype objects, and apply equality checks, normalisation, ordering, transformations, casting, according to their standard definition. In the context of property graphs, Neo4j [75] also defines a set of internal datatypes on property values that includes numbers, strings, booleans, spatial points, and temporal values.

Lexicalisation

Global identifiers for entities will sometimes have a human-interpretable form, such as chile:Santiago, but the identifier strings themselves do not carry any formal semantic significance. In other cases, the identifiers used may not be human-interpretable by design. In Wikidata, for instance, Santiago de Chile is identified as wd:Q2887, where such a scheme has the advantage of providing better persistence and of not being biased to a particular human language. For example, the Wikidata identifier for Eswatini (wd:Q1050) was not affected when the country changed its name from Swaziland, and does not necessitate choosing between languages for creating (more readable) IRIs such as wd:Eswatini (English), wd:eSwatini (Swazi), wd:Esuatini (Spanish), etc.

Since identifiers can be arbitrary, it is common to add edges that provide a human-interpretable label for nodes, such as wd:Q2887 arrow source rdfs:label arrow tip rightward “Santiago”, indicating how people may refer to the subject node linguistically. Linguistic information of this form plays an important role in grounding knowledge such that users can more clearly identify which real-world entity a particular node in a knowledge graph actually references [112]; it further permits cross-referencing entity labels with text corpora to find, for example, documents that potentially speak of a given entity [113]. Labels can be complemented with aliases (e.g., wd:Q2887 arrow source skos:altLabel arrow tip rightward “Santiago de Chile”) or comments (e.g. wd:Q2887rdfs:comment“Santiago is the capital of Chile”) to further help ground the node’s identity.

Nodes such as “Santiago” denote string literals, rather than an identifier. Depending on the specific graph model, such literal nodes may also be defined as a pair \((s,l)\), where \(s\) denotes the string and \(l\) a language code; in RDF, for example we may state chile:City arrow source rdfs:label arrow tip rightward "City"@en, chile:Cityrdfs:label"Ciudad"@es, etc., indicating labels for the node in different languages. In other models, the pertinent language can rather be specified, e.g., via metadata on the edge. Knowledge graphs with human-interpretable labels, aliases, comments, etc., (in various languages) are sometimes called (multilingual) lexicalised knowledge graphs [30]".

Existential nodes

When modelling incomplete information, we may in some cases know that there must exist a particular node in the graph with particular relationships to other nodes, but without being able to identify the node in question. For example, we may have two co-located events chile:EID42 and chile:EID43 whose venue has yet to be announced. One option is to simply omit the venue edges, in which case we lose the information that these events have a venue and that both events have the same venue. Another option might be to create a fresh IRI representing the venue, but semantically this becomes indistinguishable from there being a known venue. Hence some graph models permit the use of existential nodes, represented here as a blank circle:

chile:EID42 arrow source chile:venue arrow tip rightward arrow tip leftward chile:venuechile:EID43

These edges denote that there exists a common venue for chile:EID42 and chile:EID42 without identifying it. Existential nodes are supported in RDF as blank nodes [68], which are also commonly used to support modelling complex elements in graphs, such as RDF lists [68, 114]. Figure 17 exemplifies an RDF list, which uses blank nodes in a linked-list structure to encode order. Though existential nodes can be convenient, their presence can complicate operations on graphs, such as deciding if two data graphs have the same structure modulo existential nodes [68, 115]. Hence methods for skolemising existential nodes in graphs – replacing them with canonical labels – have been proposed [116, 115]. Other authors rather call to minimise the use of such nodes in graph data [110].

RDF list representing the three largest peaks of Chile, in order

Context

Many (arguably all) facts presented in the data graph of Figure 1 can be considered true with respect to a certain context. With respect to temporal context, Santiago has only existed as a city since 1541, flights from Arica to Santiago began in 1956, etc. With respect to geographic context, the graph describes events in Chile. With respect to provenance, data relating to EID15 were taken from – and are thus said to be true with respect to – the Ñam webpage on January 4^th, 2020. Other forms of context may also be used. We may further combine contexts, such as to indicate that Arica is a Chilean city (geographic) since 1883 (temporal) according to the Treaty of Ancón (provenance).

By context we herein refer to the scope of truth, and thus talk about the context in which some data are held to be true [117, 118]. The graph of Figure 1 leaves much of its context implicit. However, making context explicit can allow for interpreting the data from different perspectives, such as to understand what held true in 2016, what holds true excluding webpages later found to have spurious data, etc. As seen in the previous examples, context for graph data may be considered at different levels: on individual nodes, individual edges, or sets of edges (sub-graphs). We now discuss various representations by which context can be made explicit at different levels.

Direct representation

The first way to represent context is to consider it as data no different from other data. For example, the dates for the event EID15 in Figure 1 can be seen as representing a form of temporal context, indicating the temporal scope within which edges such as EID15 arrow source venue arrow tip rightward Santa Lucía are held true. Another option is to change a relation represented as an edge, such as SantiagoflightArica, into a node, such as seen in Figure 3, allowing to assign additional context to the relation. While in these examples context is represented in an ad hoc manner, a number of specifications have been proposed to represent context as data in a more standard way. One example is the Time Ontology [119], which specifies how temporal entities, intervals, time instants, etc. – and relations between them such as before, overlaps, etc. – can be described in RDF graphs in an interoperable manner. Another example is the PROV Data Model [120], which specifies how provenance can be described in RDF graphs, where entities (e.g., graphs, nodes, physical document) are derived from other entities, are generated and/or used by activities (e.g., extraction, authorship), and are attributed to agents (e.g., people, software, organisations).

Reification

Often we may wish to directly define the context of edges themselves; for example, we may wish to state that the edge Santiago arrow source flight arrow tip rightward Arica is valid from 1956. While we could use the pattern of turning the edge into a node – as illustrated in Figure 3 – to directly represent such context, another option is to use reification, which allows for making statements about statements in a generic manner (or in the case of a graph, for defining edges about edges). In Figure 18 we present three forms of reification that can be used for modelling temporal context on the aforementioned edge within a directed edge-labelled graph [76]. We use \(e\) to denote an arbitrary identifier representing the edge itself to which the contextual information can be associated. Unlike in a direct representation, \(e\) represents an edge, not a flight. RDF reification [68] (Figure 18a) defines a new node \(e\) to represent the edge and connects it to the source node (via subject), target node (via object), and edge label (via predicate) of the edge. In contrast, \(n\)-ary relations [68] (Figure 18b) connect the source node of the edge directly to the edge node \(e\) with the label of the edge; the target node of the edge is then connected to \(e\) (via value). Finally, singleton properties [121] (Figure 18c) rather use \(e\) as an edge label, connecting it to a node indicating the original edge label (via singleton). Other forms of reification have been proposed in the literature, including, for example, NdFluents [122]. In general, a reified edge does not assert the edge it reifies; for example, we may reify an edge to state that it is no longer valid. We refer to the work of Daniel Hernández, et al. [76] for further comparison of reification alternatives and their relative strengths and weaknesses.

Higher-arity representation

As an alternative to reification, we can rather use higher-arity representations for modelling context. Taking again the edge Santiago arrow source flight arrow tip rightward Arica, Figure 19 illustrates three higher-arity representations of temporal context. First, we can use a named graph (Figure 19a) to contain the edge and then define the temporal context on the graph name. Second, we can use a property graph (Figure 19b) where the temporal context is defined as an attribute on the edge. Third, we can use RDF* [123] (Figure 19c): an extension of RDF that allows edges to be defined as nodes. Amongst these options, the most flexible is the named graph representation, where we can assign context to multiple edges at once by placing them in one named graph; for example, we can add more edges to the named graph of Figure 19a that are also valid from 1956. The least flexible option is RDF*, which, in the absence of an edge id, does not permit different groups of contextual values to be assigned to an edge; for example, considering the edge Chile arrow source president arrow tip rightward M. Bachelet, if we add four contextual values to this edge to state that it was valid from 2006 until 2010 and valid from 2014 until 2018, we cannot pair the values, but may rather have to create a node to represent different presidencies (in the other models, we could have used two named graphs or edge ids).

Annotations

Thus far we have discussed representing context in a graph, but we have not spoken about automated mechanisms for reasoning about context; for example, if there are only seasonal summer flights from Santiago to Arica, we may wish to find other routes from Santiago for winter events taking place in Arica. While the dates for buses, flights, etc., can be represented directly in the graph, or using reification, writing a query to manually intersect the corresponding temporal contexts will be tedious – or may not even be possible at all. Another alternative is to consider annotations that provide mathematical definitions of a contextual domain and key operations possible within that domain that can then be applied automatically.

Some annotations model a particular contextual domain; for example, Temporal RDF [124] allows for annotating edges with time intervals, such as Chile arrow source president
[2006,2010] arrow tip rightward M. Bachelet, while Fuzzy RDF [125] allows for annotating edges with a degree of truth such as Santiagoclimate
0.8Semi-Arid, indicating that it is more-or-less true – with a degree of \(0.8\) – that Santiago has a semi-arid climate.

Other forms of annotation are domain-independent; for example, Annotated RDF [78, 126, 80] allows for representing various forms of context modelled as semi-rings: algebraic structures consisting of domain values (e.g., temporal intervals, fuzzy values, etc.) and two main operators to combine domain values: meet and join.¹⁰^{note 10} The join operator for annotations is different from the join operator for relational algebra. We provide an example in Figure 20, where \(G\) is annotated with values from a simplified temporal domain consisting of sets of integers (\(1{-}365\) representing days of the year. For brevity we use an interval notation, where, for example, \(\{[150,152]\}\) indicates the set \(\{150,151,152\}\). Query \(Q\) then asks for flights from Santiago to cities with events; this query will check and return an annotation reflecting the temporal validity of each answer. To derive these answers, we first require applying a conjunction of annotations on compatible flight and city edges, applying the meet operator to compute the annotation for which both edges hold. The natural way to define meet in our scenario is as the intersection of sets of days, where, for example, applying meet on the event annotation \(\color{blue}\{[150,152]\}\) and the flight annotation \(\color{blue}\{[1,120],[220,365]\}\) for Punta Arenas leads to the empty time interval \(\color{blue}\{\}\), which may thus lead to the city being filtered from the results (depending on the query evaluation semantics). However, for Arica, we find two different non-empty intersections: \(\color{blue}\{[123,125]\}\) for EID16 and \(\color{blue}\{[276,279]\}\) for EID17. Given that we are interested in the city (a projected variable), rather than the event, we can thus combine these two annotations for Arica using the join operator, returning the annotation in which either result holds true. In our scenario, the natural way to define join is as the union of the sets of days, giving \(\color{blue}\{[123,125],[276,279]\}.\) We provide formal definitions in Appendix B.4.1 based on the general framework proposed by Antoine Zimmermann, et al. [80] for annotations on graphs.

Temporally annotated graph — Example query on a temporally annotated graph

Example query on a temporally annotated graph

?city	context
`Arica`	\(\color{blue}\{[123,125],[276,279]\}\)

Other contextual frameworks

Other frameworks have been proposed for modelling and reasoning about context in graphs. A notable example is that of contextual knowledge repositories [127], which allow for assigning individual (sub-)graphs to their own context. Unlike in the case of named graphs, context is explicitly modelled along one or more dimensions, where each (sub-)graph must take a value for each dimension. Each dimension is further associated with a partial order over its values – e.g., 2020-03-22 \(\preceq\) 2020-03 \(\preceq\) 2020 – allowing to select and combine sub-graphs that are valid within contexts at different levels of granularity. Christoph Schuetz, et al. [128] similarly propose a form of contextual OnLine Analytic Processing (OLAP), based on a data cube formed by dimensions where individual cells contain knowledge graphs. Operations such as “slice-and-dice” (selecting knowledge according to given dimensions), as well as “roll-up” (aggregating knowledge at a higher level) can then be supported. We refer the reader to the respective papers for more details [127, 128].

Deductive Knowledge

As humans, we can deduce more from the data graph of Figure 1 than what the edges explicitly indicate. We may deduce, for example, that the Ñam festival (EID15) will be located in Santiago, even though the graph does not contain an edge EID15 arrow source location arrow tip rightward Santiago. We may further deduce that the cities connected by flights must have some airport nearby, even though the graph does not contain nodes referring to these airports. In these cases, given the data as premises, and some general rules about the world that we may know a priori, we can use a deductive process to derive new data, allowing us to know more than what is explicitly given by the data. These types of general premises and rules, when shared by many people, form part of “commonsense knowledge” [129]; conversely, when rather shared by a few experts in an area, they form part of “domain knowledge”, where, for example, an expert in biology may know that hemocyanin is a protein containing copper that carries oxygen in the blood of some species of Mollusca and Arthropoda.

Machines, in contrast, do not have a priori access to such deductive faculties; rather they need to be given formal instructions, in terms of premises and entailment regimes, in order to make similar deductions to what a human can make. These entailment regimes formalise the conclusions that logically follow as a consequence of a given set of premises. Once instructed in this manner, machines can (often) apply deductions with a precision, efficiency, and scale beyond human performance. These deductions may serve a range of applications, such as improving query answering, (deductive) classification, finding inconsistencies, etc. As a concrete example involving query answering, assume we are interested in knowing the festivals located in Santiago; we may straightforwardly express such a query as per the graph pattern shown in Figure 21. This query returns no results for the graph in Figure 1: there is no node named Festival, and nothing has (directly) the location Santiago. However, an answer (Ñam) could be automatically entailed were we to state that \(x\) being a Food Festival entails that \(x\) is a Festival, or that \(x\) having venue \(y\) in city \(z\) entails that \(x\) has location \(z\). How, then, should such entailments be captured? In Section 3.1.1 we already discussed how the former entailment can be captured with sub-class relations in a semantic schema; the second entailment, however, requires a more expressive entailment regime than seen thus far.

Graph pattern querying for names of festivals in Santiago

In this section, we discuss ways in which more complex entailments can be expressed and automated. Though we could leverage a number of logical frameworks for these purposes – such as First-Order Logic, Datalog, Prolog, Answer Set Programming, etc. – we focus on ontologies, which constitute a formal representation of knowledge that, importantly for us, can be represented as a graph. We then discuss how these ontologies can be formally defined, how they relate to existing logical frameworks, and how reasoning can be conducted with respect to such ontologies.

Ontologies

To enable entailment, we must be precise about what the terms we use mean. Returning to Figure 1, for example, and examining the node EID16 more closely, we may begin to question how it is modelled, particularly in comparison with EID15. Both nodes – according to the class hierarchy of Figure 11 – are considered to be events. But what if, for example, we wish to define two pairs of start and end dates for EID16 corresponding to the different venues? Should we rather consider what takes place in each venue as a different event? What then if an event has various start and end dates in a single venue: would these also be considered as one (recurring) event, or many events? These questions are facets of a more general question: what precisely do we mean by an “event”? Does it happen in one contiguous time interval or can it happen many times? Does it happen in one place or can it happen in multiple? There are no “correct” answers to such questions – we may understand the term “event” in a variety of ways, and thus the answers are a matter of convention.

In the context of computing, an ontology¹¹^{note 11} The term stems from the philosophical study of ontology, concerned with the different kinds of entities that exist, the nature of their existence, what kinds of properties they have, and how they may be identified and categorised. is then a concrete, formal representation of what terms mean within the scope in which they are used (e.g., a given domain). For example, one event ontology may formally define that if an entity is an “event”, then it has precisely one venue and precisely one time instant in which it begins. Conversely, a different event ontology may define that an “event” can have multiple venues and multiple start times, etc. Each such ontology formally captures a particular perspective – a particular convention. Under the first ontology, for example, we could not call the Olympics an “event”, while under the second ontology we could. Likewise ontologies can guide how graph data are modelled. Under the first ontology we may split EID16 into two events. Under the second, we may elect to keep EID16 as one event with two venues. Ultimately, given that ontologies are formal representations, they can be used to automate entailment.

Like all conventions, the usefulness of an ontology depends on the level of agreement on what that ontology defines, how detailed it is, and how broadly and consistently it is adopted. Adoption of an ontology by the parties involved in one knowledge graph may lead to a consistent use of terms and consistent modelling in that knowledge graph. Agreement over multiple knowledge graphs will, in turn, enhance the interoperability of those knowledge graphs.

Amongst the most popular ontology languages used in practice are the Web Ontology Language (OWL) [21]¹²^{note 12} We could include RDF Schema (RDFS) in this list, but it is largely subsumed by OWL, which builds upon its core., recommended by the W3C and compatible with RDF graphs; and the Open Biomedical Ontologies Format (OBOF) [23], used mostly in the biomedical domain. Since OWL is the more widely adopted, we focus on its features, though many similar features are found in both [23]. Before introducing such features, however, we must discuss how graphs are to be interpreted.

Interpretations

We as humans may interpret the node Santiago in the data graph of Figure 1 as referring to the real-world city that is the capital of Chile. We may further interpret an edge Arica arrow source flight arrow tip rightward Santiago as stating that there are flights from the city of Arica to this city. We thus interpret the data graph as another graph – what we here call the domain graph – composed of real-world entities connected by real-world relations. The process of interpretation, here, involves mapping the nodes and edges in the data graph to nodes and edges of the domain graph.

Along these lines, we can abstractly define an interpretation of a data graph as being composed of two elements: a domain graph, and a mapping from the terms (nodes and edge-labels) of the data graph to those of the domain graph. The domain graph follows the same model as the data graph; for example, if the data graph is a directed edge-labelled graph, then so too will be the domain graph. For simplicity, we will speak of directed edge-labelled graphs and refer to the nodes of the domain graph as entities, and the edges of the domain graph as relations. Given a data graph and an interpretation, while we denote nodes in the data graph by Santiago, we will denote the entity it refers to in the domain graph by Santiago (per the mapping of the given interpretation). Likewise, while we denote an edge by Arica arrow source flight arrow tip rightward Santiago, we will denote the relation by AricaflightSantiago (again, per the mapping of the given interpretation). In this abstract notion of an interpretation, we do not require that Santiago nor Arica be the real-world cities, nor even that the domain graph contain real-world entities and relations: an interpretation can have any domain graph and mapping.

Why is such an abstract notion of interpretation useful? The distinction between nodes/edges and entities/relations becomes important when we define the meaning of ontology features and entailment. To illustrate this distinction, if we ask whether there is an edge labelled flight between Arica and Viña del Mar for the data graph in Figure 1, the answer is no. However, if we ask if the entities Arica and Viña del Mar are connected by the relation flight, then the answer depends on what assumptions we make when interpreting the graph. Under the Closed World Assumption (CWA), if we do not have additional knowledge, then the answer is a definite no – since what is not known is assumed to be false. Conversely, under the Open World Assumption (OWA), we cannot be certain that this relation does not exist as this could be part of some knowledge not (yet) described by the graph. Likewise under the Unique Name Assumption (UNA), the data graph describes at least two flights to Santiago (since Viña del Mar and Arica are assumed to be different entities and therefore, Arica arrow source flight arrow tip rightward Santiago and Viña del MarflightSantiago must be different edges). Conversely, under No Unique Name Assumption (NUNA), we can only say that there is at least one such flight since Viña del Mar and Arica may be the same entity with two “names”.

These assumptions (or lack thereof) define which interpretations are valid, and which interpretations satisfy which data graphs. The UNA forbids interpretations that map two data terms to the same domain term. The NUNA allows such interpretations. Under CWA, an interpretation that contains an edge x arrow source p arrow tip rightward y in its domain graph can only satisfy a data graph from which we can entail xpy. Under OWA, an interpretation containing the edge xpy can satisfy a data graph not entailing xpy so long it does not contradict that edge.¹³^{note 13} Variations of the CWA can provide a middle ground between a completely open world that makes no assumption about completeness, falsehood of unknown statements, or unicity of names. One example of such variation is Local Closed World Assumption, already mentioned in Section 3.1.1. In the case of OWL, the NUNA and OWA are adopted, thus representing the most general case, whereby multiple nodes/edge-labels in the graph may refer to the same entity/relation-type (NUNA), and where anything not entailed by the data graph is not assumed to be false as a consequence (OWA).

Beyond our base assumptions, we can associate certain patterns in the data graph with semantic conditions that define which interpretations satisfy it; for example, we can add a semantic condition to enforce that if our data graph contains the edge p arrow source subp. of arrow tip rightward q, then any edge xpy in the domain graph of the interpretation must also have a corresponding edge xqy to satisfy the data graph. These semantic conditions then form the features of an ontology language. In what follows, to aid readability, we will introduce the features of OWL using an abstract graphical notation with abbreviated terms. For details of concrete syntaxes, we rather refer to the OWL and OBOF standards [21, 23]. Likewise we present semantic conditions for interpretations associated with each feature in the same graphical format;¹⁴^{note 14} We use “iff” as an abbreviation for “if and only if” whereby “\(\phi\) iff \(\psi\)” can be read as “if \(\phi\) then \(\psi\)” and “if \(\psi\) then \(\phi\)”. further details of these conditions will be described later in Section 4.2, with formal definitions rather provided in Appendix B.5.

Individuals

In Table 3, we list the main features supported by OWL for describing individuals (e.g., Santiago, EID16), sometimes distinguished from classes and properties. First, we can assert (binary) relations between individuals using edges such as Santa Lucía arrow source city arrow tip rightward Santiago. In the condition column, when we write \(x\)\(y\)\(z\), for example, we refer to the condition that the given relation holds in the interpretation; if so, the interpretation satisfies the axiom. OWL further allows for defining relations to explicitly state that two terms refer to the same entity, where, e.g., Región V arrow source same as arrow tip rightward Región de Valparaíso states that both refer to the same region (per Section 3.2); or that two terms refer to different entities, where, e.g., Valparaísodiff. fromRegión de Valparaíso distinguishes the city from the region of the same name. We may also state that a relation does not hold using negation, which can be serialised as a graph using a form of reification (see Figure 18a).

Ontology features for individuals
Feature	Axiom	Condition	Example
Assertion	\(x\)\(y\)\(z\)	\(x\)\(y\)\(z\)	CitycapitalSantiago
Negation		not \(x\)\(y\)\(z\)
Same As	\(x_1\)same as\(x_2\)	\(x_1\) = \(x_2\)	Región Vsame asRegión de Valparaíso
Different From	\(x_1\)diff. from\(x_2\)	\(x_1\) ≠ \(x_2\)	Valparaísodiff. fromRegión de Valparaíso

Properties

In Section 3.1.1, we already discussed how subproperties, domains and ranges may be defined for properties. OWL allows such definitions, and further includes other features, as listed in Table 4. We may define a pair of properties to be equivalent, inverses, or disjoint. We can further define a particular property to denote a transitive, symmetric, asymmetric, reflexive, or irreflexive relation. We can also define the multiplicity of the relation denoted by properties, based on being functional (many-to-one) or inverse-functional (one-to-many). We may further define a key for a class, denoting the set of properties whose values uniquely identify the entities of that class. Without adopting a Unique Name Assumption (UNA), from these latter three features we may conclude that two or more terms refer to the same entity. Finally, we can relate a property to a chain (a path expression only allowing concatenation of properties) such that pairs of entities related by the chain are also related by the given property. Note that for the latter two features in Table 4 we require representing a list, denoted with a vertical notation ⋮; while such a list may be serialised as a graph in a number of concrete ways, OWL uses RDF lists (see Figure 17).

Ontology features for property axioms
Feature	Axiom	Condition (for all \(x_\), \(y_\), \(z_*\))	Example
Subproperty	\(p\)subp. of\(q\)	\(x\)\(p\)\(y\) implies \(x\)\(q\)\(y\)	venuesubp. oflocation
Domain	\(p\)domain\(c\)	\(x\)\(p\)\(y\) implies \(x\)type\(c\)	venuedomainEvent
Range	\(p\)range\(c\)	\(x\)\(p\)\(y\) implies \(y\)type\(c\)	venuerangeVenue
Equivalence	\(p\)equiv. p.\(q\)	\(x\)\(p\)\(y\) iff \(x\)\(q\)\(y\)	startequiv. p.begins
Inverse	\(p\)inv. of\(q\)	\(x\)\(p\)\(y\) iff \(y\)\(q\)\(x\)	venueinv. ofhosts
Disjoint	\(p\)disj. p.\(q\)	not	venuedisj. p.hosts
Transitive	\(p\)typeTransitive	\(x\)\(p\)\(y\)\(p\)\(z\) implies \(x\)\(p\)\(z\)	part oftypeTransitive
Symmetric	\(p\)typeSymmetric	\(x\)\(p\)\(y\) iff \(y\)\(p\)\(x\)	nearbytypeSymmetric
Asymmetric	\(p\)typeAsymmetric	not	capitaltypeAsymmetric
Reflexive	\(p\)typeReflexive		part oftypeReflexive
Irreflexive	\(p\)typeIrreflexive	not	flighttypeIrreflexive
Functional	\(p\)typeFunctional	\(y_1\)\(p\)\(x\)\(p\)\(y_2\) implies \(y_1\) = \(y_2\)	populationtypeFunctional
Inv. Functional	\(p\)typeInv. Functional	\(x_1\)\(p\)\(y\)\(p\)\(x_2\) implies \(x_1\) = \(x_2\)	capitaltypeInv. Functional
Key	\(c\)key\(p_1\) ⋮ \(p_n\)	implies \(x_1\)=\(x_2\)	Citykeylat long
Chain	\(p\)chain\(q_1\) ⋮ \(q_n\)	\(x\)\(q_1\)\(y_1\)…\(y_{n-1}\)\(q_n\)\(z\) implies \(x\)\(p\)\(z\)	locationchainlocation part of

Classes

In Section 3.1.1, we discussed how class hierarchies can be modelled using a sub-class relation. OWL supports sub-classes, and many additional features, for defining and making claims about classes; these additional features are summarised in Table 5. Given a pair of classes, OWL allows for defining that they are equivalent, or disjoint. Thereafter, OWL provides a variety of features for defining novel classes by applying set operators on other classes, or based on conditions that the properties of its instances satisfy. First, using set operators, one can define a novel class as the complement of another class, the union or intersection of a list (of arbitrary length) of other classes, or as an enumeration of all of its instances. Second, by placing restrictions on a particular property \(p\), one can define classes whose instances are all of the entities that have: some value from a given class on \(p\); all values from a given class on \(p\);¹⁵^{note 15} While something like flight arrow tip leftward prop arrow source DomesticAirportall arrow tip rightward NationalFlight might appear to be a more natural example for All Values, this would be a modelling mistake, as the corresponding for all condition is satisfied when no such node exists. In other words, with this example definition, we could infer anything known not to have any flights to be a domestic airport. (We could, however, define the intersection of this class and airport as being a domestic airport.) have a specific individual as a value on \(p\) (has value); have themselves as a reflexive value on \(p\) (has self); have at least, at most or exactly some number of values on \(p\) (cardinality); and have at least, at most or exactly some number of values on \(p\) from a given class (qualified cardinality). For the latter two cases, in Table 5, we use the notation “\(\#\{\)a\(\mid \phi \}\)” to count distinct entities satisfying \(\phi\) in the interpretation. These features can then be combined to create more complex classes, where combining the examples for Intersection and Has Self in Table 5 gives the definition: self-driving taxis are taxis having themselves as a driver.

Ontology features for class axioms and definitions
Feature	Axiom	Condition (for all \(x_\), \(y_\), \(z_*\))	Example
Subclass	\(c\)subc. of\(d\)	\(x\)type\(c\) implies \(x\)type\(d\)	Citysubc. ofPlace
Equivalence	\(c\)equiv. c.\(d\)	\(x\)type\(c\) iff \(x\)type\(d\)	Humansuc. ofPerson
Disjoint	\(c\)disj. c.\(d\)	not \(c\)type\(x\)type\(d\)	Citydisj. c.Region
Complement	\(c\)comp.\(d\)	\(x\)type\(c\) iff not \(x\)type\(d\)	Deadcom.Alive
Union	\(c\)union\(d_1\) ⋮ \(d_n\)	\(x\)type\(c\) iff \(x\)type\(d_1\) or \(x\)type… or \(x\)type\(d_n\)	FlightunionDomesticFlight InternationalFlight
Intersection	\(c\)inter.\(d_1\) ⋮ \(d_n\)	\(x\)type\(c\) iff	SelfDrivingTaxiinter.Taxi SelfDriving
Enumeration	\(c\)one of\(x_1\) ⋮ \(x_n\)	\(x\)type\(c\) iff \(x\) \(\in \{\)\(x_1\)\(,\dots,\)\(x_n\)\(\}\)	EUStateone ofAustria ⋮ Sweden
Some Values		\(x\)type\(c\) iff there exists \(a\) such that \(x\)\(p\)\(a\)type\(d\)
All Values		\(x\)type\(c\) iff for all \(a\) with \(x\)\(p\)\(a\) it holds that \(a\)type\(d\)
Has Value		\(x\)type\(c\) iff \(x\)\(p\)\(y\)
Has Self		\(x\)type\(c\) iff \(x\)\(p\)\(x\)
Cardinality \(\star \in \{ =, \leq, \geq \}\)		\(x\)type\(c\) iff \(\#\{\)a \(\mid\) \(x\)\(p\)\(a\)\(\} \star n\)
Qualified Cardinality \(\star \in \{ =, \leq, \geq \}\)		\(x\)type\(c\) iff \(\#\{\)a \(\mid\) \(x\)\(p\)\(a\)type\(d\)\(\} \star n\)

Other features

OWL supports other language features not previously discussed, including: annotation properties, which provide metadata about ontologies, such as versioning info; datatype vs. object properties, which distinguish properties that take datatype values from those that do not; and datatype facets, which allow for defining new datatypes by applying restrictions to existing datatypes, such as to define that places in Chile must have a float between \(-66.0\) and \(-110.0\) as their value for the (datatype) property latitude. For more details we refer to the OWL 2 standard [21]. We will further discuss methodologies for the creation of ontologies in Section 6.5.

Semantics and Entailment

The conditions listed in the previous tables indicate how each feature should be interpreted. These conditions give rise to entailments, where, for example, in reference to the Symmetric feature of Table 4, the definition nearby arrow source type arrow tip rightward Symmetric and edge SantiagonearbySantiago Airport entail the edge Santiago AirportnearbySantiago according to the condition given for that feature. We now describe how these conditions lead to entailments.

Model-theoretic semantics

Each axiom described by the previous tables, when added to a graph, enforces some condition(s) on the interpretations that satisfy the graph. The interpretations that satisfy a graph are called models of the graph. Were we to consider only the base condition of the Assertion feature in Table 3, for example, then the models of a graph would be any interpretation such that for every edge x arrow source y arrow tip rightward z in the graph, there exists a relation xyz in the model. Given that there may be other relations in the model (under the OWA), the number of models of any such graph is infinite. Furthermore, given that we can map multiple nodes in the graph to one entity in the model (under the NUNA), any interpretation with (for example) the relation a arrow source a arrow tip rightward a is a model of any graph so long as for every edge xyz in the graph, it holds that x = y = z = a in the interpretation (in other words, the interpretation maps everything to a). As we add axioms with their associated conditions to the graph, we restrict models for the graph; for example, considering a graph with two edges – x arrow source y arrow tip rightward z and ytypeIrreflexive – the interpretation with aaa, x = y = … = a is no longer a model as it breaks the condition for the irreflexive axiom.

Entailment

We say that one graph entails another if and only if any model of the former graph is also a model of the latter graph. Intuitively this means that the latter graph says nothing new over the former graph and thus holds as a logical consequence of the former graph. For example, consider the graph Santiago arrow source type arrow tip rightward Citysubc. ofPlace and the graph SantiagotypePlace. All models of the latter must have that SantiagotypePlace, but so must all models of the former, which must have SantiagotypeCitysubc. ofPlace and further must satisfy the condition for Subclass, which requires that Santiago arrow source type arrow tip rightward Place also hold. Hence we conclude that any model of the former graph must be a model of the latter graph, or, in other words, the former graph entails the latter graph.

If–then vs. if-and-only-if semantics

Consider the graph nearby arrow source type arrow tip rightward Symmetric and the graph nearbyinv. ofnearby. They result in the same semantic conditions being applied in the domain graph, but does one entail the other? The answer depends on the semantics applied. Considering the axioms and conditions of Tables 3, we can consider two semantics. Under if–then semantics – if Axiom matches data graph then Condition holds in domain graph – the graphs do not entail each other: though both graphs give rise to the same condition, this condition is not translated back into the axioms that describe it.¹⁶^{note 16} Observe that nearby arrow source type arrow tip rightward Symmetric is a model of the first graph but not the second, while nearbyinv. ofnearby is a model of the second graph but not the first. Hence neither graph entails the other. Conversely, under if-and-only-if semantics – Axiom matches data graph if-and-only-if Condition holds in domain graph – the graphs entail each other: both graphs give rise to the same condition, which is translated back into all possible axioms that describe it. Hence if-and-only-if semantics allows for entailing more axioms in the ontology language than if–then semantics. OWL generally applies an if-and-only-if semantics [21].

Reasoning

Unfortunately, given two graphs, deciding if the first entails the second – per the notion of entailment we have defined and for all of the ontological features listed in Tables 3–5 – is undecidable: no (finite) algorithm for such entailment can exist that halts on all inputs with the correct true/false answer [130]. However, we can provide practical reasoning algorithms for ontologies that (1) halt on any input ontology but may miss entailments, returning false instead of true, (2) always halt with the correct answer but only accept input ontologies with restricted features, or (3) only return correct answers for any input ontology but may never halt on certain inputs. Though option (3) has been explored using, e.g., theorem provers for First Order Logic [131], options (1) and (2) are more commonly pursued using rules and/or Description Logics. Option (1) generally allows for more efficient and scalable reasoning algorithms and is useful where data are incomplete and having some entailments is valuable. Option (2) may be a better choice in domains – such as medical ontologies – where missing entailments may have undesirable outcomes.

Rules

One of the most straightforward ways to provide automated access to deductive knowledge is through inference rules (or simply rules) encoding if–then-style consequences. A rule is composed of a body (if) and a head (then). Both the body and head are given as graph patterns. A rule indicates that if we can replace the variables of the body with terms from the data graph and form a subgraph of a given data graph, then using the same replacement of variables in the head will yield a valid entailment. The head must typically use a subset of the variables appearing in the body to ensure that the conclusion leaves no variables unreplaced. Rules of this form correspond to (positive) Datalog [132] in databases, Horn clauses [133] in logic programming, etc.

Rules can be used to capture entailments under ontological conditions. In Table 6, we list some example rules for sub-class, sub-property, domain and range features [134]; these rules may be considered incomplete, not capturing, for example, that every class is a sub-class of itself, that every property is a sub-property of itself, etc. A more comprehensive set of rules for the OWL features of Tables 3–5 have been defined as OWL 2 RL/RDF [135]; these rules are likewise incomplete as such rules cannot fully capture negation (e.g., Complement), existentials (e.g., Some Values), universals (e.g., All Values), or counting (e.g., Cardinality and Qualified Cardinality). Other rule languages have, however, been proposed to support additional such features, including existentials (see, e.g., Datalog\(^\pm\) [136]), disjunction (see, e.g., Disjunctive Datalog [137]), etc.

Example rules for sub-class, sub-property, domain, and range features
Feature	Body	\(\Rightarrow\)	Head
Subclass (I)	?xtype?csubc. of?d	\(\Rightarrow\)	?xtype?d
Subclass (II)	?dsubc. of?dsubc. of?e	\(\Rightarrow\)	?dsubc. of?e
Subproperty (I)		\(\Rightarrow\)	?x?q?y
Subproperty (II)	?psubp. of?qsubp. of?r	\(\Rightarrow\)	?psubp. of?r
Domain		\(\Rightarrow\)	?xtype?c
Range		\(\Rightarrow\)	?ytype?c

Rules can be leveraged for reasoning in a number of ways. Materialisation refers to the idea of applying rules recursively to a graph, adding the conclusions generated back to the graph until a fixpoint is reached and nothing more can be added. The materialised graph can then be treated as any other graph. Although the efficiency and scalability of materialisation can be enhanced through optimisations like Rete networks [138], or using distributed frameworks like MapReduce [139], depending on the rules and the data, the materialised graph may become unfeasibly large to manage. Another strategy is to use rules for query rewriting, which given a query, will automatically extend the query in order to find solutions entailed by a set of rules; for example, taking the schema graph in Figure 12 and the rules in Table 6, the (sub-)pattern ?x arrow source type arrow tip rightward Event in a given input query would be rewritten to the following disjunctive pattern evaluated on the original graph:

?x arrow source type arrow tip rightward Event \(\cup\) ?xtypeType \(\cup\) ?xtypePeriodic Market \(\cup\) ?xvenue?y

Figure 22 provides a more complete example of an ontology that is used to rewrite the query of Figure 21; if evaluated over the graph of Figure 1, Ñam will be returned as a solution. However, not all of the aforementioned features of OWL can be supported in this manner. The OWL 2 QL profile [135] is a subset of OWL designed specifically for query rewriting of this form [140].

type Festival — Query rewriting example for the query \(Q\) of Figure 21

While rules can be used to (partially) capture ontological entailments, they can also be defined independently of an ontology language, capturing entailments for a given domain. In fact, some rules – such as the following – cannot be captured by the ontology features previously seen, as they do not support ways to infer relations from cyclical graph patterns (for computability reasons):

dom flight rule premise \(\Rightarrow\) ?x arrow source domestic flight arrow tip rightward ?y

Various languages allow for expressing rules over graphs – independently or alongside of an ontology language – including: Notation3 (N3) [82], Rule Interchange Format (RIF) [25], Semantic Web Rule Language (SWRL) [24], and SPARQL Inferencing Notation (SPIN) [141].

Description Logics

Description Logics (DLs) were initially introduced as a way to formalise the meaning of frames and semantic networks . Considering that semantic networks are an early version of knowledge graphs, and the fact that DLs have heavily influenced the Web Ontology Language, DLs thus hold an important place in the logical formalisation of knowledge graphs. DLs form a family of logics rather than a particular logic. Initially, DLs were restricted fragments of First Order Logic (FOL) that permit decidable reasoning tasks, such as entailment checking [144]. Different DLs strike different balances between expressive power and computational complexity of reasoning. DLs would later be extended with features that go beyond FOL but are useful in the context of modelling graph data, such as transitive closure, datatypes, etc.

Description Logics are based on three types of elements: individuals, such as Santiago; classes (aka concepts) such as City; and properties (aka roles) such as flight. DLs then allow for making claims, known as axioms, about these elements. Assertional axioms can be either unary class relations on individuals, such as City(Santiago), or binary property relations on individuals, such as flight(Santiago,Arica). Such axioms form the Assertional Box (A-Box). DLs further introduce logical symbols to allow for defining class axioms (forming the Terminology Box, or T-Box for short), and property axioms (forming the Role Box, R-Box); for example, the class axiom City \(\sqsubseteq\) Place states that the former class is a subclass of the latter one, while the property axiom flight \(\sqsubseteq\) connectsTo states that the former property is a subproperty of the latter one. DLs may then introduce a rich set of logical symbols, not only for defining class and property axioms, but also defining new classes based on existing terms; as an example of the latter, we can define a class \(\exists\)nearby.Airport as the class of individuals that have some airport nearby. Noting that the symbol \(\top\) is used in DLs to denote the class of all individuals, we can then add a class axiom \(\exists\)flight.\(\top \sqsubseteq \exists\)nearby.Airport to state that individuals with an outgoing flight must have some airport nearby. Noting that the symbol \(\sqcup\) can be used in DL to define that a class is the union of other classes, we can further define that Airport \(\sqsubseteq\) DomesticAirport \(\sqcup\) InternationalAirport, i.e., that an airport is either a domestic airport or an international airport (or both).

The similarities between these DL features and the OWL features previously outlined in Tables 3–5 are not coincidental: the OWL standard was heavily influenced by DLs, where, for example, the OWL 2 DL language is a fragment of OWL restricted so that entailment becomes decidable. As an example of a restriction, with DomesticAirport \(\sqsubseteq ~=1\) destination \(\circ\) country.\(\top\), we can define in DL syntax that domestic airports have flights destined to precisely one country (where p \(\circ\) q denotes a chain of properties). However, counting chains is often disallowed in DLs to ensure decidability. In Appendix B.5.3, we present formal definitions for DL syntax and semantics, as well as notions of entailment. For further reading, we also refer to the textbook by Franz Baader, et al. [144].

Expressive DLs support complex entailments involving existentials, universals, counting, etc. A common strategy for deciding such entailments is to reduce entailment to satisfiability, which decides if an ontology is consistent or not [145].¹⁷^{note 17} \(G\) entails \(G'\) if and only if \(G \cup \text{not}(G')\) is not satisfiable. Thereafter methods such as tableau can be used to check satisfiability, cautiously constructing models by completing them along similar lines to the materialisation strategy previously described, but additionally branching models in the case of disjunction, introducing new elements to represent existentials, etc. If any model is successfully “completed”, the process concludes that the original definitions are satisfiable (see, e.g., [146]). Due to their prohibitive computational complexity [135] – where for example, disjunction may lead to an exponential number of branching possibilities – such reasoning strategies are not typically applied in the case of large-scale data, though they may be useful when modelling complex domains.

Inductive Knowledge

While deductive knowledge is characterised by precise logical consequences, inductively acquiring knowledge involves generalising patterns from a given set of input observations, which can then be used to generate novel but potentially imprecise predictions. For example, from a large data graph with geographical and flight information, we may observe the pattern that almost all capital cities of countries have international airports serving them, and hence predict that if Santiago is a capital city, it likely has an international airport serving it; however, the predictions drawn from this pattern do not hold for certain, where (e.g.) Vaduz, the capital city of Liechtenstein, has no (international) airport serving it. Hence predictions will often be associated with a level of confidence; for example, we may say that a capital has an international airport in \(\frac{187}{195}\) of cases, offering a confidence of \(0.959\) for predictions made with that pattern. We then refer to knowledge acquired inductively as inductive knowledge, which includes both the models used to encode patterns, as well as the predictions made by those models. Though fallible, inductive knowledge can be highly valuable.

Conceptual overview of popular inductive techniques for knowledge graphs in terms of type of representation generated (Numeric/Symbolic) and type of paradigm used (Unsupervised/Self-supervised/Supervised)

In Figure 23 we provide an overview of the inductive techniques typically applied to knowledge graphs. In the case of unsupervised methods, there is a rich body of work on graph analytics, which uses well-known functions/algorithms to detect communities or clusters, find central nodes and edges, etc., in a graph. Alternatively, knowledge graph embeddings can use self-supervision to learn a low-dimensional numeric model of a knowledge graph that (typically) maps input edges to an output plausibility score indicating the likelihood of the edge being true. The structure of graphs can also be directly leveraged for supervised learning, as explored in the context of graph neural networks. Finally, while the aforementioned techniques learn numerical models, symbolic learning can learn symbolic models – i.e., logical formulae in the form of rules or axioms – from a graph in a self-supervised manner. We now discuss each of the aforementioned techniques in turn.

Graph Analytics

Analytics is the process of discovering, interpreting, and communicating meaningful patterns inherent to (typically large) data collections. Graph analytics is then the application of analytical processes to (typically large) graph data. The nature of graphs naturally lends itself to certain types of analytics that derive conclusions about nodes and edges based on the topology of the graph, i.e., how the nodes of the graph are connected. Graph analytics hence draws many of its techniques from related areas such as graph theory and network analysis, which have been used to study graphs that represent social networks, the Web, internet routing, transportation networks, ecosystems, protein–protein interactions, linguistic cooccurrences, and more besides [147].

Returning to the domain of our running example, the tourism board could use graph analytics to extract knowledge about, for instance: key transport hubs that serve many tourist attractions (centrality); groupings of attractions visited by the same tourists (community detection); attractions that may become unreachable in the event of strikes or other route failures (connectivity), or pairs of attractions that are similar to each other (node similarity). Given that such analytics will require a complex, large-scale graph, for the purposes of illustration, in Figure 24 we present a more concise example of some transportation connections in Chile directed towards popular touristic destinations. We first introduce a selection of key techniques that can be applied for graph analytics. We then discuss frameworks and languages that can be used to compute such analytics in practice. Given that many traditional graph algorithms are defined for unlabelled graphs, we then describe ways in which analytics can be applied over directed edge-labelled graphs. Finally we discuss the potential connections between graph analytics and querying and reasoning.

Data graph representing transport routes in Chile

Techniques

A wide variety of techniques can be applied for graph analytics. In the following we will enumerate some of the main techniques – as recognised, for example, by the survey of Alexandru Iosup, et al. [148] – that can be invoked in this setting.

Centrality: aims to identify the most important (aka central) nodes or edges of a graph. Specific node centrality measures include degree, betweenness, closeness, Eigenvector, PageRank, HITS, Katz, among others. Betweenness centrality can also be applied to edges. A node centrality measure would allow, e.g., to predict the transport hubs in Figure 24, while edge centrality would allow us to find the edges on which many shortest routes depend for predicting traffic.
Community detection: aims to identify communities in a graph, i.e., sub-graphs that are more densely connected internally than to the rest of the graph. Community detection algorithms, such as minimum-cut algorithms, label propagation, Louvain modularity, etc. enable discovering such communities. Community detection applied to Figure 24 may, for example, detect a community to the left (referring to the north of Chile), to the right (referring to the south of Chile), and perhaps also the centre (referring to cities with airports).
Connectivity: aims to estimate how well-connected the graph is, revealing, for instance, the resilience and (un)reachability of elements of the graph. Specific techniques include measuring graph density or \(k\)-connectivity, detecting strongly connected components and weakly connected components, computing spanning trees or minimum cuts, etc. In the context of Figure 24, such analysis may tell us that routes to Grey Glacier, Osorno Volcano and Piedras Rojas are the most “brittle”, becoming disconnected if one of two bus routes fail.
Node similarity: aims to find nodes that are similar to other nodes by virtue of how they are connected within their neighbourhood. Node similarity metrics may be computed using structural equivalence, random walks, diffusion kernels, etc. These methods provide an understanding of what connects nodes, and, thereafter, in what ways they are similar. In the context of Figure 24, such analysis may tell us that Calama and Arica are similar nodes based on both having return flights to Santiago and return buses to San Pedro.

While the previous techniques accept a graph alone as input,¹⁸^{note 18} Node similarity can be run over an entire graph to find the \(k\) most similar nodes for each node, or can also be run for a specific node to find its most similar nodes. There are also measures for graph similarity (based on, e.g., frequent itemsets [149]) that accept multiple graphs as input. other forms of graph analytics may further accept a node, a pair of nodes, etc., along with the graph.

Path finding: aims to find paths in a graph, typically between pairs of nodes given as input. Various technical definitions exist that restrict the set of valid paths between such nodes, including simple paths that do not visit the same node twice, shortest paths that visit the fewest number of edges, or – as previously discussed in Section 2.2 – regular path queries that restrict the labels of edges that can be traversed by the path [20]. We could use such algorithms to find, for example, the shortest path(s) in Figure 24 from Torres del Paine to Moon Valley.

Most such techniques have been proposed and studied for simple graphs or directed graphs without edge labels. We will discuss their application to more complex graph models – and how they can be combined with other techniques such as reasoning and querying – later in Section 5.1.3.

Frameworks

Various frameworks have been proposed for large-scale graph analytics, often in a distributed (cluster) setting. Amongst these we can mention Apache Spark (GraphX) [27, 150], GraphLab [151], Pregel [26], Signal–Collect [28], Shark [152], etc. These graph parallel frameworks apply a systolic abstraction [153] based on a directed graph, where nodes are processors that can send messages to other nodes along edges. Computation is then iterative, where in each iteration, each node reads messages received through inward edges (and possibly its own previous state), performs a computation, and then sends messages through outward edges based on the result. These frameworks then define the systolic computational abstraction on top of the data graph being processed: nodes and edges in the data graph become nodes and edges in the systolic graph. We refer to Appendix B.6.1 for more formal details on graph parallel frameworks.

To take an example, assume we wish to compute the places that are most (or least) easily reached by the routes shown in the graph of Figure 24. A good way to measure this is using centrality, where we choose PageRank [154], which computes the probability of a tourist randomly following the routes shown in the graph being at a particular place after a given number of “hops”. We can implement PageRank on large graphs using a graph parallel framework. In Figure 25, we provide an example of an iteration of PageRank for an illustrative sub-graph of Figure 24. The nodes are initialised with a score of \(\frac{1}{|V|} = \frac{1}{6}\), where we assume the tourist to have an equal chance of starting at any point. In the message phase (Msg), each node \(v\) passes a score of \(\frac{d \textrm{R}_i(v)}{|E(v)|}\) on each of its outgoing edges, where we denote by \(d\) a constant damping factor used to ensure convergence (typically \(d = 0.85\), indicating the probability that a tourist randomly “jumps” to any place), by \(\textrm{R}_i(v)\) the score of node \(v\) in iteration \(i\) (the probability of the tourist being at node \(v\) after \(i\) hops), and by \(|E(v)|\) the number of outgoing edges of \(v\). The aggregation phase (Agg) for \(v\) then sums all incoming messages received along with its constant share of the damping factor (\(\frac{1-d}{|V|}\)) to compute \(\textrm{R}_{i+1}(v)\). We then proceed to the message phase of the next iteration, continuing until some termination criterion is reached (e.g., iteration count or residual threshold, etc.) and final scores are output.

Example of a systolic iteration of PageRank for a sample sub-graph of Figure 24

While the given example is for PageRank, the systolic abstraction is general enough to support a wide variety of graph analytics, including those previously mentioned. An algorithm in this framework consists of the functions to compute message values in the message phase (Msg), and to accumulate the messages in the aggregation phase (Agg). The framework will take care of distribution, message passing, fault tolerance, etc. However, such frameworks – based on message passing between neighbours – have limitations: not all types of analytics can be expressed in such frameworks [155].¹⁹^{note 19} Formally, Keyulu Xu, et al. [155] have shown that such frameworks are as powerful as the (incomplete) Weisfeiler–Lehman (WL) graph isomorphism test – based on recursively hashing neighbouring hashes – for distinguishing graph structures. Hence frameworks may allow additional features, such as a global step that performs a global computation on all nodes, making the result available to each node [26]; or a mutation step that allows for adding or removing nodes and edges during processing [26].

Analytics on data graphs

As aforementioned, most analytics presented thus far are, in their “native” form, applicable for undirected or directed graphs without the edge meta-data – i.e., edge labels or property–value pairs – typical of graph data models.²⁰^{note 20} We remark that in the case of property graphs, property–value pairs on nodes can be converted by mapping values to nodes and properties to edges with the corresponding label. A number of strategies can be applied to make data graphs subject to analytics of this form:

Projection involves simply “projecting” an undirected or directed graph by optionally selecting a sub-graph from the data graph from which all edge meta-data are dropped; for example, Figure 24 may be the result of extracting the sub-graph induced by the edge labels bus and flight from a larger data graph, where the labels are then dropped to create a directed graph.
Weighting involves converting edge meta-data into numerical values according to some function. Many of the aforementioned techniques are easily adapted to the case of weighted (directed) graphs; for example, we could consider weights on the graph of Figure 24 denoting trip duration (or price, traffic, etc.), and then compute the shortest paths adding the duration of each leg.²¹^{note 21} Other forms of analytics are possible if we assume the graph is weighted; for example, if we annotated the graph of Figure 24 with probabilities of tourists moving from one place to the next, we could leverage Markov processes to understand features such as reducibility, periodicity, transience, recurrence, ergodicity, steady-states, etc., of the routes [156]. In the absence of external weights, we may rather map edge labels to weights, assigning the same weight to all flight edges, to all bus edges, etc., based on some criteria.
Transformation involves transforming the graph to a lower arity model. A transformation may be lossy, meaning that the original graph cannot be recovered; or lossless, meaning that the original graph can be recovered. Figure 26 provides an example of a lossy and lossless transformation from a directed edge-labelled graph to directed graphs. In the lossy transformation, we cannot tell, for example, if the original graph contained the edge IquiqueflightSantiago or IquiqueflightArica, etc. The lossless transformation must introduce new nodes (similar to reification) to maintain information about directed labelled edges. Both transformed graphs further attempt to preserve the directionality of the original graph.
Customisation involves changing the analytical procedure to incorporate edge meta-data, such as was the case for path finding based on path expressions. Other examples might include structural measures for node similarity that not only consider common neighbours, but also common neighbours connected by edges with the same label, or aggregate centrality measures that capture the importance of edges grouped by label, etc.

The results of an analytical process may change drastically depending on which of the previous strategies are chosen to prepare the data for analysis. This choice may be a non-trivial one to make a priori and may require empirical validation. More study is required to more generally understand the effects of such strategies on the results of different analytical techniques.

Analytics with queries

As discussed in Section 2.2, various languages for querying graphs have been proposed [20]. One may consider a variety of ways in which query languages and analytics can complement each other. First, we may consider using query languages to project or transform a graph suitable for a particular analytical task, such as to extract the graph of Figure 24 from a larger data graph. Query languages such as SPARQL [86], Cypher [87], and G-CORE [89] allow for outputting graphs, where such queries can be used to select sub-graphs for analysis. These languages can also express some limited (non-recursive) analytics, where aggregations can be used to compute degree centrality, for example; they may also have some built-in analytical support, where, for example, Cypher [87] allows for finding shortest paths. In the other direction, analytics can contribute to the querying process in terms of optimisations, where, for example, analysis of connectivity may suggest how to better distribute a large data graph over multiple machines for querying using, e.g., minimum cuts [157, 158]. Analytics have also been used to rank query results over large graphs [159, 160], selecting the most important results for presentation to the user.

In some use-cases we may further wish to interleave querying and analytical processes. For example, from the full data graph collected by the tourist board, consider an upcoming airline strike where the board wishes to find the events during the strike with venues in cities unreachable from Santiago by public transport due to the strike. Hypothetically, we could use a query to extract the transport network excluding the airline’s routes (assuming, per Figure 3 that the airline information is available), use analytics to extract the strongly connected component containing Santiago, and finally use a query to find events in cities not in the Santiago component on the given dates.²²^{note 22} Such a task could not be solved in a single query using regular path queries as such expressions would not be capable of filtering edges representing flights of a particular airline. While one could solve this task using an imperative language such as Gremlin [88], GraphX [27], or R [161], more declarative languages are also being explored to more easily express such tasks, with proposals including the extension of graph query languages with recursive capabilities [162, 163],²³^{note 23} Recursive query languages become Turing complete assuming one can also express operations on binary arrays. combining linear algebra with relational (query) algebra [164], and so forth.

Analytics with entailment

Knowledge graphs are often associated with a semantic schema or ontology that defines the semantics of domain terms, giving rise to entailments (per Section 4). Applying analytics with or without such entailments – e.g., before or after materialisation – may yield radically different results. For example, observe that an edge Santa Lucía arrow source hosts arrow tip rightward EID15 is semantically equivalent to an edge EID15venueSanta Lucía once the inverse axiom hostsinv. ofvenue is invoked; however, these edges are far from equivalent from the perspective of analytical techniques that consider edge direction, for which including one type of edge, or the other, or both, may have a major bearing on the final results. To the best of our knowledge, the combination of analytics and entailment has not been well-explored, leaving open interesting research questions. Along these lines, it may be of interest to explore semantically-invariant analytics that yield the same results over semantically-equivalent graphs (i.e., graphs that entail one another), thus analysing the semantic content of the knowledge graph rather than simply the topological features of the data graph; for example, semantically-invariant analytics would yield the same results over a graph containing the inverse axiom hosts arrow source inv. of arrow tip rightward venue and a number of hosts edges, the same graph but where every hosts edge is replaced by an inverse venue edge, and the union of both graphs.

Knowledge Graph Embeddings

Methods for machine learning have gained significant attention in recent years. In the context of knowledge graphs, machine learning can either be used for directly refining a knowledge graph [16] (discussed further in Section 8); or for downstream tasks using the knowledge graph, such as recommendation [165], information extraction [166], question answering [167], query relaxation [168], query approximation [169], etc. (discussed further in Section 10). However, many traditional machine learning techniques assume dense numeric input representations in the form of vectors, which is quite distinct from how graphs are usually expressed. So how can graphs – or nodes, edges, etc., thereof – be encoded as numeric vectors?

A first attempt to represent a graph using vectors would be to use a one-hot encoding, generating a vector for each node of length \(|L| \cdot |V|\) – with \(|V|\) the number of nodes in the input graph and \(|L|\) the number of edge labels – placing a one at the corresponding index to indicate the existence of the respective edge in the graph, or zero otherwise. Such a representation will, however, typically result in large and sparse vectors, which will be detrimental for most machine learning models.

The main goal of knowledge graph embedding techniques is to create a dense representation of the graph (i.e., embed the graph) in a continuous, low-dimensional vector space that can then be used for machine learning tasks. The dimensionality \(d\) of the embedding is fixed and typically low (often, e.g., \(50 \geq d \geq 1000\)). Typically the graph embedding is composed of an entity embedding for each node: a vector with \(d\) dimensions that we denote by \(\mathbf{e}\); and a relation embedding for each edge label: (typically) a vector with \(d\) dimensions that we denote by \(\mathbf{r}\). The overall goal of these vectors is to abstract and preserve latent structures in the graph. There are many ways in which this notion of an embedding can be instantiated. Most commonly, given an edge s arrow source p arrow tip rightward o, a specific embedding approach defines a scoring function that accepts \(\mathbf{e}\)_s (the entity embedding of node s), \(\mathbf{r}\)_p (the entity embedding of edge label p) and \(\mathbf{e}\)_o (the entity embedding of node o) and computes the plausibility of the edge: how likely it is to be true. Given a data graph, the goal is then to compute the embeddings of dimension \(d\) that maximise the plausibility of positive edges (typically edges in the graph) and minimise the plausibility of negative examples (typically edges in the graph with a node or edge label changed such that they are no longer in the graph) according to the given scoring function. The resulting embeddings can then be seen as models learnt through self-supervision that encode (latent) features of the graph, mapping input edges to output plausibility scores.

Embeddings can then be used for a number of low-level tasks involving the nodes and edge-labels of the graph from which they were computed. First, we can use the plausibility scoring function to assign a confidence to edges that may, for example, have been extracted from an external source (discussed later in Section 6). Second, the plausibility scoring function can be used to complete edges with missing nodes/edge labels for the purposes of link prediction (discussed later in Section 8); for example, in Figure 24, we might ask which nodes in the graph are likely to complete the edge Grey Glacier arrow source bus arrow tip rightward ?, where – aside from Punta Arenas, which is already given – we might intuitively expect Torres del Paine to be a plausible candidate. Third, embedding models will typically assign similar vectors to similar nodes and similar edge-labels, and thus they can be used as the basis of similarity measures, which may be useful for finding duplicate nodes that refer to the same entity, or for the purposes of providing recommendations (discussed later in Section 10).

A wide range of knowledge graph embedding techniques have been proposed [17], where our goal here is to provide a high-level introduction to some of the most popular techniques proposed thus far. First we discuss translational models that adopt a geometric perspective whereby relation embeddings translate subject entities to object entities in the low-dimensional space. We then describe tensor decomposition models that extract latent factors approximating the graph’s structure. Thereafter we discuss neural models that use neural networks to train embeddings that provide accurate plausibility scores. Finally, we discuss language models that leverage existing word embedding techniques, proposing ways of generating graph-like analogues for their expected (textual) inputs. A more formal treatment of these models is provided in Appendix B.6.2.

Translational models

Translational models interpret edge labels as transformations from subject nodes (aka the source or head) to object nodes (aka the target or tail); for example, in the edge San Pedro arrow source bus arrow tip rightward Moon Valley, the edge label bus is seen as transforming San Pedro to Moon Valley, and likewise for other bus edges. The most elementary approach in this family is TransE [66]. Over all positive edges spo, TransE learns vectors \(\mathbf{e}\)_s, \(\mathbf{r}\)_p, and \(\mathbf{e}\)_os aiming to make \(\mathbf{e}\)_s + \(\mathbf{r}\)_p as close as possible to \(\mathbf{e}\)_o. Conversely, if the edge is a negative example, TransE attempts to learn a representation that keeps \(\mathbf{e}\)_s + \(\mathbf{r}\)_p away from \(\mathbf{e}\)_o. To illustrate, Figure 27 provides a toy example of two-dimensional (\(d = 2\)) entity and relation embeddings computed by TransE. We keep the orientation of the vectors similar to the original graph for clarity. For any edge s arrow source p arrow tip rightward o in the original graph, adding the vectors \(\mathbf{e}\)_s + \(\mathbf{r}\)_p should approximate \(\mathbf{e}\)_o. In this toy example, the vectors correspond precisely where, for instance, adding the vectors for Licantén (\(\mathbf{e}\)_L.) and west of (\(\mathbf{r}\)_wo.) gives a vector corresponding to Curico (\(\mathbf{e}\)_C.). We can use these embeddings to predict edges (among other tasks); for example, in order to predict which node in the graph is most likely to be west of Antofagasta (A.), by computing \(\mathbf{e}\)_A. + \(\mathbf{r}\)_wo. we find that the resulting vector (dotted in Figure ?) is closest to \(\mathbf{e}\)_T., thus predicting Toconao (T.) to be the most plausible such node.

Aside from this toy example, TransE can be too simplistic; for example, in Figure 24, bus not only transforms San Pedro to Moon Valley, but also to Arica, Calama, and so forth. TransE will, in this case, aim to give similar vectors to all such target locations, which may not be feasible given other edges. TransE will also tend to assign cyclical relations a zero vector, as the directional components will tend to cancel each other out. To resolve such issues, many variants of TransE have been investigated. Amongst these, for example, TransH [14] represents different relations using distinct hyperplanes, where for the edge s arrow source p arrow tip rightward o, s is first projected onto the hyperplane of p before the translation to o is learnt (uninfluenced by edges with other labels for s and for o). TransR [15] generalises this approach by projecting s and o into a vector space specific to p, which involves multiplying the entity embeddings for s and o by a projection matrix specific to p. TransD [170] simplifies TransR by associating entities and relations with a second vector, where these secondary vectors are used to project the entity into a relation-specific vector space. Recently, RotatE [171] proposes translational embeddings in complex space, which allows to capture more characteristics of relations, such as direction, symmetry, inversion, antisymmetry, and composition. Embeddings have also been proposed in non-Euclidean space, e.g., MuRP [67] uses relation embeddings that transform entity embeddings in the hyperbolic space of the Poincaré ball mode, whose curvature provides more “space” to separate entities with respect to the dimensionality. For discussion of other translational models, we refer to the survey by Quan Wang, et al. [17].

Tensor decomposition models

A second approach to derive graph embeddings is to apply methods based on tensor decomposition. A tensor is a multidimensional numeric field that generalises scalars (\(0\)-order tensors), vectors (\(1\)-order tensors) and matrices (\(2\)-order tensors) towards arbitrary dimension/order. Tensors have become a widely used abstraction for machine learning [172]. Tensor decomposition involves decomposing a tensor into more “elemental” tensors (e.g., of lower order) from which the original tensor can be recomposed (or approximated) by a fixed sequence of basic operations. These elemental tensors can be viewed as capturing latent factors underlying the information contained in the original tensor. There are many approaches to tensor decomposition, where we will now briefly introduce the main ideas behind rank decompositions [172].

Leaving aside graphs momentarily, consider an \((a,b)\)-matrix (i.e., a \(2\)-order tensor) \(\mathbf{C}\), where \(a\) is the number of cities in Chile, \(b\) is the number of months in a year, and each element \((\mathbf{C})_{ij}\) denotes the average temperature of the \(i\)^th city in the \(j\)^th month. Noting that Chile is a long, thin country – ranging from subpolar climates in the south, to a desert climate in the north – we may find a decomposition of \(\mathbf{C}\) into two vectors representing latent factors – specifically \(\mathbf{x}\) (with \(a\) elements) giving lower values for cities with lower latitude, and \(\mathbf{y}\) (with \(b\) elements), giving lower values for months with lower temperatures – such that computing the outer product²⁴^{note 24} The outer product of two (column) vectors \(\mathbf{x}\) of length \(a\) and \(\mathbf{y}\) of length \(b\), denoted \(\mathbf{x} \otimes \mathbf{y}\), is defined as \(\mathbf{x}\mathbf{y}^{\mathrm{T}}\), yielding an \((a,b)\)-matrix \(\mathbf{M}\) such that \((\mathbf{M})_{ij} = (\mathbf{x})_i \cdot (\mathbf{y})_j\). Analogously, the outer product of \(k\) vectors is a \(k\)-order tensor. of the two vectors approximates \(\mathbf{C}\) reasonably well: \(\mathbf{x} \otimes \mathbf{y} \approx \mathbf{C}\). In the (unlikely) case that there exist vectors \(\mathbf{x}\) and \(\mathbf{y}\) such that \(\mathbf{C}\) is precisely the outer product of two vectors (\(\mathbf{x} \otimes \mathbf{y} = \mathbf{C}\)) we call \(\mathbf{C}\) a rank-\(1\) matrix; we can then precisely encode \(\mathbf{C}\) using \(a + b\) values rather than \(a \times b\) values. Most times, however, to get precisely \(\mathbf{C}\), we will need to sum multiple rank-\(1\) matrices, where the rank \(r\) of \(\mathbf{C}\) is the minimum number of rank-\(1\) matrices that need to be summed to derive precisely \(\mathbf{C}\), such that \(\mathbf{x}_1 \otimes \mathbf{y}_1 + \ldots \mathbf{x}_r \otimes \mathbf{y}_r = \mathbf{C}\). In the temperature example, \(\mathbf{x}_2 \otimes \mathbf{y}_2\) might correspond to a correction for altitude, \(\mathbf{x}_3 \otimes \mathbf{y}_3\) for higher temperature variance further south, etc. A (low) rank decomposition of a matrix then sets a limit \(d\) on the rank and computes the vectors \((\mathbf{x}_1,\mathbf{y}_1,\ldots,\mathbf{x}_{d},\mathbf{y}_{d})\) such that \(\mathbf{x}_1 \otimes \mathbf{y}_1 + \ldots + \mathbf{x}_{d} \otimes \mathbf{y}_{d}\) gives the best \(d\)-rank approximation of \(\mathbf{C}\). Noting that to generate \(n\)-order tensors we need to compute the outer product of \(n\) vectors, we can generalise this idea towards low rank decomposition of tensors; this method is called Canonical Polyadic (CP) decomposition [173]. For example, we might have a \(3\)-order tensor \(\mathcal{C}\) containing monthly temperatures for Chilean cities at four different times of day, which could be approximated with \(\mathbf{x}_1 \otimes \mathbf{y}_1 \otimes \mathbf{z}_1 + \ldots \mathbf{x}_{d} \otimes \mathbf{y}_{d} \otimes \mathbf{z}_{d}\) (e.g., \(\mathbf{x}_1\) might be a latitude factor, \(\mathbf{y}_1\) a monthly variation factor, and \(\mathbf{z}_1\) a daily variation factor, and so on). Various algorithms then exist to compute (approximate) CP decompositions, including Alternating Least Squares, Jennrich’s Algorithm, and the Tensor Power method [172].

Returning to graphs, similar principles can be used to decompose a graph into vectors, thus yielding embeddings. In particular, a graph can be encoded as a one-hot \(3\)-order tensor \(\mathcal{G}\) with \(|V| \times |L| \times |V|\) elements, where the element \((\mathcal{G})_{ijk}\) is set to one if the \(i\)^th node links to the \(k\)^th node with an edge having the \(j\)^th label, or zero otherwise. As previously mentioned, such a tensor will typically be very large and sparse, where rank decompositions are thus applicable. A CP decomposition [173] would compute a sequence of vectors \((\mathbf{x}_1,\mathbf{y}_1,\mathbf{z}_1,\ldots,\mathbf{x}_d,\mathbf{y}_d,\mathbf{z}_d)\) such that \(\mathbf{x}_1 \otimes \mathbf{y}_1 \otimes \mathbf{z}_1 + \ldots + \mathbf{x}_d \otimes \mathbf{y}_d \otimes \mathbf{z}_d \approx \mathcal{G}\). We illustrate this scheme in Figure 28. Letting \(\mathbf{X}, \mathbf{Y}, \mathbf{Z}\) denote the matrices formed by \(\begin{bmatrix} \mathbf{x}_1\,\cdots\,\mathbf{x}_d \end{bmatrix}\), \(\begin{bmatrix} \mathbf{y}_1\,\cdots\,\mathbf{y}_d \end{bmatrix}\), \(\begin{bmatrix} \mathbf{z}_1\,\cdots\,\mathbf{z}_d \end{bmatrix}\), respectively, with each vector forming a column of the corresponding matrix, we could then extract the \(i\)^th row of \(\mathbf{Y}\) as an embedding for the \(i\)^th relation, and the \(j\)^th rows of \(\mathbf{X}\) and \(\mathbf{Z}\) as two embeddings for the \(j\)^th entity. However, knowledge graph embeddings typically aim to assign one vector to each entity.

Abstract illustration of a CP \(d\)-rank decomposition of a tensor representing the graph of Figure 27a

DistMult [174] is a seminal method for computing knowledge graph embeddings based on rank decompositions, where each entity and relation is associated with a vector of dimension \(d\), such that for an edge s arrow source p arrow tip rightward o, a plausibility scoring function \(\sum_{i=1}^d (\mathbf{e}\)_s\()_i (\mathbf{r}\)_p\()_i (\mathbf{e}\)_o\()_i\) is defined, where \((\mathbf{e}\)_s\()_i\), \((\mathbf{r}\)_p\()_i\) and \((\mathbf{e}\)_o\()_i\) denote the \(i\)^th elements of vectors \(\mathbf{e}\)_s, \(\mathbf{r}\)_p, \(\mathbf{e}\)_o, respectively. The goal, then, is to learn vectors for each node and edge label that maximise the plausibility of positive edges and minimise the plausibility of negative edges. This approach equates to a CP decomposition of the graph tensor \(\mathcal{G}\), but where entities have one vector that is used twice: \(\mathbf{x}_1 \otimes \mathbf{y}_1 \otimes \mathbf{x}_1 + \ldots + \mathbf{x}_d \otimes \mathbf{y}_d \otimes \mathbf{x}_d \approx \mathcal{G}\). A weakness of this approach is that per the scoring function, the plausibility of s arrow source p arrow tip rightward o will always be equal to that of ops; in other words, DistMult does not consider edge direction.

Rather than use a vector as a relation embedding, RESCAL [65] uses a matrix, which allows for combining values from \(\mathbf{e}\)_s and \(\mathbf{e}\)_o across all dimensions, and thus can capture (e.g.) edge direction. However, RESCAL incurs a higher cost in terms of space and time than DistMult. HolE [175] uses vectors for relation and entity embeddings, but proposes to use the circular correlation operator – which takes sums along the diagonals of the outer product of two vectors – to combine them. This operator is not commutative, and can thus consider edge direction. ComplEx [176], on the other hand, uses a complex vector (i.e., a vector containing complex numbers) as a relational embedding, which similarly allows for breaking the aforementioned symmetry of DistMult’s scoring function while keeping the number of parameters low. SimplE [177] rather proposes to compute a standard CP decomposition computing two initial vectors for entities from \(\mathbf{X}\) and \(\mathbf{Z}\) and then averaging terms across \(\mathbf{X}\), \(\mathbf{Y}\), \(\mathbf{Z}\) to compute the final plausibility scores. TuckER [178] employs a different type of decomposition – called a Tucker Decomposition [179], which computes a smaller “core” tensor \(\mathcal{T}\) and a sequence of three matrices \(\mathbf{A}\), \(\mathbf{B}\) and \(\mathbf{C}\), such that \(\mathcal{G} \approx \mathcal{T} \otimes \mathbf{A} \otimes \mathbf{B} \otimes \mathbf{C}\) – where entity embeddings are taken from \(\mathbf{A}\) and \(\mathbf{C}\), while relation embeddings are taken from \(\mathbf{B}\). Of these approaches, TuckER [178] currently provides state-of-the-art results on standard benchmarks.

Neural models

A limitation of the previously discussed approaches is that they assume either linear (preserving addition and scalar multiplication) or bilinear (e.g., matrix multiplication) operations over embeddings to compute plausibility scores. A number of approaches rather use neural networks to learn embeddings with non-linear scoring functions for plausibility.

One of the earliest proposals of a neural model was Semantic Matching Energy (SME) [180], which learns parameters (aka weights: \(\mathbf{w}\), \(\mathbf{w}'\)) for two functions – \(f_{\mathbf{w}}(\mathbf{e}\)_s\(,\mathbf{r}\)_p\()\) and \(g_{\mathbf{w}'}(\mathbf{e}\)_o\(,\mathbf{r}\)_p\()\) – such that the dot product of the result of both functions – \(f_{\mathbf{w}}(\mathbf{e}\)_s\(,\mathbf{r}\)_p\() \cdot g_{\mathbf{w}'}(\mathbf{e}\)_o\(,\mathbf{r}\)_p\()\) – gives the plausibility score. Both linear and bilinear variants of \(f_{\mathbf{w}}\) and \(g_{\mathbf{w}'}\) are proposed. Another early proposal was Neural Tensor Networks (NTN) [181], which rather proposes to maintain a tensor \(\mathcal{W}\) of internal weights, such that the plausibility score is computed by a complex function that combines the outer product \(\mathbf{e}\)_s\( \otimes \mathcal{W} \otimes \mathbf{e}\)_o with a standard neural layer over \(\mathbf{e}\)_s and \(\mathbf{e}\)_o, which in turn is combined with \(\mathbf{r}\)_p, to produce a plausibility score. The use of the tensor \(\mathcal{W}\) results in a high number of parameters, which limits scalability [17]. Multi Layer Perceptron (MLP) [182] is a simpler model, where \(\mathbf{e}\)_s, \(\mathbf{r}\)_p and \(\mathbf{e}\)_o are concatenated and fed into a hidden layer to compute the plausibility score.

A number of more recent approaches have proposed using convolutional kernels in their models. ConvE [183] proposes to generate a matrix from \(\mathbf{e}\)_s and \(\mathbf{r}\)_p by “wrapping” each vector over several rows and concatenating both matrices. The concatenated matrix serves as the input for a set of (2D) convolutional layers, which returns a feature map tensor. The feature map tensor is vectorised and projected into \(d\) dimensions using a parametrised linear transformation. The plausibility score is then computed based on the dot product of this vector and \(\mathbf{e}\)_o. A disadvantage of ConvE is that by wrapping vectors into matrices, it imposes an artificial two-dimensional structure on the embeddings. HypER [184] is a similar model using convolutions, but avoids the need to wrap vectors into matrices. Instead, a fully connected layer (called the “hypernetwork”) is applied to \(\mathbf{r}\)_p and used to generate a matrix of relation-specific convolutional filters. These filters are applied directly to \(\mathbf{e}\)_s to give a feature map, which is vectorised. The same process is then applied as in ConvE: the resulting vector is projected into \(d\) dimensions, and a dot product applied with \(\mathbf{e}\)_o to produce the plausibility score. The resulting model is shown to outperform ConvE on standard benchmarks [184].

The presented approaches strike different balances in terms of expressivity and the number of parameters than need to be trained. While more expressive models, such as NTN, may better fit more complex plausibility functions over lower dimensional embeddings by using more hidden parameters, simpler models, such as that proposed by Dong et al. [182], and convolutional networks [183, 184] that enable parameter sharing by applying the same (typically small) kernels over different regions of a matrix, require handling fewer parameters overall and are more scalable.

Language models

Embedding techniques were first explored as a way to represent natural language within machine learning frameworks, with word2vec [185] and GloVe [186] being two seminal approaches. Both approaches compute embeddings for words based on large corpora of text such that words used in similar contexts (e.g., “frog”, “toad”) have similar vectors. Word2vec uses neural networks trained either to predict the current word from surrounding words (continuous bag of words), or to predict the surrounding words given the current word (continuous skip-gram). GloVe rather applies a regression model over a matrix of co-occurrence probabilities of word pairs. Embeddings generated by both approaches are widely used in natural language processing tasks.

Another approach for graph embeddings is thus to leverage proven approaches for language embeddings. However, while a graph consists of an unordered set of sequences of three terms (i.e., a set of edges), text in natural language consists of arbitrary-length sequences of terms (i.e., sentences of words). Along these lines, RDF2Vec [187] performs (biased [188]) random walks on the graph and records the paths (the sequence of nodes and edge labels traversed) as “sentences”, which are then fed as input into the word2vec [185] model. An example of such a path extracted from Figure 24 might be, for example, San Pedro arrow source bus arrow tip rightward CalamaflightIquiqueflightSantiago, where the paper experiments with \(500\) paths of length \(8\) per entity. RDF2Vec also proposes a second mode where sequences are generated for nodes from canonically-labelled sub-trees of which they are a root node, where the paper experiments with sub-trees of depth \(1\) and \(2\). Conversely, KGloVe [189] is based on the GloVe model. Much like how the original GloVe model [186] considers words that co-occur frequently in windows of text to be more related, KGloVe uses personalised PageRank²⁵^{note 25} Intuitively speaking, personalised PageRank starts at a given node and then determines the probability of a random walk being at a particular node after a given number of steps. A higher number of steps converges towards standard PageRank emphasising global node centrality, while a lower number emphasises proximity/relatedness to the starting node. to determine the most related nodes to a given node, whose results are then fed into the GloVe model.

Entailment-aware models

The embeddings thus far consider the data graph alone. But what if an ontology or set of rules is provided? Such deductive knowledge could be used to improve the embeddings. One approach is to use constraint rules to refine the predictions made by embeddings; for example, Quan Wang, et al. [190] use functional and inverse-functional definitions as constraints (under UNA) such that, for example, if we define that an event can have at most one value for venue, this is used to lower the plausibility of edges that would assign multiple venues to an event.

More recent approaches rather propose joint embeddings that consider both the data graph and rules when computing embeddings. KALE [191] computes entity and relation embeddings using a translational model (specifically TransE) that is adapted to further consider rules using t-norm fuzzy logics. With reference to Figure 24, consider a simple rule ?x arrow source bus arrow tip rightward ?y \(\Rightarrow\) ?xconnects to?y. We can use embeddings to assign plausibility scores to new edges, such as \(e_1\): Piedras RojasbusMoon Valley. We can further apply the previous rule to generate a new edge \(e_2\): Piedras Rojasconnects toMoon Valley from the predicted edge \(e_1\). But what plausibility should we assign to this second edge? Letting \(p_1\) and \(p_2\) be the current plausibility scores of \(e_1\) and \(e_2\) (initialised using the standard embedding), then t-norm fuzzy logics suggests that the plausibility be updated as \(p_1p_2 - p_1 + 1\). Embeddings are then trained to jointly assign larger plausibility scores to positive examples versus negative examples of both edges and ground rules. An example of a positive ground rule based on Figure 24 would be Arica arrow source bus arrow tip rightward San Pedro \(\Rightarrow\) Aricaconnects toSan Pedro. Negative ground rules randomly replace the relation in the head of the rule; for example, AricabusSan Pedro \(\not\Rightarrow\) AricaflightSan Pedro. Shu Guo, et al. [192] later propose RUGE, which uses a joint model over ground rules (possibly soft rules with confidence scores) and plausibility scores to align both forms of scoring for unseen edges.

Generating ground rules can be costly. An alternative approach, called FSL [193], observes that in the case of a simple rule, such as ?x arrow source bus arrow tip rightward ?y \(\Rightarrow\) ?xconnects to?y, the relation embedding bus should always return a lower plausibility than connects to. Thus, for all such rules, FSL proposes to train relation embeddings while avoiding violations of such inequalities. While relatively straightforward, FSL only supports simple rules, while KALE also supports more complex rules.

These works are interesting examples of how deductive and inductive forms of knowledge – in this case rules and embeddings – can interplay and complement each other.

Graph Neural Networks

While embeddings aim to provide a dense numerical representation of graphs suitable for use within existing machine learning models, another approach is to build custom machine learning models adapted for graph-structured data. Most custom learning models for graphs are based on (artificial) neural networks [29], exploiting a natural correspondence between both: a neural network already corresponds to a weighted, directed graph, where nodes serve as artificial neurons, and edges serve as weighted connections (axons). However, the typical topology of a traditional neural network – more specifically, a fully-connected feed-forward neural network – is quite homogeneous, being defined in terms of sequential layers of nodes where each node in one layer is connected to all nodes in the next layer. Conversely, the topology of a data graph is quite heterogeneous, being determined by the relations between entities that its edges represent.

A graph neural network (GNN) [194] builds a neural network based on the topology of the data graph; i.e., nodes are connected to their neighbours per the data graph. Typically a model is then learnt to map input features for nodes to output features in a supervised manner; output features for example nodes may be manually labelled, or may be taken from the knowledge graph. Unlike knowledge graphs embeddings, GNNs support end-to-end supervised learning for specific tasks: given a set of labelled examples, GNNs can be used to classify elements of the graph or the graph itself. GNNs have been used to perform classification over graphs encoding compounds, objects in images, documents, etc.; as well as to predict traffic, build recommender systems, verify software, etc. [29]. Given labelled examples, GNNs can even replace graph algorithms; for example, GNNs have been used to find central nodes in knowledge graphs in a supervised manner [194, 195, 196].

We now discuss the ideas underlying two flavours of GNN – recursive GNNs and convolutional GNNs – where we refer to Appendix B.6.3 for more formal definitions relating to GNNs.

Recursive graph neural networks

Recursive graph neural networks (RecGNNs) are the seminal approach to graph neural networks [197, 194]. The approach is conceptually similar to the systolic abstraction illustrated in Figure 25, where messages are passed between neighbours towards recursively computing some result. However, rather than define the functions used to decide the messages to pass, we rather label the output of a training set of nodes and let the framework learn the functions that generate the expected output, thereafter applying them to label other examples.

In a seminal paper, Franco Scarselli, et al. [194] proposed what they generically call a graph neural network (GNN), which takes as input a directed graph where nodes and edges are associated with feature vectors that can capture node and edge labels, weights, etc. These feature vectors remain fixed throughout the process. Each node in the graph is also associated with a state vector, which is recursively updated based on information from the node’s neighbours – i.e., the feature and state vectors of the neighbouring nodes and the feature vectors of the edges extending to/from them – using a parametric function, called the transition function. A second parametric function, called the output function, is used to compute the final output for a node based on its own feature and state vector. These functions are applied recursively up to a fixpoint. Both parametric functions can be implemented using neural networks where, given a partial set of supervised nodes in the graph – i.e., nodes labelled with their desired output – parameters for the transition and output functions can be learnt that best approximate the supervised outputs. The result can thus be seen as a recursive neural network architecture.²⁶^{note 26} Some authors refer to such architectures as recurrent graph neural networks, observing that the internal state maintained for nodes can be viewed as a form of recurrence over a sequence of transitions. To ensure convergence up to a fixpoint, certain restrictions are applied, namely that the transition function be a contractor, meaning that upon each application of the function, points in the numeric space are brought closer together (intuitively, in this case, the numeric space “shrinks” upon each application, ensuring a unique fixpoint).

To illustrate, consider, for example, that we wish to find priority locations for creating new tourist information offices. A good strategy would be to install them in hubs from which many tourists visit popular destinations. Along these lines, in Figure 29 we illustrate the GNN architecture proposed by Franco Scarselli, et al. [194] for a sub-graph of Figure 24, where we highlight the neighbourhood of Punta Arenas. In this graph, nodes are annotated with feature vectors (\(\mathbf{n}_x\)) and hidden states at step \(t\) (\(\mathbf{h}_x^{(t)}\)), while edges are annotated with feature vectors (\(\mathbf{a}_{xy}\)). Feature vectors for nodes may, for example, one-hot encode the type of node (City, Attraction, etc.), directly encode statistics such as the number of tourists visiting per year, etc. Feature vectors for edges may, for example, one-hot encode the edge label (the type of transport), directly encode statistics such as the distance or number of tickets sold per year, etc. Hidden states can be randomly initialised. The right-hand side of Figure 29 provides the GNN transition and output functions, where \(\mathrm{N}(x)\) denotes the neighbouring nodes of \(x\), \(f_{\mathbf{w}}(\cdot)\) denotes the transition function with parameters \(\mathbf{w}\), and \(g_{\mathbf{w}'}(\cdot)\) denotes the output function with parameters \(\mathbf{w'}\). An example is also provided for Punta Arenas (\(x = 1\)). These functions will be recursively applied until a fixpoint is reached. To train the network, we can label examples of places that already have (or should have) tourist offices and places that do (or should) not have tourist offices. These labels may be taken from the knowledge graph, or may be added manually. The GNN can then learn parameters \(\mathbf{w}\) and \(\mathbf{w'}\) that give the expected output for the labelled examples, which can subsequently be used to label other nodes.

On the left a sub-graph of Figure 24 highlighting the neighbourhood of Punta Arenas, where nodes are annotated with feature vectors (n_x) and hidden states at step t (h_x^{(t)}), and edges are annotated with feature vectors (a_{xy}); on the right, the GNN transition and output functions proposed by Scarselli et al. and an example for Punta Arenas (x = 1), where N(x) denotes the neighbouring nodes of x, f_w(·)\) denotes the transition function with parameters w and g_{w'}(·) denotes the output function with parameters w' — On the left a sub-graph of Figure 24 highlighting the neighbourhood of Punta Arenas, where nodes are annotated with feature vectors (\(\mathbf{n}_x\)) and hidden states at step \(t\) (\(\mathbf{h}_x^{(t)}\)), and edges are annotated with feature vectors (\(\mathbf{a}_{xy}\)); on the right, the GNN transition and output functions proposed by Franco Scarselli, et al. [194] and an example for Punta Arenas (\(x = 1\)), where \(N(x)\) denotes the neighbouring nodes of \(x\), \(f_{\mathbf{w}}(\cdot)\) denotes the transition function with parameters \(\mathbf{w}\) and \(g_{\mathbf{w}'}(\cdot)\) denotes the output function with parameters \(\mathbf{w'}\)

This GNN model is flexible and can be adapted in various ways [194]: we may define neighbouring nodes differently, for example to include nodes for outgoing edges, or nodes one or two hops away; we may allow pairs of nodes to be connected by multiple edges with different vectors; we may consider transition and output functions with distinct parameters for each node; we may add states and outputs for edges; we may change the sum to another aggregation function; etc.

Convolutional graph neural networks

Convolutional neural networks (CNNs) have gained a lot of attention, in particular, for machine learning tasks involving images [198]. The core idea in the image setting is to apply small kernels (aka filters) over localised regions of an image using a convolution operator to extract features from that local region. When applied to all local regions, the convolution outputs a feature map of the image. Typically multiple kernels are applied, forming multiple convolutional layers. These kernels can be learnt, given sufficient labelled examples.

One may note an analogy between GNNs as previously discussed, and CNNs as applied to images: in both cases, operators are applied over local regions of the input data. In the case of GNNs, the transition function is applied over a node and its neighbours in the graph. In the case of CNNs, the convolution is applied on a pixel and its neighbours in the image. Following this intuition, a number of convolutional graph neural networks (ConvGNNs) [199, 200, 29] have been proposed, where the transition function is implemented by means of convolutions. A key consideration for ConvGNNs is how regions of a graph are defined. Unlike the pixels of an image, nodes in a graph may have varying numbers of neighbours. This creates a challenge: a benefit of CNNs is that the same kernel can be applied over all the regions of an image, but this requires more careful consideration in the case of ConvGNNs since neighbourhoods of different nodes can be diverse. Approaches to address these challenges involve working with spectral (e.g. [199, 200]) or spatial (e.g., [201]) representations of graphs that induce a more regular structure from the graph. An alternative is to use an attention mechanism [202] to learn the nodes whose features are most important to the current node.

Aside from architectural considerations, there are two main differences between RecGNNs and ConvGNNs. First, RecGNNs aggregate information from neighbours recursively up to a fixpoint, whereas ConvGNNs typically apply a fixed number of convolutional layers. Second, RecGNNs typically use the same function/parameters in uniform steps, while different convolutional layers of a ConvGNN can apply different kernels/weights at each distinct step.

Symbolic Learning

The supervised techniques discussed thus far – namely knowledge graph embeddings and graph neural networks – learn numerical models over graphs. However, such models are often difficult to explain or understand. For example, taking the graph of Figure 30, knowledge graph embeddings might predict the edge SCL arrow source flight arrow tip rightward ARI as being highly plausible, but they will not provide an interpretable model to help understand why this is the case: the reason for the result may lie in a matrix of parameters learnt to fit a plausibility score on training data. Such approaches also suffer from the out-of-vocabulary problem, where they are unable to provide results for edges involving previously unseen nodes or edges; for example, if we add an edge SCL arrow source flight arrow tip rightward CDG, where CDG is new to the graph, a knowledge graph embedding will not have the entity embedding for CDG and would need to be retrained in order to estimate the plausibility of an edge CDGflightSCL.

An incomplete directed edge-labelled graph describing flights between airports

An alternative (sometimes complementary) approach is to adopt symbolic learning in order to learn hypotheses in a symbolic (logical) language that “explain” a given set of positive and negative edges. These edges are typically generated from the knowledge graph in an automatic manner (similar to the case of knowledge graph embeddings). The hypotheses then serve as interpretable models that can be used for further deductive reasoning. Given the graph of Figure 30, we may, for example, learn the rule ?x arrow source flight arrow tip rightward ?y \(\Rightarrow\) ?yflight?x from observing that flight routes tend to be return routes. Alternatively, we might learn a DL axiom stating that airports are either domestic, international, or both: Airport \(\sqsubseteq\) DomesticAirport \(\sqcup\) InternationalAirport. Such rules and axioms can then be used for deductive reasoning, and offer an interpretable model for new knowledge that is entailed/predicted; for example, from the aforementioned rule for return flights, one can interpret why a novel edge SCL arrow source flight arrow tip rightward ARI is predicted. This further offers domain experts the opportunity to verify the models – e.g., the rules and axioms – derived by such processes. Finally, rules/axioms are quantified (all flights have a return flight, all airports are domestic or international, etc.), so they can be applied to unseen examples (e.g., with the aforementioned rule, we can derive CDG arrow source flight arrow tip rightward SCL from a new edge SCLflightCDG with the unseen node CDG).

In this section, we discuss two forms of symbolic learning: rule mining, which learns rules from a knowledge graph, and axiom mining, which learns other forms of logical axioms. We refer to Appendix B.6.4 for a more formal treatment of these two tasks.

Rule mining

Rule mining, in the general sense, refers to discovering meaningful patterns in the form of rules from large collections of background knowledge. In the context of knowledge graphs, we assume a set of positive and negative edges as given. Typically positive edges are observed edges (i.e., those given or entailed by a knowledge graph) while negative edges are defined according to a given assumption of completeness (discussed later). The goal of rule mining is to identify new rules that entail a high ratio of positive edges from other positive edges, but entail a low ratio of negative edges from positive edges. The types of rules considered may vary from more simple cases, such as ?x arrow source flight arrow tip rightward ?y \(\Rightarrow\) ?yflight?x mentioned previously, to more complex rules, such as ?xcapital?ynearby?ztypeAirport \(\Rightarrow\) ?ztypeInternational Airport, indicating that airports near capitals tend to be international airports; or dom flight rule premise \(\Rightarrow\) ?x arrow source domestic flight arrow tip rightward ?y, indicating that flights within the same country denote domestic flights (as seen in Section 4.3.1).

Per the international airport example, rules are not assumed to hold in all cases, but rather are associated with measures of how well they conform to the positive and negative edges. In more detail, we call the edges entailed by a rule and the set of positive edges (not including the entailed edge itself), the positive entailments of that rule. The number of entailments that are positive is called the support for the rule, while the ratio of a rules’ entailments that are positive is called the confidence for the rule [203]. As such, support and confidence indicate, respectively, the number and ratio of entailments “confirmed” to be true for the rule, where the goal is to identify rules that have both high support and high confidence. In fact, techniques for rule mining in relational settings have long been explored in the context of Inductive Logic Programming (ILP) [204]. However, knowledge graphs present novel challenges due to the scale of the data and the frequent assumption of incomplete data (OWA), where dedicated techniques have been proposed to address these issues [205].

When dealing with an incomplete knowledge graph, it is not immediately clear how to define negative edges. A common heuristic – also used for knowledge graph embeddings – is to adopt a Partial Completeness Assumption (PCA) [205], which considers the set of positive edges to be those contained in the data graph, and the set of negative examples to be the set of all edges \(x\) arrow source \(p\) arrow tip rightward \(y'\) not in the graph but where there exists a node \(y\) such that \(x\)\(p\)\(y'\) is in the graph. Taking Figure 30, an example of a negative edge under PCA would be SCLflightARI (given the presence of SCLflightLIM); conversely, SCLdomestic flightARI is neither positive nor negative. The PCA confidence measure is then the ratio of the support to all entailments in the positive or negative set [205]. For example, the support for the rule ?x arrow source domestic flight arrow tip rightward ?y \(\Rightarrow\) ?ydomestic flight?x is \(2\) (since it entails IQQdomestic flightARI and ARIdomestic flightIQQ in the graph), while the confidence is \(\frac{2}{2} = 1\) (noting that SCLdomestic flightARI, though entailed, is neither positive nor negative, and is thus ignored by the measure). The support for the rule ?x arrow source flight arrow tip rightward ?y \(\Rightarrow\) ?yflight?x is analogously 4, while the confidence is \(\frac{4}{5} = 0.8\) (noting that SCLflightARI is negative).

The goal then, is to find rules satisfying given support and confidence thresholds. An influential rule-mining system for graphs is AMIE [205, 206], which adopts the PCA measure of confidence, and builds rules in a top-down fashion [203] starting with rule heads like \(\Rightarrow\) ?x arrow source country arrow tip rightward ?y. For each rule head of this form (one for each edge label), three types of refinements are considered, each of which adds a new edge to the body of the rule. This new edge takes an edge label from the graph and may otherwise use fresh variables not appearing previously in the rule, existing variables that already appear in the rule, or nodes from the graph. The three refinements may then:

add an edge with one existing variable and one fresh variable; for example, refining the aforementioned rule head might give: ?zflight?x \(\Rightarrow\) ?xcountry?y;
add an edge with an existing variable and a node from the graph; for example, refining the above rule might give: Domestic Airporttype?zflight?x \(\Rightarrow\) ?xcountry?y;
add an edge with two existing variables; for example, refining the above rule might give: \(\Rightarrow\) ?xcountry?y.

These refinements can be combined arbitrarily, which gives rise to a potentially exponential search space, where rules meeting given thresholds for support and confidence are maintained. To improve efficiency, the search space can be pruned; for example, these three refinements always decrease support, so if a rule does not meet the support threshold, there is no need to explore its refinements. Further restrictions are imposed on the types of rules generated. First, only rules up to a certain fixed size are considered. Second, a rule must be closed, meaning that each variable appears in at least two edges of the rule, which ensures that rules are safe, meaning that each variable in the head appears in the body; for example, the rules produced previously by the first and second refinements are neither closed (variable y appears once) nor safe (variable y appears only in the head).²⁷^{note 27} Safe rules like ?x arrow source capital arrow tip rightward ?ynearby?ztypeAirport \(\Rightarrow\) ?ztypeInternational Airport are not closed as ?x appears only in one edge. Hence the condition that rules are closed is strictly stronger than the condition that they are safe. To ensure closed rules, the third refinement is applied until a rule is closed. For further discussion of possible optimisations based on pruning and indexing, we refer to the paper on AMIE+ [206].

Later works have built on these techniques for mining rules from knowledge graphs. Mohamed H. Gad-Elrab, et al. [207] propose a method to learn non-monotonic rules – rules with negated edges in the body – in order to capture exceptions to base rules; for example, the approach may learn a rule not international rule premise \(\Rightarrow\) ?x arrow source country arrow tip rightward ?y, indicating that flights are within the same country except when the (departure) airport is international, where the exception is shown dotted and we use \(\neg\) to negate an edge. The RuLES system [208] – which is also capable of learning non-monotonic rules – proposes to mitigate the limitations of the PCA heuristic by extending the confidence measure to consider the plausibility scores of knowledge graph embeddings for entailed edges not appearing in the graph. Where available, explicit statements about the completeness of the knowledge graph (such as expressed in shapes; see Section 3.1.2) can be used in lieu of PCA for identifying negative edges. Along these lines, CARL [209] exploits additional knowledge about the cardinalities of relations to refine the set of negative examples and the confidence measure for candidate rules. Alternatively, where available, ontologies can be used to derive logically-certain negative edges under OWA through, for example, disjointness axioms. The system proposed by d’Amato et al. [210, 211] leverages ontologically-entailed negative edges for determining the confidence of rules generated through an evolutionary algorithm.

While the previous works involve discrete expansions of candidate rules for which a fixed confidence scoring function is applied, another line of research is on a technique called differentiable rule mining [212, 213, 214], which allows end-to-end learning of rules. The core idea is that the joins in rule bodies can be represented as matrix multiplication. More specifically, we can represent the relations of an edge label \(p\) by the adjacency matrix \(\mathbf{A}_p\) (of size \(|V| \times |V|\)) such that the value on the \(i\)^th row of the \(j\)^th column is \(1\) if there is an edge labelled \(p\) from the \(i\)^th entity to the \(j\)^th entity; otherwise the value is \(0\). Now we can represent a join in a rule body as matrix multiplication; for example, given ?x arrow source domestic flight arrow tip rightward ?y arrow tip leftward country?z \(\Rightarrow\) ?xcountry?z, we can denote the body by the matrix multiplication \(\mathbf{A}\)_df.\(\mathbf{A}\)_c., which gives an adjacency matrix representing entailed country edges, where we should expect the \(1\)’s in \(\mathbf{A}\)_df.\(\mathbf{A}\)_c. to be covered by the head’s adjacency matrix \(\mathbf{A}\)_c.. Since we are given adjacency matrices for all edge labels, we are left to learn confidence scores for individual rules, and to learn rules (of varying length) with a threshold confidence. Along these lines, NeuralLP [213] uses an attention mechanism to select a variable-length sequence of edge labels for path-like rules of the form ?x arrow source p_\(1\) arrow tip rightward y_\(1\)p_\(2\)…p_\(n\)y_\(n\)p_\(n+1\)?z \(\Rightarrow\) ?xp?z, for which confidences are likewise learnt. DRUM [214] also learns path-like rules, where, observing that some edge labels are more/less likely to follow others in the rules – for example, flight will not be followed by capital in the graph of Figure 24 as the join will be empty – the system uses bidirectional recurrent neural networks (a popular technique for learning over sequential data) to learn sequences of relations for rules, and their confidences. These differentiable rule mining techniques are, however, currently limited to learning path-like rules.

Axiom mining

Aside from rules, more general forms of axioms – expressed in logical languages such as DLs (see Section 4.3.2) – can be mined from a knowledge graph. We can divide these approaches into two categories: those mining specific axioms and more general axioms.

Among systems mining specific types of axioms, disjointness axioms are a popular target; for example, the disjointness axiom DomesticAirport \(\sqcap\) InternationalAirport \(\equiv \bot\) states that the intersection of the two classes is equivalent to the empty class, or in simpler terms, no node can be simultaneously of type Domestic Airport and International Airport. The system proposed by Johanna Völker, et al. [215] extracts disjointness axioms based on (negative) association rule mining [216], which finds pairs of classes where each has many instances in the knowledge graph but there are relatively few (or no) instances of both classes. Gerald Töpper, et al. [217] rather extract disjointness for pairs of classes that have a cosine similarity below a fixed threshold. For computing this cosine similarity, class vectors are computed using a TF–IDF analogy, where the “document” of each class is constructed from all of its instances, and the “terms” of this document are the properties used on the class instances (preserving multiplicities). While the previous two approaches find disjointness constraints between named classes (e.g., city is disjoint with airport), Giuseppe Rizzo, et al. [218] propose an approach that can capture disjointness constraints between class descriptions (e.g., city without an airport nearby is disjoint with city that is the capital of a country). The approach first clusters similar nodes of the knowledge base. Next, a terminological cluster tree is extracted, where each leaf node indicates a cluster extracted previously, and each internal (non-leaf) node is a class definition (e.g., cities) where the left child is either a cluster having all nodes in that class or a sub-class description (e.g., cities without airports) and the right child is either a cluster having no nodes in that class or a disjoint-class description (e.g., non-cities with events). Finally, candidate disjointness axioms are proposed for pairs of class descriptions in the tree that are not entailed to have a subclass relation.

Other systems propose methods to learn more general axioms. A prominent such system is DL-Learner [219], which is based on algorithms for class learning (aka concept learning), whereby given a set of positive nodes and negative nodes, the goal is to find a logical class description that divides the positive and negative sets. For example, given \(\{\)Iquique, Arica\(\}\) as the positive set and \(\{\)Santiago\(\}\) as the negative set, we may learn a (DL) class description \(\exists\)nearby.Airport \(\sqcap \neg(\exists\) capital\(^-.\top)\), denoting entities near to an airport that are not capitals, of which all positive nodes are instances and no negative nodes are instances. Such class descriptions are learnt in an analogous manner to how aforementioned systems like AMIE learn rules, with a refinement operator used to move from more general classes to more specific classes (and vice-versa), a confidence scoring function, and a search strategy. The system further supports learning more general axioms through a scoring function that uses count queries to determine what ratio of expected edges – edges that would be entailed were the axiom true – are indeed found in the graph; for example, to score the axiom \(\exists\)flight\(^{-}\).DomesticAirport \(\sqsubseteq\) InternationalAirport over Figure 30, we can use a graph query to count how many nodes have incoming flights from a domestic airport (there are \(3\), and how many nodes have incoming flights from a domestic airport and are international airports (there is \(1\), where the greater the difference between both counts, the weaker the evidence for the axiom.

Creation and Enrichment

In this section, we discuss the principal techniques by which knowledge graphs can be created and subsequently enriched from diverse sources of legacy data that may range from plain text to structured formats (and anything in between). The appropriate methodology to follow when creating a knowledge graph depends on the actors involved, the domain, the envisaged applications, the available data sources, etc. Generally speaking, however, the flexibility of knowledge graphs lends itself to starting with an initial core that can be incrementally enriched from other sources as required (typically following an Agile [220] or “pay-as-you-go” [221] methodology). For our running example, we assume that the tourism board decides to build a knowledge graph from scratch, aiming to initially describe the main tourist attractions – places, events, etc. – in Chile in order to help visiting tourists identify those that most interest them. The board decides to postpone adding further data, like transport routes, reports of crime, etc., for a later date.

Human Collaboration

One approach for creating and enriching knowledge graphs is to solicit direct contributions from human editors. Such editors may be found in-house (e.g., employees of the tourist board), using crowd-sourcing platforms, through feedback mechanisms (e.g., tourists adding comments on attractions), through collaborative-editing platforms (e.g., an attractions wiki open to public edits), etc. Though human involvement incurs high costs [222], some prominent knowledge graphs have been primarily based on direct contributions from human editors [34, 8]. Depending on how the contributions are solicited, however, the approach has a number of key drawbacks, due primarily to human error [223], disagreement [224], bias [225], vandalism [226], etc. Successful collaborative creation further raises challenges concerning licensing, tooling, and culture [223]. Humans are sometimes rather employed to verify and curate additions to a knowledge graph extracted by other means [223] (through, e.g., video games with a purpose [227]), to define high-quality mappings from other sources [228], to define appropriate high-level schema [229, 94], and so forth.

Text Sources

Text corpora – such as sourced from newspapers, books, scientific articles, social media, emails, web crawls, etc. – are an abundant source of rich information [230, 231]. However, extracting such information with high precision and recall for the purposes of creating or enriching a knowledge graph is a non-trivial challenge. To address this, techniques from Natural Language Processing (NLP) [232, 233] and Information Extraction (IE) [234, 235, 113] can be applied. Though processes vary considerably across text extraction frameworks, in Figure 31 we illustrate four core tasks for text extraction on a sample sentence. We will discuss these tasks in turn.

Text extraction example; nodes new to the knowledge graph are shown dashed

Pre-processing

The pre-processing task may involve applying various techniques to the input text, where Figure 31 illustrates Tokenisation, which parses the text into atomic terms and symbols. Other pre-processing tasks applied to a text corpus may include: Part-of-Speech (POS) tagging [232, 233] to identify terms representing verbs, nouns, adjectives, etc.; Dependency Parsing, which extracts a grammatical tree structure for a sentence where leaf nodes indicate individual words that together form phrases (e.g., noun phrases, verb phrases) and eventually clauses and sentences [232, 233]; and Word Sense Disambiguation (WSD) [236] to identify the meaning (aka sense) in which a word is used, linking words with a lexicon of senses (e.g., WordNet [237] or BabelNet [238]), where, for instance, the term flights may be linked with the WordNet sense “an instance of travelling by air” rather than “a stairway between one floor and the next”. The appropriate type of pre-processing to apply often depends on the requirements of later tasks in the pipeline.

Named Entity Recognition (NER)

The NER task identifies mentions of named entities in a text [239, 240], typically targetting mentions of people, organisations, locations, and potentially other types [241, 242, 243]. A variety of NER techniques exist, with many modern approaches based on learning frameworks that leverage lexical features (e.g., POS tags, dependency parse trees, etc.) and gazetteers (e.g., lists of common first names, last names, countries, prominent businesses, etc.). Supervised methods [244, 245, 246] require manually labelling all entity mentions in a training corpus, whereas bootstrapping-based approaches [247, 248, 242, 249] rather require a small set of seed examples of entity mentions from which patterns can be learnt and applied to unlabelled text. Distant supervision [241, 250, 243] uses known entities in a knowledge graph as seed examples through which similar entities can be detected. Aside from learning-based frameworks, manually-crafted rules [251, 252] are still sometimes used due to their more controllable and predictable behaviour [253]. The named entities identified by NER may be used to generate new candidate nodes for the knowledge graph (known as emerging entities, shown dashed in Figure 31), or may be linked to existing nodes per the Entity Linking task described in the following.

Entity Linking (EL)

The EL task associates mentions of entities in a text with the existing nodes of a target knowledge graph, which may be the nucleus of a knowledge graph under creation, or an external knowledge graph [254]. In Figure 31, we assume that the nodes Santiago and Easter Island already exist in the knowledge graph (possibly extracted from other sources). EL may then link the given mentions to these nodes. The EL task presents two main challenges. First, there may be multiple ways to mention the same entity, as in the case of Rapa Nui and Easter Island; if we created a node Rapa Nui to represent that mention, we would split the information available under both mentions across different nodes, where it is thus important for the target knowledge graph to capture the various aliases and multilingual labels by which one can refer to an entity [255]. Secondly, the same mention in different contexts can refer to distinct entities; for instance, Santiago can refer to cities in Chile, Cuba, Spain, among others. The EL task thus considers a disambiguation phase wherein mentions are associated to candidate nodes in the knowledge graph, the candidates are ranked, and the most likely node being mentioned is chosen [254]. Context can be used in this phase; for example, if Easter Island is a likely candidate for the corresponding mention alongside Santiago, we may boost the probability that this mention refers to the Chilean capital as both candidates are located in Chile. Other heuristics for disambiguation consider a prior probability, where for example, Santiago most often refers to the Chilean capital (being, e.g., the largest city with that name); centrality measures on the knowledge graph can be used for such purposes [254].

Relation Extraction (RE)

The RE task extracts relations between entities in the text [256, 257]. The simplest case is that of extracting binary relations in a closed setting wherein a fixed set of relation types are considered. While traditional approaches often used manually-crafted patterns [258], modern approaches rather tend to use learning-based frameworks [259], including supervised methods over manually-labelled examples [260, 256]. Other learning-based approaches again use bootstrapping [248, 261] and distant supervision [262, 263, 264, 265, 266, 267] to forgo the need for manual labelling; the former requires a subset of manually-labelled seed examples, while the latter finds sentences in a large corpus of text mentioning pairs of entities with a known relation/edge, which are used to learn patterns for that relation. Binary RE can also be applied using unsupervised methods in an open setting – often referred to as Open Information Extraction (OIE) [268, 269, 270, 271, 272, 273] – whereby the set of target relations is not pre-defined but rather extracted from text based on, for example, dependency parse trees from which relations are taken.

A variety of RE methods have been proposed to extract \(n\)-ary relations that capture further context for how entities are related. In Figure 31, we see how an \(n\)-ary relation captures additional temporal context, denoting when Rapa Nui was named a World Heritage site; in this case, an anonymous node is created to represent the higher-arity relation in the directed-labelled graph. Various methods for \(n\)-ary RE are based on frame semantics [274], which, for a given verb (e.g., “named”), captures the entities involved and how they may be interrelated. Resources such as FrameNet [275] then define frames for words, such as to identify that the semantic frame for “named” includes a speaker (the person naming something), an entity (the thing named) and a name. Optional frame elements are an explanation, a purpose, a place, a time, etc., that may add context to the relation. Other RE methods are rather based on Discourse Representation Theory (DRT) [276], which considers a logical representation of text based on existential events. Under this theory, for example, the naming of Easter Island as a World Heritage Site is considered to be an (existential) event where Easter Island is the patient (the entity affected), leading to the logical (neo-Davidsonian) formula:

\( \exists e: \big(\)naming\((e),\) patient\((e,\) Easter Island\(),\) name\((e,\) World Heritage Site\()\big) \)

Such a formula is analogous to the idea of reification, as discussed previously in Section 3.3.

Finally, while relations extracted in a closed setting are typically mapped directly to a knowledge graph, relations that are extracted in an open setting may need to be aligned with the knowledge graph; for example, if an OIE process extracts a binary relation Santiago arrow source has flights to arrow tip rightward Easter Island, it may be the case that the knowledge graph does not have other edges labelled has flights to, where alignment may rather map such a relation to the edge SantiagoflightEaster Island assuming flight is used in the knowledge graph. A variety of methods have been applied for such purposes, including mappings [277, 278] and rules [279] for aligning \(n\)-ary relations; distributional and dependency-based similarities [280], association rule mining [281], Markov clustering [282] and linguistic techniques [283] for aligning OIE relations; amongst others.

Joint tasks

Having presented the four main tasks for building knowledge graphs from text, it is important to note that frameworks do not always follow this particular sequence of tasks. A common trend, for example, is to combine interdependent tasks, jointly performing WSD and EL [255], or NER and EL [284, 285], or NER and RE [286, 287], etc., in order to mutually improve the performance of multiple tasks. For further details on extracting knowledge from text we refer to the book by Diana Maynard, et al. [232] and the recent survey by Jose L. Martínez-Rodríguez, et al. [113].

Markup Sources

The Web was founded on interlinking markup documents wherein markers (aka tags) are used to separate elements of the document (typically for formatting purposes). Most documents on the Web use the HyperText Markup Language (HTML). Figure 32 presents an example HTML webpage about World Heritage Sites in Chile. Other formats of markup include Wikitext used by Wikipedia, TeX for typesetting, Markdown used by Content Management Systems, etc. One approach for extracting information from markup documents – in order to create and/or enrich a knowledge graph – is to strip the markers (e.g., HTML tags), leaving only plain text upon which the techniques from the previous section can be applied. However, markup can be useful for extraction purposes, where variations of the aforementioned tasks for text extraction have been adapted to exploit such markup [288, 289, 113]. We can divide extraction techniques for markup documents into three main categories: general approaches that work independently of the markup used in a particular format, often based on wrappers that map elements of the document to the output; focussed approaches that target specific forms of markup in a document, most typically web tables (but sometimes also lists, links, etc.); and form-based approaches that extract the data underlying a webpage, per the notion of the Deep Web. These approaches can often benefit from the regularities shared by webpages of a given website, be it due to informal conventions on how information is published across webpages, or due to the re-use of templates to automatically generate content across webpages; for example, intuitively speaking, while the webpage of Figure 32 is about Chile, we will likely find pages for other countries following the same structure on the same website.

<html>
  <head><title>UNESCO World Heritage Sites</title></head>
  <body>
    <h1>World Heritage Sites</h1>
	<h2>Chile</h2>
	<p>Chile has 6 UNESCO World Heritage Sites.</p>
	<table border="1">
	  <tr><th>Place</th><th>Year</th><th>Criteria</th></tr>
	  <tr><td>Rapa Nui</td><td>1995</td>
		<td rowspan="6">Cultural</td></tr>
	  <tr><td>Churches of Chiloé</td><td>2000</td></tr>
	  <tr><td>Historical Valparaíso</td><td>2003</td></tr>
	  <tr><td>Saltpeter Works</td><td>2005</td></tr>
	  <tr><td>Sewell Mining Town</td><td>2006</td></tr>
	  <tr><td>Qhapaq Ñan</td><td>2014</td></tr>
	</table>
  </body>
</html>

UNESCO World Heritage Sites

World Heritage Sites

Chile

Chile has 6 UNESCO World Heritage Sites.

Place	Year	Criteria
Rapa Nui	1995	Cultural
Churches of Chiloé	2000
Historical Valparaíso	2003
Saltpeter Works	2005
Sewell Mining Town	2006
Qhapaq Ñan	2014

Example markup document (HTML) with source-code (left) and formatted document (right)

Wrapper-based extraction

Many general approaches are based on wrappers that locate and extract the useful information directly from the markup document. While the traditional approach was to define such wrappers manually – a task for which a variety of declarative languages and tools have been defined – such approaches are brittle to changes in a websites layout [290]. Hence other approaches allow for (semi-)automatically inducing wrappers [291]. A modern such approach – used to enrich knowledge graphs in systems such as LODIE [292] – is to apply distant supervision, whereby EL is used to identify and link entities in the webpage to nodes in the knowledge graph such that paths in the markup that connect pairs of nodes for known edges can be extracted, ranked, and applied to other examples. Taking Figure 32, for example, distant supervision may link Rapa Nui and World Heritage Sites to the nodes Easter Island and World Heritage Site in the knowledge graph using EL, and given the edge Easter Island arrow source named arrow tip rightward World Heritage Site in the knowledge graph (extracted per Figure 31), identify the candidate path \((x,\)td\([1]^{-} \cdot \) tr\(^{-} \cdot \) table\(^- \cdot \) h1\(,y)\) as reflecting edges of the form \(x\)named\(y\), where \(t[n]\) indicates the \(n\)^th child of tag \(t\), \(t^-\) its inverse, and \(t_1 \cdot t_2\) concatenation. Finally, paths with high confidence (e.g., ones “witnessed” by many known edges in the knowledge graph) can then be used to extract novel edges, such as Qhapaq Ñan arrow source named arrow tip rightward World Heritage Site, both on this page and on related pages of the website with similar structure (e.g., for other countries).

Web table extraction

Other approaches target specific types of markup, most commonly web tables, i.e., tables embedded in HTML webpages. However, web tables are designed to enhance human readability, which often conflicts with machine readability. Many web tables are used for layout and page structure (e.g., navigation bars), while those that do contain data may follow different formats such as relational tables, listings, attribute-value tables, matrices, etc. [293, 294]. Hence a first step is to classify tables to find ones appropriate for the given extraction mechanism(s) [294, 295]. Next, web tables may contain column spans, row spans, inner tables, or may be split vertically to improve human aesthetics. Hence a table normalisation phase is required to identify headers, merge split tables, un-nest tables, transpose tables, etc. [296, 293, 294, 297, 298, 299]. Subsequently, approaches may need to identify the protagonist [294, 300] – the main entity that the table describes – which is rather found elsewhere in the webpages; for example, though World Heritage Sites is the protagonist of the table of Figure 31, it is not mentioned by the table. Finally, extraction processes can be applied, potentially associating cells with entities [301, 302], columns with types [297, 301, 302], and column pairs with relations [301, 300]. For the purposes of enriching knowledge graphs, more recent approaches again apply distant supervision, first linking table cells to knowledge graph nodes, which are used to generate candidates for type and relation extraction [301, 302, 300]. Statistical distributions can also aid in linking numerical columns [303]. Specialised extraction frameworks have also been designed for tables on specific websites, where prominent knowledge graphs, such as DBpedia [32] and YAGO [304] focus on extraction from info-box tables in Wikipedia.

Deep Web crawling

The Deep Web presents a rich source of information accessible only through searches on web forms, thus requiring Deep Web crawling techniques to access [305]. Systems have been proposed to extract knowledge graphs from Deep Web sources [306, 307, 308]. Approaches typically attempt to generate sensible form inputs – which may be based on a user query or generated from reference knowledge – and then extract data from the generated responses (markup documents) using the aforementioned techniques [306, 307, 308].

Structured Sources

Much of the legacy data available within organisations and on the Web is represented in structured formats, primarily tables – in the form of relational databases, CSV files, etc. – but also tree-structured formats such as JSON, XML etc. Unlike text and markup documents, structured sources can often be mapped to knowledge graphs whereby the structure is (precisely) transformed according to a mapping rather than (imprecisely) extracted. The mapping process involves two steps: 1) create a mapping from the source to a graph, and 2) use the mapping in order to materialise the source data as a graph or to virtualise the source (creating a graph view over the legacy data).

Mapping from tables

Tabular sources of data are prevalent, where, for example, the structured content underlying many organisations, websites, etc., are housed in relational databases. In Figure 33 we present an example of a relational database instance that we would like to integrate into our knowledge graph under construction. There are then two approaches for mapping content from tables to knowledge graphs: a direct mapping, and a custom mapping.

Report

crime	claimant	station	date
Pickpocketing	XY12SDA	Viña del Mar	2019-04-12
Assault	AB9123N	Arica	2019-04-12
Pickpocketing	XY12SDA	Rapa Nui	2019-04-12
Fraud	FI92HAS	Arica	2019-04-13

Claimant

id	name	country
XY12SDA	John Smith	U.S.
AB9123N	Joan Dubois	France
XI92HAS	Jorge Hernández	Chile

Example relational database instance with two tables describing crime data

Possible result of applying a direct mapping to the first rows of both tables in Figure 33

A direct mapping automatically generates a graph from a table. We present in Figure 34 the result of a standard direct mapping [309], which creates an edge x arrow source y arrow tip rightward z for each (non-header, non-empty, non-null) cell of the table, such that x represents the row of the cell, y the column name of the cell, and z the value of the cell. In particular, x typically encodes the values of the primary key for a row (e.g., Claimant.id); otherwise, if no primary key is defined (e.g., per the Report table), x can be an anonymous node or a node based on the row number. The node x and edge label y further encode the name of the table to avoid clashes across tables that have the same column names used with different meanings. For each row x, we may add a type edge based on the name of its table. The value z may be mapped to datatype values in the corresponding graph model based on the source domain (e.g., a value in an SQL column of type Date can be mapped to xsd:date in the RDF data model). If the value is null (or empty), typically the corresponding edge will be omitted.²⁸^{note 28} One might consider representing nulls with anonymous nodes. However, nulls in SQL can be used to mean that there is no such value, which conflicts with the existential semantics of anonymous nodes in models such as RDF (i.e., blank nodes). With respect to Figure 34, we highlight the difference between the nodes Claimant-XY12SDA and XY12SDA, where the former denotes the row (or entity) identified by the latter primary key value. In case of a foreign key between two tables – such as Report.claimant referencing Claimant.id – we can link, for example, to Claimant-XY12SDA rather than XY12SDA, where the former node also has the name and country of the claimant. A direct mapping along these lines has been standardised for mapping relational databases to RDF [309], where Radu Stoica, et al. [310] have recently proposed an analogous direct mapping for property graphs. Another direct mapping has been defined for CSV and other tabular data [311] that further allows for specifying column names, primary/foreign keys, and data types – which are often missing in such data formats – as part of the mapping itself.

Although a direct mapping can be applied automatically on tabular sources of data and preserve the information of the original source – i.e., allowing a deterministic inverse mapping that reconstructs the tabular source from the output graph [312] – in many cases it is desirable to customise a mapping, such as to align edge labels or nodes with a knowledge graph under enrichment, etc. Along these lines, declarative mapping languages allow for manually defining custom mappings from tabular sources to graphs. A standard language along these lines is the RDB2RDF Mapping Language (R2RML) [228], which allows for mapping from individual rows of a table to one or more custom edges, with nodes and edges defined either as constants, as individual cell values, or using templates that concatenate multiple cell values from a row and static substrings into a single term; for example, a template {id}-{country} may produce nodes such as XY12SDA-U.S. from the Claimant table. In case that the desired output edges cannot be defined from a single row, R2RML allows for (SQL) queries to generate tables from which edges can be extracted where, for example, edges such as U.S. arrow source crimes arrow tip rightward 2 can be generated by defining the mapping with respect to a query that joins the Report and Claimant tables on claimant=id, grouping by country, and applying a count. A mapping can then be defined on the results table such that the source node denotes the value of country, the edge label is the constant crimes, and the target node is the count value. An analogous standard also exists for mapping CSV and other tabular data to RDF graphs, again allowing keys, column names, and datatypes to be chosen as part of the mapping [313].

Once the mappings have been defined, one option is to use them to materialise graph data following an Extract-Transform-Load (ETL) approach, whereby the tabular data are transformed and explicitly serialised as graph data using the mapping. A second option is to use virtualisation through a Query Rewriting (QR) approach, whereby queries on the graph (using, e.g., SPARQL, Cypher, etc.) are translated to queries over the tabular data (typically using SQL). Comparing these two options, ETL allows the graph data to be used as if they were any other data in the knowledge graph. However, ETL requires updates to the underlying tabular data to be explicitly propagated to the knowledge graph, whereas a QR approach only maintains one copy of data to be updated. The area of Ontology-Based Data Access (OBDA) [314] is then concerned with QR approaches that support ontological entailments as discussed in Section 4. Although most QR approaches only support non-recursive entailments expressible as a single (non-recursive) query, some QR approaches support recursive entailments through rewritings to recursive queries [315].

Mapping from trees

A number of popular data formats are based on trees, including XML and JSON. While one could imagine – leaving aside issues such as the ordering of children in a tree – a trivial direct mapping from trees to graphs by simply creating edges of the form \(x\) arrow source child arrow tip rightward \(y\) for each node \(y\) that is a child of \(x\) in the source tree, such an approach is not typically used, as it represents the literal structure of the source data. Instead, the content of tree-structured data can be more naturally represented as a graph using a custom mapping. Along these lines, the GRDLL standard [316] allows for mapping from XML to (RDF) graphs, while the JSON-LD standard [317] allows for mapping from JSON to (RDF) graphs. In contrast, hybrid query languages such as XSPARQL [162] allow for querying XML and RDF in an integrated fashion, thus supporting both materialisation and virtualisation of graphs over tree-structured sources of legacy data.

Mapping from other knowledge graphs

Another route to construct or enrich knowledge graphs is to leverage existing knowledge graphs as a source. In our scenario, for instance, a large number of points of interest for the Chilean tourist board may be available in existing knowledge graphs such as DBpedia [32], LinkedGeoData [39], Wikidata [34], YAGO [35], BabelNet [238], etc. However, depending on the knowledge graph under construction, not all entities and/or relations may be of interest. A standard option to extract a relevant sub-graph of data is to use SPARQL construct-queries that generate graphs as output [318]. Entity and schema alignment between the knowledge graphs may be further necessary to better integrate (parts of) external knowledge graphs; this may be done using linking tools for graphs [319, 320], based on the use of external identifiers [223], or indeed may be done manually [223]. For instance, Wikidata [34] uses Freebase [33, 223] as a source; Simon Gottschalk, et al. [321] extract an event-centric knowledge graph from Wikidata, DBpedia and YAGO; while Sebastian Neumaier, et al. [318] construct a spatio-temporal knowledge graph from Geonames, Wikidata, and PeriodO [322] (as well as tabular data).

Schema/Ontology Creation

The discussion thus far has focussed on extracting data from external sources in order to create and enrich a knowledge graph. In this section, we discuss some of the principal methods for generating a schema based on external sources of data, including human knowledge. For discussion on extracting a schema from the knowledge graph itself, we refer back to Section 3.1.3. In general, much of the work in this area has focussed on the creation of ontologies using either ontology engineering methodologies, and/or ontology learning. We discuss these two approaches in turn.

Ontology engineering

Ontology engineering refers to the development and application of methodologies for building ontologies, proposing principled processes by which better quality ontologies can be constructed and maintained with less effort. Early methodologies [323, 324, 325] were often based on a waterfall-like process, where requirements and conceptualisation were fixed before starting to implement the ontology in a logical language, using, for example, an ontology engineering tool [326, 229, 327]. However, for situations involving large or ever-evolving ontologies, more iterative and agile ways of building and maintaining ontologies have been proposed.

DILIGENT [328] was an early example of an agile methodology, proposing a complete process for ontology life-cycle management and knowledge evolution, as well as separating local changes (local views on knowledge) from global updates of the core part of the ontology, using a review process to authorise the propagation of changes from the local to the global level. This methodology is similar to how, for instance, the large clinical reference terminology SNOMED CT [329] (also available as an ontology) is maintained and evolved, where the (international) core terminology is maintained based on global requirements, while national or local extensions to SNOMED CT are maintained based on local requirements. A group of authors then decides which national or local extensions to propagate to the core terminology. More modern agile methodologies include eXtreme Design (XD) [330, 331], Modular Ontology Modelling (MOM) [332, 333], Simplified Agile Methodology for Ontology Development (SAMOD) [334], etc. Such methodologies typically include two key elements: ontology requirements and (more recently) ontology design patterns.

Ontology requirements specify the intended task of the resulting ontology – or indeed the knowledge graph itself – based on the ontology as its schema. A common way to express ontology requirements is through Competency Questions (CQ) [335], which are natural language questions illustrating the typical knowledge that one would require the ontology (or the knowledge graph) to provide. Such CQs can then be complemented with additional restrictions, and reasoning requirements, in case that the ontology should also contain restrictions and general axioms for inferring new knowledge or checking data consistency. A common way of testing ontologies (or knowledge graphs based on them) is then to formalise the CQs as queries over some test set of data, and make sure the expected results are entailed [336, 337]. We may, for example, consider the CQ “What are all the events happening in Santiago?”, which can be represented as a graph query Event arrow tip leftward type arrow source ?eventlocation arrow tip rightward Santiago. Taking the data graph of Figure 1 and the axioms of Figure 12, we can check to see if the expected result EID15 is entailed by the ontology and the data, and since it is not, we may consider expanding the axioms to assert that locationtypeTransitive.

Ontology Design Patterns (ODPs) are another common feature of modern methodologies [338, 339], specifying generalisable ontology modelling patterns that can be used as inspiration for modelling similar patterns, as modelling templates [340, 341], or as directly reusable components [342, 343]. Several pattern libraries have been made available online, ranging from carefully curated ones [344, 343] to open and community moderated ones [342]. As an example, in modelling an ontology for our scenario, we may decide to follow the Core Event ontology pattern proposed by Adila Krisnadhi, et al. [345], which specifies a spatio-temporal extent, sub-events, and participants of an event, further suggesting competency questions, formal definitions, etc., to support this pattern.

Ontology learning

The previous methodologies outline methods by which ontologies can be built and maintained manually. Ontology learning, in contrast, can be used to (semi-)automatically extract information from text that is useful for the ontology engineering process [346, 347]. Early methods focussed on extracting terminology from text that may represent the relevant domain’s classes; for example, from a collection of text documents about tourism, a terminology extraction tool – using measures of unithood that determine how cohesive an \(n\)-gram is as a unitary phrase, and termhood that determine how relevant the phrase is to a domain [283] – may identify \(n\)-grams such as “visitor visa”, “World Heritage Site”, “off-peak rate”, etc., as terminology of particular importance to the tourist domain, and that thus may merit inclusion in such an ontology. Axioms may also be extracted from text, where subclass axioms are commonly targetted, based on modifying nouns and adjectives that incrementally specialise concepts (e.g., extracting Visitor Visa arrow source subc. of arrow tip rightward Visa from the noun phrase “visitor visa” and isolated appearances of “visa” elsewhere), or using Hearst patterns [258] (e.g., extracting Off-Peak Ratesubc. ofDiscount from “many discounts, such as off-peak rates, are available” based on the pattern “X, such as Y”). Textual definitions can also be harvested from large texts to extract hypernym relations and induce a taxonomy from scratch [348]. More recent works aim to extract more expressive axioms from text, including disjointness axioms [215]; and axioms involving the union and intersection of classes, along with existential, universal, and qualified-cardinality restrictions [349]. The results of an ontology learning process can then serve as input to a more general ontology engineering methodology, allowing us to validate the terminological coverage of an ontology, to identify new classes and axioms, etc.

Quality Assessment

Independently of the (kinds of) source(s) from which a knowledge graph is created, data extracted for the initial knowledge graph will usually be incomplete, and will contain duplicate, contradictory or even incorrect statements – especially when taken from multiple sources. After the initial creation and enrichment of a knowledge graph from external sources, a crucial step is thus to assess the quality of the resulting knowledge graph. By quality, we here refer to fitness for purpose. Quality assessment then helps to ascertain for which purposes a knowledge graph can be reliability used.

In the following we discuss quality dimensions that capture aspects of multifaceted data quality which evolves from the traditional domain of databases to the domain of knowledge graphs [350], some of which are general, others of which are more particular to knowledge graphs [351]. While quality dimensions aim to capture qualitative aspects of the data, we also discuss quality metrics that provide ways to measure quantitative aspects of these dimensions. We discuss groupings of dimensions and metrics as inspired by Carlo Batini, et al. [352].

Accuracy

Accuracy refers to the extent to which entities and relations – encoded by nodes and edges in the graph – correctly represent real-life phenomena. Accuracy can be further sub-divided into three dimensions: syntactic accuracy, semantic accuracy, and timeliness.

Syntactic accuracy

Syntactic accuracy is the degree to which the data are accurate with respect to the grammatical rules defined for the domain and/or data model. A prevalent example of syntactic inaccuracies occurs with datatype nodes, which may be incompatible with a defined range or be malformed. For example, assuming that a property start is defined with the range xsd:dateTime, taking a value such as "March 29, 2019, 20:00"^^xsd:string would be incompatible with the defined range, while a value "March 29, 2019, 20:00"^^xsd:dateTime would be malformed (a value such as "2019-11-12T20:00:00"^^xsd:dateTime is rather expected). A corresponding metric for syntactic accuracy is the ratio between the number of incorrect values of a given property and the total number of values for the same property [351]. Such forms of syntactic accuracy can typically be assessed using validation tools [353, 354].

Semantic accuracy

Semantic accuracy is the degree to which data values correctly represent real world phenomena, which may be affected by imprecise extraction results, imprecise entailments, vandalism, etc. For instance, given that the National Congress of Chile is located in Valparaíso, this may give rise to the edge Chile arrow source capital arrow tip rightward Valparaiso (through entailment, extraction, completion, etc.), which is in fact semantically inaccurate: the Chilean capital is Santiago. Assessing the level of semantic inaccuracies is challenging. While one option is to apply manual verification, an automatic option may be to check the stated relation against several sources [355, 356]. Another option is to rather validate the quality of individual processes used to generate the knowledge graph, based on measures such as precision, possibly with the help of human experts or gold standards [113].

Timeliness

Timeliness is the degree to which the knowledge graph is currently up-to-date with the real world state [357]; in other words, a knowledge graph may be semantically accurate now, but may quickly become inaccurate (outdated) if no procedures are in place to keep it up-to-date in a timely manner. For example, consider a user checking the tourist knowledge graph for flights from one city to another. Suppose that the flight timetable is updated every minute with current flight statuses, but the knowledge graph is only updated every hour. In this case, we see that there is a quality issue regarding timeliness in the knowledge graph. Timeliness can be assessed based on how frequently the knowledge graph is updated with respect to underlying sources [357, 358], which can be done using temporal annotations of changes in the knowledge graph [359, 360], as well as contextual representations that capture the temporal validity of data (see Section 3.3).

Coverage

Coverage refers to avoiding the omission of domain-relevant elements, which otherwise may yield incomplete query results or entailments, biased models, etc.

Completeness

Completeness refers to the degree to which all required information is present in a particular dataset. Completeness comprises the following aspects: (i) schema completeness refers to the degree to which the classes and properties of a schema are represented in the data graph, (ii) property completeness refers to the ratio of missing values for a specific property, (iii) population completeness provides the percentage of all real-world entities of a particular type that are represented in the datasets, and (iv) linkability completeness refers to the degree to which instances in the data set are interlinked. Measuring completeness directly is non-trivial as it requires knowledge of a hypothetical ideal knowledge graph [361] that contains all the elements that the knowledge graph in question should “ideally” represent. Concrete strategies involve comparison with gold standards that provide samples of the ideal knowledge graph (possibly based on completeness statements [361]), or measuring the recall of extraction methods from complete sources [113], and so forth.

Representativeness

Representativeness is a related dimension that, instead of focusing on the ratio of domain-relevant elements that are missing, rather focuses on assessing high-level biases in what is included/excluded from the knowledge graph [362]. As such, this dimension assumes that the knowledge graph is incomplete – i.e., that it is a sample of the ideal knowledge graph – and asks how biased this sample is. Biases may occur in the data, in the schema, or during reasoning [225]. Examples of data biases include geographic biases that under-represent entities/relations from certain parts of the world [225], linguistic biases that under-represent multilingual resources (e.g., labels and descriptions) for certain languages [363], social biases that under-represent people of particular genders or races [364], and so forth. In contrast, schema biases may result from high-level definitions extracted from biased data [225], semantic definitions that do not cover uncommon cases, etc. Unrecognised biases may lead to adverse effects; for example, if our tourism knowledge graph has a geographic bias towards events and attractions close to Santiago city – due perhaps to the sources used for creation, the employment of curators from the city, etc. – then this may lead to tourism in and around Santiago being disproportionally promoted (potentially compounding future biases). Measures of representativeness involve comparison of known statistical distributions with those of the knowledge graph, for example, comparing geolocated entities with known population densities [225], linguistic distributions with known distributions of speakers [363], etc. Another option is to compare the knowledge graph with general statistical laws, where Arnaud Soulet, et al. [365] use (non-)conformance with Benford’s law²⁹^{note 29} Benford’s law states that the leading significant digit in many collections of numbers is more likely to be small. to measure representativeness in knowledge graphs.

Coherency

Coherency refers to how well the knowledge graph conforms to – or is coherent with – the formal semantics and constraints defined at the schema-level.

Consistency

Consistency means that a knowledge graph is free of (logical/formal) contradictions with respect to the particular logical entailment considered. For example, in the ontology of our knowledge graph, we may define that flight arrow source range arrow tip rightward Airportdisj. c.City, which when combined with the edges AricaflightSantiagotypeCity, gives rise to an inconsistency, entailing that Santiago is a member of the disjoint classes City and Airport. More generally, any semantic feature in Tables 3–5 with a “not” condition can give rise to inconsistencies if the negated condition is entailed. A measure of consistency can be the number of inconsistencies found in a knowledge graph, possibly sub-divided into the number of such inconsistencies identified by each semantic feature [79].

Validity

Validity means that the knowledge graph is free of constraint violations, such as captured by shape expressions [98] (see Section 3.1.2. We may, for example, specify a shape City whose target nodes have at most one country. Then, given the edges Chile arrow tip leftward country arrow source Santiagocountry arrow tip rightward Cuba, and assuming that Santiago becomes a target of City, we have a constraint violation. Conversely, even if we defined analogous cardinality restrictions in an ontology, this would not necessarily cause an inconsistency since, without UNA, we would first infer that Chile and Cuba refer to the same entity. A straightforward measure of validity is to count the number of violations per constraint.

Succinctness

Succinctness refers to the inclusion only of relevant content (avoiding “information overload”) that is represented in a concise and intelligible manner.

Conciseness

Conciseness refers to avoiding the inclusion of schema and data elements that are irrelevant to the domain. Pablo N. Mendes, et al. [366] distinguish intensional conciseness (schema level), which refers to the case when the data does not contain redundant schema elements (properties, classes, shapes, etc.), and extensional conciseness (data level), when the data does not contain redundant entities and relations. For example, including events in Santiago de Cuba in our knowledge graph dedicated to tourism in Chile may affect the extensional conciseness of the knowledge graph, potentially returning irrelevant results for the given domain. In general, conciseness can be measured in terms of the ratio of properties, classes, shapes, entities, relations, etc., of relevance to the domain, which may in turn require a gold standard, or techniques to assess domain-relevance.

Representational-conciseness

Representational-consieness refers to the extent to which content is compactly represented in the knowledge graph, which may again be intensional or extensional [351]. For example, having two properties flight and flies to serving the same purpose would negatively affect the intensional form of representational conciseness, while having two nodes Santiago and Santiago de Chile representing the capital of Chile (with neither linked to the other) would affect the extensional form of representational conciseness. Another example of representational conciseness is the unnecessary use of complex modelling constructs, such as using reification unnecessarily, or using linked lists when the order of elements is not important [367]. Though representational conciseness is challenging to assess, measures such as the number of redundant nodes can be used [353].

Understandability

Understandability refers to the ease with which data can be interpreted without ambiguity by human users, which involves – at least – the provision of human-readable labels and descriptions (preferably in different languages [363]) that allow them to understand what is being spoken about [367]. Referring back to Figure 1, though the nodes EID15 and EID16 are used to ensure unique identifiers for events, they should also be associated with labels such as Ñam and Food Truck. Ideally the human readable information is sufficient to disambiguate a particular node, such as associating a description "Santiago, the capital of Chile"@en with Santiago to disambiguate the city from synonymous ones. Measures of understandability may include the ratio of nodes with human-readable labels and descriptions, the uniqueness of such labels and descriptions, the languages supported, etc.

Other Quality Dimensions

We have discussed some key quality dimensions that have been discussed for – and apply generally to – knowledge graphs. Further dimensions may be pertinent in the context of specific domains, specific applications, or specific graph data models. For further details, we refer to the survey by Amrapali Zaveri, et al. [351] and to the book by Carlo Batini, et al. [352].

Refinement

Beyond assessing the quality of a knowledge graph, there exist techniques to refine the knowledge graph, in particular to (semi-)automatically complete and correct the knowledge graph [16], aka knowledge graph completion and knowledge graph correction, respectively. As distinguished from the creation and enrichment tasks outlined in Section 6, refinement typically does not involve applying extraction or mapping techniques over external sources in order to ingest their content into the local knowledge graph. Instead, refinement typically targets improvement of the local knowledge graph as given (but potentially using external sources to verify local content [16]).

Completion

Knowledge graphs are characterised by incompleteness [368]. As such, knowledge graph completion aims at filling in the missing edges (aka missing links) of a knowledge graph, i.e., edges that are deemed correct but are neither given nor entailed by the knowledge graph. This task is often addressed with link prediction techniques proposed in the area of Statistical Relational Learning [369], which predict the existence – or sometimes more generally, predict the probability of correctness – of missing edges. For instance, one might predict that the edge Moon Valley arrow source bus arrow tip rightward San Pedro is a probable missing edge for the graph of Figure 24, given that most bus routes observed are return services (i.e., bus is typically symmetric). Link prediction may target three settings: general links involving edges with arbitrary labels, e.g., bus, flight, type, etc.; type links involving edges with label type, indicating the type of an entity; and identity links involving edges with label same as, indicating that two nodes refer to the same entity (cf. Section 3.2.2). While type and identity links can be addressed using general link prediction techniques, the particular semantics of type and identity links can be addressed with custom techniques. (The related task of generating links across knowledge graphs – referred to as link discovery [370] – will be discussed later in Section 9.1.)

General link prediction

Link prediction, in the general case, is often addressed with inductive techniques as discussed in Section 5, and in particular, knowledge graph embeddings and rule/axiom mining. For example, given Figure 24, using knowledge graph embeddings, we may detect that given an edge of the form \(x\) arrow source bus arrow tip rightward \(y\), a (missing) edge \(y\)bus\(x\) has high plausibility, while using symbol-based approaches, we may learn the high-level rule ?xbus?y \(\Rightarrow\) ?ybus?x. Either such approach would help us to predict the missing link Moon ValleybusSan Pedro.

Type-link prediction

Type links are of particular importance to a knowledge graph, where dedicated techniques can be leveraged taking into account the specific semantics of such links. In the case of type prediction, there is only one edge label (type) and typically fewer distinct values (classes) than in other cases, such that the task can be reduced to a traditional classification task [16], training models to identify each semantic class based on features such as outgoing and/or incoming edge labels on their instances in the knowledge graph [371, 372]. For example, assume that in Figure 24 we also know that Arica, Calama, Puerto Montt, Punta Arenas and Santiago are of type City. We may then predict that Iquique and Easter Island are also of type City based on the presence of edges labelled flight to/from these nodes, which (we assume) are learnt to be a good feature for prediction of that class (the former prediction is correct, while the latter is incorrect). Graph neural networks (see Section 5.3) can also be used for node classification/type prediction.

Identity-link prediction

Predicting identity links involves searching for nodes that refer to the same entity; this is analogous to the task of entity matching (aka record linkage, deduplication, etc.) considered in more general data integration settings [373]. Such techniques are generally based on two types of matchers: value matchers determine how similar the values of two entities on a given property are, which may involve similarity metrics on strings, numbers, dates, etc.; while context matchers consider the similarity of entities based on various nodes and edges [373]. An illustrative example is given in Figure 35, where value matchers will compute similarity between values such as 7400 and 7500, while context matchers will compute similarity between Easter Island and Rapa Ñui based on their surrounding information, such as their having similar latitudes, longitudes, populations, and the same seat (by way of comparison, a value matcher on this pair of nodes would measure string similarity between “Easter Island” and “Rapa Ñui”).

Identity linking example, where Rapa Ñui and Easter Island refer to the same island

A major challenge in this setting is efficiency, where a pairwise matching would require \(O(n^2)\) comparisons for \(n\) the number of nodes. To address this issue, blocking can be used to group similar entities into (possibly overlapping, possibly disjoint) “blocks” based on similarity-preserving keys, with matching performed within each block [374, 373, 375]; for example, if matching places based on latitude/longitude, blocks may represent geographic regions. An alternative to discrete blocking is to use windowing over entities in a similarity-preserving ordering [375], or to consider searching for similar entities within multi-dimensional spaces (e.g., spacetime [376], spaces with Minkowski distances [377], orthodromic spaces [378], etc. [379]). The results can either be pairs of nodes with a computed confidence of them referring to the same entity, or crisp identity links extracted based on a fixed threshold, binary classification, etc. [373]. For confident identity links, the nodes’ edges may then be consolidated [380]; for example, we may select Easter Island as the canonical node and merge the edges of Rapa Ñui onto it, enabling us to find, e.g., World Heritage Sites in the Pacific Ocean from Figure 35 based on the (consolidated) sub-graph World Heritage Site arrow tip leftward named arrow source Easter Islandocean arrow tip rightward Pacific.

Correction

As opposed to completion – which finds new edges in a knowledge graph – correction identifies and removes existing incorrect edges in the knowledge graph. We here divide the principal approaches for knowledge graph correction into two main lines: fact validation, which assigns a plausibility score to a given edge, typically in reference to external sources; and inconsistency repairs, which aim to resolve inconsistencies found in the knowledge graph through ontological axioms.

Fact validation

The task of fact validation (aka fact checking) [381, 382, 383, 384, 356, 385, 386, 181, 66] involves assigning plausibility or veracity scores to facts/edges, typically between \(0\) and \(1\). An ideal fact-checking function assumes a hypothetical reference universe (an ideal knowledge graph) and would return \(1\) for the fact Santa Lucía arrow source city arrow tip rightward Santiago (being true) while returning \(0\) for SotomayorcitySantiago (being false). There is a clear relation between fact validation and link prediction – with both relying on assessing the plausibility of edges/facts/links – and indeed the same numeric- and symbol-based techniques can be applied for both cases. However, fact validation often considers online assessment of edges given as input, whereas link prediction is often an offline task that generates novel candidate edges to be assessed from the knowledge graph. Furthermore, works on fact validation are characterised by their consideration of external reference sources, which may be unstructured sources [381, 382, 387, 383] or structured sources [384, 385, 386, 181, 66].

Approaches based on unstructured sources assume that they are given a verbalisation function – using, for example, rule-based approaches [388, 389], encoder–decoder architectures [390], etc. – that is able to translate edges into natural language. Thereafter, approaches for computing the plausibility of facts in natural language – called fact finders [391, 392] – can be directly employed. Many fact finding algorithms construct an \(n\)-partite (often bipartite) graph whose nodes are facts and sources, where a source is connected to a fact if the source “evidences” the fact, i.e., if it contains a text snippet that matches – with sufficient confidence – the verbalisation of the input edge. Two mutually-dependent scores, namely the trustworthiness of sources and the plausibility of facts, are then calculated based on this graph, where fact finders differ on how they compute these scores [392]. Here we mention three scores proposed by Jeff Pasternack, et al. [391]

Sums [391] adapts the HITS algorithm [393] by defining sources as hubs (with 0 authority score) and facts as authorities (with 0 hub score).
Average Log [391] extends HITS with a normalisation factor that prevents a single source from receiving a high trustworthiness score by evidencing many facts (that may be false).
Investment [391] lets the scores of facts grow with a non-linear function based on “investments” coming from the connected sources. The score a source receives from a fact is based on the individual facts in this particular source compared to the other connected sources.

Jeff Pasternack, et al. [392] then show that these three algorithms can be generalised into a single multi-layered graph-based framework within which (1) a source can support a fact with a weight expressing uncertainty, (2) similar facts can support each other, and (3) sources can be grouped together leading to an implicit support between sources of the same group. Other approaches for fact checking of knowledge graphs later extended this framework [394, 387]. Alternative approaches based on classifiers have also emerged, where commonly-used features include trust scores for information sources, co-occurrences of facts in sources, and so forth [381, 382].

Approaches for fact validation based on structured data typically assume external knowledge graphs as reference sources and are based on finding paths that evidence the input edge being validated. Unsupervised approaches search for undirected [385, 395] or directed [384] paths up to a given threshold length that evidence the input edge. The relatedness between input edges and paths is computed using a mutual information function, such as normalized pointwise mutual information [396]. Supervised approaches rather extract features for input edges from external knowledge graphs [73, 397, 398] and use these features to train a classification model to label the edges as true or false. An important set of features are metapaths, which encode sequences of predicates that correlate positively with the edge label of the input edge. Amongst such works, PredPath [386] automatically extracts metapaths based on type information. Several approaches rather encode the reference nodes and edges using graph embeddings (see Section 5.2), which are then used to estimate the plausibility of the input edge being validated.

Inconsistency repairs

Ontologies can contain axioms – such as disjointness – that lead to inconsistencies. While such axioms can be provided by experts, they can can also be derived through symbolic learning, as discussed in Section 5.4. Such axioms can then be used to detect inconsistencies. With respect to correcting a knowledge graph, however, detecting inconsistencies is not enough: techniques are also required to repair such inconsistencies, which itself is not a trivial task. In the simplest case, we may have an instance of two disjoint classes, such as that Santiago is of type City and Airport, which are stated or found to be disjoint. To repair the inconsistency, it would be preferable to remove only the “incorrect” class, but which should we remove? This is not a trivial question, particularly if we consider that one edge can be involved in many inconsistencies, and one inconsistency can involve many edges. The issue of computing repairs becomes more complex when entailment is considered, where we not only need to remove the stated type, but also all of the ways in which it might be entailed; for example, removing the edge Santiago arrow source type arrow tip rightward Airport is insufficient if we further have an edge AricaflightSantiago combined with an axiom flightrangeAirport. Gerald Töpper, et al. [217] suggest potential repairs for such violations – remove a domain/range constraint, remove a disjointness constraint, remove a type edge, remove an edge with a domain/range constraint – where one is chosen manually. In contrast, Piero A. Bonatti, et al. [79] propose an automated method to repair inconsistencies based on minimal hitting sets [399], where each set is a minimal explanation for an inconsistency. The edges to remove are chosen based on scores of the trustworthiness of their sources and how many minimal hitting sets they are either elements of or help to entail an element of, where the knowledge graph is revised to avoid re-entailment of the removed edges. Rather than repairing the data, another option is to evaluate queries under inconsistency-aware semantics, such as returning consistent answers valid under every possible repair [400].

Publication

While it may not be desirable to publish, for example, enterprise knowledge graphs that offer a competitive advantage to a company [6], it may be desirable – or even required – to publish other knowledge graphs, such as those produced by volunteers [34, 401, 32], by publicly-funded research [44, 402, 403], by governmental organisations [37, 38], etc. Publishing refers to making the knowledge graph (or part thereof) accessible to the public, often over the Web. Knowledge graphs published as open data are then called open knowledge graphs (discussed in Section 10.1).

In the following, we first discuss two sets of principles that have been proposed to guide the publication of data on the Web. We next discuss access protocols that constitute the interfaces by which the public can interact with the content of a knowledge graph. Finally, we consider techniques to restrict the access or usage of (parts of) a knowledge graph, as appropriate.

Best Practices

We now discuss two key sets of principles for publishing data, namely the FAIR Principles proposed by Mark D. Wilkinson, et al. [404], and the Linked Data Principles proposed by Tim Berners-Lee [109].

FAIR Principles

The FAIR Principles were originally proposed in the context of publishing scientific data [404] – particularly motivated by maximising the impact of publicly-funded research – but the principles generally apply to other situations where data are to be published in a manner that facilitates their re-use by external agents, with particular emphasis on machine-readability.

FAIR itself is an acronym for four foundational principles, each with particular goals [404], that may apply to data, metadata, or both – the latter being denoted (meta)data.³⁰^{note 30} Metadata are data about data. The distinction is often important in observational sciences, where in astronomy, for example, data may include raw image data, while metadata may include the celestial coordinates and time of the image. We now describe the FAIR principles (slightly rephrasing the original wording in some cases for brevity [404]).

Findability refers to the ease with which external agents who might benefit from the dataset can initially locate the dataset. Four sub-goals should be met:
- F1: (meta)data are assigned a globally unique and persistent identifier.
- F2: data are described with rich metadata (see R1).
- F3: metadata clearly and explicitly include the identifier of the data they describe.
- F4: (meta)data are registered or indexed in a searchable resource.
Accessibility refers to the ease with which external agents (once they have located the dataset) can access the dataset. Two goals are defined, the first with two sub-goals:
- A1: (meta)data are retrievable by their identifier using a standard protocol.
  - A1.1: the protocol is open, free, and universally implementable.
  - A1.2: the protocol allows for authentication and authorisation, where necessary.
- A2. metadata are accessible, even when the data are no longer available.
Interoperability refers to the ease with which the dataset can be exploited (in conjunction with other datasets) using standard tools. Three goals are defined:
- I1: (meta)data use an accessible, shared, and general knowledge representation formalism.
- I2: (meta)data use vocabularies that follow FAIR principles.
- I3: (meta)data include qualified references to other (meta)data.
Reusability refers to the ease with which the dataset can be re-used in conjunction with other datasets. One goal is defined (with three sub-goals):
- R1: meta(data) are richly described with a plurality of accurate and relevant attributes.
  - R1.1. (meta)data are released with a clear and accessible data usage license.
  - R1.2. (meta)data are associated with detailed provenance.
  - R1.3. (meta)data meet domain-relevant community standards.

In the context of knowledge graphs, a variety of vocabularies, tools, and services have been proposed that both directly and indirectly help to satisfy the FAIR principles. In terms of Findability, as discussed in Section 2, IRIs are built into the RDF model, providing a general schema for global identifiers. In addition, resources such as the Vocabulary of Interlinked Datasets (VoID) [405] allow for representing meta-data about graphs, while services such as DataHub [406] provide a central repository of such dataset descriptions. Access protocols that enable Accessibility will be discussed in Section ?, while mechanisms for authorisation will be discussed in Section 9.3. With respect to Interoperability, as discussed in Section 4, ontologies serve as a general knowledge representation formalism, and can in turn be used to describe vocabularies that follow FAIR principles. Finally, regarding Reusability, licensing will be discussed in Section 9.3, while the PROV Data Model [120] discussed in Section 3 allows for capturing detailed provenance.

A number of knowledge graphs have been published using FAIR principles, where Mark D. Wilkinson, et al. [404] explicitly mention Open PHACTS [402], a data integration platform for drug discovery, and UniProt [403], a large collection of protein sequence and annotation data, as conforming to FAIR principles. Both datasets offer graph views of their content through the RDF data model.

Linked Data Principles

Mark D. Wilkinson, et al. [404] state that FAIR Principles “precede implementation choices”, meaning that the principles do not cover how they can or should be achieved. Preceding the FAIR Principles by almost a decade are the Linked Data Principles, proposed by Tim Berners-Lee [109], which provide a technical basis for one possible way in which these FAIR Principles can be achieved. Specifically the Linked Data Principles are as follows:

Use IRIs as names for things.
Use HTTP IRIs so those names can be looked up.
When a HTTP IRI is looked up, provide useful content about the entity that the IRI names using standard data formats.
Include links to the IRIs of related entities in the content returned.

These principles were proposed in a Semantic Web setting, where for principle (3), the standards based on RDF (including RDFS, OWL, etc.) are currently recommended for use, particularly because they allow for naming entities using HTTP IRIs, which further paves the way for satisfying all four principles. As such, these principles outline a way in which (RDF) graph-structured data can be published on the Web such that these graphs are interlinked to form what Tim Berners-Lee [109] calls a “Web of Data”, whose goal is to increase automation on the Web by making content available not only in (HTML) documents intended for human consumption, but also as (RDF) structured data that machines can locate, retrieve, combine, validate, reason over, query over, etc., towards solving tasks automatically. Conceptually, the Web of Data is then composed of graphs of data published on individual web-pages, where one can click on a node or edge-label – or more precisely perform a HTTP lookup on an IRI of the graph – to be transported to another graph elsewhere on the Web with relevant content on that node or edge-label, and so on recursively.

In Figure 36, we show a simple example with two Linked Data documents published on the Web, with each containing an RDF graph. As discussed in Section 3.2, terms such as clv:Concert, wd:Q142701, rdfs:label, etc., are abbreviations for IRIs, where, for example, wd:Q142701 expands to http://www.wikidata.org/entity/Q142701. Prefixes beginning with cl are fictitious prefixes we assume to have been created by the Chilean tourist board. The IRIs prefixed with \(\hookrightarrow\) Earth indicate the document returned if the node is looked up. The leftmost document is published by the tourist board and describes Lollapalooza 2018 (identified by the node cle:LP2018), which links to the headlining act Pearl Jam (wd:Q142701) described by an external knowledge graph, namely Wikidata. By looking up the node wd:Q142701 in the leftmost graph, the IRI dereferences (i.e., returns via HTTP) the document with the RDF graph on the right describing that entity in more detail. From the rightmost document, the node wd:Q221535 can be looked up, in turn, to find a graph about Eddie Vedder (not shown in the example). The IRIs for entities and documents are distinguished to ensure that we do not confuse data about the entity and the document; for example, while wd:Q221535 refers to Eddie Vedder, the IRI wdd:Q221535 refers to the document about Eddie Vedder; if we were to assign a last-modified date to the document, we should use wdd:Q221535 not wd:Q221535. In Figure 36, we can further observe that edge labels (which are also IRIs) and nodes representing classes (e.g., clv:Concert) can also be dereferenced, typically returning semantic definitions of the respective terms.

Two example Linked Data documents from two websites, each containing an RDF graph, where wd:Q142701 refers to Pearl Jam in Wikidata while wdd:Q142701 to the RDF graph about Pearl Jam, and where wd:Q221535 refers to Eddie Vedder while wdd:Q221535 refers to the RDF graph about Eddie Vedder; the edge-label wdt:571 refers to “inception” in Wikidata, while wdt:527 refers to “has part” — Two example Linked Data documents from two websites, each containing an RDF graph, where `wd:Q142701` refers to Pearl Jam in Wikidata while `wdd:Q142701` to the RDF graph about Pearl Jam, and where `wd:Q221535` refers to Eddie Vedder while `wdd:Q221535` refers to the RDF graph about Eddie Vedder; the edge-label wdt:571 refers to “inception” in Wikidata, while wdt:527 refers to “has part”

A key challenge is posed by the fourth principle – include links to related entities – as illustrated in Figure 36, where wd:Q221535 in the leftmost graph constitutes a link to related content about Pearl Jam in an external knowledge graph. Specifically, the link discovery task considers adding such links from one knowledge graph to another, which may involve inclusion of IRIs that dereference to external graphs (per Figure 36), or links with special semantics such as identity links. In comparison with the link prediction task discussed in Section 8.1, which is used to complete links within a knowledge graph, link discovery aims to discover links across knowledge graphs, which involves unique aspects: first, link discovery typically considers disjoint sets of source (local) nodes and target (remote) nodes; second, the knowledge graphs may often use different vocabularies; third, while in link prediction there already exist local examples of the links to predict, in link discovery, there are often no existing links between knowledge graphs to learn from. A common technique is to define manually-crafted linkage rules (aka link specifications) that apply heuristics for defining links that potentially incorporate similarity measures [320, 319]. Link discovery is greatly expedited by the provision of standard identifier schemes within knowledge graphs, such as ISBNs for books, alpha-2 and alpha-3 codes for countries (e.g., cl, clp), or even links to common knowledge graphs such as DBpedia [32] or Wikidata [34] (that themselves include standard identifiers). We refer to the survey on link discovery by Markus Nentwig, et al. [370] for more details.

Further guidelines have been proposed that provide finer-grained recommendations for publishing Linked Data, relating to how best to implement dereferencing, what kinds of links to include, how to publish and interlink vocabularies, amongst other considerations [110, 407]. We refer to the book by Tom Heath, et al. [110] for more discussion on how to publish Linked Data on the Web.

Access Protocols

Publishing involves allowing the public to interact with the knowledge graph, which implies the provision of access protocols that define the requests that agents can make and the response that they can expect as a result. Per the Accessibility principle of FAIR (specifically A1.1), this protocol should be open, free, and universally implementable. In the context of knowledge graphs, as shown in Figure 37, there are a number of access protocols to choose from, varying from simple protocols that allow users to simply download all content, towards protocols that accept and evaluate increasingly complex requests. While simpler protocols require less computation on the server that publishes the data, more complex protocols allow agents to request more specific data, thus reducing bandwidth. A knowledge graph may also offer a variety of access protocols catering to different agents with different requirements [408]. We now discuss such access protocols.

Dumps

A dump is a file or collection of files containing the content of the knowledge graph available for download. The request in this case is for the file(s) and the response is the content of the file(s). In order to publish dumps, first of all, concrete – and ideally standard – syntaxes are required to serialise the graph. While for RDF graphs there are various standard syntaxes available based on XML [409], JSON [317], custom syntaxes [410], and more besides, currently there are only non-standard syntaxes available for property graphs [411]. Second, to reduce bandwidth, compression methods can be applied. While standard compression such as GZIP or BZip2 can be straightforwardly applied on any file, custom compression methods have been proposed for graphs that not only offer better compression ratios than these standard methods, but also offer additional functionalities, such as compact indexes for performing efficient lookups once the file is downloaded [412]. Finally, to further reduce bandwidth, when the knowledge graph is updated, “diffs” can be computed and published to obviate the need for agents to download all data from scratch (see [413, 414, 415]). Still, however, dumps are only suited to certain use-cases, in particular for agents that wish to maintain a full local copy of a knowledge graph. If an agent were rather only interested in, for example, all food festivals in Santiago, downloading the entire dump would require transferring and processing a lot of irrelevant data.

Node lookups

Protocols for performing node lookups accept a node (id) request (e.g., cle:LP2018 in Figure 36) and return a (sub-)graph describing that node (e.g., the document cld:LP2018). Such a protocol is the basis for the Linked Data principles outlined previously, where node lookups are implemented through HTTP dereferencing, which further allows nodes in remote graphs to be referenced from across the Web. Although there are varying definitions on what content should be returned for a node [416], a common convention is to return a sub-graph containing either all outgoing edges for that node or all incident edges (both outgoing and incoming) for that node [367]. Though simple, mechanisms for answering graph patterns can be implemented on top of a node lookup interface by traversing from node to node according to the particular graph pattern [417]; for example, to find all food festivals in Santiago – represented by the graph pattern Food Festival arrow tip leftward type arrow source ?fflocation arrow tip rightward Santiago – we may perform a node lookup for Santiago, subsequently performing a node lookup for each node connected by a location edge to Santiago, returning those nodes declared to be of type Food Festival. However, such an approach may not be feasible if no starting node is declared (e.g., if all nodes are variables), if the node lookup service does not return incoming edges, etc. Furthermore, the client agent may need to request more data than necessary, where the document returned for Santiago may return a lot of irrelevant data, and where nodes with a location in Santiago that do not represent instances of Food Festival still need to be looked up to check their type. On the plus side, node lookups are relatively inexpensive for servers to support.

Edge patterns

Edge patterns – also known as triple patterns in the case of directed, edge-labelled graphs – are singleton graph patterns, i.e., graph patterns with a single edge. Examples of edge patterns are ?ff arrow source type arrow tip rightward Food Festival or ?fflocationSantiago, etc., where any term can be a variable or a constant. A protocol for edge patterns accepts such a pattern and returns all solutions for the pattern. Edge patterns provide more flexibility than node lookups, where graph patterns are more readily decomposed into edge patterns than node lookups. With respect to the agent interested in food festivals in Santiago, they can first, for example, request solutions for the edge pattern ?ff arrow source location arrow tip rightward Santiago and locally join/intersect these solutions with those of ?fftypeFood Festival. Given that some edge patterns (e.g., ?x?y?z) can return many solutions, protocols for edge patterns may offer additional practical features such as iteration or pagination over results [418]. Much like node lookups, the server cost of responding to a request is relatively low and easy to predict. However, the server may often need to transfer irrelevant intermediate results to the client, which in the previous example may involve returning nodes located in Santiago that are not food festivals. This issue is further aggravated if the client does not have access to statistics about the knowledge graph in order to plan how to best perform the join; for example, if there are relatively few food festivals but many things located in Santiago, rather than intersecting the solutions of the two aforementioned edge patterns, it should be more efficient to send a request for each food festival to see if it is in Santiago, but deciding this requires statistics about the knowledge graph. Extensions to the edge-pattern protocol have thus been proposed to allow for more efficient joins [419], such as allowing batches of solutions to be sent alongside the edge pattern, returning only solutions compatible with the solutions in the request [420] (e.g., sending a batch of solutions for ?ff arrow source type arrow tip rightward Food Festival to join with the solutions for the request ?fflocationSantiago).

(Complex) graph patterns

Another alternative is to let client agents make requests based on (complex) graph patterns (see Section 2.2), with the server returning (only) the final solutions. In our running example, this involves the client issuing a request for Food Festival arrow tip leftward type arrow source ?fflocation arrow tip rightward Santiago and directly receiving the relevant results. Compared with the previous protocols, this protocol is much more efficient in terms of bandwidth: it allows clients to make more specific requests and the server to return more specific responses. However, this reduction in bandwidth use comes at the cost of the server having to evaluate much more complex requests, where, furthermore, the costs of a single request are much more difficult to anticipate. While a variety of optimised engines exist for evaluating (complex) graph patterns (e.g., [421, 75, 422] amongst many others), the problem of evaluating such queries is known to be intractable [20]. Perhaps for this reason, public services offering such a protocol (most often supporting SPARQL queries [86]) have been found to often exhibit downtimes, timeouts, partial results, slow performance, etc. [423]. Even considering such issues, however, popular services continue to receive – and successfully evaluate – millions of requests/queries per day [424, 425], with difficult (worst-case) instances being rare in practice [426].

Other protocols

While Figure 37 makes explicit reference to some of the most commonly-encountered access protocols found for knowledge graphs in practice, one may of course imagine other protocols lying almost anywhere on the spectrum from more simple to more complex interfaces. To the right of (Complex) Graph Patterns, one could consider supporting even more complex requests, such as queries with entailments [427], queries that allow recursion [163], federated queries that can join results from remote services [428], or even (hypothetically) supporting Turing-complete requests that allow running arbitrary procedural code on a knowledge graph. As mentioned at the outset, a server may also choose to support multiple, complementary protocols [408].

Usage Control

Considering our hypothetical tourism knowledge graph, at first glance, one might assume that the knowledge required to deliver the envisaged services is public and thus can be used both by the tourism board and the tourists. On closer inspection, however, we may see the need for usage control in various forms:

both the tourist board and its partners should associate an appropriate license with knowledge that they contribute to the knowledge graph, such that the terms of use are clear to all parties;
a tourist might opt to install an app on their mobile phone that could be used to recommend tourist attractions based on their location, bringing with it potential privacy concerns;
the tourist board may be required to report criminal activities to the police services and thus may need to encrypt personal information; and
the tourist board could potentially share information relating to tourism demographics in an anonymous format to allow for improving transport infrastructure on strategic routes.

Thus in this section, we examine the state of the art in terms of knowledge graph licensing, usage policies, encryption, and anonymisation.

Licensing

When it comes to associating machine readable licenses with knowledge graphs, the W3C Open Digital Rights Language (ODRL) [429] provides an information model and related vocabularies that can be used to specify permissions, duties, and prohibitions with respect to actions relating to assets. ODRL supports fine-grained descriptions of digital rights that are represented as – and thus can be embedded within – graphs. Figure 38 illustrates a license granting the assignee the permission to Modify, Distribute, and Derive work from the EventGraph (e.g., Figure 1); however the assignee is obliged to Attribute the copyright holder. From a modelling perspective, ODRL can be used to model several well known license families, for instance Apache, Creative Commons (CC), and Berkeley Software Distribution (BSD), to name but a few [430, 431]. Additionally, Elena Cabrio, et al. [430] propose methods to automatically extract machine-readable licenses from unstructured text. From a reasoning perspective, license compatibility validation and composition techniques [432, 433, 434] can be used to combine knowledge graphs that are governed by different licenses. Such techniques are employed by the the Data Licenses Clearance Center (DALICC), which includes a library of standard machine readable licenses, and tools that enable users both to compose arbitrary custom licenses and also to verify the compatibility of different licenses [435].

Associating licenses with event data, along with permissions, actions, and obligations

Usage policies

Access control policies based on edge patterns can be used to restrict access to parts of a knowledge graph [436, 437, 438]. WebAccessControl (WAC)³¹^{note 31} WAC, http://www.w3.org/wiki/WebAccessControl is an access control framework for graphs that uses WebID for authentication and provides a vocabulary for specifying access control policies. Extensions of this WAC vocabulary have been proposed to capture privacy preferences [439] and to cater for contextual constraints [440, 441]. Although ODRL is primarily used to specify licenses, profiles to specify access policies [442] and regulatory obligations [443, 444] have also been proposed in recent years, as discussed in the survey by Sabrina Kirrane, et al. [445].

As a generalisation of access policies, usage policies specify how data can be used: what kinds of processing can be applied, by whom, for what purpose, etc. The example usage policy presented in Figure 39 states that the process Analyse of LocationGraph can be performed on InternalServers by members of CompanyStaff in order to provide EventRecommendations. Vocabularies for usage policies have been proposed by the SPECIAL H2020 project [446] and the W3C Data Privacy Vocabularies and Controls Community Group (DPVCG) [447, 448]. Once specified, usage policies can then be used to verify that data processing conforms to legal norms and the consent provided by subjects [449, 448].

A policy for the usage of a sub-graph of location data in the knowledge graph

Encryption

Rather than internally controlling usage, the tourist board could use encryption mechanisms on parts of the published knowledge graph, for example relating to reports of crimes, and provide keys to partners who should have access to the plaintext. While a straightforward approach is to encrypt the entire graph (or sub-graphs) with one key, more fine-grained encryption can be performed for individual nodes or edge-labels in a graph, potentially providing different clients access to different information through different keys [450]. The CryptOntology [451] can further be used to embed details about the encryption mechanism used within the knowledge graph. Figure 40 illustrates how this could be used to encrypt the names of claimants from Figure 34, storing the ciphertext zhk…kjg, as well as the key-length and encryption algorithm used. In order to grant access to the plaintext, one approach is to encrypt individual edges with symmetric keys so as to allow specific types of edge patterns to only be executed by clients with the appropriate key [452]. This approach can be used, for example, to allow clients who know a claimant ID (e.g., Claimant-XY12SDA) and have the appropriate key to find (only) the name of the claimant through an edge pattern Claimant-XY12SDA arrow source Claimant-name arrow tip rightward ?name. A key limitation of this approach, however, is that it requires attempting to decrypt all edges to find all possible solutions. A more efficient alternative is to combine functional encryption and specialised indexing to retrieve solutions from the encrypted graph without attempting to decrypt all edges [453].

Directed edge-labelled graph with the name of the claimant encrypted; plaintext elements are dashed and may be omitted from published data (possibly along with encryption details)

Anonymisation

Consider that the tourist board acquires information on transport taken by individuals within the country, which can be used to understand trajectories taken by tourists. However, from a data-protection perspective, it would be advisable to remove any personal data from the knowledge graph to avoid leaks of information about each individual’s travel.

A first approach to anonymisation is to suppress and generalise knowledge in a graph such that individuals cannot be identified, based on \(k\)-anonymity [454]³²^{note 32} \(k\)-anonymity guarantees that the data of an individual is indistinguishable from at least \(k-1\) other individuals., \(l\)-diversity [455]³³^{note 33} \(l\)-diversity guarantees that sensitive data fields have at least \(l\) diverse values within each group of individuals; this avoids leaks such as that all tourists from Austria (a group of individuals) in the data have been pick-pocketed (a sensitive attribute)., etc. Approaches to apply \(k\)-anonymity on graphs identify and suppress “quasi-identifiers” that would allow a given individual to be distinguished from fewer than \(k-1\) other individuals [456, 457]. Figure 41 illustrates a possible result of \(k\)-anonymisation for a sub-graph describing a flight passenger, where quasi-identifiers (passport, plane ticket) have been converted into blank nodes, ensuring that the passenger (the dashed blank node) cannot be distinguished from \(k-1\) other individuals. In the context of a graph, however, neighbourhood attacks [458] – using information about neighbours – can also break \(k\)-anonymity, where we also suppress the day and time of the flight, which, though not sensitive information per se, could otherwise break \(k\)-anonymity for passengers (if, for example, a particular flight had fewer than \(k\) males from the U.S. onboard).

Anonymised sample of a directed edge-labelled graph describing a passenger (dashed) of a flight

More complex neighbourhood attacks may rely on more abstract graph patterns, observing that individuals can be deanonymised purely from knowledge of the graph structure, even if all nodes and edge labels are left blank; for example, if we know that a team of \(k-1\) players take flights together for a particular number of away games, we could use this information for a neighbourhood attack that reveals the set of players in the graph. Hence a number of guarantees specific to graphs have been proposed, including \(k\)-degree anonymity [459], which ensures that individuals cannot be deanonymised by attackers with knowledge of the degree of particular individuals. The approach is based on minimally modifying the graph to ensure that each node has at least \(k-1\) other nodes with the same degree. A stronger guarantee, called \(k\)-isomorphic neighbour anonymity [460], avoids neighbourhood attacks where an attacker knows how an individual is connected to nodes in their neighbourhood; this is done by modifying the graph to ensure that for each node, there exist at least \(k-1\) nodes with isomorphic (i.e., identically structured) neighbourhoods elsewhere in the graph. Both approaches only protect against attackers with knowledge of bounded neighbourhoods. An even stronger notion is that of \(k\)-automorphism [461], which ensures that for every node, it is structurally indistinguishable from \(k-1\) other nodes, thus avoiding any attack based on structural information (as a trivial example, a \(k\)-clique or a \(k\)-cycle satisfy \(k\)-automorphism). Many of these techniques for anonymisation of graph data were originally motivated by social networks [462], though they can also be applied to knowledge graphs, per the work of Zhiyuan Lin, et al. [463], who adapt \(k\)-automorphism for directed edge-labelled graphs (specifically RDF graphs).

While the aforementioned approaches anonymise data, a second approach is to apply anonymisation when answering queries, such as adding noise to the solutions in a way that preserves privacy. One approach is to apply \(\varepsilon\)-differential privacy [464]³⁴^{note 34} \(\varepsilon\)-differential privacy ensures that the probability of achieving a given result from some process (e.g., query) applied to data, to which random noise is added, differs no more than \(e^\varepsilon\) when the data includes or excludes any individual. for querying graphs [465]. Such mechanisms are typically used for aggregate (e.g., count) queries, where noise is added to avoid leaks about individuals. To illustrate, differential privacy may allow for counting the number of passengers of specified nationalities taking specified flights, adding (just enough) random noise to the count to ensure that we cannot tell, within a certain probability (controlled by \(\varepsilon\)), whether or not a particular individual took a flight, where we would require (proportionally) less noise for common nationalities, but more noise to “hide” individuals from more uncommon nationalities.

These approaches require information loss for stronger guarantees of privacy; which to choose is thus heavily application dependent. If the anonymised data are to be published in their entirety “dump”, then an approach based on \(k\)-anonymity can be used to protect individuals, while \(l\)-diversity can be used to protect groups. On the other hand, if the data are to be made available, in part, through a query interface, then \(\varepsilon\)-differential privacy is a more suitable framework.

Knowledge Graphs in Practice

In this section, we discuss some of the most prominent knowledge graphs that have emerged in the past years. We begin by discussing open knowledge graphs, which have been published on the Web per the guidelines and protocols described in Section 9. We later discuss enterprise knowledge graphs that have been created by companies for a diverse range of applications.

Open Knowledge Graphs

By open knowledge graphs, we specifically refer to knowledge graphs published under the Open Data philosophy, namely that “open means anyone can freely access, use, modify, and share for any purpose (subject, at most, to requirements that preserve provenance and openness)”.³⁵^{note 35} See http://opendefinition.org/ Many open knowledge graphs have been published in the form of Linked Open Datasets [110], which are (RDF) graphs published under the Linked Data principles (see Section 9.1.2) following the Open Data philosophy. Many of the most prominent open knowledge graphs – including DBpedia [32], YAGO [466], Freebase [33], and Wikidata [34] – cover multiple domains, representing a broad diversity of entities and relationships; we first discuss these in turn. Later we discuss some of the other (specific) domains for which open knowledge graphs are currently available. Most of the open knowledge graphs we discuss in this section are modelled in RDF, published following Linked Data principles, and offer access to their data through dumps (RDF), node lookups (Linked Data), graph patterns (SPARQL) and, in some cases, edge patterns (Triple Pattern Fragments).

DBpedia

The DBpedia project was developed to extract a graph-structured representation of the semi-structured data embedded in Wikipedia articles [467], enabling the integration, processing, and querying of these data in a unified manner. The resulting knowledge graph is further enriched by linking to external open resources, including images, webpages, and external datasets such as DailyMed, DrugBank, GeoNames, MusicBrainz, New York Times, and WordNet [32]. The DBpedia extraction framework consists of several components, corresponding to abstractions of Wikipedia article sources, graph storage and serialisation destinations, wiki-markup extractors, parsers, and extraction managers [468]. Specific extractors are designed to process labels, abstracts, interlanguage links, images, redirects, disambiguation pages, external links, internal pagelinks, homepages, categories, and geocoordinates. The content in the DBpedia knowledge graph is not only multidomain, but also multilingual: as of 2012, DBpedia contained labels and abstracts in up to 97 different languages [469]. Entities within DBpedia are classified using four different schemata in order to address varying application requirements [468]. These schemata include a Simple Knowledge Organization System (SKOS) representation of Wikipedia categories, a Yet Another Great Ontology (YAGO) classification schema (discussed in the following), an Upper Mapping and Binding Exchange Layer (UMBEL) ontology categorisation schema, and a custom schema called the DBpedia ontology with classes such as Person, Place, Organisation, and Work [32]. DBpedia also supports live synchronisation in order to remain consistent with dynamic Wikipedia article [32].

Yet Another Great Ontology

YAGO likewise extracts graph-structured data from Wikipedia, which are then unified with the hierarchical structure of WordNet to create a “light-weight and extensible ontology with high quality and coverage” [466]. This knowledge graph aims to be applied for various information technology tasks, such as machine translation, word sense disambiguation, query expansion, document classification, data cleaning, information integration, etc. While earlier approaches automatically extracted structured knowledge from text using pattern matching, natural language processing (NLP), and statistical learning, the resulting content tended to lack in quality when compared with what was possible through manual construction [466]. However, manual construction is costly, making it challenging to achieve broad coverage and keep the data up-to-date. In order to extract data with high coverage and quality, YAGO (like DBpedia) mostly extracts data from Wikipedia infoboxes and category pages, which contain basic entity information and lists of articles for a specific category, respectively; these, in turn are unified with hierarchical concepts from WordNet [304]. A schema – called the YAGO model – provides a vocabulary defined in RDFS; this model allows for representing words as entities, capturing synonymy and ambiguity [466]. The model further supports reification, \(n\)-ary relations, and data types [304]. Refinement mechanisms employed within YAGO include canonicalisation, where each edge and node is mapped to a unique identifier and duplicate elements are removed, and type checking, where nodes that cannot be assigned to a class by deductive or inductive methods are eliminated [304]. YAGO would be extended in later years to support spatio-temporal context [35] and multilingual Wikipedias [401].

Freebase

Freebase was a general collection of human knowledge that aimed to address some of the large scale information integration problems associated with the decentralised nature of the Semantic Web, such as uneven adoption, implementation challenges, and distributed query performance limitations [470]. Unlike DBpedia and YAGO – which are mostly extracted from Wikipedia/WordNet – Freebase solicited contributions directly from human editors. Included in the Freebase platform were a scalable data store with versioning mechanisms; a large data object store (LOB) for the storage of text, image, and media files; an API that could be queried using the Metaweb Query Language (MQL); a Web user interface; and a lightweight typing system [470]. The latter typing system was designed to support collaborative processes. Rather than forcing ontological correctness or logical consistency, the system was implemented as a loose collection of structuring mechanisms – based on datatypes, semantic classes, properties, schema definitions, etc. – that allowed for incompatible types and properties to coexist simultaneously [470]. Content could be added to Freebase interactively through the Web user interface or in an automated way by leveraging the API’s write functionality. Freebase had been acquired by Google in 2010, where the content of Freebase formed an important part of the Google Knowledge Graph announced in 2012 [2]. When Freebase became read-only as of March 2015, the knowledge graph contained over three billion edges. Much of this content was subsequently migrated to Wikidata [223].

Wikidata

As exploited by DBpedia and YAGO, Wikipedia contains a wealth of semi-structured data embedded in info-boxes, lists, tables, etc. However, these data have traditionally been curated and updated manually across different articles and languages; for example, a goal scored by a Chilean football player may require manual updates in the player’s article, the tournament article, the team article, lists of top scorers, and so forth, across hundreds of language versions. Manual curation has led to a variety of data quality issues, including contradictory data in different articles, languages, etc. The Wikimedia Foundation thus proposed Wikidata as a centralised, collaboratively-edited knowledge graph to supply Wikipedia – and arbitrary other clients – with data. Under this vision, a fact could be added to Wikidata once, triggering the automatic update of potentially multitudinous articles in Wikipedia across different languages [34]. Like Wikipedia, Wikidata is also considered a secondary source containing claims that should reference primary sources, though claims can also be initially added without reference [471]. Wikidata further allows for different viewpoints in terms of potentially contradictory (referenced) claims [34]. Wikidata is multilingual, where nodes and edges are assigned language-agnostic Qxx and Pxx codes (see Figure 36) and are subsequently associated with labels, aliases, and descriptions in various languages [363], allowing claims to be surfaced in these languages. Collaborative editing is not only permitted on the data level, but also on the schema level, allowing users to add or modify lightweight semantic axioms [472] – including sub-classes, sub-properties, inverse properties, etc. – as well as shapes [473]. Wikidata offers various access protocols [424] and has received broad adoption, being used by Wikipedia to generate infoboxes in certain domains [474], being supported by Google [223], and having been used as a data source for prominent applications such as Apple’s Siri, amongst others [424].

Other open cross-domain knowledge graphs

A number of other cross-domain knowledge graphs have been developed down through the years. BabelNet [238] – in a similar fashion to YAGO – is based on unifying WordNet and Wikipedia, but with the integration of additional knowledge graphs such as Wikidata, and a focus on creating a knowledge graph of multilingual lexical forms (organized into multilingual synsets) by transforming lexicographic resources such as Wiktionary and OmegaWiki into knowledge graphs. Compared to other knowledge graphs, lexicalized knowledge graphs such as BabelNet bring together the encyclopedic information found in Wikipedia with the lexicographic information usually found in monolingual and bilingual dictionaries. The Cyc project [475] aims to encode common-sense knowledge in a machine-readable way, where over 900 person-years of effort [476] have, since 1986, gone into the creation of 2.2 million facts and rules. Though Cyc is proprietary, an open subset called OpenCyc has been published, where we refer to the comparison by Michael Färber, et al. [477] of DBpedia, Freebase, OpenCyc, and YAGO for further details. The Never Ending Language Learning (NELL) project [273] has, since 2010, extracted a graph of 120 million edges from the text of web pages using OIE methods (see Section 6). Each such open knowledge graph applies different combinations of the languages and techniques discussed in this paper over different sources with differing results.

Domain-specific open knowledge graphs

Open knowledge graphs have been published in a variety of specific domains. Max Schmachtenberg, et al. [478] identify the most prominent domains in the context of Linked Data as follows: media, relating to news, television, radio, etc. (e.g., the BBC World Service Archive [36]); government, relating to the publication of data for transparency and development (e.g., by the U.S. [37] and U.K. [38] governments); publications, relating to academic literature in various disciplines (e.g., OpenCitations [479], SciGraph [480], Microsoft Academic Knowledge Graph [481]); geographic, relating to places and regions of interest (e.g., LinkedGeoData [39]); life sciences, relating to proteins, genes, drugs, diseases, etc. (e.g., Bio2RDF [44]); and user-generated content, relating to reviews, open source projects, etc. (e.g., Revyu [482]). Open knowledge graphs have also been published in other domains, including cultural heritage [483], music [484], law [485], theology [486], and even tourism [40, 41, 42, 43]. The envisaged applications for such knowledge graphs are as varied as the domains from which they emanate, but often relate to integration [484, 44], recommendation [484, 40], transparency [37, 38], archiving [483, 36], decentralisation [482], multilingual support [486], regulatory compliance [485], etc.

Enterprise Knowledge Graphs

A variety of companies have announced the creation of proprietary “enterprise knowledge graphs” with a variety of goals in mind, which include: improving search capabilities [2, 9, 4, 3, 10], providing user recommendations [3, 10], implementing conversational/personal agents [5], enhancing targetted advertising [8], empowering business analytics [8], connecting users [8, 6], extending multilingual support [8], facilitating research and discovery [487], assessing and mitigating risk [51, 52], tracking news events [48], and increasing transport automation [53], amongst (many) others. Though highly diverse, these enterprise knowledge graphs do follow some high-level trends, as reflected in the discussion by Natasha Noy, et al. [6]: (1) data are typically integrated into the knowledge graph from a variety of both external and internal sources (often involving text); (2) the enterprise knowledge graph is often very large, with millions or even billions of nodes and edges, posing challenges in terms of scalability; (3) refinement of the initial knowledge graph – adding new links, consolidating duplicate entities, etc. – is important to improve quality; (4) techniques to keep the knowledge graph up-to-date with the domain are often crucial; (5) a mix of ontological and machine learning representations are often combined or used in different situations in order to draw conclusions from the enterprise knowledge graph; (6) the ontologies used tend to be lightweight, often simple taxonomies representing a hierarchy of classes or concepts.

We now discuss the main industries in which enterprise knowledge graphs have been deployed.

Web search

Web search engines have traditionally focused on matching a query string with sub-strings in web documents. The Google Knowledge Graph [2, 6] rather promoted a paradigm of “things not strings” – analogous to semantic search [488] – where the search engine would now try to identify the entities that a particular search may be expressing interest in. The knowledge graph itself describes these entities and how they interrelate. One of the main user-facing applications of the Google Knowledge Graph is the “Knowledge Panel”, which presents a pane on the right-hand side of (some) search results describing the principal entity that the search appears to be seeking, including some images, attribute–value pairs, and a list of related entities that users also search for. The Google Knowledge Graph was key to popularising the modern usage of the phrase “knowledge graph” (see Appendix A). Other major search engines, such as Microsoft Bing³⁶^{note 36} Microsoft’s Knowledge Graph was previously called “Satori” (meaning understanding in Japanese). [9], would later announce knowledge graphs along similar lines.

Commerce

Enterprise knowledge graphs have also been announced by companies that are principally concerned with selling or renting goods and services. A prominent example of such a knowledge graph is that used by Amazon [4, 45], which describes the products on sale in their online marketplace. One of the main stated goals of this knowledge graph is to enable more advanced (semantic) search features for products, as well as to improve product recommendations to users of its online marketplace. Another knowledge graph for commerce was announced by eBay [5], which encodes product descriptions and shopping behaviour patterns, and is used to power conversational agents that aid users to find relevant products through a natural language interface. Airbnb [3] have also described a knowledge graph that encodes accommodation for rent, places, events, experiences, neighbourhoods, users, tags, etc., on top of which a taxonomic schema is defined. This knowledge graph is used to offer potential clients recommendations of attractions, events, and activities available in the neighbourhood of a particular home for rent. Uber [10] have similarly announced a knowledge graph focused on food and restaurants for their “Uber Eats” delivery service. The goals are again to offer semantic search features and recommendations to users who are uncertain precisely what kind of food they are looking for.

Enterprise knowledge graphs have also emerged in the context of social networking services. Facebook [6] have gathered together a knowledge graph describing not only social data about users, but also the entities they are interested in, including celebrities, places, movies, music, etc., in order to connect people, understand their interests, and provide recommendations. LinkedIn [8] announced a knowledge graph containing users, jobs, skills, companies, places, schools, etc., on top of which a taxonomic schema is defined. The knowledge graph is used to provide multilingual translations of important concepts, to improve targetted advertising, to provide advanced features for job search and people search, and likewise to provide recommendations matching jobs to people (and vice versa). Another knowledge graph has been created by Pinterest [489], describing users and their interests, the latter being organised into a taxonomy. The main use-cases for the knowledge graph are to aid users to more easily find content of interest to them, as well as to enhance revenue through targetted advertisements.

Finance

The financial sector has also seen deployment of enterprise knowledge graphs. Amongst these, Bloomberg [48] has proposed a knowledge graph that powers financial data analytics, including sentiment analysis for companies based on current news reports and tweets, a question answering service, as well as detecting emerging events that may affect stock values. Thompson Reuters (Refinitiv) [51] have likewise announced a knowledge graph encoding “the financial ecosystem” of people, organisations, equity instruments, industry classifications, joint ventures and alliances, supply chains, etc., using a taxonomic schema to organise these entities. Some of the applications they mention for the knowledge graph include supply chain monitoring, risk assessment, and investment research. Knowledge graphs have also been explored in academic settings with Banca d’Italia [47], using rule-based reasoning to determine, for example, the percentage of ownership of a company by various stakeholders. Other companies exploring financial knowledge graphs include Accenture [46], Capital One [49], Wells Fargo [50], amongst others.

Other industries

Enterprises have also been actively developing knowledge graphs to enable novel applications in a variety of other industries, including: health-care, where IBM are exploring use-cases for drug discovery [6] and information extraction from package inserts [490], while AstraZeneca [487] are using a knowledge graph to advance genomics research and disease understanding; transport, where Bosch are exploring a knowledge graph of scenes and locations for driving automation [53]; oil & gas, where Maana [52] are using knowledge graphs to perform data integration for risk mitigation regarding oil wells and drilling; and more besides.

Summary and Conclusion

We have provided a comprehensive introduction to knowledge graphs, which have been receiving more and more attention in recent years. Under the definition of a knowledge graph as a graph of data intended to accumulate and convey knowledge of the real world, whose nodes represent entities of interest and whose edges represent relations between these entities, we have discussed models by which data can be structured as graphs; representations of schema, identity and context; techniques for leveraging deductive and inductive knowledge; methods for the creation, enrichment, quality assessment and refinement of knowledge graphs; principles and standards for publishing knowledge graphs; and finally, the adoption of knowledge graphs in the real world.

Future directions. Research on knowledge graphs can become a confluence of techniques arising from different areas with the common objective of maximising the knowledge – and thus value – that can be distilled from diverse sources at large scale using a graph-based data abstraction [491]. Pursuing this objective will benefit from expertise on graph databases, knowledge representation, logic, machine learning, graph algorithms and theory, ontology engineering, data quality, natural language processing, information extraction, privacy and security, and more besides.

While advances in these individual disciplines are sure to continue and to generate further impact, particularly interesting topics arise also from their intersections. In the intersection of data graphs and deductive knowledge, we emphasise emerging topics such as formal semantics for property graphs, with languages that can take into account the meaning of labels and property–value pairs on nodes and edges [492]; and reasoning and querying over contextual data, in order to derive conclusions and results valid in a particular setting [127, 80, 128]. In the intersection of data graphs and inductive knowledge, we highlight topics such as similarity-based query relaxation, allowing to find approximate answers to exact queries based on numerical representations (e.g., embeddings) [168]; shape induction, in order to extract and formalise inherent patterns in the knowledge graph as constraints [493]; and contextual knowledge graph embeddings that provide numeric representations of nodes and edges that vary with time, place, etc. [57]. Finally, in the intersection of deductive and inductive knowledge, we mention the topics of entailment-aware knowledge graph embeddings [191, 193], that incorporate rules and/or ontologies when computing plausibility; expressive graph neural networks proven capable of complex classification analogous to expressive ontology languages [494]; as well as further advances on rule and axiom mining, allowing to extract symbolic, deductive representations from the knowledge graphs [206, 219].

Aside from specific topics, more general challenges for knowledge graphs include scalability, particularly for deductive and inductive reasoning; quality, not only in terms of data, but also the models induced from knowledge graphs; diversity, such as managing contextual or multi-modal data; dynamicity, considering temporal or streaming data; and finally usability, which is key to increasing adoption. Though techniques are continuously being proposed to address precisely these challenges, they are unlikely to ever be completely “solved”; rather they serve as dimensions along which knowledge graphs, and their techniques, tools, etc., will continue to mature.

Given the availability of open knowledge graphs whose quality continue to improve, as well as the growing adoption of enterprise knowledge graphs in various industries, future research on knowledge graphs has the potential to foster key advancements in broad aspects of society. Here we have highlighted just some examples of future research directions of importance to this pursuit.

Acknowledgements

We thank the attendees of the Dagstuhl Seminar on “Knowledge Graphs” for discussions that inspired and influenced this paper, and all those that make such seminars possible. We would also like to thank Matteo Palmonari for feedback on Figures 3 and 4, as well as Stefan Decker and Carlos Bobed who provided suggestions for the paper. Hogan was supported by Fondecyt Grant No. 1181896. Hogan and Gutierrez were funded by ANID – Millennium Science Initiative Program – Code ICN17_002. Cochez did part of the work while employed at Fraunhofer FIT, Germany and was later partially funded by Elsevier’s Discovery Lab. Kirrane, Ngonga Ngomo, Polleres and Staab received funding through the project “KnowGraphs” from the European Union’s Horizon programme under the Marie Skłodowska-Curie grant agreement No. 860801. Kirrane and Polleres were supported by the European Union’s Horizon 2020 research and innovation programme under grant 731601. Labra was supported by the Spanish Ministry of Economy and Competitiveness (Society challenges: TIN2017-88877-R). Navigli was supported by the MOUSSE ERC Grant No. 726487 under the European Union’s Horizon 2020 research and innovation programme. Rashid was supported by IBM Research AI through the AI Horizons Network. Schmelzeisen was supported by the German Research Foundation (DFG) grant STA 572/18-1.

\(\mathbf{h}_x^{(t)} \coloneqq\)	\(\sum_{y \in \textrm{N}(x)} f_\mathbf{w}(\mathbf{n}_{x},\mathbf{n}_{y},\mathbf{a}_{yx},\mathbf{h}_{y}^{(t-1)})\)
\(\mathbf{o}_x^{(t)} \coloneqq\)	\(g_{\mathbf{w}'}(\mathbf{h}_x^{(t)},\mathbf{n}_x)\)
\(\mathbf{h}_1^{(t)} \coloneqq\)	\(f_\mathbf{w}(\mathbf{n}_{1},\mathbf{n}_{3},\mathbf{a}_{31},\mathbf{h}_{3}^{(t-1)})\)
	\(+ f_\mathbf{w}(\mathbf{n}_{1},\mathbf{n}_{4},\mathbf{a}_{41},\mathbf{h}_{4}^{(t-1)})\)
\(\mathbf{o}_1^{(t)} \coloneqq\)	\(g_{\mathbf{w}'}(\mathbf{h}_1^{(t)},\mathbf{n}_1)\)
\(\ldots\)

Abstract

Introduction

Data Graphs

Models

Directed edge-labelled graphs

Heterogeneous graphs

Property graphs

Graph dataset

Other graph data models

Graph stores

Querying

Graph patterns

Complex graph patterns

Navigational graph patterns

Other features

Schema, Identity, Context

Schema

Semantic schema

Validating schema

Emergent schema

Identity

Persistent identifiers

External identity links

Datatypes

Lexicalisation

Existential nodes

Context

Direct representation

Reification

Higher-arity representation

Annotations

Other contextual frameworks

Deductive Knowledge

Ontologies

Interpretations

Individuals

Properties

Classes

Other features

Semantics and Entailment

Model-theoretic semantics

Entailment

If–then vs. if-and-only-if semantics

Reasoning

Rules

Description Logics

Inductive Knowledge

Graph Analytics

Techniques

Frameworks

Analytics on data graphs

Analytics with queries

Analytics with entailment

Knowledge Graph Embeddings

Translational models

Tensor decomposition models

Neural models

Language models

Entailment-aware models

Graph Neural Networks

Recursive graph neural networks

Convolutional graph neural networks

Symbolic Learning

Rule mining

Axiom mining

Creation and Enrichment

Human Collaboration

Text Sources

Pre-processing

Named Entity Recognition (NER)

Entity Linking (EL)

Relation Extraction (RE)

Joint tasks

Markup Sources

Wrapper-based extraction

Web table extraction

Deep Web crawling

Structured Sources

Mapping from tables

Mapping from trees