Knowledge Graphs: Inductive Knowledge

Inductive Knowledge

While deductive knowledge is characterised by precise logical consequences, inductively acquiring knowledge involves generalising patterns from a given set of input observations, which can then be used to generate novel but potentially imprecise predictions. For example, from a large data graph with geographical and flight information, we may observe the pattern that almost all capital cities of countries have international airports serving them, and hence predict that if Santiago is a capital city, it likely has an international airport serving it; however, the predictions drawn from this pattern do not hold for certain, where (e.g.) Vaduz, the capital city of Liechtenstein, has no (international) airport serving it. Hence predictions will often be associated with a level of confidence; for example, we may say that a capital has an international airport in \(\frac{187}{195}\) of cases, offering a confidence of \(0.959\) for predictions made with that pattern. We then refer to knowledge acquired inductively as inductive knowledge, which includes both the models used to encode patterns, as well as the predictions made by those models. Though fallible, inductive knowledge can be highly valuable.

Conceptual overview of popular inductive techniques for knowledge graphs in terms of type of representation generated (Numeric/Symbolic) and type of paradigm used (Unsupervised/Self-supervised/Supervised)

In Figure 5.1 we provide an overview of the inductive techniques typically applied to knowledge graphs. In the case of unsupervised methods, there is a rich body of work on graph analytics, which uses well-known functions/algorithms to detect communities or clusters, find central nodes and edges, etc., in a graph. Alternatively, knowledge graph embeddings can use self-supervision to learn a low-dimensional numeric model of a knowledge graph that (typically) maps input edges to an output plausibility score indicating the likelihood of the edge being true. The structure of graphs can also be directly leveraged for supervised learning, as explored in the context of graph neural networks. Finally, while the aforementioned techniques learn numerical models, symbolic learning can learn symbolic models – i.e., logical formulae in the form of rules or axioms – from a graph in a self-supervised manner. We now discuss each of the aforementioned techniques in turn.

Graph Analytics

Analytics is the process of discovering, interpreting, and communicating meaningful patterns inherent to (typically large) data collections. Graph analytics is then the application of analytical processes to (typically large) graph data. The nature of graphs naturally lends itself to certain types of analytics that derive conclusions about nodes and edges based on the topology of the graph, i.e., how the nodes of the graph are connected. Graph analytics draws upon techniques from related areas, such as graph theory and network analysis, which have been used to study graphs representing social networks, the Web, internet routing, transport networks, ecosystems, protein–protein interactions, linguistic cooccurrences, and more besides [Estrada, 2011].

Returning to the domain of our running example, the tourism board could use graph analytics to extract knowledge about, for instance: key transport hubs that serve many tourist attractions (centrality); groupings of attractions visited by the same tourists (community detection); attractions that may become unreachable in the event of strikes or other route failures (connectivity), or pairs of attractions that are similar to each other (node similarity). Given that such analytics will require a complex, large-scale graph, for the purposes of illustration, in Figure 5.2 we present a more concise example of some transportation connections in Chile directed towards popular tourist destinations. We first introduce a selection of key techniques that can be applied for graph analytics. We then discuss frameworks and languages that can be used to compute such analytics in practice. Given that many traditional graph algorithms are defined for unlabelled graphs, we then describe ways in which analytics can be applied over directed edge-labelled graphs. Finally we discuss the potential connections between graph analytics and querying and reasoning.

Data graph representing transport routes in Chile

Techniques

A wide variety of techniques can be applied for graph analytics. In the following we will enumerate some of the main techniques – as recognised, for example, by the survey of Iosup et al. [2016] – that can be invoked in this setting.

Centrality: aims to identify the most important (aka central) nodes or edges of a graph. Specific node centrality measures include degree, betweenness, closeness, Eigenvector, PageRank, HITS, Katz, amongst others. Betweenness centrality can also be applied to edges. For example, a node centrality measure might predict the transport hubs in Figure 5.2, while edge centrality might predict traffic by finding connections on which many shortest routes depend.
Community detection: aims to identify communities in a graph, i.e., sub-graphs that are more densely connected internally than to the rest of the graph. Community detection algorithms, such as minimum-cut algorithms, label propagation, Louvain modularity, amongst others, can discover such communities. Community detection applied to Figure 5.2 may, for example, detect a community to the left (the north of Chile), to the right (the south of Chile), and perhaps also the centre (Chilean cities with airports).
Connectivity: aims to estimate how well-connected the graph is, revealing, for instance, the resilience and (un)reachability of elements of the graph. Specific techniques include measuring graph density or \(k\)-connectivity, detecting strongly connected components and weakly connected components, computing spanning trees or minimum cuts, etc. In the context of Figure 5.2, such analysis may tell us that routes to Grey Glacier, Osorno Volcano and Piedras Rojas are the most “brittle”, becoming disconnected if one of two bus routes fails.
Node (or vertex) similarity: aims to find nodes that are similar to other nodes by virtue of how they are connected within their neighbourhood. Node similarity metrics may be computed using structural equivalence, random walks, diffusion kernels, etc. These methods provide an understanding of what connects nodes, and, thereafter, in what ways they are similar. In the context of Figure 5.2, such analysis may tell us that Calama and Arica are similar nodes based on both having return flights to Santiago and return buses to San Pedro.

While the previous techniques accept a graph alone as input,¹⁷^{note 17} Node similarity can be run over an entire graph to find the \(k\) most similar nodes for each node, or can also be run for a specific node to find its most similar nodes. There are also measures for graph similarity (based on, e.g., frequent itemsets [Maillot and Bobed, 2018]) that accept multiple graphs as input. other forms of graph analytics may further accept a node, a pair of nodes, etc., along with the graph.

Path finding: aims to find paths in a graph, typically between pairs of nodes given as input. Various technical definitions exist that restrict the set of valid paths between such nodes, including simple paths that do not visit the same node twice, shortest paths that visit the fewest number of edges, or – as previously discussed in Section 4.3.2 – regular path queries that restrict the labels of edges that can be traversed by the path according to a regular expression [Angles et al., 2017]. We could use such algorithms to find, for example, the shortest path(s) in Figure 5.2 from Torres del Paine to Moon Valley.

Most of the aforementioned techniques for graph analytics were originally proposed and studied for simple graphs or directed graphs without edge labels. We will discuss their application to more complex graph models – and how they can be combined with other techniques such as reasoning and querying – later in Section 5.1.3.

Frameworks

Various frameworks have been proposed for large-scale graph analytics, often in a distributed (cluster) setting. Amongst these we can mention Apache Spark (GraphX) [Xin et al., 2013a, Dave et al., 2016], GraphLab [Low et al., 2012], Pregel [Malewicz et al., 2010], Signal–Collect [Stutz et al., 2016], Shark [Xin et al., 2013b], etc. These graph parallel frameworks apply a systolic abstraction [Kung, 1982] based on a directed graph, where nodes are seen as processors that can send messages to other nodes along edges. Computation is then iterative, where in each iteration, each node reads messages received through inward edges (and possibly its own previous state), performs a computation, and then sends messages through outward edges based on the result. These frameworks then define the systolic computational abstraction on top of the data graph being processed: nodes and edges in the data graph become nodes and edges in the systolic graph.

To take an example, assume we wish to compute the places that are most (or least) easily reached by the routes shown in the graph of Figure 5.2. A good way to measure this is using centrality, where we choose PageRank [Page et al., 1999], which computes the probability of a tourist randomly following the routes shown in the graph being at a particular place after a given number of “hops”. We can implement PageRank on large graphs using a graph parallel framework. In Figure 5.3, we provide an example of an iteration of PageRank for an illustrative sub-graph of Figure 5.2. The nodes are initialised with a score of \(\frac{1}{|V|} = \frac{1}{6}\), where we assume the tourist to have an equal chance of starting at any point. In the message phase (Msg), each node \(v\) passes a score of \(\frac{d \textrm{R}_i(v)}{|E(v)|}\) on each of its outgoing edges, where we denote by \(d\) a constant damping factor used to ensure convergence (typically \(d = 0.85\), indicating the probability that a tourist randomly “jumps” to any place), by \(\textrm{R}_i(v)\) the score of node \(v\) in iteration \(i\) (the probability of the tourist being at node \(v\) after \(i\) hops), and by \(|E(v)|\) the number of outgoing edges of \(v\). The aggregation phase (Agg) for \(v\) then sums all incoming messages received along with its constant share of the damping factor (\(\frac{1-d}{|V|}\)) to compute \(\textrm{R}_{i+1}(v)\). We then proceed to the message phase of the next iteration, continuing until some termination criterion is reached (e.g., iteration count or residual threshold, etc.) and final scores are output.

Example of a systolic iteration of PageRank for a sample sub-graph of Figure 24 — Example of a systolic iteration of PageRank on a sub-graph of Figure 5.2

While the given example is for PageRank, the systolic abstraction is general enough to support a wide variety of graph analytics, including those previously mentioned. An algorithm in this framework consists of the functions to compute message values in the message phase (Msg), and to accumulate the messages in the aggregation phase (Agg). The framework will take care of distribution, message passing, fault tolerance, etc. However, such frameworks – based on message passing between neighbours – have limitations: not all types of analytics can be expressed in such frameworks [Xu et al., 2019].¹⁸^{note 18} Formally, Xu et al. [2019] have shown that such frameworks are as powerful as the (incomplete) Weisfeiler–Lehman (WL) graph isomorphism test for distinguishing graphs. This test involves nodes recursively hashing together hashes of local information received from neighbours, and passing these hashes to neighbours. Hence frameworks may allow additional features, such as a global step that performs a global computation on all nodes, making the result available to each node [Malewicz et al., 2010]; or a mutation step that allows for adding or removing nodes and edges during processing [Malewicz et al., 2010].

Before defining a graph parallel framework, in the interest of generality, we first define a directed graph labelled with feature vectors, which captures the type of input that such a framework can accept, with vectors on both nodes and edges.

Directed vector-labelled graph: We define a directed vector-labelled graph \(G = (V,E,F,\lambda)\), where \(V\) is a set of nodes, \(E \subseteq V \times V\) is a set of edges, \(F\) is a set of feature vectors, and \(\lambda : V \cup E \rightarrow F\) labels each node and edge with a feature vector.

A directed-edge labelled graph or property graph may be encoded as a directed vector-labelled graph in a number of ways. The type of node and/or a selection of its attributes may be encoded in the node feature vectors, while the label of an edge and/or a selection of its attributes may be encoded in the edge feature vector (including, for example, weights applied to edges). Typically node feature vectors will all have the same dimensionality, as will edge feature vectors.

We define a directed vector-labelled graph in preparation for later computing PageRank using a graph parallel framework. Let \(G = (V,E,L)\) denote a directed edge-labelled graph. Let \(|E(u)|\) denote the outdegree of node \(u \in V\). We then initialise a directed vector-labelled graph \(G' = (V,E',F,\lambda)\) such that \(E' = \{ (x,z) \mid \exists y : (x,y,z)\in E \}\), and for all \(u \in V\), we define \(\lambda(u) \coloneqq \begin{bmatrix} \frac{1}{|V|} \\ |E'(u)| \\ |V| \end{bmatrix}\), and \(\lambda(u,v) \coloneqq \begin{bmatrix} \, \end{bmatrix}\), with \(F \coloneqq \{ \lambda(u) \mid u \in V \} \cup \{\lambda(u,v) \mid (u,v) \in E' \}\), assigning each node a vector containing its initial PageRank score, the outdegree of the node, and the number of nodes in the graph. Conversely, edge-vectors are not used in this case.

We now define a graph parallel framework, where we use \(\{\!\!\{ \cdot \}\!\!\}\) to denote a multiset, \(2^{S \rightarrow \mathbb{N}}\) to denote the set of all multisets containing (only) elements from the set \(S\), and \(\mathbb{R}^a\) to denote the set of all vectors of dimension \(a\) (i.e., the set of all vectors containing \(a\) real-valued elements).

Graph parallel framework

A graph parallel framework (GPF) is a triple of functions \(\mathfrak{G} \coloneqq (\)Msg, Agg, End\()\) such that (with \(a, b, c \in \mathbb{N}\)):

Msg\(: \mathbb{R}^a \times \mathbb{R}^b \rightarrow \mathbb{R}^c\)
Agg\(: \mathbb{R}^a \times 2^{\mathbb{R}^c \rightarrow \mathbb{N}} \rightarrow \mathbb{R}^a\)
End\(: 2^{\mathbb{R}^a \rightarrow \mathbb{N}} \rightarrow \{ \mathrm{true}, \mathrm{false} \}\)

The function Msg defines what message (i.e., vector) must be passed from a node to a neighbouring node along a particular edge, given the current feature vectors of the node and the edge; the function Agg is used to compute a new feature vector for a node, given its previous feature vector and incoming messages; the function End defines a condition for termination of vector computation. The integers \(a\), \(b\) and \(c\) denote the dimensions of node feature vectors, edge feature vectors, and message vectors, respectively; we assume that \(a\) and \(b\) correspond with the dimensions of input feature vectors for nodes and edges. Given a GPF \(\mathfrak{G} = (\)Msg, Agg, End\()\), a directed vector-labelled graph \(G = (V, E, F, \lambda)\), and a node \(u \in V\), we define the output vector assigned to node \(u\) in \(G\) by \(\mathfrak{G}\) (written \(\mathfrak{G}(G, u)\)) as follows. First let \(\mathbf{n}_u^{(0)} \coloneqq \lambda(u)\). For all \(i\geq 1\), let:

\begin{align*} M_u^{(i)} & \coloneqq \left\{\!\!\!\left\{ {\rm\small M{\scriptsize SG}}\left(\mathbf{n}_v^{(i-1)},\lambda(v,u)\right) \bigl\lvert\, (v,u) \in E \right\}\!\!\!\right\} \\ \mathbf{n}_{u}^{(i)} & \coloneqq {\rm\small A{\scriptsize GG}}\left(\mathbf{n}_u^{(i-1)},M_u^{(i)}\right) \end{align*}

where \(M_u^{(i)}\) is the multiset of messages received by node \(u\) during iteration \(i\), and \(\mathbf{n}_{u}^{(i)}\) is the state (vector) of node \(u\) at the end of iteration \(i\). If \(j\) is the smallest integer for which End\((\{\!\!\{ \mathbf{n}_u^{(j)} \mid u \in V \}\!\!\})\) is true, then \(\mathfrak{G}(G, u) \coloneqq \mathbf{n}_u^{(j)}\).

This particular definition assumes that vectors are dynamically computed for nodes, and that messages are passed only to outgoing neighbours, but the definitions can be readily adapted to consider dynamic vectors for edges, or messages being passed to incoming neighbours, etc. We now provide an example instantiating a GPF to compute PageRank over a directed graph.

We take as input the directed vector labelled graph \(G' = (V,E,F,\lambda)\) from Example 5.1 for a PageRank GPF. First we define the messages passed from \(u\) to \(v\):

Msg\(\left(\mathbf{n}_v,\lambda(v,u)\right) \coloneqq \begin{bmatrix} \frac{d(\mathbf{n}_{v})_1}{(\mathbf{n}_{v})_2}\\ \end{bmatrix}\)

where \(d\) denotes PageRank’s constant dampening factor (typically \(d \coloneqq 0.85\)) and \((\mathbf{n}_{v})_k\) denotes the \(k\)^th element of the \(\mathbf{n}_{v}\) vector. In other words, \(v\) will pass to \(u\) its PageRank score multiplied by the dampening factor and divided by its out-degree (we do not require \(\lambda(v,u)\) in this particular example). Next we define the function for \(u\) to aggregate the messages it receives from other nodes:

Agg\(\left(\mathbf{n}_u,M_u\right) \coloneqq \begin{bmatrix} \frac{1 - d}{(\mathbf{n}_{u})_3} + \sum_{\mathbf{m} \in M_u}(\mathbf{m})_1 \\ (\mathbf{n}_{u})_2 \\ (\mathbf{n}_{u})_3 \\ \end{bmatrix}\)

Here, we sum the scores received from other nodes along with its share of rank from the dampening factor, copying over the node’s degree and the total number of nodes for future use. Finally, there are a number of ways that we could define the termination condition; here we simply define:

End\((\{\!\!\{ \mathbf{n}_u^{(i)} \mid u \in V \}\!\!\}) \coloneqq (i \geq \textsf{z}) \)

where \(\textsf{z}\) is a fixed number of iterations, at which point the process stops.

We may note in this example that the total number of nodes is duplicated in the vector for each node of the graph. Part of the benefit of GPFs is that only local information in the neighbourhood of the node is required for each computation step. In practice, such frameworks may allow additional features, such as global computation steps whose results are made available to all nodes [Malewicz et al., 2010], operations that dynamically modify the graph [Malewicz et al., 2010], etc.

Analytics on data graphs

As aforementioned, most analytics presented thus far are, in their “native” form, applicable for undirected or directed graphs without the edge metadata – i.e., edge labels or property–value pairs – typical of graph data models.¹⁹^{note 19} We remark that in the case of property graphs, property–value pairs on nodes can be converted by mapping values to nodes and properties to edges with the corresponding label. A number of strategies can be applied to make data graphs subject to analytics of this form:

Projection involves simply “projecting” an undirected or directed graph by optionally selecting a sub-graph from the data graph from which all edge meta-data are dropped; for example, the graph of Figure 5.3 may be the result of extracting the sub-graph induced by the edge labels bus and flight from a larger data graph, where the labels are then dropped to create a directed graph.
Weighting involves converting edge meta-data into numerical values according to some function. Many of the aforementioned techniques are easily adapted to the case of weighted (directed) graphs; for example, we could consider weights on the graph of Figure 5.3 denoting trip duration (or price, traffic, etc.), and then compute the shortest (weighted) paths considering time by adding the duration of each leg of the respective journey.²⁰^{note 20} Other forms of analytics are possible if we assume the graph is weighted; for example, if we annotated the graph of Figure 5.3 with probabilities of tourists moving from one place to the next, we could leverage Markov processes to understand features such as reducibility, periodicity, transience, recurrence, ergodicity, steady states, etc., of the routes [Dynkin, 1965]. In the absence of external weights, we may rather map edge labels to weights, assigning the same weight to all flight edges, to all bus edges, etc., based on some criteria.
Transformation involves transforming the graph to a lower arity model. A transformation may be lossy, meaning that the original graph cannot be recovered; or lossless, meaning that the original graph can be recovered. Figure 5.4 provides an example of a lossy and lossless transformation from a directed edge-labelled graph to directed graphs. In the lossy transformation, we cannot tell, for example, if the original graph contained the edge Iquique–flight➛Santiago, or rather the edge Iquique–flight➛Arica, or both. The lossless transformation must introduce new nodes (similar to reification) to maintain information about directed labelled edges. Both transformed graphs further attempt to preserve the directionality of the original graph.
Customisation involves changing the analytical procedure to incorporate edge meta-data, such as was the case for path finding based on path expressions. Other examples might include structural measures for node similarity that not only consider common neighbours, but also common neighbours connected by edges with the same label, or aggregate centrality measures that capture the importance of edges grouped by label, etc.

The results of an analytical process may change drastically depending on which of the previous strategies are chosen to prepare the graph for analysis. The choice of strategy may be a non-trivial one to make a priori and may require empirical validation. More study is required to more generally understand the effects of such strategies on the results of different analytical techniques over different graph models.

Analytics with queries

As discussed in Section 2.2, various languages for querying graphs have been proposed down through the years [Angles et al., 2017]. One may consider a variety of ways in which query languages and analytics can complement each other. First, we may consider using query languages to project or transform a graph suitable for a particular analytical task, such as to extract the graph of Figure 5.2 from a larger data graph. Query languages such as SPARQL [Harris et al., 2013], Cypher [Francis et al., 2018], and G-CORE [Angles et al., 2018] allow for outputting graphs, where such queries can be used to select sub-graphs for analysis. These languages can also express some limited (non-recursive) analytics, where aggregations can be used to compute degree centrality, for example; they may also have some built-in analytical support, where, for example, Cypher [Francis et al., 2018] allows for finding shortest paths. In the other direction, analytics can contribute to the querying process in terms of optimisations, where, for example, analysis of connectivity may suggest how to better distribute a large data graph over multiple machines for querying using, e.g., minimum cuts [Akhter et al., 2018, Janke et al., 2018]. Analytics have also been used to rank query results over large graphs [Wagner et al., 2012, Fan et al., 2013], selecting the most important results for presentation to the user.

In some use-cases we may further wish to interleave querying and analytical processes. For example, from the full data graph collected by the tourist board, consider an upcoming airline strike where the board wishes to find the events during the strike with venues in cities unreachable from Santiago by public transport due to the strike. Hypothetically, we could use a query to extract the transport network excluding the airline’s routes (assuming, per Figure 2.3a that the airline information is available), use analytics to extract the strongly connected component containing Santiago, and finally use a query to find events in cities not in the Santiago component on the given dates.²¹^{note 21} Such a task could not be solved in a single query using regular path queries as such expressions would not be capable of filtering edges representing flights of a particular airline. While one could solve this task using an imperative language such as Gremlin [Rodriguez, 2015], GraphX [Xin et al., 2013a], or R [The R Foundation, 1992], more declarative languages are also being explored to express such tasks, with proposals including the extension of graph query languages with recursive capabilities [Bischof et al., 2012, Reutter et al., 2015, Hogan et al., 2020],²²^{note 22} Recursive query languages become Turing complete assuming one can also express operations on binary arrays. combining linear algebra with relational (query) algebra [Hutchison et al., 2017], and so forth.

Analytics with entailment

Knowledge graphs are often associated with a semantic schema or ontology that defines the semantics of domain terms, giving rise to entailments (per Chapter 4). Applying analytics with or without such entailments – e.g., before or after materialisation – may yield radically different results. For example, observe that an edge Santa Lucía–hosts➛EID15 is semantically equivalent to an edge EID15–venue➛Santa Lucía once the inverse axiom hosts–inv. of➛venue is invoked; however, these edges are far from equivalent from the perspective of analytical techniques that consider edge direction, for which including one type of edge, or the other, or both, may have a major bearing on the final results. To the best of our knowledge, the combination of analytics and entailment has not been well-explored, leaving open interesting research questions. Along these lines, it may be of interest to explore semantically-invariant analytics that yield the same results over semantically-equivalent graphs (i.e., graphs that entail one another), thus analysing the semantic content of the knowledge graph rather than simply the topological features of the data graph; for example, semantically-invariant analytics would yield the same results over a graph containing the inverse axiom hosts–inv. of➛venue and a number of hosts edges, the same graph but where every hosts edge is replaced by an inverse venue edge, and the union of both graphs.

Knowledge Graph Embeddings

Methods for machine learning have gained significant attention in recent years. In the context of knowledge graphs, machine learning can either be used for directly refining a knowledge graph [Paulheim, 2017] (discussed further in Chapter 8); or for downstream tasks using the knowledge graph, such as recommendation [Zhang et al., 2016], information extraction [Vashishth et al., 2018], question answering [Huang et al., 2019], query relaxation [Wang et al., 2018], query approximation [Hamilton et al., 2018], etc. (discussed further in Chapter 10). However, many traditional machine learning techniques assume dense numeric input representations in the form of vectors, which is quite distinct from how graphs are usually expressed. So how can graphs – or nodes, edges, etc., thereof – be encoded as numeric vectors?

A first attempt to represent a graph using vectors would be to use a one-hot encoding, generating a vector for each node of length \(|L| \cdot |V|\) – with \(|V|\) the number of nodes in the input graph and \(|L|\) the number of edge labels – placing a one at the corresponding index to indicate the existence of the respective edge in the graph, or zero otherwise. Such a representation will, however, typically result in large and sparse vectors, which will be detrimental for most machine learning models.

The main goal of knowledge graph embedding techniques is to create a dense representation of the graph (i.e., embed the graph) in a continuous, low-dimensional vector space that can then be used for machine learning tasks. The dimensionality \(d\) of the embedding is fixed and usually low (often, e.g., \(50 \leq d \leq 1000\)). Typically the graph embedding is composed of an entity embedding for each node: a vector with \(d\) dimensions that we denote by \(\mathbf{e}\); and a relation embedding for each edge label: (typically) a vector with \(d\) dimensions that we denote by \(\mathbf{r}\). The overall goal of these vectors is to abstract and preserve latent structures in the graph. There are many ways in which this notion of an embedding can be instantiated. Most commonly, given an edge s–p➛o, a specific embedding approach defines a scoring function that accepts \(\mathbf{e}\)_s (the entity embedding of node s), \(\mathbf{r}\)_p (the entity embedding of edge label p) and \(\mathbf{e}\)_o (the entity embedding of node o) and computes the plausibility of the edge, which estimates how likely it is to be true. Given a data graph, the goal is then to compute the embeddings of dimension \(d\) that maximise the plausibility of positive edges (typically edges in the graph) and minimise the plausibility of negative examples (typically edges in the graph with a node or edge label changed such that they are no longer in the graph) according to the given scoring function. The resulting embeddings can then be seen as models learnt through self-supervision that encode (latent) features of the graph, mapping input edges to output plausibility scores.

Embeddings can then be used for a number of low-level tasks involving the nodes and edge-labels of the graph from which they were computed. First, we can use the plausibility scoring function to assign a confidence to edges that may, for example, have been extracted from an external source (discussed later in Chapter 6). Second, the plausibility scoring function can be used to complete edges with missing nodes/edge labels for the purposes of link prediction (discussed later in Chapter 8); for example, in Figure 5.2, we might ask which nodes in the graph are likely to complete the edge Grey Glacier–bus➛?, where – aside from Punta Arenas, which is already given – we might intuitively expect Torres del Paine to be a plausible candidate. Third, embedding models will typically assign similar vectors to similar nodes and similar edge-labels, and thus they can be used as the basis of similarity measures, which may be useful for finding duplicate nodes that refer to the same entity, or for the purposes of providing recommendations (discussed later in Chapter 10).

A wide range of knowledge graph embedding techniques have been proposed [Wang et al., 2017]. Our goal here is to provide a high-level introduction to some of the most popular techniques proposed thus far. We first discuss tensor-based approaches that include three different sub-approaches using linear/tensor algebra to compute embeddings. We then discuss language models that leverage existing word embedding techniques, proposing ways of generating graph-like analogues for their expected (textual) inputs. Finally we discuss entailment-aware models that can take into account the semantics of the graph, when available.

Tensor-based models

We first discuss tensor-based models, which we sub-divide into three categories: translational models that adopt a geometric perspective whereby relation embeddings translate subject entities to object entities, tensor decomposition models that extract latent factors approximating the graph’s structure, and neural models that use neural networks to train embeddings that provide accurate plausibility scores.

Translational models

Translational models interpret edge labels as transformations from subject nodes (aka the source or head) to object nodes (aka the target or tail); for example, in the edge San Pedro–bus➛Moon Valley, the edge label bus is seen as transforming San Pedro to Moon Valley, and likewise for other bus edges. The most elementary approach in this family is TransE [Bordes et al., 2013]. Over all positive edges s–p➛o, TransE learns vectors \(\mathbf{e}\)_s, \(\mathbf{r}\)_p, and \(\mathbf{e}\)_os aiming to make \(\mathbf{e}\)_s + \(\mathbf{r}\)_p as close as possible to \(\mathbf{e}\)_o. Conversely, if the edge is a negative example, TransE attempts to learn a representation that keeps \(\mathbf{e}\)_s + \(\mathbf{r}\)_p away from \(\mathbf{e}\)_o. To illustrate, Figure 5.5 provides a toy example of two-dimensional (\(d = 2\)) entity and relation embeddings computed by TransE. We keep the orientation of the vectors similar to the original graph for clarity. For any edge s–p➛o in the original graph, adding the vectors \(\mathbf{e}\)_s + \(\mathbf{r}\)_p should approximate \(\mathbf{e}\)_o. In this toy example, the vectors correspond precisely where, for instance, adding the vectors for Licantén (\(\mathbf{e}\)_L.) and west of (\(\mathbf{r}\)_wo.) gives a vector corresponding to Curico (\(\mathbf{e}\)_C.). We can use these embeddings to predict edges (amongst other tasks); for example, in order to predict which node in the graph is most likely to be west of Antofagasta (A.), by computing \(\mathbf{e}\)_A. + \(\mathbf{r}\)_wo. we find that the resulting vector (dotted in Figure 5.5c) is closest to \(\mathbf{e}\)_T., thus predicting Toconao (T.) to be the most plausible such node.

Aside from this toy example, TransE can be too simplistic; for example, in Figure 5.2, bus not only transforms San Pedro to Moon Valley, but also to Arica, Calama, and so forth. TransE will, in this case, aim to give similar vectors to all such target locations, which may not be feasible given other edges. TransE will also tend to assign cyclical relations a zero vector, as the directional components will tend to cancel each other out. To resolve such issues, many variants of TransE have been investigated. Amongst these, for example, TransH [Wang et al., 2014] represents different relations using distinct hyperplanes, where for the edge s–p➛o, s is first projected onto the hyperplane of p before the translation to o is learnt (uninfluenced by edges with other labels for s and for o). TransR [Lin et al., 2015] generalises this approach by projecting s and o into a vector space specific to p, which involves multiplying the entity embeddings for s and o by a projection matrix specific to p. TransD [Ji et al., 2015] simplifies TransR by associating entities and relations with a second vector, where these secondary vectors are used to project the entity into a relation-specific vector space. Recently, RotatE [Sun et al., 2019] proposes translational embeddings in complex space, which allows to capture more characteristics of relations, such as direction, symmetry, inversion, antisymmetry, and composition. Embeddings have also been proposed in non-Euclidean space; for example, MuRP [Balazevic et al., 2019a] uses relation embeddings that transform entity embeddings in the hyperbolic space of the Poincaré ball mode, whose curvature provides more “space” to separate entities with respect to the dimensionality. For discussion of other translational models, we refer to surveys by Cai et al. [2018], Wang et al. [2017].

Tensor decomposition models

A second approach to derive graph embeddings is to apply methods based on tensor decomposition. A tensor is a multidimensional numeric field that generalises scalars (\(0\)-order tensors), vectors (\(1\)-order tensors) and matrices (\(2\)-order tensors) towards arbitrary dimension/order. Tensors have become a widely used abstraction for machine learning [Rabanser et al., 2017]. Tensor decomposition involves decomposing a tensor into more “elemental” tensors (e.g., of lower order) from which the original tensor can be recomposed (or approximated) by a fixed sequence of basic operations over the output tensors. These elemental tensors can be viewed as capturing latent factors underlying the information contained in the original tensor. There are many approaches to tensor decomposition, where we will now briefly introduce the main ideas behind rank decompositions [Rabanser et al., 2017].

Leaving aside graphs momentarily, consider an \((a,b)\)-matrix (i.e., a \(2\)-order tensor) \(\mathbf{C}\), where \(a\) is the number of cities in Chile, \(b\) is the number of months in a year, and each element \((\mathbf{C})_{ij}\) denotes the average temperature of the \(i\)^th city in the \(j\)^th month. Noting that Chile is a long, thin country – ranging from subpolar climates in the south, to a desert climate in the north – we may find a decomposition of \(\mathbf{C}\) into two vectors representing latent factors – specifically \(\mathbf{x}\) (with \(a\) elements) giving lower values for cities with lower latitude, and \(\mathbf{y}\) (with \(b\) elements), giving lower values for months with lower temperatures – such that computing the outer product²³^{note 23} The outer product of two (column) vectors \(\mathbf{x}\) of length \(a\) and \(\mathbf{y}\) of length \(b\), denoted \(\mathbf{x} \otimes \mathbf{y}\), is defined as \(\mathbf{x}\mathbf{y}^{\mathrm{T}}\), yielding an \((a,b)\)-matrix \(\mathbf{M}\) such that \((\mathbf{M})_{ij} = (\mathbf{x})_i \cdot (\mathbf{y})_j\). Analogously, the outer product of \(k\) vectors is a \(k\)-order tensor. of the two vectors approximates \(\mathbf{C}\) reasonably well: \(\mathbf{x} \otimes \mathbf{y} \approx \mathbf{C}\). In the (unlikely) case that there exist vectors \(\mathbf{x}\) and \(\mathbf{y}\) such that \(\mathbf{C}\) is precisely the outer product of two vectors (\(\mathbf{x} \otimes \mathbf{y} = \mathbf{C}\)) we call \(\mathbf{C}\) a rank-\(1\) matrix; we can then precisely encode \(\mathbf{C}\) using \(a + b\) values rather than \(a \times b\) values. Most times, however, to get precisely \(\mathbf{C}\), we need to sum multiple rank-\(1\) matrices, where the rank \(r\) of \(\mathbf{C}\) is the minimum number of rank-\(1\) matrices that need to be summed to derive precisely \(\mathbf{C}\), such that \(\mathbf{x}_1 \otimes \mathbf{y}_1 + \ldots \mathbf{x}_r \otimes \mathbf{y}_r = \mathbf{C}\). In the temperature example, \(\mathbf{x}_2 \otimes \mathbf{y}_2\) might correspond to a correction for altitude, \(\mathbf{x}_3 \otimes \mathbf{y}_3\) for higher temperature variance further south, etc. A (low) rank decomposition of a matrix then sets a limit \(d\) on the rank and computes the vectors \((\mathbf{x}_1,\mathbf{y}_1,\ldots,\mathbf{x}_{d},\mathbf{y}_{d})\) such that \(\mathbf{x}_1 \otimes \mathbf{y}_1 + \ldots + \mathbf{x}_{d} \otimes \mathbf{y}_{d}\) gives the best \(d\)-rank approximation of \(\mathbf{C}\). Noting that to generate \(n\)-order tensors we need to compute the outer product of \(n\) vectors, we can generalise this idea towards low-rank decomposition of tensors; this method is called Canonical Polyadic (CP) decomposition [Hitchcock, 1927]. For example, a \(3\)-order tensor \(\mathcal{C}\) containing monthly temperatures for Chilean cities at four different times of day could be approximated with \(\mathbf{x}_1 \otimes \mathbf{y}_1 \otimes \mathbf{z}_1 + \ldots \mathbf{x}_{d} \otimes \mathbf{y}_{d} \otimes \mathbf{z}_{d}\) (e.g., \(\mathbf{x}_1\) might be a latitude factor, \(\mathbf{y}_1\) a monthly variation factor, and \(\mathbf{z}_1\) a daily variation factor, and so on). Various algorithms exist to compute (approximate) CP decompositions, including Alternating Least Squares, Jennrich’s Algorithm, and the Tensor Power method [Rabanser et al., 2017].

Returning to graphs, similar principles can be used to decompose a graph into vectors, thus yielding embeddings. In particular, a graph can be encoded as a one-hot \(3\)-order tensor \(\mathcal{G}\) with \(|V| \times |L| \times |V|\) elements, where the element \((\mathcal{G})_{ijk}\) is set to one if the \(i\)^th node links to the \(k\)^th node with an edge having the \(j\)^th label, or zero otherwise. As previously mentioned, such a tensor will typically be very large and sparse, where rank decompositions are thus applicable. A CP decomposition [Hitchcock, 1927] would compute a sequence of vectors \((\mathbf{x}_1,\mathbf{y}_1,\mathbf{z}_1,\ldots,\mathbf{x}_d,\mathbf{y}_d,\mathbf{z}_d)\) such that \(\mathbf{x}_1 \otimes \mathbf{y}_1 \otimes \mathbf{z}_1 + \ldots + \mathbf{x}_d \otimes \mathbf{y}_d \otimes \mathbf{z}_d \approx \mathcal{G}\). We illustrate this scheme in Figure 5.6. Letting \(\mathbf{X}, \mathbf{Y}, \mathbf{Z}\) denote the matrices formed by \(\begin{bmatrix} \mathbf{x}_1\,\cdots\,\mathbf{x}_d \end{bmatrix}\), \(\begin{bmatrix} \mathbf{y}_1\,\cdots\,\mathbf{y}_d \end{bmatrix}\), \(\begin{bmatrix} \mathbf{z}_1\,\cdots\,\mathbf{z}_d \end{bmatrix}\), respectively, with each vector forming a column of the corresponding matrix, we could then extract the \(i\)^th row of \(\mathbf{Y}\) as an embedding for the \(i\)^th relation, and the \(j\)^th rows of \(\mathbf{X}\) and \(\mathbf{Z}\) as two embeddings for the \(j\)^th entity. However, knowledge graph embeddings typically aim to assign one vector to each entity.

Abstract illustration of a CP \(d\)-rank decomposition of a tensor representing the graph of Figure 27a — Abstract illustration of a CP \(d\)-rank decomposition of a tensor representing the graph of Figure 5.5a

DistMult [Yang et al., 2015] is a seminal method for computing knowledge graph embeddings based on rank decompositions, where each entity and relation is associated with a vector of dimension \(d\), such that for an edge s–p➛o, a plausibility scoring function \(\sum_{i=1}^d (\mathbf{e}\)_s\()_i (\mathbf{r}\)_p\()_i (\mathbf{e}\)_o\()_i\) is defined, where \((\mathbf{e}\)_s\()_i\), \((\mathbf{r}\)_p\()_i\) and \((\mathbf{e}\)_o\()_i\) denote the \(i\)^th elements of vectors \(\mathbf{e}\)_s, \(\mathbf{r}\)_p, \(\mathbf{e}\)_o, respectively. The goal, then, is to learn vectors for each node and edge label that maximise the plausibility of positive edges and minimise the plausibility of negative edges. This approach equates to a CP decomposition of the graph tensor \(\mathcal{G}\), but where entities have one vector that is used twice: \(\mathbf{x}_1 \otimes \mathbf{y}_1 \otimes \mathbf{x}_1 + \ldots + \mathbf{x}_d \otimes \mathbf{y}_d \otimes \mathbf{x}_d \approx \mathcal{G}\). A weakness of this approach is that per the scoring function, the plausibility of s–p➛o will always be equal to that of o–p➛s; in other words, DistMult does not consider edge direction.

Rather than use a vector as a relation embedding, RESCAL [Nickel and Tresp, 2013] uses a matrix, which allows for combining values from \(\mathbf{e}\)_s and \(\mathbf{e}\)_o across all dimensions, and thus can capture (e.g.) edge direction. However, RESCAL incurs a higher cost in terms of space and time than DistMult. HolE [Nickel et al., 2016b] uses vectors for relation and entity embeddings, but proposes to use the circular correlation operator – which takes sums along the diagonals of the outer product of two vectors – to combine them. This operator is not commutative, and can thus consider edge direction. ComplEx [Trouillon et al., 2016], on the other hand, uses a complex vector (i.e., a vector containing complex numbers) as a relational embedding, which similarly allows for breaking the aforementioned symmetry of DistMult’s scoring function while keeping the number of parameters low. SimplE [Kazemi and Poole, 2018] rather proposes to compute a standard CP decomposition computing two initial vectors for entities from \(\mathbf{X}\) and \(\mathbf{Z}\) and then averaging terms across \(\mathbf{X}\), \(\mathbf{Y}\), \(\mathbf{Z}\) to compute the final plausibility scores. TuckER [Balazevic et al., 2019b] employs a different type of decomposition – called a Tucker Decomposition [Tucker, 1964], which computes a smaller “core” tensor \(\mathcal{T}\) and a sequence of three matrices \(\mathbf{A}\), \(\mathbf{B}\) and \(\mathbf{C}\), such that \(\mathcal{G} \approx \mathcal{T} \otimes \mathbf{A} \otimes \mathbf{B} \otimes \mathbf{C}\) – where entity embeddings are taken from \(\mathbf{A}\) and \(\mathbf{C}\), while relation embeddings are taken from \(\mathbf{B}\). Of these approaches, TuckER [Balazevic et al., 2019b] currently provides state-of-the-art results on standard benchmarks.

Neural models

A limitation of the aforementioned approaches is that they assume either linear (preserving addition and scalar multiplication) or bilinear (e.g., matrix multiplication) operations over embeddings to compute plausibility scores. Other approaches rather use neural networks to learn embeddings with non-linear scoring functions for plausibility.

One of the earliest proposals of a neural model was Semantic Matching Energy (SME) [Glorot et al., 2013], which learns parameters (aka weights: \(\mathbf{w}\), \(\mathbf{w}'\)) for two functions – \(f_{\mathbf{w}}(\mathbf{e}\)_s\(,\mathbf{r}\)_p\()\) and \(g_{\mathbf{w}'}(\mathbf{e}\)_o\(,\mathbf{r}\)_p\()\) – such that the dot product of the result of both functions – \(f_{\mathbf{w}}(\mathbf{e}\)_s\(,\mathbf{r}\)_p\() \cdot g_{\mathbf{w}'}(\mathbf{e}\)_o\(,\mathbf{r}\)_p\()\) – gives the plausibility score. Both linear and bilinear variants of \(f_{\mathbf{w}}\) and \(g_{\mathbf{w}'}\) are proposed. Another early proposal was Neural Tensor Networks (NTN) [Socher et al., 2013], which proposes to maintain a tensor \(\mathcal{W}\) of internal weights, such that the plausibility score is computed by a complex function that combines the outer product \(\mathbf{e}\)_s\( \otimes \mathcal{W} \otimes \mathbf{e}\)_o with a standard neural layer over \(\mathbf{e}\)_s and \(\mathbf{e}\)_o, which in turn is combined with \(\mathbf{r}\)_p, to produce a plausibility score. The tensor \(\mathcal{W}\) results in a high number of parameters, limiting scalability [Wang et al., 2017]. Multi-Layer Perceptron (MLP) [Dong et al., 2014] is a simpler model, where \(\mathbf{e}\)_s, \(\mathbf{r}\)_p and \(\mathbf{e}\)_o are concatenated and fed into a hidden layer to compute plausibility scores.

A number of more recent approaches have proposed using convolutional kernels in their models. ConvE [Dettmers et al., 2018] proposes to generate a matrix from \(\mathbf{e}\)_s and \(\mathbf{r}\)_p by “wrapping” each vector over several rows and concatenating both matrices. The concatenated matrix serves as the input for a set of (2D) convolutional layers, which returns a feature map tensor. The feature map tensor is vectorised and projected into \(d\) dimensions using a parameterised linear transformation. The plausibility score is then computed based on the dot product of this vector and \(\mathbf{e}\)_o. A disadvantage of ConvE is that by wrapping vectors into matrices, it imposes an artificial two-dimensional structure on the embeddings. HypER [Balazevic et al., 2019c] is a similar model using convolutions, but avoids the need to wrap vectors into matrices. Instead, a fully connected layer (called the “hypernetwork”) is applied to \(\mathbf{r}\)_p and used to generate a matrix of relation-specific convolutional filters. These filters are applied directly to \(\mathbf{e}\)_s to give a feature map, which is vectorised. The same process is then applied as in ConvE: the resulting vector is projected into \(d\) dimensions, and a dot product applied with \(\mathbf{e}\)_o to produce the plausibility score. The resulting model is shown to outperform ConvE on standard benchmarks [Balazevic et al., 2019c].

The presented approaches strike different balances in terms of expressivity and the number of parameters than need to be trained. While more expressive models, such as NTN, may better fit more complex plausibility functions over lower dimensional embeddings by using more hidden parameters, simpler models, such as that proposed by Dong et al. [Dong et al., 2014], and convolutional networks [Dettmers et al., 2018, Balazevic et al., 2019c] that enable parameter sharing by applying the same (typically small) kernels over different regions of a matrix, require handling fewer parameters overall and are more scalable.

Survey and definition of tensor-based approaches

We now formally define and survey the aforementioned tensor-based approaches. For simplicity, we will consider directed edge-labelled graphs.

Before defining embeddings, we first introduce tensors.

Vector, matrix, tensor, order, mode: For any positive integer \(a\), a vector of dimension \(a\) is a family of real numbers indexed by integers in \(\{1, \ldots, a\}\). For \(a\) and \(b\) positive integers, an \((a,b)\)-matrix is a family of real numbers indexed by pairs of integers in \(\{1, \ldots, a\} \times \{1, \ldots, b\}\). A tensor is a family of real numbers indexed by a finite sequence of integers such that there exist positive numbers \(a_1, \ldots, a_n\) such that the indices are all the tuples of numbers in \(\{1, \ldots, a_1\} \times \ldots \times \{1, \ldots, a_n\}\). The number \(n\) is called the order of the tensor, the subindices \(i\in \{1, \ldots, n\}\) indicate the mode of a tensor, and each \(a_i\) defines the dimension of the \(i\)^th mode. A 1-order tensor is a vector and a 2-order tensor is a matrix. We denote the set of all tensors as \(\mathbb{T}\).

For specific dimensions \(a_1,\ldots,a_n\) of modes, a tensor is an element of \((\cdots(\mathbb{R}^{a_1})^{\ldots})^{a_n}\) but we write \(\mathbb{R}^{a_1,\ldots,a_n}\) to simplify the notation. We use lower-case bold font to denote vectors (\(\mathbf{x} \in \mathbb{R}^a\)), upper-case bold font to denote matrices (\(\mathbf{X} \in \mathbb{R}^{a,b}\)) and calligraphic font to denote tensors (\(\mathcal{X} \in \mathbb{R}^{a_1,\ldots,a_n}\)).

Now we are ready to abstractly define knowledge graph embeddings.

Knowledge graph embedding: Given a directed edge-labelled graph \(G = (V,E,L)\), a knowledge graph embedding of \(G\) is a pair of mappings \((\varepsilon,\rho)\) such that \(\varepsilon : V \rightarrow \mathbb{T}\) and \(\rho : L \rightarrow \mathbb{T}\).

In the most typical case, \(\varepsilon\) and \(\rho\) map nodes and edge-labels, respectively, to vectors of fixed dimension. In some cases, however, they may map to matrices. Given this abstract notion of a knowledge graph embedding, we can then define a plausibility scoring function.

Plausibility scores: A plausibility scoring function is a partial function \(\phi : \mathbb{T} \times \mathbb{T} \times \mathbb{T} \rightarrow \mathbb{R}\). Given a directed edge-labelled graph \(G = (V,E,L)\), an edge \((s,p,o) \in V \times L \times V\), and a knowledge graph embedding \((\varepsilon,\rho)\) of \(G\), the plausibility of \((s,p,o)\) is given as \(\phi(\varepsilon(s),\rho(p),\varepsilon(o))\).

Edges with higher scores are considered more plausible. Given a graph \(G = (V,E,L)\), we assume a set of positive edges \(E^+\) and a set of negative edges \(E^{-}\). Positive edges are often simply the edges in the graph: \(E^+ \coloneqq E\). Negative edges use the vocabulary of \(G\) (i.e., \(E^- \subseteq V \times L \times V\)) and are typically defined by taking edges \((s,p,o)\) from \(E\) and changing one term of each edge – often one of the nodes – such that the edge is no longer in \(E\). Given sets of positive and negative edges, and a plausibility scoring function, the objective is then to find the embedding that maximises the plausibility of edges in \(E^+\) while minimising the plausibility of edges in \(E^{-}\). Specific knowledge graph embeddings then instantiate the type of embedding considered and the plausibility scoring function in various ways.

In Table 5.1, we define the plausibility scoring function and types of embeddings used by different knowledge graph embeddings. To simplify the definitions, we use \(\mathbf{e}_x\) to denote \(\varepsilon(x)\) when it is a vector, \(\mathbf{r}_y\) to denote \(\rho(y)\) when it is a vector, and \(\mathbf{R}_y\) to denote \(\rho(y)\) when it is a matrix. Some models involve learnt parameters (aka weights) for computing plausibility. We denote these as \(\mathbf{v}\), \(\mathbf{V}\), \(\mathcal{V}\), \(\mathbf{w}\), \(\mathbf{W}\), \(\mathcal{W}\) (for vectors, matrices or tensors). We use \(d_e\) and \(d_r\) to denote the dimensionality chosen for entity embeddings and relation embeddings, respectively. Often it is assumed that \(d_e = d_r\), in which case we will write \(d\). Weights may have their own dimensionality, which we denote \(w\). The embeddings in Table 5.1 use a variety of operators on vectors, matrices and tensors, which will be defined later.

The embeddings defined in Table 5.1 vary in complexity, where a trade-off exists between the number of parameters used, and the expressiveness of the model in terms of its capability to capture latent features of the graph. To increase expressivity, many of the models in Table 5.1 use additional parameters beyond the embeddings themselves. A possible formal guarantee of such models is full expressiveness, which, given any disjoint sets of positive edges \(E^+\) and negative edges \(E^{-}\), asserts that the model can always correctly partition those edges. On the one hand, for example, DistMult [Yang et al., 2015] cannot distinguish an edge s–p➛o from its inverse o–p➛s, so by adding an inverse of an edge in \(E^+\) to \(E^{-}\), we can show that it is not fully expressive. On the other hand, models such as ComplEx [Trouillon et al., 2016], SimplE [Kazemi and Poole, 2018], and TuckER [Balazevic et al., 2019b] have been proven to be fully expressive given sufficient dimensionality; for example, TuckER [Balazevic et al., 2019b] with dimensions \(d_r = |L|\) and \(d_e = |V|\) trivially satisfies full expressivity since its core tensor \(\mathcal{W}\) then has sufficient capacity to store the full one-hot encoding of any graph. This formal property is useful to show that the model does not have built-in limitations for numerically representing a graph, though of course in practice the dimensions needed to reach full expressivity are often impractical/undesirable.

We continue by first defining the conventions used in Table 5.1.

We use \((\mathbf{x})_{i}\), \((\mathbf{X})_{ij}\), and \((\mathcal{X})_{{i_1}\ldots{i_n}}\) to denote elements of vectors, matrices, and tensors, respectively. If a vector \(\mathbf{x} \in \mathbb{R}^a\) is used in a context that requires a matrix, the vector is interpreted as an \((a, 1)\)-matrix (i.e., a column vector) and can be turned into a row vector (i.e., a \((1,a)\)-matrix) using the transpose operation \(\mathbf{x}^T\). We use \(\mathbf{x}^\mathrm{D} \in \mathbb{R}^{a,a}\) to denote the diagonal matrix with the values of the vector \(\mathbf{x} \in \mathbb{R}^{a}\) on its diagonal. We denote the identity matrix by \(\mathbf{I}\) such that if \(j=k\), then \((\mathbf{I})_{jk} = 1\); otherwise \((\mathbf{I})_{jk} = 0\).
We denote by \(\begin{bmatrix}\mathbf{X}_1\\[-0.5ex]\vdots\\\mathbf{X_n}\end{bmatrix}\) the vertical stacking of matrices \(\mathbf{X}_1, \ldots, \mathbf{X}_n\) with the same number of columns. Given a vector \(\mathbf{x} \in \mathbb{R}^{ab}\), we denote by \(\mathbf{x}^{[a,b]} \in \mathbb{R}^{a,b}\) the “reshaping” of \(\mathbf{x}\) into an \((a,b)\)-matrix such that \((\mathbf{x}^{[a,b]})_{ij} = (\mathbf{x})_{(i + a(j-1))}\). Conversely, given a matrix \(\mathbf{X} \in \mathbb{R}^{a,b}\), we denote by \(\mathrm{vec}(\mathbf{X}) \in \mathbb{R}^{ab}\) the vectorisation of \(\mathbf{X}\) such that \(\mathrm{vec}(\mathbf{X})_k = (\mathbf{X})_{ij}\) where \(i = ((k-1)\,\mathrm{mod}\,m) + 1\) and \(j = \frac{k - i}{m} + 1\) (observe that \(\mathrm{vec}(\mathbf{x}^{[a,b]}) = \mathbf{x}\)).
Given a tensor \(\mathcal{X} \in \mathbb{R}^{a,b,c}\), we denote by \(\mathcal{X}^{[i:\cdot:\cdot]} \in \mathbb{R}^{b,c}\), the \(i\)^th slice of tensor \(\mathcal{X}\) along the first mode; for example, given \(\mathcal{X} \in \mathbb{R}^{5,2,3}\), then \(\mathcal{X}^{[4:\cdot:\cdot]}\) returns the \((2,3)\)-matrix consisting of the elements \(\begin{bmatrix} (\mathcal{X})_{411} & (\mathcal{X})_{412} & (\mathcal{X})_{413} \\ (\mathcal{X})_{421} & (\mathcal{X})_{422} & (\mathcal{X})_{423} \end{bmatrix}\). Analogously, we use \(\mathcal{X}^{[\cdot : i : \cdot]} \in \mathbb{R}^{a,c}\) and \(\mathcal{X}^{[\cdot:\cdot:i]} \in \mathbb{R}^{b,c}\) to indicate the \(i\)^th slice along the second and third modes of \(\mathcal{X}\), respectively.
We denote by \(\psi(\mathcal{X})\) the element-wise application of a function \(\psi\) to the tensor \(\mathcal{X}\), such that \((\psi(\mathcal{X}))_{in_1\ldots i_n} = \psi(\mathcal{X}_{i_1\ldots i_n})\). Common choices for \(\psi\) include a sigmoid function (e.g., the logistic function \(\psi(x) = \frac{1}{1 + e^{-x}}\) or the hyperbolic tangent function \(\psi(x) = \mathrm{tanh}\,x = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)), the rectifier (\(\psi(x) = \mathrm{max}(0,x)\)), softplus (\(\psi(x) = \mathrm{ln}(1 + e^x)\)), etc.

We now define the operators used in Table 5.1, where the first and most elemental operation we consider is that of matrix multiplication.

Matrix multiplication: The multiplication of matrices \(\mathbf{X} \in \mathbb{R}^{a,b}\) and \(\mathbf{Y} \in \mathbb{R}^{b,c}\) is a matrix \(\mathbf{XY} \in \mathbb{R}^{a,c}\) such that \((\mathbf{XY})_{ij} = \sum_{k=1}^b (\mathbf{X})_{ik}(\mathbf{Y})_{kj}\). The matrix multiplication of two tensors \(\mathcal{X} \in \mathbb{R}^{a_1,\ldots,a_m,c}\) and \(\mathcal{Y} \in \mathbb{R}^{c,b_1,\ldots,b_n}\) is a tensor \(\mathcal{XY} \in \mathbb{R}^{a_1,\ldots,a_{m},b_{1},\ldots,b_{n}}\) such that (\(\mathcal{XY})_{i_1\ldots i_m i_{m+1}\ldots i_{m+n}} = \sum_{k=1}^c (\mathcal{X})_{i_1\ldots i_m k}(\mathcal{Y})_{k i_{m+1}i_{m+n}}\).

For convenience, we may implicitly add or remove modes with dimension 1 for the purposes of matrix multiplication and other operators; for example, given two vectors \(\mathbf{x} \in \mathbb{R}^{a}\) and \(\mathbf{y} \in \mathbb{R}^{a}\), we denote by \(\T{\mathbf{x}}\mathbf{y}\) (aka the dot or inner product) the multiplication of matrix \(\T{\mathbf{x}} \in \mathbb{R}^{1,a}\) with \(\mathbf{y} \in \mathbb{R}^{a,1}\) such that \(\T{\mathbf{x}}\mathbf{y} \in \mathbb{R}^{1,1}\) (i.e., a scalar in \(\mathbb{R}\)); conversely, \(\mathbf{x}\T{\mathbf{y}} \in \mathbb{R}^{a,a}\) (the outer product).

Constraints on embeddings are sometimes given as norms, defined next.

\(L^p\)-norm, \(L^{p,q}\)-norm: For \(p\in \mathbb{R}\), the \(L^p\)-norm of a vector \(\mathbf{x}\in \mathbb{R}^a\) is the scalar \(\|\mathbf{x}\|_p \coloneqq (|(\mathbf{x})_1|^p + \ldots + |(\mathbf{x})_a|^p)^{\frac{1}{p}}\), where \(|(\mathbf{x})_i|\) denotes the absolute value of the \(i\)^th element of \(\mathbf{x}\). For \(p,q\in \mathbb{R}\), the \(L^{p,q}\)-norm of a matrix \(\mathbf{X}\in\mathbb{R}^{a,b}\) is the scalar \(\|\mathbf{X}\|_{p,q} \coloneqq \left( \sum_{j=1}^b \left( \sum_{i=1}^a |(\mathbf{X})_{ij}|^p \right)^{\frac{q}{p}} \right)^\frac{1}{q}\).

The \(L^1\) norm (i.e., \(\|\mathbf{x}\|_1\)) is thus simply the sum of the absolute values of \(\mathbf{x}\), while the \(L^2\) norm (i.e., \(\|\mathbf{x}\|_2\)) is the (Euclidean) length of the vector. The Frobenius norm of the matrix \(\mathbf{X}\) then equates to \(\|\mathbf{X}\|_{2,2} = \left( \sum_{j=1}^b \left( \sum_{i=1}^a |(\mathbf{X})_{ij}|^2 \right) \right)^\frac{1}{2}\); i.e., the square root of the sum of the squares of all elements.

Another type of product used by embedding techniques is the Hadamard product, which multiplies tensors of the same dimension and computes their product in an element-wise manner.

Hadamard product: Given two tensors \(\mathcal{X} \in \mathbb{R}^{a_1,\ldots,a_n}\) and \(\mathcal{Y} \in \mathbb{R}^{a_1,\ldots,a_n}\), the Hadamard product \(\mathcal{X} \odot \mathcal{Y}\) is defined as a tensor in \(\mathbb{R}^{a_1,\ldots,a_n}\), with each element computed as \((\mathcal{X} \odot \mathcal{Y})_{i_1\ldots i_{n}} \coloneqq (\mathcal{X})_{i_1\ldots i_{n}} (\mathcal{Y})_{i_1\ldots i_{n}}\).

Other embedding techniques – namely RotatE [Sun et al., 2019] and ComplEx [Trouillon et al., 2016] – uses complex space based on complex numbers. With a slight abuse of notation, the definitions of vectors, matrices and tensors can be modified by replacing the set of real numbers \(\mathbb{R}\) by the set of complex numbers \(\mathbb{C}\), giving rise to complex vectors, complex matrices, and complex tensors. In this case, we denote by \(\mathrm{Re}(\cdot)\) the real part of a complex number. Given a complex vector \(\mathbf{x} \in \mathbb{C}^I\), we denote by \(\overline{\mathbf{x}}\) its complex conjugate (swapping the sign of the imaginary part of each element). Complex analogues of the aforementioned operators can then be defined by replacing the multiplication and addition of real numbers with the analogous operators for complex numbers, where RotateE [Sun et al., 2019] uses the complex Hadamard product, and ComplEx [Trouillon et al., 2016] uses complex matrix multiplication.

One embedding technique – MuRP [Balazevic et al., 2019a] – uses hyperbolic space, specifically based on the Poincaré ball. As this is the only embedding we cover that uses this space, and the formalisms are lengthy (covering the Poincaré ball, Möbius addition, Möbius matrix–vector multiplication, logarithmic maps, exponential maps, etc.), we rather refer the reader to the paper for further details [Balazevic et al., 2019a].

As discussed in Section 5.2, tensor decompositions are used for many embeddings, and at the heart of such decompositions is the tensor product, which is often used to reconstruct (an approximation of) the original tensor.

Tensor product: Given two tensors \(\mathcal{X} \in \mathbb{R}^{a_1,\ldots,a_m}\) and \(\mathcal{Y} \in \mathbb{R}^{b_1,\ldots,b_n}\), the tensor product \(\mathcal{X} \otimes \mathcal{Y}\) is defined as a tensor in \(\mathbb{R}^{a_1,\ldots,a_m,b_1,\ldots,b_n}\), with each element computed as \((\mathcal{X} \otimes \mathcal{Y})_{i_1\ldots i_{m}j_1\ldots j_n} \coloneqq (\mathcal{X})_{i_1 \ldots i_m} (\mathcal{Y})_{j_1 \ldots j_n}\).²⁴^{note 24} Please note that “\(\otimes\)” is used here in an unrelated sense to its use in Definition 3.10.

Assume that \(\mathcal{X} \in \mathbb{R}^{2,3}\) and \(\mathcal{Y} \in \mathbb{R}^{3,4,5}\). Then \(\mathcal{X} \otimes \mathcal{Y}\) will be a tensor in \(\mathbb{R}^{2,3,3,4,5}\). Element \((\mathcal{X} \otimes \mathcal{Y})_{12345}\) will be the product of \((\mathcal{X})_{12}\) and \((\mathcal{Y})_{345}\).

An \(n\)-mode product is used by other embeddings to transform elements along a given mode of a tensor by computing a product with a given matrix along that particular mode of the tensor.

\(n\)-mode product: For a positive integer \(n\), a tensor \(\mathcal{X} \in \mathbb{R}^{a_1,\ldots,a_{n-1},a_n,a_{n+1},\ldots,a_m}\) and matrix \(\mathbf{Y} \in \mathbb{R}^{b,a_n}\), the \(n\)-mode product of \(\mathcal{X}\) and \(\mathbf{Y}\) is the tensor \(\mathcal{X} \otimes_n \mathbf{Y} \in \mathbb{R}^{a_1,\ldots,a_{n-1},b,a_{n+1},\ldots,a_m}\) such that \((\mathcal{X} \otimes_n \mathbf{Y})_{i_1\ldots i_{n-1}ji_{n+1}\ldots i_m} \coloneqq \sum_{k=1}^{a_n} (\mathcal{X})_{i_1 \ldots i_{n-1}ki_{n+1} \ldots i_m} (\mathbf{Y})_{jk}\).

Let us assume that \(\mathcal{X} \in \mathbb{R}^{2,3,4}\) and \(\mathbf{Y} \in \mathbb{R}^{5,3}\). The result of \(\mathcal{X} \otimes_2 \mathbf{Y}\) will be a tensor in \(\mathbb{R}^{2,5,4}\), where, for example, \((\mathcal{X} \otimes_2 \mathbf{Y})_{142}\) will be given as \((\mathcal{X})_{112}(\mathbf{Y})_{41} + (\mathcal{X})_{122}(\mathbf{Y})_{42} + (\mathcal{X})_{132}(\mathbf{Y})_{43}\). Observe that if \(\mathbf{y} \in \mathbb{R}^{a_n}\) – i.e., if \(\mathbf{y}\) is a (column) vector – then the \(n\)-mode tensor product \(\mathcal{X} \otimes_n \T{\mathbf{y}}\) “flattens” the \(n\)^th mode of \(\mathcal{X}\) to one dimension, effectively reducing the order of \(\mathcal{X}\) by one.

One embedding technique – HolE [Nickel et al., 2016b] – uses the circular correlation operator \(\mathbf{x} \star \mathbf{y}\), where each element is the sum of elements along a diagonal of the outer product \(\mathbf{x} \otimes \mathbf{y}\) that “wraps” if not the primary diagonal.

Circular correlation: The circular correlation of vector \(\mathbf{x} \in \mathbb{R}^a\) with \(\mathbf{y} \in \mathbb{R}^a\) is the vector \(\mathbf{x} \star \mathbf{y} \in \mathbb{R}^{a}\) such that \((\mathbf{x} \star \mathbf{y})_k \coloneqq \sum_{i=1}^a (\mathbf{x})_i (\mathbf{y})_{(((k+i-2) \,\mathrm{mod}\,a)+1)}\).

Assuming \(a = 5\), then \((\mathbf{x} \star \mathbf{y})_1 = (\mathbf{x})_1(\mathbf{y})_1 + (\mathbf{x})_2(\mathbf{y})_2 + (\mathbf{x})_3(\mathbf{y})_3 + (\mathbf{x})_4(\mathbf{y})_4 + (\mathbf{x})_5(\mathbf{y})_5\), or a case that wraps: \((\mathbf{x} \star \mathbf{y})_4 = (\mathbf{x})_1(\mathbf{y})_4 + (\mathbf{x})_2(\mathbf{y})_5 + (\mathbf{x})_3(\mathbf{y})_1 + (\mathbf{x})_4(\mathbf{y})_2 + (\mathbf{x})_5(\mathbf{y})_3\).

Finally, a couple of neural models that we include – namely ConvE [Dettmers et al., 2018] and HypER [Balazevic et al., 2019c] – are based on convolutional architectures using the convolution operator.

Convolution: Given two matrices \(\mathbf{X} \in \mathbb{R}^{a,b}\) and \(\mathbf{Y} \in \mathbb{R}^{e,f}\), the convolution of \(\mathbf{X}\) and \(\mathbf{Y}\) is the matrix \(\mathbf{X} * \mathbf{Y} \in \mathbb{R}^{(a + e - 1),(b + f - 1)}\) such that \((\mathbf{X} * \mathbf{Y})_{ij} = \sum_{k=1}^a \sum_{l=1}^b (\mathbf{X})_{kl} (\mathbf{Y})_{(i+k-a)(j+l-b)}\).²⁵^{note 25} We define the convolution operator per the widely-usedconvention for convolutional neural networks. Strictly speaking, the operator should be called cross-correlation, where traditional convolution requires the matrix \(\mathbf{X}\) to be initially “rotated” by 180°. Since in our settings the matrix \(\mathbf{X}\) is learnt, rather than given, the rotation is redundant, and hence the distinction is not important. In cases where \((i+k-a) < 1\), \((j+l-b) < 1\), \((i+k-a) > e\) or \((j+l-b) > f\) (i.e., where \((\mathbf{Y})_{(i+k-a)(j+l-b)}\) lies outside the bounds of \(\mathbf{Y}\)), we say that \((\mathbf{Y})_{(i+k-a)(j+l-b)} = 0\).

Intuitively speaking, the convolution operator overlays \(\mathbf{X}\) in every possible way over \(\mathbf{Y}\) such that at least one pair of elements \((\mathbf{X})_{ij},(\mathbf{Y})_{lk}\) overlaps, summing the products of pairs of overlapping elements to generate an element of the result. Elements of \(\mathbf{X}\) extending beyond \(\mathbf{Y}\) are ignored (equivalently we can consider \(\mathbf{Y}\) to be “zero-padded” outside its borders).

Given \(\mathbf{X} \in \mathbb{R}^{3,3}\) and \(\mathbf{Y} \in \mathbb{R}^{4,5}\), then \(\mathbf{X} * \mathbf{Y} \in \mathbb{R}^{6,7}\), where, for example, \((\mathbf{X} * \mathbf{Y})_{11} = (\mathbf{X})_{33}(\mathbf{Y})_{11}\) (with the bottom right corner of \(\mathbf{X}\) overlapping the top left corner of \(\mathbf{Y}\)), while \((\mathbf{X} * \mathbf{Y})_{34} = (\mathbf{X})_{11}(\mathbf{Y})_{12} + \)\( (\mathbf{X})_{12}(\mathbf{Y})_{13} + \)\( (\mathbf{X})_{13}(\mathbf{Y})_{14} + \)\( (\mathbf{X})_{21}(\mathbf{Y})_{22} + \)\( (\mathbf{X})_{22}(\mathbf{Y})_{23} + \)\( (\mathbf{X})_{23}(\mathbf{Y})_{24} + \)\( (\mathbf{X})_{31}(\mathbf{Y})_{32} + \)\( (\mathbf{X})_{32}(\mathbf{Y})_{33} + \)\( (\mathbf{X})_{33}(\mathbf{Y})_{34}\) (with \((\mathbf{X})_{22}\) – the centre of \(\mathbf{X}\) – overlapping \((\mathbf{Y})_{23}\)).²⁶^{note 26} Models applying convolutions may differ regarding how edge cases are handled, or on the “stride” of the convolution applied, where, for example, a stride of 3 for \((\mathbf{X} * \mathbf{Y})\) would see the kernel \(\mathbf{X}\) centred only on elements \((\mathbf{Y})_{ij}\) such that \(i\,\mathrm{mod}\,3 = 0\) and \(j\,\mathrm{mod}\,3 = 0\), reducing the number of output elements by a factor of 9. We do not consider such details here.

In a convolution \(\mathbf{X} * \mathbf{Y}\), the matrix \(\mathbf{X}\) is often called the “kernel” (or “filter”). Often several kernels are used in order to apply multiple convolutions. Given a tensor \(\mathcal{X} \in \mathbb{R}^{c,a,b}\) (representing \(c\) \((a,b)\)-kernels) and a matrix \(\mathbf{Y} \in \mathbb{R}^{e,f}\), we denote by \(\mathcal{X} * \mathbf{Y} \in \mathbb{R}^{c,(a + e - 1),(b + f - 1)}\) the result of the convolutions of the \(c\) first-mode slices of \(\mathcal{X}\) over \(\mathbf{Y}\) such that \((\mathcal{X} * \mathbf{Y})^{[i:\cdot:\cdot]} = \mathcal{X}^{[i:\cdot:\cdot]} * \mathbf{Y}\) for \(1 \leq i \leq c\), yielding a tensor of results for \(c\) convolutions.

Details for selected knowledge graph embeddings, including the plausibility scoring function \(\phi(\varepsilon(s),\rho(p),\varepsilon(o))\) for edge \(s\)–\(p\)➛\(o\), and other conditions
Model	\(\phi(\varepsilon(s),\rho(p),\varepsilon(o))\)	Conditions (for all \(x \in V\), \(y \in L\))
TransE	\(- \\|\mathbf{e}_s + \mathbf{r}_p - \mathbf{e}_o\\|_q\)	\(\mathbf{e}_x \in \mathbb{R}^{d}\), \(\mathbf{r}_y \in \mathbb{R}^{d}\), \(q \in \{1,2\}\), \(\\|\mathbf{e}_x\\|_2 = 1\)
TransH	\(-\\|(\mathbf{e}_s - (\T{\mathbf{e}_s}\mathbf{w}_p)\mathbf{w}_p) + \mathbf{r}_p - (\mathbf{e}_o - (\T{\mathbf{e}_o} \mathbf{w}_p)\mathbf{w}_p)\\|^{2}_{2}\)	\(\mathbf{e}_x \in \mathbb{R}^{d}\), \(\mathbf{r}_y \in \mathbb{R}^{d}\), \(\mathbf{w}_y \in \mathbb{R}^d\), \(\\|\mathbf{w}_y\\|_2 = 1\) , \(\frac{\T{\mathbf{w}_y} \mathbf{r}_y}{\\|\mathbf{r}_y\\|_2} \approx 0\), \(\\|\mathbf{e}_x\\|_2 \leq 1\)
TransR	\(-\\|\mathbf{W}_p\mathbf{e}_s + \mathbf{r}_p - \mathbf{W}_p\mathbf{e}_o\\|^{2}_{2}\)	\(\mathbf{e}_x \in \mathbb{R}^{d_e}\), \(\mathbf{r}_y \in \mathbb{R}^{d_r}\), \(\mathbf{W}_y \in \mathbb{R}^{d_r , d_e}\), \(\\|\mathbf{e}_x\\|_2 \leq 1\), \(\\|\mathbf{r}_y\\|_2 \leq 1\), \(\\|\mathbf{W}_y\mathbf{e}_x\\|_2 \leq 1\)
TransD	\(-\\|(\mathbf{w}_p\otimes\mathbf{w}_s + \mathbf{I})\mathbf{e}_s + \mathbf{r}_p - (\mathbf{w}_p\otimes\mathbf{w}_o + \mathbf{I})\mathbf{e}_o\\|^{2}_{2}\)	\(\mathbf{e}_x \in \mathbb{R}^{d_e}\), \(\mathbf{r}_y \in \mathbb{R}^{d_r}\), \(\mathbf{w}_x \in \mathbb{R}^{d_e}\), \(\mathbf{w}_y \in \mathbb{R}^{d_r}\), \(\\|\mathbf{e}_x\\|_2 \leq 1\), \(\\|\mathbf{r}_y\\|_2 \leq 1\), \(\\|(\mathbf{w}_y\otimes\mathbf{w}_x + \mathbf{I})\mathbf{e}_x\\|_2 \leq 1\)
RotatE	\(- \\|\mathbf{e}_s \odot \mathbf{r}_p - \mathbf{e}_o\\|_2\)	\(\mathbf{e}_x \in \mathbb{C}^{d}\), \(\mathbf{r}_y \in \mathbb{C}^{d}\), \(\\|\mathbf{r}_y\\|_2 = 1\)
RESCAL	\(\T{\mathbf{e}_s} \mathbf{R}_p \mathbf{e}_o\)	\(\mathbf{e}_x \in \mathbb{R}^{d}\), \(\mathbf{R}_y \in \mathbb{R}^{d,d}\), \(\\|\mathbf{e}_x\\|_2 \leq 1\), \(\\|\mathbf{R}_y\\|_{2,2} \leq 1\)
DistMult	\(\T{\mathbf{e}_s} \D{\mathbf{r}_p} \mathbf{e}_o\)	\(\mathbf{e}_x \in \mathbb{R}^{d}\), \(\mathbf{r}_y \in \mathbb{R}^{d}\), \(\\|\mathbf{e}_x\\|_2 = 1\), \(\\|\mathbf{r}_y\\|_2 \leq 1\)
HolE	\(\T{\mathbf{r}_p} (\mathbf{e}_s \star \mathbf{e}_o)\)	\(\mathbf{e}_x \in \mathbb{R}^{d}\), \(\mathbf{r}_y \in \mathbb{R}^{d}\), \(\\|\mathbf{e}_x\\|_2 \leq 1\), \(\\|\mathbf{r}_y\\|_2 \leq 1\)
ComplEx	\(\mathrm{Re}(\T{\mathbf{e}_s} \D{\mathbf{r}_p} \overline{\mathbf{e}}_o)\)	\(\mathbf{e}_x \in \mathbb{C}^{d}\), \(\mathbf{r}_y \in \mathbb{C}^{d}\), \(\\|\mathbf{e}_x\\|_2 \leq 1\), \(\\|\mathbf{r}_y\\|_2 \leq 1\)
SimplE	\(\frac{\T{\mathbf{e}_s} \D{\mathbf{r}_p} \mathbf{w}_o + \T{\mathbf{e}_o} \D{\mathbf{w}_p} \mathbf{w}_s}{2}\)	\(\mathbf{e}_x \in \mathbb{R}^{d}\), \(\mathbf{r}_y \in \mathbb{R}^{d}\), \(\mathbf{w}_x \in \mathbb{R}^{d}\), \(\mathbf{w}_y \in \mathbb{R}^{d}\), \(\\|\mathbf{e}_x\\|_2 \leq 1\), \(\\|\mathbf{w}_x\\|_2 \leq 1\), \(\\|\mathbf{r}_y\\|_2 \leq 1, \\|\mathbf{w}_y\\|_2 \leq 1\)
TuckER	\(\mathcal{W} \otimes_1 \T{\mathbf{e}_s} \otimes_2 \T{\mathbf{r}_p} \otimes_3 \T{\mathbf{e}_o}\)	\(\mathbf{e}_x \in \mathbb{R}^{d_e}\), \(\mathbf{r}_y \in \mathbb{R}^{d_r}\), \(\mathcal{W} \in \mathbb{R}^{d_e , d_r , d_e }\)
SME L.	\(\T{(\mathbf{V}\mathbf{e}_s + \mathbf{V}'\mathbf{r}_p + \mathbf{v})} (\mathbf{W}\mathbf{e}_o + \mathbf{W}'\mathbf{r}_p + \mathbf{w})\)	\(\mathbf{e}_x \in \mathbb{R}^{d}\), \(\mathbf{r}_y \in \mathbb{R}^{d}\), \(\mathbf{v} \in \mathbb{R}^w\), \(\mathbf{w} \in \mathbb{R}^w\), \(\\|\mathbf{e}_x\\|_2 = 1\), \(\mathbf{V} \in \mathbb{R}^{w,d},\mathbf{V}' \in \mathbb{R}^{w,d}, \mathbf{W} \in \mathbb{R}^{w,d}, \mathbf{W}' \in \mathbb{R}^{w,d}\)
SME Bi.	\(\T{((\mathcal{V} \otimes_3 \T{\mathbf{r}_p}) \mathbf{e}_s + \mathbf{v})}((\mathcal{W} \otimes_3 \T{\mathbf{r}_p}) \mathbf{e}_o + \mathbf{w})\)	\(\mathbf{e}_x \in \mathbb{R}^{d}\), \(\mathbf{r}_y \in \mathbb{R}^{d}\), \(\mathbf{v} \in \mathbb{R}^w\), \(\mathbf{w} \in \mathbb{R}^w\), \(\\|\mathbf{e}_x\\|_2 = 1\), \(\mathcal{V} \in \mathbb{R}^{w,d,d}\), \(\mathcal{W} \in \mathbb{R}^{w,d,d}\)
NTN	\(\T{\mathbf{r}_p} \psi\left(\T{\mathbf{e}_s} \mathcal{W} \mathbf{e}_o + \mathbf{W} \begin{bmatrix}\mathbf{e}_s\\\mathbf{e}_o\end{bmatrix} + \mathbf{w}\right) \)	\(\mathbf{e}_x \in \mathbb{R}^{d}\), \(\mathbf{r}_y \in \mathbb{R}^{d}\), \(\mathbf{w} \in \mathbb{R}^{w}\), \(\mathbf{W} \in \mathbb{R}^{w , 2d}\), \(\mathcal{W} \in \mathbb{R}^{d,w,d}\), \(\\|\mathbf{e}_x\\|_2 \leq 1\), \(\\|\mathbf{r}_y\\|_2 \leq 1\), \(\\|\mathbf{w}\\|_2 \leq 1\), \(\\|\mathbf{W}\\|_{2,2} \leq 1\), \(\\|\mathcal{W}^{[\cdot:i:\cdot]}_{1\leq i \leq w}\\|_{2,2} \leq 1\)
MLP	\(\T{\mathbf{v}} \psi\left(\mathbf{W} \begin{bmatrix}\mathbf{e}_s\\\mathbf{r}_p\\\mathbf{e}_o\end{bmatrix} + \mathbf{w}\right) \)	\(\mathbf{e}_x \in \mathbb{R}^{d}\), \(\mathbf{r}_y \in \mathbb{R}^{d}\), \(\mathbf{v} \in \mathbb{R}^{w}\), \(\mathbf{w} \in \mathbb{R}^{w}\), \(\mathbf{W} \in \mathbb{R}^{w , 3d}\) \(\\|\mathbf{e}_x\\|_2 \leq 1\) \(\\|\mathbf{r}_y\\|_2 \leq 1\)
ConvE	\(\psi\left(\T{\mathrm{vec}\left(\psi\left( \mathcal{W} * \begin{bmatrix}\mathbf{e}_s^{[a, b]}\\\mathbf{r}_p^{[a, b]}\end{bmatrix} \right)\right)} \mathbf{W}\right) \mathbf{e}_o \)	\(\mathbf{e}_x \in \mathbb{R}^{d}\), \(\mathbf{r}_y \in \mathbb{R}^{d}\), \(d = ab\), \(\mathbf{W} \in \mathbb{R}^{w_1(w_2 + 2a - 1)(w_3 + b - 1) , d}\), \(\mathcal{W} \in \mathbb{R}^{w_1 , w_2 , w_3}\)
HypER	\(\psi\T{\left(\mathrm{vec}\left( \T{\mathbf{r}_p} \mathcal{W} * \mathbf{e}_s \right)} \mathbf{W} \right) \mathbf{e}_o\)	\(\mathbf{e}_x \in \mathbb{R}^{d_e}\), \(\mathbf{r}_y \in \mathbb{R}^{d_r}\), \(\mathbf{W} \in \mathbb{R}^{w_2(w_1 + d_e - 1) , d_e}\), \(\mathcal{W} \in \mathbb{R}^{d_r , w_1 , w_2}\)

Language models

Embedding techniques were first explored as a way to represent natural language within machine learning frameworks, with word2vec [Mikolov et al., 2013] and GloVe [Pennington et al., 2014] being two seminal approaches. Both approaches compute embeddings for words based on large corpora of text such that words used in similar contexts (e.g., “frog”, “toad”) have similar vectors. Word2vec uses neural networks trained either to predict the current word from surrounding words (continuous bag of words), or to predict the surrounding words given the current word (continuous skip-gram). GloVe rather applies a regression model over a matrix of co-occurrence probabilities of word pairs. Embeddings generated by both approaches have become widely used in natural language processing tasks.

Another approach for graph embeddings is thus to leverage proven approaches for language embeddings. However, while a graph consists of an unordered set of sequences of three terms (i.e., a set of edges), text in natural language consists of arbitrary-length sequences of terms (i.e., sentences of words). RDF2Vec [Ristoski and Paulheim, 2016] thus performs (biased [Cochez et al., 2017a]) random walks on the graph and records the paths (the sequence of nodes and edge labels traversed) as “sentences”, which are then fed as input into the word2vec [Mikolov et al., 2013] model. An example of such a path extracted from Figure 5.2 might be, for example, San Pedro–bus➛Calama–flight➛Iquique–flight➛Santiago, where the paper experiments with \(500\) paths of length \(8\) per entity. RDF2Vec also proposes a second mode where sequences are generated for nodes from canonically-labelled sub-trees of which they are a root node, where sub-trees of depth \(1\) and \(2\) are used for experiments. KGloVe [Cochez et al., 2017b] is rather based on GloVe. Given that the original GloVe model [Pennington et al., 2014] considers words that co-occur frequently in windows of text to be more related, KGloVe uses personalised PageRank²⁷^{note 27} Intuitively speaking, personalised PageRank starts at a given node and then determines the probability of a random walk being at a particular node after a given number of steps. A higher number of steps converges towards standard PageRank emphasising global node centrality in the graph, while a lower number emphasises proximity/relatedness to the starting node. to determine the most related nodes to a given node, which are fed into the GloVe model.

Entailment-aware models

The embeddings thus far consider the data graph alone. But what if an ontology or set of rules is provided? Such deductive knowledge could be used to improve the embeddings. One approach is to use constraint rules to refine the predictions made by embeddings; for example, Wang et al. [2015] use functional and inverse-functional definitions as constraints (under UNA) such that, for example, if we define that an event can have at most one value for venue, this is used to lower the plausibility of edges that would assign multiple venues to an event.

More recent approaches rather propose joint embeddings that consider both the data graph and rules when computing embeddings. KALE [Guo et al., 2016] computes entity and relation embeddings using a translational model (specifically TransE) that is adapted to further consider rules using t-norm fuzzy logics. With reference to Figure 5.2, consider a simple rule ?x–bus➛?y \(\Rightarrow\) ?x–connects to➛?y. We can use embeddings to assign plausibility scores to new edges, such as \(e_1\): Piedras Rojas–bus➛Moon Valley. We can further apply the previous rule to generate a new edge \(e_2\): Piedras Rojas–connects to➛Moon Valley from the predicted edge \(e_1\). But what plausibility should we assign to this second edge? Letting \(p_1\) and \(p_2\) be the current plausibility scores of \(e_1\) and \(e_2\) (initialised using the standard embedding), then t-norm fuzzy logics suggests that the plausibility be updated as \(p_1p_2 - p_1 + 1\). Embeddings are then trained to jointly assign larger plausibility scores to positive examples versus negative examples of both edges and ground rules. An example of a positive ground rule based on Figure 5.2 would be Arica–bus➛San Pedro \(\Rightarrow\) Arica–connects to➛San Pedro. Negative ground rules randomly replace the relation in the head of the rule; for example, Arica–bus➛San Pedro \(\not\Rightarrow\) Arica–flight➛San Pedro. Guo et al. [2018] later propose RUGE, which uses a joint model over ground rules (possibly soft rules with confidence scores) and plausibility scores to align both forms of scoring for unseen edges.

Generating ground rules can be costly. An alternative approach, called FSL [Demeester et al., 2016], observes that in the case of a simple rule, such as ?x–bus➛?y \(\Rightarrow\) ?x–connects to➛?y, the relation embedding bus should always return a lower plausibility than connects to. Thus, for all such rules, FSL proposes to train relation embeddings while avoiding violations of such inequalities. While relatively straightforward, FSL only supports simple rules, while KALE also supports more complex rules.

These works exemplify how deductive and inductive forms of knowledge – in this case rules and embeddings – can interplay and complement each other.

Graph Neural Networks

While embeddings aim to provide a dense numerical representation of graphs suitable for use within existing machine learning models, another approach is to build custom machine learning models adapted for graph-structured data. Most custom learning models for graphs are based on (artificial) neural networks [Wu et al., 2019], exploiting a natural correspondence between both: a neural network already corresponds to a weighted, directed graph, where nodes serve as artificial neurons, and edges serve as weighted connections (axons). However, the typical topology of a traditional neural network – more specifically, a fully-connected feed-forward neural network – is quite homogeneous, being defined in terms of sequential layers of nodes where each node in one layer is connected to all nodes in the next layer. Conversely, the topology of a data graph is quite heterogeneous, being determined by the relations between entities that its edges represent.

A graph neural network (GNN) [Scarselli et al., 2009] builds a neural network based on the topology of the data graph; i.e., nodes are connected to their neighbours per the data graph. Typically a model is then learnt to map input features for nodes to output features in a supervised manner; output features of the example nodes used for training may be manually labelled, or may be taken from the knowledge graph. Unlike knowledge graph embeddings, GNNs support end-to-end supervised learning for specific tasks: given a set of labelled examples, GNNs can be used to classify elements of the graph or the graph itself. GNNs have been used to perform classification over graphs encoding compounds, objects in images, documents, etc.; as well as to predict traffic, build recommender systems, verify software, etc. [Wu et al., 2019]. Given labelled examples, GNNs can even replace graph algorithms; for example, GNNs have been used to find central nodes in knowledge graphs in a supervised manner [Scarselli et al., 2009, Park et al., 2019, Park et al., 2020].

We now discuss the ideas underlying two main flavours of GNN, specifically, recursive GNNs and non-recursive GNNs.

Recursive graph neural networks

Recursive graph neural networks (RecGNNs) are the seminal approach to graph neural networks [Sperduti and Starita, 1997, Scarselli et al., 2009]. The approach is conceptually similar to the systolic abstraction illustrated in Figure 5.3, where messages are passed between neighbours towards recursively computing some result. However, rather than define the functions used to decide the messages to pass, we rather label the output of a training set of nodes and let the framework learn the functions that generate the expected output, thereafter applying them to label other examples.

In a seminal paper, Scarselli et al. [2009] proposed what they generically call a graph neural network (GNN), which takes as input a directed graph where nodes and edges are associated with feature vectors that can capture node and edge labels, weights, etc. These feature vectors remain fixed throughout the process. Each node in the graph is also associated with a state vector, which is recursively updated based on information from the node’s neighbours – i.e., the feature and state vectors of the neighbouring nodes and the feature vectors of the edges extending to/from them – using a parametric function, called the transition function. A second parametric function, called the output function, is used to compute the final output for a node based on its own feature and state vector. These functions are applied recursively up to a fixpoint. Both parametric functions can be implemented using neural networks where, given a partial set of supervised nodes in the graph – i.e., nodes labelled with their desired output – parameters for the transition and output functions can be learnt that best approximate the supervised outputs. The result can thus be seen as a recursive neural network architecture.²⁸^{note 28} Some authors refer to such architectures as recurrent graph neural networks, observing that the internal state maintained for nodes can be viewed as a form of recurrence over a sequence of transitions. To ensure convergence up to a fixpoint, certain restrictions are applied, namely that the transition function be a contractor, meaning that upon each application of the function, points in the numeric space are brought closer together (intuitively, in this case, the numeric space “shrinks” upon each application, ensuring convergence to a unique fixpoint).

To illustrate, consider, for example, that we wish to find priority locations for creating new tourist information offices. A good strategy would be to install them in hubs from which many tourists visit popular destinations. Along these lines, in Figure 5.7 we illustrate the GNN architecture proposed by Scarselli et al. [2009] for a sub-graph of Figure 5.2, where we highlight the neighbourhood of Punta Arenas. In this graph, nodes are annotated with feature vectors (\(\mathbf{n}_x\)) and hidden states at step \(t\) (\(\mathbf{h}_x^{(t)}\)), while edges are annotated with feature vectors (\(\mathbf{a}_{xy}\)). Feature vectors for nodes may, for example, one-hot encode the type of node (City, Attraction, etc.), directly encode statistics such as the number of tourists visiting per year, etc. Feature vectors for edges may, for example, one-hot encode the edge label (the type of transport), directly encode statistics such as the distance or number of tickets sold per year, etc. Hidden states can be randomly initialised. The right-hand side of Figure 5.7 provides the GNN transition and output functions, where \(\mathrm{N}(x)\) denotes the neighbouring nodes of \(x\), \(f_{\mathbf{w}}(\cdot)\) denotes the transition function with parameters \(\mathbf{w}\), and \(g_{\mathbf{w}'}(\cdot)\) denotes the output function with parameters \(\mathbf{w'}\). An example is also provided for Punta Arenas (\(x = 1\)). These functions will be recursively applied until a fixpoint is reached. To train the network, we can label examples of places that already have (or should have) tourist offices and places that do (or should) not have tourist offices. These labels may be taken from the knowledge graph, or may be added manually. The GNN can then learn parameters \(\mathbf{w}\) and \(\mathbf{w'}\) that give the expected output for the labelled examples, which can subsequently be used to label other nodes.

This GNN model is flexible and can be adapted in various ways [Scarselli et al., 2009]: we may define neighbouring nodes differently, for example to include nodes for outgoing edges, or nodes one or two hops away; we may allow pairs of nodes to be connected by multiple edges with different vectors; we may consider transition and output functions with distinct parameters for each node; we may add states and outputs for edges; we may change the sum to another aggregation function; etc.

On the left, a sub-graph of Figure 24 highlighting the neighbourhood of Punta Arenas, where nodes are annotated with feature vectors (n_x) and hidden states at step t (h_x^{(t)}), and edges are annotated with feature vectors (a_{xy}); on the right, the GNN transition and output functions proposed by Scarselli et al. and an example for Punta Arenas (x = 1), where N(x) denotes the neighbouring nodes of x, f_w(·)\) denotes the transition function with parameters w and g_{w'}(·) denotes the output function with parameters w' — On the left a sub-graph of Figure 5.2 highlighting the neighbourhood of Punta Arenas, where nodes are annotated with feature vectors (\(\mathbf{n}_x\)) and hidden states at step \(t\) (\(\mathbf{h}_x^{(t)}\)), and edges are annotated with feature vectors (\(\mathbf{a}_{xy}\)); on the right, the GNN transition and output functions proposed by Scarselli et al. [2009] and an example for Punta Arenas (\(x = 1\)), where \(\mathrm{N}(x)\) denotes the neighbouring nodes of \(x\), \(f_{\mathbf{w}}(\cdot)\) denotes the transition function with parameters \(\mathbf{w}\) and \(g_{\mathbf{w}'}(\cdot)\) denotes the output function with parameters \(\mathbf{w'}\)

We now define a recursive graph neural network. We assume that the GNN accepts a directed vector-labelled graph as input (see Definition 5.1).

Recursive graph neural network

A recursive graph neural network (RecGNN) is a pair of functions \(\mathfrak{R} \coloneqq (\)Agg, Out\()\), such that (with \(a, b, c \in \mathbb{N}\)):

Agg\(: \mathbb{R}^a \times 2^{(\mathbb{R}^a \times \mathbb{R}^b) \rightarrow \mathbb{N}} \rightarrow \mathbb{R}^a\)
Out\(: \mathbb{R}^a \rightarrow \mathbb{R}^c\)

The function Agg computes a new feature vector for a node, given its previous feature vector and the feature vectors of the nodes and edges forming its neighbourhood; the function Out transforms the final feature vector computed by Agg for a node to the output vector for that node. We assume that \(a\) and \(b\) correspond to the dimensions of the input node and edge vectors, respectively, while \(c\) denotes the dimension of the output vector for each node. Given a RecGNN \(\mathfrak{R} = (\)Agg, Out\()\), a directed vector-labelled graph \(G = (V,E,F,\lambda)\), and a node \(u \in V\), we define the output vector assigned to node \(u\) in \(G\) by \(\mathfrak{R}\) (written \(\mathfrak{R}(G,u)\)) as follows. First let \(\mathbf{n}_u^{(0)} \coloneqq \lambda(u)\). For all \(i \geq 1\), let:

\(\mathbf{n}_u^{(i)} \coloneqq\) Agg \(\left( \mathbf{n}_u^{(i-1)}, \{\!\!\{ (\mathbf{n}_v^{(i-1)},\lambda(v,u)) \mid (v,u) \in E \}\!\!\} \right) \)

If \(j \geq 1\) is an integer such that \(\mathbf{n}_u^{(j)} = \mathbf{n}_u^{(j-1)}\) for all \(u \in V\), then \(\mathfrak{R}(G,u) \coloneqq\) Out\((\mathbf{n}_u^{(j)})\).

In a RecGNN, the same aggregation function (Agg) is applied recursively until a fixpoint is reached, at which point an output function (Out}) creates the final output vector for each node. While in practice RecGNNs will often consider a static feature vector and a dynamic state vector [Scarselli et al., 2009], we can more concisely encode this as one vector, where part may remain static throughout the aggregation process representing input features, and part may be dynamically computed representing the state. In practice, Agg and Out are often based on parametric combinations of vectors, with the parameters learnt based on a sample of output vectors for labelled nodes.

The aggregation function for the GNN of Scarselli et al. [2009] is given as:

Agg\((\mathbf{n}_u,N) \coloneqq \sum_{(\mathbf{n}_v,\mathbf{a}_{vu})\in N}f_{\mathbf{w}}(\mathbf{n}_u,\mathbf{n}_v,\mathbf{a}_{vu})\)

where \(f_{\mathbf{w}}(\cdot)\) is a contraction function with parameters \(\mathbf{w}\). The output function is defined as:

Out\(\left( \mathbf{n}_u \right) \coloneqq g_{\textbf{w}'}(\mathbf{n}_u)\)

where again \(g_{\mathbf{w}'}(\cdot)\) is a function with parameters \(\mathbf{w'}\). Given a set of nodes labelled with their expected output vectors, the parameters \(\mathbf{w}\) and \(\mathbf{w}'\) are learnt.

There are notable similarities between graph parallel frameworks (GPFs; see Definition 5.2) and RecGNNs. While we defined GPFs using separate Msg and Agg functions, this is not essential: conceptually they could be defined in a similar way to RecGNN, with a single Agg function that “pulls” information from its neighbours (we maintain Msg to more closely reflect how GPFs are defined/implemented in practice). The key difference between GPFs and GNNs is that in the former, the functions are defined by the user, while in the latter, the functions are generally learnt from labelled examples. Another difference arises from the termination condition present in GPFs, though often the GPF’s termination condition will – like in RecGNNs – reflect convergence to a fixpoint.

Non-recursive graph neural networks

GNNs can also be defined in a non-recursive manner, where a fixed number of layers are applied over the input in order to generate the output. A benefit of this approach is that we do not need to worry about convergence since the process is non-recursive. Also, each layer will often have independent parameters, representing different transformation steps. Naively, a downside is that adding many layers could give rise to a high number of parameters. Addressing this problem, a popular approach for non-recursive GNNs is to use convolutional neural networks.

Convolutional neural networks (CNNs) have gained a lot of attention, in particular, for machine learning tasks involving images [Krizhevsky et al., 2017]. The core idea in the image setting is to train and apply small kernels (aka filters) over localised regions of an image using a convolution operator to extract features from that local region. When applied to all local regions, the convolution outputs a feature map of the image. Since the kernels are small, and are applied multiple times to different regions of the input, the number of parameters to train is reduced. Typically multiple kernels can thus be applied, forming multiple convolutional layers.

One may note that in GNNs and CNNs, operators are applied over local regions of the input data. In the case of GNNs, the transition function is applied over a node and its neighbours in the graph. In the case of CNNs, the convolution is applied on a pixel and its neighbours in the image. Following this intuition, a number of convolutional graph neural networks (ConvGNNs) [Bruna et al., 2014, Kipf and Welling, 2017, Wu et al., 2019] have been proposed, where the transition function is implemented by means of convolutions. A key consideration for ConvGNNs is how regions of a graph are defined. Unlike the pixels of an image, nodes in a graph may have varying numbers of neighbours. This creates a challenge: a benefit of CNNs is that the same kernel can be applied over all the regions of an image, but this requires more careful consideration in the case of ConvGNNs since neighbourhoods of different nodes can be diverse. Approaches to address these challenges involve working with spectral (e.g. [Bruna et al., 2014, Kipf and Welling, 2017]) or spatial (e.g., [Monti et al., 2017]) representations of graphs that induce a more regular structure from the graph. An alternative is to use an attention mechanism [Velickovic et al., 2018] to learn the nodes whose features are most important to the current node.

Next we abstractly define a non-recursive graph neural network.

Non-recursive graph neural network: A non-recursive graph neural network (NRecGNN) with \(l\) layers is an \(l\)-tuple of functions \(\mathfrak{N} \coloneqq (\)Agg\(^{(1)},\ldots,\) Agg\(^{(l)} )\), such that, for \(1 \leq k \leq l\) (with \(a_0, \ldots a_l, b \in \mathbb{N}\)), Agg\(^{(k)}: \mathbb{R}^{a_{k-1}} \times 2^{(\mathbb{R}^{a_{k-1}} \times \mathbb{R}^b) \rightarrow \mathbb{N}} \rightarrow \mathbb{R}^{a_{k}}\).

Each function Agg\(^{(k)}\) (as before) computes a new feature vector for a node, given its previous feature vector and the feature vectors of the nodes and edges forming its neighbourhood. We assume that \(a_0\) and \(b\) correspond to the dimensions of the input node and edge vectors, respectively, where each function Agg\(^{(k)}\) for \(2 \leq k \leq l\) accepts as input node vectors of the same dimension as the output of the function Agg\(^{(k-1)}\). Given an NRecGNN \(\mathfrak{N} = (\) Agg\(^{(1)},\ldots,\) Agg\(^{(l)} )\), a directed vector-labelled graph \(G = (V,E,F,\lambda)\), and a node \(u \in V\), we define the output vector assigned to node \(u\) in \(G\) by \(\mathfrak{N}\) (written \(\mathfrak{N}(G,u)\)) as follows. First let \(\mathbf{n}_u^{(0)} \coloneqq \lambda(u)\). For all \(i \geq 1\), let:

\(\mathbf{n}_u^{(i)} \coloneqq\) Agg\(^{(i)} \left( \mathbf{n}_u^{(i-1)}, \{\!\!\{ (\mathbf{n}_v^{(i-1)},\lambda(v,u)) \mid (v,u) \in E \}\!\!\} \right) \)

Then \(\mathfrak{N}(G,u) \coloneqq \mathbf{n}_u^{(l)}\).

In an \(l\)-layer NRecGNN, a different aggregation function can be applied at each step (i.e., in each layer), up to a fixed number of steps \(l\). We do not consider a separate Out function as it can be combined with the final aggregation function Agg\(^{(l)}\). When the aggregation functions use a convolutional operator based on kernels learned from labelled examples, we call the result a convolutional graph neural network (ConvGNN). We refer to the survey by Wu et al. [2019] for discussion of ConvGNNs proposed in the literature.

We have considered GNNs that define the neighbourhood of a node based on its incoming edges. These definitions can be adapted to also consider outgoing neighbours by either adding inverse edges to the directed vector-labelled graph in pre-processing, or by adding outgoing neighbours as arguments to the Agg\((\cdot)\) function. More generally, GNNs (and indeed GPFs) relying solely on the neighbourhood of each node have limited expressivity in terms of their ability to distinguish nodes and graphs [Xu et al., 2019]; for example, Barceló et al. [2020] show that such NRecGNNs have a similar expressiveness for classifying nodes as the \(\mathcal{ALCQ}\) Description Logic discussed in Section 4.3.2. More expressive GNN variants have been proposed that allow the aggregation functions to access and update a globally shared vector [Barceló et al., 2020]. We refer to the papers by Xu et al. [2019] and Barceló et al. [2020] for further discussion.

Symbolic Learning

The supervised techniques discussed thus far – namely knowledge graph embeddings and graph neural networks – learn numerical models over graphs. However, such models are often difficult to explain or understand. For example, taking the graph of Figure 5.8, knowledge graph embeddings might predict the edge SCL–flight➛ARI as being highly plausible, but they will not provide an interpretable model to help understand why this is the case: the reason for the result may lie in a matrix of parameters learnt to fit a plausibility score on training data. Such approaches also suffer from the out-of-vocabulary problem, where they are unable to provide results for edges involving previously unseen nodes or edges; for example, if we add an edge SCL–flight➛CDG, where CDG is new to the graph, a knowledge graph embedding will not have the entity embedding for CDG and would need to be retrained in order to estimate the plausibility of an edge CDG–flight➛SCL.

An incomplete directed edge-labelled graph describing flights between airports — A directed edge-labelled graph describing flights between airports

An alternative (sometimes complementary) approach is to adopt symbolic learning in order to learn hypotheses in a symbolic (logical) language that “explain” a given set of positive and negative edges. These edges are typically generated from the knowledge graph in an automatic manner (similar to the case of knowledge graph embeddings). The hypotheses then serve as interpretable models that can be used for further deductive reasoning. Given the graph of Figure 5.8, we may, for example, learn the rule ?x–flight➛?y \(\Rightarrow\) ?y–flight➛?x from observing that flight routes tend to be return routes. Alternatively, rather than learn rules, we might learn a DL axiom from the graph stating that airports are either domestic, international, or both: Airport \(\sqsubseteq\) DomesticAirport \(\sqcup\) InternationalAirport. Such rules and axioms can then be used for deductive reasoning, and offer an interpretable model for new knowledge that is entailed/predicted; for example, from the aforementioned rule for return flights, one can interpret why a novel edge SCL–flight➛ARI is predicted. This further offers domain experts the opportunity to verify the models – e.g., the rules and axioms – derived by such processes. Finally, rules/axioms are quantified (all flights have a return flight, all airports are domestic or international, etc.), so they can be applied to unseen examples (e.g., with the aforementioned rule, we can derive CDG–flight➛SCL from a new edge SCL–flight➛CDG with the unseen node CDG).

In this section, we discuss two forms of symbolic learning: rule mining, which learns rules, and axiom mining, which learns other forms of logical axioms.

Rule mining

Rule mining, in the general sense, refers to discovering meaningful patterns in the form of rules from large collections of background knowledge. In the context of knowledge graphs, we assume a set of positive and negative edges as given. Typically positive edges are observed edges (i.e., those given or entailed by a knowledge graph) while negative edges are defined according to a given assumption of completeness (discussed later). The goal of rule mining is to identify new rules that entail a high ratio of positive edges from other positive edges, but entail a low ratio of negative edges from positive edges. The types of rules considered may vary from more simple cases, such as ?x–flight➛?y \(\Rightarrow\) ?y–flight➛?x mentioned previously, to more complex rules, such as ?x–capital➛?y–nearby➛?z–type➛Airport \(\Rightarrow\) ?z–type➛International Airport, based on observing in the graph that airports near capitals tend to be international airports; or dom flight rule premise \(\Rightarrow\) ?x–domestic flight➛?y, indicating that flights within the same country denote domestic flights (as seen previously in Section 4.3.1).

Per the example inferring that airports near capital cities are international airports, rules are not assumed to hold in all cases, but rather are associated with measures of how well they conform to the positive and negative edges. In more detail, we call the edges entailed by a rule and the set of positive edges (not including the entailed edge itself), the positive entailments of that rule. The number of entailments that are positive is called the support for the rule, while the ratio of a rule’s entailments that are positive is called the confidence for the rule [Suchanek et al., 2019]. Support and confidence indicate, respectively, the number and ratio of entailments “confirmed” to be true for the rule, where the goal is to identify rules that have both high support and high confidence. Techniques for rule mining in relational settings have long been explored in the context of Inductive Logic Programming (ILP) [De Raedt, 2008]. However, knowledge graphs present novel challenges due to the scale of the data and the frequent assumption of incomplete data (OWA), where dedicated techniques have been proposed to address these issues [Galárraga et al., 2013].

When dealing with an incomplete knowledge graph, it is not immediately clear how to define negative edges. A common heuristic – also used for knowledge graph embeddings – is to adopt a Partial Completeness Assumption (PCA) [Galárraga et al., 2013], which considers the set of positive edges to be those contained in the data graph, and the set of negative examples to be the set of all edges \(x\)–\(p\)➛\(y\) not in the graph but where there exists a node \(y'\) such that \(x\)–\(p\)➛\(y'\) is in the graph. Taking Figure 5.8, an example of a negative edge under PCA would be SCL–flight➛ARI (given the presence of SCL–flight➛LIM); conversely, SCL–domestic flight➛ARI is neither positive nor negative. The PCA confidence measure is then the ratio of the support to the number of entailments in the positive or negative set [Galárraga et al., 2013]. For example, the support for the rule ?x–domestic flight➛?y \(\Rightarrow\) ?y–domestic flight➛?x is \(2\) (since it entails IQQ–domestic flight➛ARI and ARI–domestic flight➛IQQ in the graph, which are thus positive edges), while the confidence is \(\frac{2}{2} = 1\) (noting that SCL–domestic flight➛ARI, though entailed, is neither positive nor negative, and is thus ignored by the measure). The support for the rule ?x–flight➛?y \(\Rightarrow\) ?y–flight➛?x is analogously 4, while the confidence is \(\frac{4}{5} = 0.8\) (noting that SCL–flight➛ARI is a negative edge).

The goal then, is to find rules satisfying given support and confidence thresholds. An influential rule-mining system for graphs is AMIE [Galárraga et al., 2013, Galárraga et al., 2015], which adopts the PCA measure of confidence, and builds rules in a top-down fashion [Suchanek et al., 2019] starting with rule heads of the form \(\Rightarrow\) ?x–country➛?y. For each such rule head (one for each edge label), three types of refinements are considered, each of which adds a new edge to the body of the rule. This new edge takes an edge label from the graph and may otherwise use fresh variables not appearing previously in the rule, existing variables that already appear in the rule, or nodes from the graph. The three refinements may then:

add an edge with one existing variable and one fresh variable; for example, refining the aforementioned rule head might give: ?z–flight➛?x \(\Rightarrow\) ?x–country➛?y;
add an edge with an existing variable and a graph node; for example, refining the above rule might give: Domestic Airport➛type–?z–flight➛?x \(\Rightarrow\) ?x–country➛?y;
add an edge with two existing variables; for example, refining the above rule might give: \(\Rightarrow\) ?x–country➛?y.

These refinements can be combined arbitrarily, which gives rise to a potentially exponential search space, where rules meeting given thresholds for support and confidence are maintained. To improve efficiency, the search space can be pruned; for example, these three refinements always decrease support, so if a rule does not meet the support threshold, there is no need to explore its refinements. Further restrictions are imposed on the types of rules generated. First, only rules up to a certain fixed size are considered. Second, a rule must be closed, meaning that each variable appears in at least two edges of the rule, which ensures that rules are safe, meaning that each variable in the head appears in the body; for example, the rules produced by the first and second refinements in the example are neither closed (variable y appears once) nor safe (variable y appears only in the head).²⁹^{note 29} Safe rules like ?x–capital➛?y–nearby➛?z–type➛Airport \(\Rightarrow\) ?z–type➛International Airport are not closed as ?x appears only in one edge. The condition that rules are closed is strictly stronger than the safety condition. The third refinement is thus applied until a rule is closed. For further discussion of possible optimisations based on pruning and indexing, we refer to the paper [Galárraga et al., 2015].

Later works have built on these techniques for mining rules from knowledge graphs. Gad-Elrab et al. [2016] propose a method to learn non-monotonic rules – rules with negated edges in the body – in order to capture exceptions to base rules; for example, the rule not international rule premise \(\Rightarrow\) ?x–country➛?y may be learnt, indicating that flights are within the same country except when the (departure) airport is international, where the exception is shown dotted and we use \(\neg\) to negate an edge (representing an exception). The RuLES system [Ho et al., 2018] – which is also capable of learning non-monotonic rules – proposes to mitigate the limitations of the PCA heuristic by extending the confidence measure to consider the plausibility scores of knowledge graph embeddings for entailed edges not appearing in the graph. Where available, explicit statements about the completeness of the knowledge graph (such as expressed in shapes; see Section 3.1.2) can be used in lieu of PCA for identifying negative edges. Along these lines, CARL [Pellissier Tanon et al., 2017] exploits additional knowledge about the cardinalities of relations to refine the set of negative examples and the confidence measure for candidate rules. Alternatively, where available, ontologies can be used to derive logically-certain negative edges under OWA through, for example, disjointness axioms. The system proposed by d’Amato et al. [d'Amato et al., 2016b, d'Amato et al., 2016a] leverages ontologically-entailed negative edges for determining the confidence of rules generated through an evolutionary algorithm.

While the previous works involve discrete expansions of candidate rules for which a fixed confidence scoring function is applied, another line of research is on a technique called differentiable rule mining [Rocktäschel and Riedel, 2017, Yang et al., 2017, Sadeghian et al., 2019], which allows end-to-end learning of rules. The core idea is that the joins in rule bodies can be represented as matrix multiplication. More specifically, we can represent the relations of an edge label \(p\) by the adjacency matrix \(\mathbf{A}_p\) (of size \(|V| \times |V|\)) such that the value on the \(i\)^th row of the \(j\)^th column is \(1\) if there is an edge labelled \(p\) from the \(i\)^th entity to the \(j\)^th entity; otherwise the value is \(0\). Now we can represent a join in a rule body as matrix multiplication; for example, given ?x–domestic flight➛?y➛country–?z \(\Rightarrow\) ?x–country➛?z, we can denote the body by the matrix multiplication \(\mathbf{A}\)_df.\(\mathbf{A}\)_c., which gives an adjacency matrix representing entailed country edges, where we should expect the \(1\)’s in \(\mathbf{A}\)_df.\(\mathbf{A}\)_c. to be covered by the head’s adjacency matrix \(\mathbf{A}\)_c.. Since we are given adjacency matrices for all edge labels, we are left to learn confidence scores for individual rules, and to learn rules (of varying length) with a threshold confidence. Along these lines, NeuralLP [Yang et al., 2017] uses an attention mechanism to select a variable-length sequence of edge labels for path-like rules of the form ?x–p_\(1\)➛y_\(1\)–p_\(2\)➛…–p_\(n\)➛y_\(n\)–p_\(n+1\)➛?z \(\Rightarrow\) ?x–p➛?z, for which confidences are likewise learnt. DRUM [Sadeghian et al., 2019] also learns path-like rules, where, observing that some edge labels are more/less likely to follow others in the rules – for example, flight will not be followed by capital in the graph of Figure 5.2 as the join will be empty – the system uses bidirectional recurrent neural networks (a popular technique for learning over sequential data) to learn sequences of relations for rules, and their confidences. These differentiable rule mining techniques are, however, currently limited to learning path-like rules.

Axiom mining

More general forms of axioms beyond rules – expressed in logical languages such as DLs (see Section 4.3.2) – can be mined from knowledge graphs. We can divide these approaches into two: those mining specific axioms and more general axioms.

Among systems mining specific types of axioms, disjointness axioms are a popular target; for example, DomesticAirport \(\sqcap\) InternationalAirport \(\equiv \bot\) states that the two classes are disjoint by equivalently stating that the intersection of the two classes is equivalent to the empty class, or in simpler terms, no node can be simultaneously of type Domestic Airport and International Airport. The system proposed by Völker et al. [2015] extracts disjointness axioms based on (negative) association rule mining [Agrawal et al., 1993], which finds pairs of classes where each has many instances in the knowledge graph but there are relatively few (or no) instances of both classes. Töpper et al. [2012] rather extract disjointness for pairs of classes that have a cosine similarity below a fixed threshold. For computing this cosine similarity, class vectors are computed using a TF–IDF analogy, where the “document” of each class is constructed from all of its instances, and the “terms” of this document are the properties used on the class instances (preserving multiplicities). While the previous two approaches find disjointness constraints between named classes (e.g., city is disjoint with airport), Rizzo et al. [2017], Rizzo et al. [2021] propose an approach that can capture disjointness constraints between class descriptions (e.g., city without an airport nearby is disjoint with city that is the capital of a country). The approach first clusters similar nodes of the knowledge base. Next, a terminological cluster tree is extracted, where each leaf node indicates a cluster extracted previously, and each internal (non-leaf) node is a class definition (e.g., cities) where the left child is either a cluster having all nodes in that class or a sub-class description (e.g., cities without airports) and the right child is either a cluster having no nodes in that class or a disjoint-class description (e.g., non-cities with events). Finally, candidate disjointness axioms are proposed for pairs of class descriptions in the tree that are not entailed to have a sub-class relation.

Other systems propose methods to learn more general axioms. One of the first proposals in this direction is the DL-FOIL system [Fanizzi et al., 2008, Rizzo et al., 2020], which is based on algorithms for class learning (aka concept learning), whereby given a set of positive nodes and negative nodes, the goal is to find a logical class description that divides the positive and negative sets. For example, given \(\{\)Iquique, Arica\(\}\) as the positive set and \(\{\)Santiago\(\}\) as the negative set, we may learn a (DL) class description \(\exists\)nearby.Airport \(\sqcap \neg(\exists\) capital\(^-.\top)\), denoting entities near to an airport that are not capitals, of which all positive nodes are instances and no negative nodes are instances. Such class descriptions are learnt in an analogous manner to how aforementioned systems like AMIE learn rules, with a refinement operator used to move from more general classes to more specific classes (and vice-versa), a confidence scoring function, and a search strategy. Another prominent such system is DL-Learner [Bühmann et al., 2016], which system further supports learning more general axioms through a scoring function that uses count queries to determine what ratio of expected edges – edges that would be entailed were the axiom true – are indeed found in the graph; for example, to score the axiom \(\exists\)flight\(^{-}\).DomesticAirport \(\sqsubseteq\) InternationalAirport over Figure 5.8, we can use a graph query to count how many nodes have incoming flights from a domestic airport (there are \(3\)), and how many nodes have incoming flights from a domestic airport and are international airports (there is \(1\)), where the greater the difference between both counts, the weaker the evidence for the axiom.

Hypothesis mining

We now provide some abstract formal definitions for the tasks of rule mining and axiom mining over graphs, which we generically refer to as hypothesis mining.

First we introduce hypothesis induction: a task that captures a more abstract (ideal) case for hypothesis mining. For simplicity, we focus on directed edge-labelled graphs. With a slight abuse of notation, we may interpret a set of edges \(E\) as the graph with precisely those edges and with no nodes or labels without edges. We may also interpret an edge \(e\) as the graph formed by \(\{ e \}\).

Hypothesis induction

The task of hypothesis induction assumes a particular graph entailment relation \(\models_\Phi\) (see Definition 4.4; hereafter simply \(\models\)). Given background knowledge in the form of a knowledge graph \(G\) (a directed edge-labelled graph, possibly extended with rules or ontologies), a set of positive edges \(E^{+}\) such that \(G\) does not entail any edge in \(E^{+}\) (i.e., for all \(e^{+} \in E^{+}\), \(G \not\models e^{+}\)) and \(E^{+}\) does not contradict \(G\) (i.e., there is a model of \(G \cup E^{+}\)), and a set of negative edges \(E^{-}\) such that \(G\) does not entail any edge in \(E^-\) (i.e., for all \(e^{-} \in E^{-}\), \(G \not\models e^{-}\)), the task is to find a set of hypotheses (i.e., a set of directed edge-labelled graphs) \(\Psi\) such that:

\(G \not\models \psi\) for all \(\psi \in \Psi\) (the background knowledge does not entail any hypothesis directly);
\(G \cup \Psi^* \models E^{+}\) (the background knowledge and hypotheses together entail all positive edges);
for all \(e^{-} \in E^{-}\), \(G \cup \Psi^* \not\models e^{-}\) (the background knowledge and hypotheses together do not entail any negative edge);
\(G \cup \Psi^* \cup E^{+}\) has a model (the background knowledge, hypotheses and positive edges taken together do not contain a contradiction);
for all \(e^{+} \in E^{+}\), \(\Psi^* \not\models e^{+}\) (the hypotheses alone do not entail a positive edge).

where by \(\Psi^* \coloneqq \cup_{\psi \in \Psi} \psi\) we denote the union of all graphs in \(\Psi\).

Let us assume ontological entailment \(\models\) with semantic conditions \(\Phi\) as defined in Tables 4.1–4.3. Given the graph of Figure 5.8 as the background knowledge \(G\), along with:

a set of positive edges \(E^{+} = \{ \)SCL–flight➛ARI, SCL–domestic flight➛ARI\( \}\), and
a set of negative edges \(E^{-} = \{ \)ARI–flight➛LIM, SCL–domestic flight➛LIM\( \}\),

then a set of hypotheses \(\Psi = \{ \)flight–type➛Symmetric, domestic flight–type➛Symmetric\( \}\) are not entailed by \(G\), entail all positive edges in \(E^{+}\) and no negative edges in \(E^{-}\) when combined with \(G\), do not contradict \(G \cup E^{+}\), and do not entail a positive edge without \(G\). Thus \(\Psi\) satisfies the conditions for hypothesis induction.

This task represents a somewhat idealised case. Often there is no set of positive edges distinct from the background knowledge itself. Furthermore, hypotheses not entailing a few positive edges, or entailing a few negative edges, may still be useful. The task of hypothesis mining rather accepts as input the background knowledge \(G\) and a set of negative edges \(E^{-}\) (such that for all \(e^{-} \in E^{-}\), \(G \not\models e^{-}\)), and attempts to score individual hypotheses \(\psi\) (such that \(G \not\models \psi\)) per their ability to “explain” \(G\) while minimising the number of elements of \(E^{-}\) entailed by \(G\) and \(\psi\). We can now abstractly define the task of hypothesis mining.

Hypothesis mining: Given a knowledge graph \(G\), a set of negative edges \(E^{-}\), a scoring function \(\sigma\), and a threshold \(\textsf{min}_{\sigma}\), the goal of hypothesis mining is to identify a set of hypotheses \(\{ \psi \mid G \not\models \psi\text{ and }\sigma(\psi,G,E^{-}) \geq \textsf{min}_{\sigma} \}\).

There are two scoring functions that are frequently used for \(\sigma\) in the literature: support and confidence.

Hypothesis support and confidence: Given a knowledge graph \(G = (V,E,L)\) and a hypothesis \(\psi\), the positive support of \(\psi\) is defined as: \[ \sigma^{+}(\psi,G) \coloneqq |\{ e \in E \mid G' \not\models e \text{ and }G' \cup \psi \models e \}| \] where \(G'\) denotes \(G\) with the edge \(e\) removed. Further given a set of negative edges \(E^{-}\), the negative support of \(\psi\) is defined as: \[ \sigma^{-}(\psi,G,E^{-}) \coloneqq |\{ e^{-} \in E^{-} \mid G \cup \psi \models e^{-} \}| \] Finally, the confidence of \(\psi\) is defined as \(\sigma^\pm(\psi,G,E^{-}) \coloneqq \frac{\sigma^{+}(\psi,G)}{\sigma^{+}(\psi,G) + \sigma^{-}(\psi,G,E^{-})}\).

We have yet to define how the set of negative edges are defined, which, in the context of a knowledge graph \(G\), depends on which assumption is applied:

Closed world assumption (CWA): For any (positive) edge \(e\), \(G \not\models e\) if and only if \(G \models \neg e\). Under CWA, any edge \(e\) not entailed by \(G\) can be considered a negative edge.
Open world assumption: For a (positive) edge \(e\), \(G \not\models e\) does not necessarily imply \(G \models \neg e\). Under OWA, the negation of an edge must be entailed by \(G\) for it to be considered negative.
Partial completeness assumption (PCA): If there exists an edge \((s,p,o)\) such that \(G \models (s,p,o)\), then for all \(o'\) such that \(G \not\models (s,p,o')\), it is assumed that \(G \models \neg(s,p,o')\). Under PCA, if \(G\) entails some outgoing edge(s) labelled \(p\) from a node \(s\), then such edges are assumed to be complete, and any edge \((s,p,o')\) not entailed by \(G\) can be considered a negative edge.

Knowledge graphs are generally incomplete – in fact, one of the main applications of hypothesis mining is to try to improve the completeness of the knowledge graph – and thus it would appear unwise to assume that any edge that is not currently entailed is false/negative. We can thus rule out CWA. Conversely, under OWA, potentially few (or no) negative edges might be entailed by the given ontologies/rules, and thus hypotheses may end up having low negative support despite entailing many edges that do not make sense in practice. Hence the PCA can be adopted as a heuristic to increase the number of negative edges and apply more sensible scoring of hypotheses. We remark that one can adapt PCA to define negative triples by changing the subject or predicate instead of the object.

Different implementations of hypothesis mining may consider different logical languages. Rule mining, for example, mines hypotheses expressed either as monotonic rules (with positive edges) or non-monotonic edges (possibly with negated edges). On the other hand, axiom mining considers hypotheses expressed in a logical language such as Description Logics. Particular implementations may, for practical reasons, impose further syntactic restrictions on the hypotheses generated, such as to impose thresholds on their length, on the symbols they use, or on other structural properties (such as “closed rules” in the case of the AMIE rule mining system [Galárraga et al., 2013]; see Section 5.4). Systems may further implement different search strategies for hypotheses. Systems such as DL-FOIL [Fanizzi et al., 2008, Rizzo et al., 2020], AMIE [Galárraga et al., 2013], RuLES [Ho et al., 2018], CARL [Pellissier Tanon et al., 2017], DL-Learner [Bühmann et al., 2016], etc., propose discrete mining that recursively generates candidate formulae through refinement/genetic operators that are then scored and checked for threshold criteria. On the other hand, systems such as NeuralLP [Yang et al., 2017] and DRUM [Sadeghian et al., 2019] apply differentiable mining that allows for learning (path-like) rules and their scores in a more continuous fashion (e.g., using gradient descent). We refer to Section 5.4 for further discussion and examples of such techniques for mining hypotheses.

\(\mathbf{h}_x^{(t)} \coloneqq\)	\(\sum_{y \in \textrm{N}(x)} f_\mathbf{w}(\mathbf{n}_{x},\mathbf{n}_{y},\mathbf{a}_{yx},\mathbf{h}_{y}^{(t-1)})\)
\(\mathbf{o}_x^{(t)} \coloneqq\)	\(g_{\mathbf{w}'}(\mathbf{h}_x^{(t)},\mathbf{n}_x)\)
\(\mathbf{h}_1^{(t)} \coloneqq\)	\(f_\mathbf{w}(\mathbf{n}_{1},\mathbf{n}_{3},\mathbf{a}_{31},\mathbf{h}_{3}^{(t-1)})\)
	\(+ f_\mathbf{w}(\mathbf{n}_{1},\mathbf{n}_{4},\mathbf{a}_{41},\mathbf{h}_{4}^{(t-1)})\)
\(\mathbf{o}_1^{(t)} \coloneqq\)	\(g_{\mathbf{w}'}(\mathbf{h}_1^{(t)},\mathbf{n}_1)\)
\(\ldots\)