Knowledge Graphs: Creation and Enrichment

Creation and Enrichment

In this chapter, we discuss the principal techniques by which knowledge graphs can be created and subsequently enriched from diverse sources of legacy data that may range from plain text to structured formats (and anything in between). The appropriate methodology to follow when creating a knowledge graph depends on the actors involved, the domain, the envisaged applications, the available data sources, etc. Generally speaking, however, the flexibility of knowledge graphs lends itself to starting with an initial core that can be incrementally enriched from other sources as required (typically following an Agile [Hunt and Thomas, 2003] or “pay-as-you-go” [Sequeda et al., 2019] methodology). For our running example, we assume that the tourism board decides to build a knowledge graph from scratch, aiming to initially describe the main tourist attractions – places, events, etc. – in Chile in order to help visiting tourists identify those that most interest them. The board decides to postpone adding further data, like transport routes, reports of crime, etc., for a later date.

Human Collaboration

One approach for creating and enriching knowledge graphs is to solicit direct contributions from human editors. Such editors may be found in-house (e.g., employees of the tourist board), using crowd-sourcing platforms, through feedback mechanisms (e.g., tourists adding comments on attractions), through collaborative-editing platforms (e.g., an attractions wiki open to public edits), etc. Though human involvement incurs high costs [Paulheim, 2018], some prominent knowledge graphs have been primarily based on direct contributions from human editors [Vrandečić and Krötzsch, 2014, He et al., 2016]. Depending on how the contributions are solicited, however, the approach has a number of key drawbacks, due primarily to human error [Pellissier Tanon et al., 2016], disagreement [Yasseri et al., 2012], bias [Janowicz et al., 2018], vandalism [Heindorf et al., 2016], etc. Successful collaborative creation further raises challenges concerning licensing, tooling, and culture [Pellissier Tanon et al., 2016]. Humans are sometimes rather employed to verify and curate additions to a knowledge graph extracted by other means [Pellissier Tanon et al., 2016] (through, e.g., video games with a purpose [Jurgens and Navigli, 2014]), to define high-quality mappings from other sources [Das et al., 2012], to define appropriate high-level schema [Keet, 2018, Labra Gayo et al., 2018], and so forth.

Text Sources

Text corpora – such as sourced from newspapers, books, scientific articles, social media, emails, web crawls, etc. – are an abundant source of rich information [Hellmann et al., 2013, Rospocher et al., 2016]. However, extracting such information with high precision and recall for the purposes of creating or enriching a knowledge graph is a non-trivial challenge. To address this, techniques from Natural Language Processing (NLP) [Maynard et al., 2016, Jurafsky and Martin, 2019] and Information Extraction (IE) [Weikum and Theobald, 2010, Grishman, 2012, Martínez-Rodríguez et al., 2020] can be applied. Though processes vary considerably across text extraction frameworks, in Figure 6.1 we illustrate four core tasks for text extraction on a sample sentence. We will discuss these tasks in turn.

Text extraction example; dashed nodes are new to the knowledge graph

Pre-processing

The pre-processing task may involve applying various techniques to the input text, where Figure 6.1 illustrates Tokenisation, which parses the text into atomic terms and symbols. Other pre-processing tasks applied to a text corpus may include: Part-of-Speech (POS) tagging [Maynard et al., 2016, Jurafsky and Martin, 2019] to identify terms representing verbs, nouns, adjectives, etc.; Dependency Parsing, which extracts a grammatical tree structure for a sentence where leaf nodes indicate individual words that together form phrases (e.g., noun phrases, verb phrases) and eventually clauses and sentences [Maynard et al., 2016, Jurafsky and Martin, 2019]; and Word Sense Disambiguation (WSD) [Navigli, 2009] to identify the meaning (aka sense) in which a word is used, linking words with a lexicon of senses (e.g., WordNet [Miller and Fellbaum, 2007] or BabelNet [Navigli and Ponzetto, 2012]), where, for instance, the term flights may be linked with the WordNet sense “an instance of travelling by air” rather than “a stairway between one floor and the next”. The appropriate type of pre-processing to apply often depends on the requirements of later tasks in the pipeline.

Named Entity Recognition (NER)

The NER task identifies mentions of named entities in a text [Nadeau and Sekine, 2007, Ratinov and Roth, 2009], typically targetting mentions of people, organisations, locations, and potentially other types [Ling and Weld, 2012, Nakashole et al., 2013, Yogatama et al., 2015]. A variety of NER techniques exist, with many modern approaches based on learning frameworks that leverage lexical features (e.g., POS tags, dependency parse trees, etc.) and gazetteers (e.g., lists of common first names, last names, countries, prominent businesses, etc.). Supervised methods [Bikel et al., 1999, Finkel et al., 2005, Lample et al., 2016] require manually labelling all entity mentions in a training corpus, whereas bootstrapping-based approaches [Collins and Singer, 1999, Etzioni et al., 2004, Nakashole et al., 2013, Gupta and Manning, 2014] rather require a small set of seed examples of entity mentions from which patterns can be learnt and applied to unlabelled text. Distant supervision [Ling and Weld, 2012, Ren et al., 2015, Yogatama et al., 2015] uses known entities in a knowledge graph as seed examples through which similar entities can be detected. Aside from learning-based frameworks, traditional approaches based on manually-crafted rules [Kluegl et al., 2009, Chiticariu et al., 2018] are still sometimes used due to their more controllable and predictable behaviour [Chiticariu et al., 2013]. The named entities identified by NER may be used to generate new candidate nodes for the knowledge graph (known as emerging entities, shown dashed in Figure 6.1), or may be linked to existing nodes per the Entity Linking task described in the following.

Entity Linking (EL)

The EL task associates mentions of entities in a text with the existing nodes of a target knowledge graph, which may be the nucleus of a knowledge graph under creation, or an external knowledge graph [Wu et al., 2018]. In Figure 6.1, we assume that the nodes Santiago and Easter Island already exist in the knowledge graph (possibly extracted from other sources). EL may then link the given mentions to these nodes. The EL task presents two main challenges. First, there may be multiple ways to mention the same entity, as in the case of Rapa Nui and Easter Island; if we created a node Rapa Nui to represent that mention, we would split the information available under both mentions across different nodes, where it is thus important for the target knowledge graph to capture the various aliases and multilingual labels by which one can refer to an entity [Moro et al., 2014]. Second, the same mention in different contexts can refer to distinct entities; for instance, Santiago can refer to cities in Chile, Cuba, Spain, amongst others. The EL task thus considers a disambiguation phase wherein mentions are associated to candidate nodes in the knowledge graph, the candidates are ranked, and the most likely node being mentioned is chosen [Wu et al., 2018]. Context can be used in this phase; for example, if Easter Island is a likely candidate for the corresponding mention alongside Santiago, we may boost the probability that this mention refers to the Chilean capital as both candidates are located in Chile. Other heuristics for disambiguation consider a prior probability, where for example, Santiago most often refers to the Chilean capital (being, e.g., the largest city with that name); centrality measures on the knowledge graph can be used for such purposes [Wu et al., 2018].

Relation Extraction (RE)

The RE task extracts relations between entities in the text [Zhou et al., 2005, Bach and Badaskar, 2007]. The simplest case is that of extracting binary relations in a closed setting wherein a fixed set of relation types are considered. While traditional approaches often relied on manually-crafted patterns [Hearst, 1992], modern approaches rather tend to use learning-based frameworks [Roller et al., 2018], including supervised methods over manually-labelled examples [Bunescu and Mooney, 2005, Zhou et al., 2005]. Other learning-based approaches again use bootstrapping [Etzioni et al., 2004, Bunescu and Mooney, 2007] and distant supervision [Mintz et al., 2009, Riedel et al., 2010, Hoffmann et al., 2011, Surdeanu et al., 2012, Xu et al., 2013, Smirnova and Cudré-Mauroux, 2019] to forgo the need for manual labelling; the former requires a subset of manually-labelled seed examples, while the latter finds sentences in a large corpus of text mentioning pairs of entities with a known relation/edge, which are used to learn patterns for that relation. Binary RE can also be applied using unsupervised methods in an open setting – often referred to as Open Information Extraction (OIE) [Banko et al., 2007, Etzioni et al., 2011, Fader et al., 2011, Mausam et al., 2012, Mausam, 2016, Mitchell et al., 2018] – whereby the set of target relations is not pre-defined but rather extracted from text based on, for example, dependency parse trees from which relations are taken.

A variety of RE methods have been proposed to extract \(n\)-ary relations that capture further context for how entities are related. In Figure 6.1, we see how an \(n\)-ary relation captures additional temporal context, denoting when Rapa Nui was named a World Heritage site; in this case, an anonymous node is created to represent the higher-arity relation in the directed-labelled graph. Various methods for \(n\)-ary RE are based on frame semantics [Fillmore, 1976], which, for a given verb (e.g., “named”), captures the entities involved and how they may be interrelated. Resources such as FrameNet [Baker et al., 1998] then define frames for words, which, for example, may identify that the semantic frame for “named” includes a speaker (the person naming something), an entity (the thing named) and a name. Optional frame elements are an explanation, a purpose, a place, a time, etc., that may add context to the relation. Other RE methods are rather based on Discourse Representation Theory (DRT) [Kamp, 1981], which considers a logical representation of text based on existential events. Under this theory, for example, the naming of Easter Island as a World Heritage Site is considered to be an (existential) event where Easter Island is the patient (the entity affected), leading to the logical (neo-Davidsonian) formula:

\( \exists e: \big(\)naming\((e),\) patient\((e,\) Easter Island\(),\) name\((e,\) World Heritage Site\()\big) \)

Such a formula is analogous to reification, as discussed previously in Section 3.3, where \(e\) is an existential term that refers to the \(n\)-ary relation being extracted.

Finally, while relations extracted in a closed setting are typically mapped directly to a knowledge graph, relations that are extracted in an open setting may need to be aligned with the knowledge graph; for example, if an OIE process extracts a binary relation Santiago–has flights to➛Easter Island, it may be the case that the knowledge graph does not have other edges labelled has flights to, where alignment may rather map such a relation to the edge Santiago–flight➛Easter Island assuming flight is used in the knowledge graph. A variety of methods have been applied for performing such alignments, including mappings [Corcoglioniti et al., 2016, Gangemi et al., 2017] and rules [Rouces et al., 2015] for aligning \(n\)-ary relations; distributional and dependency-based similarities [Moro and Navigli, 2013], association rule mining [Dutta et al., 2014], Markov clustering [Dutta et al., 2015] and linguistic techniques [Martínez-Rodríguez et al., 2018] for aligning OIE relations; amongst others.

Joint tasks

Having presented the four main tasks for building knowledge graphs from text, it is important to note that frameworks do not always follow this particular sequence of tasks. A common trend, for example, is to combine interdependent tasks, jointly performing WSD and EL [Moro et al., 2014], or NER and EL [Luo et al., 2015, Nguyen et al., 2016], or NER and RE [Ren et al., 2017, Zheng et al., 2017], etc., in order to mutually improve the performance of multiple tasks. For further details on extracting knowledge graphs from text we refer to the book by Maynard et al. [2016] and the recent survey by Martínez-Rodríguez et al. [2020].

Markup Sources

The Web was founded on interlinking markup documents wherein markers (aka tags) are used to separate elements of the document (typically for formatting purposes). Most documents on the Web use the HyperText Markup Language (HTML). Figure 6.2 presents an example HTML webpage about World Heritage Sites in Chile. Other formats of markup include Wikitext used by Wikipedia, TeX for typesetting, Markdown used by Content Management Systems, etc. One approach for extracting information from markup documents – in order to create and/or enrich a knowledge graph – is to strip the markers (e.g., HTML tags), leaving only plain text upon which the techniques from the previous section can be applied. However, markup can be useful for extraction purposes, where variations of the aforementioned tasks for text extraction have been adapted to exploit such markup [Lu et al., 2013, Lockard et al., 2018, Martínez-Rodríguez et al., 2020]. We can divide extraction techniques for markup documents into three main categories: general approaches that work independently of the markup used in a particular format, often based on wrappers that map elements of the document to the output; focussed approaches that target specific forms of markup in a document, most typically web tables (but sometimes also lists, links, etc.); and form-based approaches that extract the data underlying a webpage, per the notion of the Deep Web. These approaches can often benefit from the regularities shared by webpages of a given website; for example, intuitively speaking, while the webpage of Figure 6.2 is about Chile, we will likely find pages for other countries following the same structure on the same website.

<html>
  <head><title>UNESCO World Heritage Sites</title></head>
  <body>
    <h1>World Heritage Sites</h1>
	<h2>Chile</h2>
	<p>Chile has 6 UNESCO World Heritage Sites.</p>
	<table border="1">
	  <tr><th>Place</th><th>Year</th><th>Criteria</th></tr>
	  <tr><td>Rapa Nui</td><td>1995</td>
		<td rowspan="6">Cultural</td></tr>
	  <tr><td>Churches of Chiloé</td><td>2000</td></tr>
	  <tr><td>Historical Valparaíso</td><td>2003</td></tr>
	  <tr><td>Saltpeter Works</td><td>2005</td></tr>
	  <tr><td>Sewell Mining Town</td><td>2006</td></tr>
	  <tr><td>Qhapaq Ñan</td><td>2014</td></tr>
	</table>
  </body>
</html>

UNESCO World Heritage Sites

World Heritage Sites

Chile

Chile has 6 UNESCO World Heritage Sites.

Place	Year	Criteria
Rapa Nui	1995	Cultural
Churches of Chiloé	2000
Historical Valparaíso	2003
Saltpeter Works	2005
Sewell Mining Town	2006
Qhapaq Ñan	2014

Example markup document (HTML) with source-code (left) and formatted document (right)

Wrapper-based extraction

Many general approaches are based on wrappers that locate and extract the useful information directly from the markup document. While the traditional approach was to define such wrappers manually – a task for which a variety of declarative languages and tools have been defined – such approaches are brittle to changes in a website’s layout [Ferrara et al., 2014]. Hence other approaches allow for (semi-)automatically inducing wrappers [Flesca et al., 2004]. A modern such approach – used to enrich knowledge graphs in systems such as LODIE [Gentile et al., 2014] – is to apply distant supervision, whereby EL is used to identify and link entities in the webpage to nodes in the knowledge graph such that paths in the markup that connect pairs of nodes for known edges can be extracted, ranked, and applied to other examples. Taking Figure 6.2, for example, distant supervision may link Rapa Nui and World Heritage Sites to the nodes Easter Island and World Heritage Site in the knowledge graph using EL, and given the edge Easter Island–named➛World Heritage Site in the knowledge graph (extracted per Figure 6.1), identify the candidate path \((x,\)td\([1]^{-} \cdot \) tr\(^{-} \cdot \) table\(^- \cdot \) h1\(,y)\) as reflecting edges of the form \(x\)–named➛\(y\), where \(t[n]\) indicates the \(n\)^th child of tag \(t\), \(t^-\) its inverse, and \(t_1 \cdot t_2\) concatenation. Finally, paths with high confidence (e.g., ones “witnessed” by many known edges in the knowledge graph) can then be used to extract novel edges, such as Qhapaq Ñan–named➛World Heritage Site, both on this page and on related pages of the website with similar structure (e.g., for other countries).

Web table extraction

Other approaches target specific types of markup, most commonly web tables embedded in HTML webpages. However, web tables are designed to enhance human rather than machine readability. Many web tables are used for layout and page structure (e.g., navigation bars). Those that contain data may follow different formats, such as relational tables, listings, attribute-value tables, and matrices [Cafarella et al., 2008, Crestan and Pantel, 2011]. A first step is to classify tables to find ones appropriate for the given extraction mechanism(s) [Crestan and Pantel, 2011, Eberius et al., 2015]. Next, web tables may contain column spans, row spans, inner tables, or may be split vertically to improve human aesthetics. Table normalisation merges split tables, un-nests tables, transposes tables, etc. [Pivk et al., 2007, Cafarella et al., 2008, Crestan and Pantel, 2011, Deng et al., 2013, Ermilov and Ngonga Ngomo, 2016, Lehmberg et al., 2016]. Some approaches then identify the table protagonist [Crestan and Pantel, 2011, Muñoz et al., 2014] – the main entity that the table describes – often found elsewhere in the webpages; for example, though not mentioned by the table of Figure 6.1, World Heritage Sites is its protagonist. Finally, extraction processes may associate cells with entities [Limaye et al., 2010, Mulwad et al., 2013], columns with types [Deng et al., 2013, Limaye et al., 2010, Mulwad et al., 2013], and column pairs with relations [Limaye et al., 2010, Muñoz et al., 2014]. When enriching knowledge graphs, recent approaches apply distant supervision, linking cells to knowledge graph nodes in order to generate candidates for type and relation extraction [Limaye et al., 2010, Mulwad et al., 2013, Muñoz et al., 2014]. Statistical distributions can also help to link numerical columns [Neumaier et al., 2016]. Specialised table extraction frameworks have also been proposed for specific websites, where prominent knowledge graphs, such as DBpedia [Lehmann et al., 2015] and YAGO [Suchanek et al., 2008] focus on extraction from info-box tables in Wikipedia.

Deep Web crawling

The Deep Web presents a rich source of information accessible only through searches on web forms, thus requiring Deep Web crawling techniques to access [Madhavan et al., 2008]. Systems have been proposed to extract knowledge graphs from Deep Web sources [Geller et al., 2008, Lehmann et al., 2012, Collarana et al., 2016]. Approaches typically attempt to generate sensible form inputs – which may be based on a user query or generated from reference knowledge – and then extract data from the generated responses (markup documents) using the aforementioned techniques [Geller et al., 2008, Lehmann et al., 2012, Collarana et al., 2016].

Structured Sources

Much of the legacy data available within organisations and on the Web is represented in structured formats, primarily tables – in the form of relational databases, CSV files, etc. – but also tree-structured formats such as JSON, XML etc. Unlike text and markup documents, structured sources can often be mapped to knowledge graphs whereby the structure is (precisely) transformed according to a mapping rather than (imprecisely) extracted. The mapping process involves two steps: 1) create a mapping from the source to a graph, and 2) use the mapping in order to materialise the source data as a graph or to virtualise the source (creating a graph view over the legacy data).

Mapping from tables

Tabular sources of data are prevalent; for example, the structured content underlying many organisations and websites are housed in relational databases. In Figure 6.3 we present an example of a relational database instance that we wish to integrate into our knowledge graph. There are then two approaches for mapping content from tables to knowledge graphs: a direct mapping, and a custom mapping.

Report

crime	claimant	station	date
Pickpocketing	XY12SDA	Viña del Mar	2019-04-12
Assault	AB9123N	Arica	2019-04-12
Pickpocketing	XY12SDA	Rapa Nui	2019-04-12
Fraud	FI92HAS	Arica	2019-04-13

Claimant

id	name	country
XY12SDA	John Smith	U.S.
AB9123N	Jeanne Dubois	France
XI92HAS	Jorge Hernández	Chile

Relational database instance with two tables describing crime data

Direct mapping result for the first rows of both tables in Figure 33 — Direct mapping result for the first rows of both tables in Figure 6.3

A direct mapping automatically generates a graph from a table. We present in Figure 6.4 the result of a standard direct mapping [Arenas et al., 2012], which creates an edge x–y➛z for each (non-header, non-empty, non-null) cell of the table, such that x represents the row of the cell, y the column name of the cell, and z the value of the cell. In particular, x typically encodes the values of the primary key for a row (e.g., Claimant.id); otherwise, if no primary key is defined (e.g., per the Report table), x can be an anonymous node or a node based on the row number. The node x and edge label y further encode the name of the table to avoid clashes across tables that have the same column names used with different meanings. For each row x, we may add a type edge based on the name of its table. The value z may be mapped to datatype values in the corresponding graph model based on the source domain (e.g., a value in an SQL column of type Date can be mapped to xsd:date in the RDF data model). If the value is null (or empty), typically the corresponding edge will be omitted.³⁰^{note 30} One might consider representing nulls with anonymous/blank nodes. However, nulls in SQL can be used to mean that there is no such value, which conflicts with the existential semantics of such nodes (e.g., in RDF). With respect to Figure 6.4, we highlight the difference between the nodes Claimant-XY12SDA and XY12SDA, where the former denotes the row (or entity) identified by the latter primary key value. In case of a foreign key between two tables – such as Report.claimant referencing Claimant.id – we can link, for example, to Claimant-XY12SDA rather than XY12SDA, where the former node also has the name and country of the claimant. A direct mapping along these lines has been standardised for mapping relational databases to RDF [Arenas et al., 2012], where Stoica et al. [2019] have recently proposed an analogous direct mapping for property graphs. Another direct mapping has been defined for CSV and other tabular data [Tandy et al., 2015] that further allows for specifying column names, primary/foreign keys, and data types – which are often missing in such data formats – as part of the mapping itself.

Although a direct mapping can be applied automatically on tabular sources of data and preserve the information of the original source – i.e., allowing a deterministic inverse mapping that reconstructs the tabular source from the output graph [Sequeda et al., 2012] – in many cases it is desirable to customise a mapping, such as to align edge labels or nodes with a knowledge graph under enrichment, etc. Along these lines, declarative mapping languages allow for manually defining custom mappings from tabular sources to graphs. A standard language along these lines is the RDB2RDF Mapping Language (R2RML) [Das et al., 2012], which allows for mapping from individual rows of a table to one or more custom edges, with nodes and edges defined either as constants, as individual cell values, or using templates that concatenate multiple cell values from a row and static substrings into a single term; for example, a template {id}-{country} may produce nodes such as XY12SDA-U.S. from the Claimant table. In case that the desired output edges cannot be defined from a single row, R2RML allows for (SQL) queries to generate tables from which edges can be extracted where, for example, edges such as U.S.–crimes➛2 can be generated by defining the mapping with respect to a query that joins the Report and Claimant tables on claimant=id, grouping by country, and applying a count for each country group. A mapping can then be defined on the results table such that the source node denotes the value of country, the edge label is the constant crimes, and the target node is the count value. An analogous standard also exists for mapping CSV and other tabular data to RDF graphs, again allowing keys, column names, and datatypes to be chosen as part of the mapping [Tennison and Kellogg, 2015].

Once the mappings have been defined, one option is to use them to materialise graph data following an Extract-Transform-Load (ETL) approach, whereby the tabular data are transformed and explicitly serialised as graph data using the mapping. A second option is to use virtualisation through a Query Rewriting (QR) approach, whereby queries on the graph (using, e.g., SPARQL, Cypher, etc.) are translated to queries over the tabular data (typically using SQL). Comparing these two options, ETL allows the graph data to be used as if they were any other data in the knowledge graph. However, ETL requires updates to the underlying tabular data to be explicitly propagated to the knowledge graph, whereas a QR approach only maintains one copy of data to be updated. The area of Ontology-Based Data Access (OBDA) [Xiao et al., 2018] is concerned with QR approaches that support ontological entailments as seen in Chapter 4. Although most QR approaches only support non-recursive entailments expressible as a single (non-recursive) query, some QR approaches support recursive entailments through rewritings to recursive queries [Sequeda et al., 2014].

Mapping from trees

A number of popular data formats are based on trees, including XML and JSON. While one could imagine – leaving aside issues such as the ordering of children in a tree – a trivial direct mapping from trees to graphs by simply creating edges of the form \(x\)–child➛\(y\) for each node \(y\) that is a child of \(x\) in the source tree, such an approach is not typically used, as it represents the literal structure of the source data. Instead, the content of tree-structured data can be more naturally represented as a graph using a custom mapping. Along these lines, the GRDLL standard [Connolly, 2007] allows for mapping from XML to (RDF) graphs, while languages such as RML allow for mapping from a variety of formats, including XML and JSON, to (RDF) graphs [Dimou et al., 2014]. In contrast, hybrid query languages such as XSPARQL [Bischof et al., 2012] allow for querying XML and RDF in unison, thus supporting both materialisation and virtualisation of graphs over tree-structured sources of legacy data.

Mapping from other knowledge graphs

We may also leverage existing knowledge graphs in order to construct or enrich another knowledge graph. For example, a large number of points of interest for the Chilean tourist board may be available in existing knowledge graphs such as BabelNet [Navigli and Ponzetto, 2012], DBpedia [Lehmann et al., 2015], LinkedGeoData [Stadler et al., 2012], Wikidata [Vrandečić and Krötzsch, 2014], YAGO [Hoffart et al., 2011], etc. However, not all entities and/or relations may be of interest. A standard option to extract a relevant sub-graph of data is to use construct queries that generate graphs as output [Neumaier and Polleres, 2019]. Entity and schema alignment between the knowledge graphs may be further necessary to better integrate (parts of) external knowledge graphs, using linking tools for graphsexternal identifiers [Pellissier Tanon et al., 2016], or indeed may be done manually [Pellissier Tanon et al., 2016]. For instance, Wikidata [Vrandečić and Krötzsch, 2014] uses Freebase [Bollacker et al., 2007b, Pellissier Tanon et al., 2016] as a source; Gottschalk and Demidova [2018] extract an event-centric knowledge graph from Wikidata, DBpedia and YAGO; while Neumaier and Polleres [2019] construct a spatio-temporal knowledge graph from Geonames, Wikidata, and PeriodO [Golden and Shaw, 2016] (as well as tabular data).

Schema/Ontology Creation

The discussion thus far has focussed on extracting data from external sources in order to create and enrich a knowledge graph. In this section, we discuss some of the principal methods for generating a schema based on external sources of data, including human knowledge. For discussion on extracting a schema from the knowledge graph itself, we refer back to Section 3.1.3. In general, much of the work in this area has focussed on the creation of ontologies using either ontology engineering methodologies, and/or ontology learning. We discuss these two approaches in turn.

Ontology engineering

Ontology engineering refers to the development and application of methodologies for building ontologies, proposing principled processes by which better quality ontologies can be constructed and maintained with less effort. Early methodologies [Grüninger and Fox, 1995a, Fernández et al., 1997, Noy and McGuinness, 2001] were often based on a waterfall-like process, where requirements and conceptualisation were fixed before starting to define the ontology, using, for example, an ontology engineering tool [Gómez-Pérez et al., 2006, Keet, 2018, Kendall and McGuinness, 2019]. However, for situations involving large or ever-evolving ontologies, more iterative and agile ways of building and maintaining ontologies have been proposed.

DILIGENT [Pinto et al., 2009] was an early example of an agile methodology, proposing a complete process for ontology life-cycle management and knowledge evolution, as well as separating local changes (local views on knowledge) from global updates of the core part of the ontology, using a review process to authorise the propagation of changes from the local to the global level. This methodology is similar to how, for instance, the large clinical reference terminology SNOMED CT [IHTSDO, 2019] (also available as an ontology) is maintained and evolved, where the (international) core terminology is maintained based on global requirements, while national or local extensions to SNOMED CT are maintained based on local requirements. A group of authors then decides which national or local extensions to propagate to the core terminology. More modern agile methodologies include eXtreme Design (XD) [Presutti et al., 2009, Blomqvist et al., 2016], Modular Ontology Modelling (MOM) [Krisnadhi and Hitzler, 2016b, Hitzler and Krisnadhi, 2018], Simplified Agile Methodology for Ontology Development (SAMOD) [Peroni, 2016], and more besides. Such methodologies typically include two key elements: ontology requirements and (more recently) ontology design patterns.

Ontology requirements specify the intended task of the resulting ontology, or of the knowledge graph itself in conjunction with the new ontology. A common way to express ontology requirements is through Competency Questions (CQ) [Grüninger and Fox, 1995b], which are natural language questions illustrating the typical information needs that one would require the ontology (or the knowledge graph) to respond to. Such CQs can then be complemented with additional restrictions, and reasoning requirements, in case that the ontology should also contain restrictions and general axioms for inferring new knowledge or checking data consistency. A common way of testing ontologies (or knowledge graphs based on them) is then to formalise the CQs as queries over some test set of data, and make sure the expected results are entailed [Blomqvist et al., 2012, Keet and Ławrynowicz, 2016]. We may, for example, consider the CQ “What are all the events happening in Santiago?”, which can be represented as a graph query Event➛type–?event–location➛Santiago. Taking the data graph of Figure 2.1 and the axioms of Figure 3.2, we can check to see if the expected result EID15 is entailed by the ontology and the data, and since it is not, we may consider expanding the axioms to assert that location–type➛Transitive.

Ontology Design Patterns (ODPs) are another common feature of modern methodologies [Gangemi, 2005, Blomqvist and Sandkuhl, 2005], specifying generalisable ontology modelling patterns that can be used as inspiration for modelling similar patterns, as modelling templates [Egaña et al., 2008, Skjæveland et al., 2018], or as directly reusable components [Daga et al., 2008, Shimizu et al., 2019]. Several pattern libraries have been made available online, ranging from carefully curated ones [Aranguren et al., 2008, Shimizu et al., 2019] to open and community moderated ones [Daga et al., 2008]. As an example, to model events in our scenario, we may adopt the Core Event ontology pattern proposed by Krisnadhi and Hitzler [2016a], which specifies a spatio-temporal extent, sub-events, and participants of an event, along with competency questions, formal definitions, etc., to support this pattern.

Ontology learning

The previous methodologies outline methods by which ontologies can be built and maintained manually. Ontology learning, in contrast, can be used to (semi-)automatically extract information from text that is useful for the ontology engineering process [Buitelaar et al., 2005, Cimiano, 2006]. Early methods focussed on extracting terminology from text that may represent the relevant domain’s classes; for example, from a collection of text documents about tourism, a terminology extraction tool – using measures of unithood that determine how cohesive an \(n\)-gram is as a unitary phrase, and termhood that determine how relevant the phrase is to a domain [Martínez-Rodríguez et al., 2018] – may identify \(n\)-grams such as “visitor visa”, “World Heritage Site”, “off-peak rate”, etc., as terminology of particular importance to the tourist domain that thus may merit inclusion in such an ontology. Ontological axioms may also be extracted from text. A common target is to extract sub-class axioms from text, leveraging patterns based on modifying nouns and adjectives that incrementally specialise concepts (e.g., extracting Visitor Visa–subc. of➛Visa from the noun phrase “visitor visa” and isolated appearances of “visa” elsewhere), or using Hearst patterns [Hearst, 1992] (e.g., extracting Off-Peak Rate–subc. of➛Discount from “many discounts, such as off-peak rates, are available” based on the pattern “X, such as Y”). Textual definitions can also be harvested from large texts to extract hypernym relations and induce a taxonomy from scratch [Velardi et al., 2013]. More recent works aim to extract more expressive axioms from text, including disjointness axioms [Völker et al., 2015]; and axioms involving the union and intersection of classes, along with existential, universal, and qualified-cardinality restrictions [Petrucci et al., 2016]. The results of an ontology learning process can then serve as input to a more general ontology engineering methodology, allowing us to validate the terminological coverage of an ontology, to identify new classes and axioms, etc.