KGTK: A Toolkit for Large Knowledge Graph Manipulation and Analysis

05/29/2020 ∙ by Filip Ilievski, et al. ∙ USC Information Sciences Institute puc-rio 21

Knowledge graphs (KGs) have become the preferred technology for representing, sharing and adding knowledge to modern AI applications. While KGs have become a mainstream technology, the RDF/SPARQL-centric toolset for operating with them at scale is heterogeneous, difficult to integrate and only covers a subset of the operations that are commonly needed in data science applications. In this paper, we present KGTK, a data science-centric toolkit to represent, create, transform, enhance and analyze KGs. KGTK represents graphs in tables and leverages popular libraries developed for data science applications, enabling a wide audience of developers to easily construct knowledge graph pipelines for their applications. We illustrate KGTK with real-world scenarios in which we have used KGTK to integrate and manipulate large KGs, such as Wikidata, DBpedia and ConceptNet, in our own work.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge graphs (KGs) have become the preferred technology for representing, sharing and using knowledge in applications. A typical use case is building a new knowledge graph for a domain or application by extracting subsets of several existing knowledge graphs, combining these subsets in application-specific ways, augmenting them with information from structured or unstructured sources, and computing analytics or inferred representations to support downstream applications. For example, during the COVID-19 pandemic, several efforts focused on building KGs about scholarly articles related to the pandemic starting from the CORD-19 dataset provided by the Allen Institute for AI [23].111 Enhancing these data with with KGs such as DBpedia and Wikidata to incorporate gene, chemical, disease and taxonomic information, and computing network analytics on the resulting graphs, requires the ability to operate these these KGs at scale.

Many tools exist to query, transform and analyze KGs. Notable examples include graph databases such as RDF triple stores and Neo4J;222 tools for operating on RDF such as graphy333 and RDFlib444, entity linking tools such as WAT [16] or BLINK [24], entity resolution tools such as MinHash-LSH [12] or MFIBlocks [10]

, libraries to compute graph embeddings such as PyTorch-BigGraph 

[11] and libraries for graph analytics, such as graph-tool555 and NetworkX.666

There are three main challenges when using these tools together. First, tools may be challenging to set up with large KGs (e.g., the Wikidata RDF dump takes more than a week to load into a triplestore) and often need custom configuration settings that require significant expertise. Second, interoperating between different tools requires developing data transformation scripts, since some of the tools may not be adapted to use the same input/output representation. Third, composing two or more tools together (e.g., to filter, search and analyze a KG) requires writing the intermediate results into disk, which is time and memory consuming for large KGs.

In this paper we introduce the Knowledge Graph Toolkit (KGTK), a framework for manipulating, validating and analyzing large-scale KGs. Our work is inspired by Scikit-learn [15] and SpaCy,777

two popular popular toolkits for machine learning and natural language processing that have had an enormous impact by making these technologies accessible to data scientists and software developers. The objective of KGTK is to build a comprehensive library of tools and methods to enable easy composition of knowledge graph operations (e.g., validation, filtering, merging, centrality analysis, text embeddings, etc.) to build knowledge-based AI applications. The contributions of KGTK are:

  • The KGTK file format, which allows representing KGs as hypergraphs. This format unifies the Wikidata data model [22] based on items, claims, qualifiers and references, property graphs that support arbitrary attributes on nodes and edges, RDF-Schema-based graphs such as DBpedia [1], and general purpose RDF graphs with various forms of reification. The KGTK format uses tab-separated values (TSV) to represent edge lists, making it easy to process with many off-the-shelf tools.

  • A comprehensive validator and data cleaning module to verify compliance with the KGTK format and normalize literal values such as strings, numbers, misaligned values, etc.

  • Import modules to transform different formats into KGTK, including N-Triples [19], Wikidata qualified terms and ConceptNet [20].

  • Graph manipulation modules for bulk operations on graphs to validate, clean, filter, join, sort and merge KGs. Several of these tools are implemented as wrappers of common, streaming Unix tools such as awk888, sort, join, as well as miller,999 a package with a comprehensive set of tools to manipulate text-delimited files.

  • Graph querying and analytics modules to compute centrality measures, connected components, and text-based graph embeddings using state-of-the-art language models: RoBERTa [13], BERT [5], and DistilBERT [17]. Common queries, such as computing the set of nodes reachable from other nodes, are also supported.

  • Export modules to transform KGTK format into diverse standard and commonly used formats, such as RDF (N-Triples), property graphs in Neo4J format, and GML to invoke tools such as graph-tool or Gephi.101010

  • A framework for composing multiple KG operations, based on Unix pipes. The framework uses the KGTK file format on the standard input and output to combine tools written in different programming languages.

KGTK provides an implementation that integrates all these methods relying on widely used tools and standards, thus allowing their composition in pipelines to operate with large KGs (e.g., Wikidata) on an average laptop.

The rest of the paper is structured as follows. Section 2 describes a motivating scenario and lists the requirements for a graph manipulation toolkit. Section 3 describes KGTK by providing an overview of its file format, supported operations, and examples on how to compose them together. Next, Section 4 showcases how we have used KGTK on three different real-world use cases, together with the current limitations of our approach. We then review relevant related work in Section 5, and we conclude the paper in Section 6.

2 Motivating Scenario

The 2020 coronavirus pandemic led to a series of community efforts to publish and share common knowledge about COVID-19 using KGs. Many of these efforts use the COVID-19 Open Research Dataset (CORD-19) [23], compiled by the Allen Institute for AI. CORD-19 is a free resource containing over 44,000 scholarly articles, including over 29,000 with full text, about COVID-19 and the coronavirus family of viruses. Having an integrated KG would allow easy access to information published in scientific papers, as well as to general medical knowledge on genes, proteins, drugs, and diseases mentioned in these papers, and their interactions.

In our work, we integrated the CORD-19 corpus with gene, chemical, disease and taxonomic information from Wikidata and CTD databases,111111 as well as entity extractions from Professor Heng Ji’s BLENDER lab at UIUC.121212 We first extracted all the items and statements for the 30,000 articles in the CORD-19 corpus [23] that were present in Wikidata at the time of extraction, extracted all Wikidata articles, authors, and entities mentioned in the BLENDER corpus, homogenized the data to fix inconsistencies (e.g., empty values), created nodes and statements for entities that were not present in Wikidata, incorporated analytic outputs such as PageRank to all the nodes on the KG and exported the output in both RDF and Neo4J.

This use case exhibited several of the challenges that KGTK is designed to address. For example, extracting a subgraph from Wikidata articles is not feasible using SPARQL queries as it would have required over 100,000 SPARQL queries; using RDF tools on the Wikidata RDF dump (107 GB compressed) is difficult because the Wikidata RDF model uses small graphs to represent each Wikidata statement; using the Wikidata JSON dump is possible, but requires writing custom code as the schema is specific to Wikidata (hence not reusable for other KGs). In addition, while graph-tool allowed us to compute graph centrality metrics, its input format is incompatible with RDF, requiring a transformation.

Other efforts employed a similar set of processing steps [23].131313A list of such projects can be found in These range from mapping the CORD-19 data to RDF,141414, to adding annotations to the articles in the dataset pointing to entities extracted from the text, obtained from various sources [8].151515 A common thread among these efforts involves leveraging existing KGs such as Wikidata and Microsoft Academic Graph to, for example, build a citation network of the papers, authors, affiliations, etc.161616 Other efforts focused on extraction of relevant entities (genes, proteins, cells, chemicals, diseases), relations (causes, upregulates, treats, binds), and linking them to KGs such as Wikidata and DBpedia. Graph analytics operations followed, such as computing centrality measures in order to support identification of key articles, people or substances,16 or generation of various embeddings to recommend relevant literature associated with an entity.171717 The resulting graphs were deployed as SPARQL endpoints, or exported as RDF dumps, CSV, or JSON files.

These examples illustrate the need for composing sequences of integrated KG operations that extract, modify, augment and analyze knowledge from existing KGs, combining it with non-KG datasets to produce new KGs. Existing KG tooling does not allow users to seamlessly run such sequences of graph manipulation tasks in a pipeline. We propose that an effective toolkit that supports the construction of modular KG pipelines has to meet the following criteria:

  1. A simple knowledge representation format that all modules in the toolkit operate on (the equivalent of datasets in Scikit-learn and document model in SpaCy), to facilitate tool integration without the need of additional data transformations.

  2. Ability to incorporate mature existing tools, wrapping them to support a common API and input/output format. The scientific community has worked for many years on efficient techniques for manipulation of graph and structured data. The toolkit should be able to accommodate them without the need for a new implementation.

  3. A comprehensive set of modules that include import and export modules for a wide variety of KG formats, modules to select, transform, combine, link and merge KGs, modules to improve the quality of KGs and infer new knowledge, and modules to compute embeddings and graph statistics. Such a rich palette of functionalities would largely support use cases such as the ones presented in this section.

  4. A pipeline mechanism to allow composing modules in arbitrary ways to process large public KGs such as Wikidata, DBpedia, or ConceptNet.

3 KGTK: The Knowledge Graph Toolkit

We developed KGTK to help manipulate, curate and analyze large real-world KGs, in which statements may have multiple qualifiers, such as the source of a statement or the units in which an observation is made. Figure 1 shows an overview of the different capabilities of KGTK. Given an input file with triples (either as tab-separated values, Wikidata JSON, or N-Triples), we convert it to an internal representation (the KGTK file format, Section 3.1) that we then use as main input/output format for the rest of the features in the toolkit. Once data is in KGTK format, we can perform operations for curating (data validation and cleaning), transforming (sort, filter or join) and analyzing (computing embeddings, statistics, node centrality) the contents of a knowledge graph. KGTK also provides export operations to commonly used formats, such as N-Triples, Neo4J and JSON. The different features of KGTK are described in Section 3.2, whereas their composition into command line pipes is illustrated in Section 3.3.

Figure 1: Overview of the usage workflow and features included in KGTK.

3.1 KGTK file format

KGTK uses a tab-separated column-based text format to describe any attributed, labeled or unlabeled hypergraph. We chose this format instead of an RDF serialization for two reasons. First, tabular formats are easy to generate and parse by standard tools, and second, this format is self-describing, easy to read and provides a simple mechanism to define edge qualifiers.

KGTK defines KGs as a set of nodes and a set of edges between those nodes. All concepts of meaning are represented via an edge, including edges themselves, allowing KGTK to represent generalized hypergraphs (while supporting the representation of RDF graphs). The snippet below shows a simple example of a KG in KGTK format with three people (Moe, Larry and Curly), the creator of the statements (Hans) and the original source of the statements (Wikipedia): [commandchars=
{}] node1 label node2 creator source id ”Moe” rdf:type Person ”Hans” Wikipedia E1 ”Larry” rdf:type Person ”Hans” Wikipedia E2 ”Curly” rdf:type Person Wikipedia ”Curly” hasFriend ”Moe” Wikipedia

The first line of a KGTK file declares the headers to be used in the document. The reserved words node1, label and node2 are used to describe the subject, property and object being described, while creator and source are optional qualifiers for each statement that provide additional provenance information about the creator of a statement and the original source. Note that the example is not using namespace URIs for any nodes and properties, as they are not needed for local knowledge graph manipulation. Nodes and edges may have namespace prefixes (such as rdf in the example) to enable mapping back to RDF after finishing KG manipulations with KGTK. Nodes and edges have unique IDs (when IDs are not present, KGTK generates them automatically).

The snippet below illustrates the representation of qualifiers for individual edges, and show how the additional columns in the previous example are represented as edges about edges:

{}] node1 label node2 id ”Moe” rdf:type Person E1 E1 source Wikipedia E3 E1 creator ”Hans” E4 ”Larry” rdf:type Person E2

KGTK is designed to support commonly-used typed literals:

  • Language tags: represented following a subset of the RDF convention, language tags are two- or three-letter ISO 639-3 codes, optionally followed by a dialect or location subtag. Example: ‘Sprechen sie deutsch?’@de.

  • Quantities: represented using a variant of the Wikidata format amount toleranceUxxxx. A quantity starts with an amount (number), followed by an optional tolerance interval, and then followed by either a combination of standard (SI) units or a Wikidata node defining the unit (e.g., Q11573 indicates “meter”). Examples include 10m, -1.2e+2[-1.0,+1.0]kg.m/s2 or +17.2Q494083

  • Coordinates: represented by using the Wikidata format @LAT/LON, for example: @043.26193/010.92708

  • Time literals: represented with a character (indicating the tip of a clock hand) and followed by an ISO 8601 date and an optional precision designator, for example: ^1839-00-00T00:00:00Z/9

The full KGTK file format specification is available online.181818

3.2 KGTK Operations

KGTK currently supports 13 operations (depicted in Figure 1),191919 grouped into four modules: importing modules, graph manipulation modules, graph analytics modules, and exporting modules. We describe each of these modules below.

3.2.1 Importing and exporting from KGTK

1. The import operation transforms an external graph format into KGTK TSV format. KGTK supports importing N-Triples, ConceptNet and Wikidata (including qualifiers) formats.

2. The export operation transforms a KGTK-formatted graph to a wide palette of formats: TSV (by default), N-Triples, Neo4J Property Graphs, graph-tool and the Wikidata JSON format.

3.2.2 Graph curation and transformation

3. The validate operation ensures that a node or edge file satisfies the KGTK file format specification, detecting errors such as nodes with empty values, values of unexpected length (either too long or too short), potential errors in strings (quotation errors, incorrect use of language tags, etc.), incorrect values in dates, etc.

4. The clean operation fixes a substantial number of errors detected by validate, by fixing encoding errors in strings, replacing invalid dates (e.g., if a minimum valid date is set up), normalizing values for dates, quantities, languages and coordinates using the KGTK convention for literals, and so on.

5. sort efficiently reorders any KGTK file according to one or multiple columns. sort is useful to organize edge files so that, for example, all edges for node1 are contiguous, enabling efficient processing in streaming operations.

6. The remove_columns operation removes a subset of the columns in an input KGTK file (node1 (source), node2 (object), and label (property) cannot be removed). Removing columns is useful in cases where columns have lengthy values and are not relevant to the use case pursued by a user, e.g., removing edge and graph identifiers when users aim to compute node centrality or calculate embeddings.

7. The filter operation selects edges from an KGTK file, by specifying constraints (“patterns”) on the values for node1, label and node2. The pattern language, inspired by graphy.js, has the following form: “subject-pattern ; predicate-pattern ; object-pattern”. For each of the three columns, the filtering pattern can consist of a list of symbols separated using commas. Empty patterns indicate that no filter should be performed for a column. For instance, to select all edges that have property P154 or P279, we can use the pattern “ ; P154,P279 ; ”. Alternatively, a common query of retrieving edges for all humans from Wikidata corresponds to the filter “ ; P31 ; Q5”.

8. The join operation will join two KGTK files. Inner join, left outer join, right outer join, and full outer join are all supported. When a join takes place, the columns from two files are merged into the set of columns for the output file. By default, KGTK will join based on the node1 column, although it can be configured to join by edge id. KGTK also allows the label and node2 columns to be added to the join. Alternatively, the user may supply a list of join columns for each file giving them full control over the semantics of the result.

9. The cat operation concatenates any number of files into a single, KGTK-compliant graph file.

3.2.3 Graph querying and analytics

10. reachable_nodes: given a set of nodes N and a set of properties P, this operation computes the set of reachable nodes that contains the nodes that can be reached from a node via paths containing any of the properties in P. This operation can be seen as a (joint) closure computation over one or multiple properties for a predefined set of nodes. A common application of this operation is to compute a closure over the subClassOf property, which benefits downstream tasks such as entity linking or table understanding.

11. The connected_components operation finds all connected components (communities) in a graph (e.g., return all the communities connected with an owl:sameAs in a KGTK file).

12. The text_embeddings operation computes embeddings for all nodes in a graph by computing a sentence embedding over a lexicalization of the neighborhood of each node. The lexicalized sentence is created based on a template whose simplified version is:

{label-properties}, {description-properties} is a {isa-properties},
has {has-properties}, and {properties:values}.

The labels (properties) to be used for label-properties, description-properties, isa-properties, has-properties, and property-values pairs, are specified as input arguments to the operation. An example sentence is “Saint David, patron saint of Wales is a human, Catholic priest, Catholic bishop, and has date of death, religion, canonization status, and has place of birth Pembrokeshire”. The sentence for each node is encoded into an embedding using one of 16 currently supported variants of three state-of-the-art language models: BERT, DistilBERT, and RoBERTa. Computing similarity between such entity embeddings is a standard component of modern decision making systems such as entity linking, question answering, or table understanding.

13. The graph_statistics operation computes various graph statistics and centrality metrics. It computes a graph summary, containing its number of nodes, edges, and most common relations. In addition, it can compute graph degrees, HITS centrality and PageRank values. Aggregated statistics (minimum, maximum, average, and top nodes) for these connectivity/centrality metrics are included in the summary, whereas the individual values for each node are represented as edges in the resulting graph. The graph is assumed to be directed, unless indicated differently.

3.3 Composing operations into pipelines

KGTK has a pipelining architecture based on Unix pipes202020 that allows chaining most operations introduced in the previous section by using the standard input/output and the KGTK file format. Pipelining increases efficiency by avoiding the need to write files to disk and supporting parallelism allowing downstream commands to process data before upstream commands complete. We illustrate the chaining operations in KGTK with three examples from our own work. Note that we have implemented a shortcut pipe operator “/”, which allows users to avoid repeating kgtk in each of their operations. For readability, command arguments are slightly simplified in the paper. Three Jupyter Notebooks that implement these examples can be found online.212121

Example 1: Alice wants to import the English subset of ConceptNet [20] in KGTK format to extract a filtered subset where two concepts are connected with a more precise semantic relation such as /r/Causes or /r/UsedFor (as opposed to weaker relations such as /r/RelatedTo). For all nodes in this subset, she wants to compute text embeddings and store them in a file called emb.txt.

To extract the desired subset, the sequence of KGTK commands is as follows:

kgtk import_conceptnet --english_only conceptnet.csv / \
  filter -p "; /r/Causes,/r/UsedFor,/r/Synonym,/r/DefinedAs,/r/IsA ;" / \
  sort -c 1,2,3 > sorted.tsv

To compute embeddings for this subset, she would use text_embedding:

kgtk text_embedding --label-properties "/r/Synonym" \
  --isa-properties "/r/IsA" --description-properties "/r/DefinedAs" \
  --property-value "/r/Causes" "/r/UsedFor" \
  --model bert-large-nli-cls-token -i sorted.tsv \
  > emb.txt

Example 2: Bob wants to extract a subset of Wikidata that contains only edges of the ‘member of’ (P463) property, and strips a set of columns that are not relevant for his use case ($ignore_col), such as id and rank. While doing so, Bob would also like to clean any erroneous edges. On the clean subset, he would compute graph statistics, including PageRank values and node degrees. Here is how to perform this functionality in KGTK (after Wikidata is already converted to a KGTK file called wikidata.tsv by import_wikidata):

kgtk filter -p ’ ; P463 ; ’ /  clean_data /
    remove_columns -c "$ignore_cols" wikidata.tsv > graph.tsv
kgtk graph_statistics --directed --degrees --pagerank graph.tsv

Example 3: Carol would like to concatenate two subsets of Wikidata: one containing occupations for several notable people: Sting, Roger Federer, and Nelson Mandela; and the other containing all ‘subclass of’ (P279) relations in Wikidata. The concatenated file needs to be sorted by subject, after which she would compute the set of reachable nodes for these people via the properties ‘occupation’ (P106) or ‘subclass of’ (P279). To achieve this in KGTK, Carol first needs to extract the two subsets with the filter operation:

kgtk filter -p ’Q8023,Q483203,Q1426;P106;’ wikidata.tsv > occupation.tsv
kgtk filter -p ‘ ; P279 ; ’ wikidata.tsv > subclass.tsv

Then, she can merge the two files into one, sort it, and compute reachability:

kgtk cat occupation.tsv subclass.tsv / \
     sort -c node1 > sorted.tsv
kgtk reachable_nodes --props P106,P279 --root "Q8023,Q483203,Q1426" \
     sorted.tsv > reachable.tsv

4 Discussion

Validating, merging, transforming and analyzing KGs at scale is an open challenge for knowledge engineers, and even more so for data scientists. Complex SPARQL queries often time out on online endpoints, while working with RDF dumps locally takes time and expertise. In addition, popular graph analysis tools do not operate with RDF, making analysis complex for data scientists.

The KGTK format intentionally does not distinguish attributes or qualifiers of nodes and edges from full-fledged edges. Tools operating on KGTK graphs can instead interpret edges differently when desired. In the KGTK file format, everything can be a node, and every node can have any type of edge to any other node. To do so in RDF requires adopting more complex mechanisms such as reification, typically leading to efficiency issues. This generality allows KGTK files to be mapped to most existing DBMSs, and to be used in powerful data transformation and analysis tools such as Pandas.222222

We believe KGTK will have a significant impact within and beyond the Semantic Web community by helping users to easily perform typical data science operations on large KGs. To give an idea, we downloaded Wikidata (truthy statements distribution, 23.2GB232323 and performed a test of filtering out all Qnodes (entities) which have the P31 property (instance of) in Wikidata. Doing this kind of filter in Apache Jena and RDFlib took more than 20 hours. In graphy, the time was reduced to 4h and 15min. Performing the same operation in KGTK took less than 1h and 30min.

Figure 2: SPARQL query and visualization of the CORD-19 use case, illustrating the use of the Wikidata infrastructure using our KG that includes a subset of Wikidata augmented with new properties such as “mentions gene” and “pagerank”.

We have been using the framework in our own work to help us integrate and analyze several KGs:

  • CORD-19: As described in Section 2, we used KGTK to combine extracted information from the papers in the CORD-19 dataset (such as entities of interest) with metadata about them, and general medical and biology knowledge, all found in Wikidata, CTD and the BLENDER datasets. A notebook illustrating the operations used in this use case is available online.242424 Figure 2 shows the the CORD-19 KGTK KG loaded in Wikidata SPARQL query interface. The KGTK tools exported the CORD-19 KG to RDF triples in a format compatible with Wikidata.

  • Commonsense Knowledge Graph (CSKG): Commonsense knowledge is dispersed across a number of (structured) knowledge sources, such as ConceptNet and ATOMIC [18]. After consolidating these knowledge sources into a single commonsense knowledge graph, we used KGTK to compute graph statistics (e.g., number of edges or most frequent relations), HITS, PageRank, and node degrees, in order to measure the impact of the consolidation on the graph connectivity and centrality. We also created RoBERTa-based embeddings of the CSKG nodes, which we are currently using for downstream question answering applications. A notebook illustrating the operations used in this use case is available online.252525

  • Integrating and exporting Ethiopian quantity data: We are using KGTK to create a custom extension of Wikidata with data about Ethiopia,262626 by integrating quantity indicators like crime, GDP, population, etc.

The heterogeneity of these cases shows how KGTK can be adopted for multi-purpose data-science operations over KGs, independently of the domain. The challenges described in these examples are common in data integration and data science. Given the rate at which KGs are gaining popularity, we expect KGTK to fill an important gap faced by many practitioners wanting to use KGs in their applications.

The primary limitation of KGTK lies in its functionality coverage. The main focus so far has been on supporting basic operations for manipulating KGs, and therefore KGTK does not yet incorporate powerful browsing and visualization tools, or advanced tools for KG identification tasks such as link prediction, entity resolution and ontology mapping.

KGTK is proposed as a new resource, and therefore we don’t have usage metrics at the time of writing this paper.

5 Related Work

Many of the functionalities in KGTK for manipulating and transforming KGs (i.e., join operations, filtering entities, general statistics and node reachability) can be translated into queries in SPARQL. However, the cost of these queries over large endpoints is often too high, and they will time out or take too long to produce a response. In fact, many SPARQL endpoints have been known to have limited availability and slow response times for many queries [4], leaving no choice but to download their data locally for any major KG manipulation. Additionally, it is unclear how to extend SPARQL to support functionalities such as computing embeddings or node centrality.

A scalable alternative to SPARQL is Linked Data Fragments (LDF) [21]

. The list of natively supported operations in LDF boils down to triple pattern matching, resembling our proposed

filter operation. However, operations like merging and joining are not trivial in LDF, while more complex analytics and querying, like embedding computation, are not supported.

Other works have proposed offline querying. LOD Lab [3] and LOD-a-lot [6] combine LDF with an efficient RDF compression format, called Header Dictionary Triples (HDT) [14, 7], in order to store a LOD dump of 30-40B statements. Although the LOD Lab project also employed mature tooling, such as Elastic Search and bash operations, to provide querying over the data, the set of available operations is restricted by employing LDF as a server, as native LDF only supports pattern matching queries. The HDT compression format has also been employed by other efforts, such as[2], which performs closure and clustering operations over half a billion identity (same-as) statements. However, HDT cannot be easily used by existing tools (e.g., graph-tool or pandas), and it does not describe mechanisms for supporting qualifiers (except for using reification on statements, which complicates the data model).

The recent developments towards supporting triple annotations with RDF* [9] provide support for qualifiers, however this format is still in its infancy and we expect it to inherit the challenges of RDF, as described before.

Several RDF libraries exist for different programming languages, such as RDFLib in Python, graphy in JavaScript and Jena or RDF4J272727 in Java. The scope of these libraries is different from KGTK, as they focus on providing the building blocks for creating RDF triples, rather than a set of operators to manipulate and analyze large KGs (validate, merge, sort, statistics, etc.).

Outside of the Semantic Web community, there are several prominent efforts to perform operations on graphs. Most notably, graph databases like Neo4j or libraries like graph-tool allow quick and intuitive traversal over KGs. The limitation of these tools is that they need to be integrated with other tooling to compute embeddings or related graph operations, requiring additional expertise.

Finally, the KGX toolkit282828 has a similar objective as KGTK, but it is scoped to process KGs aligned with the Biolink Model, a datamodel describing biological entities using property graphs. Its set of operations can be regarded as a subset of the operations supported by KGTK. To the best of our knowledge, there is no existing toolkit with a comprehensive set of operations for validating, manipulating, merging, and analyzing knowledge graphs comparable to KGTK.

6 Conclusions and Future Work

Performing common graph operations on large KGs is challenging for data scientists and knowledge engineers. Recognizing this gap, in this paper we presented the Knowledge Graph ToolKit (KGTK): a data science-centric toolkit to represent, create, transform, enhance, and analyze KGs. KGTK represents graphs in tabular format, and leverages popular libraries developed for data science applications, enabling a wide audience of researchers and developers to easily construct KG pipelines for their applications. KGTK currently supports thirteen common operations, including import/export, filter, join, merge, computation of centrality, and generation of text embeddings. We are using KGTK in our own work for three real-world scenarios which benefit from integration and manipulation of large KGs, such as Wikidata and ConceptNet.

KGTK is actively under development, and we are expanding it with new operations. Our CORD-19 use case indicated the need for a tool to create new edges, which will also be beneficial in other domains with emerging information and many long-tail/emerging new entities. Our commonsense KG use case, which combines a number of initially disconnected graphs, requires new operations that will perform de-duplication of edges in flexible ways. Additional import options are needed to support knowledge sources in custom formats, while new export formats will allow us to leverage a wider span of libraries, e.g., the GraphViz format enables using existing visualization tooling. We are also looking at converting other existing KGs to the KGTK format, both to enhance existing KGTK KGs, and to identify the need for additional functionality. Longer term, we plan to extend the toolkit to support more complex KG operations, such as entity resolution, link prediction, and entity linking.

Looking at a complementary set of functionalities, we are also working on enhancing further the user experience with KGTK. We are currently working to adapt and integrate the SQID292929 KG browser (as shown in Figure 3), which is part of the Wikidata tool ecosystem. To this end, we are using the KGTK export operations to convert any KGTK KG to Wikidata format (JSON and RDF as required by SQID), and are modifying SQID to remove its dependencies on Wikidata. The current prototype can browse arbitrary KGTK files. Remaining work includes computing the KG statistics that SQID requires, and automating deployment of the Wikidata infrastructure for use with KGTK KGs.

Figure 3: SQID visualization of local KGTK data (using the CORD-19 example).


This material is based on research sponsored by Air Force Research Laboratory under agreement number FA8750-20-2-10002. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory or the U.S. Government.


  • [1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. G. Ives (2007) DBpedia: A nucleus for a web of open data. In The Semantic Web, 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November 11-15, 2007, Lecture Notes in Computer Science, Vol. 4825, pp. 722–735. Cited by: 1st item.
  • [2] W. Beek, J. Raad, J. Wielemaker, and F. van Harmelen (2018) the closure of 500M owl:sameAs statements. In European Semantic Web Conference, pp. 65–80. Cited by: §5.
  • [3] W. Beek, L. Rietveld, F. Ilievski, and S. Schlobach (2016) LOD Lab: scalable linked data processing. In Reasoning Web International Summer School, pp. 124–155. Cited by: §5.
  • [4] C. Buil-Aranda, A. Hogan, J. Umbrich, and P. Vandenbussche (2013) SPARQL web-querying infrastructure: ready for action?. In International Semantic Web Conference, pp. 277–293. Cited by: §5.
  • [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: 5th item.
  • [6] J. D. Fernández, W. Beek, M. A. Martínez-Prieto, and M. Arias (2017) LOD-a-lot. In International Semantic Web Conference, pp. 75–83. Cited by: §5.
  • [7] J. D. Fernández, M. A. Martínez-Prieto, A. Polleres, and J. Reindorf (2018) HDTQ: managing RDF datasets in compressed space. In European Semantic Web Conference, pp. 191–208. Cited by: §5.
  • [8] R. Gazzotti, F. Michel, and F. Gandon (2020) CORD-19 named entities knowledge graph (CORD19-NEKG). Note: University Côte d’Azur, Inria, CNRS External Links: Link Cited by: §2.
  • [9] O. Hartig (2017) RDF* and SPARQL*: an alternative approach to annotate statements in RDF. In International Semantic Web Conference (Posters, Demos & Industry Tracks), Cited by: §5.
  • [10] B. Kenig and A. Gal (2013) MFIBlocks: an effective blocking algorithm for entity resolution. Information Systems 38 (6), pp. 908–926. Cited by: §1.
  • [11] A. Lerer, L. Wu, J. Shen, T. Lacroix, L. Wehrstedt, A. Bose, and A. Peysakhovich (2019) PyTorch-BigGraph: a large-scale graph embedding system. arXiv preprint arXiv:1903.12287. Cited by: §1.
  • [12] J. Leskovec, A. Rajaraman, and J. D. Ullman (2020) Mining of massive data sets. Cambridge university press. Cited by: §1.
  • [13] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: 5th item.
  • [14] M. A. Martínez-Prieto, M. A. Gallego, and J. D. Fernández (2012) Exchange and consumption of huge RDF data. In Extended Semantic Web Conference, pp. 437–452. Cited by: §5.
  • [15] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §1.
  • [16] F. Piccinno and P. Ferragina (2014) From TagME to WAT: a new entity annotator. In Proceedings of the First International Workshop on Entity Recognition & Disambiguation, pp. 55–62. Cited by: §1.
  • [17] V. Sanh, L. Debut, J. Chaumond, and T. Wolf (2019) DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Cited by: 5th item.
  • [18] M. Sap, R. Le Bras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin, B. Roof, N. A. Smith, and Y. Choi (2019) ATOMIC: an atlas of machine commonsense for if-then reasoning. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 3027–3035. Cited by: 2nd item.
  • [19] A. Seaborne and G. Carothers (2014-Feb.) RDF 1.1 N-Triples. W3C Recommendation W3C. Note: Cited by: 3rd item.
  • [20] R. Speer, J. Chin, and C. Havasi (2016) ConceptNet 5.5: an open multilingual graph of general knowledge. External Links: 1612.03975 Cited by: 3rd item, §3.3.
  • [21] R. Verborgh, M. Vander Sande, P. Colpaert, S. Coppens, E. Mannens, and R. Van de Walle (2014) Web-scale querying through linked data fragments.. In LDOW, Cited by: §5.
  • [22] D. Vrandečić and M. Krötzsch (2014-09) Wikidata: a free collaborative knowledgebase. Communications of the ACM 57 (10), pp. 78–85 (en). External Links: ISSN 00010782, Document Cited by: 1st item.
  • [23] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, D. Eide, K. Funk, R. M. Kinney, Z. Liu, William. Merrill, P. Mooney, D. A. Murdick, D. Rishi, J. Sheehan, Z. Shen, B. Stilson, A. D. Wade, K. Wang, C. Wilhelm, B. Xie, D. M. Raymond, D. S. Weld, O. Etzioni, and S. Kohlmeier (2020) CORD-19: the COVID-19 open research dataset. ArXiv abs/2004.10706. Cited by: §1, §2, §2, §2.
  • [24] L. Wu, F. Petroni, M. Josifoski, S. Riedel, and L. Zettlemoyer (2019) Zero-shot entity linking with dense entity retrieval. arXiv preprint arXiv:1911.03814. Cited by: §1.