DeepAI
Log In Sign Up

KnowGL: Knowledge Generation and Linking from Text

10/25/2022
by   Gaetano Rossiello, et al.
28

We propose KnowGL, a tool that allows converting text into structured relational data represented as a set of ABox assertions compliant with the TBox of a given Knowledge Graph (KG), such as Wikidata. We address this problem as a sequence generation task by leveraging pre-trained sequence-to-sequence language models, e.g. BART. Given a sentence, we fine-tune such models to detect pairs of entity mentions and jointly generate a set of facts consisting of the full set of semantic annotations for a KG, such as entity labels, entity types, and their relationships. To showcase the capabilities of our tool, we build a web application consisting of a set of UI widgets that help users to navigate through the semantic data extracted from a given input text. We make the KnowGL model available at https://huggingface.co/ibm/knowgl-large.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

02/04/2022

From Discrimination to Generation: Knowledge Graph Completion with Generative Transformer

Knowledge graph completion aims to address the problem of extending a KG...
04/29/2020

Exploiting Structured Knowledge in Text via Graph-Guided Representation Learning

In this work, we aim at equipping pre-trained language models with struc...
04/05/2019

Generating Knowledge Graph Paths from Textual Definitions using Sequence-to-Sequence Models

We present a novel method for mapping unrestricted text to knowledge gra...
08/20/2021

SMedBERT: A Knowledge-Enhanced Pre-trained Language Model with Structured Semantics for Medical Text Mining

Recently, the performance of Pre-trained Language Models (PLMs) has been...
09/29/2021

Multilingual Fact Linking

Knowledge-intensive NLP tasks can benefit from linking natural language ...
05/25/2022

Conditional set generation using Seq2seq models

Conditional set generation learns a mapping from an input sequence of to...
10/06/2022

Generative Entity Typing with Curriculum Learning

Entity typing aims to assign types to the entity mentions in given texts...

Introduction and Related Work

A Knowledge Graph (KG) is defined as a semantic network where entities, such as objects, events or concepts, are connected between them through relationships or properties. KGs are organized in multi-graph data structures and stored as a set of triples (or facts), i.e. (Subject, Relation, Object), grounded with a given well-defined ontology Hogan et al. (2021). The usage of formal languages to represent KGs enables unambiguous access to data and facilitates automatic reasoning capabilities that enhance downstream applications, such as analytics, knowledge discovery or recommendations Mihindukulasooriya et al. (2022).

Figure 1: KnowGL Parser Framework

However, building and curating KGs, such as Wikidata Vrandečić and Krötzsch (2014), requires a considerable human effort. Systems such as, NELL Carlson et al. (2010), DeepDive Niu et al. (2012), Knowledge Vault Dong et al. (2014), DiffBot de Sá Mesquita et al. (2019) implement Information Extraction (IE) methods for automatic knowledge base population. A standard IE pipeline consists of several steps, such as co-reference resolution Dobrovolskii (2021)

, named entity recognition 

Wang et al. (2021), relation extraction Zhong and Chen (2021), and entity liking Wu et al. (2020), each of which is commonly addressed as a separate task. A pipeline approach presents several limitations, e.g. error propagation among different IE components and complex deployment procedures. Moreover, each component of the pipeline is trained independently using different architectures and training sets.

The ability of generating structured data from text makes sequence-to-sequence Pre-trained Language Models (PLMs), such as BART Lewis et al. (2020) or T5 Raffel et al. (2020), a valuable alternative to successfully address IE Glass et al. (2021, 2022); Ni et al. (2022); Wu et al. (2022); Cabot and Navigli (2021); Josifoski et al. (2022), entity/relation linking Cao et al. (2021); Rossiello et al. (2021) and semantic parsing Zhou et al. (2021); Rongali et al. (2020); Dognin et al. (2021) tasks. In this work, we further explore this direction by asking whether PLMs can be fine-tuned to read a sentence and generate the corresponding full set of semantic annotations that are compliant with the terminology of a KG. For this purpose, we propose a framework able to convert text into a set of Wikidata statements. As shown in Mihindukulasooriya et al. (2022), KnowGL parser can be used to automatically extract KGs from collections of documents, with the purpose to help users with semantic content exploration, to create trend analysis, to extract entity infoboxes from text, or to enhance content-based recommendation systems with semantic features.

KnowGL Parser

Figure 1 shows an overview of the Knowledge Generation and Linking (KnowGL) tool. Given a sentence as input, KnowGL returns a list of triples (subject, relation, object) in a JSON format 111Figure 1 shows a special case where the output consists of only one triple. However, KnowGL is able to identify multiple pairs of mentions in the input sentence and for each of them, it generates the corresponding triple with the semantic annotations in one-pass computation.. From the example, KnowGL identifies the entity mentions semantic web and inference rules in the sentence and generates the relation label use between them. For each mention, the output provides the corresponding entity label and its type. If the entity labels, entity types and relation labels are found in Wikidata, then KnowGL also provides the Wikidata IDs associated with them. As shown in Figure 1, KnowGL Parser consists of three main components: generation, ranking and linking, as described below.

Knowledge Generation

We address the fact extraction as an autoregressive generation problem. In other words, given a natural language text input, the knowledge generation model generates a linearized sequence representation that contains a set of facts expressed in the textual input. We adopt the following schema to represent the semantic annotations of a triple in the target sequence: [(subject mention # subject label # subject type) | relation label | (object mention # object label # object type)]. If the input text contains multiple mention pairs, the linearized target representations are concatenated using $ as a separator, and the facts are sorted by the order of the appearances of the head entity in the input text. Unlike in Cabot and Navigli (2021); Josifoski et al. (2022), we generate the surface forms, entity labels, and type information for both head and tail entities in the target representation. This represents a full set of semantic annotations, i.e. ABox and TBox, to construct and populate a KG with new facts. Our hypothesis is such self-contained fact representation also acts as an implicit constraint during decoding. We exploit BART-large Lewis et al. (2020) as the base model and cast this as a translation task where at training time the encoder receives a sentence, and the decoder generates the sequence target representation as described above. To train the generation model, we extend the REBEL dataset Cabot and Navigli (2021) by adding the entity labels and their types from Wikidata for each entity surface form in the text. REBEL is an updated and cleaner version of T-REx ElSahar et al. (2018), a distantly supervised dataset for relation extraction built by aligning Wikipedia abstracts with Wikidata triples. We use cross-entropy loss as standard in machine translation whereas, in teacher forcing, the model regards the translation problem as a one-to-one mapping process and maximizes the log-likelihood of generating the linearized facts given the input text. As reported in Mihindukulasooriya et al. (2022), our KnowGL model outperforms (F1 = 70.74) both a standard IE pipeline system (F1 = 42.50) and the current state-of-the-art generative IE model (F1 = 68.93) Josifoski et al. (2022). For the evaluation, we use the test set released with the REBEL dataset.

Fact Ranking

This component parses the target sequences generated by the knowledge generation model using a regular expression. The goal is to create a ranked list of distinct facts with their scores. We extract facts from all the returned sequences generated by each beam (where the number of beams is a hyper-parameter). For each extracted fact we consider the negative log-likelihood of the entire generated sequence as a score. Since the same fact can appear in different returned sequences, we sum the scores of each sequence where the fact occurs. The idea is to promote those facts/triples that occur multiple times in different beams. Finally, the facts are sorted by their scores.

Linking to Wikidata

The linking component enables retrieving the Wikidata IDs associated with the generated entity, type and relation labels. For efficiency, we create label-to-IDs maps from Wikidata and store them in key-value data storage systems to avoid bottlenecks caused by running multiple SPARQL queries to a Wikidata triple store. It is worth noticing that the model can generate new entity, type, or relation labels that are not in Wikidata. In this case, the linking component returns a null ID and the triple can be used as a candidate for adding new facts in Wikidata.

Demonstration

KnowGL Parser is implemented in Python language and deployed as a REST API using the Flask framework. The input is a sentence and the output is a JSON format structured as shown in Figure 1. The user interface described in our video demonstration is implemented as a separate web application using Node.js and React web tool frameworks. The UI allows users to insert textual content using a textbox. Then, the returned JSON is parsed by the UI enabling different types of visualizations. For instance, the facts can be organized in a directed multi-graph where the nodes are the entities and the edges represent the relations between two entities. The user can navigate and interact with the nodes and edges to easily locate the textual evidence associated with the triples.

References

  • P. H. Cabot and R. Navigli (2021) REBEL: relation extraction by end-to-end language generation. In EMNLP (Findings), pp. 2370–2381. Cited by: Introduction and Related Work, Knowledge Generation.
  • N. D. Cao, G. Izacard, S. Riedel, and F. Petroni (2021) Autoregressive entity retrieval. In ICLR, Cited by: Introduction and Related Work.
  • A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell (2010) Toward an architecture for never-ending language learning. In AAAI, Cited by: Introduction and Related Work.
  • F. de Sá Mesquita, M. Cannaviccio, J. Schmidek, P. Mirza, and D. Barbosa (2019) KnowledgeNet: A benchmark dataset for knowledge base population. In EMNLP/IJCNLP (1), pp. 749–758. Cited by: Introduction and Related Work.
  • V. Dobrovolskii (2021) Word-level coreference resolution. In EMNLP (1), pp. 7670–7675. Cited by: Introduction and Related Work.
  • P. L. Dognin, I. Padhi, I. Melnyk, and P. Das (2021)

    ReGen: reinforcement learning for text and knowledge base generation using pretrained language models

    .
    In EMNLP (1), pp. 1084–1099. Cited by: Introduction and Related Work.
  • X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang (2014) Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In KDD, pp. 601–610. Cited by: Introduction and Related Work.
  • H. ElSahar, P. Vougiouklis, A. Remaci, C. Gravier, J. S. Hare, F. Laforest, and E. Simperl (2018) T-rex: A large scale alignment of natural language with knowledge base triples. In LREC, Cited by: Knowledge Generation.
  • M. R. Glass, G. Rossiello, Md. F. M. Chowdhury, and A. Gliozzo (2021) Robust retrieval augmented generation for zero-shot slot filling. In EMNLP (1), pp. 1939–1949. Cited by: Introduction and Related Work.
  • M. R. Glass, G. Rossiello, Md. F. M. Chowdhury, A. Naik, P. Cai, and A. Gliozzo (2022) Re2G: retrieve, rerank, generate. In NAACL-HLT, pp. 2701–2715. Cited by: Introduction and Related Work.
  • A. Hogan, E. Blomqvist, M. Cochez, C. d’Amato, G. de Melo, C. Gutiérrez, S. Kirrane, J. E. L. Gayo, R. Navigli, S. Neumaier, A. N. Ngomo, A. Polleres, S. M. Rashid, A. Rula, L. Schmelzeisen, J. F. Sequeda, S. Staab, and A. Zimmermann (2021) Knowledge graphs. Vol. 54. External Links: Link, Document Cited by: Introduction and Related Work.
  • M. Josifoski, N. D. Cao, M. Peyrard, F. Petroni, and R. West (2022) GenIE: generative information extraction. In NAACL-HLT, pp. 4626–4643. Cited by: Introduction and Related Work, Knowledge Generation.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL, pp. 7871–7880. Cited by: Introduction and Related Work, Knowledge Generation.
  • N. Mihindukulasooriya, M. Sava, G. Rossiello, Md. F. M. Chowdhury, I. Yachbes, A. Gidh, J. Duckwitz, K. Nisar, M. Santos, and A. Gliozzo (2022) Knowledge graph induction enabling recommending and trend analysis: A corporate research community use case. CoRR abs/2207.05188. Cited by: Introduction and Related Work, Introduction and Related Work, Knowledge Generation.
  • J. Ni, G. Rossiello, A. Gliozzo, and R. Florian (2022) A generative model for relation extraction and classification. CoRR abs/2202.13229. Cited by: Introduction and Related Work.
  • F. Niu, C. Zhang, C. Ré, and J. W. Shavlik (2012) DeepDive: web-scale knowledge-base construction using statistical learning and inference. In VLDS, CEUR Workshop Proceedings, Vol. 884, pp. 25–28. Cited by: Introduction and Related Work.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    .
    J. Mach. Learn. Res. 21, pp. 140:1–140:67. Cited by: Introduction and Related Work.
  • S. Rongali, L. Soldaini, E. Monti, and W. Hamza (2020) Don’t parse, generate! A sequence to sequence architecture for task-oriented semantic parsing. In WWW, pp. 2962–2968. Cited by: Introduction and Related Work.
  • G. Rossiello, N. Mihindukulasooriya, I. Abdelaziz, M. A. Bornea, A. Gliozzo, T. Naseem, and P. Kapanipathi (2021) Generative relation linking for question answering over knowledge bases. In ISWC, Lecture Notes in Computer Science, Vol. 12922, pp. 321–337. Cited by: Introduction and Related Work.
  • D. Vrandečić and M. Krötzsch (2014) Wikidata: a free collaborative knowledgebase. Commun. ACM 57 (10), pp. 78–85. External Links: ISSN 0001-0782, Link, Document Cited by: Introduction and Related Work.
  • X. Wang, Y. Jiang, N. Bach, T. Wang, Z. Huang, F. Huang, and K. Tu (2021) Improving named entity recognition by external context retrieving and cooperative learning. In ACL/IJCNLP (1), pp. 1800–1812. Cited by: Introduction and Related Work.
  • L. Wu, F. Petroni, M. Josifoski, S. Riedel, and L. Zettlemoyer (2020) Scalable zero-shot entity linking with dense entity retrieval. In EMNLP (1), pp. 6397–6407. Cited by: Introduction and Related Work.
  • X. Wu, J. Zhang, and H. Li (2022) Text-to-table: A new way of information extraction. In ACL (1), pp. 2518–2533. Cited by: Introduction and Related Work.
  • Z. Zhong and D. Chen (2021) A frustratingly easy approach for entity and relation extraction. In NAACL-HLT, pp. 50–61. Cited by: Introduction and Related Work.
  • J. Zhou, T. Naseem, R. F. Astudillo, Y. Lee, R. Florian, and S. Roukos (2021) Structure-aware fine-tuning of sequence-to-sequence transformers for transition-based AMR parsing. In EMNLP (1), pp. 6279–6290. Cited by: Introduction and Related Work.