Pynsett: A programmable relation extractor

by   Alberto Cetoli, et al.

This paper proposes a programmable relation extraction method for the English language by parsing texts into semantic graphs. A person can define rules in plain English that act as matching patterns onto the graph representation. These rules are designed to capture the semantic content of the documents, allowing for flexibility and ad-hoc entities. Relation extraction is a complex task that typically requires sizeable training corpora. The method proposed here is ideal for extracting specialized ontologies in a limited collection of documents.


page 1

page 2

page 3

page 4


PERLEX: A Bilingual Persian-English Gold Dataset for Relation Extraction

Relation extraction is the task of extracting semantic relations between...

Grammatical Case Based IS-A Relation Extraction with Boosting for Polish

Pattern-based methods of IS-A relation extraction rely heavily on so cal...

Relation Extraction for Monitoring Economic Networks

Relation extraction from texts is a research topic since the message und...

Supporting Medical Relation Extraction via Causality-Pruned Semantic Dependency Forest

Medical Relation Extraction (MRE) task aims to extract relations between...

Learning Logic Rules for Document-level Relation Extraction

Document-level relation extraction aims to identify relations between en...

Towards Geocoding Spatial Expressions

Imprecise composite location references formed using ad hoc spatial expr...

Exploiting Partially Annotated Data for Temporal Relation Extraction

Annotating temporal relations (TempRel) between events described in natu...

1 Introduction

The goal of relation extraction is to identify relations among entities in the text. It is an integral part of knowledge base population [Heng2011], question answering [Xu2016], and spoken user interfaces [Yoshino:2011:SDS:2132890.2132898]. Extracting relations reliably is still a challenging task [bunescu-mooney-2005-shortest, Guo2019AttentionGG, Luan2018], with most existing solutions relying on training data that contains a limited set of relations. These approaches cannot match patterns outside the ones specified in the training set.

In many useful cases the relations need to be customised to a specific ontology relevant only in a small collection of documents, making it very difficult to label enough examples. Zero-shot learning has been used to overcome this limit: for example, one can understand relation extraction as a question answering problem [Levy2017ZeroShotRE]. This approach can be quite successful, leveraging on recent reading comprehension progress: It trains a system on extracting semantic content first, then applies the learned generalization to create flexible rules for relation extraction.

While impressive, question answering is however not completely solved, with most Reading Comprehension corpora presenting only queries that can be answered using a single phrase [welbl-etal-2018-constructing]: The generalization stemming from a question answering problem limits the type of rules that can be written for relation extraction. Moreover, while using a question answering approach improves the recall of the extractor, it can also lower the precision of the matches due to mistaken reading comprension.

For limited sets of documents the relations to extract can often be pinned down to a few useful sentences. For example the relation WORKED_AT might be satisfactorily represented by using only a few variations around a PERSON worked for a COMPANY. For relations of this type the generalization needed is limited. Linguistic theories allow to generate a semantic representation that offers a useful generalisation of the sentence content, while at the same time providing a framework for precise rule matching over the represented text. By using Discourse Representation Theory [Kamp1993FromDT] or a Neo-Davisonian semantics [Parsons1990]

it is possible to describe a collection of sentences as a set of predicates. In these frameworks the relation extraction rules become a pattern matching exercise over graphs. The works of

Reddy et al. [reddy-etal-2014-large, TACL807] as well as Tiktinsky et al. [tiktinsky2020pybart] are an inspiration for this paper.

Further flexibility comes from representing words using word embeddings [MikolovSCCD13]. In this paper each lemma is associated to an entry in the Glove dataset [pennington2014glove]. In addition, specialised entities are written as a list of embeddings.

Writing a discourse as a collection of predicates is isomorphic to a graph representation of the text. The main idea of this paper is to discover relations in the discourse by matching specific sub-graphs. Each pattern match is effectively a graph query where the data is the discourse. The main contribution of this work is two-fold: First, it suggests a way to semantically encode sentences. Secondly, it defines a method for creating a set of flexible rules for low-resource relation extraction.

2 Implementation

2.1 Semantic representation

Sentences are transformed into graphs following a similar method of [TACL807]. We start with a dependency parser [spacy2] and apply a series of transformations to obtain a neo-Davidsonian representation of the sentence222The full list of transformation from dependency tree to a neo-Davisonian form can be found in the code repository.. In this form active and passive tenses are represented with the same expression, all words are lemmatized, and co-reference is added to the representation.

For example, the sentence Jane is working at ACME Inc as a woodworker. She is quite taller than the average becomes in a predicate form

Jane(r1), work(e1), ACME_Inc(r2), woodworker(r3),
AGENT(e1, r1), at(e1, r2), as(e1, r3),
Jane(r4), be(e2), taller(r5), average(r6), quite(r7),
AGENT(e2, r4), ADJECTIVE(r4, r5), than(r5, r6), ADJECTIVE(r5, r7),
REFERS_TO(r1, r4), REFERS_TO(r4, r5)

In this representation the sentence is a graph (Fig. 1), where the nodes are nouns, verbs, adverbs, and adjectives, and the edges are the semantic relations among them.

An additional level of semantics is added by linking together two nouns that co-refer, using the edge.

2.2 Matching of words

Words are represented using the Glove word embeddings of their lemma and a few different tags:

  • Negated: A True/False value that indicates whether a word is associated to a negation: if a verb is negated the adverb does not appear as a new node, rather the verb is flagged using this tag. In this way work can never match does not work.

  • Named Entity Type: A label indicating the entity type of the node as per Ontonotes 5.0 notation [Hovy:2006:O9S:1614049.1614064].

  • Node type: Whether it is a verb, a noun, an adjective, or an adverb.

For example, the noun is represented internally as

    lemma: "Jane",
    negated: False,
    entity_type: PERSON,
    node_type: noun

Two words match if the dot product between their lemma’s embeddings is greater than a specific thresholds, and all the other tags coincide. For example, the words carpenter and woodworker match within the used threshold. This solution can in principle be augmented with an external ontology, where synonyms and hypernyms would trigger a match as well.

In addition, the system allows to cluster a set of words under the same definition.

DEFINE TEAM AS [team, group, club];
DEFINE UNIVERSITY AS [university, academy, polytechnic];
DEFINE LITERATURE AS [book, story, article, series];

All words within the threshold distance would trigger a match. For example the word tome would match the word book, thus falling into the LITERATURE category.

2.3 Matching of sentences

Figure 1: (a) The text in Sec. 2.1 becomes a semantic graph with co-reference links. (b) Representing the rule with a Prolog-like syntax: The MATCH clause defines the sentence/graph at the right that triggers the rule; the rule then creates an edge between the entities. (c) The resulting relations graph from the two rules in Sec. 2.3.

As described before, a text becomes a set of graphs (the discourse graph). Rules have two components: a MATCH clause, which defines the trigger for the rule, and a CREATE clause, which creates the relation edge. Relations must connect two entities. I have chosen the symbol # to mark the items that need a relation among them. For example the sentence Jane#1 works at Acme#2 tags Jane and Acme for an edge to connect them.

The matching sentence can contain Named Entities (PERSON, ORG, DATE, etc) as well as an internally-defined entity (Sec. 2.2).

An example of two rules is as in the following

DEFINE ROLE AS [carpenter, painter];
MATCH "PERSON#1 works as a ROLE#2."
CREATE (works_as 1 2);
MATCH "PERSON#1 works at ORG#2 as a ROLE. PERSON is tall."
CREATE (tall_worker_at 1 2);

Please note that a MATCH clause is written as a sentence but it is internally parsed into a graph. A rule is triggered if this semantic representation is a sub-graph of the discourse graph. Two nodes are considered equal if they match according to the method in Sec. 2.2. The rules are represented as simple pattern matching rules as in Fig 1.

Notice also that for the second rule in the above example more than one sentence is specified. This is because the MATCH clause can be a text as complex and free-flowing as the documents that are being parsed. The trigger sentences also solve co-reference: In the second rule, the person that works is the same person that is tall.

This is an advantage of using an internal semantic representation: One could add more complex mangling of the sentences where simple logical constraints are added (and/or), or information is extracted from mathematical formulas.

Each rule behaves according the method defined above. When a graph triggers a rule, an edge is created in the relations graph, as show in Fig 1. In this final representation, all of the discourse structures have disappeared and knowledge is condensed onto the pre-defined relations.

3 Limitations and future work

I have presented a flexible rule-based relation extractor for limited resource sets. Flexible rules can be created, thus allowing for a quick relation extractor using specialized ontologies. The main advantage of this approach is control over the rules and precision in the extracted content. An extension of the system should allow customized ontologies to be used for word matching. Moreover more Named Entities should be included, possibly allowing for specialized NER systems within the internal pipeline.

As a final limitation, the system does not assign a temporal dimension to events yet. This information should be extracted from verb tenses and added to the discourse graph.