Neural Open Relation Extraction via an Overlap-aware Sequence Tagging Scheme

07/26/2019 ∙ by Shengbin Jia, et al. ∙ 0

Solving the Open relation extraction (ORE) task with supervised neural networks, especially the neural sequence learning (NSL) models, is an extraordinarily promising way. However, there are three main challenges: (1) The lack of labeled training corpus; (2) Only one label is assigned to each word, resulting in being difficult to extract multiple, overlapping relations; (3) The confusion about the selection of various neural architectures for the ORE. In this paper, to overcome these challenges, we design a novel tagging scheme to assist in building a large-scale, high-quality training dataset automatically. The scheme can improve the performance of models by assigning multiple, overlapping labels for each word and helping models to learn pre-identifying arguments segment-level information. In addition, we pick out a winning model empirically from various alternative neural structures. The model achieves state-of-the-art performance on four kinds of test sets. The experimental results show that the scheme is effective.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Open relation extraction (ORE) is an important NLP task and has been used for a wide variety of applications [1, 2]. It doesn’t identify pre-defined relation types as Traditional relation extraction do [3, 4], instead, it aims to obtain a semantic representation which comprises of a resourceful relational phrase and two argument phrases. Such as tracking a fierce sports competition, the news reporting works should find proactive relation descriptions (e.g., (Gatlin, again won, the Championships). ORE can satisfy the scenarios by discovering arbitrary relations without pre-defined taxonomies.

Conventional methods are usually based on pattern matching, where patterns are handcrafted by linguists 

[5, 6, 7, 8] or bootstrapped depend on syntax analysis tools [9, 2, 10]. The manual cost is heavy and the error cascade effect caused by parsers is severe. Recently, a few people are trying to solve the ORE task with supervised neural network models [11, 12]. Among them, there is a promising research idea that the extraction task is transformed into a sequence tagging problem [13, 12].

Nevertheless, there are many challenges or difficulties. (1) There are few public large-scale labeled corpora for training supervised ORE learning models. And labeling manually a training set with a large number of relations is heavy costly. (2) As an example shown in Figure 1, the conventional sequence tagging schemes can’t tag multiple, overlapping triples simultaneously for a sentence. Only one label will be assigned to each word of a sentence. That is, they can’t handle the case where a relation is involved in multiple triples, or an argument is related to multiple relations. (3) It’s always challenging for open tasks to produce supervised systems. Which kind of neural architectures being more beneficial to open relation sequence learning needs to be explored urgently.

Fig. 1: Stanovsky et al. [12] annotates the sequence of relations as shown in the 2nd line. A sentence produces only one tags sequence. However, the tagging based on our scheme is shown in lines 3-5. Multiple, overlapping triples can be represented, where “E1”, “R”, “E2” respectively represent Argument1, Relation, and Argument2. “Tags Sequence” is independent of each other.

We overcome the above challenges by the following work. As for the challenge (1), we build a large-scale, high-quality corpus in a fully automated way. We verify that the dataset has good diversification and is able to motivate models to achieve promising performances. As for the challenge (2), we design a novel tagging scheme (multiTS) to assign multiple labels for each word. A triple corresponds to its own unique tag sequence, thus overlapping triples in a sentence can be presented simultaneously and separately. As for the challenge (3), we discuss alternative popular neural structures and experimentally adapt them to open relation tagging task. Furthermore, we select an award-winning model (NST), a hybrid neural sequence tagging network, and use it to do a series of performance analysis.

In summary, the main contributions of this work are listed as follows:

  • We design a novel overlap-aware tagging scheme. The scheme can address multiple, overlapping triples simultaneously for each sentence. To our knowledge, it’s the first for ORE to be able to handle not only the case where a relation is involved in multiple triples but also the case that an argument is related to different relations.

  • We systematically explore the performance of a variety of neural models in the context of the ORE. It is valuable to understand the effectiveness of different neural architectures for the ORE task and to help readers reproduce these experiments. Meanwhile, We propose a hybrid neural sequence tagging model which achieve state-of-the-art performance compared to the existing methods or other neural models.

  • We construct an automatically labeled corpus in favor of adopting the supervised approaches to ORE task. Not only is it simple to build, but also it’s larger-scale and higher-quality than other existing ORE corpora.

Ii Related Works

As for the Open relation extraction, most of the existing methods made use of linguistic analysis. Some extractors, such as TextRunner [1], WOE [14], Reverb [5], focused on efficiency by restricting the shallow syntactic parsing to part-of-speech tagging and chunking. Meanwhile, many approaches designed complex patterns from the full syntactic processing, especially dependency parsers, such as WOE [14], PATTY [15], OLLIE [9], Open IE-4.x [2], MinIE [16] and so on. In general, the patterns were generalized by handcrafting [7, 6, 8]

or semi-supervised learning 

[9, 2, 10] such as bootstrapping. These extractors could get significantly better results than the extractors based on shallow syntax. However, they heavily relied on the syntactic parsers. Many papers [9, 6] analyzed the errors made by their extractors and found that parser errors account for a large even the largest part of the whole. Parsing errors restrained the extracting performance and would produce a serious error cascade effect. Therefore, taking a certain strategy to refine the extraction results, should be an effective way to gather high-quality relation triples.

The sequence tagging tasks [17]

, such as Chinese word segmentation (CWS), part-of-speech tagging (POS), Chunking and named entity recognition (NER), require to assign representative labels for each word in a sentence. Conventional models were linear statistical models 

[18, 19, 20]

. Neural methods mapped input sequences to obtain large fixed dimensional vector representations by various neural networks 

[21, 22, 23]

, then predicted the target sequences from the vectors using a layer with Softmax activation function 

[24, 22] or a special Conditional random fields CRF layer [25].

In addition, there are a few examples that apply neural models to open information extraction. According to the Machine translation mechanism, the extraction process was converted into text generation. Zhang et al. 

[26] extracted predicate-argument structure phrases by using a sequence to sequence model. Cui et al. [11] proposed a multi-layered encoder-decoder framework to generate relation tuples related sequences with special placeholders as a marker. Zheng et al. [13] creatively designed the model to transform the relation extraction into relation sequence tagging. And Stanovsky et. al. [12]

formulated the ORE task as a sequence tagging problem. They applied a BiLSTM and softmax layer to tag each word. However, we try to design more effective semantic learning frameworks to annotate relations. In particular, they could only handle the case where a relation involves multiple triples, but not the case that an argument was related to different relations.

Model Input Encoder Tag Decoder References
Past temporal information Future tempor- al information Local spatial information Neighborhood tag information
LSTM-CRF CRF [22]
BiLSTM-CRF CRF [22, 27, 28, 24]
LSTM-LSTM LSTM [21, 13, 29]
BiLSTM-CNN-Sofmax Sofmax [17]
CNN-CRF CRF [23, 30, 31]
NST CRF
TABLE I: Various neural sequence labeling networks.

Iii Tagging Scheme and Training Corpus

There are few public large-scale labeled datasets for ORE tagging task. Stanovsky and Dagan [32] created an evaluation corpus by an automatic translation from QA-SRL annotations [33]. It only contains 10,359 tuples over 3200 sentences. Then, Stanovsky et al. [12] further expand 17,163 labeled open tuples from the QAMR corpus [34]. However, the accurate has declined.

Therefore, we adopt a mechanism of bootstrapping by using multiple existing open relation extractors without having to resort to expensive annotation efforts. The train set contains 477,701 triples. And we design a novel tagging scheme to automatically annotate them.

Iii-a The Correct Relation Triples Collecting

We use three existing excellent and popular extractors (OLLIE [9], ClausIE [6], and Open IE-4.x [2]) to extract relation triples from the raw text111The original text is produced from the WMT 2011 News Crawl data, at http://www.statmt.org/lm-benchmark/.. These extractors have their own expertise, so to ensure the diversity of extraction results. If a triple is obtained simultaneously by the three extractors, we should believe that the triple is correct and add it to our corpus.

We randomly sample 100 triples from the corpus to test the accuracy. The performance is up to 0.95. The experimental result verifies the validity of the above operating. In addition, it represents that the extraction noises caused by syntactic analysis errors can also be well filtered out.

In order to build a high-quality corpus for models training, we are committed to the high-accuracy of extractions at the expense of the recall. The constructed dataset is imperfect, but it still has acceptable scalability. The experimental sections prove the effectiveness of the dataset.

Iii-B Automatically Sequence Tagging

Tagging is to assign a special label to each word in a sentence [13]. We use the BIOES annotation (Begin, Inside, Outside, End, Single) that indicates the position of the token in an argument or the relational phrase. It has been reported that this annotation is more expressive than others such as BIO [35]. As for a relation triple, the arguments and relation phrase could span several tokens within a sentence respectively. Thus, the arguments and relational phrase need to be division and tagged alone.

Figure 1 is an example that a sentence is tagged by our scheme. A sentence may contain multiple, overlapping triples. Each triple will correspond to their unique tag sequence. An argument can have multiple labels when it belongs to different triples.

Giving an argument pair, there can only be one relation between the pair in a sentence 222The coordination of verbs in a sentence should be considered as one complete relational phrase.. Therefore, After the arguments tagging information being used as model inputs 333Arguments recognition is easy to do through many existing methods, so we can add it to the pre-processing process., the sequence of relational tags is uniquely determined.

In addition, there is another important reason that inspires us to take such an operation. Different from the output segments, such as named entity, chunk, parts of speech, which are relatively independent, there are strict semantic associations and formal constraints that head argument segment is the agent of relation segment and tail argument segment is the object of relation segment. However, normal sequence labeling models can’t fully encode segment-level information but encode word-level context information as assigning each word a tag to indicate the boundary of adjacent segments [36, 37]. By pre-identifying the arguments and inputting them into a model, the model can directly utilize arguments segment-level information to tag relation segments.

When using models to predict, we extract the candidate argument pairs in advance, then transform them into the argument embedding as model inputs to identify the relationship between them. If the arguments are not related, the whole words appearing outside the scope of the arguments will be labeled as “O”.

Iii-C From Tag Sequence to Extracted Results

From the tag sequence Tags Sequence1 in Figure 1, “The America” and “Trump” can be combined into a triple whose relation is “President”. Because the relation role of “The America” is “1” and “Trump” is “2”, the final result is (The America, President, Trump). The same applies to (Trump, will visit, the Apple), (the Apple, founded by, Steven Paul Jobs).

Iv Models

We provide the descriptions of various popular sequence tagging models. A brief comparison is shown in Table I.

Iv-a Basic Networks

LSTM Network.

The Long short-term memory network (LSTM) is a variant of RNNs 

[38]. It is good at grasping the long-range temporal semantics of a sequence. An LSTM unit is composed of three multiplicative gates which control the proportions of information to forget and to pass on to the next time step [27].

BiLSTM Network. The LSTM takes the left context of the sentence at every word but knows nothing about the right contexts. Furthermore, Bidirectional LSTM (BiLSTM) can present each sequence forward and backward to two separate hidden states to capture past (left) and future (right) information, respectively [39, 28].

CNN Network.Convolutional neural network (CNN) owns good local perception ability. It is a natural means for capturing salient local features from the whole sequence [23, 40].

Softmax Layer. It is a layer with Softmax activation function to make independent tagging decisions directly.

CRF Layer. As for sequence labeling tasks, there are strong dependencies across output labels. The Conditional random fields (CRF) layer can efficiently use tags in neighborhoods to predict the current tag. Therefore, it is a common way to model label sequence jointly using a CRF layer [28, 27].

Iv-B Hybrid Network

According to the experiments, we get a conclusion that the Hybrid neural sequence tagging network (NST) can achieve better performance than others. Therefore, we introduce this model in more detail. Figure 2 illustrates the network architecture.

Fig. 2: An illustration of the model NST.

Let denotes the sequence of information embeddings for a sentence. The is given as input to a BiLSTM. Then the BiLSTM layer returns a concatenate representation of the forward and backward context semantics for each word .

The words involved in a relation phrase tend to appear in the neighborhoods of arguments. Thus we find that different sub-sequence segments, especially the argument-adjacent segments, will provide feature information of different energies for a model. We use CNN to learn local feature information of the input sequence. As shown in Figure 2 (a), the input sequence

is given to the convolutional layer with convolutional filters widths of 3, and then is fed to a Max-Pooling layer. An output vector

with fixed length can be obtained.

We concatenate the bidirectional temporal features: and , and local spatial features: , to a single vector . The vector is fed into the CRF layer to jointly yield the final predictions for every word.

V Experiments and Discussions

In this section, we present the experiments in detail. We evaluate various models with Precision (P), Recall (R) and F-measure (F1).

Model Wikipedia dataset NYT dataset Reverb dataset Average
P R F P R F P R F P R F
CRF 0.548 0.264 0.357 0.460 0.130 0.202 0.425 0.198 0.270 0.458 0.198 0.277
CNN-Softmax 0.886 0.527 0.661 0.931 0.493 0.645 0.915 0.504 0.650 0.912 0.506 0.651
LSTM-Softmax 0.559 0.463 0.506 0.607 0.481 0.537 0.632 0.530 0.576 0.612 0.506 0.554
BiLSTM-Softmax 0.766 0.683 0.722 0.807 0.637 0.712 0.784 0.716 0.748 0.784 0.693 0.736
LSTM-LSTM 0.552 0.426 0.481 0.623 0.497 0.552 0.639 0.535 0.582 0.619 0.505 0.556
CNN-CRF 0.878 0.574 0.694 0.854 0.578 0.689 0.892 0.568 0.694 0.882 0.571 0.693
LSTM-CRF 0.720 0.590 0.649 0.721 0.593 0.651 0.709 0.604 0.652 0.714 0.599 0.651
BiLSTM-CRF 0.838 0.734 0.782 0.817 0.678 0.741 0.830 0.743 0.784 0.829 0.728 0.775
BiLSTM-CNN-Softmax 0.821 0.729 0.772 0.837 0.702 0.764 0.831 0.760 0.794 0.830 0.743 0.784
NST 0.876 0.736 0.800 0.864 0.701 0.774 0.868 0.746 0.802 0.869 0.735 0.796
Reverb 0.770 0.210 0.330 0.557 0.144 0.228 0.595 0.133 0.217 0.641 0.162 0.259
OLLIE 0.994 0.279 0.436 0.986 0.249 0.398 0.975 0.198 0.329 0.985 0.242 0.389
ClausIE 0.795 0.526 0.633 0.656 0.481 0.555 0.953 0.585 0.725 0.801 0.531 0.638
Open IE-4.x 0.766 0.340 0.471 0.801 0.341 0.478 0.810 0.312 0.451 0.792 0.331 0.467
TABLE II: The results of different models on the Reverb dataset, Wikipedia dataset, and NYT dataset. The bolds indicate the best value when our model NST compares with other sequence tagging models. While the NST being compared with the traditional ORE extractors, we highlight the best value with the underline.

V-a Experimental Setting

Test set. To satisfy the openness and effectiveness of the experiments, we gather four high-quality test sets from the previously published works. They should be close to nature and independent of the training set. Firstly, the Reverb dataset is obtained from [5] which consists of 500 sentences with manually labeled 1,765 extractions. The sentences are obtained from Yahoo. Next, the Wikipedia dataset includes 200 random sentences extracted from Wikipedia. And we collect 605 extractions manual labeled by Del Corro [6]. Then, the NYT dataset contains 578 triples extracted from 200 random sentences in the New York Times collection. It is also created by Del Corro et al. In addition, Stanovsky and Dagan [32] present the OIE2016 dataset which is automatically translated from QA-SRL annotations [33]. It contains 10,359 tuples over 3200 sentences.

Hyperparameters

. We implement the neural network by using the Keras library

444https://github.com/keras-team/keras. The training set and validation set contain 395,715 and 81,986 records, respectively. The batch size is fixed to 50. We use early stopping [39] based on performance on the validation set. The number of LSTM units is 200 and the number of feature maps for each convolutional filter is 200. Parameter optimization is performed with Adam optimizer [41]

. The initial learning rate is 0.001, and it should be reduced by a factor of 0.1 if no improvement of the loss function is seen for some epochs. Besides, to mitigate over-fitting, we apply the dropout method 

[42] to regularize models. We use three types of embeddings as inputs. Word embedding is pre-trained by word2vec [43] on the corpora. Its dimension is 300. Part-of-speech (POS) embedding is also considered since POS information plays an important role in the relation extraction processing. We use the TreeTagger [44] which is widely adopted to annotate POS category, containing 59 different tags. Besides, we represent the tag information of arguments as argument embedding, by using 10 dimensions one-hot vectors.

V-B Experimental Results

We report the results of various models on the first three datasets, as shown in Table II 555To avoid the distortion of argument recognition errors to the final performance, we only recognize the correctness of the relational phrases in a triple.666As for a relation extracted by Reverb, OLLIE, clause, and Open IE-4.x, only when its confidence is greater than 0.5 can it be adjudged correct.. The conclusions from the table as follows.

Firstly, the model NST outperforms all other methods according to F1. It shows the effectiveness of the hybrid neural network architecture. Meanwhile, it achieves better recall and F1, when the three models used to construct corpus (OLLIE, ClausIE, and Open IE-4.x) being regarded as baselines. Especially, it achieves a 15.8% improvement in F1 over the best baseline model ClausIE.

Secondly, the NST achieves better results than the model structure proposed by Stanovsky et. al. [12] which is the same as the BiLSTM with a Softmax output layer (BiLSTM-Softmax). In addition to capturing temporal semantics of sentences by using recurrent networks, it is meaningful for the NST to extract spatial semantics by the convolutional layers and fuse sentence-level tag information by the CRF layer. In particular, the method of Stanovsky et. al. can only handle the case where a relation involves multiple triples but not the case that an argument is related to different relations.

Then, many of the neural sequence learning methods outperform conventional syntax-based methods. The conventional methods may be of high accuracy, but they can’t learn rich enough patterns resulting in a lower recall. In addition, the syntax structures of sentences are ever-changing. Although Pattern matching is hard and inflexible, neural-based models can learn deep sentence semantics and syntactic information, so as to achieve better precision and recall.

Next, we analyze the effects of various networks. Compared to LSTM-Softmax, BiLSTM-Softmax is obviously superior to LSTM-Softmax about 18.2% in F1 on the average, since it can capture richer temporal semantic information. BiLSTM-Softmax is better than CNN-Softmax in recall and F1. However, CNN-Softmax takes better precision. And overall, CNN-Softmax outperforms LSTM-Softmax. In addition, from the comparison between LSTM-Softmax and LSTM-CRF, BiLSTM-Softmax and BiLSTM-CRF, CNN-Softmax and CNN-CRF, BiLSTM-CNN-Softmax and NST (BiLSTM-CNN-CRF), LSTM-LSTM and LSTM-CRF, we can get a unanimous conclusion that the CRF layer can greatly improve model performance than a Softmax layer or an LSTM decoding layer.

Finally, the effects of the models on the three datasets are stable. It indicates that the neural sequence tagging methods have good robustness and scalability.

Fig. 3: The Precision-recall curves of the different ORE systems on the OIE2016 dataset. The model En-Decoder comes from [11].
Fig. 4: The Area under precision-recall curve (AUC) shown in Figure 3.

We use the OIE2016 dataset to evaluate the precision and recall of different systems. The precision-recall curves are shown in Figure 3777When executing the model NST, we use the methods in the clause and Open IE-4.x to pre-identify the arguments in sentences.888The model proposed by Stanovsky [12] isn’t shown here, because it uses the OIE2016 dataset as the train set. We have evaluated it in Table II., and the Area under precision-recall curve (AUC) for each system is shown in Figure 4. The NST achieves better precision than the three baselines within most recall range. Although the precision of the neural model En-Decoder [11] is better than that of the NST, the recall of the En-Decoder has been maintained in a lower range. The NST achieves the best AUC score of 0.487, which is significantly better than other systems. In particular, the AUC score of the NST is two times more than that of the En-Decoder model.

Vi Performance Analysis and Discussions

Datasets Proportion P R F
Wikipedia 38.7% 0.849 0.650 0.736
NYT 31.1% 0.779 0.606 0.681
Reverb 44.3% 0.794 0.649 0.714
TABLE III: The performances of the model NST on the sub-dataset which contains only overlapping triples of each dataset. The second column shows the proportion of overlapping triples in the total.
Model P R F
BiLSTM-Softmax 0.683 0.574 0.624
BiLSTM-Softmax 0.807 0.637 0.712
NST 0.772 0.581 0.663
NST 0.864 0.701 0.774

TABLE IV: The results of evaluating the effect of tagging scheme on the NYT test set. the notation “” represents that models are trained based on the scheme multiTS. Others are based on the scheme that can’t tag overlapping triples.

Vi-a Analysis of Effect of Overlapping Triples

As shown in Table III, The statistical results show that overlapping triples account for a large proportion in various datasets, all of which are above 30%. We perform the model NST to identify overlapping triples, with good results on all three datasets.

We evaluate the effect of our overlap-aware tagging scheme, as shown in Table IV. Based on two kinds of tagging scheme, we execute the model BiLSTM-Softmax (it was used by Stanovsky [12] ) and model NST, respectively. By using our tagging scheme (multiST), models can identify many multiple, over-lapping triples, so that they have a better recall capacity. In addition, explicitly providing argument segment information for models, can also greatly improve models’ tagging ability, so as to improve accuracy.

Model P R F
(word embeddings) 0.783 0.708 0.744
(word, POS embeddings) 0.822 0.751 0.785
(word, POS embeddings) 0.869 0.735 0.796
TABLE V: The results of evaluating the influence of embeddings used on the model NST. The results are average on the Reverb dataset, Wikipedia dataset, and NYT dataset.

Vi-B Analysis of Effect of Embeddings

We perform an experiment to evaluate the effect of various embeddings, as shown in Table V. When the model NST only uses word embeddings, the results are worse than that of the model with word embeddings and POS embeddings. We are aware that the POS features play a great role in promoting model performance and improve the F1 by 4.1%. In addition, when the input embeddings are adjusted along with model training, the effect is better and the F1 is increased by 1.1%.

Fig. 5: The performance of model NST in each part of the test sets. Here, the black bar indicates the number of correct identifications of the NST. And the gray bar represents the total number of this part.

Vi-C Analysis of Dataset Diversification and Model Generalization Ability

According to whether the three extractors that are used to construct the training set can correctly identify each test instance, we classify the examples from the test set into four parts: all three extractors identify correctly (ATE), existing two extractors identify correctly (ETE), only one extractor identify correctly (OOE), no one extractor identify correctly (NOE).

As shown in Figure 5, the model NST identifies these instances in the ATE with an accuracy close to 1. It implies that the model can learn the training set data features well. In addition, from the perspective of one extractor, the model NST can acquire the extracting ability of this extractor, after training on the training set data. The correct results extracted by this extractor, in addition to appearing in the ATE, will also appear in the ETE and OOE. These instances outside the ATE own the cognate regularity (data distribution) with those in the ATE. Therefore, the NST has a certain recognition in three parts (ATE, EYTE, and OOE). But according to the similarity degree of data distribution, the accuracy is highest on the ATE, follows by it on the ETE, and worse on the OOE (but still above 0.7).

The model integrates three kinds of extractors to obtain more powerful recognition capability than any single extractor. In particular, the model can produce certain results in the NOE where such kind of instances may be little or barely appear in the training set. It shows that the model has strong generalization ability and can learn to extract triples beyond the capability of the three extraction tools.

To maintain the high quality of the training set, we only select triples from the ATE. Although the ATE occupies a small proportion of the output of the extractors, the instances in the training set are considerable and are extracted from a large-scale text. This ensures that the training set has good diversification and passable quality, and the model has good generalization capabilities after being trained on the training set. In addition, we believe that an important reason is that the neural approach learns across a large number of highly confident training instances.

Fig. 6: Error Results Analysis of the model NST. There are four major types of errors.

Vi-D Error Analysis

To find out the factors that affect the performance of ORE tagging models, we analyze the tagging errors of the model NST as Figure 6 shown. The 30.3% relations are missed by the model. And 22.1% of the extractions are abandoned because their corresponding tag sequences violate the tagging scheme. The two types of errors mainly limit the increase in recall. In addition, the model may wrongly determine the start or end position of a relational phrase. As a result, the relation will be recognized as falseness. Such phenomena affect model precision.

Vii Conclusion

In this paper, we are committed to extracting open relations via neural sequence tagging models. We review three challenges and try to overcome them. We automatically build a high-quality, large-scale corpus to address the challenge of the lack of data; We propose a novel tagging scheme to solve the challenge of multiple label assignment; We pick out a Hybrid neural sequence tagging network (NST) from a variety of models to meet the challenge of models selection. In future work, we will further develop the training sets. And we will study a more efficient annotation scheme and use it to deal with n-ary relational tuples.

References