Supervised Neural Models Revitalize the Open Relation Extraction

09/25/2018 ∙ by Shengbin Jia, et al. ∙ 0

Open relation extraction (ORE) remains a challenge to obtain a semantic representation by discovering arbitrary relation tuples from the un-structured text. However, perhaps due to limited data, previous extractors use unsupervised or semi-supervised methods based on pattern matching, which heavily depend on manual work or syntactic parsers and are inefficient or error-cascading. Their development has encountered bottlenecks. Although a few people try to use neural network based models to improve the ORE task performance recently, it is always intractable for ORE to produce supervised systems based on various neural architectures. We analyze and review the neural ORE methods. Further, we construct a large-scale automatically tagging training set and design a tagging scheme to frame ORE as a supervised sequence tagging task. A hybrid neural sequence tagging model (NST) is proposed which combines BiLSTM, CNN and CRF to capture the contextual temporal information, local spatial information, and sentence level tag information of the sequence by using the word and part-of-speech embeddings. Experiments on multiple datasets show that our method is better than most of the existing pattern-based methods and other neural networks based models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Relation extraction is an important NLP task. It is divided into two branches, including traditional relation extraction and open relation extraction. Traditional relation extraction (TRE) Doddington et al. (2004); Culotta et al. (2006); Mintz et al. (2009) can be regarded as a classification task which is committed to identifying pre-defined relation taxonomies between two arguments, and undefined relations will not be found. Open relation extraction (ORE) Banko et al. (2007); Fader et al. (2011); Jia et al. (2018) task is to obtain a semantic representation which comprises of a resourceful relational phrase and two argument phrases (e.g., (Michael Jordan, born in, New York)).

Intelligent questions and answers Fader et al. (2014), text comprehension Lin et al. (2017), information retrieval Manning and Raghavan (2009) and other downstream applications Mausam (2016) can use the relation extraction results. However, such as tracking the fierce Olympic Games or focusing on the election of the US, these works for analyzing news may be to find important relations which are often not pre-defined. Traditional relation extraction is difficult to meet these needs. Open relation extraction systems overcome the above-mentioned scenarios by discovering arbitrary relations without pre-defined intervention. Since its inception, ORE has gained consistent attention.

Banko et al. Banko et al. (2007) first introduced the concept of open relation extraction. Many extraction systems were proposed Fader et al. (2011); Del Corro and Gemulla (2013); Mausam et al. (2012)

subsequently. However, this task is constrained by the lack of labeled corpus or inefficient extraction models. Most proposed methods are based on pattern matching to carry out unsupervised or weak-supervised learning. Linguists can make elaborate shallow or full syntactic patterns. The manual cost of such methods is high and the scalability is poor 

Del Corro and Gemulla (2013); Angeli et al. (2015); Jia et al. (2018). Others are to extract patterns automatically through bootstrapping Mausam et al. (2012); Mausam (2016); Bhutani et al. (2016). They heavily depend on syntax analysis tools, and the cascading effect caused by parsing errors is serious.

Neural network based methods are popular now and have achieved good accomplishments in the TRE task Xu et al. (2015); Wang et al. (2016); Zheng et al. (2017) or other information extraction tasks Sutskever et al. (2014); Huang et al. (2015). Thus, a few people tried to solve the ORE task with them and got well results by using conventional neural models Cui et al. (2018); Stanovsky et al. (2018). They strived to collect annotated corpus for training neural ORE models. However, the process of manually labeling a training set with a large number of relations is heavy costly. In this paper, we build a large-scale dataset with good generalization capabilities in a fully automated way, to overcome the lack of training data for supervised models. Meanwhile, we design a tagging scheme to transform the extraction task into a sequence tagging problem by using the dataset.

It is always challenging for open information extraction to produce supervised systems. Also, it is important to understand which neural architectures could be useful for the task. Thus, we build neural structures and do lots of experiments to adapt to the ORE task. On the one hand, we introduce the alternative approaches using the neural networks for the open relation tagging task. These methods include the long short-term memory network (LSTM), bidirectional LSTM network (BiLSTM), convolutional neural network (CNN), LSTM network with a conditional random fields layer (LSTM-CRF), CNN network with a CRF layer (CNN-CRF), BiLSTM network with a CRF layer (BiLSTM-CRF) and so on. They are popular and effective 

Huang et al. (2015); Ma and Hovy (2016); Lample et al. (2016). On the other hand, in order to exploit the temporal and spatial characteristics of the natural language, we present a hybrid neural sequence tagging network (NST). It utilizes BiLSTM to capture both the forward and backward context semantics of each word. Meanwhile, CNN is used to extract salient local features. Next, it takes advantage of the sentence-level tag information by CRF to output open relational labels.

In summary, the main contributions of this work are listed as follows:

  • We design a tagging scheme to solve the ORE task by the supervised sequence tagging method. The method can automatically learn the sentence semantic or syntax information and use semantic vectors to predict relations. It does not depend on the dependency parsers which are traditional pattern-based methods’ scaffolds but also shackles.

  • We propose a hybrid neural sequence tagging model (NST). It can produce state-of-the-art (or close to) performance compared to the traditional methods or other neural models.

  • We systematically compare the performance of a variety of neural models in the context of ORE. It is valuable to understand the effectiveness of different neural architectures for ORE task and to help readers reproduce these experiments.

  • We construct a large-scale dataset with good generalization capabilities for ORE task. It is forceful argument in favor of adopting the supervised approachs. The codes and dataset can be obtained from https://github.com/TJUNLP/NSL4OIE.

The rest of this article is organized as follows. In Section 2, we review some related work. Then we detail the process of preparing the training set in Section 3. In Section 4, we illustrate various neural ORE methods and propose a novel model NST. In Section 5, we conduct and analyze experiments on multiple datasets. Section 6 analyzes and discusses the performance of some aspects of my model. Section 7, concludes this work and discusses future research directions.

2 Related Works

The traditional relation extraction (TRE) could be regarded as a classification task which pre-defined relational taxonomy and judged the corresponding relation category of the known arguments Mintz et al. (2009); Culotta et al. (2006). The current mainstream methods were based on neural classifers to recognize relations by means of the large-scale databases to provide distant supervision Xu et al. (2015); Wang et al. (2016); Zheng et al. (2017).

As for the open relation extraction, virtually all of the existing techniques made use of patterns matching. Some extractors, such as TextRunner Banko et al. (2007), WOE  Wu and Weld (2010), Reverb Fader et al. (2011), focused on efficiency by restricting the shallow syntactic analysis to part-of-speech tagging and chunking. Meanwhile, some approaches designed complex patterns from the full syntactic processing, especially dependency parsers, such as WOE Wu and Weld (2010), PATTY  Nakashole et al. (2012), OLLIE  Mausam et al. (2012), MinIE Gashteovski et al. (2017) and so on. These extractors could get significantly better results but be generally more expensive than the extractors based on shallow syntactic. In addition, they heavily relied on the syntactic parsers. Many papers  Mausam et al. (2012); Del Corro and Gemulla (2013) analyzed the errors made by their extractors, found that parser errors account for a large even the largest part of the whole. Parsing errors restrained the extractors’ performance and would produce a serious cascade effect.

In general, the patterns were generalized by handcrafting or semi-supervised learning by bootstrapping. These manual patterns were high-accuracy but heavy-cost and poor-efficiency  

Angeli et al. (2015); Del Corro and Gemulla (2013). When learning patterns by bootstrapping algorithms, models firstly bootstrapped data based on seeds and then learnt lexical patterns or syntactic patterns to create an extractor based on shallow or full syntactic analysis of a sentence respectively, such as OLLIE  Mausam et al. (2012), Open IE-4.x Mausam (2016), NestIE Bhutani et al. (2016).

Recently, Many NLP tasks, such as Chinese word segmentation (CWS), part-of-speech tagging (POS), Chunking and named entity recognition (NER), require assigning representative labels to words in a sentence. Most of the current methods considerd these tasks as a sequence tagging problem 

Zhai et al. (2017)

. Traditional sequence labeling models were linear statistical models which include Hidden markov models 

Baum and Petrie (1966), Maximum entropy Markov models Mccallum et al. (2002), and Conditional random fields (CRF) Lafferty et al. (2001) and so on. Neural network based sequence tagging methods mapped the input sequences to obtain large fixed dimensional vector representations by the LSTM Sutskever et al. (2014); Huang et al. (2015), BiLSTM Huang et al. (2015), CNN Collobert et al. (2011) or the combination of the above Chiu and Nichols (2015)

, then predicted the target sequences from the vectors using a layer with Softmax activation function 

Chiu and Nichols (2015); Huang et al. (2015) or a special CRF layer Raganato et al. (2017).

In addition, there are a few examples that apply neural models to open information extraction. By considering the Machine translation mechanism, which converts extraction process into text generation. Zhang et al. 

Zhang et al. (2017) extracted predicate-argument structure phrases by using a sequence to sequence model. Cui et al. Cui et al. (2018) proposed a multilayered encoder-decoder framework to generate relation tuples related sequences with special placeholders as marker. In addition, Stanovsky et. al. Stanovsky et al. (2018)

formulated the open relation extraction task as a sequence tagging problem, too. They applied a BiLSTM and softmax layer to tag each word. However, we design more effective semantic learning frameworks to annotate relations. Meanwhile, we have a larger training set of about several hundred thousand, but they have only a few thousand.

3 Training Set Preparation

There is few unconcealed large-scale labeled corpus for ORE tagging task. Stanovsky and Dagan Stanovsky and Dagan (2016) created the labeled corpus for evaluation of ORE by an automatic translation from QA-SRL annotations He et al. (2015). It only contains 10,359 tuples over 3200 sentences. Then, Stanovsky et al. Stanovsky et al. (2018) further expand 17,163 labelled open tuples from the QAMR corpus Michael et al. (2018). But their accurate has declined. Thus, we adopt a mechanism of bootstrapping training set using multiple existing open relation extractors without having to resort to expensive annotation efforts. The train set contains 477,701 triples. And we design a tagging scheme to automatically annotate these data.

In order to build a high-quality corpus for models training, we are committed to the high-accuracy of extractions at the expense of the recall. The constructed dataset is imperfect, but it still has acceptable scalability. The experimental sections prove the effectiveness of the dataset.

Figure 1: Standard annotation for an example sentence based on our tagging scheme, where “E1”, “R”, “E2” respectively represent Argument1, Relation, and Argument2. “Tags Sequence” is independent of each other.

3.1 The Correct Relation Triples Collecting

We use three existing extractors which are excellent and popular to extract relation triples from the raw text. If a triple is obtained simultaneously by the three extractors, we will believe that the triple is correct and add it to the corpus. Three extractors we chose are respectively OLLIE Mausam et al. (2012), ClausIE Del Corro and Gemulla (2013), and Open IE-4.x Mausam (2016). They use various methods to extract relations. The original text is produced from the WMT 2011 News Crawl data111http://www.statmt.org/lm-benchmark/.

All the extractors are based on the dependency parsing trees, so the extracted relational phrase may be a combination product, this is, the distance between the adjacent words in the phrase may, in fact, be distant in the original sentence sequence. Moreover, the adjacency order of the words in a triple may be different from that in the sentence. For example, from the sentence “He thought the current would take him out, then he could bring help to rescue me.” we can get a triple (the current, would take out, him). Thus, we define some word order constraints: The arguments in a triple (Argument1, Relation, Argument2) are ordered. All words in Argument1 must appear before all words in Relation. All the words contained in Argument2 must appear after the Relation words. The order of words appearing in Relation must be consistent with them appearing in the original sentence sequence. In addition, each relational word must have appeared in the original sentence, is not the modified word or the added word.

We randomly sample 100 triples from the corpus to test the accuracy. It is up to 0.95. The result of this experiment verifies the validity of the above operating. In addition, it can be seen that the extraction errors caused by syntactic analysis errors can also be well filtered out.

3.2 Automatically Sequence Tagging

The task of tagging is to assign a special label to every word in a sentence. As for a relation triple, the arguments and relation phrase could span several tokens within a sentence respectively. Thus, the arguments and relational phrase need to be division and tagged alone. They are annotated with BIOES annotation (Begin, Inside, Outside, End, Single), indicating the position of the token in an argument or the relational phrase. Figure 1 is an example of how the results are tagged. This scheme has been reported that it is more expressive than others such as BIO Ratinov and Roth (2009).

A sentence may contain two or more triples, each triple will correspond to their unique tag sequence. The tag information of the arguments is used as model input information, an argument can have multiple labels because it may belong to different triples. When using the model to predict, we extract the candidate argument pairs in advance, then transform them into a tag sequence input model to identify the relationship between them. If the arguments are not related, the whole words appearing outside the scope of the arguments will be labeled as “O”.

3.3 From Tag Sequence to Extracted Results

From the tag sequence Tags Sequence1 in Figure 1, “The United States” and “Trump” can be combined into a triple whose relation is “President”. Because the relation role of “The United States” is “1” and “Trump” is “2”, the final result is (The United States, President, Trump). The same applies to (Trump, will visit, the Apple Inc), (the Apple Inc, founded by, Steven Paul Jobs).

4 Methods

We provide a brief description of the long short-term memory network (LSTM), convolutional neural network (CNN) and conditional random fields (CRF). Meanwhile, we present a hybrid sequence tagging network (NST) and introduce some popular neural models and compare them in experiment section.

4.1 LSTM Network

Recurrent neural networks (RNNs) are good at grasping the temporal semantics of a sequence. They are useful for all kinds of tagging applications Raganato et al. (2017); Zhou et al. (2017). LSTM is a variant of RNNs Hochreiter and Schmidhuber (1997). LSTM is better at capturing long-range dependencies than basic RNN.

For a given sentence , the LSTM returns another representation about the sequence in the input. An LSTM unit is composed of three multiplicative gates which control the proportions of information to forget and to pass on to the next time step Ma and Hovy (2016). A single LSTM memory unit is implemented as the follows:

(1)

where, , , and are the input gate, forget gate, output gate and cell vectors, respectively, is the bias,

is the logistic sigmoid function, and

are the trainable parameter matrixes.

The takes the left context of the sentence at every word , but knows nothing about the right contexts Lample et al. (2016). Furthermore, BiLSTM can present each sequence forwards and backwards to two separate hidden states to capture past (left) and future (right) information, respectively Graves et al. (2013a).

4.2 CNN Network

Relation extraction is defined to only mark relations corresponding to the target arguments. It does not predict labels for each word. Thus, it might be necessary to utilize all local features and perform this labeling globally. CNN is a natural means of capturing salient local features from the whole sequence Collobert et al. (2011).

Convolution is an operation between a vector of weights and a vector of input sequence. The weights matrix is regarded as the filter for the convolution Zeng et al. (2014). The input sequence consists of words and each word is mapped to -dimension embedding representation. An illustration of CNN architecture with three convolutional filters is given in Figure 2 (a). The

is given to the convolutional layer and its output is immediately fed to a Max-Pooling layer. An output vector with fixed length can be obtain.

4.3 Crf

For sequence labeling tasks, there are strong dependencies across output labels. It is beneficial to consider the correlations between labels in neighborhoods. CRF can efficiently use past and future tags to predict the current tag. Therefore, it is a common way to model label sequence jointly using a CRF layer, instead of making independent tagging decisions directly Lample et al. (2016); Ma and Hovy (2016).

For an input sentence , and output label sequence , we define its score as,

(2)

where is a matrix of transition scores such that represents the score of a transition from the tag to tag . and that separately means the start and the end symbol of a sentence. We regard as the matrix of scores outputted by the upper layer. corresponds to the score of the tag of the word in a sentence.

We predict the output sequence that gets the maximum score given by:

(3)

where represents all possible tag sequences including those that do not obey the BIOES format constraints.

4.4 BiLSTM-CNN-CRF Network

In this section, we propose a hybrid neural network, named NST, which combines BiLSTM and CNN to learn a continuous representation of each word in a sentence. This representation is used to predict the open relational tags by CRF layer. Figure 2 illustrates the architecture of the network in detail.

Figure 2: An illustration of our model.

Let denotes the sequence of information embeddings for every word in a sentence. The is given as input to a BiLSTM, which returns a representation of the forward and backward context for each word as explained in Section 4.1.

During relations tagging, relational words tend to appear in the neighborhoods of arguments. In other words, the power of different sub-sequence chunks to provide feature information for model is different. CNN owns good local perception ability. Thus, we use the CNN to learn local feature information of the input sequence. Unlike previous methods Ma and Hovy (2016); Chiu and Nichols (2015), they usually used CNN to learn character-level embeddings.

We concatenate the bidirectional temporal features: and , and local spatial features: , which are learned from the CNN sub-network by using convolutional filters with widths of 3, as a single vector . The vector is fed into the CRF layer to jointly yield the final predictions for every word. The parameters are trained to maximize Equation. 3 of observed sequences of relation tags in the preparatory training set.

Furthermore, in order to verify the effectiveness of the modules of this model, we divide the above model into independent sub-models, including,

BiLSTM-CNN Network. It uses BiLSTM and CNN to learn semantic information, and then uses the Softmax layer to output prediction tags. The sub-model is similar to Zhai et al. (2017).

CNN-CRF Network

. We only use convolutional filters with widths of 3 to capture local semantics. After being pooled by the Max-Pooling layer, the local feature vectors are fed into the CRF layer to classify relational tags.

4.5 More Networks

LSTM-LSTM Network. The idea is to use one LSTM to read the input sequence, one time-step at a time, to obtain large fixed dimensional vector representation, and then to use another LSTM to extract the softmax output sequence from that vector. The model is firstly proposed by Sutskever et al. (2014). What are used in Zheng et al. (2017); Ramachandran et al. (2016) are the deformation of this model.

LSTM-CRF Network. We combine an LSTM network and a CRF layer to form an LSTM-CRF model. This network can efficiently use past input features via an LSTM layer and sentence level tag information via a CRF layer. The architecture has been used by Huang et al. (2015) to do some sequence tagging tasks.

BiLSTM-CRF Network. Similar to an LSTM-CRF network, we combine a bidirectional LSTM network and a CRF layer to form a BiLSTM-CRF network. In addition to the past input features and sentence level tag information used in an LSTM-CRF model, a BiLSTM-CRF model can also use the future input features. This network model is widely used, such as Ma and Hovy (2016); Lample et al. (2016); Chiu and Nichols (2015); Huang et al. (2015). It should also be considered a module of our proposed model.

5 Experiments

In this section, we present our experiments in detail. We evaluate various models with Precision (P), Recall (R) and F-measure (F1).

5.1 Experimental Setting

Id Dataset Source Sent. Triples
1 Reverb dataset Yahoo 500 1,765
2 Wikipedia dataset Wikipedia 200 605
3 NYT dataset New York Times 200 578
4 OIE2016 dataset QA-SRL annotations 3200 10,359
Table 1: The datasets used for test in this work.

Test set. In order to satisfy the openness and effectiveness of the experiment, we gather as many high-quality test sets as we can from the previously published works. They should be close to nature and independent of the training set.

We organize several datasets in our experiments. Firstly, the Reverb dataset is got from  Fader et al. (2011) which consists of 500 sentences with manually labeled 1,765 extractions. The sentences had been obtained via the random link service of Yahoo. Next, the Wikipedia dataset includes 200 random sentences extracted from Wikipedia pages. And we collect 605 extractions manual labeled by Del Corro et al. Del Corro and Gemulla (2013). Then, the NYT dataset contains 578 triples extracted from 200 random sentences in the New York Times collection. The dataset was also created by Del Corro et al.. In addition, Stanovsky and Dagan Stanovsky and Dagan (2016)presented a labeled corpus OIE2016 dataset for evaluation. The corpus was automatically translated from QA-SRL annotations He et al. (2015). Table 1 presents the details on the four datasets.

Hyperparameters

. We implement the neural network using the Keras library

222https://github.com/keras-team/keras. The training set and the validation set respectively contain 395,715 and 81,986 records. The batch size is fixed to 50. We use early stopping Graves et al. (2013b) based on performance on validation set. The number of LSTM units is 200 and the number of feature maps for each convolutional filter is 200. Parameter optimization is performed with Adam optimizer  Kingma and Ba (2014)

. Here, an initial learning rate is 0.001, and the learning rate is reduced by a factor of 0.1 if no improvement of the loss function is seen for some epochs. Besides, to mitigate over-fitting, we apply the dropout method  

Srivastava et al. (2014) to regularize our model.

We make use of three types of embeddings as inputs. Word embedding is pre-trained by word2vec Mikolov et al. (2013) on our created dataset. The dimension of the word embeddings is 300. Part-of-speech (POS) embedding is also considered since POS information plays an important role in relation extraction. We use TreeTagger Schmid (1994) which is widely adopted for POS tagging to annotate POS category, containing 59 different tags. Besides, we express arguments information by using 10 dimensions one-hot vectors.

5.2 Experimental Results

We report the results of various models work on the first three datasets as shown in Table 2 333In order to avoid the distortion of argument recognition errors to the final performance, we only recognize the correctness of the relational phrases in a triple.444As for a relation extracted by Reverb, OLLIE, ClausIE, and Open IE-4.x, only when its confidence is greater than 0.5 can it be judged that the relations are correct.. We can get some conclusions from the table as follows.

Model Wikipedia dataset NYT dataset Reverb dataset Average
P R F P R F P R F P R F
CRF 0.548 0.264 0.357 0.460 0.130 0.202 0.425 0.198 0.270 0.458 0.198 0.277
CNN 0.886 0.527 0.661 0.931 0.493 0.645 0.915 0.504 0.650 0.912 0.506 0.651
LSTM 0.559 0.463 0.506 0.607 0.481 0.537 0.632 0.530 0.576 0.612 0.506 0.554
BiLSTM 0.766 0.683 0.722 0.807 0.637 0.712 0.784 0.716 0.748 0.784 0.693 0.736
LSTM-LSTM 0.552 0.426 0.481 0.623 0.497 0.552 0.639 0.535 0.582 0.619 0.505 0.556
CNN-CRF 0.878 0.574 0.694 0.854 0.578 0.689 0.892 0.568 0.694 0.882 0.571 0.693
LSTM-CRF 0.720 0.590 0.649 0.721 0.593 0.651 0.709 0.604 0.652 0.714 0.599 0.651
BiLSTM-CRF 0.838 0.734 0.782 0.817 0.678 0.741 0.830 0.743 0.784 0.829 0.728 0.775
BiLSTM-CNN 0.821 0.729 0.772 0.837 0.702 0.764 0.831 0.760 0.794 0.830 0.743 0.784
NST 0.876 0.736 0.800 0.864 0.701 0.774 0.868 0.746 0.802 0.869 0.735 0.796
Reverb 0.770 0.210 0.330 0.557 0.144 0.228 0.595 0.133 0.217 0.641 0.162 0.259
OLLIE 0.994 0.279 0.436 0.986 0.249 0.398 0.975 0.198 0.329 0.985 0.242 0.389
ClausIE 0.795 0.526 0.633 0.656 0.481 0.555 0.953 0.585 0.725 0.801 0.531 0.638
Open IE-4.x 0.766 0.340 0.471 0.801 0.341 0.478 0.810 0.312 0.451 0.792 0.331 0.467
Table 2: The predicted results of different models on the Reverb dataset, Wikipedia dataset, and NYT dataset. The bolds indicate the best value when our model compares with the other sequence tagging models. And comparing with the traditional ORE extractors, we highlight the best value with the underline.

First, our model NST outperforms all other methods in F1. It shows the effectiveness of our proposed method and achieves a 15.8% improvement in F1 over the best traditional method ClausIE Del Corro and Gemulla (2013)

. Meanwhile, many of the neural sequence learning methods outperform traditional pattern-based methods. The traditional methods may be of high accuracy, but they cannot learn rich enough patterns resulting in a lower recall. In addition, the sentence syntax structures are ever-changing, and template matching is hard and inflexible. Neural based models can learn deep sentence semantics and syntactic information, so as to achieve good precision and recall.

In particular, the method proposed by Stanovsky et. al. Stanovsky et al. (2018) is the same as the BiLSTM network model. We can find that our model has achieved better results than it. In addition to capturing temporal semantics of sentences by using recurrent networks, it is meaningful for our model to extract spatial semantics by the convolutional layers and fuse sentence-level tag information by the CRF layer.

Then, our model gets obvious better recall compared to the traditional extractors on the three manual annotation test sets. Although our model may be limited after being trained on the incomplete training set. The experimental results demonstrate the generalization ability of the automatic dataset construction method.

Furthermore, we analyze the effects of various network layers. Compared to LSTM, BiLSTM can capture richer temporal semantic information and is clearly superior to LSTM about 18.2% in F1 on the average. BiLSTM is better than CNN in recall and F1. However, CNN takes better precision. And overall, CNN outperforms LSTM. In addition, from the comparison between LSTM and LSTM-CRF, BiLSTM and BiLSTM-CRF, CNN and CNN-CRF, we can get a unanimous conclusion that the CRF layer can greatly improve the model performance. The same is true for NST (BiLSTM-CNN-CRF) by comparing with BiLSTM-CNN. Moreover, it is better to use a CRF layer to output traget tags than using an LSTM layer.

Finally, the effects of the models on the three datasets are stable. It indicates that the neural sequence tagging methods have good robustness and scalability.

We use the OIE2016 dataset to evaluate the precision and recall of different systems. The precision-recall curves are shown in Figure 3555When our model (NST) is executed, we use the methods in the ClausIE Del Corro and Gemulla (2013) and Open IE-4.x Mausam (2016) to pre-identify the arguments in sentences. The model proposed in  Stanovsky et al. (2018) isn’t shown here, because it used the OIE2016 dataset as train set. We have evaluate it in Table 2., and the Area under precision-recall curve (AUC) for each system is shown in Figure 4. It is observed that our neural model (NST) has a better precision than the three traditional models in the most recall range. Our model is learned from the bootstrapped outputs of the extractions of the three traditional systems, while the AUC score is better than theirs. It shows that our model has fixed generalization ability after training on the training set. Although the precision of the neural model En-Decoder Cui et al. (2018) is better than that of our model (NST), the recall of it has been maintained in a lower range than that of our model. In addition, our model achieves the best AUC score 0.487, which is significantly better than other systems. In particular, the AUC score of our model is two times more than that of the En-Decoder model.

Figure 3: The Precision-recall curves of the different ORE systems on the OIE2016 dataset. The model En-Decoder comes from  Cui et al. (2018)
Figure 4: The Area under precision-recall curve (AUC) shown in Figure 3

6 Analysis and Discussion

Model P R F1
(word embedding) 0.783 0.708 0.744
(word embedding, POS embedding) 0.822 0.751 0.785
(word embedding, POS embedding) 0.869 0.735 0.796
Table 3: The results of evaluating the influence of embeddings used in the NST model.

6.1 Analysis of Effect of Embedding

We perform a experiment to evaluate the importance of verious embeddings, shown in Table 3. The results are average on the Reverb dataset, Wikipedia dataset, and NYT dataset. The results of model training only with word embedding are worse than them after model training with word embedding and POS embedding. We are aware that the POS features play a great role in improving the performance of our model and increase the F1 by 4.1%. we are aware that the POS features play a great role in improving the performance of our model and increase the F1 by 4.1%.In addition, if the input embeddings can be adjusted along with model training, the final output will be better. The F1 is increased by 1.1%.

6.2 Analysis of Model Generalization Ability

According to whether the three extractors, that are used to construct the training set, can correctly identify each test instance, we separate the test set into four parts: all three extractors identify correctly (ATE), existing two extractors identify correctly (ETE), only one extractor identify correctly (OOE), no one extractor identify correctly (NOE). As shown in Figure 5, our model identifies these instances in the ATE with an accuracy close to 1. It implies that our model (NST) can learn the training set data features well. For other three-part test cases, our model also has good recognition capabilities, especially instances in the NOE which may be little or barely appear in the training set. The above analysis shows that our model has strong generalization ability and can learn to extract triples beyond the capability of the three extraction tools. We believe this is because the neural approach learns relations tagging across a large number of highly confident training instances. This also indirectly indicates that our constructed training set has passable quality.

Figure 5: The recognition performance of NST model in each part of the test sets. Here, the black bar indicates the number of correct identifications of NST. And the gray bar represents the total number of this part.

6.3 Error Analysis

To find out the factors that affect the results of our model, we analyze the performance of tagging open relation as Figure 6 shown. There are four major types of errors. The 30.3% relations are missed by the model. And 22.1% of the extracted relations that corresponding tag sequences violate the tagging scheme are abandoned. The two reasons mainly limit the increase in recall. In addition, the model may wrongly determine the start or end position of a relational phrase. As a result, the relations are recognized as falseness and it affects the precision.

Figure 6: Error Analusis Results.

7 Conclusion

In this paper, we construct a training set automatically and use sequence tagging methods based on neural networks to extract open relations. Meanwhile, we introduce a hybrid neural sequence tagging model. It incorporates both bidirectional LSTM and convolutional neural networks to capture temporal and structure semantic information from contexts, and joints a CRF layer to predict relation tags. The experimental results show the effectiveness of our proposed method. Compared with traditional or neural open relation extraction models, our approach achieves state-of-the-art performance on multiple test sets. In the future work, we will consider to further improve the quality of training set. And we will study a more efficient annotation scheme and use it to deal with n-ary relational tuples.

References

  • Angeli et al. (2015) Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D Manning. 2015. Leveraging Linguistic Structure For Open Domain Information Extraction. In

    Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    .
  • Banko et al. (2007) Michele Banko, Mj Cafarella, and Stephen Soderland. 2007. Open information extraction for the web. In

    the 16th International Joint Conference on Artificial Intelligence (IJCAI 2007)

    , pages 2670–2676.
  • Baum and Petrie (1966) Leonard E. Baum and Ted Petrie. 1966.

    Statistical Inference for Probabilistic Functions of Finite State Markov Chains.

    The Annals of Mathematical Statistics, 37(6):1554–1563.
  • Bhutani et al. (2016) Nikita Bhutani, H V Jagadish, and Dragomir Radev. 2016. Nested propositions in open information extraction. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 55–64.
  • Chiu and Nichols (2015) Jason P. C. Chiu and Eric Nichols. 2015. Named Entity Recognition with Bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 4:357–370.
  • Collobert et al. (2011) Ronan Collobert, Jason Weston, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch.

    Journal of Machine Learning Research

    , 12(1):2493–2537.
  • Cui et al. (2018) Lei Cui, Furu Wei, and Ming Zhou. 2018. Neural Open Information Extraction. arXiv preprint arXiv:1805.04270.
  • Culotta et al. (2006) Aron Culotta, Andrew Mccallum, and Jonathan Betz. 2006. Integrating probabilistic extraction models and data mining to discover relations and patterns in text. In Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, pages 296–303.
  • Del Corro and Gemulla (2013) Luciano Del Corro and Rainer Gemulla. 2013. Clausie: clause-based open information extraction. In Proceedings of the 22nd international conference on World Wide Web, i, pages 355–366.
  • Doddington et al. (2004) George R Doddington, Alexis Mitchell, Mark A Przybocki, Lance A Ramshaw, Stephanie Strassel, and Ralph M Weischedel. 2004. The Automatic Content Extraction (ACE) Program-Tasks, Data, and Evaluation. In The fourth international conference on Language Resources and Evaluation, volume 2, page 1.
  • Fader et al. (2011) Anthony Fader, Stephen Soderland, and Oren Etzioni. 2011. Identifying relations for open information extraction. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011), pages 1535–1545.
  • Fader et al. (2014) Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. 2014. Open question answering over curated and extracted knowledge bases. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’14, pages 1156–1165.
  • Gashteovski et al. (2017) Kiril Gashteovski, Rainer Gemulla, and Luciano Del Corro. 2017. MinIE: Minimizing Facts in Open Information Extraction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2620–2630.
  • Graves et al. (2013a) Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013a. Speech Recognition with Deep Recurrent Neural Networks. In arXiv preprint arXiv:1303.5778, pages 6645–6649. IEEE.
  • Graves et al. (2013b) Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013b. Speech Recognition with Deep Recurrent Neural Networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645–6649. IEEE.
  • He et al. (2015) Luheng He, Mike Lewis, and Luke Zettlemoyer. 2015. Question-answer driven semantic role labeling: Using natural language to annotate natural language. In EMNLP.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv preprint arXiv:1508.01991.
  • Jia et al. (2018) Shengbin Jia, Shijia E, Maozhen Li, and Yang Xiang. 2018. Chinese Open Relation Extraction and Knowl- edge Base Establishment. ACM Transactions on Asian and Low-Resource Language Information Processing, 17:15:1–22.
  • Kingma and Ba (2014) Diederik Kingma and Jimmy Ba. 2014. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980, pages 1–13.
  • Lafferty et al. (2001) John Lafferty, Andrew McCallum, and Fernando C N Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML ’01 Proceedings of the Eighteenth International Conference on Machine Learning, 8(June):282–289.
  • Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. arXiv preprint arXiv:1603.01360.
  • Lin et al. (2017) Hongyu Lin, Le Sun, and Xianpei Han. 2017. Reasoning with Heterogeneous Knowledge for Commonsense Machine Comprehension. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2022–2033.
  • Ma and Hovy (2016) Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. arXiv preprint arXiv:1603.01354.
  • Manning and Raghavan (2009) Christopher D. Manning and Prabhakar Raghavan. 2009. An Introduction to Information Retrieval. Cambridge: Cambridge university press, 61(4):852–853.
  • Mausam (2016) Mausam. 2016. Open Information Extraction Systems and Downstream Applications. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 4074–4077. AAAI Press.
  • Mausam et al. (2012) Mausam, Michael Schmitz, Robert Bart, Stephen Soderland, and Oren Etzioni. 2012. Open Language Learning for Information Extraction. In In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, July, pages 523–534.
  • Mccallum et al. (2002) Andrew Mccallum, Dayne Freitag, and Fernando Pereira. 2002. Maximum Entropy Markov Models for Information Extraction and Segmentation. In Machine Learning, volume 17, pages 1–26.
  • Michael et al. (2018) Julian Michael, Gabriel Stanovsky, Luheng He, Ido Dagan, and Luke S. Zettlemoyer. 2018. Crowdsourcing question-answer meaning representations. In NAACL-HLT.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In Advances in neural information processing systems, pages 3111–3119.
  • Mintz et al. (2009) Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - ACL-IJCNLP ’09, volume 2, page 1003. Association for Computational Linguistics.
  • Nakashole et al. (2012) Ndapandula Nakashole, Gerhard Weikum, and Fabian M. Suchanek. 2012. PATTY: A Taxonomy of Relational Patterns with Semantic Types. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2012), July, pages 1135–1145.
  • Raganato et al. (2017) Alessandro Raganato, Claudio Delli Bovi, and Roberto Navigli. 2017. Neural Sequence Learning Models for Word Sense Disambiguation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1167–1178.
  • Ramachandran et al. (2016) Prajit Ramachandran, Peter J. Liu, and Quoc V. Le. 2016. Unsupervised Pretraining for Sequence to Sequence Learning. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 383–391.
  • Ratinov and Roth (2009) Lev Ratinov and Dan Roth. 2009. Design challenges and misconceptions in named entity recognition. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pages 147–155.
  • Schmid (1994) Helmut Schmid. 1994.

    Probabilistic Part-of-Speech Tagging Using Decision Trees.

    Proceedings of International Conference on New Methods in Language Processing.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(1):1929–1958.
  • Stanovsky and Dagan (2016) Gabriel Stanovsky and Ido Dagan. 2016. Creating a large benchmark for open information extraction. In Conference on Empirical Methods in Natural Language Processing, pages 2300–2305.
  • Stanovsky et al. (2018) Gabriel Stanovsky, Luke Zettlemoyer, and Ido Dagan. 2018. Supervised open information extraction. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume, pages 885–895.
  • Sutskever et al. (2014) Sutskever, Vinyals, and Le. 2014. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems 27, pages 3104–3112.
  • Wang et al. (2016) Linlin Wang, Zhu Cao, Gerard De Melo, and Zhiyuan Liu. 2016. Relation Classification via Multi-Level Attention CNNs. Acl.
  • Wu and Weld (2010) Fei Wu and Daniel S Weld. 2010. Open Information Extraction using Wikipedia. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL 2010), July, pages 118–127.
  • Xu et al. (2015) Yan Xu, Lili Mou, Ge Li, and Yunchuan Chen. 2015. Classifying Relations via Long Short Term Memory Networks along Shortest Dependency Paths. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), September, pages 1785–1794.
  • Zeng et al. (2014) Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jun Zhao. 2014. Relation Classification via Convolutional Deep Neural Network. In the 25th International Conference on Computational Linguistics (COLING 2014), 2011, pages 2335–2344.
  • Zhai et al. (2017) Feifei Zhai, Saloni Potdar, Bing Xiang, and Bowen Zhou. 2017. Neural Models for Sequence Chunking. In The 31 AAAI Conference on Artificial Intelligence, pages 3365–3371.
  • Zhang et al. (2017) Sheng Zhang, Kevin Duh, Benjamin Van Durme, Sheng Zhang, Kevin Duh, and Benjamin Van Durme. 2017. Mt/ie: Cross-lingual open information extraction with neural sequence-to-sequence models. In Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 64–70.
  • Zheng et al. (2017) Suncong Zheng, Feng Wang, Hongyun Bao, Yuexing Hao, Peng Zhou, and Bo Xu. 2017. Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1227–1236.
  • Zhou et al. (2017) Hao Zhou, Zhenting Yu, Yue Zhang, Shujian Huang, Xinyu Dai, and Jiajun Chen. 2017. Word-Context Character Embeddings for Chinese Word Segmentation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 771–777.