Multiple Range-Restricted Bidirectional Gated Recurrent Units with Attention for Relation Classification

07/05/2017
by   Jonggu Kim, et al.
POSTECH
0

Most of neural approaches to relation classification have focused on finding short patterns that represent the semantic relation using Convolutional Neural Networks (CNNs) and those approaches have generally achieved better performances than using Recurrent Neural Networks (RNNs). In a similar intuition to the CNN models, we propose a novel RNN-based model that strongly focuses on only important parts of a sentence using multiple range-restricted bidirectional layers and attention for relation classification. Experimental results on the SemEval-2010 relation classification task show that our model is comparable to the state-of-the-art CNN-based and RNN-based models that use additional linguistic information.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/05/2015

Relation Classification via Recurrent Neural Network

Deep learning has gained much success in sentence-level relation classif...
05/15/2018

SoPa: Bridging CNNs, RNNs, and Weighted Finite-State Machines

Recurrent and convolutional neural networks comprise two distinct famili...
07/29/2018

Convolutional Gated Recurrent Units for Medical Relation Classification

Convolutional neural network (CNN) and recurrent neural network (RNN) mo...
08/22/2018

Multi-Grained-Attention Gated Convolutional Neural Networks for Sentence Classification

The classification task of sentences is very challenging because of the ...
09/23/2020

A Token-wise CNN-based Method for Sentence Compression

Sentence compression is a Natural Language Processing (NLP) task aimed a...
10/02/2017

Attentive Convolution

In NLP, convolution neural networks (CNNs) have benefited less than recu...
07/12/2018

Improving on Q & A Recurrent Neural Networks Using Noun-Tagging

Often, more time is spent on finding a model that works well, rather tha...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Relation classification is to select the relation class that implies the relation of the two nominals (e1, e2) in the given text. For instance, given the following sentence, “The <e1>phone</e1> went into the <e2>washer</e2>.”, where <e1>, </e1>, <e2>, </e2> are position indicators that represent the starting and ending positions of nominals, the goal is to find the actual relation Entity-Destination of phone and washer

. The task is important because the results can be utilized in other Natural Language Processing (NLP) applications like question answering and information retrieval.

Recently, Neural Network (NN) approaches to relation classification have been spotlighted since they do not need any handcrafted features but even obtain better performances than traditional models. Such NNs can be simply classified into CNN-based and RNN-based models, and they capture slightly different features to predict a relation class.

In general, CNN-based models can only capture local features while RNN-based models are expected to capture global features as well, but the performances of CNN-based models are better than RNN-based models. That could be thought that most of relation-related terms are not scattered but intensively positioned as short expressions on a given sentence, and further even if RNNs are expected to learn such information automatically, it cannot be easily done contrary to our expectation. To overcome the limitation of RNNs, most of the recent work using RNNs have used additional linguistic information like Shortest Dependency Path (SDP), which can reduce the effect of noise words when predicting a relation.

In this paper, we propose a simple RNN-based model that strongly pays attention to nominal-related and relation-related parts with multiple range-restricted RNN variants called Gated Recurrent Units (GRUs) Cho et al. (2014) and attention. On the SemEval-2010 Task 8 dataset Hendrickx et al. (2009), our model with only pretrained word embeddings achieved the F1 score of 84.3%, which is comparable with the state-of-the-art CNN-based and RNN-based models that use additional linguistic resources such as Part-Of-Speech (POS) tags, WordNet and SDP. Our contributions are summarized as follows:

  • For relation classification, without any additional linguistic information, we suggest modeling nominals and a relation in a sentence with specified range-restriction standards and attention using RNNs.

  • We show how effective abstracting nominal parts, a relation part and both separately with the restrictions is to relation classification.

2 Related Work

Traditional approaches to relation classification are to find important features of relations with various linguistic processors and utilize them to train classifiers. For instance, Rink:10 uses NLP tools to extract linguistic features and trains an SVM model with the features.

Recently, many deep learning approaches have been proposed. Zeng:14 proposes a model based on CNNs to automatically learn important N-gram features. Santos:15 proposes a ranking loss function to well distinguish between the real classes and

Other

class. To capture long distance patterns, RNN-based, usually using Long Short-Term Memory (LSTM), approaches have also appeared, one of which is Zhang:15. The model simply feeds on all words in a sentence, then captures important one through the max-pooling operation. Xu:15 and Miwa:16 propose other RNN models using SDP to ignore noise words in a sentence. In addition, Liu:15 and Cai:16 propose hybrid models of RNN and CNN.

One of the most related work to ours is the attention-based bidirectional LSTM (att-BLSTM) Zhou et al. (2016). The model uses bidirectional LSTM and attention techniques to abstract important parts. However, the att-BLSTM does not distinguish roles of each part in a sentence, which could not involve sensitive attention. Another of the most related work is by Zheng:16. They try to capture nominal-related and relation-related patterns with CNNs and use neither restrictions nor attention mechanism.

3 The Proposed Model

Figure 1 shows the architecture of the proposed model, which will be described in the subsections.

3.1 Word Embeddings

Our model first takes word embeddings to represent a sentence at the word level. Given a sentence consisting of words, it can be represented as , , ,…,

}. We convert each one-hot vector

by multiplying with the word embedding matrix :

(1)

Then, the sentence can be represented as .

Figure 1: Multiple Range-Restricted Bidirectional GRUs with Attention ()

3.2 Range-Restricted Bidirectional GRUs

To capture information of two nominals and one relation, our model consists of three bidirectional GRU layers with range restrictions. A GRU is a kind of RNN variant to alleviate the gradient-vanishing problem like LSTM, but it has fewer weights than LSTM. In a GRU, the -th hidden with reset gate and update gate is computed as:

(2)
(3)
(4)
(5)

where

is the logistic sigmoid function.

The range restrictions can be done by using masking techniques to restrict the input range of the three bidirectional GRUs. Therefore, they should be conducted under three separate standards, but because the standards for two nominals are the same, we introduce two kinds of standards. First, to capture each nominal information, only the positioned words are regarded as input to the corresponding bidirectional GRU layer, where is the position of nominal e1 or e2 and

is a hyperparameter affecting their window size. Second, for the relation GRU layer, the input range is set to

or according to the relative order of the nominals in a sentence, which means that the range is from the formerly-appearing nominal to the latterly-appearing nominal.

After the sentence representation at word level is fed into the six GRU layers (three GRU layers in two directions) under the restrictions, various hidden units are finally generated from the layers. We call the hidden units of each GRU layer for convenience in the next subsection.

3.3 Sentence-level Representation

Among the hidden units of the six range-restricted GRUs, the model selects important parts by using direct selection from hidden layers and the attention mechanism.

To extract e1 and e2 information, we propose to directly select hidden units at each nominal position in the e1 and e2 bidirectional GRUs, and to sum them to construct , , respectively as:

(6)
(7)

where each directional represents hidden units at the positions in the directional .

To abstract relation information, we adopt the attention mechanism that has been widely used in many areas Bahdanau et al. (2014); Hermann et al. (2015); Chorowski et al. (2015); Xu et al. (2015a). We use the attention mechanism Zhou et al. (2016), but we apply it to each directional GRU layer independently to capture more informative parts with the flexibility. The forward directional relation-abstracted vector is computed as ( in the same way):

(8)
(9)
(10)

where is a trained attention vector for the forward layer.

Then, we sum and to make the relation-abstracted vector :

(11)

Lastly, the final representation is constructed by concatenating them:

(12)

where is a concatenation operator.

Model Additional Features (Except Word Embeddings) F1
SDP-LSTM
Xu et al. (2015b) - POS, WordNet, dependency parse, grammar relation 83.7
DepNN
Liu et al. (2015) - NER, dependency parse 83.6
SPTree
Miwa and Bansal (2016) - POS, dependency parse 84.4
MixCNN+CNN
Zheng et al. (2016) - None 84.8
att-BLSTM
Zhou et al. (2016) - None 84.0
Our Model (att-BGRU)
Our Model (Relation only)
Our Model (Nominals only)
Our Model (Nominals and Relation) - None
- None
- None
- None 82.9
83.0
81.4
84.3
Table 1: Comparison with the results of the state-of-the-art models

3.4 Classification

Our model uses scores of how similar the is to each class embedding to predict the actual relation dos Santos et al. (2015). Concretely, we propose a feed-forward layer in which a weight matrix

and a bias vector

can be regarded as a set of the class embeddings. In other words, the inner-product of each row vector in with represents the similarity between them in vector space, so the class score vector is just computed as:

(13)

Then, the model chooses the max-valued index that represents the most probable class label

except that every value in the is negative. In the exceptional case, is chosen as Other dos Santos et al. (2015).

3.5 Training Objectives

We adopt the ranking loss function dos Santos et al. (2015) to train the networks. Let the score of the , and the competitive score that is the best score excluding for convenience. Then, the loss is computed as:

(14)

where and represent margins and is a factor that magnifies the gap between the score and the margin.

4 Experiments

For the experiments, we implement our model in Python using Theano

Theano Development Team (2016) and use the model with the following descriptions.

4.1 Datasets and Settings

We conduct the experiments with SemEval-2010 Task 8 dataset Hendrickx et al. (2009), which contains 8,000 sentences as the training dataset, and 2,717 sentences as the test dataset. A sentence consists of two nominals (e1, e2), and a relation between them. Ten relation types are considered: Nine specific types (Cause-Effect, Component-Whole, Content-Container, Entity-Destination, Entity-Origin, Instrument-Agency, Member-Collection, Message-Topic and Product-Producer), and the Other class. The specific types have directionality, so a total of relation classes exist.

We use 10-fold cross-validation to tune the hyperparameters. We adopt the 100-dimensional word vectors trained by Pennington:14 as initial word embeddings and select the hidden layer dimension of 100, the learning rate of 1.0 and the batch size of 10. AdaDelta Zeiler (2012) is used as the learning optimizer. Also, we adapt the dropout Hinton et al. (2012) to the word embeddings, GRU hidden units, and feed-forward layer with dropout rates of 0.3, 0.3 and 0.7, respectively, and use the of 3. We adopt the position indicator that regards <e1>, </e1>, <e2> and </e2> as single words Zhang and Wang (2015). We set , and to 2.5, 0.5 and 2.0, respectively dos Santos et al. (2015) and adopt the L2 regularization with . The official scorer is used to evaluate our model in the macro-averaged F1 (excluding Other).

4.2 Results

In Table 1, our results are compared with the other state-the-art models. Our model with only pretrained word embeddings achieved the F1 score of 84.3%, which is comparable to the state-of-the-art models.

Furthermore, we investigated the effects of extracting relation, nominals and both of them. Attention-based bidirectional GRUs with no restriction (att-BGRU) were also tested as a reimplementation of the att-BLSTM. Here, our finding is that the restricted version of the att-BGRU (the relation only model) is not significantly better, but by abstracting nominals together, the model achieves higher F1 score. That indicates even if the ranges are slightly overlapped, they capture distinct features and improve the performance.

5 Conclusion

This paper proposed a novel model based on multiple range-restricted RNNs with attention. The proposed model achieved a comparable performance to the state-of-the-art models without any additional linguistic information.

References