Improving Aspect Term Extraction with Bidirectional Dependency Tree Representation

05/21/2018 ∙ by Huaishao Luo, et al. ∙ Southwest Jiaotong University University of Illinois at Chicago 0

Aspect term extraction is one of the important subtasks in aspect-based sentiment analysis. Previous studies have shown that dependency tree structure representation is promising for this task. In this paper, we propose a novel bidirectional dependency tree network to extract dependency structure features from the given sentences. The key idea is to explicitly incorporate both representations gained separately from the bottom-up and top-down propagation on the given dependency syntactic tree. An end-to-end framework is proposed to integrate the embedded representations and BiLSTM plus CRF to learn both tree-structured and sequential features to solve the aspect term extraction problem. Experimental results demonstrate that the proposed model outperforms state-of-the-art baseline models on four benchmark SemEval datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Aspect term extraction (ATE) is the task of extracting the attributes (or aspects) of an entity upon which people have expressed opinions. It is one of the most important subtasks in aspect-based sentiment analysis Liu (2012). For example, in the laptop review sentence “Speaking of the browser, it too has problems.”, we aim to extract the term browser.

Existing methods for ATE can be divided into unsupervised and supervised approaches. The unsupervised approach is mainly based on topic modeling Lin and He (2009); Brody and Elhadad (2010); Moghaddam and Ester (2011); Chen et al. (2013); Chen and Liu (2014); Chen et al. (2014), syntactic rules Wang and Wang (2008); Zhang et al. (2010); Wu et al. (2009); Qiu et al. (2011); Liu et al. (2013), and lifelong learning Chen et al. (2014); Wang et al. (2016a); Liu et al. (2016); Shu et al. (2017). The supervised approach is mainly based on Conditional Random Fields (CRF) Lafferty et al. (2001) Jakob and Gurevych (2010); Choi and Cardie (2010); Li et al. (2010); Mitchell et al. (2013); Giannakopoulos et al. (2017).

This paper focuses on CRF-based models, which regard ATE as a sequence labeling task. There are three main types of features that have been used in previous CRF-based models for ATE. The first type is the traditional natural language features, such as syntactic structures and lexical features Toh and Su (2016); Hamdan et al. (2015); Toh and Su (2015); Balage Filho and Pardo (2014); Jakob and Gurevych (2010); Shu et al. (2017). The second type is the cross domain knowledge based features, which are useful because although each entity/product is different, there are plenty of shared aspects across domains Jakob and Gurevych (2010); Mitchell et al. (2013); Shu et al. (2017)

. The final type is the deep learning features learned by deep learning models, which have been proven very useful for the ATE task in recent years

Liu et al. (2015a); Wang et al. (2016b); Yin et al. (2016); Ye et al. (2017); Li and Lam (2017); Wang et al. (2017b, a); Giannakopoulos et al. (2017). The deep learning features generally fall into sequential representation and tree-structured representation features. However, tree-structured representation in the previous work only considered a single direction propagation trained on the parse trees with shared weights. To the best of our knowledge, there is no research about how to effectively fuse tree-structured and sequential information to solve the ATE task. Our work in this paper mainly addresses these two issues.

We make two main progress related to tree-structured and sequential representations. For one thing, we enhance the tree-structured representation using a bidirectional gate control mechanism which originates from the bidirectional LSTM (BiLSTM) Hochreiter and Schmidhuber (1997); Gers et al. (1999)

and for another, we fuse tree-structured and sequential information to perform the ATE task. These two perspectives are combined into a novel and unified model, namely, bidirectional dependency tree conditional random fields (BiDTreeCRF). Specifically, BiDTreeCRF consists of three main components. The first component is a bidirectional dependency tree network (BiDTree), which is an extension of the recursive neural network in

Socher et al. (2011). Its goal is to extract the tree-structured representation from the dependency tree of the given sentence. The second component is the BiLSTM, whose input is the output of BiDTree. The tree-structured and sequential information is fused in this layer. The last component is CRF, which is used to generate labels. This new model results in major improvements for ATE over the existing baseline models.

The proposed BiDTree is constructed based on the dependency tree such as that in Figure 1.

Figure 1: An example of dependency relations.

Compared with many other methods based on constituency trees Irsoy and Cardie (2013); Tai et al. (2015); Teng and Zhang (2016); Chen et al. (2017a), BiDTree focuses more directly on the dependency relationship between words because all nodes in the dependency tree are input words themselves, but in the constituency tree they are not.

This paper makes three main contributions. (1) BiDTree enhances the tree-structured representation by constructing a bidirectional propagation mechanism on the dependency tree. (2) The tree information and the sequential information are extracted by BiDTreeCRF simultaneously, and they are both fed to the CRF model for aspect term extraction. The integrated model can be effectively trained in an end-to-end fashion. (3) Extensive experiments are conducted on four SemEval datasets to evaluate the effectiveness of BiDTree and to verify the superiority of BiDTreeCRF over state-of-the-art baseline methods.

2 Related Work

There are several main approaches to solving the ATE problem. The first approach is based on frequent pattern mining to mine aspect terms that are frequently occurring nouns and noun phrases Zhu et al. (2009); Popescu and Etzioni (2005); Hu and Liu (2004). The second approach is the rule-based approach which uses either hand-crafted or automatically generated rules about some syntactic relationships Liu et al. (2015b); Poria et al. (2014); Qiu et al. (2011); Zhang et al. (2010); Wu et al. (2009); Wang and Wang (2008); Zhuang et al. (2006). The third approach is topic modeling, which employs some probabilistic graphical models based on Latent Dirichlet Allocation (LDA) Blei et al. (2003)

and its variants. The fourth approach is supervised learning, which is mainly based on sequential labeling methods such hidden Markov model

Jin et al. (2009) or CRF.

Recent work showed that neural networks can achieve competitive performances on the ATE task. Irsoy and Cardie (2013)

applied deep Elman-type Recurrent Neural Network (RNN) to extract opinion expressions and showed that deep RNN outperforms CRF, semi-CRF and shallow RNN.

Liu et al. (2015a) further experimented with more advanced RNN variants with fine-tune embeddings. Moreover, they summarized that employing other linguistic features (e.g., POS) can get better results. Different from these works, Poria et al. (2016)

used a 7-layer deep convolutional neural network (CNN) to tag each word with aspect or non-aspect label in opinionated sentences. Some linguistic patterns were also used to improve labeling accuracy. Attention Mechanism and Memory Interaction are also effective methods for ATE

He et al. (2017); Chen et al. (2017b); Li and Lam (2017); Wang et al. (2017b). However, RNN and CNN based on the sequence structure of a sentence can not effectively and directly capture tree-based syntactic information which better reflects the syntactic properties of nature language and hence is very important to the ATE task.

There are some tree-based neural networks that have been proposed.

Figure 2: An illustration of the BiDTree and BiDTreeCRF architecture. BiDTreeCRF has three modules: BiDTree, BiLSTM and CRF.

For example, Yin et al. (2016) designed a word embedding method that considers not only the linear context but also the dependency context information. The resulting embeddings are used in CRF to extract aspect terms. This model proves syntactic information among words yields better performance than other representative ones in ATE task. However, it can not be trained in an end-to-end fashion. Wang et al. (2016b) further integrated dependency tree and CRF into a unified framework for explicit aspect and opinion terms co-extraction. However, single directional propagation using the full connection is not enough to represent complete tree-structured syntactic information. Ye et al. (2017) proposed a tree-based convolution to capture syntactic features of sentences, which is hard to keep sequential information.

In summary, our framework has the following characteristics: (1) BiDTreeCRF is an end-to-end deep learning model and it does not need any hand-crafted features. (2) Instead of full connection on each layer of the dependency tree, we use a bidirectional propagation mechanism to extract information, which is proved to be effective in our experiments. (3) Fusion of tree-structured and sequential information rather than a single representation are introduced to address the ATE task. At the same time, this paper is also related to several other models which are constructed on constituency trees and are used to accomplish some other NLP tasks, such as translation Chen et al. (2017a), relation extraction Miwa and Bansal (2016), relation classification Liu et al. (2015c) and syntactic language modeling Tai et al. (2015); Teng and Zhang (2016); Zhang et al. (2016). However, we have different models and also different applications.

3 Model Description

The architecture of the proposed framework is shown in Figure 2. Its sample input is the dependency relations presented in Figure 1. As described in Section 1, BiDTreeCRF consists of three modules (or components): BiDTree, BiLSTM and CRF. These modules will be described in details in Sections 3.2 and 3.3.

3.1 Problem Statement

We are given a review sentence from a particular domain, denoted by , where is the sentence length. For any word , the task of ATE is to find a label corresponding to it, where B-AP, I-AP, O. “B-AP”, “I-AP”, and “O” stand for the beginning of an aspect term, inside of an aspect term, and other words, respectively. For example, “The/O picture/B-AP quality/I-AP is/O very/O great/O ./O” is a sentence with labels (or tags), where the aspect term is picture quality. This BIO encoding scheme is widely used in NLP tasks and such tasks are often solved using CRF based methods Wang et al. (2016b); Irsoy and Cardie (2013, 2014); Liu et al. (2015a).

3.2 Bidirectional Dependency Tree Network

Since BiDTree is built on the dependency tree, a sentence should be converted to a dependency-based parse tree first. As the bottom part of Figure 2 shows, each node in the dependency tree represents a word and connects to at least one other node/word. Each node has one and only one head word, e.g., Speaking is the head of browser, has is the head of Speaking, and the head word of has is ROOT222We hide it for simplicity.. The edge between each node and its head word is a syntactic dependency relation, e.g., nmod between browser and Speaking is used for nominal modifiers of nouns or clausal predicates. Syntactic relations in Figure 2 are shown as dotted black lines.

After generating a dependency tree, each word

will be initialized with a feature vector

, which corresponds to a column of a pre-trained word embedding , where is the dimension of the word vector and is the size of the vocabulary. As described above, each relation of a dependency tree starts from a head word and points to its dependent words. This can be formulated as follows: the parent node and its children nodes are connected by , where is the number of children nodes belonging to , and , where is a set of syntactic relations such as nmod, case, det, nsubj, and so on. The syntactic relation information not only serves as features encoded in the network, but also as a guide for training weights selection.

BiDTree works in two directions: bottom-up LSTM and top-down LSTM. Bottom-up LSTM is shown with solid black arrows and top-down LSTM is shown with dotted black arrows at the lower portion of Figure 2. It should be noted that they are different in not only the direction but also the parent node and children nodes. Specifically, each node of the top-down LSTM only owns one child node, but the bottom-up LSTM generally owns more than one child node. As shown in formula (1), we concatenate the output of the bottom-up LSTM and the output of the top-down LSTM into as the output of BiDTree for word ,

(1)

This allows BiDTree to capture the global syntactic context.

Let , which is the set of children nodes of node described above. Under these symbolic instructions, the bottom-up LSTM of BiDTree firstly encode the parent word and the related syntactic relations:

(2)
(3)
(4)
(5)

Then, the bottom-up LSTM transition equations of BiDTree are as follows:

(6)
(7)
(8)
(9)
(10)
(11)

where is the input gate, is the output gate, and are the forget gate, which are extended from the standard LSTM Hochreiter and Schmidhuber (1997); Gers et al. (1999). and are the memory cell state, and are the hidden state, denotes the logistic function, means element-wise multiplication, , , are weight matrices,

are bias vectors, and

is a mapping function that maps a syntactic relation type to its corresponding parameter matrix. . Specially, syntactic relation is encoded into the network like word vector but initialized randomly. The size of is the same as in our experiments.

The top-down LSTM has the same transition equations as the bottom-up LSTM, except the direction and the number of children nodes. Particularly, the syntactic relation type of the top-down LSTM is opposite to that of the bottom-up LSTM, and we distinguish them by adding a prefix “I-”, e.g., setting I-nmod to nmod. It leads to the difference of and parameter matrices. In this paper, all weights and bias vectors of BiDTree are set to size and -dimensions, respectively. The output is thus a -dimensional vector.

The formula for BiDTree is similar to the dependency layer in Miwa and Bansal (2016), and the main difference is the design of parameters of the forget gate. Their work defines a parameterization of the -th child’s forget gate with parameter matrices 333Same symbols are used for easy comparison. The whole equation corresponding to Eq. (8) is as follows:

(12)

As Tai et al. (2015) mentioned, for a large number of children nodes , using additional parameters for flexible control of information propagation from child to parent is impractical. Considering the proposed framework has a variable number of typed children, we use Eq. (8) instead of Eq. (12) to reduce the computation cost. Another difference between their formulas and ours is that we encode the syntactic relation into our network, namely, the second terms of Eqs. (2-5), which is proved effective in this paper.

3.3 Integration with Bidirectional LSTM

As the second module, BiLSTM Graves and Schmidhuber (2005) keeps the sequential context of the dependency information between words. The LSTM unit at -th word receives the output of BiDTree , the previous hidden state , and the previous memory cell to calculate new hidden state and the new memory cell using the following equations:

(13)
(14)
(15)
(16)
(17)
(18)

where , , are gates having the same meanings as their counterparts in BiDTree, with size , with size are weight matrices, and are -dimensional bias vectors. . We also concatenate the hidden states generated by LSTM cells in both directions belonging to the same word as the output vector, which is expressed as . Also, each is reduced to dimensions by a full connection layer so as to pass to the subsequent layers in our implementation.

3.4 Integration with CRF

The learned features actually are hybrid features containing both tree-structured and sequential information. All these features are fed into the last CRF layer to predict the label of each word. Linear-chain CRF is adopted here. Formally, let

represent the output features extracted by BiDTree and BiLSTM layer. The goal of CRF is to decode the best chain of labels

, where has been described in Section 3.1. As a discriminant graphical model, CRF benefits from considering the correlations between labels/tags in the neighborhood, which is widely used in sequence labeling or tagging tasks Huang et al. (2015); Ma and Hovy (2016). Let denote all possible labels and

. The probability of CRF

is computed as follows:

(19)

where is the potential of pair . and are weight and bias, respectively.

Conventionally, the training process is using maximum conditional likelihood estimation. The log-likelihood is computed as follows:

(20)

The last labeling results are generated with the highest conditional probability:

(21)

This process is usually solved efficiently by Viterbi algorithm.

3.5 Model Training

We update all parameters for BiDTreeCRF from top to bottom by propagating the errors through CRF to the hidden layers of BiLSTM and then to BiDTree via backpropagation through time (BPTT)

Goller and Kuchler (1996). Finally, we use Adam Kingma et al. (2014)

for optimization with gradient clipping and L2-regularization. The mini-batch size is 20 and initial learning rate is 0.001. We also employ dropout

Srivastava et al. (2014) on the outputs of BiDTree and BiLSTM layers with the dropout rate 0.5. All weights , and bias terms are trainable parameters. Early stopping Caruana et al. (2000)

is used based on performance on validation sets. Its value is 5 epochs in our experiments. At the same time, initial embeddings are fine-tuned during the training process. That means word embedding will be modified by back-propagating gradients. We implement BiDTreeCRF using the TensorFlow library

Abadi et al. (2016), and all computations are done on a NVIDIA Tesla K80 GPU.

4 Experiments

In this section, we conduct experiments to evaluate the effectiveness of the proposed framework.

4.1 Datasets and Experiment Setup

We conduct experiments using four benchmark SemEval datasets. The detailed statistics of the datasets are summarized in Table 1.

Datasets Train Val Test Total
L-14 2,945 100 800 3,845
R-14 2,941 100 800 3,841
R-15 1,315 48 685 2,048
R-16 2,000 48 676 2,724
L-14 2,304 54 654 3,012
R-14 3,595 98 1,134 4,827
R-15 1,654 57 845 2,556
R-16 2,507 66 859 3,432
Table 1: Datasets from SemEval; means the number of sentences, means the number of aspect terms; L-14, R-14, R-15 and R-16 are short for Laptops 2014, Restaurants 2014, Restaurants 2015 and Restaurants 2016, respectively.

L-14 and R-14 are from SemEval 2014444http://alt.qcri.org/semeval2014/task4/ Pontiki et al. (2014), R-15 is from SemEval 2015555http://alt.qcri.org/semeval2015/task12/ Pontiki et al. (2015), and R-16 is from SemEval 2016666http://alt.qcri.org/semeval2016/task5/ Pontiki et al. (2016). L-14 contains laptop reviews, and R-14, R-15 and R-16 all contain restaurant reviews. These raw datasets have been officially divided into three parts: training set, validation set, and test set. These divisions will be kept for a fair comparison. All these datasets contain annotated aspect terms, which will be used to generate sequence labels in the experiments. We use Stanford Parser Package777The homepage of Stanford Parser Package is https://nlp.stanford.edu/software/lex-parser.html

to generate dependency trees. The evaluation metric is F1 score, the same as the baseline methods.

In order to initialize word vectors, we train word embeddings with bag-of-words based model (CBOW) Mikolov et al. (2013) on Amazon reviews888http://jmcauley.ucsd.edu/data/amazon/ and Yelp reviews999https://www.yelp.com/academic_dataset, which are in-domain corpora for laptop and restaurant, respectively. The Amazon review dataset contains 142.8M reviews, and the Yelp review dataset contains 2.2M restaurant reviews. All these datasets are trained by gensim101010https://radimrehurek.com/gensim/models/word2vec.html which contains the implementation of CBOW. The parameter min_count is 10 and iter is 200 in our experiments. We set the dimension of word vectors to 300 based on the conclusion drawn in Wang et al. (2016b). The experimental results about dimension settings for the proposed model also showed that 300 is a suitable choice, which provides a good trade-off between effectiveness and efficiency.

4.2 Baseline Methods and Results

To validate the performance of our proposed model on aspect term extraction, we compare it against a number of baselines:

  • IHS_RD, DLIREC(U), EliXa(U), and NLANGP(U): The top system for L-14 in SemEval Challenge 2014 Chernyshevich (2014), the top system for R-14 in SemEval Challenge 2014 Toh and Wang (2014), the top system for R-15 in SemEval Challenge 2015 Vicente et al. (2015), and the top system for R-16 in SemEval Challenge 2016 Toh and Su (2016)

    , respectively. U means using additional resources without any constraint, such as lexicons or additional training data.

  • WDEmb: It uses word embedding, linear context embedding and dependency path embedding to enhance CRF Yin et al. (2016).

  • RNCRF-O, RNCRF-F: They both extract tree-structured features using a recursive neural network as the CRF input. RNCRF-O is a model trained without opinion labels. RNCRF-F is trained not only using opinion labels but also some hand-crafted features Wang et al. (2016b).

  • DTBCSNN+F: A convolution stacked neural network built on dependency trees to capture syntactic features. Its results are produced by the inference layer Ye et al. (2017).

  • MIN: MIN is a LSTM-based deep multi-task learning framework, which jointly handles the extraction tasks of aspects and opinions via memory interactions Li and Lam (2017).

  • CMLA, MTCA: CMLA is a multilayer attention network, which exploits relations between aspect terms and opinion terms without any parsers or linguistic resources for preprocessing Wang et al. (2017b)

    . MTCA is a multi-task attention model, which learns shared information among different tasks

    Wang et al. (2017a).

  • LSTM+CRF, BiLSTM+CRF: They are proposed by Huang et al. (2015) and produce state-of-the-art (or close to) accuracy on POS, chunking and NER data sets. We borrow them for the ATE task as baselines.

  • BiLSTM+CNN: BiLSTM+CNN111111We use this abbreviation for the sake of typesetting. is the Bi-directional LSTM-CNNs-CRF model from Ma and Hovy (2016)

    . Compared with BiLSTM+CRF above, BiLSTM+CNN encoded char embedding by CNN and obtained state-of-the-art performance on the task of POS tagging and named entity recognition (NER). We borrow this method for the ATE task as a baseline. Window size of CNN is 3, the number of filters is 30, and the dimension of char is 100.

For our proposed model, there are three variants depending on whether the weight matrices of Eqs. (2-9) are shared or not. BiDTreeCRF#1 shares all weight matrices, namely and , which means the mapping function is useless. BiDTreeCRF#2 shares the weight matrices of Eqs. (2-3,5) and Eqs. (6-7,9) while excluding Eqs. (4,8). BiDTreeCRF#3 keeps Eqs. (2-9) and does not share any weight matrices.

Models L-14 R-14 R-15 R-16
HIS_RD 74.55 79.62 - -
DLIREC(U) 73.78 84.01 - -
EliXa(U) - - 70.05 -
NLANGP(U) - - 67.12 72.34
WDEmb 75.16 84.97 69.73 -
RNCRF-O 74.52 82.73 - -
RNCRF+F 78.42 84.93 - -
DTBCSNN+F 75.66 83.97 - -
MIN 77.58 - - 73.44
CMLA 77.80 85.29 70.73 -
MTCA 69.14 - 71.31 73.26
LSTM+CRF 73.43 81.80 66.03 70.31
BiLSTM+CRF 76.10 82.38 65.96 70.11
BiLSTM+CNN 78.97 83.87 69.64 73.36
BiDTreeCRF#1 80.36 85.08 69.44 73.74
BiDTreeCRF#2 80.22 85.31 68.61 74.01
BiDTreeCRF#3 80.57 84.83 70.83 74.49
Table 2: Comparison in F1 scores on SemEval 2014, 2015 and 2016 datasets.

The comparison results are given in Table 2. In this table, the F1 score of the proposed model is the average of 20 runs with the same hyper-parameters that have been described in Section 3.5 and are used throughout our experiments. We report the results of L-14 initialize with the Amazon embedding and for the other datasets, we initialized with the Yelp embedding since they are all restaurant reviews. We will also show the embedding comparison below.

Compared to the best systems in 2014, 2015 and 2016 SemEval ABSA challenges,

Figure 3: Percentage bar graph on F1-score of different embeddings with (or without) syntactic relation. E-Amazon denotes Amazon Embedding, and E-Yelp refers to Yelp Embedding. With-Rel (No-Rel) means trained with (without) syntactic relation. The winner will cross over the 50% line.

BiDTreeCRF#3 achieves 6.02%, 0.82%, 0.78% and 2.15% F1 score gains over IHS_RD, DLIREC(U), EliXa(U) and NLANGP(U) on L-14, R-14, R-15, and R-16, respectively. Specifically, BiDTreeCRF#3 outperforms WDEmb by 5.41% on L-14 and 1.10% on R-15, and outperforms RNCRF-O by 6.05%, 2.10% for L-14 and R-14, respectively. Even compared with RNCRF+F and DTBCSNN+F which exploit additional hand-crafted features, BiDTreeCRF#3 on L-14 and BiDTreeCRF#2 on R-14 without other linguistic features (e.g., POS) still achieve 2.15%, 4.91% and 0.38%, 1.34% improvements, respectively. MIN is trained via memory interactions, CMLA and MTCA are designed as a multi-task model, and all of these three methods use more labels and share information among different tasks. Comparing with them, BiDTreeCRF#3 still gives the best score for L-14 and R-16 and a competitive score for R-15 and BiDTreeCRF#2 achieves the state-of-the-art score for R-14, although our model is designed as a single-task model. Moreover, BiDTreeCRF#3 outperforms LSTM+CRF and BiLSTM+CRF on all datasets by 7.14%, 3.03%, 4.80%, and 4.18%, and 4.47%, 2.45%, 4.87%, and 4.38%, respectively. Considering the fact that BiLSTM+CRF can be seen as BiDTreeCRF#3 without BiDTree layer, all the results support that BiDTree can extract syntactic information effectively.

As we can see,

Models L-14 R-14 R-15 R-16
BiLSTM+CRF 76.10 82.38 65.96 70.11
BiDTree+CRF 71.29 81.09 64.09 67.87
DTree-up 78.96 84.47 68.69 72.42
DTree-down 78.46 84.41 68.75 72.91
BiDTreeCRF#3 80.57 84.83 70.83 74.49
Table 3: F1-scores of ablation experiments on BiDTreeCRF.

different variants of the proposed model have different performance on the four datasets. In particular, BiDTreeCRF#3 is more powerful than the other variants on L-14, R-15, and R-16, and BiDTreeCRF#2 is more effective on R-15. We believe that R-15 is a small dataset and training it with more weight parameters caused over-fitting. Besides, BiDTreeCRF#3 outperforms BiLSTM+CNN even without char embedding.

To test the effect of each component of BiDTreeCRF, the following ablation experiments on different layers of BiDTreeCRF#3 are performed: (1) DTree-up: the bottom-up propagation of BiDTree is connected to BiLSTM and the CRF layer. (2) DTree-down: the top-down propagation of BiDTree is connected to BiLSTM and the CRF layer. (3) BiDTree+CRF: BiLSTM layer is not used compared to BiDTreeCRF. The initial word embeddings are the same as before. The comparison results are shown in Table 3. Comparing BiDTreeCRF with DTree-up and DTree-down, it is obvious that BiDTree is more competitive than any single directional dependency network. The fact that BiDTreeCRF outperforms BiDTree+CRF indicates the BiLSTM layer is effective in extracting sequential information on top of BiDTree. On the other hand, the fact that BiDTreeCRF outperforms BiLSTM+CRF shows that the dependency syntactic information extracted by BiDTree is extremely useful in the aspect term extraction task.

Since word embeddings are an important contributing factor for learning with less data,

Figure 4: Left: F1 Score of BiDTreeCRF#3 with different word vector dimensions on Electronics Amazon Embedding. Right: F1 Score of BiDTreeCRF#3 with different word vector dimensions on Yelp Embedding.

we also conduct comparative experiments about word embeddings. Additionally, the syntactic relation (the second terms of Eqs. (2-5)) is also adopted as a comparison criterion. The results are shown in Figure 3. Two conclusions are: (1) in-domain embedding is more effective than out-domain embeddings, and (2) the syntactic relation information is useful in most cases. We conduct the sensitivity test on the dimension of word embeddings of BiDTreeCRF#3. Different dimensions range from 50 to 450, with the increment of 50, are involved. The sensitivity plots on the four datasets are given in Figure 4 using Amazon Embedding and Yelp Embedding, respectively. It is worth mentioning that Amazon Embedding here is only trained from reviews of electronics products considering the time cost. Although the score is a little lower than the embedding trained from whole Amazon reviews, the conclusion still holds. The figure shows that 300 is a suitable dimension size for the proposed model.

5 Conclusion

In this paper, an end-to-end framework BiDTreeCRF was introduced. The framework can efficiently extract dependency syntactic information through bottom-up and top-down propagation in dependency trees. By combining the dependency syntactic information with the advantages of BiLSTM and CRF, we achieve the state-of-the-art performance on four benchmark datasets without using any other linguistic features. Three variants of the proposed model have been evaluated and shown to be more effective than the existing state-of-the-art baseline methods. The distinction of these variants depends on whether they share weights during training. Our results suggest that the dependency syntactic information may also be used in aspect term and aspect opinion co-extraction, and other sequence labeling tasks. Additional linguistic features (e.g., POS) and char embeddings can further boost the performance of the proposed model.

References

  • Abadi et al. (2016) Martín Abadi, Ashish Agarwal, Paul Barham, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.
  • Balage Filho and Pardo (2014) Pedro Paulo Balage Filho and Thiago Alexandre Salgueiro Pardo. 2014. NIL_CUSP: Aspect extraction using semantic labels. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 433–436.
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation.

    Journal of Machine Learning Research

    , 3(1):993–1022.
  • Brody and Elhadad (2010) Samuel Brody and Noemie Elhadad. 2010. An unsupervised aspect-sentiment model for online reviews. In Proceeding of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 804–812. Association for Computational Linguistics.
  • Caruana et al. (2000) Rich Caruana, Steve Lawrence, and C. Lee Giles. 2000. Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping. In Advances in Neural Information Processing Systems, pages 402–408.
  • Chen et al. (2017a) Huadong Chen, Shujian Huang, David Chiang, and Jiajun Chen. 2017a. Improved neural machine translation with a syntax-aware encoder and decoder. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 1936–1945.
  • Chen et al. (2017b) Peng Chen, Zhongqian Sun, Lidong Bing, and Wei Yang. 2017b. Recurrent attention network on memory for aspect sentiment analysis. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    , pages 463–472. Association for Computational Linguistics.
  • Chen and Liu (2014) Zhiyuan Chen and Bing Liu. 2014. Topic modeling using topics from many domains, lifelong learning and big data. In International Conference on Machine Learning, pages 703–711.
  • Chen et al. (2014) Zhiyuan Chen, Arjun Mukherjee, and Bing Liu. 2014. Aspect extraction with automated prior knowledge learning. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pages 347–358.
  • Chen et al. (2013) Zhiyuan Chen, Arjun Mukherjee, Bing Liu, Meichun Hsu, Malu Castellanos, and Riddhiman Ghosh. 2013. Exploiting domain knowledge in aspect extraction. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1655–1667. Association for Computational Linguistics.
  • Chernyshevich (2014) Maryna Chernyshevich. 2014. IHS R&D Belarus: Cross-domain extraction of product features using CRF. In Proceedings of the 8th International Workshop on Semantic Evaluation, SemEval@COLING 2014, pages 309–313. The Association for Computer Linguistics.
  • Choi and Cardie (2010) Yejin Choi and Claire Cardie. 2010. Hierarchical sequential learning for extracting opinions and their attributes. In Proceedings of the ACL 2010 Conference Short Papers, pages 269–274. Association for Computational Linguistics.
  • Gers et al. (1999) Felix A Gers, Jürgen Schmidhuber, and Fred Cummins. 1999. Learning to forget: Continual prediction with lstm. Neural Computation, 12(10):2451–2471.
  • Giannakopoulos et al. (2017) Athanasios Giannakopoulos, Claudiu Musat, Andreea Hossmann, and Michael Baeriswyl. 2017. Unsupervised aspect term extraction with b-lstm & crf using automatically labelled datasets. In Proceedings of the 8th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 180–188.
  • Goller and Kuchler (1996) Christoph Goller and Andreas Kuchler. 1996. Learning task-dependent distributed representations by backpropagation through structure. In Proceedings of IEEE International Conference on Neural Networks, pages 347–352.
  • Graves and Schmidhuber (2005) Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5-6):602–610.
  • Hamdan et al. (2015) Hussam Hamdan, Patrice Bellot, and Frédéric Béchet. 2015.

    Lsislif: CRF and logistic regression for opinion target extraction and sentiment polarity analysis.

    In Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2015, pages 753–758. Association for Computer Linguistics.
  • He et al. (2017) Ruidan He, Wee Sun Lee, Hwee Tou Ng, and Daniel Dahlmeier. 2017. An unsupervised neural attention model for aspect extraction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pages 388–397.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
  • Hu and Liu (2004) Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 168–177. ACM.
  • Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
  • Irsoy and Cardie (2013) Ozan Irsoy and Claire Cardie. 2013. Bidirectional recursive neural networks for token-level labeling with structure. arXiv preprint arXiv:1312.0493.
  • Irsoy and Cardie (2014) Ozan Irsoy and Claire Cardie. 2014. Opinion mining with deep recurrent neural networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 720–728. Association for Computer Linguistics.
  • Jakob and Gurevych (2010) Niklas Jakob and Iryna Gurevych. 2010. Extracting opinion targets in a single-and cross-domain setting with conditional random fields. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1035–1045. Association for Computational Linguistics.
  • Jin et al. (2009) Wei Jin, Hung Hay Ho, and Rohini K Srihari. 2009. A novel lexicalized hmm-based learning framework for web opinion mining. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 465–472.
  • Kingma et al. (2014) Diederik Kingma, Jimmy Ba, Diederik Kingma, and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Lafferty et al. (2001) John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the 18th International Conference on Machine Learning (ICML 2001), pages 282–289.
  • Li et al. (2010) Fangtao Li, Chao Han, Minlie Huang, Xiaoyan Zhu, Ying-Ju Xia, Shu Zhang, and Hao Yu. 2010. Structure-aware review mining and summarization. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 653–661. Association for Computational Linguistics.
  • Li and Lam (2017) Xin Li and Wai Lam. 2017. Deep multi-task learning for aspect term extraction with memory interaction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2876–2882. Association for Computational Linguistics.
  • Lin and He (2009) Chenghua Lin and Yulan He. 2009. Joint sentiment/topic model for sentiment analysis. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pages 375–384. ACM.
  • Liu (2012) Bing Liu. 2012. Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, 5(1):1–167.
  • Liu et al. (2015a) Pengfei Liu, Shafiq Joty, and Helen Meng. 2015a. Fine-grained opinion mining with recurrent neural networks and word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1433–1443. Association for Computational Linguistics.
  • Liu et al. (2013) Qian Liu, Zhiqiang Gao, Bing Liu, and Yuanlin Zhang. 2013.

    A logic programming approach to aspect extraction in opinion mining.

    In Proceedings of 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, pages 276–283. IEEE.
  • Liu et al. (2015b) Qian Liu, Zhiqiang Gao, Bing Liu, and Yuanlin Zhang. 2015b. Automated rule selection for aspect extraction in opinion mining. In

    Proceedings of the 24th International Conference on Artificial Intelligence

    , pages 1291–1297. AAAI Press.
  • Liu et al. (2016) Qian Liu, Bing Liu, Yuanlin Zhang, Doo Soon Kim, and Zhiqiang Gao. 2016. Improving opinion aspect extraction using semantic similarity and aspect associations. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, pages 2986–2992.
  • Liu et al. (2015c) Yang Liu, Furu Wei, Sujian Li, Heng Ji, Ming Zhou, and Houfeng Wang. 2015c. A dependency-based neural network for relation classification. arXiv preprint arXiv:1507.04646.
  • Ma and Hovy (2016) Xuezhe Ma and Eduard H. Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1064–1074. Association for Computer Linguistics.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • Mitchell et al. (2013) Margaret Mitchell, Jacqui Aguilar, Theresa Wilson, and Benjamin Van Durme. 2013. Open domain targeted sentiment. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1643–1654.
  • Miwa and Bansal (2016) Makoto Miwa and Mohit Bansal. 2016. End-to-end relation extraction using lstms on sequences and tree structures. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1105–1116. Association for Computer Linguistics.
  • Moghaddam and Ester (2011) Samaneh Moghaddam and Martin Ester. 2011. ILDA: interdependent lda model for learning latent aspects and their ratings from online product reviews. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 665–674. ACM.
  • Pontiki et al. (2015) Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Suresh Manandhar, and Ion Androutsopoulos. 2015. Semeval-2015 task 12: Aspect based sentiment analysis. In Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2015, pages 486–495. Association for Computer Linguistics.
  • Pontiki et al. (2016) Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, et al. 2016. Semeval-2016 task 5: Aspect based sentiment analysis. In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2016, pages 19–30. Association for Computer Linguistics.
  • Pontiki et al. (2014) Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. Semeval-2014 task 4: Aspect based sentiment analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation, SemEval@COLING 2014, pages 27–35. Association for Computer Linguistics.
  • Popescu and Etzioni (2005) Ana-Maria Popescu and Oren Etzioni. 2005. Extracting product features and opinions from reviews. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 339–346. Association for Computational Linguistics.
  • Poria et al. (2016) Soujanya Poria, Erik Cambria, and Alexander Gelbukh. 2016. Aspect extraction for opinion mining with a deep convolutional neural network. Knowledge-Based Systems, 108:42–49.
  • Poria et al. (2014) Soujanya Poria, Erik Cambria, Lun-Wei Ku, Chen Gui, and Alexander Gelbukh. 2014. A rule-based approach to aspect extraction from product reviews. In Proceedings of the 2nd Workshop on Natural Language Processing for Social Media (SocialNLP), pages 28–37.
  • Qiu et al. (2011) Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. 2011. Opinion word expansion and target extraction through double propagation. Computational Linguistics, 37(1):9–27.
  • Shu et al. (2017) Lei Shu, Hu Xu, and Bing Liu. 2017. Lifelong learning crf for supervised aspect extraction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Short Papers), pages 148–154.
  • Socher et al. (2011) Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. 2011. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 129–136.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958.
  • Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015.

    Improved semantic representations from tree-structured long short-term memory networks.

    In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, pages 1556–1566. Association for Computer Linguistics.
  • Teng and Zhang (2016) Zhiyang Teng and Yue Zhang. 2016. Bidirectional tree-structured lstm with head lexicalization. arXiv preprint arXiv:1611.06788.
  • Toh and Su (2015) Zhiqiang Toh and Jian Su. 2015. NLANGP: supervised machine learning system for aspect category classification and opinion target extraction. In Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2015, pages 496–501. Association for Computer Linguistics.
  • Toh and Su (2016) Zhiqiang Toh and Jian Su. 2016. NLANGP at semeval-2016 task 5: Improving aspect based sentiment analysis using neural network features. In Proceedings of the 10th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2016, pages 282–288. Association for Computer Linguistics.
  • Toh and Wang (2014) Zhiqiang Toh and Wenting Wang. 2014. Dlirec: Aspect term extraction and term polarity classification system. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 235–240.
  • Vicente et al. (2015) Iñaki San Vicente, Xabier Saralegi, and Rodrigo Agerri. 2015. EliXa: A modular and flexible ABSA platform. In Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval@NAACL-HLT 2015, pages 748–752. Association for Computer Linguistics.
  • Wang and Wang (2008) Bo Wang and Houfeng Wang. 2008. Bootstrapping both product features and opinion words from chinese customer reviews with cross-inducing. In Proceedings of The Third International Joint Conference on Natural Language, pages 289–295.
  • Wang et al. (2016a) Shuai Wang, Zhiyuan Chen, and Bing Liu. 2016a. Mining aspect-specific opinion using a holistic lifelong topic model. In Proceedings of the 25th International Conference on World Wide Web, pages 167–176.
  • Wang et al. (2017a) Wenya Wang, Sinno Jialin Pan, and Daniel Dahlmeier. 2017a. Multi-task coupled attentions for category-specific aspect and opinion terms co-extraction. arXiv preprint arXiv:1702.01776.
  • Wang et al. (2016b) Wenya Wang, Sinno Jialin Pan, Daniel Dahlmeier, and Xiaokui Xiao. 2016b. Recursive neural conditional random fields for aspect-based sentiment analysis. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 616–626. Association for Computational Linguistics.
  • Wang et al. (2017b) Wenya Wang, Sinno Jialin Pan, Daniel Dahlmeier, and Xiaokui Xiao. 2017b. Coupled multi-layer attentions for co-extraction of aspect and opinion terms. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, pages 3316–3322. AAAI Press.
  • Wu et al. (2009) Yuanbin Wu, Qi Zhang, Xuanjing Huang, and Lide Wu. 2009. Phrase dependency parsing for opinion mining. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 1533–1541. Association for Computational Linguistics.
  • Ye et al. (2017) Hai Ye, Zichao Yan, Zhunchen Luo, and Wenhan Chao. 2017. Dependency-tree based convolutional neural networks for aspect term extraction. In Advances in Knowledge Discovery and Data Mining. PAKDD 2017, pages 350–362. Springer.
  • Yin et al. (2016) Yichun Yin, Furu Wei, Li Dong, Kaimeng Xu, Ming Zhang, and Ming Zhou. 2016. Unsupervised word and dependency path embeddings for aspect term extraction. In Proceedings of the 25th International Joint Conference on Artificial Intelligence, pages 2979–2985. IJCAI/AAAI Press.
  • Zhang et al. (2010) Lei Zhang, Bing Liu, Suk Hwan Lim, and Eamonn O’Brien-Strain. 2010. Extracting and ranking product features in opinion documents. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pages 1462–1470. Association for Computational Linguistics.
  • Zhang et al. (2016) Xingxing Zhang, Liang Lu, and Mirella Lapata. 2016. Top-down tree long short-term memory networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 310–320. Association for Computational Linguistics.
  • Zhu et al. (2009) Jingbo Zhu, Huizhen Wang, Benjamin K. Tsou, and Muhua Zhu. 2009. Multi-aspect opinion polling from textual reviews. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pages 1799–1802. ACM.
  • Zhuang et al. (2006) Li Zhuang, Feng Jing, and Xiaoyan Zhu. 2006. Movie review mining and summarization. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pages 43–50. ACM.