Aspect extraction is an important task in sentiment analysis (Hu and Liu, 2004) and has many applications (Pang and Lee, 2008; Liu, 2012; Cambria and Hussain, 2012). It aims to extract opinion targets (or aspects) from opinion text. In reviews, aspects are attributes or features of opinion targets. For example, from “The screen is great” in a laptop review, it aims to extract “screen”.
Aspect extraction has been performed using supervised and unsupervised approaches. Since this work focuses on supervised learning, for existing unsupervised approaches, such as frequent pattern mining(Hu and Liu, 2004; Popescu and Etzioni, 2005), syntactic rules-based extraction (Zhuang et al., 2006; Wang and Wang, 2008; Qiu et al., 2011; Shu et al., 2016), topic modeling (Mei et al., 2007; Titov and McDonald, 2008; Lin and He, 2009; Moghaddam and Ester, 2011), word alignment (Liu et al., 2013) and label propagation (Zhou et al., 2013). Traditionally, supervised approaches (Jakob and Gurevych, 2010; Mitchell et al., 2013) use Conditional Random Fields (CRF) (Lafferty et al., 2001). Recently, deep networks have also been applied, for example, using LSTM (Williams and Zipser, 1989; Hochreiter and Schmidhuber, 1997; Liu et al., 2015) and attention mechanism (Wang et al., 2017; He et al., 2017) together with manual features (Poria et al., 2016; Wang et al., 2016). Further, (Wang et al., 2016, 2017; Li and Lam, 2017; Li et al., 2018) also proposed aspect and opinion terms co-extraction via a deep network. More recently, a simple CNN model called DE-CNN (Xu et al., 2018a) achieved state-of-the-art performances on aspect extraction by leveraging a double embedding mechanism. Besides using general-purpose embeddings (e.g., GloVe embeddings), DE-CNN also uses domain-specific embeddings (Xu et al., 2018b) to boost its performance without using any manual feature.
In this paper, we use DE-CNN as a base model. We notice that in traditional CNN model training process, all CNN layers are updated together (synchronously) through back-propagation. They easily over-fit the training dataset though validation dataset used for deciding the best parameters’ values. Inspired by a recent work called Deep Adaption Network (DAN) (Rosenfeld and Tsotsos, 2017), we design two kinds of control modules to adjust the input of each CNN layer. Although DAN is for incremental learning (continually adapt a model for new tasks without losing performance on previous tasks), we observe that by asynchronously updating control modules and CNN layers, it can boost the performance of a single task, too. The critical point is that we do not train all parameters at the same time. Instead, we optimize CNN layers when we fix control modules’ parameters. The control modules work as adding noise on each CNN layer’s input. This makes the training little harder and ensures the whole model does not fully fit the training data. After that, we optimize control modules by fixing CNN layers’ parameters. Since CNN layers’ parameters is optimized on noisy input, in this step, the whole model does not easily over-fit training data as well. In every step (fixing control modules or fixing CNN layers), we track the best validation model and make the next step training start with this best validation model. Once the best validation score does not change after several steps, the whole asynchronous-updating training process completes.
To achieve better efficiency, we propose two kinds of control modules: Embedding Control Module and CNN Control Module. The former is applied after the embedding layer, and the later is applied between two adjacent CNN layers. Using these control modules and asynchronously updating control modules and CNN layers prevent overfitting.The experiment results show that this idea is promising. To the best of our knowledge, this is the first paper that incorporates control modules and asynchronously updating.
2 Related Work
DAN(Rosenfeld and Tsotsos, 2017)
solves incremental learning problem by (1) training a base CNN network on the initial task, (2) encountering a new task, train on the square linear transformations of the base CNN layer to utilize base CNN network for the new task and also maintain base CNN’s performance for the initial task. Residual network(He et al., 2016)
solves gradient vanishing problem on a very deep neural network by providing high-way bridges between CNN layers. We do not solve incremental/transfer learning nor gradient vanishing problems. We do asynchronous parameter update to prevent over-fitting and make the only one task better.
|1||Emb Ctrl||(400, 400)(400,)|
|2||CNN||(128, 400, 3)(128,) (128, 400, 5)(128,)|
|2||CNN Ctrl||(256, 128) (128,) (128, 256) (256,)|
|3||CNN||(256, 256, 5) (256,)|
|3||CNN Ctrl||(256, 128) (128,) (128, 256) (256,)|
|4||CNN||(256, 256, 5) (256,)|
|4||CNN Ctrl||(256, 128) (128,) (128, 256) (256,)|
|5||CNN||(256, 256, 5) (256,)|
|6||Linear||(256, 3) (3, )|
layer (we later use embedding layer for simplicity), multiple CNN layers, multiple control modules, and a fully-connected+softmax layer. Note that we keep the architecture of(Xu et al., 2018a) and only add control modules. We apply control modules after the embedding layer and each CNN layer, except the last CNN layer.
We propose two kinds of control modules.
Embedding Control Module As shown in Figure 2, embedding control module adds the input and the transformed input via a square matrix together. The purpose of using this control module is to keep the original embedding and meanwhile slightly adjust the representation of the embedding.
Assume the input is a sequence of word indexes . Let denote the output from the embedding layer. The controlled output from the embedding layer is:
where and are trainable weights.
CNN Control Module As shown in Figure 2, CNN control module has a bow tie structure. The size of the hidden dimensions is first reduced and later expanded. We use
as the intermediate activation function. To avoid over-fitting, we also apply dropout after this activation function. This bow tie structure can help to strengthen important information from each CNN’s output. The expanded output is also added to the output of CNN to keep the original representation with a slight adjustment. Finally, ReLU activation is applied to ensure the output is greater than or equal to 0.
Specifically, let denote the output of the -th CNN layer(first layer is embedding layer). The output of the CNN control module is computed as:
where , , and are trainable weights.
Further, we let , , denote the trainable parameters in CNN layers, control layers and the final fully connected layer, respectively. We define the asynchronous training as follows. At every step, the model is initialized to the previous step’s best validation model and save the best validation model during training.
Step (1) fix , , and tune on .
Step (2) fix , and tune on , .
Repeat step (1) and step (2) until the best validation score does not change after several steps. In this way, CNN layers are trained when control modules are frozen, and the control modules are trained when the CNN layers are frozen.
For better comparing with state-of-the-art method DE-CNN (Xu et al., 2018a), we keep the embedding and all CNN layers the same as DE-CNN. DE-CNN has a double embedding layer, 4 CNN layers, a fully-connected layer shared across all positions of the words, and a softmax layer over the labeling space for each position of inputs. For the first CNN layer, two different filter sizes are employed. For the rest 3 CNN layers, only one filter size is used. We apply dropout after the embedding layer and each ReLU activation. As the reason indicated by (Xu et al., 2018a), the double embedding layer is frozen since the training data for aspect extraction is usually small. The embedding control module lies between the embedding layer and the first CNN layer. Three CNN control modules lie between any two adjacent CNN layers. Details are also in Table 1.
We conduct experiments on two benchmark datasets from SemEval challenges (Pontiki et al., 2014, 2016), as shown in Table 2. The first dataset is in the laptop domain from SemEval-2014 Task 4. The second dataset is in the restaurant domain from SemEval-2016 Task 5. We use NLTK111http://www.nltk.org/ to tokenize each sentence. For double embedding, the general-purpose embeddings are from the glove.840B.300d embeddings (Pennington et al., 2014). The domain-specific embeddings are obtained from DE-CNN (Xu et al., 2018a)222https://github.com/howardhsu/. We hold out 150 examples from training data as validation data to decide the hyper-parameters. The dropout rate is 0.55. For asynchronous updating, we use Adam optimizer (Kingma and Ba, 2014). The learning rate of Step (1) is 0.00005, and that of Step (2) is 0.0001. This is because CNN training tends to be unstable and Step (1) trains CNN layers that contain the majority of parameters of the network.
|Datasets||Training Set||Testing Set|
|Laptop Dataset||Restaurant Dataset|
|THA & STN||79.52||73.61|
4.1 Compared Methods
We perform a comparison of Ctrl with two groups of baselines. The results of the first group are non-CNN based methods. CRF is conditional random fields. IHS_RD (Chernyshevich, 2014) and NLANGP (Toh and Su, 2016) are the best systems from the original challenges (Pontiki et al., 2014, 2016). WDEmb (Yin et al., 2016) is enhanced CRF with multiple embeddings. LSTM (Liu et al., 2015; Li et al., 2018) is a BiLSTM implementation. BiLSTM-CNN-CRF (Reimers and Gurevych, 2017) is the state-of-the-art named entity recognition system. BERT (Devlin et al., 2018)
fine-tunes pre-trained language model on aspect extraction tasks. The following methods use multi-task learning and opinion lexicon or human annotation are adopted for opinion supervision:RNCRF (Wang et al., 2016) is a recursive neural network and CRF jointed model for aspect and opinion terms co-extraction. CMLA (Wang et al., 2017) solves the co-extraction through a multi-layer coupled-attention network. MIN (Li and Lam, 2017) solves co-extraction, and discriminate sentimental/non-sentimental sentences. THA & STN (Li et al., 2018) uses opinion summary and aspect history to improve prediction.
The second group is a CNN-based method. DE-CNN (Xu et al., 2018a) is a pure CNN-based sequence labeling model which utilizes double embedding. This is the base model that Ctrl is adapted from. We use this baseline to show the improvements from Ctrl. The remaining baselines use DE-CNN as the basic network and add an extra intermediate layer between layers in the basic network. DAN (Rosenfeld and Tsotsos, 2017) adopts linear transformation as control modules for a incremental learning method on image classification. DAN- - tunes on all fully connected layers given frozen random-value CNN layers. DAN- optimizes all parameters in fully connected layers and CNN layers together. DAN utilizes asynchronous training process between all fully connected layers and CNN layers. Ctrl- - gives random-value CNN layers (un-trainable), tunes on control modules and fully connected layers. Ctrl- synchronously updates the control modules, CNN layers, and fully connected layer. Ctrl asynchronously updates parameters. These are variations of our model.
4.2 Results and Analysis
From Table 3, we can see that our model Ctrl performs the best. The variations of Ctrl always out-perform that of DAN. It shows that a purely linear transformation is unable to produce noise and prevent over-fitting. Ctrl - -’s result shows the adaptive ability of the control modules. Ctrl - updates all parameters in the overall network synchronously, but under-performs DE-CNN though it has control modules. The reason is that in synchronous updating, control modules just make the overall network deeper. As in Figure 3, the first plot shows that Ctrl- and Ctrl can reach a similar training loss level and Ctrl- is faster. They have the same learning rate. It means that fixed control modules make the training harder. In the second plot, Ctrl- ’s validation loss decreases and then increases. This is an apparent over-fitting signal. But, Ctrl’s validation loss tends flat even after several-steps training. From the last test-score plot, we can see that Ctrl has similar testing performance as Ctrl- in the first step training. In Ctrl’s second step training (between the first and second green lines), the test score continues improving. The results and plots show that through asynchronous updating, control modules can prevent over-fitting and improve CNN performance.
We propose to add two kinds of control modules for CNN-based aspect extraction model. Through asynchronous update, our model Ctrl outperforms state-of-the-art methods significantly.
- Cambria and Hussain (2012) Erik Cambria and Amir Hussain. 2012. Sentic Computing Techniques, Tools, and Applications 2nd Edition. Springer.
- Chen et al. (2017) Tao Chen, Ruifeng Xu, Yulan He, and Xuan Wang. 2017. Improving sentiment analysis via sentence type classification using bilstm-crf and cnn. Expert Systems with Applications 72:221–230.
- Chernyshevich (2014) Maryna Chernyshevich. 2014. Ihs r&d belarus: Cross-domain extraction of product features using crf. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). pages 309–313.
- Chiu and Nichols (2015) Jason PC Chiu and Eric Nichols. 2015. Named entity recognition with bidirectional lstm-cnns. arXiv preprint arXiv:1511.08308 .
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 .
Du et al. (2017)
Jiachen Du, Lin Gui, Ruifeng Xu, and Yulan He. 2017.
A convolutional attention model for text classification.In
National CCF Conference on Natural Language Processing and Chinese Computing. Springer, pages 183–195.
- Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. 2017. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122 .
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In . pages 770–778.
- He et al. (2017) Ruidan He, Wee Sun Lee, Hwee Tou Ng, and Daniel Dahlmeier. 2017. An unsupervised neural attention model for aspect extraction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). volume 1, pages 388–397.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
- Hu and Liu (2004) Minqing Hu and Bing Liu. 2004. Mining and summarizing customer reviews. In KDD ’04. pages 168–177.
- Jakob and Gurevych (2010) Niklas Jakob and Iryna Gurevych. 2010. Extracting opinion targets in a single- and cross-domain setting with conditional random fields. In EMNLP ’10. pages 1035–1045.
- Kalchbrenner et al. (2014) Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 .
- Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 .
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
- Lafferty et al. (2001) John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML ’01. pages 282–289.
- LeCun et al. (1995) Yann LeCun, Yoshua Bengio, et al. 1995. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361(10):1995.
- Li et al. (2018) Xin Li, Lidong Bing, Piji Li, Wai Lam, and Zhimou Yang. 2018. Aspect term extraction with history attention and selective transformation. arXiv preprint arXiv:1805.00760 .
- Li and Lam (2017) Xin Li and Wai Lam. 2017. Deep multi-task learning for aspect term extraction with memory interaction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pages 2886–2892.
- Lin and He (2009) Chenghua Lin and Yulan He. 2009. Joint sentiment/topic model for sentiment analysis. In CIKM ’09. pages 375–384.
- Liu (2012) Bing Liu. 2012. Sentiment Analysis and Opinion Mining. Morgan & Claypool Publishers.
- Liu et al. (2013) Kang Liu, Liheng Xu, Yang Liu, and Jun Zhao. 2013. Opinion target extraction using partially-supervised word alignment model. In IJCAI ’13. pages 2134–2140.
Liu et al. (2015)
Pengfei Liu, Shafiq Joty, and Helen Meng. 2015.
Fine-grained opinion mining with recurrent neural networks and word embeddings.In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. pages 1433–1443.
- Ma and Hovy (2016) Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354 .
- Mei et al. (2007) Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, and ChengXiang Zhai. 2007. Topic sentiment mixture: Modeling facets and opinions in weblogs. In WWW ’07. pages 171–180.
- Mitchell et al. (2013) Margaret Mitchell, Jacqui Aguilar, Theresa Wilson, and Benjamin Van Durme. 2013. Open domain targeted sentiment. In ACL ’13. pages 1643–1654.
- Moghaddam and Ester (2011) Samaneh Moghaddam and Martin Ester. 2011. ILDA: interdependent lda model for learning latent aspects and their ratings from online product reviews. In SIGIR ’11. pages 665–674.
- Pang and Lee (2008) Bo Pang and Lillian Lee. 2008. Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2:1–135.
Pennington et al. (2014)
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014.
Glove: Global vectors for word representation.In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pages 1532–1543.
- Pontiki et al. (2016) Maria Pontiki, Dimitris Galanis, Haris Papageorgiou, Ion Androutsopoulos, Suresh Manandhar, AL-Smadi Mohammad, Mahmoud Al-Ayyoub, Yanyan Zhao, Bing Qin, Orphée De Clercq, et al. 2016. Semeval-2016 task 5: Aspect based sentiment analysis. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016). pages 19–30.
- Pontiki et al. (2014) Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. Semeval-2014 task 4: Aspect based sentiment analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014). Association for Computational Linguistics and Dublin City University, Dublin, Ireland, pages 27–35. http://www.aclweb.org/anthology/S14-2004.
- Popescu and Etzioni (2005) Ana-Maria Popescu and Oren Etzioni. 2005. Extracting product features and opinions from reviews. In HLT-EMNLP ’05. pages 339–346.
Poria et al. (2016)
Soujanya Poria, Erik Cambria, and Alexander Gelbukh. 2016.
Aspect extraction for opinion mining with a deep convolutional neural network.Knowledge-Based Systems 108:42–49.
- Qiu et al. (2011) Guang Qiu, Bing Liu, Jiajun Bu, and Chun Chen. 2011. Opinion word expansion and target extraction through double propagation. Computational Linguistics 37(1):9–27.
- Reimers and Gurevych (2017) Nils Reimers and Iryna Gurevych. 2017. Reporting score distributions makes a difference: Performance study of lstm-networks for sequence tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pages 338–348.
- Rosenfeld and Tsotsos (2017) Amir Rosenfeld and John K Tsotsos. 2017. Incremental learning through deep adaptation. arXiv preprint arXiv:1705.04228 .
- Shu et al. (2016) Lei Shu, Bing Liu, Hu Xu, and Annice Kim. 2016. Lifelong-rl: Lifelong relaxation labeling for separating entities and aspects in opinion targets. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. pages 225–235.
- Strubell et al. (2017) Emma Strubell, Patrick Verga, David Belanger, and Andrew McCallum. 2017. Fast and accurate entity recognition with iterated dilated convolutions. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pages 2670–2680.
- Titov and McDonald (2008) Ivan Titov and Ryan McDonald. 2008. A joint model of text and aspect ratings for sentiment summarization. In ACL ’08: HLT. pages 308–316.
- Toh and Su (2016) Zhiqiang Toh and Jian Su. 2016. Nlangp at semeval-2016 task 5: Improving aspect based sentiment analysis using neural network features. In Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016). pages 282–288.
- Wang and Wang (2008) Bo Wang and Houfeng Wang. 2008. Bootstrapping both product features and opinion words from chinese customer reviews with cross-inducing. In IJCNLP ’08. pages 289–295.
- Wang et al. (2016) Wenya Wang, Sinno Jialin Pan, Daniel Dahlmeier, and Xiaokui Xiao. 2016. Recursive neural conditional random fields for aspect-based sentiment analysis. arXiv preprint arXiv:1603.06679 .
- Wang et al. (2017) Wenya Wang, Sinno Jialin Pan, Daniel Dahlmeier, and Xiaokui Xiao. 2017. Coupled multi-layer attentions for co-extraction of aspect and opinion terms. In AAAI. pages 3316–3322.
- Williams and Zipser (1989) Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation 1(2):270–280.
- Xu et al. (2018a) Hu Xu, Bing Liu, Lei Shu, and Philip S. Yu. 2018a. Double embeddings and cnn-based sequence labeling for aspect extraction. In ACL.
Xu et al. (2018b)
Hu Xu, Bing Liu, Lei Shu, and Philip S Yu. 2018b.
Lifelong domain word embedding via meta-learning.
Proceedings of the 27th International Joint Conference on Artificial Intelligence. AAAI Press, pages 4510–4516.
- Yin et al. (2016) Yichun Yin, Furu Wei, Li Dong, Kaimeng Xu, Ming Zhang, and Ming Zhou. 2016. Unsupervised word and dependency path embeddings for aspect term extraction. arXiv preprint arXiv:1605.07843 .
- Zhou et al. (2013) Xinjie Zhou, Xiaojun Wan, and Jianguo Xiao. 2013. Collective opinion target extraction in Chinese microblogs. In EMNLP ’13. pages 1840–1850.
- Zhuang et al. (2006) Li Zhuang, Feng Jing, and Xiao-Yan Zhu. 2006. Movie review mining and summarization. In CIKM ’06. pages 43–50.