The majority of current sentiment analysis approaches focuses on detecting the overall polarity of a sentence or text span. However, the overall polarity refers to a broader context, instead of identifying specific targets. In addition, many sentences or paragraphs contain both positive and negative polarities, which complicates the assignment of a correct overall polarity.
Aspect based sentiment analysis (ABSA) is a more detailed and in-depth approach compared to traditional sentiment analysis and aims at tackling the problems of the latter. ABSA can be decomposed in two tasks:
Aspect term extraction (ATE), where the goal is to identify all aspect terms (e.g. battery, screen) of the target entity (e.g. laptop) in a sentence (or text span).
Sentiment Polarity (SP), where the goal is to identify the polarity (e.g. positive) attached to each aspect term.
We focus on ATE (rather than SP), because it is a harder and more interesting problem. The existing learning techniques for ATE can be categorized into supervised and unsupervised. Each category comes with numerous benefits and drawbacks. Supervised ATE leads to high performance on unseen data. However, the available human annotated datasets are restricted to only a few domains (e.g. restaurants and laptops) and are very small. Even the biggest available human annotated datasets — provided by the SemEval ABSA contest — contain only a few thousand sentences. Unsupervised ATE overcomes the aforementioned problems by exploiting large and publicly available opinion texts, such as review corpora. These corpora cover a larger domain spectrum (e.g. books, food, electronic devices, etc.) and allow us to perform ATE in a domain-independent fashion. However, unsupervised systems for ATE come at the cost of lower performance compared to supervised ones .
We propose a third option, namely ATE with distant supervision. To the best of our knowledge, there is no prior work in this area. To this end, we first propose a novel attention-based method to construct datasets for ATE, starting from review corpora, which are naturally rich in opinionated aspect terms. Using the constructed dataset, we introduce a new model for ATE, which performs feature extraction and aspect term detection simultaneously while training.
Our pipeline for ATE with distant supervision is depicted in Fig. 1. We start from raw review corpora (i.e. opinion texts) in order to construct a new dataset for ATE. However, we identify the problem that such corpora contain a lot of noisy sentences, e.g. ”I bought this laptop for my kids”. Such sentences do not contain opinionated aspect terms and therefore introduce noise to the dataset we wish to build. To overcome this problem, we employ our attention-based method to extract non-noisy sentences from the review corpora and show that sentence selection matters when it comes to constructing a new dataset for ATE.
The distant training labels are used as review ratings with an ultimate goal of labelling the tokens of sentences as aspect terms. We exploit the review ratings in a way that assigns a sentiment score on each sentence of the review. To achieve that, we leverage the attention mechanism of a model trained to predict the review ratings (e.g. 1-5 stars) of a review text. Then, we select sentences with high sentiment score since we consider that they are likely to contain domain-related opinionated aspect terms. Finally, we automatically label the tokens of the selected sentences by specifying which tokens are aspect terms.
We use the automatically labelled dataset, that we construct, in order to train a classifier for ATE. We highlight that the resulting token-labelled dataset contains information from the distant training labels (e.g. review labels) and therefore can be used for ATE with distant supervision.
We perform experiments to show that
ATE with distant supervision can outperform unsupervised systems and rule-based ATE methods and
sentence selection matters while building new ATE datasets. To this end, we evaluate our ATE classifier on the human annotated datasets of the SemEval-2014 ABSA contest. Our results show that distantly-supervised ATE achieves a performance higher than unsupervised systems and rule-based baselines . Last but not least, our experiments reveal that training a classifier on a set of selected sentences leads to higher ATE performance compared to using the whole sentence set, i.e. sentence selection matters.
The rest of this paper is organized as follows. Section 2 presents the related work for ATE and models with attention mechanism. Our sentence selection method is described in Sections 3. Section 4 analyzes our automated data labelling process. We conduct experiments and present results for ATE with distant supervision in Section 5. Finally, our work is concluded in Section 6.
2 Related Work
use the provided human annotated datasets in order to extract features. These features are very similar to those used in traditional Name Entity Recognition (NER) systems. Moreover, participants exploit external sources, such as the WordNet files  and word clusters (e.g. Brown clusters ). Finally, they usually exploit gazetteers  and word embeddings 
. The extracted features are used to train a classifier such as Conditional Random Fields (CRF) or Support Vector Machine (SVM).
Most unsupervised systems for ATE follow rule-based approaches.  uses relational and syntactic rules to automatically detect aspect terms. Authors of  present a graph-based approach where nouns and adjectives of large corpora are used as nodes in a graph. Then, they create a list with top-ranking nouns and use it to annotate unseen data by performing exact or lemma matching.
Systems similar to [12, 13, 14] perform semi-supervised ATE. They start by creating features (e.g. new word embeddings) using large corpora (e.g. available reviews). These features are later used in order to enrich the feature space of human annotated datasets.
Although there are a lot of publications for sentence selection, they mainly focus on summarizing tasks and not on ATE.  and  investigate sentence clustering in order to acquire new similar sentences to improve their model. The former focus on aspect identification while the latter on multi-aspects review summarization.  explores the use of a community-leader detection problem with sentence selection in order to build better opinion summarization, where communities consist of a cluster of sentences towards the same aspect of an entity. No existing work seems to have investigated the use of attention for sentence selection in combination with ATE.
To the best of our knowledge, we are the first to perform ATE using distant supervision. We start by using a model with attention mechanism which performs sentence selection. Regarding the attention mechanism, our work is similar to 
. Nevertheless, we put forward the focus on the sentence level instead of the word level. We think that modeling sentence representations by sentence embeddings would give better results than using a sequence of word hidden states from a bidirectional long short-term memory (B-LSTM) network. Then, we use the selected sentences in order to create an automatically labelled dataset. Finally, we train a model using the automatically labelled dataset and perform ATE using distant supervision.
3 Dataset Construction via Sentence Selection
Given a product review, it often happens that only a subset of its sentences are non-noisy and express an opinion about the item. For example, reviews usually start with sentences which do not contain any useful information about the product review, e.g. ”I bought this laptop a few weeks ago”. Such sentences are unlikely to contain opinionated aspect terms and are therefore not suitable candidates for constructing datasets for ATE.
We hypothesize that sentence selection matters when it comes to constructing datasets for ATE. In other words, we would like to show that training a model on an automatically filtered dataset (i.e. with few noisy sentences) leads to better ATE performance compared to using the whole one.
In order to perform sentence selection, we build a model to predict the rating of a review (e.g. 1-5 stars). During training, the model assigns weights to all sentences in a review. The higher the value of the weight for a particular sentence, the more important this sentence is for the review classification. Based on these weights, we then devise a method to keep the important sentences and filter out the noisy sentences.
3.1 Rating Prediction for Sentence Selection
The architecture of the rating prediction model is depicted in Fig. 2 and is inspired by . However, we differ from  since we force the attention to focus on sentence level rather than on word level. To this end, we remove the B-LSTM layer used in  and feed directly sentence embeddings (instead of word embeddings). The latter are derived using an extension of the Continuous Bag-of-Words Model (CBOW) .
We use the publicly available review dataset from Amazon111http://jmcauley.ucsd.edu/data/amazon/. Each review consists of sentences, at most. In order to leverage the attention mechanism, we only keep reviews that have at least and at most
sentences. We pad reviews withsentences with a special tag so that the length of all reviews equals to . In that way, we construct a reduced review dataset with a total of reviews. We create a train, validation and test set consisting of , and reviews respectively. Furthermore, to avoid having most of the attention given to the first sentences, all reviews are shuffled during training. This can be seen as a way of regularization.
We convert every sentence of each review to its -dimensional sentence embedding representation . Then, we concatenate all the representations in a matrix (blue components in Fig. 2). In our case, we use the pre-trained model of  to compute 600-dimensional sentence embeddings ().
represents a 2-layer MultiLayer Perceptron (MLP) without bias. The hidden layer uses thetanhactivation function (as recommended by [20, 21]) and the output layer uses a . The weight matrices of the MLP are of dimensions and of dimensions , where
is a hyperparameter that corresponds to the size of the internal representation.
Similar to , we compute weighted sums by multiplying the annotation matrix and the concatenated sentence embeddings . The resulting matrix for the sentence embeddings is given by
Finally, the representation of the review is fed into another neural network (green components in Fig. 2). This is necessary because we learn the attention mechanism by minimizing the objective function with respect to a specific task, i.e. review rating prediction in our case. However, we emphasize that the overall model is task-independent.
We innovate by leveraging the attention mechanism of Fig. 2 with the goals of performing sentence selection and constructing a new dataset for ATE. To this end, we modify the architecture of  in order to force the attention mechanism to focus on sentence rather than token level. We highlight that our goal is neither to perform nor to improve review classification and therefore our work is not comparable to .
3.2 Attention-based Sentence Filtering
We intend to use the learned weights of the attention mechanism ( and ) in order to perform sentence selection, i.e. filter out noisy sentences. Once the model is trained, we feed once again all reviews without any shuffling. The matrix contains an attention of hops for each one of the sentences in a review. Our goal is to end up with a scalar weight per sentence which indicates the importance of it during the review classification process.
We sum up over all annotation vectors of matrix and come up with a vector of size . Then, we normalize so that it has a minimum value of and a maximum value of . Each element of the normalized maps to a weight for the -th sentence in the review. In turn, the value of reflects the importance of each sentence for the review rating prediction task. The less important the sentence, the lower the value of . Hence, the normalized vector gives a general view of the importance level of all sentences.
We perform sentence selection by exploiting the elements of the normalized . We consider that sentences with low attention scores do not carry important information and are therefore unlikely to contain opinionated aspect terms. Therefore, we remove sentences with attention score lower than a threshold . The sentence selection method can be visualized in Fig. 3 and is described in Algorithm 1.
4 Automated Data Labelling
Now that we have a cleaner dataset, we need to label it, in order to use it to train a classifier for ATE. First we explain or automated labelling method. Then we give a visual example of the sentence selection and the labeling for better illustration.
4.1 Automated Labelling Method
We use the important review sentences obtained in Section 3.2 in order to construct an automatically labelled dataset for ATE. To this end, we label each token of the unlabelled opinion texts in an automated way using the IOB format (short for Inside, Out and Begin) . Tokens that are aspect terms are labelled with B. In case an aspect term consists of multiple tokens, the first token receives the B label and the rest receive the I label. Tokens that are not aspect terms are labelled with O. The automated data labelling is done using the method described in .
We use the following assumptions and tools in order to construct the automatically labelled dataset for ATE.
We only consider nouns and noun phrases  as candidate aspect terms. However, we use nouns and noun phrases that appear less than 30 times (i.e. minimum support 30) in the clean dataset . This reduces the noise in the aspect term labelling introduced by infrequent nouns and noun phrases.
We exploit a set of 12 syntactic rules that are able to capture aspect terms. These rules check if there are syntactic dependencies between opinionated adjectives or verbs and nouns or noun phrases. A subset of these rules is tabulated in Table I. For the syntactic rules, we adopt a notation similar to .
|I love this laptop||laptop|
|The GPU is perfect||GPU|
|Keyboard and sound are awful||
|The retina display is superb||retina display|
The functions of Table I can be interpreted as follows:
[topsep=1pt, itemsep=-1ex, partopsep=1ex, parsep=1ex]
is true if the syntactic dependency between the tokens and is .
is true if the token is in the sentiment lexicon.
means that we mark the token as aspect term.
is true if the token is already marked as aspect term.
Algorithm 2 describes the automated method we use in order to annotate the tokens of the non-noisy sentences with the IOB format. This algorithm is similar to . However,  focus on achieving high precision values for ATE. Since we are interested in the F-score, we apply some modifications on . To this end, we use a bigger set of syntactic rules, a bigger sentiment lexicon and remove the list of quality phrases.
It is obvious that the resulting dataset carries some information from training signals (i.e. review ratings) not directly related to the token-based labelling. Therefore, we use the automatically labelled dataset to perform ATE with distant supervision.
4.2 Data Labelling Visualization
Figure 4 depicts the results of our automated data labelling process applied on a 5-star review. As we can see, sentences with strong opinions are more highlighted by the attention mechanism and those which do not carry important information are less pointed out. However, these do not have necessarily no attention because they are still relevant for the task of the review rating prediction. Moreover, we observe that our simple regularization method to avoid having all the attentions focused on the first sentences works, i.e. attention can be given to any sentences in the review.
For the example of Fig. 4, we filter out sentences with an attention score as described in Section 3. For the remaining sentences (highlighted in red in Fig. 4), we apply the automated data labelling process. Tokens that are nouns and obey at least one of the syntactic rules are marked as aspect terms (depicted in green in Fig. 4).
With our automated dataset construction we try to optimize the quality of the extracted aspect terms. We achieve that by annotating selected sentences using a set of syntactic rules and a sentiment lexicon. The sentence selection method removes irrelevant or noisy aspect terms (e.g. ”laptop” in the first sentence of Fig. 4). However, some remaining sentences might still contain noisy aspects as depicted in orange in Fig. 4.
The data labelling visualization verifies that our assumptions (Section 4.1) are correct. More concretely, we see that our automated data labelling method manages to detect and label successfully aspect terms in the sentences by exploiting nouns and noun phrases, syntactic rules and a sentiment lexicon.
We perform ATE with distant supervision by training a model using the automatically labelled dataset (hereafter denoted as ALD) as training data. We aim at measuring the effectiveness of our proposed sentence selection mechanism (Section 3) and prove our hypothesis that sentence selection matters when it comes to constructing a new dataset for ATE. We evaluate our classifier using the human labelled test dataset (hereafter denoted as HLD) of the SemEval-2014 ABSA contest 
for the laptop domain. As evaluation metric, we use the CoNLL222http://www.cnts.ua.ac.be/conll2003/ F-score which is given by Eq. 3.
stand for precision and recall and are given by Eq.4 and 5 respectively. and are the sets of retrieved and correct aspect terms of our system respectively.
We start by building a series of baseline models in order to prove that the syntactic rules we use are capable of extracting aspect terms. Then, we use the ALD to train a model and evaluate it on the HLD. The results show that the trained classifier outperforms the baselines models and validates our hypothesis, i.e. sentence selection matters.
5.1 Rule-based Baselines
use syntactic rules in order to identify aspect terms. A simple rule-based model may identify as aspects nouns with any syntactic dependency to any word with positive or negative polarity (e.g. ”good”, ”bad”). More sophisticated rule-based systems capture aspect terms by linking nouns to modifiers and adjectival complements. We wish to prove that the syntactic rules we use, combined with the Senticnet lexicon, are capable of extracting aspect terms.
We create 4 different rule-based baseline models that do no use any machine learning algorithm. Each baseline is more advanced than the previous one, since it exploits a bigger set of syntactic rules. During the prediction process, a token of the HLD is labelled as a target if
it is a noun and
satisfies at least one of the syntactic rules of each baseline.
Baseline B-1: It considers as aspect terms all nouns of the test set with any syntactic relation to any word from the Senticnet lexicon.
Baseline B-2: It is similar to B-1, however nouns are considered as aspect terms only if they have any syntactic relation to adjectives from the Senticnet lexicon.
Baseline B-3: It extends B-2 by labelling nouns as aspect terms only if the syntactic relation to any word from the Senticnet lexicon is of type amod or advmod or acomp. Moreover, B-3 labels as aspect terms nouns that are in conjunction with other aspect terms.
Baseline B-4: It includes the full set of the 12 syntactic rules we introduce. Once again, only nouns related to words from the Senticnet lexicon are considered as aspect terms.
Results for the baseline models are tabulated in Table II. We see that B-4 performs the best among all baselines. This fact proves that our set of syntactic rules, combined with the Senticnet lexicon, are capable of identifying aspect terms. We also highlight that these baselines are completely unsupervised and domain-independent since they use only syntactic rules combined with a sentiment lexicon in order to identify aspect terms, i.e there is no use of training data or attention.
5.2 SVM for Aspect Term Extraction
We wish to beat the baseline models by using machine learning. We start by defining 6 different thresholds and perform sentence selection using the method of Section 3 with the hyperparameters and . This results in 6 different ALDs. Then, we use these datasets — one at a time — in order to train an SVM classifier. The classifier is evaluated on the HLD using the CoNLL F-score.
We construct baseline features  in order to train the SVM333We use the implementation of LIBLINEAR . classifier. More concretely, we build one-hot vectors using the sentence structure. For each token in a sentence, features are created using the identities (string representation) of , , , and . In case is at the beginning or at the end of a sentence, special characters (e.g. s and e) are used to indicate the start and the end of the sentence respectively. In addition, we build features using the word morphology. For each token in a sentence, we create extra features by taking the prefix and the suffix (up to a length of 4) of . Moreover, morphological features are enriched by investigating if is
The performance of the SVM classifier is tabulated in Table II. Columns SVM through SVM prove that the SVM classifier beats the baseline model. The subscript in the column name indicates the value of . These columns also validate our hypothesis, that sentence selection matters when it comes to constructing a new dataset for ATE. The classifier has a performance of when we use all sentences () for the ALD construction. All evaluation metrics increase as the sentence selection threshold increases from to , apart from a small fluctuation (0.4%) when . We believe that this increase is due to the fact that the sentence selection removes noise from the ALD which leads to improved classifier performance. We also notice a decrease in the performance for sentence selection thresholds greater than . In these cases, we believe that the sentence selection is harsh and removes useful information from the ALD which results in performance deterioration.
We wish to further boost the F-score of the SVM classifier. To this end, we use — since this threshold gives the best performance so far — and build a new ALD by considering nouns and noun phrases (np) as candidate aspect terms. In that way, we improve the performance of the classifier from to .
5.3 B-LSTM & CRF for Aspect Term Extraction
We exploit an architecture that employs a B-LSTM followed by a CRF classifier in order to further boost the performance for ATE using distant supervision. To this end, we choose the ALD constructed with and noun phrases, that gives the best performance, and train a B-LSTM & CRF classifier (Fig. 5). Then, we evaluate our model on the HLD, i.e. the human annotated test set of the SemEval-2014 ABSA task. We also use the training set of the SemEval-2014 ABSA task as a validation set.
The B-LSTM layer of the B-LSTM & CRF classifier exploits the structure of the sentence (i.e. previous and next words of each token) in order to extract new features (depicted in orange in Fig. 5). These features are given as input to the CRF classifier, which performs sequential labelling.
In order to train the model, we use the 300-dimensional pre-trained word embeddings of fastText444https://github.com/facebookresearch/fastText
. Moreover, we use 100 hidden states for each LSTM cell. The classifier is trained for maximum 20 epochs and uses a patience value of 5, i.e. the training terminates if there is no improvement on the validation set for more than 5 consecutive epochs. Finally, we use the Adam optimizer with learning rate and a batch size of 32.
We perform 25 experiments in order to report mean values for the precision, recall and
. We also construct 95% confidence intervals for the aforementioned metrics. The obtained results for precision, recall andF-score are:
. The experimental results validate that the B-LSTM & CRF classifier outperforms all aforementioned models for all 3 evaluation metrics. The mean values are tabulated in Table II.
Last but not least, our experimental results for ATE with distant supervision reveal that our method outperforms the supervised baseline method of the SemEval-2014 ABSA contest. More concretely, our method achieves an F-score of compared to the supervised baseline F-score of , i.e. a relative improvement of .
In this paper, we first show that sentence selection matters when it comes to building a corpus for ATE. We start from publicly available review corpora and exploit a multi-hop and task-independent attention mechanism. We force this mechanism to focus on sentence level, i.e. to give an attention score to each sentence of a review. We then perform sentence selection by varying the attention threshold from 0 to 1.
Secondly, we annotate the tokens of the selected sentences and construct new datasets for ATE — one for each attention threshold. To this end, we employ our automated data labelling method. We train multiple classifiers using the automatically labelled datasets and evaluate them on the human labelled dataset of the SemEval-2014 ABSA contest. We observe that all evaluation metrics behave similarly to an inverted U-shaped curve as the sentence selection threshold increases.
Our experiments validate our hypothesis that sentence selection matters when it comes to constructing a new dataset for ATE. Moreover, we show that ATE with distant supervision outperforms all our unsupervised rule-based models, as well as the supervised baseline of SemEval-2014 ABSA task.
-  M. Pontiki, H. Papageorgiou, D. Galanis, I. Androutsopoulos, J. Pavlopoulos, and S. Manandhar, “Semeval-2014 task 4: Aspect based sentiment analysis,” in Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 2014.
-  Z. Toh and W. Wang, “Dlirec: Aspect term extraction and term polarity classification system,” in Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 2014.
-  Z. Toh and J. Su, “Nlang: Supervised machine learning system for aspect category classification and opinion target extraction,” in Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), 2015.
-  M. Chernyshevich, “Ihs r&d belarus: Cross-domain extraction of product features using crf,” in Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), 2014.
-  M. Tkachenko and A. Simanovsky, “Named entity recognition: Exploring features,” 2012.
-  G. A. Miller, “Wordnet: A lexical database for english,” in Communications of the ACM, 1995.
J. Turian, L. Ratinov, and Y. Bengio, “Word representations: A simple and general method for semi-supervised learning,” 2010.
-  J. Kazama and K. Torisawa, “Inducing gazetteers for named entity recognition by large-scale clustering of dependency relations,” Proceedings of ACL-08: HLT, 2008.
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” inNIPS Proceedings, 2013.
-  Q. Liu, Z. Gao, B. Liu, and Y. Zhang, “Automated rule selection for opinion target extraction,” 2015.
-  A. Garcia-Pablos, M. Cuadros, and G. Rigau, “V3: Unsupervised aspect based sentiment analysis for semeval-2015 task 12,” 2015.
-  T. Hercig, T. Brychcín, L. Svoboda, M. Konkol, and J. Steinberger, “Unsupervised methods to improve aspect-based sentiment analysis in Czech,” Computación y Sistemas, 2016.
-  Y. Yin, F. Wei, L. Dong, K. Xu, M. Zhang, and M. Zhou, “Unsupervised word and dependency path embeddings for aspect term extraction,” 2016.
“Aspect extraction for opinion mining with a deep convolutional neural network,”Knowledge-Based Systems, 2016.
-  M. Hadano, K. Shimada, and T. Endo, “Aspect identification of sentiment sentences using a clustering algorithm,” Procedia-Social and Behavioral Sciences, vol. 27, pp. 22–31, 2011.
-  K. Shimada, R. Tadano, and T. Endo, “Multi-aspects review summarization with objective information,” Procedia-Social and Behavioral Sciences, vol. 27, pp. 140–149, 2011.
-  L. Zhu, S. Gao, S. J. Pan, H. Li, D. Deng, and C. Shahabi, “Graph-based informative-sentence selection for opinion summarization,” in Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. ACM, 2013, pp. 408–412.
-  Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio, “A structured self-attentive sentence embedding,” CoRR, vol. abs/1703.03130, 2017.
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,” in
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010, pp. 249–256.
-  Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Müller, “Efficient backprop,” in Neural networks: Tricks of the trade. Springer, 2012, pp. 9–48.
-  A. Giannakopoulos, C. Musat, A. Hossmann, and M. Baeriswyl, “Unsupervised aspect term extraction with b-lstm & crf using automatically labelled datasets,” in 8th Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA), 2017, in press.
-  E. Cambria, S. Poria, R. Bajpai, and B. W. Schuller, “Senticnet 4: A semantic resource for sentiment analysis based on conceptual primitives,” 2016.
-  A. Garcia-Pablos and G. Rigau, “Unsupervised acquisition of domain aspect terms for aspect based opinion mining,” 2014.
-  K. Stratos and M. Collins, “Simple semi-supervised pos tagging,” 2015.
-  R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIBLINEAR: A library for large linear classification,” Journal of Machine Learning Research, 2008.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in deep learning based natural language processing,” 2017.