text_convnet
Molding CNNs for text (http://arxiv.org/abs/1508.04112)
view repo
The success of deep learning often derives from well-chosen operational building blocks. In this work, we revise the temporal convolution operation in CNNs to better adapt it to text processing. Instead of concatenating word representations, we appeal to tensor algebra and use low-rank n-gram tensors to directly exploit interactions between words already at the convolution stage. Moreover, we extend the n-gram convolution to non-consecutive words to recognize patterns with intervening words. Through a combination of low-rank tensors, and pattern weighting, we can efficiently evaluate the resulting convolution operation via dynamic programming. We test the resulting architecture on standard sentiment classification and news categorization tasks. Our model achieves state-of-the-art performance both in terms of accuracy and training speed. For instance, we obtain 51.2 fine-grained sentiment classification task.
READ FULL TEXT VIEW PDFMolding CNNs for text (http://arxiv.org/abs/1508.04112)
Deep learning methods and convolutional neural networks (CNNs) among them have become de facto top performing techniques across a range of NLP tasks such as sentiment classification, question-answering, and semantic parsing. As methods, they require only limited domain knowledge to reach respectable performance with increasing data and computation, yet permit easy architectural and operational variations so as to fine tune them to specific applications to reach top performance. Indeed, their success is often contingent on specific architectural and operational choices.
CNNs for text applications make use of temporal convolution operators or filters. Similar to image processing, they are applied at multiple resolutions, interspersed with non-linearities and pooling. The convolution operation itself is a linear
mapping over “n-gram vectors” obtained by concatenating
consecutive word (or character) representations. We argue that this basic building block can be improved in two important respects. First, the power of n-grams derives precisely from multi-way interactions and these are clearly missed (initially) with linear operations on stacked n-gram vectors. Non-linear interactions within a local context have been shown to improve empirical performance in various tasks [Mitchell and Lapata2008, Kartsaklis et al.2012, Socher et al.2013]. Second, many useful patterns are expressed as non-consecutive phrases, such as semantically close multi-word expressions (e.g.,“not that good”, “not nearly as good”). In typical CNNs, such expressions would have to come together and emerge as useful patterns after several layers of processing.We propose to use a feature mapping operation based on tensor products instead of linear operations on stacked vectors. This enables us to directly tap into non-linear interactions between adjacent word feature vectors [Socher et al.2013, Lei et al.2014]. To offset the accompanying parametric explosion we maintain a low-rank representation of the tensor parameters. Moreover, we show that this feature mapping can be applied to all possible non-consecutive n-grams in the sequence with an exponentially decaying weight depending on the length of the span. Owing to the low rank representation of the tensor, this operation can be performed efficiently in linear time with respect to the sequence length via dynamic programming. Similar to traditional convolution operations, our non-linear feature mapping can be applied successively at multiple levels.
We evaluate the proposed architecture in the context of sentence sentiment classification and news categorization. On the Stanford Sentiment Treebank dataset, our model obtains state-of-the-art performance among a variety of neural networks in terms of both accuracy and training cost. Our model achieves 51.2% accuracy on fine-grained classification and 88.6% on binary classification, outperforming the best published numbers obtained by a deep recursive model [Tai et al.2015] and a convolutional model [Kim2014]. On the Chinese news categorization task, our model achieves 80.0% accuracy, while the closest baseline achieves 79.2%.
Deep neural networks have recently brought about significant advancements in various natural language processing tasks, such as language modeling
[Bengio et al.2003, Mikolov et al.2010][Socher et al.2013, Iyyer et al.2015, Le and Zuidema2015], syntactic parsing [Collobert and Weston2008, Socher et al.2011a, Chen and Manning2014] and machine translation [Bahdanau et al.2014, Devlin et al.2014, Sutskever et al.2014]. Models applied in these tasks exhibit significant architectural differences, ranging from recurrent neural networks
[Mikolov et al.2010, Kalchbrenner and Blunsom2013] to recursive models [Pollack1990, Küchler and Goller1996], and including convolutional neural nets [Collobert and Weston2008, Collobert et al.2011, Yih et al.2014, Shen et al.2014, Kalchbrenner et al.2014, Zhang and LeCun2015].Our model most closely relates to the latter. Since these models have originally been developed for computer vision
[LeCun et al.1998], their application to NLP tasks introduced a number of modifications. For instance, collobert2011natural use the max-over-time pooling operation to aggregate the features over the input sequence. This variant has been successfully applied to semantic parsing [Yih et al.2014] and information retrieval [Shen et al.2014, Gao et al.2014]. kalchbrenner2014 instead propose (dynamic) k-max pooling operation for modeling sentences. In addition, Kim14 combines CNNs of different filter widths and either static or fine-tuned word vectors. In contrast to the traditional CNN models, our method considers non-consecutive n-grams thereby expanding the representation capacity of the model. Moreover, our model captures non-linear interactions within n-gram snippets through the use of tensors, moving beyond direct linear projection operator used in standard CNNs. As our experiments demonstrate these advancements result in improved performance.
Let be the input sequence such as a document or sentence. Here is the length of the sequence and each is a vector representing the word. The (consecutive) n-gram vector ending at position is obtained by simply concatenating the corresponding word vectors
Out-of-index words are simply set to all zeros.
The traditional convolution operator is parameterized by filter matrix which can be thought of as smaller filter matrices applied to each in vector . The operator maps each n-gram vector in the input sequence to so that the input sequence is transformed into a sequence of feature representations,
The resulting feature values are often passed through non-linearities such as the hyper-tangent (element-wise) as well as aggregated or reduced by “sum-over” or “max-pooling” operations for later (similar stages) of processing.
The overall architecture can be easily modified by replacing the basic n-gram vectors and the convolution operation with other feature mappings. Indeed, we appeal to tensor algebra to introduce a non-linear feature mapping that operates on non-consecutive n-grams.
Typical gram feature mappings where concatenated word vectors are mapped linearly to feature coordinates may be insufficient to directly capture relevant information in the gram. As a remedy, we replace concatenation with a tensor product. Consider a 3-gram and the corresponding tensor product . The tensor product is a 3-way array of coordinate interactions such that each entry of the tensor is given by the product of the corresponding coordinates of the word vectors
Here denotes the tensor product operator. The tensor product of a 2-gram analogously gives a two-way array or matrix . The n-gram tensor can be seen as a direct generalization of the typical concatenated vector^{2}^{2}2To see this, consider word vectors with a “bias” term . The tensor product of n such vectors includes the concatenated vector as a subset of tensor entries but, in addition, contains all up to -order interaction terms..
Since each n-gram in the sequence is now expanded into a high-dimensional tensor using tensor products, the set of filters are analogously maintained as high-order tensors. In other words, our filters are linear mappings over the higher dimensional interaction terms rather than the original word coordinates.
Consider again mapping a 3-gram into a feature representation. Each filter is a 3-way tensor with dimensions . The set of filters, denoted as , is a 4-way tensor of dimension , where each slice of represents a single filter and is the number of such filters, i.e., the feature dimension. The resulting dimensional feature representation for the 3-gram is obtained by multiplying the filter and the 3-gram tensor as follows. The coordinate of is given by
(1) |
The formula is equivalent to summing over all the third-order polynomial interaction terms where tensor stores the coefficients.
Directly maintaining the filters as full tensors leads to parametric explosion. Indeed, the size of the tensor (i.e. ) would be too large even for typical low-dimensional word vectors where, e.g., . To this end, we assume a low-rank factorization of the tensor , represented in the Kruskal form. Specifically, is decomposed into a sum of rank-1 tensors
where and are four smaller parameter matrices. (similarly , and ) denotes the row of the matrix. Note that, for simplicity, we have assumed that the number of rank-1 components in the decomposition is equal to the feature dimension . Plugging the low-rank factorization into Eq.(1), the feature-mapping can be rewritten in a vector form as
(2) |
where is the element-wise product such that, e.g., for . Note that while (similarly and ) is a linear mapping from each word (similarly and ) into a -dimensional feature space, higher order terms arise from the element-wise products.
Traditional convolution uses consecutive n-grams in the feature map. Non-consecutive n-grams may nevertheless be helpful since phrases such as “not good”, “not so good” and “not nearly as good” express similar sentiments but involve variable spacings between the key words. Variable spacings are not effectively captured by fixed n-grams.
We apply the feature-mapping in a weighted manner to all n-grams thereby gaining access to patterns such as “not … good”. Let denote the feature representation corresponding to a 3-gram of words in positions , , and along the sequence. This vector is calculated analogously to Eq.(2),
We will aggregate these vectors into an dimensional feature representation at each position in the sequence. The idea is similar to neural bag-of-words models where the feature representation for a document or sentence is obtained by averaging (or summing) of all the word vectors. In our case, we define the aggregate representation in position as the weighted sum of all 3-gram feature representations ending at position , i.e.,
(3) |
where is a decay factor that down-weights 3-grams with longer spans (i.e., 3-grams that skip more in-between words). As all non-consecutive 3-grams are omitted, , and the model acts like a traditional model with only consecutive n-grams. When , however, is a weighted average of many 3-grams with variable spans.
Directly calculating according to Eq.(3) by enumerating all 3-grams would require feature-mapping operations. We can, however, evaluate the features more efficiently by relying on the associative and distributive properties of the feature operation in Eq.(2).
Let be a dynamic programming table representing the sum of 3-gram feature representations before multiplying with matrix . That is, or, equivalently,
We can analogously define and for 1-grams and 2-grams,
These dynamic programming tables can be calculated recursively according to the following formulas:
where and are two auxiliary tables. The resulting is the sum of 1, 2, and 3-gram features. We found that aggregating the 1,2 and 3-gram features in this manner works better than using 3-gram features alone. Overall, the n-gram feature aggregation can be performed in matrix multiplication/addition operations, and remains linear in the sequence length.
The dynamic programming algorithm described above maps the original input sequence to a sequence of feature representations
. As in standard convolutional architectures, the resulting sequence can be used in multiple ways. One can directly aggregate it to a classifier or expose it to non-linear element-wise transformations and use it as an input to another sequence-to-sequence feature mapping.
The simplest strategy (adopted in neural bag-of-words models) would be to average the feature representations and pass the resulting averaged vector directly to a softmax output unit
Our architecture, as illustrated in Figure 1
, includes two additional refinements. First, we add a non-linear activation function after each feature representation, i.e.
, whereis a bias vector and
ReLUis the rectified linear unit function. Second, we stack multiple tensor-based feature mapping layers. That is, the input sequence
is first processed into a feature sequence and passed through the non-linear transformation to obtain
. The resulting feature sequence is then analogously processed by another layer, parameterized by a different set of feature-mapping matrices , to obtain a higher-level feature sequence , and so on. The output feature representations of all these layers are averaged within each layer and concatenated as shown in Figure 1. The final prediction is therefore obtained on the basis of features across the levels.Following standard practices, we train our model by minimizing the cross-entropy error on a given training set. For a single training sequence and the corresponding gold label , the error is defined as,
where is the number of possible output label.
The set of model parameters (e.g.
in each layer) are updated via stochastic gradient descent using AdaGrad algorithm
[Duchi et al.2011].We initialize matrices
from uniform distribution
and similarly. In this way, each row of the matrices is an unit vector in expectation, and each rank-1 filter slice has unit variance as well,
In addition, the parameter matrix in the softmax output layer is initialized as zeros, and the bias vectors for ReLU activation units are initialized to a small positive constant .
We apply two common techniques to avoid overfitting during training. First, we add L2 regularization to all parameter values with the same regularization weight. In addition, we randomly dropout [Hinton et al.2012] units on the output feature representations at each level.
Model | Fine-grained | Binary | Time (in seconds) | |||
Dev | Test | Dev | Test | per epoch |
per 10k samples | |
RNN | 43.2 | 82.4 | - | - | ||
RNTN | 45.7 | 85.4 | 1657 | 1939 | ||
DRNN | 49.8 | 86.8 | 431 | 504 | ||
RLSTM | 51.0 | 88.0 | 140 | 164 | ||
DCNN | 48.5 | 86.9 | - | - | ||
CNN-MC | 47.4 | 88.1 | 2452 | 156 | ||
CNN | 48.8 | 47.2 | 85.7 | 86.2 | 32 | 37 |
PVEC | 48.7 | 87.8 | - | - | ||
DAN | 48.2 | 86.8 | 73 | 5 | ||
SVM | 40.1 | 38.3 | 78.6 | 81.3 | - | - |
NBoW | 45.1 | 44.5 | 80.7 | 82.0 | 1 | 1 |
Ours | 49.5 | 50.6 | 87.0 | 87.0 | 28 | 33 |
+ phrase labels | 53.4 | 51.2 | 88.9 | 88.6 | 445 | 28 |
We evaluate our model on sentence sentiment classification task and news categorization task. For sentiment classification, we use the Stanford Sentiment Treebank benchmark [Socher et al.2013]. The dataset consists of 11855 parsed English sentences annotated at both the root (i.e. sentence) level and the phrase level using 5-class fine-grained labels. We use the standard 8544/1101/2210 split for training, development and testing respectively. Following previous work, we also evaluate our model on the binary classification variant of this benchmark, ignoring all neutral sentences. The binary version has 6920/872/1821 sentences for training, development and testing.
For the news categorization task, we evaluate on Sogou Chinese news
corpora.^{3}^{3}3http://www.sogou.com/labs/dl/c.html The dataset
contains 10 different news categories in total, including Finance, Sports,
Technology and Automobile etc.
We use 79520 documents for training, 9940 for development and 9940 for testing.
To obtain Chinese word boundaries, we use
LTP-Cloud^{4}^{4}4http://www.ltp-cloud.com/intro/en/
https://github.com/HIT-SCIR/ltp, an open-source Chinese NLP platform.
We implement the standard SVM method and the neural bag-of-words model NBoW as baseline methods in both tasks. To assess the proposed tensor-based feature map, we also implement a convolutional neural network model CNN by replacing our filter with traditional linear filter. The rest of the framework (such as feature averaging and concatenation) remains the same.
In addition, we compare our model with a wide range of top-performing models on the sentence sentiment classification task. Most of these models fall into either the category of recursive neural networks (RNNs) or the category of convolutional neural networks (CNNs). The recursive neural network baselines include standard RNN [Socher et al.2011b], RNTN with a small core tensor in the composition function [Socher et al.2013], the deep recursive model DRNN [Irsoy and Cardie2014]
and the most recent recursive model using long-short-term-memory units
RLSTM [Tai et al.2015]. These recursive models assume the input sentences are represented as parse trees. As a benefit, they can readily utilize annotations at the phrase level. In contrast, convolutional neural networks are trained on sequence-level, taking the original sequence and its label as training input. Such convolutional baselines include the dynamic CNN with k-max pooling DCNN [Kalchbrenner et al.2014] and the convolutional model with multi-channel CNN-MC by Kim14. To leverage the phrase-level annotations in the Stanford Sentiment Treebank, all phrases and the corresponding labels are added as separate instances when training the sequence models. We follow this strategy and report results with and without phrase annotations.The word vectors are pre-trained on much larger unannotated corpora to achieve better generalization given limited amount of training data [Turian et al.2010]. In particular, for the English sentiment classification task, we use the publicly available 300-dimensional GloVe word vectors trained on the Common Crawl with 840B tokens [Pennington et al.2014]. This choice of word vectors follows most recent work, such as DAN [Iyyer et al.2015] and RLSTM [Tai et al.2015]. For Chinese news categorization, there is no widely-used publicly available word vectors. Therefore, we run word2vec [Mikolov et al.2013] to train 200-dimensional word vectors on the 1.6 million Chinese news articles. Both word vectors are normalized to unit norm (i.e. ) and are fixed in the experiments without fine-tuning.
We perform an extensive search on the hyperparameters of our full model, our implementation of the
CNN model (with linear filters), and the SVM baseline. For our model and the CNN model, the initial learning rate of AdaGrad is fixed to 0.01 for sentiment classification and 0.1 for news categorization, and the L2 regularization weight is fixed to and respectively based on preliminary runs. The rest of the hyperparameters are randomly chosen as follows: number of feature-mapping layers , n-gram order , hidden feature dimension, dropout probability
, and length decay . We run each configuration 3 times to explore different random initializations. For the SVM baseline, we tune L2 regularization weight , word cut-off frequency (i.e. pruning words appearing less than this times) and n-gram feature order .The source code is implemented in Python using the Theano library
[Bergstra et al.2010], a flexible linear algebra compiler that can optimize user-specified computations (models) with efficient automatic low-level implementations, including (back-propagated) gradient calculation.Table 1 presents the performance of our model and other baseline methods on Stanford Sentiment Treebank benchmark. Our full model obtains the highest accuracy on both the development and test sets. Specifically, it achieves 51.2% and 88.6% test accuracies on fine-grained and binary tasks respectively^{5}^{5}5Best hyperparameter configuration based on dev accuracy: 3 layers, 3-gram tensors (n=3), feature dimension and length decay . As shown in Table 2
, our model performance is relatively stable – it remains high accuracies with around 0.5% standard deviation under different initializations and dropout rates.
Our full model is also several times faster than other top-performing models. For example, the convolutional model with multi-channel (CNN-MC) runs over 2400 seconds per training epoch. In contrast, our full model (with 3 feature layers) runs on average 28 seconds with only root labels and on average 445 seconds with all labels.
Our results also show that the CNN model, where our feature map is replaced with traditional linear map, performs worse than our full model. This observation confirms the importance of the proposed non-linear, tensor-based feature mapping. The CNN model also lags behind the DCNN and CNN-MC baselines, since the latter two propose several advancements over standard CNN.
Dataset | Accuracy | |
---|---|---|
Fine-grained | Dev | 52.5 (0.5) % |
Test | 51.4 (0.6) % | |
Binary | Dev | 88.4 (0.3) % |
Test | 88.4 (0.5) % |
Table 3 reports the results of SVM, NBoW and our model on the news categorization task. Since the dataset is much larger compared to the sentiment dataset (80K documents vs. 8.5K sentences), the SVM method is a competitive baseline. It achieves 78.5% accuracy compared to 74.4% and 79.2% obtained by the neural bag-of-words model and CNN model. In contrast, our model obtains 80.0% accuracy on both the development and test sets, outperforming the three baselines by a 0.8% absolute margin. The best hyperparameter configuration in this task uses less feature layers and lower n-gram order (specifically, 2 layers and ) compared to the sentiment classification task. We hypothesize that the difference is due to the nature of the two tasks: the document classification task requires to handle less compositions or context interactions than sentiment analysis.
Model | Dev Acc. | Test Acc. |
---|---|---|
SVM (1-gram) | 77.5 | 77.4 |
SVM (2-gram) | 78.2 | 78.0 |
SVM (3-gram) | 78.2 | 78.5 |
NBoW | 74.4 | 74.4 |
CNN | 79.5 | 79.2 |
Ours | 80.0 | 80.0 |
We next investigate the impact of hyperparameters in our model performance. We use the models trained on fine-grained sentiment classification task with only root labels.
(1) positive prediction | (2) negative prediction | (3) negative prediction | (4) positive prediction | |
(5) negative prediction | (6) negative prediction (ground truth: negative) | |||
(7) positive prediction (ground truth: positive) |
We plot the fine-grained sentiment classification accuracies obtained during hyperparameter grid search. Figure 2 illustrates how the number of feature layers impacts the model performance. As shown in the figure, adding higher-level features clearly improves the classification accuracy across various hyperparameter settings and initializations.
We also analyze the effect of modeling non-consecutive n-grams. Figure 3 splits the model accuracies according to the choice of span decaying factor . Note when
, the model applies feature extractions to consecutive n-grams only. As shown in Figure
3, this setting leads to consistent performance drop. This result confirms the importance of handling non-consecutive n-gram patterns.Finally, we verify the effectiveness of rectified linear unit activation function (ReLU) by comparing it with no activation (or identity activation ). As shown in Figure 4, our model with ReLU activation generally outperforms its variant without ReLU. The observation is consistent with previous work on convolutional neural networks and other neural network models.
Figure 5 gives examples of input sentences and the corresponding predictions of our model in fine-grained sentiment classification. To see how our model captures the sentiment at different local context, we apply the learned softmax activation to the extracted features at each position without taking the average. That is, for each index , we obtain the local sentiment . We plot the expected sentiment scores , where a score of 2 means “very positive”, 0 means “neutral” and -2 means “very negative”. As shown in the figure, our model successfully learns negation and double negation. The model also identifies positive and negative segments appearing in the sentence.
We proposed a feature mapping operator for convolutional neural networks by modeling n-gram interactions based on tensor product and evaluating all non-consecutive n-gram vectors. The associated parameters are maintained as a low-rank tensor, which leads to efficient feature extraction via dynamic programming. The model achieves top performance on standard sentiment classification and document categorization tasks.
We thank Kai Sheng Tai, Mohit Iyyer and Jordan Boyd-Graber for answering questions about their paper. We also thank Yoon Kim, the MIT NLP group and the reviewers for their comments. We acknowledge the support of the U.S. Army Research Office under grant number W911NF-10-1-0533. The work is developed in collaboration with the Arabic Language Technologies (ALT) group at Qatar Computing Research Institute (QCRI). Any opinions, findings, conclusions, or recommendations expressed in this paper are those of the authors, and do not necessarily reflect the views of the funding organizations.
The Journal of Machine Learning Research
, 3:1137–1155.Efficient estimation of word representations in vector space.
CoRR.Semi-supervised recursive autoencoders for predicting sentiment distributions.
In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 151–161. Association for Computational Linguistics.Word representations: A simple and general method for semi-supervised learning.
In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL ’10. Association for Computational Linguistics.