Text classification is one of the most basic and important tasks in the field of machine learning. Traditionally, the use of term frequency inverse document frequency (tf-idf) as a representation of documents, and general classifiers such as support vector machines (SVM) or logistic regression have been utilized for statistical classification.
, which further led to higher accuracies for text classification. The major deep learning models utilized in text classification are largely based on convolutional neural networks (CNNs) and recurrent neural networks (RNNs).
Meanwhile, in the image classification domain, capsule networks Hinton et al. (2011); Sabour et al. (2017) proved to be effective at understanding spatial relationships in high levels of data by employing a whole vector of instantiation parameters. We have applied this network structure to the classification of text, and argue that it also has advantages in this field.
The main contributions of this work are three-fold. First, we apply capsule networks with dynamic routing to text classification and achieve comparable results to previous methods. Second, we propose an alternative routing method that achieves higher accuracy compared to dynamic routing. Third, we propose the use of an ELU-gate Dauphin et al. (2016) to propagate relevant information.
2 Related Work
2.1 Text classification
As deep learning architectures have become more popular, they have also been applied to text classification. CNN models were originally popularized for text classification by kim2014convolutional and employed convolutions directly to sentences. CNNs were further explored at the character-level by zhang2015character. Dynamic convolutional neural networks (DCNNs) Kalchbrenner et al. (2014) introduce a unique method of pooling by dynamically incorporating the length of a sentence when determining the pooling parameter.
While it is straightforward to utilize RNNs for text classification because of the sequential nature of text, naive RNNs have not been as successful as anticipated. However, with long short-term memory (LSTM) and initializations based on sequence autoencodersDai and Le (2015) or small perturbations added to LSTM word embeddings Miyato et al. (2017), RNNs have also achieved strong results.
Additionally, self-attention networks - models without any convolutions or recurrence - have also been successfully applied to text classification Shen et al. (2018).
2.2 Capsule networks
Because the convolution operator in a CNN is represented by a weighted sum of lower layers, it is difficult to express the features of a complex object as it moves into the upper layers. This has the disadvantage of not considering the hierarchical relationships between local features. CNNs utilize pooling to overcome these shortcomings. Pooling can reduce the computational complexity of convolution operations and capture the invariance of local features. However, pooling operations lose information regarding spatial relationships and are likely to misclassify objects based on their orientation or proportion.
The capsule network is a structured model that solves many of the problems inherent to CNNs. Capsules in capsule networks are locally invariant groups that learn to recognize the existence of visual entities and encode their properties into vectors. While neurons operate independently in a CNN, capsule networks utilize a nonlinear function called squashing because capsules (groups of neurons) are represented as a vector.
Capsules consider the spatial relationships between entities and learn these relationships via dynamic routing Sabour et al. (2017). Dynamic routing determines the connection strength between lower-level and upper-level capsules through repetitive routing based on a coupling coefficient. This coupling coefficient is utilized to measure the similarity between the vectors that predict the upper capsule and lower capsule, and learns which lower-level capsule must be directed to which upper-level capsule. Through this process, capsules learn to represent the properties of a given entity.
Our goal is to apply capsule networks to text classification, and modify it according to our purpose. Capsules have the ability to represent attributes of partial entities, and express semantic meanings in a wider space by expressing the entities with a vector rather than a scalar. In this regard, capsules are suitable to express a sentence or document as a vector. Figure 1 depicts the general structure of the proposed model. The input of the network is a document , where is the length of the document and is the embedding size.
The second layer is a feature map utilizing convolutions, where the kernel size is , number of filters is2016), defined as
where are weights, are bias terms, and is the element-wise multiplication operator. This ELU-gate unit acts as a control tower by selecting which features to be activated. Unlike pooling, the ELU-gate unit does not lose spatial information.
The next layer is a convolutional capsule layer with channels of convolutional dimension capsules where the kernel size is . Because the classifier is connected locally to the feature map, it is difficult for the classifier to handle variations in transformation. Some studies have shown that utilizing a large kernel size in a network tends to gather information from a much larger region in the receptive field Peng et al. (2017). Because we do not utilize pooling, we instead increased the size of the kernel to enlargen our viewpoint. Therefore, we applied nonlinear squashing Sabour et al. (2017) in the convolutional capsule layer .
The final layer is the text capsule layer . We utilized two different routing methods from the convolutional capsule layer to the text capsule layer, as described in the subsections below.
3.1.1 Capsule network with dynamic routing
In sabour2017dynamic, the capsule network updated the weight of coupling coefficients through an iterative routing process and determined the degree to which lower capsules were directed to upper capsules. The coupling coefficient is determined by the degree of similarity between the standard-upper and prediction-upper capsules.
where , and is the number of classes. is the coupling coefficient and the softmax output of is updated in every routing iteration. is determined by the degree of similarity between the lower and upper capsules and predicts the entities of the upper capsules. The predicted vector is expressed by a matrix operation between the weight matrix and .
The routing procedure is defined as follows:
3.1.2 Capsule network with static routing
For the image domain, it is reasonable to consider the spatial hierarchies of lower-level entities and routing can recognize objects similarly to the manner in which we recognize objects. However, in the language domain, there is a great deal of freedom in the way that documents and emotions can be expressed. For example, in the original capsule network, learning to correctly represent the positional characteristics of the eyes, nose, and mouth when categorizing faces in images was a major challenge. However, in the case of documents, it is difficult to say that two documents are absolutely different because the order of the sentences in the two documents are different. In this perspective, it becomes natural to suggest a static routing scheme as follows:
where is a weight matrix and is the number of capsules in . is multiplied by to express the upper entity as a capsule of -dimensional vectors. is the result of applying the squashing function to and represents the text capsule layer. This differs from fully connected scalar operations and has the advantage of representing documents as vectors.
4 Experimental Settings
We tested our model on seven different benchmark datasets, as shown in Table 1. The details for each dataset are as follows:
This dataset is a collection of 20,000 news documents partitioned between 20 different newsgroups.
We utilize the Reuters corpus provided by the Python natural language toolkit NLTK, where documents are initially tagged with 90 categories. In order to limit the number of classes, we selected the 10 most-common categories (earn, acq, money-fix, grain, crude, trad, interest, wheat, ship, corn) and selected corresponding documents.
- MR (2004)
Pang and Lee (2004) A corpus containing 1,000 positive and 1,000 negative preprocessed movie reviews.
- MR (2005)
Pang and Lee (2005)333http://www.cs.cornell.edu/people/pabo/movie-review-data/ A larger movie review dataset, which contains 5,331 positive sentences and 5,331 negative sentences.
4.2 Hyperparameters and training
with exponentially decaying learning rates. We monotonically decreased the learning rate by decaying it by a factor of 0.99 in every epoch. We utilized a dropout rateof 0.5 and embedding size of 300.
Particularly, the number of capsules is set to 6, according to experiments based on a held out dataset. This is a very low number compared to sabour2017dynamic, which employed 1,152 capsules for image classification. Our conjecture for this big difference is that the complexity of the generated feature map is lower in our benchmark tasks. If the complexity of a generated feature map is low, the capsule is expected to provide an appropriate representation of the entity, even without dynamic routing.
Our model was trained on a GPU utilizing TensorFlowAbadi et al. (2016), with the hyperparameter settings as shown in Table 2.
The CNN classification model from kim2014convolutional was utilized as a baseline model for experimental comparisons. We performed appropriate parameter tuning for each dataset which are listed in Table 3.
5 Results and analysis
5.1 Classification accuracies
|Model||20news||Reuters10||MR (2004)||MR (2005)||TREC-QA||MPQA||IMDb|
|CNN-non-static Kim (2014)||-||-||-||81.4||92.7||89.4||-|
|DCNN Kalchbrenner et al. (2014)||-||-||-||-||93.0||-|
|SA-LSTM Dai and Le (2015)||84.4||-||-||80.7||-||-||92.76|
|Virtual adversarial LSTM Miyato et al. (2017)||-||-||-||83.4||-||-||94.1|
|Bi-BloSAN Shen et al. (2018)||-||-||-||-||94.8||90.4||-|
Our experimental results indicate that the accuracy of the static-routing model is higher than that of the dynamic-routing model, as shown in Table 4. We believe this is due to the higher complexity of the second layer, which is a feature map utilizing convolutions.
5.2 Capsule networks over CNNs
Static routing does not use all the theoretical philosophies of the capsule network. However, learning in vector units is different from existing CNN. We experimented with how vector-based learning affects the performance of the model. Figure 3 is the performance results according to the variation of dimension but keeping the number of trainable parameters. Experimental results show that the higher accuracy when the dimension is increased. Therefore, when training as a vector, the capacity to represent the information of the entities increases and it becomes possible to express various attributes of the entities. Using static routing does not lose the characteristics of the capsule. So we experimented with the ability to represent the properties of a capsule in static routing. We use MNIST because there are some limitations to the visualization of minute changes in words. We did a perturbation test after adding an ELU-gate in original capsule network structure and changing dynamic routing to static routing. The experimental method is the same as Sabour et al. (2017).
Figure 3 shows that each row has various properties such as rotation, thickness, scale, etc. Therefore, the use of static routing does not lose the essential characteristics of the capsule. This differs from CNN, which is the computation of independent neurons.
We measured word similarities to see how our model differs from the basic CNN. Table 5
is the similarity measurement table. When the pre-trained word vector was utilized, both the CNN and our model were fine-tuned to the dataset. However, a difference can be seen when utilizing the static-routing method. In CNN, max-pooling cannot update entire words because only the context with the highest activation is updated during backpropagation. Because our model does not utilize max pooling, it learns the syntactic representations of words in the static-routing model because its learns without losing positional context.
5.3 Static-routing over dynamic-routing
|Pretrained word accuracy||-|
|Randomly initialized word accuracy||-|
It is a general practice to utilize max-pooling in order to extract data features when using a CNN. However, max-pooling often produces poor results in text classification due to loss of information. More specifically, max pooling only maintains the feature with the highest activation, which means it discards all other features even though they may seemingly be useful.
To remedy this issue, capsule networks with dynamic routing chooses to preserve not only one, but all features that are useful, as long as they are “agreed” among layers. However, we assert that this strategy is not necessarily optimal for document classification as opposed to image classification, due to the high variability in text. Specifically, the model should be flexible and robust enough to handle slight modifications in the text, such as word order shuffling or the insertion of an untrained word vector. We conjecture that removing the coupling coefficient would smooth out the underlying signals and therefore make the model more robust in this regard. We further perform experiments involving word order shuffling and noise injections in Section 5.3 to support this claim.
In order to prove the above hypothesis, and argue the effectiveness of static routing, we evaluated the classification results after changing the sequences of words in a sentence. For this, we utilized 50 samples from each of class 2 (ENTITY) and 3 (HUMAN), from the TREC-QA test dataset. As can be seen in Table 6, static routing proved a much higher accuracy compared to dynamic routing, given that the word vectors are pretrained.
We further identified the effects of words changes on the predictions of the model utilizing LIME Ribeiro et al. . LIME is a method for generating new samples with similar values to corresponding instances in the vicinity of the predicted value from the model and determining how the predictions of the models differ based on the input values. In the results presented in Figure 4, both routing models tend to produce incorrect decisions because of changed words.
When the original example is “what is the state flower of michigan” (the third example in Figure 4), the reconstructed data is “what is the color of michigan’s state flower”. The dynamic routing method has a negative effect on the newly added word “color.” It also has a negative effect when a combination that does not appear in the existing TREC-QA data is added, such as “can you tell me” in the table (second to last example in Figure 4).
Therefore, we did not utilize the coupling factor for this reason. As a result, the computational complexity can be reduced and generalization is improved compared to when dynamic routing is used.
5.4 Justifying the ELU-gate
In dauphin2016language, the ELU gating mechanism was mainly experimented with recurrent models such as LSTM and GRU, but they also showed that it is effective with convolutional layers. The gate gradient of LSTM is as follows.
In the case of LSTM, the effect of the gradient is reduced because downscaling occurs in and .
Since the gradient of the ELU-gate can be expressed as shown in Equation 7, the effect of downscaling is small. Unlike max-pooling, fine-tuning works well because input words are updated globally.
Table 7 shows the results of comparing the accuracy with ELU-gate and other structures. The multiple filter layer is a convolution layer having a filter size of as in the case of CNN Kim (2014) structure. The number of filters in the multiple filter layers was 100 per filter, and the kernel size of pooling was . The convolutional layer is a layer that excepted ELU-gate.
5.5 Text transformation
In image classification, capsules represent the various properties of a particular entity that is present in an image. These properties include types such as tilt, orient, hue, etc. In order to apply this analogy to text, we experimented with documents to see how capsules can learn the innate characteristics of the document being converted. To test this reconstructive phenomena, we added three fully connected layers to the capsule network with static-routing.
We added the MSE loss between the input and the reconstruct layer output, and downscaled the MSE loss by 0.03. Pretrained word vectors were not utilized. We confirmed the decoder results when we gave random noise between -0.3 and 0.3 to each dimension of the activated capsule in
. We used the words with the highest value by measuring the cosine similarity of each row and vocabulary.
The first row in Table 8 is the original sentence of TREC-QA with no added noise. When the noise is added, the result does not change the meaning of the question, but some words changed. Also, the changed sentence is a newly created without the same as the dataset. In the case of words, we could not visualize detailed changes like images because measured the similarity of the words included in the vocabulary.
In this paper, we proposed the application of capsule networks to the text classification domain and suggested the utilization of a static routing variant. We compared the proposed model to CNNs, and demonstrated that capsule networks are indeed useful for text classification based on seven popular benchmark datasets. We additionally proposed static routing, an alternative to dynamic routing, that results in higher classification accuracies with less computation.
- Abadi et al. (2016) Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation (OSDI), volume 16, pages 265–283.
- Dai and Le (2015) Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems (NIPS), pages 3079–3087.
- Dauphin et al. (2016) Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2016. Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083.
- Hinton et al. (2011) Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. 2011. Transforming auto-encoders. In International Conference on Artificial Neural Networks, pages 44–51. Springer.
- Kalchbrenner et al. (2014) Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics.
Yoon Kim. 2014.
Convolutional neural networks for sentence classification.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Le and Mikolov (2014) Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of International Conference on Machine Learning (ICML), pages 1188–1196.
- Li and Roth (2002) Xin Li and Dan Roth. 2002. Learning question classifiers. In Proceedings of the 19th international conference on Computational linguistics-Volume 1, pages 1–7. Association for Computational Linguistics.
Maas et al. (2011)
Andrew L Maas, Raymond E Daly, Peter T Pham, Dan Huang, Andrew Y Ng, and
Christopher Potts. 2011.
Learning word vectors for sentiment analysis.In Proceedings of the 49th annual meeting of the association for computational linguistics (ACL), pages 142–150. Association for Computational Linguistics.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (NIPS), pages 3111–3119.
- Miyato et al. (2017) Takeru Miyato, Andrew M Dai, and Ian Goodfellow. 2017. Adversarial training methods for semi-supervised text classification. In Proceedings of the International Conference on Learning Representations (ICLR).
- Pang and Lee (2004) Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd annual meeting on Association for Computational Linguistics (ACL), page 271. Association for Computational Linguistics.
- Pang and Lee (2005) Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd annual meeting on association for computational linguistics (ACL), pages 115–124. Association for Computational Linguistics.
- Peng et al. (2017) Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. 2017. Large kernel matters–improve semantic segmentation by global convolutional network. arXiv preprint arXiv:1703.02719.
- (16) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. “why should i trust you?” explaining the predictions of any classifier.
- Sabour et al. (2017) Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton. 2017. Dynamic routing between capsules. In Advances in Neural Information Processing Systems (NIPS), pages 3859–3869.
- Shen et al. (2018) Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. 2018. Bi-directional block self-attention for fast and memory-efficient sequence modeling. In Proceedings of the International Conference on Learning Representations (ICLR).
- Wiebe et al. (2005) Janyce Wiebe, Theresa Wilson, and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Language resources and evaluation, 39(2-3):165–210.
- Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems (NIPS), pages 649–657.