Towards Scalable and Reliable Capsule Networks for Challenging NLP Applications
Obstacles hindering the development of capsule networks for challenging NLP applications include poor scalability to large output spaces and less reliable routing processes. In this paper, we introduce: 1) an agreement score to evaluate the performance of routing processes at instance level; 2) an adaptive optimizer to enhance the reliability of routing; 3) capsule compression and partial routing to improve the scalability of capsule networks. We validate our approach on two NLP tasks, namely: multi-label text classification and question answering. Experimental results show that our approach considerably improves over strong competitors on both tasks. In addition, we gain the best results in low-resource settings with few training instances.READ FULL TEXT VIEW PDF
In this study, we explore capsule networks with dynamic routing for text...
This paper presents an empirical exploration of the use of capsule netwo...
Classical neural networks add a bias term to the sum of all weighted inp...
A capsule is a group of neurons, whose activity vector represents the
In capsule networks, the routing algorithm connects capsules in consecut...
Text classification is a challenging problem which aims to identify the
This project considers Capsule Networks, a recently introduced machine
Towards Scalable and Reliable Capsule Networks for Challenging NLP Applications
In recent years, deep neural networks have achieved outstanding success in natural language processing (NLP), computer vision and speech recognition. However, these deep models are data-hungry and generalize poorly from small datasets, very much unlike humansLake et al. (2015).
This is an important issue in NLP since sentences with different surface forms can convey the same meaning (paraphrases) and not all of them can be enumerated in the training set. For example, Peter did not accept the offer and Peter turned down the offer are semantically equivalent, but use different surface realizations.
In image classification, progress on the generalization ability of deep networks has been made by capsule networks Sabour et al. (2017); Hinton et al. (2018). They are capable of generalizing to the same object in different 3D images with various viewpoints.
Such generalization capability can be learned from examples with few viewpoints by extrapolation Hinton et al. (2011). This suggests that capsule networks can similarly abstract away from different surface realizations in NLP applications.
illustrates this idea of how observed sentences in the training set are generalized to unseen sentences by extrapolation. In contrast, traditional neural networks require massive amounts of training samples for generalization. This is especially true in the case of convolutional neural networks (CNNs), where pooling operations wrongly discard positional information and do not consider hierarchical relationships between local features(Sabour et al., 2017).
Capsule networks, instead, have the potential for learning hierarchical relationships between consecutive layers by using routing processes without parameters, which are clustering-like methods Sabour et al. (2017) and additionally improve the generalization capability. We contrast such routing processes with pooling and fully connected layers in Figure 2.
Despite some recent success in NLP tasks Wang et al. (2018); Xia et al. (2018); Xiao et al. (2018); Zhang et al. (2018a); Zhao et al. (2018), a few important obstacles still hinder the development of capsule networks for mature NLP applications.
For example, selecting the number of iterations is crucial for routing processes, because they iteratively route low-level capsules to high-level capsules in order to learn hierarchical relationships between layers. However, existing routing algorithms use the same number of iterations for all examples, which is not reliable to judge the convergence of routing. As shown in Figure 3, a routing process with five iterations on all examples converges to a lower training loss at system level, but on instance level for one example, convergence has still not obtained.
Additionally, training capsule networks is more difficult than traditional neural networks like CNN and long short-term memory (LSTM) due to the large number of capsules and potentially large output spaces, which requires extensive computational resources in the routing process.
In this work, we address these issues via the following contributions:
We formulate routing processes as a proxy problem minimizing a total negative agreement score in order to evaluate how routing processes perform at instance level, which will be discussed more in depth later.
We introduce an adaptive optimizer to self-adjust the number of iterations for each example in order to improve instance-level convergence and enhance the reliability of routing processes.
We present capsule compression and partial routing to achieve better scalability of capsule networks on datasets with large output spaces.
Our framework outperforms strong baselines on multi-label text classification and question answering. We also demonstrate its superior generalization capability in low-resource settings.
We have motivated the need for better capsule networks being capable of scaling to large output spaces and higher reliability for routing processes at instance level. We now build a unified capsule framework, which we call NLP-Capsule. It is shown in Figure 4 and described below.
We use a convolutional operation to extract features from documents by taking a sliding window over document embeddings.
Let be a matrix of stacked -dimensional word embeddings for an input document with tokens. Furthermore, let be a convolutional filter with a width . We apply this filter to a local region to generate one feature:
where denotes element-wise multiplication, and
Then, we can collect all into one feature map
after sliding the filter over the current document. To increase the diversity of features extraction, we concatenate multiple feature maps extracted by three filters with different window sizes (2,4,8) and pass them to the primary capsule layer.
In this layer, we use a group-convolution operation to transform feature maps into primary capsules. As opposed to using a scalar for each element in the feature maps, capsules use a group of neurons to represent each element in the current layer, which has the potential for preserving more information.
Using filters , in total groups are used to transform each scalar in feature maps to one capsule , a
- dimensional vector, denoted as:
where and is the concatenation operator. Furthermore, is a non-linear function (i.e., squashing function). The length of each capsule
indicates the probability of it being useful for the task at hand. Hence, a capsule’s length has to be constrained into the unit intervalby the squashing function :
One major issue in this layer is that the number of primary capsules becomes large in proportion to the size of the input documents, which requires extensive computational resources in routing processes (see Section 2.3
). To mitigate this issue, we condense the large number of primary capsules into a smaller amount. In this way, we can merge similar capsules and remove outliers. Each condensed capsuleis calculated by using a weighted sum over all primary capsules, denoted as:
where the parameter is learned by supervision.
Pooling is the simplest aggregation function routing condensed capsules into the subsequent layer, but it loses almost all information during aggregation. Alternatively, routing processes are introduced to iteratively route condensed capsules into the next layer for learning hierarchical relationships between two consecutive layers. We now describe this iterative routing algorithm. Let and be a set of condensed capsules in layer and a set of high-level capsules in layer , respectively. The basic idea of routing is two-fold.
First, we transform the condensed capsules into a collection of candidates for the -th high-level capsule in layer . Following Sabour et al. (2017), each element is calculated by:
is a linear transformation matrix.
Then, we represent a high-level capsule by a weighted sum over those candidates, denoted as:
where is a coupling coefficient iteratively updated by a clustering-like method.
As discussed earlier, routing algorithms like dynamic routing (Sabour et al., 2017) and EM routing (Hinton et al., 2018), which use the same number of iterations for all samples, perform well according to training loss at system level, but on instance level for individual examples, convergence has still not been reached. This increases the risk of unreliability for routing processes (see Figure 3).
To evaluate the performance of routing processes at instance level, we formulate them as a proxy problem minimizing the negative agreement score (NAS) function:
The basic intuition behind this is to assign higher weights to one agreeable pair if the capsule and are close to each other such that the total agreement score is maximized. However, the choice of NAS functions remains an open problem. Hinton et al. (2018)
hypothesize that the agreeable pairs in NAS functions are from Gaussian distributions. Instead, we study NAS functions by introducing Kernel Density Estimation (KDE) since this yields a non-parametric density estimator requiring no assumptions that the agreeable pairs are drawn from parametric distributions. Here, we formulate the NAS function in a KDE form.
where is a distance metric with norm, and is a Epanechnikov kernel function Wand and Jones (1994) with:
The solution we used for KDE is taking Mean Shift Comaniciu and Meer (2002) to minimize the NAS function :
First, can be updated while is fixed:
Then, can be updated using standard gradient descent:
where is the hyper-parameter to control step size.
To address the issue of convergence not being reached at instance level, we present an adaptive optimizer to self-adjust the number of iterations for individual examples according to their negative agreement scores (see Algorithm 1). Following Zhao et al. (2018), we replace standard softmax with leaky-softmax, which decreases the strength of noisy capsules.
This is the top-level layer containing final capsules calculated by iteratively minimizing the NAS function (See Eq. 1), where the number of final capsules corresponds to the entire output space. Therefore, as long as the size of an output space goes to a large scale (thousands of labels), the computation of this function would become extremely expensive, which yields the bottleneck of scalability of capsule networks.
As opposed to the entire output space on data sets, the sub-output space corresponding to individual examples is rather small, i.e., only few labels are assigned to one document in text classification, for example. As a consequence, it is redundant to route low-level capsules to the entire output space for each example in the training stage, which motivated us to present a partial routing algorithm with constrained output spaces, such that our NAS function is described as:
where and denote the sets of real (positive) and randomly selected (negative) outputs for each example, respectively. Both sets are far smaller than the entire output space.
The major focus of this work is to investigate the scalability of our approach on datasets with a large output space, and generalizability in low-resource settings with few training examples. Therefore, we validate our capsule-based approach on two specific NLP tasks: (i) multi-label text classification with a large label scale; (ii) question answering with a data imbalance issue.
Multi-label text classification task refers to assigning multiple relevant labels to each input document, while the entire label set might be extremely large. We use our approach to encode an input document and generate the final capsules corresponding to the number of labels in the representation layer. The length of final capsule for each label indicates the probability whether the document has this label.
We conduct our experiments on two datasets selected from the extreme classification repository:222https://manikvarma.github.io a regular label scale dataset (RCV1), with 103 labels Lewis et al. (2004), and a large label scale dataset (EUR-Lex), with 3,956 labels Mencia and Fürnkranz (2008), described in Table 1. The intuition behind our datasets selection is that EUR-Lex, with 3,956 labels and 15.59 examples per label, fits well with our goal of investigating the scalability and generalizability of our approach. We contrast EUR-Lex with RCV1, a dataset with a regular label scale, and leave the study of datasets with extremely large labels, e.g., Wikipedia-500K with 501,069 labels, to future work.
We compare our approach to the following baselines: non-deep learning approaches using TF-IDF features of documents as inputs: FastXMLPrabhu and Varma (2014), and PD-Sparse Yen et al. (2016), deep learning approaches using raw text of documents as inputs: FastText Joulin et al. (2016), Bow-CNN Johnson and Zhang (2014), CNN-Kim Kim (2014), XML-CNN Liu et al. (2017)), and a capsule-based approach Cap-Zhao Zhao et al. (2018). For evaluation, we use standard rank-based measures (Liu et al., 2017) such as Precision@k, and Normalized Discounted Cumulative Gain (NDCG@).
The word embeddings are initialized as 300-dimensional GloVe vectors Pennington et al. (2014). In the convolutional layer, we use a convolution operation with three different window sizes (2,4,8) to extract features from input documents. Each feature is transformed into a primary capsule with 16 dimensions by a group-convolution operation. All capsules in the primary capsule layer are condensed into 256 capsules for RCV1 and 128 capsules for EUR-Lex by a capsule compression operation.
To avoid routing low-level capsules to the entire label space in the inference stage, we use a CNN baseline (Kim, 2014) trained on the same dataset with our approach, to generate 200 candidate labels and take these labels as a constrained output space for each example.
In Table 2, we can see a noticeable margin brought by our capsule-based approach over the strong baselines on EUR-Lex, and competitive results on RCV1. These results appear to indicate that our approach has superior generalization ability on datasets with fewer training examples, i.e., RCV1 has 729.67 examples per label while EUR-Lex has 15.59 examples.
In contrast to the strongest baseline XML-CNN with 22.52M parameters and 0.08 seconds per batch, our approach has 14.06M parameters, and takes 0.25 seconds in an acceleration setting with capsule compression and partial routing, and 1.7 seconds without acceleration. This demonstrates that our approach provides competitive computational speed with fewer parameters compared to the competitors.
To further study the generalization capability of our approach, we vary the percentage of training examples from 100% to 50% on the entire training set, leading to the number of training examples per label decreasing from 15.59 to 7.77. Figure 5 shows that our approach outperforms the strongest baseline XML-CNN with different fractions of the training examples.
This finding agrees with our speculation on generalization: the distance between our approach and XML-CNN increases as fewer training data samples are available. In Table 3, we also find that our approach with 70% of training examples achieves about 5% improvement over XML-CNN with 100% of examples on 4 out of 6 metrics.
We compare our routing with Sabour et al. (2017) and Zhang et al. (2018b) on EUR-Lex dataset and observe that it performs best on all metrics (Table 4). We speculate that the improvement comes from enhanced reliability of routing processes at instance level.
|NLP-Capsule + Sabour‘s Routing||79.14||64.33||51.85|
|NLP-Capsule + Zhang‘s Routing||80.20||65.48||52.83|
|NLP-Capsule + Our Routing||80.62||65.61||53.66|
|NLP-Capsule + Sabour‘s Routing||79.14||70.13||67.02|
|NLP-Capsule + Zhang‘s Routing||80.20||71.11||68.80|
|NLP-Capsule + Our Routing||80.62||71.34||69.57|
Question-Answering (QA) selection task refers to selecting the best answer from candidates to each question. For a question-answer pair , we use our capsule-based approach to generate two final capsules and
corresponding to the respective question and answer. The relevance score of question-answer pair can be defined as their cosine similarity:
In Table 5, we conduct our experiments on the TREC QA dataset collected from TREC QA track 8-13 data Wang et al. (2007). The intuition behind this dataset selection is that the cost of hiring human annotators to collect positive answers for individual questions can be prohibitive since positive answers can be conveyed in multiple different surface forms. Such issue arises particularly in TREC QA with only 12% positive answers. Therefore, we use this dataset to investigate the generalizability of our approach.
We compare our approach to the following baselines: CNN + LR Yu et al. (2014b) using unigrams and bigrams, CNN Severyn and Moschitti (2015) using additional bilinear similarity features, CNTN Qiu and Huang (2015)
using neural tensor network, LSTMTay et al. (2017) using single and multi-layer, MV-LSTM Wan et al. (2016), NTN-LSTM and HD-LSTM Tay et al. (2017) using holographic dual LSTM and Capsule-Zhao Zhao et al. (2018) using capsule networks. For evaluation, we use standard measures Wang et al. (2007) such as Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR).
|CNN + LR (unigram)||54.70||63.29|
|CNN + LR (bigram)||56.93||66.13|
|LSTM (1 layer)||62.04||66.85|
The word embeddings used for question answering pairs are initialized as 300-dimensional GloVe vectors. In the convolutional layer, we use a convolution operation with three different window sizes (3,4,5). All 16-dimensional capsules in the primary capsule layer are condensed into 256 capsules by the capsule compression operation.
In Table 6, the best performance on MAP metric is achieved by our approach,
which verifies the effectiveness of our model. We also observe that our approach exceeds traditional neural models like CNN, LSTM and NTN-LSTM by a noticeable margin.
This finding also agrees with the observation we found in multi-label classification: our approach has superior generalization capability in low-resource setting with few training examples.
In contrast to the strongest baseline HD-LSTM with 34.51M and 0.03 seconds for one batch, our approach has 17.84M parameters and takes 0.06 seconds in an acceleration setting, and 0.12 seconds without acceleration.
Multi-label text classification aims at assigning a document to a subset of labels whose label set might be extremely large. With increasing numbers of labels, issues of data sparsity and scalability arise. Several methods have been proposed for the large multi-label classification case.
Tree-based models Agrawal et al. (2013); Weston et al. (2013) induce a tree structure that recursively partitions the feature space with non-leaf nodes. Then, the restricted label spaces at leaf nodes are used for classification. Such a solution entails higher robustness because of a dynamic hyper-plane design and its computational efficiency. FastXML Prabhu and Varma (2014) is one such tree-based model, which learns a hierarchy of training instances and optimizes an NDCG-based objective function for nodes in the tree structure.
Label embedding models Balasubramanian and Lebanon (2012); Chen and Lin (2012); Cisse et al. (2013); Bi and Kwok (2013); Ferng and Lin (2011); Hsu et al. (2009); Ji et al. (2008); Kapoor et al. (2012); Lewis et al. (2004); Yu et al. (2014a)
address the data sparsity issue with two steps: compression and decompression. The compression step learns a low-dimensional label embedding that is projected from original and high-dimensional label space. When data instances are classified to these label embeddings, they will be projected back to the high-dimensional label space, which is the decompression step. Recent works came up with different compression or decompression techniques, e.g., SLEECBhatia et al. (2015).
Deep learning models: FastText Joulin et al. (2016) uses averaged word embeddings to classify documents, which is computationally efficient but ignores word order. Various CNNs inspired by Kim (2014) explored MTC with dynamic pooling, such as Bow-CNN Johnson and Zhang (2014) and XML-CNN Liu et al. (2017).
Linear classifiers: PD-Sparse Yen et al. (2016) introduces a Fully-Corrective Block-Coordinate Frank-Wolfe algorithm to address data sparsity.
State-of-the-art approaches to QA fall into two categories: IR-based and knowledge-based QA.
IR-based QA firstly preprocesses the question and employ information retrieval techniques to retrieve a list of relevant passages to questions. Next, reading comprehension techniques are adopted to extract answers within the span of retrieved text. For answer extraction, early methods manually designed patterns to get them Pasca . A recent popular trend is neural answer extraction. Various neural network models are employed to represent questions Severyn and Moschitti (2015); Wang and Nyberg (2015). Since the attention mechanism naturally explores relevancy, it has been widely used in QA models to relate the question to candidate answers Tan et al. (2016); Santos et al. (2016); Sha et al. (2018). Moreover, some researchers leveraged external large-scale knowledge bases to assist answer selection Savenkov and Agichtein (2017); Shen et al. (2018); Deng et al. (2018).
Knowledge-based QA conducts semantic parsing on questions and transforms parsing results into logical forms. Those forms are adopted to match answers from structured knowledge bases Yao and Van Durme (2014); Yih et al. (2015); Bordes et al. (2015); Yin et al. (2016); Hao et al. (2017). Recent developments focused on modeling the interaction between question and answer pairs: Tensor layers Qiu and Huang (2015); Wan et al. (2016) and holographic composition Tay et al. (2017) have pushed the state-of-the-art.
replaced the scalar-output feature detectors of CNNs with vector-output capsules and max-pooling with routing-by-agreement.
Hinton et al. (2018) then proposed a new iterative routing procedure between capsule layers based on the EM algorithm, which achieves better accuracy on the smallNORB dataset. Zhang et al. (2018a) applied capsule networks to relation extraction in a multi-instance multi-label learning framework. Xiao et al. (2018) explored capsule networks for multi-task learning.
Xia et al. (2018) studied the zero-shot intent detection problem with capsule networks, which aims to detect emerging user intents in an unsupervised manner. Zhao et al. (2018) investigated capsule networks with dynamic routing for text classification, and transferred knowledge from the single-label to multi-label cases. Cho et al. (2019)
studied capsule networks with determinantal point processes for extractive multi-document summarization.
Our work is different from our predecessors in the following aspects: (i) we evaluate the performance of routing processes at instance level, and introduce an adaptive optimizer to enhance the reliability of routing processes; (ii) we present capsule compression and partial routing to achieve better scalability of capsule networks on datasets with a large output space.
Making computers perform more like humans is a major issue in NLP and machine learning. This not only includes making them perform on similar levelsHassan et al. (2018), but also requests them to be robust to adversarial examples Eger et al. (2019) and generalize from few data points Rücklé et al. (2019). In this work, we have addressed the latter issue.
In particular, we extended existing capsule networks into a new framework with advantages concerning scalability, reliability and generalizability. Our experimental results have demonstrated its effectiveness on two NLP tasks: multi-label text classification and question answering.
Through our modifications and enhancements, we hope to have made capsule networks more suitable to large-scale problems and, hence, more mature for real-world applications. In the future, we plan to apply capsule networks to even more challenging NLP problems such as language modeling and text generation.
We thank the anonymous reviewers for their comments, which greatly improved the final version of the paper. This work has been supported by the German Research Foundation as part of the Research Training Group Adaptive Preparation of Information from Heterogeneous Sources (AIPHES) at the Technische Universität Darmstadt under grant No. GRK 1994/1.
Twenty-Fourth International Joint Conference on Artificial Intelligence.