, the implementation of attention mechanism and its variants quickly becomes a standard component in neural networks when facing the tasks such as document classification(Yang et al., 2016), speech recognition (Chorowski et al., 2015)
and many other natural language processing (NLP) applications, which help achieve promising performance compared to previous work. However, most of early work only implemented the attention mechanism on a recurrent neural network (RNN) architecture e.g. Long Short-Term Memory (LSTM)(Hochreiter and Schmidhuber, 1997)
and Gated Recurrent Unit (GRU)(Cho et al., 2014) which have the problem of lacking the support of parallel computation, making it unpractical to build deep network. In order to address the problem above, Vaswani et al. (2017)
proposed a novel self-attention network (SAN) architecture empowered by multi-head self-attention, which utilizes different heads to capture partial sentences information by projecting the input sequence into multiple distinct subspaces in parallel. Although they only employed the simple linear transformations on the projection step, the impressive performance of the Transformer network still achieves a great success.
Most existing work that focused on the improvement of multi-head attention mechanism mainly try to extract a more informative partial representation on each independent heads (Lin et al., 2017). Li et al. (2019) proposed aggregating the output representations of multi-head attention. Dou et al. (2019) tried to dynamically aggregate information between the output representations from different encoder layers. All these work concentrates mainly on the parts either “before” or “after” the step of multi-head SAN, which, as an important part of the whole Transformer model, should be paid more attention to. To more empower the current Transformer, we thus propose constructing a more general and context-aware SAN, so that the model can learn deeper contextualized information of the input sequence, which could eventually be helpful in improving the model final performance.
In this paper, we propose the novel capsule-Transformer, in which we implement a generalized SAN called Capsule Routing Self-Attention Network which extends the linear transformation into a more general capsule routing algorithm (Sabour et al., 2017)
by taking SAN as a special case of capsule network. One of the biggest changes from capsule networking mechanism is altering the processing unit from scalar (single neuron) to capsule (group of neurons or vectors). Inspired by the idea of such a capsule processing, we first similarly organize groups of attention weights calculated through self-attention into various capsules containing preliminary linguistic features, then we apply the routing algorithm on these capsules to obtain an output which can contain deeper contextualized information of the sequence. Re-organizing the SAN in a capsule way, we extend the model to a more general form compared to the original SAN.
The attention mechanism was first introduced into the machine translation models (Bahdanau et al., 2015; Luong et al., 2015) and has received well development and broad applications for its ability to effectively model the dependencies without regard to the distance between the input and output sequences. The Transformer is proposed in Vaswani et al. (2017) empowered by multi-head attention mechanism which leverages multiple distinct transformation matrices, as different heads, to capture context information from various subspaces. While the input and output sequences are the same for self-attention.
Formally, the multi-head self-attention mechanism models the attention calculation as a query operation with some specific keys. Given the input sequence hidden states of query, key and value as , where . Here denotes the word embedding dimension, and is the length of the input sequence. In multi-head attention, the , and will be projected to different subspaces if there are heads in the model. The transformation functions are all trainable linear matrices:
where , and are the projection of the original , and on the subspace (head). The size of each of the transformation matrices is .
An attention function is applied on each head over the projected and . The output on each head is computed by combining the attention results with the value matrix as:
where , and is the concatenation of partial outputs from all the heads.
In this paper, we adopt the with scaled dot-product attention function (Luong et al., 2015) which is faster and more suitable for the parallel computation compared to the additive attention (Vaswani et al., 2017):
where is the computed attention vector of the token of the input sequence on the head.
From Eq. (5), we are aware that the attention vector containing the crucial attentive clues is important to compose the final output . We view as an entity basis to conduct a more general self-attention mechanism.
Instead of applying operations in individual neurons as common neural network, capsule network takes a capsule as the basic processing unit which consists of a group of neurons. The way of connection between two capsule layers is also different from conventional neuron based networks. Formally, for capsule in layer , it will generate a vote vector to determine to what extent itself belongs to the capsule in layer . Similar to the fully connected layer, the thus will be composed of all the vote vectors generated from the layer .
3 Model Architecture
3.1 Vertical and Horizontal Capsules
Capsule network was proposed in Sabour et al. (2017)
on the field of computer vision which changed the conventional way of data flow in neural networks. Concretely speaking, the capsule network views a group of neurons (scalars) which captures the parameters of some specific feature as a capsule entity. In computer vision, that kind of feature could be the detection of eyes, nose or mouth when doing a face recognition task. Through the advanced capsule routing algorithm proposed in their work, the low level feature capsules can be aggregated to form the high level capsules which may represent some more abstract features such as a human face or a left arm.
When it turns to multi-head SAN in NLP tasks, it is coincidentally lucky that the calculated attention weights have already been organized into multiple separate groups. Considering that these groups of weights represent partial information of the input sequence from different perspectives or subspaces, they could be naturally treated as capsules. Then, it is intuitively for us to extend the original linear transformation aggregation as the capsule routing algorithm on these capsules so that a better attention distribution representation can be obtained via careful capsule routing.
In this paper, we organize the attention weights into two types of capsule: 1) capsule that contains all the attention weights on one of the heads; 2) capsule that contains attention weights of one of the tokens on all the heads. As shown in Figure 2, we name the head-wise capsules with vertical capsules and the token-wise capsules with horizontal capsules according to their placement way in the attention weights cube respectively.
A simple architecture of our capsule routing SAN in capsule-Transformer is shown in Figure 3, and the dashed part is the main difference between our model and the vanilla one. Our generalized SAN inspired by capsule routing is composed of two components: the vertical routing part and the horizontal routing part. Each part will do a capsule routing on the attention weight matrices calculated through scaled dot-product and obtain the corresponding output capsules: the vertical output capsule and the horizontal output capsule . Both and have the same size as the input attention cube. Before the softmax, we add these two output capsules to the original attention matrices so that every token in the input sequence can get a better contextualized attention distribution representation on each head.
3.3 Routing Algorithm
Formally, the dynamic routing algorithm is applied between two capsule layers which are called input capsule layer and output capsule layer. Suppose there are and capsules in the input and output capsule layer respectively, then each of the input capsules should generates vote vectors associated with the corresponding output capsules. The vote vectors generated from the input capsule and associated with the output capsule will take the job of measuring the belonging relationship between those two capsules.
For each vote vector , a weight value will be dynamically assigned on it. The output capsule is computed by a scaled weighted sum of all its associated vote vectors:
, aiming to measure the existence probability of the aggregated capsules via its length. The weight valueis dynamically updated by applying the softmax function on the accumulated sum in each iteration, and the accumulated number is determined by the scalar product of and .
3.4 Capsule Routing Self-Attention Network
As mentioned above, in our capsule routing SAN, we reorganize the attention weight matrices into head-wise vertical capsules and token-wise horizontal capsules . Therefore the rest problem here for us is how to generate various vote vectors based on these two types of capsules.
Most existing work simply applied multiple linear transformations on the capsule to generate vote vectors (Sabour et al., 2017; Li et al., 2019; Dou et al., 2019). However, such linear processing method cannot be used here for the capsule sizes are not fixed. Therefore here we do not use the method of applying trainable parameterized transformation functions on capsules to generate vote vectors. Instead, we leverage the original attention vector to do the voting job. Since dynamic routing is also an nonparametric algorithm, we could barely introduce no new parameters to our capsule-Transformer, in which the way of vote vector generation will less the least to hurt the model performance.
3.4.1 Vertical Routing
In vertical routing, we view the attention weight cube head-wise so that in an head Transformer model we could get vertical capsules , where is the length of the attention vector. And in self-attention, . We could split the vertical capsule into vote vectors:
By applying the dynamic routing algorithm, we can calculate the vertical output capsule:
Simply adding the output capsules to all the heads may ignore an obvious fact that each head makes effects in different degrees in the formation of the output capsules, so that every heads may “absorb” the output capsule also differently. In the meantime, it has been found that for a deep layered Transformer, it has a hierarchical pattern of information capturing (Raganato and Tiedemann, 2018; Peters et al., 2018). Taking both of the above issues into consideration, we use a trainable linear matrix and bias in each layer to measure the extent of the output capsule acceptance:
where are the vote weights calculated in the last iteration in the routing.
3.4.2 Horizontal Routing
Different from vertical routing, we split the attention weight cube token-wise in the horizontal routing part. So for an -word sequence, we totally have horizontal capsules , in which each of the capsules will therefore generate vote vectors:
It is worth noting that there is one fundamental difference between the vertical and horizontal capsules that the former is order independent while the latter is not. Such an essential difference indicates that simply applying the routing algorithm to all the horizontal capsules might fail to capture positional relationship information among tokens. Meanwhile, the number of capsules in input layer varies according to the length of sequence, which makes it impossible to implement a linear transformation method mentioned above. Therefore we design a novel positional routing method to leverage this salient information.
Rather than applying the routing algorithm all at once, we here for each horizontal capsule do a partial routing to implicitly encode the sequential information into the output capsules.
As shown in Figure 4, for an -word input sequence, there is a need of totally times partial routing. For the partial routing, only the top horizontal capsules of the attention cube are involved in the aggregation, which means we only use the information of the first tokens. More concretely speaking, for capsule , its output capsule is computed by routing all the capsule , where :
3.4.3 Masked Routing in Decoder
In the vanilla Transformer model, all the encoder and decoder layers apply a multi-head SAN sub-layer (Vaswani et al., 2017). One small modification of adding a forward mask is made in decoder stack to prevent from extracting information from the non-predicted tokens. Similar to such a treatment, we also use a forward mask on the attention weight cube on each head before the routing step. Meanwhile we remove the vertical routing part since the information among different tokens will still be allowed to be exchanged in the softmax step of each iteration in the routing.
|Existing NMT Systems|
|Wu et al. (2016)||RNN with 8 layers||-||26.30|
|Gehring et al. (2017)||CNN with 15 layers||-||26.36|
|Vaswani et al. (2017)||Transformer-Base||-||27.30|
|Hassan et al. (2018)||Transformer-Big||24.20||-|
|Li et al. (2019)||Transformer-Base + Effective Aggregation||24.68||27.98|
|Transformer-Big + Effective Aggregation||25.00||28.96|
|Our NMT Systems|
Our proposed model is evaluated on the widely-used WMT17 Chinese-to-English (Zh-En) and WMT14 English-to-German (En-De) datasets. These two datasets consist of total 20.6M and 4.6M sentence pairs, respectively. For Zh-En task, we found it would be helpful for reducing the vocabulary size without hurting the model performance when we only keep the sentence pairs whose length is less than 50 during the training and validation. We use the newsdev2017 and newstest2017 as the validation set and test set through the training. While for En-De task, we use all the sentence pairs for our model training. For model validation and test, we use the newstest2013 as the validation set and newstest2014 is used as the test set. To further decrease the vocabulary size, we employ byte-pair encoding (BPE) (Sennrich et al., 2016) on the training datasets and set the merge operations as 32K for both the WMT17 and WMT14 corpora.
Our proposed capsule-Transformer is implemented on the Transformer architecture (Vaswani et al., 2017). For the configuration of the hyper-parameters on both Base ans Big model, we follow their setup to train our baseline model on Zh-En and En-De tasks. The Transformer-Base and Big model differ at the word embedding size (512 vs. 1024), the count of attention heads (8 vs. 16) and the dimensionality of feed-forward network (2048 vs. 4096). For Big model, to prevent from over-fitting we set the dropout rate as 0.3 compared to 0.1 of that of the Base model. For Base model, we set the batch size up to no more than 2048 tokens and the gradient will accumulate for 12 times before the back-propagation. For Big model, those parameters are set as 1024 tokens per batch and 24 times for gradient accumulation. The framework we use to implement both the baseline and our capsule-Transformer is OpenNMT-py (Klein et al., 2017). We choose the case-sensitive 4-gram BLEU score (Papineni et al., 2002) as the metric to evaluate the performance of our models and compare it with that of the existing models. We train our Base and Big models on 2 and 3 NVIDIA GeForce GTX 1080Ti GPUs, respectively.
4.2 Main Results
Table 1 lists the main results on both the WMT17 Chinese-to-English (Zh-En) and WMT14 English-to-German (En-De) datasets. As shown in the table, our capsule-Transformer model consistently improves the performance across both language pairs and model variations, which shows the effectiveness and generalization ability of our approach. For WMT17 Zh-En task, our model outperforms all the models listed above, especially only the capsule-Transformer-Base model could achieve a score higher even than the other Big model. For WMT14 En-De task, our model outperforms the corresponding baseline while inferior to the Big model proposed by Li et al. (2019). Considering their model introduces over 33M new parameters (while for our Big model, this number is 1.6K) and uses a much larger batch size than ours (4096 vs. 1024) in the training process, it is reasonable for us to believe that our model would achieve a more promising score if the condition was the same.
We conduct extensive analysis experiments on our capsule-Transformer to better evaluate the effects of each model component. All the results below are produced with the Transformer-Base model setup on WMT17 Zh-En task.
Effect on Transformer Componets
To evaluate the effect of capsule routing SAN in encoder and decoder, we perform an ablation study. As shown in Table 4, both encoder and decoder benefit from our capsule routing SAN. Especially the modified decoder still outperforms the baseline even we have removed the vertical routing part, which demonstrates the effectiveness of our model. The row 4 proves the complementarity of the encoder and decoder with capsule routing SAN.
Effect of Different Routing Parts
To compare importance of vertical and horizontal routing parts in capsule routing SAN, we evaluate models by removing either of the two from encoder. As shown in Table 4, both vertical and horizontal routing help enhance the model. Meanwhile the two routing parts achieve nearly the same score, which shows that it is meaningful to re-organize the attention cube in these two separate perspectives.
Effect on Different Layers
Since the deep layered Transformer is found having a hierarchical pattern of captured information (Raganato and Tiedemann, 2018; Peters et al., 2018), it is necessary to explore the working pattern of our capsule routing SAN on different layers. As shown in Table 4, although our approach improves the performance on both higher and lower layers, it works better on the lower layers.
To better understand the ability of our model in obtaining deep contextualized information, we randomly sample the sentence and visualize the attention weights of each head from the top encoder layer. As shown in Figure 5, compared to the attention distribution of vanilla Transformer, it is clear that our model can capture more contextualized features on each head.
5 Related Work
Attention mechanism has become a standard component in nowadays neural machine translation models since it was first introduced by Bahdanau et al. (2015), in which an additive attention was implemented. Later of that, Luong et al. (2015) applied a new attention method using dot-product, which is also inherited by Vaswani et al. (2017) in their impressive work on Transformer. Although Transformer model has advanced the state-of-the-art on various tasks of NLP, its “over simple” structure design implies that its potential capability might have not been fully exploited.
Adding extra information to Transformer can be the most intuitive way to enhance the model performance. To alleviate the problem of lacking the consideration of positional information, Shaw et al. (2018) extended the SAN with the incorporation of relative positional information. Shen et al. (2018) applied a directional mask method to encode the “forward”, “backward” and “local” information in to the SAN. So is the work done by Cui et al. (2019). To model the localness of the sentences, Yang et al. (2018) added a Gaussian bias to the attention weight vectors. Xiao et al. (2019) presented a lattice-based Transformer which integrates flexible segmentations into the encoder and the SAN.
However, nearly all these existing enhancement work can be viewed as adding hand-crafted features, implicitly or explicitly, to the SAN, which to some extent, might risk in losing the generalization ability. Another choice is to encode the additional information through extra networks connected with the Transformer model. In this way the model might learn deeper features. To introduce a better ability to model the recurrence, Hao et al. (2019) add the additional RNN encoder whose output embedding will be incorporated with the original one. Wang et al. (2019) add a constraint from an extra bidirectional Transformer encoder to lead the attention heads to follow tree structures.
While networks with different structure may have unpredictable effects when they are incorporated together. Some researchers focus on strengthening the Transformer via modifying the original structure. Yang et al. (2019a) leveraged deep and global context information to calculate a better attention between tokens. Guo et al. (2019) proposed a novel design of the SAN which simplifies the connection among each attention head. Yang et al. (2019b) used a CNN-like structure in SAN to model the localness by restricting the context size. All these studies focus on the improvement of SAN due to its central role in Transformer modeling. However, they still all view the elements in SAN as single scalars rather than in a more contextualized way. Different from all the existing work, in this paper, we instead consider a generalized SAN design by inheriting the idea from capsule networking and taking SAN as a special case of capsule network. So that at last we can extend the vanilla Transformer model into a more generalized form.
Capsule Networks for NLP
Capsule network was introduced to NLP to do the classification task (Yang et al., 2018; Chen and Qian, 2019) for its capsule clustering mechanism can be simply used without much modification. Its information aggregation mechanism was also utilized to encode the input sequence with fixed size (Gong et al., 2018), obtaining a better output hidden states of encoder by aggregating information from different layers (Dou et al., 2019). For SAN, Li et al. (2019) applied the routing algorithm on the concatenated output representation while not change the SAN structure. Recently, Liu et al. (2019) combined the Transformer and capsule network for the stock movements prediction. However, they only stacked Transformer encoders in a capsule way rather than deeply integrating the two networks.
Different from all the impressive improvement over the original Transformer, we adopt a deep architecture revision by generalizing the self-attention mechanism which empowers the Transformer most into a sort of capsule routing processing. Especially, we nearly introduce no new parameters for such a model design improvement, which keeps the merit of simple-idea-inspiring of the original Transformer.
In this paper, we propose the capsule-Transformer, which extends the linear transformation of self-attention in the vanilla Transformer into a more general capsule routing algorithm by taking SAN as a special case of capsule network. So that the resulted capsule-Transformer is capable of obtaining a better attention distribution representation of the input sequence via information aggregation among different heads and words. We verify the proposed capsule-Transformer only in the task of neural machine translation though, which already shows its superiority over the strong Transformer baseline, the proposed model architecture design has potentially a broad application prospect for various NLP tasks as the Transformer.
- Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), External Links: Cited by: §1, §2, §5.
- Transfer capsule network for aspect level sentiment classification. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), pp. 547–556. External Links: Cited by: §5.
- Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), pp. 1724–1734. External Links: Cited by: §1.
- Attention-based models for speech recognition. In Proceedings of the 29th Conference on Neural Information Processing Systems (NIPS 2015), pp. 577–585. External Links: Cited by: §1.
- Mixed multi-head self-attention for neural machine translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pp. 206–214. External Links: Cited by: §5.
- Dynamic layer aggregation for neural machine translation with routing-by-agreement. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI 2019), pp. 86–93. External Links: Cited by: §1, §3.4, §5.
Convolutional sequence to sequence learning.
Proceedings of the 34th International Conference on Machine Learning (ICML 2017), pp. 1243–1252. External Links: Cited by: Table 1.
- Information aggregation via dynamic routing for sequence encoding. arXiv preprint arXiv:1806.01501. External Links: Cited by: §5.
- Star-transformer. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2019), pp. 1315–1325. External Links: Cited by: §5.
- Modeling recurrence for transformer. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2019), pp. 1198–1207. External Links: Cited by: §5.
- Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567. External Links: Cited by: Table 1.
- Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §1.
OpenNMT: open-source toolkit for neural machine translation. In Proceedings of ACL 2017, System Demonstrations, pp. 67–72. External Links: Cited by: §4.1.
- Information aggregation for multi-head attention with routing-by-agreement. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2019), pp. 3566–3575. External Links: Cited by: §1, §3.4, §4.2, Table 1, §5.
- A structured self-attentive sentence embedding. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), External Links: Cited by: §1.
- Transformer-based capsule network for stock movement prediction. In Proceedings of the First Workshop on Financial Technology and Natural Language Processing, pp. 66–73. External Links: Cited by: §5.
- Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), pp. 1412–1421. External Links: Cited by: §1, §2, §2, §5.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL 2002), pp. 311–318. External Links: Cited by: §4.1.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2018), pp. 2227–2237. External Links: Cited by: §3.4.1, §4.3.
- An analysis of encoder representations in transformer-based machine translation. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 287–297. External Links: Cited by: §3.4.1, §4.3.
- Dynamic routing between capsules. In Proceedings of the 31th Conference on Neural Information Processing Systems (NIPS 2017), pp. 3856–3866. External Links: Cited by: §1, §3.1, §3.3, §3.3, §3.4.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), pp. 1715–1725. External Links: Cited by: §4.1.
- Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2018), pp. 464–468. External Links: Cited by: §5.
- DiSAN: directional self-attention network for RNN/CNN-free language understanding. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI 2018), pp. 5446–5455. External Links: Cited by: §5.
- Attention is all you need. In Proceedings of the 31th Conference on Neural Information Processing Systems (NIPS 2017), pp. 5998–6008. External Links: Cited by: §1, §2, §2, §3.4.3, §4.1, Table 1, §5.
- Tree transformer: integrating tree structures into self-attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1061–1070. External Links: Cited by: §5.
- Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. External Links: Cited by: Table 1.
- Lattice-based transformer encoder for neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3090–3097. External Links: Cited by: §5.
- Context-aware self-attention networks. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI 2019), pp. 387–394. External Links: Cited by: §5.
- Modeling localness for self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), pp. 4449–4458. External Links: Cited by: §5.
- Convolutional self-attention networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2019), pp. 4040–4045. External Links: Cited by: §5.
- Investigating capsule networks with dynamic routing for text classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018), pp. 3110–3119. External Links: Cited by: §5.
- Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL 2016), pp. 1480–1489. External Links: Cited by: §1.