Self-attention network (SAN) Lin et al. (2017)
has shown promising empirical results in various natural language processing (NLP) tasks, such as machine translationVaswani et al. (2017); Shaw et al. (2018), natural language inference Shen et al. (2018a), and acoustic modeling Sperber et al. (2018). One strong point of SAN is the strength of capturing long-range dependencies by explicitly attending to all the signals, which allows the model to build a direct relation with another long-distance representation. The performance of SAN can be improved by further parallel performing (multi-headed), which allow the model to jointly attend to information from different representation subspaces at different positions Vaswani et al. (2017).
Although SAN has achieved significant improvements, it has two major limitations. First, SAN fully takes into account all the signals with a weighted sum operation, which disperses the distribution of attention, which may result in overlooking the relation of neighboring signals Yang et al. (2018). Second, the multi-headed attention perform attention heads independently, which misses the opportunity to exploit useful interactions across attention heads.
To address these problems, we propose a novel convolutional self-attention network (CSAN), leveraging the power of CNN on modelling localness for SAN. Specifically, the attention function can be served as a filter with dynamic weights conditioned on the context. Focusing on local information, we first present a 1-dimensional CSAN model (1D-CSAN) by restricting the attentive scope to a window of neighboring elements. We then break the boundaries of attention heads by introducing a 2D-CSAN model to exploit knowledge across different semantic subspaces. Another superior of the proposed approach is that it models the locality without any new parameters.
We conducted experiments on WMT14 EnglishGerman translation task. Experimental results show that our approach improves translation performance over the strong Transformer Vaswani et al. (2017) baseline and other related model. In addition, the extensive analysis also demonstrate that this is a win-win approach on simultaneously capturing local information and modeling dependencies across heads.
2.1 Self-Attention Model
Aim at building direct pairwise relevance, SAN calculates attention weights between each pair of elements in the whole input. Formally, given an input layer , the output layer can be constructed by attending to the states of (considering the relevance), where the elements
. Besides, Vaswani:2017:NIPS pointed out that performing the attention functions in parallel (multi-head mechanism) is able to better capture features from different semantic subspaces. In each head, the weighted sum of the linear transformed input elements are individually computed. Formally, for aheads model, the -th output in the -th head () can be computed as:
where denotes trainable parameter matrices, which are distinct in different heads. indicates the number of input elements. is the attention distribution, in which the attention weight represents the relevance between the -th and -th elements. It can be calculated as:
in which are parameter matrices. denotes the scaling factor. Finally, the output states of all heads are simply concatenated to produce the final state:
As seen, each head is individually performed in its own subspace. We argue that the simple concatenation still misses the opportunity to exploit different features across heads.
As SAN calculates all the elements in an input, the normalization in Equation (2) may inhibits the attention to neighboring information Yang et al. (2018). On the contrary, CNN has proven to be of profound value for local feature fusion on various NLP tasks Kim (2014); Gehring et al. (2017); Yu et al. (2018). Therefore, we enhance the ability of capturing useful neighbouring information for SAN by borrowing the convolution concept from CNN. Moreover, as illustrated in Figure 1(a), each attention head is individually performed without considering the relation to other heads. wu2018group noted that features can be better captured by modeling dependencies across different channels. Motivated by their promising results, we also exploit to extract attention features among multiple heads. We hypothesize that the two modifications based on vanilla SAN can accumulatively improve the performance of sequential modelling.
3 Convolutional Self-Attention Networks
Regarding the attention function as a CNN-like filter, we resize (reduce or expand) the attentive scope to a window surrounding each input element. To maintain the flexibility and parallelism of original SAN model, we propose two approaches: 1) 1D-CSAN model, as shown in Figure 1(b), in which the window is assigned with a width where , but the height is consistently fixed to one (analogous to 1-dimensional window sliding in a single head); and 2) 2D-CSAN model, as shown in Figure 1(c), where the window is assigned with an unrestricted height where (analogous to 2-dimensional rectangle sliding across multiple heads).
3.1 1D Convolutional Self-Attention
Given a window width , the relevance between the -th and -th elements can be normalized according to the neighboring elements of , instead of the whole input. Thus, the calculation of attention weight Equation (2) can updated as follows:
and 0 is padded when the index out of range. Accordingly, the output of attention operation Equation (1) can be revised as:
As seen, the presented approach focuses on summarizing local context without additional components and parameters. Although the inspiration comes from CNN, CNN interacts local features using global fixed parameters, ignoring the richness of semantic and syntactic information. On the contrary, CSAN is able to dynamically model the weights (as shwon in Equation (4)) for different pairs of elements. Thus, the CSAN model superiors to its CNN counterpart in terms of flexibility.
3.2 2D Convolutional Self-Attention
Furthermore, we propose a 2D-CSAN model to simultaneously model dependencies among local elements and neighboring subspaces. The 1-dimensional attentive area () can be expanded to a 2-dimensional rectangle (), which consists of both number of elements and number of heads. Consequently, the proposed model is allowed to calculate the energy between the -th element in the -th head and the -th element in the -th head. Thus, the energy calculation Equation (3) can be updated as:
where . Thus, the attention distribution represents the dependencies among head and the output of each head covers different group of features. Note that, 2D-CSAN equals to 1D-CSAN when .
|Transformer-Base Vaswani et al. (2017)||88.0M||1.20||27.64||-|
|+ Rel_Pos Shaw et al. (2018)||+0.1M||-0.11||27.94||+0.30|
|+ Neighbor Sperber et al. (2018)||+0.4M||-0.06||27.91||+0.27|
|+ Local_Hard Luong et al. (2015)||+0.4M||-0.06||28.04||+0.40|
|+ Local_Soft Yang et al. (2018)||+0.8M||-0.09||28.11||+0.47|
|+ Block Shen et al. (2018a)||+6.0M||-0.33||27.91||+0.27|
|+ CNNs Yu et al. (2018)||+42.6M||-0.54||28.02||+0.38|
To verify the effectiveness of our approach, we conducted experiments on EnglishGerman translation task. Following Vaswani:2017:NIPS and shaw2018self, we incorporate the proposed model into the encoder in the state-of-the-art NMT architecture, Transformer. For fair comparison, we also re-implemented other related models in the same framework. Prior studies have shown that modeling locality in lower layers can achieve better performance Shen et al. (2018b); Yu et al. (2018); Yang et al. (2018). Therefore, we apply our approach in first three layers of Transformer, which is stacked 6 layers of SAN-based encoder. We use the same configurations for Base model as used in Vaswani et al. (2017). All the models are implemented on top of an open-source toolkit – THUMT Zhang et al. (2017).
The models are trained on the widely-used WMT2014 EnglishGerman (EnDe) training corpus, which consists of 4M sentence pairs. We also select the same validation and test sets as used in previous work. To alleviate the out-of-vocabulary problem, we employ byte-pair encoding (BPE) Sennrich et al. (2016) to pre-process the data with 32K merge operations. We use the case-sensitive 4-gram NIST BLEU score Papineni et al. (2002)
4.2 Results and Discussion
In this section, we evaluated the effectiveness of the proposed model on EnDe test set. The window sizes of 1D-CSAN and 2D-CSAN are set to and , respectively.
The Effectiveness of CSAN
To make the evaluation convincing, we reproduced the reported results in Vaswani et al. (2017) on the same data. As shown in Table 1, both the two proposed models outperform the strong baseline without additional parameters. Specifically, 1D-CSAN significantly improves translation performance over the baseline by +0.52 BLEU point, demonstrating the effectiveness of modeling localness for SAN. 2D-CSAN achieves the best performance overall (+0.84 BLEU). This confirms our assumption that the simple concatenation is insufficient to fully interact features in different subspaces. Modeling dependencies across multiple heads is beneficial to the performance of SAN.
Comparison to Other Work
We further compared the proposed model with several existing locality models. We divided the prior explorations on modeling localness into three categories:
From the embedding perspective, shaw2018self introduced relative position encoding (“Rel_Pos”) which embeds the relative distance into representations.
From the attention distribution perspective, “Neighbor” Sperber et al. (2018) and “Local_Hard” Luong et al. (2015) use local Gaussian biases to revise the attention distribution by predicting a window size and a central position respectively. Yang:2018:EMNLP combined the two approaches which is respective to “Local_Soft”.
Concerning the hard fashion, Shen:2018:AAAI and Yu:2018:ICLR suggested to allocate hard local scopes through dividing the input sequence into blocks (“Block”), and stacking CNN and SAN layers (“CNNs”).
As far as we know, it is the first time to compare these approaches upon a same framework. As seen, the proposed 1D-CSAN even slightly outperforms the best model among the existing approaches, i.e. “Local_Soft”. In addition, the proposed 1D-CSAN is superior to its CNN counterpart (“CNNs
”), verifying our hypothesis that the flexible weights conditioned on the context benefits to the feature extraction. By further considering the interaction across heads, 2D-CSAN totally gains 0.84 BLEU score over the vanilla SAN model.
Moreover, the proposed approach requires no additional parameters and marginally reduces the training speed. Note that, although our implementation does not speed up the model, theoretically, the computation complexity of each head can be reduced from to , besides inhibits the rapidly growing of memory requirement in SAN Shen et al. (2018b).
In this group of experiments, we evaluated the effect of varying the scope size for CSAN and phrasal pattern. The experiments are conducted on development set.
As concluded in Fig. 2, for 1D-CSAN, the local scope covered 11 elements is superior to other settings. The tendency is consistent with luong2015effective and Yang:2018:EMNLP who found that 10 is the best window size in their experiments. Then, we fixed the number of neighboring elements and resized the number of heads to be considered. As seen, by considering the features across heads (i.e. ), 2D-CSAN further improves the translation quality. However, with the N increasing, the translation quality dropped. Here is the possible reason: With a smaller N, the model still has flexibility of learning a different distribution for each head, while the larger N assumes more heads make “similar contributions” Ba et al. (2016); Wu and He (2018).
Effect of N-gram
One intuition of our approach is to capture useful phrase patterns. To evaluate the accuracy of phrase translations, we calculate the improvement of the proposed approaches over multiple N-grams, as shown in Figure3
. Both the three compared models consistently outperform the baseline on larger granularities, indicating that modeling locality can raise the ability of self-attention model on capturing the phrasal information. Furthermore, the dependencies across heads can be complementary to the localness modeling, which reveals the necessity of the interaction of features in different subspaces.
In this paper, we propose a parameter-free convolutional self-attention model to enhance the feature extraction of neighboring elements across multiple heads. Experimental results on WMT14 EnDe translation tasks demonstrate the effectiveness of the proposed methods. The extensive analyzes verified that: 1) modeling locality is necesserary to SAN; 2) modeling dependencies across heads can further improve the performance; and 3) to some extent, the dynamic weights superior to its fixed counterpart (i.e. CSAN vs. CNN) on feature extraction. The method is not restricted to the translation task and could potentially be applied to other sequence modeling tasks such as reading comprehension, language inference, and sentence classification.
- Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer Normalization. arXiv:1607.06450.
- Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional Sequence to Sequence Learning. In ICML.
- Kim (2014) Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In EMNLP.
- Lin et al. (2017) Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. In ICLR.
Luong et al. (2015)
Thang Luong, Hieu Pham, and Christopher D. Manning. 2015.
Effective Approaches to Attention-based Neural Machine Translation.In EMNLP.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In ACL.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In ACL.
- Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with Relative Position Representations. In NAACL.
- Shen et al. (2018a) Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2018a. DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding. In AAAI.
- Shen et al. (2018b) Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. 2018b. Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling. In ICLR.
- Sperber et al. (2018) Matthias Sperber, Jan Niehues, Graham Neubig, Sebastian Stüker, and Alex Waibel. 2018. Self-Attentional Acoustic Models. arXiv:1803.09519.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In NIPS.
- Wu and He (2018) Yuxin Wu and Kaiming He. 2018. Group Normalization. arXiv:1803.08494.
- Yang et al. (2018) Baosong Yang, Zhaopeng Tu, Derek F. Wong, Fandong Meng, Lidia S. Chao, and Tong Zhang. 2018. Modeling Localness for Self-Attention Networks. In EMNLP.
- Yu et al. (2018) Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. QANet: Combining Local Convolution with Global Self-attention for Reading Comprehension. In ICLR.
- Zhang et al. (2017) Jiacheng Zhang, Yanzhuo Ding, Shiqi Shen, Yong Cheng, Maosong Sun, Huanbo Luan, and Yang Liu. 2017. THUMT: An Open Source Toolkit for Neural Machine Translation. arXiv:1706.06415.