1 Introduction
Selfattention network (SAN) Lin et al. (2017)
has shown promising empirical results in various natural language processing (NLP) tasks, such as machine translation
Vaswani et al. (2017); Shaw et al. (2018), natural language inference Shen et al. (2018a), and acoustic modeling Sperber et al. (2018). One strong point of SAN is the strength of capturing longrange dependencies by explicitly attending to all the signals, which allows the model to build a direct relation with another longdistance representation. The performance of SAN can be improved by further parallel performing (multiheaded), which allow the model to jointly attend to information from different representation subspaces at different positions Vaswani et al. (2017).Although SAN has achieved significant improvements, it has two major limitations. First, SAN fully takes into account all the signals with a weighted sum operation, which disperses the distribution of attention, which may result in overlooking the relation of neighboring signals Yang et al. (2018). Second, the multiheaded attention perform attention heads independently, which misses the opportunity to exploit useful interactions across attention heads.
To address these problems, we propose a novel convolutional selfattention network (CSAN), leveraging the power of CNN on modelling localness for SAN. Specifically, the attention function can be served as a filter with dynamic weights conditioned on the context. Focusing on local information, we first present a 1dimensional CSAN model (1DCSAN) by restricting the attentive scope to a window of neighboring elements. We then break the boundaries of attention heads by introducing a 2DCSAN model to exploit knowledge across different semantic subspaces. Another superior of the proposed approach is that it models the locality without any new parameters.
We conducted experiments on WMT14 EnglishGerman translation task. Experimental results show that our approach improves translation performance over the strong Transformer Vaswani et al. (2017) baseline and other related model. In addition, the extensive analysis also demonstrate that this is a winwin approach on simultaneously capturing local information and modeling dependencies across heads.
2 Background
2.1 SelfAttention Model
Aim at building direct pairwise relevance, SAN calculates attention weights between each pair of elements in the whole input. Formally, given an input layer , the output layer can be constructed by attending to the states of (considering the relevance), where the elements
. Besides, Vaswani:2017:NIPS pointed out that performing the attention functions in parallel (multihead mechanism) is able to better capture features from different semantic subspaces. In each head, the weighted sum of the linear transformed input elements are individually computed. Formally, for a
heads model, the th output in the th head () can be computed as:(1) 
where denotes trainable parameter matrices, which are distinct in different heads. indicates the number of input elements. is the attention distribution, in which the attention weight represents the relevance between the th and th elements. It can be calculated as:
(2)  
(3) 
in which are parameter matrices. denotes the scaling factor. Finally, the output states of all heads are simply concatenated to produce the final state:
As seen, each head is individually performed in its own subspace. We argue that the simple concatenation still misses the opportunity to exploit different features across heads.
2.2 Motivation
As SAN calculates all the elements in an input, the normalization in Equation (2) may inhibits the attention to neighboring information Yang et al. (2018). On the contrary, CNN has proven to be of profound value for local feature fusion on various NLP tasks Kim (2014); Gehring et al. (2017); Yu et al. (2018). Therefore, we enhance the ability of capturing useful neighbouring information for SAN by borrowing the convolution concept from CNN. Moreover, as illustrated in Figure 1(a), each attention head is individually performed without considering the relation to other heads. wu2018group noted that features can be better captured by modeling dependencies across different channels. Motivated by their promising results, we also exploit to extract attention features among multiple heads. We hypothesize that the two modifications based on vanilla SAN can accumulatively improve the performance of sequential modelling.
3 Convolutional SelfAttention Networks
Regarding the attention function as a CNNlike filter, we resize (reduce or expand) the attentive scope to a window surrounding each input element. To maintain the flexibility and parallelism of original SAN model, we propose two approaches: 1) 1DCSAN model, as shown in Figure 1(b), in which the window is assigned with a width where , but the height is consistently fixed to one (analogous to 1dimensional window sliding in a single head); and 2) 2DCSAN model, as shown in Figure 1(c), where the window is assigned with an unrestricted height where (analogous to 2dimensional rectangle sliding across multiple heads).
3.1 1D Convolutional SelfAttention
Given a window width , the relevance between the th and th elements can be normalized according to the neighboring elements of , instead of the whole input. Thus, the calculation of attention weight Equation (2) can updated as follows:
(4) 
where
and 0 is padded when the index out of range. Accordingly, the output of attention operation Equation (
1) can be revised as:(5) 
As seen, the presented approach focuses on summarizing local context without additional components and parameters. Although the inspiration comes from CNN, CNN interacts local features using global fixed parameters, ignoring the richness of semantic and syntactic information. On the contrary, CSAN is able to dynamically model the weights (as shwon in Equation (4)) for different pairs of elements. Thus, the CSAN model superiors to its CNN counterpart in terms of flexibility.
3.2 2D Convolutional SelfAttention
Furthermore, we propose a 2DCSAN model to simultaneously model dependencies among local elements and neighboring subspaces. The 1dimensional attentive area () can be expanded to a 2dimensional rectangle (), which consists of both number of elements and number of heads. Consequently, the proposed model is allowed to calculate the energy between the th element in the th head and the th element in the th head. Thus, the energy calculation Equation (3) can be updated as:
(6) 
Accordingly, the energy normalization Equation (4) and the weighted sum of the elements Equation (5) can be respectively revised as:
(7)  
(8) 
where . Thus, the attention distribution represents the dependencies among head and the output of each head covers different group of features. Note that, 2DCSAN equals to 1DCSAN when .
Model  Param.  Speed  BLEU  
TransformerBase Vaswani et al. (2017)  88.0M  1.20  27.64   
+ Rel_Pos Shaw et al. (2018)  +0.1M  0.11  27.94  +0.30 
+ Neighbor Sperber et al. (2018)  +0.4M  0.06  27.91  +0.27 
+ Local_Hard Luong et al. (2015)  +0.4M  0.06  28.04  +0.40 
+ Local_Soft Yang et al. (2018)  +0.8M  0.09  28.11  +0.47 
+ Block Shen et al. (2018a)  +6.0M  0.33  27.91  +0.27 
+ CNNs Yu et al. (2018)  +42.6M  0.54  28.02  +0.38 
1DCSAN  +0.0M  0.00  28.16  +0.52 
2DCSAN  +0.0M  0.06  28.48  +0.84 
4 Experiments
4.1 Setup
To verify the effectiveness of our approach, we conducted experiments on EnglishGerman translation task. Following Vaswani:2017:NIPS and shaw2018self, we incorporate the proposed model into the encoder in the stateoftheart NMT architecture, Transformer. For fair comparison, we also reimplemented other related models in the same framework. Prior studies have shown that modeling locality in lower layers can achieve better performance Shen et al. (2018b); Yu et al. (2018); Yang et al. (2018). Therefore, we apply our approach in first three layers of Transformer, which is stacked 6 layers of SANbased encoder. We use the same configurations for Base model as used in Vaswani et al. (2017). All the models are implemented on top of an opensource toolkit – THUMT Zhang et al. (2017).
The models are trained on the widelyused WMT2014 EnglishGerman (EnDe) training corpus, which consists of 4M sentence pairs. We also select the same validation and test sets as used in previous work. To alleviate the outofvocabulary problem, we employ bytepair encoding (BPE) Sennrich et al. (2016) to preprocess the data with 32K merge operations. We use the casesensitive 4gram NIST BLEU score Papineni et al. (2002)
4.2 Results and Discussion
In this section, we evaluated the effectiveness of the proposed model on EnDe test set. The window sizes of 1DCSAN and 2DCSAN are set to and , respectively.
The Effectiveness of CSAN
To make the evaluation convincing, we reproduced the reported results in Vaswani et al. (2017) on the same data. As shown in Table 1, both the two proposed models outperform the strong baseline without additional parameters. Specifically, 1DCSAN significantly improves translation performance over the baseline by +0.52 BLEU point, demonstrating the effectiveness of modeling localness for SAN. 2DCSAN achieves the best performance overall (+0.84 BLEU). This confirms our assumption that the simple concatenation is insufficient to fully interact features in different subspaces. Modeling dependencies across multiple heads is beneficial to the performance of SAN.
Comparison to Other Work
We further compared the proposed model with several existing locality models. We divided the prior explorations on modeling localness into three categories:

From the embedding perspective, shaw2018self introduced relative position encoding (“Rel_Pos”) which embeds the relative distance into representations.

From the attention distribution perspective, “Neighbor” Sperber et al. (2018) and “Local_Hard” Luong et al. (2015) use local Gaussian biases to revise the attention distribution by predicting a window size and a central position respectively. Yang:2018:EMNLP combined the two approaches which is respective to “Local_Soft”.

Concerning the hard fashion, Shen:2018:AAAI and Yu:2018:ICLR suggested to allocate hard local scopes through dividing the input sequence into blocks (“Block”), and stacking CNN and SAN layers (“CNNs”).
As far as we know, it is the first time to compare these approaches upon a same framework. As seen, the proposed 1DCSAN even slightly outperforms the best model among the existing approaches, i.e. “Local_Soft”. In addition, the proposed 1DCSAN is superior to its CNN counterpart (“CNNs
”), verifying our hypothesis that the flexible weights conditioned on the context benefits to the feature extraction. By further considering the interaction across heads, 2DCSAN totally gains 0.84 BLEU score over the vanilla SAN model.
Moreover, the proposed approach requires no additional parameters and marginally reduces the training speed. Note that, although our implementation does not speed up the model, theoretically, the computation complexity of each head can be reduced from to , besides inhibits the rapidly growing of memory requirement in SAN Shen et al. (2018b).
4.3 Analysis
In this group of experiments, we evaluated the effect of varying the scope size for CSAN and phrasal pattern. The experiments are conducted on development set.
Ablation Study
As concluded in Fig. 2, for 1DCSAN, the local scope covered 11 elements is superior to other settings. The tendency is consistent with luong2015effective and Yang:2018:EMNLP who found that 10 is the best window size in their experiments. Then, we fixed the number of neighboring elements and resized the number of heads to be considered. As seen, by considering the features across heads (i.e. ), 2DCSAN further improves the translation quality. However, with the N increasing, the translation quality dropped. Here is the possible reason: With a smaller N, the model still has flexibility of learning a different distribution for each head, while the larger N assumes more heads make “similar contributions” Ba et al. (2016); Wu and He (2018).
Effect of Ngram
One intuition of our approach is to capture useful phrase patterns. To evaluate the accuracy of phrase translations, we calculate the improvement of the proposed approaches over multiple Ngrams, as shown in Figure
3. Both the three compared models consistently outperform the baseline on larger granularities, indicating that modeling locality can raise the ability of selfattention model on capturing the phrasal information. Furthermore, the dependencies across heads can be complementary to the localness modeling, which reveals the necessity of the interaction of features in different subspaces.
5 Conclusion
In this paper, we propose a parameterfree convolutional selfattention model to enhance the feature extraction of neighboring elements across multiple heads. Experimental results on WMT14 EnDe translation tasks demonstrate the effectiveness of the proposed methods. The extensive analyzes verified that: 1) modeling locality is necesserary to SAN; 2) modeling dependencies across heads can further improve the performance; and 3) to some extent, the dynamic weights superior to its fixed counterpart (i.e. CSAN vs. CNN) on feature extraction. The method is not restricted to the translation task and could potentially be applied to other sequence modeling tasks such as reading comprehension, language inference, and sentence classification.
References
 Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer Normalization. arXiv:1607.06450.
 Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional Sequence to Sequence Learning. In ICML.
 Kim (2014) Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In EMNLP.
 Lin et al. (2017) Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured selfattentive sentence embedding. In ICLR.

Luong et al. (2015)
Thang Luong, Hieu Pham, and Christopher D. Manning. 2015.
Effective Approaches to Attentionbased Neural Machine Translation.
In EMNLP.  Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU: A Method for Automatic Evaluation of Machine Translation. In ACL.
 Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In ACL.
 Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. SelfAttention with Relative Position Representations. In NAACL.
 Shen et al. (2018a) Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2018a. DiSAN: Directional SelfAttention Network for RNN/CNNFree Language Understanding. In AAAI.
 Shen et al. (2018b) Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. 2018b. BiDirectional Block SelfAttention for Fast and MemoryEfficient Sequence Modeling. In ICLR.
 Sperber et al. (2018) Matthias Sperber, Jan Niehues, Graham Neubig, Sebastian Stüker, and Alex Waibel. 2018. SelfAttentional Acoustic Models. arXiv:1803.09519.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In NIPS.
 Wu and He (2018) Yuxin Wu and Kaiming He. 2018. Group Normalization. arXiv:1803.08494.
 Yang et al. (2018) Baosong Yang, Zhaopeng Tu, Derek F. Wong, Fandong Meng, Lidia S. Chao, and Tong Zhang. 2018. Modeling Localness for SelfAttention Networks. In EMNLP.
 Yu et al. (2018) Adams Wei Yu, David Dohan, MinhThang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. QANet: Combining Local Convolution with Global Selfattention for Reading Comprehension. In ICLR.
 Zhang et al. (2017) Jiacheng Zhang, Yanzhuo Ding, Shiqi Shen, Yong Cheng, Maosong Sun, Huanbo Luan, and Yang Liu. 2017. THUMT: An Open Source Toolkit for Neural Machine Translation. arXiv:1706.06415.