Convolutional Self-Attention Networks

by   Baosong Yang, et al.
University of Macau

Self-attention networks (SANs) have drawn increasing interest due to their high parallelization in computation and flexibility in modeling dependencies. SANs can be further enhanced with multi-head attention by allowing the model to attend to information from different representation subspaces. In this work, we propose novel convolutional self-attention networks, which offer SANs the abilities to 1) strengthen dependencies among neighboring elements, and 2) model the interaction between features extracted by multiple attention heads. Experimental results of machine translation on different language pairs and model settings show that our approach outperforms both the strong Transformer baseline and other existing models on enhancing the locality of SANs. Comparing with prior studies, the proposed model is parameter free in terms of introducing no more parameters.


page 1

page 2

page 3

page 4


Convolutional Self-Attention Network

Self-attention network (SAN) has recently attracted increasing interest ...

Context-Aware Self-Attention Networks

Self-attention model have shown its flexibility in parallel computation ...

Semantics-aware Attention Improves Neural Machine Translation

The integration of syntactic structures into Transformer machine transla...

PhyloTransformer: A Discriminative Model for Mutation Prediction Based on a Multi-head Self-attention Mechanism

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused ...

SANVis: Visual Analytics for Understanding Self-Attention Networks

Attention networks, a deep neural network architecture inspired by human...

A Simple Approach to Image Tilt Correction with Self-Attention MobileNet for Smartphones

The main contributions of our work are two-fold. First, we present a Sel...

Multiple Structural Priors Guided Self Attention Network for Language Understanding

Self attention networks (SANs) have been widely utilized in recent NLP s...

1 Introduction

Self-attention networks (SANs) Parikh et al. (2016); Lin et al. (2017)

have shown promising empirical results in various natural language processing (NLP) tasks, such as machine translation 

Vaswani et al. (2017), natural language inference Shen et al. (2018a), and acoustic modeling Sperber et al. (2018). One appealing strength of SANs lies in their ability to capture dependencies regardless of distance by explicitly attending to all the elements. In addition, the performance of SANs can be improved by multi-head attention Vaswani et al. (2017), which projects the input sequence into multiple subspaces and applies attention to the representation in each subspace.

Despite their success, SANs have two major limitations. First, the model fully take into account all the elements, which disperses the attention distribution and thus overlooks the relation of neighboring elements and phrasal patterns Yang et al. (2018); Wu et al. (2018); Guo et al. (2019). Second, multi-head attention extracts distinct linguistic properties from each subspace in a parallel fashion Raganato and Tiedemann (2018), which fails to exploit useful interactions across different heads. Recent work shows that better features can be learned if different sets of representations are present at feature learning time Ngiam et al. (2011); Lin et al. (2014).

To this end, we propose novel convolutional self-attention networks (Csan

s), which model locality for self-attention model and interactions between features learned by different attention heads in an unified framework. Specifically, in order to pay more attention to a local part of the input sequence, we restrict the attention scope to a window of neighboring elements. The localness is therefore enhanced via a parameter-free 1-dimensional convolution. Moreover, we extend the convolution to a 2-dimensional area with the axis of attention head. Thus, the proposed model allows each head to interact local features with its adjacent subspaces at attention time. We expect that the interaction across different subspaces can further improve the performance of SANs.

(a) Vanilla SANs
(b) 1D-Convolutional SANs
(c) 2D-Convolutional SANs
Figure 1: Illustration of (a) vanilla SANs; (b) 1-dimensional convolution with the window size being ; and (c) 2-dimensional convolution with the area being . Different colors represent different subspaces modeled by multi-head attention, and transparent colors denote masked tokens that are invisible to SANs.

We evaluate the effectiveness of the proposed model on three widely-used translation tasks: WMT14 English-to-German, WMT17 Chinese-to-English, and WAT17 Japanese-to-English. Experimental results demonstrate that our approach consistently improves performance over the strong Transformer model Vaswani et al. (2017) across language pairs. Comparing with previous work on modeling locality for SANs (e.g. Shaw et al., 2018; Yang et al., 2018; Sperber et al., 2018), our model boosts performance on both translation quality and training efficiency.

2 Multi-Head Self-Attention Networks

SANs produce representations by applying attention to each pair of tokens from the input sequence, regardless of their distance. Vaswani:2017:NIPS found it is beneficial to capture different contextual features with multiple individual attention functions. Given an input sequence , the model first transforms it into queries , keys , and values :


where are trainable parameters and indicates the hidden size. The three types of representations are split into different subspaces, e.g., with . In each subspace , the element in the output sequence is calculated by


where is an attention model Bahdanau et al. (2015); Vaswani et al. (2017) that retrieves the keys with the query . The final output representation is the concatenation of outputs generated by multiple attention models:


3 Approach

As shown in Figure 1(a), the vanilla SANs use the query to compute a categorical distribution over all elements from  (Equation 2). It may inherit the attention to neighboring information Yu et al. (2018); Yang et al. (2018); Guo et al. (2019). In this work, we propose to model locality for SANs by restricting the model to attend to a local region via convolution operations (1D-CSans, Figure 1(b)). Accordingly, it provides distance-aware information (e.g. phrasal patterns), which is complementary to the distance-agnostic dependencies modeled by the standard SANs (Section 3.1).

Moreover, the calculation of output are restricted to the a single individual subspace, overlooking the richness of contexts and the dependencies among groups of features, which have proven beneficial to the feature learning Ngiam et al. (2011); Wu and He (2018). We thus propose to convolute the items in adjacent heads (2D-CSans, Figure 1(c)). The proposed model is expected to improve performance through interacting linguistic properties across heads (Section 3.2).

3.1 Locality Modeling via 1D Convolution

For each query , we restrict its attention region (e.g., ) to a local scope with a fixed size () centered at the position :


Accordingly, the calculation of corresponding output in Equation (2) is modified as:


As seen, SANs are only allowed to attend to the neighboring tokens (e.g., , ), instead of all the tokens in the sequence (e.g., , ).

The SAN-based models are generally implemented as multiple layers, in which higher layers tend to learn semantic information while lower layers capture surface and lexical information Peters et al. (2018); Raganato and Tiedemann (2018). Therefore, we merely apply locality modeling to the lower layers, which same to the configuration in Yu:2018:ICLR and Yang:2018:EMNLP. In this way, the representations are learned in a hierarchical fashion Yang et al. (2017). That is, the distance-aware and local information extracted by the lower SAN layers, is expected to complement distance-agnostic and global information captured by the higher SAN layers.

3.2 Attention Interaction via 2D Convolution

Mutli-head mechanism allows different heads to capture distinct linguistic properties Raganato and Tiedemann (2018); Li et al. (2018), especially in diverse local contexts Sperber et al. (2018); Yang et al. (2018). We hypothesis that exploiting local properties across heads is able to further improve the performance of SANs. To this end, we expand the 1-dimensional window to a 2-dimensional area with the new dimension being the index of attention head. Suppose that the area size is (), the keys and values in the area are:


where are elements in the -th subspace, which are calculated by Equations 4 and 5 respectively. The union operation means combining the keys and values in different subspaces. The corresponding output is calculated as:


The 2D convolution allows SANs to build relevance between elements across adjacent heads, thus flexibly extract local features from different subspaces rather than merely from an unique head.

The vanilla SAN models linearly aggregate features from different heads, and this procedure limits the extent of abstraction Fukui et al. (2016); Li et al. (2019). Multiple sets of representations presented at feature learning time can further improve the expressivity of the learned features Ngiam et al. (2011); Wu and He (2018).

Model Parameter Speed BLEU
Transformer-Base Vaswani et al. (2017) 88.0M 1.28 27.31 -
  + Bi_Direct Shen et al. (2018a) +0.0M -0.00 27.58 +0.27
  + Rel_Pos Shaw et al. (2018) +0.1M -0.11 27.63 +0.32
  + Neighbor Sperber et al. (2018) +0.4M -0.06 27.60 +0.29
  + Local_Hard Luong et al. (2015) +0.4M -0.06 27.73 +0.42
  + Local_Soft Yang et al. (2018) +0.8M -0.09 27.81 +0.50
  + Block Shen et al. (2018b) +6.0M -0.33 27.59 +0.28
  + Cnns Yu et al. (2018) +42.6M -0.54 27.70 +0.39
  + 1D-CSans +0.0M -0.00 27.86 +0.55
  + 2D-CSans +0.0M -0.06 28.18 +0.87
Table 1: Comparing with the existing approaches on WMT14 EnDe translation task. For a fair comparison, we re-implemented the existing locality approaches under the same framework. “Parameter” denotes the number of model parameters (M = million) and “Speed” denotes the training speed (steps/second). “” column denotes performance improvements over the Transformer baseline.

4 Related Work

Self-Attention Networks

Recent studies have shown that Sans can be further improved by capturing complementary information. For example, Chen:2018:ACL and Hao:2019:NAACL complemented Sans with recurrence modeling, while Yang:2019:AAAI modeled contextual information for Sans.

Concerning modeling locality for Sans, Yu:2018:ICLR injected several CNN layers Kim (2014) to fuse local information, the output of which is fed to the subsequent SAN layer. Several researches proposed to revise the attention distribution with a parametric localness bias, and succeed on machine translation Yang et al. (2018) and natural language inference Guo et al. (2019). While both models introduce additional parameters, our approach is a more lightweight solution without introducing any new parameters. Closely related to this work, Shen:2018:AAAI applied a positional mask to encode temporal order, which only allows SANs to attend to the previous or following tokens in the sequence. In contrast, we employ a positional mask (i.e. the tokens outside the local window is masked as ) to encode the distance-aware local information.

In the context of distance-aware SANs, shaw2018self introduced relative position encoding to consider the relative distances between sequence elements. While they modeled locality from position embedding, we improve locality modeling from revising attention scope. To make a fair comparison, we re-implemented the above approaches under a same framework. Empirical results on machine translation tasks show the superiority of our approach in both translation quality and training efficiency.

Multi-Head Attention

Multi-head attention mechanism Vaswani et al. (2017) employs different attention heads to capture distinct features Raganato and Tiedemann (2018). Along this direction, Shen:2018:AAAI explicitly used multiple attention heads to model different dependencies of the same word pair, and Strubell:2018:EMNLP employed different attention heads to capture different linguistic features.  Li:2018:EMNLP introduced disagreement regularizations to encourage the diversity among attention heads. Inspired by recent successes on fusing information across layers Dou et al. (2018, 2019), Li:2019:NAACL proposed to aggregate information captured by different attention heads. Based on these findings, we model interactions among attention heads to exploit the richness of local properties distributed in different heads.

5 Experiments

We conducted experiments with the Transformer model Vaswani et al. (2017) on EnglishGerman (EnDe), ChineseEnglish (ZhEn) and JapaneseEnglish (JaEn) translation tasks. For the EnDe and ZhEn tasks, the models were trained on widely-used WMT14 and WMT17 corpora, consisting of around and million sentence pairs, respectively. Concerning JaEn, we followed morishita2017ntt to use the first two sections of WAT17 corpus as the training data, which consists of 2M sentence pairs. To reduce the vocabulary size, all the data were tokenized and segmented into subword symbols using byte-pair encoding Sennrich et al. (2016) with 32K merge operations. Following shaw2018self, we incorporated the proposed model into the encoder, which is a stack of 6 SAN layers. Prior studies revealed that modeling locality in lower layers can achieve better performance Shen et al. (2018b); Yu et al. (2018); Yang et al. (2018), we applied our approach to the lowest three layers of the encoder. About configurations of NMT models, we used the Base and Big settings same as Vaswani:2017:NIPS, and all models were trained on 8 NVIDIA P40 GPUs with a batch of 4096 tokens.

(a) 1D-CSans
(b) 2D-CSans
Figure 2: Effects of (a) window size on 1D-CSans, and (b) attended head numbers on 2D-CSans. For 2D-CSans, the window size dimension is fixed to be 11.

5.1 Effects of Window/Area Size

Model WMT14 EnDe WMT17 ZhEn WAT17 JaEn
Speed BLEU Speed BLEU Speed BLEU
Transformer-Base 1.28 27.31 1.21 24.13 1.33 28.10
   + CSans 1.22 28.18 1.16 24.80 1.28 28.50
Transformer-Big 0.61 28.58 0.58 24.56 0.65 28.41
   + CSans 0.50 28.74 0.48 25.01 0.55 28.73
Table 2: Experimental results on WMT14 EnDe, WMT17 ZhEn and WAT17 JaEn test sets. “Speed” denotes the training speed (steps/second). “” indicates statistically significant difference from the vanilla self-attention counterpart (), tested by bootstrap resampling Koehn (2004).

We first investigated the effects of window size (1D-CSans) and area size (2D-CSans) on EnDe validation set, as plotted in Figure 2. For 1D-CSans, the local size with 11 is superior to other settings. This is consistent with luong2015effective who found that 10 is the best window size in their local attention experiments. Then, we fixed the number of neighboring tokens being 11 and varied the number of heads. As seen, by considering the features across heads (i.e. ), 2D-CSans further improve the translation quality. However, when the number of heads in attention goes up, the translation quality inversely drops. One possible reason is that the model still has the flexibility of learning a different distribution for each head with few interactions, while a large amount of interactions assumes more heads make “similar contributions” Wu and He (2018).

5.2 Comparison to Related Work

We re-implemented and compared several exiting works (Section 4) upon the same framework. Table 1 lists the results on the EnDe translation task. As seen, all the models improve translation quality, reconfirming the necessity of modeling locality and distance information. Besides, our models outperform all the existing works, indicating the superiority of the proposed approaches. In particular, CSans achieve better performance than Cnns, revealing that extracting local features with dynamic weights (CSans) is superior to assigning fixed parameters (Cnns). Moreover, while most of the existing approaches (except for Shen:2018:AAAI) introduce new parameters, our methods are parameter-free and thus only marginally affect training efficiency.

5.3 Universality of The Proposed Model

To validate the universality of our approach on MT tasks, we evaluated the proposed approach on different language pairs and model settings. Table 2 lists the results on EnDe, ZhEn and JaEn translation tasks. As seen, our model consistently improves translation performance across language pairs, which demonstrates the effectiveness and universality of the proposed approach. It is encouraging to see that 2D-Convolution with base setting yields comparable performance with Transformer-Big.

5.4 Accuracy of Phrase Translation

One intuition of our approach is to capture useful phrasal patterns via modeling locality. To evaluate the accuracy of phrase translations, we calculate the improvement of the proposed approaches over multiple granularities of n-grams, as shown in Figure 

3. Both the two model variations consistently outperform the baseline on larger granularities, indicating that modeling locality can raise the ability of self-attention model on capturing the phrasal information. Furthermore, the dependencies among heads can be complementary to the localness modeling, which reveals the necessity of the interaction of features in different subspaces.

Figure 3: Performance improvement on different n-grams. “Gap of BLEU” denotes the improvement achieved by the proposed models over the baseline.

6 Conclusion

In this paper, we propose a parameter-free convolutional self-attention model to enhance the feature extraction of neighboring elements across multiple heads. Empirical results of machine translation task on a variety of language pairs demonstrate the effectiveness and universality of the proposed methods. The extensive analyses suggest that: 1) modeling locality is beneficial to SANs; 2) interacting features across multiple heads at attention time can further improve the performance; and 3) to some extent, the dynamic weights are superior to their fixed counterpart (i.e. CSans vs. Cnns) on local feature extraction.

As our approach is not limited to the task of machine translation, it is interesting to validate the proposed model in other sequence modeling tasks, such as reading comprehension, language inference, semantic role labeling, sentiment analysis as well as sentence classification.


The work was partly supported by the National Natural Science Foundation of China (Grant No. 61672555), the Joint Project of Macao Science and Technology Development Fund and National Natural Science Foundation of China (Grant No. 045/2017/AFJ) and the Multiyear Research Grant from the University of Macau (Grant No. MYRG2017-00087-FST). We thank the anonymous reviewers for their insightful comments.


  • Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR.
  • Chen et al. (2018) Mia Xu Chen, Orhan Firat, Ankur Bapna, Melvin Johnson, Wolfgang Macherey, George Foster, Llion Jones, Mike Schuster, Noam Shazeer, Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Zhifeng Chen, Yonghui Wu, and Macduff Hughes. 2018. The best of both worlds: Combining recent advances in neural machine translation. In ACL.
  • Dou et al. (2018) Ziyi Dou, Zhaopeng Tu, Xing Wang, Shuming Shi, and Tong Zhang. 2018. Exploiting Deep Representations for Neural Machine Translation. In EMNLP.
  • Dou et al. (2019) Ziyi Dou, Zhaopeng Tu, Xing Wang, Longyue Wang, Shuming Shi, and Tong Zhang. 2019. Dynamic Layer Aggregation for Neural Machine Translation. In AAAI.
  • Fukui et al. (2016) Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In EMNLP.
  • Guo et al. (2019) Maosheng Guo, Yu Zhang, and Ting Liu. 2019. Gaussian Transformer: A Lightweight Approach for Natural Language Inference. In AAAI.
  • Hao et al. (2019) Jie Hao, Xing Wang, Baosong Yang, Longyue Wang, Jinfeng Zhang, and Zhaopeng Tu. 2019. Modeling Recurrence for Transformer. In NAACL.
  • Kim (2014) Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In EMNLP.
  • Koehn (2004) Philipp Koehn. 2004. Statistical Significance Tests for Machine Translation Evaluation. In EMNLP.
  • Li et al. (2018) Jian Li, Zhaopeng Tu, Baosong Yang, Michael R. Lyu, and Tong Zhang. 2018. Multi-Head Attention with Disagreement Regularization. In EMNLP.
  • Li et al. (2019) Jian Li, Baosong Yang, Zi-Yi Dou, Xing Wang, Michael R. Lyu, and Zhaopeng Tu. 2019. Information Aggregation for Multi-Head Attention with Routing-by-Agreement. In NAACL.
  • Lin et al. (2014) Min Lin, Qiang Chen, and Shuicheng Yan. 2014. Network in Network. In ICLR.
  • Lin et al. (2017) Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A Structured Self-Aattentive Sentence Embedding. In ICLR.
  • Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In EMNLP.
  • Morishita et al. (2017) Makoto Morishita, Jun Suzuki, and Masaaki Nagata. 2017. NTT Neural Machine Translation Systems at WAT 2017. In WAT.
  • Ngiam et al. (2011) Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y Ng. 2011.

    Multimodal Deep Learning.

    In ICML.
  • Parikh et al. (2016) Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A Decomposable Attention Model for Natural Language Inference. In EMNLP.
  • Peters et al. (2018) Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep Contextualized Word Representations. In NAACL.
  • Raganato and Tiedemann (2018) Alessandro Raganato and Jörg Tiedemann. 2018. An Analysis of Encoder Representations in Transformer-Based Machine Translation. In EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP.
  • Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In ACL.
  • Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-Attention with Relative Position Representations. In NAACL.
  • Shen et al. (2018a) Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, and Chengqi Zhang. 2018a. DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding. In AAAI.
  • Shen et al. (2018b) Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. 2018b. Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling. In ICLR.
  • Sperber et al. (2018) Matthias Sperber, Jan Niehues, Graham Neubig, Sebastian Stüker, and Alex Waibel. 2018. Self-Attentional Acoustic Models. Interspeech.
  • Strubell et al. (2018) Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCallum. 2018. Linguistically-Informed Self-Attention for Semantic Role Labeling. In EMNLP.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need. In NIPS.
  • Wu et al. (2018) Wei Wu, Houfeng Wang, Tianyu Liu, and Shuming Ma. 2018. Phrase-level Self-Attention Networks for Universal Sentence Encoding. In EMNLP.
  • Wu and He (2018) Yuxin Wu and Kaiming He. 2018. Group Normalization. arXiv:1803.08494.
  • Yang et al. (2019) Baosong Yang, Jian Li, Derek F. Wong, Lidia S. Chao, Xing Wang, and Zhaopeng Tu. 2019. Context-Aware Self-Attention Networks. In AAAI.
  • Yang et al. (2018) Baosong Yang, Zhaopeng Tu, Derek F. Wong, Fandong Meng, Lidia S. Chao, and Tong Zhang. 2018. Modeling Localness for Self-Attention Networks. In EMNLP.
  • Yang et al. (2017) Baosong Yang, Derek F Wong, Tong Xiao, Lidia S Chao, and Jingbo Zhu. 2017. Towards Bidirectional Hierarchical Representations for Attention-based Neural Machine Translation. In EMNLP.
  • Yu et al. (2018) Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. QANet: Combining Local Convolution with Global Self-attention for Reading Comprehension. In ICLR.