Convolutional Self-Attention Network

10/31/2018 ∙ by Baosong Yang, et al. ∙ Tencent University of Macau 0

Self-attention network (SAN) has recently attracted increasing interest due to its fully parallelized computation and flexibility in modeling dependencies. It can be further enhanced with multi-headed attention mechanism by allowing the model to jointly attend to information from different representation subspaces at different positions (Vaswani et al., 2017). In this work, we propose a novel convolutional self-attention network (CSAN), which offers SAN the abilities to 1) capture neighboring dependencies, and 2) model the interaction between multiple attention heads. Experimental results on WMT14 English-to-German translation task demonstrate that the proposed approach outperforms both the strong Transformer baseline and other existing works on enhancing the locality of SAN. Comparing with previous work, our model does not introduce any new parameters.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Self-attention network (SAN) Lin et al. (2017)

has shown promising empirical results in various natural language processing (NLP) tasks, such as machine translation 

Vaswani et al. (2017); Shaw et al. (2018), natural language inference Shen et al. (2018a), and acoustic modeling Sperber et al. (2018). One strong point of SAN is the strength of capturing long-range dependencies by explicitly attending to all the signals, which allows the model to build a direct relation with another long-distance representation. The performance of SAN can be improved by further parallel performing (multi-headed), which allow the model to jointly attend to information from different representation subspaces at different positions Vaswani et al. (2017).

Although SAN has achieved significant improvements, it has two major limitations. First, SAN fully takes into account all the signals with a weighted sum operation, which disperses the distribution of attention, which may result in overlooking the relation of neighboring signals Yang et al. (2018). Second, the multi-headed attention perform attention heads independently, which misses the opportunity to exploit useful interactions across attention heads.

To address these problems, we propose a novel convolutional self-attention network (CSAN), leveraging the power of CNN on modelling localness for SAN. Specifically, the attention function can be served as a filter with dynamic weights conditioned on the context. Focusing on local information, we first present a 1-dimensional CSAN model (1D-CSAN) by restricting the attentive scope to a window of neighboring elements. We then break the boundaries of attention heads by introducing a 2D-CSAN model to exploit knowledge across different semantic subspaces. Another superior of the proposed approach is that it models the locality without any new parameters.

We conducted experiments on WMT14 EnglishGerman translation task. Experimental results show that our approach improves translation performance over the strong Transformer Vaswani et al. (2017) baseline and other related model. In addition, the extensive analysis also demonstrate that this is a win-win approach on simultaneously capturing local information and modeling dependencies across heads.

(a) Vanilla SAN
(b) 1D-Convolutional SAN
(c) 2D-Convolutional SAN
Figure 1: Illustration of the proposed models, where the white and grey matrices represent the input and output respectively. For simplification, the axis of features is omitted. As seen, the vanilla self-attention networks (a) calculate the weighted sum of all the elements in a sequences, besides ignore the dependencies among heads. The proposed 1D- (b) and 2D- (c) Convolutional SAN are assigned a limited attentive scope ( and ). To take features in other heads into account, the 2D variant expands the attentive scope beyond the boundary of uni-head.

2 Background

2.1 Self-Attention Model

Aim at building direct pairwise relevance, SAN calculates attention weights between each pair of elements in the whole input. Formally, given an input layer , the output layer can be constructed by attending to the states of (considering the relevance), where the elements

. Besides,  Vaswani:2017:NIPS pointed out that performing the attention functions in parallel (multi-head mechanism) is able to better capture features from different semantic subspaces. In each head, the weighted sum of the linear transformed input elements are individually computed. Formally, for a

heads model, the -th output in the -th head () can be computed as:


where denotes trainable parameter matrices, which are distinct in different heads. indicates the number of input elements. is the attention distribution, in which the attention weight represents the relevance between the -th and -th elements. It can be calculated as:


in which are parameter matrices. denotes the scaling factor. Finally, the output states of all heads are simply concatenated to produce the final state:

As seen, each head is individually performed in its own subspace. We argue that the simple concatenation still misses the opportunity to exploit different features across heads.

2.2 Motivation

As SAN calculates all the elements in an input, the normalization in Equation (2) may inhibits the attention to neighboring information Yang et al. (2018). On the contrary, CNN has proven to be of profound value for local feature fusion on various NLP tasks Kim (2014); Gehring et al. (2017); Yu et al. (2018). Therefore, we enhance the ability of capturing useful neighbouring information for SAN by borrowing the convolution concept from CNN. Moreover, as illustrated in Figure 1(a), each attention head is individually performed without considering the relation to other heads.  wu2018group noted that features can be better captured by modeling dependencies across different channels. Motivated by their promising results, we also exploit to extract attention features among multiple heads. We hypothesize that the two modifications based on vanilla SAN can accumulatively improve the performance of sequential modelling.

3 Convolutional Self-Attention Networks

Regarding the attention function as a CNN-like filter, we resize (reduce or expand) the attentive scope to a window surrounding each input element. To maintain the flexibility and parallelism of original SAN model, we propose two approaches: 1) 1D-CSAN model, as shown in Figure 1(b), in which the window is assigned with a width where , but the height is consistently fixed to one (analogous to 1-dimensional window sliding in a single head); and 2) 2D-CSAN model, as shown in Figure 1(c), where the window is assigned with an unrestricted height where (analogous to 2-dimensional rectangle sliding across multiple heads).

3.1 1D Convolutional Self-Attention

Given a window width , the relevance between the -th and -th elements can be normalized according to the neighboring elements of , instead of the whole input. Thus, the calculation of attention weight Equation (2) can updated as follows:



and 0 is padded when the index out of range. Accordingly, the output of attention operation Equation (

1) can be revised as:


As seen, the presented approach focuses on summarizing local context without additional components and parameters. Although the inspiration comes from CNN, CNN interacts local features using global fixed parameters, ignoring the richness of semantic and syntactic information. On the contrary, CSAN is able to dynamically model the weights (as shwon in Equation (4)) for different pairs of elements. Thus, the CSAN model superiors to its CNN counterpart in terms of flexibility.

3.2 2D Convolutional Self-Attention

Furthermore, we propose a 2D-CSAN model to simultaneously model dependencies among local elements and neighboring subspaces. The 1-dimensional attentive area () can be expanded to a 2-dimensional rectangle (), which consists of both number of elements and number of heads. Consequently, the proposed model is allowed to calculate the energy between the -th element in the -th head and the -th element in the -th head. Thus, the energy calculation Equation (3) can be updated as:


Accordingly, the energy normalization Equation (4) and the weighted sum of the elements Equation (5) can be respectively revised as:


where . Thus, the attention distribution represents the dependencies among head and the output of each head covers different group of features. Note that, 2D-CSAN equals to 1D-CSAN when .

Model Param. Speed BLEU
Transformer-Base Vaswani et al. (2017) 88.0M 1.20 27.64 -
  + Rel_Pos Shaw et al. (2018) +0.1M -0.11 27.94 +0.30
  + Neighbor Sperber et al. (2018) +0.4M -0.06 27.91 +0.27
  + Local_Hard Luong et al. (2015) +0.4M -0.06 28.04 +0.40
  + Local_Soft Yang et al. (2018) +0.8M -0.09 28.11 +0.47
  + Block Shen et al. (2018a) +6.0M -0.33 27.91 +0.27
  + CNNs Yu et al. (2018) +42.6M -0.54 28.02 +0.38
1D-CSAN +0.0M -0.00 28.16 +0.52
2D-CSAN +0.0M -0.06 28.48 +0.84
Table 1: Experimental results on WMT14 EnDe translation task. For a fair comparison, we re-implemented the existing locality approaches upon the same framework, i.e Transformer-Base Vaswani et al. (2017). “Param.” denotes the trainable parameter size of each model (M = million). “Speed” represents the training speed (steps/second).

4 Experiments

4.1 Setup

To verify the effectiveness of our approach, we conducted experiments on EnglishGerman translation task. Following Vaswani:2017:NIPS and shaw2018self, we incorporate the proposed model into the encoder in the state-of-the-art NMT architecture, Transformer. For fair comparison, we also re-implemented other related models in the same framework. Prior studies have shown that modeling locality in lower layers can achieve better performance Shen et al. (2018b); Yu et al. (2018); Yang et al. (2018). Therefore, we apply our approach in first three layers of Transformer, which is stacked 6 layers of SAN-based encoder. We use the same configurations for Base model as used in Vaswani et al. (2017). All the models are implemented on top of an open-source toolkit – THUMT Zhang et al. (2017).

The models are trained on the widely-used WMT2014 EnglishGerman (EnDe) training corpus, which consists of 4M sentence pairs. We also select the same validation and test sets as used in previous work. To alleviate the out-of-vocabulary problem, we employ byte-pair encoding (BPE) Sennrich et al. (2016) to pre-process the data with 32K merge operations. We use the case-sensitive 4-gram NIST BLEU score Papineni et al. (2002)

as evaluation metric.

4.2 Results and Discussion

In this section, we evaluated the effectiveness of the proposed model on EnDe test set. The window sizes of 1D-CSAN and 2D-CSAN are set to and , respectively.

The Effectiveness of CSAN

To make the evaluation convincing, we reproduced the reported results in Vaswani et al. (2017) on the same data. As shown in Table 1, both the two proposed models outperform the strong baseline without additional parameters. Specifically, 1D-CSAN significantly improves translation performance over the baseline by +0.52 BLEU point, demonstrating the effectiveness of modeling localness for SAN. 2D-CSAN achieves the best performance overall (+0.84 BLEU). This confirms our assumption that the simple concatenation is insufficient to fully interact features in different subspaces. Modeling dependencies across multiple heads is beneficial to the performance of SAN.

Comparison to Other Work

We further compared the proposed model with several existing locality models. We divided the prior explorations on modeling localness into three categories:

  • From the embedding perspective, shaw2018self introduced relative position encoding (“Rel_Pos”) which embeds the relative distance into representations.

  • From the attention distribution perspective, “Neighbor” Sperber et al. (2018) and “Local_Hard” Luong et al. (2015) use local Gaussian biases to revise the attention distribution by predicting a window size and a central position respectively. Yang:2018:EMNLP combined the two approaches which is respective to “Local_Soft”.

  • Concerning the hard fashion, Shen:2018:AAAI and Yu:2018:ICLR suggested to allocate hard local scopes through dividing the input sequence into blocks (“Block”), and stacking CNN and SAN layers (“CNNs”).

As far as we know, it is the first time to compare these approaches upon a same framework. As seen, the proposed 1D-CSAN even slightly outperforms the best model among the existing approaches, i.e. “Local_Soft”. In addition, the proposed 1D-CSAN is superior to its CNN counterpart (“CNNs

”), verifying our hypothesis that the flexible weights conditioned on the context benefits to the feature extraction. By further considering the interaction across heads, 2D-CSAN totally gains 0.84 BLEU score over the vanilla SAN model.

Moreover, the proposed approach requires no additional parameters and marginally reduces the training speed. Note that, although our implementation does not speed up the model, theoretically, the computation complexity of each head can be reduced from to , besides inhibits the rapidly growing of memory requirement in SAN Shen et al. (2018b).

4.3 Analysis

Figure 2: Experimental results on varying the attentive scopes for 1D-CSAN (left) and 2D-CSAN (right). The variants are evaluated on validation set. Grey dash line denotes the baseline. The X-axis of left and right figures represent the number for elements and heads in the local scope, respectively. For the 2D-CSAN, the number of local elements is fixed to 11.

In this group of experiments, we evaluated the effect of varying the scope size for CSAN and phrasal pattern. The experiments are conducted on development set.

Ablation Study

As concluded in Fig. 2, for 1D-CSAN, the local scope covered 11 elements is superior to other settings. The tendency is consistent with luong2015effective and Yang:2018:EMNLP who found that 10 is the best window size in their experiments. Then, we fixed the number of neighboring elements and resized the number of heads to be considered. As seen, by considering the features across heads (i.e. ), 2D-CSAN further improves the translation quality. However, with the N increasing, the translation quality dropped. Here is the possible reason: With a smaller N, the model still has flexibility of learning a different distribution for each head, while the larger N assumes more heads make “similar contributions” Ba et al. (2016); Wu and He (2018).

Effect of N-gram

One intuition of our approach is to capture useful phrase patterns. To evaluate the accuracy of phrase translations, we calculate the improvement of the proposed approaches over multiple N-grams, as shown in Figure 


. Both the three compared models consistently outperform the baseline on larger granularities, indicating that modeling locality can raise the ability of self-attention model on capturing the phrasal information. Furthermore, the dependencies across heads can be complementary to the localness modeling, which reveals the necessity of the interaction of features in different subspaces.

Figure 3: Performance improvement according to N-gram. Y-axis denotes the gap of BLEU score between locality models and baseline.

5 Conclusion

In this paper, we propose a parameter-free convolutional self-attention model to enhance the feature extraction of neighboring elements across multiple heads. Experimental results on WMT14 EnDe translation tasks demonstrate the effectiveness of the proposed methods. The extensive analyzes verified that: 1) modeling locality is necesserary to SAN; 2) modeling dependencies across heads can further improve the performance; and 3) to some extent, the dynamic weights superior to its fixed counterpart (i.e. CSAN vs. CNN) on feature extraction. The method is not restricted to the translation task and could potentially be applied to other sequence modeling tasks such as reading comprehension, language inference, and sentence classification.