Fast Directional Self-Attention Mechanism

05/02/2018 ∙ by Tao Shen, et al. ∙ University of Technology Sydney University of Washington 0

In this paper, we propose a self-attention mechanism, dubbed "fast directional self-attention (Fast-DiSA)", which is a fast and light extension of "directional self-attention (DiSA)". The proposed Fast-DiSA performs as expressively as the original DiSA but only uses much less computation time and memory, in which 1) both token2token and source2token dependencies are modeled by a joint compatibility function designed for a hybrid of both dot-product and multi-dim ways; 2) both multi-head and multi-dim attention combined with bi-directional temporal information captured by multiple positional masks are in consideration without heavy time and memory consumption appearing in the DiSA. The experiment results show that the proposed Fast-DiSA can achieve state-of-the-art performance as fast and memory-friendly as CNNs. The code for Fast-DiSA is released at <>.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recurrent neural network (RNN) and convolutional neural network (CNN) have been broadly used as context fusion modules for natural language processing (NLP) tasks. Recently, RNN/CNN in conjunction with an attention mechanism has been proven to be effective for contextual feature modeling in a wide range of NLP tasks, including sentiment classification Li et al. (2018), machine translation Bahdanau et al. (2015), reading comprehension Seo et al. (2017); Yu et al. (2018), etc. More recently, self-attention mechanisms have been developed for context fusion and syntactic dependency modeling with the advantage of fewer parameters, more parallelizable computation, and better empirical performance Hu et al. (2017); Vaswani et al. (2017); Shen et al. (2018a). In addition, neural networks based solely on self-attention mechanisms have achieved state-of-the-art quality on many NLP tasks, e.g., machine translation Vaswani et al. (2017), sentence embedding Shen et al. (2018a) and semantic role labeling Tan et al. (2017).

Self-attention mechanisms can be categorized into two classes according to the type of dependency each aims to model. The first category is token2token self-attention Hu et al. (2017); Vaswani et al. (2017); Shen et al. (2018a) that captures syntactic dependency between every two tokens in a sequence. An efficient dot-product compatibility function is usually deployed to measure this pairwise dependency Vaswani et al. (2017)

. In contrast, additive compatibility function captures the dependency by multi-layer perceptron (MLP), and can usually achieve better performance

Britz et al. (2017). Its expressive power can be further improved if expanded to multiple dimensions Shen et al. (2018a). This multi-dim self-attention empirically surpasses dot-product one, but suffers from expensive computation and memory, which grow linearly with the number of features and quadratically with the sequence length. Hence, it is not scalable to long sequences in practice.

Figure 1: (a) Memory consumption and (b) time cost vs. sequence length on synthetic data; (c) memory load (-axis), inference time on dev set (-axis) and test accuracy on the SNLI dataset.

The second category is source2token self-attention Liu et al. (2016); Lin et al. (2017); Shen et al. (2018a) aiming to capture global dependency, i.e., the importance of each token to the entire sequence for a specific task. Its time and space complexities grow linearly, rather than quadratically, with the sequence length. Hence, it is empirically efficient in terms of memory and computation even if expanded to multiple dimensions, i.e., using a vector of feature-wise scores instead of a scalar for the global dependency. But, it is hard to reach state-of-the-art performance on NLP tasks due to the lack of pairwise and local dependencies.

In this paper, we propose a novel attention mechanism called multi-mask tensorized self-attention (MTSA)111More details about training setups, related work, discussion, and visualization are provided in the Appendix., for context fusion. In MTSA, 1) the pairwise dependency is captured by an efficient dot-product based token2token self-attention, while the global dependency is modeled by a feature-wise multi-dim source2token self-attention, so they can work jointly to encode rich contextual features; 2) self-attention alignment scores are tensorized for more expressive power in that each pair of tokens has one score for each feature, but no tensor computation is required other than simple and efficient matrix multiplications when implemented; 3) the tensors above are computed in multiple subspaces (i.e., in a multi-head fashion) rather than in the original input space, so the required memory and computation can be distributed to multiple subspaces; and 4) a distinct positional mask is applied to each head in order to encode rich structural information such as the sequential order and relative position of tokens.

In the experiments, we build CNN/RNN-free neural networks based on MTSA for sentence embedding and sequence tagging tasks, including natural language inference, semantic role labeling, sentiment analysis, question-type classification, machine translation, etc. The results demonstrate that MTSA achieves state-of-the-art or competitive performance on nine benchmark datasets. To summarize the comparison of MTSA with recently popular models, we show the memory consumption and time cost

vs. sequence length respectively in Figure 1(a) and 1(b) on synthetic data (batch size of 64 and feature channels of 300). On the SNLI Bowman et al. (2015), a public dataset for language inference, as shown in Figure 1(c), MTSA achieves the best result but is as fast and as memory-efficient as the CNNs (all baselines and the benchmark are detailed in Section 4).

Notations: 1) lowercase denotes a vector; 2) bold lowercase denotes a sequence of vectors (stored as a matrix); and 3) uppercase denotes a matrix or tensor.

2 Background

2.1 Attention Mechanism

Given an input sequence of token embeddings or memory slots , and a vector representation of a query , attention mechanism Bahdanau et al. (2015); Luong et al. (2015) computes an alignment score between each token and by a compatibility function , which aims to measure the dependency/relevance between and , or the attention of to

, w.r.t. a given task. The scores are transformed into probabilities through a

function. These probabilities are then used as weights to sum all the tokens and generate a contextual embedding for , i.e.,


where denotes the vector of alignment scores, is the categorical distribution for attention probabilities, which is derived from applying function to . And, is the output vector for the query .

There are two major types of compatibility functions, leading to the two most frequently used attention mechanisms. The first one is dot-product or multiplicative compatibility function (Eq.(2)), which composes dot-product attention mechanism Luong et al. (2015)

using cosine similarity to model the dependencies. The other one is additive or multi-layer perceptron (MLP) compatibility function (Eq.(

3)) that results in additive attention mechanism Bahdanau et al. (2015) using MLP to model the dependencies.


where are learnable parameters, denotes inner-product. Empirically, networks with additive attention usually outperform those with dot-product attention, but require more computation time and memory Britz et al. (2017).

Multi-dim attention mechanism Shen et al. (2018a) expands the alignment score in previous attention mechanisms to a vector for feature-wise scores, each computed on a feature dimension. It has greater capacity to model complex dependencies, and can handle context variation and polysemy problems harassing many NLP tasks. In particular, it replaces vector in additive compatibility function (Eq.(3)) with a matrix , and thus produces scores to describe the attention of to .

2.2 Self-Attention Mechanism

Self-attention mechanism is a special case of attention mechanisms, where the query

stems from the input sequence itself. Self-attention mechanisms can be classified into token2token or source2token self-attention mechanism according to the type of dependency each aims to model.

A) Token2token self-attention mechanism Vaswani et al. (2017); Shen et al. (2018a) aims at producing a context-aware representation for each token in light of its syntactic dependencies on other tokens from the same sequence. Two examples of token2token self-attention are 1) scaled dot-product self-attention which composes the multi-head self-attention Vaswani et al. (2017), and 2) masked self-attention used in directional self-attention Shen et al. (2018a).

A.1) Scaled dot-product attention mechanism Vaswani et al. (2017) in general form has three arguments: query tokens , key tokens and value tokens associated with the key tokens. It uses a scaled dot-product function to model the relationship between each query and key, and finally outputs a sequence such that


A special case of this mechanism is that the three input arguments are derived from the same source, i.e., , which can be referred to as a token2token self-attention, namely scaled dot-product self-attention. As for multi-head attention mechanism, the input is projected into multiple subspaces, then parameter-untied scaled dot-product attention is applied to the embeddings in each subspace. The results for multiple subspaces are concatenated to form the final output , i.e.,


A.2) Masked self-attention mechanism Shen et al. (2018a) uses multi-dim compatibility function to model the dependency between every two tokens in a sequence, and uses positional mask to encode sequential information. It overcomes inherent problem appearing in self-attention compared to RNNs on the lack of sequential information. Its compatibility function is defined as


where is a constant scalar, is learnable weight matrix, and is a positional mask with each entry . When , applying function to the alignment scores results in a zero attention probability, which cuts off the attention of to . Hence, masked self-attention with an asymmetric mask, where , can encode sequential or other structural information Shen et al. (2018a); Im and Cho (2017). To this end, two positional masks have been proposed to encode the forward and backward order information respectively, i.e.,

Furthermore, directional self-attention (DiSA) Shen et al. (2018a) concatenates the features produced by masked self-attention mechanisms with the forward and backward positional masks (i.e., ), leading to context-ware representations with bi-directional information encoded.

B) Source2token self-attention mechanism Liu et al. (2016); Lin et al. (2017); Shen et al. (2018a) is designed for sentence embedding or sequence compression, which is based on the importance of each token to the entire source sequence for a specific task. Specifically, it removes the query from the compatibility function when computing the alignment score. For example, the compatibility function of additive source2token self-attention mechanism is to simply remove from Eq.(3).

3 Proposed Models

In this section, we firstly elaborate on tensorized self-attention (TSA) in Section 3.1, which captures both pairwise and global dependencies by combining the two types of self-attention mechanisms introduced in Section 2.2. Then, we extend TSA to multi-mask tensorized self-attention (MTSA) in Section 3.2 by applying different positional masks to TSA in multiple subspaces (multi-head fashion). Lastly, in Section 3.3, we present an efficient computation scheme for MTSA without any high-rank tensor computation involved even if tensorized alignment scores are used.

3.1 Tensorized Self-Attention (TSA)

Figure 2: Tensorized self-attention (TSA) Mechanism.

Tensorized self-attention (TSA), whose structure is illustrated in Figure 2, is a neural mechanism that can be trained to model both pairwise and global dependencies, while any previous self-attention mechanism only focuses on one type of dependencies. TSA models both types by combining the aforementioned token2token and source2token self-attention mechanisms. This generates an tensor containing the alignment scores between every two tokens on each feature dimension. These scores are then normalized and transformed into probability weights, which are used to sum all dependent tokens and then generate the contextual embedding for each input token. We will demonstrate later in Section 3.3 that only matrix rather than tensor operation is required when executing the procedures above.

To facilitate the elaboration of proposed models and keep the consistent notation with prior attention mechanisms, TSA first projects the input embeddings into three spaces to represent the query, key and value tokens, respectively.


where and are learnable weights for projections.

TSA then integrates two kinds of compatibility functions from two self-attention mechanisms respectively. Firstly, the scaled dot-product self-attention is used to capture dependency between every two tokens. Dot-product operations are fast, and sufficient to model the pairwise dependency in most tasks. Its compatibility function is



is inner-product operation. Then, a multi-dim source2token self-attention mechanism is used to estimate the contribution of each token to the given task on each feature dimension. It aims at capturing the importance of each token to the entire input sequence w.r.t. the task, i.e., the

global dependency. The multi-dim extension only linearly increases the memory and computation of source2token self-attention by a multiplicative factor , but is essentially helpful to improve expressive capability in line with prior works Shen et al. (2018a). Its compatibility function is


where , are the learnable weights, and

is an activation function. The compatibility function used in TSA broadcasts the scalar alignment score

computed by the token2token self-attention to all feature dimensions, and then adds them to the feature-wise score vector computed by the source2token self-attention. In addition, the positional masks from masked self-attention (in Section 2.2) are also integrated to encode sequential and structural information. These yield following compatibility function of TSA.


where . and are two scale functions. They control the way to combine two kinds of scores and their weights, more details of which are elaborated in Appendix A.1. We also show heatmaps of the token2token and source2token alignment scores in Appendix E.

For each query token , a function is applied to the alignment scores on each feature dimension, resulting in a categorical distribution over all value tokens based on corresponding key tokens . The probability of token attending to on the feature dimension (i.e., ) is


where, . TSA outputs a contextual embedding for each input token on every feature dimension as the weighted sum of all the value token embeddings on that dimension, where the weights are provided by the probabilities in Eq.(11

). It is the expectation of sampling a value token embeddings on each feature dimension according to the feature-wise probability distribution, i.e.,


3.2 Multi-Mask Tensorized Self-Attention (MTSA) Mechanism

Rather than computing attention in the original input space, multi-head attention Vaswani et al. (2017) projects the input sequence to multiple subspaces, applies attention to the projected embedding in each subspace, and concatenates their outputs at last. The computations associated with multiple heads can be completed in parallel. By using adequate heads, each with a low-dimensional subspace (i.e., the representation dimension for each head is updated by where is the number of head), it reduces parameters and memory/computation cost and increases diversity of the attention. In addition, to encode different kinds of sequential or structural information, multiple different positional masks (e.g., forward, backward and multi-length window) can be further applied to the multiple heads.

The memory-/time-efficiency and expressive power of TSA can be improved by using the combination of the multi-head and multi-mask techniques introduced above. By writing TSA mechanism as a function with input sequence and a positional mask , and the output given by Eq.(12), multi-mask tensorized self-attention (MTSA) produces


where , is the number of heads, denotes the parameter-independent TSA block that produces a -dim representation in the subspace, represents the positional mask applied to attention in the subspace, denotes a vertical concatenation operation, and is the output of MTSA. In our experiments, we apply forward mask to half of the heads and apply backward mask to the other half to encode bi-directional order information of the input sequence.

3.3 Computation-Optimized MTSA

Input: input sequence , head number , subspace dimension , positional masks , and weights/biases:

Output: contextual embeddings

1:for all  do Computing -head in parallel
3:      token2token attention scores
4:      scores of source2token attention
5:      Applying mask to token2token weights
6:      Applying source2token weights to
7:      Applying masked token2token weights and normalizing
8:end for
9:Return Vertical concatenation of the outputs from all heads
Algorithm 1 Multi-Mask Tensorized Self-Attention

As shown in Eq.(10) and Eq.(11), TSA or each head of MTSA needs to compute the attention scores and probabilities as tensors. In accordance with multi-dim self-attention Shen et al. (2018a), this makes TSA more expressively powerful and improves the final performance for sequence modeling, but terribly leads to memory explosion and computational bottleneck on long sequences with large and . Fortunately, in MTSA, it is possible to significantly reduce the demand on computations to matrix-only operations by exploring the computational structure.

A memory-optimized and highly-parallelizable computation scheme for MTSA is given in Algorithm 1. For each head, the score matrices of token2token and source2token are computed in steps 3 and 4 respectively. Then, we combine token2token scores with the positional mask to form a new mask in step 5, and compute the output embedding with the weighs from the multi-dim source2token self-attention in step 6. Finally, in step 7, we apply the new mask from step 5 to the weighted embedding from step 6 and complete the normalization. This procedure generates the exactly same output as Eq.(13) (as rigorously proved in Appendix A.2) but no any tensor operation is incurred. More details about memory-efficiency, time-efficiency and attention dropout are elaborated in Appendix A.3.

4 Experiments

We compare MTSA with commonly-used context fusion baselines on several NLP tasks222Codes for Experiments are released at

. When addressing a sentence embedding problem, a multi-dim source2token self-attention is applied on the top of context fusion module to produce the sequence embedding. Codes are implemented in Python with Tensorflow and executed on a single NVIDIA GTX 1080Ti graphics card. In addition, data for both time cost and memory consumption are collected under Tensorflow-1.7 with CUDA9 and cuDNN7. The

fair and reliable experiment setups are elaborated in Appendix B.



Inf. Time Memory Train Acc. Test Acc.
300D SPINN-PI encoders Bowman et al. (2016) 3.7m 89.2 83.2
600D Bi-LSTM encoders Liu et al. (2016) 2.0m 86.4 83.3
600D Bi-LSTM enc.+intra-attn Liu et al. (2016) 2.8m 84.5 84.2
600D Deep Gated Attn. Chen et al. (2017) 11.6m 90.5 85.5
600D Gumbel TreeLSTM enc. Choi et al. (2018) 10.0m 93.1 86.0
600D Residual stacked enc. Nie and Bansal (2017) 29.0m 91.0 86.0
300D Reinforced SAN Shen et al. (2018b) 3.1m 404s 92.6 86.3
Distance-based SAN Im and Cho (2017) 4.7m 416s 89.6 86.3
Bi-LSTM Graves et al. (2013) 2.9m 854s 9.1s 942MB 90.4 85.0
Bi-GRU Chung et al. (2014) 2.5m 850s 9.4s 810MB 91.9 84.9
Multi-CNN Kim (2014) 1.4m 137s 1.4s 208MB 89.3 83.2
Hrchy-CNN Gehring et al. (2017) 3.4m 195s 1.8s 309MB 91.3 83.9
Multi-head Vaswani et al. (2017) 2.0m 179s 1.5s 466MB 89.6 84.2
DiSA Shen et al. (2018a) 2.3m 390s 5.2s 6682MB 91.1 85.6
Bi-BloSA Shen et al. (2018c) 4.1m 303s 3.2s 1600MB 91.6 85.8
MTSA 2.9m 180s 1.6s 558MB 91.8 86.3
Table 1: Experimental results for different methods with comparative parameter number on SNLI. : the number of parameters (excluding word embedding part); Time/Epoch: averaged training time per epoch with batch size 128; Inf. Time: averaged dev inference time with batch size 128; Memory: memory load on synthetic data of sequence length 64 and batch size 64 with back-propagation considered; Train Acc. and Test Acc.: the accuracies on training/test sets. All state-of-the-art methods in leaderboard are listed in Table 1&2 up to Sep. 2018.

The context fusion baselines include 1) Bi-LSTM Graves et al. (2013): 600D bi-directional LSTM consisting of 300D forward plus 300D backward LSTMs, 2) Bi-GRU Chung et al. (2014): 600D bi-directional GRU, 3) Multi-CNN Kim (2014): three CNNs with 200D kernels to model 3/4/5-grams respectively, 4) Hrchy-CNN Gehring et al. (2017): 3-layer 300D stacked CNN with kernel size 5, gated linear units Dauphin et al. (2016)

and residual connections

He et al. (2016), 5) Multi-head Vaswani et al. (2017): 600D multi-head self-attention with 8 heads (75-dim subspace per head) and positional embedding used by Vaswani et al. Vaswani et al. (2017), 6) DiSA Shen et al. (2018a): 600D directional self-attention mechanism consisting of 300D forward and 300D backward masked self-attentions, and 7) Bi-BloSA Shen et al. (2018c): 600D bi-directional block self-attention with intra-/inter-block self-attention, aiming to reduce the time and space complexities of multi-dim self-attention by using hierarchical structure.

4.1 Natural Language Inference

Natural language inference (NLI) aims at speculating on the relationship between a premise and a corresponding hypothesis, where the relationship could be entailment, neutral or contradiction. In experiments, we first compare MTSA with other baselines on the Stanford Natural Language Inference Bowman et al. (2015) (SNLI) dataset.

Following the method of applying sentence-encoding model to NLI given by Bowman et al. Bowman et al. (2016), two parameter-tied sentence-encoding models are used to generate embeddings for premise and hypothesis, resulting in and respectively. The concatenation of , , and representing the relationship is passed into a 3-way neural classifier for final prediction.

The experimental results of the models from the official leaderboard, baselines, and MTSA are shown in Table 1. MTSA achieves state-of-the-art performance with less time and memory cost. Compared to the methods from the leaderboard, MTSA outperforms RNN-based encoders (e.g., Residual stacked enc.), RNN+attention encoders (e.g., Deep Gated Attn.) and even parsing trees based encoders (e.g., Gumbel TreeLSTM enc.) by a large margin. Compared to the two competitive self-attention networks with complicated and expensive training computations, MTSA trained in end-to-end manner achieves the same state-of-the-art performance by using much fewer parameters and less computational time.

Compared to baselines, MTSA is faster than RNN-based models and outperforms CNN-based models given a similar number of parameters and computation time. Moreover, compared to the dot-product self-attention (Multi-head), MTSA costs similar time and memory but performs more expressively powerful self-attention, and thus achieves better performance. Furthermore, compared to the multi-dim self-attention (DiSA and Bi-BloSA), MTSA uses much less memory and time but even produces much better prediction quality.

Model SNLI MultiNLI
Dev Test Match Mismatch
BiLSTM w/ Shortcut 86.0 74.6 73.6
BiLSTM w/ Gen-Pooling 86.6 73.8 74.0
HBMP 86.6 73.7 73.0
Transfer + Multi-Head 86.9 86.6 76.3 75.7
Transfer + MTSA 87.2 86.9 76.7 76.4
Table 2: Experimental results on sentence-encoding based SNLI and MultiNLI benchmark tasks. “Transfer

” denotes pretrained language model on large corpus for transfer learning, which detailed by

Radford et al. Radford et al. (2018). References: Nie and Bansal (2017), Chen et al. (2018), Talman et al. (2018).

In addition, to further improve the state-of-the-art performance, in contrast to training from scratch, a language model built on the Transformer Vaswani et al. (2017) unsupervisedly pretrained on large English corpus (detailed by Radford et al. Radford et al. (2018)) is transfered for the baseline and proposed models for sentence-encoding based NLI tasks. As shown in Table 2, MTSA integrated with pretrained language model can achieve new state-of-the-art accuracy on both SNLI and Multi-Genre Natural Language Inference (MultiNLI) Williams et al. (2017)333All test results are Evaluated on Kaggle official websites: and among all sentence-encoding models.

An ablation study of MTSA is shown in Table 3 to verify the capability of its each part in context fusion. The results show that token2token (modeling pairwise dependency), source2token (modeling global dependency), and positional masks (encoding sequential information) all contribute important information to sequence modeling, and the contributions are complementary.

Model Inf. Time Test Acc.
MTSA 2.9m 1.6 86.3
MTSA w/o fw&bw masks 2.9m 1.6 85.3 (-1.0)
MTSA w/o token2token 2.5m 1.5 85.8 (-0.5)
MTSA w/o source2token 2.5m 1.4 84.9 (-1.4)
MTSA w/o proposed modules 1.8m 1.1 84.3 (-2.0)
Table 3: An ablation study of MTSA on SNLI.
Models Training Development WSJ Test Brown Test
Time P R F1 Comp. P R F1 Comp. P R F1 Comp.
Täckström et al. Täckström et al. (2015) 81.2 76.2 78.6 54.4 82.3 77.6 79.9 56.0 74.3 68.6 71.3 39.8
Zhou and Xu Zhou and Xu (2015) 79.7 79.4 79.6 - 82.9 82.8 82.8 - 70.7 68.2 69.4 -
He et al. He et al. (2017) 81.6 81.6 81.6 62.3 83.1 83.0 83.1 64.3 72.8 71.4 72.1 44.8
He et al. He et al. (2018) - - - - - - 83.9 - - - 73.7 -
Strubell et al. Strubell et al. (2018) - - - - 84.7 84.2 84.5 - 73.9 72.4 73.1 -
Bi-LSTM Graves et al. (2013) 72h 81.8 83.4 82.6 63.3 83.0 84.0 83.5 64.6 72.3 72.8 72.5 46.8
Multi-CNN Kim (2014) 19h 75.2 79.6 77.3 53.6 77.3 80.9 79.0 55.5 68.3 70.3 69.3 41.9
Multi-head Tan et al. (2017) 20h 82.6 83.6 83.1 65.2 84.5 85.2 84.8 66.4 73.5 74.6 74.1 48.4
MTSA 20h 82.8 84.4 83.6 65.4 84.2 85.3 84.8 67.0 74.3 74.6 74.5 49.1
Table 4: Experimental Results of SRL for single models on CoNLL-05 with gold predicates. Multi-head baseline is equivalent to the model in Tan et al. Tan et al. (2017). For fair comparisons, first, we use the hyper-parameters provided by Tan et al. Tan et al. (2017) instead of tuning them; second, all listed models are independent of external linguistics information, e.g., PoS, dependency parsing.

4.2 Semantic Role Labeling

To verify the capability of MTSA in generating context-aware representation of each token, we compare it with baselines on semantic role labeling (SRL) task, which aims to tag each token from an input sequence with a label for its semantic role. Particularly, given a sentence, the goal of SRL is to identify the arguments of each target verb into semantic roles, which can benefit many downstream NLP tasks. SRL has two steps: 1) assigning either a semantic argument or non-argument to a given predicate and 2) labeling a specific semantic role for the identified argument.

We follow the experimental setup in Tan et al. Tan et al. (2017), where the SRL task is treated as a BIO tagging problem. Tan et al. Tan et al. (2017) designed a deep attentive neural net by stacking multi-head self-attention, named as deepatt, to perform context fusion, whose output is then passed to a neural classifier to make the final decision. The results achieved by previous methods, baselines, and MTSA are shown in Table 4, which demonstrates that MTSA achieves new state-of-the-art performance on the CoNLL-05 dataset by costing similar training time as CNN and multi-head self-attention baselines.

4.3 Sentence Classifications

cBoW 79.9 86.4 91.3 87.3 /
Skip-thought 81.3 87.5 93.6 92.2 /
DCNN / / / 93.0 48.5
SRU 84.8(1.3) 89.7(1.1) 93.4(0.8) 93.9(0.6) /
CNNs 82.2(.2) 88.8(1.2) 92.9(0.7) 93.2(0.5) /
Bi-LSTM 84.6(1.6) 90.2(0.9) 94.7(0.7) 94.4(0.3) 49.9(0.8)
Multi-head 82.6(1.9) 89.8(1.2) 94.0(0.8) 93.4(0.4) 48.2(0.6)
DiSA 84.8(2.0) 90.1(0.4) 94.2(0.6) 94.2(0.1) 51.0(0.7)
Bi-BloSA 84.8(0.9) 90.4(0.8) 94.5(0.5) 94.8(0.2) 50.6(0.5)
MTSA 84.9(2.4) 90.5(0.6) 94.5(0.6) 95.3(0.3) 51.3(0.7)
Table 5: Experimental results on five sentence classification benchmarks. References: Mikolov et al. (2013), Kiros et al. (2015), Kalchbrenner et al. (2014), Lei and Zhang (2017).

The goal of sentence classification is to predict the correct label for a sentence in various scenarios. We evaluate the models on five sentence classification benchmarks for different NLP tasks, which include 1) CR Hu and Liu (2004): customer reviews of various products to predict whether the review is positive or negative, 2) MPQA Wiebe et al. (2005): an opinion polarity detection subtask of the MPQA dataset, 3) SUBJ Pang and Lee (2004): subjectivity dataset where a label indicates whether a sentence is subjective or objective, 4) TREC Li and Roth (2002): question-type classification dataset which classifies the question sentences into six classes, 5) SST-5 Socher et al. (2013)

: the Stanford Sentiment Treebank dataset with five sentiment labels. The reported accuracies for CR, MPQA, and SUBJ are the mean of 10-fold cross validation. The accuracies for TREC are the mean of five runs on the dev set, and the accuracies for SST-5 are the mean of five runs on the test set. All standard deviations are shown in parentheses.

The prediction accuracies achieved on these five benchmarks are shown in Table 5. MTSA achieves the best prediction accuracy on CR, MPQA, TREC and SST-5 benchmarks with better time efficiency and a lower memory load.

4.4 Machine Translation

We also evaluate proposed model on WMT 2014 English-German translation task for exhaustive comparisons with multi-head attention. We replace multi-head self-attention modules in the encoder of official Transformer implementation with MTSA module and do not tune the hyperparameters. Although our computation resources is limited, we use two training setups and also introduce

t-test to ensure that MTSA consistently outperforms multi-head self-attention in Transformer.

For Setup1, we use default hyperparameter set of transformer_base_single_gpu provided by official implementation with , batch size of 2048 and training step of 250K, and report BLEU value for the last checkpoint. For Setup2, we use the hyperparameter set of transformer_base with the modification of 1) using instead of , 2) increasing batch size from 4096 to 6144 per GPU, and 3) using training step of 133K. More details of the training setups for translation task are described in Appendix B.1.

Model Multi-head (Transformer) MTSA
Param# 61.38M 61.58M
Setup1 23.64 24.09
p-value: 0.001 (6 runs)
Setup2 26.98 27.21
p-value: 0.080 (3 runs)
Table 6: Results for the Transformer with either multi-head self-attention or proposed MTSA. The reported BLEU values for Setup 1 and 2 are the mean of 5 and 3 runs respectively.

As shown in Table 6, with small p-value for both training setup 1 and 2, the encoder with MTSA significantly outperforms that with multi-head self-attention, which demonstrates that multi-dim based MTSA modeling both pairwise and global dependencies is more expressive than dot-product based multi-head self-attention. Although the results do not improve state-of-the-art BLEU value of machine translation task, the purpose of this experiment to verify the effectiveness of MTSA in contrast to dot-product based multi-head self-attention is accomplished.

5 Conclusion

In conclusion, MTSA is highly parallelizable with more expressive power since it efficiently captures the pairwise dependency at token level, but delicately models the global dependency at feature level, and distributes computations to multiple heads, each equipped with a distinct positional mask. These lead to a sweet spot of the trade-off between performance and efficiency, and make MTSA as memory-efficient as CNN and scalable to long sequences but outperform previous (and even multi-dim) self-attention mechanisms in terms of prediction quality. The experiments conducted on nine NLP tasks verify that the MTSA can reach state-of-the-art performance with appealing efficiency.

6 Acknowledgments

This research was funded by the Australian Government through the Australian Research Council (ARC) under grants 1) LP160100630 partnership with Australia Government Department of Health and 2) LP150100671 partnership with Australia Research Alliance for Children and Youth (ARACY) and Global Business College Australia (GBCA). We also acknowledge the support of NVIDIA Corporation and MakeMagic Australia with the donation of GPUs.


Appendix A Supplemental Contents for MTSA

a.1 Scale functions

The and in Eq.(10) are either parameterized or non-parameterized scale functions, which are hyperparameters of the proposed model. They can adjust the manner and weights of the combination of the two alignment score entries.

In this work, we simply focus on non-parameterized scale functions, switching between and , which control the way to combine the two kinds of alignment scores in the attention mechanism. In particular, since the summed alignment score will be passed into a function with exponential operation for attention probabilities, function will provide a -scaled multiplicative item for the combination during the normalization of , in contrast to the additive item provided by . In addition, there are two other reasons to leverage the scale functions: 1) as stated in Vaswani et al. Vaswani et al. (2017), and similar to with temperature, avoids large alignment scores, which as the inputs to

function will result in extremely small gradient; 2) without scale function, the alignment score can be very large and may cause numerical problems during backpropagation.

a.2 Equivalence of MTSA and Its Memory-Optimized Computation Scheme

In this section, we rigorously prove that Algorithm 1 outputs the same results as Eq.(13). In the following analysis, we remove the subscript as the index of heads in Algorithm 1 Step 2-7 for simplicity, and use to indicate the indices of key/value tokens, query tokens and feature channels, respectively. We have that,

Hence, the computation scheme in Algorithm 1 produces the exactly same output as the original MTSA but does not require any high-rank tensor operation. Therefore, it produces the expressively powerful tensorized alignment scores but is computed as fast and as memory-efficiently as a CNN.

a.3 More Details about Algorithm 1

Memory-Efficiency (illustrated in Figure 1(a)): Compared to multi-dim token2token self-attention Shen et al. (2018a) that inherently requires 4-D tensors (with the shape of [batch size, sequence length, sequence length, feature channels]) to store the alignment scores during the training phase, MTSA does not use any high-rank tensor operation but only matrix multiplications to avoid memory explosion .

Time-Efficiency (illustrated in Figure 1(b)): MTSA is highly parallelizable because its computations can be distributed to multiple subspaces, and only a few matrix multiplications (which are also highly parallelizable) occur in each subspace. Compared to dot-product based multi-head attention, multi-dim based MTSA only uses extra two fully-connected layers and two element-wise matrix operations.

Attention Dropout: Similar to Vaswani et al. Vaswani et al. (2017), the dropout Srivastava et al. (2014) with the keep probability of can be applied to both the token2token and the source2token attention probabilities in MTSA. In particular, the can be applied to each of two matrices composing the dividend in Algorithm 1 Step 7, i.e., replacing “” with “”, each with the keep probability of .

Appendix B Training Setups

The optimization objectives for classification and regression problems are cross-entropy loss and mean square error respectively, which we minimize by using Adadelta Zeiler (2012) or Adam Kingma and Ba (2015) optimizer. All trainable weight matrices are initialized by Glorot Initializer Glorot and Bengio (2010), and all the biases are initialized as zeros. We use 300D (except 200D for SRL task) GloVe 6B pre-trained vectors Pennington et al. (2014)

to initialize the word embeddings. The embedding for a word in the training set but not in GloVe is randomly initialized by sampling from uniform distribution between

. The word embeddings will be fine-tuned during the training phase. We also apply Dropout with keep probability , and L2 regularization with weight decay factor to all the model for avoiding overfitting. The unspecified activation functions for all fully-connected layers appearing in models are set to Glorot et al. (2011). The activation function applied to token2token alignment scores is set to unless otherwise specified.

For fair and reliable comparisons with baselines and prior works, on SNLI and sentence classification tasks, we follow training setup and hyperparameters used in corresponding prior works, and only tune the dropout probability for different baseline or ablation models, without any other trick (e.g., learning rate schedule, batch/layer normalization, etc.); on SRL, we directly employ the training setup and the well-tuned hyperparameters used in the prior state-of-the-art work Tan et al. (2017) based on multi-head self-attention mechanism, without tuning them specifically for our proposed model. Besides, for the language model based transfer learning for SNLI and MultiNLI tasks, we use the pretrained model provided by Radford et al. Radford et al. (2018). And, we use the language model as the auxiliary task for models’ universality with the coefficient of , and use other hyper-parameters (e.g., initial learning rate, optimizer, leaning rate schedule, epoch number) given by Radford et al. Radford et al. (2018).

Finally, We give the details about hyper-parameter selection which leads the proposed model to achieve the optimal performance for each NLP benchmark in the following.

For SNLI dataset (natural language inference), we set and , and use Adadelta as the optimizer with mini batch size of . And, we do not use the attention dropout for this benchmark. Besides, the activation function for fully-connected layer is set to Clevert et al. (2016). The training procedure is completed within 600K steps, approximately costing 12 hours.

For CoNLL-05 dataset (semantic role labeling), we use the same hyper-parameters provided in Tan et al. (2017) rather than tune them for a fair comparison. The keep probabilities of dropout for fully-connected layer and residual connection He et al. (2016) are set to and respectively. The attention dropout with keep probability of is applied to both source2token and token2token alignment scores, which equals to setting the probability to in MTSA. And, the activation function applied to the token2token alignment scores is set to . Besides, different from using fixed positional embedding in Tan et al. (2017), we remove it and only use the forward and backward masks in MTSA to encode bi-directional order information. The training will finish within about 20 hours by using Adadelta optimizer.

For CR, MPQA and SUBJ datasets, we set and for these three benchmarks. And we apply attention dropout with keep probability of to CR and MPQA. Different from the other experiments in this paper, we here use Adam as the optimizer to train the models, and do not use any learning rate decay trick. The training procedure is completed within 1000 batch steps.

For TREC dataset (question-type classification), we set and and do not apply the attention dropout. The training procedure is completed within 80K steps by using Adadelta optimizer.

For SST-5 dataset (sentiment analysis), we set and and do not apply the attention dropout. The training procedure is completed within 120K steps by using Adadelta optimizer.

b.1 Evaluation on Machine Translation

For machine translation, due to the limited computation resources, we build two training and decoding setups which require fewer GPUs to fairly and reliably compare the Transformer with either multi-head self-attention or the proposed MTSA.

According to the reproductivity experiments at issue#317 in which transformer_base model from official implementation tensor2tensor needs , batch size of 4096 and training step of 250K to achieve the BLEU value of 27.76, our reproductivity experiment of the Transformer with , batch size of 6144 and training step of 133K to achieve BLEU value of 27 is reasonable and accurate. The issue#444 of tensor2tensor also demonstrates that the Transformer trained on hurts BLEU point compared to that trained on , and the Transformer trained on fewer GPUs cannot achieve state-of-the-art decoding performance even if using more GPU hours.

Appendix C Related Work

The self-attention mechanism was firstly applied to NLP tasks to implicitly model the syntactic dependency by using a pairwise compatibility function. Kim et al. Kim et al. (2017) proposed a syntactic attention mechanism to simulate syntactic tree selection, which can be regarded as a self-attention mechanism making soft-selection based on the learned syntactic dependency model. Hu et al. Hu et al. (2017) presented a self aligning layer to align information between context words, allowing crucial clues to be fused into the context-aware representation, which mitigates a limitation of the capability of a RNN in long-term dependency modeling. Vaswani et al. Vaswani et al. (2017) proposed a scaled dot-product attention mechanism where a scaled dot-product compatibility function is leveraged to model the syntactic pairwise dependency. They then presented a multi-head attention mechanism based on the dot-product attention, which employs multiple subspaces to capture diverse dependencies and save memory/computation/parameters. An attention-only model, dubbed “Transformer”, based solely on multi-head attention was finally proposed for sequence to sequence tasks. Shen et al. Shen et al. (2018a) proposed a multi-dimensional compatibility function to capture feature-level dependency or relevance for attention mechanism. They then introduced a directional (masked) self-attention mechanism, in which the multi-dim compatibility function is used to model the pairwise dependency, and forward and backward positional masks are leveraged to capture bi-directional order information.

Furthermore, there was another type of self-attention mechanism capturing the contribution and dependency of each token to the entire source sequence for a specific task, which can be used on sentence encoding or sequence compression task. Liu et al. Liu et al. (2016) proposed an intra-sentence attention mechanism where the pooling result of the input sequence is used as the query attending to each token from the same sequence. They applied it to sentence embedding tasks. Lin et al. Lin et al. (2017)

proposed a self-attentive model using a matrix to represent the sentence embedding, with each row of the matrix attending to a different part of the sentence. It shares a similar idea with the multi-head attention

Vaswani et al. (2017). Shen et al. Shen et al. (2018a) proposed a source2token self-attention mechanism that removes the query from the multi-dim compatibility function, for the purpose of directly modeling the feature-wise contribution of each token to the entire input source on a specific task.

Self-attention mechanisms introduced above have been implemented on a wide range of practical tasks and achieved state-of-the-art performance. Lin et al. Lin et al. (2017) applied the self-attention model in conjunction with Bi-LSTM to sentence embedding tasks. Hu et al. Hu et al. (2017) integrated the self aligning layer with general query-context mutual-attention framework (i.e., BiDAF Seo et al. (2017)) to model long-term dependency for machine comprehension task. Vaswani et al. Vaswani et al. (2017) applied the attention-only sequence to sequence model, “Transformer”, to neural machine translation. Shen et al. Shen et al. (2018a) employed the directional and source2token self-attention mechanisms respectively as context fusion and sequence compression modules to build a sentence embedding model. Tan et al. Tan et al. (2017) applied the stacked multi-head self-attention mechanism jointly with fully-connected layer (similar to the encoder in Transformer) to the semantic role labeling task. Im and Cho Im and Cho (2017) proposed distance-aware masks (sharing a similar idea with directional self-attention) to model the distance information between every two tokens in a sequence, and applied it to sentence-encoding based natural language inference task. Liu et al. Liu et al. (2018) facilitated the passage summarization problem to a language model problem, and used the decoder from Transformer to solve this problem. Yu et al. Yu et al. (2018) employed stacked CNN and self-attention mechanism to model local and long-term dependencies respectively, and achieved new state-of-the-art performance on machine comprehension task. Veličković et al. Veličković et al. (2017) implemented a stacked multi-head attention on a graph to perform transductive and inductive graph tasks, where a node is used as the query attending to its neighboring nodes.

Figure 3: Heatmaps for normalized token2token and source2token alignment scores with forward and backward masks. Each row shows three types of scores associated with the tokens from a same sentence: token2token alignment scores in TSA with forward mask (left), token2token alignment scores in TSA with backward masks (middle), source2token alignment scores at token level for the two heads with forward and backward masks (right). The tokens in -axis and -axis are the dependent and governor tokens, respectively.

Appendix D Discussion

In this paper, we propose a novel multi-dim self-attention mechanism, called multi-mask tensorized self-attention (MTSA), for context fusion purpose. It is equipped with an expressive but previously inefficient multi-dim compatibility function to compute tensorized alignment scores that can capture both pairwise and global dependencies. However, it does not suffer from any time or memory explosion problem that precludes previous multi-dim attention mechanisms from being applied to large-scale datasets or long sequences. Meanwhile, multiple distinct positional masks are applied to multiple heads (subspaces) to model different types of sequential and structural information of input sequence. The experimental results show that MTSA empirically achieves state-of-the-art performance on a wide range of NLP tasks, and is as fast and as memory-efficient as CNN baselines. This indicates that a stacked version of MTSA might improve the performance on more NLP tasks.

There are various intriguing works that are worth exploring based on the proposed model, such as 1) integrating MTSA with Transformer Vaswani et al. (2017) for more complicated and high-level NLP tasks (e.g., neural machine translation and summarization), 2) applying more types of positional masks or distance-aware masks Im and Cho (2017) to different heads, and thus taking into account richer structure information, and 3) integrating the light-weight and time-efficient MTSA with a hierarchical self-attention structure Shen et al. (2018c) for context fusion on long text (e.g., passage and document), rather than using single self-attention mechanism or the non-parallelizable RNN-based models.

Appendix E Visualization

In this section, we use heatmaps to visualize the token2token and source2token alignment scores computed by MTSA mechanism with forward and backward positional masks. The sentences used for visualization are randomly selected from the test set of SNLI dataset. For clarity, the visualized token-level alignment score of multi-dim source2token self-attention is computed by averaging the corresponding vector of feature-wise alignment scores.

As shown in Figure 3, the heatmaps of the alignment scores computed by MTSA mechanism demonstrate that, 1) a token pair with strong syntactic relevance is assigned with high alignment score by the pairwise compatibility function; 2) the important tokens (e.g., verbs and nouns ) usually achieve high source2token alignment scores, whereas the trivial tokens (e.g., stop words) obtain relatively low alignment scores; and 3) MTSA mechanism with backward and forward masks focuses on different positions of the input sentence in different heads (subspaces), which makes the final attention results more versatile and diverse.