DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding

09/14/2017 ∙ by Tao Shen, et al. ∙ University of Technology Sydney University of Washington 0

Recurrent neural nets (RNN) and convolutional neural nets (CNN) are widely used on NLP tasks to capture the long-term and local dependencies, respectively. Attention mechanisms have recently attracted enormous interest due to their highly parallelizable computation, significantly less training time, and flexibility in modeling dependencies. We propose a novel attention mechanism in which the attention between elements from input sequence(s) is directional and multi-dimensional (i.e., feature-wise). A light-weight neural net, "Directional Self-Attention Network (DiSAN)", is then proposed to learn sentence embedding, based solely on the proposed attention without any RNN/CNN structure. DiSAN is only composed of a directional self-attention with temporal order encoded, followed by a multi-dimensional attention that compresses the sequence into a vector representation. Despite its simple form, DiSAN outperforms complicated RNN models on both prediction quality and time efficiency. It achieves the best test accuracy among all sentence encoding methods and improves the most recent best result by 1.02 Natural Language Inference (SNLI) dataset, and shows state-of-the-art test accuracy on the Stanford Sentiment Treebank (SST), Multi-Genre natural language inference (MultiNLI), Sentences Involving Compositional Knowledge (SICK), Customer Review, MPQA, TREC question-type classification and Subjectivity (SUBJ) datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


Code of Directional Self-Attention Network (DiSAN)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Context dependency plays a significant role in language understanding and provides critical information to natural language processing (NLP) tasks. For different tasks and data, researchers often switch between two types of deep neural network (DNN): recurrent neural network (RNN) with sequential architecture capturing long-range dependencies (e.g., long short-term memory (LSTM)

[Hochreiter and Schmidhuber1997]

and gated recurrent unit (GRU)

[Chung et al.2014]

), and convolutional neural network (CNN)

[Kim2014] whose hierarchical structure is good at extracting local or position-invariant features. However, which network to choose in practice is an open question, and the choice relies largely on the empirical knowledge.

Recent works have found that equipping RNN or CNN with an attention mechanism can achieve state-of-the-art performance on a large number of NLP tasks, including neural machine translation

[Bahdanau, Cho, and Bengio2015, Luong, Pham, and Manning2015], natural language inference [Liu et al.2016], conversation generation [Shang, Lu, and Li2015], question answering [Hermann et al.2015, Sukhbaatar et al.2015], machine reading comprehension [Seo et al.2017]

, and sentiment analysis

[Kokkinos and Potamianos2017]. The attention uses a hidden layer to compute a categorical distribution over elements from the input sequence to reflect their importance weights. It allows RNN/CNN to maintain a variable-length memory, so that elements from the input sequence can be selected by their importance/relevance and merged into the output. In contrast to RNN and CNN, the attention mechanism is trained to capture the dependencies that make significant contributions to the task, regardless of the distance between the elements in the sequence. It can thus provide complementary information to the distance-aware dependencies modeled by RNN/CNN. In addition, computing attention only requires matrix multiplication, which is highly parallelizable compared to the sequential computation of RNN.

In a very recent work [Vaswani et al.2017], an attention mechanism is solely used to construct a sequence to sequence (seq2seq) model that achieves a state-of-the-art quality score on the neural machine translation (NMT) task. The seq2seq model, “Transformer”, has an encoder-decoder structure that is only composed of stacked attention networks, without using either recurrence or convolution. The proposed attention, “multi-head attention”, projects the input sequence to multiple subspaces, then applies scaled dot-product attention to its representation in each subspace, and lastly concatenates their output. By doing this, it can combine different attentions from multiple subspaces. This mechanism is used in Transformer to compute both the context-aware features inside the encoder/decoder and the bottleneck features between them.

The attention mechanism has more flexibility in sequence length than RNN/CNN, and is more task/data-driven when modeling dependencies. Unlike sequential models, its computation can be easily and significantly accelerated by existing distributed/parallel computing schemes. However, to the best of our knowledge, a neural net entirely based on attention has not been designed for other NLP tasks except NMT, especially those that cannot be cast into a seq2seq problem. Compared to RNN, a disadvantage of most attention mechanisms is that the temporal order information is lost, which however might be important to the task. This explains why positional encoding is applied to the sequence before being processed by the attention in Transformer. How to model order information within an attention is still an open problem.

The goal of this paper is to develop a unified and RNN/CNN-free attention network that can be generally utilized to learn the sentence encoding model for different NLP tasks, such as natural language inference, sentiment analysis, sentence classification and semantic relatedness. We focus on the sentence encoding model because it is a basic module of most DNNs used in the NLP literature.

We propose a novel attention mechanism that differs from previous ones in that it is 1) multi-dimensional: the attention w.r.t. each pair of elements from the source(s) is a vector, where each entry is the attention computed on each feature; and 2) directional: it uses one or multiple positional masks to model the asymmetric attention between two elements. We compute feature-wise attention since each element in a sequence is usually represented by a vector, e.g., word/character embedding [Kim et al.2016], and attention on different features can contain different information about dependency, thus to handle the variation of contexts around the same word. We apply positional masks to attention distribution since they can easily encode prior structure knowledge such as temporal order and dependency parsing. This design mitigates the weakness of attention in modeling order information, and takes full advantage of parallel computing.

We then build a light-weight and RNN/CNN-free neural network, “Directional Self-Attention Network (DiSAN)”, for sentence encoding. This network relies entirely on the proposed attentions and does not use any RNN/CNN structure. In DiSAN, the input sequence is processed by directional (forward and backward) self-attentions to model context dependency and produce context-aware representations for all tokens. Then, a multi-dimensional attention computes a vector representation of the entire sequence, which can be passed into a classification/regression module to compute the final prediction for a particular task. Unlike Transformer, neither stacking of attention blocks nor an encoder-decoder structure is required. The simple architecture of DiSAN leads to fewer parameters, less computation and easier parallelization.

In experiments111Codes and pre-trained models for experiments can be found at https://github.com/taoshen58/DiSAN, we compare DiSAN with the currently popular methods on various NLP tasks, e.g., natural language inference, sentiment analysis, sentence classification, etc. DiSAN achieves the highest test accuracy on the Stanford Natural Language Inference (SNLI) dataset among sentence-encoding models and improves the currently best result by . It also shows the state-of-the-art performance on the Stanford Sentiment Treebank (SST), Multi-Genre natural language inference (MultiNLI), SICK, Customer Review, MPQA, SUBJ and TREC question-type classification datasets. Meanwhile, it has fewer parameters and exhibits much higher computation efficiency than the models it outperforms, e.g., LSTM and tree-based models.


1) Lowercase denotes a vector; 2) bold lowercase denotes a sequence of vectors (stored as a matrix); and 3) uppercase denotes a matrix or a tensor.

2 Background

2.1 Sentence Encoding

In the pipeline of NLP tasks, a sentence is denoted by a sequence of discrete tokens (e.g., words or characters) , where could be a one-hot vector whose dimension length equals the number of distinct tokens . A pre-trained token embedding (e.g., word2vec [Mikolov et al.2013b] or GloVe [Pennington, Socher, and Manning2014]) is applied to and transforms all discrete tokens to a sequence of low-dimensional dense vector representations with . This pre-process can be written as , where word embedding weight matrix and .

Most DNN sentence-encoding models for NLP tasks take as the input and further generate a vector representation for each by context fusion. Then a sentence encoding is obtained by mapping the sequence to a single vector , which is used as a compact encoding of the entire sentence in NLP problems.

2.2 Attention

The attention is proposed to compute an alignment score between elements from two sources. In particular, given the token embeddings of a source sequence and the vector representation of a query , attention computes the alignment score between and by a compatibility function , which measures the dependency between and , or the attention of to . A function then transforms the scores

to a probability distribution

by normalizing over all the tokens of . Here is an indicator of which token in is important to on a specific task. That is, large means contributes important information to . The above process can be summarized by the following equations.




The output of this attention mechanism is a weighted sum of the embeddings for all tokens in , where the weights are given by . It places large weights on the tokens important to , and can be written as the expectation of a token sampled according to its importance, i.e.,


where can be used as the sentence encoding of .

Figure 1: (a) Traditional (additive/multiplicative) attention and (b) multi-dimensional attention. denotes alignment score , which is a scalar in (a) but a vector in (b).

Additive attention (or multi-layer perceptron attention)

[Bahdanau, Cho, and Bengio2015, Shang, Lu, and Li2015] and multiplicative attention (or dot-product attention) [Vaswani et al.2017, Sukhbaatar et al.2015, Rush, Chopra, and Weston2015] are the two most commonly used attention mechanisms. They share the same and unified form of attention introduced above, but are different in the compatibility function . Additive attention is associated with



is an activation function and

is a weight vector. Multiplicative attention uses inner product or cosine similarity for

, i.e.,


In practice, additive attention often outperforms multiplicative one in prediction quality, but the latter is faster and more memory-efficient due to optimized matrix multiplication.

2.3 Self-Attention

Self-Attention is a special case of the attention mechanism introduced above. It replaces with a token embedding from the source input itself. It relates elements at different positions from a single sequence by computing the attention between each pair of tokens, and . It is very expressive and flexible for both long-range and local dependencies, which used to be respectively modeled by RNN and CNN. Moreover, it has much faster computation speed and fewer parameters than RNN. In recent works, we have already witnessed its success across a variety of NLP tasks, such as reading comprehension [Hu, Peng, and Qiu2017] and neural machine translation [Vaswani et al.2017].

3 Two Proposed Attention Mechanisms

In this section, we introduce two novel attention mechanisms, multi-dimensional attention in Section 3.1 (with two extensions to self-attention in Section 3.2) and directional self-attention in Section 3.3. They are the main components of DiSAN and may be of independent interest to other neural nets for other NLP problems in which an attention is needed.

3.1 Multi-dimensional Attention

Multi-dimensional attention is a natural extension of additive attention (or MLP attention) at the feature level. Instead of computing a single scalar score for each token as shown in Eq.(5), multi-dimensional attention computes a feature-wise score vector for by replacing weight vector in Eq.(5) with a matrix , i.e.,


where is a vector with the same length as , and all the weight matrices . We further add two bias terms to the parts in and out activation , i.e.,


We then compute a categorical distribution over all the tokens for each feature . A large means that feature of token is important to .

We apply the same procedure Eq.(1)-(3) in traditional attention to the dimension of . In particular, for each feature , we replace with , and change to in Eq.(1)-(3). Now each feature in each token has an importance weight . The output can be written as


We give an illustration of traditional attention and multi-dimensional attention in Figure 1. In the rest of this paper, we will ignore the subscript which indexes feature dimension for simplification if no confusion is possible. Hence, the output can be written as an element-wise product

Remark: The word embedding usually suffers from the polysemy in natural language. Since traditional attention computes a single importance score for each word based on the word embedding, it cannot distinguish the meanings of the same word in different contexts. Multi-dimensional attention, however, computes a score for each feature of each word, so it can select the features that can best describe the word’s specific meaning in any given context, and include this information in the sentence encoding output .

3.2 Two types of Multi-dimensional Self-attention

When extending multi-dimension to self-attentions, we have two variants of multi-dimensional attention. The first one, called multi-dimensional “token2token” self-attention, explores the dependency between and from the same source , and generates context-aware coding for each element. It replaces with in Eq.(8), i.e.,


Similar to in vanilla multi-dimensional attention, we compute a probability matrix for each such that . The output for is


The output of token2token self-attention for all elements from is .

The second one, multi-dimensional “source2token” self-attention, explores the dependency between and the entire sequence , and compresses the sequence into a vector. It removes from Eq.(8), i.e.,


The probability matrix is defined as and is computed in the same way as in vanilla multi-dimensional attention. The output is also same, i.e.,


We will use these two types (i.e., token2token and source2token) of multi-dimensional self-attention in different parts of our sentence encoding model, DiSAN.

3.3 Directional Self-Attention

Directional self-attention (DiSA) is composed of a fully connected layer whose input is the token embeddings , a “masked” multi-dimensional token2token self-attention block to explore the dependency and temporal order, and a fusion gate to combine the output and input of the attention block. Its structure is shown in Figure 2. It can be used as either a neural net or a module to compose a large network.

Figure 2: Directional self-attention (DiSA) mechanism. Here, we use to denote in Eq. (3.3).

In DiSA, we first transform the input sequence to a sequence of hidden state by a fully connected layer, i.e.,


where , , and are the learnable parameters, and is an activation function.

We then apply multi-dimensional token2token self-attention to , and generate context-aware vector representations for all elements from the input sequence. We make two modifications to Eq.(10) to reduce the number of parameters and make the attention directional.

First, we set in Eq.(10) to a scalar and divide the part in by , and we use for , which reduces the number of parameters. In experiments, we always set , and obtain stable output.

Second, we apply a positional mask to Eq.(10), so the attention between two elements can be asymmetric. Given a mask , we set bias to a constant vector in Eq.(10), where is an all-one vector. Hence, Eq.(10) is modified to


To see why a mask can encode directional information, let us consider a case in which and , which results in and unchanged . Since the probability is computed by , leads to . This means that there is no attention of to on feature . On the contrary, we have , which means that attention of to exists on feature . Therefore, prior structure knowledge such as temporal order and dependency parsing can be easily encoded by the mask, and explored in generating sentence encoding. This is an important feature of DiSA that previous attention mechanisms do not have.

(a) Diag-disabled mask
(b) Forward mask
(c) Backward mask
Figure 3: Three positional masks: (a) is the diag-disabled mask ; (b) and (c) are forward mask and backward mask , respectively.

For self-attention, we usually need to disable the attention of each token to itself [Hu, Peng, and Qiu2017]. This is the same as applying a diagonal-disabled (i.e., diag-disabled) mask such that


Moreover, we can use masks to encode temporal order information into attention output. In this paper, we use two masks, i.e., forward mask and backward mask ,


In forward mask , there is the only attention of later token to early token , and vice versa in backward mask. We show these three positional masks in Figure 3.

Given input sequence and a mask , we compute according to Eq.(3.3), and follow the standard procedure of multi-dimensional token2token self-attention to compute the probability matrix for each . Each output in is computed as in Eq.(11).

The final output of DiSA is obtained by combining the output and the input of the masked multi-dimensional token2token self-attention block. This yields a temporal order encoded and context-aware vector representation for each element/token. The combination is accomplished by a dimension-wise fusion gate, i.e.,


where and are the learnable parameters of the fusion gate.

4 Directional Self-Attention Network

We propose a light-weight network, “Directional Self-Attention Network (DiSAN)”, for sentence encoding. Its architecture is shown in Figure 4.

Figure 4: Directional self-attention network (DiSAN)

Given an input sequence of token embedding , DiSAN firstly applies two parameter-untied DiSA blocks with forward mask Eq.(17) and Eq.(18), respectively. The feed-forward procedure is given in Eq.(14)-(3.3) and Eq.(19)-(20). Their outputs are denoted by . We concatenate them vertically as , and use this concatenated output as input to a multi-dimensional source2token self-attention block, whose output computed by Eq.(12)-(13) is the final sentence encoding result of DiSAN.

Remark: In DiSAN, forward/backward DiSA blocks work as context fusion layers. And the multi-dimensional source2token self-attention compresses the sequence into a single vector. The idea of using both forward and backward attentions is inspired by Bi-directional LSTM (Bi-LSTM) [Graves, Jaitly, and Mohamed2013], in which forward and backward LSTMs are used to encode long-range dependency from different directions. In Bi-LSTM, LSTM combines the context-aware output with the input by multi-gate. The fusion gate used in DiSA shares the similar motivation. However, DiSAN has fewer parameters, simpler structure and better efficiency.

5 Experiments

Model Name


Train Accu(%) Test Accu(%)
Unlexicalized features [Bowman et al.2015] 49.4 50.4
+ Unigram and bigram features [Bowman et al.2015] 99.7 78.2
100D LSTM encoders [Bowman et al.2015] 0.2m 84.8 77.6
300D LSTM encoders [Bowman et al.2016] 3.0m 83.9 80.6
1024D GRU encoders [Vendrov et al.2016] 15m 98.8 81.4
300D Tree-based CNN encoders [Mou et al.2016] 3.5m 83.3 82.1
300D SPINN-PI encoders [Bowman et al.2016] 3.7m 89.2 83.2
600D Bi-LSTM encoders [Liu et al.2016] 2.0m 86.4 83.3
300D NTI-SLSTM-LSTM encoders [Munkhdalai and Yu2017b] 4.0m 82.5 83.4
600D Bi-LSTM encoders+intra-attention [Liu et al.2016] 2.8m 84.5 84.2
300D NSE encoders [Munkhdalai and Yu2017a] 3.0m 86.2 84.6
Word Embedding with additive attention 0.45m 216 82.39 79.81
Word Embedding with s2t self-attention 0.54m 261 86.22 83.12
Multi-head with s2t self-attention 1.98m 345 89.58 84.17
Bi-LSTM with s2t self-attention 2.88m 2080 90.39 84.98
DiSAN without directions 2.35m 592 90.18 84.66
Directional self-attention network (DiSAN) 2.35m 587 91.08 85.62
Table 1: Experimental results for different methods on SNLI. : the number of parameters (excluding word embedding part). T(s)/epoch: average time (second) per epoch. Train Accu(%) and Test Accu(%): the accuracy on training and test set.

In this section, we first apply DiSAN to natural language inference and sentiment analysis tasks. DiSAN achieves the state-of-the-art performance and significantly better efficiency than other baseline methods on benchmark datasets for both tasks. We also conduct experiments on other NLP tasks and DiSAN also achieves state-of-the-art performance.

Training Setup: We use cross-entropy loss plus L2 regularization penalty as optimization objective. We minimize it by Adadelta [Zeiler2012] (an optimizer of mini-batch SGD) with batch size of . We use Adadelta rather than Adam [Kingma and Ba2015] because in our experiments, DiSAN optimized by Adadelta can achieve more stable performance than Adam optimized one. Initial learning rate is set to . All weight matrices are initialized by Glorot Initialization [Glorot and Bengio2010], and the biases are initialized with . We initialize the word embedding in by 300D GloVe 6B pre-trained vectors [Pennington, Socher, and Manning2014]

. The Out-of-Vocabulary words in training set are randomly initialized by uniform distribution between

. The word embeddings are fine-tuned during the training phrase. We use Dropout [Srivastava et al.2014] with keep probability for language inference and for sentiment analysis. The L2 regularization decay factors are and for language inference and sentiment analysis, respectively. Note that the dropout keep probability and varies with the scale of corresponding dataset. Hidden units number is set to . Activation functions are ELU (exponential linear unit) [Clevert, Unterthiner, and Hochreiter2016]

if not specified. All models are implemented with TensorFlow

222https://www.tensorflow.org and run on single Nvidia GTX 1080Ti graphic card.

5.1 Natural Language Inference

The goal of Natural Language Inference (NLI) is to reason the semantic relationship between a premise sentence and a corresponding hypothesis sentence. The possible relationship could be entailment, neutral or contradiction. We compare different models on a widely used benchmark, Stanford Natural Language Inference (SNLI)333https://nlp.stanford.edu/projects/snli/ [Bowman et al.2015] dataset, which consists of 549,367/9,842/9,824 (train/dev/test) premise-hypothesis pairs with labels.

Following the standard procedure in bowman2016fast bowman2016fast, we launch two sentence encoding models (e.g., DiSAN) with tied parameters for the premise sentence and hypothesis sentence, respectively. Given the output encoding for the premise and for the hypothesis, the representation of relationship is the concatenation of , , and , which is fed into a 300D fully connected layer and then a -unit output layer with to compute a probability distribution over the three types of relationship.

For thorough comparison, besides the neural nets proposed in previous works of NLI, we implement five extra neural net baselines to compare with DiSAN. They help us to analyze the improvement contributed by each part of DiSAN and to verify that the two attention mechanisms proposed in Section 3 can improve other networks.

  • Word Embedding with additive attention.

  • Word Embedding with s2t self-attention: DiSAN with DiSA blocks removed.

  • Multi-head with s2t self-attention: Multi-head attention [Vaswani et al.2017] ( heads, each has

    hidden units) with source2token self-attention. The positional encoding method used in vaswani2017attention vaswani2017attention is applied to the input sequence to encode temporal information. We find our experiments show that multi-head attention is sensitive to hyperparameters, so we adjust keep probability of dropout from

    to with step and report the best result.

  • Bi-LSTM with s2t self-attention: a multi-dimensional source2token self-attention block is applied to the output of Bi-LSTM (300D forward + 300D backward LSTMs).

  • DiSAN without directions: DiSAN with the forward/backward masks and replaced with two diag-disabled masks , i.e., DiSAN without forward/backward order information.

Compared to the results from the official leaderboard of SNLI in Table 1, DiSAN outperforms previous works and improves the best latest test accuracy (achieved by a memory-based NSE encoder network) by a remarkable margin of . DiSAN surpasses the RNN/CNN based models with more complicated architecture and more parameters by large margins, e.g., to Bi-LSTM, to Bi-LSTM with additive attention. It even outperforms models with the assistance of a semantic parsing tree, e.g., to Tree-based CNN, to SPINN-PI.

In the results of the five baseline methods and DiSAN at the bottom of Table 1, we demonstrate that making attention multi-dimensional (feature-wise) or directional brings substantial improvement to different neural nets. First, a comparison between the first two models shows that changing token-wise attention to multi-dimensional/feature-wise attention leads to improvement on a word embedding based model. Also, a comparison between the third baseline and DiSAN shows that DiSAN can substantially outperform multi-head attention by . Moreover, a comparison between the forth baseline and DiSAN shows that the DiSA block can even outperform Bi-LSTM layer in context encoding, improving test accuracy by . A comparison between the fifth baseline and DiSAN shows that directional self-attention with forward and backward masks (with temporal order encoded) can bring improvement.

Additional advantages of DiSAN shown in Table 1 are its fewer parameters and compelling time efficiency. It is faster than widely used Bi-LSTM model. Compared to other models with competitive performance, e.g., 600D Bi-LSTM encoders with intra-attention (2.8M), 300D NSE encoders (3.0M) and 600D Bi-LSTM encoders with multi-dimensional attention (2.88M), DiSAN only has 2.35M parameters.

5.2 Sentiment Analysis

Model Test Accu
MV-RNN [Socher et al.2013] 44.4
RNTN [Socher et al.2013] 45.7
Bi-LSTM [Li et al.2015] 49.8
Tree-LSTM [Tai, Socher, and Manning2015] 51.0
CNN-non-static [Kim2014] 48.0
CNN-Tensor [Lei, Barzilay, and Jaakkola2015] 51.2
NCSL [Teng, Vo, and Zhang2016] 51.1
LR-Bi-LSTM [Qian, Huang, and Zhu2017] 50.6
Word Embedding with additive attention 47.47
Word Embedding with s2t self-attention 48.87
Multi-head with s2t self-attention 49.14
Bi-LSTM with s2t self-attention 49.95
DiSAN without directions 49.41
DiSAN 51.72
Table 2: Test accuracy of fine-grained sentiment analysis on Stanford Sentiment Treebank (SST) dataset.

Sentiment analysis aims to analyze the sentiment of a sentence or a paragraph, e.g., a movie or a product review. We use Stanford Sentiment Treebank (SST)444https://nlp.stanford.edu/sentiment/ [Socher et al.2013] for the experiments, and only focus on the fine-grained movie review sentiment classification over five classes, i.e., very negative, negative, neutral, positive and very positive. We use the standard train/dev/test sets split with 8,544/1,101/2,210 samples. Similar to Section 5.1, we employ a single sentence encoding model to obtain a sentence representation of a movie review, then pass it into a 300D fully connected layer. Finally, a -unit output layer with is used to calculate a probability distribution over the five classes.

In Table 2, we compare previous works with DiSAN on test accuracy. To the best of our knowledge, DiSAN improves the last best accuracy (given by CNN-Tensor) by . Compared to tree-based models with heavy use of the prior structure, e.g., MV-RNN, RNTN and Tree-LSTM, DiSAN outperforms them by , and

, respectively. Additionally, DiSAN achieves better performance than CNN-based models. More recent works tend to focus on lexicon-based sentiment analysis, by exploring sentiment lexicons, negation words and intensity words. Nonetheless, DiSAN still outperforms these fancy models, such as NCSL (

) and LR-Bi-LSTM ().

Figure 5: Fine-grained sentiment analysis accuracy vs. sentence length. The results of LSTM, Bi-LSTM and Tree-LSTM are from tai2015improved tai2015improved and the result of DiSAN is the average over five random trials.

It is also interesting to see the performance of different models on the sentences with different lengths. In Figure 5, we compare LSTM, Bi-LSTM, Tree-LSTM and DiSAN on different sentence lengths. In the range of , the length range for most movie review sentences, DiSAN significantly outperforms others. Meanwhile, DiSAN also shows impressive performance for slightly longer sentences or paragraphs in the range of . DiSAN performs poorly when the sentence length , in which however only of total movie review sentences lie.

5.3 Experiments on Other NLP Tasks

Multi-Genre Natural Language Inference

Multi-Genre Natural Language Inference (MultiNLI)555https://www.nyu.edu/projects/bowman/multinli/ [Williams, Nangia, and Bowman2017] dataset consists of 433k sentence pairs annotated with textual entailment information. This dataset is similar to SNLI, but it covers more genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation. However, MultiNLI is a quite new dataset, and its leaderboard does not include a session for the sentence-encoding only model. Hence, we only compare DiSAN with the baselines provided at the official website. The results of DiSAN and two sentence-encoding models on the leaderboard are shown in Table 3. Note that the prediction accuracies of Matched and Mismatched test datasets are obtained by submitting our test results to Kaggle open evaluation platforms666https://inclass.kaggle.com/c/multinli-matched-open-evaluation and https://inclass.kaggle.com/c/multinli-mismatched-open-evaluation: MultiNLI Matched Open Evaluation and MultiNLI Mismatched Open Evaluation.

Method Matched Mismatched
cBoW 0.65200 0.64759
Bi-LSTM 0.67507 0.67248
DiSAN 0.70977 0.71402
Table 3: Experimental results of prediction accuracy for different methods on MultiNLI.

Semantic Relatedness

The task of semantic relatedness aims to predict a similarity degree of a given pair of sentences. We show an experimental comparison of different methods on Sentences Involving Compositional Knowledge (SICK)777http://clic.cimec.unitn.it/composes/sick.html dataset [Marelli et al.2014]. SICK is composed of 9,927 sentence pairs with 4,500/500/4,927 instances for train/dev/test. The regression module on the top of DiSAN is introduced by tai2015improved tai2015improved. The results in Table 4 show that DiSAN outperforms the models from previous works in terms of Pearson’s and Spearman’s indexes.

Model Pearson’s Spearman’s MSE
Meaning Factory .8268 .7721 .3224
ECNU .8414 / /
DT-RNN .7923 (.0070) .7319 (.0071) .3822 (.0137)
SDT-RNN .7900 (.0042) .7304 (.0042) .3848 (.0042)
Cons. Tree-LSTM .8582 (.0038) .7966 (.0053) .2734 (.0108)
Dep. Tree-LSTM .8676 (.0030) .8083 (.0042) .2532 (.0052)
DiSAN .8695 (.0012) .8139 (.0012) .2879 (.0036)
Table 4:

Experimental results for different methods on SICK sentence relatedness dataset. The reported accuracies are the mean of five runs (standard deviations in parentheses). Cons. and Dep. represent Constituency and Dependency, respectively.

[Bjerva et al.2014], [Zhao, Zhu, and Lan2014], [Socher et al.2014], [Tai, Socher, and Manning2015].

Sentence Classifications

The goal of sentence classification is to correctly predict the class label of a given sentence in various scenarios. We evaluate the models on four sentence classification benchmarks of various NLP tasks, such as sentiment analysis and question-type classification. They are listed as follows. 1) CR: Customer review [Hu and Liu2004] of various products (cameras, etc.), which is to predict whether the review is positive or negative; 2) MPQA: Opinion polarity detection subtask of the MPQA dataset [Wiebe, Wilson, and Cardie2005]; 3) SUBJ: Subjectivity dataset [Pang and Lee2004] whose labels indicate whether each sentence is subjective or objective; 4) TREC: TREC question-type classification dataset [Li and Roth2002]. The experimental results of DiSAN and existing methods are shown in Table 5.

cBoW 79.9 86.4 91.3 87.3
Skip-thought 81.3 87.5 93.6 92.2
DCNN / / / 93.0
AdaSent 83.6 (1.6) 90.4 (0.7) 92.2 (1.2) 91.1 (1.0)
SRU 84.8 (1.3) 89.7 (1.1) 93.4 (0.8) 93.9 (0.6)
Wide CNNs 82.2 (2.2) 88.8 (1.2) 92.9 (0.7) 93.2 (0.5)
DiSAN 84.8 (2.0) 90.1 (0.4) 94.2 (0.6) 94.2 (0.1)
Table 5: Experimental results for different methods on various sentence classification benchmarks. The reported accuracies on CR, MPQA and SUBJ are the mean of 10-fold cross validation, the accuracies on TREC are the mean of dev accuracies of five runs. All standard deviations are in parentheses. [Mikolov et al.2013a], [Kiros et al.2015], [Kalchbrenner, Grefenstette, and Blunsom2014], [Zhao, Lu, and Poupart2015], [Lei and Zhang2017].

5.4 Case Study

To gain a closer view of what dependencies in a sentence can be captured by DiSAN, we visualize the attention probability or alignment score by heatmaps. In particular, we will focus primarily on the probability in forward/backward DiSA blocks (Figure 6), forward/backward fusion gates in Eq.(19) (Figure 7), and the probability in multi-dimensional source2token self-attention block (Figure 8). For the first two, we desire to demonstrate the dependency at token level, but attention probability in DiSAN is defined on each feature, so we average the probabilities along the feature dimension.

We select two sentences from SNLI test set as examples for this case study. Sentence 1 is Families have some dogs in front of a carousel and sentence 2 is volleyball match is in progress between ladies.

(a) Sentence 1, forward
(b) Sentence 1, backward
(c) Sentence 2, forward
(d) Sentence 2, backward
Figure 6: Attention probability in forward/backward DiSA blocks for the two example sentences.

Figure 6 shows that1) semantically important words such as nouns and verbs usually get large attention, but stop words (am, is, are, etc.) do not; 2) globally important words, e.g., volleyball, match, ladies in sentence 1 and dog, front, carousel in sentence 2, get large attention from all other words; 3) if a word is important to only some of the other words (e.g. to constitute a phrase or sense-group), it gets large attention only from these words, e.g., attention between progress, between in sentence1, and attention between families, have in sentence 2.

This also shows that directional information can help to generate context-aware word representation with temporal order encoded. For instance, for word match in sentence 1, its forward DiSA focuses more on word volleyball, while its backward attention focuses more on progress and ladies, so the representation of word match contains the essential information of the entire sentence, and simultaneously includes the positional order information.

In addition, forward and backward DiSAs can focus on different parts of a sentence. For example, the forward one in sentence 2 pays attention to the word families, whereas the backward one focuses on the word carousel. Since forward and backward attentions are computed separately, it avoids normalization over multiple significant words to weaken their weights. Note that this is a weakness of traditional attention compared to RNN, especially for long sentences.

(a) Sentence 1, forward
(b) Sentence 1, backward
(c) Sentence 2, forward
(d) Sentence 2, backward
Figure 7: Fusion Gate in forward/backward DiSA blocks.

In Figure 7, we show that the gate value in Eq.(19). The gate combines the input and output of masked self-attention. It tends to selects the input representation instead of the output if the corresponding weight in is large. This shows that the gate values for meaningless words, especially stop words is small. The stop words themselves cannot contribute important information, so only their semantic relations to other words might help to understand the sentence. Hence, the gate tends to use their context features given by masked self-attention.

(a) glass in pair 1
(b) close in pair 2
Figure 8: Two pairs of attention probability comparison of same word in difference sentence contexts.

In Figure 8, we show the two multi-dimensional source2token self-attention score vectors of the same word in the two sentences, by their heatmaps. The first pair has two sentences: one is The glass bottle is big, and another is A man is pouring a glass of tea. They share the same word is glass with different meanings. The second pair has two sentences: one is The restaurant is about to close and another is A biker is close to the fountain. It can be seen that the two attention vectors for the same words are very different due to their different meanings in different contexts. This indicates that the multi-dimensional attention vector is not redundant because it can encode more information than one single score used in traditional attention and it is able to capture subtle difference of the same word in different contexts or sentences. Additionally, it can also alleviate the weakness of the attention over long sequence, which can avoid normalization over entire sequence in traditional attention only once.

6 Conclusion

In this paper, we propose two novel attention mechanisms, multi-dimensional attention and directional self-attention. The multi-dimensional attention performs a feature-wise selection over the input sequence for a specific task, and the directional self-attention uses the positional masks to produce the context-aware representations with temporal information encoded. Based on these attentions, Directional Self-Attention Network (DiSAN) is proposed for sentence-encoding without any recurrent or convolutional structure. The experiment results show that DiSAN can achieve state-of-the-art inference quality and outperform existing works (LSTM, etc.) on a wide range of NLP tasks with fewer parameters and higher time efficiency.

In future work, we will explore the approaches to using the proposed attention mechanisms on more sophisticated tasks, e.g. question answering and reading comprehension, to achieve better performance on various benchmarks.

7 Acknowledgments

This research was funded by the Australian Government through the Australian Research Council (ARC) under grant 1) LP160100630 partnership with Australia Government Department of Health, and 2) LP150100671 partnership with Australia Research Alliance for Children and Youth (ARACY) and Global Business College Australia (GBCA).


  • [Bahdanau, Cho, and Bengio2015] Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
  • [Bjerva et al.2014] Bjerva, J.; Bos, J.; Van der Goot, R.; and Nissim, M. 2014. The meaning factory: Formal semantics for recognizing textual entailment and determining semantic similarity. In SemEval@ COLING, 642–646.
  • [Bowman et al.2015] Bowman, S. R.; Angeli, G.; Potts, C.; and Manning, C. D. 2015. A large annotated corpus for learning natural language inference. In EMNLP.
  • [Bowman et al.2016] Bowman, S. R.; Gauthier, J.; Rastogi, A.; Gupta, R.; Manning, C. D.; and Potts, C. 2016. A fast unified model for parsing and sentence understanding. In ACL.
  • [Chung et al.2014] Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS.
  • [Clevert, Unterthiner, and Hochreiter2016] Clevert, D.-A.; Unterthiner, T.; and Hochreiter, S. 2016. Fast and accurate deep network learning by exponential linear units (elus). In ICLR.
  • [Glorot and Bengio2010] Glorot, X., and Bengio, Y. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 249–256.
  • [Graves, Jaitly, and Mohamed2013] Graves, A.; Jaitly, N.; and Mohamed, A.-r. 2013. Hybrid speech recognition with deep bidirectional lstm. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on, 273–278. IEEE.
  • [Hermann et al.2015] Hermann, K. M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.; Kay, W.; Suleyman, M.; and Blunsom, P. 2015. Teaching machines to read and comprehend. In NIPS.
  • [Hochreiter and Schmidhuber1997] Hochreiter, S., and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
  • [Hu and Liu2004] Hu, M., and Liu, B. 2004. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 168–177. ACM.
  • [Hu, Peng, and Qiu2017] Hu, M.; Peng, Y.; and Qiu, X. 2017. Reinforced mnemonic reader for machine comprehension. arXiv preprint arXiv:1705.02798.
  • [Kalchbrenner, Grefenstette, and Blunsom2014] Kalchbrenner, N.; Grefenstette, E.; and Blunsom, P. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.
  • [Kim et al.2016] Kim, Y.; Jernite, Y.; Sontag, D.; and Rush, A. M. 2016. Character-aware neural language models. In AAAI.
  • [Kim2014] Kim, Y. 2014. Convolutional neural networks for sentence classification. In EMNLP.
  • [Kingma and Ba2015] Kingma, D., and Ba, J. 2015. Adam: A method for stochastic optimization. In ICLR.
  • [Kiros et al.2015] Kiros, R.; Zhu, Y.; Salakhutdinov, R. R.; Zemel, R.; Urtasun, R.; Torralba, A.; and Fidler, S. 2015. Skip-thought vectors. In NIPS.
  • [Kokkinos and Potamianos2017] Kokkinos, F., and Potamianos, A. 2017. Structural attention neural networks for improved sentiment analysis. arXiv preprint arXiv:1701.01811.
  • [Lei and Zhang2017] Lei, T., and Zhang, Y. 2017. Training rnns as fast as cnns. arXiv preprint arXiv:1709.02755.
  • [Lei, Barzilay, and Jaakkola2015] Lei, T.; Barzilay, R.; and Jaakkola, T. 2015. Molding cnns for text: non-linear, non-consecutive convolutions. In EMNLP.
  • [Li and Roth2002] Li, X., and Roth, D. 2002.

    Learning question classifiers.

    In ACL.
  • [Li et al.2015] Li, J.; Luong, M.-T.; Jurafsky, D.; and Hovy, E. 2015. When are tree structures necessary for deep learning of representations? arXiv preprint arXiv:1503.00185.
  • [Liu et al.2016] Liu, Y.; Sun, C.; Lin, L.; and Wang, X. 2016. Learning natural language inference using bidirectional lstm model and inner-attention. arXiv preprint arXiv:1605.09090.
  • [Luong, Pham, and Manning2015] Luong, M.-T.; Pham, H.; and Manning, C. D. 2015. Effective approaches to attention-based neural machine translation. In EMNLP.
  • [Marelli et al.2014] Marelli, M.; Menini, S.; Baroni, M.; Bentivogli, L.; Bernardi, R.; and Zamparelli, R. 2014. A sick cure for the evaluation of compositional distributional semantic models. In LREC.
  • [Mikolov et al.2013a] Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
  • [Mikolov et al.2013b] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013b. Distributed representations of words and phrases and their compositionality. In NIPS.
  • [Mou et al.2016] Mou, L.; Men, R.; Li, G.; Xu, Y.; Zhang, L.; Yan, R.; and Jin, Z. 2016.

    Natural language inference by tree-based convolution and heuristic matching.

    In ACL.
  • [Munkhdalai and Yu2017a] Munkhdalai, T., and Yu, H. 2017a. Neural semantic encoders. In EACL.
  • [Munkhdalai and Yu2017b] Munkhdalai, T., and Yu, H. 2017b. Neural tree indexers for text understanding. In EACL.
  • [Pang and Lee2004] Pang, B., and Lee, L. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In ACL.
  • [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove: Global vectors for word representation. In EMNLP.
  • [Qian, Huang, and Zhu2017] Qian, Q.; Huang, M.; and Zhu, X. 2017. Linguistically regularized lstms for sentiment classification. In ACL.
  • [Rush, Chopra, and Weston2015] Rush, A. M.; Chopra, S.; and Weston, J. 2015.

    A neural attention model for abstractive sentence summarization.

    In EMNLP.
  • [Seo et al.2017] Seo, M.; Kembhavi, A.; Farhadi, A.; and Hajishirzi, H. 2017. Bidirectional attention flow for machine comprehension. In ICLR.
  • [Shang, Lu, and Li2015] Shang, L.; Lu, Z.; and Li, H. 2015. Neural responding machine for short-text conversation. In ACL.
  • [Socher et al.2013] Socher, R.; Perelygin, A.; Wu, J. Y.; Chuang, J.; Manning, C. D.; Ng, A. Y.; Potts, C.; et al. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP.
  • [Socher et al.2014] Socher, R.; Karpathy, A.; Le, Q. V.; Manning, C. D.; and Ng, A. Y. 2014. Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics 2:207–218.
  • [Srivastava et al.2014] Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.; and Salakhutdinov, R. 2014. Dropout: a simple way to prevent neural networks from overfitting.

    Journal of Machine Learning Research

  • [Sukhbaatar et al.2015] Sukhbaatar, S.; Weston, J.; Fergus, R.; et al. 2015. End-to-end memory networks. In NIPS.
  • [Tai, Socher, and Manning2015] Tai, K. S.; Socher, R.; and Manning, C. D. 2015. Improved semantic representations from tree-structured long short-term memory networks. In ACL.
  • [Teng, Vo, and Zhang2016] Teng, Z.; Vo, D.-T.; and Zhang, Y. 2016. Context-sensitive lexicon features for neural sentiment analysis. In EMNLP.
  • [Vaswani et al.2017] Vaswani, A.; Shazeer; Noam; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; and Polosukhin, I. 2017. Attention is all you need. In NIPS.
  • [Vendrov et al.2016] Vendrov, I.; Kiros, R.; Fidler, S.; and Urtasun, R. 2016. Order-embeddings of images and language. In ICLR.
  • [Wiebe, Wilson, and Cardie2005] Wiebe, J.; Wilson, T.; and Cardie, C. 2005. Annotating expressions of opinions and emotions in language. Language resources and evaluation 39(2):165–210.
  • [Williams, Nangia, and Bowman2017] Williams, A.; Nangia, N.; and Bowman, S. R. 2017. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426.
  • [Zeiler2012] Zeiler, M. D. 2012. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.
  • [Zhao, Lu, and Poupart2015] Zhao, H.; Lu, Z.; and Poupart, P. 2015. Self-adaptive hierarchical sentence model. In IJCAI.
  • [Zhao, Zhu, and Lan2014] Zhao, J.; Zhu, T.; and Lan, M. 2014. Ecnu: One stone two birds: Ensemble of heterogenous measures for semantic relatedness and textual entailment. In SemEval@ COLING, 271–277.