1 Introduction
Learning effective language representation is important for a variety of text analysis tasks including sentiment analysis, news classification, natural language inference and question answering. Language representations can be learned via supervised or unsupervised approaches. Supervised learning using neural networks commonly entails learning intermediate sentence representations usually using an attention mechanism followed by a task specific layer. For text classification tasks; this is usually a fully connected layer followed by an
Nwaysoftmax layer where N is the number of classes.Learning unsupervised language representations has made substantial progress in recent years with the introduction of new techniques for language modeling combined with deep models like ELMo [peters2018deep], ULMFit [howard2018universal], BERT [devlin2018bert]
and GPT2
[radford2019language]. These methods have enabled transfer of learned representations via pretraining to downstream tasks. Although these models work well on a variety of tasks there are two major limitations: 1) they are computationally expensive to train and 2) they usually have a large number of parameters that greatly increases the model size and memory requirements. For instance, the multilingual BERTbase Cased model has 110M parameters, the small GPT2 model has 117M parameters [radford2019language] and the RoBERTa model was trained on 160GB of data [liu2019roberta]. It is natural to see how task specific training (ELMo) or finetuning (BERT, GPT2) can be limiting when the training data and computational resources are scarce. Further, running inference on and storing such models can also be difficult in low resource scenarios such as IoT devices or lowlatency use cases. Hence, supervised learning for taskspecific architectures which are trained from scratch, especially where domain specific training data is available are useful. They are lightweight and easy to deploy. In this work we focus on learning compact attentionbased supervised language representations with text classification as the downstream task.As outlined above, modern neural networks have a large number of parameters. Computation at attention layers in these networks can get prohibitive. Especially, multihead attention mechanisms [lin2017structured, guo2017end, vaswani2017attention] (multiple attention distributions over a given sentence) that form an integral part of many stateoftheart architectures for NLP tasks [vaswani2017attention, devlin2018bert] can be expensive to compute. We argue that the attention layer giving rise to multiple attentions in these methods is overparameterized. In this work, we propose a novel lowrank factorization based multihead attention mechanism (LAMA), which is computationally cheaper than prior approaches and sometimes exceeds the performance of stateoftheart baselines.
Contrary to previous approaches [lin2017structured, guo2017end] that are based on the additive attention mechanism [neuralAlign], LAMA is based on the multiplicative attention [luong2015effective]
which replaces the additive attention by the dot product attention for faster computation. We further introduce a bilinear projection while computing the dot product to capture similarities between a global context vector and each word in the sentence. The function of the bilinear projection is to capture nuanced context dependent wordimportance as corroborated by previous works
[chen2017reading]. Next, we use a lowrank formulation of the bilinear projection matrix based on hadarmard product [kim2016hadamard, yu2015combining] to compress the attention layer and speed up the computation of multiple attention distributions for each word. We leverage this approach to decompose a single bilinear matrix to produce multiple attentions between the global context vector and each word as opposed to having a different learned vector [lin2017structured] or matrix [yu2017multi]. We evaluate our model by performing experiments on multiple datasets spanning two different tasks namely Sentiment Analysis and Text Classification. We find that our model restores or in some cases outperforms stateoftheart approaches with fewer parameters for efficient computation.2 Related Work
Spearheaded by their success in neural machine translation
[neuralAlign, luong2015effective] attention mechanisms are now ubiquitously used in problems such as question answering [teachQA, seo2016bidirectional, dhingra2016gated][paulus2017deep, chen2017reading] and training large language models [devlin2018bert, radford2019language]. In sequence modeling, attention mechanisms allow the decoder to learn which part of the sequence it should “attend” to based on the input sequence and the output it has generated so far [neuralAlign]. A special case of attention known as selfattention [lin2017structured] or intra attention [han] is used for text classification tasks such as sentiment analysis and natural language inference.Models have been proposed that compute multiple attention distributions over a single sequence of words. Multiview networks [guo2017end] use a different set of parameters for each view which leads to an increase in the number of parameters. Lin et al. [lin2017structured] use the additive attention mechanism and modify it to produce multiple attentions to obtain a matrix sentence embedding. Recently proposed multihead attention (also known as scaled dot product attention) has been shown to be very effective in machine translation [vaswani2017attention] and pretraining [devlin2018bert]
. In this work we propose a more parameter efficient way to compute multiple attentions. The score between the context vector and the word hidden representation is computed using a bilinear projection matrix followed by an approach inspired by multimodal low rank bilinear pooling
[kim2016hadamard] to factorize the matrix into two low rank matrices to compute multiple attention distributions over words. Contrary to Guo et al. [guo2017end] we use matrix factorization to alleviate the problem of increasing parameters with increasing views and our approach uses fewer parameters than [lin2017structured] to compute multiple attentions and performs superior to their approach. Lowrank factorization has been a popular approach to reduce the size of the hidden layers [chen2018adaptive, tai2015convolutional]. In this work we use Hadamard product formulation of the product of two lowrank matrices to compactify the attention layer.3 Proposed Model
A document (review or a news article) is first tokenized and converted to a word embedding via a lookup into a pretrained embedding matrix. The embedding of each token is encoded via a biGRU sentence encoder to get a contextual annotation of each word in that sentence. The LAMA attention mechanism then obtains multiple attention distributions over those words by computing an alignment score of their hidden representation with a wordlevel context vector. Sum of the word representations weighted by the scores from multiple attention distributions then forms a matrix sentence embedding. The matrix embedding is then flattened and passed onto downstream layers (either a classifier or another encoder depending on the task). Since we model all tokens in the text together without using any hierarchical structure, without loss of generality the terms sentence and document are used interchangeably in the rest of the paper. Capital bold letters indicate matrices, small bold letters indicate vectors and small letters indicate scalars.
3.1 Sequence Encoder
We use the GRU [neuralAlign] RNN as the sequence encoder. GRU uses a gating mechanism to track the state of the sequences. There are two types of gates: the reset gate and the update gate . The update gate decides how much past information is kept and how much new information is added. At time , the GRU computes its new state as:
(1) 
and the update gate is updated as:
(2) 
The RNN candidate state is computed as:
(3) 
Here is the reset gate which controls how much the past state contributes to the candidate state. If is zero, then it forgets the previous state. The reset gate is updated as follows:
(4) 
Consider a document containing words. . Let each word be denoted by , where every word is converted to a real valued word vector using the pretrained embedding matrix , , where is the embedding dimension and is the vocabulary. The embedding matrix is finetuned during training. Note that we have dropped the subscript as all the derivations are for the document and it is assumed implicit in the following sections.
We encode the document using a bidirectional GRU (biGRU) that summarizes information in both directions along the text to get a contextual annotation of a word. In a biGRU the hidden state at time step is represented as a concatenation of hidden states in the forward and backward direction. The forward GRU denoted by processes the sentence from to whereas the backward GRU denoted by processes it from to .
(5) 
(6a)  
(6b) 
Here the word annotation is obtained by concatenating the forward hidden state and the backward hidden state .
3.2 SingleHead Attention
To alleviate the burden of remembering long term dependencies from GRUs we use the global attention mechanism [luong2015effective] in which the sentence representation is computed by attending to all words in the sentence. Let be the annotation corresponding to the word . First we transform
using a one layer MultiLayer Perceptron (MLP) to obtain its hidden representation
. We assume Gaussian priors with mean and standard deviation on and .(7) 
Next, to compute the importance of the word in the current context we calculate its relevance to a global context vector c.
(8) 
Here, , is a bilinear projection matrix which is randomly initialized and jointly learned with other parameters during training. is the dimension of the GRU hidden state and & c are both of dimension since we’re using a biGRU. The mean of the word embeddings provides a good initial approximation of the global context of the sentence. We initialize which is then updated during training. We use a bilinear model because they are more effective in learning pairwise interactions. The attention weight for the word is then computed using a softmax function where summation is taken over all the words in the document.
(9) 
Notation  Meaning 
Corpus size  
# of words tokens in a sample  
# of aspects  
alignment score  
attention weight  
word hidden representation  
c  global context vector 
GRU hidden state dimension 
3.3 Lama
The attention distribution above usually focuses on a specific component of the document, like a special set of trigger words. So it is expected to reflect an aspect, or component of the semantics in a document. This type of attention is useful for smaller pieces of texts such as tweets or short reviews. For larger reviews there can be multiple aspects that describe that review. For this we introduce a novel way of computing multiple heads of attention that capture different aspects.
Suppose aspects are to be extracted from a sentence, we need alignment scores between each word hidden representation and the context vector c. To obtain an dimensional output , we need to learn weight matrices given by as demonstrated in previous works. Although this strategy might be effective in capturing pairwise interactions for each aspect it also introduces a huge number of parameters that may lead to overfitting and also incur a high computational cost especially for a large or a large . To address this, the rank of matrix W can be reduced by using lowrank bilinear method to have less number of parameters [kim2016hadamard, yu2017multi]. Consider one aspect; the bilinear projection matrix in Eq. 8 is factorized into two low rank matrices P & Q.
(10) 
where and are two lowrank matrics, is the Hadamard product or the elementwise multiplication of two vectors, is an allone vector and is the latent dimensionality of the factorized matrices.
To obtain scores, by Eq.10
, the weights to be learned are two threeorder tensors
and accordingly. Without loss of generality and can be reformulated as 2D matrices and respectively with simple reshape operations. Setting , which corresponds to rank1 factorization. Eq.10 can be written as:(11) 
This brings the two feature vectors , the word hidden representation and , the global context vector in a common subspace and are given by and respectively. now is a multihead alignment vector for the word . For computing attention for one head, this is equivalent to replacing the projection matrix in Eq 8 by the outer product of vectors and  rows of the matrices and respectively and rewriting it as the Hadamard product. As a result each row of matrices and represent the vectors for computing the score for a different head.
The multihead attention vector is obtained by computing a softmax function along the sentence length:
(12) 
Before computing softmax, similar to [kim2016hadamard, yu2017multi] to further increase the model capacity we apply the tanh nonlinearlity to
. Since elementwise multiplication is introduced the values of neurons may vary a lot so we apply an
normalization layer across the dimension. Although is not strictly necessary since both c and are in the same modality empirically we do see improvement after applying . Each component of is the contribution of the word to the aspect.Let be a matrix of all word annotations in the sentence; . The attention matrix for the sentence can be computed as:
(13) 
where, is c repeated times, once for each word, and softmax is applied rowwise. is the attention matrix between the sentence and the global context with each row representing attention for one aspect.
Given , the multihead attention matrix for the sentence; . The sentence representation for an aspect given by can be computed by taking a weighted sum of all word annotations.
(14) 
Similarly, sentence representation can be computed for all heads and is given in a compact form by:
(15) 
Here is a matrix sentence embedding and contains as many rows as the number of heads. Each row contains an attention distribution for a new aspect. It is flattened by concatenating all rows to obtain the document representation d
. From the document representation, the class probabilities are obtained as follows.
(16) 
Loss is computed using cross entropy.
(17) 
where is the number of classes and is the probability of the class . The final training loss is given by:
(18) 
The summation is taken over all the documents in a minibatch. We use the minibatch stochastic gradient descent algorithm
[kiefer1952stochastic]with momentum and weight decay for optimizing the loss function and the backpropogation algorithm is used to compute the gradients. Fig.
1 illustrates a schematic of the model architecture. A single document and its flow through various model components is shown. The middle block illustrates the proposed attention mechanism for one word of the document.3.3.1 Hyperparameters
We use a word embedding size of 100. The embedding matrix is pretrained on the corpus using word2vec. All words appearing less than 5 times are discarded. The GRU hidden state is set to , MLP hidden state to and apply a dropout to the hidden layer. We use a batch size of for training and an initial learning rate of . For early stopping we use .
3.3.2 Computational Efficiency for Attention
Other variables such as input sentence length and dimension of the hidden state representations held constant, the computational complexity depends on the attention layer. Here, we show how the proposed model LAMA compares to the self attentive network (SAN) [lin2017structured] with respect to the number of parameters needed to compute an attention matrix for a sentence. In SAN the attention matrix A can be computed as:
(19) 
where, H is a matrix of word annotations with the shape by, is by and is by where is number of aspects. So the total number of parameters needed to be learned are . For the proposed model, to compute the attention matrix given by Eq. 13 the parameters to be learned are matrices , and the context vector c. So the total number of parameters required are . Comparing the above terms the reduction factor is . Even though, both are the parameter savings come from the constant factor. For , reduction factor is . For the Transformer model [vaswani2017attention], the multihead attention in one layer is given by;
(20) 
where the projections are parameter matrices & , and . In the selfattention case, all of Q, K, and V are the input representations from the previous layer. The number of parameters required to compute the multihead attention in Eq. 20 are . It should be noted that even though the attention computation in Transformer is an order of magnitude higher, it does not contain any GRU layers. So the overall complexity depends on the choice of the input layer such as convolutional or recurrent layers.
4 Experiments
We evaluate the performance of the proposed model on two tasks with five different datasets. Table 2 gives an overview of the datasets and their statistics.
Dataset  # Train  # Test  # words 
YELP  499,976  4,000  118 
YELPL  175844  1378  226 
IMDB  24,000  25,000  221 
Reuters  4484  2189  102 
News  151,328  32, 428  352 
4.1 Sentiment Analysis
For the sentiment analysis task we pick two datasets  the YELP Ratings dataset and the IMDB Movie Sentiment Dataset.
4.1.1 Yelp
The Yelp dataset ^{1}^{1}1https://www.yelp.com/dataset challenge consists of 2.7M Yelp reviews and user ratings from 1 to 5. Given a review as input the goal is to predict the number of stars the user who wrote that review assigned to the corresponding business store. We treat the task as 5way text classification where each class indicates the user rating. We randomly selected 500K reviewstar pairs as training set, 4,000 for the dev set, and 4,000 for test set. Reviews were tokenized using the Spacy tokenizer ^{2}^{2}2https://spacy.io/. 100dimensional word embeddings were trained from scratch on the train dataset using the gensim ^{3}^{3}3https://radimrehurek.com/gensim/ software package.
Multihead attention capturing multiple aspects is more useful for classifying ratings that are more subjective i.e. longer reviews where people express their experiences in detail. We create a subset of the YELP dataset containing all longer reviews i.e. reviews containing longer than 118 tokens which we found to be the mean length of the reviews in the dataset. The training set consists of 175,844 reviews, the dev set consists of 1,416 reviews and the test set consists of 1,378 reviews. The goal is to predict the ratings from the above subset of the Yelp dataset. We refer to this dataset as YelpL (YelpLong) in the rest of the paper since it consists of all longer reviews. We hypothesize that having multihead attention would benefit in this setting where more intricate foraging of information from different parts of the text is required to make a prediction. The model hyperparameters and training settings remain the same as the above.
4.1.2 Movie Reviews
The large Movie Review dataset [maasEtAl:2011:ACLHLT2011] contains movie reviews along with their associated binary sentiment polarity labels. It contains 50,000 highly polar reviews ( out of 10 for negative reviews and out of 10 for positive reviews) split evenly into 25K train and 25K test sets. The overall distribution of labels is balanced (25K pos and 25K neg). In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movieunique terms and their associated with observed labels. From the training set we randomly set aside 1000 reviews for validation. We refer to this dataset as IMDB in the rest of the paper.
4.2 News Classification
For this task we selected two datasets.
4.2.1 News Aggregator
This dataset [Dua:2017]. contains headlines, URLs, and categories for news stories collected by a web aggregator between March 10th, 2014 and August 10th, 2014. News categories included in this dataset include business; science and technology; entertainment; and health. Different news articles that refer to the same news item (e.g., several articles about recently released employment statistics) are also categorized together. Given a news article the task is to classify it into one of the four categories. Training dataset consits of 151,328 articles and test dataset consits of 32, 428. The average token length is 352.
4.2.2 Reuters
This dataset ^{4}^{4}4https://www.cs.umb.edu/ smimarog/textmining/datasets/ is taken from Reuters21578 Text Categorization Collection. This is a collection of documents that appeared on Reuters newswire in 1987. The documents were assembled and indexed with categories. We evaluate on the Reuters8 dataset consisting of news articles about 8 topics including acq, crude, earn, grain, interest, moneyfx, ship,trade.
4.3 Comparative Methods
We use supervised and unsupervised baselines for comparison. For unsupervised methods we use simple average of word embeddings (AVG) [shen2018baseline] and the recently proposed BERT model [devlin2018bert] as our baseline. We use a pretrained BERT implementation ^{5}^{5}5https://gluonnlp.mxnet.io
and finetune it with a taskspecific classifier. This finetuning is performed for 10 epochs using the ADAM optimizer
[kingma2014adam] with a learning rate of 5e6.Among supervised methods we use a variety of models with and without attention as our baselines. We use a biGRU [Chung2014EmpiricalEO]
model with maxpooling referred as BiGRU as a baseline. We use a convolutional neural network with
maxovertime pooling [kim2014convolutional] as another baseline. We refer to this as CNN in the paper.Among attentionbased multihead models we use the Self Attention Network proposed in [lin2017structured]. We refer to this baseline as SAN. Following the original paper we have used 30 attention heads and MLP hidden size of 512 for this baseline. Encoder of the Transformer model (TE) [vaswani2017attention] is another multihead attention mechanism that is used as a baseline. For this baseline we use one encoder layer and 16 attention heads. We use such that as used their paper. For our attentionbased models we performed a grid search to identify the optimal number of attention heads to get the best performance.
5 Results
Methods—Corpus  News  Reuters  Yelp  IMDB  YelpL 
Supervised  
SAN  0.876  0.942  0.68  0.831  0.638 
BiGRU  0.905  0.867  0.663  0.876  0.608 
CNN  0.914  0.96  0.693  0.874  0.672 
TE  0.899  0.973  0.655  0.817  0.569 
LAMA  0.922  0.965  0.697  0.895  0.653 
LAMA + Ctx  0.923  0.973  0.716  0.9  0.665 
Unsupervised  
BERT  0.922  0.978  0.715  0.894  0.672 
AVG  0.91  0.795  0.653  0.874  0.652 
Table 3 shows the accuracy of the baselines and the proposed model. Numbers highlighted in bold represent best performing models in supervised and unsupervised categories respectively. The proposed model with context (LAMA+ctx) outperforms the SAN model [lin2017structured] on all tasks from 3.3% (Reuters) to 8.2% (IMDB).
Extrapolating the attention over larger chunks of text we get uniform attention to all words, which is equivalent to no attention or equal preference for all words which is what a simple BiGRU model does (in a contextual setting and average of word embeddings in a noncontextual setting). We note that this model outperforms LSTM model by 2.0% (News), 12.2% (Reuters) 7.9% (Yelp) and 9.4 % (YelpL) and 2.7 % (IMDB).
Our models outperform the Transformer Encoder on all tasks except the Reuters dataset where both models perform on par. When compared to finetuned pretrained language models such as BERT we find that performance of both models is similar except the YELPL dataset where BERT outperforms LAMA.
If we consider the noncontextual baseline of average of word embeddings we see an improvement of 1.4% (News), 22.4% (Reuters) 9.8% (Yelp), 11.4% (YelpL) and 3.0% (IMDB) which proves that contextual dependencies captured by LSTM or CNN models are indeed important for the tasks.
Compared to CNN models we see an improvement of 1% (News), 1.4% (Reuters) and 3.3% (Yelp) and 3.0 % (IMDB). On the YelpL dataset our model and CNN perform similarly. This maybe because of the ability of CNNs to capture local context using multiple fixed sized kernels and our model’s ability to capture phraselevel context.
5.1 Contextual Attention Weights
To verify that our model captures context dependent word importance, we plot the distribution of the attention weights of the positive words ‘amazing’, ‘happy’ and ‘recommended’ and negative words ‘poor’, ‘terrible’ and ‘worst’ from the test split of the Yelp data set as shown in Figure 2. We plot the distribution when conditioned on the ratings of the review. It can be seen from Figure 2 that the weight of positive words concentrates on the low end in the reviews with rating 1 (blue line). As the rating increases, the weight distribution shifts to the right. This indicates that positive words play a more important role for reviews with higher ratings. The trend is opposite for the negative words where words with negative connotation have lower attention weights for reviews with rating 5 (purple line). Howver, there are a few exceptions. For example, it is intuitive that ‘amazing’ gets a high weight for reviews with high ratings but it also gets a high weight for reviews with rating 2 (orange line) . This is because, inspecting the Yelp dataset we find that ‘amazing’ occurs quite frequently with the word ‘not’ in the reviews with rating 2; ‘above average but not amazing’, ‘was ok but not amazing’. Our model captures this phraselevel context and assigns similar weights to ‘not’ and ‘amazing’. ‘not’ being a negative word gets a high weight for lower ratings and hence so does ‘amazing’. Similarly, other exceptions such as ‘terrible’ for rating 4 can be explained due to the fact that customers might dislike one aspect of a business such as their service but like another aspect such as food.
Yelp(1)  Yelp(5)  r8(ship)  r8(money)  




To further illustrate contextdependent word importance Table 4 lists top attended keywords for Yelp and Reuters datasets. It can be seen that certain words that often occur in pairs such as ‘would recommend’, ‘very professional’, ‘very delicious’ appear in the list for Yelp and ‘exchange rate’, ‘trade deficit’ for Reuters. This is because the model assigns close attention weights to the words in these phrases because they occur frequently together. Note also that superlatives such ‘100 stars’ appear in the list which are strong indicators of the sentiment of a review.
5.2 Why Multiple Heads
Having a structured embedding with multiple rows, provides for more contextual representations. To verify this we evaluate the model performance as we vary the number of attention heads from 1 to 25. Specifically, we plot the validation accuracy vs. epochs for different values of , for the YelpL and IMDB datasets. We vary from 1 to 20 to get 5 models with , , , and . The plots are shown in Figure 3. From the figure we can see that for the YelpL dataset model performance peaks for and then starts falling for . We can clearly see a significant difference between and , showing that having a multiaspect attention mechanism helps. For the IMDB dataset model with performs the best whereas model with performs the worst although performances for are similar to each other for this task.
5.3 Attentional Unit Efficiency
In this section we compare the number of trainable parameters in the attention layer of three multihead attention mechanisms (§3.3.2); the proposed model (LAMA), SelfAttentive network (SAN) and the Transformer Encoder (TE). Fig 4 shows the increase in number of parameters (yaxis) when the number of attention heads are increased from 1 to 40.
5.4 Convergence Analysis
We plot the training loss per epoch for the three attentionbased models (SAN, TE and LAMA). As can be seen from Fig. 5 on an average LAMA converges faster than SAN and TE and to a smaller optimum.
6 Conclusion
In this paper we presented a novel compact multihead attention mechanism and illustrated its effectiveness on text classification benchmarks. The proposed method computes multiple attention distributions over words which leads to contextual sentence representations. The results showed that this mechanism performed better than several other approaches including noncontextual unsupervised baselines such as average of word embeddings, contextual baselines such as LSTMbased methods and CNNs and also other attention mechanisms with fewer parameters. We further demonstrated the computational superiority of our approach in comparison to prior multiaspect mechanisms in computing attentions and hence is more amenable in lowresource scenarios. An important research question concerns with the discernibility of different attention heads with respect to each other for better interpretablity. One of the obstacles in learning attention in an unsupervised way is that there is no implicit mechanism to impose structure on different rows; although it merits further research.