Sentiment Analysis with Contextual Embeddings and Self-Attention

03/12/2020 ∙ by Katarzyna Biesialska, et al. ∙ Politechnika Warszawska 7

In natural language the intended meaning of a word or phrase is often implicit and depends on the context. In this work, we propose a simple yet effective method for sentiment analysis using contextual embeddings and a self-attention mechanism. The experimental results for three languages, including morphologically rich Polish and German, show that our model is comparable to or even outperforms state-of-the-art models. In all cases the superiority of models leveraging contextual embeddings is demonstrated. Finally, this work is intended as a step towards introducing a universal, multilingual sentiment classifier.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

All areas of human life are affected by people’s views. With the sheer amount of reviews and other opinions over the Internet, there is a need for automating the process of extracting relevant information. For machines, however, measuring sentiment is not an easy task, because natural language is highly ambiguous at all levels, and thus difficult to process. For instance, a single word can hardly convey the whole meaning of a statement. Moreover, computers often do not distinguish literal from figurative meaning or incorrectly handle complex linguistic phenomena, such as: sarcasm, humor, negation etc.

In this paper, we take a closer look at two factors that make automatic opinion mining difficult – the problem of representing text information, and sentiment analysis (SA). In particular, we leverage contextual embeddings, which enable to convey a word meaning depending on the context it occurs in. Furthermore, we build a hierarchical multi-layer classifier model, based on an architecture of the Transformer encoder [32], primarily relying on a self-attention mechanism and bi-attention. The proposed sentiment classification model is language independent, which is especially useful for low-resource languages (e.g. Polish).

We evaluate our methods on various standard datasets, which allows us to compare our approach against current state-of-the-art models for three languages: English, Polish and German. We show that our approach is comparable to the best performing sentiment classification models; and, importantly, in two cases yields significant improvements over the state of the art.

The paper is organized as follows: Section 2 presents the background and related work. Section 3 describes our proposed method. Section 4 discusses datasets, experimental setup, and results. Section 5 concludes this paper and outlines the future work.

2 Related Work

Sentiment classification has been one of the most active research areas in natural language processing (NLP) and has become one of the most popular downstream tasks to evaluate performance of neural network (NN) based models. The task itself encompasses several different opinion related tasks, hence it tackles many challenging NLP problems, see e.g.

[17, 21].

2.1 Sentiment Analysis Approaches

The first fully-formed techniques for SA emerged around two decades ago, and continued to be prevalent for several years, until deep learning methods entered the stage. The most straight-forward method, developed in [31], is based on the number of positive and negative words in a piece of text. Concretely, the text is assumed to have positive polarity if it contains more positive than negative terms, and vice versa. Of course, the term-counting method is often insufficient; therefore, an improved method was proposed in [11]

, which combines counting positive and negative terms with a machine learning (ML) approach (i.e. Support Vector Machine).

Various studies (e.g. [30]

) have shown that one can determine the polarity of an unknown word by calculating co-occurrence statistics of it. Moreover, classical solutions to the SA problem are often based on lexicons. Traditional lexicon-based SA leverages word-lists, that are pre-annotated with positive and negative sentiment. Therefore, for many years lexicon-based approaches have been utilized when there was insufficient amount of labeled data to train a classifier in a fully supervised way.

In general, ML algorithms are popular methods for determining sentiment polarity. A first ML model applied to SA has been implemented in [22]. Moreover, throughout the years, different variants of NN architectures have been introduced in the field of SA. Especially recursive neural networks [23]

, such as recurrent neural networks (RNN)

[28, 29, 14]

, or convolutional neural networks (CNN)

[10, 12] have become the most prevalent choices.

2.2 Vector Representations of Words

One of the principal concepts in linguistics states that related words can be used in similar ways [7]. Importantly, words may have different meaning in different contexts. Nevertheless, until recently it has been a dominant approach (e.g. word2vec [20], GloVe [24]) to learn representations such that each and every word has to capture all its possible meanings.

However, lately a new set of methods to learn dynamic representations of words has emerged [19, 8, 25, 26, 5]. These approaches allow each word representation to capture what a word means in a particular context. While every word token has its own vector, the vector can depend on a variable-length sequence of nearby words (i.e. context). Consequently, a context vector is obtained by feeding a neural network with these context word vectors and subsequently encoding them into a single fixed-length vector.

Figure 1: The architecture of ELMo.

ULMFiT [8]

was the very first method to induce contextual word representations by harnessing the power of language modeling. The authors proposed to learn contextual embeddings by pre-training a language model (LM), and then performing task-specific fine-tuning. ULMFiT architecture is based on a vanilla 3-layer Long Short-Term Memory (LSTM) NN without any attention mechanism.

The other contextual embedding model introduced recently is called ELMo (Embeddings from Language Models) [25]. Similarly to ULMFiT, this model uses tokens at the word-level. ELMo contextual embeddings are “deep” as they are a function of all hidden states. Concretely, context-sensitive features are extracted from a left-to-right and a right-to-left 2-layer bidirectional LSTM language models. Thus, the contextual representation of each word is the concatenation of the left-to-right and right-to-left representations as well as the initial embedding (see Fig. 1).

The most recent model – BERT [5] – is more sophisticated architecturally-wise, as it is a multi-layer masked LM based on the Transformer NN utilizing sub-word tokens. However, as we are bound to use word-level tokens in our sentiment classifier, we leverage the ELMo model for obtaining contextual embeddings. More specifically, by means of ELMo we are able to feed our classifier model with context-aware embeddings of an input sequence. Hence, in this setting we do not perform any fine-tuning of ELMo on a downstream task.

2.3 Self-Attention Deep Neural Networks

The attention mechanism was introduced in [3]

in 2014 and since then it has been applied successfully to different computer vision (e.g. visual explanation) and NLP (e.g. machine translation) tasks. The mechanism is often used as an extra source of information added on top of the CNN or LSTM model to enhance the extraction of sentence embedding

[6, 16]. However, this scenario is not applicable to sentiment classification, since the model only receives a single sentence on input, hence there is no such extra information [16].

Self-attention (or intra-attention) is an attention mechanism that computes a representation of a sequence by relating different positions of a single sequence. Previous work on sentiment classification has not covered extensively attention-based neural network models for SA (especially using the Transformer architecture [32]), although some papers have appeared recently [2, 15].

3 The Proposed Approach

Our proposed model, called Transformer-based Sentiment Analysis (TSA) (see Fig. 2), is based on the recently introduced Transformer architecture [32]

, which has provided significant improvements for the neural machine translation task. Unlike RNN or CNN based models, the Transformer is able to learn dependencies between distant positions. Therefore, in this paper we show that attention-based models are suitable for other NLP tasks, such as learning distributed representations and sentiment analysis, and thus are able to improve the overall accuracy.

Figure 2: An overview of the TSA model architecture.

The architecture of the TSA model and steps to train it can be summarized as follows:

  1. At the very beginning there is a simple text pre-processing method that performs text clean-up and splits text into tokens.

  2. We use contextual word representations to represent text as real-valued vectors.

  3. After embedding the text into real-valued vectors, the Transformer network maps the input sequence into hidden states using self-attention.

  4. Next a bi-attention mechanism is utilized to estimate the interdependency between representations.

  5. A single layer LSTM together with self-attentive pooling compute the pooled representations.

  6. A joint representation for the inputs is later passed to a fully-connected neural network.

  7. Finally, a softmax layer is used to determine sentiment of the text.

3.1 Embeddings and Encoded Positional Information

Non-recurrent models, such as deep self-attention NN, do not necessarily process the input sequence in a sequential manner. Hence, there is no way they can record the position of each word in a sequence, which is an inherent limitation of every such model. Therefore, in the case of the Transformer, the need has been addressed in the following manner – the Transformer takes into account the order of the words in the input sequence by encoding their position information in extra vectors (so called positional encoding vectors) and adding them to input embeddings. There are many different approaches to embed position information, such as learned or fixed positional encodings (PE), or recently introduced relative position representations (RPR) [27]. The original Transformer used sine and cosine functions of different frequencies.

In this work, we explore the effectiveness of applying a modified approach to incorporate positional information into the model, namely using RPR instead of PE. Furthermore, we use global average pooling in order to average the output of the last self-attention layer and prepare the model for the final classification layer.

3.2 The Transformer Encoder

The input sequence is combined with word and positional embeddings, which provide time signal, and together are fed into an encoder block. Matrices for a query Q, a key K and a value V

are calculated and passed to a self-attention layer. Next, a normalization is applied and residual connections provide additional context. Further, a final dense layer with vocabulary size generates the output of the encoder. A fully-connected feed-forward network within the model is a single hidden layer network with a ReLU activation function in between:


3.3 Self-Attention Layer

The self-attention block in the encoder is called multi-head self-attention. A self-attention layer allows each position in the encoder to access all positions in the previous layer of the encoder immediately, and in the first layer all positions in the input sequence. The multi-head self-attention layer employs h parallel self-attention layers, called heads, with different Q, K, V matrices obtained for each head. In a nutshell, the attention mechanism in the Transformer architecture relies on a scaled dot-product attention, which is a function of Q and a set of K-V pairs. The computation of attention is performed in the following order. First, a multiplication of a query and transposed key is scaled through the scaling factor of (Eq. 2)


Next, the attention is produced using the softmax function over their scaled inner product:


Finally, the weighted sum of each attention head and a value is calculated as follows:


3.4 Masking and Pooling

Similar to other sources of data, the datasets used for training and evaluation of our models contain sequences of different length. The most common approach in the literature involves finding a maximal sequence length existing in the dataset/batch and padding sentences that are shorter than the longest one with trailing zeroes. In the proposed TSA model, we deal with the problem of variable-length sequences by using masking and self-attentive pooling. The inspiration for our approach comes from the BCN model proposed in

[19]. Thanks to this mechanism, we are able to fit sequences of different length into the final fixed-size vector, which is required for the computation of the sentiment score. The self-attentive pooling layer is applied just after the encoder block.

4 Experiments

4.1 Datasets

In this work, we compare sentiment analysis results considering four benchmark datasets in three languages. All datasets are originally split into training, dev and test sets. Below we describe these datasets in more detail.

Dataset # Classes Train Dev Test Domain Language
SST-2 2 6,920 872 1,821 movies English
SST-5 5 8,544 1,101 2,210 movies English
PolEmo 2.0-IN 5 5,783 723 722 medical, hotels Polish
GermEval 3 19,432 2,369 2,566 travel, transport German
Table 1: Sentiment analysis datasets with number of classes and train/dev/test split.

Stanford Sentiment Treebank (SST)

This collection of movie reviews [28] from the is annotated for the binary (SST-2) and fine-grained (SST-5) sentiment classification. SST-2 divides reviews into two groups: positive and negative, while SST-5 distinguishes 5 different review types: very positive, positive, neutral, negative, very negative. The dataset consists of 11,855 single sentences and is widely used in the NLP community.

PolEmo 2.0

The dataset [13] comprises online reviews from education, medicine and hotel domains. There are two separate test sets, to allow for in-domain (medicine and hotels) and out-of-domain (products and university) evaluation. The dataset comes with the following sentiment labels: strong positive, weak positive, neutral, weak negative, strong negative, and ambiguous.


This dataset [33] contains customer reviews of the railway operator (Deutsche Bahn) published on social media and various web pages. Customers expressed their feedback regarding the service of the railway company (e.g. travel experience, timetables, etc.) by rating it as positive, negative, or neutral.

4.2 Experimental Setup

Pre-processing of input datasets is kept to a minimum as we perform only tokenization when required. Furthermore, even though some datasets, such as SST or GermEval, provide additional information (i.e. phrase, word or aspect-level annotations), for each review we only extract text of the review and its corresponding rating.

The model is implemented in the Python programming language, PyTorch

111 and AllenNLP222 Moreover, we use pre-trained word-embeddings, such as ELMo [25], GloVe [24]. Specifically, we use the following ELMo models: Original333, Polish [9] and German [18]. In the ELMO+GloVe+BCN model we use the following 300-dimension GloVe embeddings:, Polish [4] and German555 In order to simplify our approach when training the sentiment classifier model, we establish a very similar setting to the vanilla Transformer. We use the same optimizer - Adam with , and

. We incorporate four types of regularization during training: dropout probability

, embedding dropout probability , residual dropout probability , and attention dropout probability . We use 2 encoder layers. In addition, we employ label smoothing of value . In terms of RPR parameters, we set clipping distance to .

4.3 Results and Discussion

In Table 2, we summarize experimental results achieved by our model and other state-of-the-art systems reported in the literature by their respective authors.

English Polish German
SST-2 SST-5 PolEmo2.0-IN GermEval
RNTN [28] 85.4 45.7 - -
DCNN [10] 86.8 48.5 - -
CNN [12] 88.1 48.0 - -
DMN [14] 88.6 52.1 - -
Constituency Tree-LSTM [29] 88.0 51.0 - -
CoVe+BCN [19] 90.3 53.7 - -
SSAN+RPR [2] 84.2 48.1 - -
Polish BERT [1] - - 88.1 -
SWN2-RNN [33] - - - 74.9
Our baseline
ELMo+GloVe+BCN 91.4 53.5 88.9 78.2
Our model
ELMo+TSA 89.3 50.6 89.8 78.9
Table 2: Results of our systems compared to baselines and state-of-the-art systems evaluated on English, Polish and German sentiment classification datasets.

We observe that our models, baseline and ELMo+TSA, outperform state-of-the-art systems for all three languages. More importantly, the presented accuracy scores indicate that the TSA model is competitive and for two languages (Polish and German) achieves the best results. Also noteworthy, in Table 2, there are two models that use some variant of the Transformer: SSAN+RPR [2] uses the Transformer encoder for the classifier, while Polish BERT [1] employs Transformer-based language model introduced in [5]. One of the reasons why we achieve higher score for the SST dataset might be that the authors of SSAN+RPR used word2vec embeddings [20], whereas we employ ELMo contextual embeddings [25]. Moreover, in our TSA model we use not only self-attention (as in SSAN+RPR) but also a bi-attention mechanism, hence this also should provide performance gains over standard architectures.

In conclusion, comparing the results of the models leveraging contextual embeddings (CoVe+BCN, Polish BERT, ELMo+GloVe+BCN and ELMo+TSA) with the rest of the reported models, which use traditional distributional word vectors, we note that the former category of sentiment classification systems demonstrates remarkably better results.

5 Conclusion and Future Work

We have presented a novel architecture, based on the Transformer encoder with relative position representations. Unlike existing models, this work proposes a model relying solely on a self-attention mechanism and bi-attention. We show that our sentiment classifier model achieves very good results, comparable to the state of the art, even though it is language-agnostic. Hence, this work is a step towards building a universal, multi-lingual sentiment classifier.

In the future, we plan to evaluate our model using benchmarks also for other languages. It is particularly interesting to analyze the behavior of our model with respect to low-resource languages. Finally, other promising research avenues worth exploring are related to unsupervised cross-lingual sentiment analysis.


  • [1] Allegro KLEJ benchmark. Note: Accessed: 2020-01-20 External Links: Link Cited by: §4.3, Table 2.
  • [2] A. Ambartsoumian and F. Popowich (2018) Self-attention: a better building block for sentiment analysis neural network classifiers. In Proceedings of the 9th EMNLP Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 130–139. Cited by: §2.3, §4.3, Table 2.
  • [3] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv. Cited by: §2.3.
  • [4] S. Dadas (2019) A repository of polish NLP resources. Note: GithubAccessed: 2020-01-20 External Links: Link Cited by: §4.2.
  • [5] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171–4186. Cited by: §2.2, §2.2, §4.3.
  • [6] C. N. dos Santos, M. Tan, B. Xiang, and B. Zhou (2016) Attentive pooling networks. arXiv. Cited by: §2.3.
  • [7] J. R. Firth (1957) A synopsis of linguistic theory, 1930-1955. Studies in linguistic analysis. Cited by: §2.2.
  • [8] J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pp. 328–339. Cited by: §2.2, §2.2.
  • [9] A. Janz and P. Miłkowski (2019) ELMo embeddings for polish. Note: CLARIN-PL digital repository External Links: Link Cited by: §4.2.
  • [10] N. Kalchbrenner, E. Grefenstette, and P. Blunsom (2014) A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, pp. 655–665. Cited by: §2.1, Table 2.
  • [11] A. Kennedy and D. Inkpen (2006) Sentiment classification of movie reviews using contextual valence shifters. Computational Intelligence 22, pp. 110–125. Cited by: §2.1.
  • [12] Y. Kim (2014) Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1746–1751. Cited by: §2.1, Table 2.
  • [13] J. Kocoń, P. Miłkowski, and M. Zaśko-Zielińska (2019) Multi-level sentiment analysis of PolEmo 2.0: extended corpus of multi-domain consumer reviews. In Proceedings of the 23rd Conference on Computational Natural Language Learning, pp. 980–991. Cited by: §4.1.
  • [14] A. Kumar, O. Irsoy, P. Ondruska, M. Iyyer, J. Bradbury, I. Gulrajani, V. Zhong, R. Paulus, and R. Socher (2016) Ask me anything: dynamic memory networks for natural language processing. In Proceedings of the 33rd International Conference on International Conference on Machine Learning, Vol. 48, pp. 1378–1387. Cited by: §2.1, Table 2.
  • [15] G. Letarte, F. Paradis, P. Giguère, and F. Laviolette (2018) Importance of self-attention for sentiment analysis. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP, pp. 267–275. Cited by: §2.3.
  • [16] Z. Lin, M. Feng, C. N. dos Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017) A structured self-attentive sentence embedding. In 5th International Conference on Learning Representations, Cited by: §2.3.
  • [17] B. Liu (2012) Sentiment analysis and opinion mining. Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers. Cited by: §2.
  • [18] P. May (2019) German ELMo Model. Note: Accessed: 2020-01-20 External Links: Link Cited by: §4.2.
  • [19] B. McCann, J. Bradbury, C. Xiong, and R. Socher (2017) Learned in translation: contextualized word vectors. In Advances in Neural Information Processing Systems 30, pp. 6294–6305. Cited by: §2.2, §3.4, Table 2.
  • [20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pp. 3111–3119. Cited by: §2.2, §4.3.
  • [21] S. M. Mohammad (2016) Sentiment analysis: detecting valence, emotions, and other affectual states from text. In Emotion measurement, pp. 201–237. Cited by: §2.
  • [22] B. Pang, L. Lee, and S. Vaithyanathan (2002) Thumbs up? sentiment classification using machine learning techniques. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing, pp. 79–86. Cited by: §2.1.
  • [23] R. Paulus, R. Socher, and C. D. Manning (2014) Global belief recursive neural networks. In Advances in Neural Information Processing Systems 27, pp. 2888–2896. Cited by: §2.1.
  • [24] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543. Cited by: §2.2, §4.2.
  • [25] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2227–2237. Cited by: §2.2, §2.2, §4.2, §4.3.
  • [26] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever (2018) Improving language understanding by generative pre-training. Cited by: §2.2.
  • [27] P. Shaw, J. Uszkoreit, and A. Vaswani (2018) Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 464–468. Cited by: §3.1.
  • [28] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642. Cited by: §2.1, §4.1, Table 2.
  • [29] K. S. Tai, R. Socher, and C. D. Manning (2015) Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 1556–1566. Cited by: §2.1, Table 2.
  • [30] P. D. Turney and P. Pantel (2010) From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, pp. 141–188. Cited by: §2.1.
  • [31] P. Turney (2002) Thumbs up or thumbs down? semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 417–424. Cited by: §2.1.
  • [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, pp. 5998–6008. Cited by: §1, §2.3, §3.
  • [33] M. Wojatzki, E. Ruppert, S. Holschneider, T. Zesch, and C. Biemann (2017) GermEval 2017: Shared Task on Aspect-based Sentiment in Social Media Customer Feedback. In Proceedings of the GermEval 2017, pp. 1–12. Cited by: §4.1, Table 2.