Improving Results on Russian Sentiment Datasets

by   Anton Golubev, et al.
Mail.Ru Group

In this study, we test standard neural network architectures (CNN, LSTM, BiLSTM) and recently appeared BERT architectures on previous Russian sentiment evaluation datasets. We compare two variants of Russian BERT and show that for all sentiment tasks in this study the conversational variant of Russian BERT performs better. The best results were achieved by BERT-NLI model, which treats sentiment classification tasks as a natural language inference task. On one of the datasets, this model practically achieves the human level.



There are no comments yet.


page 1

page 2

page 3

page 4


Transfer Learning for Improving Results on Russian Sentiment Datasets

In this study, we test transfer learning approach on Russian sentiment b...

SentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics

We propose SentiBERT, a variant of BERT that effectively captures compos...

PoWER-BERT: Accelerating BERT inference for Classification Tasks

BERT has emerged as a popular model for natural language understanding. ...

BERT-DRE: BERT with Deep Recursive Encoder for Natural Language Sentence Matching

This paper presents a deep neural architecture, for Natural Language Sen...

Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

In the natural language processing literature, neural networks are becom...

On Explaining Your Explanations of BERT: An Empirical Study with Sequence Classification

BERT, as one of the pretrianed language models, attracts the most attent...

Distributionally Robust Classifiers in Sentiment Analysis

In this paper, we propose sentiment classification models based on BERT ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sentiment analysis studies are currently based on the application of deep learning approaches, which requires training and testing on specialized datasets. For English, popular sentiment analysis datasets include: Stanford Sentiment Treebank datasets SST

[socher2013recursive], IMDB dataset of movie reviews [maas2011learning], Twitter sentiment datasets [nakov2016semeval, rosenthal2017semeval], and many others. For other languages, much less datasets have been created. In Russian several sentiment evaluations were previously organized, including ROMIP2012-2013 and SentiRuEval2015-2016 [chetviorkin2013evaluating, loukachevitch2015entity, loukachevitch2016rubtsova]

, which included the preparation of annotated data on reviews (movies, books and digital cameras), news quotes, and Twitter messages. The best results on these datasets were obtained with classical machine learning techniques such as SVM

[chetviorkin2013evaluating], early neural network approaches [trofimovich2016comparison]

, or even engineering methods based on rules and lexicons

[kuznetsova2013testing]. Currently, the results achieved in the above-mentioned Russian evaluations can undoubtedly be improved.

In this study, we test standard neural network architectures (CNN, LSTM, BiLSTM) and recently appeared BERT architectures on previous Russian sentiment evaluation datasets. We compare two variants of Russian BERT [devlin2018bert] and show that for all sentiment tasks in this study the conversational variant of Russian BERT performs better. The best results were achieved by BERT-NLI model, which treats sentiment classification problem as the natural inference task. In one of the tasks this model practically achieves the human level of sentiment analysis.

The contributions of this paper are as follows:

  • we renew previous results on five Russian sentiment analysis datasets using the state-of-the-art methods,

  • we test new conversational Russian BERT model in several sentiment analysis tasks and show that it is better than previous Russian RuBERT model,

  • we show that the BERT model, which treats sentiment analysis as a natural language inference task achieves the best results on all datasets under analysis.

This paper is structured as follows. In Section 2 we present sentiment analysis datasets previously created for Russian shared tasks, best methods and achieved results in previous evaluations. Section 3 describes preprocessing steps and methods applied to sentiment analysis tasks in the current study, including several BERT-based models. Section 4 presents the achieved results. In Section 5 we analyse the errors of models on difficult examples. Section 6 describes other available Russian sentiment analysis datasets and methods applied to these datasets.

2 Datasets

In our study we consider five Russian datasets annotated for previous Russian sentiment evaluations: news quotes of the ROMIP-2013 evaluations [chetviorkin2013evaluating] and Twitter datasets of two SentiRuEval evaluations 2015-2016 [loukachevitch2015entity, loukachevitch2016rubtsova]. Table 1 presents the datasets under evaluation, the volumes of their training and test parts, main quality measures, achieved results, and the best methods. Table 2 contains the distribution of the datasets texts by sentiment classes.

Dataset Train vol. Test vol. Metrics Result Method
News Quotes ROMIP-2013333 4260 5500 62.1 Lexicons+Rules
SentiRuEval-2015 Telecom444 5000 5322 50.3 SVM
SentiRuEval-2015 Banks444 5000 5296 36.0 SVM
SentiRuEval-2016 Telecom555 8643 2247 55.9 2-layer GRU
SentiRuEval-2016 Banks555 9392 3313 55.1 2-layer GRU
Table 1: Datasets under evaluation.

2.1 News Quotes Dataset

For creating news quotes collection, opinions in direct or indirect speeches were extracted from news articles [chetviorkin2013evaluating]

. The task was to classify quotations as neutral, positive or negative speaker comment about the topic of the quotation. It can be seen in Table 2 that class distribution in the dataset was rather balanced. The main quality measure was


The participants experimented with classical machine learning approaches such as Naive Bayes and SVM classifiers, but the best results were obtained by a knowledge-based approach using a large sentiment lexicon and rules: 62.1 of

measure and 61.6 of accuracy score. This can be explained with great variety of topics and topic-related sentiment discussed in news quotes [chetviorkin2013evaluating].

Train sample Test sample
Dataset Positive Negative Neutral Positive Negative Neutral
News Quotes ROMIP-2013 16 36 48 11 33 56
SentiRuEval-2015 Telecom 19 32 49 10 23 67
SentiRuEval-2015 Banks 7 34 59 8 15 79
SentiRuEval-2016 Telecom 15 29 56 10 46 44
SentiRuEval-2016 Banks 8 18 74 10 22 68
Table 2: Class distribution by datasets (%).

2.2 Twitter Datasets

Twitter datasets were annotated for the task of reputation monitoring [amigo2013overview, loukachevitch2015entity]. The goal of Twitter sentiment analysis at SentiRuEval was to find sentiment-oriented opinions or positive and negative facts about two types of organizations: banks and telecom companies. In such a way the task can be classified as targeted (entity-oriented) sentiment analysis problem. Similar evaluations were organized twice in 2015 and 2016, during which four target-oriented datasets were annotated (Table 1). In 2016 training datasets in both domains were constructed by uniting of training and test data of the 2015 evaluation and, therefore they were much larger in size [loukachevitch2016rubtsova].

The participating systems were required to perform a three-way classification of tweets: positive, negative or neutral. It can be seen in Table 2 that neutral class was prevailing in the datasets. For this reason, the main quality measure was measure, which was calculated as the average value between measure of the positive class and measure of the negative class. measure of the neutral class was ignored because this category is usually not interesting to know. But this does not reduce the task to the two-class prediction because erroneous labeling of neutral tweets negatively influences on and . Additionally micro-average measures were calculated for two sentiment classes [loukachevitch2015entity, loukachevitch2016rubtsova].

It can be seen in Table 1 that the results in 2016 are much higher than in 2015 for the same tasks. There can be two reasons for this. The first reason is the larger volume of the training data in 2016. The second reason is the use by the participants of more advanced methods, including neural network models and embeddings.

3 Methods

We compare the following groups of sentiment analysis methods on the above-described datasets. SVM with pre-trained embeddings is a baseline for our study. We chose FastText666 embeddings (dimension 300) because of its better results compared with other types of Russian embeddings such as ELMo666, Word2Vec777, and GloVe888 in preliminary studies. To submit data to the SVM algorithm, averaging token embeddings in a sentence was used. Grid search mechanism from scikit-learn999

framework was utilized to obtain optimal hyperparameters.

3.1 Preprocessing

Since most of the data are tweets containing noise information, significant text preprocessing was implemented. The full cycle contained the following steps:

  • lowercase cast;

  • replacing URLs with url token;

  • replacing mentions with user token;

  • replacing hashtags with hashtag token;

  • replacing emails with email token;

  • replacing phone numbers with phone token;

  • replacing emoticons with appropriate tokens like sad, happy, neutral;

  • removing all special symbols except punctuation marks;

  • replacing any repeated more than 2 times in a row letter with 2 repetitions of that letter;

  • lemmatization and removing stop words.

It is worth to note that the last point was applied only in the case of SVM and classic neural networks. For BERT-based methods it did not make any difference and gave not a considerable change of about 0.01%.

3.2 Classical neural networks

The first group of methods is a set of classical convolutional and LSTM neural networks.

The architecture of the convolutional neural network considered in this paper is based on approaches

[cliche2017bbtwtr, zhang2015]. Input data is represented as a matrix of size , where is the number of tokens in the tweet and is the dimension of the embedding space. The optimal matrix height

was chosen experimentally. If necessary, a sentence is truncated or zero-padded to the required length.

After that several convolution operations of various sizes are applied to this matrix in parallel. A single branch of convolution involves a filtering matrix with the size of the convolution

equal the number of words it covers. Then output of each branch is max-pooled. This helps to extract the most important information for each convolution, regardless of feature position in the text. After all convolution operations, obtained vectors are concatenated and sent to a fully connected layer, which is then passed through the softmax layer to give the final classification probabilities. In our model we chose the number of convolution branches equal to

with windows sizes respectively. To reduce overfitting, dropout layers with probability of were added after max-pooling and fully connected layers.

The main idea of LSTM recurrent networks is the introduction of a cell state of dimension

equal to the dimension of network, running straight down the entire chain and ability of LSTM to remove or add information to the cell state using special structures called gates. This helps to avoid the exploding and vanishing gradient problems during the backpropagation training stage in recurrent neutral networks. In our work, we chose

equal to the size of token embeddings.

Besides, we used the bidirectional LSTM (BiLSTM) model, which represents two LSTMs stacked together. Two networks read the sentence from different directions and their cell states are concatenated to obtain vector of dimension . As well as in LSTM network, this vector is sent to a fully connected layer of size 40 and then passed through a softmax layer to give the final classification probabilities.

In both LSTM architectures we used dropout to reduce over-fitting by adding a dropout layer with probability of before and after the fully connected layers. For all described neural networks we used pre-trained Russian FastText embeddings with dimension of .

3.3 Fine-tuning BERT model

The second group of methods is based on two pre-trained Russian BERT models and several approaches of application of BERT [devlin2018bert] to the sentiment analysis task. The utilized approaches can be subdivided into single sentence classification and constructing auxiliary sentences approach [utilizingbertforsa], which converts a sentiment analysis task into a sentence-pair classification problem. It seems possible since input representation of BERT can represent both a single sentence and a pair of sentences considering them as a next sentence prediction task.

The BERT sentence-single model uses only an initial sentence as an input and represents a vanilla BERT model with an additional single linear layer with matrix on the top. Here denotes the number of classes and the dimension of the hidden state. For the classification task, the first word of the input sequence is identified with a unique token [CLS]

. The input representation is constructed by summing the initial token, segment, and position embeddings for any token in the sequence. Classification probabilities distribution is calculated using the softmax function.

The BERT sentence-pair model architecture has some differences. The input representation converts a pair of sentences in one sequence of tokens inserting special token [SEP] between them. The classification layer is added over the final hidden state of the first token .

For the targeted task, there is a label for each object of sentiment analysis in a sentence so the real name of an entity was replaced by a special token. For example, the initial tweet ”Sberbank is a safe place where you can keep your savings” is converted to ”MASK is a safe place where you can keep your savings”.

Two sentence-pair models use auxiliary sentences and based on question answering (QA) and natural language inference (NLI) tasks. The auxiliary sentences for the targeted analysis are as follows:

  • pair-NLI: ”The sentiment polarity of MASK is”

  • pair-QA: ”What do you think about MASK?”

The answer is supposed to be one from the Positive, Negative, Neutral set.

In case of the general sentiment analysis task, there is one label per sentence and no objects of sentiment analysis to mask. So we proposed to assign the token to the whole sentence. Therefore the initial sentence ”56% of Rambler Group was sold to Sberbank” is converted to ”MASK = 56% of Rambler Group was sold to Sberbank”. The same auxiliary sentences were constructed for this task.

In our study, we compare two different pre-trained BERT models from DeepPavlov framework [deeppavlov]:

During the fine-tuning procedure, we set dropout probability at

, number of epochs at

, initial learning rate at , and batch size at .

4 Results

To compare different models, we calculated standard metrics such as accuracy and . Besides, we calculated the metrics necessary for comparison with the participants of the competition: and , which take into account only positive and negative classes. All the reported results were obtained by averaging over five runs. To distinguish two pre-trained BERT models, special label (C) is used for Conversational RuBERT.

Model Accuracy
ROMIP-2013 [chetviorkin2013evaluating] 61.60 62.10
SVM 69.12 61.63 74.82 75.07
CNN 68.57 60.43 73.51 74.55
LSTM 73.61 62.31 77.02 78.20
BiLSTM 74.14 62.78 77.61 78.94
BERT-single 78.90 68.07 84.33 84.45
BERT-pair-QA 79.06 68.54 84.33 84.45
BERT-pair-NLI 79.68 69.45 84.96 85.08
BERT-single (C) 79.81 71.12 85.05 85.10
BERT-pair-QA (C) 78.95 70.16 84.71 84.83
BERT-pair-NLI (C) 80.28 70.62 85.52 85.68
Table 3: Results on News Quotes Dataset.

4.1 Results of News Quotes Dataset

Table 3 describes results of the models on the ROMIP-2013 news quotes dataset.

Model Accuracy
SentiRuEval-2015 [loukachevitch2015entity] 48.80 53.60
SVM 62.86 58.29 50.27 54.70
CNN 60.80 57.52 49.92 53.23
LSTM 64.46 58.94 52.10 56.03
BiLSTM 65.54 59.35 53.01 56.83
BERT-single 72.48 67.04 58.43 62.53
BERT-pair-QA 74.00 67.83 58.15 62.92
BERT-pair-NLI 74.66 68.24 59.17 64.13
BERT-single (C) 76.55 69.12 61.34 66.23
BERT-pair-QA (C) 76.63 68.54 63.47 67.51
BERT-pair-NLI (C) 76.40 68.83 63.14 67.45
Manual 70.30 70.90
Table 4: Results on SentiRuEval-2015 Telecom Operators Dataset.

As it was mentioned before, the participants of the evaluation applied traditional machine learning methods (SVM, Naive Bayes classifier, etc.) and knowledge-based methods with lexicons and rules. The knowledge-based methods achieved the best results. This was explained by thematic diversity of news quotes, when the test collection could contain sentiment words and expressions absent in the training collection.

Model Accuracy
SentiRuEval-2015 [loukachevitch2015entity] 36.00 36.60
SVM 49.23 43.39 33.08 36.62
CNN 47.91 42.87 31.62 34.18
LSTM 51.89 44.12 35.85 39.55
BiLSTM 53.21 46.43 36.93 40.18
BERT-single 83.78 74.57 57.82 60.64
BERT-pair-QA 84.24 75.34 56.65 57.41
BERT-pair-NLI 85.14 77.59 60.46 63.15
BERT-single (C) 85.80 78.71 64.90 66.95
BERT-pair-QA (C) 86.28 78.62 62.37 67.27
BERT-pair-NLI (C) 86.88 79.51 67.44 70.09
Table 5: Results on SentiRuEval-2015 Banks Dataset.

It can be seen in the current evaluation, that the task was difficult even for some models with embeddings (SVM, CNN, LSTM, BiLSTM). Among traditional neural network approaches, BiLSTM obtained the best results.

Model Accuracy
SentiRuEval-2016 [loukachevitch2016rubtsova] 55.94 65.69
SVM 65.89 55.34 53.13 65.87
CNN 65.28 54.87 52.62 64.40
LSTM 66.71 56.74 56.93 67.18
BiLSTM 67.30 57.11 57.23 67.93
BERT-single 72.85 65.12 60.29 71.70
BERT-pair-QA 74.24 66.34 63.86 73.26
BERT-pair-NLI 74.51 67.48 62.81 73.39
BERT-single (C) 75.20 67.89 64.96 73.91
BERT-pair-QA (C) 75.27 68.11 65.91 74.22
BERT-pair-NLI (C) 75.71 68.42 66.07 74.11
Table 6: Results on SentiRuEval-2016 Telecom Operators Dataset.

The use of BERT drastically improves the results. Better results are achieved by conversational RuBERT models. The best configuration is BERT-pair-NLI, when additional MASK token is assigned to the whole sentence and the sentence inference task was set.

4.2 Results on Twitter Datasets

Tables 4 and 5 describe results of the models on two Twitter datasets of SentiRuEval-2015. The specific feature of this evaluation was a long 6 months period of time between downloading the training and test collections. In this period Ukrainian topics of tweets about telecom operators and banks led to great differences between the training and test collections.

These differences between collections showed up in very low obtained results on the bank 2015 dataset [loukachevitch2015entity]. The problem was also complicated for the current SVM+FastText, CNN, LSTM and BiLSTM models. Only BERT-based methods could significantly improved the results. Conversational RuBERT in the NLI setting was the best method again.

It is interesting to note that one participant of the SentiRuEval-2015 uploaded manual annotation of the test Telecom dataset and obtained the results described in Table 4 as Manual [loukachevitch2015entity]. It can be seen that the best BERT results are very close to the manual labeling.

Model Accuracy
SentiRuEval-2016 [loukachevitch2016rubtsova] 55.17 58.81
SVM 66.46 57.85 51.12 53.74
CNN 67.15 58.43 52.06 54.96
LSTM 70.80 61.17 57.22 59.71
BiLSTM 71.44 61.86 58.40 61.06
BERT-single 81.20 73.21 68.19 69.56
BERT-pair-QA 80.35 72.61 66.61 68.18
BERT-pair-NLI 80.91 72.68 65.62 67.65
BERT-single (C) 80.47 72.59 66.95 69.46
BERT-pair-QA (C) 82.28 74.06 69.53 71.76
BERT-pair-NLI (C) 81.28 73.34 65.82 68.03
Table 7: Results on SentiRuEval-2016 Banks Dataset.

Tables 6 and 7 describe results of the models on two Twitter datasets of SentiRuEval-2016. In contrast to previous evaluations, baseline results of the 2016 competition (the best results achieved by participants) are better than the SVM+FastText and CNN models. This is due to the fact that the participants applied neural network architectures with embeddings and combined the SVM method with existing Russian sentiment lexicons [trofimovich2016comparison, loukachevitch2016rubtsova].

5 Analysis of Difficult Examples

The authors of previous Russian sentiment evaluations described examples, which were difficult for most participants of the shared tasks [chetviorkin2013evaluating, loukachevitch2015entity, loukachevitch2016rubtsova]. We gathered these examples and obtained the collection of 21 difficult samples. Now we can compare the performance of the models on this collection.

The difficult examples are translated from Russian and can be subdivided into several groups.

The first type of difficulties concerns the problem of the absence of a sentiment word or word with positive or negative connotations in the training collection, which was a serious problem for previous approaches. From this group one example was again erroneously classified by all current models:

  • ”Sberbank imposes credit cards”. (Ex.1)

The following sentence from this group was successfully processed by all models:

  • ”In the capital there was a daring robbery of Sberbank”. (Ex.2)

The second groups comprises examples with complicated word combinations that include words of different sentiments and/or sentiment operators. From these examples, the following example was problematic for all models:

  • ”Secretary of the Presidium of the General Council of United Russia, State Duma Deputy Chairman Sergei Neverov said on Saturday that the party is not afraid of a split due to the appearance different ideological platforms in it”. (Ex.3)

In the above-mentioned sentence there are two negative words and negation, which inverts negative sentiment to positive: ”not afraid of a split”. But the following example was processed correctly by most BERT-based models:

  • ”VTB-24 reduced losses in the second quarter”. (Ex.4)

The third group includes tweets with irony. The following example was differently treated by the models:

  • ”Sberbank – the largest network of non-working ATMs in Russia”. (Ex.5)

The fourth group includes tweets that mention two telecom operators with different sentiment attitudes. In most cases it was difficult for models to distinguish correct sentiment towards each company.

  • ”I always said to you that the best operator is Beeline. Megaphone does not respect you”. (Ex.6 - Beeline, Ex.7 - Megaphone)

Ex.1 -1 0 0 0 0 0 0 0 0 0 0
Ex.2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
Ex.3 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
Ex.4 1 -1 -1 0 -1 1 -1 1 1 -1 1
Ex.5 -1 -1 -1 -1 -1 0 0 0 0 -1 -1
Ex.6 1 0 0 0 0 -1 -1 -1 -1 -1 -1
Ex.7 -1 0 0 -1 -1 -1 -1 -1 -1 -1 -1
Acc. 0.33 0.24 0.48 0.52 0.48 0.53 0.62 0.62 0.57 0.71
Table 8: Analysis of difficult examples. ”Acc.” means the accuracy of classification on whole collection of 21 difficult examples.

Table 8 describes the results of the models on difficult examples. Due to limited space, acronyms of corresponding BERT architectures from previous tables were used. Here denote negative, neutral and positive sentiments respectively. Correct predictions are in bold. The last row is share of correct answers for each model. The best results are achieved by BERT-pair-NLI model with pre-trained Conversational RuBERT.

6 Related Work

The most latest and largest Russian sentiment dataset is RuSentiment [rogers2018rusentiment]

, which contains more than 30000 posts from VKontakte (VK), the most popular social network in Russia. Each post is labeled with one of five classes. The authors evaluated several traditional machine learning methods (logistic regression, linear SVM, Gradient Boosting) and neural networks. The best result (71.7

measure) was achieved by the neural network with four full-connected layers and FastText embeddings trained on VKontakte posts. In [yuadaptation] the authors applied to the RuSentiment dataset multilingual BERT and RuBERT, trained on Russian text collections and obtained measure 87.73 by RuBERT.

Another popular dataset for Russian sentiment analysis is a tweet collection with automatic annotations based on emoticons (RuTweetCorp) [rubtsova2015constructing]. This corpus contains more than 200 thousand Twitter messages posted in 2013-2014 annotated as positive and negative.

In [rubtsova2018reducing], SVM with Word2Vec embeddings were applied the RuTweetCorp dataset. The authors of [svetlov2019sentiment] tested LSTM+CNN and BiGRU models on RuSentiment and RuTweetCorp datasets. Zvonarev and Bilyi [zvonarev2019comparison]

compared logistic regression, XGBoost classifier and Convolutional Neural Network on RuTweetCorp and obtained the best results with CNN.

The authors of [loukachevitch2018extracting] created the RuSentRel corpus consisted of analytical articles devoted to international relations. The corpus is annotated with sentiment attitudes towards mentioned named entities. Rusnachenko et al. [rusnachenko2019distant] study extraction of sentiment attitudes using CNN and distant supervision approach on the RuSentRel corpus.

7 Conclusion

In this study, we tested standard neural network architectures (CNN, LSTM, BiLSTM) and recently appeared BERT models on previous Russian sentiment evaluation datasets. We applied not only vanilla BERT classification approach, but reformulation of the classification task and question-answering (QA) and natural-language inference (NLI) tasks. We also compared two variants of Russian BERT and showed that for all sentiment tasks in this study the conversational variant of Russian BERT is better.

The best results were mostly achieved by BERT-NLI model. In one of the tasks this model practically achieved the human level of sentiment analysis.

The source code111111 and all sentiment datasets121212 used in this work are publicly available.


The reported study was funded by RFBR according to the research project № 20-07-01059.