Attention-Based Neural Networks for Sentiment Attitude Extraction using Distant Supervision

by   Nicolay Rusnachenko, et al.
Mail.Ru Group

In the sentiment attitude extraction task, the aim is to identify <<attitudes>> – sentiment relations between entities mentioned in text. In this paper, we provide a study on attention-based context encoders in the sentiment attitude extraction task. For this task, we adapt attentive context encoders of two types: (1) feature-based; (2) self-based. In our study, we utilize the corpus of Russian analytical texts RuSentRel and automatically constructed news collection RuAttitudes for enriching the training set. We consider the problem of attitude extraction as two-class (positive, negative) and three-class (positive, negative, neutral) classification tasks for whole documents. Our experiments with the RuSentRel corpus show that the three-class classification models, which employ the RuAttitudes corpus for training, result in 10 attention mechanism. We also provide the analysis of attention weight distributions in dependence on the term type.



There are no comments yet.


page 1

page 2

page 3

page 4


Studying Attention Models in Sentiment Attitude Extraction Task

In the sentiment attitude extraction task, the aim is to identify <<atti...

Sentiment Frames for Attitude Extraction in Russian

Texts can convey several types of inter-related information concerning o...

Simple Unsupervised Similarity-Based Aspect Extraction

In the context of sentiment analysis, there has been growing interest in...

Extracting Sentiment Attitudes From Analytical Texts

In this paper we present the RuSentRel corpus including analytical texts...

Who Blames or Endorses Whom? Entity-to-Entity Directed Sentiment Extraction in News Text

Understanding who blames or supports whom in news text is a critical res...

Reasoning with Sarcasm by Reading In-between

Sarcasm is a sophisticated speech act which commonly manifests on social...

CrowdTSC: Crowd-based Neural Networks for Text Sentiment Classification

Sentiment classification is a fundamental task in content analysis. Alth...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Classifying relations between entities mentioned in texts remains one of the difficult tasks in natural language processing (NLP). The sentiment attitude extraction aims to seek for positive/negative relations between objects expressed as named entities in texts (rusnachenko2018neural). For example, in Figure 1 named entities <<Russia>> and <<NATO>> have the negative attitude towards each other with additional indication of other named entities.

context При этом МоскваwhiteМосква неоднократно подчеркивала, что ее активность на балтикеwhiteбалтике является ответом именно на действия НАТОwhiteНАТО и эскалацию враждебного подхода к РоссииwhiteРоссии вблизи ее восточных границ
Meanwhile MoscowwhiteMoscow has repeatedly emphasized that its activity in the Baltic SeawhiteBaltic Sea is a response precisely to actions of NATOwhiteNATO and the escalation of the hostile approach to RussiawhiteRussia near its eastern borders
attitudes NATORussia: neg
RussiaNATO: neg
Figure 1. Example of a context with attitudes mentioned in it; named entities <<Russia>> and <<NATO>> have the negative attitude towards each other with additional indication of other named entities.

When extracting relations from texts, one encounters the complexity of the sentence structure; sentences can contain many named entity mentions; a single opinion might comprise several sentences.

This paper is devoted to study of models for targeted sentiment analysis with attention. The intuition exploited in the models with attentive encoders is that only some terms in the context are relevant for attitude indication. The interactions of words, not just their isolated presence, may reveal the specificity of contexts with attitudes of different polarities. We additionally used the distant supervision (DS) 

(mintz2009distant) technique to fine-tune the attention mechanism by providing relevant contexts, with words that indicate the presence of attitude. Our contribution in this paper is three-fold:

  • We apply attentive encoders based on (1) attitude participants and (2) context itself;

  • We conduct the experiments on the RuSentRel (loukachevitch2016creating) collection using the distant supervision technique in the training process. The results demonstrate that the application of attention-based encoders enhance quality by 3% in the three-class classification task;

  • We provide an analysis of weight distribution to illustrate the influence of distant supervision onto informative terms selection.

2. Related Work

In previous works, various neural network approaches for targeted sentiment analysis were proposed. In (rusnachenko2018neural)

the authors utilize convolutional neural networks (

CNN). Considering relation extraction as a three-scale classification task of contexts with attitudes in it, the authors subdivide each context into outer and inner (relative to attitude participants) to apply Piecewise-CNN (PCNN(zeng2015distant). The latter architecture utilizes a specific idea of the max-pooling

operation. Initially, this is an operation, which extracts the maximal values within each convolution. However, for relation classification, it reduces information extremely rapid and blurs significant aspects of context parts. In case of PCNN, separate max-pooling operations are applied to outer and inner contexts. In the experiments, the authors revealed a fast training process and a slight improvement in the

PCNN results in comparison to CNN.

In (shen-huang-2016-attention), the authors proposed an attention-based CNN model for semantic relation classification (hendrickx2009semeval)

. The authors utilized the attention mechanism to select the most relevant context words with respect to participants of a semantic relation. The architecture of the attention model is a multilayer perceptron (MLP), which calculates the weight of a word in context with respect to the entity. The resulting

AttCNN model outperformed several CNN and LSTM based approaches with by F1-measure.

In (zhou2016attention; yang2016hierarchical), the authors experimented with self-based attention models, in which targets became adapted automatically during the training process. The authors considered the attention as context word quantification with respect to abstract targets. In (yang2016hierarchical), the authors brought a similar idea also onto the sentence level. The obtained hierarchical model was called as HAN.

In (rusnachenko2019distant), authors apply distant supervision (DS) approach to developing an automatic collection for the sentiment attitude extraction task in the news domain. A combination of two labeling methods (1) pair-based and (2) frame-based were used to perform context labeling. The developed collection was called as RuAttitudes. Experimenting with the RuSentRel corpus, the authors consider the problem of sentiment attitude extraction as a two-class classification task and mention the 13.4% increase by

when models trained with an application of RuAttitudes over models which training relies on supervised learning.

For Russian, Archipenko et al. (arkhipenko2016comparison) compared neural architectures for entity-related tweet setiment classification; they found that the best results were obtained with the GRU neural model (cho2014learning). The authors of (rogers2018rusentiment) annotated more than  thousand social media posts in Russian with three sentiment categories and compared several baseline classification methods, obtaining the best results with a four-layer neural model with non-linear activations between layers. These results were improved in (kuratov2019adaptation), where the authors applied the BERT model trained on Russian data (RuBERT). Tutunalina et al. (tutubalina2020russian) compared several neural network models to extract positive or negative adverse drug reactions in Russian social network texts.

3. Resources

In our study we utilize the following collections: (1) RuSentRel as a source of news texts with manually provided attitude labeling in it, and (2) automatically developed RuAttitudes collection, which addresses the lack of training examples in RuSentRel.

We also use two Russian sentiment resources: the RuSentiLex lexicon 

(loukachevitch2016creating), which contains words and expressions of the Russian language with sentiment labels and the RuSentiFrames lexicon (rusnachenko2019distant), which provides several types of sentiment attitudes for situations associated with specific Russian predicates.

3.1. RuSentRel collection

We consider sentiment analysis of Russian analytical articles collected in the RuSentRel corpus (loukachevitch2018extracting). The corpus comprises texts in the international politics domain and contains a lot of opinions. The articles are labeled with annotations of two types: (1) the author’s opinion on the subject matter of the article; (2) the attitudes between the participants of the described situations. The annotation of the latter type includes 2000 relations across 73 large analytical texts. Annotated sentiments can be only positive or negative. Additionally, each text is provided with annotation of mentioned named entities. Synonyms and variants of named entities are also given, which allows not to deal with the coreference of named entities.

3.2. RuSentiFrames lexicon

The RuSentiFrames222 lexicon describes sentiments and connotations conveyed with a predicate in a verbal or nominal form (rusnachenko2019distant), such as "осудить, улучшить, преувеличить" (to condemn, to improve, to exaggerate), etc. The structure of the frames in RuSentFrames comprises: (1) the set of predicate-specific roles; (2) frames dimensions such as the attitude of the author towards participants of the situation, attitudes between the participants, effects for participants. Currently, RuSentiFrames contains frames for more than 6 thousand words and expressions.

Frame "Одобрить" (Approve)
roles A0: who approves
A1: what is approved
polarity A0 A1, pos, 1.0
A1 A0, pos, 0.7
effect A1, pos, 1.0
state A0, pos, 1.0
A1, pos, 1.0
Table 1. Example description of frame <<Одобрить>> (Approve) in RuSentiLex lexicon.

In RuSentiFrames, individual semantic roles are numbered, beginning with zero. For a particular predicate entry, Arg0 is generally the argument exhibiting features of a Prototypical Agent, while Arg1 is a Prototypical Patient or Theme (dowty1991thematic). In the main part of the frame, the most applicable for the current study is the polarity of Arg0 with a respect to Arg1 (A0A1). Table 1 provides an example of frame "одобрить" (to approve).

3.3. RuAttitudes

The RuAttitudes (rusnachenko2019distant) is a corpus of news texts automatically labeled using distant supervision approach. These are news stories from specialized political sites and Russian sites of world-known news agencies published in 2017. The news texts are annotated with attitudes between participants, which sentiments can be only positive or negative. In comparison with RuSentRel, the RuAttitudes corpus includes  K attitudes gathered across  K news texts.

Every news text is presented as a sequence of its contexts, where the first context is a news title and others are news content or sentences. For a particular news story, the RuAttitudes corpus keeps information of only those contexts, which has at least one attitude mentioned in it. Each context is presented as a sequence of words with named entities markup. According to Section 2, the authors considered an application of two factors (1) Pair-based and (2) Frame-based in order to define the fact of presence and sentiment polarity of an attitude, which is described by a pair of mentioned named entities.

Маккейн: СШАwhiteСША продолжат поддержку ГрузииwhiteГрузии
McCain: USAwhiteUSA continue supporting GeorgiawhiteGeorgia


sentence: 5
<<СШАwhiteСША и далее продолжат поддержку свободы, суверенитета и территориальной целостности ГрузииwhiteГрузии в рамках международно признанных границ страны>>, – сказал он.

<<USAwhiteUSA and in further continue support freedom, sovereignty and territorial integrity GeorgiawhiteGeorgia within the internationally recognized borders of the country>>, – he said.


sentence: 11
29 декабря премьер-министр КвирикашвилиwhiteКвирикашвили сообщил, что правительство ГрузииwhiteГрузии установило первые контакты с новой администрацией СШАwhiteСША.
29’th december prime-minister KvirikashviliwhiteKvirikashvili reported, that the government of GeorgiawhiteGeorgia has established first contacts with the new USAwhiteUSA administration.

Figure 2. Example of news (#11323) description from RuAttitudes-1.1 collection illustrates the attitude USAGeorgia which is annotated by Frame-based and Pair-based factors in news title with the corresponding appearance of pair in the sentences (#5, #11) of news content.

Pair-based factor assumes to perform annotation using a list of entity pairs with preassigned sentiment polarities. In turn, Frame-based factor utilizes infomation from the RuSentiFrames lexicon (Section 3.2) in order to perform annotation. The context is retrieved in case when both factors are met. Due to the latter, it is worth to mention the specifics of the Frame-based factor. A pair of neighbour named entities is considered as having a sentiment attitude when a news title has the following structure:

SubjectwhiteSubject   {frame}ObjectwhiteObject

where corresponds to the size of the non-empty set. The sentiment score is considered positive in the case when all the frame entries of the set are equally positive in terms of A0A1 polarity values. Otherwise, the sentiment is considered negative. The annotated attitude is then utilized in news content filtering. Sentences that has no subject and object entries of the related attitude are discarded. Figure 3.3 provides an example of a news text, in which attitude assumes to be annotated by Frame-based factor as positive: all the frames mentioned between attitude ends (to continue, to support) conveys the same positive sentiment value of A0A1 polarity.

Figure 3. General, context-based 3-scale (positive, negative, neutral) classification model, with details on <<Attention-Based Context Encoder>> block in Section 5 and 6.

4. Model

In this paper, the problem of sentiment attitude extraction is treated as a classification task of two types: two-scale and three-scale. Given a pair of named entities, we predict a sentiment label of a pair, which could be as follows:

  • sentiment, i.e. positive or negative (two-scale classification format);

  • sentiment or neutral.

As the RuSentRel corpus provides opinions with positive or negative sentiment labels only (Section 3), we automatically added neutral sentiments for all pairs not mentioned in the annotation and co-occurred in the same sentences of the collection texts.

Говорить о разделении кавказского регионаwhiteкавказского региона из-за конфронтации РоссииwhiteРоссии и ТурцииwhiteТурции пока не приходится, хотя опасность есть.
Talking about the separation of the Caucasus regionwhiteCaucasus region due to the confrontation between RussiawhiteRussia and TurkeywhiteTurkey is not necessary, although there is a danger.
Talking about the separation of the due to the confrontation between and is not-necessary <COMMA> although there is a danger <DOT>
Figure 4. An example of a context processing into a sequence of terms; attitude participants and other mentioned entities become masked; frames are italic and optionally suffixed with the sentiment value of A0A1 polarity.

We consider a context as a text fragment that is limited by a single sentence and includes a pair of named entities. The general architecture is presented in Figure 3, where the sentiment could be extracted from the context. To present a context, we treat the original text as a sequence of terms limited by , with the distance between attitude participants limited by terms. Each term belongs to one of the following groups: entities, frames, tokens, and words (if none of the prior has not been matched).

We use masked representation for attitude participants (, ) and mentioned named entities () to prevent models from capturing related information. To represent frames, we combine a frame entry with the corresponding A0A1 sentiment polarity value (and neutral if the latter is absent). We also invert sentiment polarity when an entry has "не" (not) preposition. The tokens group includes: punctuation marks, numbers, url-links. Each term of words is considered in a lemmatized333 form.

Figure 4 provides an example of a context processing into a sequence of input terms. All entries are encoded with the negative polarity A0A1: "конфронтация" (confrontation) has a negative polarity, and "не приходится" (not necessary) has a positive polarity of entry "necessary" which is inverted due to the "not" preposition.

To represent the context in a model, each term is embedded with a vector of fixed dimension. The sequence of embedded vectors

is denoted as input embedding (). Sections 5 and 6 provide an encoder implementation in details. In particular, each encoder relies on input embedding and generates output embedded context vector .

In order to determine a sentiment class by the embedded context

, we apply: (1) the hyperbolic tangent activation function towards

and (2) transformation through the fully connected layer:


In Formula 1, and correspond to the hidden states; correspond to the size of vector , and is a number of classes. Finally, the result

is an output vector of probabilities, which is computed by:


5. Feature Attentive Context Encoders

In this section, we consider features as a significant for attitude identification context terms, towards which we would like to quantify the relevance of each term in the context. For a particular context, we select embedded values of the attitude participants (, ).

Figure 5 illustrates a feature-based encoder  (huang2016attention). In formulas 35, we describe the quantification process of a context embedding with respect to a particular feature . Given an ’th embedded term , we concatenate its representation with :

Figure 5. Feature-attentive context encoder architecture, based on AttCNN model (huang2016attention).

The quantification of the relevance of with respect to is denoted as and calculated as follows:


In Formula 4, and correspond to the weight and attention matrices respectively, and

corresponds to the size of the hidden representation in the weight matrix. To deal with normalized weights within a context, we transform quantified values

into probabilities by Formula 2 as follows: ). We utilize Formula 5 to obtain attention-based context embedding of a context with respect to feature :


Applying Formula 5 towards each feature results in vector . We use average-pooling to transform the latter sequence into single averaged vector .

We also utilize a <<CNN encoder>> block (Figure 5) in order to compose the context representation . The resulting context embedding vector is a concatenation of and :


Structurally, a convolutional neural network based encoder is a sequence of the following transformations: convolutions and pooling. Figure 6 provides a detailed comparison between classic neural network (CNN, Figure 5(a)), and piecewise convolutional neural network (PCNN, Figure 5(b)).

Starting with the convolution operation, which remains equal across all the encoders of Figure 6, let is as consequent vectors concatenation from ’th till ’th positions. An application of towards the concatenation is a sequence convolution by filter , where is a filter window size, and corresponds to embedding vector size. For convolving calculation (), we apply scalar multiplication as follows:


To get multiple feature combinations, a set of different filters has been applied towards . This leads to a modified version of Formula 7 by introduced layer index :


Denoting in Formula 8 we reduce the latter by index and compose a matrix which represents convolution matrix with shape .

Max-pooling is an operation that reduces values by keeping maximum. In original CNN architecture (Figure 5(a)), max pooling applies separately per each convolution layers , which results in . It reduces convolved information quite rapidly which is not appropriate for attitude classification task. To keep context features that are inside and outside of the attitude entities, authors (zeng2015distant) perform piecewise max-pooling (Figure 5(b)). Given attitude entities as borders, we divide each into inner, left and right segments . Then max-pooling applies per each segment separately:


Thus, for each we have a set . Concatenation of these sets for each layer results in and that is a result of piecewise max-pooling operation.

(a) CNN
(b) PCNN (zeng2015distant)
Figure 6. Comparison of CNN-based context encoders; corresponds to convolutional filter window, size of 3.

6. Self Attentive Context Encoders

In section 5 the application of attention in context embedding fully relies on the sequence of predefined features. The quantification of context terms is performed towards each feature. In turn, the self-attentive approach assumes to quantify a context with respect to an abstract parameter. Unlike quantification methods in feature-attentive embedding models, here the latter is replaced with a hidden state () which modified during the training process.

To learn the hidden term semantics for each input, we utilize the LSTM (hochreiter1997long)recurrent neural network architecture, which addresses learning long-term dependencies by avoiding gradient vanishing and expansion problems. The calculation of ’th embedded term is based on prior state , where the latter acts as a parameter of auxiliary functions (hochreiter1997long). Figure 7 illustrates the attention-based sentence encoder architecture, builded on top of the BiLSTM – is a bi-directional LSTM to obtain a pair of sequences and   (). The resulting context representation is composed as the concatenation of bi-directional sequences elementwise: . The quantification of hidden term representation with respect to is described in formulas 10-11.

Figure 7. Self-attentive context encoder architecture, with self-attention module of Att-BLSTM model (zhou2016attention) over bi-directional LSTM encoder.

In order to deal with normalized weights, we transoform quantified values into as follows: (Formula 2). The resulting context embedding vector is an activated weighted sum of each parameter of context hidden states:


7. Model Details

Two-scale Three-scale
Model DS
Att-BLSTM 0.667 0.71 0.62 0.67 0.68 0.332 0.36 0.33 0.31 0.38
BiLSTM 0.653 0.70 0.60 0.66 0.70 0.312 0.34 0.31 0.29 0.39
Att-BLSTM 0.640 0.69 0.60 0.64 0.68 0.314 0.35 0.27 0.32 0.32
BiLSTM 0.632 0.66 0.63 0.61 0.67 0.286 0.32 0.26 0.28 0.34
AttPCNN 0.644 0.67 0.61 0.65 0.66 0.312 0.33 0.30 0.31 0.41
PCNN 0.599 0.70 0.53 0.57 0.63 0.315 0.33 0.30 0.31 0.40
AttPCNN 0.617 0.64 0.56 0.65 0.67 0.297 0.32 0.29 0.28 0.35
PCNN 0.608 0.62 0.58 0.63 0.66 0.285 0.29 0.27 0.30 0.32
AttCNN 0.631 0.64 0.64 0.62 0.66 0.316 0.35 0.29 0.30 0.41
CNN 0.625 0.62 0.63 0.63 0.68 0.305 0.31 0.30 0.31 0.40
AttCNN 0.636 0.66 0.64 0.61 0.62 0.270 0.33 0.23 0.25 0.30
CNN 0.553 0.60 0.56 0.51 0.59 0.274 0.30 0.26 0.26 0.31
Table 2. Experiment (Two-scale and Three-scale) context classification results by measure over RuSentRel collection; all the models are separated into the following groups (from top to bottom): BiLSTM, PCNN, CNN; models that employ RuAttitudes in the training process (DS mode) are labeled with <<>>; columns related to result evaluation in each experiment (from left to right): (1) average value in CV-3 experiment () with results on each split (), (2) results on train/test separation ().

We provide embedding details of context term groups described in Section 4. For words and frames, we look up for vectors in precomputed and publicly available model444 based on news articles with window size of , and vector size of . Each term that is not presented in model we treat as a sequence of parts (-grams) and look up for related vectors in  to complete an averaged vector. For a particular part, we start with trigrams () and decrease until the related -gram is found. For masked entities (, , ) and  tokens

, each element embedded with a vector of size 1000; every vector is randomly initialized from a Gaussian distribution 


Each context term has been additionally expanded with the following parameters:

  • Distance embedding (rusnachenko2018neural) (, ) – is vectorized distance in terms from attitude participants of entry pair ( and respectively) to a given term;

  • Closest to synonym distance embedding (, ) is a vectorized absolute distance in terms from a given term towards the nearest entity, synonymous to and ;

  • Part-of-speech embedding () is a vectorized tag for words (for terms of other groups considering <<unknown>> tag);

  • A0A1 polarity embedding () is a vectorized <<positive>> or <<negative>> value for frame entries whose description in RuSentiFrames provides the corresponding polarity (otherwise considering <<neutral>> value); polarity is inverted when an entry has "не" (not) preposition.

7.1. Training

This process assumes hidden parameter optimization of a given model. We utilize an algorithm described in (rusnachenko2018neural). The input is organized in minibatches, where each minibatch yields of bags. Each bag has a set of pairs , where each pair is described by an input embedding with the related label . The training process is iterative, and each iteration includes the following steps in order to calculate vector and perform hidden states update.

The first step assumes a minibatch composing, which is consist of bags of size . Then we perform a forward propagation through the network which results in a vector (size of ) of outputs . In the third step we calculate cross entropy loss for an output vector as follows:


In the final step we compose a vector, where ’th component () corresponds to the maximal cross entropy loss within a related ’th bag:


7.2. Parameters settings

The minibatch size () is set to , where contexts count per bag is set to . All the contexts were limited by terms, with the distance between attitude participants limited to terms. For embedding parameters (Section 7) we use vectors with size of . For CNN and PCNN context encoders, the size of convolutional window () and filters count (c) were set to and respectively. As for parameters related to sizes of hidden states in Sections 5 and 6: , . We utilize the AdaDelta optimizer with parameters and  (zeiler2012adadelta). To prevent models from overfitting, we apply towards the output with keep probability set to . For hidden state values initialization we utilize Xavier weight intializer (glorot2010understanding).

8. Experiments

According to Section 4, we treat sentiment attitude extraction as a classification task of different scales of output classes. We train and evaluate all the models in the following experiments:

  1. Two-scale (rusnachenko2019distant), in which all the models have to predict a sentiment label of an attitude in context. It is important to note that for each document we consider only those attitudes that might be fitted in a context;

  2. Three-scale (rusnachenko2018neural), in which each model might classify a given context with an attitude in it as sentiment-oriented (positive/negative) or neutral.

It is worth to note that the evaluation process in case of Two-scale experiment assumes to treat only those pairs in comparison, which could be found within a context of the related document.

8.1. Datasets and Evaluation formats

The evaluation in experiments has been performed over the RuSentRel corpus, using the following formats:

  1. CV-based format, in which it is supposed to utilize 3-fold cross-validation (CV); all folds are equal in terms of sentence count;

  2. Fixed format, in which the predefined separation of documents onto train/test sets is considered555

For evaluating models in this task, we adopt macro-averaged F1-score () over documents. F1-score is considered averaging of the positive and negative classes, which are most important in attitude analysis.

8.2. Model Comparisons and Training

In terms of architecture aspects, all the models differ only in sentence encoder implementation of a single context classification model (Figure 3). The list of the models selected for the experiments is as follows:

  • CNN model with a classic convolutional neural network architecture (Figure 5(a));

  • PCNN model, in which the encoder treats each convolution layer in parts, relatively to the attitude participants’ positions in the context (Figure 5(b));

  • AttCNN, AttPCNN are models with feature attentive encoders (Section 5); <<e>> corresponds to the set of attitude participants (, ).

  • BiLSTM is a bi-directional LSTM (hochreiter1997long);

  • Att-BLSTM model (Section 6);

For a particular model, the training (and related evaluation) process has been performed in the following modes:

  1. DS, is an application of distant supervision, which is considered as a combination of RuSentRel and RuAttitudes collections;

  2. SL, is supervised learning, using RuSentRel.

It is worth to clarify the details of the training set creation in DS mode depending on the evaluation formats (Section 8.1):

  • For CV-based, in each split, the RuAttitudes collection is combined with each training block of the RuSentRel collection;

  • For Fixed, the training set represents a combination of RuAttitudes with the train part.

We measure

on the training part every 10 epoch. The number of epochs was limited by 150. The training process terminates when

on the training part becomes greater than .

8.3. Result Analysis

Table 2 provides the results in the experiments for models organized (and separated) into the following groups: CNN, PCNN, BiLSTM. To access the effectiveness of both an application of distant supervision in the training process (DS mode, marked with <<>> sign in Table 2) and attention-based encoders (prefixed with <<Att>>), we provide efficiency assessment in the following directions:

  1. Application of DS mode for baselines;

  2. Application of attention-based sentence encoders in DS mode.

Two-scale Three-scale
Ratio Parameter







round(0.625/0.553 - 1, 2) 0.01 round(0.305/0.274 - 1, 2) round(0.315/0.285 - 1, 2) round(0.312/0.286 - 1, 2)

round(0.68/0.59 - 1, 2) round(0.70/0.67 - 1, 2)
round(0.40/0.31 - 1, 2) round(0.40/0.32 - 1, 2) round(0.39/0.34 - 1, 2)

round(0.631/0.625 - 1, 2) round(0.644/0.599 - 1, 2) 0.02 round(0.316/0.305 - 1, 2) round(0.332/0.312 - 1, 2)

round(0.66/0.63 - 1, 2) round(0.41/0.40 - 1, 2) round(0.41/0.40 - 1, 2)
Table 3. Calculated and ratios in each experiment for CV-based () and Fixed () evaluation formats; values below zero displayed as <<>>

To accomplish the comparison in a particular experiment, for each model we calculate the corresponding ratios by and :

  • – is the effectiveness of baseline models trained in DS mode over a related baseline that trained in SL mode;

  • – is the effectiveness of models trained in DS mode with attention-based sentence encoder (prefixed with Att) over related baseline version.

Table 3 provides calculated ratios for the Two-scale and Three-scale experiments. The ratio calculation () for a result over a result performed as follows: .

Analyzing results in the Two-scale experiment by in Table 3, model AttCNN shows a significant increase in 13% and 15% in case of CV-based and Fixed evaluation formats respectively. An application of attention-based encoders does not illustrate an increase in result model quality, only 1% for AttCNN and 5-8% for AttPCNN. The highest result is obtained by the Att-BLSTM model with a 4% increase by .

As for the Three-scale experiment, it is also possible to investigate a significant increase by with 10% in the CV-based evaluation mode and 15-29% on the test part (Fixed evaluation format). Utilizing attentive encoders in the models that employ RuAttitudes in training provides 3% results improvement according to ratio. The highest increase by is achieved by Att-BLSTM model with 6% when the model is evaluated in the CV-based format.

9. Analysis of Attention Weights

According to Section 3.3, one of the assumptions behind the distant supervision application for RuAttitudes collection developing is that the attitude might be conveyed by a frame of a certain sentiment polarity. For models of the Three-scale experiment with attention-based encoders (AttCNN, AttPCNN, Att-BLSTM), in this section, we analyze how contexts with sentiment and neutral attitudes affect on weight distribution in dependence on the term type.

frames nouns prep sentiment
frames nouns prep sentiment
frames nouns prep sentiment
frames nouns prep sentiment
Figure 8. Kernel density estimations (KDE) of context-level weight distributions across neutral (N) and sentiment (S) context sets for models Att-BLSTM and AttCNN trained in different modes: distant supervision application (DS), and supervised learning only (SL); the probability range (x-axis) scale depends on the group of terms: (frames, sentiment), (nouns), and (prep); vertical lines indicate expected values of corresponding distributions.
Model DS
Att-BLSTM 0.29 0.23 0.26 0.14 0.17
Att-BLSTM 0.13 0.22 0.08 0.11 0.07
AttCNN 0.05 0.03 0.05 0.03 0.03
AttCNN 0.09 0.07 0.09 0.07 0.07
AttPCNN 0.10 0.03 0.04 0.04 0.06
AttPCNN 0.09 0.17 0.15 0.08 0.06
Table 4. Calculated statistics () from Kolmogorov-Smirnov test by following term groups: frames (F), nouns (N), prep (P), sentiment (S), and verbs (V); highest and second highest values per each category are bolded and underlined respectively.
Model DS
Att-BLSTM 0.20 -0.09 -0.02 0.09
Att-BLSTM 0.07 0.12 0.03 0.05 0.03
AttPCNN 0.06
AttPCNN -0.02
Table 5. The difference in estimated values of and () by following term groups: frames (F), nouns (N), prep (P), sentiment (S), and verbs (V); absolute max values by each term group are bolded; absolute values less or equal displayed as <<>>
Att-BLSTM (SL) (Original)
Figure 9. Weight distribution visualization on sentiment contexts for model Att-BLSTM, trained in different modes: supervised learning (SL), and with an application of distant supervision (DS); for visualization purposes, weight of each term is normalized by the maximum in context; frame entries (marked italic and bolded) appeared between masked attitude participants become greater weighted when training process employs RuAttitudes (DS mode).

The terms quantification process remains a significant part of each attention-based encoder. Being assigned and normalized, weights of every term in a context might be treated as probability weight distribution across all the terms appeared in a context.

The source of documents for contexts in this analysis is the test part of the RuSentRel collection (Section 8.1). We analyse the weight distribution of the frames group, declared in Section 4, across all input contexts. We additionally introduce a list of extra groups utilized in the analysis by separating the subset of words into prepositions (prep), terms appeared in RuSentiLex lexicon (sentiment, Section 3), nouns (nouns), and verbs (verbs). The contents of nouns and verbs is considered only for those entries that are not present in the RuSentiLex lexicon.

The context-level weight of a particular term group is a weighted sum of terms which both appear in the context and belong to the corresponding term group. For discrepancy analysis between sentiment and neutrally labeled contexts, we utilize distributions of context-levels weights across:

  1. Sentiment contexts (S) – contexts, labeled with positive or negative labels;

  2. Neutral contexts (N) – contexts, labeled as neutral.

Further, such weight distributions over sentiment and neutral contexts denoted as and respectively, where asterisk corresponds to the certain term group.

To reveal the difference between distributions, the statistics from Kolmogorov-Smirnov test was used (massey1951kolmogorov). In our analysis, the calculation of such statistics is considered to be performed between a pair of samples (tabulated distributions), where each sample is a sequence of term group probabilities within each context. It is worth to note that such tabulated distributions meet the criteria of the independence of values (weights) related to continious set. Considering the latter, we are able to switch from tabulated to the cumulative distributions as follows:


where is related to the contexts set of a certain polarity (sentiment or neutral), i.e. . The Kolmogorov-Smirnov statistics (KS-statistics) represent the maximum of the absolute deviation between cumulative distributions and :


Table 4 provides the calculated KS-statistics (Formula 16) separately for each group of terms. Larger values by address on a greater difference in weights distribution between and .

Another statistics that we utilize in analysis is a difference in estimated values of and :


In addition to KS-statistics, the calculation of provides the sign of the difference. Summarizing results of both statistics, we may conclude that among all the models presented in our analysis, only Att-BLSTM illustrates a significant difference between and across all the term groups. The comparative kernel density estimations of context weight distributions for Att-BLSTM and AttCNN is presented in Figure 8. In case of Att-BLSTM, application of RuAttitudes in training (DS mode) results in weights distribution biasing from nouns and prep onto terms of the frames and sentiment groups in sentiment contexts. The similar case is observed for AttCNN trained in DS mode: terms of frames and sentiment groups become more valuable equally in sentiment and neutral context sets. The assumption here is a structure of contexts in RuAttitudes (Section 3.3): all the contexts enriched with frames, appeared between attitude participants. Those cases where frames convey the presence of an attitude in context are presented in Figure 9. According to the provided examples for Att-BLSTM model, it is possible to investigate greater weighted frame entries when the training process of related model employs RuAttitudes.

Overall, the model Att-BLSTM stands out baselines and models with feature-based attention encoders (AttCNN, AttPCNN) both due to results (Section 8) and the greatest discrepancy between and across all the term groups presented in the analysis (Figure 8). We assume that the latter is achieved due to the following factors: (1) application of bi-directional LSTM encoder; (2) utilization of a single trainable vector () in the quantification process (Section 6) while the models of feature-based approach (Section 5, Formula 4) depend on fully-connected layers.


In this paper, we study the attention-based models, aimed to extract sentiment attitudes from analytical articles. We consider the problem of extraction as two-class and three-class classification tasks for whole documents. Depending on the task, the described models should classify a context with an attitude mentioned in it onto the following classes: positive or negative (two-class); positive, negative, or neutral (three-class).

We investigated two types of attention embedding approaches: (1) feature-based, (2) self-based. To fine-tune the attention mechanism, we utilized distant supervision technique by employing RuAttitudes collection in the training process.

We conducted experiments on Russian analytical texts of the RuSentRel corpus and provided analysis of the results. The affection of distant-supervision technique onto attention-based encoders was shown by the variety in weight distribution of certain term groups between sentiment and non-sentiment contexts. Utilizing the distant-supervision approach in training three-class classification models results in 10% improvement by for architectures that do not employ attention module in context encoder. Replacing the latter with attention-based encoders provides the classification improvement by 3% .

In further work we plan to study application of language models for the presented tasks, as it continues the idea of attentive encoders application.


The reported study was funded by RFBR according to the research project № 20-07-01059.