SentiLR: Linguistic Knowledge Enhanced Language Representation for Sentiment Analysis

by   Pei Ke, et al.

Most of the existing pre-trained language representation models neglect to consider the linguistic knowledge of texts, whereas we argue that such knowledge can promote language understanding in various NLP tasks. In this paper, we propose a novel language representation model called SentiLR, which introduces word-level linguistic knowledge including part-of-speech tag and prior sentiment polarity from SentiWordNet to benefit the downstream tasks in sentiment analysis. During pre-training, we first acquire the prior sentiment polarity of each word by querying the SentiWordNet dictionary with its part-of-speech tag. Then, we devise a new pre-training task called label-aware masked language model (LA-MLM) consisting of two subtasks: 1) word knowledge recovering given the sentence-level label; 2) sentence-level label prediction with linguistic knowledge enhanced context. Experiments show that SentiLR achieves state-of-the-art performance on several sentence-level / aspect-level sentiment analysis tasks by fine-tuning, and also obtain comparative results on general language understanding tasks.


page 1

page 2

page 3

page 4


KESA: A Knowledge Enhanced Approach For Sentiment Analysis

Though some recent works focus on injecting sentiment knowledge into pre...

Understanding Neural Networks through Representation Erasure

While neural networks have been successfully applied to many natural lan...

Leveraging Pre-trained Language Model for Speech Sentiment Analysis

In this paper, we explore the use of pre-trained language models to lear...

Linguistic Profiling of a Neural Language Model

In this paper we investigate the linguistic knowledge learned by a Neura...

FeelsGoodMan: Inferring Semantics of Twitch Neologisms

Twitch chats pose a unique problem in natural language understanding due...

DictBERT: Dictionary Description Knowledge Enhanced Language Model Pre-training via Contrastive Learning

Although pre-trained language models (PLMs) have achieved state-of-the-a...

Deep contextualized word representations

We introduce a new type of deep contextualized word representation that ...

1 Introduction

Recently, pre-trained language representation models such as GPT Radford et al. (2018, 2019), ELMo Peters et al. (2018), BERT Devlin et al. (2019) and XLNet Yang et al. (2019) have achieved promising results in NLP tasks, including reading comprehension Rajpurkar et al. (2016), natural language inference Bowman et al. (2015); Williams et al. (2018) and sentiment classification Socher et al. (2013). These models capture contextual information from large-scale unlabelled corpora via well-designed pre-training tasks. The literature has commonly reported that pre-trained models can be used as effective feature extractors and achieve state-of-the-art performance on various downstream tasks Wang et al. (2019).

Although pre-trained language representation models have achieved transformative performance, the pre-training tasks like masked language model and next sentence prediction Devlin et al. (2019) neglect to consider the linguistic knowledge. We argue that such knowledge is important for some NLP tasks, particularly for sentiment analysis. For instance, existing work has shown that linguistic knowledge including part-of-speech tag Qian et al. (2015); Huang et al. (2017) and prior sentiment polarity Qian et al. (2017) of each word is closely related to the sentiment of longer texts such as sentences and paragraphs. We argue that pre-trained models enriched with the linguistic knowledge of words will benefit the understanding of the sentiment of the whole texts, thereby resulting in better performance on sentiment analysis.

Although directly introducing the linguistic knowledge from external linguistic resources is feasible, it remains a challenge for the model to learn beneficial knowledge-aware representation that promotes the downstream tasks in sentiment analysis. The linguistic knowledge roughly reflects different impacts of individual words on the sentiment of a whole sentence. Some of these words may act as sentiment shifters. For example, negation words constantly change the sentiment to the opposite polarity Zhu et al. (2014), while intensity words modify the valence degree, i.e., sentiment intensity of the text Qian et al. (2017). However, the sentiment labels of sentences are commonly derived from multiple sentiment shifts induced by words, and modeling the complex relationship between the sentence-level sentiment labels and word-level sentiment shifts is still underexplored. Thus, the goal of our research is to fully employ the linguistic knowledge to get language representation entailing the connection between high-level labels and words, which improves the performance in the tasks of sentiment analysis.

In this paper, we propose a novel pre-trained language representation model called SentiLR to deal with this challenge. First, to acquire the linguistic knowledge of each word, we utilize SentiWordNet 3.0 Baccianella et al. (2010) as our linguistic resource. Specifically, we look up the sentiment scores of words with corresponding part-of-speech tags in SentiWordNet. Since we can not accurately match the meaning of each word with the sense in SentiWordNet, we compute a weighted sum of the sentiment score of all the senses as the prior sentiment polarity for each word Guerini et al. (2013). Then, to capture the relationship between sentence-level labels and word-level sentiment shifts using linguistic knowledge, we devise a novel pre-training task called label-aware masked language model. This task contains two sub-tasks: 1) predicting a masked word, part-of-speech tag, and sentiment polarity at masked positions given the sentence-level sentiment label; 2) predicting the sentence-level label, the masked word and its linguistic knowledge including part-of-speech tag and sentiment polarity simultaneously. These two sub-tasks are expected to encourage the model to utilize linguistic knowledge to build the connection between high-level sentiment labels and low-level sentiment shifts. Our contributions are three folds:

  • We analyze the importance of incorporating linguistic knowledge into pre-trained language representation models, and we observe that effectively leveraging linguistic knowledge benefits the sentiment analysis tasks.

  • We propose a novel pre-trained language representation model called SentiLR, which acquires word-level sentiment polarity from SentiWordNet and adopts label-aware masked language model to capture the relationship between sentence-level sentiment labels and word-level sentiment shifts.

  • We conduct experiments on sentence-level / aspect-level sentiment classification tasks and show that SentiLR can outperform state-of-the-art pre-trained language representation models such as BERT and XLNet.

Figure 1: Overview of SentiLR. This model first acquires the word-level sentiment polarity from SentiWordNet for each word with the corresponding part-of-speech tag. During pre-training, the model is trained based on label-aware masked language model and next sentence prediction. After pre-training, SentiLR can be simply fine-tuned to typical sentiment analysis tasks like sentence-level / aspect-level sentiment classification.

2 Related Work

2.1 Pre-trained Language Representation Model

Early work on pre-trained language representation models mainly focuses on distributed word representations, such as word2vec Mikolov et al. (2013) and Glove Pennington et al. (2014). Since the distributed word representation is independent of context, it’s challenging for such representation to model the complex word characteristics under different contexts. Thus contextual language representation based on pre-trained models including CoVe McCann et al. (2017), ELMo Peters et al. (2018), GPT Radford et al. (2018, 2019) and BERT Devlin et al. (2019) becomes prevalent recently. These models use deep LSTM Hochreiter and Schmidhuber (1997) or Transformer Vaswani et al. (2017) as the encoder to acquire contextual language representation. Various pre-training tasks were explored including traditional NLP tasks like machine translation McCann et al. (2017) and language model Peters et al. (2018); Radford et al. (2018, 2019), or other tasks such as masked language model and next sentence prediction Devlin et al. (2019).

With the advent of BERT Devlin et al. (2019)

achieving state-of-the-art performances on various NLP tasks, many variants of BERT have been proposed. Due to the important role of entities in language understanding, two heuristic ways have been studied to make the pre-trained model aware of entities, i.e. introducing knowledge graph

Zhang et al. (2019) / knowledge base Peters et al. (2019) explicitly and designing entity-specific masking strategies during pre-training Sun et al. (2019a, b). Considering the implicit relationship among different NLP tasks, post-training approaches Xu et al. (2019); Li et al. (2019) conduct supervised training on the pre-trained BERT with transfer tasks which are related to target tasks, in order to get a better initialization for target tasks. The model structure and the pre-training tasks of BERT are also worth exploring. Some researchers measure the impact of key hyper-parameters to improve the under-trained BERT Liu et al. (2019), and others improve the masked language model with masking contiguous random spans Joshi et al. (2019) or decomposing the training objective into auto-regressive language model Yang et al. (2019).

Other work propose task-specific pre-training strategies to acquire task-specific language representation applied to the corresponding tasks such as data augmentation Wu et al. (2019), cross-lingual analysis Lample and Conneau (2019), relation extraction Alt et al. (2019); Soares et al. (2019) and language generation Song et al. (2019); Dong et al. (2019). To the best of our knowledge, SentiLR is the first work to explore sentiment-specific pre-trained language representation model for downstream sentiment analysis tasks.

2.2 Linguistic Knowledge for Sentiment Analysis

Linguistic knowledge such as part of speech and word-level sentiment polarity is commonly used as external features in sentiment analysis. Part of speech can facilitate the understanding of the syntactic structure of texts by improving the parsing performance Socher et al. (2013). It can also be incorporated into all layers of RNN as tag embeddings Qian et al. (2015). Huang et al. (2017) shows that part of speech can help to learn sentiment-favorable representations.

Word-level sentiment polarity is mostly derived from sentiment lexicons

Hu and Liu (2004); Wilson et al. (2005). Guerini et al. (2013) obtains the prior sentiment polarity by weighting the sentiment scores over all the senses of words in SentiWordNet Esuli and Sebastiani (2006); Baccianella et al. (2010). Teng et al. (2016) proposes a lexicon-based weighted sum model, which weights the prior sentiment scores of sentiment words to get the sentiment label of the whole sentence. Qian et al. (2017) models the linguistic role of sentiment, negation and intensity words via linguistic regularizers in the training objective.

3 Model

3.1 Task Definition and Model Overview

Our task is formulated as follows: given a text sequence of length , our goal is to acquire the representation of the whole sequence that captures the contextual information and the linguistic knowledge simultaneously. In this formulation,

indicates the dimension of the representation vector.

Figure 1 shows the overview of our model pipeline which contains three stages: 1) Acquiring the prior sentiment polarity for each word with its corresponding part-of-speech tag; 2) Conducting pre-training via two tasks i.e. label-aware masked language modeling and next sentence prediction; 3) Fine-tuning on sentiment analysis tasks with different settings. Compared with the vanilla pre-trained models like BERT Devlin et al. (2019), our model enriches the input sequence with its linguistic knowledge including part-of-speech tags and sentiment polarity labels, and utilizes a modified masked language model to capture the relationship between sentence-level sentiment labels and word-level knowledge in addition to context dependency.

3.2 Linguistic Knowledge Acquisition

This module obtains the sentiment polarity for each word with its part-of-speech tag. The input of this module is a sequence of tuples containing words and part-of-speech labels tagged by external tools such as NLTK111 Assume that for each tuple , we can find different senses with their sense numbers and positive / negative scores in SentiWordNet due to the ambiguity, where indicates the order of different senses and is the positive / negative score assigned by SentiWordNet. Since we can’t accurately match the meaning of each word in the sequence with the sense in the SentiWordNet, we follow Guerini et al. (2013) to convert the scores of all the senses to a prior sentiment label:


where the reciprocal of the of each sense weights the respective score since in the SentiWordNet smaller sense number indicates more frequent use of this sense in natural language. Note that if we can’t find any sense for in SentiWordNet, the label of will be assigned.

3.3 Pre-training Tasks

During pre-training, Label-aware masked language model (LA-MLM) and next sentence prediction (NSP) are adopted as the pre-training tasks where the setting of NSP is identical to the one proposed by Devlin et al. (2019). Label-aware masked language model is designed to utilize the linguistic knowledge to grasp the implicit dependency between sentence-level sentiment labels and words in addition to context dependency. It contains two separate sub-tasks, both of which take the position embedding, token embedding and segment embedding as the input. The position embedding introduces the position information into the model, while the segment embedding shows the boundary of different sentences. They are implemented in the same setting as BERT. Besides the original word embedding, the token embedding additionally includes the part-of-speech embedding and the word-level sentiment polarity embedding obtained in Section 3.2.

Figure 2: Sub-task#1 of label-aware masked language model. Given the sentence-level sentiment label (negative) as the input embedding (), our model is to predict the word (good), part-of-speech tag (JJ) and word-level sentiment polarity (positive) individually at the masked position.

The goal of sub-task#1 is to recover the masked sequence conditioned on the sentence-level label, as shown in Figure 2. In this setting, we add the sentence-level sentiment embedding to the inputs and the model is required to predict the word, part-of-speech tag and word-level sentiment polarity individually using the hidden states at the masked positions. This sub-task explicitly exerts the impact of the high-level sentiment label on the words and the linguistic knowledge of words, enhancing the ability of our model to explore the complex connection among them.

Figure 3: Sub-task#2 of label-aware masked language model. Our model is to predict the sentence-level sentiment label (negative) and recover the word information (word: good, part-of-speech tag: JJ, word-level sentiment polarity: positive) simultaneously.

The purpose of sub-task#2 is to predict the sentence-level label and the word information based on the hidden states at [CLS] and masked positions respectively. From Figure 3, we can see that the label is used as the supervision signal, which is different from sub-task#1. The simultaneous prediction of labels, words and linguistic knowledge of words enables our model to capture the implicit relationship among them.

Since the two sub-tasks are separate, we empirically set the proportion of pre-training data provided for the two sub-tasks to be 4:1. As for the masking strategy, we increase the masking probability of the words with positive / negative sentiment polarity from 15% in the setting of BERT to 30% because they are more possible to cause sentiment shifts in the whole sentence.

3.4 Fine-tuning Setting

Figure 4: Fine-tuning settings of SentiLR on sentence-level / aspect-level sentiment classification. In both classification tasks,

indicates the text sequence to be classified. In the aspect-level sentiment classification task, an additional aspect term / aspect category sequence

is concatenated with the text sequence.

Equipped with the ability to utilize linguistic knowledge via pre-training, our model can be fine-tuned to different sentiment analysis tasks, including sentence-level / aspect-level sentiment classification. We follow the fine-tuning setting of the existing work Devlin et al. (2019); Xu et al. (2019):

Sentence-level Sentiment Classification: The input of this task is a text sequence . The sentiment label is obtained based on the hidden state of [CLS].

Aspect-level Sentiment Classification: In addition to the text sequence, the input additionally contains an aspect term / aspect category sequence . The sentiment label is also acquired based on the hidden state of [CLS]. Figure 4 illustrates the fine-tuning settings.

4 Experiment

4.1 Pre-training Dataset and Implementation

We adopted the Yelp Dataset Challenge 2019222 as our pre-training dataset. This dataset contains 6,685,900 reviews with 5-class review-level sentiment labels. Each review consists of 8.1 sentences and 127.8 words on average.

Since our method can adapt to all the BERT-style pre-training models, we used vanilla BERT Devlin et al. (2019) as the base framework to construct Transformer blocks in this paper and leave the exploration of other models like RoBERTa Liu et al. (2019) as future work. The hyper-parameters of the Transformer blocks were set to be the same as BERT-Base due to the limited computational power. Considering the large cost of training from scratch, we utilized the parameters of pre-trained BERT333 to initialize our model. We also followed BERT to use WordPiece vocabulary Wu et al. (2016) with a vocabulary size of 30,522. The maximum sequence length in the pre-training phase was 128, while the batch size was 512. We took Adam Kingma and Ba (2015)

as the optimizer and set the learning rate to be 5e-5. Our model was pre-trained on Yelp Dataset Challenge 2019 for 1 epoch with label-aware masked language model and next sentence prediction. Note that we’ll release all the data, codes and model parameters.

4.2 Baselines

We compared SentiLR with several state-of-the-art pre-trained language representation models:

BERT: The pre-trained model based on masked language model and next sentence prediction Devlin et al. (2019).

XLNet: The variant of BERT which autoregressively recovers the masked tokens with permutation language model Yang et al. (2019).

For fair comparison, all the baselines in this paper were set to the base version. The number of parameters in each model is listed in Table 1. Since SentiLR adopts the same architecture of Transformer blocks as BERT, the number of parameters in these two models are almost the same and less than XLNet.

# parameters 109.486M 117.313M 109.495M
Table 1: The number of parameters of each pre-trained language representation model.

4.3 Sentence-level Sentiment Classification

Dataset Amount Length # classes
SST 8,544 / 1,101 / 2,210 19.2 5
MR 8,534 / 1,078 / 1,050 21.7 2
IMDB 24,749 / 249 / 24,999 279.2 2
Yelp-2 504,000 / 56,000 / 38,000 155.3 2
Yelp-5 594,000 / 56,000 / 50,000 156.6 5
Table 2: Statistics of sentence-level sentiment classification datasets.

The goal of the sentence-level sentiment classification is to predict the sentiment labels of sentences or paragraphs, which examines the model’s ability to understand the whole text. We evaluated our model on several sentence-level sentiment classification benchmarks including Stanford Sentiment Treebank (SST) Socher et al. (2013), Movie Review (MR) Pang and Lee (2005), IMDB Maas et al. (2011) and Yelp-2/5 Zhang et al. (2015) which cover widely used datasets at different scales. We reported the statistics of the datasets in Table 2 including the number of training / validation / test set, the average length and the number of classes. Since MR, IMDB and Yelp-2/5 don’t have validation sets, we randomly sampled subsets from the training sets as the validation sets, and tested all the models with the same data split.

Model SST MR IMDB Yelp-2 Yelp-5
BERT 53.51 86.29 92.89 97.74 70.16
XLNet 56.68 88.38 95.26 97.41 70.23
SentiLR 55.97 88.38 94.62 98.03 70.86
Table 3: Accuracy on sentence-level sentiment classification benchmarks (%).

The results are shown in Table 3. We can observe that SentiLR performs better or equally compared with other baselines on MR, Yelp-2 and Yelp-5. As for SST and IMDB, our model clearly surpasses BERT and shows comparative performances with XLNet. This demonstrates that our model can derive the sentence-level labels based on the sentiment shifts in the sentences and get a better understanding of the sentiment in the whole text.

4.4 Aspect-level Sentiment Classification

Task Aspect Term Sentiment Classification
Dataset SemEval14 SemEval14
(Laptop) (Restaurant)
Amount 2,163 / 150 / 638 3,452 / 150 / 1,120
# classes 3 3
# terms 1,031 1,268
Task Aspect Category Sentiment Classification
Dataset SemEval14 SemEval16
(Restaurant) (Restaurant)
Amount 3,366 / 150 / 973 2,150 / 150 / 751
# classes 3 3
# categories 5 12
Table 4: Statistics of aspect-level sentiment classification datasets.

Aspect-level sentiment classification is an important task in sentiment analysis. Given the aspect term / aspect category and the corresponding review, this task aims to predict the sentiment of the aspect based on the review, which evaluates the ability to capture the sentiment of specific content. The difference between aspect term and aspect category is that the former is a specific term (e.g. fish) while the latter is a coarse-grained category (e.g. food). For aspect term sentiment classification, we chose SemEval2014 Task 4 (laptop and restaurant domains) as the benchmarks, while for aspect category sentiment classification, we used SemEval2014 Task 4 (restaurant domain) and SemEval2016 Task 5 (restaurant domain). The statistics of these benchmarks containing the amount of training / validation / test sets, the number of classes and the number of aspect terms / aspect categories are shown in Table 4. We followed the existing work Xu et al. (2019) to leave 150 examples from the training sets for validation.

Task Aspect Term Sentiment Classification
Dataset SemEval14 (Laptop) SemEval14 (Restaurant)
Model Acc. MF1. Acc. MF1.
BERT 76.33 70.85 84.64 77.38
XLNet 79.94 75.81 85.71 77.60
SentiLR 80.72 76.47 86.16 79.20
Task Aspect Category Sentiment Classification
Dataset SemEval14 (Restaurant) SemEval16 (Restaurant)
Model Acc. MF1. Acc. MF1.
BERT 88.90 81.07 87.08 71.28
XLNet 90.65 83.63 87.22 73.14
SentiLR 92.91 87.10 90.28 77.85
Table 5: Accuracy (Acc.) and Macro-F1 (MF1.) on aspect-level sentiment classification benchmarks (%).

We present the results of aspect-level sentiment classification in Table 5. We can see that SentiLR outperforms the baselines in both accuracy and Macro-F1 on these datasets, indicating that our model can successfully grasp the sentiment of the given aspects. Since the improvement of Macro-F1 is more notable than that of accuracy, it is convinced that our model actually does better in all the three sentiment classes. Due to the sparsity of aspect terms compared with aspect categories, our model improved a larger margin on the task of aspect category sentiment classification than the aspect terms.

4.5 General Language Understanding Tasks

BERT 92.70 84.40 91.18 88.40
SentiLR 93.58 83.42 90.40 91.29
BERT 56.86 89.21 86.70 74.37
SentiLR 57.08 88.33 87.50 72.92
Table 6: Accuracy on the development sets of different tasks in GLUE (%). The results of BERT with are reported by Devlin et al. (2019).

To explore whether the performance of SentiLR on common NLP tasks will improve or degrade, we evaluated our model on General Language Understanding Evaluation (GLUE) benchmark Wang et al. (2019), which collects diverse language understanding tasks. We fine-tuned SentiLR on each task of GLUE respectively, and compared its performance with vanilla BERT. Since the test sets of GLUE are not publicly available, we reported the results on development sets in Table 6. Note that we directly used the results of BERT on SST-2, MNLI, QNLI and MRPC which are reported by Devlin et al. (2019) and re-implemented the BERT model fine-tuned on the rest of the tasks by ourselves.

From Table 6, SentiLR surely gets better results on the tasks in sentiment analysis like SST-2. We also observe that our model outperforms BERT on CoLA, MRPC, QNLI tasks, and gets comparative results on the other datasets. Among these datasets, CoLA requires fine-grained grammaticality distinction for complex syntactic structures Warstadt and Bowman (2019), which may be aided by part-of-speech tag information. Similarly, QNLI has also been reported to be improved with external part-of-speech features Rajpurkar et al. (2016). Thereby, our model which is able to utilize the linguistic knowledge achieves better performance accordingly.

4.6 Ablation Study

Dataset MR SemEval14 SemEval14
(Laptop) (Restaurant)
Model Acc. Acc. MF1. Acc. MF1.
BERT 86.29 76.33 70.85 88.90 81.07
SentiLR 88.38 80.72 76.47 92.91 87.10
 - subtask#1 87.14 78.84 73.19 91.88 85.12
 - subtask#2 87.43 78.21 73.25 90.34 82.85
 - knowledge 87.43 78.68 73.58 92.29 87.08
Table 7: Ablation study on sentence-level sentiment classification (MR), aspect term sentiment classification (SemEval14-Laptop) and aspect category sentiment classification (SemEval14-Restaurant), where accuracy (Acc.) and Macro-F1 (MF1.) are reported.

To study the effectiveness and significance of the linguistic knowledge introduced and the label-aware masked language model, we remove the linguistic knowledge and two sub-tasks in the label-aware masked language model respectively and present the results in Table 7. Since the two sub-tasks are separate, the setting of -subtask#1/2 in Table 7 indicates that the pre-training data are all fed into the other sub-task. Additionally, the -knowledge setting means that we remove the part-of-speech and sentiment polarity embedding in the input as well as the supervision signals of linguistic knowledge in two sub-tasks.

Input Sentence The movie is of [MASK] quality with [MASK] good comments about it.
Sentence-level Label Predicted Sentence
0 The movie is of poor quality with no good comments about it.
1 The movie is of low quality with few good comments about it.
2 The movie is of decent quality with some good comments about it.
3 The movie is of good quality with several good comments about it.
4 The movie is of excellent quality with many good comments about it.
Table 8: Generated cases where multiple words are masked in the input sentence. The predicted sentence is constructed by sampling one word (in italics) from the top- candidates at each [MASK] position while keeping other words identical to the input sentence.

The results in Table 7 show that both the linguistic knowledge and the pre-training task contribute to the final performance. In terms of the different effects of two sub-tasks, they perform comparatively on the sentence-level classification and aspect term sentiment classification. Nevertheless, sub-task#2 seems more important to the aspect category sentiment classification as the performance degrades severely on SemEval14 (Restaurant) when sub-task#2 is ablated. Considering the impact of knowledge, the performance of SentiLR doesn’t degrade a lot compared with the setting of removing the pre-training task. This result indicates that SentiLR doesn’t only depend on the external knowledge from SentiWordNet. The well-designed pre-training task facilitates the model to explore the information within contexts even without the explicit knowledge and build the deep connection between the labels and words.

4.7 Further Analysis on Label-aware Masked Language Model

Label-aware masked language model plays an important part in SentiLR. It makes our model learn to capture not only the context dependency but the linguistic knowledge of words. In order to get a deeper understanding of how this pre-training task captures the context dependency and the relationship between sentence-level labels and word-level knowledge, we provided some generated cases of label-aware masked language model after pre-training.

Input Sentence This restaurant is really
[MASK] regarding its serve.
Sentence- 0 1 2 3 4
level Label
Weighted -0.33 -0.25 -0.16 0.25 0.46
Negative Words  Neutral Words  Positive Words  
Table 9: Statistics of the prediction at [MASK] position given the same input sentence with different sentence-level labels. The weighted sentiment score is computed as a weighted sum of the probability over the vocabulary and the weight for each word is obtained from the SentiWordNet via Equation (3.2). The visualized generation probabilities of different sentiments are obtained by accumulating the probability of words with the respective prior sentiment polarities.

Firstly, we show that our pre-trained model can capture the deep relationship between sentence-level labels and sentiment words. Given the same input sentence with one masked word and different sentence-level labels in the form of sentence-level embeddings, our model can recover the masked word with respect to the global sentiment. We calculated the weighted sentiment score via where is the probability of word at the [MASK] position computed by SentiLR, indicates the predicted part-of-speech tags from SentiLR, and is obtained from the SentiWordNet via Equation (3.2). As this weighted score reveals the sentiment polarity of the model’s prediction, we can see from Table 9 that it gradually shifts from negative to positive as the sentence-level label goes from 0 to 4. We also calculated the accumulated generation probabilities of negative, neutral and positive words defined by the prior sentiment labels to show the changes of word usage in fine-grained sentiment settings.

Secondly, we demonstrate that our model can simultaneously capture the context dependency and the sentiment-related linguistic knowledge. We can see from Table 8 that our model chooses different words at the first [MASK] to satisfy the fine-grained sentence-level labels. Then, our model infers the relationship between the amount of positive comments and the quality of the movie via context dependency and fills the second [MASK] with reasonable quantifiers.

5 Conclusion

In this paper, we propose a novel pre-trained language representation model called SentiLR, which captures not only the context dependency but also the linguistic knowledge of each word. We introduce the linguistic knowledge from SentiWordNet and design label-aware masked language model to enable our model to utilize the knowledge in sentiment analysis tasks. Experiments show that our model can outperform several state-of-the-art pre-trained language representations in the sentiment analysis tasks.


This work was supported by the National Science Foundation of China key project with grant No. 61936010 and regular project with grand No. 61876096, and the National Key R&D Program of China (Grant No. 2018YFC0830200). This work was also supported by Beijing Academy of Artificial Intelligence, BAAI.