SentiPrompt: Sentiment Knowledge Enhanced Prompt-Tuning for Aspect-Based Sentiment Analysis

09/17/2021 ∙ by Chengxi Li, et al. ∙ Zhejiang University 0

Aspect-based sentiment analysis (ABSA) is an emerging fine-grained sentiment analysis task that aims to extract aspects, classify corresponding sentiment polarities and find opinions as the causes of sentiment. The latest research tends to solve the ABSA task in a unified way with end-to-end frameworks. Yet, these frameworks get fine-tuned from downstream tasks without any task-adaptive modification. Specifically, they do not use task-related knowledge well or explicitly model relations between aspect and opinion terms, hindering them from better performance. In this paper, we propose SentiPrompt to use sentiment knowledge enhanced prompts to tune the language model in the unified framework. We inject sentiment knowledge regarding aspects, opinions, and polarities into prompt and explicitly model term relations via constructing consistency and polarity judgment templates from the ground truth triplets. Experimental results demonstrate that our approach can outperform strong baselines on Triplet Extraction, Pair Extraction, and Aspect Term Extraction with Sentiment Classification by a notable margin.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Aspect-based Sentiment Analysis (Pontiki et al., 2014) is a fine-grained sentiment analysis task which requires extracting aspects, judging the corresponding polarities, and finding opinions as the causes of sentiment towards each aspect. For the example in Figure 1, the task aims to extract the two aspects “owners” and “beer selection”, their corresponding opinions (“great fun” and “worth staying for”) and polarities (both positive). Previous studies have proposed many subtasks for ABSA, all being different in required input, output and task formulations. In this paper, we focus on the most complex and challenging three subtasks, ASEC, Pair, and Triplet, as shown in Figure 1.

Figure 1: Examples of prompt-tuning for ABSA subtasks.

Aspect Term Extraction and Sentiment Classification (AESC) requires extracting the aspect terms and classifying the sentiment polarities about them. Pair Extraction (Pair) extracts the aspect terms as well as the corresponding opinion terms simultaneously. Aspect Sentiment Triplet Extraction (Triplet(Peng et al., 2020) aims to get the information containing all the aspects, their corresponding opinions and sentiment polarities from input sentences. These are all nearly complete solutions for the ABSA task, especially Triplet.

Researchers have proposed pipeline approaches (Peng et al., 2020; Mao et al., 2021)

to solve these subtasks in a multi-stage framework. However, these multi-stage models are not end-to-end, thus, making predictions from various sub-models inside separately. Separating tasks into multiple stages potentially breaks the relation modeling within the triplets and brings about ineligible error propagation. Recently, several neural-network-based models 

(Wu et al., 2020; Xu et al., 2020, 2021) have been proposed to develop end-to-end framework with sequence tagging. In these approaches, aspects and opinions can be jointly extracted and polarities are also jointly considered. However, these approaches mainly rely on modeling the relations between aspects and opinions at the word level, regardless of the span-level relations which need supervisions from extra subtasks for pruning. More recent studies make attempts on unified frameworks with Pre-trained Langugage Models (PLM) for the ABSA task (Chen et al., 2021a; Yan et al., 2021), yielding promising performance on benchmark datasets. Despite the success of PLM fine-tuning, some recent studies find that one of its critical challenges is the significant gap of objective forms between pre-training and fine-tuning, which restricts PLMs from reaching their full potential.

Prompt-tuning (Jiang et al., 2020; Shin et al., 2020; Schick and Schutze, 2021; Han et al., 2021), as a new fine-tuning paradigm, has emerged which is proposed to bridge the gap of objective forms between pre-training and fine-tuning. With appropriate prompts and tuning objectives, the prompt-tuning could manipulate the model behavior to fit various downstream tasks. By using specially constructed prompts, we can further inject and stimulate the task-related knowledge in PLMs, thus boosting the model performance. Nevertheless, handcrafting the appropriate prompt for ABSA requires domain expertise, and auto-constructing a well-performing prompt often requires additional computational cost for verification.

Motivated by the above observations, in this paper, we propose SentiPrompt, a novel approach to leverage sentiment knowledge enhanced prompts to tune the LM in a generative framework for ABSA. To be specific, we leverage consistency and polarity judgment templates to construct prompts regarding ground truth aspects, opinions, and polarities as sentiment knowledge. During training, the model predicts the consistency of prompt samples with the ground truth and classify polarities for consistent ones as instructed. In this way, the encoder part of the framework learn to explicitly capture sentiment relations between aspect and opinion term pairs. Similarly, we introduce sentiment knowledge by this tuning process. Together with the supervision from generative predictions, the framework can serve the ABSA task far better than solely being fine-tuned by the loss from final predictions. This is the first attempt to introduce downstream task-related knowledge and to enhance relation modeling for the LM through prompt-tuning methods. Sentiment knowledge enhanced prompt-tuning, working as an ABSA task-adaptive modification to the tuning process, helps to promote term extraction, pairing and classification with sentiment knowledge enhanced prompts. This technique needs no extra parser to get semantic information or building graph from sentence structure. Besides, neither aspect and opinion terms nor polarities are of large numbers in a comment. Thus it allows convenient, time-sparing and explicit modeling of relations between those terms and polarities and better introduction of sentiment-related knowledge for the LM.

Our main contributions can be summarized as follows:

  • We propose sentiment knowledge enhanced prompt-tuning (SentiPrompt) to explicitly model the relations between terms and leverage sentiment knowledge. We further utilize trainable continuous templates to obtain optimized prompts. To the best of our knowledge, this is the first approach to incorporate prompt-tuning for ABSA subtasks.

  • We revise the latest version of four benchmark datasets regarding mistakes and incompleteness in triplet annotations and release a new version for future research. Experimental results demonstrate that our method archives promising improvements and obtains new state-of-the-art performance all datasets.

Related Works

Aspect-based Sentiment Analysis

Early research about the ABSA task starts with extracting aspects or opinions alone. Most early research follows sequence tagging schema (Li et al., 2018; Xu et al., 2020; Bi et al., 2021) to solve Aspect Term Extraction (ATE) while recent works (Ma et al., 2019; Li et al., 2020) attempt to use sequence-to-sequence techniques with PLMs and achieve relatively promising results. As for Opinion Term Extraction (OTE), most works treat it as an auxiliary task (Wang and Pan, 2018; He et al., 2019; Chen and Qian, 2020). Later research turns to extract the corresponding opinions (Aspect-oriented Opinion Extraction, AOE) or classifying polarities (Aspect-level Sentiment Classification, ALSC) for every given aspect. Fan et al. (2019) first introduces AOE with their baseline model and datasets. Lately, others (Wu et al., 2020; Pouran Ben Veyseh et al., 2020) develop some sequence tagging models to follow this track. For ALSC, works about it start earlier. Most of them (Tang et al., 2016; Wang et al., 2016)

incorporate long-short-term-memory(LSTM) units or attention mechanism to deal with relations between aspects and sentiments. In addition, some works 

(Chen et al., 2020; Wang et al., 2020; Tang et al., 2020; Li et al., 2021) propose to leverage graph neural networks to utilize syntax and semantic information from extra parsers.

Recent studies propose some more fine-grained subtasks AESC, Pair and Triplet for ABSA. Peng et al. (2020) first proposes a multi-stage pipeline to do extraction, pairing and classification step by step. Then Xu et al. (2020) follows the commonly used sequence tagging schema with a position-aware method. Yet another line (Wu et al., 2020) tries to solve Triplet by labeling the relations of word-pairs. Xu et al. (2021) goes further by labeling span-pairs with a pruning method. Moreover, Mao et al. (2021); Yan et al. (2021) propose to combine a set of subtasks through unified formulations. They formulate various subtasks as a machine reading comprehension or generation task, which perform quite well. Our method tends to solve this task in a unified generative framework yet include explicit modeling of relations between terms through sentiment knowledge enhanced prompt-tuning.

Prompt-tuning

Prompt-tuning is a new paradigm of fine-tuning inspired by GPT-3 

(Brown et al., 2020), especially for language models in few-shot or zero-shot settings. It means prepending instructions and demonstrations to the input and output predictions. Recent PET work (Schick and Schutze, 2021) focus on semi-supervised setting with a large number of unlabeled examples. Gao et al. (2020) explore their prompt-tuning methods with demonstrations on language models for some benchmark tasks, including Sentiment Classification. Prompt-tuning can induce better performances for PLMs on widespread of NLP tasks including text classification Gao et al. (2020); Zhang et al. (2021c), relation extraction Han et al. (2021); Chen et al. (2021b), NER Chen et al. (2021c), entity typing Ding et al. (2021), and so on. To construct better prompt for downstream tasks, several approaches Han et al. (2021); Hu et al. (2021) leverage knowledge injection Zhang et al. (2021a, b) to templates and vertalizer construction. Besides, there exist lots of works on prompting for mining knowledge from PLMs (Davison et al., 2019; Petroni et al., 2019; Talmor et al., 2020). Unlike those works, we focus on ABSA which is quite different from classification tasks. Furthermore, since handcrafting a well-performing prompt is rather difficult, a number of works focus on automatic prompt search via generation or gradient-based search  (Jiang et al., 2020; Shin et al., 2020; Gao et al., 2020). Those works mentioned above still keep the discrete format for prompt search whereas Liu et al. (2021); Li and Liang (2021) tends to do optimization for continuous prompt embeddings.

Methodology

In this section, we first present the task formulation and the sequential output generated to cover all subtasks, then elaborate our proposed method about sentiment knowledge enhanced prompt-tuning and the generation module.

Task Formulation

For a given sentence , we tokenize it into a sequence of tokens . The ABSA task aims to provide the aspect terms, the sentiment polarity, and the opinion terms, represented by a, s and o in our method respectively. a and o are span indexes of terms and is the polarity class index. Superscripts start and end denote the start and end index of the corresponding token span in the sequence. For our generative framework, the target sequence consists of multiple base prediction . Base predictions of subtasks are all some subsets of the above and we list them as follows:

  • For AESC:

  • For Pair:

  • For Triplet:

In our work, we use BART (Lewis et al., 2020) with pointer network (Vinyals et al., 2015) for sequence-to-sequence generation. During training, the input is used for both sentiment knowledge enhanced prompt-tuning and index sequence generation. The encoder part is shared across both sides.

(a) SentiPrompt Tuning
(b) Generation Framework
Figure 2:

The overall architecture of our model. Embedding vectors represented as light yellow and orange boxes are retrieved from different embedding layers (Best viewed in color).

SentiPrompt Tuning

Manually Designed Auto-constructed
Consistency Prompt The Sushi is High? [MASK] ,.. Sushi , High? [MASK]
The Good Sushi is High? [MASK] , Good Sushi , High? [MASK]
Polarity Prompt This is [MASK] ,,[MASK]
Table 1: The demonstration of our prompt designs taking “Good Sushi High Price.” as the input example.

The sentiment knowledge of ground truth sentiment terms and polarities is injected in the construction of prompts. The prompt encoder is inserted here to provide trainable and continuous representations for prompts.

Sentiment Knowledge Enhanced Prompt Construction

SentiPrompt tunes the LM with sentiment knowledge enhanced prompts as a masked language modeling task (MLM) during training. So we construct prompts as samples for SentiPrompt MLM based on terms and polarities from ground truth triplets. To construct a prompt masked training sample, we randomly take an aspect and an opinion from ground truth triplets to fill in a fixed template. Given an input sequence (e.g. Good Sushi High Price), a template consists of pseudo prompt tokens and randomly sampled aspect and opinion terms , (sampled from (Sushi, Price) and (Good, High)). Here can be specific English words or just a randomly-initialized pseudo token to be optimized. It is formed like:

(1)

Besides, we randomly do span manipulations to aspects and opinions of some prompts in construction. These prompts will be constructed with a sub-span of aspect and opinion terms if these terms contain more than one token. For single token cases, we attach a few neighbor tokens of the term to it. On the whole, the created prompts may be consistent with the ground truth triplets or against them due to random sample choices and random span manipulations. For the example input, the ground truth triplets are (Sushi, Good, POS) and (Price, High, NEG). Therefore, other possible pairs are all considered inconsistent. In Table 1, the first example is constructed from (Sushi, High) and the second by attaching the left neighbor Good to the aspect Sushi. For chosen pairs that are consistent with the ground truth, we will attach an extra polarity judgement prompt to it as the suffix. Likewise, the template for the polarity judgement is formed like:

(2)

The label ground truth of the very polarity judgement prompt is the same as the corresponding polarity of the original source triplet. For negative pairs, because of the mismatch of aspect and opinion terms, judging the polarity of such a pair is meaningless. Thus, in this case, only the aspect-opinion combination part will be fed into the model. The whole template for prompt is like:

(3)

In this way, we can formulate a simple classification task for SentiPrompt MLM through a prompt according to the input sentence with the extra polarity judgement prompt as the suffix. Take a manually designed one as an example:

(4)
This is [MASK] (5)

We use this as an auxiliary task for the encoder part to decide whether to fill in “yes” (consistent with ground truth) or “no” (inconsistent) for the masked position. In addition, if the polarity judgement prompt is attached, let the encoder part decide the right polarity class to fill in that masked position. Let

be the polarity class token label for [MASK], the probability of predicting consistency is:

(6)

where is the hidden vector of the [MASK] position and means the weight vector before softmax for the word. For this sentiment knowledge enhanced prompt MLM task, we compute the cross-entropy loss of the predictions on masked tokens:

(7)

Prompt Encoder

For better prompt representations, we replace the embeddings of pseudo prompt tokens

from BART with their differential output embeddings through a prompt encoder and optimize them for the downstream task. To associate these pseudo prompt tokens with each other, we choose a bidirectional long-short term memory networks to connect these separate token embeddings and two layers of multi-layer perceptron activated by ReLU to get the processed hidden states. The prompt encoder takes the embeddings of

pseudo prompt tokens from a prompt-token embedding layer as input and outputs the optimized embedding .

(8)
(9)

where . The optimized embedding at each position of pseudo prompt tokens will replace the embedding from the layer of BART in SentiPrompt MLM. So the tokens in will be mapped as:

(10)

and then be fed into the encoder part while tokens of , and [MASK] will still be embedded by the BART embedding layer .

Prompt Optimization for ABSA

Given a PLM and an input sequence , the pre-trained embedding layer will map every token of to input embeddings . Let be the vocabulary of , the prompt tokens in and satisfy that as well as the label words for [MASK]. As is from prompt encoders with trainable parameters, this allows better continuous prompts beyond the original vocabulary could express. With the downstream loss, these sentiment knowledge enhanced prompts can be differentially optimized by

(11)

Generation Framework for ABSA

For the ABSA task, the model takes a sequence of tokens as input and hope to generate the target sequence as defined above. The input and output sequence starts and ends with special tokens: <> and <>. These special tokens are to be generated in but we ignore them in equations for simplicity. Given a sequence of tokens :

(12)

To model the index probability for every , we use BART model with pointer network, which roughly consists of an Encoder part and a Decoder part.

Figure 3: The index generator in the decoder part. The light and dark yellow boxes are token embeddings and outputs from BART Encoder and blue boxes are from BART Decoder as shown in Figure 2.

Encoder

The encoder part is to encode

into the hidden representation space as a vector

.

(13)
(14)

where , is the hidden state dimension and is the embedding layer.

Decoder

The decoder part takes the encoder outputs and previous decoder outputs as inputs and decode out. As are token indexes, an Index-to-Token Converter is applied to make a conversion.

(15)

where is the polarity class token list222The polarity class token always starts after the pointer indexes of the given sequence, at .. After this, we use a BART decoder to encode tokens to get the decoder hidden state for .

(16)

where

, and the probability distribution

of token can be computed as:

(17)
(18)
(19)

where , and is the predicted probability distribution of on all candidate indexes.

Training and Inference

During training, we compute the weighted sum of the cross-entropy losses from sentiment knowledge enhanced prompt-tuning and index predictions for optimization of the generation framework and continuous prompt representations.

(20)

During inference, we only use for generation. We apply beam search to generate the prediction sequence in an auto-regressive way. Then, a specific decoding process is applied to convert the output sequence into required terms and polarities. The corresponding decoding algorithm is included in Appendix G.

Experiments

In this section, we first introduce the datasets on which the experiments are conducted and detail those baselines to be compared with, then present our main results.

Datasets

We evaluate our method on three versions of four ABSA datasets. The first version () is from Peng et al. (2020). They get the pairly annotated opinions from Wang et al. (2017), the corresponding aspects annotation from Fan et al. (2019), and refine the data into <, , > triplet form. The second version  (Xu et al., 2020) is a revised version of . The missing triplets with overlapping opinions in Peng et al. (2020) are added. However, from our observation of four datasets in , there still exist quite a few unlabeled triplets, mistakes about polarities and aspect-opinion mismatching. Thus we make some corrections to these datasets and produce our version of them, namely .

We fix up these datasets for the following problems:

  • Few long phrases are included in opinions in previous annotations whereas they are notably important for classifying sentiments. We add some phrases such as verb-object and adjective-to-verb phrases into annotations to make the corresponding opinion terms more reasonable and complete for their resulting sentiments.

  • Whether the adverbs of the corresponding adjective or verb in the opinion are included is ambiguous in previous annotations. We uniformly include adverbs in our annotations.

  • Some opinion terms are actually towards the pronoun but got wrongly annotated to a nearby aspect.

Some revised examples and statistical comparisons between , , and are included in Appendix D and E. We will release later.

Baselines

To put our results in perspective, we summarize state-of-the-art models on three ABSA subtasks and compare SentiPrompt with them. Detailed comparisons of baselines are shown in Appendix F. Those three with the suffix “+” denote the variant version by Peng et al. (2020) for being capable of doing AESC, Pair, and Triplet. They are modified by adding an MLP to determine whether a triplet is correct in the matching stage. The baseline results marked with are from Xu et al. (2020) and are from Yan et al. (2021). The “PE” in Table 2,3,4 indicates the prompt encoder. More details about implementations and hyper-parameter settings are shown in Appendix A.

Main Results

Model 14res 14lap 15res 16res
AESC Pair Triple AESC Pair Triple AESC Pair Triple AESC Pair Triple
CMLA+ 70.62 48.95 43.12 56.90 44.10 32.90 53.60 44.60 35.90 61.20 50.00 41.60
RINANTE+ 48.15 46.29 34.03 36.70 29.70 20.0 41.30 35.40 28.0 42.10 30.70 23.30
Li-unified+ 73.79 55.34 51.68 63.38 52.56 42.47 64.95 56.85 46.69 70.20 53.75 44.51
Peng-two-stage 74.19 56.10 51.89 62.34 53.85 43.50 65.79 56.23 46.79 71.73 60.04 53.62
JET-BERT - - 63.92 - - 50.0 - - 54.67 - - 62.98
OTE-MTL - - 45.05 - - 59.67 - - 48.97 - - 55.83
Dual-MRC 76.57 74.93 70.32 64.59 63.37 55.58 65.14 64.97 57.21 70.84 75.71 67.40
SPAN-BART 78.47 77.68 72.46 68.17 66.11 57.59 69.95 67.98 60.11 75.69 77.38 69.98
SentiPrompt 81.09 78.96 75.01 70.79 68.05 60.41 74.20 70.36 64.50 79.81 80.13 74.66
SentiPrompt w/o PE 79.44 78.30 73.70 68.10 68.75 58.75 70.63 67.62 61.63 77.59 78.21 72.69
Table 2: Comparison of F1 scores for AESC, Pair, and Triplet on the . We highlight the best results in bold.
Model 14res 14lap 15res 16res
P R F1 P R F1 P R F1 P R F1
CMLA+ 39.18 47.13 42.79 30.09 36.92 33.16 34.56 39.84 37.01 41.34 42.1 41.72
RINANTE+ 31.42 39.38 34.95 21.71 18.66 20.07 29.88 30.06 29.97 25.68 22.3 23.87
Li-unified+ 41.04 67.35 51.0 40.56 44.28 42.34 44.72 51.39 47.82 37.33 54.51 44.31
Peng-two-stage 43.24 63.66 51.46 37.38 50.38 42.87 48.07 57.51 52.32 46.96 64.24 54.21
JET-BERT 70.56 55.94 62.40 55.39 47.33 51.04 64.45 51.96 57.53 70.42 58.37 63.83
GTS-BERT 67.76 67.29 67.50 57.82 51.32 54.36 62.59 57.94 60.15 66.08 69.91 67.93
SPAN-BART 65.52 64.99 65.25 61.41 56.19 58.69 59.14 59.38 59.26 66.60 68.68 67.62
SentiPrompt 72.79 72.94 72.86 63.40 58.60 60.90 62.97 62.06 62.51 70.20 73.35 71.74
SentiPrompt w/o PE 73.65 67.20 70.28 60.42 57.86 59.11 60.61 61.24 60.92 67.90 71.21 69.52
Table 3:

Comparison of F1, Precision and Recall for

Triplet on the . We highlight the best results in bold.
Model 14res 14lap 15res 16res
P R F1 P R F1 P R F1 P R F1
OTE-MTL 50.81 58.24 54.27 47.97 39.68 43.44 56.90 38.67 46.05 56.25 53.73 54.96
JET-BERT 72.79 30.01 42.50 61.80 25.40 36.00 69.56 37.13 48.42 68.18 29.30 40.98
GTS-BERT 61.09 60.62 60.85 56.48 49.82 52.94 58.39 55.08 56.68 63.78 61.75 62.75
SPAN-BART 64.36 64.24 64.30 60.41 51.95 55.86 54.00 54.10 54.05 65.19 68.47 66.79
SentiPrompt 66.10 63.37 64.71 61.30 55.32 58.15 61.81 62.06 61.93 68.66 69.04 68.85
SentiPrompt w/o PE 65.65 63.95 64.79 59.88 53.72 56.64 55.84 58.79 57.28 66.79 67.91 67.35
Table 4: Comparison of F1, Precision and Recall for Triplet on

. We rerun the baseline models according to their open sources on

and report their results. The best results are highlighted in bold.

On , we compare our method on AESC, Pair and Triplet. Experimental results on are shown in Table 2. Our method obtains convincing improvements on all subtasks of all datasets, 3.38% F1 for AESC, 2.09% F1 for Pair and 3.60% F1 for Triplet on average. Experimental results for Triplet on are shown in Table3. According to the table, our method achieves better performance than baselines on all datasets. Our method surpasses strong baselines by 7.61 % at most, with an average of 3.44% F1 for Triplet on all datasets. On the coming up , we compare our method on Triplet in Table 4. SentiPrompt outperforms representative baselines by an average of 2.30% F1 on all datasets. However, compared to the results on , the best performances go down by 8.07%, 3.78%, 5.23%, 2.99% F1 on 14res, 14lap, 15res, and 16res respectively. As is reflected by the significant drops of performances, our revised annotations are more complicated and complete thus set a higher demand for the capability of models. Additionally, SentiPrompt generally outperforms the model without prompt encoder, except for 14res of . In conclusion, main experimental results demonstrate that SentiPrompt is a new state-of-the-art method for the ABSA task.

To illustrate the differences between models, some triplet prediction cases from some baselines and our SentiPrompt on a sample sentence are shown in Figure 4. For the given example, OTE-MTL could only make one triplet prediction out of three. GTS simply takes the token “too” following “margaritas” as its opinion term. SPAN-BART is able to give the right predictions on “food” and “margaritas” as well as our method but fails to identify “waitress”. Our method gives a larger span on the opinion for “waitress”, which is comparatively reasonable. Consider that the annotation span for this is actually not that complete. Generally speaking, SentiPrompt generates more reasonable and accurate outputs in this case. These indicate that our method is more helpful to extract correct terms and capture accurate relations between them, which benefit the polarity classification as well. These improvements surely lead to better model performance. Yet, while our method achieves significant improvements on 14res, 15res and 16res, the improvement on 14lap is relatively marginal.

Figure 4: A case study on an input with multiple triplets. GT is the ground truth.

Analysis

Tuning the LM in unified frameworks using prompting methods with sentiment knowledge shows exciting potentials, but also has its limitations. We conduct some extensional experiments to further discuss SentiPrompt.

Few-shot ABSA

To further evaluate the potential of our method, we conduct few-shot setting experiments with 10% of training data on four datasets. Due to the instability of tuning on small datasets, we measure average performance across 5 different randomly sampled and , using a fixed set of seeds 333We randomly generate 5 numbers in the range of (0, 6000) from the random function. Here ={544, 3210, 8, 5678, 744}

. This gives a more robust measure of model performance and a better estimate of the variance. As previous works about these ABSA subtasks in few-shot setting are scarce, we still adopt some baselines mentioned above for comparison in few-shot setting experiments. The comparison of experimental results on

and are in Appendix B. These figures show that SentiPrompt achieves better performances than all baselines on 14lap and 14res and outperforms the majority class of baselines on 15res and 16res. However, our results suffer from high variance. In worst cases, the performance drops up to 10% F1 below the average. In addition, unlike some other prompt-tuning methods, SentiPrompt won’t perform well if most parameters of the LM are frozen, which increases the computational cost of tuning in this process.

Prompt Length Analysis

Additionally, to investigate the effect of prompts with different lengths to the ABSA task, we conduct experiments with prompts of various lengths on . For automatically optimized prompts, we simplify the template for this type and control the length by the numbers of pseudo prompt tokens at three fixed positions of the template. Let , pseudo prompt tokens at these positions are of the same number . We compare prompts with . Manually designed ones are as shown in Table 1. The comparison figure is included in Appendix C. In most cases, continuously optimized prompts achieve comparable or higher performance than manually designed ones. According to the figure, it is clear that the number of pseudo prompt tokens in a template have notable influence on model performance as subtasks and datasets change. As more pseudo tokens provide more trainable parameters, they enhance the capability of the model yet require more data and time to be properly trained.

Multi-triplet Analysis

Besides, we compare the model performance of SPAN-BART and SentiPrompt on for multi-triplet cases, with more than one aspect or opinion within the input. These data account for about 70% of the total. As shown in Table 5, SentiPrompt improves the Precision quite a lot while keeps the Recall changing little, resulting in the general lift on F1. Consider the large proportion of multi-triplet cases in data, the improvements in main results can be largely attributed to better performance on them. With the LM capturing relations between multiple aspects and opinions better, SentiPrompt performs better on multi-triplet cases and thus benefits the overall performance.

Model Metric Multi-Triplet
14res 14lap 15res 16res
SPAN-BART P 66.27 63.07 58.29 59.14
R 59.35 48.76 48.05 71.08
F1 62.62 55.00 52.68 64.56
SentiPrompt P 70.13 68.09 64.48 77.03
R 57.93 49.48 52.35 61.76
F1 63.45 57.31 57.79 68.55
Table 5: Extensional comparison of results for multi-triplet cases on dataset.

Invalid Prediction Analysis

Since those subtasks require formatted sequence output to decode as described in Section 3, we estimate the invalid prediction rates for Triplet of the generative framework SPAN-BART and SentiPrompt, on error length (Err-length) and error order (Err-order), respectively. A valid triplet prediction should contain five indexes. The start index should be smaller or equal to the paired end index. From Table 6, SentiPrompt could achieve sufficiently low invalid rates for both error cases on most datasets. lower invalid rate on 14res, 15res and error length decrease much on 16res. Lower invalid rates compared to SPAN-BART demonstrates that a better way of introducing knowledge for unified frameworks leads to well-formatted index sequence predictions, which lay a solid foundation for better actual performance in practical scenarios.

Model Error(%) 14res 14lap 15res 16res
SPAN-BART Err-length 0.48 0.77 1.41 1.40
Err-order 1.75 3.70 3.26 3.26
SentiPrompt Err-length 0.41 1.01 1.24 0.31
Err-order 1.63 3.68 3.11 4.29
Table 6: The errors for Triplet on the test set of dataset.

Conclusion

In this work, we propose SentiPrompt to utilize sentiment knowledge enhanced prompts for the ABSA task, which is a task-adaptive modification to a unified generative framework. we inject sentiment knowledge regarding aspects, opinions, and polarities from ground truth into prompts and explicitly model term relations via consistency and polarity judgment templates. Through SentiPrompt MLM during training, the encoder part of the framework learn to explicitly model sentiment relations between aspects and opinions. Likewise, we introduce sentiment knowledge for accurate term extraction and sentiment classification by this tuning process. Additionally, due to mistakes and incompleteness in annotations, we revise four public datasets to form our version . We conduct experiments on three versions of four benchmark datasets. SentiPrompt achieves significant improvements on ABSA subtasks on all versions of datasets.

References

  • Z. Bi, N. Zhang, G. Ye, H. Yu, X. Chen, and H. Chen (2021) Interventional aspect-based sentiment analysis. arXiv preprint arXiv:2104.11681. Cited by: Aspect-based Sentiment Analysis.
  • T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 1877–1901. External Links: Link Cited by: Prompt-tuning.
  • C. Chen, Z. Teng, and Y. Zhang (2020) Inducing target-specific latent structures for aspect sentiment classification. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    ,
    Online, pp. 5596–5607. External Links: Link, Document Cited by: Aspect-based Sentiment Analysis.
  • S. Chen, Y. Wang, J. Liu, and Y. Wang (2021a) Bidirectional machine reading comprehension for aspect sentiment triplet extraction. In AAAI, Cited by: Introduction.
  • X. Chen, X. Xie, N. Zhang, J. Yan, S. Deng, C. Tan, F. Huang, L. Si, and H. Chen (2021b) Adaprompt: adaptive prompt-based finetuning for relation extraction. arXiv preprint arXiv:2104.07650. Cited by: Prompt-tuning.
  • X. Chen, N. Zhang, L. Li, X. Xie, S. Deng, C. Tan, F. Huang, L. Si, and H. Chen (2021c) LightNER: a lightweight generative framework with prompt-guided attention for low-resource ner. arXiv preprint arXiv:2109.00720. Cited by: Prompt-tuning.
  • Z. Chen and T. Qian (2020) Relation-aware collaborative learning for unified aspect-based sentiment analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 3685–3694. External Links: Link, Document Cited by: Aspect-based Sentiment Analysis.
  • J. Davison, J. Feldman, and A. Rush (2019) Commonsense knowledge mining from pretrained models. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). External Links: Link, Document Cited by: Prompt-tuning.
  • N. Ding, Y. Chen, X. Han, G. Xu, P. Xie, H. Zheng, Z. Liu, J. Li, and H. Kim (2021) Prompt-learning for fine-grained entity typing. arXiv preprint arXiv:2108.10604. Cited by: Prompt-tuning.
  • Z. Fan, Z. Wu, X. Dai, S. Huang, and J. Chen (2019) Target-oriented opinion words extraction with target-fused neural sequence labeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 2509–2518. External Links: Link, Document Cited by: Aspect-based Sentiment Analysis, Datasets.
  • T. Gao, A. Fisch, and D. Chen (2020) Making pre-trained language models better few-shot learners. ArXiv abs/2012.15723. Cited by: Prompt-tuning.
  • X. Han, W. Zhao, N. Ding, Z. Liu, and M. Sun (2021) PTR: prompt tuning with rules for text classification. CoRR abs/2105.11259. External Links: Link, 2105.11259 Cited by: Introduction, Prompt-tuning.
  • R. He, W. S. Lee, H. T. Ng, and D. Dahlmeier (2019) An interactive multi-task learning network for end-to-end aspect-based sentiment analysis. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 504–515. External Links: Link, Document Cited by: Aspect-based Sentiment Analysis.
  • S. Hu, N. Ding, H. Wang, Z. Liu, J. Li, and M. Sun (2021) Knowledgeable prompt-tuning: incorporating knowledge into prompt verbalizer for text classification. arXiv preprint arXiv:2108.02035. Cited by: Prompt-tuning.
  • Z. Jiang, F. F. Xu, J. Araki, and G. Neubig (2020) How can we know what language models know?. Transactions of the Association for Computational Linguistics 8, pp. 423–438. External Links: ISSN 2307-387X, Link, Document Cited by: Introduction, Prompt-tuning.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. External Links: Link, Document Cited by: Task Formulation.
  • K. Li, C. Chen, X. Quan, Q. Ling, and Y. Song (2020) Conditional augmentation for aspect term extraction via masked sequence-to-sequence generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7056–7066. External Links: Link, Document Cited by: Aspect-based Sentiment Analysis.
  • R. Li, H. Chen, F. Feng, Z. Ma, X. Wang, and E. Hovy (2021) Dual graph convolutional networks for aspect-based sentiment analysis. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 6319–6329. External Links: Link Cited by: Aspect-based Sentiment Analysis.
  • X. L. Li and P. Liang (2021) Prefix-tuning: optimizing continuous prompts for generation. ArXiv abs/2101.00190. Cited by: Prompt-tuning.
  • X. Li, L. Bing, P. Li, W. Lam, and Z. Yang (2018) Aspect term extraction with history attention and selective transformation. In

    Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18

    ,
    pp. 4194–4200. External Links: Document, Link Cited by: Aspect-based Sentiment Analysis.
  • X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang (2021) GPT understands, too. External Links: 2103.10385 Cited by: Prompt-tuning.
  • D. Ma, S. Li, F. Wu, X. Xie, and H. Wang (2019) Exploring sequence-to-sequence learning in aspect term extraction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3538–3547. External Links: Link, Document Cited by: Aspect-based Sentiment Analysis.
  • Y. Mao, Y. Shen, C. Yu, and L. Cai (2021) A joint training dual-mrc framework for aspect based sentiment analysis. In AAAI, Cited by: Introduction, Aspect-based Sentiment Analysis.
  • H. Peng, L. Xu, L. Bing, F. Huang, W. Lu, and L. Si (2020) Knowing what, how and why: a near complete solution for aspect-based sentiment analysis. Proceedings of the AAAI Conference on Artificial Intelligence 34 (05), pp. 8600–8607. External Links: ISSN 2159-5399, Link, Document Cited by: Introduction, Introduction, Aspect-based Sentiment Analysis, Datasets, Baselines.
  • F. Petroni, T. Rocktäschel, S. Riedel, P. Lewis, A. Bakhtin, Y. Wu, and A. Miller (2019) Language models as knowledge bases?. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). External Links: Link, Document Cited by: Prompt-tuning.
  • M. Pontiki, D. Galanis, J. Pavlopoulos, H. Papageorgiou, I. Androutsopoulos, and S. Manandhar (2014) SemEval-2014 task 4: aspect based sentiment analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Dublin, Ireland, pp. 27–35. External Links: Link, Document Cited by: Introduction.
  • A. Pouran Ben Veyseh, N. Nouri, F. Dernoncourt, D. Dou, and T. H. Nguyen (2020)

    Introducing syntactic structures into target opinion word extraction with deep learning

    .
    In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 8947–8956. External Links: Link, Document Cited by: Aspect-based Sentiment Analysis.
  • T. Schick and H. Schutze (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In EACL, Cited by: Introduction, Prompt-tuning.
  • T. Shin, Y. Razeghi, R. L. Logan IV, E. Wallace, and S. Singh (2020) AutoPrompt: eliciting knowledge from language models with automatically generated prompts. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). External Links: Link, Document Cited by: Introduction, Prompt-tuning.
  • A. Talmor, Y. Elazar, Y. Goldberg, and J. Berant (2020) OLMpics-on what language model pre-training captures. Transactions of the Association for Computational Linguistics 8, pp. 743–758. External Links: ISSN 2307-387X, Link, Document Cited by: Prompt-tuning.
  • D. Tang, B. Qin, X. Feng, and T. Liu (2016) Effective LSTMs for target-dependent sentiment classification. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 3298–3307. External Links: Link Cited by: Aspect-based Sentiment Analysis.
  • H. Tang, D. Ji, C. Li, and Q. Zhou (2020) Dependency graph enhanced dual-transformer structure for aspect-based sentiment classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 6578–6588. External Links: Link, Document Cited by: Aspect-based Sentiment Analysis.
  • O. Vinyals, M. Fortunato, and N. Jaitly (2015) Pointer networks. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28, pp. . External Links: Link Cited by: Task Formulation.
  • K. Wang, W. Shen, Y. Yang, X. Quan, and R. Wang (2020) Relational graph attention network for aspect-based sentiment analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 3229–3238. External Links: Link, Document Cited by: Aspect-based Sentiment Analysis.
  • W. Wang, S. J. Pan, D. Dahlmeier, and X. Xiao (2017) Coupled multi-layer attentions for co-extraction of aspect and opinion terms. In AAAI, Cited by: Datasets.
  • W. Wang and S. J. Pan (2018) Recursive neural structural correspondence network for cross-domain aspect and opinion co-extraction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 2171–2181. External Links: Link, Document Cited by: Aspect-based Sentiment Analysis.
  • Y. Wang, M. Huang, X. Zhu, and L. Zhao (2016) Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 606–615. External Links: Link, Document Cited by: Aspect-based Sentiment Analysis.
  • Z. Wu, C. Ying, F. Zhao, Z. Fan, X. Dai, and R. Xia (2020) Grid tagging scheme for aspect-oriented fine-grained opinion extraction. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 2576–2585. External Links: Link, Document Cited by: Introduction, Aspect-based Sentiment Analysis.
  • Z. Wu, F. Zhao, X. Dai, S. Huang, and J. Chen (2020) Latent opinions transfer network for target-oriented opinion words extraction. Proceedings of the AAAI Conference on Artificial Intelligence 34 (05), pp. 9298–9305. External Links: Link, Document Cited by: Aspect-based Sentiment Analysis.
  • L. Xu, Y. K. Chia, and L. Bing (2021) Learning span-level interactions for aspect sentiment triplet extraction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 4755–4766. External Links: Link, Document Cited by: Introduction, Aspect-based Sentiment Analysis.
  • L. Xu, H. Li, W. Lu, and L. Bing (2020) Position-aware tagging for aspect sentiment triplet extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 2339–2349. External Links: Link, Document Cited by: Introduction, Aspect-based Sentiment Analysis, Aspect-based Sentiment Analysis, Datasets, Baselines.
  • H. Yan, J. Dai, X. Qiu, Z. Zhang, et al. (2021) A unified generative framework for aspect-based sentiment analysis. arXiv preprint arXiv:2106.04300. Cited by: Introduction, Aspect-based Sentiment Analysis, Baselines.
  • N. Zhang, S. Deng, X. Cheng, X. Chen, Y. Zhang, W. Zhang, H. Chen, and H. I. Center (2021a) A. drop redundant, shrink irrelevant: selective knowledge injection for language pretraining. In In IJCAI, Cited by: Prompt-tuning.
  • N. Zhang, Q. Jia, S. Deng, X. Chen, H. Ye, H. Chen, H. Tou, G. Huang, Z. Wang, N. Hua, et al. (2021b) AliCG: fine-grained and evolvable conceptual graph construction for semantic search at alibaba. arXiv preprint arXiv:2106.01686. Cited by: Prompt-tuning.
  • N. Zhang, L. Li, X. Chen, S. Deng, Z. Bi, C. Tan, F. Huang, and H. Chen (2021c) Differentiable prompt makes pre-trained language models better few-shot learners. arXiv preprint arXiv:2108.13161. Cited by: Prompt-tuning.

A. Hyper-parameter settings in sour experiments

All experiments are completed on a single 32G Nvidia-V100 GPU. It takes about 2 hours to finish 50 epochs. Our model is constructed with BART-base pre-trained model. The hyper-parameters to get the best results on every dataset are as in Table

7.

Dataset 14lap 14res 15res 16res
Training Epochs 50
Batch Size 8 32
Learning Rate 5e-5 1e-4
Hidden Size 1024
Table 7: Hyper-parameters used to get the best performance in training

For AESC, Pair and Triplet, a prediction result is correct only when all the span boundaries are exactly matched and the predicted sentiment polarity are accurately classified. For , We report F1 scores of AESC, Pair and Triplet tasks for all datasets. For and , we report the precision (P), recall (R), and F1 scores of Triplet task for them.

B. Experimental results under few-shot setting scenario

Figure 5: Comparison of F1 scores for Triplet on the and datasets under few-shot setting with 10% of data.

C. Experimental results for different prompts

We compare prompts with different and the handcrafted one. The results are shown in Figure 6.

Figure 6: Comparison of F1 scores for AESC, Pair and Triplet on the dataset.

D. Examples of our revision to the annotations in

Due to the three problems in previous datasets mentioned in the paper, we revise some annotations in to form our version . To get these problems through, we pick some typical annotations in and our corresponding revised version as examples in Table 8.

Sentence What can you say about a place where the
The battery lasts as advertised We ordered a tuna melt-it waitress brings out the wrong entree, then
(give or take 15-20 minutes), came with out cheese which verbally assaults your 80 year old grand-
and the entire user experience just made it a tuna sandwich. mother and gives her lip about sending it
is very elegant. back (which she did politely, by the way).
In (tuna melt, with out, NEG)
(user experience, elegant, POS) (cheese, with out, NEG) (entree, wrong, NEU)
(tuna sandwich, with out, NEG)
In (waitress, brings out the wrong entree, NEG)
(battery, lasts as advertised, POS) (tuna melt, with out cheese, NEG) (waitress, verbally assaults your
(user experience, elegant, POS) 80 year old grandmother, NEG)
(waitress, gives her lip about sending it, NEG)
Table 8: Some examples of our revision to the annotations of samples. In , some aspect-opinion pairs are missing and some are wrongly matched. There also exist some complex cases not properly annotated. While in , we correct those mistakes and use long phrases such as verb-object phrase to annotate complex samples.

E. Statistical comparison between three versions of four datasets.

A detailed statistical comparison between , and is presented in Table 9. has a bit more sentences and triplets annotations than due to our revision.

Dataset 14res 14lap 15res 16res
train 1300 2145 920 1265 593 923 842 1289
dev 323 524 228 337 148 238 210 316
test 496 862 339 490 318 455 320 465
train 1266 2338 906 1460 605 1013 857 1394
dev 310 577 219 346 148 249 210 339
test 492 994 328 543 148 485 326 514
train 1266 2436 906 1465 605 1036 857 1459
dev 310 598 219 372 148 273 210 355
test 492 1043 328 567 148 512 326 536
Table 9: The statistics of four datasets, where the , denote the numbers of sentences, aspect terms, opinion terms, and the <, > pairs, respectively. These three versions of four datasets can be applied to all ABSA subtasks mentioned above, including AE, OE, ALSC, AOE, AESC, Pair and Triplet.

F. Comparison between different baselines in our experiments

To make fair comparison between baselines used in our experiments, we summarize the their task formulations, the backbone in their models, and the subtasks they can do in Table 10.

Baselines E2E Task Formulation Backbone AE OE ALSC AOE AESC Pair Triplet
RINANTE+ - Seq.Tagging LSTM+CRF -
CMLA+ - Seq.Tagging Attention -
Li-unified+ - Seq.Tagging LSTM -
Peng-two-stage - Seq.Tagging LSTM+GCN -
JET-BERT Seq.Tagging BERT -
Dual-MRC - Span.MRC BERT -
OTE-MTL Seq.Tagging LSTM - - - - - -
GTS-BERT Grid.Tagging BERT -
JET-BERT Seq.Tagging BERT -
SPAN-BART Span.Generation BART
Ours Span.Generation BART
Table 10: The baselines in our experiments. “E2E” is short for End-to-End, which means the model should output all the subtasks’ results synchronously rather than requiring any preconditions, e.g., pipeline methods.

G. The decoding algorithm for converting process

The decoding algorithm for converting the predicted index sequence to aspect spans, opinion spans and polarities is shown in Algorithm 14.

1:, the number of tokens in ; target sequence ; and we have
2:Target span set
3:
4:while  do
5:     
6:     if  then
7:         
8:         
9:     else
10:         
11:     end if
12:     
13:end while
14:return
Algorithm 1 Decoding Algorithm for the Triplet