Weakly-Supervised Hierarchical Models for Predicting Persuasive Strategies in Good-faith Textual Requests

01/16/2021 ∙ by Jiaao Chen, et al. ∙ Georgia Institute of Technology 9

Modeling persuasive language has the potential to better facilitate our decision-making processes. Despite its importance, computational modeling of persuasion is still in its infancy, largely due to the lack of benchmark datasets that can provide quantitative labels of persuasive strategies to expedite this line of research. To this end, we introduce a large-scale multi-domain text corpus for modeling persuasive strategies in good-faith text requests. Moreover, we design a hierarchical weakly-supervised latent variable model that can leverage partially labeled data to predict such associated persuasive strategies for each sentence, where the supervision comes from both the overall document-level labels and very limited sentence-level labels. Experimental results showed that our proposed method outperformed existing semi-supervised baselines significantly. We have publicly released our code at https://github.com/GT-SALT/Persuasion_Strategy_WVAE.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Persuasive communication has the potential to bring significant positive and pro-social factors to our society Hovland et al. (1971). For instance, persuasion could largely help fundraising for charities and philanthropic organizations or convincing substance-abusing family members to seek professional help. Given the nature of persuasion, it is of great importance to study how and why persuasion works in language. Modeling persuasive language is challenging in the field of natural language understanding since it is difficult to quantify the persuasiveness of requests and even harder to generalize persuasive strategies learned from one domain to another. Although researchers from social psychology have offered useful advice on how to understand persuasion, most of them have been conducted from a qualitative perspective (Bartels, 2006; Popkin and Popkin, 1994). Computational modeling of persuasion is still in its infancy, largely due to the lack of benchmarks that can provide unified, representative corpus to facilitate this line of research, with a few exceptions like (Luu et al., 2019a; Atkinson et al., 2019; Wang et al., 2019).

Most existing datasets concerning persuasive text are either (1) too small (e.g., in the order of hundreds) for current machine learning models

Yang et al. (2019) or (2) not representative for understanding persuasive strategies by only looking at one specific domain Wang et al. (2019). To make persuasion research and technology maximally useful, both for practical use and scientific study, a generic and representative corpus is a must, which can represent persuasive language in a way that is not exclusively tailored to any one specific dataset or platform. To fill these gaps, building on theoretical work on persuasion and these prior empirical studies, we first introduce a set of generic persuasive strategies and a multi-domain corpus to understand different persuasion strategies that people use in their requests for different types of persuasion goals in various domains.

However, constructing a large-scale dataset that contains persuasive strategies labels is often time-consuming and expensive. To mitigate the cost of labeling fine-grained sentence persuasive strategy, we then introduce a simple but effective weakly-supervised hierarchical latent variable model that leverages mainly global or document-level labels (e.g., overall persuasiveness of the textual requests) alongside with limited sentence annotations to predict sentence-level persuasion strategies. Our work is inspired by prior work Oquab et al. (2015)

in computer vision that used the global image-level labels to classify local objects. Intuitively, our model is hierarchically semi-supervised, with sentence-level latent variables to reconstruct the input sentence and all latent variables of sentences are aggregated to predict document-level persuasiveness. Specifically, at the sentence-level, we utilize two latent variables representing persuasion strategies and context separately, in order to disentangle information pertaining to label-oriented and content-specific properties to do reconstructions; at the document level, we encode those two latent variables together to predict the overall document labels in the hope that it could supervise the learning of sentence-level persuasive strategies. To sum up, our contributions include:

  1. A set of generic persuasive strategies based on theoretical and empirical studies and introducing a relatively large-scale dataset that includes annotations of persuasive strategies for three domains.

  2. A hierarchical weakly-supervised latent variable model to predict persuasive strategies with partially labeled data.

  3. Extensive experimental results that test the effectiveness of our models and visualize the importance of our proposed persuasion strategies.

Related Work

There has been much attention paid to computational persuasive language understanding Guo et al. (2020); Atkinson et al. (2019); Lukin et al. (2017); Yang and Kraut (2017); Shaikh et al. (2020). For instance, Tan et al. (2016) looked at how the interaction dynamics such as the language interplay between opinion holders and other participants predict the persuasiveness via ChangeMyView subreddit. Althoff et al. (2014) studied donations in Random Acts of Pizza on Reddit, using the social relations between recipient and donor plus linguistic factors like narratives to predict the success of these altruistic requests. Although prior work offered predictive and insightful models, most research determined their persuasion labels or variables without reference to a taxonomy of persuasion techniques. Yang et al. (2019) identified the persuasive strategies employed in each sentence among textual requests from crowdfunding websites in a semi-supervised manner. Wang et al. (2019) looked at utterance in persuasive dialogues and annotated a corpus with different persuasion strategies such as self-modeling, foot-in-the-door, credibility, etc., together with classifiers to predict such strategies at a sentence-level. These work mainly focused on a small subset of persuasion strategies and the identification of such strategies in a specific context. Inspired by those work, we propose a generic and representative set of persuasion strategies to capture various persuasion strategies that people use in their requests.

Strategy Definition and Examples Connection with Prior Work
Commitment
The persuaders indicating their intentions to take acts or justify their earlier
decisions to convince others that they have made the correct choice.
e.g., I just lent to Auntie Fine’s Donut Shop. (Kiva)
Commitment (Yang et al. 2019),
Self-modeling (Wang et al. 2019),
Commitment (Vargheese et al. 2020a)
Emotion
Making request full of emotional valence and arousal affect to influence others.
e.g., Guys I’m desperate. (Borrow)
I’ve been in the lowest depressive state of my life. (RAOP)
Ethos Carlile et al. (2018),
Emotion appeal Carlile et al. (2018),
Sentiment (Durmus et al. 2018),
Emotion words (Luu et al, 2019b),
Emotion Asai et al. (2020)
Politeness
The usage of polite language in requests.
e.g., Your help is deeply appreciated! (Borrow)
Politeness (Durmus et al. 2018),
Politeness (Althoff et al. 2014),
Politeness (Nashruddin et al. 2020)
Reciprocity
Responding to a positive action with another positive action. People are more
likely to help if they have received help themselves.
e.g., I will pay 5% interest no later than May 1, 2016. (Borrow)
I’ll pay it forward with my first check. (RAOP)
Reciprocity (Althoff et al. 2014),
Reciprocity Roethke et al. (2020),
Reciprocity (Vargheese et al. 2020a)
Scarcity
People emphasizing on the urgency, rare of their needs.
e.g., Need this loan urgently. (Borrow)
I haven’t ate a meal in two days. (RAOP)
Loan expiring today and still needs $650. (Kiva)
Scarcity (Vargheese et al. 2020a),
Scarcity Yang et al. (2019),
Scarcity Lawson et al. (2020)
Credibility
The uses of credentials impacts to establish credibility and earn others’ trust.
e.g., Can provide any documentation needed. (Borrow)
She has already repaid 2 previous loans. (Kiva)
Credibility appeal Wang et al. (2019),
Social Proof Roethke et al. (2020),
Social Proof (Vargheese et al. 2020b)
Evidence
Providing concrete facts or evidence for the narrative or request.
e.g. My insurance was canceled today. (Borrow)
There is a Pizza Hut and a Dominos near me. (RAOP)
$225 to go and 1 A+ member on the loan. (Kiva)
Evidentiality (Althoff et al. 2014),
Evidence Carlile et al. (2018),
Evidence Stab and Gurevych (2014),
Concreteness Yang et al. (2019)
Evidence (Durmus et al. 2018)
Impact
Emphasizing the importance or impact of the request.
e.g., I will use this loan to pay my rent. (Borrow)
This loan will help him with his business. (Kiva)
Logos Carlile et al. (2018),
Logic appeal Wang et al. (2019)
Impact Yang et al. (2019)
Table 1: The generic taxonomy of persuasive strategies, their definitions, example sentences, and connections with prior work.

Recently many semi-supervised learning approaches have been developed for natural language processing, including adversarial training 

Miyato et al. (2016), variational auto-encoders Kingma et al. (2014); Yang et al. (2017); Gururangan et al. (2019), consistency training Xie et al. (2020); Chen et al. (2020, 2020) and various pre-training techniques Kiros et al. (2015); Dai and Le (2015). The contextual word representations Peters et al. (2018); Devlin et al. (2019)

have emerged as powerful mechanisms to make use of large scale unlabeled data. Most of these prior works focus on semi-supervised learning, in which the labels are partially available and the supervisions for labeled and unlabeled data are both on the sentence-levels. In contrast, our work is hierarchical weakly supervised and we aim to predict

sentence-levels labels, not document-level persuasiveness. To our best knowledge, weakly supervised learning has been explored much less in natural language processing except for a few recent work Lee et al. (2019); Min et al. (2019) in question answering. There are a few exceptions: Yang et al. (2019) utilized a small amount of hand-labeled sentences together with a large number of requests automatically labeled at the document level for text classification. Pryzant et al. (2017) proposed an adversarial objective to learn text features highly predictive of advertisement outcomes. Our work has an analog task in computer vision–weakly supervised image segmentation Papandreou et al. (2015); Pinheiro and Collobert (2015)– which uses image labels or bounding boxes information to predict pixel-level labels. Similar to image segmentation, obtaining global/document/image level labels for persuasive understanding is much cheaper than local/sentence/pixel level labels. Different from multi-task learning where models have full supervisions in each task, our proposed model is fully supervised at the document level while partially supervised at the sentence level.

Persuasion Taxonomy and Corpus

Previous work modeling persuasion in language either focus on a small subset of strategies or look at a specific platform, hard to be adapted to other contexts. To fill this gap, we propose a set of generic persuasive strategies based on widely used persuasion models from social psychology. Specifically, we leverage Petty and Cacioppo’s elaboration likelihood model (1986) and Chiaken’s social information processing model (Chaiken, 1980), which suggest that people process information in two ways: either performing a relatively deep analysis of the quality of an argument or relying on some simple superficial cues to make decisions (Cialdini, 2001). Guided by these psychology insights, we examine the aforementioned computational studies on persuasion and argumentation Wang et al. (2019); Yang et al. (2019); Durmus et al. (2018); Vargheese et al. (2020b); Carlile et al. (2018), and further synthesize these theoretical and practical tactics into eight unified categories: Commitment, Emotion, Politeness, Reciprocity, Scarcity that allow people to use simple inferential rules to make decisions, and Credibility, Evidence, Impact that require people to evaluate the information based on its merits, logic, and importance. As shown in Table 1, our taxonomy distilled, extended, and unified existing persuasion strategies. Different from prior work that introduced domain-specific persuasion tactics with limited generalizability, our generic taxonomy can be easily plugged into different text domains, making large-scale understanding of persuasion in language across multiple contexts comparable and replicable.

Dataset Collection & Statistics

We collected our data from three different domains related to persuasion. (1) Kiva111www.kiva.org is a peer-to-peer philanthropic lending platform where persuading others to make loans is a key to success (no interest), (2) subreddit “r/RandomActsofPizza222www.reddit.com/r/Random˙Acts˙Of˙Pizza” (RAOP) where members write requests to ask for free pizzas (social purpose, no direct money transaction), and (3) subreddit “r/borrow333www.reddit.com/r/borrow” (Borrow) that focuses on writing posts to borrow money from others (with interest). After removing personal and sensitive information, we obtained 40,466 posts from Kiva, 18,026 posts from RAOP, and 49,855 posts from Borrow.

We sampled 5% documents with document length ranging from 1 to 6 from Kiva, 1 to 8 from RAOP and 1 to 7 from Borrow to annotate, as documents with at most 6 sentences account for 89% in Kiva, 86% posts in RAOP has no more than 8 sentences, and 85% posts in Borrow has at most 7 sentences. We recruited four research assistants to label persuasion strategies for each sentence in sampled documents. Definitions and examples of different persuasion strategies were provided, together with a training session where we asked annotators to annotate a number of example sentences and walked them through any disagreed annotations. To assess the reliability of the annotated labels, we then asked them to annotate the same 100 documents with 400 sentences and computed Cohen’s Kappa coefficient to measure inter-rater reliability. We obtained an average score of 0.538 on Kiva, 0.613 on RAOP, and 0.623 on Borrow, which indicates moderate agreement McHugh (2012). Annotators then annotated the rest 1200 documents by themselves independently.

The dataset statistics are shown in Table 2, and the sentence-level label distribution in each dataset is shown in Figure 1. We merge rare strategies into the Other category. Specifically, we merge Commitment, Scarcity, and Emotion in Borrow, Credibility and Commitment in RAOP, Reciprocity and Emotion in Kiva, as Other. We utilized whether the requester received pizzas or loans from the subreddits as the document-level labels for RAOP and Borrow. 30.1% of people successfully got pizzas on RAOP and 48.5% of people received loans on Borrow. In Kiva, we utilized the number of people who lent loans as the document-level labels. The numbers are further labeled based on buckets: , , , , accounting for 44.1%, 20.3%, 12.4% and 33.2% of all documents.

Figure 1: The distribution of each persuasion strategy in three annotated three datasets.
# Docs
# Sents
w/ label
# Sents
w/o label
Doc Labels Sent Labels
Borrow 49,855 5,800 164,293 Success or not
Evidence, Impact, Politeness, Reciprocity, Credibility
RAOP 18,026 3,600 77,517 Success or not
Evidence, Impact, Politeness, Reciprocity, Scarcity, Emotion
Kiva 40,466 6,300 135,330 # People loaned
Evidence, Impact, Politeness, Credibility, Scarcity, Commitment
Table 2: Dataset statistics. For strategies that are rare, we merged them into an Other category.

Method

To alleviate the dependencies on labeled data, we propose a hierarchical weakly-supervised latent variable model to leverage partially labeled data to predict sentence-level persuasive strategies. Specifically, we introduce a sentence-level latent variable model to reconstruct the input sentence and predict the sentence-level persuasion labels spontaneously, supervised by the global or document-level labels (e.g., overall persuasiveness of the documents). The overall architecture of our method is shown in Figure 2.

Weakly Supervised Latent Model

Given a corpus of documents , where each document consists of sentences . For each document , its document level label is denoted as , representing the overall persuasiveness of the documents. We divide the corpus into two parts: , where () denotes the set of documents with (without) sentence labels. For each document , the corresponding sentence labels are , where and represents the persuasive strategy of a given sentence. In many practical scenarios, getting document-level labels is much easier and cheaper than the fine-grained sentence labels since the number of sentences in a document can be very large. Similarly, in our setting, the number of documents with fully labeled sentences is very limited, i.e., . To this end, we introduce a novel hierarchical weakly supervised latent variable model that can leverage both the document-level labels and the small amount of sentence-level labels to discover the sentence persuasive strategies. Our model is weakly supervised since we will utilize document labels to facilitate the learning of sentence persuasive strategies. The intuition is that global documents labels of persuasiveness carry useful information of local sentence persuasive strategies, thus can provide supervision in an indirect manner.

Figure 2: Overall architecture. At sentence-level, the input sentences are first encoded into two latent variables: representing strategies and containing context information; the decoder reconstructs the input sentences. At document-level, a predictor network aggregates the latent variables within the input document to predict document-level labels. For labeled documents, labels are directly used for the reconstruction and prediction; for unlabeled ones, latent variables are used.

Sentence Level VAE

Following prior work on semi-supervised variational autoencoders (VAEs)

Kingma and Welling (2013), for an input sentence

, we assume a graphical model whose latent representation contains a continuous vector

, denoting the content of a sentence, and a discrete persuasive strategy label :

(1)

To learn the semi-supervised VAE, we optimize the variational lower bound as our learning objective. For unlabeled sentence, we maximize the evidence lower bound as:

(2)

where is a decoder (generative network) to reconstruct input sentences and is an encoder (an inference or a predictor network) to predict sentence-level labels.

For labeled sentences, the variational lower bound is:

(3)

In addition, for sentences with labels, we also update the inference network via minimizing the cross entropy loss directly.

Document Level VAE

Different from sentence-level VAEs, we model the input document with sentences as a whole and assume that the document-level label depends on the sentence-level latent variables. Thus we obtain the document-level VAE model as:

(4)

where is the generative model for all sentences in the document and the document label .

For simplicity, we further assume conditional independence between the sentences in and its label given the latent variables:

Since the possible number of the sentence label combinations is huge, simply computing the marginal probability becomes intractable. Thus we optimize the evidence lower bound. By using mean field approximation

Jain et al. (2018), we factorize the posterior distribution as: . That is, the posterior distribution of latent variables and only depends on the sentence and the document label . For documents without sentence labels, the evidence lower bound is:

(5)

For document with sentence labels, the variational lower bound can be adapted from above as:

(6)

Combining the loss for document with and without sentence labels, we obtain the overall loss function:

(7)

Here, represents the discriminative loss for sentences with labels and controls the trade-off between generative loss and discriminative loss 444The influence of is discussed in Section 4 in Appendix..

Compared to sentence-level VAE (S-VAE) that only learns sentence representation via a generative network , document-level VAE utilizes the contextual relations between sentences by aggregating multiple sentences in a document and further predicting document-level labels via a predictor network . Document-level weakly supervised VAE (WS-VAE) incorporates both direct sentence-level supervision and indirect document-level supervision to better make use of unlabeled sentences, thus can further help the persuasion strategies classification. Note that our hierarchical weakly-supervised latent variable model presents a generic framework to utilize dependencies between sentence-level and document-level labels, and can be easily adapted to other NLP tasks where document-level supervision is rich and sentence-level supervision is scarce.

Training Details

In practice, we parameterize the inference network and using a LSTM or a BERT which encodes the sentences (and document label) to get the posterior distribution. We used another LSTM as the decoder to model the the generative network . At the document level, each sentence’s content vector and strategy vector is fed as input to a LSTM to model the predictor network .

Reparametrization:

It is challenging to back-propagate through random variables as it involves non-differentiable sampling procedures. For latent variable

, we utilized the reparametrization technique proposed by Kingma and Welling (2013) to re-parametrize the Gaussian random variable as , where , and are deterministic and differentiable. For discrete latent variable , we adopted Gumbel softmax Jang et al. (2017) to approximate it continuously:

where are the probabilities of a categorical distribution, follows Gumbel and is the temperature. The approximation is accurate when and smooth when . We gradually decrease in the training process.

Prior Estimation:

Classical variational models usually assume simple priors such as uniform distributions. We performed a Gaussian kernel density estimation over training data to estimate the prior for

, and assumed the latent variable

follows a standard Gaussian distribution.

Experiment and Result

Dataset Train Dev Test
Borrow 900 400 400
RAOP 300 200 300
Kiva 1000 400 400
Table 3: Split statistics about train, dev, and test set.
Dataset Model Sentence-level Persuasion Strategy Prediction F1 Score Doc-Level Accuracy
20 50 100 Max
Kiva LSTM -
SH-Net
BERT -
S-VAE -
WS-VAE
WS-VAE-BERT
RAOP LSTM -
SH-Net
BERT -
S-VAE -
WS-VAE
WS-VAE-BERT
Borrow LSTM -
SH-Net
BERT -
S-VAE -
WS-VAE
WS-VAE-BERT
Table 4:

Sentence-level persuasion strategy prediction performance (Macro F1 Score) and document-level prediction performance (Accuracy). Models are trained with documents amount of 20 (81 sentences in Kiva, 99 sentences in RAOP and 59 sentences in Borrow), 50 (200 sentences in Kiva, 236 sentences in RAOP and 168 sentences in Borrow), 100 (355 sentences in Kiva, 480 sentences in RAOP and 356 sentences in Borrow), and all the training set (3512 sentences in Kiva, 1382 sentences in RAOP and 3136 sentences in Borrow). The results are averaged after 5 different runs, with the 95% confidence interval.

Experiment Setup: We randomly sampled from the labeled documents to form the maximum labeled train set, the development, and test set to train and evaluate models, and we utilized all the unlabeled documents as training data as well. The data splits are shown in Table 3. We utilized NLTK Bird et al. (2009) to split the documents into sentences and tokenize each sentence with BERT-base uncased tokenizer Devlin et al. (2019). We added a special CLS token at the beginning of each sentence and a special SEP token at the end of each sentence. We used BERT Devlin et al. (2019) as the discriminative network, LSTM as the generative network and predictor network. The inference network is a 2-layer MLP. We trained our model via AdamW Loshchilov and Hutter (2017) and tuned hyper-parameters on the development set.

Baselines and Model Settings555Parameters details are stated in Section 5 in the Appendix.

We compared our model on strategy classification for each sentence with several baselines: (1) LSTM Hochreiter and Schmidhuber (1997): LSTM is utilized as the encoder for sentences. We use the last layer’s hidden states as the representations of sentences to classify the persuasion strategies. Only labeled sentences are used here. (2) SH-Net Yang et al. (2019): SH-Net utilized a hierarchical LSTM to classify strategies with the supervision from both sentence-level and document-level labels, thus both labeled documents and unlabeled documents being used. We followed their implementation and modified the document-level inputs as concatenations of latent variables and . (3) BERT Devlin et al. (2019): We used the pre-trained BERT-base uncased model and fine-tuned it for the persuasion strategy classification. BERT only utilized labeled sentences. (4) S-VAE: Sentence-level VAE applied variational autoencoders in classifications by reconstructing the input sentences while learning to classify them. Both labeled and unlabeled sentences are used.

Figure 3: Average attention weight learned in the predictor network for different strategies in three datasets.

WS-VAE denotes our proposed weakly supervised latent variable model that made use of sentence-level labels and document-level labels at the same time, as well as reconstructing input documents. We further showed that our proposed WS-VAE model is orthogonal to pre-trained models like BERT as well by utilizing pre-trained BERT as the discriminative network to encode the input sentences and then using 2-layer LSTM as the generative network and predictor network, denoted as WS-VAE-BERT, a special case (based on pre-trained transformer models) of WS-VAE.

Results

Varying the Number of Labeled Documents

We tested the models with varying amount of labeled documents from 20 to the maximum number of labeled training documents, and summarized the results in Table 4

. The simple LSTM classifier showed the worst performance over three datasets, especially when limited labeled documents were given. After simply adding document-level supervision as well as unlabeled documents, SH-Net got better Macro F1 scores as well as lower variance, showing the impact of document-level supervision on sentence-level learning. BERT fine-tuned on persuasion strategy classification tasks showed better performance than LSTM and SH-Net with limited labeled data in most cases.

By leveraging the reconstruction of each input sentence using corresponding persuasion strategies and context latent variables, S-VAE showed a significant performance boost comparing to only utilizing indirect supervision from the document-level labels. This indicated that by incorporating the extra supervision directly from the input sentence itself, we can gain more help than hierarchical supervision from document levels. By utilizing the hierarchical latent variable model, which not only utilized the sentence reconstruction but also document-level predictions to assist the sentence-level classifications, WS-VAE outperformed S-VAE. When combining with the state-of-the-art pre-trained models like BERT, our WS-VAE-BERT achieved the best performance over three datasets. This suggests that such improvement does not only come from large pre-trained models, but also the incorporation of our hierarchical latent variable model.

Note that we also showed the document-level prediction accuracy for models that used all the labeled documents. Even though the document-level predictions were not our goals, we observed a consistent trend that higher document-level performance correlated with the higher sentence-level accuracy, suggesting that the global document-level supervision helped the sentence-level predictions.

Figure 4: Attention weight for content vectors and strategy vectors when predicting document-level labels in the predictor network.
Figure 5: Cosine similarities between different persuasive strategies (Credibility, Reciprocity, Evidence, Commitment, Scarcity, Emotion, Impact and Politeness).
Importance of Strategies vs Content

To better understand how these persuasive strategies and the text content jointly affect the success of text requests, we added an attention layer over content latent variable and strategy latent variable in the predictor network to visualize the importance of persuasive strategies and text content in the WS-VAE-BERT, as shown in Figure 4. In all three domains, we found that content vectors tend to have larger weights than strategy vectors. This suggests that when people are writing requests to convince others to take action, content is relatively the more important component than persuasion strategies. However, leveraging proper persuasive strategies can further boost the likelihood of their requests being fulfilled.

Attention Weight

We further calculated the average attention weights learned in the predictor network (attended over strategy latent variable and content latent variable to predict the document-level labels) for different strategies in three datasets which is shown in Figure 3. We observed that Reciprocity, Commitment, Scarcity and Impact seemed to play more important roles, while Credibility, Evidence, Emotion and Politeness had lower average attention weights, which indicated that simple superficial strategies might be more influential to overall persuasiveness in online forums than strategies that required deeper analysis.

Relation between Persuasive Strategies

To explore possible relations among different persuasive strategies, we utilized the embeddings for each persuasive strategy from the predictor network and visualized their pairwise similarities in Figure 5. All the similarities scores were below 0.5, showing those strategies in our taxonomy are generally orthogonal to each other and capture different aspects of persuasive language. However, some strategies tend to demonstrate relatively higher relations; for example, Scarcity highly correlates with Evidence on RAOP and Kiva, indicating that people may often use them together in their requests.

Conclusion and Future Work

This work introduced a set of generic persuasive strategies based on theories on persuasion, together with a large-scale multi-domain text corpus annotated with their associated persuasion strategies. To further utilize both labeled and unlabeled data in real-world scenarios, we designed a hierarchical weakly-supervised latent variable model to utilize document-level persuasiveness supervision to guide the learning of specific sentence-level persuasive strategies. Experimental results showed that our proposed method outperformed existing semi-supervised baselines significantly on three datasets. Note that, we made an assumption that the document-level persuasiveness label only depended on the sentence-level information. However there are other factors closely related to the overall persuasiveness such as requesters/lenders’ backgrounds or their prior interactions Valeiras-Jurado (2020); Longpre et al. (2019). Future work can investigate how these audience factors further affect the predictions of both sentence- and document- level labels. As an initial effort, our latent variable methods disentangle persuasion strategies and the content, and highlight the relations between persuasion strategies and the overall persuasiveness, which can be further leveraged by real-world applications to make textual requests more effective via different choices of persuasion strategies.

Acknowledgment

We would like to thank Jintong Jiang, Leyuan Pan, Yuwei Wu, Zichao Yang, the anonymous reviewers, and the members of Georgia Tech SALT group for their feedback. We acknowledge the support of NVIDIA Corporation with the donation of GPU used for this research. DY is supported in part by a grant from Google.

References

  • T. Althoff, C. Danescu-Niculescu-Mizil, and D. Jurafsky (2014) How to ask for a favor: a case study on the success of altruistic requests. In Proceedings of ICWSM, Cited by: Table 1, Related Work.
  • S. Asai, K. Yoshino, S. Shinagawa, S. Sakti, and S. Nakamura (2020) Emotional speech corpus for persuasive dialogue system. In Proceedings of The 12th Language Resources and Evaluation Conference, Marseille, France, pp. 491–497 (English). External Links: Link, ISBN 979-10-95546-34-4 Cited by: Table 1.
  • D. Atkinson, K. B. Srinivasan, and C. Tan (2019) What gets echoed? understanding the “pointers” in explanations of persuasive arguments. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2904–2914. Cited by: Introduction, Related Work.
  • L. M. Bartels (2006) Priming and persuasion in presidential campaigns. Capturing campaign effects, pp. 78–112. Cited by: Introduction.
  • S. Bird, E. Klein, and E. Loper (2009) Natural language processing with python. 1st edition, O’Reilly Media, Inc.. External Links: ISBN 0596516495, 9780596516499 Cited by: Experiment and Result.
  • W. Carlile, N. Gurrapadi, Z. Ke, and V. Ng (2018) Give me more feedback: annotating argument persuasiveness and related attributes in student essays. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 621–631. External Links: Link, Document Cited by: Table 1, Persuasion Taxonomy and Corpus.
  • S. Chaiken (1980) Heuristic versus systematic information processing and the use of source versus message cues in persuasion.. Journal of personality and social psychology 39 (5), pp. 752. Cited by: Persuasion Taxonomy and Corpus.
  • J. Chen, Y. Wu, and D. Yang (2020) Semi-supervised models via data augmentation for classifying interactive affective responses. In AffCon@AAAI, Cited by: Related Work.
  • J. Chen, Z. Yang, and D. Yang (2020)

    MixText: linguistically-informed interpolation of hidden space for semi-supervised text classification

    .
    In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 2147–2157. External Links: Link, Document Cited by: Related Work.
  • R. Cialdini (2001) 6 principles of persuasion. Arizona State University, eBrand Media Publication. Cited by: Persuasion Taxonomy and Corpus.
  • A. M. Dai and Q. V. Le (2015) Semi-supervised sequence learning. In Advances in neural information processing systems, pp. 3079–3087. Cited by: Related Work.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. Cited by: Related Work, Baselines and Model Settings555Parameters details are stated in Section 5 in the Appendix., Experiment and Result.
  • E. Durmus, C. Cardie, and E. Durmus (2018) Exploring the role of prior beliefs for argument persuasion. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1035–1045. External Links: Link, Document Cited by: Table 1, Persuasion Taxonomy and Corpus.
  • Z. Guo, Z. Zhang, and M. Singh (2020) In opinion holders’ shoes: modeling cumulative influence for view change in online argumentation. In Proceedings of The Web Conference 2020, pp. 2388–2399. Cited by: Related Work.
  • S. Gururangan, T. Dang, D. Card, and N. A. Smith (2019) Variational pretraining for semi-supervised text classification. arXiv preprint arXiv:1906.02242. Cited by: Related Work.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Link, Document Cited by: Baselines and Model Settings555Parameters details are stated in Section 5 in the Appendix..
  • C. I. Hovland, I. L. Janis, and H. Kelly (1971) Communication and persuasion. Attitude change, pp. 66–80. Cited by: Introduction.
  • V. Jain, F. Koehler, and E. Mossel (2018) The mean-field approximation: information inequalities, algorithms, and complexity. CoRR abs/1802.06126. External Links: Link, 1802.06126 Cited by: Document level VAE, Document Level VAE.
  • E. Jang, S. Gu, and B. Poole (2017) Categorical reparameterization with gumbel-softmax. External Links: Link Cited by: Training Details.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. Note: cite arxiv:1312.6114 External Links: Link Cited by: Sentence level VAE, Sentence Level VAE, Training Details.
  • D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling (2014) Semi-supervised learning with deep generative models. In Advances in neural information processing systems, pp. 3581–3589. Cited by: Related Work.
  • R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler (2015) Skip-thought vectors. In Advances in neural information processing systems, pp. 3294–3302. Cited by: Related Work.
  • P. Lawson, C. J. Pearson, A. Crowson, and C. B. Mayhorn (2020) Email phishing and signal detection: how persuasion principles and personality influence response patterns and accuracy. Applied Ergonomics 86, pp. 103084. Cited by: Table 1.
  • K. Lee, M. Chang, and K. Toutanova (2019) Latent retrieval for weakly supervised open domain question answering. arXiv preprint arXiv:1906.00300. Cited by: Related Work.
  • L. Longpre, E. Durmus, and C. Cardie (2019) Persuasion of the undecided: language vs. the listener. In Proceedings of the 6th Workshop on Argument Mining, pp. 167–176. Cited by: Conclusion and Future Work.
  • I. Loshchilov and F. Hutter (2017) Fixing weight decay regularization in adam. CoRR abs/1711.05101. External Links: Link, 1711.05101 Cited by: Experiment and Result.
  • S. Lukin, P. Anand, M. Walker, and S. Whittaker (2017) Argument strength is in the eye of the beholder: audience effects in persuasion. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 742–753. Cited by: Related Work.
  • K. Luu, C. Tan, and N. A. Smith (2019a) Measuring online debaters’ persuasive skill from text over time. Transactions of the Association for Computational Linguistics 7, pp. 537–550. Cited by: Introduction.
  • K. Luu, C. Tan, and N. Smith (2019b) Measuring online debaters’ persuasive skill from text over time. Transactions of the Association for Computational Linguistics 7 (0), pp. 537–550. External Links: ISSN 2307-387X, Link Cited by: Table 1.
  • M. McHugh (2012) Interrater reliability: the kappa statistic. Biochemia medica : časopis Hrvatskoga društva medicinskih biokemičara / HDMB 22, pp. 276–82. External Links: Document Cited by: Dataset Collection & Statistics, Dataset & Annotation Details.
  • S. Min, D. Chen, H. Hajishirzi, and L. Zettlemoyer (2019) A discrete hard em approach for weakly supervised question answering. arXiv preprint arXiv:1909.04849. Cited by: Related Work.
  • T. Miyato, A. M. Dai, and I. Goodfellow (2016) Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725. Cited by: Related Work.
  • N. Nashruddin, F. A. Alam, and A. Harun (2020) Moral values found in linguistic politeness patterns of bugis society. Edumaspul: Jurnal Pendidikan 4 (1), pp. 132–141. Cited by: Table 1.
  • M. Oquab, L. Bottou, I. Laptev, and J. Sivic (2015)

    Is object localization for free?-weakly-supervised learning with convolutional neural networks

    .
    In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 685–694. Cited by: Introduction.
  • G. Papandreou, L. Chen, K. P. Murphy, and A. L. Yuille (2015) Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 1742–1750. Cited by: Related Work.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365. Cited by: Related Work.
  • R. E. Petty and J. T. Cacioppo (1986) The elaboration likelihood model of persuasion. In Communication and persuasion, pp. 1–24. Cited by: Persuasion Taxonomy and Corpus.
  • P. O. Pinheiro and R. Collobert (2015) From image-level to pixel-level labeling with convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1713–1721. Cited by: Related Work.
  • S. L. Popkin and S. L. Popkin (1994) The reasoning voter: communication and persuasion in presidential campaigns. University of Chicago Press. Cited by: Introduction.
  • R. Pryzant, Y. Chung, and D. Jurafsky (2017) Predicting sales from the language of product descriptions.. In eCOM@ SIGIR, Cited by: Related Work.
  • K. Roethke, J. Klumpe, M. Adam, and A. Benlian (2020) Social influence tactics in e-commerce onboarding: the role of social proof and reciprocity in affecting user registrations. Decision Support Systems 131, pp. 113268. External Links: ISSN 0167-9236, Document, Link Cited by: Table 1.
  • O. Shaikh, J. Chen, J. Saad-Falcon, D. H. Chau, and D. Yang (2020) Examining the ordering of rhetorical strategies in persuasive requests. Findings of EMNLP. Cited by: Related Work.
  • C. Stab and I. Gurevych (2014) Identifying argumentative discourse structures in persuasive essays. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 46–56. Cited by: Table 1.
  • C. Tan, V. Niculae, C. Danescu-Niculescu-Mizil, and L. Lee (2016) Winning arguments: interaction dynamics and persuasion strategies in good-faith online discussions. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16, Republic and Canton of Geneva, Switzerland, pp. 613–624. External Links: ISBN 978-1-4503-4143-1 Cited by: Related Work.
  • J. Valeiras-Jurado (2020) Genre-specific persuasion in oral presentations: adaptation to the audience through multimodal persuasive strategies. International Journal of Applied Linguistics 30 (2), pp. 293–312. Cited by: Conclusion and Future Work.
  • J. P. Vargheese, M. Collinson, and J. Masthoff (2020a) Exploring susceptibility measures to persuasion. In Persuasive Technology. Designing for Future Change, S. B. Gram-Hansen, T. S. Jonasen, and C. Midden (Eds.), Cham, pp. 16–29. External Links: ISBN 978-3-030-45712-9 Cited by: Table 1.
  • J. P. Vargheese, M. Collinson, and J. Masthoff (2020b) Exploring susceptibility measures to persuasion. In International Conference on Persuasive Technology, pp. 16–29. Cited by: Table 1, Persuasion Taxonomy and Corpus.
  • X. Wang, W. Shi, R. Kim, Y. Oh, S. Yang, J. Zhang, and Z. Yu (2019) Persuasion for good: towards a personalized persuasive dialogue system for social good. arXiv preprint arXiv:1906.06725. Cited by: Introduction, Introduction, Table 1, Related Work, Persuasion Taxonomy and Corpus.
  • Q. Xie, Z. Dai, E. Hovy, M. Luong, and Q. V. Le (2020) Unsupervised data augmentation for consistency training. External Links: Link Cited by: Related Work.
  • D. Yang, J. Chen, Z. Yang, D. Jurafsky, and E. Hovy (2019) Let’s make your request more persuasive: modeling persuasive strategies via semi-supervised neural nets on crowdfunding platforms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3620–3630. Cited by: Introduction, Table 1, Related Work, Related Work, Persuasion Taxonomy and Corpus, Baselines and Model Settings555Parameters details are stated in Section 5 in the Appendix..
  • D. Yang and R. E. Kraut (2017) Persuading teammates to give: systematic versus heuristic cues for soliciting loans. Proceedings of the ACM on Human-Computer Interaction 1, pp. 114. Cited by: Related Work.
  • Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick (2017) Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3881–3890. Cited by: Threshold on KL Divergence, Related Work.

Appendix

Dataset & Annotation Details

In different contexts, people tend to write documents with different numbers of sentences, which might be associated with different sets of persuasion strategies.

The mean and std for number of sentences per document are 4.68 and 4.63 in Borrow, 5.10 and 4.40 in RAOP, and 3.83 and 4.12 in Kiva.

We recruited two graduate and two undergraduate students to label the persuasion strategies for each sentence in given documents which were randomly sampled from the whole corpus. Definitions and examples of different persuasion strategies were provided to the annotators. We also conducted a training session where we asked annotators to annotate 50 example sentences and walked through them any disagreements or confusions they had. Annotators then annotated 1200 documents by themselves independently.

To assess the reliability of the annotated labels, the same set of documents which contained 100 documents with 400 sentences was given to annotators to label and we computed the Cohen’s Kappa coefficient. We obtained an average score of 0.538 on Kiva, 0.613 on RAOP and 0.623 on Borrow, which indicated moderate agreement and reasonable annotation quality McHugh (2012).

Ws-Vae

Sentence level VAE

Based on prior work on semi-supervised VAEs Kingma and Welling (2013), for an input sentence , we assume a graphical model whose latent representation contains a continuous vector , denoting the content of a sentence, and a discrete persuasive strategy label :

To learn the semi-supervised VAE, we optimize the variational lower bound as our learning objective. For unlabeled sentence, we maximize:

where is a decoder (generative network) to reconstruct input sentences and is an encoder (an inference or a predictor network) to predict sentence-level labels. For labeled sentences, the variational lower bound becomes:

In addition, for sentences with labels, we also update the inference network via minimizing the cross entropy loss directly.

Document level VAE

Different from sentence-level VAEs, we model the input document with sentences as a whole and assume that the document-level label depends on the sentence-level latent variables. Thus we obtain the document-level VAE model as:

where is the generative model for all sentences in the document and the document label . For simplicity, we further assume conditional independence between the sentences in and its label given the latent variables:

Since the possible number of the sentence label combinations is huge, simply computing the marginal probability becomes intractable. Thus we optimize the evidence lower bound. By using mean field approximation Jain et al. (2018), we factorize the posterior distribution as:

That is, the posterior distribution of latent variables and only depends on the sentence and the document label . For documents without sentence labels, the variational lower bound is:

For document with sentence labels, the variational lower bound can be adapted from above as:

Combining the loss for document with and without sentence labels, we obtain the overall loss function:

Here, represents the discriminative loss for sentences with persuasive strategy labels and controls the trade-off between generative loss and discriminative loss.

Threshold on KL Divergence

Yang et al. (2017) found that VAEs might easily get stuck in two local optimums: the KL term on is very large and all samples collapse to one class or the KL term on is very small and is close to the prior distribution. Thus we minimize the KL term only when it is larger than a threshold :

Influence of the Trade-off Weight

The overall loss function of our proposed weakly-supervised hierarchical latent variable model is:

Here, the is a parameter that controls the balance of reconstruction loss and supervised sentence classification loss. When is small, the sentence level classifications are not well learned. When is large, the model tends to only learn the sentence level classification tasks and ignore the reconstructions and document level predictions. In experiments, we set to 5 through a grid search from the set .

Model Implementation Details

S-Vae

For S-VAE - the sentence-level latent variable model, which applies variational autoencoderes in sentence-level classifications by reconstructing the input sentences while learning to classify them, which encourages the model to assign input sentences to a label such that the reconstruction loss is low. S-VAE is a special case (only performing operations at sentence levels) of our proposed WS-VAE. The weight for the reconstruction term is 1, the weight for the classification term is 5 and the weight for KL divergence terms are annealing from a small value to 1 through the training process. The learning rate is 0.001.

Ws-Vae

For WS-VAE - our proposed weakly supervised latent variable model, takes advantage of sentence-level labels and document-level labels at the same time, as well as reconstructing input documents. The weight for the reconstruction term is 1, the weight for the classification term is 5, the weight for KL divergence terms are annealing from a small value to 1 through the training process, and the weight for predictor term is 0.5. The threshold for KL regularization on is 1.2. The learning rate is 0.001.

Ws-Vae-Bert

For WS-VAE-BERT - a special case (based on pre-trained transformer models) of WS-VAE, combines ES-VAE with recent pre-trained BERT. The weight for the reconstruction term is 1, the weight for the classification term is 5, the weight for KL divergence terms are annealing from a small value to 1 through the training process, and the weight for predictor term is 0.1. The threshold for KL regularization on is 1.2. The learning rate is 0.00001.

Datasets Threshold on y Macro F1
Kiva 0 0.228
1.2 0.315
2.0 0.305
RAOP 0 0.274
1.2 0.321
2.0 0.316
Borrow 0 0.485
1.2 0.595
2.0 0.542
Table 5: Macro F1 Score with different threshold on y in KL regularization term for SH-VAE. Models are trained on three datasets with 20 labeled documents (81 sentences in kiva, 99 sentences in RAOP and 59 sentences in Borrow).

Impact of Variational Regularization

To show the importance of variational regularization on the latent variable (the threshold on KL divergence ) mentioned in Section Threshold on KL Divergence, we performed ablation study for the KL term for . We tested WS-VAE with different values of threshold on three datasets using 20 labeled documents and the results were shown in Table 5. When the threshold is small like 0, which meant we added large regularization on y, the performance is bad because the was so close to estimated prior distributions and barely learned from objective functions. When the threshold was large like 2, which meant there did not exist any regularization on , we got lower F1 scores as well. When there is a appropriate threshold such as 1.2 to offer regularization, WS-VAE could achieve the best performance.

Figure 6: Macro F1 scores with 20 documents with sentence labels and different numbers of documents without sentence labels for WS-VAE. Results on Borrow follow the left y-axis, while RAOP and Kiva follow the right y-axis.

Varying the Number of Unlabeled Documents

We visualized WS-VAE’s performances on three datasets when varying the amount of unlabeled data in Figure 6: macro F1 scores increased with more unlabeled data, demonstrating the effectiveness of the introduction of unlabeled sentences, and our hierarchical weakly-supervised model.