Log In Sign Up

Towards a Progression-Aware Autonomous Dialogue Agent

Recent advances in large-scale language modeling and generation have enabled the creation of dialogue agents that exhibit human-like responses in a wide range of conversational scenarios spanning a diverse set of tasks, from general chit-chat to focused goal-oriented discourse. While these agents excel at generating high-quality responses that are relevant to prior context, they suffer from a lack of awareness of the overall direction in which the conversation is headed, and the likelihood of task success inherent therein. Thus, we propose a framework in which dialogue agents can evaluate the progression of a conversation toward or away from desired outcomes, and use this signal to inform planning for subsequent responses. Our framework is composed of three key elements: (1) the notion of a "global" dialogue state (GDS) space, (2) a task-specific progression function (PF) computed in terms of a conversation's trajectory through this space, and (3) a planning mechanism based on dialogue rollouts by which an agent may use progression signals to select its next response.


page 1

page 2

page 3

page 4


Weakly-Supervised Neural Response Selection from an Ensemble of Task-Specialised Dialogue Agents

Dialogue engines that incorporate different types of agents to converse ...

Target-Guided Open-Domain Conversation Planning

Prior studies addressing target-oriented conversational tasks lack a cru...

WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation for Multi-turn Dialogue

An intelligent dialogue system in a multi-turn setting should not only g...

Regularizing Dialogue Generation by Imitating Implicit Scenarios

Human dialogues are scenario-based and appropriate responses generally r...

Generating Strategic Dialogue for Negotiation with Theory of Mind

We propose a framework to integrate the concept of Theory of Mind (ToM) ...

1 Introduction

All human conversation serves some purpose. These may range from negotiating an agreement to explaining a topic to maintaining a social relationship. People are generally capable of forming an assessment, sometimes subconsciously, whether a conversation is going well or not and adjusting their behavior accordingly. Such assessment, which underlies most human conversation, is essential in continuous awareness of the direction where the interaction is heading and whether the parties are in sync or not, e.g., Bernieri and Rosenthal (1991). In a task-oriented interaction, the participants assess if progress towards a successful outcome is being made. In a negotiation, parties assess if an agreement is likely. Even in a casual conversation,

Figure 1: Our framework applied to the charity solicitation task in Persuasion For Good (Wang et al., 2019). Given the dialogue history (center left), the system uses rollouts (Lewis et al., 2017) to simulate the outcome of two response candidates (bottom, in red). Each rollout is mapped as a path through the Global Dialogue State space (center right) where it can be compared with similar outcomes. The candidates are finally ranked using the Progression Function (top), and the best is selected.

people intuitively sense when to continue, when to change the subject, or when to stop. Based on such (subjective) assessment, participants adjust what to say next: whether to push forward, make a concession, soften the tone, digress, or say goodbye. A wide range of research in conversation and discourse analysis is devoted to these and related issues including (Beebe and Masterson, 2000; Cassell et al., 2007; Friedman, 2004; Gremler and Gwinner, 2008; Langewitz et al., 2003); however, recent efforts in Dialogue State Tracking (DST) have been primarily focused on collecting fine-grained details (e.g., slot-value pairs for travel booking or restaurant reservation) (Balaraman et al., 2021) without concern for the overall direction and quality of the conversation, even though the latter is critical for achieving human-level dialogue interaction.

As such, we approach dialogue state tracking at a higher level, focusing instead on what we call the Global Dialogue State (GDS). Given a conversational task (e.g., negotiation), the global state of a dialogue reflects the most likely outcome (e.g., a strong agreement or a stalemate) given the history of the dialogue up to the current turn. In contrast to traditional DST, the global state remains invariant to the specific details discussed at each turn (e.g., names, dates, quantities) that are typically the concern of slot-filling models. Rather, global dialogue states are influenced by the contexts in which these details occur (e.g., “I would love to donate $5 to this charity!” vs. “I would never donate $5 to this charity”). Thus, the global state of a dialogue can be measured in terms of its semantic similarity to other groups of dialogues for the same task, which can be naturally formulated as a cluster-assignment problem in the dialogue embedding space. That is, a dialogue which is assigned at the current turn to a cluster of highly successful outcomes may assume a high likelihood of success, and likewise a dialogue assigned to a cluster of unsuccessful outcomes may assume a low likelihood of success. It follows from this that the path of a dialogue through global state space can be used to derive a Progression Function (PF)

to give turn-level estimates of task success, which can in turn be used by a dialogue agent to inform its next response.

The remainder of this paper is organized as follows: In Section 2 we review relevant literature pertaining to dialogue state tracking and response planning; in Section 3 we formally define the global dialogue state and progression function, propose supervised and unsupervised approaches for modeling them, and describe how they can be used to assess and select dialogue response candidates; in Section 4 we experimentally apply our framework to the charity solicitation task in the Persuasion For Good dataset (Wang et al., 2019), reporting results from automatic and manual evaluations; and in Sections 5 and 6 we conclude with a discussion of limitations and future directions. Code for our methods and experiments has been released, 111Available at and a listing of software packages we use can be found in Appendix A.

2 Related Work

Our work lies at the intersection of dialogue state tracking and response planning. As previously noted, we approach dialogue state at a much higher level than is typically seen in the DST literature. Our concept of global dialogue state is not mutually exclusive with traditional DST approaches, which we refer to from here on as local DST. Rather, an effective dialogue system might integrate local and global DST approaches to enable simultaneous tracking of user intents and slot-value pairs (needed for interfacing with external resources) and the overall likelihood of conversational success.

2.1 Dialogue State Tracking

Local DST approaches are used in task-oriented (also called goal-oriented) dialogue systems. Local DST is responsible for identifying user intent (e.g., search for restaurants) and extracting slot-value pairs (e.g., location, price range). Recent DST systems perform state tracking in a diverse set of domains, including food ordering (Lertvittayakumjorn et al., 2021), travel reservations (Qin et al., 2021), negotiations (He et al., 2018), and many others. Datasets such as MultiWOZ (Budzianowski et al., 2018; Eric et al., 2020; Zang et al., 2020) and SGD (Rastogi et al., 2020) provide large-scale testbeds for training single DST systems that generalize across many task domains. However, local DST is generally not deployed in open-domain end-to-end dialogue systems that focus on social interaction and user engagement, recent examples including DialoGPT (Zhang et al., 2020), Meena (Adiwardana et al., 2020), and BlenderBot (Roller et al., 2021; Xu et al., 2021). In open-domain models, the task is unconstrained and thus it makes little sense to employ traditional slot-based dialogue state trackers. Instead, these models track state implicitly in their latent representations of dialogue history. Unlike local DST, global state tracking is applicable in both the task-oriented and open-domain settings.

2.2 Dialogue Response Planning

Many approaches exist for planning in dialogue response generation. Planning helps a dialogue agent maintain coherence over multiple turns and stay on track to complete its goal. Lewis et al. (2017) introduce Dialogue Rollouts, allowing a negotiation agent to simulate the remainder of a conversation based on each of multiple candidate responses and select the one which yields the best outcome. Yarats and Lewis (2018) follow up by separating semantic planning and surface realization for response generation by first producing a latent semantic representation of the dialogue plan and then conditioning on it during generation with rollouts. Similarly, Jiang et al. (2019) implement a look-ahead module to implicitly predict multiple future turns in an end-to-end encoder-decoder architecture, experimenting with negotiation and restaurant reservation settings. These works all experiment in task domains where goal achievement is explicitly measurable, which is not true in the general case. Thus we propose to combine such methods with our progression function which provides estimates of goal completion likelihood. Particularly, in this paper we demonstrate the use of rollouts with the PF as a reward signal.

3 Methods

Figure 2: Architecture of the supervised and unsupervised GDS and PF models (top). In GDS space (top right), each cluster is characterized by similar dialogue semantics, and is thus interpreted as the class of typical outcomes for dialogues within. GDS and PF can be used with rollouts (bottom) to allow a dialogue agent to plan ahead.

The goal of our system is to construct a global dialogue state space for a task-specific dataset and learn a progression function to estimate how well an ongoing dialogue is progressing toward the desired outcome of the task. The quantity output by the progression function is an estimate of a dialogue-level attribute which indicates task success (e.g. satisfaction in a customer service task). In many task domains, the success of a conversation cannot be completely measured by a single attribute. For example, in the charity solicitation task we use in our experiments, donation amount is the primary success attribute. Here, there are cases where the conversation appears to go very well, but ultimately no donation is made for unexpected reasons such as the solicitee not being able to afford to donate. One could reasonably expect such an outcome to be “acceptable” in the context of a solicitation task since the solicitee has engaged with the solicitor and displayed interest, and we cannot reasonably expect the solicitor to force a donation out of someone who cannot afford it. Thus we introduce the “acceptability score”, a synthetic attribute that measures success by considering multiple factors (e.g., donation amount and sentiment). For any dialogue dataset, the acceptability score combines multiple dialogue-level attributes in a way sensitive to their covariance with the primary success attribute:


where is the primary success attribute (e.g. donation amount) value for dialogue ,

is the vector of all other attribute values (e.g., sentiment) for dialogue

, and is the training set covariance between the primary success indicator and the -th other attribute. We define the output of the progression function to be an estimate of the acceptability score.

To learn the progression function, dialogue-level attribute annotations must exist for use in this purpose. However, in many settings such annotations are not available in sufficient quantity to directly learn a progression model with sufficient generalization. Consequently, we propose supervised and unsupervised approaches for learning the global state and progression models.

3.1 Unsupervised Approach

3.1.1 Global Dialogue State

In the unsupervised approach, the GDS space is a dialogue embedding space where clusters of embeddings represent groups of dialogues with similar semantic content. For each complete dialogue in the training set, all utterances are independently embedded and then pooled to create a dialogue-level embedding where is the embedding size. The GDS space is thus given as a matrix in where is the number of complete dialogues. To embed utterances we take advantage of pre-trained sentence encoders exposed to large-scale corpora. Specifically, we use a publicly available MPNet (Song et al., 2020) model fine-tuned for semantic textual similarity using a contrastive objective on over 1B training pairs from 32 distinct datasets. 222Obtained from To combine utterance embeddings into a dialogue-level embedding we use recency-weighted mean pooling. The recency weight determines how much emphasis is placed on more recent utterances, where means all utterances are weighted evenly and means that more emphasis is placed on more recent utterances. The motivation for recency weighting is to test the hypothesis that more recent developments in a conversation are more relevant for predicting current progression toward a goal. For example, a conversation may start out off-task with participants engaging in small talk, and then later re-focus.

The embedding for dialogue with utterances is thus formulated as where is the matrix of utterance vectors in and is a vector of evenly spaced real numbers over the interval

. The softmax ensures all recency weights sum to 1 and can be interpreted as probabilities as done with attention scores in

(Bahdanau et al., 2015; Vaswani et al., 2017). As shown in Figure 3, each utterance is thus weighted by a monotonically increasing probability mass where higher values of cause more mass to be concentrated at the end of the dialogue.

Figure 3: Recency weight controls how much emphasis is placed on recent utterances when computing .

The unsupervised GDS model is a clustering of the dialogues in their embedding space. The dialogue embeddings are either clustered directly or after projection to a lower-dimensional space using Parametric UMAP (Sainburg et al., 2020; McInnes et al., 2018a)

. We experiment with k-means and HDBSCAN

(McInnes and Healy, 2017; Campello et al., 2013) to cluster the embeddings. For k-means, we choose the number of clusters

and train with 10 random initializations. For HDBSCAN, we choose the minimum cluster size and minimum samples hyperparameters, and the optimal number of clusters are selected automatically. Unlike k-means which simply partitions the embedding space, HDBSCAN classifies some embeddings as noise points. Clustering hyperparameters are selected by cross-validation on several metrics as described later in Section

4. The process of constructing the GDS model is illustrated in Figure 2.

The clusters output by this process can be interpreted as the equivalence classes of final global states possible for the task represented in the dialogue dataset. To estimate the global state of an ongoing dialogue , it is embedded as in the same manner as the complete training dialogues, followed by optional dimensionality reduction. The trained k-means or HDBSCAN model is then used to assign to one of the existing clusters, or possibly as a noise point in the case of HDBSCAN.

Each cluster is assigned an aggregate acceptability score by taking an average of acceptability for each dialogue in the cluster. If k-means is used, we aggregate using a 10% trimmed mean across all dialogues in the cluster. If HDBSCAN is used, a probability is returned for each dialogue representing the likelihood that it is a member of its assigned cluster, so we compute the probability-weighted average across all dialogues in the cluster. Dialogues classified as noise points are ignored.

To visualize the GDS model, Parametric UMAP is used again to project the clustered dialogue embeddings into or . As shown in Figure 1, the GDS model can be mapped as a scatter plot with each cluster labeled by its aggregate values. If k-means is used, the cluster centroids can be displayed as a bold point within each cluster. HDBSCAN clusters do not have centroids, but they do have a number of representative points that are close to the cluster core. We average these points to simulate a centroid for display purposes, and likewise show it as a bold point within each cluster. To show how an ongoing dialogue traverses the GDS space over time, its embeddings at each turn are projected onto the map and connected with line segments to form a path.

3.1.2 Computing Progression

Since each cluster in the GDS space is intended to represent a class of end-task global states, we compute the progression of an ongoing dialogue with respect to the likelihood that its final global state will rest in each individual cluster. Supposing there are final clusters after running k-means or HDBSCAN, we compute a probability vector such that where is cluster . is computed differently for k-means and HDBSCAN. K-means does not produce a probabilistic soft clustering, so we define with respect to the proximity of to the centroids of each cluster:


where is the centroid of cluster . HDBSCAN does produce a probabilistic soft clustering, so in that case is already computed.

We ultimately want the closest (or most probable) clusters for ongoing dialogue to have the most sway in estimating its progression at the current point in time. That is, if has moved into a cluster of high-success outcomes, its progression should increase. Likewise if has moved away from such a high-success cluster, either into a lower-success cluster or off-task into a noisy or unknown region of the GDS space, its progression should decrease. Thus, once is computed, we estimate its progression as the probability-weighted average of the aggregate acceptability scores assigned to each cluster. This is formulated as


where is a vector of the aggregate acceptability scores assigned to each cluster. The scaling factor in the denominator ensures that ongoing dialogue embeddings classified as noise points by HDBSCAN will not be assigned progression values close to zero as a consequence of not belonging to any cluster, which can cause significant fluctuation in the progression function as the dialogue traverses noisy regions of the GDS space. 333For HDBSCAN we also experiment with softmax for re-scaling , giving . Figure 2 illustrates how progression of an ongoing dialogue depends on its position in GDS space.

3.2 Supervised Approach

For the supervised approach, we simply fine-tune RoBERTa (Liu et al., 2019) to directly predict acceptability given the dialogue history text, where all utterances are concatenated into a single sequence. To construct the GDS space we obtain the dialogue level embedding directly from the CLS (<s>) token for each complete dialogue in the training set, and cluster them as in Section 3.1.1

. Unlike the unsupervised approach where recency weighting is used to “attend” to more recent parts of the dialogue, the supervised fine-tuning process causes the CLS embedding to aggregate the parts of the dialogue most relevant to the task objective, which is more optimal than the recency heuristic. Also, unlike the unsupervised approach where progression for an ongoing dialogue is computed with respect to its embedding, here progression is directly predicted by RoBERTa. In our experiments we compare RoBERTa-base, RoBERTa-large, and RoBERTa-large-adapted, the latter receiving additional domain adaptation training for dialogue. Domain adaptation is done via Masked Language Modeling (MLM) on a self-generated version of the Gutenberg Dialogue Dataset

(Csaky and Recski, 2021). Hyperparameters and model weights from domain adaptation training are provided with our code release.

3.3 Response Planning

To allow a dialogue agent to use the progression function as feedback for response planning, we adopt Dialogue Rollouts (Lewis et al., 2017) to simulate the outcomes of a set of response candidates. A rollout for a response candidate simulates the next turns of the conversation (for both participants) given that candidate is used. At each turn of a negotiation task, Lewis et al. (2017) sample a set of response candidates and rollouts per candidate. They score each rollout by a deterministic reward (the value of the items “won” by the agent during negotiation), and rank each candidate by the average of its rollout scores. The highest ranking candidate is then selected by the agent. As shown in Figure 2, we generalize this process to any task for which a progression function can be learned, replacing the deterministic reward with the progression function value. To demonstrate this, we fine-tune the 762M parameter DialoGPT (Zhang et al., 2020) 444Obtained from as a dialogue response generator and use beam sampling to generate response candidates and rollouts. We select DialoGPT for this task as it is pre-trained on a large Reddit dialogue corpus.

4 Experiments

4.1 Dataset

We apply our framework to the Persuasion For Good dataset (Wang et al., 2019), which is a crowd-sourced dialogue dataset where the task for an individual playing the role of persuader is to convince another individual playing the role of persuadee to make a donation to a well-known children’s charity. We selected this dataset since it has a clear task objective (to solicit donations), but a complex relationship between dialogue content and success. The dataset authors identify 10 distinct persuasion strategies used to solicit donations, where different strategies correlate with donation amount at different strengths. Additionally, participants in Persuasion For Good dialogues complete a pre-task psychological survey, yielding 23 attributes based on the Big-Five personality traits (Goldberg, 1992), the Moral Foundations endorsement (Graham et al., 2011), the Schwartz Portrait Value (Cieciuch and Davidov, 2012), and the Decision-Making style (Hamilton et al., 2016) questionnaires for each individual. The dataset authors demonstrated varying degrees of correlation between these psychological attributes and the end-task donation amount. The complexity in measuring progression in this context, coupled with it being a relatively small dataset, makes Persuasion For Good an interesting and challenging testbed for our framework. Persuasion For Good contains 1017 dialogues, each with approximately 10 turns per speaker (20 utterances).

4.2 Progression Function Experiments

As the objective of the task is to solicit donations, we consider the end-dialogue persuadee donation amount to be the primary dialogue success indicator. We also augment the dataset by computing average dialogue sentiment. To compute sentiment we use a RoBERTa model555Obtained from fine-tuned on the sentiment classification task of the TweetEval benchmark (Barbieri et al., 2020), which was publicly released by the benchmark authors. We score sentiment at the utterance level in the range by multiplying the sentiment class probabilities predicted by RoBERTa for negative, neutral and positive by respectively and summing the result. We then average the utterance-level sentiment score for each dialogue.

We filter the dataset to eliminate dialogues with end-task donation amounts outside the allowed task parameters (between $0 and $2 USD), and use a regular expression to filter out dialogues where the persuadee fails to make a donation after promising a non-zero dollar amount in the conversation. After filtration we are left with 751 dialogues for our study. We split the dialogues into a training and test set, leaving 577 dialogues for training and 174 for testing. After splitting, we mean-center the dialogue values in the training set for each attribute and scale them to have unit variance. We apply the same transformation to the test set using the distribution parameters of the training set. Our final pre-processing step is to compute the acceptability score. To do this, we compute the covariance matrix of the dialogue-level attribute values in the training set, which include the donation amount and psychological attributes for both the persuader and persuadee from the original dataset, along with our computed sentiment scores. Since the values are all standardized, the covariances are equivalent to Pearson’s

. We select the covariances of all attributes with respect to the persuadee donation amount (see Figure 5 in Appendix B) and define the acceptability score of each dialogue as defined in Section 3. We use the same covariances obtained from the training set to compute acceptability scores on the test set. After pre-processing, the training set has 52 total attributes. These include the persuadee/persuader donation amounts, psychological variables, sentiment, and the acceptability score.

4.2.1 Progression Model Training

We train four progression models as outlined in Sections 3.1 and 3.2: (1) Unsupervised, (2) RoBERTa-base, (3) RoBERTa-large, and (4) RoBERTa-large-adapted. For each model, 10% of the training set is held out as a validation set (58 dialogues). For the unsupervised model, a grid search is run for the hyperparameters (e.g., # clusters, recency , dim. reduction, etc.) over the validation set, and the final model is obtained by re-training over the full training set using the best hyperparameters. The final model uses k-means for clustering with and recency weight . A complete hyperparameter listing and details on the grid search can be found in Appendix F. For the supervised RoBERTa models, fine-tuning is done with AdamW (Loshchilov and Hutter, 2019) and an initial learning rate of

for a maximum of 30 epochs. Early stopping is used over the validation set with the checkpoint corresponding to the lowest validation loss selected as the final model.

4.2.2 Automatic Evaluation

We evaluate the progression models on the following automatic metrics: (1) Mean Absolute Error (MAE) in predicting dialogue acceptability, and (2) Pearson’s correlation (r) between overall PF slope and dialogue acceptability. With MAE we validate that the progression function is able to estimate success of a complete dialogue, while PF slope correlation validates that during an ongoing dialogue, progression increases over time for high-success dialogues and decreases over time for low-success dialogues. To measure PF slope correlation, we fit a least-squares regression line to the progression curve of each dialogue in the test set, and measure Pearson’s

between the regression slopes and their corresponding acceptability scores. For robustness we repeat this evaluation 33 times with varying initialization seeds for each model type (final hyperparameters stay constant) and report the means and standard deviations across runs in Table


Model MAE r
Unsupervised* 1.37 ( 0.01) 0.40 ( 0.02)
RoBERTa-base 0.99 ( 0.06) 0.50 ( 0.04)
RoBERTa-large 0.97 ( 0.10) 0.54 ( 0.06)
RoBERTa-large-adapted 0.95 ( 0.05) 0.57 ( 0.03)
For Pearson’s , for all runs (two-tailed; is non-correlation).
* Hyperparameters of the unsupervised model can be found in Appendix G.
Table 1: Progression Function Auto Eval Results: Mean (SD) of MAE and Pearson’s across runs.

Unsurprisingly, the supervised models outperform the unsupervised model on both metrics, which is expected since they directly optimize a mean squared error objective. Of the supervised models, the RoBERTa-large instances perform the best, with dialogue domain adaptation boosting each metric. Pearson’s is significant at the

level for all runs (the null hypothesis is non-correlation).

4.2.3 Manual Evaluation

To obtain a more precise evaluation, we asked three annotators to estimate sentence-level progression for dialogues in our test set. Two graduate students and one postdoc in our lab served as our annotators. For each of twelve randomly selected test dialogues, each annotator rated all sentences on a scale of {-1, 0, 1} for progression, with -1 indicating regression from the task goal, 0 indicating neutral progression, and +1 indicating progression toward the task goal. Altogether, each annotator provided 431 sentence ratings across 244 utterances. After aggregating at the utterance level, average inter-annotator agreement is 0.57 (Cohen’s kappa). For each dialogue, the cumulative sum of the utterance-level manual ratings creates a ground-truth progression curve, as shown in Figure 4.

We evaluate the progression models against the ground-truth curves using Pearson’s correlation () since the PF output and cumulative manual ratings are continuous and on different scales. We report the following correlations: (1) between utterance-level PF value and ground-truth value (utt), (2) between utterance-level PF slope and ground-truth slope (utt-sl), (3) between dialogue-level PF slope and ground-truth slope (dlg-sl), and (4) between dialogue-level PF slope and the final ground-truth progression value (dlg-sl-f). We repeat this evaluation for each of the 33 uniquely initialized model instances of each type from Section 4.2.2, averaging each metric across raters and then across runs. We report the aggregate means and standard deviations across runs in Table 2. A complete listing of results for each rater (averaged across runs) is provided in Appendix C, along with detailed explanations of each metric (Appendix D). Additionally, Figure 10 in Appendix E provides examples of disagreement between the PF and ground-truth progression curves which can be challenged despite high inter-annotator agreement, demonstrating difficulty in establishing ground-truth for this open-ended task.

Model utt utt-sl dlg-sl dlg-sl-f
Unsuperv. 0.09 ( 0.13) 0.04 ( 0.01) -0.01 ( 0.07) -0.07 ( 0.07)
RoBERTa-b 0.39 ( 0.07) 0.18 ( 0.03) 0.26 ( 0.06) 0.26 ( 0.06)
RoBERTa-l 0.39 ( 0.08) 0.18 ( 0.06) 0.34 ( 0.09) 0.35 ( 0.10)
RoBERTa-l-a 0.48 ( 0.05) 0.24 ( 0.03) 0.43 ( 0.06) 0.41 ( 0.07)
b = base; l = large; l-a = large-adapted.
Table 2: Progression Function Manual Eval Results: Mean (SD) of utt, utt-sl, dlg-sl, & dlg-sl-f across runs.
Figure 4: Left: ground-truth progression curve given by the cumulative sum of utterance-level manual ratings. Right: estimated curve from RoBERTa-large-adapted.

4.2.4 Benefit of Domain Adaptation

To verify the beneficial effect of domain adaptation we perform two-tailed paired t-tests to confirm the differences in means between RoBERTa-large and RoBERTa-large-adapted on all automatic and manual metrics. For each metric, we pair the results from both models for each run of the same seed, since their regression heads would have received identical initializations. We find that the means of

utt, utt-sl, dlg-sl, and dlg-sl-f differ at the significance level, and the means of the automatic Pearson’s metric differ at the level. This confirms our intuition that domain adaptation for dialogue prior to fine-tuning the regression objective aids generalization in this task.

4.3 Rollout Experiments

To demonstrate the ability of the PF to guide a dialogue agent, we use it to score rollouts generated with DialoGPT as described in Section 3.3. Specifically, we design a self-play experiment to automatically evaluate the effect of PF-guided rollouts on the success of the solicitation task in Persuasion For Good. The following summarizes the experimental setup, procedure, and results.

4.3.1 Exeperimental Setup

First, we fine-tune DialoGPT to generate responses on Persuasion For Good. We add speaker control tokens to the vocabulary so that the model can be conditioned to generate as the persuader or persuadee, respectively. Training is done with AdamW (initial lr=) for 6 epochs with early stopping over a 10% validation set using perplexity. The final model checkpoint was selected after 3 epochs, achieving validation perplexity of 8.82.

We then select a progression model to use for all self-play runs. Since the supervised RoBERTa-large-adapted model achieved the best average scores across all automatic and manual evaluations, we randomly select one of the 33 RoBERTa-large-adapted instances from Section 4.2.2 to use for all runs. We use this instance for rollout scoring and to measure the progression of each self-play dialogue.

Finally, we train a binary classifier to identify if the persuadee has stated the intent to donate in a conversation, which we use to detect successful self-play dialogues. We fine-tune a RoBERTa model as a classifier using just the persuadee’s utterances as input and use binarized donation labels in Persuasion For Good as targets. Specifically, for each dialogue the label is 0 if the donated amount is $0, otherwise it is 1. We use the manually verified intended donation labels from Persuasion For Good “ANNSET” for our validation and test splits and use the remaining end-task donation labels for training. Training is done with early stopping over the validation split using macro F1. The final model checkpoint achieved test F1 of 0.89 and test accuracy of 0.90. All three trained models used in this experiment are available to download (see our code release for instructions and hyperparameters).

4.3.2 Self-Play Procedure

From our test set of 174 dialogues, we manually filter out 41 in which the persuadee pledges a donation within the first 10 utterances, leaving 133 remaining conversations. For each of these, the response generator is given the first 10 utterances as context and tasked to complete a second set of 10 utterances, playing the role of both the persuader and persuadee. Since the task is solicitation, we allow the generator to use rollouts only when acting as the persuader. We perform the self-play experiment using three persuader modes: (1) with no rollouts (No RO), (2) with 2 response candidates, 2 rollouts per candidate, and 3 utterances per rollout (2x2x3), and (3) with 3 response candidates, 3 rollouts per candidate, and 5 utterances per rollout (3x3x5). For each utterance in each rollout, we use beam sampling with num_beams=6, top_k=50, top_p=0.95, and temperature= where is the number of tokens in the dialogue history. After generation, we compute the following metrics for each dialogue: (1) progression using the selected RoBERTa-large-adapted instance (Prog.

), (2) persuader and persuadee sentiment using the sentiment classifier from Section

4.2 (ER Sent. & EE Sent.), and (3) the percent of test dialogues where the persuadee pledges a donation amount or explicitly states intent to donate, as detected by the binary donation intent classifier (EE Don.%).

4.3.3 Self-Play Results

Mode Prog. ER Sent. EE Sent. EE Don. %
No RO 0.01 ( 0.24) 0.51 ( 0.03) 0.44 ( 0.06) 38% ( 6%)
2x2x3 0.69 ( 0.29) 0.62 ( 0.05) 0.49 ( 0.07) 45% ( 10%)†
3x3x5 0.95 ( 0.16) 0.66 ( 0.02) 0.52 ( 0.04) 47% ( 11%)
All 2x2x3 and 3x3x5 means significant at (or 0.05 if marked †)
when compared to No RO with two-tailed paired t-tests. ER and EE refer
to the persuader and persuadee respectively.
Table 3: Rollouts self-play results: Mean (SD) of progression, sentiment, and % donated across runs.

For robustness we repeat this procedure 5 times with varying generation seeds for each persuader mode. In total, 1,995 self-play dialogues are completed (133 dialogues for each of 3 modes for each of 5 seeds). We average each metric across dialogues and then across runs, and report the aggregate means and standard deviations across runs. Additionally, to verify the benefit of rollouts, we perform two-tailed paired t-tests to confirm the differences in means between the rollout-enabled modes (2x2x3 and 3x3x5) and the baseline (No RO). For each metric, we average the results across runs and pair these averages from both modes for each dialogue. Results are shown in Table 3.

We observe that the mean progression increases significantly when rollouts are used. This is expected since response candidates with the highest average end-rollout progression are selected. We also observe that rollouts lead to higher average sentiment for both the persuader and persuadee, which makes sense due to the correlation between sentiment and the acceptability score (see Figure 5 in Appendix B). Finally, rollouts yield a higher percentage of dialogues with a pledged or intended donation. 666We randomly selected 30 of the 1,995 self-play dialogues and manually inspected them for donation intent. We found the classifier labeled 26 of them correctly (87% accuracy). All of these results are significant at the level except for EE Don.% in 2x2x3 mode which is significant at .

Although progression is noticeably higher for the 3x3x5 mode than for the 2x2x3 mode (0.95 vs 0.69), all other metrics are close between these modes with a small advantage in 3x3x5 mode. This suggests that scaling rollout simulations can be beneficial, but there may be diminishing returns for simulation size. Example self-play dialogues are provided in Tables 7, 8, and 9 in Appendix H.

5 Limitations & Future Direction

We recognize several limitations of our study which warrant follow-up investigation. This study focuses on a single task and dataset, and thus is subject to the assumptions and biases therein. Since we intend our framework to be general, it is prudent to perform additional studies to verify the efficacy of our methods on a variety of datasets spanning multiple dialogue domains and tasks. Also, although we provide automatic evaluation of the ability of rollouts to improve performance on a solicitation task, we cannot assume that humans would respond in the same way as DialoGPT. Thus, human evaluation is needed to further validate this approach.

6 Conclusion

In this work we introduced the concept of global dialogue state and proposed a framework with which a dialogue agent can gain awareness of where an ongoing conversation is headed, the likelihood of a successful outcome, and how its own response decisions impact the overall direction of the dialogue. We demonstrated that an unsupervised approach to modeling the GDS space and progression function is feasible, which is useful in sparsely-labeled settings. However, we showed that with domain-adaptation pre-training for dialogue, supervised methods are preferable when labels are available. Finally, we demonstrated how using the PF as a feedback mechanism via dialogue rollouts allows an agent to improve outcomes on a solicitation task.

Ethical Considerations

Ethical Dialogue Systems

We acknowledge the potential risks inherent in the deployment of goal-oriented dialogue systems, and especially note that care must be taken to ensure persuasive dialogue systems are designed for beneficial use as discussed by Wang et al. (2019). Concretely, when applying our framework, care must be taken to ensure that the goal of the system (defined by the primary success attribute of the acceptability score) should be generally accepted as beneficial. For example, our basis for dialogue acceptability in this work is with respect to raising money for children’s charity. In general, the achievement of the system’s goal should not intentionally lead the user or any other party to harm. Additionally, the definition of acceptability, through its primary or any other correlated attributes, should not allow for discriminative responses, purposefully malicious discourse, or other violations of accepted ethical standards. For example, we include sentiment as secondary attributes in the acceptability score, which, when applied via dialogue rollouts, encourages the system to be courteous, polite, and respectful. It is possible with minimal effort to include further secondary attributes that identify bias, hate speech, and other indicators to help the system remain safe to use.

Annotator Compensation

All manual annotators were recruited on a voluntary basis in an educational setting and did not receive or expect monetary compensation. Specifically, two graduate students and one postdoc in our lab served as our annotators.

Environmental Impact

All training and inference in this work was done with two NVIDIA Quadro RTX 8000 GPUs. The most compute-intensive portion of the work was the additional domain adaptation pre-training for RoBERTa-large-adapted (see Section 3.2), which took approximately two weeks. After that the multi-seed self-play evaluations took approximately four days, and all other operations (e.g., training and evaluating PF models, fine-tuning DialoGPT) took 24 hours or less.


We would like to thank our manual annotators for their valuable contribution and the anonymous reviewers for their helpful feedback. This paper is based upon work supported in part by the United States Air Force under Contract No. FA8750-21-C-0075 and in part by the IBM Corporation under the Artificial Intelligence Research Collaboration Agreement No. W1771793 between IBM and Rensselaer. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of USAF or IBM Corporation.


  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from External Links: Link Cited by: Table 4.
  • D. Adiwardana, M. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang, A. Kulshreshtha, G. Nemade, Y. Lu, et al. (2020) Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977. External Links: Link Cited by: §2.1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §3.1.1.
  • V. Balaraman, S. Sheikhalishahi, and B. Magnini (2021) Recent neural methods on dialogue state tracking for task-oriented dialogue systems: a survey. In Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 239–251. Cited by: §1.
  • F. Barbieri, J. Camacho-Collados, L. Espinosa Anke, and L. Neves (2020) TweetEval: unified benchmark and comparative evaluation for tweet classification. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 1644–1650. External Links: Document, Link Cited by: §4.2.
  • S.A. Beebe and J.T. Masterson (2000) Communicating in small groups: principles and practices. Longman. External Links: ISBN 9780321036483, LCCN 99026288, Link Cited by: §1.
  • F.J. Bernieri and R. Rosenthal (1991) Interpersonal coordination: behavior matching and interactional synchrony. In Fundamentals of nonverbal behaviors. Studies in emotion and social interaction, R.S. Feldman and B. Rime (Eds.), Cited by: §1.
  • P. Budzianowski, T. Wen, B. Tseng, I. Casanueva, S. Ultes, O. Ramadan, and M. Gašić (2018) MultiWOZ - a large-scale multi-domain Wizard-of-Oz dataset for task-oriented dialogue modelling. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    Brussels, Belgium, pp. 5016–5026. External Links: Document, Link Cited by: §2.1.
  • R. J. Campello, D. Moulavi, and J. Sander (2013) Density-based clustering based on hierarchical density estimates. In Pacific-Asia conference on knowledge discovery and data mining, pp. 160–172. Cited by: §3.1.1.
  • J. Cassell, A. Gill, and P. Tepper (2007) Coordination in conversation and rapport. In Proceedings of the Workshop on Embodied Language Processing, Prague, Czech Republic, pp. 41–50. External Links: Link Cited by: §1.
  • J. Cieciuch and E. Davidov (2012) A comparison of the invariance properties of the pvq-40 and the pvq-21 to measure human values across german and polish samples. In Survey Research Methods, Vol. 6, pp. 37–48. Cited by: §4.1.
  • R. Csaky and G. Recski (2021) The Gutenberg dialogue dataset. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, pp. 138–159. External Links: Link Cited by: §3.2.
  • M. Eric, R. Goel, S. Paul, A. Sethi, S. Agarwal, S. Gao, A. Kumar, A. Goyal, P. Ku, and D. Hakkani-Tur (2020) MultiWOZ 2.1: a consolidated multi-domain dialogue dataset with state corrections and state tracking baselines. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, pp. 422–428 (English). External Links: ISBN 979-10-95546-34-4, Link Cited by: §2.1.
  • W. Falcon and T. P. L. team (2019) PyTorch lightning. External Links: Document, Link Cited by: Table 4.
  • R. Friedman (2004) Studying negotiations in context: an ethnographic approach. Internat’l Negotiation. Cited by: §1.
  • L. R. Goldberg (1992) The development of markers for the big-five factor structure.. Psychological assessment 4 (1), pp. 26. Cited by: §4.1.
  • J. Graham, B. A. Nosek, J. Haidt, R. Iyer, S. Koleva, and P. H. Ditto (2011) Mapping the moral domain.. Journal of personality and social psychology 101 (2), pp. 366. Cited by: §4.1.
  • D. D. Gremler and K. P. Gwinner (2008) Rapport-building behaviors used by retail employees. Retailing. Cited by: §1.
  • K. Hamilton, S. Shih, and S. Mohammed (2016) The development and validation of the rational and intuitive decision styles scale. Journal of personality assessment 98 (5), pp. 523–535. Cited by: §4.1.
  • C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant (2020) Array programming with NumPy. Nature 585 (7825), pp. 357–362. External Links: Document, Link Cited by: Table 4.
  • H. He, D. Chen, A. Balakrishnan, and P. Liang (2018) Decoupling strategy and generation in negotiation dialogues. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2333–2343. External Links: Document, Link Cited by: §2.1.
  • J. D. Hunter (2007) Matplotlib: a 2d graphics environment. Computing in Science & Engineering 9 (3), pp. 90–95. External Links: Document Cited by: Table 4.
  • Z. Jiang, X. Mao, Z. Huang, J. Ma, and S. Li (2019) Towards end-to-end learning for efficient dialogue agent by modeling looking-ahead ability. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, Stockholm, Sweden, pp. 133–142. External Links: Document, Link Cited by: §2.2.
  • W. Langewitz, M. Nübling, and H. Weber (2003) A theory-based approach to analysing conversation sequences. Epidemiologia e Psichiatria Sociale 12 (2), pp. 103–108. External Links: Document Cited by: §1.
  • P. Lertvittayakumjorn, D. Bonadiman, and S. Mansour (2021) Knowledge-driven slot constraints for goal-oriented dialogue systems. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 3407–3419. External Links: Document, Link Cited by: §2.1.
  • M. Lewis, D. Yarats, Y. Dauphin, D. Parikh, and D. Batra (2017) Deal or no deal? end-to-end learning of negotiation dialogues. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2443–2453. External Links: Document, Link Cited by: Figure 1, §2.2, §3.3.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. External Links: Link Cited by: §3.2.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §4.2.1.
  • L. McInnes, J. Healy, and S. Astels (2017) Hdbscan: hierarchical density based clustering.

    The Journal of Open Source Software

    2 (11), pp. 205.
    Cited by: Table 4.
  • L. McInnes, J. Healy, and J. Melville (2018a) Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. External Links: Link Cited by: §3.1.1.
  • L. McInnes, J. Healy, N. Saul, and L. Grossberger (2018b) UMAP: uniform manifold approximation and projection. The Journal of Open Source Software 3 (29), pp. 861. Cited by: Table 4.
  • L. McInnes and J. Healy (2017) Accelerated hierarchical density based clustering. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pp. 33–42. Cited by: §3.1.1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019)

    PyTorch: an imperative style, high-performance deep learning library

    In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: Table 4.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: Table 4.
  • plotly technologies inc. (2015) Montreal, QC. External Links: Link Cited by: Table 4.
  • L. Qin, F. Wei, T. Xie, X. Xu, W. Che, and T. Liu (2021)

    GL-GIN: fast and accurate non-autoregressive model for joint multiple intent detection and slot filling

    In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, pp. 178–188. External Links: Document, Link Cited by: §2.1.
  • A. Rastogi, X. Zang, S. Sunkara, R. Gupta, and P. Khaitan (2020)

    Towards scalable multi-domain conversational agents: the schema-guided dialogue dataset

    Proceedings of the AAAI Conference on Artificial Intelligence 34 (05), pp. 8689–8696. External Links: Link, Document Cited by: §2.1.
  • N. Reimers and I. Gurevych (2019) Sentence-BERT: sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3982–3992. External Links: Document, Link Cited by: Table 4.
  • S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson, Y. Liu, J. Xu, M. Ott, E. M. Smith, Y. Boureau, and J. Weston (2021) Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, pp. 300–325. External Links: Link Cited by: §2.1.
  • T. Sainburg, L. McInnes, and T. Q. Gentner (2020)

    Parametric umap: learning embeddings with deep neural networks for representation and semi-supervised learning

    Cited by: §3.1.1.
  • V. Satopaa, J. Albrecht, D. Irwin, and B. Raghavan (2011) Finding a "kneedle" in a haystack: detecting knee points in system behavior. In 2011 31st International Conference on Distributed Computing Systems Workshops, Vol. , pp. 166–171. External Links: Document Cited by: Table 4.
  • K. Song, X. Tan, T. Qin, J. Lu, and T. Liu (2020) MPNet: masked and permuted pre-training for language understanding. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 16857–16867. External Links: Link Cited by: §3.1.1.
  • the pandas development team (2020) Pandas-dev/pandas: pandas External Links: Document, Link Cited by: Table 4.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §3.1.1.
  • P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, İ. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors (2020) SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods 17, pp. 261–272. External Links: Document Cited by: Table 4.
  • X. Wang, W. Shi, R. Kim, Y. Oh, S. Yang, J. Zhang, and Z. Yu (2019) Persuasion for good: towards a personalized persuasive dialogue system for social good. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5635–5649. External Links: Document, Link Cited by: Figure 1, §1, §4.1, Ethical Dialogue Systems.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Document, Link Cited by: Table 4.
  • J. Xu, A. Szlam, and J. Weston (2021) Beyond goldfish memory: long-term open-domain conversation. arXiv preprint arXiv:2107.07567. External Links: Link Cited by: §2.1.
  • D. Yarats and M. Lewis (2018)

    Hierarchical text generation and planning for strategic dialogue

    In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 5587–5595. External Links: Link Cited by: §2.2.
  • X. Zang, A. Rastogi, S. Sunkara, R. Gupta, J. Zhang, and J. Chen (2020) MultiWOZ 2.2 : a dialogue dataset with additional annotation corrections and state tracking baselines. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, Online, pp. 109–117. External Links: Document, Link Cited by: §2.1.
  • Y. Zhang, S. Sun, M. Galley, Y. Chen, C. Brockett, X. Gao, J. Gao, J. Liu, and B. Dolan (2020) DIALOGPT : large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Online, pp. 270–278. External Links: Document, Link Cited by: §2.1, §3.3.

Appendix A Software Packages Used

Package Version Citation URL
hdbscan 0.8.27 (McInnes et al., 2017)
kneed 0.7.0 (Satopaa et al., 2011)
Matplotlib 3.3.4 (Hunter, 2007)
NumPy 1.19.5 (Harris et al., 2020)
Pandas 1.2.4 (the pandas development team, 2020)
plotly 5.1.0 (plotly technologies inc., 2015)
PyTorch 1.9.0 (Paszke et al., 2019)
PyTorch Lightning 1.5.8 (Falcon and team, 2019)
scikit-learn 0.24.1 (Pedregosa et al., 2011)
SciPy 1.6.2 (Virtanen et al., 2020)
Sentence-Transformers N/A* (Reimers and Gurevych, 2019)
TensorFlow 2.5.1 (Abadi et al., 2015)
Transformers 4.11.3 (Wolf et al., 2020)
umap-learn 0.5.1 (McInnes et al., 2018b)
* we use all-mpnet-base-v2 directly through Transformers, but it is part of the Sentence-Transformers model library. Additionally, we base
   parts of our sentence embedding implementation on that found in Sentence-Transformers.
Table 4: Software packages used in obtaining or presenting the results in this work

Appendix B Training Set Covariances For Acceptability Score

Figure 5: The covariances of all other dialogue attributes with respect to the persuadee donation are used to weight the acceptability score. ER and EE refer to the persuader and persuadee respectively.

Appendix C Full Manual Evaluation Results

Model utt (1/2/3) utt-sl (1/2/3) dlg-sl (1/2/3) dlg-sl-f (1/2/3)
unsupervised 0.09 / 0.06 / 0.12 0.02 / 0.04 / 0.05 0.04 / -0.03 / -0.02 -0.07 / -0.09 / -0.05
RoBERTa-base 0.39‡/ 0.30‡/ 0.48‡ 0.15†/ 0.17†/ 0.21‡ 0.30 / 0.12 / 0.34 0.26 / 0.16 / 0.35
RoBERTa-large 0.39‡/ 0.30‡/ 0.50‡ 0.16 / 0.18†/ 0.21† 0.41 / 0.17 / 0.46 0.36 / 0.21 / 0.47
RoBERTa-large-adapted 0.49‡/ 0.37‡/ 0.59 0.21‡/ 0.24‡/ 0.29 0.51 / 0.26 / 0.52 0.45 / 0.27 / 0.51
Average Pearson’s p-value across runs: †: ;  ‡: ; (two-tailed; is non-correlation).
Table 5: Progression Function Manual Eval Results (all annotators, averaged across all runs)

Appendix D Explanations of Manual Metrics

Figure 6: utt: Pearson’s (right) between utterance-level PF values (center, e.g., circled) and ground-truth values (left, e.g., circled) for all 244 utterances across 12 dialogues. Points shown on the right are from annotator 3. This metric is intended to measure if the PF and ground-truth progression curves assign similar values (relative to their respective scales) at each step of an ongoing dialogue.
Figure 7: utt-sl: Pearson’s (right) between utterance-level PF slopes (center, e.g., see triangle) and ground-truth slopes (left, e.g., see triangle), for all 244 utterances across 12 dialogues. Utterance-level slopes are computed as the differences in the progression curves between two dialogue steps. Points shown on the right are from annotator 3. This metric is intended to measure if the PF and ground-truth progression curves move in the same direction at each step of an ongoing dialogue.
Figure 8: dlg-sl: Pearson’s (right) between dialogue-level PF slopes (center, e.g., see line) and ground-truth slopes (left, e.g., see line), for all 12 dialogues. Dialogue-level slopes are computed by fitting least-squares regression lines to the progression curves. Points shown on the right are from annotator 3. This metric is intended to measure the ability of the overall PF trend to approximate the ground-truth progression curve.
Figure 9: dlg-sl-f: Pearson’s (right) between dialogue-level PF slopes (center, e.g., see line) and the final ground-truth progression value (left, e.g., circled), for all 12 dialogues. Dialogue-level slopes are computed by fitting least-squares regression lines to the progression curves. Points shown on the right are from annotator 3. This metric is intended to measure the ability of the overall PF trend to predict the end-task outcome.

Appendix E PF Disagreement Examples

Figure 10: Two examples where the PF (RoBERTa-large-adapted) disagrees with multiple annotators. We note that the PF correctly decreases (Right) and stays negative (Left) given the turns boxed in red showing poor progression.

Appendix F Grid Search Details for Unsupervised Model

for  do
       for  do
             for  do
                   for  do
                          k-means experiments
                         for  do
                               for  do
                                     for  do
                         HDBSCAN experiments
                         for  do
                               for  do
                                     for  do
                                           for  do
Algorithm 1 Grid search for hyperparameter tuning of the unsupervised progression model on the validation set. Descriptions for each hyperparameter are provided in Table 6.
Hyperparameter Description
(recency weight) Controls how much emphasis is placed on recent tokens when computing dialogue embeddings.
(embedding size) The dimensionality of dialogue embeddings. Values reduced with Parametric UMAP.
normalize_embeddings If True, embeddings are normalized to have unit magnitude after dimensionality reduction.
distance_metric The distance metric used by Parametric UMAP and centroid proximity calculations.*
k (number of clusters) Number of clusters to use for k-means.
inverse_distance If True, Euclidean centroid proximity is computed as the inverse distance instead of negative distance.**

If True, centroid proximities are converted to z-scores before progression is computed. †

min_cluster_size Minimum number of points in a HDBSCAN cluster. Clusters with fewer points get merged into larger ones.
soft_value_aggregation If True, HDBSCAN cluster attribute aggregations are weighted with cluster membership probabilities.
prob_scaling The type of scaling applied to progression computed from HDBSCAN cluster membership probabilities.
*: Centroid proximity calculations refer to progression computation. Proximities used during clustering are always Euclidean.
**: Does not apply to cosine distance.
†: Applies only if k-means is used, or if HDBSCAN is used with softmax probability scaling.
Table 6: Hyperparameter Descriptions

Appendix G Final Unsupervised Model Hyperparameters

The final unsupervised model uses k-means (), , , normalized embeddings, euclidean distance, and inverse distance for centroid proximity.

Appendix H Examples from rollout self-play experiment

The following examples in Tables 7, 8, and 9 compare self-play between the baseline No RO mode and the two rollout modes 2x2x3 and 3x3x5. In each example, the utterance where the persuadee (EE) states intent to donate is highlighted in red. The complete dialogues from the self-play experiments are included with our code release.

ER: Hello, Have you ever donated to a charity?
EE: A few times, yes.
ER: ow was your experience?
EE: It was ok. I didn’t really think much of it. Most of the time its just a dollar or two.
ER: Do you donate all te time?
EE: Not so much.
ER: Have yu heard of Save the children? It is pretty well known around the world.
EE: I think so.
ER: What do you think that they do?
EE: Give money to child. Support efforts in war torn countries. Healthcare. All that stuff.
NO ROLLOUTS (progression: -2.65; ER sentiment: 0.69; EE sentiment: 0.03)
ER: I. It is a worthy cause. You can donate all you want from it.
EE: I’m sorry. I’m broke right now. That’s why I’m asking if you do donate all the time.
ER: It’s okay. I understand and agree. I think it’s a good cause.
EE: So what is the charity that you donate to?
ER: I don’t have a specific charity, but I do think they are a decent charity.
EE: Are they a non-profit? What do they do?
ER: It’s a non profit. They help protect childrens rights to health and education.
EE: Well, I think they are a great charity. Do they provide help in other ways besides donating to them?
ER: Yes they help with children in the US and other developing countries.
EE: So I’m guessing the money goes towards programs to help those children get a proper education?
2x2x3 ROLLOUTS (progression: 0.15; ER sentiment: 0.67; EE sentiment: 0.63)
ER: I, they do. I have donated to the Red Cross and the Salvation Army in the past.
EE: I donated a couple of months ago. Do you donate often? I donate when I can.
ER: Yes, I donate to them as well. How much do you like to donate?
EE: i donate $.50 each month. Do you do the same?
ER: Yeah, that sounds great. Any bit helps though. Do you have an amount in mind?
EE: I would like to donate half. Is that a good idea?
ER: Yes that would be great. The children really do need it. I donate to them as well.
EE: I’m glad we can agree on a generous donation. How much would you like me to donate?
ER: I think half would be fine, I’m not sure, but that is a very generous amount.
EE: Well, it is nice to think that I’m not getting hit by a bot, so that’s a good thing.
Table 7: Rollout Self-Play Example 1: No RO vs. 2x2x3 (dialogue #7; generation seed: 247)
ER: Hello therehow are you?
EE: I am good. How are you?
ER: I’m pretty good, just trying to get ready for work soon.
EE: I understand that. So tell me about this charity.
ER: Have you heard of Save the Children? They do work around health, education, etc.
EE: I’m not sure that I have.
ER: They work mostly in developing countries. I would appreciate it if some of the funds could be donated. The researchers will pass that donation
       on to the charity at the end.
EE: What kind of work do they do?
ER: They do emergency hunger relief, are engaged with Syrian children.
EE: Ok, that sounds good. How do I know it is a legitimate charity?
NO ROLLOUTS (progression: -0.43; ER sentiment: 0.71; EE sentiment: 0.50)
ER: They are a good non-profit that helps children all around the world
EE: Ah, that’s good to know. Do you know what the funding for this charity actually goes towards?
ER: They actually make a huge impact. Do you donate to charity often?
EE: No, I don’t think so, but I will look them up.
ER: They are a 501c, so it’s not like they’re going to ask you to give anything at this time. Every little bit helps.
EE: Cool, I don’t think I’m ready to give anything up yet. How about you?
ER: They are a legit charity that I donate to regularly. Have you donated to a charity before?
EE: No, I’m usually a pretty cautious individual with donations, but I think that would be a good idea.
ER: They’re great with giving to charities but I wouldn’t be comfortable donating more than the 10.00, but I usually donate my entire earnings
       from this task to help out others.
EE: Cool, that’s good to know. Have a great day!
3x3x5 ROLLOUTS (progression: 0.36; ER sentiment: 0.65; EE sentiment: 0.70)
ER: Well, it’s not an international non-governmental organization, but they help children in war zones like Syria. Children are our future and
       we need to help out where we can.
EE: Ah, I see! I will look into this. Does it have any good ratings on Charity Navigator?
ER: I have no idea, but I would be willing to look into them. They’re very trustworthy.
EE: They are very reputable. I will research them later and donate. Thanks!
ER: I’m glad to hear that. This is something I would like to be passionate about.
EE: Yes, I really love helping out! I think we need to take care of our country’s children. And if you would like to know more about a charity
       you might like to research more: URL
ER: I would like to hear a little more about this charity from you.
EE: Well, I will look that up. This is one I really enjoy working on. I’ve heard of some of the kids who are in need, but this is one I’m very
       passionate about. Do you have any other charities you’ve worked on?
ER: I’m going to look at some websites right now. It’s called Childrens International. You can research them. They are great. They promote the
       rights of children in these situations, especially those that are separated from their parents or families. Have you donated to them?
EE: Not yet, but I think I will sometime in the future.
Table 8: Rollout Self-Play Example 2: No RO vs. 3x3x5 (dialogue #88; generation seed: 756)
ER: Hi, do you usually donate to charities
ER: great! You’re such a nice person. What charities do you usually donate to?
EE: To church, I return tithes and I also do the offerings
ER: great! Well, have you ever heard about Save the Children charity?
EE: no, can you tell me more about it?
ER: it’s an international non-governemntal organization that promotes children’s rights, provides relief and helps support children in developing countries
EE: nice! I have a child so i’m all about helping kids
ER: would you donate to this charity? your donation will make such a huge impact for some children
EE: I would like to know more before i donate. How do i know if it’s a legit organization
NO ROLLOUTS (progression: -2.44; ER sentiment: 0.10; EE sentiment: 0.41)
ER: if you would like to donate, you can visit their site: URL
EE: i dont want to donate, if i did however, how much would i donate?
ER: if you have to donate, you would have to submit a donation agreement, i dont think you can donate right now
EE: ok, how do i submit it?
ER: You can’t donate until you check out their website.
EE: ok thanks for telling me, i will check it out
ER: It is a charity that focuses on children in the United States
EE: ok thanks for the information, if i ever get to a child’s situation, i will donate
ER: You’re very welcome
EE: ok, have a nice day
3x3x5 ROLLOUTS (progression: 1.14; ER sentiment: 0.36; EE sentiment: 0.52)
ER: I will send you a copy of the actual donation and you can make a donation from there.
EE: How do i donate?
ER: It’s in your task payment. You can do that right now.
EE: How do i donate?
ER: You don’t need to donate at this point, but I’ll give you a heads up on the website: URL
EE: Ok, i’ll go check it out now. Thank you very much!
ER: You’re welcome, but I have to ask, Would you be willing to donate $2, to the Save the Children charity?
EE: Ok. I will probably just donate $2.
ER: That is great. Thank you so much for your donation!
EE: You’re very welcome. Thank you, have a nice one!
Table 9: Rollout Self-Play Example 3: No RO vs. 3x3x5 (dialogue #98; generation seed: 589)