Log In Sign Up

WeaSuL: Weakly Supervised Dialogue Policy Learning: Reward Estimation for Multi-turn Dialogue

by   Anant Khandelwal, et al.

An intelligent dialogue system in a multi-turn setting should not only generate the responses which are of good quality, but it should also generate the responses which can lead to long-term success of the dialogue. Although, the current approaches improved the response quality, but they over-look the training signals present in the dialogue data. We can leverage these signals to generate the weakly supervised training data for learning dialog policy and reward estimator, and make the policy take actions (generates responses) which can foresee the future direction for a successful (rewarding) conversation. We simulate the dialogue between an agent and a user (modelled similar to an agent with supervised learning objective) to interact with each other. The agent uses dynamic blocking to generate ranked diverse responses and exploration-exploitation to select among the Top-K responses. Each simulated state-action pair is evaluated (works as a weak annotation) with three quality modules: Semantic Relevant, Semantic Coherence and Consistent Flow. Empirical studies with two benchmarks indicate that our model can significantly out-perform the response quality and lead to a successful conversation on both automatic evaluation and human judgement.


page 1

page 2

page 3

page 4


Weakly-Supervised Neural Response Selection from an Ensemble of Task-Specialised Dialogue Agents

Dialogue engines that incorporate different types of agents to converse ...

HappyBot: Generating Empathetic Dialogue Responses by Improving User Experience Look-ahead

Recent neural conversation models that attempted to incorporate emotion ...

Controlling Dialogue Generation with Semantic Exemplars

Dialogue systems pretrained with large language models generate locally ...

Chat More If You Like: Dynamic Cue Words Planning to Flow Longer Conversations

To build an open-domain multi-turn conversation system is one of the mos...

Towards a Progression-Aware Autonomous Dialogue Agent

Recent advances in large-scale language modeling and generation have ena...

Sketch-Fill-A-R: A Persona-Grounded Chit-Chat Generation Framework

Human-like chit-chat conversation requires agents to generate responses ...

What Went Wrong? Explaining Overall Dialogue Quality through Utterance-Level Impacts

Improving user experience of a dialogue system often requires intensive ...

1 Introduction

Dialog policy for multi-turn dialogue decides the next best action to take on the environment so as to complete the conversation based on various success criteria. Reinforcement learning can help to learn such a policy where the environment can be users (human or model) and the policy takes action on the environment from which it gets a reward signal

fatemi2016policy; peng2017composite; chen2017agent; yarats2018hierarchical; lei2018sequicity; he2018decoupling; su2018discriminative.

Learning a dialogue policy using reinforcement learning can be challenging with humans users, since it requires a large set of samples with a reward to train. Since there are a lot of previous works on neural response generation gu2020dialogbert; zhao2020learning; zhang2019recosa; xing2018hierarchical; serban2016building we can model the users also, using any of these encoder-decoder architectures. This helps to simulate the conversations between the simulated user and the agent (policy model) replying to each other zhao2016towards; dhingra2016towards; shah2018bootstrapping. Reward signal for policy learning can be as simple as the small constant negative reward at each turn and a large reward at the end (if the goal completes) to encourage shorter conversations takanobu2019guided.

However, reward estimation for dialogue is challenging, the small constant negative reward at each turn may lead to ending the conversation prematurely. Instead of handcrafting the reward at the end based on success or failure, it is more useful if we can evaluate reward at every turn to guide the policy to dynamically change actions as per the need for the user and end the conversation naturally. With the growing complexity of the system across different topics, it is required to build a more sophisticated reward function to avoid manual intervention for accounting different factors towards conversation success.

In this work, we proposed a novel model for contextual response generation in multi-turn dialogue. The model includes the turn-level reward estimator, which combines the weak supervision signals obtained from three basic modules 1) Semantic Coherence, 2) Consistent Flow, 3) Semantic Relevance. These modules are learned jointly with the response generation model with the counterfactual examples obtained from negative sampling. Leveraging the weak supervision signals obtained from these models, we further update the reward estimator and dialog policy jointly in an alternative way, thus improving each other.

Our proposed approach integrates semantic understanding of utterances using encoder-decoder systems with the power of Reinforcement Learning (RL) to optimize long-term success. We test the proposed approach with two benchmarks: DailyDialog li2017dailydialog and PersonaChat zhang2018personalizing. Experimental results demonstrate on both datasets indicate that our model can significantly outperform state-of-the-art generation models in terms of both automatic evaluation and human judgment.

2 Related Work

Open-domain dialogue in a multi-turn setting has been widely explored with different encoder-decoder architectures gu2020dialogbert; feng2021multi; kottur2017exploring; li2016deep; shah2018bootstrapping; shang2015neural; vinyals2015neural; wu2019self; zhao2020learning; zhong2019affect. The basic encoder-decoder architectures like Seq-to-Seq models have been widely extended and modified to generate the generic responses, context modelling and grounding by persona/emotion/knowledge li2015diversity; xing2017topic; serban2016building; xing2018hierarchical; zhang2019recosa; zhang2018personalizing; zhou2018emotional; dinan2018wizard.

The dialogue literature widely applies reinforcement learning, including the recent ones based on deep architectures takanobu2019guided; takanobu2020multi; li2020guided; takanobu2020multi; li2020guided; gordon2020learning; gordon2020show. But these task-oriented RL dialogue systems often model the dialogue with limited parameters and assumptions specific to the dataset, targeted for that task. The dataset includes hand-built templates with state, action and reward signals designed by humans for each new domain making this setting difficult for extending these to open domain dialogue systems.

Our goal in this work is to integrate the state-of-the-art encoder-decoder architectures like in gu2020dialogbert; zhao2020learning; csaky2020gutenberg and reinforcement learning paradigms to efficiently learn the dialogue policy optimized for long-term success in the multi-turn dialogue scenarios. We are recently inspired by the works in takanobu2019guided; li2020guided; li2016deep to jointly learn the reward function and dialogue policy, and reduce the effort and cost for manual labelling the conversations for building the reward model. Specifically, we leverage the weak supervision inspired from chang2021jointly; chang2021neural to generate the labelled dataset to facilitate this joint learning and building reward estimation model.

3 Approach

We represent dialog sessions where each dialog session represents the trajectory of state-action pairs as . The user in our case is a simulator which utters a response given the state denoted as where denotes the binary signal indicating the end of a dialog session, in that case the response is empty. The dialog policy decides the action according to the current state after the agent interacts with the user simulator . At each time, the state given to the either dialog party is updated after recording the action uttered by the other party. The reward estimator evaluates the quality of response/action uttered by the dialog policy . The dialog policy is based on the BERT devlin-etal-2019-bert encoder-decoder model and the reward function is the MLP model parameterized by and respectively. We have modeled the user simulator exactly in the same way as the agent but trained only using supervised learning objective.

In the subsequent section, we will introduce the components action, state, policy, quality modules and reward estimator. Further, sections explain the setup we have used for weakly supervised learning and, finally, the experimental results.

3.1 Action

An action is the dialogue utterance generated by the encoder-decoder model as shown in Figure 1

. The model takes as input the context history (state), and outputs the probability distribution over a set of possible actions denoted as

parameterized by . The user simulator generates the action , policy generates the action , and the input state for the agent and the user is and respectively.

3.2 State

The state is the past conversation history between an agent and a user denoted as, . The state for an agent and a user are differently denoted as and respectively. Let’s say the agent utterances are denoted by ’s, then state, and the agent utters . Similarly, the user state and the user utters

. Each of the utterances is mapped to a fixed-length sentence vector using SBERT


Figure 1: BERT based Encoder-Decoder with Semantic Coherence and Relevance. Similarly, Consistent Flow loss is also calculated using encoder.

3.3 Dialogue Policy

The dialogue policy takes the form of a BERT based encoder-decoder ( i.e. ) gu2020dialogbert as shown in Figure 1. Similar to xu2020learning, we have used the BERT based encoder and transformer decoder, but instead of feeding the utterance at word level, we instead fed the utterance representation (obtained from SBERT) into the encoder. The encoder takes as input the previous context history as and output the response at the output of the decoder.

3.4 User Simulator

We have modelled the user simulator in exactly the same way as the BERT based encoder-decoder shown in Figure 1. However, the user simulator is trained only (with supervised learning objective) for utterances in dialog corpus and predicting user response gu2020dialogbert.

3.5 Conversation Quality Modules

We calculate the reward for each state-action pair (see Section. 3.8) and use this signal to train the dialogue policy so that it can avoid reaching bad states so as to reach the successful end of the conversation between a user and an agent. We have leveraged the signals from three basic modules, namely, Semantic Coherence, Consistent Flow and Semantic Relevance (which are jointly learned with the dialogue policy). For each of the three modules, the data for the positive class is obtained from the source corpus while for the negative class it has been generated dynamically during training. We describe each of the three modules in the following sections.

3.5.1 Semantic Relevance

We need to filter out the utterances generated with high confidence by the dialog policy but are semantically irrelevant to the previous context. To quantify such a characteristic, we modeled the general response relevance prediction task which utilizes the sequential relationship of the dialog data fed to the encoder side of BERT encoder-decoder framework. Since, the task of semantic relevance is to match the two sequences of conversation, so instead of matching the context and response, we have measured the relevance of two fragments of dialogue session.

Specifically, given a context , we randomly split into two consecutive pieces and . Similar to xu2020learning

, we replaced the left or right part with the sampled piece from the corpus. Also, we additionally generate the negative samples by internal shuffling in the left or right part. The whole model is trained like a classifier with corresponding labels

. Since the individual utterances are fed after obtaining their vector representation, the aggregated representation of two pieces is represented by

over which the non-linear transformation is applied, the score for semantic relevance is given by

, and similar to xu2020learning, it has been trained using the binary cross-entropy loss as:


3.5.2 Semantic Coherence

The response generated should be rewarded only if it is coherent despite having adequate content. This makes the model to generate the coherent responses while avoiding the incoherent ones. Specifically, given a context , we randomly select any of the agent response at time , denoted as , and replace it with any random utterance from the corpus. We also generate the incoherent samples by internal shuffling of bi-grams. The incoherent utterance is labelled as and coherent samples as . The semantic coherence model is also trained like a classifier for each of the utterance representations obtained at the output of BERT encoder as shown in Figure 1

. The probability of the

-th utterance being incoherent is given as:


and the loss function is given as:


3.5.3 Consistent Flow

We want the agent to continuously add the information to keep the conversation going in the forward direction. To determine the flowing conversation, we take the cosine similarity between the last two agent utterances denoted as

and denoted as , and we measure the similarity with randomly sampled utterance in place of given as . We would like to be larger than by at least a margin and define the learning objective as a hing loss function:


3.6 Joint Training of Agent and Reward Modules

To initialize the parameters of agent and reward modules {Semantic Relevance, Semantic Coherence, Consistent Flow}, we used the supervised learning objective since all the state-action pairs obtained from the pre-training corpus are the ground-truth and can be used as close approximation for further fine-tuning on other dialog corpus. We used the pre-training corpus as Gutenberg dialog corpus csaky2020gutenberg. Since the agent model in our case is based on BERT encoder-decoder parameterized by similar to gu2020dialogbert, the probability of generating agent’s response a is given as:


where is the j-th word generated at the output of decoder and s is the whole context history utterances fed to the encoder and is the maximum sequence length of decoder. The loss function for generating agent response a is given as:


The joint loss function is defined as:


The policy is also parameterized by , and the probability of action is given by similar to , since the probability distribution is learned only from pairs obtained from the corpus with human demonstrations. It is a good approximation to initialize the parameters of policy with parameters of . Furthermore, we update the policy (Step 13 in the Algorithm. 1) to avoid actions which do not lead to rewarding conversations.

3.7 Dialogue Simulation between Agent and User

We setup simulation between virtual agent and user, and let them take turns talking to each other. The simulation is started with a starter utterance obtained from the dialog samples (Step 5 of Algorithm 1) and fed to the agent, it then encodes the utterance and generates the response , the state is then updated with previous history and fed to the user model to obtain the next response . The response is appended to to obtain the updated state . Similarly, the process is repeated until one of the following conditions occurs after a few number of turns111The number of turns after these rules applied is average number of turns in the corpus: a) When agent starts to produce dull responses like “I don’t know” 222Used simple rule matching method with 9 phrases collected from the corpus, instead of having false positives and negatives this works well in practice.. b) When agent starts to generate repetitive response consecutively 333If by rule two consecutive utterances matched more than 80% it is said to be repetitive. c) Or, the conversation achieved the maximum number of turns handled by agent and user models.444The maximum number of turn is set as 20.

3.8 Weakly Supervised Learning Algorithm

Learning with weak supervision is widely used with the rise of data-driven neural approaches ratner2020snorkel; mrksic-etal-2017-neural; chang2020unsupervised; bach2017learning; wu2018learning; chang2021jointly. Our approach incorporates a similar line of work by providing noisy text to a pre-trained model which incorporates prior knowledge from general-domain text and small in-domain text peng2020few; chen2019few; harkous2020have and use it as a weak annotator similar to ratner2020snorkel. The primary challenge with the synthetic data is the noise introduced during the generation process, and the noisy labels tend to bring little to no improvement frenay2013classification. To train on such noisy data, we employ three step training process: a) pre-training b) generate data with weighted categories c) fine-tuning similar to chang2021jointly; dehghani2017fidelity.

Step 1: Pre-train Generation and Quality Modules Jointly. This step involves pre-training the agent with quality modules jointly as explained in Section 3.6. Quality modules trained on clean data as well as automatically generated negative samples by random sampling. These modules are further fine-tuned on the sampled dialogues from target dialogue corpus at each training iteration. Similarly, we initialized the user also by supervised training on the pre-training dialogue corpus with fine-tuning on target dialogue corpus. (see steps 2-7 of Algorithm 1). The fine-tuning steps make use of continual learning to avoid catastrophic forgetting madotto2020continual; lee2017toward.

Step 2: Generates the Weakly Labelled data with Reward categories. After the models are initialized with trained parameters, the dialogue simulation has been started between the agent and the user (see Section. 3.7) to interact with each other and generates the synthetic data with annotated scores with each quality module for every state-action pair in sampled dialogues. During dialogue simulation, we employ Dynamic Blocking mechanismniu2020unsupervised to generate novel words and paraphrased responses. Specifically, we generate Top-7 response at each turn and set the agent to exploration for 60 percent of the times and for the rest of the times it exploits by selecting the response from top two ranked responses. We specifically filter the state-action pairs into three reward categories namely, VeryHigh, High and Low. For the state-action pairs whose scores by each module are greater than or equal to 0.8 are put into the VeryHigh category. Other, state-action pairs whose scores by each module are between 0.6 and 0.8 are put into the High reward category. The rest of all state-action pairs are put into the Low reward category. Additionally, we include state-action pairs sampled from target dialog corpus in Step 1. into the VeryHigh category.

Step 3: Update the reward estimator and policy. The reward estimator maximizes the log likelihood state-action pairs of higher rewards than the lower ones. The reward estimator , parameterized by , and let’s say , and represents the collection of all state action pairs of High, VeryHigh and Low reward category respectively.


where models state-action pairs of H, V and L category as a Boltzmann distribution takanobu2019guided. The cost function for reward estimator in terms of trajectories obtained from respective reward categories is given as:


It minimize the KL-divergence between reward distribution and the state-action pairs of high and very high reward but maximize the distribution from the ones with low category. The gradient yields:


Since, the dialog policy is required to put the actions atleast to that of high category, i.e. maximize the entropy regularized expected reward () which is effectively minimizes the KL divergence between the policy distribution and Boltzmann distribution.


where the term is independent to , and denotes the entropy of a model. Using likelihood ratio trick the gradient for policy is given as:


Hence, the reward is for each state-action pair and the loss function re-written as:


Like in takanobu2019guided the reward estimator includes the shaping term. Formally, we include next state also instead of just


where is the MLP network with input as pre-sigmoid scores from each quality modules, and is also the MLP network with input as the concatenation of as state vector and SBERT sentence embedding of action .

1:Pre-Training corpus , Dialogue Corpus .
2:Modules = {Semantic Relevance, Semantic Coherence, Consistent Flow}
3:Do Agent training on as in Section 3.6 jointly with modules
4:User supervised training on .
5:for each training iteration do
6:     Sample dialogues from randomly.
7:     Fine-tune user simulator on .
8:     Fine-tune Agent and on jointly.
9:     Collect dialog samples by executing
10:         the dialog policy and interacting with
11:         , , where
12:         and is updated each time after get-
13:         ting response from user and agent re-
14:         spectively.
15:     Get weak annotation scores for all
16:          from each of the modules .
17:     Filtering the pairs into {VeryHigh,
18:         High and Low} reward categories.
19:     Update the reward estimator by minimiz-
20:         ing w.r.t ( Eq.10)
21:     Compute reward for each as,
22:     Update the policy by minimizing
23:         w.r.t (Eq. 13).
24:end for
Algorithm 1 Dialogue Policy Learning

4 Experiments

We conduct experiments on DailyDialog li2017dailydialog, PersonaChat zhang2018personalizing and used Gutenberg Dialogue Dataset csaky2020gutenberg as a pre-training corpus. We compare our model performance with baselines on various aspects of response quality.

4.1 Datasets

We considered DailyDialog li2017dailydialog and PersonaChat zhang2018personalizing which are open domain dialog corpus to evaluate our system. DailyDialog contains conversation revolving around various topics pertaining to daily life, and PersonaChat contains conversations between people with their respective persona profiles. These dialogues can be of varying length, we limit the maximum length to 20, that can be fed to the BERT Encoder-Decoder model. Since average length of DailyDialog is 7.9 and that of PersonaChat is 9.4, so most of the dialogues fit easily without truncation from the history. For rest of the dialogues, it can be slided across to include the more recent utterances and remove it from the starting. Since we are mapping the utterances to their corresponding vectors using SBERT, the length of individual utterances truncated automatically and retain only first 512 word pieces in case of longer utterances. For pre-training corpus the vocabulary is limited to 100,000 while the vocabularies for DailyDialog and PersonaChat are 25,000 and 32,768 respectively.

4.2 Baselines

We select various multi-turn response generation baselines. The baselines which are not included pre-training are (1) HRED : Hierarchical encoder-decoder framework serban2016building (2) VHRED : an extension of HRED that generates response with latent variables 10.5555/3298023.3298047 (3) HRAN : Hierarchical attention mechanism based encoder-decoder framework xing2018hierarchical (4) ReCoSa : Hierarchical transformer based model zhang2019recosa (5) SSN: dialogue generation learning with self-supervision signals extracted from utterance order wu2019self (6) Transformer-Auxiliary Tasks: A recent state-of-the are model leaning language generation with joint learning of transformer with auxiliary tasks zhao2020learning. The another two baselines from csaky2020gutenberg which involve pre-training on the Gutenberg corpus are: (1)Transformer : 50M parameters version and (2) GPT-2 : Pre-trained model with version of 117M parameters. The repository555 contains these two trained models.

4.3 Evaluation Metrics

We evaluate the performance of our model on various aspects of response quality using both automatic and human evaluation. Although, most of the automatic metrics poorly correlate with human evaluation liu2016not, and the recently proposed metrics li2017adversarial; lowe2017towards; tao2018ruber are harder to evaluate than perplexity and BLEU papineni2002bleu. Additionally, human evaluation has its inherent limitation of bias, cost and replication difficulty tao2018ruber. Due to this consensus, some used only automatic metrics xing2018automatic; xu2018better and some used only human evaluation krause2017edina; fang-etal-2018-sounding while some used both shen2018nexus; xu2018towards; baheti2018generating; ram2018conversational.

We mainly used the automatic metrics using the DIALOG-EVAL repository666 dialog-eval, it contains 17 different metrics, but we measure only a few metrics to facilitate the comparison with the published baselines results. We specifically follow zhao2020learning to measure automatic evaluation and human evaluation. For response content quality we measured BLEU-4 papineni2002bleu and Perplexity(PPL) sutskever2014sequence. Like in zhao2020learning used embedding metrics average (AVG), extrema (EXT), and greedy (GRE) measuring similarity between response and target embedding. Similar to zhao2020learning we also measured the informativeness of responses with distinct-1 and distinct-2 that are calculated as the ratios of distinct unigrams and bigrams.

Since our main objective is not to judge the response quality but to predict the response for long-term success of dialogue. We follow the guidelines as in li2016deep to explore both single-turn and multi-turn settings. We picked 500 dialogues from the test set and asked 3 native speakers for their judgement. In the first setting, we asked judges to pick the better response among the one generated by our model and a baseline model (Pre-Trained GPT2) based on various criteria like answerability and semantics. In the second setting, in case of multi-turn we used 200 simulated conversations between RL agent and a user model to judge the whole conversation for responses uttered by agent. In a complete end-to-end conversation we asked the judges to decide which of the simulated conversations are of higher quality. To compare against the RL model we employ baseline model to simulate the 200 conversations with the same starter utterance used by RL model. Automatic and Human evaluation are shown in Table. 1 and 2 respectively.


Dataset Model PPL BLEU Distinct-1 Distinct-2 Average Greedy Extrema


DailyDialog HRED 56.22 0.535 1.553 3.569 81.393 65.546 48.109
HRAN 47.23 0.447 1.953 7.400 83.460 67.239 49.599
VHRED 44.79 0.997 1.299 6.113 83.866 67.186 48.570
SSN 44.28 1.250 2.309 7.266 72.796 73.069 44.260
ReCoSa 42.34 1.121 1.987 10.180 84.763 67.557 48.957
Transformer-Auxiliary Tasks 38.60 1.658 3.457 14.954 85.224 69.518 49.069
Pre-Trained Transformer - 11.5 2.92 14.7 55.1 53.5 59.8
Pre-Trained GPT2 - 12.8 4.07 25.9 56.8 54.0 59.6
Our Model 20.13 15.171 6.316 28.422 85.417 73.118 61.539
Our Model w/o weak supervision 20.51 14.718 4.611 26.752 86.481 73.003 59.911
PersonaChat HRED 46.04 1.279 0.164 0.450 83.329 65.546 48.109
HRAN 41.94 1.997 0.235 0.771 82.850 67.239 49.599
VHRED 42.07 2.181 0.312 1.915 82.995 67.186 48.570
SSN 47.90 2.288 0.637 2.623 85.002 73.069 44.260
ReCoSa 34.19 2.258 0.915 4.217 83.963 67.557 48.957
Transformer-Auxiliary Tasks 33.23 2.434 1.279 5.816 83.632 69.518 49.069
Pre-Trained Transformer - 15.5 1.04 4.8 51.3 57.5 57.1
Pre-Trained GPT2 - 15.3 1.82 12.9 53.6 55.9 55.8
Our Model 19.78 16.651 2.434 13.912 84.941 73.081 59.241
Our Model w/o weak supervision 21.49 16.017 2.318 13.274 85.018 72.438 58.816


Table 1: Automatic metrics comparison with baselines. Results in bold indicate the best performing model on the corresponding metrics.


Setting RL-Win RL-Lose Tie
Single-Turn general quality 0.41 0.28 0.31
Single-Turn ease to answer 0.55 0.12 0.33
Multi-turn general quality 0.76 0.13 0.11
Setting RL-Win RL-Lose Tie
Single turn general quality 0.36 0.22 0.42
Single-Turn ease to answer 0.51 0.14 0.35
Multi-turn general quality 0.71 0.17 0.12


Table 2: Human Evaluation Results. Ratios are calculated after taking majority vote among the decisions made by three judges.

4.4 Results and Discussions

Table. 1

reports automatic evaluation metrics on the baseline and the proposed model. Our model outperforms for most of the metrics on both datasets. Since our main idea is to generate the responses for successful conversation in the long run than just evaluating the response quality at each of the turn. This is the main reason of why our model outperforms on both distinct-1 and distinct-2 metrics, in comparison to Transformer-auxiliary task model which also trained jointly with the similar tasks but lacks fine-tuning with the weak supervision signals indicate that an additional training with weakly labelled data improves the generalization performance. Although, we see the perplexity also improves since our model is generating the responses more like humans to optimize the conversation in long run. Similarly, embedding metrics also shown the improvement but little on average since it capturing the sense but due to length mismatch which occurs owing to the fact that our model is generating more novel words with futuristic sense. However, Distinct-{1,2} scores shows improvement because of the large pre-trained vocabulary, it gives the model more flexibility to generate novel words without disturbing the sense of the sentence.

We also note the results for our model without weak supervision training, namely, Our Model w/o Weak Supervision, this model just fine-tunes on the DailyDialog li2017dailydialog and PersonaChat zhang2018personalizing without generating the weak labelled data. Clearly, the distinct-1 and distinct-2 metrics are lower than the proposed model, because the model tends to generate the repetitive words more frequently. Similarly, the embedding metrics and PPL does not show any improvement over the proposed model except on embedding metric based on Average. However, it performs well on BLEU scores since it learns well to reproduce the responses as in the ground truth but not optimized for a successful conversation in the long run.

Table 1 also reports the results of another two baselines which are pre-trained models on Gutenberg Dialogue Corpus csaky2020gutenberg. These models are fine-tuned on DailyDialog and PersonaChat dataset respectively. These models although improved much on BLEU scores and distinct-1 and distinct-2 scores since it gets the larger vocab and more enhanced training for learning the language structure. But lags in the embedding metrics indicating the response quality is low.

Table 2 reports the human evaluation results, the objective for which our model training is to generate the response for a successful conversation in the long run for the multi-turn scenario. Clearly, the evaluation results are up to our expectation, since the RL system does not bring a significant boost in single-turn response quality than the case of multi-turn setting.

5 Conclusions

We proposed a weak supervision framework for policy and reward estimation for long-term success of the dialogue by simulating the conversation between a virtual agent and user. Empirical studies on two benchmarks proves the effectiveness of our approach.


Appendix A Implementation Details

Our implementation uses the open source Huggingface Transformer repository

wolf2020huggingfaces. Specifically, we have used the base version from sentence transformers pre-trained on millions of paraphrase examples, named as ‘paraphrase-distilroberta-base-v1’. The encoder-decoder framework is initialized with the base version ‘bert-base-uncased’but with configuration of smaller size. The smaller sized model reduces the ‘bert-base-uncased’configuration to 6 transformer layers, has a hidden size of 768, and contains 2 attention heads, {L=6, H=768, A=2}. Similar to gu2020dialogbert

we sum the position embeddings to the output sentence embeddings of size 768 to indicate the user or agent utterances. Odd ones indicate the user utterances and even ones are that of an agent. The MLP network for semantic relevance and semantic coherence used a hidden dimension of 128. The

has been set to best value of 0.54 after performing a grid search in the range of {0.4, 0.7} with step size of 0.02. The reward estimator models using two hidden layers of size 512 and 256 respectively. And, is modelled using a single hidden layer of size one. In each training iteration the policy and reward estimator are updated with continual learning to avoid catastrophic forgetting mechanism using EWC modified loss, the value used as a parameter is set to 0.4. Also, at each training iteration the policy and reward parameters are saved if it reduces the perplexity on the validation set (calculated after running for all the batches of the training dataset) and patience is set to 3 as a stopping criterion before we terminate the training.