Log In Sign Up

Dialogue Evaluation with Offline Reinforcement Learning

Task-oriented dialogue systems aim to fulfill user goals through natural language interactions. They are ideally evaluated with human users, which however is unattainable to do at every iteration of the development phase. Simulated users could be an alternative, however their development is nontrivial. Therefore, researchers resort to offline metrics on existing human-human corpora, which are more practical and easily reproducible. They are unfortunately limited in reflecting real performance of dialogue systems. BLEU for instance is poorly correlated with human judgment, and existing corpus-based metrics such as success rate overlook dialogue context mismatches. There is still a need for a reliable metric for task-oriented systems with good generalization and strong correlation with human judgements. In this paper, we propose the use of offline reinforcement learning for dialogue evaluation based on a static corpus. Such an evaluator is typically called a critic and utilized for policy optimization. We go one step further and show that offline RL critics can be trained on a static corpus for any dialogue system as external evaluators, allowing dialogue performance comparisons across various types of systems. This approach has the benefit of being corpus- and model-independent, while attaining strong correlation with human judgements, which we confirm via an interactive user trial.


page 1

page 2

page 3

page 4


CHAI: A CHatbot AI for Task-Oriented Dialogue with Offline Reinforcement Learning

Conventionally, generation of natural language for dialogue agents may b...

Neural User Simulation for Corpus-based Policy Optimisation for Spoken Dialogue Systems

User Simulators are one of the major tools that enable offline training ...

Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach

Reliable automatic evaluation of dialogue systems under an interactive e...

Causal-aware Safe Policy Improvement for Task-oriented dialogue

The recent success of reinforcement learning's (RL) in solving complex t...

Learning how to learn: an adaptive dialogue agent for incrementally learning visually grounded word meanings

We present an optimised multi-modal dialogue agent for interactive learn...

Shades of BLEU, Flavours of Success: The Case of MultiWOZ

The MultiWOZ dataset (Budzianowski et al.,2018) is frequently used for b...

Establishing linguistic conventions in task-oriented primeval dialogue

In this paper, we claim that language is likely to have emerged as a mec...

1 Introduction

With the rise of personal assistants, task-oriented dialogue systems have received a surge in popularity and acceptance. Task-oriented dialogue systems are characterized by a user goal which motivates the interaction, e.g., booking a hotel, searching for a restaurant or calling a taxi. The dialogue agent is considered successful if it is able to fulfill the user goal by the end of the interaction.

Ideally, success rates are obtained via interaction with a real user in-the-wild. Unfortunately, with a handful of exceptions, e.g., LetsGO lee2018dialcrowd and Alexa Challenge gabriel2020further, that is often out of reach. The closest approximation is human trials with paid users such as Amazon Mechanical Turk workers, which has also been adopted as final evaluation in recent incarnations of the Dialogue State Tracking Challenge (DSTC) gunasekara2020overview. However, such evaluations are highly time- and cost-intensive, making them impractical for optimization during an iterative development. The third alternative is to use a user simulator to conduct online dialogue simulation, however the result is subject to the quality of the user simulator itself. Furthermore, developing such simulators is far from straightforward and requires significant amounts of handcrafting schatzmann08a. Only recently we have seen data-driven user simulators that can compete with hand-coded ones lin-etal-2021-domain.

While there has been considerable progress towards more meaningful automatic evaluation metrics for dialogues, there remains a number of limitations as highlighted by the recent NSF report

mehri2022report: the metrics 1) measure only a limited set of dialogue qualities, which mostly focus on subjective aspects such as fluency and coherence, 2) lack generalization across datasets and models, and 3) are not yet strongly correlated with human judgements. These limitations hinder a more widespread use of newly proposed metrics for benchmarking and comparison, especially with prior works. Further, in particular for task-oriented dialogue systems, the need for reliable automatic evaluation of dialogue success is still unanswered.

Being able to automatically evaluate the success rate of any policy using static data offers a number of benefits in terms of required resources, generalizability, and reproducibility. Furthermore, it is not only suitable for the final evaluation of a dialogue policy, but can also be utilized as an objective for iterative optimization. The corpus-based success rate is one such method, which has become the standard metric for state-of-the-art comparisons of policy optimization approaches today budzianowski2018large. Unfortunately, this metric is computed based on pseudo-dialogues that may contain context mismatch. Therefore, we believe it should be treated more as an approximation: it is insufficient at best, and misleading at worst, in reflecting real performance of dialogue systems. In addition, the rules used to check the goal completion need to be handcrafted based on the ontology, making this method data- or ontology-dependent.

In this paper, we propose to use offline reinforcement learning (RL) to train a policy evaluator, also known as a critic, based on a static collection of dialogue data111 We show that an offline critic addresses the limitations of current automatic metrics: 1) it can be trained to evaluate any dialogue system architecture after-the-fact, allowing comparisons across various types of systems from prior works, 2) it can be utilized in the iterative development phase to optimize a dialogue policy, 3) it is theoretically grounded, solving the problems that standard corpus-based success rate has due to context mismatch, and 4) it strongly correlates with the performance of the system when interacting with human users, which we confirm via a user trial.

2 Related Work

For a long time, the research in dialogue policy has focused on user-centered criteria such as user satisfaction walker1997paradise; lees12; ultes2017domain. The most reliable way to obtain these scores is to have users interact directly with the system and let them subjectively rate the system afterwards. Due to the time and resource requirements to carry out such evaluations, human trials are usually done only as the final evaluation after the system development is finished.

As the line between policy and natural language generation (NLG) tasks becomes blurred, we see the introduction of metrics such as BLEU 

papineni2002bleu and perplexity. However, these have been labeled early on to be potentially misleading, as they correlate poorly with human judgement stent2005evaluating; liu2016not. This circumstance motivates automatic metrics that are highly correlated with human ratings dziri-etal-2019-evaluating; mehri2020unsupervised; mehri2020usr. However, these metrics are designed to measure subjective quality of a dialogue response, making them more suitable for evaluating chat-based systems.

Despite the availability of toolkits that facilitate user simulation (US) evaluation zhu2020convlab, corpus-based match and success rates are the default benchmark for works in task-oriented dialogue systems today budzianowski2018large; nekvinda2021shades. These metrics are practical to compute, reproducible, and scalable. Current standard corpus-based metrics are computed on a pseudo-dialogue constructed using user utterances from data and responses generated by the system. A set of rules then checks whether the system provides all information requested by the user. Unfortunately, they do not take into account context mismatches that may originate from the pseudo-dialogue construction and therefore does not reflect other aspects of dialogue quality as the resulting dialogue flow is completely overlooked.

There has been few applications of offline RL to dialogue systems. jaques2019way explores various language-based criteria, e.g., sentiment and semantic similarity, as reward signals for open-domain dialogue, paired with a Kullback-Leibler (KL) control for exploration within the support of the data. verma2022chai proposed using fine-tuned language models to utilize unlabeled data for learning the critic function. The method is however only demonstrated on a very small state and action space, and it is therefore unclear whether it generalizes to more complex set ups. ramachandran2021causal applied offline RL with a pair-wise reward learning model based on preference learning, however it still utilizes the corpus-based success rate for choosing the preferred rollout. To the best of our knowledge, offline RL has not previously been deployed for dialogue evaluation.

3 Preliminaries

3.1 Offline RL

Dialogue can be formulated as a reinforcement learning problem with a Markov decision process (MDP)

. In this MDP, , , and denote the state and action spaces, and the reward function, respectively.

denotes the probability of transitioning to state

from after executing , and is the probability of starting in state . is the discount factor that weighs the importance of immediate and future rewards. At each time step , the agent observes a state , executes its policy by selecting an action according to , transitions to a new state and receives a reward . The goal of the policy is to maximize the cumulative discounted rewards, i.e., the return .

Instead of interacting with the MDP to learn a policy, offline RL aims to learn a policy exclusively from previously collected data containing state transitions under an unknown behavior policy . This set-up is especially useful in cases where deploying the agent in the real environment is too costly, as is the case with real user interaction for dialogue systems. As the agent can not interact with the environment, the performance of the trained policy needs to be evaluated also based on the data . The Q-value denotes the expected return when executing in and following policy

thereafter. Q-learning algorithms estimate the Q-function

by iteratively applying the Bellman operator


Value-based RL methods optimize the policy by maximizing the Q-values for every state-action pair . With discrete actions, and for given state , the actor can then simply select in a greedy fashion.

Alternatively, with an actor-critic method, an actor is trained which optimizes its parameters to maximize the expected return of the starting states, for example via the deterministic policy gradient method silver2014deterministic; lillicrap2016continuous:


The challenge in performing offline RL comes from the fact that is static and has limited coverage of and . While an out-of-distribution state is not a problem during training as the state is always sampled from , the policy may select an out-of-distribution action that is not contained in . This tends to lead to arbitrarily high estimates which further encourages the policy to take out-of-distribution actions. There are two main methods to counteract this: 1) constraining the policy to stay within the support of the dataset wu2019behavior; jaques2019way; fujimoto2019off; PLAS_corl2020, and 2) modifying the critic to better handle out-of-distribution actions kumar2019stabilizing; kumar2020conservative. In this work, we focus on the former.

3.2 Dialogue Policy in the Latent Action Space

RL can be applied to a dialogue system policy at different levels of abstraction. Semantic actions, i.e., tuples containing intent, slot and values, such as inform(area=centre), are widely used for a compact and well-defined action space GeishauserHLLHF21; tseng2021transferable. Pre-defining the actions and labeling the dialogue data however requires considerable labor. In addition, the final policy needs to be evaluated dependent on an NLG module. On the opposite end, natural language actions view each word of the entire system vocabulary as an action in a sequential decision making process mehri2019structured; jaques2019way. This blows up the action space size and the trajectory length, hindering effective learning and optimal convergence.


proposed instead an automatically inferred latent space to serve as action space of the dialogue policy, where a latent action is a real-valued vector containing latent meaning. This decouples action selection and language generation, as well as shorten the dialogue trajectory.


followed up this work by proposing the use of variational auto-encoding (VAE) for a latent-space that is action characterized. In both of these works, the latent space is trained via supervised learning (SL) on the response generation task, and then followed with policy gradient RL using the corpus-based success as the reward signal, i.e.,


3.3 Offline RL for Policy in the Latent Action Space (PLAS)

A latent action space also lends itself well to offline RL with a policy-constraint technique. PLAS_corl2020 proposed to use a conditional VAE (CVAE) to model the behavior policy

to reconstruct actions conditioned on states. The benefit of learning in the latent space is that the latent policy has the flexibility of choosing the shape of the distribution via the prior. By constraining the latent policy to output latent actions with high probability under the prior, the decoder will output an action that is likely under the behavior policy in expectation. By choosing a simple prior such as a normal Gaussian distribution, constraint to the latent policy becomes simple to enforce, for example by defining

such that for each dimension

of the latent space for some hyperparameter


Figure 1: Overview of LAVA_mt, critic network and offline RL with PLAS. First, (a) we pre-train LAVA_mt with modified shared objective. The state encoder and latent space of the resulting model is used to initialize the actor for PLAS. The critic (b) is an RNN-based model that takes state, action and user goal to estimate the return. PLAS samples the transition from the static dataset and uses it to train actor and critic in an alternating fashion. To compute the target Q-value , target actor and critic networks are used with soft update to improve stability.

PLAS defines a deterministic policy with continuous latent action that is optimized using the deterministic policy gradient method silver2014deterministic. Dual critics are used that are optimized with soft clipped double Q-learning. The PLAS algorithm has been applied to real robot experiment as well as locomotive simulations tasks. In this environment, the latent actions and action space are continuous. This differs quite considerably from dialogue systems, where the latent action needs to be translated to word-level actions which are discrete.

4 Offline Critic for Dialogue Policy Evaluation and Optimization

The architecture of our proposed critic is depicted in Figure 1(b). We utilize recurrency to let the critic take dialogue context into account. We encode the word-level user utterance with an RNN and concatenate it with the binary belief state to obtain . On the other hand, the critic has the flexibility of taking any form of action. With latent actions, the action can be used as input directly by concatenating it with the state. When word-level or semantic actions are considered, a separate encoder can be used before concatenating it with the state.

In addition, to leverage the available data as much as possible, we incorporate the user goal for estimating the return. The MDP then becomes the dynamic parameter MDP (DP-MDP) as described by xie2020deep, where a set of task parameters governs the state dynamics and reward function . It is safe to incorporate the user goal for learning, because the critic is only used for policy evaluation and not needed to run the policy. If the user goal is not given in the data, it can be automatically derived from the dialogue state. To maintain the correctness of the dialogue context, when predicting , all actions are taken from the corpus. Only is taken from the output of the policy. This is in contrast to the existing corpus-based success rate computation, where all are taken from the policy and thus create context mismatches.

To keep the critic pessimistic in the face of uncertainty, we implement a dropout layer and do forward passes for each state-action pair and the lowest value is then taken as the final prediction, i.e.,

. In this way, prediction with high variance, i.e., high uncertainty, is punished by taking the lower bound. This mechanism replaces the use of double critic in PLAS.

4.1 Offline Critic for Optimization: LAVA+PLAS

We combine LAVA lubis2020lava and PLAS PLAS_corl2020 approaches in order to train a dialogue policy with latent action via offline RL. We use the multi-task LAVA approach, i.e., LAVA_mt, depicted in Figure 1

(a), using continuous latent variables modeled via Gaussian distributions, as the normal distribution prior works best with the PLAS approach. In the original LAVA_mt, the model utilizes response generation (RG) and response VAE objectives for optimization with a 10:1 ratio, i.e., the VAE objective is optimized once every 10th RG epoch. In other words, the VAE is only used as an auxiliary task to ground the latent space from time to time. In this work, we modify the model training to preserve both RG and VAE abilities equally, as we will need the VAE to retrieve the latent action from the dataset


With as state encoder parameters, action encoder, and decoder, for each training pass, both tasks are performed and the model uses their joint loss to update its parameters, i.e.,


While the original LAVA uses policy gradient RL with the corpus-based success rate, in this work we follow the SL with PLAS algorithm. Parts of the LAVA_mt model are used to initialize the actor and critic networks: parameters are used for the actor, to retrieve the latent action given a word-level response , and the decoder to map latent actions produced by the actor into word-level responses. Prior to PLAS training, we warm-up the LAVA_mt model with only the VAE objective to further improve the latent action reconstruction capability:


PLAS training is depicted in Figure 1(c). It consists of two interleaved training loops. For each pass, an episode is sampled from the static dataset . In the actor training loop, the actor parameter is optimized using deterministic policy gradient silver2014deterministic to maximize the critic estimate. Due to the deterministic nature of the policy, the actor no longer samples from the distribution, but instead takes the distribution mean as the action. To encourage the policy to stay close to the behavior policy, as an additional loss, we add a mean-squared error (MSE) term between the chosen action and the reconstructed action from the corpus . The actor loss is defined as


On the other hand, the critic is trained to minimize the error of the Bellman equation. In addition, we penalize the critic with a weighted KL loss term as a means of regularization when the target actor chooses an action that is far from the behavior policy. The critic loss is defined as


As is common practice, we use the target critic and actor networks for computing the target Q-value. The actor, critic, and their corresponding target networks are initialized the same way, but the target networks are updated with a soft update to promote stability in training.

4.2 Offline Critic for Evaluation

In this paper, we utilize offline RL critic in a new way, as a data- and model-independent evaluator for task-oriented dialogue systems. Following the critic training loop in Figure 1(c), we replace the target actor with the fixed policy , i.e. the one to be evaluated, and perform the critic loop training with Equation 7

as the loss function, setting

for systems with word-level action.

Note that with this approach, the dataset consisting of dialogues for evaluation can take any form as long as the states and actions are compatible with the dialogue system input and output, allowing comparisons across various types of dialogues systems. For instance, the states can be represented as sequences of utterances or binary vectors and actions as word-level, latent, semantic, or binary actions. In terms of rewards, those can be sparse (i.e. intermediate rewards are set to , , , ) and in case that the corpus represents the desireable behaviour, a maximum reward can be assumed as a final reward for every dialogue in the corpus (i.e. set to , , ). Of course more accurate reward labels would result in an even more precise evaluator. As a consequence, dialogue systems can be evaluated on static corpora that differ from the training corpus and also not necessarily generated by interacting with the system.

A possible use case scenario would be a human-human corpus annotated with states and sparse rewards and a number of different dialogue systems being evaluated on this corpus. This is the case we consider in our evaluation below, whereby we use word-level and latent actions, and thus do not require explicit action labels.

5 Experimental Set-up

5.1 Data

We use MultiWOZ 2.1 budzianowski2018large; eric2019multiwoz to conduct our experiments, one of the most challenging and largest corpora of its kind. MultiWOZ is a collection of conversations between humans in a Wizard-of-Oz fashion, where one person plays the role of a dialogue system and the other one a user. The user is tasked to find entities, e.g., a restaurant or a hotel, that fit certain criteria by interacting with the dialogue system. The corpus simulates a multi-domain task-oriented dialogue system interaction.We use the training, validation and test set partitions provided in the corpus, amounting to dialogues for training and each for validation and testing.

5.2 Policy and Critic Training

For the LAVA_mt pre-training, we use simple recurrent models as encoder and decoder and follow the hyperparameters as set in the original work lubis2020lava with a few exceptions, i.e. we use -dimensional continuous latent variables with a normal Gaussian as the prior and we lower the learning rate to . As depicted in Figure 1, parts of the LAVA_mt model are then used by the actor, critic, and different parts of PLAS training. For the critic, we set the hidden size to be

and the linear layer to use the sigmoid activation function. During PLAS, we use a learning rate of

for the critic and for the actor. The critic dropout rate and are set to and , respectively. The policy is trained with a maximum of 10K sampled episodes from the corpus, and the best checkpoint is chosen according to the corpus-based success rate. We set the hyper-parameters of the critic as an offline evaluator the same way, except that it uses 100K sampled episodes for training without early stopping.

5.3 Dialogue Systems

To show the generalization ability of our proposed offline evaluation, we evaluate various dialogue systems that differ in terms of modular abstractions and architectures:

Hdsa chen2019semantically

is a transformer-based dialogue generation architecture with graph-based dialogue action using hierarchically-disentangled self-attention (HDSA). The model consists of a predictor, which outputs the dialogue action, and a generator, which subsequently maps it into dialogue response. Two versions of HDSA are included, one which uses ground-truth action for generation (gold), and one which uses predicted labels (pred). Note that the ‘pred’ version is the only one that can be deployed in an interactive set-up.

AuGPT kulhanek2021augpt

is a fully end-to-end dialogue system with fine-tuned GPT2 radford2019language on multi-task objectives, including belief state prediction, response prediction, belief-response consistency, user intent prediction, and system action prediction. The model is trained on MultiWOZ data augmented with the Taskmaster-1 byrne2019taskmaster and Schema-Guided Dialogue rastogi2020towards datasets.

Lava lubis2020lava

is an RNN-based model using latent actions, optimized via SL and policy gradient RL with corpus-based success rate as reward. We use LAVA_kl as the best performing model reported.


is our proposed variant of LAVA that is trained in an offline RL set-up using offline critic and PLAS algorithm (Section 4.1).

5.4 Evaluation Metrics

Offline Critic for Evaluation (Ours)

For each system, we train an offline critic using offline Q-learning as described in Section 4.2. While theoretically the critic can take any form of dialogue action as input, in our experiments we utilize word-level or latent action. We consider intermediate rewards to be and the final reward is for a successful dialogue or for a failed dialogue, as provided in the MultiWOZ corpus. As final estimated value of the policy, we report the average estimated return of all initial states on the test set.

Standard corpus-based metrics

Corpus based evaluation is conducted on MultiWoZ test set using delexicalized responses with the benchmarking evaluation script provided by budzianowski2018large. A pseudo dialogue is generated, where user turns are taken from the corpus and system turns are generated by the evaluated model. Match rate computes whether all informable slots in the user goal are generated, and success rate computes whether all information requested by the user is provided. For completeness, we also report the BLEU score on target responses.

US evaluation

We use the default ConvLab2 zhu2020convlab user simulator with the BERT-based NLU module, rule-based agenda policy and template NLG. We conducted dialogues and report the average number of turns across all dialogues. We focus on three measures: book rate, i.e., how often the system finalized a booking, success rate, i.e., the percentage of dialogues where all information requested by the user is provided by the system and bookings are successfully made, and lastly complete rate, i.e., the number of dialogues that are finished regardless of whether the booked entity matches the user criteria. We also report entity F1 and average number of turns across the simulated dialogues.

With the exception of AuGPT, the systems’ dialogue policies require a dialogue state tracker (DST) for online interactions. For this purpose, we utilize a tracker with a joint goal accuracy of 52.26% on the test set of MultiWOZ 2.1 van2020knowing. This tracker is a recurrent neural model, which utilises attention and transformer based embeddings to extract important information from the dialogue. We perform lexicalization via handcrafted rules using the information from the dialogue state and database query. For handling incomplete lexicalizations due to empty database queries or a wrongly predicted domain by the policy, we replace the response with a generic “I’m sorry, could you say that again?". This is equal to masking such actions while neither punishing nor rewarding the policy.

Corpus Match 66.06 83.94
Success 51.95 67.54
BLEU 0.17 0.14
ConvLab US Compl. 37.42 47.02
Success 31.87 39.40
Book 19.12 36.74
F1 49.11 57.14
Turns 21.57 21.99
Table 1: Offline RL in latent space improves task-related metrics on both corpus and US evaluations. Results are averaged across 5 seeds.
Policy Corpus Evaluation Critic Evaluation
Match Success BLEU
MultiWOZ (Human) 90.40 1.82 82.30 2.36 N/A 52.68 0.02
AuGPT 83.30 2.31 67.20 2.91 0.17 52.45 0.02
LAVA+PLAS 88.30 1.99 73.40 2.74 0.14 51.76 0.03
LAVA_kl 97.50 1.14 94.80 1.47 0.12 48.95 0.08
HDSA (gold) 91.80 1.70 82.50 2.35 0.21 49.89 0.08
HDSA (pred) 88.90 1.95 74.50 2.70 0.20 49.00 0.09
Table 2:

Corpus-based evaluation metrics. 95% confidence intervals are reported.

Policy ConvLab US Evaluation Human Evaluation
Compl. Success Book F1 Avg. turn Success Rating
AuGPT 89.20 1.92 83.30 2.31 85.16 3.34 81.03 1.40 14.50 0.41 90.75 2.85 4.34 0.08
LAVA+PLAS 54.20 3.09 45.30 3.09 61.18 4.51 58.85 2.25 23.54 0.89 63.00 4.75 3.34 0.12
LAVA_kl 49.20 3.10 40.00 3.04 63.20 4.37 54.47 2.24 26.64 1.00 63.25 4.74 3.44 0.12
HDSA (pred) 36.70 2.99 25.90 2.71 6.67 2.37 49.97 2.23 31.32 0.86 55.25 4.89 3.09 0.12
Table 3: Interactive evaluation metrics. 95% confidence intervals are reported.
Fleiss’ Kappa Human Evaluation
Success Rating
Corpus-based Corpus Match -0.623 -0.571
Success -0.460 -0.397
BLEU 0.343 0.299
Critic 0.755 0.713
Interactive US Complete 0.992 0.984
Success 0.991 0.984
Book 0.789 0.802
F1 0.990 0.978
Turn -0.967 -0.956
Table 4: Correlation between evaluation metrics and human judgements. Absolute values shows the strength of the correlation. Negative sign shows inverse correlation.

Human evaluation

Human evaluation is performed via DialCrowd lee2018dialcrowd connected to Amazon Mechanical Turk. The systems are set up identically as in the US evaluation, except that the systems are interacting with paid users instead of a US. Users are provided with a randomly generated user goal and are required to interact with our systems in natural language and to subsequently evaluate them. We ask the user whether their goal is fulfilled through the dialogue, indicating the success rate. We also ask them to rate the overall system performance on a Likert scale from 1 (worst) to 5 (best). For each system we collected 400 dialogues with human workers.

6 Results and Analysis

6.1 Offline Critic for Optimization

Table 1 shows the policy performance after shared multi-task SL training and the performance after subsequent offline RL training with PLAS, averaged over 5 seeds. We observe that offline RL in latent space with the critic estimate as reward signal improves task-related metrics on both corpus and US evaluation. The consistent improvement on offline and interactive evaluations is the result of critic’s value estimate as reward signal, which we believe is noteworthy as the policy is never explicitly trained on either metric.

Like policy gradient RL used by LAVA (Equation 3), PLAS leads to a decrease in BLEU score. This is quite common for end-to-end policies trained with RL following SL lubis2020lava, however the decrease with PLAS is not as drastic. This signals that the policy retains more linguistic variety in the responses, since the reward signal does not overlook context mismatch and thus responses that are out of context are not rewarded. We include a dialogue example in Appendix A to demonstrate the context mismatch issue and how the offline critic addresses it.

6.2 Offline Critic for Evaluation

System performances across metrics

Tables 2 and 3 present the corpus- and interaction-based evaluation results of LAVA+PLAS and our baselines. For completeness, we included the human policy, i.e., the behavior policy of the dataset, on the corpus-based evaluation. For LAVA+PLAS, we pick the best model out of the 5 seeds. For the baseline models, we utilize the released pre-trained parameters and re-run all evaluations.

The ranking of the systems differs depending on the evaluation metrics. With corpus-based success and match rates, LAVA far outperforms the other models and even human wizards. This is expected, as LAVA_kl is directly optimized with the corpus-based success rate as reward. In terms of BLEU, HDSA – which is designed for generation with semantic action – achieves the first rank. With critic evaluation, human policy achieves the highest score. The rankings for evaluation with user simulator and paid workers in Table 3 are consistent, showing another trend entirely. AuGPT outperforms the other systems with a huge margin, LAVA+PLAS and LAVA_kl show a narrower gap in performance compared to corpus-based metrics, while HDSA performs very poorly. The collected dialogues show that the language understanding and generation of AuGPT is superior to the other models, as it leverages a large pre-trained model as a base model and utilizes multiple dialogue corpora for fine-tuning. In other words, it is trained on orders of magnitude more data compared to the other systems. This results in a more natural interaction with both simulated and human users.

It is interesting to note that the critic has a much narrower confidence interval compared to the other metrics. Although the values for some policies are seemingly close, the intervals show that the difference between most of the systems are statistically significant, except for LAVA_kl and HDSA (gold).

Correlation with human judgements

Table 4 lists pairwise correlation between human judgements and the automatic metrics. We differentiate between corpus-based metrics such as the standard match and success rates, BLEU and critic evaluation, with interactive metrics that require a form of user, either simulated or paid. Success rates of current standard evaluations have moderate inverse correlation with human judgements due to the context mismatch that occurs during its computation. On the other hand, the theoretically grounded value estimation by the offline critic has a strong correlation with human judgements, showing that our proposed method is a more suitable corpus-based metric to reflect the dialogue system performance. Our study confirms the weak correlation between BLEU and human ratings. All metrics computed based on interaction with US are strongly correlated with metrics from human evaluation. The number of turns is strongly but inversely correlated, which aligns with the intuition that the fewer turns the system needs to complete the dialogue, the better it is perceived by human users. This suggests that while existing US is far from fully imitating human behavior, it provides a good approximation to how the systems will perform when interacting with human users. We advocate that future works report on multiple evaluation metrics to provide a more complete picture of the dialogue system performance.

Note that while US evaluation provides stronger correlations with human judgements, our proposed use of offline RL critic for evaluation has the benefit of being corpus- and model-independent, whereas for a new corpus and ontology, a new US would need to be designed and developed. Furthermore, an offline evaluation takes significantly less time to perform, making it an efficient choice for the iterative development process.

6.3 Impact of Reward Signal on RL

LAVA+PLAS and LAVA_kl are the only two systems optimized via RL. We observe that they significantly outperform the other on the respective metric they received as reward signal during RL. However, when subjected to interactive evaluation, the gap between their performance is shrinking (see Table 3). This shows on the one hand the power of reinforcement learning methods to optimize the given reward and on the other hand how important it is to define this reward correctly, warranting further research in both extrinsic and intrinsic reward modelling for dialogue wesselmann; GeishauserHLLHF21.

7 Conclusion

We propose the use of offline RL for dialogue evaluation based on static corpus. While offline RL critics are typically utilized for policy optimization, we show that they can be trained for any dialogue system as external evaluators that are corpus- and model-independent, while attaining strong correlation with human judgements, which we confirm via an interactive user trial. Not only does the offline RL critic provide a corpus-based metric that is reliable and efficient to compute, it also addresses a number of issues highlighted in the recently published NSF report mehri2022report. It is important to note that the proposed framework does not depend on the definition of states, action and rewards. So in principle, one could apply this method beyond task-oriented dialogue systems. For example, one could evaluate a number of chat-bots considering a corpus annotated only with level of engagement achieved in each dialogue and thus measure the level of engagement of the evaluated chat-bots.


N. Lubis, C. van Niekerk, M. Heck and S. Feng are supported by funding provided by the Alexander von Humboldt Foundation in the framework of the Sofja Kovalevskaja Award endowed by the Federal Ministry of Education and Research. C. Geishauser and H-C. Lin are supported by funds from the European Research Council (ERC) provided under the Horizon 2020 research and innovation programme (Grant agreement No. STG2018 804636). Google Cloud and HHU ZIM provided computational infrastructure.


Appendix A Example Dialogue

Table 5 shows an example dialogue accompanied by the corresponding offline critic Q estimation and corpus-based success rate.

Utterances Q Corpus succ.
User can you help me find a [value_pricerange] place to stay in the [value_area] part of town?
True sure. there are [value_count] guesthouses there. i would be happy to book [value_count] for you if you like.
Pred the [hotel_name] is located at [hotel_address] , postcode [hotel_postcode]. the phone number is [hotel_phone]. 0.4730
User does it have a star rating of [value_count]?
True no. they all have [value_count] stars.
Pred the [hotel_name] is a guesthouse in the [value_area] area. the phone number is [hotel_phone] and the postcode is [hotel_postcode]. 0.4569
User then find me [value_count] in the [value_pricerange] price range.
True all of the accommodations in the [value_area] are in the [value_pricerange] price range. would you like to try a different part of town?
Pred the [hotel_name] is located at [hotel_address] , postcode [hotel_postcode]. the phone number is [hotel_phone]. 0.4478
User find me a nice [value_count] and book for [value_count] people and [value_count] nights from [value_day]
True actually , the [hotel_name] is a [value_count] start hotel in the [value_area] area. it s in the [value_pricerange] price range. how does that sound?
Pred i have booked your room at the [hotel_name] for [value_count] nights starting [value_day]. your reference number is [hotel_reference]. is there anything else i can help you with? 0.4461
User that sounds perfect. can you reserve that and provide me with the reference number?
True okay , you are all set. the hotel is booked and the reference number is [hotel_reference]. can i help you with anything else today?
Pred the reference number is [hotel_reference]. 0.4205
User thank you. i am also looking for a restaurant that serves [value_food] and that is in the same area as the hotel
True [restaurant_name] is an [value_pricerange] restaurant in the same area as the hotel , would you like me to make reservations for you?
Pred [restaurant_name] is located at [restaurant_address] , [restaurant_postcode]. the phone number is [restaurant_phone]. 0.4167
User yes. same group of people , at [value_time] on the same day please.
True booking was successful. the table will be reserved for [value_count] minutes. reference number is [restaurant_reference]. what else can i help you with today?
Pred i have booked you a table at [restaurant_name]. the reference number is [restaurant_reference]. can i help you with anything else? 0.4095
User that was all i needed. thanks. bye.
True ok! have a great day!
Pred you are welcome. have a great day! 0.4161 1.0
Table 5: Example dialogue comparing offline critic estimate and corpus-based success. "True" denotes responses taken from the corpus, and "Pred" responses from the policy, in this case we use LAVA_kl with which context mismatch often occurs. Note that Q prediction takes "User" and "True" utterances from the beginning up to the previous turn, and "User" and "Pred" of current turn. On the other hand, Corpus-based success takes on "User" and "Pred" utterances for all turns. Predicted responses in italic highlight the context mismatch that can occur when pseudo-dialogue is constructed for dialogue success computation. This is however ignored and the dialogue is considered successful, since all necessary requestable slots are generated by the system. On the other hand, the Q-estimate shows a decrease in value, and the policy is given a lower reward signal for the same dialogue.