With the rise of personal assistants, task-oriented dialogue systems have received a surge in popularity and acceptance. Task-oriented dialogue systems are characterized by a user goal which motivates the interaction, e.g., booking a hotel, searching for a restaurant or calling a taxi. The dialogue agent is considered successful if it is able to fulfill the user goal by the end of the interaction.
Ideally, success rates are obtained via interaction with a real user in-the-wild. Unfortunately, with a handful of exceptions, e.g., LetsGO lee2018dialcrowd and Alexa Challenge gabriel2020further, that is often out of reach. The closest approximation is human trials with paid users such as Amazon Mechanical Turk workers, which has also been adopted as final evaluation in recent incarnations of the Dialogue State Tracking Challenge (DSTC) gunasekara2020overview. However, such evaluations are highly time- and cost-intensive, making them impractical for optimization during an iterative development. The third alternative is to use a user simulator to conduct online dialogue simulation, however the result is subject to the quality of the user simulator itself. Furthermore, developing such simulators is far from straightforward and requires significant amounts of handcrafting schatzmann08a. Only recently we have seen data-driven user simulators that can compete with hand-coded ones lin-etal-2021-domain.
While there has been considerable progress towards more meaningful automatic evaluation metrics for dialogues, there remains a number of limitations as highlighted by the recent NSF reportmehri2022report: the metrics 1) measure only a limited set of dialogue qualities, which mostly focus on subjective aspects such as fluency and coherence, 2) lack generalization across datasets and models, and 3) are not yet strongly correlated with human judgements. These limitations hinder a more widespread use of newly proposed metrics for benchmarking and comparison, especially with prior works. Further, in particular for task-oriented dialogue systems, the need for reliable automatic evaluation of dialogue success is still unanswered.
Being able to automatically evaluate the success rate of any policy using static data offers a number of benefits in terms of required resources, generalizability, and reproducibility. Furthermore, it is not only suitable for the final evaluation of a dialogue policy, but can also be utilized as an objective for iterative optimization. The corpus-based success rate is one such method, which has become the standard metric for state-of-the-art comparisons of policy optimization approaches today budzianowski2018large. Unfortunately, this metric is computed based on pseudo-dialogues that may contain context mismatch. Therefore, we believe it should be treated more as an approximation: it is insufficient at best, and misleading at worst, in reflecting real performance of dialogue systems. In addition, the rules used to check the goal completion need to be handcrafted based on the ontology, making this method data- or ontology-dependent.
In this paper, we propose to use offline reinforcement learning (RL) to train a policy evaluator, also known as a critic, based on a static collection of dialogue data111 https://gitlab.cs.uni-duesseldorf.de/general/dsml/lava-plas-public. We show that an offline critic addresses the limitations of current automatic metrics: 1) it can be trained to evaluate any dialogue system architecture after-the-fact, allowing comparisons across various types of systems from prior works, 2) it can be utilized in the iterative development phase to optimize a dialogue policy, 3) it is theoretically grounded, solving the problems that standard corpus-based success rate has due to context mismatch, and 4) it strongly correlates with the performance of the system when interacting with human users, which we confirm via a user trial.
2 Related Work
For a long time, the research in dialogue policy has focused on user-centered criteria such as user satisfaction walker1997paradise; lees12; ultes2017domain. The most reliable way to obtain these scores is to have users interact directly with the system and let them subjectively rate the system afterwards. Due to the time and resource requirements to carry out such evaluations, human trials are usually done only as the final evaluation after the system development is finished.
As the line between policy and natural language generation (NLG) tasks becomes blurred, we see the introduction of metrics such as BLEUpapineni2002bleu and perplexity. However, these have been labeled early on to be potentially misleading, as they correlate poorly with human judgement stent2005evaluating; liu2016not. This circumstance motivates automatic metrics that are highly correlated with human ratings dziri-etal-2019-evaluating; mehri2020unsupervised; mehri2020usr. However, these metrics are designed to measure subjective quality of a dialogue response, making them more suitable for evaluating chat-based systems.
Despite the availability of toolkits that facilitate user simulation (US) evaluation zhu2020convlab, corpus-based match and success rates are the default benchmark for works in task-oriented dialogue systems today budzianowski2018large; nekvinda2021shades. These metrics are practical to compute, reproducible, and scalable. Current standard corpus-based metrics are computed on a pseudo-dialogue constructed using user utterances from data and responses generated by the system. A set of rules then checks whether the system provides all information requested by the user. Unfortunately, they do not take into account context mismatches that may originate from the pseudo-dialogue construction and therefore does not reflect other aspects of dialogue quality as the resulting dialogue flow is completely overlooked.
There has been few applications of offline RL to dialogue systems. jaques2019way explores various language-based criteria, e.g., sentiment and semantic similarity, as reward signals for open-domain dialogue, paired with a Kullback-Leibler (KL) control for exploration within the support of the data. verma2022chai proposed using fine-tuned language models to utilize unlabeled data for learning the critic function. The method is however only demonstrated on a very small state and action space, and it is therefore unclear whether it generalizes to more complex set ups. ramachandran2021causal applied offline RL with a pair-wise reward learning model based on preference learning, however it still utilizes the corpus-based success rate for choosing the preferred rollout. To the best of our knowledge, offline RL has not previously been deployed for dialogue evaluation.
3.1 Offline RL
Dialogue can be formulated as a reinforcement learning problem with a Markov decision process (MDP). In this MDP, , , and denote the state and action spaces, and the reward function, respectively.
denotes the probability of transitioning to statefrom after executing , and is the probability of starting in state . is the discount factor that weighs the importance of immediate and future rewards. At each time step , the agent observes a state , executes its policy by selecting an action according to , transitions to a new state and receives a reward . The goal of the policy is to maximize the cumulative discounted rewards, i.e., the return .
Instead of interacting with the MDP to learn a policy, offline RL aims to learn a policy exclusively from previously collected data containing state transitions under an unknown behavior policy . This set-up is especially useful in cases where deploying the agent in the real environment is too costly, as is the case with real user interaction for dialogue systems. As the agent can not interact with the environment, the performance of the trained policy needs to be evaluated also based on the data . The Q-value denotes the expected return when executing in and following policy
thereafter. Q-learning algorithms estimate the Q-functionby iteratively applying the Bellman operator
Value-based RL methods optimize the policy by maximizing the Q-values for every state-action pair . With discrete actions, and for given state , the actor can then simply select in a greedy fashion.
Alternatively, with an actor-critic method, an actor is trained which optimizes its parameters to maximize the expected return of the starting states, for example via the deterministic policy gradient method silver2014deterministic; lillicrap2016continuous:
The challenge in performing offline RL comes from the fact that is static and has limited coverage of and . While an out-of-distribution state is not a problem during training as the state is always sampled from , the policy may select an out-of-distribution action that is not contained in . This tends to lead to arbitrarily high estimates which further encourages the policy to take out-of-distribution actions. There are two main methods to counteract this: 1) constraining the policy to stay within the support of the dataset wu2019behavior; jaques2019way; fujimoto2019off; PLAS_corl2020, and 2) modifying the critic to better handle out-of-distribution actions kumar2019stabilizing; kumar2020conservative. In this work, we focus on the former.
3.2 Dialogue Policy in the Latent Action Space
RL can be applied to a dialogue system policy at different levels of abstraction. Semantic actions, i.e., tuples containing intent, slot and values, such as inform(area=centre), are widely used for a compact and well-defined action space GeishauserHLLHF21; tseng2021transferable. Pre-defining the actions and labeling the dialogue data however requires considerable labor. In addition, the final policy needs to be evaluated dependent on an NLG module. On the opposite end, natural language actions view each word of the entire system vocabulary as an action in a sequential decision making process mehri2019structured; jaques2019way. This blows up the action space size and the trajectory length, hindering effective learning and optimal convergence.
proposed instead an automatically inferred latent space to serve as action space of the dialogue policy, where a latent action is a real-valued vector containing latent meaning. This decouples action selection and language generation, as well as shorten the dialogue trajectory.lubis2020lava
followed up this work by proposing the use of variational auto-encoding (VAE) for a latent-space that is action characterized. In both of these works, the latent space is trained via supervised learning (SL) on the response generation task, and then followed with policy gradient RL using the corpus-based success as the reward signal, i.e.,
3.3 Offline RL for Policy in the Latent Action Space (PLAS)
A latent action space also lends itself well to offline RL with a policy-constraint technique. PLAS_corl2020 proposed to use a conditional VAE (CVAE) to model the behavior policy
to reconstruct actions conditioned on states. The benefit of learning in the latent space is that the latent policy has the flexibility of choosing the shape of the distribution via the prior. By constraining the latent policy to output latent actions with high probability under the prior, the decoder will output an action that is likely under the behavior policy in expectation. By choosing a simple prior such as a normal Gaussian distribution, constraint to the latent policy becomes simple to enforce, for example by definingsuch that for each dimension
of the latent space for some hyperparameter.
PLAS defines a deterministic policy with continuous latent action that is optimized using the deterministic policy gradient method silver2014deterministic. Dual critics are used that are optimized with soft clipped double Q-learning. The PLAS algorithm has been applied to real robot experiment as well as locomotive simulations tasks. In this environment, the latent actions and action space are continuous. This differs quite considerably from dialogue systems, where the latent action needs to be translated to word-level actions which are discrete.
4 Offline Critic for Dialogue Policy Evaluation and Optimization
The architecture of our proposed critic is depicted in Figure 1(b). We utilize recurrency to let the critic take dialogue context into account. We encode the word-level user utterance with an RNN and concatenate it with the binary belief state to obtain . On the other hand, the critic has the flexibility of taking any form of action. With latent actions, the action can be used as input directly by concatenating it with the state. When word-level or semantic actions are considered, a separate encoder can be used before concatenating it with the state.
In addition, to leverage the available data as much as possible, we incorporate the user goal for estimating the return. The MDP then becomes the dynamic parameter MDP (DP-MDP) as described by xie2020deep, where a set of task parameters governs the state dynamics and reward function . It is safe to incorporate the user goal for learning, because the critic is only used for policy evaluation and not needed to run the policy. If the user goal is not given in the data, it can be automatically derived from the dialogue state. To maintain the correctness of the dialogue context, when predicting , all actions are taken from the corpus. Only is taken from the output of the policy. This is in contrast to the existing corpus-based success rate computation, where all are taken from the policy and thus create context mismatches.
To keep the critic pessimistic in the face of uncertainty, we implement a dropout layer and do forward passes for each state-action pair and the lowest value is then taken as the final prediction, i.e.,
. In this way, prediction with high variance, i.e., high uncertainty, is punished by taking the lower bound. This mechanism replaces the use of double critic in PLAS.
4.1 Offline Critic for Optimization: LAVA+PLAS
We combine LAVA lubis2020lava and PLAS PLAS_corl2020 approaches in order to train a dialogue policy with latent action via offline RL. We use the multi-task LAVA approach, i.e., LAVA_mt, depicted in Figure 1
(a), using continuous latent variables modeled via Gaussian distributions, as the normal distribution prior works best with the PLAS approach. In the original LAVA_mt, the model utilizes response generation (RG) and response VAE objectives for optimization with a 10:1 ratio, i.e., the VAE objective is optimized once every 10th RG epoch. In other words, the VAE is only used as an auxiliary task to ground the latent space from time to time. In this work, we modify the model training to preserve both RG and VAE abilities equally, as we will need the VAE to retrieve the latent action from the dataset.
With as state encoder parameters, action encoder, and decoder, for each training pass, both tasks are performed and the model uses their joint loss to update its parameters, i.e.,
While the original LAVA uses policy gradient RL with the corpus-based success rate, in this work we follow the SL with PLAS algorithm. Parts of the LAVA_mt model are used to initialize the actor and critic networks: parameters are used for the actor, to retrieve the latent action given a word-level response , and the decoder to map latent actions produced by the actor into word-level responses. Prior to PLAS training, we warm-up the LAVA_mt model with only the VAE objective to further improve the latent action reconstruction capability:
PLAS training is depicted in Figure 1(c). It consists of two interleaved training loops. For each pass, an episode is sampled from the static dataset . In the actor training loop, the actor parameter is optimized using deterministic policy gradient silver2014deterministic to maximize the critic estimate. Due to the deterministic nature of the policy, the actor no longer samples from the distribution, but instead takes the distribution mean as the action. To encourage the policy to stay close to the behavior policy, as an additional loss, we add a mean-squared error (MSE) term between the chosen action and the reconstructed action from the corpus . The actor loss is defined as
On the other hand, the critic is trained to minimize the error of the Bellman equation. In addition, we penalize the critic with a weighted KL loss term as a means of regularization when the target actor chooses an action that is far from the behavior policy. The critic loss is defined as
As is common practice, we use the target critic and actor networks for computing the target Q-value. The actor, critic, and their corresponding target networks are initialized the same way, but the target networks are updated with a soft update to promote stability in training.
4.2 Offline Critic for Evaluation
In this paper, we utilize offline RL critic in a new way, as a data- and model-independent evaluator for task-oriented dialogue systems. Following the critic training loop in Figure 1(c), we replace the target actor with the fixed policy , i.e. the one to be evaluated, and perform the critic loop training with Equation 7
as the loss function, settingfor systems with word-level action.
Note that with this approach, the dataset consisting of dialogues for evaluation can take any form as long as the states and actions are compatible with the dialogue system input and output, allowing comparisons across various types of dialogues systems. For instance, the states can be represented as sequences of utterances or binary vectors and actions as word-level, latent, semantic, or binary actions. In terms of rewards, those can be sparse (i.e. intermediate rewards are set to , , , ) and in case that the corpus represents the desireable behaviour, a maximum reward can be assumed as a final reward for every dialogue in the corpus (i.e. set to , , ). Of course more accurate reward labels would result in an even more precise evaluator. As a consequence, dialogue systems can be evaluated on static corpora that differ from the training corpus and also not necessarily generated by interacting with the system.
A possible use case scenario would be a human-human corpus annotated with states and sparse rewards and a number of different dialogue systems being evaluated on this corpus. This is the case we consider in our evaluation below, whereby we use word-level and latent actions, and thus do not require explicit action labels.
5 Experimental Set-up
We use MultiWOZ 2.1 budzianowski2018large; eric2019multiwoz to conduct our experiments, one of the most challenging and largest corpora of its kind. MultiWOZ is a collection of conversations between humans in a Wizard-of-Oz fashion, where one person plays the role of a dialogue system and the other one a user. The user is tasked to find entities, e.g., a restaurant or a hotel, that fit certain criteria by interacting with the dialogue system. The corpus simulates a multi-domain task-oriented dialogue system interaction.We use the training, validation and test set partitions provided in the corpus, amounting to dialogues for training and each for validation and testing.
5.2 Policy and Critic Training
For the LAVA_mt pre-training, we use simple recurrent models as encoder and decoder and follow the hyperparameters as set in the original work lubis2020lava with a few exceptions, i.e. we use -dimensional continuous latent variables with a normal Gaussian as the prior and we lower the learning rate to . As depicted in Figure 1, parts of the LAVA_mt model are then used by the actor, critic, and different parts of PLAS training. For the critic, we set the hidden size to be
and the linear layer to use the sigmoid activation function. During PLAS, we use a learning rate offor the critic and for the actor. The critic dropout rate and are set to and , respectively. The policy is trained with a maximum of 10K sampled episodes from the corpus, and the best checkpoint is chosen according to the corpus-based success rate. We set the hyper-parameters of the critic as an offline evaluator the same way, except that it uses 100K sampled episodes for training without early stopping.
5.3 Dialogue Systems
To show the generalization ability of our proposed offline evaluation, we evaluate various dialogue systems that differ in terms of modular abstractions and architectures:
is a transformer-based dialogue generation architecture with graph-based dialogue action using hierarchically-disentangled self-attention (HDSA). The model consists of a predictor, which outputs the dialogue action, and a generator, which subsequently maps it into dialogue response. Two versions of HDSA are included, one which uses ground-truth action for generation (gold), and one which uses predicted labels (pred). Note that the ‘pred’ version is the only one that can be deployed in an interactive set-up.
is a fully end-to-end dialogue system with fine-tuned GPT2 radford2019language on multi-task objectives, including belief state prediction, response prediction, belief-response consistency, user intent prediction, and system action prediction. The model is trained on MultiWOZ data augmented with the Taskmaster-1 byrne2019taskmaster and Schema-Guided Dialogue rastogi2020towards datasets.
is an RNN-based model using latent actions, optimized via SL and policy gradient RL with corpus-based success rate as reward. We use LAVA_kl as the best performing model reported.
is our proposed variant of LAVA that is trained in an offline RL set-up using offline critic and PLAS algorithm (Section 4.1).
5.4 Evaluation Metrics
Offline Critic for Evaluation (Ours)
For each system, we train an offline critic using offline Q-learning as described in Section 4.2. While theoretically the critic can take any form of dialogue action as input, in our experiments we utilize word-level or latent action. We consider intermediate rewards to be and the final reward is for a successful dialogue or for a failed dialogue, as provided in the MultiWOZ corpus. As final estimated value of the policy, we report the average estimated return of all initial states on the test set.
Standard corpus-based metrics
Corpus based evaluation is conducted on MultiWoZ test set using delexicalized responses with the benchmarking evaluation script provided by budzianowski2018large. A pseudo dialogue is generated, where user turns are taken from the corpus and system turns are generated by the evaluated model. Match rate computes whether all informable slots in the user goal are generated, and success rate computes whether all information requested by the user is provided. For completeness, we also report the BLEU score on target responses.
We use the default ConvLab2 zhu2020convlab user simulator with the BERT-based NLU module, rule-based agenda policy and template NLG. We conducted dialogues and report the average number of turns across all dialogues. We focus on three measures: book rate, i.e., how often the system finalized a booking, success rate, i.e., the percentage of dialogues where all information requested by the user is provided by the system and bookings are successfully made, and lastly complete rate, i.e., the number of dialogues that are finished regardless of whether the booked entity matches the user criteria. We also report entity F1 and average number of turns across the simulated dialogues.
With the exception of AuGPT, the systems’ dialogue policies require a dialogue state tracker (DST) for online interactions. For this purpose, we utilize a tracker with a joint goal accuracy of 52.26% on the test set of MultiWOZ 2.1 van2020knowing. This tracker is a recurrent neural model, which utilises attention and transformer based embeddings to extract important information from the dialogue. We perform lexicalization via handcrafted rules using the information from the dialogue state and database query. For handling incomplete lexicalizations due to empty database queries or a wrongly predicted domain by the policy, we replace the response with a generic “I’m sorry, could you say that again?". This is equal to masking such actions while neither punishing nor rewarding the policy.
|SL||SL + PLAS|
|Policy||Corpus Evaluation||Critic Evaluation|
|MultiWOZ (Human)||90.40 1.82||82.30 2.36||N/A||52.68 0.02|
|AuGPT||83.30 2.31||67.20 2.91||0.17||52.45 0.02|
|LAVA+PLAS||88.30 1.99||73.40 2.74||0.14||51.76 0.03|
|LAVA_kl||97.50 1.14||94.80 1.47||0.12||48.95 0.08|
|HDSA (gold)||91.80 1.70||82.50 2.35||0.21||49.89 0.08|
|HDSA (pred)||88.90 1.95||74.50 2.70||0.20||49.00 0.09|
Corpus-based evaluation metrics. 95% confidence intervals are reported.
|Policy||ConvLab US Evaluation||Human Evaluation|
|AuGPT||89.20 1.92||83.30 2.31||85.16 3.34||81.03 1.40||14.50 0.41||90.75 2.85||4.34 0.08|
|LAVA+PLAS||54.20 3.09||45.30 3.09||61.18 4.51||58.85 2.25||23.54 0.89||63.00 4.75||3.34 0.12|
|LAVA_kl||49.20 3.10||40.00 3.04||63.20 4.37||54.47 2.24||26.64 1.00||63.25 4.74||3.44 0.12|
|HDSA (pred)||36.70 2.99||25.90 2.71||6.67 2.37||49.97 2.23||31.32 0.86||55.25 4.89||3.09 0.12|
|Fleiss’ Kappa||Human Evaluation|
Human evaluation is performed via DialCrowd lee2018dialcrowd connected to Amazon Mechanical Turk. The systems are set up identically as in the US evaluation, except that the systems are interacting with paid users instead of a US. Users are provided with a randomly generated user goal and are required to interact with our systems in natural language and to subsequently evaluate them. We ask the user whether their goal is fulfilled through the dialogue, indicating the success rate. We also ask them to rate the overall system performance on a Likert scale from 1 (worst) to 5 (best). For each system we collected 400 dialogues with human workers.
6 Results and Analysis
6.1 Offline Critic for Optimization
Table 1 shows the policy performance after shared multi-task SL training and the performance after subsequent offline RL training with PLAS, averaged over 5 seeds. We observe that offline RL in latent space with the critic estimate as reward signal improves task-related metrics on both corpus and US evaluation. The consistent improvement on offline and interactive evaluations is the result of critic’s value estimate as reward signal, which we believe is noteworthy as the policy is never explicitly trained on either metric.
Like policy gradient RL used by LAVA (Equation 3), PLAS leads to a decrease in BLEU score. This is quite common for end-to-end policies trained with RL following SL lubis2020lava, however the decrease with PLAS is not as drastic. This signals that the policy retains more linguistic variety in the responses, since the reward signal does not overlook context mismatch and thus responses that are out of context are not rewarded. We include a dialogue example in Appendix A to demonstrate the context mismatch issue and how the offline critic addresses it.
6.2 Offline Critic for Evaluation
System performances across metrics
Tables 2 and 3 present the corpus- and interaction-based evaluation results of LAVA+PLAS and our baselines. For completeness, we included the human policy, i.e., the behavior policy of the dataset, on the corpus-based evaluation. For LAVA+PLAS, we pick the best model out of the 5 seeds. For the baseline models, we utilize the released pre-trained parameters and re-run all evaluations.
The ranking of the systems differs depending on the evaluation metrics. With corpus-based success and match rates, LAVA far outperforms the other models and even human wizards. This is expected, as LAVA_kl is directly optimized with the corpus-based success rate as reward. In terms of BLEU, HDSA – which is designed for generation with semantic action – achieves the first rank. With critic evaluation, human policy achieves the highest score. The rankings for evaluation with user simulator and paid workers in Table 3 are consistent, showing another trend entirely. AuGPT outperforms the other systems with a huge margin, LAVA+PLAS and LAVA_kl show a narrower gap in performance compared to corpus-based metrics, while HDSA performs very poorly. The collected dialogues show that the language understanding and generation of AuGPT is superior to the other models, as it leverages a large pre-trained model as a base model and utilizes multiple dialogue corpora for fine-tuning. In other words, it is trained on orders of magnitude more data compared to the other systems. This results in a more natural interaction with both simulated and human users.
It is interesting to note that the critic has a much narrower confidence interval compared to the other metrics. Although the values for some policies are seemingly close, the intervals show that the difference between most of the systems are statistically significant, except for LAVA_kl and HDSA (gold).
Correlation with human judgements
Table 4 lists pairwise correlation between human judgements and the automatic metrics. We differentiate between corpus-based metrics such as the standard match and success rates, BLEU and critic evaluation, with interactive metrics that require a form of user, either simulated or paid. Success rates of current standard evaluations have moderate inverse correlation with human judgements due to the context mismatch that occurs during its computation. On the other hand, the theoretically grounded value estimation by the offline critic has a strong correlation with human judgements, showing that our proposed method is a more suitable corpus-based metric to reflect the dialogue system performance. Our study confirms the weak correlation between BLEU and human ratings. All metrics computed based on interaction with US are strongly correlated with metrics from human evaluation. The number of turns is strongly but inversely correlated, which aligns with the intuition that the fewer turns the system needs to complete the dialogue, the better it is perceived by human users. This suggests that while existing US is far from fully imitating human behavior, it provides a good approximation to how the systems will perform when interacting with human users. We advocate that future works report on multiple evaluation metrics to provide a more complete picture of the dialogue system performance.
Note that while US evaluation provides stronger correlations with human judgements, our proposed use of offline RL critic for evaluation has the benefit of being corpus- and model-independent, whereas for a new corpus and ontology, a new US would need to be designed and developed. Furthermore, an offline evaluation takes significantly less time to perform, making it an efficient choice for the iterative development process.
6.3 Impact of Reward Signal on RL
LAVA+PLAS and LAVA_kl are the only two systems optimized via RL. We observe that they significantly outperform the other on the respective metric they received as reward signal during RL. However, when subjected to interactive evaluation, the gap between their performance is shrinking (see Table 3). This shows on the one hand the power of reinforcement learning methods to optimize the given reward and on the other hand how important it is to define this reward correctly, warranting further research in both extrinsic and intrinsic reward modelling for dialogue wesselmann; GeishauserHLLHF21.
We propose the use of offline RL for dialogue evaluation based on static corpus. While offline RL critics are typically utilized for policy optimization, we show that they can be trained for any dialogue system as external evaluators that are corpus- and model-independent, while attaining strong correlation with human judgements, which we confirm via an interactive user trial. Not only does the offline RL critic provide a corpus-based metric that is reliable and efficient to compute, it also addresses a number of issues highlighted in the recently published NSF report mehri2022report. It is important to note that the proposed framework does not depend on the definition of states, action and rewards. So in principle, one could apply this method beyond task-oriented dialogue systems. For example, one could evaluate a number of chat-bots considering a corpus annotated only with level of engagement achieved in each dialogue and thus measure the level of engagement of the evaluated chat-bots.
N. Lubis, C. van Niekerk, M. Heck and S. Feng are supported by funding provided by the Alexander von Humboldt Foundation in the framework of the Sofja Kovalevskaja Award endowed by the Federal Ministry of Education and Research. C. Geishauser and H-C. Lin are supported by funds from the European Research Council (ERC) provided under the Horizon 2020 research and innovation programme (Grant agreement No. STG2018 804636). Google Cloud and HHU ZIM provided computational infrastructure.
Appendix A Example Dialogue
Table 5 shows an example dialogue accompanied by the corresponding offline critic Q estimation and corpus-based success rate.
|User||can you help me find a [value_pricerange] place to stay in the [value_area] part of town?|
|True||sure. there are [value_count] guesthouses there. i would be happy to book [value_count] for you if you like.|
|Pred||the [hotel_name] is located at [hotel_address] , postcode [hotel_postcode]. the phone number is [hotel_phone].||0.4730|
|User||does it have a star rating of [value_count]?|
|True||no. they all have [value_count] stars.|
|Pred||the [hotel_name] is a guesthouse in the [value_area] area. the phone number is [hotel_phone] and the postcode is [hotel_postcode].||0.4569|
|User||then find me [value_count] in the [value_pricerange] price range.|
|True||all of the accommodations in the [value_area] are in the [value_pricerange] price range. would you like to try a different part of town?|
|Pred||the [hotel_name] is located at [hotel_address] , postcode [hotel_postcode]. the phone number is [hotel_phone].||0.4478|
|User||find me a nice [value_count] and book for [value_count] people and [value_count] nights from [value_day]|
|True||actually , the [hotel_name] is a [value_count] start hotel in the [value_area] area. it s in the [value_pricerange] price range. how does that sound?|
|Pred||i have booked your room at the [hotel_name] for [value_count] nights starting [value_day]. your reference number is [hotel_reference]. is there anything else i can help you with?||0.4461|
|User||that sounds perfect. can you reserve that and provide me with the reference number?|
|True||okay , you are all set. the hotel is booked and the reference number is [hotel_reference]. can i help you with anything else today?|
|Pred||the reference number is [hotel_reference].||0.4205|
|User||thank you. i am also looking for a restaurant that serves [value_food] and that is in the same area as the hotel|
|True||[restaurant_name] is an [value_pricerange] restaurant in the same area as the hotel , would you like me to make reservations for you?|
|Pred||[restaurant_name] is located at [restaurant_address] , [restaurant_postcode]. the phone number is [restaurant_phone].||0.4167|
|User||yes. same group of people , at [value_time] on the same day please.|
|True||booking was successful. the table will be reserved for [value_count] minutes. reference number is [restaurant_reference]. what else can i help you with today?|
|Pred||i have booked you a table at [restaurant_name]. the reference number is [restaurant_reference]. can i help you with anything else?||0.4095|
|User||that was all i needed. thanks. bye.|
|True||ok! have a great day!|
|Pred||you are welcome. have a great day!||0.4161||1.0|