Log In Sign Up

AARGH! End-to-end Retrieval-Generation for Task-Oriented Dialog

We introduce AARGH, an end-to-end task-oriented dialog system combining retrieval and generative approaches in a single model, aiming at improving dialog management and lexical diversity of outputs. The model features a new response selection method based on an action-aware training objective and a simplified single-encoder retrieval architecture which allow us to build an end-to-end retrieval-enhanced generation model where retrieval and generation share most of the parameters. On the MultiWOZ dataset, we show that our approach produces more diverse outputs while maintaining or improving state tracking and context-to-response generation performance, compared to state-of-the-art baselines.


page 14

page 15


Mars: Semantic-aware Contrastive Learning for End-to-End Task-Oriented Dialog

Traditional end-to-end task-oriented dialog systems first convert dialog...

"None of the Above":Measure Uncertainty in Dialog Response Retrieval

This paper discusses the importance of uncovering uncertainty in end-to-...

RubyStar: A Non-Task-Oriented Mixture Model Dialog System

RubyStar is a dialog system designed to create "human-like" conversation...

A Neural Network Approach to Context-Sensitive Generation of Conversational Responses

We present a novel response generation system that can be trained end to...

Autoregressive Entity Generation for End-to-End Task-Oriented Dialog

Task-oriented dialog (TOD) systems often require interaction with an ext...

Mix-and-Match: Scalable Dialog Response Retrieval using Gaussian Mixture Embeddings

Embedding-based approaches for dialog response retrieval embed the conte...

Controllable Response Generation for Assistive Use-cases

Conversational agents have become an integral part of the general popula...

1 Introduction

Most research task-oriented dialog models nowadays focus on end-to-end modeling, i.e., the whole dialog system is integrated into a single neural network

wen2017; ham2020. Although recent end-to-end generative

approaches based on pre-trained language models produce fluent and natural responses, they suffer from two major problems: (1) hallucinations and lack of grounding

(dziri2021), which result in faulty dialog management or responses inconsistent with the dialog state or database results, and (2) blandness and low lexical diversity of outputs (zhang2020dialogpt). On the other hand, retrieval-based dialog systems (chaudhuri2018) select the most appropriate response candidate from a human-generated training set, thus producing varied outputs. However, their responses might not fit the context and can lead to disfluent conversations, especially when the set of candidates is sparse. This limits their usage to very large datasets which do not support dialog state tracking or database access (lowe2015; alrfou2016).

Several recent works focus on combining the retrieval and generative dialog systems via response selection and subsequent refinement, i.e., retrieval-augmented generation (pandey2018; weston2018; cai2019a; thulke2021). These models are used for open-domain conversations or to incorporate external knowledge into task-oriented systems and do not consider an explicit dialog state.

Our work follows the retrieve-and-refine approach, but we adapt it for database-aware task-oriented dialog. We aim at improving diversity of produced responses while preserving their appropriateness. In other words, we do not retrieve any new information from an external knowledge base, instead, we retrieve relevant training data responses to support the decoder in producing varied outputs. To the best of our knowledge, we are the first to use retrieval-augmented models in this context. Unlike previous works, we merge the retrieval and generative components into a single neural network and train both tasks jointly, instead of using two separately trained models. Our contributions are summarized as follows:111Code:

Figure 1: Our retrieval-based generative task-oriented system (AARGH, see Section 3.5

). Numbers in module boxes mark the order of processing during inference: (1) inputs are pushed through the shared context encoder and (2) state encoder; (3) the state decoder produces the update to the current dialog state. The new state is used to query the database whose outputs are discretized, embedded, and (4) used in the retrieval encoder whose output is reduced to a single vector via average pooling. The context embedding is used to get the best response candidate (hint). Finally, (5) the response decoder, which can attend to the state encoder outputs via cross-attention and is conditioned on the database results and the hint, generates the final system response to be shown to the user.

  • [itemsep=0pt,topsep=4pt,leftmargin=12pt]

  • We propose a single-encoder retrieval model utilizing dialog action annotation during training, and we show its superior retrieval capabilities in the task-oriented setting compared to two-encoder baseline models humeau2020.

  • We propose an end-to-end task-oriented generative system with an integrated minimalistic retrieval module. We compare it to strong baselines that model response selection and generation separately.

  • On the MultiWOZ benchmark (budzianowski2018large), our approaches outperform previous methods in terms of lexical diversity and achieve competitive or better results in automatic metrics and human evaluation.

2 Related Work

Task-Oriented Response Generation

Most current works focus on building multi-domain database-grounded systems. The breeding ground for this research is the large-scale conversational dataset MultiWOZ (budzianowski2018large; eric-etal-2020-multiwoz; zang2020).

Recent models often benefit from action annotation. zhang2020damd use action-based data augmentation and a three-stage architecture, decoding the dialog state, action, and response. chen2019

generate responses without state tracking, exploiting a hierarchical structure of the action annotation. On the other hand, reinforcement learning models

(wang2020) learn latent actions from data without using annotation.

Recent works focus on end-to-end systems based on pre-trained language models. budzianowski2019 fine-tune GPT-2 (radford2019) to model task-oriented dialogs, hosseini2020 enhance this approach with explicitly decoded system actions. peng2021 use auxiliary training objectives and machine teaching for GPT-2 fine-tuning. lin2020 introduced the encoder-decoder-based framework MinTL with BART (devlin2019) or T5 (kale2020) backbones (see Section 3.1).

Response Selection

can be viewed as scoring response candidates given a dialog context. A popular approach is the dual encoder architecture (lowe2015; henderson2019a) where the response and context encoders model a joint embedding space. The encoders can take various forms: henderson2019b compare encoders based on BERT (devlin-etal-2019-bert) and custom encoders pre-trained on Reddit; wu2020 pre-train encoders specifically for task-oriented conversations. humeau2020 introduce poly-encoders, which produce multiple context encodings and add an attention layer to allow rich interaction with the candidate encoding (cf. Section 3.3).

Retrieval-Augmented Generation

To benefit from both retrieval and generative models, weston2018 proposed an open-domain dialog system utilizing a retrieval network and a decoder to refine retrieved responses. roller2021 further developed this approach, using poly-encoders with a large pre-trained decoder. They found that their decoder tends to ignore the retrieved response hints. To combat this, they propose the -blending method (replacing retrieval output with ground truth, see Section 3.2). Similarly, gupta2021 and cai2019b; cai2019a focus on retrieval-augmented open-domain dialog, but to prevent the inflow of erroneous information into the generative part of their models, they use semantic frames or reduced forms of retrieved responses instead of raw response texts.

thulke2021 aim at knowledge retrieval from external documents for resolution of out-of-domain questions on MultiWOZ (kim2020). shalyminov2020 present the only work using generation and retrieval in a single model. They finetune GPT-2 (radford2019) for response generation in a low-resource task-oriented setup, retrieve alternative responses based on the model’s embedding similarity, and choose between generated and retrieved responses on-the-fly. However, their model is not trained for retrieval, cannot alter retrieved responses, and does not take a dialog state or database into account.

3 Method

We aim at end-to-end modeling of database-aware task-oriented systems, i.e., systems supporting both dialog state tracking and response generation tasks young2013. We combine retrieval and generative models to reduce hallucinations and boost output diversity. We first describe our purely generative baseline (Section 3.1), then explain baseline generation based on retrieved hints (Section 3.2). We then introduce baseline retrieval models (Section 3.3) and our action-aware retrieval (Section 3.4). Finally, we describe AARGH, our single-model retrieval generation hybrid, in Section 3.5. AARGH is shown in Figure 1; other setups are depicted in Appendix A.

3.1 Generative Baseline

Our purely-generative baseline model (Gen) follows MinTL (lin2020). It is based on an encoder-decoder backbone with a context encoder, shared among two decoders: one for modeling the dialog state updates, the other for producing the final system response. Both decoders attend to the encoded input tokens via an attention mechanism.

The encoder input sequence consists of a concatenation of two parts: (1) past dialog utterances prepended with <|system|> or <|user|> tokens, and (2) the initial dialog state converted to a string, e.g., hotel [area: center] restaurant [food: African, pricerange: expensive]. The first decoder is conditioned only on the start-of-sequence token and predicts the dialog state update as a difference between the current state and the initial state. The second decoder is conditioned on the number of database results for each queried domain, e.g. train: 6 if there are six matching results for a train search, and generates the final response.

During inference, the input is passed through the encoder, then the state update is predicted, merged with the initial dialog state, and this new state is used to query the database (see Section 4 for details). The final system response is predicted based on the context, state, and database results.

3.2 Retrieval-Augmented Response Generation

To combine the retrieval and generative approaches, we follow weston2018 and incorporate response hints, i.e., the outputs of a retrieval module (Sections 3.3, 3.4), into the generative module in their original form as raw sub-word tokens. Specifically, we add the retrieved response prepended with <|hint|> to the input of Gen’s response decoder (Section 3.1), alongside the database results.

gupta2021 state that this straightforward token-based retrieve & refine setup might lead to generating incoherent responses due to over-copying of contextually irrelevant tokens. However, using more abstract outputs of the retrieval module, e.g. semantic frames or salient words would go against our goal of reducing blandness and increasing responses lexical diversity. To smoothly control the amount of token copying, we follow roller2021 and use the so-called

-blending. During training, we replace the retrieved utterance with the ground-truth final response with probability

. This method also ensures that the decoder learns to attend to the retrieval part of its input successfully.

3.3 Baseline Response Selection

We consider two baseline retrieval model variants:

Dual-encoder (DE)

follows the very popular retrieval architecture (lowe2015; humeau2020) which makes use of context and response encoders. Both produce a single vector in a joint embedding space. During training, the context embedding and the corresponding response embedding are pushed towards each other, while other responses in the training batch are used as negative examples, i.e., cross-entropy loss is used:

where is the similarity matrix between normalized encoded responses and contexts in the batch, specifically , where is a trainable scaling factor.

Inference-time retrieval is as simple as finding the nearest candidate embedding given a context embedding. The context input is similar to Gen’s (see Section 3.1): a concatenation of the current updated dialog state, the number of matching database results and past user and system utterances. Encoders are followed by average pooling and a fully-connected layer for dimensionality reduction.

Poly-encoder (PE)

an extension of DE, aiming at richer interaction between the candidate and the context. The candidate encoder is unchanged. In the context encoder, the average pooling is replaced with two levels of dot-product attention (vaswani2017; humeau2020). The first level summarizes the encoded context tokens into vectors. The context tokens act as attention keys and values; queries to this attention are learned embeddings (query codes). The second attention level provides the candidate-context interaction: it takes the context summary vectors as keys and values, and the candidate encoder output acts as the query. The parameter provides trade-off between inference complexity and richness of the context encoding. The loss term remains the same.

3.4 Action-aware Response Selection

We argue that the dual- or poly-encoder models are not practical for the task-oriented settings as their performance depends on the way negative examples are sampled during training (nugmanova2019). Choosing appropriate negative examples is difficult in task-oriented datasets as system responses are often very similar to each other (with the conversations being in a narrow domain and following similar patterns). Therefore, we propose a method for candidate selection based on system action annotation, which is usually available in task-oriented datasets. We designed the method to be usable with a single encoder only, but we also include a dual-encoder version for comparison.

Action-aware-encoder (AAE)

Using two separate encoders to encode the response and the context might be impractical due to large model size. Some recent works (e.g., wu2020; roller2021) use a single shared encoder instead, and henderson2020 discuss parameter sharing between the two encoders. In view of that, we propose a single-encoder action-aware retrieval model. We train it to produce embeddings of dialog contexts which are close to each other if the corresponding responses in the training data have similar action annotation. More precisely, we adapt wan2018’s generalized end-to-end loss, originally developed for batch-wise training of speaker classification from audio: To form training mini-batches, we first sample random dialog actions, and for each of those actions, we sample examples that include the particular action in their system action annotation. We then encode dialog contexts corresponding to the sampled examples into normalized embeddings , and compute the similarity matrix as follows:

where , , , and is a set of indices. Same as for DE, is a trainable scaling factor of the similarity matrix. In other words, the similarity matrix describes the similarity between embeddings of each example and centroids, i.e., the means of embeddings that correspond to the same particular action. For stability reasons and to avoid trivial solutions, we follow wan2018 and exclude from the centroid calculation when computing .

We then maximize the similarity between the examples and their corresponding centroids while using other centroids as negative examples:

During inference, we rank the responses from the training set according to the cosine similarity of their corresponding contexts and the query context. Again, the contexts consist of the current updated dialog state, the number of matching database results and past utterances.

Action-aware-dual-encoder (AADE)

This setup follows the DE architecture (see Section 3.3), but it is trained in a similar way as AAE, i.e., we form training mini-batches identically and for each of distinct actions in the batch, we treat all examples as positive examples.

3.5 Hybrid End-to-end Model

To further simplify the retrieval-augmented setup, reduce the number of trainable parameters and gain back computational efficiency, we introduce an end-to-end Action-Aware Retrieval-Generative Hybrid model (AARGH), which jointly models both response selection and context-to-response generation (see Figure 1). It is a natural extension of the Gen generative model (Section 3.1), enabled by our new single-encoder action-aware response retrieval (AAE, Section 3.4).

A new retrieval encoder, which produces normalized context embeddings, shares most parameters with the original encoder, which is followed by the two decoders and is partially responsible for state tracking and response generation. To build the retrieval encoder, we fork the last layers of the original encoder and condition them on the outputs of the shared preceding layers, concatenated with an embedding of the number of current database results. To obtain this embedding, we convert the number of database results into a small set of bins, which are then embedded via a learnt embedding layer of size .222This conversion is dataset-specific and not used in other compared models such as Gen. We use the label 0 if there are no results, 1 for 0 matching results, 2, 3, 4 if there are 1, 2 or 3 results, respectively, 5, 6 if there are less than 6 or 11 results, and 7 if there are 11 or more results. The new retrieval encoder is followed by average pooling and trained using the same objective as AAE (see Section 3.4).

During inference, we pass the input through the partially shared context encoder and decode and update the dialog state. The new state is used to query the database. Database results are embedded and added to the output of the last encoder shared layer to form the input to the retrieval encoder, which produces the context embedding and a retrieved response. Based on state, database results, and retrieved response, the response decoder produces the final (delexicalized) response.

4 Experimental Setup


Our models are based on pre-trained models from HuggingFace (wolf-etal-2020-transformers): We implement Gen and the generative parts in our retrieval-based models using T5-base (kale2020). Retrieval encoders in DE, AADE, PE and AAE are implemented as fine-tuned BERT-base (devlin2019). AARGH is built upon T5-base, same as Gen; we fork the last out of encoder layers. The choice of is a trade-off between model performance and size.333We noticed a performance drop when using , and did not bring any large gain. The database embedding has size . For simplicity, we do not use specialized backbones pre-trained on dialogs such as ToD-BERT (wu2020). PE uses query codes (see Section 3.3) and single-headed attention mechanisms.

Figure 2: Part of a short conversation from MultiWOZ. It has user and system turns, and annotated slot spans. Both, user and system affect the dialog state. Actions are shown below system texts.
Figure 3: t-SNE projection of test set context embeddings (colored by domains) of retrieval modules of our models. The colors indicate the different MultiWOZ domains that are associated with the corresponding dialog turns.

Data and database

We experiment on the MultiWOZ 2.2 dataset (budzianowski2018large; zang2020) which is a popular dataset with around 10k task-oriented conversations in 7 different domains such as trains, restaurants, or hotels (see Figure 2). A single conversation can touch multiple domains. The dataset has an associated database, dialog state annotation, dialog action annotation of system turns, and slot value span annotation for easy delexicalization (wen2015), thus enabling development of realistic end-to-end dialog systems.444Unlike the similar-sized Taskmaster (byrne2019) and SGD (rastogi2019) datasets, which lack databases and annotation detail. To query the database using the belief state, we use the fuzzy matching implementation by nekvinda2021

. To filter out inactive domains from database results during inference, we follow previous work and estimate the currently active domain from dialog state updates.

Input and output format

We use the same formats for all models. Target responses are delexicalized using MultiWOZ 2.2 span annotation, and we limit the context to 5 utterances. MultiWOZ action labels include domain, action, and slot name, e.g., train-inform-price. We remove domains from the labels to limit data sparsity.

Training procedure

DE, AADE, PE and AAE are trained in two stages. The retrieval part is trained first and provides response hints to the generative model during the second phase. Modules in AARGH are trained jointly, but we alternate parameter updates for the retrieval encoder and the rest of the network. To do so, we use two separate optimizers. AARGH’s hints used in the response decoder during training are refreshed after every epoch. All models are optimized using Adam

(kingma2014) and cosine learning rate decay with warmup. With respect to memory limits of our hardware, we set , for batch sampling during training of retrieval parts of AAE and AARGH.


We experiment with two -blending values: a conservative one (, marked “ ”) and a greedy one (, marked “ ”), targeting a mostly generation-focused and a mostly retrieval-focused setting.555The values were chosen empirically, based on preliminary experiments on development data.


We use greedy decoding for dialog state update generation. For response generation, we report results with greedy decoding in Section 5 and with beam search in Appendix B.

5 Evaluation and Results

We focus on end-to-end modeling, which includes dialog state tracking and response generation. All reported results are on MultiWOZ test set with 1000 dialogs, averaged over 8 different random seeds. We generated responses given ground truth contexts. We follow MinTL and predict the dialog state cumulatively for each conversation turn, which means that state tracking errors may compound. See Appendix C for an example end-to-end conversation without any ground-truth information.

Setting BLEU Action IoU % full match % no match % uniq. hints
Random 02.1 05.1 ± 0.2 01.1 85.0 93.5
DE 08.9 34.7 ± 0.5 11.0 29.7 54.1
AADE 07.9 30.9 ± 1.7 08.9 33.4 24.2
PE 08.8 35.0 ± 0.8 11.4 28.9 44.1
AAE 12.8 37.1 ± 0.2 14.5 28.6 88.6
AARGH 12.6 36.6 ± 0.2 14.2 29.0 89.6
Table 1: Evaluation of retrieval components of our models (Section 3.3, 3.5). See Section 5.1 for details.

5.1 Response selection

Setting BLEU  Inform  Success Unique trigrams BCE Hint-BLEU Hint-copy Joint acc.
Corpus -  93.7  90.9 25,212 3.37 - - -
SOLOIST (peng2020) 13.6  82.3  72.4 07,923 2.41 - - -
PPTOD (pttod2021) 18.2  83.1  72.7 02,538 1.88 - - -
MTTOD (mttod2021) 19.0  85.9  76.5 04,066 1.93 - - -
MinTL (lin2020) 19.4  73.7  65.4 02,525 1.81 - - -
Gen (equiv. to MinTL) 18.6 ± 0.3  77.0 ± 1.2  66.4 ± 1.0 03,209 1.94 - - 54.1 ± 0.2
AAE (retrieval only) 12.8 ± 0.1  79.9 ± 0.6  58.3 ± 0.7 22,457 3.34 100.0 100.0 % -


DE +Gen 17.6 ± 0.3  80.9 ± 0.5  68.8 ± 0.6 08,190 2.36 32.5 15.2 % 54.2 ± 0.1
AADE +Gen 17.3 ± 0.3  81.2 ± 0.9  69.1 ± 1.0 06,613 2.29 26.7 12.8 % 54.3 ± 0.1
PE +Gen 17.4 ± 0.3  79.9 ± 0.9  66.8 ± 1.0 07,736 2.35 31.3 14.5 % 54.4 ± 0.2
AAE +Gen 17.5 ± 0.6 82.0 ± 1.0 70.3 ± 0.8 08,152 2.32 32.0 16.2 % 54.2 ± 0.2
AARGH 17.3 ± 0.3  81.2 ± 0.6  69.5 ± 0.5 08,200 2.33 28.4 14.2 % 53.8 ± 0.2


DE +Gen 12.3 ± 0.3  87.8 ± 0.3  69.1 ± 0.5 18,800 3.20 80.4 76.5 % 54.2 ± 0.2
AADE +Gen 14.6 ± 0.4  81.0 ± 0.8  66.7 ± 0.4 10,723 2.72 51.7 44.8 % 54.2 ± 0.1
PE +Gen 12.9 ± 0.4  86.0 ± 0.8  67.1 ± 0.6 16,632 3.13 74.0 69.1% 54.4 ± 0.1
AAE +Gen 11.9 ± 0.2 90.5 ± 0.3 71.3 ± 0.3 19,436 3.23 91.1 89.3 % 54.3 ± 0.2
AARGH 12.1 ± 0.2  89.6 ± 0.2  70.7 ± 0.5 19,813 3.21 87.6 85.0 % 53.6 ± 0.2
Table 2: Response generation and state tracking evaluation on MultiWOZ using automatic metrics, including the bi-gram conditional entropy (BCE) and number of unique trigrams. We compare previous work, the baseline and retrieval-based generative models. See Section 5.2 for details about the metrics; Section 34 for model descriptions.

First, we assess the performance of retrieval components of DE, AADE, PE, AAE and AARGH. We cannot use the popular R@k metric (chaudhuri2018) as AAE and AARGH use embeddings of dialog contexts (not responses) of candidates as the search criterion and would always score 100%. Instead, we use the action annotation and measure the intersection over union (IoU), full-match and no-match rates on sets of actions associated with top-1 retrieved and ground-truth responses. We add BLEU (papineni2002; liu2016) between ground-truth and retrieved responses and the proportion of distinct retrieval outputs to assess their lexical similarity to references and diversity.

Table 1 shows that AAE and AARGH significantly outperform other setups on all measures except for the no-match rate,666

According to a paired t-test with 95% confidence level.

where PE has comparable results. This is expected as they use the additional action annotation during training, unlike DE and PE. AADE performs surprisingly bad. According to the unique hints rate, AAE and AARGH retrieve a much wider range of outputs, which could improve lexical diversity of final responses. The higher BLEU, Action IoU and full match rates suggest that the models retrieve responses more similar to the ground truth.

Silhouette coefficient DE AADE AAE AARGH
per Domain 0.098 0.179 0.151 0.159
per Action 0.147 0.316 0.312 0.320
Table 3: Evaluation of domain and action separation (Section 5.1). We show averages over 8 random seeds.

To further compare the approaches to response selection, we computed the Silhouette coefficient (rousseeuw1987) based on the active domain and action annotation (see Table 3).777In the case of action-based clustering, we treat each action as a separate cluster; each example can belong to multiple clusters. The clustering measure is calculated for each cluster and averaged over all actions which are weighted by the size of the corresponding clusters. We omit PE because its context embeddings depend on queries, i.e., the candidate embeddings (other models output the same context regardless of candidates). DE has the worst results; other systems perform similarly, but AARGH is the best on action separation while AADE has the best scores for domains.

We see that AADE’s context encoder is successful in clustering, but it lags behind in terms of correct action selection. Unlike AARGH and AAE, AADE retrieves candidates based on response embeddings. We hypothesize that lower response variability (compared to context variability) leads the model to prefer responses seen more frequently during training. AARGH and AAE are not affected by this as they use purely context-based retrieval.

Figure 3 provides a visualisation of the domain clusters projected using t-SNE (vandermaaten2008). It supports the findings of our evaluation based on the Silhouette coefficient: We see that visualisations of AARGH and AADE embedding spaces look similarly whereas DE’s clusters appear more noisy.

5.2 Response generation

System Response
User: I need a taxi to take me to Yippee Noodle Bar by 9:45.
Corpus Absolutely, where will the taxi be picking you up from today?
Gen I can help with that. Where are you departing from?
DE +Gen Where will you be departing from?
AARGH Okay, I can help with that. Where are you coming from?
AARGH Of course, I just need to know your destination so I can book that for you.
User: I am looking for an expensive hotel with free parking.
System: We have 5 places that offer all of that. Any preference on star rating or area?
User: No, that’s not important to me. But, I would prefer to stay in a guesthouse with free parking.
Corpus Sorry, there are no [type] [price] places with free parking. Perhaps you’d prefer a different type or price range?
Gen I’m sorry, but there are no [type] that meet your criteria. Would you like to try a different price range?
DE +Gen I’m sorry, but there are no [type] that meet your criteria. Would you like to try a different price range or area?
AARGH I’m sorry, but I don’t have any [type] that meet your criteria. Would you like to try a [type] or a [type]?
AARGH I’m sorry but I don’t have any [type] matching that criteria. Could we look at [type] instead?
User: Thank you so much. You have been very helpful.
Corpus Great! I hope you enjoy your stay!
Gen You’re welcome. Have a great day!
DE +Gen You’re welcome. Have a great day!
AARGH Is there anything else I can help you with today?
AARGH You are welcome. I can also help arrange transportation or suggest attractions, if you wish.
Table 4: Side-by-side system output comparison (see Sections 3 and 4 for model details). The responses are continuations of conversations SNG0016, SNG1048, and MUL2138 from MultiWOZ.

We evaluate the response generation abilities of our models using automatic metrics and human assessment of delexicalized texts (see Table 4 for examples).

Evaluation with automatic metrics

We use the corpus-based evaluator by nekvinda2021 to measure commonly used metrics on MultiWOZ (Inform & Success rates, BLEU) as well as lexical diversity measures, namely the number of distinct trigrams in the outputs and bigram conditional entropy li_diversity-promoting_2016; novikova_lexical_2019. State tracking joint accuracy is calculated with scripts adapted from TRADE (trade2019). To better understand the effect of using retrieved hints and to quantify the amount of copying, we calculate BLEU between retrieved hints and final generated responses (Hint-BLEU) and the proportion of generated responses exactly matching the corresponding retrieved hints (Hint-copy).

We include comparisons with recent strong end-to-end models on MultiWOZ: SOLOIST (peng2020), MTTOD (mttod2021), PPTOD (pttod2021), and MinTL (lin2020), which has the same architecture as Gen. To show the importance of the generative parts of our models, we also include AAE without the refining decoder.

Table 2 shows scores obtained with greedy decoding (see Appendix B for beam search results). All models have similar state tracking performance. AARGH has slightly lower numbers, which is not surprising as it shares a substantial part of the encoder with its retrieval component. As expected, we notice a huge difference in Hint-BLEU and Hint-copy of versions with different -blending probabilities ( vs. ).888Hint-copy of 15% roughly means one turn per dialog. The performance boost over Gen and retrieval-only AAE is, for variants, mainly in terms of Success. In , more frequent hint copying reduces BLEU and improves lexical diversity; we also see higher Inform. AAE +Gen and AARGH (both and ) perform better than corresponding DE +Gen or PE +Gen on Inform and Success rates.999According to a paired t-test with 95% confidence level. Differences between AAE +Gen and AARGH are not statistically significant and their Success scores are better than MinTL, competitive with PPTOD and SOLOIST but lower than MTTOD. In terms of lexical diversity, all models are better than most generative baselines.101010The variants are similar to SOLOIST, which, however, reaches diversity by employng sampling (holtzman2020) instead of greedy decoding.

Mean Ranking 2.03 1.99 1.91 2.3
Ranked #1 36.1% 35.5% 40.8% 37.9%
Ranked #2 34.7% 36.7% 33.5% 18.8%
Ranked #3 18.8% 20.5% 19.7% 18.8%
Ranked #4 10.4% 07.2% 06.1% 24.6%
Table 5: Human evaluation results – mean ranks (1-4) established from 50 evaluated conversations.

Human evaluation

We arranged an in-house human evaluation on the delexicalized outputs of Gen (i.e., MinTL’s architecture), DE +Gen , AARGH and AARGH . We used side-by-side relative ranking evaluation, which has been repeatedly found to increase consistency compared to rating isolated examples (callison2007; belz_comparing_2010; kiritchenko_best-worst_2017). Participants were given full dialog context and current database results, and we asked them to rank responses of the compared models from the best-fitting to the worst, where multiple responses could be ranked the same (see Appendix D for details). We collected rankings for 346 turns of 50 conversations from 5 linguists with experience in natural language generation. All of them were given a different set of dialogs and they were instructed to focus on consistency with the context and database results, naturalness, and attractiveness of the responses. See Table 5 for results.

Although AARGH scored the best on automatic metrics, it has worse mean ranks than other models, which all have similar mean ranks.111111According to Friedman test with 95% confidence level and Nemenyi post-hoc test; only the difference between AARGH and other models is statistically significant. This confirms previous findings of low correlation between automatic metrics and human assessments liu_how_2016; novikova_why_2017. Upon detailed manual error analysis, we found that AARGH often copies whole hints including words that do not fit the context, i.e., contradictions to earlier statements or noisy non-delexicalized values from the training set. AARGH performs slightly better than the baselines and is more often ranked best and least often ranked worst.

6 Conclusion

We present AARGH, an end-to-end task-oriented dialog system, combining retrieval and generative approaches. It uses an embedded single-encoder retrieval component which extends a purely generative model without the need for a large number of new parameters. AARGH features an action-aware response selection training objective. Our experiments on the MultiWOZ dataset show that AARGH outperforms baselines in terms of automatic metrics and human evaluation and it is competitive with state-of-the-art models such as SOLOIST or MTTOD. We showed that our proposed action-aware retrieval training objective supports retrieval of a larger variety of unique and relevant responses in the task-oriented setting and makes efficient use of the available system action annotation. Further, using the retrieval module improves dialog management in terms of the Success rate. A limitation of our approach is the need for careful hyper-parameter setting, coupled with the risk of overuse of retrieved responses that match the dialogue state but are not appropriate for the context.

In future work, we would like to confirm our results on more datasets and explore more complex ways of usage of the retrieved responses to encourage the model to copy interesting language structures while ignoring inappropriate tokens or relics of faulty delexicalization.


This research was supported by Charles University projects GAUK 373921, SVV 260575 and PRIMUS/19/SCI/10, and by the European Research Council (Grant agreement No. 101039303 NG-NLG). It used resources provided by the LINDAT/CLARIAH-CZ Research Infrastructure (Czech Ministry of Education, Youth and Sports project No. LM2018101).


Appendix A Model Architectures

Figure 4 shows architectures of the baseline (Gen), dual-encoder-based model (DE), and single-encoder action-aware model (AAE). See Figure 1 for details about AARGH and Section 3 for description of the models.

Figure 4: Architecture of the baseline (Gen, top), dual-encoder-based model (DE, middle) and single-encoder action-aware model (AAE, bottom). Numbers in module boxes mark the order of processing during inference.

Appendix B Beam Search Results

See Table 6 for the results of beam search-based response generation evaluation, and compare the results with greedy decoding evaluation (see Section 5.1 and Table 2). For all models, we used beams of size 8 during the decoding

In the case of conservative -blending, beam search decoding results in higher lexical diversity for all retrieval-augmented systems. However, the gains with respect to Inform and Success rates are mostly very small or not present at all in the case of AADE and AARGH. All BLEU scores are slightly lower which corresponds with the higher output diversity. We notice that the numbers for the baseline without a retrieval component have an opposite trend. Beam search decoding causes lower lexical diversity and higher BLEU. We attribute this to the fact that beam search decoding prefers safer responses with a higher overall probability.

When using higher -blending, the differences become small even in the case of lexical diversity. We hypothesize that all the retrieval-based models are not substantially influenced by the particular response decoding strategy because they strongly rely on the retrieved hints and their copying.

Setting BLEU Inform Success Num. trigrams Bi-gram entropy Hint-BLEU Hint-copy
Gen 2683 1.81 - -

DE 10098 2.49 25.2 %
AADE 7378 2.33 19.2 %
PE 9470 2.48 24.4 %
AAE 10457 2.46 29.0 %
AARGH 9072 2.36 22.2 %

DE 19103 3.28 79.6 %
AADE 10997 2.76 50.2 %
PE 17178 3.19 74.2 %
AAE 19448 3.21 90.1 %
AARGH 19763 3.22 86.0 %
Table 6: Beam search-based response generation on MultiWOZ using automatic metrics. For each model setup, We use beams of size 8 during response decoding and report results averaged over 8 random seeds. We compare the baseline (Gen) and retrieval-based generative models (See Section 3 and 4). See Section 5.2 for details about the metrics. Cf. Table 2 showing results obtained using greedy decoding.

Appendix C End-to-end Conversation

Figure 5 shows a multi-domain (restaurant and taxi) end-to-end conversation between our end-to-end retrieval-based model AARGH (See Section 3.5).

Figure 5: End-to-end conversation between the user and our retrieval-based AARGH model with conservative -blending (see Section 3). For the system turns, we show delexicalized hints proposed by the retrieval module (left boxes in italics) and the corresponding lexicalized final responses (right boxes). We highlighted the parts of hints present in the final texts and the parts of final responses newly-introduced by the model during refining.

Appendix D Human Evaluation Interface

We used the graphical user interface depicted in Figure 6 for human evaluation. A full dialog context, i.e., all past utterances corresponding to the particular turn, and the number of database results were shown to participants. We asked participants to rank provided responses from the best to the worst. They evaluated only two conversations in a single run and we sampled the conversations from the test set so that all participants receive roughly the same number of turns to assess. Evaluated responses were shown side-by-side; each of them had a dedicated discrete scale from 1 to 4 where 1 was labeled as the best and 4 as the worst. More responses could receive the same ranking. Participants could move forward and backward in the conversations and they could switch to another conversation anytime.

Figure 6: Our graphical user interface used for human evaluation.