1.1 Chit-Chat Dialogue Systems
Chit-chat systems are built to chat freely with users and they do not aim to achieve any particular goals, but rather to entertain and engage. The first system of this kind traces back to ELIZA [weizenbaum1966eliza]
, a rule-based system that uses pre-defined patterns and templates to generate the system response. This kind of rudimentary and hand-crafted system can rarely generalize to unseen conversations. Therefore, statistical-based chit-chat systems have been proposed[serban2015survey] to overcome this problem. These models leverage transcripts of natural spoken conversations or crowd-sourced datasets to learn dialogue systems that can respond like humans. These systems use either Sequence-to-Sequence (Seq2Seq) models [sutskever2014sequence] or retrieval systems[isbell2000cobot, zhou2020design], and they are trained end-to-end using human-to-human conversations.
More recently, large pre-trained language models [peters2018deep, radford2019language, raffel2019exploring, shin2020generating, lin2019moel] have greatly improved the state-of-the-art in many down-stream tasks. These language models are trained using the simple log-likelihood objective over large amounts of unlabeled data (e.g., Wikipedia articles). This approach results in large powerful language models that produce coherent text and can be used to perform unconditional language generation. Chit-chat conversational systems are a special case of language model where the prefix is the dialogue history and the continuation is a human-like response [wolf2019transfertransfo]. Recently, large pre-trained language models trained on unlabeled human-to-human conversation (e.g., from Reddit) [zhang2019dialogpt, adiwardana2020towards, roller2020recipes] have shown excellent performance in modelling human responses. Figure 1.1 provides a high-level overview and examples of a statistical-based chit-chat dialogue systems.
1.2 Module-based Task-Oriented Dialogue Systems
Task-oriented dialogue systems are usually built using several modules, Automatic Speech Recognition (ASR), Natural Language Understanding (NLU), a Dialogue Manager (DM), and Natural Language Generation (NLG). Thus they are called module-based[williams2007partially, hori2009statistical, lee2009example, levin2000stochastic, young2013pomdp]. Figure 1.2 provides a high-level overview of a module-based task-oriented dialogue system. In this system, each module is used sequentially to produce the user responses. For instance:
the Natural Language Understanding (NLU) [raymond2007generative, deng2012use, yao2014spoken, guo2014joint, zhang2016joint, liu2020crossner, liu2020importance, wu2019getting, wu2021qaconv] module extracts a semantic frame from the user utterance, which includes the domain, the intent and slots
triggered in the current turn. The domain specifies the general topic of the request (e.g., banking), the intent specifies what the user wants to achieve (e.g., getting information about a bank account), and the slots are the specific name and values of the goal (e.g., savings account). Hence, this module includes domain and intent classifiers[tur2012towards, chen2016zero, liu2019zero] and a slot tagger [nguyen2007comparisons, mesnil2014using]. The latter is a parsing task that assigns predefined slots types to words.
the Dialogue Manager [rudnicky1999agenda, young2006using, young2013pomdp] uses the dialogue frame provided by the NLU module to generate a system action, a.k.a. speech-act. The later is a semantic class which represents a high-level description of the response (e.g., request_location). This module is made of a Dialogue State Tracker (DST) and a Dialogue Policy (DP). The DST generates a dialogue state which is a global
semantic frame. This models uses the provided frame to update the global dialogue state (e.g., frame), and they are implemented using hand-crafted features, a complex domain-specific lexicon, and a domain ontology[williams2007partially, thomson2010bayesian, henderson2014robust] or using statistical models [williams2016dialog, mrkvsic2016neural]. On the other hand, the DP [li2009reinforcement, lipton2018bbq]
is a classifier that maps the dialogue state to the possible system actions, usually trained using reinforcement learning. Importantly, the DP also issues actions for querying external knowledge bases using slot values present in the dialogue state.
the Natural Language Generation module uses the system action produced by the Dialogue Manager to generate plain text responses. Traditionally, this module is implemented using template responses [busemann1998flexible], while more recently, statistical methods [wen2015semantically, press2017language, winata2020learning, winata2019code, xu2019clickbait] based on Seq2Seq models have been proposed.
Other permutations of these modules have also been explored. For instance, several systems [rastogi2017scalable, ramadan2018large, zhong2018global, lin2021leveraging] remove the NLU module and replace it with only a DST. In contrast, others build a very strong NLU module and ignore the DST [chen2018gunrock].
1.3 End-to-End Task-Oriented Dialogue Systems
End-to-end task-oriented dialogue systems [eric-manning:2017:EACLshort, madotto2018mem2seq, wu2019global, reddy2018multi, yavuzdeepcopy, bordes2016learning, madotto2020learning, lin2020mintl, lin2021bitod] has been proposed to reduce the complexity of the modularized systems. Differently from module-based dialogue systems, end-to-end systems train a single model directly on text transcripts of the dialogues. This task is tackled in two ways: response selection [bordes2016learning, perez2016gated, wu2017dstc6, dqmem, williams2017hybrid, seo2016query] and token by token generation of the response [madotto2018mem2seq, wen2016network, serban2016building, zhao2017generative, serban2017hierarchical, madotto2020exploration]
. In this thesis, we especially focus on the latter, which, indeed, is very similar to Seq2Seq models used in chit-chat dialogue systems. The major difference is that task-oriented systems have to generate informative responses grounded on knowledge present in various kind of databases (e.g., knowledge graphs). A standard strategy for end-to-end dialogue systems is to first generate API-CALLs to retrieve knowledge, and then to provide knowledge-based information as input to the model. Figure1.3 shows a system where the same Seq2Seq model generates both API-CALLs and system responses by using the knowledge as further input.
1.4 Motivation and Research Problem
With the emergence of powerful “deep learning” architectures, end-to-end dialogue systems end-to-end generative dialogue systems have been proposed to optimize overall system performance and simplify training. However, these systems cannot be easily controlled or extended like the modularized systems. This is because a single neural system is used, which is usually a large pre-trained language model (e.g., GPT-2), and thus it is hard to surgically change desirable attributes (e.g., style, topics, etc.). More importantly, uncontrollable dialogue systems can generate offensive or even toxic responses.
In modularized/rule-based systems, adding new dialogue domains, for example, only requires adding rules and functions to the existing codebase, while in end-to-end models this step requires an expensive training process, namely, retrain the entire model with all the domains. Similarly, chit-chat template-based systems can easily control the generated response in terms of style and topics, but this is extremely challenging for generative conversational models.
Generally speaking, controlling dialogue systems is particularly important not only from a research perspective but also to meet an industrial need. In fact, existing smart assistants (e.g., Amazon Alexa, Apple’s Siri etc.) on the market are slowly moving from modularized systems to end-to-end solutions. One of the main obstacles to prevent the wide deployment of these systems is the lack of explicit control over the generated responses. This is especially important in larger systems, where only part of the technology is actually a neural network, while the rest is hard coded rules. In such cases, the system developer requires specific behaviours from the end-to-end models (e.g., positive responses, or response from a certain domain). Therefore, having high- and low-level control over different attributes of the dialogue response is an essential skill.
In this thesis, we study controllable methods for end-to-end generative dialogue systems.In particular, we focus on
Controlling Style & Topics see2019makes showed that being able to control the response generation can have a significant impact on the quality of conversations. However, controlled generation from large conversational models such as DialoGPT [zhang2019dialogpt], Meena [adiwardana2020towards] and Blender-Bot [roller2020recipes], remains a challenge, and is particularly more difficult in the absence of annotated conversational datasets. Therefore, we propose to use residual adapters [houlsby2019parameter], which adds less than 1.5% task-specific parameters per style/topic, to make the controllable response generation viable for online systems, and we run a comprehensive automatic and human evaluation to show that our method can control the generate responses in terms of style and topics, without losing fluency and without requiring dialogue specific datasets[madotto2020plug].
Controlling Dialogue Domain Continuously The ability to continuously updated dialogue systems with new features based on the user’s needs, e.g., adding new slots and intents, or even completely new domains, is essential for a robust and deployable dialogue systems. However, existing end-to-end dialogue models are trained with the assumption of having a fixed dataset and architecture at the beginning of the training, and they are not designed to add new domains and functionalities through time without incurring the high cost of whole-system retraining. Therefore, we propose a highly controllable architectural method based on residual adapters [houlsby2019parameter] to continuously update task-oriented dialogue systems with new features based on the user’s needs. Moreover, we analyze the trade-off between performance, number of parameters, and episodic memory sizes of other existing methods (regularization, rehearsal, architectural) [madotto2020continual].
Controlling Multi-Skill Dialogue Systems Unlike humans who can do both, systems for goal-oriented dialogues [williams2007partially, young2013pomdp] and chit-chat conversations [serban2016generative, vinyals2015neural] are often learned with separate models, eventually rule-based. However, end-to-end dialogue models share the same Seq2Seq architecture for both chit-chat and task-oriented systems. These models greatly suffer from lack of controllability and flexibility. Therefore, we propose a novel theoretical framework to control an end-to-end dialogue model with multiple composable and controllable skills, and we empirically show the effectiveness of using specialized parameters in combined chit-chat and task-oriented datasets[madotto2020attention, lin2021adapter].
1.5 Thesis Outline
The thesis is divided in four main chapters, plus a conclusion. In Chapter 2, we introduce the background notation and methodology used throughout the thesis. In Chapter 3, we describe how to control style and topics of large generative conversational models. In Chapter 4, we describe a flexible dialogue system that is able to add conversational domains continuously. In Chapter 5, we propose a novel way to control an end-to-end dialogue model with multiple composable and controllable skills. Finally, in Chapter 6, we summarize the thesis and the significance of the controllable dialogue systems, and we discuss possible future research directions.
2.1 Basic notation
Let us defined a generic supervised dataset , where
is a feature vector andits corresponding label. The dataset is split into three non-overlapping sets, namely training, validation, and testing
. The sets are used for training the model, estimating the performance of different hyper-parameters and estimating the generalization error respectively.
In this thesis, we focus on two settings: classification and sequence generation. We further define these two settings by specifying the in-out pairs . In both settings, is a sequence of tokens , which represents an ordered sequence of words, e.g., sentences in a paragraph or utterances in a dialogue. Differently, is a value from the set for the classification setting and a sequence of tokens for the sequence generation setting.
Independently of the setting, we focus on discriminative models, parameterized by
, that learns the conditional probability. In sequence generation, we further expand
using the chain rule of probability[bengio2003neural] as
In the coming sections, we introduce model instances for
, in both classification and sequence generation, and the loss functions used to train them.
Before diving into the modelling, let us define a vocabulary of words as a set of unique words appearing in a large text corpus. The token is converted into its 1-hot representation denoted as , where is the cardinality of the vocabulary. In , only the element corresponding to the token is set to 1 and otherwise to 0. Following this notation, the input becomes a matrix , where each row is a vector of the vocabulary size dimension.
2.2 Word Embedding
The basic block of any NLP-based neural network is the embedding matrix, which maps the input tokes into their embedded representation. Let us define the word embedding matrix [mikolov2013distributed, xu2018emo2vec] , where is the embedding size. The embedding matrix is multiplied by the input sequence 111We remove the index to improve readability. to obtain its embedded representation. We denote this transformation as
is the resulting embedding for each of the tokens in the input sequence. In most neural network libraries (e.g., PyTorch[NEURIPS2019_9015]tensorflow2015-whitepaper]) this operation is efficiently done using hashing to select the row of interest . In this case, the input matrix is transformed back to a vector in which each position is the index of the element 1 in the row. Figure 2.1 describes an example of transforming a sentence into its embedded representation. Using this notation, the embedding operation is denoted as . In the rest of the thesis, we use these two notations interchangeably.
2.3 Feed-Forward Neural Network
A feed-forward (FF) neural networks is a parametric model made of affine transformations and non-linear activations. More formally, given an input vector, FF networks with layers compute the function
where each and are trainable parameters, and
is an activation function, e.g.,. Without loss of generality, Equation 2.3 describes a FF neural network in which each transformation has the same size (i.e.,
), and uses the same activation function (e.g., sigmoidal). In general, the dimensions of the transformation are arbitrary and different activation functions can be used (e.g., ReLU).
Finally, to model the conditional probability , the output of is projected to the classification output space
using a linear transformation and a Softmax function,
where . Therefore, we denote the model prediction for the class as
where and . Furthermore, we denote the overall model’s parameters with the set . These parameters are optimized by minimizing the cross-entropy loss between the predicted and the gold from the dataset . Formally, the loss is computed as
In this equation, we purposely imply that this operation is vectorial. Thus we omit the further summation over the number of classes. The loss function is minimized using the gradient descent algorithm (i.e., backpropagation[rumelhart1986learning]). A high-level description of a FF neural network is shown in Figure 2.2.
2.4 Recurrent Neural Networks
Differently from FF neural networks, recurrent neural networks (RNNs) are used to process a sequence of inputs, e.g., a sequence of words, and they are able to capture temporal dependency in the input.
Given the input sequence , where each , an RNN processes one input at a time by updating an hidden state vector . The RNN forward function for an input at time step is expressed as:
where , , and . Notice that and are shared parameters across time steps. We denote the set of hidden states at each time step as . These hidden states can be used to represent the temporal dependency between "input features", e.g., the word sequence in a sentence. Figure 2.3 shows a high-level description of the model. The RNN described in Equation 2.7 suffers from the gradient vanishing and explosion problem [pascanu2013difficulty]
. To cope with this issue, two variants of this simple architecture have been proposed: Long-Term Short Term Memory (LSTM) and Gated Recurrent Unit (GRU).
[hochreiter1997long] uses three gates (input gate, forget gate and output gate) to modulate the amount of information to be stored in the hidden memories (the hidden state and context state). Given the input sequence , where each , the LSTM for an input at time step computes
where , , , and . The denotes the Hadamard product (i.e. element-wise matrix multiplication). The vector and are usually called the hidden state and the context state. The LSTM reduces the gradient vanishing problem greatly, since the context state does not use a sigmoidal () activation.
[cho2014learning], similarly to the LSTM, modulates how much information has to be stored in the hidden memory. However, it greatly simplifies the models’ architecture and it uses only one hidden state, like in the original RNN. Given the input sequence , where each , the GRU for an input at time step computes
where , , , and . Differently from the LSTM, the GRU architecture uses one hidden state , but a similar gating mechanism for avoiding gradient vanishing.
RNNs are often used in three settings: classification, sequence tagging, and sequence-to-sequence generation. In the classification setting, the model is trained using the same loss function as in Equation 2.6, where the FF network is replaced by an RNN and the last hidden state is used for the final linear transformation in Equation 2.5. Differently, in the sequence classification and the sequence-to-sequence settings, the RNN is trained to make a sequence of classifications. In this thesis, we focus on the sequence-to-sequence setting. Thus, interested readers can refer to goodfellow2016deep for more information about sequence classification tasks.
Sequence-to-sequence (Seq2Seq) [sutskever2014sequence] is an encoder-decoder architecture used to process sequences in both the input and the output. In this section, we describe an RNN-based Seq2Seq model for language generation, thus leveraging the RNN in Section 2.4 and the word embedding in Section 2.2.
The objective of a Seq2Seq model is to approximate the conditional probability , where both and
are sequences, e.g., sequences of words in a sentence. Formally, a Seq2Seq model generates a sequence of probability distributions, each of which is conditioned on the previously generated tokens and the input sequence(Equation 2.1).
The Encoder gets the tokens in , computes their embedded representation using the embedding matrix and passes it through an RNN. 222We use the simple RNN, but the same explanation holds for both the LSTM and GRU. Hence, given , each token in is processed by
The last hidden state is used as the initial hidden state of the decoder. The decoder, which is another RNN, generates token by token the output sequence . Hence, given as the initial hidden state, the decoder computes
where is a linear transformation, similar to FF, that maps the hidden state into a vector of vocabulary length size. Then, the Softmax activation normalizes the vector to obtain a probability distribution over the vocabulary . Figure 2.4 shows a high-level description of an RNN-based Seq2Seq model. Given a dataset of input-output pairs , the parameter of the model are optimized by minimizing:
where is the -th word in the -th example. As in Equation 2.6, this loss is minimized using gradient descent. During testing, the model produces the output sequence in an auto-regressive manner [graves2013generating], i.e., providing the generated token as input of the next generation time-step. Seq2Seq is a powerful generative model, but it still struggles in capturing long-term dependency between the generated token and the input sequence. To cope with this issue, the attention mechanism [bahdanau2014neural, luong2015effective] has been proposed.
Attention is used to learn an alignment between the generated tokens and the input . In practice, this is done by scoring the current hidden state of the decoder with all the encoder hidden states. Hence, given the encoder hidden state as and the decoder hidden state at time step as , the attention module computes the context vector as:
where is the concatenation of the two vectors, is the attention vector, and Score is a function chosen among three options: dot product, bi-linear, and neural networks. These are formally as:
Figure 2.5 shows a Seq2Seq model with attention over one decoding step. The attention mechanisms is extremely important for the success of Seq2Seq architectures, up to the point of removing the RNN in favor of a (self) attention based mechanism [vaswani2017attention].
As described in Section 2.5, RNNs are the basic block for both the encoder and the decoder in seq2seq models, and the attention mechanism (Section 2.5.1) plays a key roles in the success of the overall architecture. However, RNNs, including LSTMs and GRUs, requires a non-parallelizable operation for computing the hidden states of the sequence, which greatly limits the speed of the model. To cope with this temporal dependency issue and to fully exploit the attention mechanism, vaswani2017attention proposed an RNN-free model, the Transformer.
Similar to RNN-based Seq2Seq, the Transformer is made of 1) an encoder, 2) a decoder, 3) an embedding matrix, and 4) a positional embedding matrix. The latter is particularly important for capturing the temporal dependency in the input. Differently to Seq2Seq, however, both encoder and decoder, shares a generalized attention mechanism and FF networks only.
2.6.1 Generalized Attention
Given the input sequence , where each , the generalized attention module computes:
where . This is a generalization of the attention process in Equation 2.18, 2.19 and 2.20, where instead of considering each single hidden state and loop over it, we put all the hidden states into a matrix and compute all the attention scores in a single operation. The output of the attention module is indeed a matrix of the same size as the input .
vaswani2017attention also introduce a multi-head attention mechanism, where projection matrices ( and ) are used to split and into a sub-matrix and the attention is run in parallel on this transformation. More formally, a multi-head attention module with heads computes:
where the projections are parameter matrices , , and .
2.6.2 Embedding Transformation
Similar to Seq2Seq models, "the first step of the Transformer is to convert the tokens in the input sequence into their corresponding embeddings. Thus, given the input sequence , we define the embedding matrix , which maps tokens to vectors of dimension .
Differently from RNNs, the Transformer does not use any recurrent states and thus by construction it is not able to model temporal dependency in the input sequence. To cope with this issue, vaswani2017attention propose a sinusoidal positional embedding. The positional embedding matrix is made of sine and cosine functions of different frequencies depending on the position. More formally, each element in is defined as
where pos is the position in the sequence and is the -th position in the embedding dimension. Therefore, given the input sequence , its embedded representation is defined as
where is the -th position in the embedding dimension, and is the resulting embedded representation of the input.
The Transformer encoder is made of a stack of encoder layers, each of which is made of a multi-head attention module, a FF layer, and two layer normalization modules. Each layer gets the embedding , which at the first layer is simply the word embedding of the input, and it returns a transformed version of this embedding matrix by computing
where is a FF neural network with two layers, and ReLU activation. The multi-head attention step, in which , is often denoted as the self-attention layer. In short, the stack of encoder layers form a Transformer encoder , and the forward function is denoted as
where the embedding function includes the positional encoder.
The Transformer decoder is composed of a stack of decoder layers, each of which is made of two multi-head modules with their corresponding layer normalization, and a FF neural network. Each layer gets the embedding , the last layer embedding from the encoder, denoted as , and returns a transformed version of this embedding matrix by computing
The first multi-head attention module is further masked to avoid the current tokens to attend to the future once. A summary of different attention masks is shown in Figure 2.6. In short, the stack of decoder layers forms a Transformer decoder , and the forward function is denoted as
where . The -th vector of , denoted as , is then used to generate a distribution over the vocabulary by computing
where . This equation is equivalent to Equation 2.15, and thus we use the same cross-entropy loss function as in Equation 2.16. This last transformation is often named the language model head. Finally, a high-level representation of the Transformer encoder-decoder architecture is shown in Figure 2.8.
2.7 Pre-trained Language Models
The Transformer [vaswani2017attention] has enabled large-scale language models (LMs) trained on a huge amount of data [radford2018improving, radford2019language, devlin2018bert, dai2019transformer] to greatly improve the state-of-the-art on many downstream tasks in natural language processing. In general, pre-trained models are divided into bi-directional [devlin2018bert, liu2019roberta], uni-directional or casual-decoder [radford2018improving, radford2019language, gpt3, dai2019transformer], and encoder-decoder generative [raffel2020exploring, lewis2020bart]. The bi-directional pre-trained models are trained with a masked-language model (MLM) loss, which learns how to predict words that are randomly masked in the input. These models achieve state-of-the-art performance in complex natural language understanding tasks [wang2018glue]. On the other hand, the uni-directional and encoder-decoder generative models are usually trained using the likelihood function in Equation 2.1. In this thesis, we mostly focus on generative models, and thus we provide further details about the latter two pre-training strategies.
The causal-decoder is a special Transformer-based architecture in which the encoder and decoder are merged into a single set of parameters. In this architecture, the input and output sequence is concatenated and it is provided as input to a Transformer encoder , as defined in Equation 2.26. However, the attention matrices (i.e., the result of in Equation 2.22) in this encoder are masked, as in the first self-attention step of the Transformer decoder. Figure 2.6(a) shows the attention masking for a causal decoder architecture. More formally, given a sequence , which can be the concatenation of the input and output sequence or simply a sequence of tokens, the causal-decoder computes:
where . The -th vector of , denoted as , is then used to generate a distribution over the vocabulary by computing
Similar to the Transformer-decoder, we use the same cross-entropy loss function as in Equation 2.16 to train the model. This model is usually referred to the Language Model (LM) Transformer, since it is trained in the same way.
The most effective causal-decoder, pre-trained on a large text corpus are GPT-2 [radford2019language] and GPT-3 [brown2020language], while in the dialogue system scenario, DialoGPT [zhang2019dialogpt] is pre-trained using a large number of unlabeled conversations from Reddit, (more in details in Chapter 3). Independently of the pre-training corpus, these pre-trained models are fine-tuned to specific generation tasks (e.g., dialogue response generation) by using the same Seq2Seq loss as defined in Equation 2.16. This casual decoder model is shown in Figure 2.7, and it is widely used in Chapter 3 and 4 of the thesis.
Pre-trained encoder-decoder models are Transformers, as described in Section 2.6, trained on a massive amount of unlabeled data (plain text or conversations). Differently from the casual-decoder, two pre-training strategies have been proposed: span-prediction [raffel2020exploring] and denoising pre-training [lewis2020bart]. In both strategies, the input sequence is corrupted and the model is taught to reconstruct the original sequence.
Several pre-trained encoder-decoder conversational models [adiwardana2020towards, roller2020recipes] have been shown to be very effective in generating human-like responses. However, these models are still very hard to control.
2.8 Related Work
Task-oriented dialogue models [gao2018neural] can be categorized in two types: module-based [williams2007partially, hori2009statistical, lee2009example, levin2000stochastic, young2013pomdp, wu2019transferable] and end-to-end. In this paper, we focus on the latter which are systems that train a single model directly on text transcripts of dialogues. These tasks are tackled by selecting a set of predefined utterances [bordes2016learning, perez2016gated, williams2017hybrid, seo2016query] or by generating a sequence of tokens [wen2016network, serban2016building, zhao2017generative, serban2017hierarchical]. Especially in the latter, copy-augmented models [eric-manning:2017:EACLshort, reddy2018multi, yavuzdeepcopy] are very effective since extracting entities from a knowledge base is fundamental.
Chit-Chat Dialogue Generating human-like responses involves overcoming a variety of challenges such as personalization [li2016persona, personachat, dinan2020second, wolf2019transfertransfo, madotto2019personalizing], knowledge grounding [dinan2018wizard, gopalakrishnan2019topical, ghazvininejad2018knowledge, moghe2018towards, wu2020controllable, xu2021retrieval], emotions [li2017dailydialog, rashkin2018know, zhou2018emotional, fan2020facial, li2020empathetic], diversity [li2016diversity, li2016deep, ghandeharioun2019approximating, serban2017hierarchical, gao2018neural] and, bias [bang2021assessing, lee2019exploring, lee2021mitigating] so on. In terms of controlled dialogue generation, see2019makes studied of conditional generative models [kikuchi2016controlling] and weighted decoding [ghazvininejad2017hafez] in controlling models trained on persona-chat. see2019makes concluded that controlling specificity, relatedness, and repetition increase human-engagement, motivating us to extend the controllabitly to styles and topics. In this paper, we focus on these two since large pre-trained models can already achieve a high humanness score [adiwardana2020towards, roller2020recipes, zhang2019dialogpt].
Controlled Text Generation
Recent methods for controlled generation include fine-tuning models using supervised learning[peng2020few, subramani2019can], reinforcement learning [ziegler2019fine], adversarial training [yu2017seqgan], by pre-training models with control codes [keskar2019ctrl, ficler2017controlling, chan2020cocon], and other various approaches [zhang2020pointer, sheng2020towards, carbone2020etc]. Alternatively, weight decoding using both bag-of-words [holtzman2018learning, ghazvininejad2017hafez, baheti2018generating, see2019makes] and discriminators [holtzman2018learning, krause2020gedi], does not require any fine-tuning. Similarly, dathathri2019plug propose the Plug-and-Play Language Model (PPLM) to control the generation of a pre-trained language model, e.g., GPT2 [radford2019language], both in terms of style and topic of the generated text. Finally, residual adapters [houlsby2019parameter] has been used to learn multiple language generation tasks [lin2020exploring] without fine-tuning the original models’ parameters. Concurrently to our work, Smith2020ControllingSI compare the performance and trade-offs of three existing controllable language generation methods on 200 possible styles.
Continual Learning in NLP
Continual learning has been explored for both classification [d2019episodic, sprechmann2018memory, wang2020efficient, lee2021dynamically] and generation [sun2019lamol, hu2020drinking] tasks. For instance, sun2019lamol, chuang2020lifelong proposed LAMOL, which we use as our baseline, and studied its effectiveness on a subset of DecaNLP [mccann2018natural]. On the other hand, the work of d2019episodic, sprechmann2018memory is not suitable for interactive systems as dialogue systems, since their methods require local adaptation (i.e., a fine-tuning step) during inference. Finally, continual learning has been used for sentence encoding [liu2019continual], composition language learning [li2019compositional] and relation learning [han2020continual, lee2021towards]. However, these methods are very specific to particular NLP applications.
Continual Learning in Dialogue Systems
The very early work on CL for Task-Oriented dialogue is from lee2017toward, who used EWC to avoid catastrophic forgetting on three domains learned sequentially. Continual learning has also been studied in the NLG setting, where a single model was trained to learn one domain at the time in MWoZ [mi2020continual]. The authors used episodic memory to replay the example in combination with EWC. We compare similar baselines but on a larger benchmark that also includes MWoZ and the NLG setting. For the DST setting, CL was studied by [wu2019transferable] using MWoZ, where several baselines such as L2, EWC and GEM were compared. Differently, li2019evaluate leveraged CL for evaluating the quality of chat-bot models, and he2019mix studied the catastrophic forgetting problem in chit-chat systems. Finally, shuster2020deploying showed that by training models on humans-machine conversations in an open-domain fantasy world game [fan2020generating] the models progressively improved, as measured by automatic metrics and online engagement scores.
Mixture of Expert & Conditional Computation
The idea of having specialized parameters, or so-called experts, has been widely studied topics in the last two decades [jacobs1991adaptive, jordan1994hierarchical]. For instance, different architecture and methodologies have been used such as Gaussian Processes [tresp2001mixtures], Hierarchical Experts [yao2009hierarchical], and sequential expert addition [aljundi2017expert]. More recently, the Mixture Of Expert [shazeer2017outrageously, kaiser2017one] model was proposed which added a large number of experts between two LSTMs. To the best of our knowledge, none of these previous works applied the results of the gating function to the parameters itself. On the other hand, there are Conditional Computational models which learn to dynamically select their computation graph [bengio2013estimating, davis2013low]. Several methods have been used such as reinforcement learning [bengio2016conditional], a halting function [graves2016adaptive, dehghani2018universal, figurnov2017spatially], by pruning [lin2017runtime, he2018amc] and routing/controller function [rosenbaum2018routing]. However, this line of work focuses more on optimizing the inference performance of the model more than specializing parts of it for computing a certain task.
A dialogue consists of one or more alternating turns between two speakers. We denote the dialogue history as a single sequence of tokens from the concatenation of the alternating utterances from the user and the system turns respectively. Without loss of generality, we assume that has all the dialogue history without the last system utterance, denoted as . We model the dialogue responses using a Transformer [vaswani2017attention]-based casual-decoder (LM) by using the dialogue history as a prefix and then generating the continuation in an auto-regressive manner [DBLP:journals/corr/abs-1901-08149].
Let us re-define the concatenation of and output as the sequence of tokens . Then we can compute the language model distribution using the chain rule of probability [bengio2003neural] as
where is the model’s parameters and denotes the concatenation of and . We define the Transformer decoding process in a recursive manner. Let us define the matrix as the key-value pairs from the past dialogue history, i.e., , where corresponds to the key-value pairs from the -th layer generated at all time-steps from 0 to .
Thus, we define the recurrent decoding process as:
and then is sampled from the distribution , where is a linear transformation that maps the hidden state of the last layer to a vector of the vocabulary size. This efficient Transformer implementation [huggingface] leverages the cached memories to generate without recomputing .
3.1.1 Plug-and-Play Language Model
PPLM [dathathri2019plug] uses an attribute model (i.e., a classifier) for controlling the generated text. We denote the attribute model as , where is the specific desired attribute to optimize for (e.g., positivity), and is the generated response so far. At every generation step , PPLM perturbs the history matrix in the direction of the sum of two gradients: i) to maximize the log-likelihood of the attribute under the conditional attribute model and ii) to ensure the high log-likelihood of the generated text under the unmodified conversational language model . The gradient updates are restricted to so as to preserve the original model parameters.
Let be the update to to shift the generated text towards possessing the desired attribute i.e., . At the beginning of the generation, is initialized to zero and it is updated using the gradients from the attribute model. We rewrite the attribute model as , and we define the gradient update for as
where is the step size and is the scaling coefficient for the normalization term. Equation 3.3 is repeated times depending on how strongly we want the response to be conditioned to the attribute. We study the effect of the step size and the number of iterations on the generated text in detail in Section 3.4. Subsequently, the new is computed and a new token is generated using . The described optimization process is repeated for every token in the generated sequence. As previously mentioned, to ensure fluency we also take a step towards minimizing the Kullback–Leibler (KL) regularization between the perturbed and the original distribution. In addition, we also use Post-norm Geometric Fusion [stahlberg2018simple, dathathri2019plug] to avoid adversarial generation [szegedy2013intriguing].
The discriminator is a linear classifier trained on an annotated dataset with sentence and label pairs as — note that these sentences do not necessarily need to be conversational responses, as in our case. For each sentence of length , we compute the set of hidden states from the LM, then we compute the mean () across time, and finally we train using the cross-entropy between the label distribution and . Figure 3.3 shows an example of how PPLM uses different discriminators to control the output generation of DialoGPT (DGPT).
3.1.2 Residual Adapters
Residual adapters [houlsby2019parameter, bapna2019simple] are trainable modules added on top of each Transformer layer, which steer the output distribution of a pre-trained model without modifying the original weights. An adapter block consists of layer normalization [ba2016layer], followed by two linear layers [hinton1994autoencoders], denoted as , of a Transformer [vaswani2017attention], where is the hidden size and is the sequence length, the residual adapter computes
where and are trainable parameters of dimensions and respectively, and LN denotes the layer normalization. The bottleneck dimension is a tunable hyper-parameter that allows adjustment of the capacity of the adapter according to the complexity of the target task. We define the set of , , as the set of parameters for the adapter for a model with layers.
At decoding time, PPLM uses a fixed number of iterations to generate a single token. This makes the model impracticable for interactive tasks such as conversational models. To cope with this issue, we propose to first use PPLM to generate datasets of dialogues with certain attributes , denoted as , where is the dialogue history and the corresponding PPLM, generated response. Then we optimize the residual adapter parameters to steer the output of the original LM distribution. Hence, for each attribute , we optimize the parameters in to minimize the negative log-likelihood over the dataset of dialogues . Formally,
where each response is of maximum length in . Figure 3.4 shows the final Transformer with one adapter per attribute.
3.2 Experimental Setup
In this section, we conduct extensive experiments on the proposed methodology using both automatic and human evaluation. Differently from PPLM [dathathri2019plug], where a set of pre-defined prefixes are used to trigger the generation, in our experiments, we use 100 conversations [adiwardana2020towards] for generating 1100 possible prefixes (i.e., moving window of size two). These open-domain generic dialogues serve as a prefix to trigger the responses rather than fine-tuning. In all our experiments, we use DialoGPT-medium [zhang2019dialogpt], a large pre-trained model trained on 147 Million multi-turn dialogues from Reddit, spanning from 2005 to 2017. Importantly, the proposed methodology is model agnostic, and thus it can be applied to any other large pre-trained model such as Meena [adiwardana2020towards] and Blender-Bot [roller2020recipes].
Since the plug-and-play adapters use the generated responses from PPLM, we randomly split the prefixes, with 80% for learning the adapter perturbation and the remaining 20% for the final automatic and human evaluation. This is done to have a fair comparison between other baselines and adapters (See Appedix A for more details).
3.2.1 Attribute Models
We train three discriminators covering six attribute models: Positive, Negative, Question, Sci/Tech, Business and Sport. To control the Positive and Negative responses, we use SST-5 [socher2013recursive] with the classes Very-Positive and Very-Negative as the attribute, and to control for Question, we use the speech-act annotation from Daily Dialogue [li2017dailydialog]
with the Question class as the attribute. To avoid any dialogue-related data, we only use sentences without their corresponding context. Finally, to generate responses about Sci/Tech, Business and Sport, we use the AG-NEWS[zhang2015character] topic-classification dataset, using the respective classes as attributes. As mentioned in Section 3.1.1, we freeze the DialoGPT parameters and train a linear classifier on top of the representations from the final layer of its Transformer blocks. Table 3.1, shows the sample size statistics and the performance in terms of F1-score for all the aforementioned datasets. We also report the current state-of-the-art to show that a linear classifier trained on top of the DialoGPT activation can reach competitive performance.
|Daily Dialogue [li2017dailydialog]||Act||4||92,650||10,295||80.58||80.00||86.10|
|AG NEWS [zhang2015character]||Topic||4||120,000||7,600||90.68||90.65||95.44|
We compare multiple plug-and-play settings: DG: DialoGPT, proposed by zhang2019dialogpt; WD: DialoGPT plus a word-level weight-decoding schema, as in [ghazvininejad2017hafez] and [see2019makes]; PP: DialoGPT plus PPLM [dathathri2019plug], as explained in Section 3.1.1; and AD: DialoGPT with one adapter per style, as explained in Section 3.1.2. In all the baselines, we sample 10 hypotheses using multinomial-sampling after a top-k filtering (with ), to ensure response diversity [zhang2020trading], and we select the hypotheses with the lowest attribute model loss as the response. This re-ranking technique has shown itself to be very effective for generating good responses [adiwardana2020towards, dathathri2019plug].
3.2.3 Evaluation Metrics
We evaluate the generated responses using both automatic and human evaluations.
Automatic Eval. in open-domain chat is challenging [liu2016not]
, especially when using n-gram methods over a single reference (e.g., BLEU[papineni-etal-2002-bleu]). In this paper, no gold-reference response is provided (e.g., stylistic human-generated response). Thus we rely on unsupervised measures for fluency, diversity and style/topic. For fluency, we compute the perplexity score of the dialogue prefix plus the generated response using GPT2 [radford2019language]. While for diversity, we use the distinct n-grams [li2016diversity] (normalized by the length of the text) across all the responses generated by a given method. To evaluate the attribute consistency, we train external classifiers using non-overlapping data with the attribute model. For sentiments, we use AMAZON-5 [mcauley2013hidden] product reviews, and for topics, we use the test-set data of AG-NEWS [zhang2015character] because we could not find another topic classification dataset with the same classes. For each dataset, we train a separate BERT [devlin2019bert] (base) classifier with a simple classification head. Table 2 in Appendix B, summarizes the dataset statistics and the performance of the trained scorer.
Human Eval. is the most effective way to evaluate open-domain chat-bots. In this chapter, we evaluate two aspects of the generated response: humanness and attribute consistency. The first is used to evaluate the fluency and coherence of the generated responses. The second is used, to evaluate whether the generated responses respect the style or the topic enforced by the attribute model. We use Acute-Eval [li2019acute]-style A/B testing, in which we compare all possible model pairings (e.g., PP vs. DG etc.). For each comparison, we show the same dialogue context and two possible options, one generated from model A and one from model B. Then we ask the annotators to select among four options: model A, model B, both or neither. We collect annotations for both humanness and attribute consistency on 30 dialogues per model comparison and attribute, which amounts to a total of 4200 human annotations. Further details are provided in Appendix C.
|Score by Attribute|
In this section, we evaluate the proposed methodology to answer three research questions: 1) is it possible to use plug-and-play methods to control the output of a large pre-trained conversational model? and if so, 2) what are the most effective plug-and-play methods, and 3) how difficult is to control the response generation given various attributes? To answer these questions, we rely on both automatic and human evaluation. Table 3.2 and Figure 3.5 reports the aggregated result for all the styles and topics in both evaluations. The breakdown per attribute is reported in Appendix D.
3.3.1 Quantitative Evaluation
Automatic Eval. The major evaluation criterion is to have responses that are as fluent as the original DialoGPT or humans while following the style or topic enforced by the attribute model. In Table 3.2, we can see that DialoGPT (DG) achieves the lowest perplexity, but it also has the lowest aggregate attribute score ("Score" in Table 3.2). By analysing the breakdown by style, we can see that, by default, the original model has a higher score in both Positive style and Sci/Tech topic. We hypothesize that this is due to two factors: 1) The discussions in Reddit are more often related to Sci/Tech topics. By being provided general questions as input, e.g., “What do you do for living?", the model often generates tech-related responses, e.g., “I am a computer science student". 2) zhang2019dialogpt filtered undesired and toxic responses from the Reddit conversations used in training DialoGPT, which explains the positivity of its responses.
Using weight decoding (WD) on top of DialoGPT leads to an improvement in both the diversity score and the external classifier score. However, WD tends to increase the perplexity score, showing that the generation fluency with respect to the context is lost. In the preliminary experiments, we notice that WD generates responses that are not related to the dialogue context but are highly similar to the distribution of the discriminator datasets. This is consistent with the observation in [see2019makes] that WD is difficult to tune and often provides control at the cost of fluency, leading to non-sensical generation. On the other hand, PPLM (PP) is able to achieve a lower perplexity compared to WD while attaining both a higher attribute consistency score and a high response diversity (dist). We hypothesize that better performance is due the ability of PPLM to dynamically perturb the latent activation of the model without breaking the original distribution thanks to the KL regularization and the Post-norm Geometric Fusion [stahlberg2018simple].
The adapter plug-and-play setting has the highest overall attribute score and the lowest perplexity between PP and WD. However, the response diversity, especially dist-1, is lower than for other baselines, meaning that the response may contain repetitive tokens (e.g., “so so bad"). In general, adapters optimized with PPLM-generated responses, which usually are not perfect, can properly learn to steer the output distribution without breaking the original DialoGPT output. As previously mentioned, this also comes with the advantage of not computing the PPLM perturbation at decoding time.
|HUMAN 1||Are you doing any home decorating then?|
|HUMAN 2||Yes! We set up an eco-friendly (i.e. fake) Christmas tree and put up some colorful LED lights which is very festive.|
|Oh that’s so cool! I love your stuff!!|
|Very nice, good sir|
|PP||I’m not a fan of LEDs in general. They always seem to fail.||Oh wow awesome! Thank you so much for your time!|
|AD||That sounds like the absolute most boring thing. EVER.||That is amazing! I am so excited!! :D So creative and creative!! :D|
Human Eval. In Figure 3.5, we report the winning rate of the A/B testing for both humanness and attribute consistency. From the figure, we highlight the following: 1) There is no statistically significant difference in the humanness score among the multiple methods, even with 210 annotations per cell. In general, all the methods lose with the human response (HM), but not by a large margin. This is due to the fact that the annotators choose the “both" option more often. 2) In terms of attribute consistency, we observe that the methods form a clean, well-ordered rank, ADPPWDDGHM, which confirms the automatic evaluation results. Different from humanness, all the results except those for WD vs. DG are statistically significant (), showing the adapter clearly defeats other methods.
To answer the first two research questions, we observe that both automatic and human evaluation show that plug-and-play methods are suitable for controling response generation. Moreover, the most effective method is the adapter plug-and-play, which produces fluent, and attribute consistent responses, while being three orders of magnitude faster than PPLM at inference time (148.5s/token vs. 0.123s/token) using a single Nvidia 1080Ti.
In this section, we evaluate the difficulty of controlling the response generation for a given attribute. To do so, we analyse the behaviour of PPLM over two opposite styles (positive and negative) and then we conduct a qualitative evaluation over the generated responses.
Iteration & Step Size
We analyse the loss of the automatic scorer for fluency and attribute consistency to understand the effects of the number of iterations and the step size in Equation 3.3. Figure 3.6 depicts the normalized sum of the log perplexity score, computed by GPT2 [radford2019language] and the external classifier loss on the response generated for the Negative and Positive style. In general, the aggregate loss for the negative attribute (Figure 3.5(a)) is higher for the Positive attribute (Figure 3.5(b)), as also shown in the sampled responses, where a small step size and few iterations leads to positive responses. However, when both the step size and the iteration surpass a certain threshold, the conditioning becomes very strong and the text generated by PPLM loses its fluency. Overall, this visualization suggests that it is more laborious to control for the negative sentiment with PPLM, and there is a smaller region for the hyper-parameters space where the responses are both fluent and attribute-consistent.
We sample and read 200 dialogue responses from the adapter plug-and-play model (AD), and we study the overall quality of the responses especially to understand when and why DialoGPT is hard to steer. We discover three possible factors: 1) the hardness of the response steering is influenced by the context, 2) available vocabulary for attributed style/topic, and 3) mutual exclusivity of the attribute-specific vocabulary.
1) Unlike language models that use short prefixes (e.g., “The issues …") to trigger the generation, conversational models are constrained to the given dialogue history, which significantly influences the controllability. Given an open ended dialogue context (e.g., Table 11 in appendix), AD generates an impressively natural and on-topic response, but when provided a more constrained dialogue context (e.g., Table 17 in appendix), AD generates a response that may sound sudden and out of context.
2) Looking at the overall responses, also shown in Table 3.3, we observe that the models use a restricted vocabulary for generating attribute-consistent responses. For example, AD frequently generates sentences containing “horrible", “terrible" or “worst" for negative, while “beautiful", “happy" or “wonderful" are more common for positive.
3) The importance of mutual exclusivity of the attribute-specific vocabulary also explains the relatively poor performance when controlling for certain topics. As noted above, positive and negative vocabularies are clearly distinguishable. However, the attribute-specific words for topics such as Business are more generic (e.g., “car", “store") than those for other topics such as Sport (e.g., “football", “hockey") or Sci/Tech (e.g., “android", “software"). If the attribute-specific words are common and shared across multiple domains, the generated responses may not sound attribute specific even though the correct vocabulary is used.
Note that this use of restricted vocabulary also harms fluency, because the vocabulary cannot always fit within a given context. Additional generated examples and statistics of attribute-specific vocabulary on each style/topic are provided in Appendix D.
3.5 Short Summary
We explore plug-and-play methods for controlling the response generation of large pre-trained conversational models. With extensive automatic and human evaluations, we show that PPLM is able to generate fluent and attribute-consistent responses. Further, to overcome the significant computational overhead introduced by PPLM at decoding time, we optimize a tiny residual adapter for each attribute based on a few synthetic responses generated using PPLM. The resulting model does not require further computation at decoding time, and outperforms PPLM both in terms of fluency and attribute consistency.
4.1.1 Task-Oriented Dialogue Modelling
We model TODs as a Seq2Seq generation task [lei2018sequicity, byrne2020tickettalk] that generates both API-calls and system responses. As shown in Figure 4.2, the model takes as input a dialogue history to generate an API-call, which is the concatenation of user intents and current dialogue states, and then uses its API-call returns, which can be empty or system speech-acts, to generate its system response. This modelling choice is guided by the existing annotated dialogue datasets, which provide the intent and the dialogue state of the user at every turn, and the speech-act of the system; and it allows us to define four distinct settings for studying CL: intent recognition (INTENT), DST, NLG and end-to-end (E2E). In the coming paragraphs, we formally describe the four settings as different input-out pairs for a Seq2Seq model.
Let us define the dialogue history as a single sequence of tokens from the concatenation of the alternating utterances from the user and the system turns respectively. Without loss of generality, we assume that has all the dialogue history without the last system utterance, denoted as . To distinguish between speakers, we add two special tokens at the beginning of every utterance: "USER:" for the user utterance, and "SYSTEM:" for the system utterance. Then, we define an API-call, denoted by , as the concatenation of the API-name, i.e., the user-intent, and its arguments, i.e., slot-value pairs from the DST. The following syntax is used:
where is an intent or the API-name, the slot-name and one of the possible values for the slot . The return of the API-call is either an empty string, thus the model uses the dialogue history to generate a response, or a speech-act, denoted as , in the same format as the API-call in Equation 4.1. Similar to the dialogue history, we define two special tokens "API:" and "OUT:" for triggering the model to generate the API-call and for distinguishing the return of the API from the dialogue history respectively. Based on this pre-processing, we define the four settings.
Without loss of generality, we define the three modularized settings by their input-out pairs:
whereas for the E2E setting we define the pairs as:
Often, is empty and thus the model maps the dialogue history to the response (), as we have seen in the previous chapter. An example of input-out pairs is shown in Figure 4.2.
Finally, we define a dialogue dataset as , where is a general input-out pair from one of the four settings under consideration, and the dialogue domain under consideration (e.g., hotel).
We employ casual language models (e.g., GPT-2), which are often used in the current state-of-the-art task-oriented dialogue models, as in peng2020soloist and [hosseini2020simple]. Then, given the concatenation of the input and output sequences, we compute the conditional language model distribution using the chain rule of probability [bengio2003neural] as
where is the model’s parameters and denotes the concatenation of and . The parameters are trained to minimize the negative log-likelihood over a dataset of input-out pairs, which in our case is the data of the four settings. Formally, we define the loss as:
where is a maximum sequence length in . At inference time, given the input sequence , the model parameterized by autoregressively generates the output sequence .
4.1.2 Continual Learning
The goal of continual learning is to learn a set of tasks sequentially without catastrophically forgetting the previously learned tasks. In TODs, we cast CL as learning a sequence of domains sequentially. Let us define a curriculum of domains as an ordered set , where is a dataset under the domain . In addition, we denote the models’ parameters after learning the task by .
Following the recently defined taxonomy for CL [wortsman2020supermasks], we study the settings in which the task-ID, is provided during training, but not during testing, 111GNs: Task given during training, no inference; shared labels. meaning that, during training, the model is aware of which domain it is currently learning, but during testing, the model is evaluated without specifying the dialogue domain. This assumption makes our CL setting more challenging but more realistic, since during inference times, users do not explicitly specify in which domain they want to operate.
We consider three CL approaches: regularization, rehearsal and architectural. In our experiments, we describe the most commonly used methods within each approach, especially those known to work well in language tasks.
Regularization methods add a regularization term to the current learned to avoid interfering with the previously learned . Formally, the loss at task is:
where is copies of the previously learned parameters frozen at this stage. In our experiments, we consider two kinds of : the identity function (L2) and the Fisher information matrix [kirkpatrick2017overcoming] (EWC).
Rehearsal methods use an episodic memory to store examples from the previously learned domains, and re-use them while learning new tasks. The most straightforward method is to add the content of the memory to the current task data [robins1995catastrophic]. Following our notation, the model is optimized using , and we refer to this method as REPLAY. Another rehearsal method is to constrain the gradients updates so that the loss of the samples in memory never increases. More formally,
Of this kind, the method Gradient Episodic Memory (GEM) [lopez2017gradient] computes the gradient constraint via a quadratic programming solver that scales with the number of parameters of the model. After our first investigation, we discover that it is impractical for large-language models to use GEM, since they have millions of parameters and the constraints are computed for each batch. To cope with this computational complexity, chaudhry2018efficient proposed A-GEM, which efficiently computes the gradient constraints while being effective in CL tasks. Finally, a rehearsal method specific to language tasks is LAMOL [sun2019lamol], which instead of storing samples in , trains a model that simultaneously learns to solve tasks and generate training samples.
Architectural methods add task-specific parameters to an existing base model for each task. Of this kind, multiple models have been proposed, such as Progressive Net [rusu2016progressive], Dynamically Expandable Networks (DEN) [yoon2017lifelong] and Learn-to-Grow [li2019learn]. On the other hand, there are fixed-capacity methods, that do not add specific parameters, but learn parameter masks [fernando2017pathnet], usually binary [mallya2018piggyback]
, to select sub-networks that are task-specific. To the best of our knowledge, these models have been tested mostly on computer vision tasks, and they can not easily handle our CL setting (i.e., no task-ID during testing).
Motivated by the lack of architectural baselines for CL in Seq2Seq modelling, we propose a novel architectural method called AdapterCL. Our proposed method parameterizes each task using residual adapters [houlsby2019parameter], as we have seen in the previous chapter, and it uses an entropy-based classifier to select which adapter to use at testing time. This method is designed for large pre-trained language models, e.g., GPT-2, since only the task-specific parameters are trained, while the original weights are left frozen.
To continuously learn new tasks, we first spawn a new adapter, parameterized by , and then we train its parameters as in Equation 4.3. For instance, given the dataset and the model with its corresponding adapter , the loss is defined as:
Importantly, the loss is optimized over to guarantee that each task is independently learned. An example of AdapterCL while learning tasks continuously is shown in Figure 4.3.
In our CL setting the task-ID is provided during training, and thus each is optimized over . During testing, however, the task-ID is not provided, and thus the model has to predict which adapter to use to accomplish the task. This step is not required in regularization and rehearsal approaches since a single set of parameters is optimized during training.
Inspired by wortsman2020supermasks, we propose to utilize the perplexity of each adapter over the input as a measure of uncertainty. Thus, by selecting the adapter with the lowest perplexity, we select the most confident model to generate the output sequence. The perplexity of an input sequence is defined as
Therefore, given the set of adapters parameterized by , each of which is trained respectively with , and an input sample , we compute:
where each represents the confidence of the adapter for the input . The task-ID is thus selected as
The perplexity-based selector requires a linear number of forwards with respect to the number of adapters (Equation 4.8), but it has the advantage of not requiring a further classifier, which itself would suffer from catastrophic forgetting and would require episodic memory. An example of perplexity-based adapter selection is shown in Figure 4.4.
4.3 Experimental Settings
In this section we describe 1) the datasets used for creating the learning curriculum, 2) the evaluation metric used to evaluate the different settings, and 3) the experimental setups.
To the best of our knowledge, there is no benchmark for CL in dialogue systems with a high number of tasks to be learned continuously and with multiple training settings. The closest to ours is the method from mi2020continual, which continuously learns five domains in the NLG setting. In general, NLP benchmarks for CL use no more than 10 tasks [sun2019lamol, d2019episodic]. Consequently, we propose a CL benchmark in which we jointly pre-processing four task-oriented datasets: Task-Master 2019 (TM19) [byrne-etal-2019-taskmaster], Task-Master 2020 (TM20) [byrne-etal-2019-taskmaster], Schema Guided Dialogue (SGD) [rastogi2019towards] and MultiWoZ [budzianowski2018large]. This results in a curriculum of 37 domains to be learned continuously under four settings: INTENT classification, DST, NLG, and E2E. This is possible because the four datasets provide the speech act annotation for both the user and the system turns, and the dialogue state as well. To avoid any domain overlapping, we select only the dialogues with a single domain, and we do not merge domains with similar/the same names or semantics. For example, the restaurant domain appears in TM19, TM20 and MWOZ, with different slot names and values. Thus, we intentionally keep these data samples separate for modelling scenarios in which multiple APIs are available for the same domain.
4.3.2 Evaluation Metrics
Automatic evaluations for E2E TODs are challenging, especially for the response generation task. To overcome this issue, we use well-defined metrics based on the three modularized settings. In all of the three sub-tasks, we define the relevant metrics as follows:
INTENT recognition is evaluated using the accuracy between the generated intents and the gold labels.
DST is evaluated with the Joint Goal Accuracy (JGA) [wu2019transferable] over the gold dialogue states.
NLG is evaluated using both the BLEU score [papineni-etal-2002-bleu] and the slot error rate (EER) [wen2015semantically], which is computed as the ratio between the total number of slots and the values not appearing in the response. In datasets such as SGD, the slot has binary values, e.g., yes or no, and thus we exclude these from the count, as in kale2020few. In the E2E setting, if the API output () is empty, then we rely on the BLEU score.
Independently of these metrics, we also compute CL-specific metrics such as the average metric through time (Avg. Metric), as in lopez2017gradient. We consider access to the test set for each of the tasks, and after the model finishes learning the task , we evaluate its test performance on all tasks in the curriculum. To elaborate, we construct the matrix , where is the test metric (e.g., BLEU, JGA) of the model on task after observing the last sample from task . Then we define the average accuracy as
The Avg. Metric score is useful for understanding the learning dynamics through time of different baselines. Further metrics such as Backward-Transfer and Forward-Transfer [lopez2017gradient] are available to distinguish baselines with similar Avg. Metric scores, but we limit our evaluation to this metric, since there is a large gap among the baselines. Finally, to evaluate the adapter selection, we use the accuracy over the gold task-ID.
|VANILLA||-||0.21 0.02||4.1 1.4||4.91 4.5||48.7 3.9||6.38 0.6|
|L2||0.56 0.06||3.8 1.4||3.81 3.4||55.7 7.1||5.4 0.9|
|EWC||0.91 0.10||3.9 1.3||5.22 4.5||58.2 3.7||5.06 0.5|
|AGEM||-||0.38 0.04||34.0 6.4||6.37 4.0||62.1 6.9||4.54 0.6|
|LAMOL||-||2.32 1.24||7.5 6.4||4.55 3.5||66.1 6.9||3.0 0.9|
|REPLAY||-||0.62 0.23||81.1 1.4||30.33 1.2||17.8 0.9||17.4 0.7|
|ADAPT||0.20 0.02||90.5 0.6||35.1 0.5||31.78 1.3||16.76 0.4|
|MULTI||-||-||4.14 2.23||95.5 0.1||48.9 0.2||12.56 0.2||23.61 0.1|
) needed per task, and Hours is the average hours per epoch (single NVIDIA 2080Ti)required for training a new domain.
4.3.3 Baselines and Settings
The main goal is to compare the performance of different CL approaches and to understand the trade-offs among them. Therefore, following the definition provided in Section 4.1.2, we compare 1) EWC and L2, 2) A-GEM, LAMOL, and REPLAY, and 3) AdapterCL. Additionally, we provide a baseline trained on each task continuously, namely, VANILLA, without any regularization or memory, and a multitask baseline (MULTI), which is trained on all the data in the curriculum at the same time. In L2, EWC, and A-GEM, we tune different in the range 0.0001 to 100, and in the rehearsal-based methods, REPLAY and GEM, we keep 50 samples per task, for a total of 1,850 samples in at the end of the curriculum. This is particularly important since, if we store in memory all the samples of the seen tasks, the model incurs a high training cost. Arguably, this could be an option if the per-task sample size is small, but this is not always possible, e.g, large language models [brown2020language]. Therefore, the assumption of minimizing the number of samples in memory is valid and is widely used in the CL literature [mi2020continual]. Finally, for the AdapterCL, we tune the bottleneck size between 10, 50, 100, and 200. Interested readers can refer to Appendix A for further details of the selected hyper-parameters. In CL the model is not able decide the order of tasks. Therefore, we create five learning curricula by randomly permuting the 37 tasks.
4.4 Results & Analysis
The main results in the E2E setting are summarized in Table 4.2 In these tables we report the Avg. Metric at the end of the curriculum, which is equivalent to the average test set performance in all the tasks, and the resources used by each model.
From the tables, we can observe that 1) both regularization-based methods (L2/EWC) and some rehearsal-based methods (AGEM/LAMOL) cannot continually learn tasks without incurring in catastrophic forgetting, 2) REPLAY and AdapterCL perform comparably well on the Intent and DST tasks, 3) REPLAY works the best on the NLG task, showing that transferring knowledge between tasks is needed, and 4) no CL methods can reach the performance of the multi-task baseline, especially on the DST task. In addition, the adapter selection accuracy based on Equation 4.9 is 95.440.2% in E2E, 98.030.1% in Intent Recognition, 98.190.1% in DST, and 93.980.1% in the NLG.
Although these numbers are meaningful, they do not describe the entire learning history of the curriculum. To better understand these dynamics, we plot the Avg. Metric in Equation 4.10 after each task is learned ( in the equation). Figure 4.5 shows the plot for the considered metrics and all the baselines. From this figure we can better understand how REPLAY and AdapterCL outperform the other baselines and, interestingly, that LAMOL performs as well as REPLAY on the first 12 tasks. This is because LAMOL learns to generate training samples instead of using an explicit memory, and thus the generation becomes harder when more and more task are shown. This result further strengthens our motivation to have a benchmark with a long curriculum.
4.4.1 Training Time Analysis
From Figure 4.6 we plot the training time (Hours Epochs) required to add a new domain to an existing model. A clear trend is shown where rehearsal based methods (REPLAY, LAMOL) requires a linearly increasing amount of time to add new domains, while ADAPTER and VANILLA the time remain constant across time. This is even more evident when the entire training-set of all the previous tasks is used for training (REPLAY-ALL), which lead to a very expensive process to add new domains. The average time across domain for all the baseline is shown in Table 4.2. ADAPTER based CL requires also an additional cost in selecting which parameters to use during testing. By using a single 2080ti, the average time to select the adapter is seconds, which is as expensive as decoding 4 tokens.
4.4.2 No Free Lunch
Based on the results shown in Table 4.2, and especially based on the resources used by each method, we conclude that there is a no free lunch in terms of resources needed to avoid the catastrophic forgetting problem. To elaborate, in both REPLAY and AdapterCL, the resources used grow linearly with the number of tasks; i.e., in REPLAY the number of samples stored in the episodic memory grows linearly (50 times the number of tasks), and in AdapterCL the number of parameters grows linearly (number of adapter parameters times the number of tasks). Figure 4.7 describes the high-level intuition behind this concept by plotting the number of tasks and parameters and the episodic memory sizes needed.
Therefore, given a resource budget, different baselines are preferable in terms of memory or parameters. The main advantage of using memory-based methods (e.g., REPLAY) is that no parameters are added, and thus the resulting model is closer to the multitask baseline. However, this comes with the disadvantage of losing the learned weights of the original pre-trained model. This is particularly critical for large pre-trained language model which provide a good starting point for fine-tuning new tasks. On the other hand, the main advantage of parameter-isolation methods (e.g., AdapterCL) is the ability to retain the original weights and to control which tasks to trigger, given a certain input. The latter is important in scenarios where just a subset of a domain is shown to the user (e.g., only one particular restaurant API). The main disadvantage, however, is the lack of knowledge transfer among tasks, since each dataset is trained in isolation.
4.4.3 Analysis: Episodic Memory Size
In this section, we analyze the effect of increasing the episodic memory size for the REPLAY method. Trivially, by including all the training samples in the memory, the model, in the last task, converges to the multitask baseline. Then, the question of how many samples to keep per task to avoid catastrophic forgetting becomes important. In light of this, Figure 4.8 shows the performance of the model at different episodic memory sizes on the different tasks. Here, we observe that by storing only a few samples per task (10–50) the model still greatly suffers from catastrophic forgetting, where with around 500 samples, which is equivalent to a total of 18,500 samples in our setting, the performance is closer to that of the multitask baseline (i.e., a possible upper bound).
4.5 Short Summary
We propose a benchmark for CL in TODs, with 37 tasks to be learned continuously on four settings: intent recognition, DST, NLG, and end-to-end. Then, we implement three CL methodologies: regularization, rehearsal and architectural. For the latter, we propose a simple yet effective method based on residual adapters and use a perplexity-based classifier to select an adapter to use at inference time. Finally, we analyze the trade-off between the performance, the number of parameters, and the episodic memory sizes of the evaluated baselines, unveiling an insightful trade-off (“no-free lunch") among the methods.
We use the standard encoder-decoder architecture and avoid any task-specific designs [wu2019global, reddy2018multi], as we aim to build a generic conversation model for both chit-chat and task-oriented dialogues. Following the notation in Chapter 4, we define two input-output sequences types as
where is the dialogue history, is an API call, is the output of the API and is the system response. it is the result of a API execution (e.g., table) or plain texts (e.g., persona description), depending on the task. Therefore, we define a dialogue dataset as , where is a general input-output pair from the two possible types (API and Response). Finally, we define a binary skill vector that specifies the type of skill required to generate . This can be considered as a prior vector for learning to select the correct expert during training. 111The vector will be absent during the testing. For example, in Table 5.1, the first response is of type API in the Hotel domain. Thus the skill vector will have and , while all the other skills/domains are set to zero. 222With the assumption that each index in is assigned a semantic skill (e.g., API position ). More importantly, we may set the vector to have multiple skills to force the model to compose skills to achieve a semantic compositionality of different experts.
5.1.1 Attention over Parameters
By following the Transformer notation in Chapter 2, the main idea is to produce a single set of parameters for decoder TRS by the weighted sum of independently parameterized decoders. This process is similar to attention [luong-pham-manning:2015:EMNLP], where the memories are the parameters and the query is the encoded representation. Let us define as the list of parameters for decoders, since a TRS is represented by its parameters . Since each can be sized in the order of millions, we assign the corresponding key vectors to each , similar to key-value memory networks [miller2016key]. Thus, we use a key matrix