VariationalTransformer
None
view repo
Despite the great promise of Transformers in many sequence modeling tasks (e.g., machine translation), their deterministic nature hinders them from generalizing to high entropy tasks such as dialogue response generation. Previous work proposes to capture the variability of dialogue responses with a recurrent neural network (RNN)based conditional variational autoencoder (CVAE). However, the autoregressive computation of the RNN limits the training efficiency. Therefore, we propose the Variational Transformer (VT), a variational selfattentive feedforward sequence model. The VT combines the parallelizability and global receptive field of the Transformer with the variational nature of the CVAE by incorporating stochastic latent variables into Transformers. We explore two types of the VT: 1) modeling the discourselevel diversity with a global latent variable; and 2) augmenting the Transformer decoder with a sequence of finegrained latent variables. Then, the proposed models are evaluated on three conversational datasets with both automatic metric and human evaluation. The experimental results show that our models improve standard Transformers and other baselines in terms of diversity, semantic relevance, and human judgment.
READ FULL TEXT VIEW PDFNone
Convolutional and fullyattentional feedforward architectures, such as Transformers Vaswani et al. (2017), have emerged as effective alternatives to RNNs Dehghani et al. (2018)
in wide range of NLP tasks. These architectures remove the computational temporal dependency during the training and effectively address the longstanding vanishing gradients problem of recurrent models by processing all inputs simultaneously. Notably, transformers apply a fully attention strategy, where each token in the sequence is informed by other tokens via a selfattention mechanism. It acts as an effectively global receptive field across the whole sequences which absence in RNNs. Despite the powerful modeling capability of trasnformers, they often fail to model
onetomany ^{1}^{1}1Given a similar dialogue history, there may exist many valid responses. relation in dialogue response generation tasks Zhao et al. (2017) due to their deterministic nature. As a result, they generate dull and generic response (e.g., “I am not sure”), especially with greedy and beam search, which are widely used in other sequence modeling tasks. There have been attempts to generate diverse and informative dialogue responses by incorporating latent variable(s) into the RNN encoderdecoder architecture. In particular Zhao et al. (2017) adapt a conditional variational autoencoder (CVAE) to capture discourselevel variations of dialogue, while Goyal et al. (2017) and Du et al. (2018) integrates latent variables in the hidden states of the RNN decoder. However, the inherently sequential computation of aforementioned models limit the efficiency for large scale training.In this paper, we introduce the Variational Transformer (VT) ^{2}^{2}2The source code is available in https://github.com/zlinao/VariationalTransformer a variational selfattentive feedforward sequence model to address the aforementioned issues. The VT combine the parallelizability and global receptive field of the transformer with the variational nature of CVAE by incorporating stochastic latent variables into transformers. We explore two types of VT: 1) Global Variational Transformer (GVT), and 2) Sequential Variational Transformer. The GVT is the extension of CVAE in Zhao et al. (2017)
, which modeling the discourselevel diversity with a global latent variable, While SVT, inspired by variational autoregressive models
Goyal et al. (2017); Du et al. (2018), incorporates a sequence of latent variables into decoding process by using a novel variational decoder layer. Unlike previous approaches Zhao et al. (2017); Goyal et al. (2017); Du et al. (2018), SVT uses Noncausal Multihead Attention, which attend to future tokens for computing posterior latent variables instead of using an additional encoder.The proposed VT architectures integrate stochastic latent variables into Transformers. The experimental results on a three conversation dataset demonstrate that our models can generate more informative and coherent responses.
Conversational systems has been widely studied Weizenbaum and others (1966); Wallace (2009); Vinyals and Le (2015); Serban et al. (2016)
. Compare to rulebased systems
Weizenbaum and others (1966); Wallace (2009), sequencetosequence conversation models achieve superior performance in terms of scalable training and generalization ability Vinyals and Le (2015). However, it has been pointed out that encoderdecoder models tend to generate generic and repetitive responses like “I am sorry” Li et al. (2016a). To address this issue, there have been three main lines of work. The first is adding additional information (e.g., persona) as input to guild model generate more informative responses Li et al. (2016b); Zhang et al. (2018). The second modifies the learning objective to promote more diverse generation Li et al. (2016a), and the third integrates stochastic latent variables into Seq2Seq models by using the CVAE framework Serban et al. (2017); Zhao et al. (2017). Our work comes within this third line introducing a novel model, the Variational Transformer, to improve dialogue response generation.Many works have attempted to combine CVAEs with encoderdecoder architectures for sequence generation tasks. Zhang et al. (2016)
propose a variational encoderdecoder model for neural machine translation, while
Li et al. (2017) apply variational recurrent neural networks (VRNN) Chung et al. (2015)for text summarization.
Zhao et al. (2017) and Zhou and Wang (2018) explore incorporating meta features into CVAE framework in dialogue response generation tasks. Goyal et al. (2017) and Du et al. (2018) propose variational autoregressive decoders which enhanced by highly multimodal latent variables to capture the high variability in dialogue responses. Le et al. (2018) further augment variational autoregressive decoders with dynamic memory networks for improving generation quality. We unify the previous successful ideas of CVAE, and explore the combinations of CVAE and Transformer.Taking advantage of the parallelintime structure and global receptive field, Transformers Vaswani et al. (2017) have recently been shown to achieve impressive results on various sequence modeling tasks. Based on this, several followup models have been presented. The Image Transformer Parmar et al. (2018) has been proposed for image generation, while the MultiModel Kaiser et al. (2017)
integrates convolution, attention and sparselygated mixtureofexpert blocks into a single deeplearning model for simultaneously learning multiple tasks from various domains.
Lin et al. (2019) proposed a fully attentional mixtureofexpert model (MoEL) for empathetic dialogue modeling. The Universal Transformer Dehghani et al. (2018) incorporates the recurrent inductive bias of RNNs into the standard Transformer, and achieves better result on a wide range of algorithmic and language understanding tasks. Kaiser et al. (2018) introduce the Latent Transformer (LT) for nonautoregressive machine translation. During training, the LT first autoencodes a target sequence into a shorter sequence discrete latent variables. Then a parallel decoder decodes the target using discrete latent variables and an input sequence. Different from the LT Kaiser et al. (2018), the VT generates continuous latent variables during the decoding process.The CVAE framework Sohn et al. (2015)
represents a dyadic conversation via three random variables: the input condition
, including conversation context and meta features (meta features can be ignored when not available); a latent variable ; and the target response . A CVAE can be efficiently trained with Stochastic Gradient Variational Bayes (SGVB) Kingma and Welling (2013) by maximizing the variational lower bound of given c, according to:(1) 
The typical CVAE consists of a prior network , which is used to approximate , a recognition network , which is used to approximate posterior distribution , and a decoder , which is used to approximate
. By assuming z follows multivariate Gaussian distribution with a diagonal covariance matrix, the evidence lower bound (ELBO) can be written as
(2)  
where denotes the reconstruction loss and denotes the KullbackLeibler (KL) divergence between the posterior and prior.
In dialogue generation tasks, previous works Zhao et al. (2017); Zhou and Wang (2018) apply RNN encoders (with GRU or LSTM cell) to encode dialogue contexts and responses separately. The condition is represented by the concatenation of the last hidden state of the context encoder and the meta features (e.g., topic, emotion), while the response is represented by the last hidden state of response encoder. Then the prior network and the recognition network
parameterized by multilayer perceptrons (MLPs) are applied to approximate the means and the log variances of the prior latent distribution
and posterior latent distribution . With the reparameterization trick Kingma and Welling (2013), we can obtain samples of the prior latent variable (for testing) from and samples of the posterior latent variable (for training) from . Finally, an RNN decoder use and as the initial state to predicts the response .The vanishing latent variable problem Bowman et al. (2016) is a common issue in RNNbased CVAEs. That is, the powerful autoregressive RNN decoder first learns to ignore the latent variable, and decodes the response by only condition on the previous tokens. Thus the latent variable fails to encode the meaningful information, and the CVAE deteriorates to seq2seq model. To alleviate this issue, KL annealing Bowman et al. (2016) and bagofword loss Zhao et al. (2017) have been proposed, and have shown effectiveness in various dialogue tasks Zhao et al. (2017); Zhou and Wang (2018).
The aforementioned RNNbased CVAE framework integrate the latent variable into the initial state of RNN decoder, while in transformer, it is more flexible to incorporate the latent variable embedding into the first input token of the decoder to generate the initial state.
The overall architecture of GVT is depicted in Figure 1. Different from RNNs, the Transformer encoder maps an input sequence of symbol representations to a sequence of contextualized representations Vaswani et al. (2017). In order to get fixed dimension representations of the response and context, we add a special token at the beginning of the input sequence as in BERT Devlin et al. (2018), to compute the weighted sum of the output representations via selfattention. Thus the output representation of the token is considered as the representation of the whole sequence. Then we introduce a recognition network and a prior network to compute the posterior latent variable and prior latent variable as in Zhao et al. (2017); Zhou and Wang (2018). We add the latent variable sample and meta features (can be ignored when not available) into , the embedding of the startofsequence token :
(3) 
Finally, the transformer decoder decodes the response sequentially while attending to the new embedding of token with latent information.
This design enhances the CVAE framework with the global receptive field, and each position of the GVT can directly access the latent information via the multihead selfattention mechanism. However, we still observe that the GVT suffers the vanishing latent variable problem as RNNbased CVAE because the decoder can bypass the latent information by paying less attention to the token. Hence, we apply the KL annealing, and bagofword auxiliary loss as in Zhao et al. (2017); Zhou and Wang (2018) to preserve the useful information of the latent variable. Therefore, the learning objective of the GVT is defined as follows:
(4) 
In order to augment the capacity of the latent variable with multimodal distributions and to better utilize the latent information, we further explore incorporating a sequence of latent variables in decoding process. We introduce Sequential Variational Transformer (SVT) with a novel variational decoder layer which generate latent variables for each position: . Similar to Goyal et al. (2017), we interpret the latent variables as a generation plan for the future sequence. Unlike previous CVAE models which use an extra encoder to encode the response separately Zhao et al. (2017); Zhou and Wang (2018) or use a backward RNN to encode the future sequence for each time step Goyal et al. (2017); Du et al. (2018), SVT uses a Noncausal Multihead Attention which leaks the future information to the recognition network for computing the posterior latent variables.
As shown in Figure 2, the SVT shares the same encoder as the standard Transformer Vaswani et al. (2017), while its decoder consists of a variational decoder layer followed by a stack of standard Transformer decoder layers. The variational decoder layer has two paths for computing the posterior latent variable and prior latent variable respectively. We denote them as Posterior Path and Prior Path.
The Prior Path (solid line in Figure 2) has a masked multihead selfattention sublayer which performs causal attention on the shifted response, followed by a multihead selfattention sublayer which performs encoderdecoder multihead attention on the context encoder. The last sublayer is composed of a MLP prior network which approximates a sequence of prior latent variable for each position, and a Positionwise FeedForward Network (FFN) which fuse the latent information with the observed information representation before the prior network (shown in Figure 2). Specifically, we concatenate with as the input to the FNN, and the FNN pass the fused representation to the next layer. Same as Vaswani et al. (2017)
, in the variational decoder layer, each sublayer is followed by a residual connection and layer normalization. That is, the output of each sublayer is
.We decompose the response as and the latent variable as . The prior model produces latent variables at each position by not only conditioning on the input condition (the concatenation of context and meta features), but also conditioning on the observed response tokens . By assuming follows a multivariate Gaussian distribution, the prior model becomes:
(5) 
where
The only difference between the Posterior Path (dash line in Figure 2) and Prior Path is that the mask is removed from the masked multihead attention. Thus the masked (casual) multihead attention become noncasual multihead attention, which allows each position to attend to the subsequent positions. Then, the second multihead attention sublayer (shared the same weight with prior path) performs posterior attention on the encoder and passes the posterior observed information to the recognition network. The recognition network produces the posterior latent variable for each position as:
(6) 
where
During the training, the posterior path guides the learning of prior path via KL divergence constraint:
(7) 
In the training phase, the posterior latent variables from Equation 6 are passed to the FFN, while in the testing phase the Posterior Path will be blocked and the posterior latent variables will be replaced with the prior latent variables from Equation 5.
During the decoding process, each response token is generated by conditioning on observed response tokens , latent variables , and the input condition . The decoding process of the SVT is:
(8) 
As we expect the latent variables to be a generation plan for the future sequence, we inject such bias into latent variables by using an auxiliary loss: SequentialBagofWord (SBOW) which proposed by Du et al. (2018). The idea of the SBOW auxiliary objective is to sequentially predict the bag of succeeding target words by using latent variable . In our case, the succeeding words prediction also leverages the observed information and . Thus the auxiliary loss at each position is computed by:
(9) 
where
is a feedforward neural network with the softmax output.
The evidence lower bound (ELBO) objective of SVT is the sum of the reconstruction loss
and KullbackLeibler divergence loss
at each position:(10)  
We regularize the ELBO learning objective with an auxiliary loss to enhance the expressiveness of the latent variables. Therefore, the final learning objective is formulated as follows:
(11) 
where,
(12) 
We evaluate the proposed models on three conversationet dataset such as MojiTalk Zhou and Wang (2018), PersonaChat Zhang et al. (2018), EmpatheticDialogues Rashkin et al. (2019).
dataset consists of 596,959 post and response pairs from Twitter. Each response is labeled by one emoji which indicates the response emotion. There are 64 emoji labels in total with unbalanced distribution. We use the preprocessed data and vocabulary released from Zhou and Wang (2018) and follow the same split of train/validation/test set.
are onetoone multiturn conversation datasets. In PersonaChat (Persona), the conversations are revolve around personas which are established by four to six persona sentences. While in EmpatheticDialogues (ED), the conversation are mostly about situation that happened to one of the speaker and another speaker is trying to understand the feeling and reply accordingly. Both datasets are about modeling social skills and the goal is to make user more engaging. Therefore, we combine the train/validation/test set of two datasets.
We compare the proposed models with the following baselines:
An attentionbased sequencetosequence model with the emoji vector as additional input as discribed in MojiTalk Zhou and Wang (2018).
An RNNbased conditional variational autoencoder for dialogue response generation Zhou and Wang (2018), which uses a multivariate Gaussian latent variable to model the response and concatenate it with the last hidden state of the encoder as the initial state of the decoder. KL annealing, early stopping strategy and bagofword auxiliary loss are applied during the training. We use the implementation ^{3}^{3}3The implementation of CVAE baseline: https://github.com/claudezhou/MojiTalk released by Zhou and Wang (2018).
A transformer Vaswani et al. (2017)
trained by using a Maximum Likelihood Estimation (MLE) objective and can be considered as the base model for both the GVT and SVT.
We use a 4layer Transformer as our base model. The hidden size is set to be 300 everywhere, and the word embedding is initialized with the 300dimensional pretrained GloVe embeddings for both encoder and decoder. The multihead attention sublayers are made up of 4 attention heads each with embedding dimension 64. The size of latent variable is 300. The recognition network and the prior network are parameterized by 3layer MLPs with 512 hidden dimension. Following the training setup of Zhou and Wang (2018), we first train our baseline transformer model with the MLE objective and use it to initialize its counterparts in both GVT and SVT. Then the models are trained endtoend by the Adam optimizer with the initial learning rate . KL annealing and early stopping strategy are applied as in Zhou and Wang (2018). In the test time, we use greedy decoding strategy for all models.
MojiTalk  
Model  PPL  KLD  Diversity  Embeddings Similarity  Human Evaluation  
Dist1  Dist2  Dist3  Coherence  Emotion  
Seq2Seq  130.75    0.0055  0.0187  0.0347  0.738  0.594  20.67  20.67 
CVAE  35.33  27.55  0.0189  0.1340  0.3640  0.751  0.613  18.33  18 
Transformer  72.66    0.0040  0.0161  0.0324  0.741  0.596  19.67  23.33 
GVT  19.71  18.15  0.0207  0.1524  0.4064  0.753  0.609  23  22.67 
SVT  18.96  32.27  0.0079  0.1053  0.3654  0.762  0.619  26  27.67 
Human      0.0557  0.4009  0.7697         
Persona + ED  
Model  PPL  KLD  Diversity  Embeddings Similarity  Human Evaluation  
Dist1  Dist2  Dist3  Coherence  Engagedness  
CVAE  31.32  10.01  0.0186  0.1102  0.295  0.917  0.666  20.67  21.33 
Transformer  48.03    0.0058  0.0237  0.0524  0.915  0.672  24.67  24.67 
GVT  18.34  19.13  0.0204  0.1406  0.3995  0.917  0.675  20  21.33 
SVT  17.75  24.67  0.0213  0.1521  0.3936  0.906  0.665  38.67  36.67 
Human      0.0640  0.3800  0.7070         
The evaluation metrics include Perplexity (
PPL) and KullbackLeibler divergence between the posterior and prior (KLD). A well trained model should achieve a low reconstruction and small but nontrivial KL distance Zhao et al. (2018).To measure the generation diversity, we calculate Dist1, Dist2, and Dist3
, the ratio of the number of distinct ngrams (unigrams, bigrams, and trigrams) over the total number of ngrams. A higher distinct ngrams ratio indicates more diverse generation.
This metric computes the cosine similarity between the sentence embedding of a generated sequence and that of a groundtruth response. In our experiments, we introduce two different ways to represent sentence embeddings. The first is
Liu et al. (2016) that calculates the average of word embeddings in a sentence using FastText Mikolov et al. (2018) which is trained with Common Crawl and Wikipedia data. We use FastText embeddings instead of other pretrained word embeddings because it can handle outofvocabulary issue. However, representing a sentence by simply taking the average of word embeddings ignores the context information. Therefore, we propose to use a pretrained language model BERT Devlin et al. (2018) to compute the contextualized sentence representation. Specifically, we use a pretrained BERT to encode a generated sentence and a groundtruth response, and average the output representation of both to obtain the sentence embeddings. We denote such contextualized sentence embedding as .In the human evaluation, we prepare multiplechoice questions for human evaluators and the answers are the generation results from the five models (Seq2Seq, CVAE, Transformer, GVT, and SVT). we first randomly sample 100 dialogues and their corresponding responses from our models and the baselines. For each response, we assign three human annotators to select the most coherent (on topic) response to the context (multiple answers are allowed). In addition, annotators also need to choose the best response correlated to the given emoji label in Mojitalk and the most engaging response in PersonaChat and EmpatheticDialogues. If there is no response that satisfies the evaluators, they can choose “all answers are bad”, which means none of the answer is chosen. We compute the rate that each model is chosen to quantify generation quality regarding to the human standard.
The automatic evaluation results are shown in Table 1. Transformerbased models have significantly lower perplexity compared to RNNbased models which indicate that the global receptive field performed by multihead selfattention boost the modeling capacity. However, deterministic Seq2Seq and Transformer models tends to generate generic responses which leads to a low diversity score. Meanwhile incorporating a stochastic latent variable into both models (CVAE and GVT) promote more diverse generation results and boost the diversity scores such as Dist1, Dist2, and Dist3.
Context  trade must ’ve made you mad ?  

Emotion  
Responses  Seq2Seq: i ’m not sure if i ’m not sure if i ’m not sure if i ’m not sure about it  
CVAE: <unk> but i don ’t think it ’s been on  
Transformer: i ’m not sure i ’m not  
GVT: i ’ll pass it on , she ’s mad  
SVT: hell yeah bro . yeah  
Ref: i don ’t wanna talk about it  
Context  love the smell of a good bbq !  
Emotion  
Responses  Seq2Seq: i love it ! i love it !  
CVAE: aw you ’re getting better  
Transformer: i ’m glad you like it !  
GVT: i ’d like to know you ’re very well .  
SVT: omg what kind of smell ? thanks for sharing it with a pizza in <unk>  
Ref: hahaha that sounds like friday to me pc  
Context 


Responses  CVAE: hi , i am doing well . you ?  
Transformer: i am doing well , how are you ?  
GVT: i am good . just studying some people there .  
SVT: : i am doing well , just finished eating some ice cream .  
Ref: i am doing well . i am relaxing before bed . i work in the morning .  
Context  i cook mine at home while watching one tree hill . love that show .  
Responses  CVAE: i love to cook . i like to cook .  
Transformer: i love the outdoors . i love the outdoors .  
GVT:it is good . . . you can make some money .  
SVT: do you have any pets ? i have never watched any of tv.  
Ref: i am looking for a new job . i hate sitting still all day  
User: well do you work ? do you have a degree to sustain yourself ?  
Context  System: i built models when i was a kid . now i sculpt and mold and carve.  
User: nice , i specialize in computer science degree so i mostly mold 3d images.  
Responses  CVAE: i do not like it when you get to do the same  
Transformer: i am a teacher . i am a teacher . i am a teacher .  
GVT: me too ! my favorite is the best baker .  
SVT: i love the technology . i like to play when i get older  
Ref: i am looking for a new job . i hate sitting still all day 
Compare to baseline models, the GVT achieves relatively lower reconstruction PPL, which suggests that the global latent variable contains rich latent information (e.g., topic) for response generation. Meanwhile, the sequential latent variables of the SVT encode finegrained latent information and further improve the reconstruction PPL.
On the other hand, SVT achieves the highest score in terms of two semantic relevanceoriented metrics such as and in MojiTalk dataset, while in the combined dataset of Persona and ED, we observe performance drop of SVT compare to other models. This is because both Persona and ED are well designed and have lower entropy than MojiTalk which collected from Twitter. We hypothesize that the sequential latent variables have no advantage in term of similarity to single, fixed ”gold response” when model low entropy response. Indeed, in open domain dialogue response generation, automatic metric is not always aligned with the human judgement Liu et al. (2016). In contrast, human evaluation result reported in Table 1 demonstrates the generations of SVT are closer to the human standard in terms of coherence, invoked emotion and engagedness.
Table 2 compares the generation of the proposed models with baselines given the same contexts. We observe that the Seq2Seq and vanilla transformer tend to generate generic and repetitive responses (e.g., i am not sure) in MojiTalk due to their deterministic structure fail to capture the variability in dialogue response. By incorporating stochastic latent variables, the CVAE and GVT can generate more diverse responses, but their responses are sometimes digressive (e.g., example 5). Interestingly, GVT and SVT generalize the topic beyong the context which make the dialogue more engaging (e.g., example 4). In general, SVT is able to generate more coherent and informative responses.
This paper introduces the Variational Transformer (VT), a variational selfattentive feedforward sequence model that combines the global receptive field of a Transformer with the variational nature of a CVAE. We propose two types of the VT: 1) the Global Variational Transformer (GVT) which incorporates a global latent variable as additional input to the transformer decoder; and 2) the Sequential Variational Transformer (SVT) which generates latent variables for each position during decoding process. Quantitative and qualitative experimental results shows that our models outperform baselines in terms of diversity, semantic relevance, and human judgment. In future work, we will utilize the pretraining language models Radford et al. (2019) as the backbone to strengthen the language model of the VT for better generation.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
, pp. 3154–3163. Cited by: §1, §1, §2.2, §4.3, §4.International Conference on Machine Learning
, pp. 2395–2404. Cited by: §2.3.