1 Introduction
The development of the variational autoencoder framework (Kingma and Welling, 2014; Rezende et al., 2014)
has paved the way for learning largescale, directed latent variable models. This has led to significant progress in a diverse set of machine learning applications, ranging from computer vision
(Gregor et al., 2015; Larsen et al., 2016) to natural language processing tasks (Mnih and Gregor, 2014; Miao et al., 2016; Bowman et al., 2015; Serban et al., 2017b). It is hoped that this framework will enable the learning of generative processes of realworld data — including text, audio and images — by disentangling and representing the underlying latent factors in the data. However, latent factors in realworld data are often highly complex. For example, topics in newswire text and responses in conversational dialogue often posses latent factors that follow nonlinear (nonsmooth), multimodal distributions (i.e. distributions with multiple local maxima).Nevertheless, the majority of current models assume a simple prior in the form of a multivariate Gaussian distribution in order to maintain mathematical and computational tractability. This is often a highly restrictive and unrealistic assumption to impose on the structure of the latent variables. First, it imposes a strong unimodal structure on the latent variable space; latent variable samples from the generating model (prior distribution) all cluster around a single mean. Second, it forces the latent variables to follow a perfectly symmetric distribution with constant kurtosis; this makes it difficult to represent asymmetric or rarely occurring factors. Such constraints on the latent variables increase pressure on the downstream generative model, which in turn is forced to carefully partition the probability mass for each latent factor throughout its intermediate layers. For complex, multimodal distributions — such as the distribution over topics in a text corpus, or natural language responses in a dialogue system — the unimodal Gaussian prior inhibits the model’s ability to extract and represent important latent structure in the data. In order to learn more expressive latent variable models, we therefore need more flexible, yet tractable, priors.
In this paper, we introduce a simple, flexible prior distribution based on the piecewise constant distribution. We derive an analytical, tractable form that is applicable to the variational autoencoder framework and propose a differentiable parametrization for it. We then evaluate the effectiveness of the distribution when utilized both as a prior and as approximate posterior across variational architectures in two natural language processing tasks: document modeling and natural language generation for dialogue. We show that the piecewise constant distribution is able to capture elements of a target distribution that cannot be captured by simpler priors — such as the unimodal Gaussian. We demonstrate stateoftheart results on three document modeling tasks, and show improvements on a dialogue natural language generation. Finally, we illustrate qualitatively how the piecewise constant distribution represents multimodal latent structure in the data.
2 Related Work
The idea of using an artificial neural network to approximate an inference model dates back to the early work of Hinton and colleagues
(Hinton and Zemel, 1994; Hinton et al., 1995; Dayan and Hinton, 1996). Researchers later proposed Markov chain Monte Carlo methods (MCMC)
(Neal, 1992), which do not scale well and mix slowly, as well as variational approaches which require a tractable, factored distribution to approximate the true posterior distribution (Jordan et al., 1999). Others have since proposed using feedforward inference models to initialize the meanfield inference algorithm for training Boltzmann architectures (Salakhutdinov and Larochelle, 2010; Ororbia II et al., 2015). Recently, the variational autoencoder framework (VAE) was proposed by Kingma and Welling (2014) and Rezende et al. (2014), closely related to the method proposed by Mnih and Gregor (2014). This framework allows the joint training of an inference network and a directed generative model, maximizing a variational lowerbound on the data loglikelihood and facilitating exact sampling of the variational posterior. Our work extends this framework.With respect to document modeling, neural architectures have been shown to outperform wellestablished topic models such as Latent Dirichlet Allocation (LDA) (Hofmann, 1999; Blei et al., 2003). Researchers have successfully proposed several models involving discrete latent variables (Salakhutdinov and Hinton, 2009; Hinton and Salakhutdinov, 2009; Srivastava et al., 2013; Larochelle and Lauly, 2012; Uria et al., 2014; Lauly et al., 2016; Bornschein and Bengio, 2015; Mnih and Gregor, 2014). The success of such discrete latent variable models — which are able to partition probability mass into separate regions — serves as one of our main motivations for investigating models with more flexible continuous latent variables for document modeling. More recently, Miao et al. (2016) proposed to use continuous latent variables for document modeling.
Researchers have also investigated latent variable models for dialogue modeling and dialogue natural language generation (Bangalore et al., 2008; Crook et al., 2009; Zhai and Williams, 2014). The success of discrete latent variable models in this task also motivates our investigation of more flexible continuous latent variables. Closely related to our proposed approach is the Variational Hierarchical Recurrent EncoderDecoder (VHRED, described below) (Serban et al., 2017b), a neural architecture with latent multivariate Gaussian variables. In parallel with our work, Zhao et al. (2017) has also proposed a latent variable model for dialogue modeling with the specific goal of generating diverse natural language responses.
Researchers have explored more flexible distributions for the latent variables in VAEs, such as autoregressive distributions, hierarchical probabilistic models and approximations based on MCMC sampling (Rezende et al., 2014; Rezende and Mohamed, 2015; Kingma et al., 2016; Ranganath et al., 2016; Maaløe et al., 2016; Salimans et al., 2015; Burda et al., 2016; Chen et al., 2017; Ruiz et al., 2016). These are all complimentary to our approach; it is possible to combine them with the piecewise constant latent variables. In parallel to our work, multiple research groups have also proposed VAEs with discrete latent variables (Maddison et al., 2017; Jang et al., 2017; Rolfe, 2017; Johnson et al., 2016). This is a promising line of research, however these approaches often require approximations which may be inaccurate when applied to larger scale tasks, such as document modeling or natural language generation. Finally, discrete latent variables may be inappropriate for certain natural language processing tasks.
3 Neural Variational Models
We start by introducing the neural variational learning framework. We focus on modeling discrete output variables (e.g. words) in the context of natural language processing applications. However, the framework can easily be adapted to handle continuous output variables.
3.1 Neural Variational Learning
Let be a sequence of tokens (words) conditioned on a continuous latent variable . Further, let be an additional observed variable which conditions both and . Then, the distribution over words is:
where are the model parameters. The model first generates the higherlevel, continuous latent variable conditioned on . Given and , it then generates the word sequence . For unsupervised modeling of documents, the is excluded and the words are assumed to be independent of each other, when conditioned on :
Model parameters can be learned using the variational lowerbound (Kingma and Welling, 2014):
(1) 
where we note that is the approximation to the intractable, true posterior . is called the encoder, or sometimes the recognition model or inference model, and it is parametrized by . The distribution is the prior model for , where the only available information is . The VAE framework further employs the reparametrization trick, which allows one to move the derivative of the lowerbound inside the expectation. To accomplish this, is parametrized as a transformation of a fixed, parameterfree random distribution , where is drawn from a random distribution. Here, is a transformation of , parametrized by , such that . For example, might be drawn from a standard Gaussian distribution and might be defined as , where and are in the parameter set . In this case, is able to represent any Gaussian with mean
and variance
.Model parameters are learned by maximizing the variational lowerbound in eq. (3.1) using gradient descent, where the expectation is computed using samples from the approximate posterior.
The majority of work on VAEs propose to parametrize as multivariate Gaussian distribtions. However, this unrealistic assumption may critically hurt the expressiveness of the latent variable model. See Appendix A for a detailed discussion. This motivates the proposed piecewise constant latent variable distribution.
3.2 Piecewise Constant Distribution
We propose to learn latent variables by parametrizing
using a piecewise constant probability density function (PDF). This should allow
to represent complex aspects of the data distribution in latent variable space, such as nonsmooth regions of probability mass and multiple modes.Let be the number of piecewise constant components. We assume is drawn from PDF:
(2) 
where is the indicator function, which is one when is true and otherwise zero. The distribution parameters are , for . The normalization constant is:
It is straightforward to show that a piecewise constant distribution with more than pieces is capable of representing a bimodal distribution. When
, a vector
of piecewise constant variables can represent a probability density with modes. Figure 1 illustrates how these variables help model complex, multimodal distributions.In order to compute the variational bound, we need to draw samples from the piecewise constant distribution using its inverse cumulative distribution function (CDF). Further, we need to compute the KL divergence between the prior and posterior. The inverse CDF and KL divergence quantities are both derived in Appendix
B. During training we must compute derivatives of the variational bound in eq. (3.1). These expressions involve derivatives of indicator functions, which have derivatives zero everywhere except for the changing points where the derivative is undefined. However, the probability of sampling the value exactly at its changing point is effectively zero. Thus, we fix these derivatives to zero. Similar approximations are used in training networks with rectified linear units.
4 Latent Variable Parametrizations
In this section, we develop the parametrization of both the Gaussian variable and our proposed piecewise constant latent variable.
Let be the current output sequence, which the model must generate (e.g. ). Let be the observed conditioning information. If the task contains additional conditioning information this will be embedded by . For example, for dialogue natural language generation represents an embedding of the dialogue history, while for document modeling .
4.1 Gaussian Parametrization
Let and be the prior mean and variance, and let and
be the approximate posterior mean and variance. For Gaussian latent variables, the prior distribution mean and variances are encoded using linear transformations of a hidden state. In particular, the prior distribution covariance is encoded as a diagonal covariance matrix using a softplus function:
where is an embedding of the conditioning information (e.g. for dialogue natural language generation this might, for example, be produced by an LSTM encoder applied to the dialogue history), which is shared across all latent variable dimensions. The matrices and vectors
are learnable parameters. For the posterior distribution, previous work has shown it is better to parametrize the posterior distribution as a linear interpolation of the prior distribution mean and variance and a new estimate of the mean and variance based on the observation
(Fraccaro et al., 2016). The interpolation is controlled by a gating mechanism, allowing the model to turn on/off latent dimensions:where is an embedding of both and . The matrices and the vectors are parameters to be learned. The interpolation mechanism is controlled by and , which are initialized to zero (i.e. initialized such that the posterior is equal to the prior).
4.2 Piecewise Constant Parametrization
We parametrize the piecewise prior parameters using an exponential function applied to a linear transformation of the conditioning information:
where matrix and vector are learnable. As before, we define the posterior parameters as a function of both and :
where and are parameters.
5 Variational Text Modeling
We now introduce two classes of VAEs. The models are extended by incorporating the Gaussian and piecewise latent variable parametrizations.
5.1 Document Model
The neural variational document model (NVDM) model has previously been proposed for document modeling (Mnih and Gregor, 2014; Miao et al., 2016), where the latent variables are Gaussian. Since the original NVDM uses Gaussian latent variables, we will refer to it as GNVDM. We propose two novel models building on GNVDM. The first model we propose uses piecewise constant latent variables instead of Gaussian latent variables. We refer to this model as PNVDM. The second model we propose uses a combination of Gaussian and piecewise constant latent variables. The models sample the Gaussian and piecewise constant latent variables independently and then concatenates them together into one vector. We refer to this model as HNVDM.
Let be the vocabulary of document words. Let represent a document matrix, where row is the 1of binary encoding of the ’th word in the document. Each model has an encoder component
, which compresses a document vector into a continuous distributed representation upon which the approximate posterior is built. For document modeling, word order information is not taken into account and no additional conditioning information is available. Therefore, each model uses a bagofwords encoder, defined as a multilayer perceptron (MLP)
. Based on preliminary experiments, we choose the encoder to be a twolayered MLP with parametrized rectified linear activation functions (we omit these parameters for simplicity). For the approximate posterior, each model has the parameter matrix
and vector for the piecewise latent variables, and the parameter matrices and vectors for the Gaussian means and variances. For the prior, each model has parameter vector for the piecewise latent variables, and vectors for the Gaussian means and variances. We initialize the bias parameters to zero in order to start with centered Gaussian and piecewise constant priors. The encoder will adapt these priors as learning progresses, using the gating mechanism to turn on/off latent dimensions.Let be the vector of latent variables sampled according to the approximate posterior distribution. Given , the decoder outputs a distribution over words in the document:
where is a parameter matrix and
is a parameter vector corresponding to the bias for each word to be learned. This output probability distribution is combined with the KL divergences to compute the lowerbound in eq. (
3.1). See Appendix C.Our baseline model GNVDM is an improvement over the original NVDM proposed by Mnih and Gregor (2014) and Miao et al. (2016). We learn the prior mean and variance, while these were fixed to a standard Gaussian in previous work. This increases the flexibility of the model and makes optimization easier. In addition, we use a gating mechanism for the approximate posterior of the Gaussian variables. This gating mechanism allows the model to turn off latent variable (i.e. fix the approximate posterior to equal the prior for specific latent variables) when computing the final posterior parameters. Furthermore, Miao et al. (2016) alternated between optimizing the approximate posterior parameters and the generative model parameters, while we optimize all parameters simultaneously.
5.2 Dialogue Model
The variational hierarchical recurrent encoderdecoder (VHRED) model has previously been proposed for dialogue modeling and natural language generation (Serban et al., 2017b, 2016b). The model decomposes dialogues using a twolevel hierarchy: sequences of utterances (e.g. sentences), and subsequences of tokens (e.g. words). Let be the ’th utterance in a dialogue with utterances. Let be the ’th word in the ’th utterance from vocabulary given as a 1of binary encoding. Let be the number of words in the ’th utterance. For each utterance , the model generates a latent variable . Conditioned on this latent variable, the model then generates the next utterance:
where are the model parameters. VHRED consists of three RNN modules: an encoder RNN, a context RNN and a decoder RNN. The encoder RNN computes an embedding for each utterance. This embedding is fed into the context RNN, which computes a hidden state summarizing the dialogue context before utterance : . This state represents the additional conditioning information, which is used to compute the prior distribution over :
where is a PDF parametrized by both and . A sample is drawn from this distribution: . This sample is given as input to the decoder RNN, which then computes the output probabilities of the words in the next utterance. The model is trained by maximizing the variational lowerbound, which factorizes into independent terms for each subsequence (utterance):
where distribution is the approximate posterior distribution with parameters , computed similarly as the prior distribution but further conditioned on the encoder RNN hidden state of the next utterance.
The original VHRED model (Serban et al., 2017b) used Gaussian latent variables. We refer to this model as GVHRED. The first model we propose uses piecewise constant latent variables instead of Gaussian latent variables. We refer to this model as PVHRED. The second model we propose takes advantage of the representation power of both Gaussian and piecewise constant latent variables. This model samples both a Gaussian latent variable and a piecewise latent variable independently conditioned on the context RNN hidden state:
where and are PDFs parametrized by independent subsets of parameters . We refer to this model as HVHRED.
6 Experiments
We evaluate the proposed models on two types of natural language processing tasks: document modeling and dialogue natural language generation. All models are trained with backpropagation using the variational lowerbound on the loglikelihood or the exact loglikelihood. We use the firstorder gradient descent optimizer Adam (Kingma and Ba, 2015)
with gradient clipping
(Pascanu et al., 2012)^{1}^{1}1Code and scripts are available at https://github.com/ago109/piecewisenvdmemnlp2017 and https://github.com/julianser/hredlatentpiecewise.6.1 Document Modeling
Model  20NG  RCV1  CADE 

LDA  
docNADE  
NVDM  
GNVDM  
HNVDM3  
HNVDM5 
Tasks We use three different datasets for document modeling experiments. First, we use the 20 NewsGroups (20NG) dataset (Hinton and Salakhutdinov, 2009). Second, we use the Reuters corpus (RCV1V2), using a version that contained a selected 5,000 term vocabulary. As in previous work (Hinton and Salakhutdinov, 2009; Larochelle and Lauly, 2012), we transform the original word frequencies using the equation , where TF is the original word frequency. Third, to test our document models on text from a nonEnglish language, we use the Brazilian Portuguese CADE12 dataset (CardosoCachopo, 2007). For all datasets, we track the validation bound on a subset of 100 vectors randomly drawn from each training corpus.





environment  project  science  
project  gov  built  
flight  major  high  
lab  based  technology  
mission  earth  world  
launch  include  form  
field  science  scale  
working  nasa  sun  
build  systems  special  
gov  technical  area 
Training All models were trained using minibatches with 100 examples each. A learning rate of was used. Model selection and early stopping were conducted using the validation lowerbound, estimated using five stochastic samples per validation example. Inference networks used 100 units in each hidden layer for 20NG and CADE, and 100 for RCV1. We experimented with both and
latent random variables for each class of models, and found that
latent variables performed best on the validation set. For HNVDM we vary the number of components used in the PDF, investigating the effect that 3 and 5 pieces had on the final quality of the model. The number of hidden units was chosen via preliminary experimentation with smaller models. On 20NG, we use the same setup as (Hinton and Salakhutdinov, 2009) and therefore report the perplexities of a topic model (LDA, (Hinton and Salakhutdinov, 2009)), the document neural autoregressive estimator (docNADE, (Larochelle and Lauly, 2012)), and a neural variational document model with a fixed standard Gaussian prior (NVDM, lowest reported perplexity, (Miao et al., 2016)).Results In Table 1, we report the test document perplexity: . We use the variational lowerbound as an approximation based on 10 samples, as was done in (Mnih and Gregor, 2014). First, we note that the best baseline model (i.e. the NVDM) is more competitive when both the prior and posterior models are learnt together (i.e. the GNVDM), as opposed to the fixed prior of (Miao et al., 2016). Next, we observe that integrating our proposed piecewise variables yields even better results in our document modeling experiments, substantially improving over the baselines. More importantly, in the 20NG and Reuters datasets, increasing the number of pieces from 3 to 5 further reduces perplexity. Thus, we have achieved a new stateoftheart perplexity on 20 NewsGroups task and — to the best of our knowledge – better perplexities on the CADE12 and RCV1 tasks compared to using a stateoftheart model like the GNVDM. We also evaluated the converged models using an nonparametric inference procedure, where a separate approximate posterior is learned for each test example in order to tighten the variational lowerbound. HNVDM also performed best in this evaluation across all three datasets, which confirms that the performance improvement is due to the piecewise components. See appendix for details.
In Table 2, we examine the top ten highest ranked words given the query term “space”, using the decoder parameter matrix. The piecewise variables appear to have a significant effect on what is uncovered by the model.In the case of “space”, the hybrid with 5 pieces seems to value two senses of the word–one related to “outer space” (e.g., “sun”, “world”, etc.) and another related to the dimensions of depth, height, and width within which things may exist and move (e.g., “area”, “form”, “scale”, etc.). On the other hand, GNVDM appears to only capture the “outer space” sense of the word. More examples are in the appendix.
Finally, we visualized the means of the approximate posterior latent variables on 20NG through a tSNE projection. As shown in Figure 2, both GNVDM and HNVDM5 learn representations which disentangle the topic clusters on 20NG. However, GNVDM
appears to have more dispersed clusters and more outliers (i.e. data points in the periphery) compared to
HNVDM5. Although it is difficult to draw conclusions based on these plots, these findings could potentially be explained by the Gaussian latent variables fitting the latent factors poorly.6.2 Dialogue Modeling
Model  Activity  Entity 

HRED  
GVHRED  
PVHRED  
HVHRED 
Task We evaluate VHRED on a natural language generation task, where the goal is to generate responses in a dialogue. This is a difficult problem, which has been extensively studied in the recent literature (Ritter et al., 2011; Lowe et al., 2015; Sordoni et al., 2015; Li et al., 2016; Serban et al., 2016b, a). Dialogue response generation has recently gained a significant amount of attention from industry, with highprofile projects such as Google SmartReply (Kannan et al., 2016) and Microsoft Xiaoice (Markoff and Mozur, 2015). Even more recently, Amazon has announced the Alexa Prize Challenge for the research community with the goal of developing a natural and engaging chatbot system (Farber, 2016).
We evaluate on the technical support response generation task for the Ubuntu operating system. We use the wellknown Ubuntu Dialogue Corpus (Lowe et al., 2015, 2017), which consists of about 1/2 million natural language dialogues extracted from the #Ubuntu Internet Relayed Chat (IRC) channel. The technical problems discussed span a wide range of softwarerelated and hardwarerelated issues. Given a dialogue history — such as a conversation between a user and a technical support assistant — the model must generate the next appropriate response in the dialogue. For example, when it is the turn of the technical support assistant, the model must generate an appropriate response helping the user resolve their problem.
We evaluate the models using the activity and entitybased metrics designed specifically for the Ubuntu domain (Serban et al., 2017a). These metrics compare the activities and entities in the model generated responses with those of the reference responses; activities are verbs referring to highlevel actions (e.g. download, install, unzip) and entities are nouns referring to technical objects (e.g. Firefox, GNOME). The more activities and entities a model response overlaps with the reference response (e.g. expert response) the more likely the response will lead to a solution.
Training The models were trained to maximize the loglikelihood of training examples using a learning rate of and minibatches of size . We use a variant of truncated backpropagation. We terminate the training procedure for each model using early stopping, estimated using one stochastic sample per validation example. We evaluate the models by generating dialogue responses: conditioned on a dialogue context, we fix the model latent variables to their median values and then generate the response using a beam search with size 5. We select model hyperparameters based on the validation set using the F1 activity metric, as described earlier.
It is often difficult to train generative models for language with stochastic latent variables (Bowman et al., 2015; Serban et al., 2017b). For the latent variable models, we therefore experiment with reweighing the KL divergence terms in the variational lowerbound with values , , and . In addition to this, we linearly increase the KL divergence weights starting from zero to their final value over the first training batches. Finally, we weaken the decoder RNN by randomly replacing words inputted to the decoder RNN with the unknown token with probability. These steps are important for effectively training the models, and the latter two have been used in previous work by Bowman et al. (2015) and Serban et al. (2017b).
HRED (Baseline): We compare to the HRED model (Serban et al., 2016b): a sequencetosequence model, shown to outperform other established models on this task, such as the LSTM RNN language model (Serban et al., 2017a). The HRED model’s encoder RNN uses a bidirectional GRU RNN encoder, where the forward and backward RNNs each have hidden units. The context RNN is a GRU encoder with hidden units, and the decoder RNN is an LSTM decoder with hidden units.^{2}^{2}2Since training lasted between 13 weeks for each model, we had to fix the number of hidden units during preliminary experiments on the training and validation datasets. The encoder and context RNNs both use layer normalization (Ba et al., 2016).^{3}^{3}3We did not apply layer normalization to the decoder RNN, because several of our colleagues have found that this may hurt the performance of generative language models. We also experiment with an additional rectified linear layer applied on the inputs to the decoder RNN. As with other hyperparameters, we choose whether to include this additional layer based on the validation set performance. HRED, as well as all other models, use a word embedding dimensionality of size .
GHRED: We compare to GVHRED, which is VHRED with Gaussian latent variables (Serban et al., 2017b). GVHRED uses the same hyperparameters for the encoder, context and decoder RNNs as the HRED model. The model has Gaussian latent variables per utterance.
PHRED: The first model we propose is PVHRED, which is VHRED model with piecewise constant latent variables. We use number of pieces for each latent variable. PVHRED also uses the same hyper parameters for the encoder, context and decoder RNNs as the HRED model. Similar to GVHRED, PVHRED has piecewise constant latent variables per utterance.
HHRED: The second model we propose is HVHRED, which has piecewise constant (with pieces per variable) and Gaussian latent variables per utterance. HVHRED also uses the same hyperparameters for the encoder, context and decoder RNNs as HRED.
Results: The results are given in Table 3. All latent variable models outperform HRED w.r.t. both activities and entities. This strongly suggests that the highlevel concepts represented by the latent variables help generate meaningful, goaldirected responses. Furthermore, each type of latent variable appears to help with a different aspects of the generation task. GVHRED performs best w.r.t. activities (e.g. download, install and so on), which occur frequently in the dataset. This suggests that the Gaussian latent variables learn useful latent representations for frequent actions. On the other hand, HVHRED performs best w.r.t. entities (e.g. Firefox, GNOME), which are often much rarer and mutually exclusive in the dataset. This suggests that the combination of Gaussian and piecewise latent variables help learn useful representations for entities, which could not be learned by Gaussian latent variables alone. We further conducted a qualitative analysis of the model responses, which supports these conclusions. See Appendix G.^{4}^{4}4Results on a Twitter dataset are given in the appendix.
7 Conclusions
In this paper, we have sought to learn rich and flexible multimodal representations of latent variables for complex natural language processing tasks. We have proposed the piecewise constant distribution for the variational autoencoder framework. We have derived closedform expressions for the necessary quantities required for in the autoencoder framework, and proposed an efficient, differentiable implementation of it. We have incorporated the proposed piecewise constant distribution into two model classes — NVDM and VHRED — and evaluated the proposed models on document modeling and dialogue modeling tasks. We have achieved stateoftheart results on three document modeling tasks, and have demonstrated substantial improvements on a dialogue modeling task. Overall, the results highlight the benefits of incorporating the flexible, multimodal piecewise constant distribution into variational autoencoders. Future work should explore other natural language processing tasks, where the data is likely to arise from complex, multimodal latent factors.
Acknowledgments
The authors acknowledge NSERC, Canada Research Chairs, CIFAR, IBM Research, Nuance Foundation and Microsoft Maluuba for funding. Alexander G. Ororbia II was funded by a NACMESloan scholarship. The authors thank Hugo Larochelle for sharing the NewsGroup 20 dataset. The authors thank Laurent Charlin, Sungjin Ahn, and Ryan Lowe for constructive feedback. This research was enabled in part by support provided by Calcul Qubec (www.calculquebec.ca) and Compute Canada (www.computecanada.ca).
References
 Ba et al. (2016) J. L. Ba, J. R. Kiros, and G. E. Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450.
 Bangalore et al. (2008) S. Bangalore, G. Di Fabbrizio, and A. Stent. 2008. Learning the structure of taskdriven human–human dialogs. IEEE Transactions on Audio, Speech, and Language Processing, 16(7):1249–1259.
 Blei et al. (2003) D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent dirichlet allocation. JAIR, 3:993–1022.
 Bornschein and Bengio (2015) J. Bornschein and Y. Bengio. 2015. Reweighted wakesleep. In ICLR.
 Bowman et al. (2015) S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and S. Bengio. 2015. Generating sentences from a continuous space. In Conference on Computational Natural Language Learning.
 Burda et al. (2016) Y. Burda, R. Grosse, and R. Salakhutdinov. 2016. Importance weighted autoencoders. ICLR.
 CardosoCachopo (2007) A. CardosoCachopo. 2007. Improving Methods for Singlelabel Text Categorization. PdD Thesis, Instituto Superior Tecnico, Universidade Tecnica de Lisboa.
 Chen et al. (2017) X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, and P. Abbeel. 2017. Variational lossy autoencoder. In ICLR.
 Crook et al. (2009) N. Crook, R. Granell, and S. Pulman. 2009. Unsupervised classification of dialogue acts using a dirichlet process mixture model. In Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 341–348.
 Dayan and Hinton (1996) P. Dayan and G. E. Hinton. 1996. Varieties of helmholtz machine. Neural Networks, 9(8):1385–1403.
 Devroye (1986) L. Devroye. 1986. Samplebased nonuniform random variate generation. In Proceedings of the 18th conference on Winter simulation, pages 260–265. ACM.
 Farber (2016) M. Farber. 2016. Amazon’s ’Alexa Prize’ Will Give College Students Up To $2.5M To Create A Socialbot. Fortune.
 Fraccaro et al. (2016) M. Fraccaro, S. K. Sønderby, U. Paquet, and O. Winther. 2016. Sequential neural models with stochastic layers. In NIPS, pages 2199–2207.

Gregor et al. (2015)
K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. 2015.
DRAW: A recurrent neural network for image generation.
In ICLR.  Hinton et al. (1995) G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal. 1995. The” wakesleep” algorithm for unsupervised neural networks. Science, 268(5214):1158–1161.
 Hinton and Salakhutdinov (2009) G. E. Hinton and R. Salakhutdinov. 2009. Replicated softmax: an undirected topic model. In Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, editors, NIPS, pages 1607–1614. Curran Associates, Inc.
 Hinton and Zemel (1994) G. E. Hinton and R. S. Zemel. 1994. Autoencoders, minimum description length and helmholtz free energy. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, NIPS, pages 3–10. MorganKaufmann.
 Hofmann (1999) T. Hofmann. 1999. Probabilistic latent semantic indexing. In ACM SIGIR Conference on Research and Development in Information Retrieval, pages 50–57. ACM.
 Jang et al. (2017) E. Jang, S. Gu, and B. Poole. 2017. Categorical reparameterization with gumbelsoftmax. In ICLR.
 Johnson et al. (2016) M. Johnson, D. K. Duvenaud, A. Wiltschko, R. P. Adams, and S. R. Datta. 2016. Composing graphical models with neural networks for structured representations and fast inference. In NIPS, pages 2946–2954.
 Jordan et al. (1999) M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. 1999. An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233.
 Kannan et al. (2016) A. Kannan, K. Kurach, et al. 2016. Smart Reply: Automated Response Suggestion for Email. In KDD.
 Kingma and Ba (2015) D. Kingma and J. Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
 Kingma et al. (2016) D. P. Kingma, T. Salimans, and M. Welling. 2016. Improving variational inference with inverse autoregressive flow. NIPS, pages 4736–4744.
 Kingma and Welling (2014) D. P. Kingma and M. Welling. 2014. Autoencoding variational Bayes. ICLR.
 Larochelle and Lauly (2012) H. Larochelle and S. Lauly. 2012. A neural autoregressive topic model. In NIPS, pages 2708–2716.
 Larsen et al. (2016) A. B. Lindbo Larsen, S. K. Sønderby, and O. Winther. 2016. Autoencoding beyond pixels using a learned similarity metric. In ICML, pages 1558–1566.
 Lauly et al. (2016) S. Lauly, Y. Zheng, A. Allauzen, and H. Larochelle. 2016. Document neural autoregressive distribution estimation. arXiv preprint arXiv:1603.05962.
 Li et al. (2016) J. Li, M. Galley, C. Brockett, J. Gao, and B. Dolan. 2016. A diversitypromoting objective function for neural conversation models. In The North American Chapter of the Association for Computational Linguistics (NAACL), pages 110–119.
 Lowe et al. (2015) R. Lowe, N. Pow, I. Serban, and J. Pineau. 2015. The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured MultiTurn Dialogue Systems. In Special Interest Group on Discourse and Dialogue (SIGDIAL).
 Lowe et al. (2017) R. T. Lowe, N. Pow, I. V. Serban, L. Charlin, C.W. Liu, and J. Pineau. 2017. Training EndtoEnd Dialogue Systems with the Ubuntu Dialogue Corpus. Dialogue & Discourse, 8(1).
 Maaløe et al. (2016) L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther. 2016. Auxiliary deep generative models. In ICML, pages 1445–1453.

Maddison et al. (2017)
C. J. Maddison, A. Mnih, and Y. W. Teh. 2017.
The concrete distribution: A continuous relaxation of discrete random variables.
In ICLR.  Markoff and Mozur (2015) J. Markoff and P. Mozur. 2015. For Sympathetic Ear, More Chinese Turn to Smartphone Program. New York Times.
 Miao et al. (2016) Y. Miao, L. Yu, and P. Blunsom. 2016. Neural variational inference for text processing. In ICML, pages 1727–1736.
 Mnih and Gregor (2014) A. Mnih and K. Gregor. 2014. Neural variational inference and learning in belief networks. In ICML, pages 1791–1799.
 Neal (1992) R. M. Neal. 1992. Connectionist learning of belief networks. Artificial intelligence, 56(1):71–113.
 Ororbia II et al. (2015) A. G. Ororbia II, C. L. Giles, and D. Reitter. 2015. Online semisupervised learning with deep hybrid boltzmann machines and denoising autoencoders. arXiv preprint arXiv:1511.06964.
 Pascanu et al. (2012) R. Pascanu, T. Mikolov, and Y. Bengio. 2012. On the difficulty of training recurrent neural networks. ICML, 28:1310–1318.
 Ranganath et al. (2016) R. Ranganath, D. Tran, and D. Blei. 2016. Hierarchical variational models. In ICML, pages 324–333.
 Rezende and Mohamed (2015) D. J. Rezende and S. Mohamed. 2015. Variational inference with normalizing flows. In ICML, pages 1530–1538.

Rezende et al. (2014)
D. J. Rezende, S. Mohamed, and D. Wierstra. 2014.
Stochastic backpropagation and approximate inference in deep generative models.
In ICML, pages 1278–1286.  Ritter et al. (2011) A. Ritter, C. Cherry, and W. B. Dolan. 2011. Datadriven response generation in social media. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 583–593.
 Rolfe (2017) J. T. Rolfe. 2017. Discrete variational autoencoders. In ICLR.
 Ruiz et al. (2016) F. J. R. Ruiz, M. K. Titsias, and D. M. Blei. 2016. The generalized reparameterization gradient. In NIPS, pages 460–468.
 Salakhutdinov and Hinton (2009) R. Salakhutdinov and G. E. Hinton. 2009. Semantic hashing. International Journal of Approximate Reasoning, 50(7):969–978.

Salakhutdinov and Larochelle (2010)
R. Salakhutdinov and H. Larochelle. 2010.
Efficient learning of deep boltzmann machines.
In AISTATs, pages 693–700.  Salimans et al. (2015) T. Salimans, D. P Kingma, and M. Welling. 2015. Markov chain monte carlo and variational inference: Bridging the gap. In ICML, pages 1218–1226.
 Sennrich et al. (2016) R. Sennrich, B. Haddow, and A. Birch. 2016. Neural machine translation of rare words with subword units. In Association for Computational Linguistics (ACL).
 Serban et al. (2017a) I. V. Serban, T. Klinger, G. Tesauro, K. Talamadupula, B. Zhou, Y. Bengio, and A. Courville. 2017a. Multiresolution recurrent neural networks: An application to dialogue response generation. In ThirtyFirst AAAI Conference (AAAI).
 Serban et al. (2016a) I. V. Serban, R. Lowe, L. Charlin, and J. Pineau. 2016a. Generative deep neural networks for dialogue: A short review. In NIPS, Let’s Discuss: Learning Methods for Dialogue Workshop.
 Serban et al. (2016b) I. V. Serban, A. Sordoni, Y. Bengio, A. Courville, and J. Pineau. 2016b. Building endtoend dialogue systems using generative hierarchical neural network models. In Thirtieth AAAI Conference (AAAI).
 Serban et al. (2017b) I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. Courville, and Y. Bengio. 2017b. A hierarchical latent variable encoderdecoder model for generating dialogues. In ThirtyFirst AAAI Conference (AAAI).
 Sordoni et al. (2015) A. Sordoni, M. Galley, M. Auli, C. Brockett, Y. Ji, M. Mitchell, J. Nie, J. Gao, and B. Dolan. 2015. A neural network approach to contextsensitive generation of conversational responses. In Conference of the North American Chapter of the Association for Computational Linguistics (NAACLHLT 2015), pages 196–205.
 Srivastava et al. (2013) N. Srivastava, R. R Salakhutdinov, and G. E. Hinton. 2013. Modeling documents with deep boltzmann machines. In Proceedings of the TwentyNinth Conference on Uncertainty in Artificial Intelligence (UAI), pages 616–624.
 Uria et al. (2014) B. Uria, I. Murray, and H. Larochelle. 2014. A deep and tractable density estimator. In ICML, pages 467–475.
 Zhai and Williams (2014) K. Zhai and J. D. Williams. 2014. Discovering latent structure in taskoriented dialogues. In Association for Computational Linguistics (ACL), pages 36–46.
 Zhao et al. (2017) T. Zhao, R. Zhao, and M. Eskenazi. 2017. Learning discourselevel diversity for neural dialog models using conditional variational autoencoders. In Association for Computational Linguistics (ACL).
Appendix A Appendix: Inappropriate Gaussian Priors
The majority of work on VAEs propose to parametrize — both the prior and approximate posterior (encoder) — as a multivariate Gaussian variable. However, the multivariate Gaussian is a unimodal distribution and can therefore only represent one mode in latent space. Furthermore, the multivariate Gaussian is perfectly symmetric with a constant kurtosis. These properties are problematic if the latent variables we aim to represent are inherently multimodal, or if the latent variables follow complex, nonlinear probability manifolds (e.g. asymmetric distributions or heavytailed distributions). For example. the frequency of topics in news articles could be represented by a continuous probability distribution, where each topic has its own island of probability mass; sports and politics topics might each be clustered on their own separate island of probability mass with zero or little mass in between them. Due to its unimodal nature, the Gaussian distribution can never represent such probability distributions. As another example, ambiguity and uncertainty in natural language conversations could similarly be represented by islands of probability mass; given the question How do I install Ubuntu on my laptop?, a model might assign positive probability mass to specific, unambiguous entities like Ubuntu 4.10 and to welldefined procedures like installation using a DVD. In particular, certain entities like Ubuntu 4.10 are now outdated — these entities occur rarely in practice and should be considered rare events. When modeling such complex, multimodal latent distributions, the mapping from multivariate Gaussian latent variables to outputs — i.e. the conditional distribution — has to be highly nonlinear in order to compensate for the simplistic Gaussian distribution and capture the natural latent factors in an intermediate layer of the model. However, it is difficult to learn such nonlinear mappings when using the variational bound in eq. (3.1), as it incurs additional variance from sampling the latent variable . Consequently, such models are likely to converge on solutions that do not capture salient aspects of the latent variables, which in turn leads to a poor fit of the output distribution.
Appendix B Appendix: Piecewise Constant Variable Derivations
To train the model using the reparametrization trick, we need to generate where . To do so, we employ inverse transform sampling (Devroye, 1986), which requires finding the inverse of the cumulative distribution function (CDF). We derive the CDF of eq. (2):
(3) 
Next, we derive its inverse:
(4) 
Armed with the inverse CDF, we can now draw a sample :
(5) 
In addition to sampling, we need to compute the KullbackLeibler (KL) divergence between the prior and approximate posterior distributions of the piecewise constant variables. We assume both the prior and the posterior are piecewise constant distributions. We use the prior superscript to denote prior parameters and the post superscript to denote posterior parameters (encoder model parameters). The KL divergence between the prior and posterior can be computed using a sum of integrals, where each integral inside the sum corresponds to one constant segment:
(6)  
(7)  
(8)  
(9) 
In order to improve training, we further transform the piecewise constant latent variables to lie within the interval ] after sampling: . This ensures the input to the decoder RNN has mean zero initially.
Appendix C Appendix: NVDM Implementation
The complete NVDM architecture is defined as:
where is the Hadamard product, is an operator that combines the Gaussian and the Piecewise variables and is the decoder model.^{5}^{5}5Operations include vector concatenation, summation, or averaging. As a result of using the reparametrization trick and choice of prior, we calculate the latent variable through the two samples, and . is a nonlinear activation function, which was the parametrized linear rectifier (with a learnable “leak” parameters) for the 20 NewsGroups experiments and the softsign function, or , for Reuters and CADE. The decoder model outputs a probability distribution over words conditioned on . In this case, we define as the softmax function (omitting the bias term for clarity) computed as:
The decoder’s output is used to calculate the first term in the variational lowerbound: . The prior and posterior distributions are used to compute the KL term in the variational lowerbound. The lowerbound is:
where the KL term is the sum of the Gaussian and piecewise KLdivergence measures:
KL  
The KLterms may be interpreted as regularizers of the parameter updates for the encoder model (Kingma and Welling, 2014). These terms encourage the posterior distributions to be similar to their corresponding prior distributions, by limiting the amount of information the encoder model transmits regarding the output.
Appendix D Appendix: VHRED Implementation
As described in the model section, the probability distribution of the generative model factorizes as:
(10) 
where are the model parameters. VHRED uses three RNN modules: an encoder RNN, a context RNN and a decoder RNN. First, each utterance is encoded into a vector by the encoder RNN:
where is either a GRU or a bidirectional GRU function. The last hidden state of the encoder RNN is given as input to the context RNN. The context RNN uses this state to updates its internal hidden state:
where is a GRU function taking as input two vectors. This state conditions the prior distribution over :
(11) 
where is a PDF parametrized by both and . Next, a sample is drawn from this distribution: . The sample and context state are given as input to the decoder RNN:
where is the LSTM gating function taking as input four vectors. The output distribution is computed by passing through an MLP , an affine transformation and a softmax function:
(12) 
where is the word embedding matrix for the output distribution with embedding dimensionality .
As mentioned in the model section, the approximate posterior is conditioned on the encoder RNN state of the next utterance:
(13) 
where is a PDF parametrized by and (i.e. the future state of the encoder RNN after processing ).
For the Gaussian latent variables, we use the interpolation gating mechanism described in the main text for the approximate posterior. We experimented with other mechanisms for controlling the gating variables, such as defining and to be a linear function of the encoder. However, this did not improve performance in our preliminary experiments.
Appendix E Appendix: Training Details
Piecewise Constant Variable Interpolation We conducted initial experiments with the interpolation gating mechanism for the approximate posterior of the piecewise constant latent variables. However, we found that this did not improve performance.
Dialogue Modeling We use the Ubuntu Dialogue Corpus v2.0 extracted January, 2016: http://cs.mcgill.ca/~jpineau/datasets/ubuntucorpus1.0/.
For the HRED model we found that an additional rectified linear units layer decreased performance on the validation set according to the activity F1 metric. Hence we test HRED without the rectified linear units layer. On the other hand, for all VHRED models we found that the additional rectified linear units layer improved performance on the validation set. For PVHRED, we found that a final weight of one for the KL divergence terms performed best on the validation set. For GVHRED and HVHRED, reweighing the KL divergence terms with a final value performed best on the validation set. We conducted preliminary experiments with and pieces, and found that models with were easier to train. Therefore, we use pieces for both PVHRED and HVHRED.
For all models, we compute the loglikelihood and variational lowerbound costs starting from the second utterance in each dialogue.
Appendix F Appendix: Additional Document Modeling Experiments
Iterative Inference
For the document modeling experiments, our results and conclusions depend on how tight the variational lowerbound is. As such, it is in theory possible that some of our models are performing much better than reported by the variational lowerbound on the test set. Therefore, we use a nonparametric iterative inference procedure to tighten the variational lowerbound, which aims to learn a separate approximate posterior for each test example. The iterative inference procedure consists of simple stochastic gradient descent (no more than 100 steps), with a learning rate of
and the same gradient rescaling used in training. For 20 NewsGroups, the iterative inference procedure is stopped on a test example if the bound does not improve over 10 iterations. For Reuters and CADE, the iterative inference procedure is stopped if the bound does not improve over iterations. During iterative inference the parameters of the model, as the well as the generated prior, are all fixed. Only the gradients of the variational lowerbound with respect to generated posterior model parameters (i.e. the mean and variance of the Gaussian variables, and the piecewise components, ) are used to update the posterior model for each document (using a freshly drawn sample for each inference iteration step).Note, this form of inference is expensive and requires additional metaparameters (e.g. a stepsize and an earlystopping criterion). We remark that a simpler, and more accurate, approach to inference might perhaps be to use importance sampling.
The results based on iterative inference are reported in Table 5. As Section 6.1, we find that HNVDM outperforms the GNVDM model. This confirms our previous conclusions.
In our current examples, it appears that the HNVDM with 5 pieces returns more general words. For example, as evidenced in Table 4, in the case of “government”, the baseline seems to value the plural form of the word (which is largely based on morphology) while the hybrid model actually pulls out meaningful terms such as “federal”, “policy”, and “administration”.





governments  citizens  arms  
citizens  rights  rights  
country  governments  federal  
threat  civil  country  
private  freedom  policy  
rights  legitimate  administration  
individuals  constitution  protect  
military  private  private  
freedom  court  citizens  
foreign  states  military 





LDA  
RSM  
docNADE  
SBN  
fDARN  
NVDM  
GNVDM  
HNVDM3  
HNVDM5 





GNVDM  
HNVDM3  
HNVDM5  
CADE  Sampled  SGDInf  
GNVDM  
HNVDM3  
HNVDM5 






Timerelated  GKL  GKL  PKL  
months  23  33  40  
day  28  32  35  
time  55  22  40  
century  28  13  19  
past  30  18  28  
days  37  14  19  
ahead  33  20  33  
years  44  16  38  
today  46  27  71  
back  31  30  47  
future  20  15  20  
order  42  14  26  
minute  15  34  40  
began  16  5  13  
night  49  12  18  
hour  18  17  16  
early  42  42  69  
yesterday  25  26  36  
year  60  17  21  
week  28  54  58  
hours  20  26  31  
minutes  40  34  38  
months  23  33  40  
history  32  18  28  
late  41  45  31  
moment  23  17  16  
season  45  29  37  
summer  29  28  31  
start  30  14  38  
continue  21  32  34  
happened  22  27  35 






Names  GKL  GKL  PKL  
henry  33  47  39  
tim  32  27  11  
mary  26  51  30  
james  40  72  30  
jesus  28  87  39  
george  26  56  29  
keith  65  94  61  
kent  51  56  15  
chris  38  55  28  
thomas  19  35  19  
hitler  10  14  9  
paul  25  52  18  
mike  38  76  40  
bush  21  20  14  
Adjectives  GKL  GKL  PKL  
american  50  12  40  
german  25  21  22  
european  20  17  27  
muslim  19  7  23  
french  11  17  17  
canadian  18  10  16  
japanese  16  9  24  
jewish  56  37  54  
english  19  16  26  
islamic  14  18  28  
israeli  24  14  18  
british  35  15  17  
russian  14  19  20 
Approximate Posterior Analysis We present an additional analysis of the approximate posterior on 20 NewsGroups, in order to understand what the models are capturing. For a test example, we calculate the squared norm of the gradient of the KL terms w.r.t. the word embedding inputted to the approximate posterior model. The higher the squared norm of the gradients of a word is, the more influence it will have on the posterior approximation (encoder model). For every test example, we count the top words with highest squared gradients separately for the multivariate Gaussian and piecewise constant latent variables.^{6}^{6}6Our approach is equivalent to counting the top words with the highest L2 gradient norms.
The results shown in Table 6, illustrate how the piecewise variables capture different aspects of the document data. The Gaussian variables were originally were sensitive to some of the words in the table. However, in the hybrid model, nearly all of the temporal words that the Gaussian variables were once more sensitive to now more strongly affect the piecewise variables, which themselves also capture all of the words that were originally missed This shift in responsibility indicates that the piecewise constant variables are better equipped to handle certain latent factors. This effect appears to be particularly strong in the case of certain nationalitybased adjectives (e.g., “american”, “israeli”, etc.). While the GNVDM could model multimodality in the data to some degree, this work would be primarily done in the model’s decoder. In the HNVDM, the piecewise variables provide an explicit mechanism for capturing modes in the unknown target distribution, so it makes sense that the model would learn to use the piecewise variables instead, thus freeing up the Gaussian variables to capture other aspects of the data, as we found was the case with names (e.g., “jesus”, “kent”, etc.).
Appendix G Appendix: Additional Dialogue Modeling Experiments
Dialogue Context (History)  Response 

Hi . I am installing ubuntu now in my new laptop . In ” something else ” partitioning , what mount point should I set for a drive which is not root or not home … It ’s up to you , just choose a directory that will remind you of the contents of that partition . E.G. : if it ’s the Windows partition , use /windows . it ’s a new harddrive with full free space . I bought it without windows preinstalled . I want to create drives in which I will only store files .. I mean , not root or not home . What mount point do I set for it ? ” /mount ” is not shown in drop down menu sorry . I mean /mount I mean , in my desktop , extra drives are mounted in /media do you understand my problem ? Sorry , English is not my native language . I do :) Just use a directory with a meaningful name and preferably in the root , for example ” /files ”. Choose something like ” /files ”. / for root , /home for home , but what mount point for a file storage partition . wait , let me see if there is a ” /files ” , thank you . 
HRED: No problem .
GVHRED: It should be in the list of the drive . If you want to be able to mount it . PVHRED: If you want to mount it , you can use the mount command . HVHRED: As far as I know , you can use the ” mount ” command to mount the partitions . 
I have a problem installing Ubuntu 12.04 desktop . Tried desktop with USB ¿ hangs . alternate with usb ¿ hangs . alternate with cd ¿ hangs ( even faster) . I tried using different usb drives . Could someonle please help me ? how are you making the usb ? and whats the pc specs . I used Universal usb installer what exactly does it do when it ’ hangs ’ Specs : CPU core i7 @ 3.4 Ghz , Sapphire Radeon HD 7800 , 8GB RAM ATi video ? tried the NOMODESET option yet ? nothing at all , I also can’t witch to console using Ctrl + alt + F not yet Will try that , thanks 
HRED: you could try the alternate install cd .
GVHRED: it might be a problem with your graphics card . You could also try the alternate install cd and see if it works . PVHRED: try ’ nomodeset ’ or ’ nomodeset ’ HVHRED: only thing you can do is run the command ’ sudo dpkgreconfigure xserverxorg ’ and see if you get any errors . 
anyone know how to add a route in ubuntu ? getting an error on adding a route , and I’m not sure why pastebin your command and pronlem <url > 
HRED: can you pastebin the output of ” route n ” please
GVHRED: if you want to use pastebin to pastebin the output of ” sudo route n ” please pastebin the output of ” route ” PVHRED: what is the output of : route n HVHRED: only if you have a static ip address , you need to specify the ip address of the gateway 
Any idea why empathy ’s not playing notification sounds ? Even though I have ’ em ticked in preferences restarted it yet ? yar check pulse to see if the application is muted for some reason ? well Sound settings . Had sound effects turned off in sound settings , didn’t realize that controlled other applications Ah yea , ive done it a few time it ’s annoying My favorite though is recently pulse has been freezing on my desktop and audio will just not be adjustable for like … 30 seconds or so 
HRED: I don’t think it ’s a problem with pulseaudio , but I’m not sure how to fix it
GVHRED: Thank you for your help . I will try that . Thanks for the suggestion . PVHRED: I think it ’s a bug with pulseaudio , but I don’t know how to fix it . HVHRED: as far as I can tell , I have a problem with pulseaudio , but I don’t know how to fix it . 
Question about setting up drives : Running 12.04 LTS . I have two drives . An SSD that holds the OS for booting , and a second drive ” data ” for all the data , home directories , etc . The data drive is encrypted with LUKS . With my current setup , whenever I reboot , I get a password entry screen to decrypt the data drive , where my home directory is . This works fine . The problem is that if I reboot remotely , I’m stuck as I can’t enter password . If I keep my home directory on the SSD , there won’t be enough room . If I use a symbolic link , and the encrypted drive isn’t mounted , that creates errors and issues . Suggestions ?? yes , you can redirect the Message was cut off … 
HRED: if you want to see the data on the drive , you can mount it in /etc/fstab
GVHRED: it should be a good place to do that , but if you want to do a clean install , you will need to reboot PVHRED: it ’s not a problem , it ’s just a bug HVHRED: if you want to see the data on the drive , you can mount it in /etc/fstab 
hey guys , does anyone know why i cannot connect to ssh via my external IP ? (connection refused ) but ssh v localhost works fine ? Have you forwarded the port on your router to the machine you’re trying to access ? It is a work machine , so am not sure of the rules 
HRED: You need to forward port 22 to your router to forward port 22 to the server .
GVHRED: That ’s odd . What are you trying to do ? Can you pastebin the output of ” sudo netstat ” to ” pastebin ” please ? PVHRED: Can you pastebin the output of ” sudo aptget install opensshserver ”? HVHRED: Even if it ’s not working , then you need to set the port forward to your router . 
Ubuntu Experiments We present test examples — dialogue context and model responses generated using beam search — for the Ubuntu models in Table 7. The examples qualitatively illustrate the differences between models. First, we observe that HRED tends to generate highly generic responses compared to all the latent variable models. This supports the quantitative results reported in the main text, and suggests that modeling the latent factors through latent variables is critical for this task. Next, we observe that HVHRED tends to generate relevant entities and commands — such as mount command, xserverxorg, static ip address and pulseaudio in examples 14. On the other hand, GVHRED tends to be better at generating appropriate verbs — such as list, install, pastebin and reboot in examples 13 and example 5. Qualitatively, PVHRED model appears to perform somewhat worse than both GVHRED and HVHRED. This suggests that the Gaussian latent variables are important for the Ubuntu task, and therefore that the best performance may be obtained by combining both Gaussian and piecewise latent variables together in the HVHRED model.
Twitter Experiments We also conducted a dialogue modeling experiment on a Twitter corpus, extracted from based on public Twitter conversations (Ritter et al., 2011). The dataset is split into training, validation, and test sets, containing respectively 749,060, 93,633 and 9,399 dialogues each. On average, each dialogue contains about utterances (dialogue turns) and about words. We preprocessed the tweets using bytepair encoding (Sennrich et al., 2016) with a vocabulary consisting of 5000 subwords.
We trained our models with a learning rate of and minibatches of size or .^{7}^{7}7We had to vary the minibatch size to make the training fit on GPU architectures with low memory. As for the Ubuntu experiments, we used a variant of truncated backpropagation and apply gradient clipping. We experiment with GVHRED and HVHRED. Similar to (Serban et al., 2017b), we use a bidirectional GRU RNN encoder, where the forward and backward RNNs each have hidden units. We experiment with context RNN encoders with and hidden units, and find that that hidden units reach better performance w.r.t. the variational lowerbound on the validation set. The encoder and context RNNs use layer normalization (Ba et al., 2016). We experiment with decoder RNNs with , and hidden units (LSTM cells), and find that hidden units reach better performance. For the GVHRED model, we experiment with latent multivariate Gaussian variables with and dimensions, and find that dimensions reach better performance. For the HVHRED model, we experiment with latent multivariate Gaussian and piecewise constant variables each with and dimensions, and find that dimensions reach better performance. We drop words in the decoder with a fixed drop rate of and multiply the KL terms in the variational lowerbound by a scalar, which starts at zero and linearly increases to over the first 60,000 training batches. Note, unlike the Ubuntu experiments, the final weight of the KL divergence is exactly one (hence the bound is tight).
Word  GVHRED  HVHRED  Word  GVHRED  HVHRED  

Timerelated  GKL  GKL  PKL  Eventrelated  GKL  GKL  PKL 
monday  3  5  10  school  9  16  50 
tuesday  2  3  7  class  11  16  27 
wednesday  4  11  13  game  20  26  41 
thursday  2  3  9  movie  12  20  41 
friday  9  18  26  club  13  22  28 
saturday  6  6  13  party  8  10  32 
sunday  2  2  9  wedding  7  13  23 
weekend  8  16  32  birthday  12  20  23 
today  18  28  56  easter  15  15  23 
night  16  31  68  concert  7  16  20 
tonight  32  36  47  dance  11  12  21 
Word  GVHRED  HVHRED  Word  GVHRED  HVHRED  
Sentiment related 
GKL  GKL  PKL 
Acronyms, Punctuation Marks & Emoticons 
GKL  GKL  PKL 
good  72  73  44  lol  394  358  312 
love  102  101  38  omg  52  45  19 
awesome  26  44  39  .  386  558  1009 
cool  14  28  29  !  648  951  525 
haha  132  101  75  ?  507  851  221 
hahaha  60  48  24  *  108  54  19 
amazing  14  38  33  xd  28  42  26 
thank  137  153  29  56  42  24 
Our hypothesis is that the piecewise constant latent variables are able to capture multimodal aspects of the dialogue. Therefore, we evaluate the models by analyzing what information they have learned to represent in the latent variables. For each test dialogue with utterances, we condition each model on the first utterances and compute the latent posterior distributions using all utterances. We then compute the gradients of the KL terms of the multivariate Gaussian and piecewise constant latent variables w.r.t. each word in the dialogue. Since the words vectors are discrete, we compute the sum of the squared gradients w.r.t. each word embedding. The higher the sum of the squared gradients of a word is, the more influence it will have on the posterior approximation (encoder model). For every test dialogue, we count the top words with highest squared gradients separately for the multivariate Gaussian and piecewise constant latent variables.^{8}^{8}8Our approach is equivalent to counting the top words with the highest L2 gradient norms. We also did some experiments using L1 gradient norms, which showed similar patterns.
The results are shown in Table 8. The piecewise constant latent variables clearly capture different aspects of the dialogue compared to the Gaussian latent variables. The piecewise constant variable approximate posterior encodes words related to time (e.g. weekdays and times of day) and events (e.g. parties, concerts, Easter). On the other hand, the Gaussian variable approximate posterior encodes words related to sentiment (e.g. laughter and appreciation) and acronyms, punctuation marks and emoticons (i.e. smilies). We also conduct a similar analysis on the document models evaluated in Subsection 6.1, the results of which may be found in the Appendix.