1 Introduction
While automatic natural language generation (NLG), in particular from structured data, has had a long tradition Reiter and Dale (2000)
, the recent advances in deep learning have given it a new impetus. In parallel to a massive number of deep generative models for creating realistic images, a fair number of papers have introduced probabilistic generative models of text
(Zhao et al., 2017a; Hu et al., 2017, inter alia)which are claimed to produce fluent and meaningful samples from a continuous vector representation. Similar to research focused on image generation, related but distinct text generation tasks for such models include:

sentence reconstruction – given a natural language sentence, can we encode it into a fixedlength vector and then reconstruct it from that representation?

unconditional sentence generation – can we generate fluent sentences that follow the distribution of sentences in natural language?

conditional sentence generation – given a content and/or style representation, can we generate a sentence expressing that content and exhibiting the desired stylistic properties?
While tasks (1) and (2) may not have obvious applications, they are important for assessing the properties of the learned representations and their usefulness for other tasks, including (3). For example, autoencoderbased models have been proposed Hu et al. (2017); Zhao et al. (2017a) for learning disentangled representations of style and content from unaligned data. One possible application of such autoencoders is modifying the style of a sentence by manipulating the style representation, but this is only possible if the model can encode and generate accurately, i.e. has a low reconstruction error.
Partly due to the difficulty of evaluating text generation directly, recent studies on autoencoders for text Bowman et al. (2016); Hu et al. (2017) have mostly focused on applying them to tasks such as language modeling and classification. With the exception of Zhao et al. (2017a), these studies do not consider the reconstruction task. Bowman et al. (2016) report negative results of variational autoencoders on language modeling, which suggests that the reconstruction error of these models will be high.
The lack of evaluation standards resulted in fierce debates around the experimental setup of some of the most novel neuralbased text generation studies, to the point that their utility has been questioned.^{1}^{1}1See the discussions around the posts by Yoav Goldberg from June 2017: https://medium.com/@yoav.goldberg. This is unfortunate because neural generative models for text do hold real promise for NLG, as the progress in MT over the past few years has clearly demonstrated.
In this paper, we strive to address the methodological issues with the current neural text generation research and also close some gaps by answering a few natural questions to the studies already published. We focus on neural generative models from the autoencoder family and their performance on tasks (1) and (2), because we feel that this is an area that hasn’t been sufficiently explored and deserves a proper treatment before one moves on to more complex setups.
In particular, our contributions are as follows:

We focus on several most recent autoencoder models for sentence generation, namely plain (AE), variational (VAE) and adversarially regularized (ARAE) autoencoders Kingma and Welling (2013); Bowman et al. (2016); Zhao et al. (2017a), as well as adversarial autoencoders (AAE, Makhzani et al., 2015), and compare them on equal footing.

We study the effects of alternative techniques for regularizing autoencoders for text, namely latent code normalization, injecting noise into the latent representation, and RNN dropout.

We show that these simple techniques are sufficient for training an autoencoder which is comparable to stateoftheart models for unconditional text generation while outperforming them in terms of reconstruction accuracy.

We rigorously evaluate different variants of autoencoder models with humans as well as compute a rich set of automatic metrics on both generated samples and reconstructions, which is missing in the previous work.

In particular, we introduce a novel technique for automatically measuring the quality of generated texts – Fréchet InferSent Distance (inspired by the recent work on image generation by Heusel et al., 2017).
2 Related work
Since the introduction of VAEs by Kingma and Welling (2013)
and its many successful applications in computer vision, the first study of VAEs for text generation was performed by
Bowman et al. (2016). While demonstrating that VAEs can be a viable way to train unconditional generative models for text, the authors show that training a VAE with an LSTM Hochreiter and Schmidhuber (1997) decoder leads to an issue where it tends to ignore the latent code completely and hence collapse to a language model. To alleviate this issue, the authors applied two moderately successful tricks: input dropout and KL term annealing. While demonstrating their model can generate naturallooking samples, the reconstruction performance is omitted from the discussion, which is important as it indicates how well the encoder generalizes and structures the latent space.To address some of the issues of training VAE models for text discussed by Bowman et al. (2016), Semeniuta et al. (2017) propose a hybrid architecture composed of a convolutional encoder and a decoder composed of a deconvolutional and an autoregressive layer (LSTM or ByteNet, Kalchbrenner et al., 2016). This model is shown to better handle longer sequences and more importantly, it allows for a better control over the KL term. The latter ensures that the latent vector is actually useful and used by the decoder. Additionally, similar to findings in Chen et al. (2016) and Yang et al. (2017), explicit control over the autoregressive power of the decoder, e.g., by using a ByteNet decoder with a smaller receptive field, helps to alleviate this issue. In this work we employ a standard LSTM encoder/decoder architecture, whereas our primary focus is on various mechanisms to match posterior and prior distributions and its effects on structuring the latent space.
The original VAE objective includes a KL penalty term whose goal is to match the approximate posterior with a prior. This regularizes (smooths out) the latent space, ensuring that it is possible to generate meaningful samples from any point from the prior. Instead of using a conventional KL penalty, Makhzani et al. (2015) propose to use a GAN discriminator to match the aggregate approximate posterior with the prior. Bousquet et al. (2017) provide a proof that this in effect corresponds to minimizing a Wasserstein distance in the primal between the data and generated distributions. Zhao et al. (2017a) attack the problem from a different angle by using a GAN to instead learn a powerful prior that matches the aggregated posterior. Thus, during generation, the latent vectors are sampled from the GAN generator instead of being drawn directly from an imposed prior.
Hu et al. (2017) propose a conditional VAE model for text where a discriminator is used to impose desired attributes on generated samples and disentangle them from the latent representation produced by the encoder. To enable backpropagation from the discriminator, the recurrent decoder is made fully differentiable by applying a continuous approximation.
Another notable approach of applying VAEs to text was recently proposed in Guu et al. (2017), where generation is treated as a prototypethenedit task – sample a prototype sentence from the training corpus and then edit it into a new sentence. Unlike conventional VAEs where the encoder packs the whole sentence into a latent vector, Guu et al. (2017) choose the latent vector to represent an edit that transforms an input prototype into a new sentence.
Finally, autoencoders as a key technique in unsupervised representation learning have been widely applied in NLP tasks to regularize language models and sequencetosequence models Dai and Le (2015), for supervised machine translation Zhang et al. (2016), and more recently, for enabling unsupervised machine translation Artetxe et al. (2017); Lample et al. (2017).
3 Background
In this section, we briefly review two previously proposed types of generative models which we adopt in this work: variational and adversarial autoencoders.
Both are autoencoders consisting of two components: an encoder , which transforms an input to an embedding (latent code) , and a decoder (generator) , which produces a reconstruction of from . A prior distribution is imposed on the embedding space and the model is trained to match the aggregated posterior to the prior. The two models differ in the way they achieve this goal: a VAE includes a KL divergence term in its cost function, while AAEs employ an adversarial training objective.
3.1 Variational autoencoder
A variational autoencoder (Kingma and Welling, 2013) maximizes a lower bound on the marginal loglikelihood:
The first term is the logprobability of reconstructing the input
given the latent vector sampled from the posterior distribution. The second term is the negative KL divergence from the prior to the posterior, which effectively acts as a regularizer, pushing the posterior closer to the prior.A standard Gaussian is usually chosen as the prior distribution, and the posterior (the output distribution of the encoder) is modelled as a diagonal Gaussian to allow for gradient backpropagation using a reparameterization trick.
A VAE for text, as proposed by Bowman et al. (2016), uses an RNN encoder and decoder. The authors use KL cost annealing (gradually increasing the weight of the KL term from 0 to 1) and word dropout (randomly masking out tokens from the decoder’s input during training) to encourage the decoder to make use of the latent vector produced by the encoder.
3.2 Adversarial autoencoder
Adversarial autoencoders (Makhzani et al., 2015) regularize the embedding space by means of adversarial training. The model is extended with an adversarial network (discriminator) , which is trained to predict whether a given vector is a sample from the imposed prior distribution or an embedding produced by the encoder:
Here, is the probability, predicted by the discriminator, that is a genuine sample from the prior distribution.
Meanwhile, the autoencoder is trained in two alternating optimization steps: In the reconstruction phase, we optimize the standard reconstruction objective:
In the regularization phase, the encoder is trained to fool the discriminator so that the latter is unable to distinguish the encoder outputs from the samples coming from :
Note that can now be an arbitrary distribution, as long as we can sample from it.
4 Models
We will now describe the details of the models we examine in this work, our modifications to them, and our choice of prior distributions. In general, the models use an LSTM encoder and decoder with 512 units and 128dimensional word embeddings; the latent code is 100dimensional. The discriminator (if present) is a fully connected network with 3 layers of size 300.
Vae.
Aae.
For adversarial autoencoders, we experiment with two kinds of prior distributions: a standard Gaussian and a uniform distribution on the unit sphere (in Euclidean space).
In the case of a Gaussian prior, we use two types of posterior distributions: a diagonal Gaussian parameterized by the encoder, i.e. , and a deterministic posterior, where the encoder produces a single for each input. We refer to the two resulting models as AAEgauss and AAEgaussdet, respectively.
In the spherical case (AAEsph
), we normalize the output of the encoder to ensure that the aggregated posterior distribution is supported on the unit sphere. During training, we add Gaussian noise to the normalized embeddings before passing them to the decoder and the discriminator. The variance of this noise is either fixed or exponentially decayed over time.
Unlike Makhzani et al. (2015), we combine the reconstruction and regularization phase in one training objective:
We use except where stated otherwise.
To further regularize the decoder, we apply dropout with a keep probability of 0.4 on the LSTM inputs and states Gal and Ghahramani (2016).
Arae.
Adversarially regularized autoencoders (Zhao et al., 2017a) are similar to AAEs, but instead of imposing a prior distribution on the embeddings, they learn a flexible prior and employ adversarial training to match it to the aggregated posterior. We use the original ARAE implementation, modified to perform decoding from the mean of the posterior distribution.^{2}^{2}2Our fork of the code, based on a version from August 2017, can be found here: https://git.io/fxQnz We evaluate two ARAE configurations: the defaults used by Zhao et al. (2017a) and a modified setup with hyperparameters and training time matching our models.
Plain AE.
We also include a plain autoencoder, which isn’t endowed with a means of controlling the aggregated posterior. However, in order to be able to draw samples from the model, we still assume a prior distribution on the embeddings – either Gaussian (AEgaussdet) or spherical (AEsph). These autoencoders are equivalent to their adversarial counterparts with set to 0.
Note that while there is no explicit control over the embedding space of AEgaussdet, the outputs of the AEsph encoder are constrained to the unit sphere (although a uniform distribution is not enforced).
5 Experimental setup
We train and evaluate all models on a public corpus consisting of 200,000 sentence summaries extracted from news articles^{3}^{3}3https://git.io/fxQnR Filippova and Altun (2013). We perform unsupervised subword tokenization using SentencePiece^{4}^{4}4https://git.io/sentencepiece with a vocabulary of 16,000 tokens.
The models are trained for 500,000 iterations using Adam (Kingma and Ba, 2014) with a batch size of 128 and a learning rate of . For AAEs, we use SGD with a learning rate of to update the discriminator.
Each model is evaluated in two different modes:

Sampling – as a pure unconditional generative model, drawing random samples from the prior distribution and using them to condition greedy decoding. This allows us to measure how well the model approximates the underlying data distribution.

Reconstruction – as an autoencoder, measuring the reconstruction quality. In this case, we first encode the input sentence, use the mean of the posterior distribution as the latent vector , and then run greedy decoding.
We give examples of generated sentences in Appendix A.
5.1 Sampling evaluation
To evaluate an unconditional generative model for text, we would like to make sure that (a) the generated sentences are correct with respect to the language used in the training data, and (b) the generated sentences reflect the diversity of expressions in the training data, i.e. the model avoids mode collapse. In order to capture both requirements, we use a number of different evaluation metrics.
Cross entropy.
A natural way to evaluate a probabilistic model is cross entropy:
(1) 
Note, however, that is intractable for any given . Following Zhao et al. (2017a), we approximate
using an RNN language model trained on 100,000 model samples; then, to obtain an estimate of
creftype 1, we evaluate this LM on the test set.^{5}^{5}5Note that Zhao et al. (2017a) use an equivalent metric, but refer to it as ‘reverse perplexity’.We are also interested in ‘reverse cross entropy’, i.e. the expected negative logprobability of samples from the model with respect to the true data distribution:
(2) 
This can be thought of as a measure of plausibility (fluency) of the generative model’s outputs. Again, is unknown, but can be approximated using a language model. Therefore, to estimate creftype 2, we score the samples from each model using a pretrained RNN LM. The model is trained on a large news corpus from the English Gigaword.^{6}^{6}6https://catalog.ldc.upenn.edu/ldc2003t05
Fréchet distance.
In addition, motivated by a comprehensive study of various GAN models (Lucic et al., 2017) where the authors use Fréchet Inception Distance (Heusel et al., 2017) extensively and demonstrate that it is superior to the Inception Score Salimans et al. (2016), we experiment with an equivalent metric for text – Fréchet InferSent Distance (FID) – to measure the distance between the generative distribution and the data distribution. FID measures the Wasserstein2 distance Vaserstein (1969) between two Gaussians, whose means and covariances are taken from embeddings of the real and generated data (i.e. samples from and ), respectively. To our knowledge, this is the first time that this idea is applied to evaluating generative models for text. Different from the negative loglikelihood metrics discussed above, FID directly measures the distance between distributions, hence it offers an additional angle for comparing generative models whose goal is to learn to recover the true data distribution.
We compute FID between 10,000 sentences generated from the model and taken from the test set, respectively. To obtain their embeddings, we use a pretrained general purpose sentence embedding model, InferSent Conneau et al. (2017), which encodes each sentence as a 4,096dimensional vector. We chose InferSent for computing the FID metric on sentence samples because it has been shown to provide stateoftheart results on various sentence representation tasks, and is domainindependent to a large extent.
5.2 Reconstruction evaluation
The metrics in the previous section quantify the quality and diversity of samples generated while conditioning the decoder on a sample from the prior . Another way to gauge the diversity of sentences the model can represent is to measure how accurately it can reconstruct a given input. We express the reconstruction error as negative loglikelihood (NLL) and BLEU3 and ROUGE3 scores computed with the input sentence as a reference.
5.3 Human evaluation
We also evaluate the models based on subjective human judgment, focusing on the two tasks mentioned above: sampling and reconstruction.
For sampling, we decoded sentences from random points in the embedding space and asked human raters to rate them on a 5point Likert scale according to their fluency, where 1 = gibberish, 3 = understandable but ungrammatical sentences, and 5 = naturally constructed and understandable sentences.
For reconstruction, we presented the raters with a sentence and its reconstruction produced by one of the models. Besides assessing fluency, the raters were asked to provide another score on a 5point Likert scale measuring how well the output reflects the original meaning (this score is referred to as relevance in the following). A score of 1 corresponds to an unrelated sentence, 3 to a reasonably good paraphrase, and 5 means either a perfect reconstruction or a semantically equivalent paraphrase.
The evaluation was done using a crowdsourced rating platform. For both tasks, we evaluated a sample of 200 sentences from each model, employing three raters per item. The results were calculated as an average of the median sentence ratings. For 84% of the items, there was a majority score, i.e. at least two of the three raters chose the same of the 5 possible scores for the item.
6 Results
Model  Relevance  Fluency 

real data  —  4.42 
AAEgaussdet (, )  3.54  3.71 
ARAE  3.35  3.56 
AAEsph ()  2.76  3.53 
AEsph ()  2.54  3.53 
AAEsph () 
2.40  3.54 
AEsph (, )  1.73  2.33 
ARAE (default)  1.48  2.51 
VAE ()  1.39  3.87 
Model  Fluency 

real data  4.42 
VAE ()  3.46 
AEsph ()  3.07 
AAEsph ()  2.83 
LM  2.69 
AAEsph ()  2.61 
ARAE (default)  2.08 
AEsph (, )  1.85 
ARAE  1.68 
AAEgaussdet (, )  1.53 
6.1 Quantitative evaluation
The results are shown in Table 1 (automatic evaluation) and Tables 2 and 3
(human evaluation). Results on samples from the training set and from an RNN LM are included for comparison. The RNN LM uses the same architecture and hyperparameters as the decoder of all the other models.
For the sampling task, one thing to notice is that there seems to be a tradeoff between the quality and the diversity of the samples: models with a lower (i.e. better) reverse cross entropy and a higher fluency rating tend to have a higher (i.e. worse) forward cross entropy. In particular, the reverse cross entropy of some models (VAE and some sph models) is less than that of the real data – this is a clear sign that the model is suffering from a mode collapse. This is supported by the fact that these models also tend to have worse performance on reconstruction, which suggests that the set of sentences they are able to encode is less diverse.
Another important observation is that plain autoencoders with the spherical prior (AEsph) achieve relatively good results, on par with their adversarial counterparts (AAEsph). This suggests that the techniques applied in these models – constraining the embeddings to lie on a unit sphere and injecting noise – are sufficient for making the model learn to cover the sphere uniformly and be able to decode sentences from any given point on the sphere. The adversarial training seems to have little additional effect, if any at all.
In particular, AEsph with performs at least as well on sampling as all other types of models we evaluated:

It achieves a superior forward cross entropy.

Its FID is only slightly higher than for VAE (), which achieves the lowest (i.e. best) value.

Although its reverse cross entropy is still below the real data threshold, it is higher than for VAEs, hence it arguably suffers less from the mode collapse problem.

It achieves a higher fluency score than a LM and is only surpassed by the VAE.

Finally, it outperforms VAEs on the reconstruction task by a large margin.
The effect of adversarial training on the Gaussian prior model (AAEgaussdet) seems to be more pronounced than in the spherical prior models – this is unsurprising as the nonadversarial variant (AEgaussdet) doesn’t place any restrictions on the aggregated posterior, and therefore cannot be expected to be useful as a generative model. However, AAEgaussdet still has poor performance on sampling according to both automatic and human evaluation.
Regarding ARAE, it outperforms all other methods on almost all reconstruction metrics, but its results on sampling are rather poor, especially according to human ratings. This might be due to a more challenging dataset than in Zhao et al. (2017a), or simply because of the model’s high sensitivity to hyperparameters, which is noted by the authors.
6.2 Embedding visualization
Fig. 1 shows tSNE van der Maaten and Hinton (2008) projections in 2D of the encodings of ten random sentences from the test set. Each sentence has been encoded one hundred times with sampling from the posterior, then plotted with some additional noise in order to better visualize collapsed points.
Plain AEgaussdet is deterministic, and each sentence is mapped to the same identical point all 100 times. This leads to a very highquality reconstruction, but the embedding space is not smooth and sampling from random points in the prior would often produce unreadable outputs. Plain VAE exhibits the opposite behaviour: all 10 inputs are encoded into large, heavily overlapping regions of the embedding space. This hints at why this model performs poorly for reconstruction and has very good quality sampling from any random point in the embedding space.
Finally AAEsph and AEsph display similar behaviours, with sentences mapped into smooth regions in the space without significant overlap in the projections.
While not a quantitative study by itself, the plots are consistent with the observed results for sampling and reconstruction described above.
7 Conclusions
We introduced a rigorous evaluation scheme for generative models for text. In addition to previously proposed metrics, we proposed the Fréchet InferSent Distance, adopted from the field of image generation.
Three families of generative models (plain, variational and adversarially regularized autoencoders) have been thoroughly compared, under different regularization strategies. The qualitative evaluation shows that no model outperforms the others under all circumstances, with VAE being the strongest for sampling, but suffering from mode collapse and poor reconstruction performance. The rest of the models represent compromises between good sampling and reconstruction, and as we have demonstrated, the tradeoff between these two can be controlled using simple regularization techniques.
References
 Artetxe et al. (2017) Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2017. Unsupervised neural machine translation. volume abs/1710.11041.
 Bousquet et al. (2017) Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, CarlJohann SimonGabriel, and Bernhard Schoelkopf. 2017. From optimal transport to generative modeling: the VEGAN cookbook.
 Bowman et al. (2016) Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In CoNLL.
 Chen et al. (2016) Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. 2016. Variational lossy autoencoder. volume abs/1611.02731.
 Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.
 Dai and Le (2015) Andrew M. Dai and Quoc V. Le. 2015. Semisupervised sequence learning. volume abs/1511.01432.
 Filippova and Altun (2013) Katja Filippova and Yasemin Altun. 2013. Overcoming the lack of parallel data in sentence compression. In EMNLP.

Gal and Ghahramani (2016)
Yarin Gal and Zoubin Ghahramani. 2016.
A theoretically grounded application of dropout in recurrent neural networks.
In Advances in Neural Information Processing Systems 29, pages 1019–1027. Curran Associates, Inc.  Guu et al. (2017) Kelvin Guu, Tatsunori B. Hashimoto, Yonatan Oren, and Percy Liang. 2017. Generating sentences by editing prototypes. CoRR, abs/1709.08878.
 Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs trained by a two timescale update rule converge to a local Nash equilibrium.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural computation, 9 8:1735–80.
 Hu et al. (2017) Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P. Xing. 2017. Toward controlled generation of text. In ICML.
 Kalchbrenner et al. (2016) Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aäron van den Oord, Alex Graves, and Koray Kavukcuoglu. 2016. Neural machine translation in linear time. volume abs/1610.10099.
 Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
 Kingma and Welling (2013) Diederik P. Kingma and Max Welling. 2013. Autoencoding variational Bayes. CoRR, abs/1312.6114.
 Lample et al. (2017) Guillaume Lample, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2017. Unsupervised machine translation using monolingual corpora only. volume abs/1711.00043.
 Lucic et al. (2017) Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. 2017. Are GANs created equal? A largescale study. volume abs/1711.10337.

van der Maaten and Hinton (2008)
Laurens van der Maaten and Geoffrey Hinton. 2008.
Visualizing data using tsne.
In
Journal of Machine Learning Research
.  Makhzani et al. (2015) Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian J. Goodfellow. 2015. Adversarial autoencoders. CoRR, abs/1511.05644.
 Reiter and Dale (2000) Ehud Reiter and Robert Dale. 2000. Building Natural Language Generation Systems. Cambridge, U.K.: Cambridge University Press.
 Salimans et al. (2016) Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. 2016. Improved techniques for training GANs. volume abs/1606.03498.
 Semeniuta et al. (2017) Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. 2017. A hybrid convolutional variational autoencoder for text generation. In EMNLP.
 Shoemake (1985) Ken Shoemake. 1985. Animating rotation with quaternion curves. In SIGGRAPH.
 Vaserstein (1969) Leonid Nisonovich Vaserstein. 1969. Markov processes over denumerable products of spaces, describing large systems of automata. Problemy Peredachi Informatsii, 5(3):64–72.
 Yang et al. (2017) Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, and Taylor BergKirkpatrick. 2017. Improved variational autoencoders for text modeling using dilated convolutions. In ICML.
 Zhang et al. (2016) Biao Zhang, Deyi Xiong, and Jinsong Su. 2016. Variational neural machine translation. volume abs/1605.07869.
 Zhao et al. (2017a) Junbo Jake Zhao, Yoon Kim, Kelly Zhang, Alexander M. Rush, and Yann LeCun. 2017a. Adversarially regularized autoencoders for generating discrete structures. CoRR, abs/1706.04223.
 Zhao et al. (2017b) Tiancheng Zhao, Ran Zhao, and Maxine Eskénazi. 2017b. Learning discourselevel diversity for neural dialog models using conditional variational autoencoders. In ACL.
Appendix A Example outputs
Fig. 2 shows a random sample of the sentences generated from various models.
, we show the outputs obtained by encoding two sentences from the test set and using spherical linear interpolation (Slerp;
Shoemake, 1985) to generate sentences ‘in between’.VAE ()
A man who was shot while walking through a tree.
A man has died after falling from a bridge.
A man has died after falling into ice.
A man was shot dead in his home.
A man was shot during a fight.
A man was stuck on the face.
Barcelona have made a winning start.
Barcelona have made a winning start.
Barcelona will face a winning start.
Barcelona will face a winning start.
[2ex]
AAEsph ()
A Doncaster man suffered injury after a collision with a train.
A Doncaster man suffered injury after a crash was shot.
A Doncaster man is facing charges after a police officer.
A Doncaster man faces charges after a domestic assault.
A Doncaster man faces charges after he was shot.
Barcelona will appeal the suspension of heavy losses.
Barcelona will appeal the suspension of suspension.
Barcelona will appeal the suspension of suspension.
Barcelona will appeal the transfer ban.
Barcelona will appeal the transfer ban.
[2ex]
AEsph ()
A man suffered life after suffering a tree.
A man suffered life after suffering a car.
A man suffered life after suffering a car.
A man suffered life after suffering a car.
A man suffered injury after a car crash.
A man will not leave the road.
Barcelona will not appeal the transfer.
Barcelona will appeal the transfer ban.
Barcelona will appeal the transfer ban.
Barcelona will appeal the transfer ban.
VAE ()
A woman is accused of stealing a trio of fireworks.
A woman was arrested for stealing a laptop to charity.
A woman was found guilty of stealing from her home.
A woman has been charged with a string of burglaries.
A woman has been charged with a string of burglaries.
A school principal was arrested for a string of burglaries.
A school principal has been convicted of a federal tax scheme.
The Australian Open has been voted as a new Limerick person.
The Australian Open has been launched in a new reality show.
The Humane Society has been awarded a new distribution centre.
[2ex]
AAEsph ()
A woman is accused of leaving her children at home.
A woman is accused of leaving her children at home.
A woman is accused of leaving her children at a home.
A woman is accused of leaving a children at her home.
A woman has been accused of leaving a children’s home.
A woman has been charged with a student at a school bus.
The Union has been named a new national national school.
The Union has been named a national national national national site.
The Union has been named a national national national national site.
The Union has been named a national national national national site.
[2ex]
AEsph ()
A woman is accused of leaving her home in home.
A woman is accused of leaving her home in her home.
A woman is accused of leaving her home in her home.
A woman is accused of leaving her children in her home.
A woman has been accused of vandalizing her children.
A woman has been accused of vandalizing her child.
The Union has declared a heritage partnership with the city.
The Union has been declared a heritage site.
The Union has been declared a heritage site.
The Union has been declared a heritage site.