Latent Space Secrets of Denoising Text-Autoencoders

05/29/2019 ∙ by Tianxiao Shen, et al. ∙ Amazon MIT 0

While neural language models have recently demonstrated impressive performance in unconditional text generation, controllable generation and manipulation of text remain challenging. Latent variable generative models provide a natural approach for control, but their application to text has proven more difficult than to images. Models such as variational autoencoders may suffer from posterior collapse or learning an irregular latent geometry. We propose to instead employ adversarial autoencoders (AAEs) and add local perturbations by randomly replacing/removing words from input sentences during training. Within the prior enforced by the adversary, structured perturbations in the data space begin to carve and organize the latent space. Theoretically, we prove that perturbations encourage similar sentences to map to similar latent representations. Experimentally, we investigate the trade-off between text-generation and autoencoder-reconstruction capabilities. Our straightforward approach significantly improves over regular AAEs as well as other autoencoders, and enables altering the tense/sentiment of sentences through simple addition of a fixed vector offset to their latent representation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural language models trained with massive datasets have shown impressive performance in generating realistic text that can be hard to distinguish from human writing (Radford et al., 2019). Still, controllable generation and manipulation of text remain difficult (Hu et al., 2017). While this can, in principle, be done by mapping text to continuous representations where desired modifications are enacted via real-valued arithmetic operations, such an approach has not yet proven successful, partly due to the challenge of molding a meaningful latent space geometry for discrete text data.

A popular latent variable generative model for text is the variational autoencoder (VAE) (Kingma and Welling, 2014; Bowman et al., 2016)

. Unfortunately, this method suffers from the posterior collapse problem, where the latent representation is entirely ignored when the decoder is a powerful autoregressive model like a RNN 

(Bowman et al., 2016; Chen et al., 2016). Techniques such as KL-weight annealing or weakening the decoder have struggled to inject significant content into the latent code (Yang et al., 2017; Kim et al., 2018), and alternatives like the -VAE (Higgins et al., 2017) with a small KL coefficient appear necessary. The VAE can be explicitly encouraged to utilize its latent code via an additional mutual information objective (Zhao et al., 2017), bringing the approach closer to adversarial autoencoders (AAEs) (Makhzani et al., 2015). Circumventing the issue of collapse, the AAE allows to perform manipulations in the latent space to induce change in the data space (Shen et al., 2017; Zhao et al., 2018). However, we have found that its latent space can be highly non-smooth and irregular, resulting in poor quality generations from prior samples.

In this paper, we extend AAEs to make them substantially more effective for text generation and manipulation. Perhaps surprisingly, this is possible by augmenting AAEs with a simple denoising objective, where original sentences are reconstructed from perturbations of random word replacement/removal (Lample et al., 2017; Artetxe et al., 2017)

. Similar denoising autoencoders (DAEs) have been introduced before

(Vincent et al., 2008) and adapted to image modeling (Creswell and Bharath, 2018). Here, we demonstrate that the perturbations greatly improve the performance of AAEs for text modeling, both theoretically and empirically. While a basic AAE can learn an arbitrary mapping from data to latent variables, introducing perturbations that reflect local structures in the data space can help better organize the latent space. We prove that similar sentences are encouraged to map to similar latent representations as a result.

We systematically evaluate various text autoencoders in terms of their generation and reconstruction capabilities (Cífka et al., 2018). The results demonstrate that our proposed model provides the best trade-off between producing high-quality text vs. informative sentence representations. We further investigate how well text can be manipulated by applying simple transformations in the learned latent space. Our model is able to reasonably perform sentence-level vector arithmetic without any training supervision (Mikolov et al., 2013)

. It also produces higher quality sentence interpolations than other text autoencoders, suggesting better linguistic continuity in its latent space 

(Bowman et al., 2016).

2 Method

Figure 1: Illustration of the learned latent geometry by AAE before and after introducing perturbations. With high-capacity encoder/decoder networks, a standard AAE has no preference over - couplings and thus can learn a random mapping between them (Left). Trained with local perturbations , AAE learns to map similar to close to best achieve the denoising objective (Right).

Define to be a space of sequences of discrete symbols from vocabulary (with maximum length ); also define to be a continuous latent space. Our goal is to learn a mapping between the data distribution over and a given prior distribution over latent space (a Gaussian prior is used in the experiments of this work.) Such a mapping allows us to easily manipulate discrete data through continuous latent representations , and provides a generative model where samples from can be obtained by first drawing from the prior and then mapping it to the space.

We adopt the adversarial autoencoder (AAE) framework, which involves a deterministic encoder mapping from data space to latent space, a probabilistic decoder that generates a sequence from latent code and evaluates its likelihood using parameterized distribution , and a discriminator tring to distinguish the encodings from the prior . Both and

are recurrent neural networks (RNNs) in this work, although other sequence models 

(Dehghani et al., 2019) could be employed as well. takes input sequence and outputs the last RNN hidden state as its encoding . generates a sequence autoregressively, with each step conditioned on and previous symbols. The discriminator

is a feed-forward network with a sigmoid output layer that estimates the probability of

coming from the prior rather than the encoder.

Apart from the usual AAE, we introduce perturbations in space to learn smoother representations that reflect structure in the data. Given a perturbation process that stochastically maps to nearby , let and . We optimize the following objective:

(1)
(2)
(3)

Here, is the loss of reconstructing from , is the adversarial loss222In practice, we train to maximize instead of , which turns out to be more stable (Goodfellow et al., 2014). We also tried the WGAN objective (Arjovsky et al., 2017) but did not observe much difference. evaluated on perturbed , and

is a hyperparameter weighting the two terms.

The objective function combines the denoising technique with the AAE (Vincent et al., 2008; Creswell and Bharath, 2018). When (i.e. there is no perturbation), the above simply becomes the usual AAE objective. In the next section, we provide a theoretical analysis of AAE with perturbed and show that it enjoys better properties than without input perturbations.

3 Theoretical Analysis

3.1 Posterior Properties with Perturbations

Tolstikhin et al. (2017) previously connected the AAE objective to a relaxed form of the Wasserstein distance between model and data distributions. Specifically, for cost function and deterministic decoder mapping :

(4)

where the minimization over couplings with marginals and can be replaced with minimization over conditional distributions whose marginal matches the latent space prior . Relaxing this marginal constraint via a divergence penalty estimated by adversarial training, one recovers the AAE objective. In particular, AAE on discrete with the cross-entropy loss is minimizing an upper bound of the total variation distance between and , with chosen as the indicator cost function (Zhao et al., 2018).

For AAE with perturbation process , we define:

(5)

Our model is thus optimizing over conditional distributions of the form (5), a subset of all possible conditional distributions. Thus, after introducing input perturbations, our method is still minimizing an upper bound of the Wasserstein distance between and described in (4).

Let us now examine more closely how perturbations affect the model. Expression (5) shows they enable the use of stochastic encodings even though our model merely employs a deterministic encoder network trained without any reparameterization-style tricks. Assume that can always be preserved with a positive probability . When the support of and do not overlap for different training examples , the encoder can learn to assign for , and we are back to the unconstrained posterior scenario . If and intersect, then the latent posterior of and will have overlapping components for . For example, if assigns a high probability to that lies close to (based on some metric over ), then for similar and , the high-probability overlap between their perturbations will inherently force their posteriors close together in the latent space. This is desirable for learning good representations and not guaranteed by merely minimizing statistical divergence between and . In the next subsection, we formally prove how perturbations help better structure the latent space (all proofs of our theorems are relegated to the Appendix).

3.2 Latent Space Geometry

Following prior analysis of language decoders (Mueller et al., 2017), we assume a powerful decoder that can approximate arbitrary so long as it remains sufficiently Lipschitz continuous on .

Assumption 1.

There exists such that all decoder models obtainable via training satisfy the following property for all .

When is implemented as a RNN language model, will remain Lipschitz in its continuous input if the RNN weight matrices have bounded norm. This property is naturally encouraged by popular training methods that utilize SGD with early stopping and regularization (Zhang et al., 2017). Note we have not assumed or is Lipschitz in , which would be unreasonable since stands for discrete text, and when a few symbols fed to the RNN cell change the decoder likelihood can vary drastically (e.g.,  may assign a much higher probability to a grammatically valid sentence than an invalid one that only differs by one word). Our discussion is directed to the nature of such families of log-likelihood functions with a continuous variable and a discrete variable .

Our analysis presumes an effectively trained discriminator that succeeds in ensuring that the latent encodings resemble samples from the prior. For simplicity, we thus directly assume that are actual samples from which are given a priori. Here, the task of the encoder is to map given unique training examples to the given latent points, and the goal of the decoder is to maximize under the encoder mapping (cf. Eq. 2). The analysis aims to highlight differences between optimal encoder/decoder solutions under the AAE objective with or without perturbations. Throughout, we assume the encoder is a universal function approximator capable of producing any possible mapping from each to a unique . Likewise, the decoder can approximate arbitrary with only the Lipschitz constraint. Let denote the set of possible decoder models subject to Assumption 1 with Lipschitz constant , and

denote the sigmoid function.

Theorem 1.

For any encoder mapping from to , the optimal value of objective is the same.

Intuitively, this result stems from the fact that the model receives no information about the structure of , and are simply provided as different symbols. Hence AAE offers no preference over - couplings, and a random matching in which the do not reflect any data structure is equally good as any other matching (Figure 1, Left). Latent point assignments start to differentiate, however, once we introduce local input perturbations.

To elucidate how perturbations affect latent space geometry, it helps to first consider a simple setting with only four examples . Again, we consider given latent points sampled from , and the encoder/decoder are tasked with learning which to match with which . As depicted in Figure 1, suppose there are two pairs of closer together and also two pairs of closer together. More precisely, under a distance metric over , satisfy that with some : , , and for all other pairs. In addition, satisfy that with some : , , and for all other pairs. We have the following conclusion (where ):

Theorem 2.

Suppose our perturbation process reflects local geometry with: if and otherwise. For and , the perturbation objective achieves the largest value when the encoder maps close pairs of to close pairs of .

This entails that the AAE with perturbed will always prefer to map similar to similar . Note that Theorem 1 still applies here, and the regular AAE will not prefer any particular pairing over the other possibilities. We next generalize beyond the basic four-points scenario to consider examples of that are clustered. Here, we can ask whether this cluster organization will also be reflected in the latent space of an AAE trained with local input perturbations.

Theorem 3.

Suppose are divided into clusters of equal size , with denoting the cluster index of . Let the perturbation process be uniform within clusters, i.e. if and otherwise. For an encoder mapping from to , the perturbation objective is upper bounded by: .

Theorem 3 provides an upper bound on the achievable log-likelihood objective value for a particular - mapping. This achievable limit is substantially better when examples in the same cluster are mapped to points in the latent space in a manner that is well-separated from encodings of other clusters. In other words, by preserving input space cluster structure in the latent space, the AAE with perturbed can achieve better objective values and thus is incentivised to learn a encoder/decoder that behaves in this manner. An analogous corollary can be shown for the case when examples are perturbed to yield additional inputs not present in the training data. In this case, the model would aim to map each example and its perturbations as a group to a compact group of points well-separated from other groups in the latent space.

4 Related Work

Vincent et al. (2008) first used input perturbations to improve autoencoder representations. However, their DAE requires sophisticated MCMC sampling to be employed generatively (Bengio et al., 2013). Im et al. (2017) later proposed VAE with input perturbations, but their model remains prone to posterior collapse due to the per-example KL penalty. While -VAE can trade the KL penalty for reconstruction improvements, the resulting aggregated posteriors may not match the prior, leading to poor generative performance. In contrast, adversarial prior enforcement in our AAE poses a global constraint over all training examples and does not as severely affect individual reconstructions.

Alternatively to our proposed perturbations in the data space, Rubenstein et al. (2018) suggest AAEs may be improved through Gaussian perturbations in the latent space. They argue deterministic encoding in the AAE may induce suboptimal latent geometry, particularly if the adversarial prior causes the encoder to act as a space-filling curve. Rubenstein et al. (2018) demonstrate stochastic encodings can help avoid this issue, but they have to enforce an

penalty on Gaussian log-variance

to prevent their latent perturbations from vanishing. Crucially, our use of input perturbations enables us to obtain stochastic sentence representations without parametric restrictions like Gaussianity or the excessive training variance/instability associated with the learning of nondeterministic encoder models (Roeder et al., 2017).

Previous work on controllable text generation has employed the standard AE, -VAE, as well as AAE trained with attribute label information (Hu et al., 2017; Shen et al., 2017; Zhao et al., 2018; Logeswaran et al., 2018; Subramanian et al., 2018). Our proposed model can perform text manipulation without any training labels. Moreover, it can be utilized as a superior base autoencoder model when additional supervision signals are available.

5 Experiments

Datasets   We evaluate various text autoencoders, including our proposed model, on two text corpora: Yelp reviews and Yahoo answers. The Yelp dataset has millions of reviews which we segment into individual sentences. We then sample 200K/10K/10K sentences with length less than 16 words as train/dev/test sets. The vocabulary size is 10K after replacing words with under 5 occurrences by an “<unk>” token. Our second dataset is based on a subset of Yahoo answers from Yang et al. (2017). We again perform sentence segmentation and eliminate those whose length exceed 30 words. The resulting dataset has 463K/46K/47K sentences for train/dev/test sets, with vocabulary size 20K.

Perturbation Process   We randomly mask each word with probability . This way perturbations of sentences with more words in common will have larger overlap. We also tried removing each word or replacing it with a random word from the vocabulary, and found that these variants perform similarly. We leave to future work to explore more sophisticated text perturbations.

Baselines and Experimental Details   We compare five alternative text autoencoders with our proposed model: adversarially regularized autoencoder (Zhao et al., 2018, ARAE) where the prior is implicitly transformed from a latent code generator, -VAE (Higgins et al., 2017), AAE (Makhzani et al., 2015), AAE with perturbed  (Rubenstein et al., 2018), and purely reconstruction-focused autoencoder (AE). Descriptions of hyperparameters and training regime are detailed in Appendix D.

5.1 Generation-Reconstruction Trade-off

Figure 2: Generation-reconstruction trade-off of different text autoencoders on the Yelp dataset. of -VAE is swept from . of AAE with perturbed is 0.01, 0.05, 0.1 or 0.2. Word mask probability of AAE with perturbed ranges from 0.1 to 1. The “real data” dotted line marks the perplexity of a language model trained and evaluated on real data. In the BLEU-reversed PPL plot (Right), we removed points of severe collapse that have huge reverse PPL (>200) arising from extreme parameter settings.

In this section, we evaluate the latent variable generative models in terms of both generation quality and reconstruction accuracy. A strong generative model should not only generate high quality sentences from prior samples, but also learn useful latent variables that capture significant data content. Only when both requirements are met can we successfully manipulate sentences by modifying their latent representation (in order to produce valid output sentences that remain faithful to the input).

We compute BLEU (Papineni et al., 2002)

between input sentences and reconstructed sentences to measure reconstruction accuracy. To quantitatively evaluate the quality of generated sentences, we adopt two model-based evaluation metrics: PPL and reverse PPL 

(Zhao et al., 2018). PPL is the perplexity of a language model trained on real data and evaluated on generated data. This measures the fluency of the generated text, but cannot detect the collapsed case where the model repeatedly generates a few common sentences. Reverse PPL is the perplexity of a language model trained on generated data and evaluated on real data. It takes into account both the fluency and diversity of the generated text. If a model generates only a few common sentences, a language model trained on it will exhibit poor PPL on real data.

Figure 2 plots the results of different models on the Yelp dataset (see Figure E.1 in Appendix for Yahoo results). The x-axis is reconstruction BLEU (higher is better); the y-axis is PPL/reverse PPL (lower is better). The bottom right corner represents an ideal situation where high reconstruction BLEU and low PPL/reverse PPL are achieved at the same time. For models with tunable hyperparameters, we sweep the full spectrum of their generation-reconstruction trade-off by varying the KL coefficient of -VAE, the log-variance penalty of AAE with perturbed , and the word mask probability of AAE with perturbed . The BLEU-PPL diagrams show that as the degree of regularization/perturbation increases, the generated samples are more fluent (lower PPL) but reconstruction accuracy drops (lower BLEU). AAE with perturbed provides strictly better trade-off than -VAE and AAE with perturbed , both of which have similar performance. This implies introducing perturbations in the data space is superior to perturbations in the latent space. The latter are often limited to be Gaussian for tractability, whereas the former may be mapped to any desired latent distribution by our neural encoder. ARAE falls on or above the curve of -VAE and AAE with perturbed , revealing that it does not fare better than these methods. The basic AAE has extremely high PPL, indicating that the text its produces is of low quality.

Figure 2 (Right) shows that reverse PPL first drops and then rises as we increase the degree of regularization/perturbation. This is because when encodes little information, generations from them lack enough diversity to cover real data. Again, AAE with perturbed demonstrates dominance over other baselines, which tend to have higher reverse PPL and lower reconstruction BLEU.

In subsequent experiments, we set for -VAE, for AAE with perturbed , and for AAE with perturbed to ensure they achieve fairly high reconstruction accuracy, which is essential to perform text manipulation.

5.2 Vector Arithmetic

Model ACC BLEU PPL
AE 26.4 61.5 49.1
-VAE 42.2 45.7 53.2
ARAE 28.5 19.3 46.7
AAE 16.7 67.0 48.9
   w/ 41.8 48.1 54.7
   w/ 44.5 56.5 39.7
n/a both good both bad > <
20 14 27 26 13
Table 1: Above: automatic evaluations of vector arithmetic for tense inversion. Below: human evaluation statistics of our model vs. -VAE. “>”: ours is better, “<”: -VAE is better.
Model ACC BLEU PPL
Shen et al. (2017) 81.7 12.4 38.4
AAE 7.7 78.3 39.5
39.1 30.4 107.5
73.5 6.6 289.7
   w/ 10.0 73.1 33.6
50.2 31.3 58.3
90.7 6.8 129.9
Table 2:

Automatic evaluations of vector arithmetic for sentiment transfer. Accuracy is measured by a sentiment classifier.

Shen et al. (2017) is specifically tailored for sentiment transfer, while our text autoencoders are not.
Input and more unbelievably the pizza served was missing a portion . “ the oven ate it ” - according to the waitress .
AE and more importantly the pizza served was missing a portion . “ “ skinny food ” ’s prices to replace the waitress .
-VAE or some … the mediterranean it was missing a bowl . “ love the sandwich that happens is ( my waitress .
ARAE and more pizza there is large portion of the greek steak . “ the corned beef is better pizza ” was the best .
AAE and more how the pizza served was missing a portion . “ holy station out it ” - according to the waitress .
   w/ - even support the food rings was missing a portion . “ the wall ate it ” - according to the waitress .
   w/ and more importantly the pizza served is missing a lot . “ the saucer goes it ” : bring to the waitress .
Input they have a nice selection of stuff and the prices seem just about right . husband loves the thin crust pizza .
AE they have a nice selection of stuff and the prices seem just about right . husband loves the thin crust pizza .
-VAE they have a nice selection of stuff and the prices were just right about . husband ordered the thin crust pizza .
ARAE they have a nice selection of beers and i get no <unk> products . always the baked goods are very salty .
AAE they have a nice selection of stuff and the prices seemed just about right . husband loves the thin crust pizza .
   w/ they had a nice selection of stuff and the prices did nothing like right . husband ordered the curry crust pizza .
   w/ they had a nice selection of stuff and the prices seemed just about right . husband loved the thin crust pizza .
Table 3: Examples of vector arithmetic for tense inversion.
AAE AAE w/
Input service was excellent ( as always ) . service was excellent ( as always ) .
service was excellent ( as always ) . service was excellent ( as always ) .
service was tasteless ( as not ) . service was fine ( as already ) .
service was overcooked ( without not please . service was rude ( after already ) .
Input i am truly annoyed and disappointed at this point . i am truly annoyed and disappointed at this point .
i am truly impressed and disappointed at this point . i am truly annoyed and disappointed at this point .
i am truly impressed and disappointed at this point . i am truly flavorful and disappointed at this point .
my pizza is pleasant and disappointed and chinese . i am truly friendly and pleasant at this point .
Table 4: Examples of vector arithmetic for sentiment transfer.

Mikolov et al. (2013) previously discovered word embeddings learned without supervision can capture linguistic relationships via simple arithmetic. A canonical example is the embedding arithmetic “King” - “Man” + “Woman” which results in a vector that lies very close to the embedding of “Queen”. We now investigate whether analogous structure emerges in the latent space of our sentence-level models, with tense and sentiment as two example attributes (Hu et al., 2017).

Tense

We use the Stanford Parser333https://nlp.stanford.edu/software/srparser.html to extract the main verb of a sentence and determine the sentence tense based on its part-of-speech tag. The Yelp development and test sets consist of around 2.8K past tense sentences and 4.5K present tense sentences. We compute a single “tense vector” by averaging the latent code separately for past tense sentences and present tense sentences in the development set and then computing the difference between the two. Given a sentence from the test set, we attempt to change its tense from past to present or from present to past through simple addition/subtraction of the tense vector. More precisely, a source sentence is first is encoded to , and then the tense-modified sentence is produced via , where denotes the fixed tense vector.

To quantitatively compare different models, we compute their tense transfer accuracy as measured by the parser, the output BLEU with the input sentence, and output PPL evaluated by a language model. Table 2 (Above) shows that AAE with perturbed achieves the highest accuracy, lowest PPL, and relatively high BLEU, implying the output sentences produced by our model are more likely to be of high quality and of the proper tense, meanwhile remaining similar to the source sentence. We also conduct human evaluation on 100 test sentences (50 past and 50 present) to compare our model and -VAE, the closest baseline model. The human annotator is presented with a source sentence and two outputs (one from each approach, presented in random order) and asked to judge which one successfully changes the tense while being faithful to the source, or whether both are good/bad, or if the input is not suitable to have its tense inverted. From Table 2 (Below), one can see that AAE with perturbed outperforms -VAE twice as often as it is outperformed. Our model achieves an overall success rate of , 16% higher than the -VAE.

Table 3 shows the result of adding this simple latent vector offset to four example sentences under different models. In three examples, the AAE with perturbed can successfully change “was” to “is”, “loves” to “loved”, “have” to “had” and “seem” to “seemed”, with only slight sentence distortions. Other baselines either fail to alter the tense, or change the meaning of the source sentence (e.g., “loves” to “ordered”). The second example depicts a difficult case where all models fail.

Sentiment

We repeat a similar analysis on sentiment, using the sentiment transfer dataset of Shen et al. (2017) which is also derived from Yelp reviews. We sample 100 negative and positive sentences and compute the difference between their mean -representation as the “sentiment vector” . Then we apply to another 1000 negative and positive sentences to change their sentiment, following the same previously described procedure used to alter tense. Table 2 reports the automatic evaluations, and Table 4 (also Table E.1 in Appendix) shows examples generated by AAE and AAE with perturbed . Sentiment seems to be less salient in the data than tense, and does not appear to invert the sentiment of a sentence effectively. Thus, we also tried and , and found that the resulting sentences get more and more positive/negative (Table 4). However, the PPL also increases dramatically with the scaling factor, indicating that the sentences become unnatural when their encodings are offset too much. AAE with perturbed outperforms AAE, but is not competitive with style transfer models that are specifically trained with sentiment labels (Shen et al., 2017). Nevertheless, our model can be employed as a base model in place of other autoencoders when training with additional supervision.

5.3 Latent Space Interpolation

Our final experiments study sentence interpolation in the latent space of generative models. Given two input sentences , we encode them to , and decode from to obtain their intermediate sentences. Ideally this should produce fluent sentences with gradual semantic change (Bowman et al., 2016). Table 5 shows two examples from the Yelp dataset, where it is clear that AAE with perturbed leads to more fluent and coherent interpolations than AAE without perturbations. Table E.2 in Appendix shows two challenging examples Yahoo, where we interpolate between dissimilar sentences. While it is difficult for our model trained with simple perturbations to generate semantically correct sentences in these cases, its learned latent space exhibits continuity on topic and syntactic structure.

Input 1 i highly recommend it and i ’ll definitely be back ! everyone is sweet !
Input 2 i will be back ! everyone who works there is very sweet and genuine too !
AAE i highly recommend it and i ’ll definitely be back ! everyone is sweet !
i highly recommend it and i ’ll definitely be back ! everyone is sweet !
i will be it ! everyone is pre-made - my tea and low pickles !
i will be back ! everyone who works there is very sweet and genuine too !
i will be back ! everyone who works there is very sweet and genuine too !
AAE w/ i highly recommend it and i ’ll definitely be back ! everyone is sweet !
i highly recommend it and i ’ll definitely be back ! everyone is sweet !
i highly recommend it and will be back ! ! everyone ’s friendly - and sweet ! !
i will be back ! everyone who works there is very sweet and genuine ! !
i will be back ! everyone who works there is very sweet and genuine too !
Table 5: Interpolations between two input sentences generated by AAE and our model on the Yelp dataset.

6 Conclusion

This paper introduced the utility of data perturbations in AAE for text generative modeling. In line with previous work (Devlin et al., 2018; Lample et al., 2018), we find denoising techniques particularly effective for learning improved text representations. Our proposed model substantially outperforms other text autoencoders, and demonstrates potential for performing sentence-level vector arithmetic. Future work might investigate better perturbation strategies and delve deeper into the latent space geometry of text autoencoders to further improve controllable generation.

Acknowledgments

We thank Tao Lei and the MIT NLP group for their helpful comments.

References

Appendix A Proof of Theorem 1

See 1

Proof.

Consider two encoder matchings to and to , where both and are permutations of the indices . Suppose is the optimal decoder model for the first matching (with permutations ). This implies

Now let . Then can achieve exactly the same log-likelihood objective value for matching as for matching , while still respecting the Lipschitz constraint. ∎

Appendix B Proof of Theorem 2

See 2

Proof.

Let denote , and assume without loss of generality that the encoder maps each to . We also define as the two -pairs that lie close together. For our choice of , the training objective to be maximized is:

(6)

The remainder of our proof is split into two cases:

Case 1. for

Case 2. for

Under Case 1, points that lie far apart also have encodings that remain far apart. Under Case 2, points that lie far apart have encodings that lie close together. We complete the proof by showing that the achievable objective value in Case 2 is strictly worse than in Case 1, and thus an optimal encoder/decoder pair would avoid the matching that leads to Case 2.

In Case 1 where for all , we can lower bound the training objective (6) by choosing:

(7)

with , where denotes the sigmoid function. Note that this ensures for each , and does not violate the Lipschitz condition from Assumption 1 since:

and thus remains when .

Plugging the assignment from (7) into (6), we see that an optimal decoder can obtain training objective value in Case 1 where .


Next, we consider the alternative case where for .

For and for all , we have:

by Assumption 1
since

Continuing from (6), the overall training objective in this case is thus:

using the fact that the optimal decoder for the bound in this case is: for all .

Finally, plugging our range for stated in the Theorem 2, it shows that the best achievable objective value in Case 2 is strictly worse than the objective value achievable in Case 1. Thus, the optimal encoder/decoder pair under the AAE with perturbed will always prefer the matching between and that ensures nearby are encoded to nearby (corresponding to Case 1). ∎

Appendix C Proof of Theorem 3

See 3

Proof.

Without loss of generality, let for notational convenience. We consider what is the optimal decoder probability assignment under the Lipschitz constraint 1.

The objective of the AAE with perturbed is to maximize:

We first show that the optimal will satisfy that the same probability is assigned within a cluster, i.e. for all s.t. . If not, let , and we reassign . Then still conforms to the Lipschitz constraint if meets it, and will have a larger target value than .

Now let us define . The objective becomes:

Consider each term : when , this term can achieve the maximum value by assigning ; when , the Lipschitz constraint ensures that:

Therefore:

Overall, we thus have:

Appendix D Experimental Details

In all models, the encoder and generator are one-layer LSTMs with hidden dimension 1024 and word embedding dimension 512. The last hidden state of the encoder is projected into 128 dimensions to produce the latent code , which is then concatenated with input word embeddings fed to the generator. The discriminator is an MLP with one hidden layer of size 512. of AAE based models is set to 10 to ensure the latent codes are indistinguishable from the prior. All models are trained via the Adam optimizer [Kingma and Ba, 2014] with learning rate 0.0005, , . At test time, encoder-side perturbations are disabled, and we use greedy decoding to generate from .

Appendix E Additional Results

Figure E.1: Generation-reconstruction trade-off of different text autoencoders on the Yahoo dataset. Like in Figure 2, of -VAE is swept from . of AAE with perturbed is 0.01, 0.05, 0.1 or 0.2. Word mask probability of AAE with perturbed ranges from 0.1 to 1. The “real data” dotted line marks the perplexity of a language model trained and evaluated on real data. In the BLEU-reversed PPL plot (Right), we removed points of severe collapse that have huge reverse PPL (>300) arising from extreme parameter settings.
AAE AAE w/
Input the service was top notch and so was the food . the service was top notch and so was the food .
the service was top notch and so was the food . the service was top notch and so was the food .
the service was top asap and so was the tv . the service was broken off to how was my food .
unfortunately but immediately walked pain and did getting the entre . but <unk> was handed me to saying was my fault .
Input really dissapointed wo n’t go back . really dissapointed wo n’t go back .
really margarita wo n’t go back . really dissapointed wo n’t go back .
really margarita wo n’t go back . really fantastic place will go back .
really wonderful place ! really wonderful place always great !
Input dining was a disappointing experience in comparison . dining was a disappointing experience in comparison .
dining was a disappointing experience in comparison . dining was a pleasant experience in comparison .
dining was a disappointing sushi in restaurants . dining was a great experience in addition .
dining was a seafood experience . dining food is great experience in addition .
Table E.1: More examples of vector arithmetic for sentiment transfer.
Input 1 what language should i learn to be more competitive in today ’s global culture ?
Input 2 what languages do you speak ?
AAE what language should i learn to be more competitive in today ’s global culture ?
what language should i learn to be more competitive in today ’s global culture ?
what language should you speak ?
what languages do you speak ?
what languages do you speak ?
AAE w/ what language should i learn to be more competitive in today ’s global culture ?
what language should i learn to be competitive today in arabic ’s culture ?
what languages do you learn to be english culture ?
what languages do you learn ?
what languages do you speak ?
Input 1 i believe angels exist .
Input 2 if you were a character from a movie , who would it be and why ?
AAE i believe angels exist .
i believe angels - there was the exist exist .
i believe in tsunami romeo or <unk> i think would it exist as the world population .
if you were a character from me in this , would we it be ( why !
if you were a character from a movie , who would it be and why ?
AAE w/ i believe angels exist .
i believe angels exist in the evolution .
what did <unk> worship by in <unk> universe ?
if you were your character from a bible , it will be why ?
if you were a character from a movie , who would it be and why ?
Table E.2: Interpolations between two input sentences generated by AAE and our model on the Yahoo dataset.