Disentangled Representation Learning for Text Style Transfer

08/13/2018 ∙ by Vineet John, et al. ∙ University of Waterloo 0

This paper tackles the problem of disentangling the latent variables of style and content in language models. We propose a simple, yet effective approach, which incorporates auxiliary objectives: a multi-task classification objective, and dual adversarial objectives for label prediction and bag-of-words prediction, respectively. We show, both qualitatively and quantitatively, that the style and content are indeed disentangled in the latent space, using this approach. This disentangled latent representation learning method is applied to attribute (e.g. style) transfer on non-parallel corpora. We achieve similar content preservation scores compared to previous state-of-the-art approaches, and significantly better style-transfer strength scores. Our code is made publicly available for replicability and extension purposes.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The neural network has been a successful learning machine during the past decade due to its highly expressive modeling capability, which is a consequence of multiple layers of non-linear transformations of input features. Such transformations, however, make intermediate features “latent,” in the sense that they do not have explicit meaning and are not interpretable. Therefore, neural networks are usually treated as black-box machinery.

Disentangling the latent space of neural networks has become an increasingly important research topic. In the image domain, for example, chen2016infogan chen2016infogan use adversarial and information maximization objectives to produce interpretable latent representations that can be tweaked to adjust writing style for handwritten digits, as well as lighting and orientation for face models. mathieu2016disentangling mathieu2016disentangling utilize a convolutional autoencoder to achieve the same objective. However, this problem is not well explored in natural language processing.

In this paper, we address the problem of disentangling the latent space of neural networks for text generation. Our model is built on an autoencoder that encodes a sentence to the latent space (vector representation) by learning to reconstruct the sentence itself. We would like the latent space to be disentangled with respect to different features, namely,

style and content in our task.

To accomplish this, we propose a simple approach that combines multi-task and adversarial objectives. We artificially divide the latent representation into two parts: the style space and content space. In this work, we consider the sentiment of a sentence as the style. We design auxiliary losses, enforcing the separation of style and content latent spaces. In particular, the multi-task loss operates on a latent space to ensure that the space does contain the information we wish to encode. The adversarial loss, on the contrary, minimizes the predictability of information that should not be contained in that space. In previous work, researchers typically work with the style, or specifically, sentiment space [Hu et al.2017, Shen et al.2017, Fu et al.2018], but simply ignore the content space, as it is hard to formalize what “content” actually refers to.

In our paper, we propose to approximate the content information by bag-of-words (BoW) features, where we focus on style-neutral, non-stopwords. Along with traditional style-oriented auxiliary losses, our BoW multi-task loss and BoW adversarial loss make the style and content spaces much more disentangled from each other.

The learned disentangled latent space can be directly used for text style-transfer [Hu et al.2017, Shen et al.2017], which aims to transform a given sentence to a new sentence with the same content but a different style. Since it is difficult to obtain training sentence pairs with the same content and differing styles (i.e. parallel corpora), we follow the setting where we train our model on a non-parallel but style-labeled corpora. We call this non-parallel text style transfer. To accomplish this, we train an autoencoder with disentangled latent spaces. For style-transfer inference, we simply use the autoencoder to encode the content vector of a sentence, but ignore its encoded style vector. We then infer from the training data, an empirical embedding of the style that we would like to transfer. The encoded content vector and the empirically-inferred style vector are concatenated and fed to the decoder. This grafting technique enables us to obtain a new sentence similar in content to the input sentence, but with a different style.

We conducted experiments on two customer review datasets. Qualitative and quantitative results show that both the style and content spaces are indeed disentangled well. In the style-transfer evaluation, we achieve substantially better style-transfer strength, content preservation, and language fluency scores, compared with previous results. Ablation tests also show that the auxiliary losses can be combined well, each playing its own role in disentangling the latent space.

2 Related Work

Disentangling neural networks’ latent space has been explored in the image processing domain in the recent years, and researchers have successfully disentangled rotation features, color features, etc. of images [Chen et al.2016, Luan et al.2017]. Some image characteristics (e.g., artistic style) can be captured well by certain statistics [Gatys, Ecker, and Bethge2016]. In other work, researchers adopt data augmentation techniques to learn a disentangled latent space [Kulkarni et al.2015, Champandard2016].

In natural language processing, the definition of “style” itself is vague, and as a convenient starting point, NLP researchers often treat sentiment as a salient style of text. hu2017toward hu2017toward manage to control the sentiment by using discriminators to reconstruct sentiment and content from generated sentences. However, there is no evidence that the latent space would be disentangled by this reconstruction. shen2017style shen2017style use a pair of adversarial discriminators to align the recurrent hidden decoder states of original and style-transferred sentences, for a given style. fu2018style fu2018style propose two approaches: training style-specific embeddings, and training separate style-specific decoders for style-transfer. They apply an adversarial loss on the encoded space to discourage encoding style in the latent space of an autoencoding model. All the above approaches only deal with the style information and simply ignore the content part.

zhao2018adversarially zhao2018adversarially extend the multi-decoder approach and use a Wasserstein-distance penalty to align content representations of sentences with different styles. However, the Wasserstein penalty is applied to empirical samples from the data distribution, and is more indirect than our BoW-based auxiliary losses. Recently, rao2018dear rao2018dear treat the formality of writing as a style, and create a parallel corpus for style transfer with sequence-to-sequence models. This is beyond the scope of our paper, as we focus on non-parallel text style transfer.

Our paper differs from previous work in that both our style space and content space are encoded from the input, and we design several auxiliary losses to ensure that each space encodes and only encodes the desired information. Such disentanglement of latent space has its own research interest in the deep learning community. The disentangled representation can be directly applied to non-parallel text style-transfer tasks, as in the aforementioned studies.

3 Approach

In this section, we describe our approach in detail, shown in Figure 1. Our model is built upon an autoencoder with a sequence-to-sequence neural network [Sutskever, Vinyals, and Le2014], and we design multi-task and adversarial losses for both style and content spaces. Finally, we present our approach to transfer style in the context of natural language generation.

Figure 1: Overview of our approach.

3.1 Autoencoder

An autoencoder encodes an input to a latent vector space, from which it reconstructs the input itself. The latent vector space is usually of much smaller dimensionality than input data, and the autoencoder learns salient and compact representations of data during the reconstruction process.

Let be an input sequence with

tokens. The encoder recurrent neural network (RNN) with gated recurrent units (GRU)

[Cho et al.2014] encodes and obtains a hidden vector representation , which is linearly transformed from the encoder RNN’s final hidden state.

Then a decoder RNN generates a sentence, which ideally should be itself. Suppose at a time step , the decoder RNN predicts the word

with probability

. Then the autoencoder is trained with a sequence-aggregated cross-entropy loss, given by


where and are the parameters of the encoder and decoder, respectively.222For brevity, we only present the loss for a single data point (i.e., a sentence) throughout the paper. The total loss sums over all data points, and is implemented using mini-batches. Both the encoder and decoder are deterministic functions in the original autoencoder model [Rumelhart, Hinton, and Williams1985], and thus we call it a deterministic autoencoder (DAE).

3.1.1 Variational Autoencoder.

In addition to DAE, we also implement a variational autoencoder (VAE) [Kingma and Welling2014], which imposes a probabilistic distribution on the latent vector. The Kullback-Leibler (KL) divergence [Kullback and Leibler1951]

penalty is added to the loss function to regularize the latent space. The decoder reconstructs data based on the sampled latent vector from its posterior distribution.

Formally, the autoencoding loss in the VAE is



is the hyperparameter balancing the reconstruction loss and the KL term.

is the prior, set to the standard normal distribution

. is the posterior taking the form , where and are predicted by the encoder network. The motivation for using VAE as opposed to DAE is that the reconstruction is based on the samples of the posterior, which populates encoded vectors to the neighborhood and thus smooths the latent space. bowman2016generating bowman2016generating show that VAE enables more fluent sentence generation from a latent space than DAE.

The autoencoding losses in Equations (1,3.1.1) serve as our primary training objective. Besides, the autoencoder is also used for text generation in the style-transfer application. We also design several auxiliary losses to disentangle the latent space. In particular, we hope that can be separated into two spaces and , representing style and content, respectively, i.e., , where denotes concatenation. This is accomplished by the auxiliary losses described in the rest of this section.

3.2 Style-Oriented Losses

We first design auxiliary losses that ensure the style information is contained in the style space . This involves a multi-task loss that ensures is discriminative for the style, as well as an adversarial loss that ensures is not discriminative for the style.

3.2.1 Multi-Task Loss for Style.

Although the corpus we use is non-parallel, we assume that each sentence is labeled with its style. In particular, we treat the sentiment as the style of interest, following previous work [Hu et al.2017, Shen et al.2017, Fu et al.2018, Zhao et al.2018], and each sentence is labeled with a binary sentiment tag (positive or negative).

We build a classifier on the style space that predicts the style label. Formally, a two-way softmax layer (equivalent to logistic regression) is applied to the style vector

, given by


where are parameters for multi-task learning of style, and is the output of softmax layer.

The classifier is trained with a simple cross-entropy loss against the ground truth distribution , given by


where are the encoder’s parameters.

We train the style classifier at the same time as the autoencoding loss. Thus, this could be viewed as multi-task learning, incentivizing the entire model to not only decode the sentence, but also predict its sentiment from the style vector . We denote it by “mul(s).” The idea of multi-task losses is not new and has been used in previous work for sequence-to-sequence learning [Luong et al.2015], sentence representation learning [Jernite, Bowman, and Sontag2017]

and sentiment analysis

[Balikas, Moura, and Amini2017], among others.

3.2.2 Adversarial Loss for Style.

The above multi-task loss only ensures that the style space contains style information. However, the content space might also contain style information, which is undesirable for disentanglement.

We thus apply an adversarial loss to discourage the content space containing style information. The idea is to first introduce a classifier, called an adversary, that deliberately discriminates the true style label using the content vector . Then the encoder is trained to learn a content vector space, from which its adversary cannot predict style information.

Concretely, the adversarial discriminator and its training objective have a similar form as Equations 3 and 4, but with different input and parameters, given by


where are the parameters of the adversary.

It should be emphasized that, for the adversary, the gradients are not propagated back to the autoencoder, i.e., the variables in are treated as shallow features. Therefore, we view as a function of only, whereas is a function of both and .

Having trained an adversary, we would like the autoencoder to be tuned in such an ad hoc fashion, that is not discriminative for style. In existing literature, there could be different approaches, for example, maximizing the adversary’s loss [Shen et al.2017, Zhao et al.2018] or penalizing the entropy of the adversary’s prediction [Fu et al.2018]. In our work, we adopt the latter, as it can be easily extended to multi-category classification, used for the content-oriented losses of our approach. Formally, the adversarial objective for the style is to maximize


where and is the predicted distribution over the style labels. Here, is maximized with respect to the encoder, and attains maximum value when

is a uniform distribution. It is viewed as a function of

, and we fix .

While adversarial loss has been explored in previous style-transfer papers [Shen et al.2017, Fu et al.2018], it has not been combined with the multi-task loss. As we shall show in our experiments, combining these two losses is promisingly effective, achieving better style transfer performance than a variety of previous state-of-the-art methods.

3.3 Content-Oriented Losses

The above style-oriented losses only regularize style information, but they do not impose any constraint on how the content information should be encoded. This also happens in most previous work [Hu et al.2017, Shen et al.2017, Fu et al.2018]. Although the style space is usually much smaller than the content space, it is unrealistic to expect that the content would not flow into the style space because of its limited capacity. Therefore, we need to design content-oriented auxiliary losses to regularize the content information.

Inspired by the above combination of multi-task and adversarial losses, we apply the same idea to the content space. However, it is usually hard to define what “content” actually refers to.

To this end, we propose to approximate the content information by bag-of-words (BoW) features. The BoW features of an input sentence is a vector, each element indicating the probability of a word’s occurrence in the sentence. For a sentence with words, the word ’s BoW probability is given by , where denotes the target distribution of content, and

is an indicator function. Here, we only consider content words, excluding stopwords and style-specific words, since we focus on “content” information. In particular, we exclude sentiment words from a curated lexicon

[Hu and Liu2004] for sentiment style transfer. The effect of using different vocabularies for BoW is analyzed in Supplemental Material A.

3.3.1 Multi-Task Loss for Content.

Similar to the style-oriented loss, the multi-task loss for content, denoted as “mul(c)”, ensures that the content space contains content information, i.e., BoW features.

We introduce a softmax classifier over the BoW vocabulary


where are the classifier’s parameters, and is the predicted BoW distribution.

The training objective is a cross-entropy loss against the ground truth distribution , given by


where the optimization is performed with both encoder parameters and the multi-task classifier . Notice that although the target distribution is not one-hot as for BoW prediction, the cross-entropy loss (Equation 9) has the same form.

It is also interesting that, at first glance, the multi-task loss for content appears to be redundant, given the autoencoding loss, when in fact, it is not. The multi-task loss only considers content words, which do not include stopwords and sentiment words, and is only applied to the content space . This ensures that the content information is captured in the content space. The autoencoding loss only requires that the model reconstructs the sentence based on the content and style space, and does not ensure their separation.

3.3.2 Adversarial Loss for Content.

To ensure that the style space does not contain content information, we design our final auxiliary loss, the adversarial loss for content, denoted as “adv(c).”

We build an adversary, a softmax classifier on the style space to predict BoW features, approximating content information, given by


where are the classifier’s parameters for BoW prediction.

The adversarial loss for the model is to maximize the entropy of the discriminator


Again, is trained with respect to the discriminator’s parameters , whereas is trained with respect to , similar to the adversarial loss for style.

3.4 Training Process

The overall loss for the autoencoder comprises several terms: the reconstruction objective, the multi-task objectives for style and content, and the adversarial objectives for style and content:


where ’s are the hyperparameters that balance the autoencoding loss and these auxiliary losses.

To put it all together, the model training involves an alternation of optimizing discriminator losses and , and the model’s own loss , shown in Algorithm 1.

1 foreach mini-batch do
2       minimize w.r.t. ;
3       minimize w.r.t. ;
4       minimize w.r.t. ;
6 end foreach
Algorithm 1 Training process.

3.5 Generating Style-Transferred Sentences

A direct application of our disentangled latent space is style-transfer for natural language generation. For example, we can generate a sentence with generally the same meaning (content) but a different style (e.g., sentiment).

Let be an input sentence with and

being the encoded, disentangled style and content vectors, respectively. If we would like to transfer its content to a different style, we compute an empirical estimate of the target style’s vector


The inferred target style is concatenated with the encoded content for decoding style-transferred sentences, as shown in Figure 1b.

4 Experiments

4.1 Datasets

We conducted experiments on two datasets, Yelp and Amazon reviews. Both of these datasets comprise sentences accompanied by binary sentiment labels (positive, negative). They are used to train latent space disentanglement as well as to evaluate sentiment transfer.

4.1.1 Yelp Service Reviews.

We used a Yelp review dataset, following previous work [Shen et al.2017, Zhao et al.2018].333The Yelp dataset is available at https://github.com/shentianxiao/language-style-transfer It contains 444101, 63483 and 126670 labeled reviews for train, validation, and test, respectively. The maximum review length is 15 words, and the vocabulary size is approximately 9200.

4.1.2 Amazon Product Reviews.

We further evaluate our model with an Amazon review dataset, following another previous paper [Fu et al.2018].444The Amazon dataset is available at https://github.com/fuzhenxin/text_style_transfer It contains 559142, 2000 and 2000 labeled reviews for train, validation, and test, respectively. The maximum review length is 20 words, and the vocabulary size is approximately 58000.

4.2 Experiment Settings

We used the Adam optimizer [Kingma and Ba2014]

for the autoencoder and the RMSProp optimizer

[Tieleman and Hinton2012] for the discriminators, following adversarial training stability tricks [Arjovsky, Chintala, and Bottou2017]. Each optimizer has an initial learning rate of

. Our model is trained for 20 epochs, by which time it has mostly converged. The word embedding layer was initialized by word2vec

[Mikolov et al.2013] trained on respective training sets. Both the autoencoder and the discriminators are trained once per mini-batch with , , , and . These hyperparameters were tuned by performing a log-scale grid search within two orders of magnitude around the default value , and choosing those that yielded the best validation results. The recurrent unit size is , the style vector size is , and the content vector size is . We append the latent vector to the hidden state at every time step of the decoder.

For the VAE model, we enforce the KL-divergence penalty on both the style and content posterior distributions, using and , respectively. We set and and use the KL-weight annealing schedule following bahuleyan2018probabilistic bahuleyan2018probabilistic. They were tuned in the same manner as the other hyperparameters of the model.

4.3 Experiment I: Disentangling Latent Space

First, we analyze how the style (sentiment) and content of the latent space are disentangled. We train classifiers on the different latent spaces, and report their inference-time classification accuracies in Table 1.

We see that the 128-dimensional content vector is not particularly discriminative for style. It achieves accuracies slightly better than majority guess. However, the 8-dimensional style vector , despite its low dimensionality, achieves substantially higher style classification accuracy. When combining content and style vectors, we observe no further improvement. These results verify the effectiveness of our disentangling approach, as the style space contains style information, whereas the content space does not.

We show t-SNE plots of both the deterministic autoencoder (DAE) and the variational autoencoder (VAE) models in Figure 2. As seen, sentences with different styles are noticeably separated in a clean manner in the style space (LHS), but are indistinguishable in the content space (RHS). It is also evident that the latent space learned by the variational autoencoder is considerably smoother and continuous compared with the one learned by the deterministic autoencoder.

We show t-SNE plots for ablation tests with different combinations of auxiliary losses in Supplemental Material B.

Latent Space Yelp Amazon
None (majority guess) 0.602 0.512
Content space () 0.658 0.697 0.675 0.693
Style space () 0.974 0.974 0.821 0.810
Complete space () 0.974 0.974 0.819 0.810
Table 1: Classification accuracy on latent spaces.
Figure 2: t-SNE plots of the disentangled style and content spaces (with all auxiliary losses on the Yelp dataset).
Model Yelp Dataset Amazon Dataset
Transfer Cosine Word Language Transfer Cosine Word Language
Accuracy Similarity Overlap Fluency Accuracy Similarity Overlap Fluency
Style-Embedding [Fu et al.2018] 0.182 0.959 0.666 -16.17  0.400  0.930 0.359 -28.13
Cross-Alignment [Shen et al.2017]  0.784 0.892 0.209 -23.39 0.606 0.893 0.024 -26.31
Multi-Decoder [Zhao et al.2018]  0.818 0.883 0.272 -20.95 0.552 0.926 0.169 -34.70
Ours (DAE) 0.883 0.915 0.549 -10.17 0.720 0.921 0.354 -24.74
Ours (VAE) 0.934 0.904 0.473 -9.84 0.822 0.900 0.196 -21.70
Table 2: Performance of non-parallel text style transfer. The style-embedding approach achieves poor transfer accuracy, and should not be considered as an effective style-transfer model. Despite this, our model outperforms other previous methods in terms of all aspects (transfer strength, content preservation, and language fluency). Numbers with the symbol are quoted from respective papers. Others are based on our replication using the published code in previous work. Our replicated experiments achieve 0.809 and 0.835 transfer accuracy on the Yelp dataset, close to the results in shen2017style shen2017style and zhao2018adversarially zhao2018adversarially, respectively, showing that our replication is fair.

4.4 Experiment II: Non-Parallel Text Style Transfer

We also conducted sentiment transfer experiments with our disentangled latent space.

4.4.1 Metrics.

We evaluate competing models based on (1) style transfer strength, (2) content preservation and (3) quality of generated language. The evaluation of generated sentences is a difficult task in contemporary literature, so we adopt a few automatic metrics and use human judgment as well.

Style-Transfer Accuracy. We follow most previous work [Hu et al.2017, Shen et al.2017, Fu et al.2018]

and train a separate convolutional neural network (CNN) to predict the sentiment of a sentence 

[Kim2014], which is then used to approximate the style transfer accuracy. In other words, we report the CNN classifier’s accuracy on the style-transferred sentences, considering the target style to be the ground truth.

While the style classifier itself may not be perfect, it achieves a reasonable sentiment accuracy on the validation sets ( for Yelp; for Amazon). Thus, it provides a quantitative way of evaluating the strength of style-transfer.

Cosine Similarity. We followed fu2018style fu2018style and computed a sentence embedding by concatenating the , , and

of its word embeddings (sentiment words removed). Then, we computed the cosine similarity between the source and generated sentence embeddings, which is intended to be an indicator of content preservation.

Word Overlap. We find that the cosine similarity measure, although correlated to human judgment, is not a sensitive measure, and we propose a simple yet effective measure that counts the unigram word overlap rate of the original sentence and the style-transferred sentence , computed by .

Language Fluency. We use a trigram Kneser-Ney (KL) smoothed language model [Kneser and Ney1995] as a quantitative and automated metric to evaluate the fluency of a sentence. It estimates the empirical distribution of trigrams in a corpus, and computes the log-likelihood of a test sentence. We train the language model on the respective dataset, and report the Kneser-Ney language model’s log-likelihood. A larger (closer to zero) number indicates a more fluent sentence.

Manual Evaluation. Despite the above automatic metrics, we also conduct human evaluations to further confirm the performance of our model. This was done on the Yelp dataset only, due to the amount of manual effort involved. We asked 6 human evaluators to rate each sentence on a 1–5 Likert scale [Stent, Marge, and Singhai2005] in terms of transfer strength, content similarity, and language quality. This evaluation was conducted in a strictly blind fashion: samples obtained from all evaluated models are randomly shuffled, so that the evaluator would be unaware of which model generated a particular sentence. The inter-rater agreement—as measured by Krippendorff’s alpha [Krippendorf2004] for our Likert scale ratings—is 0.74, 0.68, and 0.72 for transfer strength, content preservation, and language quality, respectively. According to krippendorf2004content krippendorf2004content, this is an acceptable inter-rater agreement.

Model Transfer Content Language
Strength Preservation Quality
fu2018style fu2018style 1.67 3.84 3.66
shen2017style shen2017style 3.63 3.07 3.08
zhao2018adversarially zhao2018adversarially 3.55 3.09 3.77
Ours (DAE) 3.67 3.64 4.19
Ours (VAE) 4.32 3.73 4.48
Table 3: Manual evaluation on the Yelp dataset.

4.4.2 Results and Analysis.

We compare our approach with previous state-of-the-art work in Table 2. For baseline methods, we quoted results from existing papers whenever possible, and replicated the experiments to report other metrics with publicly available code [Shen et al.2017, Fu et al.2018, Zhao et al.2018].555fu2018style fu2018style propose another model using multiple decoders; the method is further developed in zhao2018adversarially zhao2018adversarially, and we adopt the latter for comparison. As discussed in Table 2, our replication involves reasonable efforts and is fair for comparison.

Objectives Transfer Cosine Word Language
Accuracy Similarity Overlap Fluency
0.106 0.939 0.472 -12.58
, 0.767 0.911 0.331 -12.17
, 0.782 0.886 0.230 -12.03
, , 0.912 0.866 0.171 -9.59
, , , , 0.934 0.904 0.473 -9.84
Table 4: Ablation tests on the Yelp dataset. In all variants, we follow the same protocol of style transfer by substituting an empirical estimate of the target style vector.
Original (Positive) DAE Transferred (Negative) VAE Transferred (Negative)
the food is excellent and the service is exceptional the food was a bit bad but the staff was exceptional the food was bland and i am not thrilled with this
the waitresses are friendly and helpful the guys are rude and helpful the waitresses are rude and are lazy
the restaurant itself is romantic and quiet the restaurant itself is awkward and quite crowded the restaurant itself was dirty
great deal horrible deal no deal
both times i have eaten the lunch buffet and it was outstanding their burgers were decent but the eggs were not the consistency both times i have eaten here the food was mediocre at best
Original (Negative) DAE Transferred (Positive) VAE Transferred (Positive)
the desserts were very bland the desserts were very good the desserts were very good
it was a bed of lettuce and spinach with some italian meats and cheeses it was a beautiful setting and just had a large variety of german flavors it was a huge assortment of flavors and italian food
the people behind the counter were not friendly whatsoever the best selection behind the register and service presentation the people behind the counter is friendly caring
the interior is old and generally falling apart the decor is old and now perfectly the interior is old and noble
they are clueless they are stoked they are genuinely professionals
Table 5: Examples of style transferred sentence generation.

We observe that the style embedding model [Fu et al.2018] performs poorly on the style-transfer objective,666It should be noted that the transfer accuracy is lower bounded by 0% as opposed to 50%, because we always transfer a sentence to the opposite sentiment. The lower-bound, zero transfer accuracy, is achieved by a trivial model that copies the input. resulting in inflated cosine similarity and word overlap scores. We also examined the number of times each model generates exact copies of the source sentences during style transfer. We notice that the style-embedding model simply reconstructs the exact source sentence of the time, whereas all other models do this less than of the time. Therefore, we do not think that the style embedding approach is an effective model for text style transfer.

The other two competing methods [Shen et al.2017, Zhao et al.2018] achieve reasonable transfer accuracy and cosine similarity. However, our model outperforms them by 10% transfer accuracy as well as content preserving scores (measured by cosine similarity and the word overlap rate). This shows our model is able to generate high-quality style transferred sentences, which in turn indicates that the latent space is well disentangled into style and content subspaces. Regarding language fluency, we see that VAE is better than DAE in both experiments. This is expected as VAE regularizes the latent space by imposing a probabilistic distribution. We also see that our method achieves considerably more fluent sentences than competing methods, showing that our multi-task and adversarial losses are more “natural” than other methods, for example, aligning RNN hidden states [Shen et al.2017].

Table 3 presents the results of human evaluation. Again, we see that the style embedding model [Fu et al.2018] is ineffective as it has a very low transfer strength, and that our method outperforms other baselines in all aspects. The results are consistent with the automatic metrics in both experiments (Table 2). This implies that the automatic metrics we used are reasonable; it also shows consistent evidence of the effectiveness of our approach.

We conducted ablation tests on the Yelp dataset, and show results in Table 4. With only, we cannot achieve reasonable style transfer accuracy by substituting an empirically estimated style vector of the target style. This is because the style and content spaces would not be disentangled spontaneously with the autoencoding loss alone.

With either or , the model achieves reasonable transfer accuracy and cosine similarity. Combining them together improves the transfer accuracy to 90%, outperforming previous methods by a margin of 10% (Table 2). This shows that the multi-task loss and the adversarial loss work in different ways. Our insight of combining the two auxiliary losses is a simple yet effective way of disentangling latent space.

However, and only regularize the style information, leading to gradual drop of content preserving scores. Then, we have another insight of introducing content-oriented auxiliary losses, and , based on BoW features, which regularize the content information in the same way as the style information. By incorporating all these auxiliary losses, we achieve high transfer accuracy, high content preservation, as well as high language fluency.

Table 5 provides several examples of our style-transfer model. Results show that we can successfully transfer the sentiment while preserving the content of a sentence. We see that, with the empirically estimated style vector, we can reliably control the sentiment of generated sentences.

5 Conclusion

In this paper, we propose a simple yet effective approach for disentangling the latent space of neural networks. We combine multi-task and adversarial objectives to separate content and style information from each other, and propose to approximate content information with bag-of-words features of style-neutral, non-stopword vocabulary.

Both qualitative and quantitative experiments show that the latent space is indeed separated into style and content parts. This disentangled space can be directly applied to text style-transfer tasks. It achieves substantially better style-transfer strength, content-preservation scores, as well as language fluency, compared with previous state-of-the-art work.


  • [Arjovsky, Chintala, and Bottou2017] Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein generative adversarial networks. In ICML, 214–223.
  • [Bahuleyan et al.2018] Bahuleyan, H.; Mou, L.; Vamaraju, K.; Zhou, H.; and Vechtomova, O. 2018. Probabilistic natural language generation with wasserstein autoencoders. arXiv preprint arXiv:1806.08462.
  • [Balikas, Moura, and Amini2017] Balikas, G.; Moura, S.; and Amini, M.-R. 2017. Multitask learning for fine-grained twitter sentiment analysis. In SIGIR, 1005–1008.
  • [Bowman et al.2016] Bowman, S. R.; Vilnis, L.; Vinyals, O.; Dai, A.; Jozefowicz, R.; and Bengio, S. 2016. Generating sentences from a continuous space. In CoNLL, 10–21.
  • [Champandard2016] Champandard, A. J. 2016. Semantic style transfer and turning two-bit doodles into fine artworks. arXiv preprint arXiv:1603.01768.
  • [Chen et al.2016] Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; and Abbeel, P. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, 2172–2180.
  • [Cho et al.2014] Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, 1724–1734.
  • [Fu et al.2018] Fu, Z.; Tan, X.; Peng, N.; Zhao, D.; and Yan, R. 2018. Style transfer in text: Exploration and evaluation. In AAAI, 663–670.
  • [Gatys, Ecker, and Bethge2016] Gatys, L. A.; Ecker, A. S.; and Bethge, M. 2016. Image style transfer using convolutional neural networks. In CVPR, 2414–2423.
  • [Hu and Liu2004] Hu, M., and Liu, B. 2004. Mining and summarizing customer reviews. In KDD, 168–177.
  • [Hu et al.2017] Hu, Z.; Yang, Z.; Liang, X.; Salakhutdinov, R.; and Xing, E. P. 2017. Toward controlled generation of text. In ICML, 1587–1596.
  • [Jernite, Bowman, and Sontag2017] Jernite, Y.; Bowman, S. R.; and Sontag, D. 2017. Discourse-based objectives for fast unsupervised sentence representation learning. arXiv preprint arXiv:1705.00557.
  • [Kim2014] Kim, Y. 2014. Convolutional neural networks for sentence classification. In EMNLP, 1746–1751.
  • [Kingma and Ba2014] Kingma, D. P., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • [Kingma and Welling2014] Kingma, D. P., and Welling, M. 2014. Auto-encoding variational bayes. International Conference on Learning Representations.
  • [Kneser and Ney1995] Kneser, R., and Ney, H. 1995. Improved backing-off for m-gram language modeling. In icassp, volume 1, 181e4.
  • [Krippendorf2004] Krippendorf, K. 2004. Content analysis: An introduction to its methodology. London: SAGE.
  • [Kulkarni et al.2015] Kulkarni, T. D.; Whitney, W. F.; Kohli, P.; and Tenenbaum, J. 2015. Deep convolutional inverse graphics network. In NIPS, 2539–2547.
  • [Kullback and Leibler1951] Kullback, S., and Leibler, R. A. 1951. On information and sufficiency. The Annals of Mathematical Statistics 22(1):79–86.
  • [Luan et al.2017] Luan, F.; Paris, S.; Shechtman, E.; and Bala, K. 2017. Deep photo style transfer. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 4990–4998.
  • [Luong et al.2015] Luong, M.-T.; Le, Q. V.; Sutskever, I.; Vinyals, O.; and Kaiser, L. 2015. Multi-task sequence to sequence learning. arXiv preprint arXiv:1511.06114.
  • [Mathieu et al.2016] Mathieu, M. F.; Zhao, J. J.; Zhao, J.; Ramesh, A.; Sprechmann, P.; and LeCun, Y. 2016. Disentangling factors of variation in deep representation using adversarial training. In NIPS, 5040–5048.
  • [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In NIPS, 3111–3119.
  • [Rao and Tetreault2018] Rao, S., and Tetreault, J. 2018. Dear sir or madam, may i introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer. In NAACL, volume 1, 129–140.
  • [Rumelhart, Hinton, and Williams1985] Rumelhart, D. E.; Hinton, G. E.; and Williams, R. J. 1985. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science.
  • [Shen et al.2017] Shen, T.; Lei, T.; Barzilay, R.; and Jaakkola, T. 2017. Style transfer from non-parallel text by cross-alignment. In NIPS, 6833–6844.
  • [Stent, Marge, and Singhai2005] Stent, A.; Marge, M.; and Singhai, M. 2005. Evaluating evaluation methods for generation in the presence of variation. In Int. Conf. Intelligent Text Processing and Computational Linguistics, 341–351.
  • [Sutskever, Vinyals, and Le2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to sequence learning with neural networks. In NIPS, 3104–3112.
  • [Tieleman and Hinton2012] Tieleman, T., and Hinton, G. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude.

    COURSERA: Neural Networks for Machine Learning

  • [Zhao et al.2018] Zhao, J. J.; Kim, Y.; Zhang, K.; Rush, A. M.; and LeCun, Y. 2018. Adversarially regularized autoencoders. In ICML, 5897–5906.

Appendix A Supplemental Material

a.1 A. Bag-of-Words (BoW) Vocabulary Ablation Tests

The tests in Table 6 demonstrate the effect of the choice of vocabulary used for the auxiliary content losses.

BoW Vocabulary Transfer Cosine Word Language
Strength Similarity Overlap Fluency
Full Corpus Vocabulary 0.822 0.896 0.344 -10.13
Vocabulary without sentiment words 0.872 0.901 0.359 -10.33
Vocabulary without stopwords 0.836 0.894 0.429 -10.06
Vocabulary without stopwords and sentiment words 0.934 0.904 0.473 -9.84
Table 6: Ablation tests on the BoW vocabulary.

It is evident that using a BoW vocabulary that excludes sentiment words and stopwords performs better on every single quantitative metric.

a.2 B. t-SNE plots of Ablation Tests

Figure 3 shows the t-SNE plots of the style and content embeddings, without any auxiliary losses. Figures 4, 5, 6 and 7 show the effect of adding each of the auxiliary losses independently.

Figure 3: t-SNE Plot of VAE latent embeddings with only .
Figure 4: t-SNE Plot of VAE latent embeddings with .
Figure 5: t-SNE Plot of VAE latent embeddings with .
Figure 6: t-SNE Plot of VAE latent embeddings with .
Figure 7: t-SNE Plot of VAE latent embeddings with .