Style Transformer: Unpaired Text Style Transfer without Disentangled Latent Representation

05/14/2019 ∙ by Ning Dai, et al. ∙ FUDAN University 0

Disentangling the content and style in the latent space is prevalent in unpaired text style transfer. However, two major issues exist in most of the current neural models. 1) It is difficult to completely strip the style information from the semantics for a sentence. 2) The recurrent neural network (RNN) based encoder and decoder, mediated by the latent representation, cannot well deal with the issue of the long-term dependency, resulting in poor preservation of non-stylistic semantic content.In this paper, we propose the Style Transformer, which makes no assumption about the latent representation of source sentence and equips the power of attention mechanism in Transformer to achieve better style transfer and better content preservation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text style transfer is the task of changing the stylistic properties (e.g., sentiment) of the text while retaining the style-independent content within the context. Since the definition of the text style is vague, it is difficult to construct paired sentences with the same content and differing styles. Therefore, the studies of text style transfer focus on the unpaired transfer.

Recently, neural networks have become the dominant methods in text style transfer. Most of the previous methods Hu et al. (2017); Shen et al. (2017); Fu et al. (2018); Carlson et al. (2017); Zhang et al. (2018b, a); Prabhumoye et al. (2018); Jin et al. (2019)

formulate the style transfer problem into the “encoder-decoder” framework. The encoder maps the text into a style-independent latent representation (vector representation), and the decoder generates a new text with the same content but a different style from the disentangled latent representation plus a style variable.

These methods focus on how to disentangle the content and style in the latent space. The latent representation needs better preserve the meaning of the text while reducing its stylistic properties. Due to lacking paired sentence, an adversarial loss Goodfellow et al. (2014) is used in the latent space to discourage encoding style in the latent space. Although the disentangled latent representation brings better interpretability, in this paper, we address the following concerns for these models.

1) It is difficult to judge the quality of disentanglement. As reported in Elazar and Goldberg (2018); Lample et al. (2019), the style information can be still recovered from the latent representation even the model has trained adversarially. Therefore, it is not easy to disentangle the stylistic property from the semantics of a sentence.

2) Disentanglement is also unnecessary. Lample et al. (2019) also found that a good decoder can generate the text with the desired style from an entangled latent representation by “overwriting” the original style.

3) Due to the limited capacity of vector representation, the latent representation is hard to capture the rich semantic information, especially for the long text. The recent progress of neural machine translation also proves that it is hard to recover the target sentence from the latent representation without referring to the original sentence.

4) Most of these models adopt recurrent neural networks (RNNs) as encoder and decoder, which has a weak ability to capture the long-range dependencies between words in a sentence. Besides, without referring the original text, RNN-based decoder is also hard to preserve the content. The generation quality for long text is also uncontrollable.

In this paper, we address the above concerns of disentangled models for style transfer. Different from them, we propose Style Transformer, which takes Transformer Vaswani et al. (2017)

as the basic block. Transformer is a fully-connected self-attention neural architecture, which has achieved many exciting results on natural language processing (NLP) tasks, such as machine translation

Vaswani et al. (2017), language modeling Dai et al. (2019), text classification Devlin et al. (2018). Different from RNNs, Transformer uses stacked self-attention and point-wise, fully connected layers for both the encoder and decoder. Moreover, Transformer decoder fetches the information from the encoder part via attention mechanism, compared to a fixed size vector used by RNNs. With the strong ability of Transformer, our model can transfer the style of a sentence while better preserving its meaning. The difference between our model and the previous model is shown in Figure 1.

Our contributions are summarized as follows:

  • To the best of our knowledge, this is the first work that applies the Transformer architecture to style transfer task.

  • We introduce a novel training algorithm which makes no assumptions about the disentangled latent representations of the input sentences, and thus the model can employ attention mechanisms to improve its performance further.

  • Experimental results show that our proposed approach generally outperforms the other approaches on two style transfer datasets. Specifically, to the content preservation, Style Transformer achieves the best performance with a significant improvement.

2 Related Work

Recently, many text style transfer approaches have been proposed. Among these approaches, there is a line of works aims to infer a latent representation for the input sentence, and manipulate the style of generated sentence based on this learned latent representation. Shen et al. (2017) propose a cross-aligned auto-encoder with adversarial training to learn a shared latent content distribution and a separated latent style distribution. Hu et al. (2017) propose a new neural generative model which combines variational auto-encoders and holistic attribute discriminators for the effective imposition of semantic structures. Following their work, many methods Fu et al. (2018); John et al. (2018); Zhang et al. (2018a, b) has been proposed based on standard encoder-decoder architecture.

Although, learning a latent representation will make the model more interpretable and easy to manipulate, the model which is assumed a fixed size latent representation cannot utilize the information from the source sentence anymore.

On the other hand, there are also some approaches without manipulating latent representation are proposed recently. Xu et al. (2018)

propose a cycled reinforcement learning method for unpaired sentiment-to-sentiment translation task.

Li et al. (2018) propose a three-stage method. Their model first extracts content words by deleting phrases a strong attribute value, then retrieves new phrases associated with the target attribute, and finally uses a neural model to combine these into a final output. Lample et al. (2019) reduce text style transfer to unsupervised machine translation problem Lample et al. (2018). They employ Denoising Auto-encoders Vincent et al. (2008) and back-translation Sennrich et al. (2016) to build a translation style between different styles.

However, both lines of the previous models make few attempts to utilize the attention mechanism to refer the long-term history or the source sentence, except Lample et al. (2019)

. In many NLP tasks, especially for text generation, attention mechanism has been proved to be an essential technique to enable the model to capture long-term dependency

Bahdanau et al. (2014); Luong et al. (2015); Vaswani et al. (2017).

In this paper, we follow the second line of work and propose a novel method which makes no assumption about the latent representation of source sentence and takes the proven self-attention network, Transformer, as a basic module to train a style transfer system.



(a) Disentangled Style Transfer


(b) Style Transformer
Figure 1: General illustration of previous models and our model. denotes style-independent content vector and denotes the style variable.

3 Style Transformer

To make our discussion more clearly, in this section, we will first give a brief introduction to the style transfer task, and then start to discuss our proposed model based on our problem definition.

3.1 Problem Formalization

In this paper, we define the style transfer problem as follows: Considering a bunch of datasets , and each dataset is composed of many natural language sentences. For all of the sentences in a single dataset , they share some common characteristic (eg. they are all the positive feedback for a specific goods), and we refer this common characteristic as the style of these sentences. In other words, a style is defined by the distribution of a dataset. Suppose we have different datasets , then we can define different styles, and we denote each style by the symbol . The goal of style transfer is that: given a arbitrary natural language sentence and a desired style , rewrite this sentence to a new one which has the style and preserve the information in original sentence as much as possible.

3.2 Model Overview

To tackle the style transfer problem we defined above, our goal is to learn a mapping function where is a natural language sentence and is a style control variable. The output of this function is the transferred sentence for the input sentence .

A big challenge in text style transfer is that we have no access to the parallel corpora. Thus we can’t directly obtain supervision to train our transfer model. In section 3.4, we employ two discriminator-based approaches to create supervision from non-parallel corpora.

Finally, we will combine the Transformer network and discriminator network via an overall learning algorithm in section

3.5 to train our style transfer system.

3.3 Transformer Network

Generally, Transformer follows the standard encoder-decoder architecture. Explicitly, for a input sentence , the Transformer encoder maps inputs to a sequence of continuous representations . And the Transformer decoder estimates the conditional probability for the output sentence by auto-regressively factorized its as:


At each time step

, the probability of the next token is computed by a softmax classifier:



is logit vector outputted by decoder network.

To enable style control in the standard Transformer framework, we add a extra style embedding as input to the Transformer encoder . Therefore the network can compute the probability of the output condition both on the input sentence and the style control variable . Formally, this can be expressed as:


and we denote the predicted output sentence of this network by .

3.4 Discriminator Network

Suppose we use and to denote the sentence and its style from the dataset . Because of the absence of the parallel corpora, we can’t directly obtain the supervision for the case where . Therefore, we introduce a discriminator network to learn this supervision from the non-parallel copora.

The intuition behind the training of discriminator is based on the assumption below: As we mentioned above, we only have the supervision for the case . In this case, because of the input sentence and chosen style are both come from the same dataset , one of the optimum solutions, in this case, is to reproduce the input sentence. Thus, we can train our network to reconstruct the input in this case. In the case of where , we construct supervision from two ways. 1) For the content preservation, we train the network to reconstruct original input sentence when we feed transferred sentence to the Transformer network with the original style label . 2) For the style controlling, we train a discriminator network to assist the Transformer network to better control the style of the generated sentence.

In short, the discriminator network is another Transformer encoder, which learns to distinguish the style of different sentences. And the Transformer network receives style supervision from this discriminator. To achieve this goal, we experiment with two different discriminator architectures.

Conditional Discriminator

In a setting similar to Conditional GANs Mirza and Osindero (2014), discriminator makes decision condition on a input style. Explicitly, a sentence and a proposal style are feed into discriminator , and the discriminator is asked to answer whether the input sentence has the corresponding style. In discriminator training stage, the real sentence from datasets , and the reconstructed sentence are labeled as positive, and the transferred sentences where , are labeled as negative. In Transformer network training stage, the network is trained to maximize the probability of positive when feed and to the discriminator.

Multi-class Discriminator

Different from the previous one, in this case, only one sentence is feed into discriminator , and the discriminator aims to answer the style of this sentence. More concretely, the discriminator is a classifier with classes. The first classes represent different styles, and the last class is stand for the generated data from , which is also often referred as fake sample. In discriminator training stage, we label the real sentences and reconstructed sentences to the label of the corresponding style. And for the transferred sentence where , is labeled as the class . In Transformer network learning stage, we train the network to maximize the probability of the class which is stand for style .

Figure 2: The training process for Style Transformer network. The input sentence and input style is feed into Transformer network . If the input style is the same as the style of sentence , generated sentence will be trained to reconstruct . Otherwise, the generated sentence will be feed into Transformer and discriminator to reconstruct input sentence and input style respectively.

3.5 Learning Algorithm

In this section, we will discuss how to train these two networks. And the training algorithm of our model can be divided into two parts: the discriminator learning and Transformer network learning. The brief illustration is shown in Figure 2.

3.5.1 Discriminator Learning

Loosely speaking, in the discriminator training stage, we train our discriminator to distinguish between the real sentence and reconstructed sentence from the transferred sentence

. The loss function for the discriminator is simply the cross-entropy loss of the classification problem.

For the conditional discriminator:


And for the multi-class discriminator:


According to the difference of discriminator architecture, there is a different protocol for how to label these sentences, and the details can be found in Algorithm 1.

Input: Transformer network , discriminator , and a dataset with style
1 Sample a minibatch of m sentences from . ;
2 foreach  do
3        Randomly sample a style ;
4        Use to generate two new sentence
6        ;
7        if  is conditional discriminator then
8               Label as 1 ;
9               Label as 0 ;
11       else
12               Label as ;
13               Label as 0 ;
15        end if
16       Compute loss for by Eq. (4) or (5) .
17 end foreach
Algorithm 1 Discriminator Learning

3.5.2 Transformer Network Learning

The training of Transformer network is developed according to the different cases of where or .

Self Reconstruction

For the case , or equivalently, the case . As we discussed before, the input sentence and the input style comes from the same dataset , we can simply train our Transformer network to reconstruct the input sentence by minimizing negative log-likelihood:


For the case , we can’t obtain direct supervision from our training set. So, we introduce two different training loss to create supervision indirectly.

Cycle Reconstruction

To encourage generated sentence preserving the information in the input sentence , we feed the generated sentence to the Transformer network with the style of and training our network to reconstruct original input sentence by minimizing negative log-likelihood:

Style Controlling

If we only train our Transformer network to reconstruct the input sentence from transferred sentence , the network can only learn to copy the input to the output. To handle this degeneration problem, we further add a style controlling loss for the generated sentence. Namely, the network generated sentence is feed into discriminator to maximize the probability of style .

For the conditional discriminator, the Transformer network aims to minimize the negative log-likelihood of class when feed to the discriminator with the style label :


And in the case of multi-class discriminator, the Transformer network is trained to minimize the the negative log-likelihood of the corresponding class of style :


Combining the loss function we discussed above, the training procedure of the Transformer network is summarized in Algorithm 2.

Input: Transformer network , discriminator , and a dataset with style
1 Sample a minibatch of m sentences from . ;
2 foreach  do
3        Randomly sample a style ;
4        Use to generate two new sentence
6        ;
7        Compute for by Eq. (6) ;
8        Compute for by Eq. (7) ;
9        Compute for by Eq. (8) or (9) ;
11 end foreach
Algorithm 2 Transformer Network Learning

3.5.3 Summarization and Discussion

Finally, we can construct our final training algorithm based on discriminator learning and Transformer network learning steps. Similar to the training process of GANs Goodfellow et al. (2014), in each training iteration, we first perform steps discriminator learning to get a better discriminator, and then train our Transformer network steps to improve its performance. The training process is summarized in Algorithm 3.

Input: A bunch of datasets , and each represent a different style
1 Initialize the Transformer network , and the discriminator network with random weights ;
2 repeat
3        for  step do
4               foreach dataset  do
5                      Accumulate loss by Algorithm 1
6               end foreach
7              Perform gradient decent to update .
8        end for
9       for  step do
10               foreach dataset  do
11                      Accumulate loss by Algorithm 2
12               end foreach
13              Perform gradient decent to update .
14        end for
16until network converges;
Algorithm 3 Training Algorithm

Before finishing this section, we finally discuss a problem which we will be faced with in the training process. Because of the discrete nature of the natural language, for the generated sentence , we can’t directly propagate gradients from the discriminator through the discrete samples. To handle this problem, one can use REINFORCE Williams (1992) or the Gumbel-Softmax trick Kusner and Hernández-Lobato (2016)

to estimates gradients from the discriminator. However, these two approaches are faced with high variance problem which will make the model hard to converge. For the reasons above, empirically, we view the softmax distribution generated by

as a “soft” generated sentence and feed this distribution to the downstream network to keep the continuity of the whole training process. When this approximation is used, we also switch our decoder network from greedy decoding to continuous decoding. Which is to say, at every time step, instead of feed the token that has maximum probability in previous prediction step to the network, we feed the whole softmax distribution (Eq. (2)) to the network. And the decoder uses this distribution to compute a weighted average embedding from embedding matrix for the input.

4 Experiment

4.1 Datasets

We evaluated and compared our approach with several state-of-the-art systems on two review datasets, Yelp Review Dataset (Yelp) and IMDb Movie Review Dataset (IMDb). The statistics of the two datasets are shown in Table 1.

Yelp Review Dataset (Yelp) The Yelp dataset is provided by the Yelp Dataset Challenge, consisting of restaurants and business reviews with sentiment labels (negative or positive). Following previous work, we use the possessed dataset provided by Li et al. (2018). Additionally, it also provides human reference sentences for the test set.

IMDb Movie Review Dataset (IMDb) The IMDb dataset consists of movie reviews written by online users. To get a high quality dataset, we use the highly polar movie reviews provided by Maas et al. Maas et al. (2011). Based on this dataset, we construct a highly polar sentence-level style transfer dataset by the following steps: 1) fine tune a BERT Devlin et al. (2018) classifier on original training set, which achieves accuracy on test set; 2) split each review in the original dataset into several sentences; 3) filter out sentences with confidence threshold below 0.9 by our fine-tuned BERT classifier; 4) remove sentences with uncommon words. Finally, this dataset contains 366K, 4k, 2k sentences for training, validation, and testing, respectively.

Dataset Yelp IMDb
Positive Negative Positive Negative
Train 266,041 177,218 178,869 187,597
Dev. 2,000 2,000 2,000 2,000
Test 500 500 1,000 1,000
Avg. Len. 8.9 18.5
Table 1: Datasets statistic.
Model Yelp IMDb
Input Copy 3.3 23 100 11 5.2 100 5
RetrieveOnly Li et al. (2018) 92.9 0.4 0.7 10 N/A N/A N/A
TemplateBased Li et al. (2018) 84.2 13.7 44.1 67 N/A N/A N/A
DeleteOnly Li et al. (2018) 85.5 9.7 28.6 79 N/A N/A N/A
DeleteAndRetrieve Li et al. (2018) 88.0 10.4 29.1 61 58.7 55.4 18
ControlledGen Hu et al. (2017) 88.9 14.3 45.7 201 93.9 62.1 58
CrossAlignment Shen et al. (2017) 76.3 4.3 13.2 90 N/A N/A N/A
MultiDecoder Fu et al. (2018) 49.9 9.2 37.9 127 N/A N/A N/A
CycleRLXu et al. (2018) 88.0 2.8 7.2 204 97.6 4.9 246
Ours (Conditional) 93.6 17.1 45.3 78 86.8 66.2 38
Ours (Multi-Class) 87.6 20.3 54.9 50 79.7 70.5 29
Table 2: Automatic evaluation results on Yelp and IMDb datset

4.2 Evaluation

A goal transferred sentence should be a fluent, content-complete one with target style. To evaluate the performance of the different model, following previous works, we compared three different dimensions of generated samples: 1) Style control, 2) Content preservation and 3) Fluency.

4.2.1 Automatic Evaluation

Style Control We measure style control automatically by evaluating the target sentiment accuracy of transferred sentences. For an accurate evaluation of style control, We trained two sentiment classifiers on the training set of Yelp and IMDb using fastText.

Content Preservation To measure content preservation, we calculate the BLEU score Papineni et al. (2002) between the transferred sentence and its source input using NLTK. A higher BLEU score indicates the transferred sentence can achieve better content preservation by retaining more words from the source sentence. If a human reference is available, we will calculate the BLEU score between the transferred sentence and corresponding reference as well. Two BLEU score metrics are referred to as self-BLEU and ref-BLEU respectively.

Fluency Fluency is measured by the perplexity of the transferred sentence, and we trained a 5-gram language model on the training set of two datasets using KenLM Heafield (2011).

4.2.2 Human Evaluation

Due to the lack of parallel data in style transfer area, automatic metrics are insufficient to evaluate the quality of the transferred sentence. Therefore we also conduct human evaluation experiments on two datasets.

We randomly select 100 source sentences (50 for each sentiment) from each test set for human evaluation. For each review, one source input and three anonymous transferred samples are shown to a reviewer. And the reviewer is asked to choose the best sentence for style control, content preservation, and fluency respectively.

  • Which sentence has the most opposite sentiment toward the source sentence?

  • Which sentence retains most content from the source sentence?

  • Which sentence is the most fluent one?

In order to avoid interference from similar or same generated sentences, ”no preference.” is also an option answer to these questions.

4.3 Training Details

In all of the experiment, for the encoder, decoder, and the discriminator, we all use 4-layer Transformer with 4-way multi-head attention. The hidden size, embedding size, and positional encoding size in Transformer are all 256 dimensions. Another embedding matrix with 256 hidden units is used to represent different style, which is feed into encoder as an extra token of the input sentence. And the positional encoding isn’t used for the style token. For the discriminator, similar to Radford et al. Radford et al. (2018) and Devlin et al. (2018) we further add a <cls> token to the input, and the output vector of corresponding position is feed into a softmax classifier which represent the output of discriminator.

4.4 Experimental Results

Results using automatic metrics are presented in Table 2. Comparing to previous approaches, our models achieve competitive performance overall and get better content preservation at all of two datasets. Our conditional model can achieve a better style controlling compared to the multi-class model Both our models are able to generate sentences with relatively low perplexity. For those previous models performing the best on a single metric, an obvious drawback can always be found on another metric.

For the human evaluation, we choose two of the most well-performed models according to the automatic evaluation results as competitors: DeleteAndRetrieve (D&R) Li et al. (2018) and Controlled Generation (Ctrl.Gen) Hu et al. (2017). And the generated outputs from multi-class discriminator model is used as our final model. We have performed over 400 human evaluation reviews. Results are presented in Table 3. The human evaluation results are mainly confirmed with our automatic evaluation result. And it also shows that our models are better in content preservation, compared to two competitor model.

Finally, to better understand the characteristic of different model we sampled several output sentences from the Yelp dataset, which are shown in Table 4.

Model Yelp IMDb
Style Content Fluency Style Content Fluency
Ctrl.Gen 16.8 23.6 17.7 30.0 19.5 22.0
D&R 13.6 15.5 21.4 21.0 27.0 25.0
Ours 48.6 36.8 41.4 29.5 35.0 31.5
No Preference 20.9 24.1 19.5 19.5 18.5 21.5
Table 3: Human evaluation results on two datasets. Each cell indicates the proportion of being preferred.
negative to positive
Input the food ’s ok , the service is among the worst i have encountered .
DAR the food ’s ok , the service is among great and service among .
Ctrl the food ’s ok , the service is among the randy i have encountered .
Ours the food ’s delicious , the service is among the best i have encountered .
Human the food is good , and the service is one of the best i ’ve ever encountered .
Input this is the worst walmart neighborhood market out of any of them .
DAR walmart market is one of my favorite places in any neighborhood out of them .
Ctrl fantastic is the randy go neighborhood market out of any of them .
Ours this is the best walmart neighborhood market out of any of them .
Human this is the best walmart out of all of them .
Input always rude in their tone and always have shitty customer service !
DAR i always enjoy going in always their kristen and always have shitty customer service !
Ctrl always good in their tone and always have shitty customer service !
Ours always nice in their tone and always have provides customer service !
Human such nice customer service , they listen to anyones concerns and assist them with it .
positive to negative
Input everything is fresh and so delicious !
DAR small impression was ok , but lacking i have piss stuffing night .
Ctrl everything is disgrace and so bland !
Ours everything is overcooked and so cold !
Input these two women are professionals .
DAR these two scam women are professionals .
Ctrl shame two women are unimpressive .
Ours these two women are amateur .
Human these two women are not professionals .
Input fantastic place to see a show as every seat is a great seat !
DAR there is no reason to see a show as every seat seat !
Ctrl unsafe place to embarrassing lazy run as every seat is lazy disappointment seat !
Ours disgusting place to see a show as every seat is a terrible seat !
Human terrible place to see a show as every seat is a horrible seat !
Table 4: Case study from Yelp dataset. The red words indicate good transfer; the blue words indicate bad transfer; the brown words indicate grammar error.

4.5 Ablation Study

Conditional Multi-class
Style Transformer 93.6 17.1 78 87.6 20.3 50
- self reconstruction 50.0 0 N/A 20.7 0 N/A
- cycle reconstruction 94.2 8.6 56 93.2 8.7 40
- discriminator 3.3 22.9 11 3.3 22.9 11
- real sample 89.7 17.4 75 83.8 19.4 55
- generated sample 46.3 21.6 34 35.6 22.0 33
Table 5: Model ablations study result on Yelp dataset

To study the impact of different components on overall performance, we further did an ablation study of our model on Yelp dataset, and results are reported in Table 5.

For better understanding the role of different loss functions, we disable each loss function by turns and retrain our model with the same setting for the rest of hyperparameters. After we disable self-reconstruction loss (Eq. (

6)), our model failed to learn a meaningful output and only learned to generate a single word for any combination of input sentence and style. However, when we don’t use cycle reconstruction loss (Eq. (7)), it’s also possible to train the model successfully, and both of two models converge to reasonable performance. And comparing to the full model, there is a small improvement in style accuracy, but a significant drop in BLEU score. At last, when the discriminator loss (Eq. (8) and (9)) is not used, the model quickly degenerates to a model which is only copying the input sentence to output without any style control. This behavior is also confirmed to our intuition. If the model is only asked to minimize the self-reconstruction loss and cycle reconstruction loss, directly copying input is one of the optimum solutions which is the easiest to achieve. In summary, each of these loss plays an important role in the Transformer network training stage: 1) the self-reconstruction loss guides model to generate readable natural language sentence. 2) the cycle reconstruction loss encourages model to preserve the information in source sentence. 3) the discriminator provides style supervision to help model control the style of generated sentences.

Another group of study is focused on the different type of samples used in the discriminator training step. In Algorithm 1, we used a mixture of real sentence and generated sentence as the positive training samples for the discriminator. By contrast, in the ablation study, we trained our model with only one of them. As the result shows, the generated sentence is the key component in discriminator training. When we remove real sentence from the training data of discriminator, our model can also achieve a competitive result as the full model with only a small performance drop. However, if we only use real sentence the model will lose a significant part of the ability to control the style of the generated sentence, and thus yields a bad performance in style accuracy. However, the model can still perform a style control far better than the input copy model discussed in the previous part. For the reasons above, we used a mixture of real sample and generated sample in our final version.

5 Conclusions

In this paper, we propose the Style Transformer with a novel training algorithm for text style transfer task. Experimental results on two text style transfer datasets show that our model achieved a competitive or better performance compared to previous state-of-the-art approaches. Especially, because of our proposed approach don’t assume a disentangled latent representation for manipulating the sentence style, our model can get better content preservation on both of two datasets.


We would like to thank the anonymous reviewers for their valuable comments. The research work is supported by National Natural Science Foundation of China (No. 61672162 and 61751201) and Shanghai Municipal Science and Technology Commission (No. 16JC1420401 and 17JC1404100).