Fine-Grained-Style-Transfer
Revision in Continuous Space: Fine-Grained Control of Text Style Transfer
view repo
Typical methods for unsupervised text style transfer often rely on two key ingredients: 1) seeking for the disentanglement of the content and the attributes, and 2) troublesome adversarial learning. In this paper, we show that neither of these components is indispensable. We propose a new framework without them and instead consists of three key components: a variational auto-encoder (VAE), some attribute predictors (one for each attribute), and a content predictor. The VAE and the two types of predictors enable us to perform gradient-based optimization in the continuous space, which is mapped from sentences in a discrete space, to find the representation of a target sentence with the desired attributes and preserved content. Moreover, the proposed method can, for the first time, simultaneously manipulate multiple fine-grained attributes, such as sentence length and the presence of specific words, in synergy when performing text style transfer tasks. Extensive experimental studies on three popular text style transfer tasks show that the proposed method significantly outperforms five state-of-the-art methods.
READ FULL TEXT VIEW PDFRevision in Continuous Space: Fine-Grained Control of Text Style Transfer
Text style transfer, which is an under-explored challenging task in the field of text generation, aims to convert some attributes of a sentence (e.g., negative sentiment) to other attributes (e.g., positive sentiment) while preserving attribute-independent content. In other words, text style transfer can generate sentences with desired attributes in a controlled manner. Due to the difficulty in obtaining training sentence pairs with the same content and differing styles, this task usually works in an unsupervised manner where the model can only access non-parallel, but style labeled sentences.
Most existing methods (Hu et al., 2017; Shen et al., 2017; Fu et al., 2018; Li et al., 2018; Prabhumoye et al., 2018; Yang et al., 2018; John et al., 2019) for text style transfer usually first explicitly disentangle the content and the attribute through an adversarial learning paradigm (Goodfellow et al., 2014)
. The attribute-independent content and the desired attribute vector are then fed into the decoder to generate the target sentence. However, some recent evidence suggests that using adversarial learning may not be able to learn representations that are disentangled
(Li et al., 2018; Guillaume Lample, 2019). Moreover, vanilla adversarial learning is designed for generating real-valued and continuous data but has difficulties in directly generating sequences of discrete tokens. As a result, algorithms such as REINFORCE (Sutton et al., 2000; Yu et al., 2017; Li et al., 2017; Che et al., 2017; Lin et al., 2017; Guo et al., 2018)or those that approximate the discrete tokens with temperature-softmax probability vectors
(Kusner and Hernández-Lobato, 2016; Zhang et al., 2017; Hu et al., 2017; Prabhumoye et al., 2018; Yang et al., 2018) are used. Unfortunately, these methods tend to be unstable, slow, and hard-to-tune in practice (Guillaume Lample, 2019).Is it really a necessity to explicitly disentangle the content and the attributes? Also, do we have to use adversarial learning to achieve text style transfer? Recently, the idea of mapping the discrete input into a continuous space and then performing gradient-based optimization with a predictor to find the representation of a new discrete output with desired property has been applied for sentence revision (Mueller et al., 2017) and neural architecture search (Luo et al., 2018). Motivated by the success of these works, we propose a new solution to the task of content-preserving text style transfer. This method can be easily trained on the non-parallel dataset without adversarial training which is used in most existing methods. Furthermore, unlike most previous methods that only control a single binary attribute (e.g., positive and negative sentiments), our approach can further control multiple fine-grained attributes such as sentence length and the existence of specific words (Liu et al., 2018).
The proposed approach contains three key components: (a) A variational auto-encoder (VAE) (Kingma and Welling, 2013; Rezende et al., 2014; Fabius and van Amersfoort, 2015; Bowman et al., 2016), whose encoder maps sentences into a smooth continuous space and its decoder can map a continuous representation back to the sentence. (b) Some attribute predictors that take the continuous representation of a sentence as input and predict the attributes of its decoder output sentence, respectively. These attribute predictors enable us to find the target sentence with the desired attributes in the continuous space. (c) A content predictor that takes the continuous representation of a sentence as input and predicts the Bag-of-Word (BoW) feature of its decoder output sentence. The purpose of component (c) is threefold: First, it could enhance the content preservation during style transfer; Second, it enables the target sentence to contain some specific words; Third, it can tackle the vanishing latent variable problem of VAE (Zhao et al., 2017). With the gradients obtained from these predictors, we can revise the continuous representation of the original sentence by gradient-based optimization to find a target sentence with the desired fine-grained attributes, and achieve the content-preserving text style transfer.
The contributions of this paper could be summarized as below:
We propose a new method for fine-grained control of text style transfer task, which does not explicitly disentangle the content and the attribute and avoids the training difficulties caused by the use of adversarial learning in the previous methods.
Unlike most previous methods that only control a single binary attribute, the proposed method can simultaneously control multiple fine-grained attributes such as sentence length, and containing specific words. To the best of our knowledge, it is the first text style transfer method that can control such fine-grained attributes.
Extensive experimental comparisons on three popular text style transfer tasks show that the proposed method significantly outperforms five state-of-the-art methods.
We have witnessed an increasing interest in text style transfer under the setting of non-parallel data. Most such methods explicitly disentangle the content and the attribute. One line of research leverages the auto-encoder framework to encode the original sentence into an attribute-independent content representation with adversarial learning, which is then fed into the decoder with a style vector to output the transferred sentence.
In Hu et al. (2017); Shen et al. (2017); Prabhumoye et al. (2018), adversarial learning is utilized to ensure that the output sentence has the desired style. In order to disentangle the content and the attribute, Hu et al. (2017) enforces the output sentence to reconstruct the content representation, while Fu et al. (2018); Zhao et al. (2017); John et al. (2019) apply adversarial learning to discourage encoding style information into the content representation. Shen et al. (2017) utilizes adversarial learning to align the generated sentences from one style to the data domain of the other style. In (Yang et al., 2018), the authors extend the cross-align method (Shen et al., 2017) by employing a language model as the discriminator, which can provide a more stable and more informative training signal for adversarial learning.
However, as argued in (Li et al., 2018; Guillaume Lample, 2019), it is often easy to fool the discriminator without actually learning the representations that are disentangled. Unlike the methods mentioned above that disentangle the content and the attribute with adversarial learning, another line of research (Prabhumoye et al., 2018; Logeswaran et al., 2018; Guillaume Lample, 2019) applies back-translation (Wintner et al., 2016) to rephrase a sentence while reducing the stylistic properties and encourage content compatibility. Besides, the authors in (Li et al., 2018) directly mask out the words associated with the original style of the sentence to obtain the attribute-independent content text.
Instead of revising the sentence in the discrete space with prior knowledge as in (Li et al., 2018), our method maps the discrete sentence into a continuous representation space and revises the continuous representation with the gradient provided by the predictors. This method does not explicitly disentangle the content and the attribute and avoids the training difficulties caused by the use of adversarial learning in the previous methods. Similar ideas have been proposed in (Mueller et al., 2017; Luo et al., 2018) for sentence revision and neural architecture search. As pointed out in (Shen et al., 2017), the model proposed in (Mueller et al., 2017) does not necessarily enforce content preservation, while our method employs a content predictor to enhance content preservation. Furthermore, unlike most previous methods that only control a single binary attribute (e.g., positive and negative sentiments), our approach can further control multiple fine-grained attributes such as sentence length and the existence of specific words. To our best knowledge, these fine-grained attributes have not been studied before in text style transfer task.
Let denote a dataset which contains sentences paired with a set of attributes . Each has attributes of interest . Unlike most previous methods (Shen et al., 2017; Fu et al., 2018; Prabhumoye et al., 2018; Li et al., 2018; Yang et al., 2018) that only consider a single binary attribute (e.g., positive or negative sentiments), here we consider multiple fine-grained attributes such as sentence length and the presence of specific words (e.g., a pre-defined subject noun). For example, given a original sentence =“the salads are fresh and delicious.”, its attribute set can be ={sentiment=positive, length=7, subject_noun=salads}. Our task is to learn a generative model that can generate a new sentence with the required attributes , and retain the attribute-independent content of as much as possible.
The proposed model consists of three components: a variational auto-encoder (VAE), attribute predictors, and a content predictor.
Variational auto-encoder . The VAE integrates stochastic latent representation into the auto-encoder architecture. Its RNN encoder maps a sentence into a continuous latent representation :
(1) |
and its RNN decoder maps the representation back to reconstruct the sentence :
(2) |
where and denote the parameters of the encoder and decoder. The VAE is then optimized to minimize the reconstruction error of input sentences, and meanwhile minimize the KL term to encourages the to match the prior :
(3) |
where is the KL-divergence. Compared with traditional deterministic auto-encoder, the VAE offers two main advantages in our approach:
(1) Deterministic auto-encoders often have “holes” in their latent space, where the latent representations may not able to generate anything realistic (Roberts et al., 2018)
. In contrast, by imposing a prior standardized normal distribution
on the latent representations, the VAE learns latent representations not as single isolated points, but as soft dense regions in continuous latent space which makes it be able to generate plausible examples from every point in the latent space (Bowman et al., 2016). This characteristic avoids the problem that the representation revised (optimized) by the gradient not being able to generate a plausible sentence.(2) This continuous and smooth latent space learned by the VAE enables the sentences generated by adjacent latent representation to be similar in content and semantics (Bowman et al., 2016; Semeniuta et al., 2017; Goyal et al., 2017; Yang et al., 2017; Shen et al., 2018). Therefore, if we revise the representation within a reasonable range (i.e., small enough), the resulting new sentence would not differ much in content from the original sentence.
Attribute predictors . Each of them takes the representation as input and predict one attribute of the decoder output sentence generated by
. For example, the attribute predictor can be a binary classifier for positive-negative sentiment prediction or a regression model for sentence length prediction. With the gradients provided by the predictors, we can revise the continuous representation
of the original sentence by gradient-based optimization to find a target sentence with the desired attributes .The attribute predictors are trained in two stages. We firstly jointly train these attribute predictors with VAE. For M-classification predictors, we have
(4) |
where . And for the regression predictors, we have
(5) |
where . In this joint training, we take the attributes of the input sentence as the label of predictors. Since the predictor are designed to predict the attribute of the sentence generated by , we further train each predictor individually after joint training. We sample from and feed it into the decoder to generate a new sentence . Afterwards we feed into the CNN text classifiers (Kim, 2014) which are trained on the training set to predict its attributes222Some attributes can be obtained directly without using classifiers, such as the length of . as the label of the predictors:
(6) |
Content predictor . It is a multi-label classifier that takes as input and predicts the Bag-of-Word feature of its decoder output sentence:
(7) |
We assume as -trial multimodal distribution:
(8) |
where is the size of vocabulary, is the length of , and is the output value of -th word in .
The training of content predictor has also two stages. Firstly it is jointly trained with VAE:
(9) |
After joint training, it is trained separately through:
(10) |
During text style transfer, we can similarly revise the representation with the gradient provided by the content predictor to enhance content preservation. Here we consider two ways to enhance content preservation during style transfer. We can set to contain all the words in the original sentence , which means that we try to find a sentence with the desired attributes and keep all the words of the original sentence as much as possible to achieve content preservation. However, retaining all the words is often not what we want. For example, should not contain the original emotional words in the task of text sentiment transfer. Instead, the noun in the original sentence should be retained in such a task (Melnyk et al., 2017; Li et al., 2018; John et al., 2019). Therefore, we can set to contain only all nouns in . Furthermore, we can set to contain some desired specific words to achieve finer-grained control of target sentences.
Putting them together, the final joint training loss is as follows:
(11) |
where and are balancing hyper-parameters. It should be noted that and also act as regularizers that prevent the encoder from being trapped into a KL vanishing state (Bowman et al., 2016; Kingma et al., 2016; Yang et al., 2017; Shen et al., 2018; Alemi et al., 2018; Liu et al., 2019).
Given the original sentence , the inference process of style transfer is performed in the continuous space. We revise its representation by gradient-based optimization as follows:
(12) |
where is the step size and is the trade-off parameter to balance the content preservation and style transfer strength. We iterate such optimization to find the until the confidence of attribute predictors is greater than threshold or reach the maximum number of rounds . The target is obtained by decoding with a beam search (Och and Ney, 2004). An example procedure is shown in Figure 1.
The experiments are designed for answering the following questions: Q1: Compared with the state-of-the-art methods, how well do our methods perform in the text style transfer tasks? To answer this question, we evaluate them on three publicly available datasets of sentiment transfer and gender style transfer tasks. Q2: Can our methods further control fine-grained attributes such as length and control multiple attributes at the same time? To verify this, we conduct several experiments on text sentiment transfer tasks and simultaneously control other fine-grained attributes such as length and keyword presence.
We use two datasets, Yelp restaurant reviews and Amazon product reviews (He and McAuley, 2016)333These datasets can be download at http://bit.ly/2LHMUsl., which are commonly used in prior works too (Shen et al., 2017; Fu et al., 2018; Li et al., 2018; Prabhumoye et al., 2018). Following their experimental settings, we use the same pre-processing steps and similar experimental configurations.
Methods | Accuracy | PPL | Overlap | Noun% | BLEU | Suc% |
---|---|---|---|---|---|---|
Original | 0.1 | 22.9 | 100.0 | 100.0 | 42.4 | 0.1 |
Human | 91.8 | 76.9 | 47.2 | 78.5 | 100.0 | 83.3 |
CrossAligned (Shen et al., 2017) | 73.6 | 72.0 | 41.1 | 42.9 | 18.4 | 27.9 |
StyleEmbedding (Fu et al., 2018) | 7.2 | 93.9 | 75.4 | 74.2 | 31.9 | 2.1 |
MultiDecoder (Fu et al., 2018) | 48.8 | 166.5 | 51.5 | 52.2 | 23.1 | 11.3 |
BTS (Prabhumoye et al., 2018) | 94.8 | 32.8 | 21.5 | 23.5 | 6.8 | 31.9 |
Delete, Retrieve, & Generate (Li et al., 2018): | ||||||
TemplateBased | 81.3 | 183.6 | 55.6 | 83.3 | 28.9 | 42.5 |
DeleteOnly | 85.8 | 81.4 | 49.5 | 74.9 | 24.7 | 51.4 |
RetrievalOnly | 98.4 | 25.7 | 15.8 | 39.6 | 4.7 | 51.0 |
DeleteAndRetrieve | 89.5 | 96.1 | 49.4 | 74.0 | 24.9 | 55.7 |
Ours-1 | 88.2 | 26.5 | 46.6 | 77.4 | 21.8 | 66.9 |
Ours-2 | 92.3 | 18.3 | 38.9 | 69.3 | 18.8 | 67.9 |
Ours-3 | 95.7 | 20.6 | 39.7 | 61.5 | 17.9 | 66.3 |
Methods | Accuracy | PPL | Overlap | Noun% | BLEU | Suc% |
Original | 23.4 | 24.4 | 100.0 | 100.0 | 57.2 | 23.2 |
Human | 88.1 | 62.9 | 60.5 | 85.0 | 100.0 | 81.2 |
CrossAligned (Shen et al., 2017) | 69.6 | 18.3 | 19.3 | 20.4 | 5.0 | 28.8 |
StyleEmbedding (Fu et al., 2018) | 40.5 | 87.7 | 42.2 | 41.8 | 22.1 | 13.2 |
MultiDecoder (Fu et al., 2018) | 66.5 | 80.8 | 30.6 | 30.4 | 14.3 | 19.8 |
BTS (Prabhumoye et al., 2018) | 82.6 | 25.3 | 24.7 | 22.5 | 9.2 | 36.9 |
Delete, Retrieve, & Generate (Li et al., 2018): | ||||||
TemplateBased | 69.6 | 108.9 | 73.3 | 87.9 | 42.8 | 50.0 |
DeleteOnly | 51.6 | 49.3 | 74.4 | 95.1 | 44.7 | 44.1 |
RetrievalOnly | 87.2 | 28.7 | 21.0 | 44.5 | 6.7 | 51.2 |
DeleteAndRetrieve | 55.2 | 48.2 | 69.1 | 92.6 | 41.8 | 48.7 |
Ours-1 | 81.9 | 35.0 | 37.7 | 76.0 | 11.5 | 59.1 |
Ours-2 | 85.1 | 21.8 | 49.3 | 49.8 | 21.5 | 55.9 |
Ours-3 | 90.0 | 15.9 | 39.5 | 41.4 | 16.3 | 54.5 |
There are three criteria for a good style transfer (Li et al., 2018; Prabhumoye et al., 2018). Concretely, the generated sentences should: 1) have the desired attributes; 2) be fluent; 3) preserve the attribute-independent content of the original sentence as much as possible. For the first and second criteria, we follow previous works (Shen et al., 2017; Fu et al., 2018; Li et al., 2018; Prabhumoye et al., 2018) in using model-based evaluation. We measure whether the style is successfully transferred according to the prediction of a pre-trained bidirectional LSTM classifier (Schuster and Paliwal, 1997; Hochreiter and Schmidhuber, 1997), and measure the language quality by the perplexity (PPL) of the generated sentences with a pre-trained language model. Following previous works, we use the trigram Kneser-Ney smoothed language model (Kneser and Ney, 1995) trained on the respective dataset. Since it is hard to measure the content preservation, we follow previous works and report two metrics: 1) Word overlap, which counts the unigram word overlap rate of the original sentence and the generated sentence , computed by ; 2) Because most nouns in sentences are attribute-independent content (Melnyk et al., 2017; Li et al., 2018) in this task, we also calculate the percentage of nouns (e.g., as detected by a POS tagger) in the original sentence appearing in the generated sentence (denoted as Noun%). Because a good model should perform well on all three criteria, it is reasonable to propose a more comprehensive metric that serves as a lower bound of transfer success percentage (denoted as Suc%): One such sample is considered as transfer successful if its attribute is consistent with the classifier prediction of the desired attribute, its language probability is no less than a threshold, and it contains at least one noun of the original sentence. There are 1000 human annotated sentences as the ground truth of the transferred sentences in (Li et al., 2018). We also take them as references and report the bi-gram BLEU scores (Papineni et al., 2002).
We compare our method with several previous state-of-the-art methods (Shen et al., 2017; Fu et al., 2018; Li et al., 2018; Prabhumoye et al., 2018). We report the results of the human-written sentences as a strong baseline. The results of not making any changes to the original sentences (denoted as Original) are also reported. The effect of using different hyper-parameters and the ablation study are analyzed in Appendix A.
Table 1 shows the evaluation results on two datasets. Generally we find that StyleEmbedding and MultiDecoder achieve high content retention (Overlap, BLEU, and Noun%), but their fluency (PPL) and transfer accuracy are poor, resulting in low overall scores (Suc%). On the contrary, BST achieves high fluency and transfer accuracy, while the content is poorly preserved. The fluency of CrossAligned is better, but it does not perform in both content preservation and sentiment transfer. Because the methods proposed in (Li et al., 2018) are based on prior knowledge to revise the original sentence in the discrete space, they (except for RetrievalOnly) can achieve both high content retention and transfer accuracy. However, the generated sentences are not fluent enough. Our methods revise the original sentence in a continuous space, which does well in fluency, content preservation, and transfer accuracy. They achieve the highest overall scores over all baselines. In addition, we can see that our methods can control the trade-off between the transfer accuracy and content preservation.
We conduct human evaluations to verify the performance of our methods on two datasets further. Following previous works (Li et al., 2018; Fu et al., 2018), we randomly select 50 original sentences and ask 7 evaluators444All evaluators have Bachelor or higher degree. They are independent of the authors’ research group. to evaluate the sentences generated by different methods. Each generated sentence is rated on the scale of 1 to 5 in terms of transfer accuracy, preservation of content, and language fluency. The results are shown in Table 2. It can be seen that our models significantly outperform all the baselines on the percentage success rate (Suc%) for two datasets. The generated examples can be found in Appendix B.
Yelp | Amazon | |||||||
Acc | Gra | Con | Suc% | Acc | Gra | Con | Suc% | |
Human | 4.1 | 4.4 | 3.6 | 78 | 3.5 | 4.3 | 3.9 | 60 |
CrossAligned (Shen et al., 2017) | 3.3 | 2.9 | 2.6 | 22 | 3.0 | 3.3 | 1.6 | 6 |
MultiDecoder (Fu et al., 2018) | 2.4 | 3.0 | 3.1 | 12 | 2.3 | 2.7 | 2.5 | 6 |
BTS (Prabhumoye et al., 2018) | 3.9 | 3.7 | 1.8 | 26 | 2.8 | 3.3 | 1.8 | 8 |
DeleteAndRetrieve (Li et al., 2018) | 3.8 | 3.6 | 3.5 | 54 | 2.4 | 3.5 | 3.8 | 28 |
Ours-1 | 3.6 | 4.1 | 3.1 | 66 | 3.4 | 4.0 | 2.8 | 42 |
Ours-2 | 3.7 | 4.3 | 3.2 | 72 | 3.7 | 4.0 | 2.4 | 40 |
Ours-3 | 3.8 | 4.1 | 3.0 | 60 | 3.8 | 4.5 | 2.5 | 50 |
Methods | Accuracy | PPL | Overlap | Noun% | Suc% |
---|---|---|---|---|---|
Orginal | 21.9 | 183.4 | 100.0 | 100.0 | 21.9 |
BTS (Prabhumoye et al., 2018) | 60.3 | 145.0 | 37.9 | 35.3 | 36.3 |
Ours-1 | 79.9 | 78.9 | 46.4 | 53.8 | 63.9 |
Ours-2 | 71.3 | 87.8 | 51.8 | 57.5 | 58.7 |
Ours-3 | 70.6 | 98.2 | 46.8 | 69.6 | 66.6 |
We use the same dataset555This dataset can be download at http://tts.speech.cs.cmu.edu/style_models/gender_classifier.tar. as in (Prabhumoye et al., 2018), which contains reviews from Yelp annotated with two sexes (they only consider male or female due to the absence of corpora with other gender annotations (Eckert and McConnell-Ginet, 2013)). Following (Prabhumoye et al., 2018), we use the same pre-processing steps and similar experimental configurations. We directly compare our method against BST (Prabhumoye et al., 2018) which has been shown to outperform the previous approach (Shen et al., 2017) on this task. We use the same metrics described in Section 4.1 except for the BLEU score because this dataset does not provide the human annotated sentences. The results are shown in Table 3. We can see our methods outperform BST (Prabhumoye et al., 2018) on all metrics. The generated examples are shown in Appendix C.
Methods | Accuracy | PPL | Overlap | Noun% | Len% | Key% |
---|---|---|---|---|---|---|
Original | 0.1 | 22.9 | 100.0 | 100.0 | 100.0 | 7.8 |
Keywords | 16.7 | 43.9 | 39.2 | 56.0 | 98.1 | 92.3 |
Sentiment + Keywords | 91.6 | 52.6 | 24.5 | 42.4 | 106.0 | 78.3 |
Length | 0.2 | 29.8 | 25.0 | 48.3 | 208.8 | 5.9 |
Sentiment + Length | 97.7 | 25.4 | 21.4 | 51.7 | 189.5 | 9.2 |
Keywords + Length | 25.6 | 44.5 | 29.8 | 61.8 | 165.0 | 83.2 |
Sentiment + Keywords + Length | 93.0 | 51.8 | 18.8 | 50.0 | 183.7 | 66.6 |
Length | 0.2 | 31.3 | 30.7 | 25.2 | 40.8 | 6.3 |
Sentiment + Length | 95.1 | 23.0 | 29.1 | 38.1 | 66.9 | 6.7 |
Keywords + Length | 21.4 | 87.0 | 28.4 | 38.9 | 61.6 | 83.7 |
Sentiment + Keywords + Length | 87.6 | 123.8 | 16.3 | 23.7 | 60.9 | 63.0 |
We conduct experiments on controlling fine-grained attributes (length or keyword presence) and simultaneously manipulating multiple attributes (length, keyword presence, and sentiment) of the original sentence. We use the same dataset, Yelp, and the same metrics used in Section 4.1. For the attribute of length, we design two experiments: 1) We hope that the target sentence can add some relevant content to the original sentence, and increase its length by twice (denoted as Length); 2) We hope that the target sentence can compress the content of the original sentence and reduce its length by half (denoted as Length). For evaluation, we measure the percentage of the length of the generated sentences to the length of the original sentences (denoted as Len%). For the attribute of keyword presence, we hope that the target sentence can contain a pre-defined keyword and retain the content of the original sentence as much as possible (denoted as Keywords). In our experiments, we define a keyword as a noun that is semantically most relevant (computed by the cosine distance of pre-trained word embeddings) to the original sentence but do not appear in the original sentence. The percentage of the generated sentences contain the pre-defined keyword (denoted as Key%) is reported.
The results are shown in Table 4. For a single fine-grained attribute, it can be observed that Keywords achieves 92.3 Key% score, Length and Length achieve 208.8 and 40.8 Len% scores respectively. At the same time, the fluency and content retention scores are still high. These results demonstrate the proposed method can control such fine-grained attributes. When we further control the sentiment attribute, we can see that Sentiment + Keywords achieves 91.6% accuracy, while the accuracy of Sentiment + Length and Sentiment + Length is 97.7% and 95.1% respectively. Meanwhile, their rest scores have not declined significantly. When simultaneously controlling all these attributes, Sentiment + Keywords + Length achieves 93.0% accuracy, 183.7 Len% score, and 66.6 Key% score, while Sentiment + Keywords + Length achieves 87.6% accuracy, 60.9 Len% score, and 63.0 Key% score. Since it is more difficult to reduce sentence length than to increase sentence length while controlling other attributes, the fluency of Sentiment + Keywords + Length is worse than Sentiment + Keywords + Length. We show some generated examples in Appendix D. These results indicate that our proposed method can control multiple attributes simultaneously.
In this paper, we explore a novel task setting for text style transfer, in which it is required to simultaneously manipulate multiple fine-grained attributes. We propose to address it by revising the original sentences in a continuous space based on gradient-based optimization. Experimental results demonstrate that the proposed method can simultaneously manipulate multiple fine-grained attributes such as sentence length and the presence of specific words. To our best knowledge, this is the first time that a style transfer algorithm can control all those fine-grained attributes. Furthermore, extensive experiments on three popular text style transfer tasks show that our approach outperforms five previous state-of-the-art methods by a large margin.
Stochastic backpropagation and approximate inference in deep generative models.
In ICML, 2014.A hybrid convolutional variational autoencoder for text generation.
In EMNLP, 2017.Policy gradient methods for reinforcement learning with function approximation.
In NIPS, 2000.We study the effect of following hyper-parameters and configurations:
The hyper-parameter described in Equation 12, which is the trade-off parameter to balance the content preservation and style transfer strength.
The target of the content predictor. As described in Section 3.1, we proposed two kinds of : (a) Set the to contain all the words in the original sentence (denoted as Cont-1); (b) Set the to contain only all nouns (detected by a NLTK POS tagger) in (denoted as Cont-2). Besides, we test not using the content predictor (denoted as Cont-0).
The KL loss of the Variational Auto-encoder (VAE). If the KL loss is too large, the VAE will collapse into an Auto-encoder. If the KL loss drops to 0, the VAE will collapse into a plain language model. Ideally, it should be small but non-zero. Under different configurations (e.g., KL annealing, and the weighted KL term), we obtain VAEs of different KL losses and then test their performance in our scenarios.
Table 5 reports these results on Yelp sentiment transfer task with the same settings in Section 4.1. From the results we can see:
When the value of is increased, the word overlap score and the Noun% score increase while the sentiment transfer accuracy decreases. It demonstrates that the can control the trade-off between attribute transfer and content preservation.
After the retraining of sentiment predictor, the sentiment transfer accuracy increased from 88.3% to 93.1%. The retraining of content predictor further improves the word overlap score and the Noun% score. These results show that the retraining of the predictors by Equation 6 and 10 can further improve the performance.
As expected, Cont-1 can improve the word overlap score, while Cont-2 can further improve the Noun% score and the sentiment transfer accuracy. Compared with the Cont-0, both Cont-1 and Cont-2 have significantly improved the success rate, which indicates the effectiveness of the content predictor.
When the KL loss of the VAE is lower, the reconstruction error is higher. At the same time, the accuracy and the fluency are better, but the content preservation is poor. The KL term of VAE can also control the trade-off between attribute transfer and content preservation.
Settings | Accuracy | PPL | Overlap | Noun% | Suc% |
---|---|---|---|---|---|
1. : | |||||
= 0.1 | 88.7 | 20.6 | 35.7 | 73.6 | 61.4 |
= 0.05 | 93.4 | 19.8 | 34.8 | 68.6 | 64.5 |
2. Retraining: | |||||
No Retraining | 88.3 | 19.4 | 39.4 | 58.6 | 59.4 |
+ Retrain Sentiment Predictor | 93.1 | 22.8 | 39.7 | 60.0 | 62.9 |
+ Retrain Content Predictor | 94.1 | 20.6 | 41.6 | 61.5 | 65.4 |
3. type: | |||||
Cont-0 | 91.9 | 12.6 | 33.2 | 43.4 | 49.4 |
Cont-1 | 92.8 | 19.4 | 36.3 | 60.2 | 60.5 |
Cont-2 | 93.4 | 19.8 | 35.7 | 68.6 | 64.5 |
4. KL loss: | |||||
= 13.85 | 92.6 | 14.6 | 31.7 | 53.4 | 57.4 |
= 17.27 | 88.8 | 19.8 | 38.1 | 69.1 | 64.0 |
= 21.84 | 84.7 | 27.6 | 43.1 | 86.8 | 63.8 |
Some samples of the sentiment transfer task from ours and baselines on Yelp and Amazon are shown in Table 6 and Table 7, respectively.
Sentiment transfer from negative to positive (Yelp) | |
---|---|
Original | we sit down and we got some really slow and lazy service . |
Human | the service was quick and responsive . |
CrossAligned | we went down and we were a good , friendly food . |
MultiDecoder | we sit down and we got some really and fast food . |
DeleteAndRetrieve | we got very nice place to sit down and we got some service . |
BackTranslation | we got and i and it is very nice and friendly staff . |
Ours1 | we sat down and got some really good service and friendly people . |
Ours2 | we sat down the street and had some really nice and fast service . |
Ours3 | we really sit down and the service and food were great . |
Original | there was only meat and bread . |
Human | there was a wide variety of meats and breads . |
CrossAligned | there was amazing flavorful and . |
MultiDecoder | there was only meat and bread . |
DeleteAndRetrieve | meat and bread was very fresh . |
BackTranslation | it was very nice and helpful . |
Ours1 | the bread was fresh and the meat was tender . |
Ours2 | the bread was good and the bread was fresh and plentiful . |
Ours3 | the bread was fresh and very tasty . |
Original | anyway , we got our coffee and will not return to this location . |
Human | we got coffee and we ’ll think about going back . |
CrossAligned | anyway , we got our food and will definitely return to this location . |
MultiDecoder | anyway , we got our coffee and will not return to this location . |
DeleteAndRetrieve | anyway , we got our coffee and would recommend it to everyone . |
BackTranslation | everything in the staff is very nice and it was the best . |
Ours1 | i will return to this location , and we will definitely return . |
Ours2 | we will return to this location again , and the coffee was great . |
Ours3 | we will definitely return , and this is our new favorite coffee place . |
Sentiment transfer from positive to negative (Yelp) | |
Original | i love this place , the service is always great ! |
Human | hate this place , service was bad . |
CrossAligned | i know this place , the food is just a horrible ! |
MultiDecoder | i love this place , the service is always great ! |
DeleteAndRetrieve | i did not like the homework of lasagna , not like it , . |
BackTranslation | i wish i have been back , this place is a empty ! |
Ours1 | however , this place is the worst i have ever been to . |
Ours2 | i do n’t know why i love this place , but the service is horrible . |
Ours3 | i do n’t know why this place has the worst customer service ever . |
Original | their pizza is the best i have ever had as well as their ranch ! |
Human | their pizza is the worst i have ever had as well as their ranch ! |
CrossAligned | their pizza is the other i have ever had as well as their onions ! |
MultiDecoder | their pizza is the best i have ever had as well at their job ! |
DeleteAndRetrieve | had their bad taste like ranch ! |
BackTranslation | their food is n’t the worst i ’ve ever had to go ! |
Ours1 | this is the worst pizza i have ever had as well as their ranch . |
Ours2 | this is the worst pizza i have ever had as well as their bruchetta . |
Ours3 | i have had the worst pizza i have ever had in my life as well . |
Original | i will be going back and enjoying this great place ! |
Human | i wo n’t be going back and suffering at this terrible place ! |
CrossAligned | i will be going back because from the _num_ stars place ! |
MultiDecoder | i will be going back and often at no place ! |
DeleteAndRetrieve | i will be going back and will not be returning into this anymore . |
BackTranslation | i will not be going back and this place is awful ! |
Ours1 | i will not be going back to this place for a while . |
Ours2 | i will not be going back to this place for a while . |
Ours3 | i wo n’t be going back to this place unless i ’m desperate . |
Sentiment transfer from negative to positive (Amazon) | |
---|---|
Original | ridiculous ! i had trouble getting it on with zero bubbles . |
Human | great ! i had no trouble getting it on with zero bubbles . |
CrossAligned | so far i have been using it for years and now . |
MultiDecoder | beautiful i have to replace it with after using the _num_ |
DeleteAndRetrieve | they are easy to use , i had trouble getting it on with zero bubbles . |
BackTranslation | flavorful ! i don t have used it to work with _num_ years . |
Ours1 | i have no trouble putting bubbles on it . |
Ours2 | i have had no trouble getting bubbles on it . |
Ours3 | i ve had no problems with bubbles on it . |
Original | i ve used it twice and it has stopped working . |
Human | used it without problems . |
CrossAligned | i have it s so it s just work well . |
MultiDecoder | i ve used it twice and it has gave together . |
DeleteAndRetrieve | i ve used it twice and it has performed well . |
BackTranslation | i ve been using this for _num_ years now and it works great . |
Ours1 | i ve used it several times and it works great . |
Ours2 | i ve used it several times and it has worked flawlessly . |
Ours3 | i ve used it for several months now and it has been working great . |
Original | i ve used these a few times and broke them very easily . |
Human | i ve used these a few times and loved them . |
CrossAligned | i ve had this for a few months and it s fine . |
MultiDecoder | i ve used these a few times and use the iphone very quickly . |
DeleteAndRetrieve | i ve used these a few times and broke them very easily ! . |
BackTranslation | i ve had this case for _num_ years and it works great . |
Ours1 | i ve used them a few times and they are very sturdy . |
Ours2 | i ve used them several times a week and they are very sturdy . |
Ours3 | i ve used these a few times and they are very sturdy . |
Sentiment transfer from positive to negative (Amazon) | |
Original | this product does what it is suppose to do . |
Human | this product does not do what it is supposed to do . |
CrossAligned | this product isn t work and i have used . |
MultiDecoder | this product does what it is supposed to do . |
DeleteAndRetrieve | this product did not do what it was suppose to do . |
BackTranslation | this product metropolis what it s like . |
Ours1 | this product does not do what it claims to do . |
Ours2 | this product does not do what it claims to do . |
Ours3 | this product does not do what it claims to do . |
Original | i would recommend to anyone who wants a pda . |
Human | i would not recommend this to anyone who wants a pda . |
CrossAligned | i would not recommend it to be a refund . |
MultiDecoder | i would recommend to anyone who has it into . |
DeleteAndRetrieve | i would not recommend this to anyone who wants a sensitive pda . |
BackTranslation | i wish i would give them a lot of them . |
Ours1 | i would not recommend this product to anyone . |
Ours2 | i would not recommend this to anyone who wants a <UNK> . |
Ours3 | i would not recommend this to anyone who wants a <UNK> . |
Original | i have been extremely happy with my purchase . |
Human | upset at purchase from the start . |
CrossAligned | i have been using them for my hair . |
MultiDecoder | i have been extremely happy with my review . |
DeleteAndRetrieve | i have been extremely disappointed with this purchase . |
BackTranslation | i was very disappointed with my phone . |
Ours1 | i am very disappointed with this purchase and would not purchase again . |
Ours2 | i have been extremely disappointed with my purchase . |
Ours3 | i am very disappointed with this purchase . |
Table 8 shows some samples of the gender style transfer task from ours and the strong baseline.
Gender style transfer from male to female | |
---|---|
Original | i wish there is more than 0 stats to give you . |
BackTranslation | i think there ’ s than 0 stars to see you . |
Ours1 | i wish i could give more stars . |
Ours2 | i wish there would give more stars . |
Ours3 | i wish i could give more stars . |
Original | good vibe , good drinks and prices and unique decoration . |
BackTranslation | good service , good service , and the service and décoration . |
Ours1 | overall , the drinks were really good . |
Ours2 | overall , the drinks are really good and unique . |
Ours3 | the drinks are good , and the decor is cute . |
Original | the food was n’t anything outstanding to justify the price . |
BackTranslation | the food was kind of a good time to try the price . |
Ours1 | i ca n’t wait to go back and the food was n’t anything special . |
Ours2 | the food was n’t anything special . |
Ours3 | the food was n’t anything special . |
Original | the cost was more for the size than the quality . |
BackTranslation | the service itself was very good for the price that ’ s hotels . |
Ours1 | the portion size was more than enough for me . |
Ours2 | the portion size was more than enough for the size . |
Ours3 | the portion size was more than $ _num_ for the size of the portion . |
Gender style transfer from female to male | |
Original | we went here for my fiance ’ s birthday . |
BackTranslation | we went here for my wife ’ s anniversaire . |
Ours1 | went here for my wife ’ s birthday . |
Ours2 | went here for my wife ’ s birthday . |
Ours3 | went here for my wife ’ s birthday . |
Original | they always take such good care of me . |
BackTranslation | they always do a good job . |
Ours1 | they do a good job of taking care of you . |
Ours2 | they always take care of you . |
Ours3 | they do a good job of taking care of you . |
Original | if you do come for breakfast get a croissant . |
BackTranslation | if you are looking for lunch , has a stems . |
Ours1 | if you come here for breakfast , you get a breakfast sandwich . |
Ours2 | do n’t come here if you want a breakfast sandwich . |
Ours3 | breakfast croissant is a must if you come here for breakfast . |
Original | the only thing worth mentioning was their dessert . |
BackTranslation | only compared to say was their service . |
Ours1 | the only thing worth mentioning is the deserts . |
Ours2 | the only thing worth mentioning is the deserts . |
Ours3 | the only thing worth mentioning is the dessert . |
The samples of multiple fine-grained attributes control are shown in Table 9.
Multiple fine-grained attributes control (from negative to positive) | |
---|---|
Original | i was very disappointed with this place . |
Keywords | i love this place . |
Sentiment + Keywords | i love this place too . |
Length | i love this place , and i ’m so glad i went to the house . |
Sentiment + Length | i was very disappointed with this place , and i was not impressed with it . |
Keywords + Length | i was very impressed with this place and this place was very good . |
Sentiment + Keywords + Length | i was very impressed with . |
Length | very disappointed overall . |
Sentiment + Length | love this . |
Keywords + Length | i was very impressed with this place and love this place . |
Sentiment + Keywords + Length | i love this place . |
Original | at this location the service was terrible . |
Keywords | the location at location was very convenient . |
Sentiment + Keywords | the location is convenient and convenient . |
Length | the location at this location was convenient and the service was horrible . |
Sentiment + Length | this was the first time i went to this location and the service was terrible . |
Keywords + Length | the service at this location was great and the food was very good . |
Sentiment + Keywords + Length | the service at this location . |
Length | terrible customer service . |
Sentiment + Length | this location is convenient . |
Keywords + Length | the location at this location is great and the location is very convenient . |
Sentiment + Keywords + Length | location is convenient . |
Original | i ’ll keep looking for a different salon . |
Keywords | i love looking for this nail salon . |
Sentiment + Keywords | i love this nail salon for sure . |
Length | i love this place , and i ’ll be looking for a new nail . |
Sentiment + Length | i ’ll be looking for a different nail salon , and i do n’t know . |
Keywords + Length | i have been to this salon for a couple of the day , and it ’s always the same thing i have ever made . |
Sentiment + Keywords + Length | love this salon . |
Length | definitely a salon . |
Sentiment + Length | i love this nail salon . |
Keywords + Length | i love this place , and i ’ll be looking for a new nail . |
Sentiment + Keywords + Length | i love this nail salon . |
Multiple fine-grained attributes control (from positive to negative) | |
Original | the best mexican food in the phoenix area . |
Keywords | this is the best mexican restaurant in the area . |
Sentiment + Keywords | this was the worst chinese restaurant in the phoenix area . |
Length | this is the best mexican food i have had in the area and the area . |
Sentiment + Length | this was the worst chinese food i have had in the phoenix area in phoenix . |
Keywords + Length | this is the best mexican food in the area and the best restaurant in phoenix . |
Sentiment + Keywords + Length | this was the worst chinese restaurant i have ever been to in the entire area . |
Length | best mexican food . |
Sentiment + Length | the worst food in phoenix . |
Keywords + Length | best mexican restaurant in phoenix . |
Sentiment + Keywords + Length | the worst restaurant in the area . |
Original | thank you amanda , i will be back ! |
Keywords | thanks again , thank you angela ! |
Sentiment + Keywords | no thanks , i will not be back . |
Length | if you are in the mood , i will definitely be taking care of you . |
Sentiment + Length | if you want to be treated rudely , i will be taking care of you . |
Keywords + Length | thanks to steven , i will be back , thank you for my next experience ! |
Sentiment + Keywords + Length | if i asked him , i will be taking my car elsewhere , no thanks . |
Length | thank you ! |
Sentiment + Length | i will not be back . |
Keywords + Length | thanks again , thank you ! |
Sentiment + Keywords + Length | no thanks , thank you ! |
Original | service was great and food was even better . |
Keywords | terrible customer service and even better customer service . |
Sentiment + Keywords | the customer service was terrible even worse than it was . |
Length | the food was great , and the service was even better than i remembered it . |
Sentiment + Length | the food was terrible and the service was even worse than it was even worse . |
Keywords + Length | the customer service was terrible and the food was even worse than i remembered it . |
Sentiment + Keywords + Length | the customer service was terrible even though it was n’t even worse than before . |
Length | service was great . |
Sentiment + Length | the service was even worse . |
Keywords + Length | customer service was even better . |
Sentiment + Keywords + Length | even worse customer service was terrible . |