Text style transfer aims at changing the language style of an input sentence to a target style with the constraint that the style-independent content should remain the same across the transfer. While several methods are proposed for the task John et al. (2019); Smith et al. (2019); Jhamtani et al. (2017); Kerpedjiev (1992); Xu et al. (2012); Shen et al. (2017); Subramanian et al. (2018); Xu et al. (2018), they commonly model the distribution of the transfer outputs as a delta distribution, which implies a one-to-one mapping mechanism that converts an input sentence in one language style to a single corresponding sentence in the target language style.
We argue a multimodal mapping is better suited for the text style transfer task. For examples, the following two reviews:
“This lightweight vacuum is simply effective.”,
“This easy-to-carry vacuum picks up dust and trash amazingly well.”
would both be considered correct negative-to-positive transfer results for the input sentence, “This heavy vacuum sucks”. Furthermore, a one-to-many mapping allows a user to pick the preferred text style transfer outputs in the inference time.
In this paper, we propose a one-to-many text style transfer framework that can be trained using non-parallel text. That is, we assume the training data consists of two corpora of different styles, and no paired input and output sentences are available. The core of our framework is a latent decomposition scheme learned via adversarial training. We decompose the latent representation of a sentence into two parts where one encodes the style of a sentence, while the other encodes the style-independent content of the sentence. In the test time, for changing the style of an input sentence, we first extract its content code. We then sample a sentence from the training dataset of the target style corpus and extract its style code. The two codes are combined to generate an output sentence, which would carry the same content but in the target style. As sampling a different style sentence, we have a different style code and have a different style transfer output. We conduct experiments with comparison to several state-of-the-art approaches on multiple public datasets, including Yelp 46 and Amazon He and McAuley (2016). The results, evaluated using various performance metrics, including content preservation, style accuracy, output diversity, and user preference, show that the model trained with our framework performs consistently better than the competing approaches.
Let and be two spaces of sentences of two different language styles. Let and be their corresponding latent spaces. We further assume and can be decomposed into two latent spaces and where and are the latent spaces that control the style variations in and and and are the latent spaces that control the style-independent content information. Since and are style-independent content representation, we have . For example, and may denote the spaces of negative and positive product reviews where the elements in encode the product and its features reviewed in a sentence, the elements in represent variations in negative styles such as the degree of preferences and the exact phrasing, and the elements in represent the corresponding variations in positive styles. The above modeling implies
A sentence can be decomposed to a content code and a style code .
A sentence can be reconstructed by fusing its content code and its style code .
To transfer a sentence in to a corresponding sentence in , one can simply fuse the content code with a style code where .
Figure 1 provides a visualization of the modeling.
Under this formulation, the text style transfer mechanism is given by a conditional distribution , where is the sentence generated by transferring sentence to the target domain . Note that existing works Fu et al. (2018); Shen et al. (2017) formulate the text style transfer mechanism to be a one-to-one mapping that converts an input sentence to only a single corresponding output sentence. That is where is the Dirac delta function. As a results, they can not be used to generate multiple style transfer outputs for an input sentence.
One-to-Many Style Transfer. To model the transfer function, we use a framework consists of a set of networks as visualized in Figure 2. It has a content encoder , a style encoder , and a decoder for each domain . In the following, we will explain the framework in details using the task of transferring from to . The task of transferring from to follows the same pattern.
The content encoder takes the sequence of elements as input and computes a content code
, which is a sequence of vectors describing the sentence’s style-independent content. The style encoderconverts to a style code , which is a pair of vectors. Note that we will use and
as the new mean and standard deviation of the feature activation of the inputfor the style transfer task of converting a sentence in to a corresponding sentence in . Specifically, we combine the content code and the style code using a composition function , which will be discussed momentarily, to obtain . Then, we use the decoder to map the representation to the output sequence . Note that is extracted from a randomly sampled , and by sampling a different sentence, say where , we have and hence a different style transfer output. By treating style variations as sample-able quantities, we achieve one-to-many style transfer output capability.
The combination function is given by
where denotes element-wise product, denotes element-wise division, and indicate the operation of computing mean and standard derivation for the content latent code by treating each vector in
as an independent realization of a random variable. In other words, the latent representationis constructed by first normalizing the content code
in the latent space and then applying the non-linear transformation whose parameters are provided from a sentence of target style. Sincecontains no learnable parameters, we consider as part of the decoder. This design draws inspirations from image style transfer works Huang and Belongie (2017); Dumoulin et al. (2016)
, which show that image style transfer can be achieved by controlling the mean and variance of the feature activations in the neural networks. We hypothesize this is the same case for the text style transfer task and apply it to achieve the one-to-many style transfer capability.
Network Design. We realize the content encoder using a convolutional network. To ensure the length of the output sequence
is equal to the length of the input sentence, we pad the input byzero vectors on both left and right side, where is the length of the input sequence as discussed in Gehring et al. (2017)
. For the convolution operation, we do not include any stride convolution. We also realize the style encoderusing a convolutional network. To extract the style code, after several convolution layers, we apply global average pooling and then project the results to and
using a two-layer multi-layer perceptron. We apply the log-exponential nonlinearity to computeto ensure the outputs are strictly positive, required for modeling the deviations. The decoder is realized using a convolutional network with an attention mechanism followed by a convolutional sequence-to-sequence network (ConvS2S) Gehring et al. (2017). We realized our method based on ConvS2S, but it can be extended to work with transformer models Vaswani et al. (2017); Devlin et al. (2018); Radford et al. (2019). Further details are given in the appendix.
2.1 Learning Objective
We train our one-to-many text style transfer model by minimizing multiple loss terms.
We use reconstruction loss to regularize the text style transfer learning. Specifically, we assume the pair of content encoderand style encoder and the decoder form an auto-encoder. We train them by minimizing the negative log likelihood of the training corpus:
where , and denote the parameters of , , and respectively.
For each training sentence, synthesizes the output sequence by predicting the most possible token based on the latent representation and the previous output predictions
, so that the probability of a sentence can be calculated by
where denotes the token index and is the sentence length. Following Gehring et al. (2017), the probability of a token is computed by the linear projection of the decoder output using softmax.
Back-translation loss. Inspired by recent studies Prabhumoye et al. (2018); Sennrich et al. (2015); Brislin (1970) that show that back-translation loss, which is closely related to the cycle-consistency loss Zhu et al. (2017a)
used in computer vision, is helpful for preserving the content of the input, we adopt a back-translation loss to regularize the learning. To achieve the goal, as shown in Figure3, we transfer the input to the other style domain . We then transfer it back to the original domain by using its original style code . By doing so, the resulting sentence should be as similar as possible to the original input . In other words, we minimize the discrepancy between and given by
where . We also define in a similar way.
To avoid the non-differentiability of the beam search Och and Ney (2004); Sutskever et al. (2014), we substitute the hard decoding of by using a set of differentiable non-linear transformations between the decoder and the content encoder when minimizing the back-translation loss. The non-linear transformations project the feature activation of the second last layer of the decoder to the second layer of the content encoder
. These non-linear projections are learned by the multilayer perceptron (MLP), which are trained jointly with the text style transfer task. We also apply the same mechanism to compute. This way, our model can be trained purely using back-propagation.
To ensure the MLP correctly project the feature activation to the second layer of , we enforce the output of the MLP to be as similar as possible to the feature activation of the second layer of . This is based on the idea that and
should have the same content code across transfer, and their feature activation in the content encoder should also be the same. Accordingly, we apply Mean Square Error (MSE) loss function to achieve this objective:
where and denote the function for computing feature activation of the second layer of and , respectively. The loss for the other domain is defined in a similar way.
Style classification loss. During learning, we enforce a style classification loss on the style code with the standard cross-entropy loss . This encourages the style code to capture the stylistic properties of the input sentences.
Adversarial loss. We use GANs Goodfellow et al. (2014) for matching the distribution of the input latent code to the decoder from the reconstruction streams to the distribution of the input latent code to the decoder from the translation stream. That is (1) we match the distribution of to the distribution of , and (2) we match the distribution of to the distribution of . This way we ensure distribution of the transfer outputs matches distribution of the target style sentences since they use the same decoder. As we apply adversarial training to the latent representation, we also avoid dealing with the non-differentiability of beam search.
The adversarial loss for the second domain is given by
where is the discriminator which aims at distinguishing the latent representation of the sentence from . The adversarial loss is defined in a similar manner.
Overall learning objective. We then learn a one-to-many text style transfer model by solving
In the following, we first introduce the datasets and evaluation metrics and then present the experiment results with comparison to the competing methods.
Datasets. We use the following datasets.
Amazon product reviews (Amazon) He and McAuley (2016) contains positive and negative review sentences for training, and positive and negative review sentences for testing. The length of a sentence ranges from to words. We use this dataset for converting a negative product review to a positive one, and vice versa. Our evaluation follows the protocol described in Li et al. (2018).
Yelp restaurant reviews (Yelp) 46 contains a training set of positive and negative sentences, and a test set of positive and negative testing sentences. The length of a sentence ranges from to words. We use this dataset for converting a negative restaurant review to a positive one, and vice versa. We use two evaluation settings: Yelp500 and Yelp25000. Yelp500 is proposed by Li et al. (2018), which includes randomly sampled positive and negative sentences from the test set, while Yelp25000 includes randomly sampled positive and negative sentences from the test set.
Evaluation metrics. We evaluate a text style transfer model on several aspects. Firstly, the transfer output should carry the target style (style score). Secondly, the style-independent content should be preserved (content preservation score). We also measure the diversity of the style transfer outputs for an input sentence (diversity score).
We use a classifier to evaluate the fidelity of the style transfer resultsFu et al. (2018); Shen et al. (2017). Specifically, we apply the Byte-mLSTM Radford et al. (2017) to classify the output sentence generated by a text style transfer model. As transferring a negative sentence to a positive one, we expect a good transfer model should be able to generate a sentence that is classified positive by the classifier. The overall style transfer performance of a model is then given by the average accuracy on the test set measured by the classifier.
Content score. We build a style-independent distance metric that can quantify content similarity between two sentences, by comparing embeddings of the sentences after removing their style words. Specifically, we compute embedding of each non-style word in the sentence using the word2vec Mikolov et al. (2013)
. Next, we compute the average embedding, which serves as the content representation of the sentence. The content similarity between two sentences is given by the cosine distance of their average embeddings. We compute the relative n-gram frequency to determine which word is a style word based on the observation that the language style is largely encoded in the n-gram distributionXu et al. (2012). This is in spirit similar to the term frequency-inverse document frequency analysis Sparck Jones (1972). Let and be the n-gram frequencies of two corpora of different styles. The style magnitude of an n-gram in style domain is given by
where is a small constant. We use -gram. A word is considered a style word if is greater than a threshold.
Diversity score. To quantify the diversity of the style transfer outputs, we resort to the self-BLEU score proposed by Zhu et al. (2018). Given an input sentence, we apply the style transfer model 5 times to obtain 5 outputs. We then compute self-BLEU scores between any two generated sentences (10 pairs). We apply this procedure to all the sentences in the test set and compute the average self-BLEU score . After that, we define the diversity score as . A model with a higher diversity score means that the model is better in generating diverse outputs. In the experiments, we denote Diversity- as the diversity score computed by using self-BLEU-.
Implementation. We use the convolutional sequence-to-sequence model Gehring et al. (2017). Our content and style encoder consist of convolution layers, respectively. The decoder has convolution layers. The content and style codes are dimensional. We use the pytorch Paszke et al. (2017) and fairseq Ott et al. (2019) libraries and train our model using a single GeForce GTX 1080 Ti GPU. We use the SGD algorithm with the learning rate set to
. Once the content and style scores converge, we reduce the learning rate by an order of magnitude after every epoch until it reaches. Detail model parameters are given in the appendix.
Baselines. We compare the proposed approach to the following competing methods.
CAE Shen et al. (2017) is based on auto-encoder and is trained using a GAN framework. It assumes a shared content latent space between different domains and computes the content code by using a content encoder. The output is generated with a pre-defined binary style code.
MD Fu et al. (2018) extends the CAE to work with multiple style-specific decoders. It learns style-independent representation by adversarial training and generates output sentences by using style-specific decoders.
BTS Prabhumoye et al. (2018) learns style-independent representations by using back-translation techniques. BTS assumes the latent representation of the sentence preserves the meaning after machine translation.
DR Li et al. (2018) employs retrieval techniques to find similar sentences with desired style. They use neural networks to fuse the input and the retrieved sentences for generating the output.
CopyPast simply uses the input as the output, which serves as a reference for evaluation.
3.1 Results on One-to-Many Style Transfer
Our model can generate different text style transfer outputs for an input sentence. To generate multiple outputs for an input, we randomly sample a style code from the target style training dataset during testing. Since the CAE Shen et al. (2017) and BTS Prabhumoye et al. (2018) are not designed for the one-to-many style transfer, we extend their methods to achieve this capability by injecting random noise, termed CAE+noise and BTS+noise. Specifically, we add random Gaussian noise to the latent code of their models during training, which is based on the intuition that the randomness would result in different activations in the networks, leading to different outputs. Table 2 shows the average diversity scores achieved by the competing methods over runs. We find that our method performs favorably against others.
User Study. We conduct a user study to evaluate one-to-many style transfer performance using the Amazon Mechanical Turk (AMT) platform. We set up the pairwise comparison following Prabhumoye et al. (2018). Given an input sentence and two sets of model-generated sentences (5 sentences per set), the workers are asked to choose which set has more diverse sentences with the same meaning, and which set provides more desirable sentences considering both content preservation and style transfer. These are denoted as Diversity, and Overall in Table 2. The workers are also asked to compare the transfer quality in terms of grammatically and fluency, which is denoted as Fluency. For each comparison, a third option No Preference is given for cases that both are equally good or bad.
We randomly sampled sentences from Yelp500 test set for the user study. Each comparison is evaluated by at least three different workers. We received more than responses from the AMT, and the results are summarized in Table 2. Our method outperforms the competing methods by a large margin in terms of diversity, fluency, and overall quality. In the appendix, we present further details of the comparisons with different variants of CAE+noise and BTS+noise. Our method achieves significantly better performance. Table 3 shows the qualitative results of the proposed method. Our proposed method generates multiple different style transfer outputs for restaurant reviews and lyrics111We use the country song lyrics and romance novel collections, which are available in the Stylish descriptions dataset Chen et al. (2019)..
3.2 More Results and Ablation Study
In addition to generating multiple style transfer outputs, our model can also generate high-quality style transfer outputs. In Figure 7, we compare the quality of our style transfer outputs with those from the competing methods. We show the performance of our model using the style–content curve where each point in the curve is the achieved style score and the content score at different training iterations. In Figure (a)a, given a fixed content preservation score, our method achieves a better style score on Amazon dataset. Similarly, given a fixed style score, our model achieves a better content preservation score. The results on Yelp500 and Yelp25000 datasets also demonstrate a similar trend as shown in Figure (b)b and Figure (c)c, respectively.
The style–content curve also depicts the behavior of the proposed model during the entire learning process. As visualized in Figure 8, we find that our model achieves a high style score but a low content score in the early training stage. With more iterations, our model improves the content score with the expense of a reduced style score. To strike a balance between the two scores, we decrease the learning rate when the model reaches a similar number for the two scores.
|Model||Style Score||Content Score|
User Study. We also conduct a user study on the transfer output quality. Given an input sentence with two generated style transferred sentences from two different models222The sentences generated by other methods have been made publicly available by Li et al. (2018)., workers are asked to compare the transferred quality of the two generated sentences in terms of content preservation, style transfer, fluency, and overall performance, respectively. We received more than responses from AMT platform, and the results are summarized in Table 4. We observe No Preference was chosen more often than others, which shows exiting methods may not fully satisfy human expectation. However, our method achieves comparable or better performance than the prior works.
Ablation Study. We conduct a study where we consider three different designs of the proposed models. (1) full: This is the full version of the proposed model; (2) sharing-encoders: In this case, we have a content encoder and a style encoder that are shared by the two domains; (3) sharing-decoder: In this case, we have a decoder that is shared by the two domains. Through this study, we aim for studying if regularization via weight-sharing is beneficial to our approach.
Table 5 shows the comparison of our method using different designs. The sharing-encoders baseline performs much better than the sharing-decoder baseline, and our full method performs the best. The results show that the style-specific decoder is more effective for generating target-style outputs. On the other hand, the style-specific encoder extracts more domain-specific style codes from the inputs. Weight-sharing schemes do not lead to a better performance.
Impact of the loss terms. In the appendix, we present an ablation study on the loss terms, which shows that all the terms in our objective function are important.
4 Related Works
is a core problem in natural language processing. It has a wide range of applications including machine translationJohnson et al. (2017); Wu et al. (2016) et al. (2015), and dialogue systems Li et al. (2016a, b). Recent studies Devlin et al. (2018); Gehring et al. (2017); Graves (2013); Johnson et al. (2017); Radford et al. (2019); Wu et al. (2016)
proposed to train deep neural networks using maximum-likelihood estimation (MLE) for computing the lexical translation probabilities in parallel corpus. Though effective, acquiring parallel corpus is difficult for many language tasks.
Text style transfer has a longstanding history Kerpedjiev (1992). Early studies utilize strongly supervision on parallel corpus Rao and Tetreault (2018); Xu (2017); Xu et al. (2012). However, the lack of parallel training data renders existing methods non-applicable to many text style transfer tasks. Instead of training with paired sentences, recent studies Fu et al. (2018); Hu et al. (2017); Prabhumoye et al. (2018); Shen et al. (2017); Li et al. (2019) addressed this problem by using adversarial learning techniques. In this paper, we argue while the existing methods address the parallel data acquisition difficulty, they do not address the diversity problem in the translated outputs. We address the issue by formulating text style transfer as a one-to-many mapping problem and demonstrate one-to-many style transfer results.
Generative adversarial network (GANs) Arjovsky et al. (2017); Goodfellow et al. (2014); Salimans et al. (2016); Zhu et al. (2017a) have achieved great success on image generation Huang et al. (2018); Zhu et al. (2017b)
. Several attempts are made to applying GAN for the text generation taskGuo et al. (2018); Lin et al. (2017); Yu et al. (2017); Zhang et al. (2017). However, these methods are based on unconditional GANs and tend to generate context-free sentences. Our method is different in that our model is conditioned on the content and style codes, and our method allows a more controllable style transfer.
We have presented a novel framework for generating different style transfer outputs for an input sentence. This was achieved by modeling the style transfer as a one-to-many mapping problem with a novel latent decomposition scheme. Experimental results showed that the proposed method achieves better performance than the baselines in terms of the diversity and the overall quality.
- Wasserstein gan. arXiv preprint arXiv:1701.07875. Cited by: §4.
- Back-translation for cross-cultural research. Journal of cross-cultural psychology 1 (3), pp. 185–216. Cited by: §2.1.
- Unsupervised stylish image description generation via domain layer norm. In Proc. AAAI, Cited by: footnote 1.
- BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2, §4.
- A learned representation for artistic style. In Proc. ICLR, Cited by: §2.
- Style transfer in text: exploration and evaluation. In Proc. AAAI, Cited by: §2, 1st item, 2nd item, §4.
- Convolutional sequence to sequence learning. In Proc. ICML, Cited by: Appendix F, §2.1, §2, §3, §4.
- Generative adversarial nets. In Proc. NeurIPS, Cited by: §2.1, §4.
Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Cited by: §4.
- Long text generation via adversarial training with leaked information. In Prof. AAAI, Cited by: §4.
- Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proc. WWW, Cited by: §1, 1st item.
- Toward controlled generation of text. In Proc. ICML, Cited by: §4.
- Arbitrary style transfer in real-time with adaptive instance normalization. In Proc. ICCV, Cited by: §2.
Multimodal unsupervised image-to-image translation. In Proc. ECCV, Cited by: §4.
- Shakespearizing modern language using copy-enriched sequence to sequence models. In Proc. EMNLP Workshop on Stylistic Variation, Cited by: §1.
- Disentangled representation learning for non-parallel text style transfer. In Proc. ACL, Cited by: §1.
Google’s multilingual neural machine translation system: enabling zero-shot translation. TACL. Cited by: §4.
- Generation of informative texts with style. In Proc. COLING, Cited by: §1, §4.
- Domain adaptive text style transfer. In Proc. EMNLP, Cited by: §4.
- A persona-based neural conversation model. In Proc. ACL, Cited by: §4.
Deep reinforcement learning for dialogue generation. arXiv preprint arXiv:1606.01541. Cited by: §4.
- Delete, retrieve, generate: a simple approach to sentiment and style transfer. In Proc. NAACL, Cited by: 1st item, 2nd item, 4th item, footnote 2.
- Adversarial ranking for language generation. In Proc. NeurIPS, Cited by: §4.
- Distributed representations of words and phrases and their compositionality. In Proc. NeurIPS, Cited by: 2nd item.
- The alignment template approach to statistical machine translation. Computational linguistics. Cited by: §2.1.
- Fairseq: a fast, extensible toolkit for sequence modeling. In Proc. NAACL Demonstrations, Cited by: §3.
- BLEU: a method for automatic evaluation of machine translation. In Proc. ACL, Cited by: Appendix D.
- Automatic differentiation in PyTorch. In Proc. NeurIPS Autodiff Workshop, Cited by: §3.
- Style transfer through back-translation. In Proc. ACL, Cited by: Appendix A, §2.1, 3rd item, §3.1, §3.1, §4.
- Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444. Cited by: 1st item.
- Language models are unsupervised multitask learners. OpenAI Tech Report. Cited by: §2, §4.
- Dear sir or madam, may i introduce the yafc corpus: corpus, benchmarks and metrics for formality style transfer. In Proc. NAACL, Cited by: §4.
- Improved techniques for training gans. In Proc. NeurIPS, Cited by: §4.
- Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709. Cited by: §2.1.
- Style transfer from non-parallel text by cross-alignment. In Proc. NeruIPS, Cited by: §1, §2, 1st item, 1st item, §3.1, §4.
- Zero-shot fine-grained style transfer: leveraging distributed continuous style representations to transfer to unseen styles. arXiv preprint arXiv:1911.03914. Cited by: §1.
- A statistical interpretation of term specificity and its application in retrieval. Journal of documentation. Cited by: 2nd item.
- Multiple-attribute text style transfer. arXiv preprint arXiv:1811.00552. Cited by: §1.
- Sequence to sequence learning with neural networks. In Proc. NeurIPS, Cited by: §2.1.
- Attention is all you need. In Proc. NeurIPS, Cited by: §2.
- Show and tell: a neural image caption generator. In Proc. CVPR, Cited by: §4.
- Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §4.
- Unpaired sentiment-to-sentiment translation: a cycled reinforcement learning approach. In Proc. ACL, Cited by: §1.
- Paraphrasing for style. Proc. COLING. Cited by: §1, 2nd item, §4.
- From shakespeare to twitter: what are language styles all about?. In Proc. EMNLP Workshop on Stylistic Variation, Cited by: §4.
-  Yelp Dataset Challenge. Note: https://www.yelp.com/dataset/challenge Cited by: §1, 2nd item.
- Seqgan: sequence generative adversarial nets with policy gradient. In Proc. AAAI, Cited by: §4.
- Adversarial feature matching for text generation. arXiv preprint arXiv:1706.03850. Cited by: §4.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proc. ICCV, Cited by: §2.1, §4.
- Toward multimodal image-to-image translation. In Proc. NeurIPS, Cited by: §4.
- Texygen: a benchmarking platform for text generation models. Proc. SIGIR. Cited by: 3rd item.
Appendix A User Study
To control the quality of human evaluation, we conduct pilot study to design and improve our evaluation questionnaire. We invite 23 participants who are native or proficient English speakers to evaluate the sentences generated by different methods. For each participant, we randomly present sentences from Yelp500 test set, and the corresponding style transferred sentences generated by different models. We ask the participants to vote the transferred sentence which they think the sentence meaning is closely related to the original sentence with an opposite sentiment. However, we find that it may be difficult to interpret the evaluation results in terms of transfer quality in details.
Therefore, instead of asking the participants to directly vote one sentence, we switch the task to evaluating the sentences in terms of four different aspects including style transfer, content preservation, fluency and grammatically, and overall performance. Following the literature Prabhumoye et al. (2018), for each pairwise comparison, a third option No Preference is given for cases that both are equally good or bad. Figure 11 and Figure 12 show the instructions and the guidelines of our questionnaire for human evaluation on Amazon Mechanical Turk platform. We refer the reader to Section 3 in the main paper for the details of the human evaluation results.
To evaluate the performance of one-to-many style transfer, we extend the pair-wise comparison to set-wise comparison. Given an input sentence and two sets of model-generated sentences (5 sentences per set), the workers are asked to choose which set has more diverse sentences with the same meaning, and which set provides more desirable sentences considering both content preservation and style transfer. We also ask the workers to compare the transfer quality in terms of content preservation, style transfer, grammatically and fluency.
Appendix B Diversity Baselines
We report further comparisons with different variants of CAE and BTS. We added random Gaussian noise to the style code of CAE and BTS
, respectively. Specifically, we randomly sample the noise from the Gaussian distribution withand , respectively. We empirically found that the generations will be of poor quality when . Thus, we evaluated the baselines with in the experiments. On the other hand, we also explored different extensions to enhance the diversity of sequence generation of the baselines. For example, we expanded the generations by randomly select a beam search size per generation.
Appendix C Additional One-to-Many Style Transfer User Study Results
We report the human evaluation with comparisons to different variants of the CAE and BTS. Similar to the human study presented in the main paper, we conduct evaluation using Amazon Mechanical Turk. We randomly sampled sentences from Yelp test set for user study. Each comparison is evaluated by at least three experts whose HIT Approval Rate is greater than . We received more than responses, and the results are summarized in Table 6 and Table 7. We observed previous models achieve higher style scores, but their output sentences are often in a generic format and may not preserve the content with correct grammar. In contrast, our method achieves significantly better performance than the baselines in terms of diversity, fluency, and overall quality.
Appendix D Ablation Study on Objective Function
The proposed objective function consists of five different learning objectives. We conduct ablation study to understand which loss function contributes to the performance. Since adversarial loss is essential for domain alignment, we evaluate loss functions by iterating different combination of the reconstruction loss, the back-translation loss (together with the mean square loss), and the style loss.
We report the style score and the content preservation score in this experiment. We additionally present the BLEU score Papineni et al. (2002), which is a common metric for evaluating the performance of machine translation. A model with a higher BLEU score means that the model is better in translating reasonable sentences. As shown in Table 8, we find that training without reconstruction loss may not produce reasonable sentences according to the BLEU score. Training with reconstruction loss works well for content preservation yet it performs less favorably for style transfer. Back-translation loss is able to improve style and content preservation scores since it encourage content and style representations to be disentangle. When training with the style loss, our model improves the style accuracy, yet performs worse on content preservation. Overall, we observe that training with all the objective terms achieves a balanced performance in terms of different evaluation scores. The results show that the reconstruction loss, the back-translation loss, and the style loss are important for style transfer.
Appendix E Style Code Sampling Scheme
We design a sampling scheme that can lead to a more accurate style transfer. During inference, our network takes the input sentence as a query, and retrieves a pool of target style sentences whose content information is similar to the query. We measure the similarity by estimating the cosine similarity between the sentence embeddings. Next, we randomly sample a target style code from the retrieved pool, and generate the output sentence. The test-time sampling scheme improves the content preservation score fromto , and achieves similar style score from to on Yelp25000 test set. The results show that it is possible to improve the content preservation by using the top ranked target style sentences.
We provide further analysis on the sampling scheme for the training phase. Specifically, during training, we sample the target style code from the pool of top ranked sentences in the target style domain. Figure 9 shows the content preservation scores of our method using different sampling schemes. The results suggest we can improve the content preservation by learning with the style codes extracted from the top ranked sentences in the target style domain. However, we noticed that this sampling scheme actually reduces the number of training data. It becomes more challenging for the model to learn the style transfer function as shown in Figure 10. The results suggest that it is more suitable to apply the sampling scheme in the inference phase.
Appendix F Additional Implementation Details
We use hidden units for the content encoder, the style encoder, and the decoder. All embeddings in our model have dimensionality . We use the same dimensionalities for linear layers mapping between the hidden and embedding sizes. Additionally, we modify the convolution block in the style encoder
to have max pooling layers for capturing the activation of the style words. On the other hand, we also modify the convolution block of the content encoderto have average pooling layers for computing the average activation of the input. During inference, the decoder generates the output sentence with the multi-step attention mechanism Gehring et al. (2017).
Appendix G Failure Cases
Although our approach performs more favorably against the previous methods, our model still fails in a couple of situations. Table 9 shows the common failure example generated by our model. We observe that it is challenging to preserve the content when the inputs are the lengthy sentences. It is also challenging to transfer the style if the sentence contains novel symbols or complicated structure.