Language Style Transfer from Sentences with Arbitrary Unknown Styles

08/13/2018 ∙ by Yanpeng Zhao, et al. ∙ Tencent 0

Language style transfer is the problem of migrating the content of a source sentence to a target style. In many of its applications, parallel training data are not available and source sentences to be transferred may have arbitrary and unknown styles. First, each sentence is encoded into its content and style latent representations. Then, by recombining the content with the target style, we decode a sentence aligned in the target domain. To adequately constrain the encoding and decoding functions, we couple them with two loss functions. The first is a style discrepancy loss, enforcing that the style representation accurately encodes the style information guided by the discrepancy between the sentence style and the target style. The second is a cycle consistency loss, which ensures that the transferred sentence should preserve the content of the original sentence disentangled from its style. We validate the effectiveness of our model in three tasks: sentiment modification of restaurant reviews, dialog response revision with a romantic style, and sentence rewriting with a Shakespearean style.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Style transfer aims at migrating the content of a sample from a source style to a target style. Recently, great progress has been achieved by applying deep neural networks to redraw an image in a particular style 

(Kulkarni et al., 2015; Liu and Tuzel, 2016; Gatys et al., 2016; Zhu et al., 2017; Luan et al., 2017). However, so far very few approaches have been proposed for style transfer of natural language sentences, i.e., changing the style or genre of a sentence while preserving its semantic content. For example, we would like a system that can convert a given text piece in the language of Shakespeare (Mueller et al., 2017); or rewrite product reviews with a favored sentiment (Shen et al., 2017); or generate responses with a consistent persona Li et al. (2016).

user Have you ever been to Moribor before? bot Yes, I have been to Moribor twice. transferred Yes, I have been to Moribor twice, it is beautiful. user How do you like you neighbors there? bot They were very polite and quiet throughout the night. transferred They were very kind and made me feel comfortable. user What was the weather like during your stay? bot The muggy weather made me cranky. transferred The warm sunlight made it a great day. user I heard you even met your father there. bot Um, I thought he would not go there. transferred Um, I was very glad to meet him there. user How is he getting on? bot Terrible! No one wants to make friends with him. transferred Great! Everyone wants to be his friend.

Table 1: Chatbot responses have inconsistent sentiments. Transferred responses have a consistent positive sentiment regardless of the sentiments of the original responses (bot).

A big challenge faced by language style transfer is that large-scale parallel data are unavailable. However, parallel data are necessary for most text generation frameworks, such as the popular sequence-to-sequence models 

(Sutskever et al., 2014; Bahdanau et al., 2014; Rush et al., 2015; Nallapati et al., 2016; Paulus et al., 2017). Hence these methods are not applicable to the language style transfer problem. A few approaches have been proposed to deal with non-parallel data (Hu et al., 2017; Shen et al., 2017). Most of these approaches try to learn a latent representation of the content disentangled from the source style, and then recombine it with the target style to generate the corresponding sentence.

All the above approaches assume that data have only two styles, and their task is to transfer sentences from one style to the other. However, in many practical settings, we may deal with sentences with arbitrary unknown styles. Consider we are building chatbots. A good chatbot needs to exhibit a consistent persona, so that it can gain the trust of users. However, existing chatbots such as Siri, Cortana, and XiaoIce (Shum et al., 2018) lack the ability of generating responses with a consistent persona during the whole conversation. Table 1 shows some examples. The chatbot responds to the user with varying sentiments (neutral, positive, or negative). One possible solution is to transfer the generated chatbot responses into a target persona before sending them to users. Hence, in this paper, we study the setting of language style transfer in which the source data to be transferred can have arbitrary unknown styles.

Another challenge in language style transfer is that the transferred sentence should preserve the content of the original sentence disentangled from its style. To tackle this problem, Shen et al. (2017) assumed the source and target domain share the same latent content space, and trained their model by aligning these two latent spaces. Hu et al. (2017) constrained that the latent content representation of the original sentence could be inferred from the transferred sentence. However, these attempts considered content modification in the latent content space but not the sentence space.

The contribution of this paper mainly consists of the following three parts:

  • [wide=0noitemsep]

  • We address a new style transfer task where sentences in the source domain can have arbitrary language styles but those in the target domain are with only one language style.

  • We propose a style discrepancy loss to learn disentangled representations of content and style. This loss enforces that the discrepancy between an arbitrary style representation and the target style representation should be consistent with the closeness of its sentence style to the target style. Additionally, we employ a cycle consistency to avoid content change.

  • We evaluate our model in three tasks: sentiment modification of restaurant reviews, dialog response revision with a romantic style, and sentence rewriting with a Shakespearean style. Experimental results show that our model surpasses the state-of-the-art style transfer model (Shen et al., 2017) in these three tasks.

2 Related Work

Image Style Transfer: Most style transfer approaches in the literatures focus on vision data. Kulkarni et al. (2015) proposed to disentangle the content representations from image attributes, and control the image generation by manipulating the graphics code that encodes the attribute information.  Gatys et al. (2016)

used Convolutional Neural Networks (CNNs) to learn separated representations of the image content and style, and then created the new image from their combination. Some approaches have been proposed to align the two data domains with the idea of generative adversarial networks (GANs) 

(Goodfellow et al., 2014).  Liu and Tuzel (2016)

proposed a coupled GAN framework to learn a joint distribution of multi-domain data by the weight-sharing constraint.  

Zhu et al. (2017) introduced a cycle consistency loss, which minimizes the gap between the transferred images and the original ones. However, due to the discreteness of the natural language, this loss function cannot be directly applied on text data. In our work, we show how the idea of cycle consistency can be used on text data.

Text Style Transfer: To handle the non-parallel data problem, Mueller et al. (2017)

revised the latent representation of a sentence in a certain direction guided by a classifier, so that the decoded sentence imitates those favored by the classifier.

Ficler and Goldberg (2017)

encoded textual property values with embedding vectors, and adopted a conditioned language model to generate sentences satisfying the specified content and style properties.

Li et al. (2018) demonstrated that a simple delete-retrieve-generate approach could achieve good performance in sentiment transfer tasks. Hu et al. (2017) used the variational auto-encoder (VAE) Kingma and Welling (2013) to encode the sentence into a latent content representation disentangled from the source style, and then recombine it with the target style to generate its counterpart. Shen et al. (2017) considered transferring between two styles simultaneously. They utilized adversarial training to align the generated sentences from one style to the data domain of the other style. We also adopt similar adversarial training in our model. However, since we assume the source domain contains data with various and possibly unknown styles, it is impossible for us to apply a discriminator to determine whether a sentence transferred from the target domain is aligned in the source domain as in Shen et al. (2017).

3 Problem Formulation

We now formally present our problem formulation. Suppose there are two data domains, one source domain in which each sentence may have its own language style, and one target domain consisting of data with the same language style. During training, we observe samples from and samples from , denoted as and . Note that we can hardly find a sentence pair that describes the same content. Our task is to design a model to learn from these non-parallel training data such that for an unseen testing sentence , we can transfer it into its counterpart , where should preserve the content of but with the language style in .

4 Model

4.1 Encoder-decoder Framework

Figure 1: (a) Basic model with the style discrepancy loss. Solid lines: encode and decode the sample itself; dash lines: transfer into . (b): Proposed cycle consistency loss (can be applied for samples in similarly).

We assume each sentence can be decomposed into two representations: one is the style representation , and the other is the content representation , which is disentangled from its style. Each sentence has its individual style , while all the sentences share the same style, denoted as . Our model is built upon the encoder-decoder framework. In the encoding module, we assume that and of a sentence can be obtained through two encoding functions and respectively:

where , , , and is an indicator function. When a sentence comes from the source domain, we use a function to encode its style representation. For from the target domain, a shared style representation is used. Both and parameters in are learnt jointly together with other parameters in our model.

For the decoding module, we first employ a reconstruction loss to encourage that the sentence from the decoding function given and of a sentence can well reconstruct itself. Here, we use a probabilistic generator as the decoding function and the reconstruction loss is:


where denotes the parameter of the corresponding module.

To enable style transfer using non-parallel training data, we enforce that for a sample , its decoded sequence using given its content representation and the target style should be in the target domain . We use the idea of GAN (Goodfellow et al., 2014)) and introduce an adversarial loss to be minimized in decoding. The goal of the discriminator is to distinguish between and , while the generator tries to bewilder the discriminator:


Remarks: The above encoder-decoder framework is under-constrained in two aspects:

  • [wide=0noitemsep]

  • For a sample , can be an arbitrary value that minimizes the above losses in Equation 4.1 and  4.1, which may not necessarily capture the sentence style. This will affect the other decomposed part , making it not fully represent the content which should be invariant with the style.

  • The discriminator can only encourage the generated sentence to be aligned with the target domain , but cannot guarantee to keep the content of the source sentence intact.

To address the first problem, we propose a style discrepancy loss, to constrain that the learnt should have its distance from guided by another discriminator which evaluates the closeness of the sentence style to the target style. For the second problem, we get inspired by the idea in He et al. (2016) and Zhu et al. (2017) and introduce a cycle consistency loss applicable to word sequence, which requires that the generated sentence can be transferred back to the original sentence .

4.2 Style Discrepancy Loss

By using a portion of the training data, we can first train a discriminator to predict whether a given sentence

has the target language style with an output probability, denoted as

. When learning the decomposed style representation for a sample , we enforce that the discrepancy between this style representation and the target style representation , should be consistent with the output probability from . Specifically, since the styles are represented with embedding vectors, we measure the style discrepancy using the norm:


Intuitively, if a sentence has a larger probability to be considered having the target style, its style representation should be closer to the target style representation . Thus, we would like to have positively correlated with

. To incorporate this idea in our model, we use a probability density function

, and define the style discrepancy loss as:


where ( is a valid probability density function) and is pre-trained and then fixed. If a sentence has a large , incorporating the above loss into the encoder-decoder framework will encourage a large and hence a small , which means will be close to . In our experiment, we instantiate

with the standard normal distribution for simplicity:


However, better probability density functions can be used if we have some prior knowledge of the style distribution. With Equation 5, the style discrepancy loss can be equivalently minimized by:


Note that is not jointly trained in our model. The reason is that, if we integrate it into the end-to-end training, we may start with a

with a low accuracy, and then our model is inclined to optimize a wrong style-discrepancy loss for many epochs and get stuck into a poor local optimum.

4.3 Cycle Consistency Loss

Inspired by He et al. (2016); Zhu et al. (2017), we require that a sentence transferred by the generator should preserve the content of its original sentence, and thus it should have the capacity to recover the original sentence in a cyclic manner. For a sample with its transferred sentence having the target style , we encode and combine its content with its original style for decoding. We should expect that with a high probability, the original sentence is generated. For a sample , though we do not aim to change its language style in our task, we can still compute its cycle consistency loss for the purpose of additional regularization. We first choose an arbitrary style obtained from a sentence in , and transfer into this style. Next, we put this generated sentence into the encoder-decoder model with the style , and the original sentence should be generated. Formally, the cycle consistency loss is:


4.4 Full Objective

An illustration of our basic model with the style discrepancy loss is shown in Figure 1 and the full model combined with the cycle consistency loss is shown in Figure 1. To summarize, the full loss function of our model is:


where are parameters balancing the relative importance of the different loss parts. The overall training objective is a minmax game played among the encoder , , generator and discriminator :


We implement the encoder using an RNN with the last hidden state as the content representation, and the style encoder using a CNN with the output representation of the last layer as the style representation. The generator is an RNN that takes the concatenation of the content and style representations as the initial hidden state. The discriminator and the pre-trained discriminator used in the style discrepancy loss are CNNs with the similar network structure in followed by a sigmoid output layer.

5 Experiments

5.1 Datasets

Yelp: Raw data are from the Yelp Dataset Challenge Round 10, which are restaurant reviews on Yelp. Generally, reviews rated with 4 or 5 stars are considered positive, 1 or 2 stars are negative, and 3 stars are neutral. For positive and negative reviews, we use the processed data released by Shen et al. (2017), which contains 250k negative sentences and 350k positive sentences. For neutral reviews, we follow similar steps in Shen et al. (2017) to process and select the data. We first filter out neutral reviews (rated with 3 stars and categorized with the keyword ‘restaurant’) with the length exceeding 15 or less than 3. Then, data selection in Moore and Lewis (2010) is used to ensure a large enough vocabulary overlap between the neutral data and the data in Shen et al. (2017). Afterwards, we sample 500k sentences from the resulting dataset as the neutral data. We use the positive data as the target style domain. Based on the three classes of data, we construct two source datasets with multiple styles:

  • [wide=0noitemsep]

  • Positive+Negative (Pos+Neg): we add different numbers of positive data (50k, 100k, 150k) into the negative data so that the source domain contains data with two sentiments.

  • Neutral+Negative (Neu+Neg): we combine neutral (50k, 100k, 150k) and negative data together as the source data.

We consider the Neu+Neg dataset is harder to learn from than the Pos+Neg dataset since for the Pos+Neg dataset, we can make use of a pre-trained classifier to possibly filter out some positive data so that most of the source data have the same style and the model in Shen et al. (2017) can work. However, the neutral data cannot be removed in this way. Also, most of the real data may be in the neutral sentiment, and we want to see if such sentences can be transferred well.

Chat: We use sentences from a real Chinese dialog dataset as the source domain. Users can chat with various personalized language styles, which are not easy to be classified into one of the three sentiments as in Yelp. Romantic sentences are collected from several online novel websites and filtered by human annotators. The dataset has 400k romantic sentences and 800k general sentences. Our task is to transfer the dialog sentences to a romantic style, characterized by the selected romantic sentences.

Shakespeare: We experiment on revising modern text in the Shakespearean style at the sentence-level as in Mueller et al. (2017). Following their experimental setup, we collect 29,388 sentences authored by Shakespeare and 54,800 sentences from non-Shakespeare-authored works. The length of all the sentences ranges from 3 to 15.

Each of the above three datasets is divided into three non-overlapping parts which are respectively used for the style transfer model, the pre-trained discriminator , and the evaluation classifier used in Section 5.3. Each part is further divided into training, testing, and validation sets. There is one exception that for Shakespeare is trained on a subset of the data for training the style transfer model because the Shakespeare dataset is small. Statistics of the data for training and evaluating the style transfer model are shown in the supplementary material.

5.2 Compared Methods and Configurations

We compare our method with Shen et al. (2017) which is the state-of-the-art language style transfer model with non-parallel data, and we name as Style Transfer Baseline (STB). As described in Section 2 and Section 4, STB is built upon an auto-encoder framework. It focuses on transferring sentences from one style to the other, with the source and target language styles represented by two embedding vectors. It also relies on adversarial training methods to align content spaces of the two domains. We keep the configurations of the modules in STB, such as the encoder, decoder and discriminator, the same as ours for a fair comparison.

We implement our model using Tensorflow 

(Abadi et al., 2016). We use GRU as the encoder and generation cells in our encoder-decoder framework. Dropout (Srivastava et al., 2014) is applied in GRUs and the dropout probability is set to 0.5. Throughout our experiments, we set the dimension of the word embedding, content representation and style representation as , and respectively. For the style encoder , we follow the CNN architecture in Kim (2014), and use filter sizes of with 100 feature maps each, so that the resulting output layer is of size , i.e., the dimension of the style representation. The pre-trained discriminator is implemented similar to but using filter sizes with 250 feature maps each. The testing accuracy of the pre-trained is 95.23% for Yelp, 87.60% for Chat, and 87.60% for Shakespeare. We further set the balancing parameters , , and train the model using the Adam optimizer (Kingma and Ba, 2015) with the learning rate

. All input sentences are padded so that they have the same length 20 for Yelp and Shakespeare and 35 for Chat. Furthermore, we use the pre-trained word embeddings

Glove (Pennington et al., 2014) for Yelp and Shakespeare and use the Chinese word embeddings trained on a large amount of Chinese news data for Chat when training the classifiers.

5.3 Evaluation Setup

Following Shen et al. (2017)

, we use a model-based evaluation metric. Specifically, we use a pre-trained evaluation classifier to classify whether the transferred sentence has the correct style. The evaluation classifier is implemented with the same network structure as the discriminator

. The testing accuracy of evaluation classifiers is 95.36% for Yelp, 87.05% for Chat, and 88.70% for Shakespeare. We repeat the training three times for each experiment setting and report the mean accuracy on the testing data with their standard deviation.

5.4 Experiment Roadmap

We perform a set of experiments on Yelp, which is also used in Shen et al. (2017), to systematically investigate the effectiveness of our model:

  • [wide=0noitemsep]

  • We validate the usefulness of the proposed style discrepancy loss and the cycle consistency loss in our setting (Sec 5.5.1).

  • We vary the difficulty level of the source data and verify the robustness of our model (Sec 5.5.2).

  • We look into some transferred sentences and analyze successful cases and failed cases (Sec 5.5.3).

  • We perform human evaluations to evaluate the overall quality of the transferred sentences and check if the results are consistent with those using the model-based evaluation (Sec 5.5.4).

After conducting the thorough study of our model on Yelp, we apply our model on Chat and Shakespeare, two real datasets in which the source domain has various language styles.

5.5 Experiments and Analysis on Yelp

5.5.1 Ablation Study

To investigate the effectiveness of the style discrepancy loss, we try our full model by removing the style discrepancy loss only on the dataset Pos+Neg whose source domain contains both positive sentences and negative sentences. When the model converges, we find that the cycle consistency loss is much larger than that obtained in the full model for the same setting. We also manually check that almost all sentences fail to transfer their styles with this model. This indicates that the proposed style discrepancy loss is an inseparable part of our model.

#positive STB Ours samples used STB (with Cyc) (without Cyc) Ours 50k 100k 150k

Table 2: Testing accuracies on Yelp with Pos+Neg source data.

We also validate the effectiveness of the cycle consistency loss 111Note that our proposed cycle consistency loss can be similarly added in STB.. Specifically, we compare two versions of both STB and our model, one with the cycle consistency loss and one without. We vary the number of positive sentences in the source domain and results are shown in Table 2. It can be seen that incorporating the cycle consistency loss consistently improves the performance for both STB and our proposed model.

5.5.2 Difficulty Level of Training Data

Pos+Neg as Source Data:We compare STB and our proposed model on the first dataset Pos+Neg using the results in Table 2. As the number of positive sentences in the source data increases, the average performance of both versions of STB decreases drastically. This is reasonable because STB introduces a discriminator to align the sentences from the target domain back to the source domain, and when the source domain contains more positive samples, it is hard to find a good alignment to the source domain. Meanwhile the performance of our model, even the basic one without the cycle consistency loss, does not fluctuate much with the increase of the number of positive samples, showing that our model is not that sensitive to the source data containing more than one sentiments. Overall, our model with the cycle consistency loss performs the best.

#Neural samples STB (with Cyc) Ours 50k 100k 150k

Table 3: Testing accuracies on Yelp with Neu+Neg source data.

Neu+Neg as Source Data: The dataset Pos+Neg is not so challenging because we can use a pre-trained discriminator similar to in our model, to remove those samples classified as positive with high probabilities, so that only sentences with a less positive sentiment remain in the source domain. Thus, we test on our second dataset Neu+Neg. In this setting, in case that some positive sentences exist in those neutral reviews, when STB is trained, we use the same pre-trained discriminator in our model to filter out samples classified as positive with probabilities larger than 0.9. In comparison, our model uses all data, since it naturally allows for those data with styles similar to the target style. Experimental results in Table 3 show that with the same amount of neutral data mixed in the source domain, our model performs better than STB (with Cyc) and is relatively stable among multiple runs.

Limited Sentences in Target Domain: In real applications, there may be only a small amount of data in the target domain. To simulate this scenario, we limit the amount of the target data (randomly sampled from the positive data) used for training, and evaluate the robustness of the compared methods. Table 4 shows the experimental results. It is surprising to see that both methods obtain relatively steady accuracies with different numbers of target samples. Yet, our model surpasses STB (with Cyc) in all the cases.

#Target samples used STB (with Cyc) Ours 100k 150k 200k

Table 4: Testing accuracies on Yelp with different numbers of target samples used.

5.5.3 Case Study

We manually examine the generated sentences for a detailed study. Overall, our full model can generate grammatically correct positive reviews without changing the original content in more cases than the other methods. In Table 5, we present some example results of the various methods. We can see that when the original sentences are simple such as the first example, all models can transfer the sentence successfully. However, when the original sentences are complex, both versions of STB and our basic model (without the cycle consistency loss) cannot generate fluent sentences, but our full model still succeeds. One minor problem with our model is that it may use a wrong tense in transferred sentences (i.e., the last transferred sentence by our model), which however does not influence the sentence style and meaning a lot.

Original Sentence and just not very good . STB but i was very good . STB (with Cyc) and just always very good . Ours (without Cyc) and just always very good . Ours and i love it . Original Sentence i have tried to go to them twice silly me . STB i ’m going to anyone when they need to . STB (with Cyc) i ’ve been to go here for years out . Ours (without Cyc) i have recommend to anyone ’s your home needs . Ours i have tried the place and it was great . Original Sentence i am so thankful to be out of this place . STB i am so impressed to the experience i had . STB (with Cyc) i am always impressed to see for a family . Ours (without Cyc) i am so grateful for being on this place again . Ours i am so happy to have found this place . Original Sentence service was okay but joseph is just rude . STB service was well , but well . STB (with Cyc) service was friendly and everyone is . Ours (without Cyc) service was quick , and just funny . Ours service is great and food was great . Original Sentence they were very loud and made noise throughout the night . STB they were very good and made my own water . STB (with Cyc) they were very good and made out on the game . Ours (without Cyc) they were very nice and made the other time . Ours they are very friendly and made you feel comfortable .

Table 5: Example sentences on Yelp transferred into a positive sentiment.

5.5.4 Human Evaluation

The model-based evaluation metric is inadequate at measuring whether a transferred sentence preserves the content of a source sentence (Fu et al., 2017)

. Therefore, we rely on human evaluations to evaluate model performance in content preservation. We randomly select 200 test samples from Yelp and perform human evaluations to estimate the overall quality of transferred sentences rating from 1 (failed), 2 (tolerable), 3 (satisfying) and 4 (perfect) by jointly considering content preservation, sentiment modification, and fluency of the transferred sentences.

We hire five annotators to evaluate the results. Since we have 9 settings and totally 24 different methods in Table 2-4, here we select one of the settings on Yelp due to limited budgets. Table 6 shows the averaged scores of the annotator’s evaluations. As can be seen, by considering all the above three aspects, our model is better than other methods. This result is also consistent with the automatic evaluation ones.

STB (with Cyc) (without Cyc) Ours Overall

Table 6: Human evaluation on Yelp when 150k positive sentences are added to source domain (row 3 in Table 2).

#target samples used STB (with Cyc) Ours 10k 50k 100k 150k

Table 7: Testing accuracies on Chat with different numbers of target samples used.

5.6 Experiments and Analysis on Chat

As in the Yelp experiment, we vary the number of target sentences to test the robustness of the compared methods. The experimental results are shown in Table 7. Several observations can be made. First, STB (with Cyc) obtains a relatively low performance with only 10k target samples. As more target samples are used, its performance increases. Second, our model achieves a high accuracy even with 10k target samples used, and remains stable in all the cases. Thus, our model achieves better performance as well as stronger robustness on Chat. Due to space limitations, we present a few examples in Table 10 in the supplementary material. We find that our model generally successfully transfers the sentence into a romantic style with some romantic phrases used.

5.7 Experiments and Analysis on Shakespeare

The testing accuracies are shown in Table 8. We also present some example sentences in Table 9. Compared with STB, our model can generate sentences which are more fluent and have a higher probability to have a correct target style. For example, in the sentences transferred by our model, the words such as ’sir’, ’master’, and ’lustre’ are used, which are common in the Shakespearean works. However, we find that both STB and our model tend to generate short sentences and change the content of source sentences in more cases in this set of experiment than in the Yelp and Chat datasets. We conjecture this is caused by the scarcity of training data. Sentences in the Shakespearean style form a vocabulary of 8559 words, but almost 60% of them appear less than 10 times. On the other hand, the source domain contains 19962 words, but there are only 5211 common words in these two vocabularies. Thus aligned words/phrases may not exist in the dataset.

#target samples used STB (with Cyc) Ours 21,888

Table 8: Testing accuracies on Shakespeare.

Original Sentence i should never have thought of such a thing . STB (with Cyc) i shall not have to for you . Ours i will never be thee , sir . Original Sentence do n’t try to make any stupid moves . STB (with Cyc) i think you have to your . Ours do you walk , the master ? Original Sentence where were the hardships she had expected ? STB (with Cyc) what is it is it ? Ours where are the lustre here ?

Table 9: Example non-Shakespeare sentences transferred into a Shakespearean language style.

6 Conclusion

We have presented an encoder-decoder framework for language style transfer. It allows for the use of non-parallel data, where the source data have various unknown language styles. Each sentence is encoded into two latent representations, one corresponding to its content disentangled from the style and and the other representing the style only. By recombining the content with the target style, we can decode a sentence aligned in the target domain. Specifically, we propose two loss functions, i.e., the style discrepancy loss and the cycle consistency loss, to adequately constrain the encoding and decoding functions. The style discrepancy loss is used to enforce a properly encoded style representation while the cycle consistency loss is utilized to ensure that the style-transferred sentences can be transferred back to their original sentences. Experimental results in three tasks demonstrate that our proposed method outperforms previous style transfer methods.


Supplementary Materials

This supplementary material contains the following contents. (1) Table 1018 show statistics of the data used in the experiments. (2) Table 19 shows some transferred sentences from the testing data of Chat.

Training Test Validation Positive 240417 40000 20000 Negative 151026 40000 20000

Table 10: Statistics of Yelp for the style transfer model

Training Test Validation Romantic 207312 40000 40000 General 514460 40000 40000

Table 11: Statistics of Chat for the style-transfer model

Training Test Validation Shakespeare 21888 1000 2000 Non-Shakespeare 43800 1000 2000

Table 12: Statistics of Shakespeare for the style transfer model

Training Test Validation Positive 75000 5000 2500 Negative 37500 5000 2500

Table 13: Statistics of Yelp for the discriminator

Training Test Validation Positive 37500 1500 2250 Negative 18750 1500 2250

Table 14: Statistics of Yelp for the evaluation classifier

Training Test Validation Romantic 100000 10000 10000 General 200000 10000 10000

Table 15: Statistics of Chat for the discriminator

Training Test Validation Romantic 50000 5000 5000 General 100000 5000 5000

Table 16: Statistics of Chat for the evaluation classifier

Training Test Validation Shakespeare 3500 500 500 Non-Shakespeare 7000 500 500

Table 17: Statistics of Shakespeare for the discriminator

Training Test Validation Shakespeare 3500 500 500 Non-Shakespeare 7000 500 500

Table 18: Statistics of Shakespeare for the evaluation classifier

Original Sentence 回眸一笑 就 好 It is enough to look back and smile STB (with Cyc) 回眸一笑 就 好 了 It would be just fine to look back and smile Ours 回眸一笑 , 勿念 。 Look back and smile, please do not miss me. Original Sentence 得过且过 吧 ! Just live with it! STB (with Cyc) 想不开 吧 , 我 的 吧 。 I just take things too hard. * Ours 爱到深处 , 随遇而安 。 Love to the depths, enjoy myself wherever I am. Original Sentence 自己 的 幸福 给 别人 了 Give up your happiness to others STB (with Cyc) 自己 的 幸福 给 别人 , 你 的 。 Give up your happiness to others. * Ours 自己 的 幸福 是 自己 , 自己 的 。 Leave some happiness to yourself, yourself.

Table 19: Example sentences on Chat transferred into a romantic style. English translations are provided (* denotes that the sentence has grammar mistakes in Chinese).