Domain Adaptive Text Style Transfer

08/25/2019 ∙ by Dianqi Li, et al. ∙ Microsoft University of Washington 0

Text style transfer without parallel data has achieved some practical success. However, in the scenario where less data is available, these methods may yield poor performance. In this paper, we examine domain adaptation for text style transfer to leverage massively available data from other domains. These data may demonstrate domain shift, which impedes the benefits of utilizing such data for training. To address this challenge, we propose simple yet effective domain adaptive text style transfer models, enabling domain-adaptive information exchange. The proposed models presumably learn from the source domain to: (i) distinguish stylized information and generic content information; (ii) maximally preserve content information; and (iii) adaptively transfer the styles in a domain-aware manner. We evaluate the proposed models on two style transfer tasks (sentiment and formality) over multiple target domains where only limited non-parallel data is available. Extensive experiments demonstrate the effectiveness of the proposed model compared to the baselines.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text style transfer, which aims to edit an input sentence with the desired style while preserving style-irrelevant content, has received increasing attention in recent years. It has been applied successfully to stylized image captioning Gan et al. (2017), personalized conversational response generation Zhang et al. (2018a), formalized writing Rao and Tetreault (2018), offensive to non-offensive language transfer dos Santos et al. (2018)

, and other stylized text generation tasks

Akama et al. (2017); Zhang et al. (2019).

Text style transfer has been explored as a sequence-to-sequence learning task using parallel datasets Jhamtani et al. (2017). However, parallel datasets are often not available, and hand-annotating sentences in different styles is expensive. The recent surge of deep generative models Kingma and Welling (2013); Goodfellow et al. (2014) has spurred progress in text style transfer without parallel data by learning disentanglement Hu et al. (2017); Shen et al. (2017); Fu et al. (2018); Li et al. (2018); Prabhumoye et al. (2018). These methods typically require massive amounts of data Subramanian et al. (2018), and may perform poorly in limited data scenarios.

A natural solution to the data-scarcity issue is to resort to massive data from other domains. However, directly leveraging abundant data from other domains is problematic due to the discrepancies in data distribution on different domains. Different domains generally manifest themselves in domain-specific lexica. For example, sentiment adjectives such as “delicious”, “tasty”, and “disgusting” in restaurant reviews might be out of place in movie reviews, where the sentiment words such as “imaginative”, “hilarious”, and “dramatic” are more typical. Domain shift Gretton et al. (2009) is thus apt to result in feature misalignment.

In this work, we take up the problem of domain adaptation in scenarios where the target domain data is scarce and misaligned with the distribution in the source domain. Our goal is to achieve successful style transfer into the target domain, with the help of the source domain, while the transferred sentences carry relevant characteristics in the target domain.

We present two first-of-their-kind domain adaptive text style transfer models that facilitate domain-adaptive information exchange between the source and target domains. These models effectively learn generic content information and distinguish domain-specific information. Generic content information, primarily captured by modeling a large corpus from the source domain, facilitates better content preservation on the target domain. Meanwhile, domain-specific information, implicitly imposed by domain vectors and domain-specific style classifiers, underpins the transferred sentences by generating target-specific lexical terms.

Our contributions in this paper are threefold: () We explore a challenging domain adaptation problem for text style transfer by leveraging massively-available data from other domains. () We introduce simple text style transfer models that preserve content and meanwhile translate text adaptively into target-domain-specific terms. () We demonstrate through extensive experiments the robustness of these methods for style transfer tasks (sentiment and formality) on multiple target domains where only limited non-parallel data is available. Our implementation is available at

2 Related Work

Text Style Transfer.

Text style transfer using neural networks has been widely studied in the past few years. A common paradigm is to first disentangle latent space as content and style features, and then generate stylistic sentences by tweaking the style-relevant features and passing through a decoder.

Hu et al. (2017); Fu et al. (2018); Shen et al. (2017); Yang et al. (2018); Gong et al. (2019); Lin et al. (2017) explored this direction by assuming the disentanglement can be achieved in an auto-encoding procedure with a suitable style regularization, implemented by either adversarial discriminators or style classifiers. Li et al. (2018); Xu et al. (2018); Zhang et al. (2018c) achieved disentanglement by filtering the stylistic words of input sentences. Recently, Prabhumoye et al. (2018) has proposed to use back-translation for text style transfer with a de-noising auto-encoding objective Logeswaran et al. (2018); Subramanian et al. (2018). Our work differs from the above in that we leverage domain adaptation to deal with limited target domain data, whereas previous methods require massive target domain style-labelled samples.

Domain Adaptation.

Domain adaptation has been studied in various natural language processing tasks, such as sentiment classification 

Qu et al. (2019), dialogue system Wen et al. (2016), abstractive summarization Hua and Wang (2017); Zhang et al. (2018b), machine translation Koehn and Schroeder (2007); Axelrod et al. (2011); Sennrich et al. (2016b); Michel and Neubig (2018), etc. However, no work has been done for exploring domain adaptation on text style transfer. To our best knowledge, we are the first to explore the adaptation of text style transfer models for a new domain with limited non-parallel data available. The task requires both style transfer and domain-specific generation on the target domain. To differentiate different domains, Sennrich et al. (2016a); Chu et al. (2017) appended domain tokens to the input sentences. Our model uses learnable domain vectors combining domain-specific style classifiers, which force the model to learn distinct stylized information in each domain.

3 Preliminary

We first describe a standard text style transfer approach, which only considers data in the target domain. We limit our discussion to the scenario where only non-parallel data is available, since large amounts of parallel data is typically not feasible.

Given a set of style-labelled sentences in the target domain, the goal is to transfer sentence with style to a sentence with another style , where . belong to a set of style labels in the target domain: . Typically, an encoder encodes the input to a semantic representation , while a decoder controls or modifies the stylistic property and decodes the sentence based on and the pre-specific style .

Specifically, we denote an encoder-decoder model as . The semantic representation of sentence is extracted by the encoder , i.e., . The decoder aims to learn a conditional distribution of given the semantic representation and style :


where is the token of , and is the prefix of up to the token.

Directly estimating Eqn. (

1) is impractical during training due to a lack of parallel data . Alternatively, the original sentence

should have high probability under the conditional distribution

. Thus, an auto-encoding reconstruction loss could be formulated as:


Note that we assume that the decoder recovers ’s original stylistic property as accurate as possible when given the style label . To achieve text style transfer, the decoder manipulates the style of generated sentences by replacing with a desired style . Specifically, the generated sentence is sampled from . However, by directly optimizing Eqn. (2), the encoder-decoder model tends to ignoring the style labels and collapses to a reconstruction model, which might simply copy the input sentence, hence fails to transfer the style. To force the model to learn meaningful style properties, Hu et al. (2017, 2018) apply a style classifier for the style regularization. The style classifier ensures the encoder-decoder model to transfer with its correct style label :


where is the style classifier pretrained on the target domain. The overall training objective for text style transfer within the target domain is written as:


4 Domain Adaptive Text Style Transfer

In this section, we present Domain Adaptive Style Transfer (DAST) models to perform style transfer on a target domain by borrowing the strength from a source domain, while maintaining the transfer to be domain-specific.

4.1 Problem Definition

Suppose we have two sets of style-labelled sentences , in the source domain and the target domain , respectively. denotes the source sentence. denotes the corresponding style label, which belongs to a source style label set: (e.g., positive/negative). can be available or unknown. Likewise, pair represents the sentence and style label in the target domain, where .

We consider domain adaptation in two settings: the source style is unknown, e.g., we may have a large corpus, such as Yahoo! Answers, but the underlying style for each sample is not available; the source styles are available, and are the same as the target styles, i.e., , e.g., both IMDB movie reviews and Yelp restaurant reviews have the same style classes (negative and positive sentiments).

In both scenarios, we assume that the target domain only has limited non-parallel data. With the help of source domain data , the goal is to transfer to in the target domain. The transferred sentence should simultaneously hold: the main content with , a different style from , and domain-specific characteristics of the target data distribution .

Figure 1: Illustration of the proposed DAST-C (left) and DAST (right) model. DAST-C learns the generic content information through on massive source domain data with unknown style . For DAST, and denote domain vectors and domain-specific style classifiers, respectively. Better looked in color.

4.2 DAST with unknown-stylized source data

In this section, we investigate the case that the source style is unknown. We first examine a drawback of limited target data to motivate our method. With limited target data, Eqn. (4) may yield an undesirable transferred text, where the generated text tends to using the most discriminative words that the target style prefers while ignoring the content. This is because the classifier

typically requires less data to train, comparing with a sequence autoencoder

. The classifier objective thus dominates Eqn. (4), rendering the generator to bias the sentences with most representative stylized (e.g., positive or negative) words rather than preserving the contents (see Table 5 for examples).

We consider alleviating this issue by leveraging massive source domain data to enhance the content-preserving ability, though the underlying styles in the source domain are unknown. By jointly training an auto-encoder on both the source and target domain data, the learned generic content information enables the model to yield better content preservation on the target domain.

To utilize the source data, we consider that only contains a special unknown-style label , separated from the target style . We assume the semantic representation of the source data is encoded by the encoder, i.e., . The decoder takes with style to generate the sentences on the source domain. The auto-encoding reconstruction objective of the source domain is:


where the encoder-decoder model is shared in both domains. Therefore, the corresponding objective can be written as:


This can be perceived as combining the source domain data with the target domain data to train a better encoder-decoder framework, while target-specific style information on the target domain is learned through .

Note that and are conditional on domain-specific styles labels: and , which implicitly encourages the model to learn domain-specific features. The decoder could thus generate target sentences adaptively with , while achieving favorable content preservation with the generic content information modeled by . We refer this model, which is illustrated in Figure 1(left), as Domain Adaptive Style Transfer with generic Content preservation (DAST-C).

4.3 DAST with stylized source data

We further explore the scenario where . In this case, besides the generic content information, there is much style information from the source domain that could be leveraged, e.g., generic stylized expressions like “fantastic” and “terrible” for sentiment transfer can be applied to both restaurant and movie reviews. We thus consider to borrow the full strength of the source data, by sharing learned knowledge on both the generic content and style information.

A straightforward way to achieve this is to train Eqn. (4) on both domains. However, simply mixing the two domains together will lead to undesirable style transfers, where the transfer is not domain-specific. For example, when adapting the IMDB movie reviews to the Yelp restaurant reviews, directly sharing the style transfer model without specifying the domain will inevitably result in generations like “The pizza is dramatic!”.

To alleviate this problem, we introduce additional domain vectors, encouraging the model to perform style transfer in a domain-aware manner. The proposed DAST model is illustrated in Figure 1(right). Consider two domain vectors: for the source domain and for the target domain, respectively. We rewrite the auto-encoding loss as:


where the encoder-decoder model is shared across domains. The domain vectors, , , learned from the model, implicitly guide the decoder to generate sentences with domain-specific characteristics. Note that and are shared, i.e., . This enables the model to learn generic style information from both domains. On the other hand, explicitly learning precise stylized information within each domain is crucial to generate domain-specific styles. Thus, two domain-specific style classifiers ensure the model to learn the corresponding styles by conditioning on in the source domain or in the target domain:


where are the transferred sentences with pre-specific styles in the source and target domains, respectively. The domain-specific style classifiers, and , are trained separately on each domain. The signals from classifiers encourage the model to learn domain-specific styles combining with the domain vectors and style labels. The overall training objective of the proposed DAST model is:


The domain-specific style classifiers enforce the model to learn domain-specific style information conditioning on or , which in turn controls the model to generate sentences with domain-specific words. The model can thus distinguish domain-specific features, and adaptively transfer the styles in a domain-aware manner.

5 Experiments

We evaluate our proposed models on two tasks: sentiment transfer (positive-to-negative and negative-to-positive), and formality transfer (informal-to-formal). In both tasks, we make comparisons with previous approaches over multiple target domains. All experiments are conducted on one Nvidia GTX 1080Ti GPU.

5.1 Dataset

A statistics for the source and target corpora used in the experiments is summarized in Table 1.

Sentiment Transfer
Source Train Target Train Dev Test
IMDB 344k Yelp 444k 4k 1k
Amazon 554k 2k 1k
Yahoo 4k 2k 1k
Formality Transfer
Source Train Target Train Dev Test
GYAFC 103k Enron 6k 500 500
Table 1: Statistics of source and target datasets.

Sentiment Transfer.

For the source domain, we use IMDB movie review corpus Diao et al. (2014) by following the filtering and preprocessing pipelines from Shen et al. (2017). This results in training samples with sentiment labels. For the target domain, both the Yelp restaurant review dataset and the Amazon product review dataset are from Li et al. (2018). For the test sets, we evaluate our methods by using human-transferred sentences, annotated by Li et al. (2018), on both Yelp and Amazon datasets. In addition to the two standard sentiment datasets, we manually collected a Yahoo sentimental question dataset - question samples with sentiments from Yahoo! Answers dataset Zhang et al. (2015). We split the sentimental questions into // for train/dev/test sets, respectively. Note that the Yahoo sentiment dataset only consists of questions, which have different domain characteristics with the IMDB dataset. In all the sentiment experiments, we consider both transfer directions (positive-to-negative and negative-to-positive).

Formality Transfer.

We use the Grammarly’s Yahoo Answers Formality Corpus (GYAFC) Rao and Tetreault (2018) as the source dataset. The publicly released version of GYAFC only covers two topics (Entertainment & Music and Family & Relationships), where each topic contains paired informal and formal sentences written by humans. For the target domain, we use Enron email conversation dataset111, which covers several different fields like Business, Politics, Daily Life, etc. We manually labeled non-parallel sentences written in either the formal or informal style. We split the Enron dataset into samples for training, validation and testing, respectively. Both the validation and test set consist of mere informal sentences, where the corresponding formal references are annotated by us from a crowd-sourcing platform for evaluation. We only assess the informal-to-formal transfer direction in the formality transfer experiment.

Yelp Amazon
Model (100% target data) D-acc S-acc hBLEU G-score D-acc S-acc hBLEU G-score
CrossAlign Shen et al. (2017) - 85.0 3.7 8.3 - 23.0 34.1 18.0
Delete&Retrieve Li et al. (2018) - 90.6 14.8 17.9 - 50.9 30.3 25.7
CycleRL Xu et al. (2018) - 88.7 12.3 16.4 - 68.7 14.2 15.5
SMAE Zhang et al. (2018c) - 85.1 12.1 15.5 - 71.1 12.9 14.9
ControlGen Hu et al. (2018) 91.5 25.5 27.4 - 79.0 31.1 30.5
Finetune 96.1 91.3 25.6 27.8 97.4 79.2 34.1 34.3
DAST-C (ours) 93.8 91.7 25.7 27.5 96.7 81.9 35.7 35.0
DAST (ours) 95.8 92.3 26.3 28.9 96.9 83.0 35.9 35.1
Model (1% target data) D-acc S-acc hBLEU G-score D-acc S-acc hBLEU G-score
CrossAlign Shen et al. (2017) - 76.3 4.8 8.5 - 83.2 2.0 5.9
Delete&Retrieve Li et al. (2018) - 82.1 4.1 7.6 - 63.0 6.9 9.3
CycleRL Xu et al. (2018) - 86.6 1.4 5.2 - 79.5 0.7 3.8
SMAE Zhang et al. (2018c) - 96.0 1.2 4.8 - 87.2 0.4 3.2
ControlGen Hu et al. (2018) - 98.5 3.7 8.6 - 83.2 1.9 5.8
Finetune 98.1 96.7 13.9 18.5 96.0 89.2 11.3 14.4
DAST-C (ours) 96.9 90.3 17.8 19.3 94.8 78.2 20.1 21.6
DAST (ours) 97.0 92.6 20.1 23.1 94.6 82.7 21.0 23.1
Table 2:

Automatic evaluation results on Yelp and Amazon test sets. D-acc and S-acc denote domain accuracy and style accuracy, respectively. G-score is the geometric mean of S-acc and


5.2 Evaluation

Automatic Metrics.

We evaluate the effectiveness of our DAST models based on three automatic metrics:


) Content Preservation. We assess the content preservation according to n-gram statistics, by measuring the BLEU scores 

Papineni et al. (2002) between generated sentences and human references on the target domain, refered as human BLEU (hBLEU). When no human reference is available (e.g., Yahoo), we compute the BLEU scores with respect to the input sentences.

() Style Control. We generate samples from the model and measure the style accuracy with a style classifier that is pre-trained on the target domain. We refer the style accuracy as S-acc.

() Domain Control. To validate whether the generated sentences hold the characteristics of the target domain, we adopt a pre-trained domain classifier to measure the percentage of generated sentences that belong to the target domain. We refer the domain accuracy as D-acc.

All the pre-trained classifiers are implemented by TextCNN  Kim (2014); Zhang et al. (2017). The test accuracy of all these classifiers used for evaluation are reported in Appendix A.1. Following Xu et al. (2018), we also evaluate all methods using a single unified metric called G-score, which calculates the geometric mean of style accuracy and hBLEU.

Human Evaluation.

To accurately evaluate the quality of transferred sentences, we conduct human evaluations based on the content preservation, style control and fluency aspects by following Mir et al. (2019). Previous works Subramanian et al. (2018); Gong et al. (2019)

ask workers to evaluate the quality via a numerical score, however, we found that this empirically leads to high-variance results. Instead, we pair transferred sentences from two different models, and ask workers to choose the sentence they prefer

when compared to the input on each evaluation aspect. We provide a “No Preference” option to choose when the workers think the qualities of the two sentences are indistinguishable. Details of the human evaluation instruction are included in Appendix A.3. For each testing, we randomly sample 100 sentences from the corresponding test set and collect three human responses for each pair on every evaluation aspect, resulting in 2700 responses in total.

5.3 Experimental Setup

The encoder and the decoder are implemented by one-layer GRU Cho et al. (2014) with hidden dimensions 500 and 700, respectively. The domain-vector dimension is set to 50. The style labels are represented by learnable vectors with 150 dimensions. The decoder is initialized by a concatenation of representations of content, style, and domain vectors. If domain vectors are not used, the dimension of style labels is set to 200; accordingly, the initialization of the decoder is a concatenation of content and style representations. TextCNN Kim (2014) is employed for the domain-specific style classifiers pre-trained on corresponding domains. After pre-training, the parameters of the classifiers are fixed. We use the hard-sampling trick Logeswaran et al. (2018) to back-propagate the loss through discrete tokens from the classifier to the encoder-decoder model. During training, we assign each mini-batch the same amount of source and target data to balance the training.

We make an extensive comparison with five state-of-the-art text style transfer models: CrossAlign Shen et al. (2017), Delete&Retrieve Li et al. (2018), CycleRL Xu et al. (2018), SMAE Zhang et al. (2018c) and ControlGen Hu et al. (2018). We also experiment a simple and effective domain adaptation baseline - Finetune, which is trained with Eqn. (4) on the source domain and then fine-tuned on the target domain.

5.4 Results

Figure 2: Results on Yelp test set in terms of different percentage of target domain data. samples.
Style Control (Yelp data) Content Preservation (Yelp data) Fluency (Yelp data)
Our Model Neutral Comparison Our Model Neutral Comparison Our Model Neutral Comparison
DAST 56.2% 30.5% 13.3% ControlGen DAST 47.0% 48.4% 4.6% ControlGen DAST 47.1% 40.8% 12.0% ControlGen
DAST 40.5% 42.3% 17.2% DAST-C DAST 22.4% 65.7% 11.9% DAST-C DAST 29.1% 55.8% 15.1% DAST-C
DAST 17.9% 18.5% 63.6% Human DAST 17.7% 47.4% 34.9% Human DAST 10.1% 30.4% 59.5% Human
Style Control (Enron) Content Preservation (Enron) Fluency (Enron)
Our Model Neutral Comparison Our Model Neutral Comparison Our Model Neutral Comparison
DAST 74.2% 19.8% 6% ControlGen DAST 80.8% 14.8% 4.4% ControlGen DAST 73.8% 20.6% 5.6% ControlGen
DAST 28.4% 50.2% 21.4% DAST-C DAST 26.8% 48.8% 24.4% DAST-C DAST 26.9% 51.6% 21.5% DAST-C
DAST 17.6% 30.5% 51.9% Human DAST 15.3% 36.9% 47.8% Human DAST 11.6% 36.5% 51.9% Human
Table 3: Results of Human Evaluation for style control, content preservation and fluency, showing preferences () for DAST model vis-a-vis baseline or other comparison systems. Evaluation results of the overall transfer quality are provided in Appendix A.3.


Yahoo Sentiment Transfer
Model D-acc S-acc BLEU
ControlGen - 99.1 9.7
Finetune 97.8 98.8 31.4
DAST-C 90.7 98.8 35.9
DAST 90.8 99.2 39.2


Table 4: Results on Yahoo sentiment transfer task.

Model Comparisons.

To evaluate the effectiveness of leveraging massive data from other domains, we compare our proposed DAST models with previously proposed models trained on the target domain (Table 2). We observe that by leveraging massive data from the IMDB dataset, our models achieve better performance against all baselines on the sentiment transfer tasks in both the Yelp and Amazon domains.

Notably, when the target domain has limited data (), all baselines trained on the target domian only completely fail on content preservation. Finetune preserves better content but experiences the catastrophic forgetting problem Goodfellow et al. (2013) to the source domain information. As a result, the overall style transfer performance is still nonoptimal. On the contrary, with the help of the source domain, DAST obtains considerable content preservation performance improvement when compared with other baselines. Our model also attains favorable performance in terms of style transferring accuracy (S-acc), resulting in a good overall G-score. In general, we observe that DAST-C is able to better preserve content information, while DAST further improves both content preservation and style control abilities. Additionally, both DAST-C and DAST can adapt to the target domain, evidenced by the high domain accuracy (D-acc). The human evaluation results (Table 3) show a strong preference of DAST over DAST-C as well as ControlGen in terms of style control, content preservation and fluency.

Finally, we evaluate our models on Yahoo sentiment transfer task. As can be seen in Table 4, both DAST and DAST-C achieve successful style transfer even if the target data is formed as questions which have a large discrepancy with the source IMDB domain. The samples of Yelp and Yahoo sentiment transfer are shown in Table 5. We also investigate the effect of different source domain data, included in Appendix A.2.

Yelp (positive-to-negative) Yelp (negative-to-positive)
Input the service was great , food delicious , and the value impeccable . and the pizza was cold , greasy , and generally quite awful .
ControlGen the service was horrible , service , the service and very frustrated . and the food was delicious, delicious , and freaking tasty , delicious .
Finetune the service was poor , food , and the experience were . and the pizza was professional , friendly , and always have great .
DAST-C the service was horrible , food horrible , and the slow sparse . and the pizza was fresh, greasy , and generally quite cool .
DAST the service was horrible , food bland , and the value lousy . and the pizza was tasty , juicy , and definitely quite amazing .
Human service was poor and the food expensive and weak tasting . the pizza was warm , not greasy , and generally tasted great .
Yahoo (positive-to-negative) Yahoo (negative-to-positive)
Input who is more romantic ? man or woman ? why do stupid questions constantly receive intelligent answers ?
ControlGen which is more stupid ? and or why ? men do fantastic questions constantly receive intelligent bound !
Finetune the is more expensive ? man or woman ? why do great questions read more entertaining answers ?
DAST-C who is more ugly ? man or woman ? why do important questions constantly receive intelligent answers ?
DAST who is more crazy ? man or woman ? why do nice questions constantly receive intelligent answers ?
Enron (informal-to-formal) Enron (informal-to-formal)
Input ya ’ll need to come visit us in austin . are n’t you suppose to be teaching some kids or something ?
ControlGen could we need to look on saturday in enpower . are you not supposed to be disloyal some kids or something ?
Finetune you will need to go in bed with him . are you not to be able to be some man or something ?
DAST-C you will need to visit town . are not you supposed to be teaching some kids or something ?
DAST yes , you will need to visit us in austin . are you not supposed to be teaching some children or something ?
Human all of you should come visit us in austin . are you not supposed to be instructing children ?
Table 5: Transferred sentences on Yelp ( data), Yahoo and Enron datasets, where red denotes successful style transfers, blue denotes content losses, and orange denotes grammar errors. Better looked in color.


Model D-acc S-acc hBLEU G-score
DAST 97.0 92.6 20.1 23.1
w/o d-spec attributes 83.9 90.9 20.0 22.7
w/o d-spec classifiers 91.4 83.8 19.0 20.8
w/o both 73.8 80.6 18.7 19.9


Setup D-acc S-acc hBLEU G-score
IMDB+Yelp 97.0 92.6 20.1 23.1
Finetune 98.1 96.7 13.9 18.5
IMDB 62.8 59.3 21.4 12.2
Yelp 96.8 98.5 3.7 8.6


Table 6: Ablation study on Yelp (1%) dataset with help from IMDB dataset. The results are evaluated on Yelp test set. d-spec is short for domain-specific.

Limiting the Target Domain Data.

We further test the limit of our model by using as few target domain data as possible. Figure 2 shows the quantitative results with different percentages of target domain training data. When the target domain data is insufficient, especially less than , the content preservation ability of the baseline (trained with target data only) has degenerated rapidly despite a relatively high style transfer accuracy. This is not desirable because a transferred sentence can easily have the correct style while barely contains any similar content to the input by retrieving sentences with the target style. Finetune improves content preservation but still suffers the same problem with fewer target data. Note that DAST-C is not comparable to Finetune as the previous one does not use the style information in the source domain.

On the other hand, both DAST models bring substantial improvements to content preservation, and can still successfully manipulate the styles, resulting in consistently higher G-scores. This is presumably because our models adapt the content information as well as the style information from the source domain to consistently sustain the style transfer on the target domain. By learning both generic and domain-specific stylized information, DAST outperforms DAST-C in terms of content preservation and style control. Even with target domain data (400 samples), DAST could still attain reasonable text style transfer, whereas the model trained on the target data completely generates sentences in nonsense. Meanwhile, DAST could keep transferring the sentences in a domain-aware manner, achieving high domain accuracy all the time.

Ablation Study.

To investigate the effect of individual components and training setup on the overall performance, we conduct an ablation study in Table 6. The domain vectors enable the model to transfer sentences in a domain-aware manner, and thus give the largest boost on domain accuracy. Without domain-specific style classifiers, the model mixes the style information on both domains, resulting in worse style control and content preservation. Additionally, simply increasing the number of training data (i.e., the row “w/o both”) improves content preserving, while introducing a data distribution discrepancy between the training (Yelp+IMDB) and test data (Yelp), as evidenced by the lower S-acc and D-acc scores.

In terms of the training setup, the source domain IMDB mostly helps content preservation, while accurate style information is mainly learned from the target domain Yelp. Finetune gives higher S-acc and D-acc and lower hBLEU due to the catastrophic forgetting. Our proposed DAST uses the source domain data more wisely thus gives balanced results on the style and domain control as well as content preservation.


Enron Formality Transfer
Model D-acc S-acc hBLEU
ControlGen - 81.2 4.74
Finetune 91.3 81.6 14.7
DAST-C 87.6 89.2 15.5
DAST 88.4 91.6 16.4


Table 7: Results on Enron formality transfer tasks.

Non-parallel Style Transfer with Parallel Source Data.

Finally, to verify the versatility of our proposed models on different scenarios, we investigate another domain adaptation setting, where the source domain data (GYAFC) is parallel but the target domain data (Enron) is non-parallel, on the challenging formality transfer task. Since parallel data is available in the source domain, we can simply add a sequence-to-sequence loss on source domain data in Eqn. (6) and Eqn. (9) to help the target domain without parallel data. The training objectives can be written as: and , respectively. Results are summarized in Table 7. DAST outperforms other methods on both style control and content preservation while keeping the transferred sentences with target-specific characteristics (D-acc). A strong human preference for DAST can be observed in Table 3 when compared to the baselines. Qualitative samples are provided in Table 5.

6 Conclusion

We present two simple yet effective domain adaptive text style transfer models that leverage massively available data from other domains to facilitate the transfer task in the target domain. The proposed models achieve better content preservation with the generic information learned from the source domain and simultaneously distinguish the domain-specific information, which enables the models to transfer text in a domain-adaptive manner. Extensive experiments demonstrate the robustness and applicability on various scenarios where the target data is limited.


We would like to thank the reviewers for their constructive comments. We thank NVIDIA Corporation for the donation of the GPU used for this research. We also thank Hao Peng, Tianyi Zhou for their helpful discussions.


  • R. Akama, K. Inada, N. Inoue, S. Kobayashi, and K. Inui (2017)

    Generating stylistically consistent dialog responses with transfer learning

    In IJCNLP, Cited by: §1.
  • A. Axelrod, X. He, and J. Gao (2011) Domain adaptation via pseudo in-domain data selection. In EMNLP, Cited by: §2.
  • K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, Cited by: §5.3.
  • C. Chu, R. Dabre, and S. Kurohashi (2017)

    An empirical comparison of simple domain adaptation methods for neural machine translation

    arXiv preprint arXiv:1701.03214. Cited by: §2.
  • Q. Diao, M. Qiu, C. Wu, A. J. Smola, J. Jiang, and C. Wang (2014) Jointly modeling aspects, ratings and sentiments for movie recommendation (jmars). In SIGKDD, Cited by: §5.1.
  • C. N. dos Santos, I. Melnyk, and I. Padhi (2018) Fighting offensive language on social media with unsupervised text style transfer. In ACL, Cited by: §1.
  • Z. Fu, X. Tan, N. Peng, D. Zhao, and R. Yan (2018) Style transfer in text: exploration and evaluation. In AAAI, Cited by: §1, §2.
  • C. Gan, Z. Gan, X. He, J. Gao, and L. Deng (2017) Stylenet: generating attractive visual captions with styles. In CVPR, Cited by: §1.
  • H. Gong, S. Bhat, L. Wu, J. Xiong, and W. Hwu (2019) Reinforcement learning based text style transfer without parallel training corpus. arXiv preprint arXiv:1903.10671. Cited by: §2, §5.2.
  • I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio (2013) An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211. Cited by: §5.4.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, Cited by: §1.
  • A. Gretton, AJ. Smola, J. Huang, M. Schmittfull, K. Borgwardt, and B. Schölkopf (2009) Covariate shift and local learning by distribution matching. In

    Dataset Shift in Machine Learning

    Cited by: §1.
  • Z. Hu, H. Shi, Z. Yang, B. Tan, T. Zhao, J. He, W. Wang, L. Qin, D. Wang, et al. (2018) Texar: a modularized, versatile, and extensible toolkit for text generation. arXiv preprint arXiv:1809.00794. Cited by: §3, §5.3, Table 2.
  • Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing (2017) Toward controlled generation of text. In ICML, Cited by: §1, §2, §3.
  • X. Hua and L. Wang (2017) A pilot study of domain adaptation effect for neural abstractive summarization. In Proceedings of the Workshop on New Frontiers in Summarization, Cited by: §2.
  • H. Jhamtani, V. Gangal, E. Hovy, and E. Nyberg (2017) Shakespearizing modern language using copy-enriched sequence to sequence models. In Proceedings of the Workshop on Stylistic Variation, Cited by: §1.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. In EMNLP, Cited by: §5.2, §5.3.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
  • P. Koehn and J. Schroeder (2007) Experiments in domain adaptation for statistical machine translation. In Proceedings of the second workshop on statistical machine translation, Cited by: §2.
  • J. Li, R. Jia, H. He, and P. Liang (2018) Delete, retrieve, generate: a simple approach to sentiment and style transfer. In NAACL, Cited by: §1, §2, §5.1, §5.3, Table 2.
  • K. Lin, D. Li, X. He, Z. Zhang, and M. Sun (2017) Adversarial ranking for language generation. In NeurIPS, Cited by: §2.
  • L. Logeswaran, H. Lee, and S. Bengio (2018) Content preserving text generation with attribute controls. In NeurIPS, Cited by: §2, §5.3.
  • P. Michel and G. Neubig (2018) Extreme adaptation for personalized neural machine translation. In ACL, Cited by: §2.
  • R. Mir, B. Felbo, N. Obradovich, and I. Rahwan (2019) Evaluating style transfer for text. arXiv preprint arXiv:1904.02295. Cited by: §A.3, §5.2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In ACL, Cited by: §5.2.
  • S. Prabhumoye, Y. Tsvetkov, R. Salakhutdinov, and A. W. Black (2018) Style transfer through back-translation. In ACL, Cited by: §1, §2.
  • X. Qu, Z. Zou, Y. Cheng, Y. Yang, and P. Zhou (2019) Adversarial category alignment network for cross-domain sentiment classification. In NAACL, Cited by: §2.
  • S. Rao and J. Tetreault (2018) Dear sir or madam, may i introduce the gyafc dataset: corpus, benchmarks and metrics for formality style transfer. In NAACL, Cited by: §1, §5.1.
  • R. Sennrich, B. Haddow, and A. Birch (2016a) Controlling politeness in neural machine translation via side constraints. In NAACL, Cited by: §2.
  • R. Sennrich, B. Haddow, and A. Birch (2016b) Improving neural machine translation models with monolingual data. In ACL, Cited by: §2.
  • T. Shen, T. Lei, R. Barzilay, and T. Jaakkola (2017) Style transfer from non-parallel text by cross-alignment. In NeurIPS, Cited by: §1, §2, §5.1, §5.3, Table 2.
  • S. Subramanian, G. Lample, E. M. Smith, L. Denoyer, M. Ranzato, and Y. Boureau (2018) Multiple-attribute text style transfer. arXiv preprint arXiv:1811.00552. Cited by: §1, §2, §5.2.
  • T. Wen, M. Gašić, N. Mrkšić, L. M. Rojas-Barahona, P. Su, D. Vandyke, and S. Young (2016) Multi-domain neural network language generation for spoken dialogue systems. In NAACL, Cited by: §2.
  • J. Xu, S. Xu, Q. Zeng, X. Zhang, X. Ren, H. Wang, and W. Li (2018) Unpaired sentiment-to-sentiment translation: a cycled reinforcement learning approach. In ACL, Cited by: §2, §5.2, §5.3, Table 2.
  • Z. Yang, Z. Hu, C. Dyer, E. P. Xing, and T. Berg-Kirkpatrick (2018) Unsupervised text style transfer using language models as discriminators. In NeurIPS, Cited by: §2.
  • S. Zhang, E. Dinan, J. Urbanek, A. Szlam, D. Kiela, and J. Weston (2018a) Personalizing dialogue agents: i have a dog, do you have pets too?. In ACL, Cited by: §1.
  • X. Zhang, J. Zhao, and Y. LeCun (2015) Character-level convolutional networks for text classification. In NeurIPS, Cited by: §5.1.
  • Y. Zhang, N. Ding, and R. Soricut (2018b) SHAPED: shared-private encoder-decoder for text style adaptation. In NAACL, Cited by: §2.
  • Y. Zhang, J. Xu, P. Yang, and X. Sun (2018c) Learning sentiment memories for sentiment modification without parallel data. In EMNLP, Cited by: §2, §5.3, Table 2.
  • Y. Zhang, X. Gao, S. Lee, C. Brockett, M. Galley, J. Gao, and B. Dolan (2019) Consistent dialogue generation with self-supervised feature learning. arXiv preprint arXiv:1903.05759. Cited by: §1.
  • Y. Zhang, D. Shen, G. Wang, Z. Gan, R. Henao, and L. Carin (2017) Deconvolutional paragraph representation learning. In NeurIPS, Cited by: §5.2.

Appendix A Supplementary Material

a.1 Evaluation Classifiers

We train the style classifier to classify the styles on the target domain. The domain classifiers are trained to distinguish the samples from different domains. After training, all classifiers are used for evaluation only. The test accuracy of evaluation classifiers are reported in Table 8.

Style Classifier Domain Classifier
Dataset Accuracy Dataset Accuracy
Yelp 97.6% IMDB & Yelp 94.8%
Amazon 81.0% IMDB & Amazon 97.1%
Yahoo 99.4% IMDB & Yahoo 86.9%
ENRON 87.0% GYAFC & ENRON 89.7%
Table 8: Test accuracy of evaluation classifiers.

a.2 Source Domain Data

To investigate the effectiveness of the source domain data, we evaluate our proposed models on different source domains that have unknown styles or the same styles as Yelp. Results are included in Table 9. It can be seen that the proposed models can robustly achieve favorable style transfer with help of different source domain data. Since DAST-C model mainly learns the generic content information by modeling the large corpus on the source domain, the number of source training data significantly affects the performance, especially on content preservation (BLEU). On the other hand, DAST also adapts the generic style information, the source domain with closer sentiment information (IMDB) can thus benefit more to the target domain (Yelp) comparing to the TripAdvisor dataset.

Model Source # samples D-acc S-acc BLEU
DAST-C IMDB 572K 96.9 90.3 17.8
Yahoo 900k 90.3 91.3 19.6
GYAFC 206k 93.5 92.9 16.1
DAST IMDB 334k 97.0 92.6 20.1
TripAdvisor 572k 86.2 91.4 18.4
Table 9: Performance on the Yelp (1% data) dataset with help of different source domain data.

a.3 Human Evaluation

For each human evaluation on Yelp sentiment transfer and Enron formality transfer tasks, we randomly sampled 100 sentences from the corresponding test set and collected three responses for each pair on every evaluation aspect, yielding 2700 responses in total. Each pair of system outputs was randomly presented to 7 crowd-sourced judges, who indicated their preference for style control, content preservation and fluency using the form shown in Figure 3. To minimize the impact of spamming, we employed the top-ranked 30% of U.S. workers provided by the crowd-sourcing service. In order to make the task less abstract, following Mir et al. (2019), we asked the judges to evaluate the content preservation quality independently of style information. Detailed task descriptions and examples were also provided to guide the judges. Inter-rater agreement, as measured by agreement with the most common judgment was 75.9%.

Besides the style control, content preservation and fluency evaluated in Table 3, we also asked each worker to provide a judgement of the overall quality in terms of three aspects as a whole. Results are summarized in Table 10. It shows that our DAST model is better in the overall quality compared to the baselines.

Figure 3: Questionnaire used to elicit pairwise judgments from crowd-sourced annotators. Candidate responses were presented in random order.
Overall Quality (Yelp 1% data)
Our Model Neutral Comparison
DAST 81.1% 14.0% 4.9% ControlGen
DAST 31.4% 43.0% 25.6% DAST-C
DAST 16.9% 23.9% 59.2% human
Overall Quality (Enron)
Our Model Neutral Comparison
DAST 52.7% 35.3% 12.0% ControlGen
DAST 34.0% 48.4% 17.6% DAST-C
DAST 12.0% 17.8% 68.0% human
Table 10: Results of Human Evaluation in terms of the overall quality on Yelp sentiment transfer and Enron formality transfer tasks.