Log In Sign Up

The Daunting Task of Real-World Textual Style Transfer Auto-Evaluation

The difficulty of textual style transfer lies in the lack of parallel corpora. Numerous advances have been proposed for the unsupervised generation. However, significant problems remain with the auto-evaluation of style transfer tasks. Based on the summary of Pang and Gimpel (2018) and Mir et al. (2019), style transfer evaluations rely on three criteria: style accuracy of transferred sentences, content similarity between original and transferred sentences, and fluency of transferred sentences. We elucidate the problematic current state of style transfer research. Given that current tasks do not represent real use cases of style transfer, current auto-evaluation approach is flawed. This discussion aims to bring researchers to think about the future of style transfer and style transfer evaluation research.


page 1

page 2

page 3

page 4


A Review of Human Evaluation for Style Transfer

This paper reviews and summarizes human evaluation practices described i...

Style Transfer from Non-Parallel Text by Cross-Alignment

This paper focuses on style transfer on the basis of non-parallel text. ...

SE-DAE: Style-Enhanced Denoising Auto-Encoder for Unsupervised Text Style Transfer

Text style transfer aims to change the style of sentences while preservi...

Exploring Contextual Word-level Style Relevance for Unsupervised Style Transfer

Unsupervised style transfer aims to change the style of an input sentenc...

Few-shot Controllable Style Transfer for Low-Resource Settings: A Study in Indian Languages

Style transfer is the task of rewriting an input sentence into a target ...

Quantitative Evaluation of Style Transfer

Style transfer methods produce a transferred image which is a rendering ...

Improving Style Transfer with Calibrated Metrics

Style transfer methods produce a transferred image which is a rendering ...

1 Introduction

There are numerous recent works on textual style transfer, the task of changing the style of an input sentence while preserving the content (hu-1; shen-1; fu-1). One factor that makes textual transfer difficult is the lack of parallel corpora. There are abundant advances on developing methods that do not require parallel corpora (simple-transfer; zhang2018learning; logeswaran2018content; yang2018unsupervised)

, but significant issues remain with automatic evaluation metrics. Researchers started by using post-transfer style classification accuracy as the only automatic metric

(hu-1; shen-1). Researchers have then realized the importance of targeting content preservation and fluency in style transfer models, and they have developed corresponding metrics, starting from fu-1 and authorship. pang2018learning and mir2019evaluating have summarized the three evaluation aspects (style accuracy of transferred sentences, content preservation between original and transferred sentences, fluency of transferred sentences) and developed metrics that are well-correlated with human judgments. However, given that current tasks do not represent real use cases of style transfer (Section 2), we discuss the potential problems of existing metrics when facing real-world style transfer tasks (Section 3). Moreover, fu-1 and pang2018learning have shown that if we obtain different models at different intermediate points in the same training instance, we will get different tradeoffs of style accuracy, content preservation, and fluency. Therefore, more discussions on tradeoff and metric aggregation are needed (Section 4), for better model comparison and selection.

1.1 Background: Evaluation based on Human-Written “Gold Standards”

First, we show that one intuitive way of evaluating style transfer is inadequate: computing BLEU scores (papineni2002bleu) between generated/transferred outputs and human-written gold-standard outputs. In fact, simple-transfer crowdsourced 1000 Yelp human-written references as test data (500 positive-sentiment sentences transferred from negative sentiment, and 500 negative-sentiment sentences transferred from positive sentiment). From Table 1, we see the striking phenomenon that untransferred sentences, compared to transferred sentences generated by the best-performing models, have the highest BLEU score111We use the multi-bleu.perl script to compute BLEU. by a large margin.

Model BLEU Accuracy
 CAE 4.9 0.818
 CAE 6.8 0.765
 Multi-decoder 7.6 0.792
 Multi-decoder 11.2 0.525
 Style embedding 15.4 0.095
 Template 18.0 0.867
 Delete/Retrieve 12.6 0.909
 LM 13.4 0.854

 LM + classifier

22.3 0.900
 CAE+losses (model 6) 22.5 0.843
 CAE+losses (model 6) 16.3 0.897
Untransferred 31.4 0.024
Table 1: Results on Yelp “style” (sentiment) transfer. BLEU is between 1000 transferred sentences and human references, and accuracy is restricted to the same 1000 sentences. Accuracy: post-transfer style classification accuracy (by a classifier pretrained on the two corpora). CAE

: cross-aligned autoencoder as in

shen-1. BLEU scores reported for simple-transfer are copied from evaluations by yang2018unsupervised. Note that if a model name appears twice, the models are from different stopping points during training.

This phenomenon either suggests that prior work for this task has not surpassed the baseline of copying the input sentence, or suggests that BLEU is not a good style transfer metric by itself (as it varies by transfer accuracy, as shown in the table). However, it may be a good metric on content preservation, one particular aspect of style transfer evaluation. In fact, simple-transfer used BLEU to measure content preservation.

Obtaining human references is costly, and using human references may only solve one aspect of evaluation (i.e., content preservation). We thus complement this aspect and reduce cost by focusing our discussion on automatic evaluation metrics that do not require a large number of references.

1.2 Background: Existing Auto-Evaluation Metrics

Researchers have agreed on the following three aspects to evaluate style transfer (mir2019evaluating; pang2018learning).

Style accuracy.

Style accuracy is the percentage of sentences transferred onto the correct/target style. Automatic evaluation of post-transfer style classification accuracy is computed by using a classifier pretrained on the original corpora. Initially, this was the only auto-evaluation approach used in the style transfer works (hu-1; shen-1).

Content similarity.

Researchers have realized that when the accuracy is large, the content of the transferred sentence does not necessarily correspond to the content of the original sentence. In particular, pang2018learning computed sentence-level content similarity by first averaging GloVe word embeddings (glove) weighted by

scores and by computing cosine similarity between the embedding of original sentence and the embedding of the transferred sentence. Next, they averaged the cosine similarities over all original-transferred sentence pairs. The metric has high correlation with human judgments.


first removed style words from the original sentence and the transferred sentence using a style lexicon, and then replaced those words with a

placeholder. Next, mir2019evaluating used METEOR (denkowski-1) and Earth Mover’s Distance (pele2009fast) to compute the content similarity. Other works have used similar approaches (cycle-reinforce; simple-transfer; back-translation), mostly involving BLEU and METEOR (papineni2002bleu; denkowski-1).


Researchers realized that style accuracy and content similarity do not guarantee a natural or fluent sentence. pang2018learning trained a language model on the concatenation of the original two corpora (of two styles), and used perplexity of resulting transferred sentence to measure fluency. mir2019evaluating named the metric “naturalness,” and they followed the similar logic with one critical difference. They trained a language model on target style to measure perplexity of transferred sentences. santos2018fighting and yang2018unsupervised also used perplexity as a measure for naturalness or fluency.

2 Problem 1: Style Transfer Tasks

Before diving into problems of unsupervised auto-evaluation metrics, we first discuss the style transfer tasks in relevant research. The big idea is that we need to move forward from the current operational definition of style, to the real-world and useful definition of style, to be explained below. This transition will create problems for existing style transfer metrics.

What are the practical use cases of style transfer?

Here are some possibilities.

  1. Writing assistance and dialogue (heidorn2000intelligent; ritter2011data). For example, it is helpful to have programs that transfer a formal sentence to an informal sentence (formality). It is helpful to have programs that make emails more polite (politeness).

  2. Author obfuscation and anonymity (authorship; gender) so that authors can stay relatively anonymous in, for example, heated political discussions.

  3. For artistic purposes: As an example, we may transfer a modern article to old literature styles.

  4. Adjusting reading difficulty in education (campbell1987adapted): Programs may be helpful in generating passages of the same content, but of different difficulty levels appropriate to different age groups.

  5. Data augmentation to fix dataset bias (anonymous2020learning): In sentiment classification using the IMDb movie review dataset (Maas:2011:LWV:2002472.2002491), for example, the appearance of the word “romantic” is highly correlated with positive sentiment, and the appearance of the word “horror” is highly correlated with negative sentiment. anonymous2020learning thus asked workers to write sentences (where words like “romantic” and “horror” stay unchanged) with flipped sentiment to reduce spurious correlations. This counterfactual data augmentation approach may also be used to address social bias issues in NLP such as gender, race, and nationality (zhao-etal-2017-men; kiritchenko-mohammad-2018-examining; ws-2019-gender). Style transfer is a good way to replace most of all of the expensive crowdsourcing procedure. This direction is in line with current NLP community’s interest in bias and fairness.

What do the collected datasets (from the above use cases) look like?

The two datasets may have very different vocabularies, and it is hard to train a classifier to differentiate style-related words from content-related words. As elaborated in Section 3, certain words need to stay constant despite the fact that the two corpora have drastically different vocabularies. A quick example is that in case (v) above, words like “romantic” need to stay unchanged, even if “romantic” may not appear in the negative-style vocabulary often.

Here is another example. In the task of transferring Dickens’ style literature to modern style literature but keeping the content (pang2018learning) or in similar literature-related tasks (kabbara2016stylistic; xu2017shakespeare), the former may contain words like “English farm”, “horses”, etc; the latter may contain words like “vampire”, “pop music.” However, these words should stay the same, as they are content-related but not style-related. On the other hand, Dickens’ literature may contain words like “devil-may-care” and “flummox” numerous times, but these words are style related and should be changed. Compared to the Yelp sentiment datasets, it is very difficult to automatically differentiate content-related words from style-related words in the Literature dataset. Similar situations may occur frequently in author obfuscation and other practical applications.

Current research focuses on the operational definition of style. Those tasks as well as the Yelp sentiment transfer does NOT represent style transfer.

Therefore, according to the previous paragraph, Yelp sentiment transfer is very idealized, as we can use a simple classifier to classify which words are content-related and which words are style-related. Therefore, changing a word can often change the style (sentiment in this case) successfully. However, to make style transfer useful, we need to go beyond the Yelp sentiment task which most researches focus on.

In fact, if we generalize the phenomenon, we would find that the current research mostly deals with an operational definition of style where the corpus-specific content words are changed. In the Dickens vs. modern literature example, if the sentence contains the word “Oliver,” then it is most likely Dickens style (according to the operational definition), because the word “Oliver” has appeared so many times in the novel Oliver Twist but the word may have rarely appeared in the modern literature corpus. However, this is not the practical or useful definition of style.

The vast majority of datasets and use cases are not as idealized as the Yelp dataset. We need to recognize the real-world definition of style (e.g., keeping “Oliver” as it is in style transfer), so that style transfer research can show promise of being integrated to application interfaces. This creates problems for the existing automatic evaluation metrics.

3 Problem 2: The Issue of Metrics

3.1 Content Similarity

In the task of author obfuscation or writing style transfer, the idea of content similarity becomes rather complicated. In the task of Literature style transfer, what are the style keywords? Take the example where the two unparalleled corpora are Dickens-written sentences and modern literature sentences. Consider the following sentence: Oliver deemed the gathering in York a great success. The expected transfer (if we train human annotators/specialists to transfer it) from the Dickens style to the modern literature style should be similar to “Oliver thought the gathering was successful” (which is the real-world style transfer). However, the most likely transfer (if we use simple autoencoder framework directly) will be “Karl enjoyed the party in LA” (which is the operational style transfer). Consider the following types of words:

  • Corpus-specific content proper nouns: Names may be different in the transferred sentences, as names in two corpora are different. Similarly for locations, organizations, etc. To transfer correctly, a simple baseline could be using a NER labeller. We can replace words with the corresponding labels, and after transferring the sentence (where some words are represented by labels), we can replace the labels with the original words. In short, these proper nouns need to be consistent.

  • Other corpus-specific content words: “English farms” should be transferred to “English farms” instead of “baseball fields”; “horses” should be transferred to “horses” instead of “vampires.” In this case, the human-expected rules do not correspond with the machine-identified differences between two corpora. When evaluating, these words are not style keywords, and we should use semantic similarity to make sure that the words stay consistent.

  • Style words: “Deemed” and “gathering” may belong to the Dickens style. They should be changed.

mir2019evaluating removed and masked the style keywords by using a classifier. In this case, all of the aforementioned itemized types of words will be masked, and content similarity evaluation will fail.

We can address this problem by manually creating the list of style keywords, or by retrieving the style keywords by relying on outside knowledge. Another possibility is to keep the words as they are, without removing and masking the style keywords, as the style keywords are likely the minority.

3.2 Fluency and Style Accuracy

pang2018learning and mir2019evaluating both used perplexity. However, one issue is that lower perplexity may reflect unnatural sentences with common words. We can punish abnormally small perplexity as in Section 4.1. Moreover, fluency and style accuracy may have similar problem with Section 3.1. Perplexity will be large for sentences of the same content but different styles, if the content-words have appeared only rarely in the target corpus. Accuracy has a similar problem.

Therefore, to address this problem, we can mask out corpus-specific content words, before pretraining the language model to evaluate fluency and before pretraining the classifier to evaluate accuracy.

4 Problem 3: Trade-off and Aggregation of Scores

Once we have three numbers: style accuracy, content similarity, and fluency, how do practitioners decide which combination to select? According to pang2018learning and mir2019evaluating, style accuracy is inversely correlated to content similarity, fluency is inversely correlated with content similarity, and fluency is inversely correlated with style accuracy. So how do practitioners determine the degree of trade-off for selection?

It is often useful to summarize multiple metrics into one number, for ease of tuning and model selection. One natural approach is to use aggregation. Suppose we use , , to represent style accuracy, content similarity, and fluency, respectively. Note that different papers may have different variations of defining , , and . cycle-reinforce

simply took the geometric mean of

and . However, this choice is arbitrary. In the style transfer models using different datasets, each of , , corresponds to different range, minimum, and maximum.222That is, may fluctuate between 0.4 and 0.6 for models for dataset 1, but may fluctuate between 0.8 to 0.9 for models for dataset 2. The method of geometric mean does not hold. Geometric mean is designed so that same percentage change results in same effects of geometric mean. But the percentage change ceases to be meaningful in our case.

4.1 Potential Solutions for Aggregation

If we still decide to design an aggregation method based on geometric mean, one possible simple remedy similar to pang2018learning is to learn a threshold , such that represents a similar percentage change across many datasets. We define that for sentence ,


where are the parameters to be learned as described later. Note that the metric is also designed to punish abnormally small perplexity, as discussed previously.

One question arises: Is a universal necessary or helpful (i.e., do we need that work across many datasets)? The current research strives for a universal metric that work across datasets. If we also strive to do so, we obtain the following result.

If we need a universal evaluator that works across many datasets.

We can randomly sample a few hundred pairs of transferred sentences from a range of style transfer outputs (from different models—good ones and bad ones) from a range of style transfer tasks, and ask annotators which of the two transferred sentences is better.333For each annotation, annotators will be given an original sentence, model-1-transferred sentence, model-2-transferred sentence, and they will be asked to judge which transferred sentence is better if they take all three evaluation aspects into account (style accuracy, content similarity, and fluency). (Note that the two transferred sentence correspond to the same original sentence).

We denote a pair of sentences by where is preferred by the annotator. We train the parameters using the loss where and as commonly used margin.444As an example, trained on Yelp dataset and Dickens-modern Literature dataset only, we obtained following the metrics of pang2018learning. Please note that this is an extended abstract, so we do not conduct detailed evaluations. To further the quality of the metrics, we propose adding more pairs of transferred sentences from other style transfer tasks to train the parameters .

To even make the metric more convincing, we may design more complicated functions . Here is a possibility: . We can also design

to be a very small neural network (with nonlinear activation), especially if we have lots of annotations. We can provide a set of possible function forms

, and we can train parameters for each individual and select the best

. We can estimate the quality of

by computing the percentage of machine preferences (“which transferred sentence in a pair is better” according to -scores) that match the human preferences (“which transferred sentence in a pair is better” according to human judgment).

If we do not need a universal evaluator.

Then we can repeat the above procedure by only sampling pairs of transferred sentences from the dataset of interest. We suggest this approach, as it will be more accurate for the particular task.

5 Conclusion

We discussed existing auto-evaluation metrics for style transfer with non-parallel corpora. We also emphasized that we need to move on from operational style transfer and pay more attention to the real-world style transfer research, so that we can put style transfer systems into practical applications. This shift will create problems for style transfer evaluation metrics. Finally, for ease of model selection of comparison, we discussed possible ways of aggregating the metrics. We hope that this discussion will accelerate the research in real-world style transfer.


The author would like to thank He He and Kevin Gimpel for helpful discussions.