Text Style Transfer: A Review and Experiment Evaluation

10/24/2020 ∙ by Zhiqiang Hu, et al. ∙ ibm Singapore University of Technology and Design 0

The stylistic properties of text have intrigued computational linguistics researchers in recent years. Specifically, researchers have investigated the Text Style Transfer (TST) task, which aims to change the stylistic properties of the text while retaining its style independent content. Over the last few years, many novel TST algorithms have been developed, while the industry has leveraged these algorithms to enable exciting TST applications. The field of TST research has burgeoned because of this symbiosis. This article aims to provide a comprehensive review of recent research efforts on text style transfer. More concretely, we create a taxonomy to organize the TST models and provide a comprehensive summary of the state of the art. We review the existing evaluation methodologies for TST tasks and conduct a large-scale reproducibility study where we experimentally benchmark 19 state-of-the-art TST algorithms on two publicly available datasets. Finally, we expand on current trends and provide new perspectives on the new and exciting developments in the TST field.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The stylistic properties of text have intrigued linguistic researchers for a long time. Enkvist (Enkvist, 2016) opined that text style is a “concept that is as common as it is elusive” and suggested that style may be described as a linguistic variation while preserving the conceptual content of the text. To give a practical example, the formality of text will vary across settings for similar content; examples include a conversation with friends such as “let’s hang out on Sunday afternoon!”, or a professional email such as “We will arrange a meeting on Sunday afternoon.”

In recent years, the studies on text style have attracted not only the attention of the linguist but also many computer science researchers. Specifically, computer science researchers are investigating the Text Style Transfer (TST) task, which is an increasingly popular branch of natural language generation (NLG) (Gatt and Krahmer, 2018) that aims to change the stylistic properties of the text while retaining its style-independent content. Earlier TST studies have mainly attempted to perform TST with parallel corpus (Xu et al., 2012; Jhamtani et al., 2017; Carlson et al., 2018; Shang et al., 2019; Wang et al., 2019b; Jin et al., 2019; Nikolov and Hahnloser, 2018; Liao et al., 2018; Xu et al., 2019b). For instance, Xu et al. (Xu et al., 2012) were one of the first works to apply a phrase-based machine translation (PBMT) model to perform TST. They generated a parallel corpus of 30K sentence pairs by scraping the modern translations of Shakespearean plays and training a PBMT system to translate from modern English to Shakespearean English. However, parallel data are scarce in many real-world TST applications, such as dialogue generation of different styles. The scarcity of parallel data motivated a new breed of TST algorithms that attempt to transfer text style without parallel data (Li et al., 2018; Xu et al., 2018; Zhang et al., 2018b; Sudhakar et al., 2019; Wu et al., 2019a; Shen et al., 2017; Zhao et al., 2018a; Fu et al., 2018; Chen et al., 2018; Logeswaran et al., 2018; Zhao et al., 2018b; Lai et al., 2019; John et al., 2019; Park et al., 2019; Yin et al., 2019; Yang et al., 2018; Hu et al., 2017; Tian et al., 2018; Lample et al., 2019; Dai et al., 2019; Zhang et al., 2018a; Jain et al., 2019; Mueller et al., 2017; Xu et al., 2019a; Wang et al., 2019a; Liu et al., 2020; Luo et al., 2019; Gong et al., 2019; He et al., 2020).

The goal of this survey is to review the literature on the advances in TST thoroughly and to provide experimental comparisons of various algorithms. It provides a panorama through which readers can quickly understand and step into the field of TST. It is noteworthy that the literature in the field is rather disparate, and a unified comparison is critical to aid in understanding the strengths and weaknesses of various methods. This survey lays the foundations of future innovations in the area of TST and taps into the richness of this research area. To summarize, the key contributions of this survey are three-fold: (i) We investigate, classify, and summarise recent advances in the field of TST. (ii) We present several evaluation methods and experimentally compare different TST algorithms. (iii) we discuss the challenges in this field and propose possible directions on how to address them in future works.

The organization of this paper is as follows. We start our discussion on the related research areas that inspire the commonly used TST techniques in Section 2. Next, we explore and demonstrate some of the commercial applications of TST in Section 3. In Section 4, we categorize and explain the existing TST algorithms. The methodologies for evaluating TST algorithms are presented in Section 5. In Section 6, we present experiments on publicly available datasets to benchmark the existing TST algorithms. In Section 7, we outline the open issues in TST research and offer possible future TST research directions. Finally, we conclude our survey paper in Section 8.

2. Related Research Areas

Text style transfer (TST) is a relatively new research area. Many of the earlier TST works are heavily influenced by two related research areas: neural style transfer, i.e., transferring styles in images (Jing et al., 2019; Gatys et al., 2015) and neural machine translation (Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2015)

. We found that a substantial number of TST techniques were adapted from the common methods used in neural style transfer and neural machine translation. In addition, some of the evaluation metrics used in TST are also “inherited” from the neural machine translation task. In this section, we will introduce the two related research areas and highlight some of the common techniques and evaluation metrics that are transferred or adapted for the TST task.

2.1. Neural Style Transfer

Gatys et al. (Gatys et al., 2015)

first explored the use of a convolutional neural network (CNN) to extract content and style features from images separately. Their experimental results demonstrated that CNN is capable of extracting content information from an arbitrary photograph and is also capable of extracting style information from a well-known artwork. Based on this finding, Gatys et al.

(Gatys et al., 2015) further experimented with the idea of exploiting CNN feature activation to recombine the content of a given photo with the style of famous artwork. The underlying idea in their proposed algorithm is to minimize the loss between the synthesized image’s CNN latent representation and the desired CNN feature distributions, which is the combination of the photo’s content feature representation and artwork’s style feature representation. Interestingly, the algorithm also does not have any explicit restriction on the type of style images and does not require ground truth for training. The seminal work of Gatys et al. opened the new field of Neural Style Transfer

(NST), which is the process of using neural networks to render content images in different styles

(Jing et al., 2019).

The burgeoning research in the emerging field of NST has attracted wide attention from both academia and industry. In particular, natural language processing(NLP) researchers are motivated to adopt similar strategies to implicitly disentangle the content and style features in text, and transfer the learned style features on another textual content (Prabhumoye et al., 2018; Zhang et al., 2018c; Shen et al., 2017; Zhao et al., 2018a; Fu et al., 2018; Chen et al., 2018; Logeswaran et al., 2018; Yin et al., 2019; Zhao et al., 2018b; Lai et al., 2019; John et al., 2019; Park et al., 2019; Yin et al., 2019; Lai et al., 2019; Yang et al., 2018; Hu et al., 2017; Tian et al., 2018). For example, Fu et al. (Fu et al., 2018) proposed two TST models, which adopted an adversarial learning approach to implicitly disentangle the content and style in text. The first method used multiple decoders for each type of style to generate text of different styles from a common content embedding. In the second approach, style embeddings are learned and augmented to a content embedding, and one decoder is used to generate output in different styles.

While the goal in this line of TST works is similar to the objective of NST, disentangling content and style in text has proven to be much harder than in the case of images (Lample et al., 2019). Firstly, the styles in images are distinctive; it is easier to visualize and differentiate styles in two images in terms of patterns that can be modeled easily by a neural network. However, text styles are somewhat more subtle, which makes it challenging to differentiate and define styles in two given pieces of text. Secondly, unlike the image’s content and style, which is easily separated in the different CNN layers, the content and styles in the text are tightly coupled and not easily separated even with the style labels. Hence, some of the recent TST works have proposed a new direction to transfer text style transfer without disentanglement of text’s content and style (Lample et al., 2019; Dai et al., 2019; Zhang et al., 2018a, a; Jain et al., 2019; Mueller et al., 2017; Xu et al., 2019a; Wang et al., 2019a; Liu et al., 2020; Luo et al., 2019; Gong et al., 2019; He et al., 2020).

2.2. Neural Machine Translation

Neural machine translation (NMT), which is a deep learning-based approach for machine translation, is a well-studied research area

(Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2015). Unlike the traditional statistical machine translation techniques (Brown et al., 1990; Koehn et al., 2007), NMT can perform end-to-end training of a machine translation model without the need to deal with word alignments, translation rules, and complicated decoding algorithms. Both TST and NMT are branches of natural language generation (NLG) (Gatt and Krahmer, 2018). Naturally, the two research areas share a few similarities. Firstly, two tasks are quite similar; NMT aims to change the language of a text while preserving the content, TST aims to modify the stylistic properties of a text while preserving the content. Secondly, most of the TST models have “borrowed” the most commonly used NMT technique: the sequence-to-sequence encoder-decoder model (Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2015). Other TST studies have also adopted the back-translation technique originally proposed for NMT (Sennrich et al., 2016) to transfer styles in text (Prabhumoye et al., 2018; dos Santos et al., 2018). For instance, Prabhumoye et al. (Prabhumoye et al., 2018) used a back-translation model to extract the content features in text and subsequently generate text in different styles using the extracted content features and multiple decoders. Thirdly, TST works have inherited some of the automatic quantitative evaluation metrics that were originally proposed to evaluate the NMT task. For example, the bilingual evaluation understudy (BLEU) metric (Papineni et al., 2002)

, which is used to evaluate the quality of machine-translated text by computing the modified n-gram precision between the generated and reference text, is also widely used in the TST task.

Despite the close similarities between TST and NMT, it is also worth noting their subtle differences. In NMT, the languages are definitive, and most of the NMT models only evaluate if the content of the text is preserved during translation while assuming that the language itself is correctly translated. However, in TST, the text style is abstract, and most TST models will need to evaluate if the text’s content is preserved and the style is effectively modified. These differences motivated TST researchers to explore other techniques as such controllable generation (Hu et al., 2017; Tian et al., 2018; Lample et al., 2019; Dai et al., 2019; Zhang et al., 2018a, a; Jain et al., 2019) to ensure that the style in a text is modified during TST. The need to evaluate if the style is effectively transferred also creates new evaluation metrics to evaluate the TST models (Fu et al., 2018; Mir et al., 2019; Pang and Gimpel, 2019).

3. Application

The research on TST algorithms has many industrial applications and could lead to many commercial benefits. In this section, we summarize these applications and present some potential usages.

3.1. Writing tools

One of the industrial applications of TST algorithms is the design of writing tools. Academics across various domains have widely researched Computer-aided writing tools, and the industry has developed many writing tool applications (Klahold and Fathi, 2020b, a; Silva et al., 2019; Snyder, 1993; Parra and others, 2019; MacArthur, 2009). The TST methods can be applied as new useful features in existing writing tool applications.

The utility of writing style has been widely studied by linguistic and literacy education scholars(Can and Patton, 2004; Young, 2002; Halliday, 1981; Johnstone, 2009; Ashok et al., 2013; Ivanič, 2004). The TST algorithms enable writing tool applications to apply the insights from existing linguistics studies to improve the writings of users. For instance, applying TST algorithms enable writing tool users to switch between writing styles for different audiences while preserving the content in their writing. The style evaluation methods developed to evaluate TST algorithms can also be applied to analyze the writing style of users (Parra and others, 2019). For instance, the writing tool could analyze the style of a user’s business email draft to be too informal and recommend the users to modify his or her writing to make the writing style more formal. Cao et al. (Cao et al., 2020) developed an interesting real-world TST application, where the text is transferred between expert and layman styles. The underlying intuition is that expertise style transfer aims at improving the readability of a text by reducing the expertise level, such as explaining the complex terminology with simple phrases. On the other hand, it also aims to improve the expertise level based on context, so that laymen’s expressions can be more accurate and professional.

3.2. Persuasion and Marketing

Studies have explored utilizing persuasive text to influence the attitude or behaviors of people (Kaptein et al., 2012; Chambliss and Garner, 1996), and the insights gained from these studies have also been applied in improve marketing and advertising in the industry. The style of text has an impact on its persuasiveness (Darani, 2014; Muehlenhaus, 2012; Johnstone, 1989), and the TST algorithms can be used to convert a text into a more persuasive style. Recent studies have also explored personalizing persuasive strategies according to the user’s profile (Kaptein et al., 2015). Similarly, TST algorithms could also be used to structure the text in different persuasive text styles that best appeal to the user profiles. For instance, TST algorithms can be applied to modify a marketing message into an authoritative style for users who appeal to authority. Jin et al. (Jin et al., 2020) proposed a compelling use-case to utilize TST methods to make news headlines more attractive. Specifically, a TST algorithm is used to transfer news headlines between humor, romance, and clickbaity style.

3.3. Chatbot Dialogue

The research and development of chatbots, i.e., intelligent dialogue systems that are able to engage in conversations with humans, has been one of the longest-running goals in artificial intelligence

(Abdul-Kader and Woods, 2015; Xu et al., 2017; Zhou et al., 2020b). Kim et al. (Kim et al., 2019) conducted a study on the impact of chatbot’s conversational style on users and found that when a causal conversational style is used, experiment participants are less likely to persuade a user to perform an action compared to participants who conversed with formal conversational style chatbot. The encouraging results from the study suggest that a user may be influenced by chatbot’s conversational styles, and TST algorithms could be exploited to enhance the chatbots’ flexibility in conversational styles.

TST algorithms can be applied to equip chatbots with the ability to switch between conversational styles, and this makes the chatbots more appealing and engaging to the users. For instance, a chatbot recommending products to customers may adopt a more persuasive conversational style while the same chatbot may switch to a formal conversational style when addressing the customer’s complaint.

4. A Taxonomy of Text Style Transfer Methods

In this section, we first propose a taxonomy to organize the most notable and promising advances in TST research in recent years. Subsequently, we discuss each category of TST models in greater detail.

4.1. Categories of Text Style Transfer Models

To provide a bird-eye view of this field, we classify the existing TST models based on the types of (1) data setting, (2) strategy, and (3) technique used. Fig. 1 summarizes the taxonomy for text style transfer.

Figure 1. Text Style Transfer Taxonomy

4.1.1. Types of Data Setting

We broadly classify the existing TST studies into two categories based on the types of data settings used for model training.

  • Parallel Data. In this data setting, the TST models are trained with known pairs of text with different styles. Commonly, NMT methods such as sequence-to-sequence (Seq2Seq) models (Xu et al., 2012; Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2015) are applied to transfer the style of text. For example, Jhamtani et al. (Jhamtani et al., 2017) trained a Seq2Seq model with a pointer network on a parallel corpus and applied the model to translate modern English phrases to Shakespearean English. Details of the techniques applied on parallel datasets will be discussed in Section 4.2

  • Non-Parallel Data. TST models in the non-parallel data setting aim to transfer the style of text without any knowledge of matching text pairs in different styles. Most of the existing TST studies fall into this category as parallel datasets are scarce in many real-world TST applications.

4.1.2. Strategy

In order to perform TST in non-parallel data setting, existing studies have proposed to disentangle the style and content in text, a strategy commonly used in NST (Gatys et al., 2015). On the whole, there are three types of strategies adopted by existing TST studies:

  • Explicit Style-Content Disentanglement. In this strategy, the TST models adopted an explicit text replacement approach to generate text of a target style. For instance, Li et al. (Li et al., 2018) first explicitly identify parts of the text that is associated with the original style and then replace them with new phrases associated with the target style. The text with the replaced new phrases is then inputted into a Seq2Seq model to generate a fluent text in the target style. Details of the techniques applied to disentangle content and style will be explicitly discussed in Section 4.3

  • Implicit Style-Content Disentanglement. To disentangle style and content in text implicitly, TST models aim first to learn the content and style latent representations of a given text. Subsequently, the original text’s content latent representation is combined with the latent representation of the target style to generate a new text in the target style. Multiple techniques such as back-translation, adversarial learning, and controllable generation have been proposed to disentangle the content and style latent representations.

  • Without Style-Content Disentanglement. Recent studies have suggested that it is difficult to judge the quality of text style and content disentanglement and the disentanglement is also unnecessary for TST (Lample et al., 2019)

    . Therefore, newer TST studies explored performing TST without disentangling the text’s style and content. Techniques such as adversarial learning, controllable generation, reinforcement learning, probabilistic modeling, and pseudo-parallel corpus have been applied to perform TST without disentanglement of the text’s content and style.

4.1.3. Techniques

. Table 1 lists the types of techniques that are commonly used to perform TST. We organize them following the previously mentioned taxonomy and will review them in detail in subsequent sections. Additionally, we also summarize the list of relevant publications that proposed variant models using these techniques.

Data Setting
Strategy (Content-Style
Disentanglement)
Technique Publication
Parallel - Sequence-to-Sequence (Jhamtani et al., 2017; Carlson et al., 2018; Shang et al., 2019; Wang et al., 2019b; Jin et al., 2019; Nikolov and Hahnloser, 2018; Liao et al., 2018; Xu et al., 2019b; Zhang et al., 2020)
Non-Parallel Explicit Explicit Style Keyword Replacement (Li et al., 2018; Xu et al., 2018; Zhang et al., 2018b; Sudhakar et al., 2019; Wu et al., 2019a)
Implicit Back-Translation (Prabhumoye et al., 2018; Zhang et al., 2018c)
Adversarial Learning (Shen et al., 2017; Zhao et al., 2018a; Fu et al., 2018; Chen et al., 2018; Logeswaran et al., 2018; Yin et al., 2019; Zhao et al., 2018b; Lai et al., 2019; John et al., 2019; Park et al., 2019; Yang et al., 2018)
Attribute Control Generation (Hu et al., 2017; Tian et al., 2018)
Without Attribute Control Generation (Lample et al., 2019; Dai et al., 2019; Zhang et al., 2018a; Jain et al., 2019; Zhou et al., 2020a)
Entangled Latent Representation Edition (Mueller et al., 2017; Xu et al., 2019a; Wang et al., 2019a; Liu et al., 2020)
Reinforcement Learning (Luo et al., 2019; Gong et al., 2019)
Probabilistic Modeling (He et al., 2020)
Table 1. Publications Based on Different Text Style Transfer Techniques

4.2. Sequence-to-Sequence Model with Parallel Data

The Sequence-to-Sequence (Seq2Seq) model (Xu et al., 2012; Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2015) is core to many natural language generation tasks (Gatt and Krahmer, 2018), and TST is no exception. Generally, a Seq2Seq model is trained on a parallel corpus, where the text of the original style is input into an encoder, and the decoder outputs the corresponding text of the target style. Variants of the general approach were proposed in TST models that trained on parallel datasets. Jhamtani et al. (Jhamtani et al., 2017) extended the work in Xu et al.(Xu et al., 2012) by adding a pointer network (Vinyals et al., 2015) to the Seq2Seq model to selectively copy word tokens from the input text directly to generate the text in a target style. Carlson (Carlson et al., 2018) added attention mechanism (Vaswani et al., 2017) to the Seq2Seq model to evaluate their proposed parallel Bible prose style corpus. Other studies have also attempted to use the parallel dataset to train a seq2seq a semi-supervised fashion (Shang et al., 2019), as well as fine-tuning pre-trained models to perform TST (Wang et al., 2019b).

Another interesting approach is to generate pseudo-parallel datasets and apply seq2seq models to perform TST. Jin et al. (Jin et al., 2019) first constructed a pseudo-parallel corpus by matching text sentences in a source style corpus with text sentences in target style corpus

using cosine similarity. Subsequently, a seq2seq model is trained using the constructed pseudo-parallel corpus to perform TST. Nikolov and Hahnloser

(Nikolov and Hahnloser, 2018) improve the pseudo-parallel corpus generation with a hierarchical method that computes similarity scores at document and sentence levels to find parallel text pairs across different style corpus. Liao et al. (Liao et al., 2018) first generated a pseudo-parallel dataset and then applied a dual-encoder seq2seq framework to disentangle the content from style for text style transfer. Zhang et al. (Zhang et al., 2020) proposed and experimented with a few pseudo-parallel dataset generation methods. Specifically, the researchers explored simultaneous training, pre-training, and fine-tuning data augmentation methods to generate pseudo-parallel data for TST tasks.

A major drawback of conventional Seq2Seq models is that training requires large parallel corpora, which are scarce in the TST domain. Xu et al. (Xu et al., 2019b) recognized this limitation and proposed a joint training approach that combined the information gain from the Seq2Seq model trained on parallel corpus and class-labeled annotation learned from training a classifier to predict the text’s style. However, most of the TST studies have moved on and experimented with performing TST without parallel datasets.

4.3. Explicit Style Keyword Replacement

A common approach to explicitly disentangle content and style in text is to perform the replacement of keywords that are attributed to a certain style. Li et al. (Li et al., 2018) first proposed a Delete-Retrieve-Generate framework that became the signature approach that utilized explicit style attributed work replacement for TST. Fig. 2 shows the overview of the framework. In the proposed framework, statistics of word frequency is first used to identify and delete style attributed words such as “good, bad” from the original text. Next, a text that is most similar to the original text is retrieved from the corpus of the target style. The style attributed words from the retrieved text are combined with the content words from the original text to generate the text using a rule-based fashion or with a neural sequence-to-sequence model. Zhang et al. (Zhang et al., 2018b) adopted a similar explicit style keyword replacement frame to perform sentiment transfer in text. Sudhakar et al. (Sudhakar et al., 2019) extended the Delete-Retrieve-Generate framework and improve the delete step by exploiting a Transformer (Vaswani et al., 2017) to identify style attributed keywords. Wu et al. (Wu et al., 2019b) proposed the two-step ”Mask and Infill” approach. In the mask step, style attributed words are masked. In the infill step, a pre-trained masked language model is used to infill the masked positions by predicting words or phrases attributed to the target style. Leeftink and Spanakis (Leeftink and Spanakis, 2019) applied a classifier with an attention mechanism to highlight keywords and phrases that determine the style of a sentence. Then, these identified phrases are changed to phrases of target style using a seq2seq approach. Recent works have also combined explicit style keyword replacement with cycled reinforcement learning to iteratively replace style attributed keywords while maintaining the content in text (Xu et al., 2018; Wu et al., 2019a).

Figure 2. An overview of the Delete-Retreive-Generate framework proposed by Li et al. (Li et al., 2018)

.

4.4. Adversarial Learning

A common technique used to perform implicit content-style disentanglement is adversarial learning. Shen et al. (Shen et al., 2017) leverage an adversarial training scheme where a classifier is used to evaluate if an encoder is able to generate a latent representation devoid of style. Zhao et al. (Zhao et al., 2018a)

, inspired by the earlier TST study, proposed a generic natural language generation technique, named Adversarially Regularized Autoencoders (ARAE), where similar adversarial learning technique is used to generate natural textual output by removing specific attributes from an encoder latent output through adversarial learning and use the manipulated latent output of encoder to induce changes in the natural language output of the decoder. Among the earlier adversarial learning TST works, Fu et al.

(Fu et al., 2018) proposed a framework that is definitive of works that utilized adversarial learning to disentangle content and style in text implicitly. Fig. 4 illustrates the two models included in their proposed framework.

Figure 3. Two common adversarial learning based TST models: multi-decoder (left) and style-embedding (right) proposed by Fu et al. (Fu et al., 2018). The content representation c is the output of the of the encoder. The style classifier aims at distinguishing the style of the input. An adversarial network is used to make sure content c does not have style representation. In style-embedding, content representation c and style embedding s are concatenated and fed into decoder.

An encoder is trained to generate intermediate latent representation of input text sequence of length . An adversarial network is used to separate the content representation c from the style. The adversarial network is composed of two parts. The first part aims at classifying the style of input x

given the representation learned by the encoder. The loss function minimizes the negative log probability of the style labels in the training data:

(1)

where and are the parameters of the classifier and encoder, respectively. denotes the size of the training data, and refers to the style label. The second part of the adversarial network aims at making the classifier unable to identify the style of input x by maximizing the entropy (i.e., minimizing the negative entropy) of the predicted style labels:

(2)

where is the number of styles. Note that the two parts of the adversarial network update different sets of parameters, they work together to make sure that the output of do not contain style information.

Figure 4. Reconstruction loss and cycle-consistency loss used in text style transfer.

Once the encoder is trained to produce the content representation, two methods are used to generate the text in target style. The first method involves training multiple decoders (shown in Fig. 4-left) to take in the content representation and generate outputs in different styles. The second method involves training a style embedding (shown in Fig. 4-right) and concatenating it to the content representation to output in target style using a decoder.

There are, however, limitations to the above adversarial learning TST framework. The style label alone may not be able to guide the generation of fluent sentences in target style, and some content information may still be lost during its latent representation generation process. Improved variants of the adversarial learning TST framework have also been proposed to address these limitations. For instance, reconstruction loss and cycle-consistency loss are introduced to improve content preservation during TST (Chen et al., 2018; Logeswaran et al., 2018; Yin et al., 2019; Zhao et al., 2018b; Lai et al., 2019; John et al., 2019). As part of the decoder model, a reconstruction loss is introduced to enforce that the decoder, which takes the content representation and original style embedding as input, can reconstruct the input sentence x itself. We require the sentence transferred by the decoder to preserve the content of its original sentence. Thus it should have the capability to recover the original sentence in a cyclic manner. Therefore, the transferred style sentence is input into the TST model to transfer the sentence back to its original style, and a cycle-consistency loss is used to enforce the generated sentence in original style is similar to the input sentence.

Other types of losses were also added to improve the adversarial learning TST models. Chen et al. (Chen et al., 2018) proposed FM-GAN, which enhances the cycle-consistency loss by minimizing the feature-movers distance between the latent representation of the input and generated sentence in the original style. John et al. (John et al., 2019) added a style-oriented loss to ensure the style information is contained in style embedding , i.e., a loss which ensures that is discriminative for the style. Zhao et al. (Zhao et al., 2018b) adopted a different approach and added a style discrepancy loss to ensure that the style representation accurately encodes the style information by maximizing the discrepancy between the input and target sentences’ styles. Park et al. (Park et al., 2019) proposed relational losses to distinguish the semantics, syntactic, and lexical features between the input and generated transferred style sentences.

Besides adding loss functions, other studies have proposed auxiliary components to enhance the adversarial learning process. Yin et al. (Yin et al., 2019) presented two partial comparators to guide adversarial learning, a content comparator that judges whether two the input sentence and generated sentence share the same content, and a style comparator that judge if they have different styles. Lai et al. (Lai et al., 2019) combine the adversarial learning framework with a word-level conditional mechanism to preserve content information by retaining style-unrelated words while modifying the other style-related words. Yang et al. (Yang et al., 2018) replaced the binary classifiers with a target domain language model as the discriminator to provide richer and more stable token-level feedback during the adversarial learning process.

4.5. Back-Translation

Another technique that had been explored to disentangle content and style in text is back-translation. Fig. 5 shows a back translation framework adopted by Prabhumoye et al. (Prabhumoye et al., 2018) to perform TST. In this approach, Prabhumoye et al. attempted to use NMT models to rephrase the sentence and remove the stylistic properties in text. In the proposed model, an English text is first translated to French using an NMT model, and the translated French text is subsequently translated back to English using another NMT model. The latent representation learned using the NMT model is assumed to contain only content information, devoid of any stylistic properties. Finally, the latent representation is used to generate text in a different style using the multi-decoder approach. Zhang et al. (Zhang et al., 2018c) adopted an iterative back-translation pipeline to perform TST. The pipeline first learns a cross-domain word embedding in order to build an initial phrase-table. The phrase-table is then used to bootstrap an iterative back-translation model, which jointly trained two NMT systems to transfer text style.

Figure 5. Back Translation framework for Text Style Transfer proposed by Prabhumoye et al. (Prabhumoye et al., 2018)

4.6. Attribute Control Generation

Existing TST works have also explored learning a style attribute to control text generation in different styles. Fig. 6 shows the proposed attribute controlled generation model by Hu et al. (Hu et al., 2017).

Figure 6. Attribute controlled generation proposed in by Hu et al. (Hu et al., 2017)

Unlike an autoencoder, which learns a compressed representation for an input data, the Variational Autoencoder (VAE) (Kingma and Welling, 2013)

learns the parameters of a probability distribution representing the data. The learned distribution can also be sampled to generate new data samples. Therefore, the generative nature of VAE makes it widely explored and utilized in many natural language generation tasks

(Gatt and Krahmer, 2018). Hu et al. (Hu et al., 2017) proposed a TST model that utilized VAE to learn a sentence’s latent representation

and leverage a style classifier to learn a style attribute vector

. The probabilistic encoder of VAE also functions as an additional discriminator to capture variations of implicitly modeled aspects, and guide the generator to avoid entanglement during attribute code manipulation. Finally, and are input into a decoder to generate a sentence in the specific style. Specifically, the VAE loss function is shown as follows:

(3)

where is the KL-divergence. and denote the parameters of the decoder and encoder, respectively. is the conditional probabilistic encoder to infer the latent representation given input sentence , and is the conditional distribution defined by the classifier for each structured variable in . To ensure that retain the style-independent content information, a independency constraint is proposed to ensure that the latent representation of the input sentence and transferred style sentence remains close. Tian et al. (Tian et al., 2018) further extended this approach by proposing to add more constraints to preserve the style-independent content by using POS information preservation and a content conditional language model.

Lample et al. (Lample et al., 2019) proposed an attribute-controlled text generation approach without style and content disentanglement. They argued that it is difficult to judge if the style is indeed disentangled from the content representation, and it is also unnecessary to perform style and content disentanglement for TST. Fig. 7 illustrates the proposed model by Lample et al. (Lample et al., 2019).

Figure 7. Attribute controlled generation with back translation proposed in by Lample et al. (Lample et al., 2019)

The model employed Denoising Auto-encoders (Vincent et al., 2010) and back-translation (Sennrich et al., 2016) to build a translation style between different styles. The intuition for this model is that the noise function may corrupt words in input sentence that convey its original style. The corrupted input sentence is then fed into an encoder to generate the latent representation . is subsequently combined with a trainable target style attribute and input into a decoder to generate a sentence in target style . Finally, a back-translation process is initiated to have the generated sentence is feed into the sample encode-decoder process to reconstruct the original sentence using the latent representation , and original style attribute . Note that in this model, the authors did not claim or constrain the latent representation of only to contain content information.

There are other variants of attribute-controlled approaches that perform TST without content and style disentanglement. For instance, Dai et al. (Dai et al., 2019) adopted a transformer-based autoencoder (Vaswani et al., 2017) to perform TST with a trainable style attribute. The model’s goal is to harvest the power of attention mechanism in the Transformer to achieve better style transfer and better content preservation. Zhang et al. (Zhang et al., 2018a) proposed a Shared-private encoder-decoder (SHAPED) framework that learns the style attributes to transfer the text style. Li et al. (Li et al., 2019) extended the attribute-controlled TST works and proposed a domain adaptive TST which enable style transfer to be performed in a domain-aware manner. Specifically, besides the latent style attributes, the proposed model also learned domain vectors of the text in the source and target domain. The domain vectors are subsequently used with the style attributes and sentence’s latent representations to perform TST across domains. Jain et al. (Jain et al., 2019) proposed an unsupervised TST method using the attribute-controlled technique. The approach is similar to the work proposed in (Hu et al., 2017). However, unlike most of the TST methods, which require a style classifier, the proposed model assumes that the style label is unknown. Instead, the model proposed a scoring mechanism that provided scores on the sentences’ semantic relatedness, fluency, and readability grade, to guide the learning of the style attribute for TST. Similarly, Zhou et al. (Zhou et al., 2020a) proposed an unsupervised method to perform fine-grained attribute-control to perform TST. The proposed model utilized an attentional seq2seq model that dynamically exploits the relevance of each output word to the target style for unsupervised style transfer. Included in the model is a carefully-designed objective function that fine-tuned the model’s style transfer, style relevance consistency, content preservation, and fluency modeling loss terms.

4.7. Entangled Latent Representation Edition

Another line of work, which also attempted to perform TST without any content and style disentanglement, is directly editing the latent representation learned using the autoencoder-based models. Fig. 8 shows a common framework adopted by work that edits latent representations for TST. Typically, the latent representation learned using an autoencoder is manipulated using various methods. The manipulated latent representation is then input into the decoder to generate text of the target style.

Figure 8. Common framework for editing text’s latent representation for TST

In earlier work, Mueller et al. (Mueller et al., 2017)

explored manipulating the hidden representation learned using VAE to generate sentences that contain a certain style measured using a corresponding classifier. However, it is interesting to note that there was no quantitative evaluation of the effectiveness of text style transfer in this earlier work.

Xu et al. (Xu et al., 2019a)

conducted extensive experiments to investigate the latent vacancy in unsupervised learning of controllable representation when modeling text with VAEs. Similar to the study in

(Mueller et al., 2017), Xu et al. studied the impact on text style when manipulating the factors in latent representation and found that if a manipulation fails at decoding accurate sentences, it is because the manipulation results in representation areas that the decoder never seen during training. To handle this issue, they proposed to constrain the posterior mean to a learned probability simplex and only performed manipulation within the probability simplex.

Liu et al. (Liu et al., 2020) adopt a gradient-based optimization in the continuous space to manipulate the latent representation learned using VAE and style classifiers to achieve text style transfer. Moreover, the proposed method naturally has the ability to simultaneously control multiple fine-grained attributes, such as sentence length and the presence of specific words, when performing text style transfer tasks. Wang et al. (Wang et al., 2019a) adopted a similar approach and perform a Fast-Gradient-Iterative-Modification algorithm to edit the latent representation learned using transformer-based autoencoder until the generated text conforms to the target style.

4.8. Reinforcement Learning

Reinforcement learning has also been applied to perform TST. For instance, Luo et al. (Luo et al., 2019) proposed to learn two seq2seq models between two styles via reinforcement learning, without disentangling style and content. Fig.9 illustrates the proposed dual reinforcement learning framework. The authors considered the learning of source-to-target style and target-to-source style as a dual-task. The style classifier reward, , and reconstruction reward,

, are designed to encourage style transfer accuracy and content preservation. The overall reward is the harmonic mean of the two rewards, and it is used as the feedback signal to guide learning in the dual-task structure. As such, the model can be trained via reinforcement learning without any use of parallel data or content-style disentanglement.

Figure 9. Dual reinforcement learning model for TST proposed in (Luo et al., 2019)

Gong et al. (Gong et al., 2019) proposed a reinforcement learning-based generator-evaluator framework to perform TST. Similar to previous TST works, the proposed model employs an attention-based encoder-decoder mode to transfer and generate target style sentences. However, unlike the previous models that utilize a style classifier to guide the generation process, the proposed model employed a style classifier, semantic model, and a language model to provide style, semantic, and fluency rewards respectively to guide the text generation. The authors’ intuition is that the transfer of text style should not only ensure the transfer of style and content preservation but also generate fluent sentences.

4.9. Probabilistic Model

He et al. (He et al., 2020) proposed a probabilistic deep generative model to infer the latent representations of sentences for TST. The proposed model hypothesizes a parallel latent sequence that generates each observed sequence, and the model learns to transform sequences from one domain to another in a completely unsupervised fashion. Specifically, the model combines a recurrent language model prior with an encoder-decoder transducer to infer the latent representations of the sentence in assumed parallel style corpus. The inferred latent representation is then used to generate the sentence of a specific style via a decoder.

5. Evaluation Methodology

As TST is a relatively new research area, new methods will need to be designed to evaluate the TST algorithms. In this section, we first summarize the downstream tasks and existing datasets used to evaluate the TST models. Next, we discuss the automated and human evaluation methods used to assess the quality of TST algorithms.

5.1. Tasks and Datasets

Datasets. Table 2 summarises the datasets used to existing studies to evaluate the performance of TST algorithms. These corpora often contain texts in labeled with two or more attributes. For example, the Yelp dataset contains review text records labeled with binary sentiment class (i.e., positive or negative), and the Caption dataset contain caption-text records labeled with romantic, humorous, and factual classes. Most of these datasets are non-parallel datasets, i.e., there are no matching text pairs in the different attribute classes, except for Shakespeare-Modern and GYAFC. It is also interesting to note that while the GYAFC corpus is a parallel dataset, most of the existing TST studies assume a non-parallel setting when training the TST models with this dataset.

Many downstream tasks have been proposed to leverage these datasets to evaluate TST models. In the rest of this section, we will review these downstream tasks in greater detail.

Dataset Subset Attributes #Text records
Shakespeare-Modern (Xu et al., 2012) - Shakespeare 21,076
Modern 21,076
Yelp (Shen et al., 2017) - Positive 381,911
Negative 252,343
IMDb (Dai et al., 2019) - Positive 181,869
Negative 190,597
Amazon (He and McAuley, 2016) - Positive 278,713
Negative 279,284
GYAFC (Rao and Tetreault, 2018) F&R Informal 56,087
Formal 55,233
E&M Informal 56,888
Formal 56,033
PNTD (Fu et al., 2018) - Paper 107,538
News 108,503
Caption (Li et al., 2018) - Romantic 6,300
Humorous 6,300
Factual 300
Gender (Prabhumoye et al., 2018) - Male 1,604,068
Female 1,604,068
Political (Prabhumoye et al., 2018) - Democracy 298,961
Republican 298,961
Offensive (dos Santos et al., 2018) Twitter Offensive 74,218
Non-offensive 1,962,224
Reddit Offensive 266,785
Non-offensive 7,096,473
Table 2. Dataset statistics for text style transfer.

Author Imitation. Author imitation is the task of paraphrasing a sentence to match a specific author’s style. To perform this task, Xu et al. (Xu et al., 2012) collected a parallel dataset, which captures the line-by-line modern paraphrases for 16 of Shakespeare’s 36 plays (Antony & Cleopatra, As You Like It, Comedy of Errors, Hamlet, Henry V, etc.) using the educational site Sparknotes 111www.sparknotes.com. The goal was to imitate Shakespeare’s text style by transferring modern English sentences into Shakespearean style sentences. This dataset is publicly available222 http://tinyurl.com/ycdd3v6h and was also used in other TST studies (Jhamtani et al., 2017; He et al., 2020).

The imitation of authors writing style is an exciting TST task. There are many interesting industrial applications, such as transferring the writing styles of famous novel authors into other stories and unifying the writing styles of multiple authors to a single author in a collaborative setting. However, the Shakespeare-Modern dataset is the only known corpus that facilitates author imitation in TST studies. There are also some apparent limitations in the corpus; the dataset size is small, and the approach is limited to transfer to only one author’s style. An interesting future work may be to collect text written by various authors and transfer the style of text among multiple authors.

Sentiment transfer. Sentiment transfer is a very popular evaluation task adopted in many TST studies. The task involves modifying the sentiment of a sentence while preserving its original contextual content. Table 3 shows an example of the sentiment transfer task. Given the input sentence with positive sentiment, “Everything is fresh and so delicious!”, the goal of the TST model is to covert the sentence into negative sentiment while preserving the contextual content information. In this example, the word “Everything” represents the content information and is preserved during the style transfer operation. The examples also reveal an interesting aspect of sentiment transfer task: while the sentence’s style is transferred, and the contextual content is preserved, the semantic of the sentence has also changed. For instance, a sentence supporting a particular political party may semantically change to a negative opinion during the sentiment transfer process. Nevertheless, this task is widely-used to evaluate TST models, and three popular datasets have been proposed for this task:

  • Yelp333https://github.com/shentianxiao/language-style-transfer is a corpus of restaurant reviews from Yelp collected in (Shen et al., 2017). The original Yelp reviews are on a 5 points rating scale. As part of data preprocessing, reviews with 3 points and above ratings are labeled as positive, while those below 3 points are labeled as negative. The reviews with an exact 3 points rating considered neutral and are excluded in this dataset.

  • Amazon444https://github.com/lijuncen/Sentiment-and-Style-Transfer is product review dataset from amazon collected in (He and McAuley, 2016). It is preprocessing using the same method in the Yelp dataset.

  • IMDb555https://github.com/fastnlp/nlp-dataset is a movie reviews dataset. Dai et al. (Dai et al., 2019) constructed this dataset by performing similar data preprocessing methods on a publicly available and popular movie reviews dataset (Maas et al., 2011).

Positive Sentiment Negative Sentiment
input Everything is fresh and so delicious!
Ref-0 Everything was so stale.
Ref-1 Everything is rotten and not so delicious.
Ref-2 Everything is stale and horrible.
Ref-3 Everything is stale and tastes bad.
Table 3. Sentiment transfer examples.

Formality transfer. The formality transfer task involves modifying the formality of a given sentence. Typically, an informal sentence to its formal form and vice versa. Formality transfer is presumably more complex than sentiment transfer, as multiple attributes may affect the formality of text. For instance, the modification on the sentence structure, the length of text, punctuation, capitalization, etc., influences the formality of text. Table 4 shows an example of transferring an informal sentence to its formal form. We illustrate that simple keyword replacement methods cannot achieve the formality transfer of a sentence from the four transferred formal sentences. After the sentence is transferred to its formal form, the length of the sentence (as shown in Ref-0) and punctuation may be changed (e.g., replacement of ellipsis with a full stop). Furthermore, unlike sentiments, the formality of a sentence is highly subjective; individuals may perceive a sentence’s degree of formality differently.

GYAFC666https://github.com/raosudha89/ GYAFC-corpus is the largest human-labeled parallel dataset proposed for the formality transfer task (Rao and Tetreault, 2018). The authors extracted informal sentences from Entertainment&Music (E&M) and Family&Relationship (F&R) domains of the Yahoo Answers L6 corpus 777https://webscope.sandbox.yahoo.com/catalog.php?datatype=l. The collected dataset was further preprocessed to remove sentences that are too short or long. Finally, the authors crowd-sourced workers to manually re-write the informal sentences to each formal sentences, resulting in a parallel formality dataset. This dataset was also widely used to evaluate the recent TST models.

Informal Formality formal Formality
input He loves you, too, girl…Time will tell.
Ref-0 He loves you as well, but only time can tell what will happen.
Ref-1 He loves you too, lady…time will tell.
Ref-2 He loves you, as well. Time will tell.
Ref-3 He loves you too and time will tell.
Table 4. Formality transfer examples.

Paper-news title transfer. Paper-news title transfer was the task of transferring a title to a different type while preserving the content. Fu et al. (Fu et al., 2018) collected the PNTD888https://github.com/fuzhenxin/textstyletransferdata

dataset, which consisted of paper titles retrieved from academic publication archive websites such ACM Digital Library, Arxiv, Sprint, ScienceDirect and Nature, and news titles from the UC Irvine Machine Learning Repository.

Captions style transfer. Li et al. 2018 (Li et al., 2018) proposed the task of transferring factual formal captions into romantic and humorous styles. The researchers collected the caption dataset4, where each sentence was labeled as factual, romantic, or humorous. This is also the smallest TST dataset.

Gender style transfer. The differences between male and female writing styles is a widely studied research topic in sociolinguistics. Prabhumoye et al. (Prabhumoye et al., 2018) extended the sociolinguistics studies to perform TST between text written by the different gender, i.e., transfer a text written by a male to female writing style and vice versa. The researchers constructed the gender dataset999https://github.com/shrimai/Style-Transfer-Through-Back-Translation by first preprocessing a Yelp review dataset annotated with the gender of the reviewers (Reddy and Knight, 2016) by split the reviews into sentences and preserve the gender label for each sentence. The sentences that are deemed to be gender-neutral are removed from the dataset.

Political slant transfer. Political slant transfer is the task of modifying the writer’s political affiliation writing style while preserving the content. Prabhumoye et al. (Prabhumoye et al., 2018) collected comments of Facebook posts from 412 members of the United States Senate and House who have public Facebook pages. The comments are annotated with the congressperson’s political part affiliation: democracy or republican. Table 5 shows examples of comments collected in the dataset9.

Republican defund them all, especially when it comes to the illegal immigrants.
Republican thank u james, praying for all the work u do .
Democracy on behalf of the hard-working nh public school teachers- thank you!
Democracy we need more strong voices like yours fighting for gun control .
Table 5. Political slant examples.

Offensive language correction. The use of offensive and abusive languages is a growing problem in online social media. The offensive language correction task aims to transfer offensive sentences into non-offensive ones. Santos et al. (dos Santos et al., 2018) collected posts from Twitter and Reddit. The posts were subsequently classified into “offensive” and “non-offensive” classes using a classifier pre-trained on an annotated offensive language dataset.

Multiple-attribute style transfer. Thus far, the tasks we have discussed involved transferring text between two style attributes. Lai et al. (Lai et al., 2019) proposed a multiple-attribute style transfer task and collected multi-style attribute datasets based on Yelp and Amazon review datasets. Table 6 summarizes the statistics of three datasets collected in their studies. The goal is to transfer a text with specific multiple style attributes such as a sentiment, gender of the author, etc. The specification of multiple attributes makes the TST task more complex and realistic as the text style should be multi-faceted. For instance, the gender and sentiment of the author could both affect the style of the text. We postulate that Multiple-attribute style transfer may be one of the TST research’s future directions, and we will discuss this further in Section 7.

Dataset Sentiment Gender Category
FYelp Positive Negative Male Female American Asian Bar Dessert Mexican
2,056,132 639,272 1,218,068 1,477,366 904,026 518,370 595,681 431,225 246,102
Amazon Positive Negative - - Book Clothing Electronics Movies Music
64,251,073 10,944,310 - - 26,208,872 14,192,554 25,894,877 4,324,913 4,574,167
Social Media Content Relaxed Annoyed Male Female age:18-24 age:65+
7,682,688 17,823,468 14,501,958 18,463,789 12,628,250 7,629,505
Table 6. Dataset statistics for multiple-attribute transfer datasets.

5.2. Automated Evaluation

Several automated evaluation metrics have been proposed to measure the effectiveness of TST models (Pang, 2019b, a; Pang and Gimpel, 2019; Mir et al., 2019). Broadly, these metrics evaluate the TST algorithms in three criteria:

  1. The ability to transfer the text style.

  2. The amount of original content preserved after the TST operation.

  3. The fluency of the transferred style sentence.

A TST algorithm underperforming in any of these three criteria is considered ineffective in performing the TST task. For example, a TST algorithm transfers a negative sentiment sentence, “the pasta taste bad!”, to a positive one, “the movie is great!”. While the algorithm can transfer the style of the input text, i.e., from negative to positive sentiment, it fails to preserve the original statement’s content, i.e., describing the pasta. Like many other natural language generation tasks, the transferred sentence will also have to achieve a certain level of fluency for the TST algorithm to be useful in real-world applications. Therefore, an effective TST algorithm will have to perform well in all three criteria of the evaluation.

Transfer strength. A TST model’s transfer strength or its ability to transfer text style is commonly measured using Style Transfer Accuracy (Hu et al., 2017; Shen et al., 2017; Fu et al., 2018; Luo et al., 2019; John et al., 2019). Typically, a binary style classifier TextCNN (Moschitti et al., 2014) is first pre-trained separately to predict the style label of the input sentence. The style classifier is then used to approximate the style transfer accuracy of the transferred style sentences by considering the target style as the ground truth. It is also important to note that the style classifier is not perfect. For instance, when pre-trained on the Yelp and GYAFC datasets and applied to classify tweets on their respective validation dataset, the style classifier is only able to achieve accuracies of 97.2% and 83.4% respectively. Nevertheless, the style transfer accuracy is thus far the only known automated quantitative approach to evaluate the transfer strength of the TST algorithms.

Content preservation. To quantitatively measure the amount of original content preserved after the style transfer operation, TST studies have borrowed three automated evaluation metrics that are commonly used in other natural language generation tasks:

  • self-BLEU: The BLEU score (Papineni et al., 2002) was originally designed to evaluate the quality of a machine-translated text. The BLEU score was one of the first metrics to claim a high correlation with human judgment on the translated text quality. To compute the BLEU score, the machine-translated text is compared with a set of good quality reference translations. However, most of the TST task assumes a non-parallel setting, and matching references of style transferred sentences are not always available. Therefore, a self-BLEU is adopted by comparing the style transferred sentence with its original sentence. The intuition is that the content is assumed to be preserved when the style transferred sentence shared many similar n-grams with the original sentence.

  • Cosine Similarity): Fu et al. (Fu et al., 2018) calculated the cosine similarity between original sentence embedding and transferred sentence embedding. The intuition is that the embeddings of the two sentences should be close to preserve the semantics of the transferred sentences.

  • Word Overlap: Vineet et al. (John et al., 2019) argued that the cosine similarity is not a sensitive metric as the original and transferred sentences may have high cosine similarity scores even the content of the sentences are different. Thus, they employed a simple metric that counts the unigram word overlap rate of the original and style transferred sentences. Noted that stop words and style-attributed words (e.g., sentiment words) are excluded in the word overlap calculation.

Fluency. Generating fluent sentences is a common goal for almost all natural language generation models. A common approach to measuring the fluency of a sentence is using a trigram Kneser-Ney language model (Kneser and Ney, 1995)

. The Kneser-Ney language model is pre-trained to estimate the empirical distribution of trigrams in a training corpus. Subsequently, the

perplexity score of a generated sentence is calculated by comparing the sentence’s trigram and the estimated trigram distribution. The intuition is that a generated sentence with a lower perplexity score is considered more “aligned” to the training corpus, and therefore considered as more fluent. In TST tasks, the language model is similarly trained on the TST datasets, and the perplexity scores of the style transferred sentences are computed to evaluate the sentences’ fluency.

5.3. Human Evaluation

Few TST studies have performed human evaluations on their proposed TST algorithms (Shen et al., 2017; Li et al., 2018) as such evaluations are often expensive and laborious. In a typical human evaluation setting, human workers are crowd-sourced to rate how the style transferred sentence fair on the three evaluation criteria using a range scale. For example, given a pair of original and transferred sentences, a human worker is asked to rate how well the content is preserved in the transferred sentence on the scale of 1 to 5 points, with 5 points being “very well preserved”. Multiple human workers are asked to evaluate a given pair of original and transferred sentences, and the average scores are reported to reduce individual bias. Although researchers have put in great effort to ensure the quality of the human evaluation on TST tasks, the evaluation approach has proven to be very challenging as the interpretation of the text style is subjective and may vary across individuals (epang2019towards; Pang, 2019a; Mir et al., 2019). Nevertheless, human evaluations still offer valuable insights into how well TST algorithms are able to transfer style and generate sentences that are acceptable by human standards.

6. Reproducibility Study

Although most of the existing TST methods were evaluated in the original works using the downstream tasks discussed in Section 5, the experiments were often carried out with no or few baselines. Thus, we conduct a reproducibility study101010Code implementation of the reproduced model are compiled in this repository: https://gitlab.com/bottle_shop/style/tst_survey and benchmark 19 TST models on two popular corpora: Yelp reviews and GYAFC, representing the sentiment transfer tasks and formality transfer tasks, respectively. To the best of our knowledge, this is the first time where so many TST models are evaluated on the same datasets. Specifically, the experimental results from this study provide new insights into how each TST algorithm fair against each other in terms of transfer strength, content preservation, and fluency. This section is organized as follows: We first describe the experimental setup of our reproducibility study. Next we discuss the experimental results on sentiment transfer and formality transfer tasks. Finally, we perform the trade-off analyses, where we investigate how the relationships between multiple evaluation criteria influence the TST model performance.

6.1. Experimental Setup

Environment Settings.

The experiments were performed on an Ubuntu 18.04.4 LTS system with 24 cores, 128 GB RAM, and a clock speed of 2.9 GHz. The GPU used for deep neural network-based models was Nvidia GTX 2080Ti. We followed the environmental requirements and hyperparameter settings of the released code implementations of the TST models to reproduce the experimental results. For TST models that that did not experiment on our datasets in their original publications, we will experiment and optimally set the hyperparameters of TST models. Table

7 shows the training, validation, and test splits of the Yelp and GYAFC datasets used in our experiments.

Dataset Subset Attributes Train Dev Test
Yelp - Positive 267,314 38,205 76,392
Negative 176,787 25,278 50,278
GYAFC F&R Informal 51,967 2,788 1,332
Formal 51,967 2,247 1,019
Table 7. Dataset statistics for Yelp and GYAFC.

Evaluation Metrics. We adopt the evaluation metrics discussed in Section 5 to measure the performance of the TST models. Specifically, we apply the Style Transfer Accuracy (ACC) to measure transfer strength. For measuring content preservation, we adopt self-BLEU, Cosine Similarity (CS), and Word Overlap (WO). Specifically, for the experiments on the GYAFC dataset, the human style transferred sentences of the test set are available. Therefore, we also compute the BLEU score as between the TST model’s transferred sentences and the human style transferred sentences. We compute the perplexity score (PPL) to quantify the fluency of the transferred sentences. Finally, we compute two average metrics that consider all evaluation aspects:

  • Geometric Mean (G-Score): We compute the geometric mean of ACC (transfer strength), self-BLEU (content preservation), WO (content preservation), and 1/PPL (fluency). We excluded the CS measure in mean computation due to its insensitivity, and we take calculated the inverse of the perplexity score because a smaller PPL score is designed.

  • Harmonic Mean (H-Score): Different averaging methods reflect different priorities. Thus, we also compute the harmonic mean of ACC, self-BLEU, WO, and 1/PPL.

Reproduced Models. We limit our reproducibility study to these 19 TST models as they had published their implementation codes. We hope to encourage fellow researchers to publish their codes and datasets as it can promote this field’s development. Specifically, we reproduced and implemented the following TST models:

  • DeleteOnly; Template; Del&Retri (Li et al., 2018): A style keyword replacement method, which disentangles the style and content of sentence explicitly by keyword replacement. The authors proposed three variants of their mode: DeleteOnly, which first remove the removed style attributed keywords from the source sentence. Subsequently, the latent representation of the source sentence is combined with the target style attribute and input into a sequence model to generate the sentence in the target style. The Template model is simply replacing the deleted style attributed keywords with keywords of target style. The Del&Retri model first performed the same keyword removal as DeleteOnly method. Next, it retrieves a new sentence associate with the target attribute. Lastly, the keyword-removed source sentence and the retrieved sentence are input into a sequence model to generate the transferred style sentence.

  • B-GST; G-GST (Sudhakar et al., 2019): A TST model that extended the work in (Li et al., 2018) and proposed the Generative Style Transformer (GST) to perform text style transfer. There are two variants of GST model: the Blind Generative Style Transformer (B-GST) and the Guided Generative Style Transformer (G-GST).

  • UST (Xu et al., 2018): A style keyword replacement method, which utilized cycled reinforcement learning to iteratively replace style attributed keywords while maintaining the content in text for text style transfer. This model was originally implemented for the sentiment transfer task.

  • PTO (Wu et al., 2019a): A style keyword replacement TST model that applied reinforcement-learning to hierarchically reinforced a Point-Then-Operate (PTO) sequence operation. The PTO operation has two agents: a high-level agent iteratively proposes operation positions, and a low-level agent alters the sentence based on the high-level proposals. Using this reinforcement framework, the style-attributed keywords are replaced explicitly to perform TST.

  • SMAE (Zhang et al., 2018b): A style keyword replacement model to perform TST by disentangling style and content explicitly. The model was originally designed for sentiment transfer. The Sentiment attribute words are first detected, and a sentiment-memory based auto-encoder model is subsequently used to perform sentiment modification without parallel data.

  • ARAE (Zhao et al., 2018a): A generic natural language generation technique that utilizes adversarial learning to modify the specific attributes in text. TST is one of the model’s applications proposed in its original paper.

  • CAAE (Shen et al., 2017): An adversarial learning TST model that implicitly disentangle the text’s style. Specifically, the model assumes a shared latent content distribution across different text corpora, and propose a method that leverages refined alignment of latent representations to perform TST.

  • Multi-Dec; Style-Emb (Fu et al., 2018): A adversarial learning TST model that utilized a style classifier to perform disentanglement of style and content representation for style transfer task. Two variants of the model were proposed: The multi-decoder (Multi-Dec) model that uses different decoders to generate text with different styles. The style embedding (Style-Emb) model that concatenates the style embedding vector with content representation to generate different style text with one decoder.

  • BST (Prabhumoye et al., 2018): A back-translation based TST model that employed a pre-trained back translation model to rephrase a sentence while reducing its stylistic characteristics. Subsequently, separate style-specific decoders were used for style transfer.

  • DRLST (John et al., 2019): An adversarial learning TST model that incorporates auxiliary multi-task and adversarial objectives for style prediction and bag-of-word prediction respectively to perform text style transfer.

  • Ctrl-Gen (Hu et al., 2017): An attribute-controlled TST model that utilized variational auto-encoders and style classifier to guide the learning of a style attribute to control the generation of text in different styles.

  • DAST; DAST-C (Li et al., 2019): An attribute-controlled TST model that performs TST in a domain-aware manner. Two variants are proposed: The Domain Adaptation Style (DAST) model and DAST with generic content information (DAST-C). In these models, latent style attributes and domain vectors are learned to perform TST across domains.

  • DualRL (Luo et al., 2019): A reinforcement-learning based-TST model that utilized two seq2seq models to transfer between two text styles. Specifically, this model considers the learning of source-to-target style and target-to-source style as a dual-task that mutually reinforce each other to perform TST without disentangling style and content.

  • PFST (He et al., 2020): A probabilistic deep generative TST model. The model combines a language model prior and an encoder-decoder transducer to infer the latent representations of the sentence in an assumed parallel style corpus. The inferred latent representations are subsequently used to generate the sentence of a specific style via a decoder.

6.2. Sentiment Transfer

Table 8 shows the performance of various TST models on the sentiment transfer task. While there are no TST models that achieved the best performance in all evaluation metrics, DualRL (Luo et al., 2019), PTO (Wu et al., 2019a), B-GST (Sudhakar et al., 2019), and PFST (He et al., 2020) have achieved a well-balanced trade-off among between text fluency, content preservation, and the style transfer accuracy. We also note that the different average methods, i.e., G-Score and H-Score, weighted the different evaluation metrics differently. For instance, H-score gives a higher weightage to perplexity scores of generated sentences. Thus, DRLST (John et al., 2019), which has the lowest PPL score, also has the highest H-score. Conversely, the Template model (Li et al., 2018) has the highest PPL score and lowest H-score.

More interestingly, we observed that the style keyword replacement methods such as DeleteOnly (Li et al., 2018), Template (Li et al., 2018), Del&Retri (Li et al., 2018), B-GST (Sudhakar et al., 2019), G-GST (Sudhakar et al., 2019), UST (Xu et al., 2018), PTO (Wu et al., 2019a), and SMAE (Zhang et al., 2018b), have achieved good performance in the sentiment transfer tasks. The methods have achieved a high transfer accuracy score while preserving the content information, i.e., high self-BLEU, CS, and WO scores. A possible reason for the style keyword replacement methods good performance might be due to the nature of the task; the sentiment of a sentence can be easily modified by replacing keywords related to the source sentiment. For example, replacing “fresh” with “rotten” would transform the sentence from positive to negative sentiment. However, it is interesting to note that the Template method (Li et al., 2018), which is an algorithm that simply replaces the sentiment-related keywords, has a high perplexity score, which indicates bad performance in sentence fluency. This motivates more complex generative approaches that can prevent the generation of implausible sentences by simple keyword replacement.

Yelp
Model ACC(%) self-BLEU CS WO PPL G-Score H-Score
DualRL 79.0 58.3 0.97 0.801 134 2.29 0.030
DRLST 91.2 7.6 0.904 0.484 86 1.41 0.045
DeleteOnly 84.2 28.7 0.893 0.501 115 1.80 0.034
Template 78.2 48.1 0.850 0.603 1959 1.04 0.002
Del&Retri 88.1 30.0 0.897 0.464 101 1.87 0.039
Ctrl-Gen 89.6 49.5 0.953 0.707 384 1.69 0.010
BST 83.1 2.3 0.827 0.076 261 0.49 0.015
CAAE 82.7 11.2 0.901 0.277 145 1.15 0.027
PTO 82.3 57.4 0.982 0.737 245 1.94 0.016
ARAE 83.2 18.0 0.874 0.270 138 1.31 0.028
B-GST 89.2 46.5 0.959 0.649 216 1.88 0.018
G-GST 72.7 52.0 0.967 0.617 407 1.55 0.010
DAST 90.7 49.7 0.961 0.705 323 1.77 0.012
DAST-C 93.6 41.2 0.933 0.560 450 1.48 0.009
Multi-Dec 69.6 17.2 0.887 0.244 299 0.99 0.013
Style-Emb 47.5 31.4 0.926 0.433 217 1.31 0.018
UST 74.0 41.0 0.929 0.448 394 1.36 0.010
SMAE 84.4 14.8 0.907 0.294 210 1.315 0.019
PFST 85.3 41.7 0.902 0.527 104 2.06 0.038
Table 8. The results on Yelp dataset of style transfer models included.

6.3. Formality Transfer

Table 9 shows the performance of various TST models on the formality transfer task. Similar to the observation in sentiment transfer, none of the TST models is able to score well on all evaluation metrics. We noted that the average style transfer accuracy in GYAFC is 52.9%, which is significantly lower than Yelp’s average score of 84.4%. This highlights the difficulty of the formality transfer task. We also observe that most models performed worse in this task compared to the sentiment transfer task. It is also unsurprising that the style keyword replacement methods did not perform well in the formality transfer task; most of these models achieved a low style transfer accuracy. Some of the adversarial learning-based TST models, such as CAAE (Shen et al., 2017) and DRLST (John et al., 2019), had achieved high style transfer accuracy but very low content preservation as these models lack the mechanism to control content preservation during the generative process. Interestingly, we observe that the attribute-controlled TST methods, i.e., Ctrl-Gen (Hu et al., 2017), DAST (Li et al., 2019), and DAST-C (Li et al., 2019) have achieved good performance both style transfer accuracy and content preservation.

The GYAFC dataset also provided the performances of four human references performing the formality transfer task on the test dataset (as shown in the bottom of Table 9). On average, the human references had achieved 78.1% style transfer accuracy. This is considered a reasonable performance, given that the pre-trained binary classifier only managed to achieved 83.4% accuracy on the test set. Furthermore, formality in text is subjective, and the four human references may have different opinions on the degree of formality in text.

As the GYAFC dataset is a parallel dataset, i.e., there are matching sentences in source and target styles, we are able to compute the BLEU score between the transferred style sentence and the matching sentence in the target style. Unsurprisingly, the human references have achieved the highest BLEU score, suggesting that the sentences generated by the human references are quite similar to the matching sentences in the target style. In comparison, the TST models fair poorly in the BLEU scores. We also observe that the TST models’ average content preservation metrics scores in the formality transfer task are lower than the scores in the sentiment transfer task. For instance, the WO scores in the sentiment transfer task are higher because only a few keywords need to be replaced to perform the style transfer. However, in the formality transfer case, a more drastic and complex modification of the text has to be performed for the style transfer. As such, there will be less word overlap between the original sentence and the transferred sentence, resulting in a lower WO score. The limitation of existing metrics in measuring content preservation in formality transfer highlights the need to search for better evaluation methods for this challenging task.

GYAFC
Model ACC(%) self-BLEU BLEU CS WO PPL G-Score H-Score
DualRL 56.7 61.6 18.8 0.944 0.447 122 1.89 0.032
DRLST 71.1 4.2 2.7 0.909 0.342 86 1.04 0.045
DeleteOnly 26.0 35.4 16.2 0.945 0.431 82 1.48 0.047
Template 51.5 45.1 19.0 0.943 0.509 111 1.81 0.035
Del&Retri 50.6 22.1 11.8 0.934 0.345 94 1.42 0.041
Ctrl-Gen 73.1 57.0 15.6 0.943 0.446 168 1.82 0.023
BST 69.7 0.5 0.5 0.883 0.04 69 0.38 0.042
CAAE 72.3 1.8 1.5 0.896 0.028 55 0.51 0.044
ARAE 76.2 4.8 2.2 0.903 0.042 77 0.67 0.040
B-GST 30.3 22.5 11.6 0.951 0.557 117 1.34 0.034
G-GST 31.0 20.7 10.2 0.941 0.556 127 1.29 0.031
DAST 73.1 50.6 14.3 0.934 0.350 204 1.59 0.019
DAST-C 78.2 48.5 13.8 0.927 0.328 308 1.42 0.013
Multi-Dec 22.2 13.4 5.9 0.911 0.168 146 0.76 0.026
Style-Emb 27.7 8.3 3.6 0.897 0.102 136 0.64 0.027
UST 23.6 0.5 0.5 0.881 0.012 28 0.27 0.035
SMAE 21.6 6.5 1.2 0.898 0.079 74 0.62 0.046
PFST 50.8 55.3 16.5 0.940 0.466 200 0.51 0.020
Human0 78.1 20.5 43.5 0.942 0.393 80 1.67 0.048
Human1 78.7 18.2 43.2 0.931 0.342 199 1.25 0.020
Human2 78.2 18.6 43.4 0.932 0.354 192 1.28 0.021
Human3 77.4 18.8 43.5 0.931 0.354 196 1.27 0.020
Table 9. The results on GYAFC dataset of style transfer models included.
Figure 10. Metrics trade-off analysis for sentiment transfer on Yelp review dataset
Figure 11. Metrics trade-off analysis for formality transfer on GYAFC review dataset

6.4. Evaluation Metrics Trade-off Analysis

Besides evaluating the TST models on the two style transfer tasks, we also reproduced the evaluation metrics trade-off analysis proposed in (Mir et al., 2019). The goal of this analysis is to investigate the relationships between the evaluation metrics. Specifically, we create variants of TST models by varying their hyperparameters and study the trade-off effects between pairs of evaluation metrics. Similar to the study in (Mir et al., 2019), we select ARAE, DualRL, and CAAR models for our analysis.

Fig. 10 show the trade-off analysis results on sentiment transfer task. Specifically, Fig. 10A,B and C show the trade-off relationships between style transfer strength and content preservation metrics, while Fig. 10D shows the trade-off relationship between style transfer strength and fluency metric. Similar to the observations made in (Mir et al., 2019), we notice that as the transfer strength of increases, content preservation metrics decrease across the three models. However, the trade-off relationship between the transfer strength and sentence fluency is less obvious as we notice that ARAE is able to achieve lower PPL when ACC increases. Similar observations are made for the formality transfer task in Fig. Fig. 11.

The observations made in our trade-off analysis suggest some form competing relationships between transfer strength and content preservation, i.e., when transfer strength scores increase, content preservation scores decrease, and vice versa. A potential reason for observation might be due to the entanglement between semantic and stylistic properties in natural language; it is hard to separate the two properties, and changing one affects the other. Therefore, when optimizing to transfer the style in text, it is hard to maintain the sentence’s semantic, i.e., the content information.

7. Future Research Direction and Open Issues

TST is a relatively new research area. While existing works have established a foundation for TST research, this section outlines several promising prospective research directions. We also discuss the open issues that we believe are critical to the present state of the field.

7.1. Deeper Dive into Style Representation

Although there have been studies that suggested that existing techniques are not able to disentangle a text’s style from its content (Lample et al., 2019) effectively, we believe that style could still be a standalone element in text. For example, a Shakespearean scholar will be able to recognize a text written in Shakespearean style regardless of its content. Similarly, a Star War fan will also be able to recognize scripts from Yoda’s Scenes regardless of its content because the speech style of the Star War character is distinctive. The human’s ability to discern the styles in text suggests the possibility to learn a representation that is descriptive of text styles. Therefore, new techniques can be explored to learn the representations for distinctive text style.

More studies could also be done on the style embeddings learned by the existing techniques. Currently, we have little understanding on the style representations learned using existing techniques; besides knowing the style representation supposedly has some correlation with the style labels, we do not know much about the information that are preserved in the style representations. For instance, in learning the style representations for the formality transfer tasks, little is known about the preservation of the sentence structure in the representations, and the sentence structure may have an impact on the formality of the text.

To this end, a potential future research direction would be to conduct a deeper analysis of the style representations learned for the different tasks using existing techniques. We believe that this will provide new insights, which can guide the development of future TST techniques.

7.2. Unsupervised Text Style Transfer

While most of the exiting TST methods are developed for the non-parallel dataset setting, these techniques continue to require a large amount of style labeled data to guide the transferrence of text styles. A promising research direction would be to explore unsupervised methods to perform TST with little or no labeled data. For instance, recent studies (Jain et al., 2019; Gong et al., 2019) have explored guiding the transfer of text style by scoring the sentences’ semantic relatedness, fluency, and readability grade instead of the style labels. We postulate that more aspects of the text, such as tone, brevity, sentence structures, etc., can be explored to train the future TST models and reduce the dependence on the style labels.

7.3. Going Beyond Transferring Between Two Styles

Currently, most of the existing TST methods focus on transferring the text between two styles. We believe that TST studies should go beyond performing a binary style transfer and explore richer and more dynamic tasks. For example, Lai et al. (Lai et al., 2019) proposed a multiple-attribute style transfer task where a text is transferred by specifying multiple style attributes such as a sentiment, gender of the author, etc. Domain-aware TST method has also been explored where we consider the domain of the text (e.g., food or movie reviews) when transferring the text styles (e.g., from positive to negative sentiment). We believe that more dynamic TST tasks with better real-life applications will be a promising future research direction.

7.4. Automatic Evaluation for Text Style Transfer

Our experimental evaluation in section 6 has illustrated the challenges of evaluating the effectiveness of TST models. The existing evaluation methods have a few limitations. Firstly, the evaluation of text style transfer based on transfer accuracy is limited by the performance of style classifier. Secondly, similar to previous studies (Fu et al., 2018; Pang and Gimpel, 2019), we notice that the transfer strength is inversely proportional to the content preservation, suggesting that these metrics may be complementary and challenging to optimize simultaneously. The limitations of existing evaluation metrics see the demand to explore novel automatic evaluation metrics to evaluate TST models.

8. Discussion and Conclusion

Although TST is a relatively new branch of the natural language processing field, a considerable amount of TST research has been conducted in recent years. The explosive growth of TST research has generated many novel and interesting TST models. This survey aims to organize these novel TST models using a taxonomy (cf. Fig. 1). We also summarize the common techniques used by modern TST models to transfer text styles. We also emphasize the important TST research trends, such as the shift from TST models attempting to disentangle text style from content to aiming to perform TST without any style-content disentanglement. While we postulate that the trend on performing TST without any style-content disentanglement will continue, we believe that the study on style representation remains an interesting research direction that deserves further exploration.

Besides discussing the common TST techniques, we also conducted a large-scale reproducibility study where we replicated and benchmarked 19 state-of-the-art TST algorithms on two publicly available datasets. To the best of our knowledge, this is the first large-scale reproducibility study on TST methods. The results of our study show that none of the TST methods could dominate on all evaluation metrics. This suggests the complexity of the TST task, wherein different methods may have advantages in different aspects, and there is no simple way to declare a winner. The evaluation analysis in our reproducibility study also advocated the need to search for better TST evaluation metrics.

We believe that research on TST will continue to flourish, and the industry will continue to find more exciting applications for the existing TST methods. We hope that this survey can provide readers with a comprehensive understanding of the critical aspects of this field, clarify the important types of TST methods, and shed some light on future studies.

References

  • S. A. Abdul-Kader and J. Woods (2015) Survey on chatbot design techniques in speech conversation systems. International Journal of Advanced Computer Science and Applications 6 (7). Cited by: §3.3.
  • V. G. Ashok, S. Feng, and Y. Choi (2013) Success with style: using writing style to predict the success of novels. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1753–1764. Cited by: §3.1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, Cited by: §2.2, §2, 1st item, §4.2.
  • P. F. Brown, J. Cocke, S. A. Della Pietra, V. J. Della Pietra, F. Jelinek, J. Lafferty, R. L. Mercer, and P. S. Roossin (1990) A statistical approach to machine translation. Computational linguistics 16 (2), pp. 79–85. Cited by: §2.2.
  • F. Can and J. M. Patton (2004) Change of writing style with time. Computers and the Humanities 38 (1), pp. 61–82. Cited by: §3.1.
  • Y. Cao, R. Shui, L. Pan, M. Kan, Z. Liu, and T. Chua (2020) Expertise style transfer: a new task towards better communication between experts and laymen. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Cited by: §3.1.
  • K. Carlson, A. Riddell, and D. Rockmore (2018) Evaluating prose style transfer with the bible. Royal Society open science 5 (10), pp. 171920. Cited by: §1, §4.2, Table 1.
  • M. J. Chambliss and R. Garner (1996) Do adults change their minds after reading persuasive text?. Written Communication 13 (3), pp. 291–313. Cited by: §3.2.
  • L. Chen, S. Dai, C. Tao, H. Zhang, Z. Gan, D. Shen, Y. Zhang, G. Wang, R. Zhang, and L. Carin (2018) Adversarial text generation via feature-mover’s distance. In Advances in Neural Information Processing Systems, pp. 4666–4677. Cited by: §1, §2.1, §4.4, §4.4, Table 1.
  • K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Cited by: §2.2, §2, 1st item, §4.2.
  • N. Dai, J. Liang, X. Qiu, and X. Huang (2019) Style transformer: unpaired text style transfer without disentangled latent representation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5997–6007. Cited by: §1, §2.1, §2.2, §4.6, Table 1, 3rd item, Table 2.
  • L. H. Darani (2014) Persuasive style and its realization through transitivity analysis: a sfl perspective. Procedia-social and behavioral sciences 158, pp. 179–186. Cited by: §3.2.
  • C. dos Santos, I. Melnyk, and I. Padhi (2018) Fighting offensive language on social media with unsupervised text style transfer. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 189–194. Cited by: §2.2, §5.1, Table 2.
  • N. E. Enkvist (2016) Linguistic stylistics. Vol. 5, Walter de Gruyter GmbH & Co KG. Cited by: §1.
  • Z. Fu, X. Tan, N. Peng, D. Zhao, and R. Yan (2018) Style transfer in text: exploration and evaluation. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §2.1, §2.2, Figure 3, §4.4, Table 1, 2nd item, §5.1, §5.2, Table 2, 8th item, §7.4.
  • A. Gatt and E. Krahmer (2018) Survey of the state of the art in natural language generation: core tasks, applications and evaluation. Journal of Artificial Intelligence Research 61, pp. 65–170. Cited by: §1, §2.2, §4.2, §4.6.
  • L. A. Gatys, A. S. Ecker, and M. Bethge (2015) A neural algorithm of artistic style. CoRR abs/1508.06576. External Links: Link, 1508.06576 Cited by: §2.1, §2, §4.1.2.
  • H. Gong, S. Bhat, L. Wu, J. Xiong, and W. Hwu (2019) Reinforcement learning based text style transfer without parallel training corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3168–3180. Cited by: §1, §2.1, §4.8, Table 1, §7.2.
  • M. Halliday (1981) Linguistic function and literary style: an inquiry into the language of william golding’s the inheritors. Essays in Modern Stylistics, pp. 325–60. Cited by: §3.1.
  • J. He, X. Wang, G. Neubig, and T. Berg-Kirkpatrick (2020) A probabilistic formulation of unsupervised text style transfer. In International Conference on Learning Representations (ICLR), Cited by: §1, §2.1, §4.9, Table 1, §5.1, 14th item, §6.2.
  • R. He and J. McAuley (2016) Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pp. 507–517. Cited by: 2nd item, Table 2.
  • Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P. Xing (2017) Toward controlled generation of text. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1587–1596. Cited by: §1, §2.1, §2.2, Figure 6, §4.6, §4.6, §4.6, Table 1, §5.2, 11st item, §6.3.
  • R. Ivanič (2004) Discourses of writing and learning to write. Language and education 18 (3), pp. 220–245. Cited by: §3.1.
  • P. Jain, A. Mishra, A. P. Azad, and K. Sankaranarayanan (2019) Unsupervised controllable text formalization. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6554–6561. Cited by: §1, §2.1, §2.2, §4.6, Table 1, §7.2.
  • H. Jhamtani, V. Gangal, E. Hovy, and E. Nyberg (2017) Shakespearizing modern language using copy-enriched sequence to sequence models. In Proceedings of the Workshop on Stylistic Variation, pp. 10–19. Cited by: §1, 1st item, §4.2, Table 1, §5.1.
  • D. Jin, Z. Jin, J. T. Zhou, L. Orii, and P. Szolovits (2020) Hooks in the headline: learning to generate headlines with controlled styles. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Cited by: §3.2.
  • Z. Jin, D. Jin, J. Mueller, N. Matthews, and E. Santus (2019) IMaT: unsupervised text attribute transfer via iterative matching and translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3088–3100. Cited by: §1, §4.2, Table 1.
  • Y. Jing, Y. Yang, Z. Feng, J. Ye, Y. Yu, and M. Song (2019) Neural style transfer: a review. IEEE transactions on visualization and computer graphics. Cited by: §2.1, §2.
  • V. John, L. Mou, H. Bahuleyan, and O. Vechtomova (2019) Disentangled representation learning for non-parallel text style transfer. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 424–434. Cited by: §1, §2.1, §4.4, §4.4, Table 1, 3rd item, §5.2, 10th item, §6.2, §6.3.
  • B. Johnstone (1989) Linguistic strategies and cultural styles for persuasive discourse. Cited by: §3.2.
  • B. Johnstone (2009) Stance, style, and the linguistic individual. Stance: sociolinguistic perspectives, pp. 29–52. Cited by: §3.1.
  • M. Kaptein, B. De Ruyter, P. Markopoulos, and E. Aarts (2012) Adaptive persuasive systems: a study of tailored persuasive text messages to reduce snacking. ACM Transactions on Interactive Intelligent Systems (TiiS) 2 (2), pp. 1–25. Cited by: §3.2.
  • M. Kaptein, P. Markopoulos, B. De Ruyter, and E. Aarts (2015) Personalizing persuasive technologies: explicit and implicit personalization using persuasion profiles. International Journal of Human-Computer Studies 77, pp. 38–51. Cited by: §3.2.
  • S. Kim, J. Lee, and G. Gweon (2019) Comparing data from chatbot and web surveys: effects of platform and conversational style on survey response quality. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–12. Cited by: §3.3.
  • D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §4.6.
  • A. Klahold and M. Fathi (2020a) Computer aided writing. Springer. Cited by: §3.1.
  • A. Klahold and M. Fathi (2020b) Word processing as writing support. In Computer Aided Writing, pp. 21–29. Cited by: §3.1.
  • R. Kneser and H. Ney (1995) Improved backing-off for m-gram language modeling. In 1995 International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 181–184 vol.1. Cited by: §5.2.
  • P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, et al. (2007)

    Moses: open source toolkit for statistical machine translation

    .
    In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, pp. 177–180. Cited by: §2.2.
  • C. Lai, Y. Hong, H. Chen, C. Lu, and S. Lin (2019)

    Multiple text style transfer by using word-level conditional generative adversarial network with two-phase training

    .
    In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3570–3575. Cited by: §1, §2.1, §4.4, §4.4, Table 1, §5.1, §7.3.
  • G. Lample, S. Subramanian, E. M. Smith, L. Denoyer, M. Ranzato, and Y. Boureau (2019) Multiple-attribute text rewriting. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: §1, §2.1, §2.2, Figure 7, 3rd item, §4.6, Table 1, §7.1.
  • W. Leeftink and G. Spanakis (2019) Towards controlled transformation of sentiment in sentences. In Proceedings of the 11th International Conference on Agents and Artificial Intelligence, ICAART 2019, Volume 2, Prague, Czech Republic, February 19-21, 2019, pp. 809–816. Cited by: §4.3.
  • D. Li, Y. Zhang, Z. Gan, Y. Cheng, C. Brockett, B. Dolan, and M. Sun (2019) Domain adaptive text style transfer. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3295–3304. Cited by: §4.6, 12nd item, §6.3.
  • J. Li, R. Jia, H. He, and P. Liang (2018) Delete, retrieve, generate: a simple approach to sentiment and style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1865–1874. Cited by: §1, Figure 2, 1st item, §4.3, Table 1, §5.1, §5.3, Table 2, 1st item, 2nd item, §6.2, §6.2.
  • Y. Liao, L. Bing, P. Li, S. Shi, W. Lam, and T. Zhang (2018) Quase: sequence editing under quantifiable guidance. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3855–3864. Cited by: §1, §4.2, Table 1.
  • D. Liu, J. Fu, Y. Zhang, C. Pal, and J. Lv (2020) Revision in continuous space: fine-grained control of text style transfer. Cited by: §1, §2.1, §4.7, Table 1.
  • L. Logeswaran, H. Lee, and S. Bengio (2018) Content preserving text generation with attribute controls. In Advances in Neural Information Processing Systems, pp. 5103–5113. Cited by: §1, §2.1, §4.4, Table 1.
  • F. Luo, P. Li, J. Zhou, P. Yang, B. Chang, X. Sun, and Z. Sui (2019) A dual reinforcement learning framework for unsupervised text style transfer. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 5116–5122. Cited by: §1, §2.1, Figure 9, §4.8, Table 1, §5.2, 13rd item, §6.2.
  • A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts (2011)

    Learning word vectors for sentiment analysis

    .
    In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pp. 142–150. Cited by: 3rd item.
  • C. A. MacArthur (2009) Reflections on research on writing and technology for struggling writers. Learning Disabilities Research & Practice 24 (2), pp. 93–103. Cited by: §3.1.
  • R. Mir, B. Felbo, N. Obradovich, and I. Rahwan (2019) Evaluating style transfer for text. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 495–504. Cited by: §2.2, §5.2, §5.3, §6.4, §6.4.
  • A. Moschitti, B. Pang, and W. Daelemans (Eds.) (2014) Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP 2014, october 25-29, 2014, doha, qatar, A meeting of sigdat, a special interest group of the ACL. ACL. Cited by: §5.2.
  • I. Muehlenhaus (2012) If looks could kill: the impact of different rhetorical styles on persuasive geocommunication. The Cartographic Journal 49 (4), pp. 361–375. Cited by: §3.2.
  • J. Mueller, D. Gifford, and T. Jaakkola (2017) Sequence to better sequence: continuous revision of combinatorial structures. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2536–2544. Cited by: §1, §2.1, §4.7, §4.7, Table 1.
  • N. I. Nikolov and R. H. R. Hahnloser (2018) Large-scale hierarchical alignment for author style transfer. CoRR abs/1810.08237. External Links: Link, 1810.08237 Cited by: §1, §4.2, Table 1.
  • R. Y. Pang and K. Gimpel (2019) Unsupervised evaluation metrics and learning criteria for non-parallel textual transfer. In Proceedings of the 3rd Workshop on Neural Generation and Translation, pp. 138–147. Cited by: §2.2, §5.2, §7.4.
  • R. Y. Pang (2019a) The daunting task of real-world textual style transfer auto-evaluation. CoRR abs/1910.03747. External Links: Link, 1910.03747 Cited by: §5.2, §5.3.
  • R. Y. Pang (2019b) Towards actual (not operational) textual style transfer auto-evaluation. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pp. 444–445. Cited by: §5.2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §2.2, 1st item.
  • S. Park, S. Hwang, F. Chen, J. Choo, J. Ha, S. Kim, and J. Yim (2019) Paraphrase diversification using counterfactual debiasing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6883–6891. Cited by: §1, §2.1, §4.4, Table 1.
  • G. Parra et al. (2019) Automated writing evaluation tools in the improvement of the writing skill.. International Journal of Instruction 12 (2), pp. 209–226. Cited by: §3.1, §3.1.
  • S. Prabhumoye, Y. Tsvetkov, R. Salakhutdinov, and A. W. Black (2018) Style transfer through back-translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 866–876. Cited by: §2.1, §2.2, Figure 5, §4.5, Table 1, §5.1, §5.1, Table 2, 9th item.
  • S. Rao and J. Tetreault (2018) Dear sir or madam, may i introduce the gyafc dataset: corpus, benchmarks and metrics for formality style transfer. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 129–140. Cited by: §5.1, Table 2.
  • S. Reddy and K. Knight (2016) Obfuscating gender in social media writing. In Proceedings of the First Workshop on NLP and Computational Social Science, pp. 17–26. Cited by: §5.1.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 86–96. Cited by: §2.2, §4.6.
  • M. Shang, P. Li, Z. Fu, L. Bing, D. Zhao, S. Shi, and R. Yan (2019) Semi-supervised text style transfer: cross projection in latent space. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4939–4948. Cited by: §1, §4.2, Table 1.
  • T. Shen, T. Lei, R. Barzilay, and T. Jaakkola (2017) Style transfer from non-parallel text by cross-alignment. In Advances in neural information processing systems, pp. 6830–6841. Cited by: §1, §2.1, §4.4, Table 1, 1st item, §5.2, §5.3, Table 2, 7th item, §6.3.
  • P. C. D. Silva, R. L. P. Teixeira, and V. O. A. V. Boas (2019) Computational linguistics: analysis of the functional use of microsoft text word processor text corrector. International Journal of Linguistics, Literature and Culture, LLC, pp. 23. Cited by: §3.1.
  • I. Snyder (1993) Writing with word processors: a research overview. Educational Research 35 (1), pp. 49–68. Cited by: §3.1.
  • A. Sudhakar, B. Upadhyay, and A. Maheswaran (2019) “Transforming” delete, retrieve, generate approach for controlled text style transfer. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3260–3270. Cited by: §1, §4.3, Table 1, 2nd item, §6.2, §6.2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §2.2, §2, 1st item, §4.2.
  • Y. Tian, Z. Hu, and Z. Yu (2018) Structured content preservation for unsupervised text style transfer. CoRR abs/1810.06526. External Links: Link, 1810.06526 Cited by: §1, §2.1, §2.2, §4.6, Table 1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.2, §4.3, §4.6.
  • P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. Manzagol (2010)

    Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion

    .
    Journal of machine learning research 11 (Dec), pp. 3371–3408. Cited by: §4.6.
  • O. Vinyals, M. Fortunato, and N. Jaitly (2015) Pointer networks. In Advances in neural information processing systems, pp. 2692–2700. Cited by: §4.2.
  • K. Wang, H. Hua, and X. Wan (2019a) Controllable unsupervised text attribute transfer via editing entangled latent representation. In Advances in Neural Information Processing Systems, pp. 11034–11044. Cited by: §1, §2.1, §4.7, Table 1.
  • Y. Wang, Y. Wu, L. Mou, Z. Li, and W. Chao (2019b) Harnessing pre-trained neural networks with rules for formality style transfer. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3564–3569. Cited by: §1, §4.2, Table 1.
  • C. Wu, X. Ren, F. Luo, and X. Sun (2019a) A hierarchical reinforced sequence operation method for unsupervised text style transfer. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4873–4883. Cited by: §1, §4.3, Table 1, 4th item, §6.2, §6.2.
  • X. Wu, T. Zhang, L. Zang, J. Han, and S. Hu (2019b) Mask and infill: applying masked language model to sentiment transfer. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 5271–5277. Cited by: §4.3.
  • A. Xu, Z. Liu, Y. Guo, V. Sinha, and R. Akkiraju (2017) A new chatbot for customer service on social media. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 3506–3510. Cited by: §3.3.
  • J. Xu, X. Sun, Q. Zeng, X. Zhang, X. Ren, H. Wang, and W. Li (2018) Unpaired sentiment-to-sentiment translation: a cycled reinforcement learning approach. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 979–988. Cited by: §1, §4.3, Table 1, 3rd item, §6.2.
  • P. Xu, Y. Cao, and J. C. K. Cheung (2019a) On variational learning of controllable representations for text without supervision. CoRR abs/1905.11975. External Links: Link, 1905.11975 Cited by: §1, §2.1, §4.7, Table 1.
  • R. Xu, T. Ge, and F. Wei (2019b) Formality style transfer with hybrid textual annotations. CoRR abs/1903.06353. External Links: Link, 1903.06353 Cited by: §1, §4.2, Table 1.
  • W. Xu, A. Ritter, B. Dolan, R. Grishman, and C. Cherry (2012) Paraphrasing for style. In Proceedings of COLING 2012, pp. 2899–2914. Cited by: §1, 1st item, §4.2, §5.1, Table 2.
  • Z. Yang, Z. Hu, C. Dyer, E. P. Xing, and T. Berg-Kirkpatrick (2018) Unsupervised text style transfer using language models as discriminators. In Advances in Neural Information Processing Systems, pp. 7287–7298. Cited by: §1, §2.1, §4.4, Table 1.
  • D. Yin, S. Huang, X. Dai, and J. Chen (2019) Utilizing non-parallel text for style transfer by making partial comparisons. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 5379–5386. Cited by: §1, §2.1, §4.4, §4.4, Table 1.
  • M. Young (2002) The technical writer’s handbook: writing with style and clarity. University Science Books. Cited by: §3.1.
  • Y. Zhang, N. Ding, and R. Soricut (2018a) SHAPED: shared-private encoder-decoder for text style adaptation. In Proceedings of NAACL-HLT, pp. 1528–1538. Cited by: §1, §2.1, §2.2, §4.6, Table 1.
  • Y. Zhang, T. Ge, and X. Sun (2020) Parallel data augmentation for formality style transfer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Cited by: §4.2, Table 1.
  • Y. Zhang, J. Xu, P. Yang, and X. Sun (2018b) Learning sentiment memories for sentiment modification without parallel data. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1103–1108. Cited by: §1, §4.3, Table 1, 5th item, §6.2.
  • Z. Zhang, S. Ren, S. Liu, J. Wang, P. Chen, M. Li, M. Zhou, and E. Chen (2018c) Style transfer as unsupervised machine translation. CoRR abs/1808.07894. External Links: Link, 1808.07894 Cited by: §2.1, §4.5, Table 1.
  • J. Zhao, Y. Kim, K. Zhang, A. M. Rush, and Y. LeCun (2018a) Adversarially regularized autoencoders. In 35th International Conference on Machine Learning, ICML 2018, pp. 9405–9420. Cited by: §1, §2.1, §4.4, Table 1, 6th item.
  • Y. Zhao, W. Bi, D. Cai, X. Liu, K. Tu, and S. Shi (2018b) Language style transfer from sentences with arbitrary unknown styles. CoRR abs/1808.04071. External Links: Link, 1808.04071 Cited by: §1, §2.1, §4.4, §4.4, Table 1.
  • C. Zhou, L. Chen, J. Liu, X. Xiao, J. Su, S. Guo, and H. Wu (2020a) Exploring contextual word-level style relevance for unsupervised style transfer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Cited by: §4.6, Table 1.
  • L. Zhou, J. Gao, D. Li, and H. Shum (2020b) The design and implementation of xiaoice, an empathetic social chatbot. Computational Linguistics 46 (1), pp. 53–93. Cited by: §3.3.