Text style transfer is a task that changes the style of input sentences while preserving their style-independent content. Due to its wide applications, such as sentiment transfer Hu et al. (2017); Shen et al. (2017) and text formalization Jain et al. (2019)
, it has become a research hotspot in natural language generation in recent years. However, due to the lack of parallel training data, researchers mainly focus on unsupervised style transfer.
In this aspect, many approaches Hu et al. (2017); Shen et al. (2017); Fu et al. (2018) resort to an auto-encoding framework, where the encoder is used to disentangle the content and style, and the decoder generates the output sentence with the target style. Another line of research Xu et al. (2018); Li et al. (2018) focuses on removing the style marker of the input sentence to obtain a style-independent sentence representation. When generating the output sentence, both lines directly feed the target style into the decoder as a whole. From another perspective, some researchers treat the style transfer as a translation process and adapt unsupervised machine translation to this task Logeswaran et al. (2018); Zhang et al. (2018); Lample et al. (2019), where the style switch is implicitly achieved. Besides, many recent models further explore this task in different ways, such as gradient-based optimization method Liu et al. (2019)
, dual reinforcement learningLuo et al. (2019), hierarchical reinforcement learning Wu et al. (2019) and transformer-based model Dai et al. (2019). Overall, in these models, the quality of output sentences mainly depends on the content representation of input sentence and the exploitation of the target style.
However, one main drawback of the aforementioned models is that they lack the fine-grained control of the influence from the target style on the generation process, limiting the potential of further style transfer improvement. Intuitively, the frequencies of words occurring the sentences with different styles are distinct, and thus they are related to various styles in different degrees. In view of the above, we believe that during the ideal style transfer, impacts of the target style should be distinguished depending on different words. If we equip the current style transfer model with a neural network component, which can automatically quantify the style relevance of the output sentence at word level, the performance of the model is expected to be further improved.
In this paper, we propose a novel attentional sequence-to-sequence model (Seq2seq) that dynamically predicts and exploits the relevance of each output word to the target style for unsupervised style transfer. Specifically, we first pre-train a style classifier, where the relevance of each input word to the original style can be quantified through layer-wise relevance propagation (LRP) Bach et al. (2015). After that, in a denoising auto-encoding manner, we train a basic attentional Seq2seq model to reconstruct the input sentence and repredict its word-level previously-quantified style relevance simultaneously. In this way, this model is endowed with the ability to automatically predict the style relevance of each output word. Then, we equip the decoder of this model with a neural style component to exploit the predicted word-level style relevance for better style transfer. Particularly, we fine-tune this model using a carefully-designed ojbective function involving style transfer, style relevance consistency, content preservation and fluency modeling loss terms.
Compared with previous approaches, our proposed model avoids the complex disentanglement procedure, of which the quality can not be guaranteed. Also, our model is able to solve the issue of the source-side information loss caused by unsatisfactory disentanglement or explicitly removing style markers. More importantly, our model is capable of achieving fine-grained control over the impacts of target style on different output words, leading to better style transfer. To sum up, our contributions can be summarized as follows:
We explore a training approach based on LRP and denoising auto-encoding for the Seq2seq style transfer model, which enables the model to automatically predict the word-level style relevance of output sentences;
We propose a novel Seq2seq model, which exploits the predicted word-level style relevance of output sentences for better style transfer. To the best of our knowledge, the text style transfer with fine-grained style control has not been explored before;
Experimental results and in-depth analysis on two benchmark datasets strongly demonstrate the effectiveness of our model. We release our code at https://github.com/PaddlePaddle/Research/tree/master/NLP/ACL2020-WST
2 Related Work
In recent years, unsupervised text style transfer has attracted increasing attention. Most of previous work Hu et al. (2017); Shen et al. (2017); Fu et al. (2018); Prabhumoye et al. (2018); Xu et al. (2018); Li et al. (2018) aimed at producing a style-independent content representation from the input sentence and generate the output with target style. For example, Hu et al. Hu et al. (2017) employed a variational auto-encoder with an attribute classifier as discriminator, forcing the disentanglement of specific attributes and content in latent representation. Shen et al. Shen et al. (2017) exploited an auto-encoder framework with an adversarial style discriminator to obtain a shared latent space cross-aligning the content of text from different styles. Based on multi-task learning and adversarial training of deep neural networks, Fu et al. Fu et al. (2018) explored two models to learn style transfer from non-parallel data. Prabhumoye et al. Prabhumoye et al. (2018) learned a latent representation of the input sentence to better preserve the meaning of the sentence while reducing stylistic properties. Then adversarial training and multi-task learning techniques were exploited to make the output match the desired style. Although these work has shown effectiveness to some extent, however, as analyzed by some recent work Li et al. (2017); Lample et al. (2019), their style discriminators are prone to be fooled.
Meanwhile, some studies Luo et al. (2019); Li et al. (2018) explicitly removed style-related words identified by a pre-trained classifier to get a style-independent content representation, and then added target style to generate output sentences. Nevertheless, such approaches tend to cause the information loss of the input sentence, since its style-related words often contain meaningful content. Besides, there have been several researches Logeswaran et al. (2018); Zhang et al. (2018); Lample et al. (2019) adopting back-translation to build style transfer models without parallel data. Logeswaran et al. Logeswaran et al. (2018)
introduced a reconstruction loss interpolating auto-encoding and back-translation loss components, where attribute compatibility is encouraged by a discriminator. Along this line, Zhang et al.Zhang et al. (2018) and Lample et al. Lample et al. (2019) directly adapted unsupervised machine translation approaches to this task, where the style transfer is implicitly achieved via iterative back-translations between texts in different styles.
Very recently, some attempts Liu et al. (2019); Luo et al. (2019); Wu et al. (2019); Dai et al. (2019) have been made to perform style transfer from other different perspectives. For example, Liu et al. Liu et al. (2019) mapped a discrete sentence into a continuous space and then used the gradient-based optimization with a pre-trained attribute predictor to find the latent representation satisfying desired properties (e.g. sentence length, sentiment). Luo et al. Luo et al. (2019) performed a one-step mapping to directly transfer the style of the original sentences via dual reinforcement learning. Wu et al. Wu et al. (2019) adopted a hierarchical reinforced sequence operation method to iteratively revise the words of original sentences. Dai et al. Dai et al. (2019) proposed a Transformer-based Vaswani et al. (2017) style transfer model without disentangling latent representation. Finally, note that exploring word-level style relevance has also been studied in other NLP tasks, such as machine translation Zeng et al. (2018); Su et al. (2019).
3 Our Model
Given a set of labelled training instances , where is the style label of the sentence , we aim to train a style transfer model that can automatically convert an input sentence = with the original style into a content-invariant output sentence = with the target style .
To achieve this goal, we extend the standard attentional Seq2seq model Sutskever et al. (2014); Bahdanau et al. (2015) by equipping its decoder with a neural style component to achieve fine-grained control over the impacts of target style on different output words. As shown in Figure 1, the training of our model consists of two stages, which will be described below.
3.1 Stage 1: Train a Basic Attentional Seq2seq Model with Repredicting Word-level Style Relevance
At this stage, we first introduce a pre-trained style classifier to quantify the word-level relevance of training sentences to the original style via LRP Bach et al. (2015). Then, we train a basic attentional Seq2seq model in a denoising auto-encoding manner, where this model is required to reconstruct the input sentence and repredict its word-level style relevance simultaneously. By doing so, our model acquires the preliminary ability to predict the style relevance of output words and reconstruct input sentences, which makes the training in the subsequent stage easier. Next, we briefly describe our basic model, and then introduce its objective function in detail.
3.1.1 Attentional Seq2seq Model
It mainly consists of an encoder and a decoder with an attention mechanism.
The encoder is a forward GRU network. Taking the sentence = as input, this network maps the input words into a hidden state sequence as , where and
denotes the embedding vector and the hidden state of the word, respectively. Specially, the last hidden state is used to initialize the decoder hidden state .
The decoder is also a forward GRU network. Its hidden state is updated by , where is the decoder hidden state at the -th timestep, is the embedding vector of the previously generated word , and denotes the corresponding attention-based context vector. Formally, is defined as the weighted sum of all hidden states of input words:
where , ,
are trainable parameters. Finally, the output prediction probability over vocabulary is calculated as, where is a learnable matrix. Please note that all bias terms in above equations are also trainable parameters, which are omitted for the simplicity of notation.
3.1.2 The Objective Function
To effectively train the above basic model, we define the following objective function :
where and denote the sentence reconstruction loss and the style relevance restoration loss, respectively.
1. Sentence reconstruction loss : Using this loss, we expect our model to capture informative features for reconstructing sentence. Formally, we define as follows:
Here, denotes parameters of this Seq2seq model, and is the partially corrupted version of , where a certain proportion of input words are randomly replaced to prevent our model from simply copying .
2. Style relevance restoration loss : It is used to measure how well the word-level style relevance of an input sentence can be repredicted during the denoising auto-encoding. Formally, it is defined as
where and denote the style relevance of the -th input and output word, respectively, and denotes the set of other parameters used to calculate (see Equation 6). It is notable that and are not involved into the sentence reconstruction. Apparently, two key issues arise, namely, how to calculate and ?
As for , we calculate it based on the previous decoder hidden state :
where , form the previously-mentioned parameter set (see Equation 5).
, which has been widely used to measure the contributions of neurons to the final prediction of a classifier, to quantify the word-level style relevance of sentences. Concretely, we calculate the relevance scoreof the -th neuron at the -th layer in a manner similar to back-propagation:
where . Here, denotes the weight of the edge between adjacent neurons, is the value of neuron that can be computed in advance during the forward-propagation, and denotes the neuron number of each layer. Through this way, we can obtain the neuron-wise contribution scores of the embedding vector of input word to the final style prediction. Furthermore, we define the style relevance score of as the sum of , and finally map this score into the range via a tanh(*) function: , where the hyper-parameter serves as a scaling factor. In practice, since too low style relevance may be noise, we directly treat the words with style relevance lower than as style independent and set their style relevance as 0.***Experiments show that such treatment enhances the stability of our system.
3.2 Stage 2: Fine-tune the Extended Model
At this stage, we extend the above basic model by equipping its decoder with a neural style component, which predicts and then exploits the style relevance of the next output word to refine its generation. Likewise, we first give a detailed description to our extended model, and then depict how to fine-tune it using a novel objective function involving multiple loss terms.
3.2.1 The Extended Model
Here, we omit the descriptions of our basic encoder and decoder, which are identical to those of the previously-described basic model, and only depict the newly introduced neural style component. Figure 2 shows the architecture of this component.
To incorporate word-level style relevance into our decoder, at the -th timestep, we predict the style relevance following Equation 6, and then use to revise the decoder hidden state as follows:
where is the revision to the hidden state , is a gate controlling to what extent the will be revised by and is the revised hidden state used to generate (see Subsection 2.1.1). Note that reflects how much information of target style is incorporated.
More specifically, and are updated as below:
where is the embedding vector of , is the context vector calculated as Equation 1, is the target style, and is an MLP function with the parameter set used to produce .
By doing so, we can impose fine-grained control over the influence of the target style on the generation of the next word. When our model predicts that the next word is strongly style-relevant, it encourages the decoder to produce the proper stylistic content by leveraging more target style information. Conversely, the target style will exert less influence on the decoder hidden state, avoiding being disturbed to lose original style independent content.
3.2.2 The Objective Function
After initializing our extended model with parameters of the basic model, we then fine-tune it using the following objective function:
where , , and denote the style transfer loss, the style relevance consistency loss, the content preservation loss, and the frequency modeling loss, with as their balancing parameters.
1. Style transfer loss : It is used to ensure the output sentence contains the target style well. To this end, we apply the above-mentioned pre-trained style classifier (See Stage 1) to classify the style of the output sentence, where related parameters of our model are updated to encourage the target style can be predicted from the output sentence:
denotes a “soft” generated sentence based on gumbel-softmax distribution Jang et al. (2017), where the representation of each word is defined as the weighted sum of word embeddings with the prediction probability at the current timestep.
2. Style relevance consistency loss : It ensures the predicted style relevances of output words are consistent with its stylistic outcomes evaluated by the classifier. Specifically, during the above style classification, we apply LRP to obtain the style relevance of each “soft” word, and then try to minimize the following loss:
3. Content preservation loss : However, only using the above two style-related loss terms will lead to model collapse, producing an extremely short output sentence that matches the target style but totally loses the original meaning. To address this issue, we introduce a content preservation loss to prevent the model from collapsing.
Specifically, we define the content representations of input sentence and output sentence as the weighted sums of their individual word embeddings according to the corresponding weights and , respectively. Then, we minimize the following loss term to force these two representations to be close:
In this way, the less relevant a word is to the corresponding style, the more its embedding should be considered.
4. Fluency modeling loss : Finally, we follow Yang et al. Yang et al. (2018) to introduce a bidirectional GRU based language model, which is pre-trained on the training instances with target style to ensure that our model can generate fluent output sentences.
For the forward direction, we aim to reduce the distribution divergence between the prediction probability vector of our model and that of the forward language model by minimizing their cross-entropy as follows:
are the predicted probability distributions produced by our model and the forward language model, respectively. Note that at each timestep, we fed the continuous approximation of the output word, which is defined as the weighted sum of word ebmeddings with the current probability vector, into the language model. For the back direction, we directly reverse the output sentence and calculate the reverse language model lossin a similar way. Finally, we directly define the total fluency modeling loss as the average of and .
|CrossAlign Shen et al. (2017)||75.3||17.9||36.7||28.9||70.5||3.6||15.9||6.8|
|DelRetri Li et al. (2018)||89.0||31.1||52.6||46.1||55.2||21.1||34.2||30.6|
|Unpaired Xu et al. (2018)||64.9||37.0||49.0||47.1||79.5||2.0||12.6||3.9|
|UnsuperMT Zhang et al. (2018)||95.4||44.5||65.1||60.7||70.8||33.4||48.6||45.4|
|DualRL Luo et al. (2019)||85.6||55.2||68.7||67.1||71.1||41.9||54.6||52.7|
|PoiGen Wu et al. (2019)||91.5||59.0||73.5||71.8||46.2||45.8||46.0||46.0|
, respectively. Numbers in bold mean that the improvement to the best performing baseline is statistically significant (t-test with p-value <0.05).
|DelRetri Li et al. (2018)||2.18||2.21||2.40||2.26||1.53||1.55||1.62||1.57|
|UnsuperMT Zhang et al. (2018)||3.26||3.07||3.24||3.19||2.46||2.42||2.75||2.54|
|DualRL Luo et al. (2019)||3.31||3.43||3.47||3.40||2.26||2.28||2.36||2.30|
|PoiGen Wu et al. (2019)||3.42||3.51||3.54||3.49||1.39||1.52||1.43||1.45|
YELP: This dataset is comprised of restaurant and business reviews and has been widely used in sentiment transfer. To evaluate our model, we adopt the human references released by Luo et al. Luo et al. (2019), which has four references for each sentence in the test set. Following common practices Shen et al. (2017); Li et al. (2018), we choose reviews over 3 stars as positive instances and those under 3 stars as negative instances. The splitting of train, development and test sets are in accordance with the setting in Li et al. (2018). Moreover, we filter the sentences with more than 16 words, leaving roughly 448K, 64K and 1K sentences in the train and development and test sets, respectively.
GYAFC: The parallel data of GYAFC Rao and Tetreault (2018) consists of formal and informal texts while providing four human references for each test sentence. Particularly, we use this dataset in a non-alignment setting during training. There are roughly 102K, 5K and 1K sentences remaining in the train, development and test sets, respectively.
4.3 Training Details
As for the threshold for filtering noise, we set it to 0.3 by observing model performances on development set. For the training of Stage 2, we empirically set the learning rate to and clip the gradients if their norms exceed . We draw from 0.5 to 1.5 with the step 0.1, from 0.5 to 5 with the step 0.5 and
from 0.1 to 2 with the step 0.1. Similarly, the overall performance of our model on the developement set is employed to guide the hyperparameter search procedure. Finally, we choose, and .
4.4 Automatic Evaluation
We evaluate the quality of output sentences in terms of transfer accuracy and content preservation. Following previous work Luo et al. (2019), we use the pre-trained style classifier to calculate the transfer accuracy of output sentences. The classifier achieves 97.8% and 88.3% accuracy on the test sets of YELP and GYAFC, respectively. Moreover, we compute the BLEU scores of output sentences to measure content preservation. Finally, we report the geometric mean and harmonic mean of these metrics, which quantify the overall performance of various models.
Experimental results in Table 1 shows that our model achieves the best performance among all models.
4.5 Human Evaluation
We invite 3 annotators with linguistic background to evaluate the output sentences.†††We use Fleiss kappa to quantify the agreement among them. The Fleiss kappa score is 0.76 for the Yelp dataset and 0.80 for the GYAFC dataset. The accuracy of style transfer (Acc), the preservation of original content (Con) and the fluency (Flu) are the three aspects of model performance we consider. Following the criteria introduced in Zhang et al. (2018), the annotators are required to score each aspect of sentences from 1 to 5.
Table 2 shows the human evaluation results. Our model achieves the best performance on both datasets in terms of almost every aspects, except that the Acc score of our model is slightly lower than PoiGen on YELP. It may be due to the error of the pre-trained style classifier that the transfer accuracy of our model is higher than PoiGen on YELP. Note that the content preservation of our model is significantly higher than others, showing our word-level control actually preserves more style-independent content of original sentences.
4.6 Ablation Studies
Compared with previous studies, the training of our model contains two stages, involving a neural style component and several novel loss terms, such as (see Equation 3), , and (see Equation 11). To fully investigate their effects on our model, we conduct extensive ablation studies on YELP, which is larger than GYAFC. Specifically, we compare ours with the following variants:
-NSC: A variant of our model, where the proposed neural style component is removed from the model. It should be noted that this variant is actually the basic model, which is only trained at Stage 1.
NSC-: It is a variant of our model, which is equipped with a neural style component but without (see Equation 8). Note that this variant does have the ability of fine-grained controlling the influence from the target style on the generation process.
-: In this model, the loss term is directly removed from Equation 3 at Stage 1.
‡‡‡Because our model would collapse if is completely removed from the objective function, we do not compare our model with its variant without loss.: A variant of our model, where its content preservation loss is modified as
Compared with (see Equation 14), we find that all words are equally considered in . Thus, through this experiment, we can investigate the impact of differential word-level style relevance modeling on our model. -: It is also a variant of our model， where the weight of is set as 0 (see Equation 11).
-: A variant of our model with the weight of as 0 (see Equation 11).
Finetuning-: For this model, we fix all parameters involved in Stage 1 at Stage 2.
Table 3 lists the experimental results. We can observe that most variants are significantly inferior to our model in terms of BLEU scores. Particularly, although the Acc of some variants increase, these models may overly change the original content to conduct transfer, still resulting in lower BLEU scores. These results demonstrate the effectiveness of our introduced neural style component, different loss terms and two stage training strategy. As an exception of above observations, when we replace with , the BLEU score increases but the Acc drops significantly. This is due to the fact that does not discriminate words of different style relevances and overly constrain the model to keep its original content.
4.7 Case Study
We conduct case study to understand the advantage of our model. Figure 3 displays several instances of input and output sentences with word-level style relevance. For example, according to the word-level style relevance from LRP, we can observe that the words of the first input YELP sentence, including “slow” and “lazy” are most related to the original style, while other words hardly contribute to the style of the whole sentence. Meanwhile, some words of the first output sentence, such as “really”, “good” and “friendly” are predicted to be most relevant to the target style. Our model successfully exploits the predicted style relevance and change the word “good” and “friendly” while keeping other parts unchanged. In the second informal-to-formal transfer case, besides replacing “just” with “simply”, our model also appends a “.” token with high predicted style relevance, which is a sign characterizing formality.
From Figure 3, we can see that the relevance scores of words in GYAFC are less discriminative compared to those in YELP. Thus, we provide the corpus-level statistics by counting the frequencies of some typical words of GYAFC in Figure 3, showing that the predicted scores often indicate the distribution of words across different styles. “simply” appears 29 and 380 times in the informal and formal sets, respectively. The period token ‘.’ is an indicative marker of text formality since lots of informal sentences ends with no punctuation. Besides, “your”, “about” and “this” are distributed uniformly across styles. There are 4,590 and 5,357 “your”, 2,489 and 2,353 “about”, 2,062 and 2,211 ‘this”appearing in the informal and formal sets, respectively.
These results are consistent with our intuition, verifying the correlation between the predicted style relevance of each word and its actual stylistic outcome.
This paper has proposed a novel attentional Seq2seq model equipped with a neural style component for unsupervised style transfer. Using the quantified style relevance from a pre-trained style classifier as supervision information, our model is first trained to reconstruct input sentences and repredict the word-level style relevance simultaneously. Then, equipped with the style component, our model can exploit the word-level predicted style relevance for better style transfer. Experiments on two benchmark datasets prove the superiority of our model over several competitive baselines.
This work was supported by the Beijing Advanced Innovation Center for Language Resources (No. TYR17002), the National Key R&D Project of China (No. 2018AAA0101900), the National Natural Science Foundation of China (No. 61672440), and the Scientific Research Project of National Language Committee of China (No. YB135-49).
- On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one. Cited by: §1, Figure 1, §3.1.2, §3.1.
- Neural machine translation by jointly learning to align and translate. In Proceedings of ICLR 2015, External Links: Cited by: §3.
- Style transformer: unpaired text style transfer without disentangled latent representation. In Proceedings of ACL 2019, External Links: Cited by: §1, §2.
- Style transfer in text: exploration and evaluation. In Proceedings of AAAI 2018, External Links: Cited by: §1, §2.
- Toward controlled generation of text. In Proceedings of ICML 2017, External Links: Cited by: §1, §1, §2.
- Unsupervised controllable text formalization. In Proceedings of AAAI 2019, External Links: Cited by: §1.
- Categorical reparameterization with gumbel-softmax. In Proceedings of ICLR 2017, External Links: Cited by: §3.2.2.
- Convolutional neural networks for sentence classification. In Proceedings of EMNLP 2014, External Links: Cited by: §3.1.2.
- Multiple-attribute text rewriting. In Proceedings of ICLR 2019, External Links: Cited by: §1, §2, §2.
- Adversarial learning for neural dialogue generation. In Proceedings of EMNLP 2017, External Links: Cited by: §2.
- Delete, retrieve, generate: a simple approach to sentiment and style transfer. In Proceedings of NAACL 2018, External Links: Cited by: §1, §2, §2, §4.1, §4.2, Table 1, Table 2.
- Revision in continuous space: fine-grained control of text style transfer. CoRR abs/1905.12304. External Links: Cited by: §1, §2.
- Content preserving text generation with attribute controls. In Proceedings of NIPS 2018, External Links: Cited by: §1, §2.
- A dual reinforcement learning framework for unsupervised text style transfer. CoRR. External Links: Cited by: §1, §2, §2, §4.1, §4.2, §4.4, Table 1, Table 2.
- Style transfer through back-translation. In Proceedings of ACL 2018, External Links: Cited by: §2.
- Dear sir or madam, may I introduce the GYAFC dataset: corpus, benchmarks and metrics for formality style transfer. In Proceedings of NAACL 2018, External Links: Cited by: §4.1.
- Style transfer from non-parallel text by cross-alignment. In Proceedings of NIPS 2017, External Links: Cited by: §1, §1, §2, §4.1, §4.2, Table 1.
- Variational recurrent neural machine translation. In Proceedings of AAAI 2018, External Links: Cited by: §5.
- Exploring discriminative word-level domain contexts for multi-domain neural machine translation. Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.
- Sequence to sequence learning with neural networks. In Proceedings of NIPS 2014, External Links: Cited by: §3.
- Attention is all you need. In Proceedings of NIPS 2017, External Links: Cited by: §2.
- A hierarchical reinforced sequence operation method for unsupervised text style transfer. In Proceedings of ACL 2019, External Links: Cited by: §1, §2, §4.2, Table 1, Table 2.
- Unpaired sentiment-to-sentiment translation: A cycled reinforcement learning approach. In Proceedings of ACL 2018, External Links: Cited by: §1, §2, §4.2, Table 1.
- Unsupervised text style transfer using language models as discriminators. In Proceedings of NIPS 2018, External Links: Cited by: §3.2.2.
- Multi-domain neural machine translation with word-level domain context discrimination. In Proceedings of EMNLP 2018, External Links: Cited by: §2.
- Variational neural machine translation. In Proceedings of EMNLP 2016, External Links: Cited by: §5.
- Style transfer as unsupervised machine translation. CoRR. External Links: Cited by: §1, §2, §4.2, §4.5, Table 1, Table 2.