Text style transfer is a task that generates a sentence while preserving the content in a given sentence but changing the source style. The style of the sentence refers to a predefined class (e.g. sentiment, formality, tense) and the content refers to the rest of the sentence except for the style. Lack of parallel data makes text style transfer task difficult. This problem cannot be solved by supervised learning because there are no right sentences.
One previous method [hu2017toward, shen2017style, fu2018style, prabhumoye-etal-2018-style, logeswaran2018content] of text style transfer is to learn latent representations to separate style and content from sentences. First, these approaches try adversarial training to learn a disentangled latent representation of the content and style. Secondly, a transferred sentence is generated from the decoder by combining the disentangled latent representation and the target style. However, the experimental results of [lample2018multipleattribute]
report that disentangled latent representation through adversarial training is hard to get and not necessary. Also, adversarial training is not effective to encode a sentence of various lengths into a vector representation of fixed length. Other methods of text style transfer do not depend on disentanglement.[dai-etal-2019-style, lample2018multipleattribute, ijcai2019-711] do not attempt to find the disentangled latent representation in the sentence. Therefore, sentences with different styles are mapped to the same space. [xu-etal-2018-unpaired, li-etal-2018-delete, sudhakar-etal-2019-transforming, ijcai2019-732] neutralize sentences by deleting style-dependent attribute markers. The remained content tokens resulting from the deletion of attribute markers are style independent, and the content tokens and a style attribute are combined to generate the transferred sentence.
We propose an approach with two stages using Delete and Generate without adversarial training for disentanglement. (1) Attribute markers of a sentence are extracted by using a pre-trained classifier as a Delete model. Our method is model-agnostic and is not affected by the design of the classifier. Attribute markers found in a sentence are deleted. (2) A transferred sentence is generated by combining the target attribute and the content tokens after stage-1. The Generate model consists of an encoder and decoder with the Transformer structure.
In the method of deleting attribute markers, [li-etal-2018-delete] deletes attribute markers via a statistical manner using a frequency ratio and [sudhakar-etal-2019-transforming, xu-etal-2018-unpaired] delete attribute markers using attention weights of a classifier. [ijcai2019-732]
deletes attribute markers by fusion of the frequency ratio and the attention weights. We introduce an intuitive delete method that uses a change in classifier probability. If a change in classifier probability is significant when limiting certain tokens in a sentence, the token is considered an attribute marker. Our method does not need to build attribute dictionaries or define attention weights like previous methods and easily control content and style trade-off.
We test our methods on two text style transfer datasets: sentiment of Yelp reviews and Amazon reviews. Evaluation metrics are conducted in terms of content, fluency, style accuracy, and semantic. The content and style accuracy are measured similarly to previous studies. Fluency is measured in two ways: general-fluency using pre-trained GPT-2[radford2019language] and data-fluency using finetuned GPT-1 [Radford2018ImprovingLU]. Semantic is newly evaluated using BERTscore [zhang*2020bertscore] in this paper. The goal of BERTsocre is to evaluate semantic equivalence between two sentences. In this paper, we use a pre-trained model GPT and BERT [devlin-etal-2019-bert]
that perform well in natural language processing/generation to evaluate transferred sentences with various automatic evaluations. Since automatic evaluations are not perfect evaluations of generated sentences, it is hard to know which system is the best, but we can determine which system has a problem. Comparative models are unstable in some evaluation metrics. But our proposed model has stable results for all automatic evaluations and is called SST (Stable Style Transformer). In addition, we first observe a point that can enhance the style controlling ability by generating sentences through latent space walking in the vector space of the style attribute token.
2 Related Work
One line of text style transfer research [shen2017style, fu2018style, hu2017toward, prabhumoye-etal-2018-style, logeswaran2018content] is to separate content and style from sentences through disentangled learning. [hu2017toward] uses the VAE model to derive the disentanglement of the content between the generated sentence and the original sentence through KL loss. [shen2017style] introduce the aligned auto-encoder and the cross aligned auto-encoder using learning discriminators. [fu2018style] propose a multi-decoder and StyleEmbedding model. Multi-decoder model has decoders for each style, and the style embedding model uses only one decoder by inserting style embedding into the decoder. The methods of [prabhumoye-etal-2018-style, logeswaran2018content] used back-translation to learn latent representations.
The second line of text style transfer research is not to rely on learning for latent representation. The first approach [xu-etal-2018-unpaired, li-etal-2018-delete, sudhakar-etal-2019-transforming, ijcai2019-732] is to find and delete tokens called attribute markers that are highly related to style. [li-etal-2018-delete] uses the delete method of attribute markers as a statistical method based on frequency ratio, and [sudhakar-etal-2019-transforming, xu-etal-2018-unpaired] use the attention scores of the Transformer classifier and LSTM classifier, respectively. [ijcai2019-732] deletes attribute markers by fusion of the frequency ratio and attention scores. The second approach [dai-etal-2019-style, lample2018multipleattribute, ijcai2019-711] does not attempt to control content and style separately. Therefore, sentences with different styles are encoded to gather in the same latent representation space. [dai-etal-2019-style, lample2018multipleattribute] are based on learning method using cycle reconstruction loss. [lample2018multipleattribute] reported that disentanglement is not easy and that latent representations learned through adversarial training are unnecessary because learned latent representations depend on style. Unlike the previous models, [ijcai2019-711]
learns dual models in two directions: style1 (e.g. negative) to style2 (e.g. positive) and style2 (e.g. positive) to style1 (e.g. negative) by reinforcement learning.
In the language model research, the RNN-based language model is weak in long dependency. Therefore, the recent study of text style transfer [dai-etal-2019-style, sudhakar-etal-2019-transforming, ijcai2019-732] has been conducted with Transformer [NIPS2017-7181] which is known to have good performance in language modeling. [dai-etal-2019-style] is a method of using the encoder and the decoder of the Transformer, and [sudhakar-etal-2019-transforming] is a method of fine-tuning the decoder to the style transfer datasets with the pre-trained GPT-1 as an initial state. [ijcai2019-732] solved the problem of text style transfer in a similar way to Text Infilling or Cloze by presenting Attribute Conditional Masked Language Model (AC-MLM) using pre-trained BERT.
In this paper, we chose the first approach (Delete and Generate) that does not rely on latent representations in the second research line, referring to the results of [lample2018multipleattribute]. Our system has a Transformer encoder and decoder because the style transfer task is given input text. If the system uses only a decoder such as [sudhakar-etal-2019-transforming], there is a disadvantage that it cannot include bidirectional encoding of the content token. Or, if only bidirectional encoders are used, such as AC-MLM, the position and length of the masking tokens to be filled in a sentence is not flexible.
In this section, we introduce our proposed method. The style transfer problem definition is described in Section 3.1. An overview of the model is shown in Section 3.2. The proposed generation process is introduced in Sections 3.3 and 3.4. The learning mechanism is described in Section 3.5.
3.1 Problem Statement
Given a dataset consist of sentence and label: where is a sentence and is a style attribute (e.g. sentiment) and N is the number of datasets. Our goal is to train the model to generate a sentence with a different style while preserving the content of the sentence . For example, if is ”The food is salty and tasteless” and is ”negative” attribute, then is generated to mean ”The food is not salty and delicious” which has a ”positive” attribute. However the dataset is non-parallel, so the model cannot be provided aligned with .
3.2 Model Overview
Our approach consists of two stages: Delete and Generate framework in Fig. 1. The first stage is the Delete process with a pre-trained style classifier. The pre-trained style classifier finds and deletes tokens that contain a lot of style attributes. The second stage is encoding the content tokens and combine them with a target style to generate a sentence. Both the encoder and the decoder have the Transformer structure, which is better than RNN and robust to long dependency.
3.3 Stage-1: Delete process
The stage-1 is the process of finding and deleting tokens for a given sentence and style attribute. In the previous study, the mechanisms of deleting attribute markers are the frequency-ratio method and the classifier’s attention score (or fusion of both). However, the frequency ratio method requires pre-built vocabulary for the training dataset and it is difficult to understand contextual information. The attention score method has a limitation on the structure of the classifier, because it must learn the style classifier using self-attention regardless of accuracy. It is also unclear whether the attention score is directly proportional to the attribute.
We propose a novel method of removing attribute markers using a pre-trained classifier without a pre-built dictionary and attention scores. Our method is a model-agnostic method and it is more intuitive to find attribute markers than the previous method. Given an input sentence , the style probability follows:
where is a probability predicted by the classifier and is style label. If we delete token from the sentence , the style probability changes as follows:
where is the remained tokens after tokens are deleted. The value of determines how much the token affects the style classifier. The token is deleted in order of the largest IS, and the Delete process ends if only one of the following two conditions: (1) is less than , or (2) the ratio of content tokens is less than .
is a hyperparameter that determines that a sentence no longer has a source style attribute.is a hyperparameter that determines how much of the content is preserved. The two hyperparameters make it easy to control the trade-off of content and style, and the experimental results are explained in Section 4.7.
3.4 Stage-2: Generate process
Our model generates a transferred sentence with the encoder and the decoder of the Transformer.
All content tokens given as a result of Delete process are input to a bidirectional self-attention the Transformer encoder. Explicitly, the Transformer encoder maps content tokens to the continuous representation as follow:
In order to generate a sentence with the desired style, two special tokens, style and start, are initially input to the decoder in Fig. 1. The position of special tokens is always fixed in front, so do not add positional embedding. We use teacher-forcing at training time and no teacher-forcing at test time to generate sentences. If the generated token is the special token end, the Generate process ends. The decoder auto-regressively predicts the conditional probability of the next step token as follows:
is the logit vector of the decoder,is a desired style and is the predicted token in step.
Since we only have non-parallel datasets, we can’t do supervised learning about transferred sentences. SST is trained to minimize two losses depending on the style condition : (source style) or (target style).
3.5.1 Reconstruction loss
SST reconstructs the original sentence conditioned on and source style . Reconstruction loss follows the equation:
In non-parallel datasets, the reconstruction loss cannot be calculated if the style of the generated sentence is .
3.5.2 Style loss
If the model is only trained with reconstruction loss, the decoder will not see how to transform the style. Therefore, a discrepancy occurs between training time and test time. To learn how to generate sentence with a target style , we introduce style loss as follows:
Style loss is measured by a pre-trained classifier to determine whether the transferred sentence has a . Since the generated sentence is a discrete space, we utilize soft-embedding of predicted tokens to optimize through style loss. When the SST is trained, the parameters of the classifier are not finetuned.
3.5.3 Model Details
The Transformer encoder and decoder consist of 3 layers, and each layer has 4 heads. The style classifier consists of 5 convolution filters based on [kim-2014-convolutional]. Text is tokenized using Byte-Pair-Encoding, and (word, style, position) embeddings are 256-dimensional vectors. In the Delete process, is at training time and observes the trade-off of content and style by changing parameters during test time.
In this paper, we test our model on two datasets, YELP and AMAZON, which are provided in [li-etal-2018-delete]. The Yelp dataset is for business reviews, and the Amazon dataset is product reviews. Both datasets are labeled negative and positive and are used for sentiment transfer. The datasets are split into train, dev, and test sets, and the statistics are shown in Table 1.
4.2 Human References
Human references are used to measure human-BLEU and BERTscore. We used 2 Yelp human references and 1 Amazon human reference. Yelp: [li-etal-2018-delete] provides 1 human reference and 3 additional human references in [ijcai2019-711]. We used 2 human references, one from [li-etal-2018-delete] and one (the best performance in automatic evaluation) from [ijcai2019-711], to increase reliability. Amazon: We used the human reference provided by [li-etal-2018-delete].
4.3 Previous Method
We compare the previous models with three approaches. The first comparisons are CrossAligned [shen2017style], [StyleEmbedding, multi-decoder] [fu2018style], and BackTranslation [prabhumoye-etal-2018-style], which attempt to separate content and style through latent representation learning. The second comparisons are [DeleteOnly, DeleteAndRetrieve] [li-etal-2018-delete], UnpariedRL [xu-etal-2018-unpaired] and [B-GST, G-GST] [sudhakar-etal-2019-transforming], which delete attribute markers and then generate the sentence. [TemplateBased, RetrieveOnly] [li-etal-2018-delete] return the target sentence through retrieve without generating. The final comparison is DualRL [ijcai2019-711], which does not distinguish between content and style.
Content preserving intensity is measured by G-BLEU, the geometric mean of self-BLEU and human-BLEU, as in previous works. A high BLEU score indicates that the model is good at content preservation.
In the Yelp dataset, RetireveOnly and BackTranslation are considered unstable models because G-BELU score is too low compared to other systems. In the Amazon datasets, CrossAligned and RetrieveOnly are too low compared to other systems.
Most style transfer studies measure style accuracy using a classifier. We also evaluate style accuracy with a classifier (note that this is different from the one used in training).
In the Yelp dataset, StyleEmbedding, multi-decoder, and UnpairedRL have quite a low accuracy. In the Amazon datasets, StyleEmbedding, DeleteOnly, and DeleteAndRetrieve are unstable in style transfer.
Fluency is considered the perplexity of the transferred sentence. We use GPT-1 and GPT-2, which is known to perform well as a language model. General-fluency (g-PPL) is measured using pre-trained GPT-2 and data-fluency (d-PPL) is measured using GPT-1 (instead of GPT-2 due to GPU memory) finetuned to the dataset. General-Fluency is a general view because the language model is not fitted to the data, and data-fluency is an evaluation metric in terms of the specific data of style transfer tasks. The total-fluency (t-PPL) is the geometric mean of d-PPL and g-PPL, and lower values indicate better fluency.
In the Yelp data set, TemplateBased is unstable because t-PPL is too much larger than other systems. In the Amazon dataset, it is determined that the fluency of B-GST and G-GST is unstable.
Semantic is measured using BERTsocre. Unlike BLEU and ROUGE, BERTscore is an evaluation metric defined in continuous space. The BERT is used to calculate cosine similarity by extracting the contextual token embeddings from a human reference and a transferred sentence. BERTscore solves the limitations of previous metrics and measures a better correlation between the reference and the candidate. The original BERTscore ranged from 0 to 1, but we rescale it from 0 to 100 to clearly see the difference.
We set the unstable threshold as 1.4 point lower than the mean of all systems. CrossAligned, RetrieveOnly, and BackTranslation have limitations on Yelp datasets. CrossAligned, multi-decoder, and RetrieveOnly have limitations on Amazon datasets.
: In the Yelp datasets, SST model is evaluated in two cases where is (0.7, 0) and (0.7, 0.75). SST(0.7, 0) changes styles better with style accuracy of 82.2, but SST(0.7, 0.75) has better performance on other metrics. In the Amazon datasets, SST model is evaluated when is (0.6, 0.5). The effects of and are discussed in detail in Section 4.7.
|SST (0.7, 0)||19.11||82.2||306.65||89.96|
|- Style loss||19.78||78.2||341.51||89.84|
|SST (0.7, 0)||35.48||10.29||19.11||82.2||274.19||342.95||306.65||89.96|
|SST (0.6, 0.5)||45.47||20.34||30.41||66.5||4.51||367.73||40.72||89.17|
4.5 Result Analysis
Human systems are not the best performance except human-BLEU and BERTscore, which are calculated using human references. But which of the sentences in human and machines is actually realistic? Probably human. It is difficult to determine the best system with only automatic evaluation, but it is possible to determine which system is stable or unstable. If a system has significantly lower performance during the evaluation, it is considered unstable. The stable systems in the Yelp dataset are SST, DeleteOnly, DeleteAndRetireve, DualRL, B-GST, and G-GST. In the Amazon dataset, the stable systems are SST and TemplateBased. For all the metrics in both datasets, the stable systems are SST and DualRL. In automatic evaluation, DualRL outperforms SST, but DualRL does not share the model parameters of positive to negative and negative to positive tasks. Therefore, direct comparison is difficult because DualRL is regarded as two models.
We trained SST by changing the random seed of the model initialization several times and found that SST can always yield stable and comparable results. SST can be inferred as a stable system for the following reasons: (1) G-BLEU: Delete and Generate approaches show the stable performance of G-BLEU because the methods generate a sentence based on content tokens. There is no guarantee that content tokens will always be maintained, but content tokens help the generator. (2) Attribute: Our delete process is a method of determining whether certain tokens are deleted with Important Score. The direct and model-agnostic deletion is effective for neutralizing sentences. SST also improves a style accuracy by adding style control loss. (3) Fluency: TemplatedBased, B-GST, and G-GST show non-ideal fluency in d-PPL. TemplatedBased is considered unstable because it simply inserts attribute tokens of training data when generating test sentences. Since B-GST and G-GST use pre-trained GPT, they also have the ability to predict the distribution of tokens that are not in training data. The ability to predict generalized tokens is usually helpful, but can sometimes be harmful to d-PPL. SST, the Transformer encoder-decoder structure, learns only the distribution of given data and therefore has a stable d-PPL. (4) Semantic: Transformer language modeling is known to perform better on various tasks than RNN. Even in the style transfer task, the Transformer-based structures seem to reflect the linguistic characteristics.
Table 5 shows the samples of the generation of the models, which shows the lack of comparison models. In Yelp’s negative to positive example, there are only SST and DualRL models that change the style while preserving content that includes taste and price of the food. In Yelp’s positive to negative example, the professionals word contains a combination of style and content. In this case, the deletion and generation framework has the disadvantage of corrupting content information.
4.6 Ablation Study
If we use style loss for SST training, Table 2 shows that the style accuracy has 4 point gain. Fluency and semantic are slightly better. It is observed that style loss improves the data-fluency, resulting in better total fluency. However, style loss decreases G-BLUE slightly by allowing the transferred sentence to change the attribute better.
4.7 Trade-off between Content and Style
With and we can simply adjust the trade-off of content and style. The results of Yelp are shown in Fig. 2. Smaller and allow the model to focus on style changes, while larger and allow the model to focus on content preserving. The trade-off of content and style changes linearly with and is sensitive to . The appropriate and depend on datasets.
4.8 Latent Space Walking
In this section we observe the transferred sentences according to the weight of positive and negative in the continuous style vector space. Ideally, a neutral sentence should be generated when the style attribute has the same weight for negative and positive. An example is shown in Table 6. A lot of data, like this example, don’t show a neutral sentence even if the style has the same weight for the negative and positive. If we train our model to reflect this problem, we can expect better style control.
5 Conclusion and Future Work
We propose Stable Style Transformer (SST) that re-writes the sentences with Delete and Generate. SST is a system that can be used in the real world with overall stable results compared to other comparable systems. The proposed direct and model-agnostic deletion allows the classifier to intuitively delete attribute markers and easily handle the trade-off of content and style. In future work, we would like to explore multiple attribute transfer not only sentiment transfers. In addition, we will study solutions for the case where attribute markers contain content in the deletion and generation framework.
|Yelp ( negative to positive)||Yelp ( positive to negative)|
|the food was so-so and very over priced for what you get .||these two women are professionals .|
|SST||the service is so-so and very reasonably priced for what you get .||these two women are rude .|
|CrossAligned||the food was fantastic and very very nice for what you .||these two dogs are hard down .|
|StyleEmbedding||the food was so-so and very over priced for what you get .||these two pot everywhere was .|
|DeleteOnly||the food was so-so and very over priced for what you get .||i would n’t like these two women are professionals .|
|these two scam women are professionals .|
|Back-translation||the food is delicious and the staff are very good for me .||this place is just not good .|
|UnpariredRL||the food was so-so and very over priced for what great qualities .||these two women are great .|
|DualRL||the food was surprising and very reasonably priced for what you get .||these two women are unprofessional .|
|B-GST||the food was amazing - so fresh and very good for what you get .||these two women are terrible liars .|
|G-GST||the food was priced right - so nice and very good for what you get .||these two women are condescending .|
|Human_DRG||the food was great and perfectly priced||these two women are not professionals .|
|Human_DualRL||the food was good and the price is low .||these two women are not professionals at all|
|Amazon ( negative to positive)||Amazon ( positive to negative)|
|Input (source)||i have to lower the rating another notch .||it seems to be of very good quality in its build .|
|SST||love the rating another one ,||it seems to be of very poor quality in its build .|
|CrossAligned||i would recommend this for the price .||it s not be for a good game for my phone .|
|StyleEmbedding||i have to get by a one market .||it seems to be the num_extend is good nice high cases .|
|DeleteOnly||i have to lower the rating and it fits into another notch .||
|DeleteAndRetrieve||i have to lower the rating another notch and i love it .||initially it was very good quality in its build .|
|B-GST||i have lower levels for the other notch .||it seems to be of very good quality in taste .|
|G-GST||i have lower the steel another notch .||it seems to be of very good value in return .|
|Human_DRG||i have to raise the rating another notch .||it seems to be of very poor quality in its build|
|when i finally walked in , i was very disappointed .|
|when i finally got , i was very happy .|