Text style transfer aims at rephrasing a given sentence in a desired style. It can be used to rewrite stylized literature works, generate different styles of journals or news (e.g., formal/informal), and to transfer educational texts with specialized knowledge for education with different levels.
Due to lack of parallel data for this task, previous works mainly focused on unsupervised learning of styles, usually assuming that there is a substantial amount of nonparallel corpora for each style, and that the contents of the two corpora do not differ significantlyShen et al. (2017); John et al. (2018); Fu et al. (2018). Existing state-of-art models either attempt to disentangle style and content in the latent space Shen et al. (2017); John et al. (2018); Fu et al. (2018), directly modifies the input sentence to remove stylized words Li et al. (2018)
, or use reinforcement learning to control the generation of transferred sentences in terms of style and contentWu et al. (2019a); Luo et al. (2019). However, most of the approaches fail on low-resource datasets based on our experiments. This calls for new few-shot style transfer techniques.
The general notion of style is not restricted to the heavily studied sentiment styles, but also writing styles of a person. However, even the most productive writer can’t produce a fraction of the text corpora commonly used for unsupervised training of style transfer today. Meanwhile, in real world, there exists as many writing styles as you can imagine. Viewing the transfer between each pair of styles as a separate domain-specific task, we can thus formulate a multi-task learning problem, each task corresponding to a pair of styles. To this end, we apply a meta-learning scheme to take advantage of data from other domains, i.e., other styles to enhance the performance of few-shot style transfer Finn et al. (2017).
Moreover, existing works mainly focus on a very limited range of styles. In this work, we take both personal writing styles and previously studied general styles, such as sentiment style, into account. We test our model and other state-of-the-art style transfer models on two datasets, each with several style transfer tasks with small training data, and verify that information from different style domains used by our model enhances the abilities in content preservation, style transfer accuracy, and language fluency.
Our contributions are listed as follows:
We propose Multi-task Small-data Text Style Transfer (ST) algorithm, which adapts meta-learning framework to existing state-of-art models, and this is the first work that applies meta-learning on text style transfer to the best of our knowledge (see Section 2).
The proposed algorithm substantially outperforms the state-of-the-art models in the few-shot text style transfer in terms of content preservation, transfer accuracy and language fluency (see Section 3).
We create and release a literature writing style transfer dataset, which the first of its kind (see Section 3.1).
In this section, we first present two simple but effective style transfer models, namely CrossAlign Shen et al. (2017) and VAE John et al. (2018) as our base models, and then present a meta-learning framework called model-agnostic meta-learning Finn et al. (2017) that incorporates the base models to solve the few-shot style transfer problem.
The CrossAlign model architecture proposed by Shen et al. (2017) is shown in Figure 1. Let and be two corpora with styles and , respectively. and are encoders and decoders that take both the sentence or , and their corresponding style labels or as inputs. Then the encoded sentences and , together with their labels are input to two different adversarial discriminators and
, which are trained to differentiate between logits generated by the concatenation of content embedding and the original/opposite style label.
VAE for Style Transfer
In order to disentangle style and content in the latent space, John et al. (2018)
used variational autoencoder (VAE) and their specially designed style-oriented and content-oriented losses to guide the updates of the latent space distributions for the two componentsKingma and Welling (2013).
The architecture of this model is shown in Figure 2. Given a corpus with unknown latent style space and content space, an RNN encoder maps a sequence into the latent space, which defines a distribution of style and content Cho et al. (2014). Then style embedding and content embedding are sampled from their corresponding latent distributions and are concatenated as the training sentence embedding.
The two embeddings are used to calculate multi-task loss and adversarial loss
for content and style to separate their information. Then this concatenated latent vector is used as a generative latent vector, and is concatenated to every step of the input sequence and fed into decoder, which reconstructs the sentence . The final loss is the sum of these multi-task losses and the usual VAE reconstruction with KL divergence for both style embedding and content embedding Kingma and Welling (2013).
The main designs of style- and content-oriented losses are as follows John et al. (2018).
The style embedding should contain enough information to be discriminative. Therefore, a multitask discriminator is added to align the predicted distribution and the ground-truth distribution of labels.
where is the distribution of ground-truth style labels, and is the predicted output by the style discriminator.
The content embedding should not contain too much style information. Therefore, an adversarial discriminator is added, with loss of the discriminator and adversarial loss for the autoencoder given by
where contains the weights for a fully connected layer, and is the predicted distribution of style labels when taking content embedding as an input.
The content embedding needs to be able to predict the information given by bag-of-words (BoW), which is defined as
for each word in the vocabulary with sentence length Wallach (2006). Therefore, a multitask discriminator is added to align the predicted BoW distribution with ground-truth.
where is the distribution of true BoW representations, and is the predicted output by the content discriminator.
The style embedding should not contain content information. Similar as before, an adversarial discriminator is trained to predict the BoW features from style embedding, with loss for discriminator and adversarial loss given by
In the training phase, the adversarial discriminators are trained together with other parts of the model, and the final loss of the autoencoder is given by the weighted sum of the loss from traditional VAE, the multitask losses for style and content, and the adversarial losses given by the style and content discriminators. Then in the inference phase, the style embedding is extracted from the latent space of a target domain, and the original style embedding is substituted by this target embedding in decoding.
2.2 Model-Agnostic Meta-learning (MAML)
Meta-learning is designed to help a model quickly adapt to a new tasks, given that it has been trained on several other similar tasks. Compared with other model-based meta-learning methods, model-agnostic meta-learning algorithm (MAML) utilizes only gradient information Finn et al. (2017). Therefore, it can be easily applied to models based on gradient descent training.
Given a distribution of similar tasks
, a task-specific loss functionand shared parameters
, we aim to jointly learn a model so that in fine-tuning with the new task, the parameters are well-initialized so that the model quickly converges with fewer epochs and a smaller dataset.
Figure 3 shows the architecture of MAML. We define the shared model with parameters as a meta-learner. The data for each task is divided in to a support set and a query set . Every update of the meta-learner’s parameters consist of -step updates for each of the tasks. The support set for each task is used to update the sub-tasks, and the query set is used to evaluate a query loss that is later used for meta-learner’s updates.
In each sub-task training, the sub-learner is initialized with the parameters of the meta-learner. Then this parameter is updated times using the support data for this specific task. After updating, the new parameter is for the -th task, and a loss is evaluated using the query dataset for this sub-task. This sub-training process is performed for each sub-task, and losses from all sub-tasks are aggregated to obtain a loss for meta-training.
In our application, the sub-tasks contain different pairs of styles to be transferred. The meta-learner contains the transfer function , which takes a sentence with its style label , and outputs a sentence in the target style with similar content. This transfer function is shared by all pairs of styles in the meta-training phase. In addition, both our base models include adversarial functions for style disentanglement, the updates for the adversarial parameters are also included in the updates of meta-learner. Since the data size for each task with a single pair of styles is assumed to be small, the goal of MAML is to use information from other style pairs for a better initialization in the fine-tuning phase of a specific sub-task. The multi-task style transfer via meta-learning (ST) algorithm is described in Algorithm 1.
In order to incorporate a more diverse range of styles, we gather two datasets for our experiments. The first is collected from literature translations with different writing styles, and the second is a grouped standard dataset used for existing style transfer works, which also contains different types of styles.
We test our ST model and state-of-the-art models on these two datasets, and verify our model’s effectiveness on few-shot style transfer scheme. By comparing our models with the pretrained base models, we verify that meta-learning framework improves the performance both in terms of language fluency and style transfer accuracy.
|Common Source||Writer A||Writer B|
|Notre-Dame de Paris||Alban Kraisheimer||Isabel F. Hapgood|
|The Brothers Karamazov||Andrew R. MacAndrew||Richard Pevear|
|The Story of Stone||David Hawkes||Yang Xianyi|
|The Magic Mountain||John E. Woods||H. T. Lowe-Porter|
|The Iliad||Ian C. Johnston||Robert Fagles|
|Les Miserables||Isabel F. Hapgood||Julie Rose|
|Crime and Punishment||Michael R. Katz||Richard Pevear|
Since we extend the definition of style to the general writing style of a person, we do not need to be limited to the widely used Yelp/Amazon review and GYAFC datasets. To model the real situation where we have different style pairs with not enough data for each style pair, we propose to use the literature translations dataset and a set of popular style transfer datasets with reduced sizes.
Literature Translations (LT)
Current state-of-the-art works on text style transfer require large datasets for training, and thus they are not able to be applied to personal writing styles. One reason is that personal writing styles are relatively difficult to learn, compared with more discriminative styles such as sentiment and formality. Furthermore, sources of data reflecting personal writing styles are quite limited.
For the reasons above, we consider literature translations dataset. Firstly, there are multiple versions of translation from the same source. Since it is possible to align these comparable sentences to construct ground-truth references, they are well-suited for our test data. Moreover, in addition to the common-source translated work, a translator has other written works, which can be used for our non-parallel training data.
We collect a set of writers with unknown different writing styles , with each writer has his/her own set of written works . In order to have a test set with ground-truth references, we used translated works from non-English sources111Obtained from http://gen.lib.rus.ec/., so that each writer in our set has at least one translated work that is from the same source as another writer. Namely, for each writing style in the set, there exists another style and such that . In this dataset, each writer has approximately 10k nonparallel sentences for training.
We used the aligned sentences for each style pair using the algorithm provided by Chen et al. (2019) for testing. The sentence pairs are extracted from the common translated work for each writer pair. The test data has approximately 1k sentences for each writer. More information is shown in Table 1.
Grouped Standard Datasets (GSD)
In our second set, we group popular datasets for style transfer. For large datasets, we use only a small portion of then in order to model our few-shot style transfer task. The datasets we use are listed in Table 2. For the standard/simple versions of Wikipedia, we use the aligned sentences by Hwang et al. (2015) for testing. For all datasets listed in the table, we use 10k sentences for training and 1k sentences for testing.
|Amazon||(musical instrument) positive/negative|
BLEU for Content Preservation
To evaluate content preservation of transferred sentences, we use a multi-BLEU score between reference sentences and generated sentences Papineni et al. (2002). When ground-truth sentences are available in the dataset, we calculate the BLEU scores between generated sentences and ground-truth sentences. When they are missing, we calculate self-BLEU scores based on the original sentences222We use BLEU score provided by multi-bleu.perl.
Following the metrics used by John et al. (2018), we use a bigram Kneser-Key bigram language model to evaluate the fluency and naturalness of generated sentences Kneser and Ney (1995). The language models are trained in the target domain for each style pair. We use the training data before reduction to train the language model for GSD set.
Transfer Accuracy (ACC)
To evaluate the effectiveness of style transfer, we pretrain a TextCNN classifier proposed by Kim (2014). The transfer accuracy is the score output by the CNN classifier. Our classifier achieves accuracy of 80% on GSD and 77% even on LT dataset, which serves as a reasonable evaluator for transfer effectiveness.
Human Evaluation of Fluency and Content
|Template||41.6||81.48||5.4||0.31||4.3 / 4.2||81.7||88.8||5.3||0.42||4.2 / 4.2|
|CrossAlign||2.2||2.1||1895.6||0.45||1.2 / 1.1||2.7||2.2||1049.7||0.36||1.0 / 1.0|
|DeleteRetrieve||35.9||41.6||63.3||0.33||1.0 / 1.0||20.5||21.4||28.8||0.41||2.1 / 1.3|
|DualRL||4.1||3.9||1400.7||0.49||1.2 / 1.2||25.4||27.5||171.0||0.41||2.9 / 1.5|
|VAE||13.5||16.3||8.5||0.49||3.5 / 1.7||12.4||26.4||21.5||0.45||4.3 / 2.1|
|ST-CA (ours)||6.3||6.8||54.8||0.65||3.1 / 2.3||66.7||73.2||21.4||0.42||3.6 / 3.8|
|ST-VAE (ours)||20.5||15.1||8.2||0.62||3.8 / 1.9||14.7||13.9||10.9||0.71||4.3 / 2.7|
We conduct an additional human evaluation, following Luo et al. (2019). Two native English speakers are required to score the generated sentences from 1 to 5 in terms of fluency, naturalness, and content preservation, respectively. Before annotation, the two evaluators are given the best and worst sentences generated so as to know the upper and lower bound, and thus score more linearly. The final score for each model is calculated as the average score given by the annotators. The kappa inter-judge agreement is 0.769, indicating significant agreement.
3.3 Multi-task Style Transfer
We compare the results with the state-of-the-art models for the style transfer task. All the baseline models are trained on the single style pair. The ST model is trained on all the tasks for both LT and GSD sets, and then fine-tuned using a specific style pair in the sets. The trained meta-learner is fine-tuned on each of the sub-tasks, and the scores are calculated as the average among all sub-tasks for both ST models and baselines. The results are shown in Table 3.
We note that the BLEU and PPL scores for the template based model appear to be superior to those of other models. This is because it directly modifies the original sentence by changing a couple of words. So the modification is actually minimum. However, its transfer accuracy suffers, which is well expected. Thus its should only serve as a reference in our task.
For qualitative analysis, we randomly select sample sentences output by the baseline models, pretrained base models and our ST models on the Translations dataset and Yelp positive/negative review dataset, which are shown in Table 4.
|Original Sentence (Notre-Dame de Paris)||in their handsome tunics of purple camlet , with big white crosses on the front .|
|Template||in their handsome tunics of purple camlet, with big white crosses on front.|
|CrossAlign||heel skilful skilful skilful skilful|
|DeleteRetrieve||the man , and the man , the man , the|
|DualRL||lyres lyres lyres|
|VAE||the gypsy girl had stirred up from the conflict|
|ST-CrossAlign (ours)||so the spectacle who prayed and half white streets ,|
|ST-VAE (ours)||all four were dressed in robes of white and were white locks from|
|Original Sentence (Yelp positive)||the staff is welcoming and professional .|
|Template||the staff is welcoming and professional .|
|CrossAlign||glad glad glad|
|CrossAlign (pretrained)||the staff is welcoming and professional .|
|DeleteRetrieve||the staff is a time .|
|DualRL||less expensive have working .|
|VAE||the staff is rude and rude|
|VAE (pretrained)||the staff is extremely welcoming and professional .|
|ST-CrossAlign (ours)||the staff is friendly and unprofessional|
|ST-VAE (ours)||the staff are rude and unprofessional .|
|Original Sentence (Yelp negative)||these people do not care about patients at all !|
|Template||these people wonderful about patients at all !|
|CrossAlign||glad glad glad|
|CrossAlign (pretrained)||these people do not care about patients at all !|
|DeleteRetrieve||i was n’t be a a appointment and i have .|
|DualRL||and just like that it was over and i was .|
|VAE||these people do not care about patients or doctors|
|VAE (pretrained)||these guys do n’t care about the patients at time|
|ST-CrossAlign (ours)||these people do not satisfied at all !|
|ST-VAE (ours)||i was so happy and i did n’t consent|
From the results, we notice that state-of-the-art models fail to achieve satisfying performances in our few-shot style transfer task, and many baseline models fail to generate syntactically or logically consistent sentences. In comparison, our methods are able to generate relatively fluent sentences both in terms of automatic evaluation and human evaluation, meanwhile achieving a higher transfer accuracy.
We might be tempted to conclude that this is simply because the ST models learn better language models because they are trained on larger data, i.e., data from all styles rather than only a single pair of styles. Therefore, further experiments are required.
3.4 Pretrained Base Models
Based on the previous reasoning, we extract and pretrain the language model part in our base models (CrossAlign and VAE) on the union of data from all sub-tasks. Starting with a well-trained language model, we then fine-tune the models for the style transfer task. By comparing these models with our ST model, we verify that meta-learning framework can improve the style transfer accuracy in addition to language fluency. We perform this experiment only on the GSD dataset, since they are enough for analysis purposes.
In addition, to examine the effect of pretraining combined with meta-learning, we also add a pretraining phase to our ST model. The quantitative and qualitative results are included in Table 5 and Table 4 (on Yelp dataset for the pretrained base models).
By adding a pretraining phase, the models get a chance to see all the data and learn to generate fluent sentences via reconstruction. Therefore, it is not surprising that the content preservation measure (BLEU) and sentence naturalness measure (PPL) give significantly better results than before but at a cost of style transfer accuracy.
In effect, the models tend to reconstruct the original sentence and do not transfer the style. In comparison, our ST model learn to generate reasonable sentences and transfer styles jointly in the training phase. Therefore, it is still superior in terms of style transfer accuracy. This verifies that the success of ST is not merely resulted from a larger training dataset. The way that the model updates its knowledge is parallel, rather than sequential, which contributes to better language models and more effective style transfer.
Furthermore, we notice that the pretraining phase in our ST model is not crucial, suggesting that it is the meta-learning framework that significant contributes to the model’s improvements in generating fluent sentences and effectively transferring styles.
3.5 Disentanglement of Style
Following the experiments adapted by John et al. (2018), we use t-SNE plots (shown in Figure 4) to analyze the effectiveness of disentanglement of style embedding and content embedding in the latent space Maaten and Hinton (2008). In particular, we compare the pretrained base models (CrossAlign and VAE) and our ST models.
These two models, together with our ST models attempt to disentangle style and content in latent space, and thus is well suited for this experiment, while it is unreasonable to treat hidden state vectors in other baseline models as content/style embedding. Therefore, they are excluded from this experiment.
As we can see from the figures, the content space learned by all models are relatively clustered, while the style spaces are more separated in our ST models than the pretrained base models. This verifies that the improvements of meta-learning framework is not limited to a better language model, but also in terms of the disentanglement of styles.
4 Related Work
Fu et al. (2018) devised a multi-encoder and multi-embedding scheme to learn a style vector via adversarial training. Adapting a similar idea, Zhang et al. (2018) built a shared private encoder-decoder model to control the latent content and style space. Also based on a seq2seq model, Shen et al. (2017) proposed a cross-align algorithm to align the hidden states with a latent style vector from target domain using teacher-forcing. More recently, John et al. (2018) used well-defined style-oriented and content-oriented losses based on a variational autoencoder to separate style and content in latent space.
Li et al. (2018) directly removed style attribute words based on TF-IDF weights and trained a generative model that takes the remaining content words to construct the transferred sentence. Inspired by the recent achievements of masked language models, Wu et al. (2019b) used an attribute marker identifier to mask out the style words in source domain, and trained a “infill” model to generate sentences in target domain.
Based on reinforcement learning, Xu et al. (2018) proposed a cycled-RL scheme with two modules, one for removing emotional words (neutralization), and the other for adding sentiment words (emotionalization). Wu et al. (2019a) devised a hierarchical reinforced sequence operation method using a point-then-operate framework, with a high-level agent proposing the position in a sentence to operate on, and a low-level agent altering the proposed positions. Luo et al. (2019) proposed a dual reinforcement learning model to jointly train the transfer functions using explicit evaluations for style and content as a guidance. Although their methods work well in large datasets such as Yelp Asghar (2016) and GYAFC Rao and Tetreault (2018), it fails in our few-shot style transfer task.
Prabhumoye et al. (2018) adapted a back-translation scheme in an attempt to remove stylistic characteristics in some intermediate language domain, such as French.
There are also meta-learning applications on text generation tasks.Qian and Yu (2019) used the model agnostic meta-learning algorithm for domain adaptive dialogue generation. However, their task has paired data for training, which is different from our task. In order to enhance the content-preservation abilities, Li et al. (2019) proposed to first train an autoencoder on both source and target domain. But in addition to utilizing off-domain data, we are applying meta-learning method to enhance models’ performance both in terms of language model and transfer abilities.
In this paper, we extend the concept of style to general writing styles, which naturally exist as many as possible but with a limited size of data. To tackle this new problem, we propose a multi-task style transfer (ST) framework, which is the first of its kind to apply meta-learning to small-data text style transfer. We use the literature translation dataset and the augmented standard dataset to evaluate the state-of-the-art models and our proposed model.
Both quantitative and qualitative results show that ST outperforms the state-of-the-art baselines. Compared with state-of-the-art models, our model does not rely on a large dataset for each style pair, but is able to effectively use off-domain information to improve both language fluency and style transfer accuracy.
Noticing that baseline models might not be able to learn an effective language model from small datasets, which is a possible reason for their bad performances, we further eliminate the bias in our experiment by pretraining the base models using data from all tasks. From the results, we ascertain that the enhancement of meta-learning framework is substantial.
- Yelp dataset challenge: review rating prediction. arXiv preprint arXiv:1605.05362. Cited by: §4.
- Aligning sentences between comparable texts of different styles. In The 9th Joint International Semantic Technology Conference, Cited by: §3.1.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §2.1.
Model-agnostic meta-learning for fast adaptation of deep networks.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §1, §2.2, §2.
Style transfer in text: exploration and evaluation.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §4.
- Aligning sentences from standard wikipedia to simple wikipedia. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 211–217. Cited by: §3.1.
- Disentangled representation learning for text style transfer. arXiv preprint arXiv:1808.04339. Cited by: §1, §2.1, §2.1, §2, §3.2, §3.5, §4.
- Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: §2.1, §3.2.
- Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.1, §2.1.
- Improved backing-off for m-gram language modeling. In 1995 International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 181–184. Cited by: §3.2.
- Domain adaptive text style transfer. arXiv preprint arXiv:1908.09395. Cited by: §4.
- Delete, retrieve, generate: a simple approach to sentiment and style transfer. arXiv preprint arXiv:1804.06437. Cited by: §1, §4.
- A dual reinforcement learning framework for unsupervised text style transfer. arXiv preprint arXiv:1905.10060. Cited by: §1, §3.2, §4.
- Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §3.5.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §3.2.
- Style transfer through back-translation. arXiv preprint arXiv:1804.09000. Cited by: §4.
- Domain adaptive dialog generation via meta learning. arXiv preprint arXiv:1906.03520. Cited by: §4.
- Dear sir or madam, may i introduce the gyafc dataset: corpus, benchmarks and metrics for formality style transfer. arXiv preprint arXiv:1803.06535. Cited by: §4.
- Style transfer from non-parallel text by cross-alignment. In Advances in neural information processing systems, pp. 6830–6841. Cited by: §1, §2.1, §2, §4.
- Topic modeling: beyond bag-of-words. In Proceedings of the 23rd international conference on Machine learning, pp. 977–984. Cited by: item 3.
- A hierarchical reinforced sequence operation method for unsupervised text style transfer. arXiv preprint arXiv:1906.01833. Cited by: §1, §4.
- ” Mask and infill”: applying masked language model to sentiment transfer. arXiv preprint arXiv:1908.08039. Cited by: §4.
- Unpaired sentiment-to-sentiment translation: a cycled reinforcement learning approach. arXiv preprint arXiv:1805.05181. Cited by: §4.
- Shaped: shared-private encoder-decoder for text style adaptation. arXiv preprint arXiv:1804.04093. Cited by: §4.