Codes for "Massive Styles Transfer with Limited Labeled Data"
Language style transfer has attracted more and more attention in the past few years. Recent researches focus on improving neural models targeting at transferring from one style to the other with labeled data. However, transferring across multiple styles is often very useful in real-life applications. Previous researches of language style transfer have two main deficiencies: dependency on massive labeled data and neglect of mutual influence among different style transfer tasks. In this paper, we propose a multi-agent style transfer system (MAST) for addressing multiple style transfer tasks with limited labeled data, by leveraging abundant unlabeled data and the mutual benefit among the multiple styles. A style transfer agent in our system not only learns from unlabeled data by using techniques like denoising auto-encoder and back-translation, but also learns to cooperate with other style transfer agents in a self-organization manner. We conduct our experiments by simulating a set of real-world style transfer tasks with multiple versions of the Bible. Our model significantly outperforms the other competitive methods. Extensive results and analysis further verify the efficacy of our proposed system.READ FULL TEXT VIEW PDF
The work by Gatys et al.  recently showed a neural style algorithm th...
Image style transfer has drawn broad attention in recent years. However,...
Recent neural style transfer frameworks have obtained astonishing visual...
We propose StyleBank, which is composed of multiple convolution filter b...
Text style transfer aims to change the style of sentences while preservi...
We present a novel approach to the problem of text style transfer. Unlik...
Real-world image recognition is often challenged by the variability of v...
Codes for "Massive Styles Transfer with Limited Labeled Data"
There are various language styles in our life. For example, people with different age and background or in different areas talk in different ways; famous writers have their own special writing styles; network language is more fashionable than formal language, and so on. The techniques of style transfer can be applied in real life to help with generating robotic instructions Kiddon et al. (2015), simplification for children, and expression personalization Lin and Walker (2017)
. It is often very useful to transfer from one given style to a number of different styles, thus promoting text generation applications to meet specific requirements for different target audiences.
Most existing style transfer methods Jhamtani et al. (2017); Ficler and Goldberg (2017) heavily rely on large-scale labeled (or paired) data for training. However, the large labeled dataset across different styles are usually not easy to get. Although the data shortage problem has been alleviated by relevant researches Xu (2017), the proposed datasets are still limited in both scales and domains. Note that unlabeled data are abundant and easy to collect for any style, which can be leveraged to enhance the style transfer models.
Besides, different styles have internal relationships. Like Shakespeare’s plays and modern literature, they differ a lot in morphology, grammar, and so on, but they still share many common expressions. Therefore, we argue that the style transfer tasks for different style pairs have mutual influences on each other. Nevertheless, the style transfer tasks for different style pairs are investigated independently in previous works, ignoring the mutual influences among them.
In this work, we investigate the problem of transferring across multiple language styles, as described below:
There is a set of writing styles in the same language (e.g., English). For each style , there are plenty of unlabeled data . While there are also a few labeled data between any two styles and . The scale of labeled data is very limited due to annotator resources and time. The goal is to find a set of style transfer models to transfer text data across styles.
First, we leverage unlabeled data to improve style transfer across each style pair (i.e., one-to-one style transfer) by proposing a semi-supervised framework to enhance both encoders and decoders, which is inspired by recent researches in NMT Lample et al. (2017); Artetxe et al. (2017).
Second and more importantly, for multiple (or one-to-many in our setting) style transfer, the data are further used for improving those one-to-one models in a multi-agent manner and making them achieve better performances. Although this can be done in a popular multi-task learning manner Collobert and Weston (2008) by sharing parameters for different one-to-one transfer models. Our experiments demonstrate that parameter sharing will easily lead to performance dropping due to inconsistency of multiple tasks. Instead, we propose a multi-agent system to address the multiple style transfer problem. The one-to-one style transfer models are regarded as our basic style transfer agents. And then we design self-organization algorithms to let the agents find and use helpful neighbors to improve themselves.
We try to set up a general circumstance following the definition of the problem we come up with. We use the dataset consisting of several versions of the Bible Carlson et al. (2017). Without loss of generality, we set one version as the source version, while the others are target versions. As different Bible versions are translated by different authors and targeting at different crowds, the writing styles of them are different from each other. Agents for different one-to-one style transfer tasks are trained independently and then enhanced by a few neighbor agents. The evaluation results show the efficacy and superiority of our system for multiple style transfer.
The contributions are summarized as follows:
To the best of our knowledge, we are the first to introduce multi-agent learning to style transfer tasks.
Our proposed system can leverage unlabeled data and mutual benefits across different style transfer tasks to address the data shortage problem. And our system performs significantly better than the state-of-the-art models proposed by previous works.
Our codes and dataset will be released at GitHub111https://github.com/zhyack/MAST. Experiments can be easily reproduced and extended.
Among researches on all kinds of style transfer tasks, our work is most relevant to the studies of writing style transfer Jhamtani et al. (2017); Ficler and Goldberg (2017); Nisioi et al. (2017); Prabhumoye et al. (2018). Xu Xu et al. (2012) proposes a dataset of style transfer between Shakespeare’s scripts and modern English. They build language models and train Moses decoder 222http://www.statmt.org/moses to learn paraphrasing for styles. Jhamtani Jhamtani et al. (2017) follows this work and proposes sequence-to-sequence models trained with the parallel data in the dataset. Recently, Carlson Carlson et al. (2017) proposes a more content-rich dataset for similar studies, the Bibles. Sequence-to-sequence models are applied on different versions of the Bibles and get better results than Moses.
The need to leverage unlabeled data draws a lot of interests of NMT researchers. Researches like Yang et al. (2018a); Lample et al. (2017), Sennrich et al. (2016), and Artetxe et al. (2017) propose methods to build semi-supervised or unsupervised models. However, these techniques mainly designed for NMT tasks, and they haven’t been widely used for style transfer tasks. Some unsupervised approaches Shen et al. (2017); Yang et al. (2018b) try addressing style transfer problems by using GAN Goodfellow et al. (2014). But their architecture shows drawbacks in content preservation Xu et al. (2018). In this paper, we follow the ideas of Sennrich’s Sennrich et al. (2016) work to propose a semi-supervised method for leveraging unlabeled data of both source side and target side.
The core inspiration for our proposed system comes from the idea of multi-agent system design. A P2P self-organization system Gorodetskii (2012)
have been successfully applied in practical security systems. They design policies for agents to choose useful neighbors to produce better predictions. It enlightens us to build style transfer systems. Researches on reinforcement learning in text generation tasksYu et al. (2017) also show the practicability to regard text generation models as agents with a large action space.
We use the popular attentional sequence-to-sequence models as baselines and build our system based on them.
Some techniques have been applied to the vanilla sequence-to-sequence model Sutskever et al. (2014) to get better performance, such as bidirectional encoder Schuster and Paliwal (1997) and attentional decoder Bahdanau et al. (2014); Luong et al. (2015), using LSTM Hochreiter and Schmidhuber (1997) instead of original RNN cells. In this work, we use the model with bidirectional encoder and Luong attentional decoder Luong et al. (2015), one of the state-of-the-art models of neural language style transfer used in previous works Xu et al. (2012); Jhamtani et al. (2017); Carlson et al. (2017), as one of our baseline models.
Many studies in the NMT area try to make use of unlabeled data. Semi-supervised Artetxe et al. (2017) and unsupervised Lample et al. (2017); Yang et al. (2018a) models are proposed to enhance the basic sequence-to-sequence models, which are very inspiring for improving the performance of style transfer models.
Among these studies, Sennrich Sennrich et al. (2016) proposes two effective methods. One is to provide monolingual data with dummy source sentence, which is very similar to the technique of denoising auto-encoder(DAE)Bengio et al. (2013). And the other one is to provide synthetic source data obtained from a target-to-source translation model (back-translation) to pair with the unlabeled target data. We improve this model to fit style transfer environment, making use of the unlabeled data from both source side and target side.
The semi-supervised sequence-to-sequence model (Semi) is shown in Figure 1. and are short for encoder and decoder, and means the data is noised. Our aim is to get and , so that we can encode the source language from style , and decode the embedding to the target language of style . Different from the model in Sennrich et al. (2016), we use back-translation to translate the target-side outputs to source-side, which can make use of the unlabeled source data in an auto-encoder manner. In a word, there are three routes for training in the Semi model:
We jointly train the model by randomly choosing one from the three routes for each training batch. And finally, and are enhanced by training with unlabeled data through DAE and back-translation.
As we mentioned before, intuitively applying multi-task learning methods will hurt the performance on some tasks due to inconsistency in various different targets. Therefore, we propose a new approach to integrate models for different tasks not by sharing parameters, but by designing a multi-agent system. We regard the multiple Semi models as the basic agents for style transfer between the fixed source style and multiple target styles. To model how they communicate and use the information of the others, we follow the framework of the classic multi-agent system to build our multi-agent system for style transfer (MAST). There are two core steps for each agent in MAST: finding other helpful agents as neighbors and learning to predict with the neighbors.
In MAST, an agent cannot make use of the information of all other agents, for some agents may be not helpful to this specific one and the computing resources are limited. Therefore, we need to design algorithms for an agent to automatically locate the most helpful agents as its neighbors. To achieve this, we propose and combine two novel strategies.
As similar styles usually share many common things, agents building on similar styles can be referential for each other. A binary classification model is an effective way to evaluate the similarity between two styles. If two styles are very different, a good classification model can easily distinguish them and produce high-accuracy predictions, while the accuracy may be much worse when the styles are very similar to each other. Therefore, for each pair of target styles, we use a binary classification model with an attentional RNN to get a representation vector. And then we use a fully connected layer for binary classification. The attention mechanism is used to distribute weights to the vectors outputted by the RNN to construct a better feature vector for binary classification. We defineto include the classification results of all the target style pairs, where is the total number of target styles in the system. Then we re-scale 333https://en.wikipedia.org/wiki/Feature_scaling to with respect to each row (style) to make sure all the scores are within [0,1]. And we define the similarity of styles and as .
Apart from the similarities drawn from the classification, some agents may not perform well on their own target styles, and these agents may not provide help to a specific agent. Figuratively speaking, working with someone really unskilled or sharing no interest with you is usually a bad choice. Inspired by this, we realize that the similarity between agents and the agent performance should be balanced to make the final decision. For evaluating the performance of the agents, we use the BLEU scores Papineni et al. (2002) achieved on the development set. We also re-scale the BLEU scores for all agents to to get the performance scores .
Finally, we linearly combine the similarity scores and the performance scores by Eq.1, where is a coefficient weight and . In this work, we set to 0.5 to combine the scores equally. The neighbors of agent are chosen by checking the top- scores (except for ) in . It is notable that , thus the choosing of neighbors may be unidirectional.
After setting the neighbors, we need to give an agent the ability of learning to make use of their neighbors. Imagine that you encounter some life choices. The first thing you will do is probably to ask for advice from families and friends. And then you will weigh all the suggestions according to your estimate of the situation to make your final choice. In a similar way, we propose the multi-agent training framework - MAT as illustrated in Figure 2.
In Figure 2, is the -th agent in the system we want to enhance. We denote as the indexes of its neighbors. All these basic agents are pre-trained by using the Semi model. Apart from the basic agents, an auxiliary model for is trained, which is the key to let
make use of its neighbors. As each of the agents will produce a probability distributionover its action space (vocabulary) at time step , is trained to predict a weight distribution on all the basic agents. That is to say, does not learn to make generative actions, but learns to integrate the probability distributions of all the basic agents as Eq.2 according to the environments (inputs and the last predicted word ). In Eq.2, is over a global action space, which is the union set of . is a mapping operation to set the probabilities of words not in to when forwarding to . Finally, predictions are greedily sampled from to let the agents keep pace with each other and get similar states at next time steps.
It is noteworthy that different agents have different action spaces (vocabularies) . Therefore, we use to map the vector from local action spaces to global action space. And the predictions can be different for different agents, because words not in are not valid actions of .
In , we use an RNN encoder to model the environment and another RNN decoder to predict weights. As all the basic agents are pre-trained, we fix the parameters of them. That is to say, in Figure 2, only is trained. Algorithm 1 shows the whole training process.
In MAST, training SOS and MAT is efficient. And compared with other approaches, all the modules in our system (training basic agents, SOS and MAT) can be easily deployed in a distributed manner. That is to say, when new styles get involved, it is easy and efficient to extend the system.
Bible is one of the books that have the most translations. The number of English versions is over 50. BibleGateway collects these bibles texts444https://www.biblegateway.com/. Information and differences of different bible versions can be found on this site.. Carlson Carlson et al. (2017) collates data crawled from this site and makes alignment between sentences across different versions according to chapters and sentence orders. They release 6 versions (ASV, BBE, DARBY, DRA, WEB, YLT) that are available in public domain and their preprocessing system555https://github.com/keithecarlson/Zero-Shot-Style-Transfer. We crawl another 10 versions (AMP, CJB, CSB, ERV, ESV, KJ21, MEV, NCV, NIV, NOG) from BibleGateway and pre-process the data in the same way.
Without loss of generality, we set ASV as the source version, and the other 15 versions as target versions. For each source-target version pair, we can averagely get 29803 aligned sentence pairs. To simulate the situation with few labeled data and abundant unlabeled ones, we sample 2000 aligned pairs as the training set, 500 aligned pairs as the development set, 500 aligned pairs as the test set, and 20000 sentences in each version as unlabeled supplement materials.
All the models except SOS adopt 2-layer bidirectional LSTM encoder and 2-layer attentional LSTM decoder. In SOS, only 1-layer LSTM is used. Dropout rates are set to 0.3 for all RNN layers. The dimension sizes of hidden vectors and embedding vectors are all set to 500. Parameters are optimized by stochastic gradient descent with a learning rate of 1.0.
AttS2S models are trained on the labeled training set, while Semi models also leverage the supplement unlabeled data besides the labeled ones. And for MAST, we take Semi models as pre-trained agents and use the labeled data to train the controllers. For SOS, the training set is used to train the classification model, and the development set is used to calculate similarities and performances.
Here we present the results on five style transfer tasks (public bible versions). Baseline models are AttS2S and Semi. The results of the state-of-the-art unsupervised approach STTBT666https://github.com/shrimai/Style-Transfer-Through-Back-TranslationPrabhumoye et al. (2018) is also provided to compare. We also realize a multi-task model MulT, where models of different tasks share the parameters except for output layers. Two different ways of choosing neighbors, randomly choosing and SOS, are applied to MAST as MAST:Rand-k and MAST:SOS-k, where stands for the number of neighbors. In this comparison, we set to 2.
We use BLEU Papineni et al. (2002) for automatic evaluation. BLEU is widely used for evaluating the quality of the output texts according to given references. Usually, the higher BLEU scores mean better results. The results of automatic evaluation are displayed in Table 1.
As we can see in Table 1, fully unsupervised method STTBT performs really poor compared to others, although there are only a few labeled data pairs used in other methods. The results of STTBT and AttS2S can be regarded as the bottomlines of semi-supervised methods. MulT seems to achieve some improvements at most of the tasks compared to AttS2S. But we observe a serious performance dropping on ASV-WEB task. We think it is caused by the inconsistency between WEB and other versions. Sharing parameters in a multi-task learning manner amplifies the inconsistency and causes worse results than AttS2S. And the results of MulT cannot even beat Semi, which is more relevant and more comparative to the other methods. Therefore, we will not deploy any further experiment on STTBT and MulT.
We further conduct human evaluation on Amazon Mechanical Turk (AMT777https://www.mturk.com) to make more convincing comparisons of four typical models. For each style (version) transfer task, we randomly choose 30 samples from test set and present the input texts, the generated output texts of different models and the gold output texts to three judges. Each judge is required to rate the generated texts of all models for each sample in three aspects: fluency, accuracy and style in a 5-pt Likert scale888https://en.wikipedia.org/wiki/Likert_scale. Fluency indicates whether the output text is fluent and grammatical. Accuracy indicates whether the output text is semantically consistent with the input text. And style indicates how the output text is likely from the target version. We require the judges to have read at least one version of the Bibles, so that they can understand the contents and find differences between versions. The average scores and statistical significance analysis are presented in Table 2. And we also present some samples from the test set in Figure 3.
in a two-tail t-test.) thanAttS2S, Semi, and MAST:Rand-2 respectively.
From the comparison of Semi and AttS2S in Table 1 and Table 2, we find that Semi models with DAE and back-translation do help in improving the performance of the basic AttS2S models. And for all the results, our models MAST:Rand-2 and MAST:SOS-2 generally make significant improvements, and MAST:SOS-2 even outperforms AttS2S and Semi by averagely 6.57 BLEU points and 2.94 BLEU points, respectively. MAST:Rand-2 seems to be unsteady, sometimes even a little worse than AttS2S, and much worse than Semi, while MAST:SOS-2 is almost always much better than the baselines models. This shows the effectiveness of our proposed SOS algorithm. But in the results of human evaluation, MAST:SOS-2 is not always significantly better than MAST:Rand-2. Considering the space for choosing neighbors are rather small in these experiments, we extend the dataset with 10 more versions.
In this section, we extend the experiments to 15 style (version) transfer tasks, where there are 14 candidate neighbors for each agent. As BLEU has shown good consistency with human evaluation results in previous experiments, we only use BLEU for automatic evaluations. The results are shown in Table 3. As too many numbers make it hard to compare, we also develop color scales in Figure 4 according to the values in Table 3999The neighbor sets are different from the ones in previous experiments, some results can be different from those in Table 1.
From Table 3 and Figure 4, we clearly find that the performances of MAST:Rand-k are really random. Although some results of MAST:Rand-k can also achieve the state-of-the-art, many results are even worse than the Semi models they base on. After all, the neighbors are randomly chosen for these agents, so the random results make sense in this way. On the contrary, the results of MAST:SOS-k are much better. From Figure 4, the results of MAST:SOS-k are almost all darker (better) than the others. What’s more, we also see that with more neighbors, the results seem to be more steady and significant. For example, MAST:SOS-4 outperforms AttS2S by 8.73 BLEU scores on average, and it also outperforms Semi by 5.15 BLEU scores on average.
By comparing the results of MAST:SOS-2 in Table 1 and Table 3, we find that the amount of neighbor candidates has an influence on the performance of our system. We further develop experiments to verify this. In this set of experiments, we vary the versions in our experiments, from the 5 public versions (BBE, DARBY, DRA, WEB, YLT) to all the 15 versions by adding additional versions two by two. As the 5 basic versions exist in all the experiments, we regard the results of MAST:SOS-2 in Table 1 as the initial results, and we calculate the average improvements over initial results for each following experiment of MAST:SOS-2. The trend is shown in Figure 5, where more candidates tend to support our system to get better results.
We change the number of labeled data pairs from 0.5k to 5k to explore whether the system still provides significant improvements under different data settings. We develop experiments on the 5 basic versions (BBE, DARBY, DRA, WEB, YLT) and measure the average BLEU scores of the major models in this study (AttS2S, Semi and MAST:SOS-2). And we can see in Figure 6, our system performs extremely better than other models in the real low-resource settings, which is very promising for practical uses.
In this work, we propose a multi-agent system for addressing multiple style transfer tasks with limited labeled data. We design a semi-supervised model to leverage unlabeled data for one-to-one style transfer, and propose MAST for multiple styles transfer. In MAST, SOS picks useful neighbors considering limited resources, and MAT provides controllers for models integration. The design takes account of practicability and expansibility. The comprehensive experiments demonstrate the effectiveness of our system.
In future work, we can explore other algorithms to find neighbor agents more accurately. Moreover, the controllers we use in MAST have many other options. And basic agents can be improved with other techniques to better support MAST.
This work was supported by Key Laboratory of Science, Technology and Standard in Press Industry (Key Laboratory of Intelligent Press Media Technology). Special thanks to Zhiwei Yu for her insightful suggestions. Xiaojun Wan is the corresponding author.
Effective approaches to attention-based neural machine translation.In Proceedings of EMNLP 2015, pages 1412–1421.