A time-honored way to nudge human creativity is to structure generation around the idea of variation, from literary pastiches to variations in classical music or the concept of jazz standards. Variation is then used primarily as an inspiration device, where it is not necessary to stick too closely to the original template. Artificial text style transfer can similarly act as a loosely constrained generative device, to combat monotony by generating more variations of a given piece of text, or to avoid blandness through anchoring on an interesting original. Within that framing, it is more important to be able to generate richer variations than to strictly preserve content.
Most existing text style transfer work has focused on a narrow set of applications where the attributes of interest have a very limited set of discrete possible values, e.g. two valences of reviews (positive and negative), three different writing styles [example], five types of restaurant cuisines (Lample et al., 2019)
. This is very well suited to applications where style transfer has to adhere closely to its input (e.g., editing text to make it more formal or business-like), but less so when the emphasis is on creativity more than faithfulness to the original. In this work, we propose a new approach that allows for text generation conditioned on a much richer and fine-grained specification of target attributes, by leveraging distributed representations pre-trained through a separate supervised classification task. By specifying attributes through continuous distributed representations, we show that our architecture allows for fine-grained conditioned text generation that can match new attribute targets unseen during training, or attribute targets implicitly specified through text, that may not precisely match any of the discrete labels originally used to define the attribute space.
This work thus makes the following contributions: first, we propose a method that allows transfer to a much larger set of fine-grained styles without requiring additional optimization during inference. Second, we show how this method can be used to perform zero-shot style transfer to new styles unseen during the style transfer training, through leveraging a joint underlying lower-dimensional style embedding space. Third, we show how fine-tuning a pre-trained attribute control architecture affords control over a different but related attribute space.
2 Related work
Many earlier approaches to text style transfer rely on a disentangling objective seeking to extract a representation from which the original style is hard to recover (Lample et al., 2017b). However, recent work has shown that this disentanglement was neither empirically achieved, nor necessary (Lample et al., 2019). In this work, we do not use any disentanglement objective either.
Style transfer can be viewed as translation from one style to another. Recent strides in unsupervised translation have led to a body of work adapting machine translation techniques to style transfer(Prabhumoye et al., 2018; Lample et al., 2019; Zhang et al., 2018). This work follows this approach and uses an architecture very similar to that in Lample et al. (2019).
When used to generate a richer set of alternatives, style transfer can be viewed as a controlled text generation technique with a particularly strong conditioning anchor. The recently released CTRL model (Keskar et al., 2019) allows for generation based on control codes such as a specific website link, which are used as a pre-pended token. The style attribute is similarly specified here by providing an initial token to the model to specify the target attribute, but the generated text is also conditioned much more strongly on a source sentence, as was done in Lample et al. (2019).
In this work, we instead propose to decouple the classifier from the style transfer architecture by merely using the classifier to produce a distributed representation of the target attribute, so that existing pre-trained supervised representations can be re-used. This would allow for our method to be applied to any type of consistent distributed embedding space (e.g., pre-trained unsupervised fastText embeddings (Joulin et al., 2016)).
3 Specifying target attributes as distributed continuous representations
Our approach relies on an autoencoder architecture similar to that inLample et al. (2019), modified to leverage consistent pre-trained distributed continuous representations of attributes. This section presents the notation and base architecture before introducing our key modification to leverage embeddings.
3.1 Base architecture
This section briefly introduces the architecture and training objective of Lample et al. (2019), which we use as base for our style transfer system.
Let be a training set of sentences paired with source attribute values . is a discrete attribute value in the set of possible values for the attribute being considered, e.g. if
represents the overall rating of a restaurant review. In this work, we only consider transfer of a single attribute, but our approach could easily be extended to multiple attributes using an attribute embedding averaging heuristic as inLample et al. (2019).
The style transfer architecture consists of a model that maps any pair of a source sentence (whose source attribute is ) paired with a target attribute to a new sentence that has the target attribute value , while striving to remain as close as possible to , and being fluent English. This is achieved by training a sequence-to-sequence auto-encoder as a denoising auto-encoder, with an added back-translation objective to ensure transfer to the target attribute.
The input is encoded into a latent representation , then is decoded into , where the parameters of encoder and decoder are trainable, and target attribute value can be a different attribute – or the same original attribute if not trying to modify it when reconstructing.
In order to retain fluency and ability to reconstruct well without merely copying, the architecture is trained with a denoising auto-encoding objective (Fu et al., 2017):
The decoder is encouraged to leverage the provided target attribute through a back-translation loss (Sennrich et al., 2015; Lample et al., 2017a, 2018; Artetxe et al., 2018): input is encoded into , but then decoded using target attribute value , yielding the reconstruction . is in turn used as input of the encoder and decoded using the source attribute value to ideally obtain the source , and we train the model to map back into . The back-translation objective is thus written:
where is a variation of the input sentence written with a randomly sampled target attribute that is specified according to the procedure described in sec. 3.2. Back-translated sentences are generated on the fly during training by greedy decoding at each time step.
The system is trained by combining both denoising auto-encoding and back-translation loss:
Architecture building blocks
The encoder is a 2-layer bidirectional LSTM using word embedding look-up tables trained from scratch. The decoder is a 2-layer LSTM augmented with an attention mechanism (Bahdanau et al., 2014). All the embedding and hidden layer dimensions are 512, including the attribute embedding obtained as explained in Section 3.2. Decoding is conditioned on both that attribute embedding, which is provided as the first token embedding, similar to Lample et al. (2018), and on a representation of the input obtained from the encoder with an attention mechanism.
3.2 Leveraging pre-trained distributed continuous representations
Lample et al. (2019) specify the target attribute as an embedding read from a lookup table that is optimized during training. This means that each target attribute value has its own entry, and precludes leveraging known similarities between target attribute values.
Instead, we propose to write the target embedding as the product of an existing distributed embedding , and a weight matrix . The motivation for this is that pre-trained distributed embeddings encode similarities between attribute values that can be learned from other tasks (e.g., supervised classification) and directly leveraged for style transfer.
In this work, we obtain the embedding by running some text possessing the desired target attribute value through a feedforward classifier . We experiment with a fastText classifier (Joulin et al., 2016) and a classifier derived from BERT (Devlin et al., 2018) with an added bottleneck layer, and use the last hidden layer whose dot-product with class embeddings would determine what class is selected. The dimension of that layer is arbitrary. Preliminary experiments have shown better training with smaller dimensions, so in the remainder of the paper we set the supervised embedding dimension to 8. Thus, the weight matrix is of dimension . Note that the base style transfer architecture adapted from Lample et al. (2019) for possible attribute values would correspond to being a look-up table of dimension
, with a one-hot encoding of each attribute value instead of the supervised distributed embeddings used here.
During training, randomly selected samples from the training set are run through the classifier to obtain a fine-grained continuous distributed target embedding value which is used as target attribute value for the back-translation loss, and scaled to unit norm. For validation and measuring accuracy of transfer, class embeddings are used instead, after being also scaled to unit norm.
4 Experiments in original fine-grained attribute space
We demonstrate the technique using a set of fine-grained sentiment labels such as happy, curious, angry, hopeful, sad, thankful, etc. (see full list in Table 1).
|Base task||aggravated, angry, annoyed, confused, curious, delighted, ecstatic, emotional, fabulous, fantastic, frustrated, grateful, happy, heartbroken, hopeful, irritated, joyful, overwhelmed, perplexed, pumped, sad, shocked, sleepy, thankful|
|ED task||afraid, angry, annoyed, anticipating, anxious, apprehensive, ashamed, caring, confident, content, devastated, disappointed, disgusted, embarrassed, excited, faithful, furious, grateful, guilty, hopeful, impressed, jealous, joyful, lonely, nostalgic, prepared, proud, sad, sentimental, surprised, terrified, trusting|
The choice of fine-grained sentiment as set of attributes is motivated by the richness of the attribute space, for which large labelled datasets are available (e.g., Li et al. (2017); Rashkin et al. (2019)), while also being in continuity with the use of sentiment as style in much of the text style transfer literature.
We train a sentiment classifier over 24 sentiments using an unreleased dataset of millions of samples of social media content written by English speakers with a writer-assigned sentiment tag. In order to make our work reproducible by others, we select training data from publicly available data in the following way: starting from a Reddit dump collected and published by a third party, we use that classifier to select a subset of millions of posts matching each of the 24 sentiment labels of interest. A new classifier is then trained from scratch on that data to provide the target embeddings, and the initial classifier is discarded. We pick a set of 24 sentiment labels to demonstrate fine-grained transfer to a larger set of possible labels compared to previous work, which usually limits transfer to a handful of possible attribute values. The set of 24 sentiment labels (see Table1) is selected by keeping sentiment labels that have reasonable-looking matches among the Reddit posts from the third-party dump, after a quick manual inspection of random samples to determine which labels to keep and what threshold to use to decide which posts to retain. Posts from the third-party Reddit dump that score above those thresholds are run through the safety classifier from Dinan et al. (2019) to remove offensive or toxic content, and the English language classifier from fastText Joulin et al. (2016) to remove non-English content. We also remove content that contains URLs or images. The remaining data comprises between 22k and 11M examples per sentiment label, and data from each label is sampled in a balanced way during training. The final data consists of a train set of 31M labeled samples, and an additional 730k samples as validation and test sets, respectively.
Following Lample et al. (2019), we use three automated metrics to measure target attribute control, fluency, and content preservation:
Attribute control: Attribute control is measured by using a fastText or BERT classifier trained to predict attribute values. This classifier does not have the low-dimensional bottleneck of the one used to produce the embedding , as classification performance is more accurate with larger dimensions.
Fluency: Fluency is measured by the perplexity assigned to generated text sequences by an LSTM language model trained on the third-party Reddit training data.
The best trade-off between those three aspects of transfer is dependent on the desired application. If the goal is to generate new utterances for a retrieval system in a conversation while keeping them from being bland or too repetitive through anchoring on a source utterance, in a manner reminiscent of the retrieve-and-refine approach (Weston et al., 2018), fluency and attribute control would matter more than content preservation. If the goal is to stick as close to the source sentence as possible and say the same things another way, which is better defined for language types (e.g., casual vs. formal) than for sentiment, then content preservation would matter more, but in a way that self-BLEU might not be sophisticated enough to capture.
Hyperparameters are picked by looking at performance over the validation set, using self-BLEU and transfer control. We also experimented with pooling (as in Lample et al. (2019)) and sampling with a temperature instead of greedy decoding, as well as larger bottleneck dimensions, but these all resulted in worse performance on the datasets we use here. Evaluation is performed by running style transfer on all non-matching combinations of source and target labels, on up to 900 source sequences per source label. Results are reported using source sentences from the test set.
|source||it is annoying how Meme has already changed meanings…|
|Model 2||it is fantastic football Meme has already changed meanings…|
|Model 4||it is fantastic =D|
|source||I wish people would stop making right-handed Link pics.|
|Model 2||Fantastic show in right-handed Link pics.|
|Model 4||I think this is fantastic and Star Wars videos…|
4.3 Fine-grained style transfer
We first use our system to demonstrate successful transfer over a large number of fine-grained attribute values. Results in Table 3 show that training achieves very good accuracy while maintaining reasonable self-BLEU scores and perplexity similar to the average perplexity of reference sentences. Classification of the identity baseline to the source attribute is a bit less than classification to the target attribute for the target baseline because the former uses test set examples, which were not seen by the classifier. Example generations are given in Table 4, where four sentiment classes are held-out during training, but training is otherwise similar.
|Target attr. sample||99.8||0.0||0.0||151.2|
|grateful||I appreciate him. And I love him.|
|angry||I hate him. And I am angry about him.|
|hopeful||I would love him. And I hope it’s true.|
|sad||I miss him. And I liked him.|
|thankful||I have seen him. And thanks for doing that.|
|hopeful||I hope I’m not too late to the party.|
|angry||I am so angry I’m not too late to the party.|
|curious||I wonder if I’m not too late to the party.|
|ecstatic||I am ecstatic I’m not too late to the party.|
|happy||I am happy I’m not too late to the party.|
|pumped||Thank you! So pumped to pick this up!|
|curious||Am I the only one who didn’t pick this up?|
|frustrated||Of course it would be hard to pick this up!|
|hopeful||Any chance I can pick this up?|
|shocked||But she was shocked when she found out what’d happened.|
|angry||But she was so angry when she found out what’d happened.|
|curious||Do you know if she found out what’d happened.|
|delighted||Hey she laughed when she found out what’d happened.|
|ecstatic||Absolutely ecstatic when she found out what’d happened.|
|emotional||But she cried when she found out what’d happened.|
|thankful||Thank you, she was looking forward to something like what’d happened.|
4.4 Zero-shot style transfer to unseen attribute values
Limiting the capacity of the attribute value representations through a small-dimensional bottleneck may make it easier for the auto-encoder to learn to generalize over the embedding space overall, beyond the specific combinations of the sentiment labels seen during training. To check if the transfer can indeed generalize to unseen sentiment labels, we train a system with 20 out of the 24 sentiment labels, holding out 4 labels that are seen by the classifier (shown in italics in Table 1), but not the style-transfer auto-encoder architecture during training. We then evaluate transfer to these unseen classes. Results in Table 5 show that transfer to these unseen classes is still largely successful, with the target class being picked more than half the time out of 24 possible classes. However, transfer to these held-out classes remains less successful than transfer to the classes seen during training. Examples of transfer to unseen classes are given at the bottom of Table 4.
|Training target attribute||Held-out target attribute|
5 Transferring to a new, related attribute space
Training the style transfer architecture requires millions of training examples. In this section, we examine whether it is possible to leverage pre-training on a given sentiment transfer task, to then transfer111 Note that transfer in this sentence is used first in the context of transfer learning, then in the context of style transfer.
Note that transfer in this sentence is used first in the context of transfer learning, then in the context of style transfer.that training to an attribute transfer task with a training set orders of magnitude smaller, as long as the attribute space is related.
The dataset we use here to examine transfer to a related task is the EmpatheticDialogues dataset (Rashkin et al., 2019), which comprises about 25k dialogues accompanied by a situation description of a few sentences, and a sentiment label belonging to a list of 32, some of which are also in the list of 24 from the first task (e.g., angry, grateful, joyful, as shown in Table 1). We use the situation descriptions and sentiment labels, not the dialogues.
We perform evaluation using the same metrics as before. The classification task over the EmpatheticDialogues labels is overall more difficult, given that there are more labels, but more importantly, that the dataset has not been pre-filtered by a classifier in the same way that the base training dataset was selected from the third-party Reddit dump. Thus, classification metrics (shown in Table 7) are lower across the board, with the upper bound being the 56.5% of the Source classification for the Identity baseline. The language in EmpatheticDialogues is also easier to predict than that of Reddit, resulting in lower perplexity scores.
|source||I come home from work and my parents are always arguing. It frustrates me.|
|Scratch||I have a big presentation at work that I am really looking forward to it.|
|Zero-shot||I come home from her and my parents are always arguing. It compliments me.|
|Fine-tuned||I come home from work and my parents are always studing. I am so content with my wife.|
|source||My boss made me work overtime yesterday and I didn’t even get paid for it!|
|Scratch||My husband and I went on a vacation trip to New York. I was not expecting it|
|Zero-shot||My boss made it overtime kicked and I didn’t even get arrested for it!|
|Fine-tuned||My boss made me work yesterday. Everything I had is going well now.|
5.2 Transfer experiments
We compare three different approaches to perform attribute control anchored in this new dataset.
Training from scratch
The EmpatheticDialogues dataset has only 25k situation descriptions, and is therefore too small to allow for successful training of the transfer architecture from scratch. To show this, we perform training exactly as in the previous section, but using only data from the 25k situation descriptions. Results in Table 7 show that the system learns adequate attribute control, but ignores the source sequence.
|Target attr. sample||77.8||0.7||0.0||94.8|
The “zero-shot” approach to task transfer here requires mapping the new attribute space to the old, so as to specify the new desired targets in the embedding space understood by the model. To see if this can work without any fine-tuning, we train a logistic regression layer from the previous Reddit sentiment embedding space to the new attribute space, and use the learned attribute embeddings to specify the new target attributes. Attribute control is performed in the same way as before using a style transfer architecture trained on 20 sentiment labels (so as to allow comparing to transfer to a held-out sentiment label from the same data), but the attribute targets, the source sequences and the label classifiers are all from theEmpatheticDialogues dataset. This approach performs very poorly, as shown in Table 7. This is not surprising, given that the low-dimensional embedding space for the original sentiment labels is trained to represent sentiment information from conversational posts that are quite removed from the task of inferring the sentiment felt in a situation description, and may simply have lost too much information to adequately infer the sentiment in this new context. In fact, the accuracy of the logistic regression classifier used to map the new sentiment labels to the old space is below 18% (on the test set), compared to over 50% achieved by a bottleneck BERT-based classifier trained on that data in raw text form.
Starting from the same pre-trained architecture as in the zero-shot baseline, we fine-tune the architecture on the situation descriptions from EmpatheticDialogues. This gives a chance for the model to adapt to the language and different framing and attribute space. Results in Table 7 show that the fine-tuning reaches reasonable transfer performance. Example generations are shown in Table 8.
|anxious||Waiting for my results|
|anticipating||Waiting for the results to come out.|
|caring||Waiting for my grandmother.|
|joyful||Waiting for my paycheck at the end|
|prepared||Waiting for my exams|
|grateful||My grandfather invited me over and made us an awesome dinner today.|
|hopeful||My grandfather promised to buy me a car as soon as he went on vacation.|
|jealous||My grandfather bought a car and I was pretty envious of him.|
|sad||My grandfather passed away and it was a shock.|
|prepared||I’m going overseas and i’m super ready|
|afraid||I’m going to the doctor on Monday. I hope he does well|
|anticipating||I’m going to eat with some friends tonight. I can’t wait to eat at the university.|
|confident||I’m going to get a new car this year. I just know it|
|content||I’m going overseas and i’m ready to go start my new job.|
|excited||I’m going camping next weekend. I am so stoked!|
|hopeful||I’m going to be able to get my degree next week.|
|jealous||I’m going hiking with another person who is in a relationship.|
|joyful||I’m going overseas and i’m super excited.|
6 Discussion and Conclusion
This work has shown that taking advantage of consistent embedding spaces obtained through a separate task (in this case, supervised classification) makes it possible to achieve reasonable success with zero-shot transfer to classes that were not seen during training or even, with some fine-tuning, transfer to an altogether different attribute space.
When viewed as a method to generate controlled variations of an input text, this style transfer approach paves the way for promising data augmentation methods where an existing set of retrieval utterances could be augmented to fit specific target styles. Given that retrieval models are still performing better than generative models in conversational systems (e.g., see Rashkin et al. (2019)), this would allow combining the flexibility of enhanced fine-grained control with the power of retrieval models, while still escaping flaws of generative models such as blandness and repetition, similar to the retrieve-and-refine approach (Weston et al., 2018).
Another promising potential use of this style transfer architecture is through the indirect, implicit definition of a style through examples: instead of requiring a label, which could lead to quantization noise when the desired attribute is not an exact match to a pre-defined attribute value, the target attribute representation can be directly inferred from an example text input that conveys the desired style. This would allow mirroring of the style of a text without labeling it, or conversely complementing it by looking at a maximally distant embedding. Our approach would also lend itself well to using un-labelled styles extracted in an unsupervised way, as long as they can be represented in a consistent embedding space.
Unsupervised neural machine translation. In International Conference on Learning Representations (ICLR), Cited by: §3.1.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §3.1.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.2.
- Build it break it fix it for dialogue safety: robustness from adversarial human attack. arXiv preprint arXiv:1908.06083. Cited by: §4.1.
- Style transfer in text: exploration and evaluation. arXiv preprint arXiv:1711.06861. Cited by: §3.1.
- Structuring latent spaces for stylized response generation. arXiv preprint arXiv:1909.05361. Cited by: §2.
- Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759. Cited by: §2, §3.2, §4.1.
- Ctrl: a conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858. Cited by: §2.
- Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043. Cited by: §3.1, §3.1.
- Phrase-based & neural unsupervised machine translation. arXiv preprint arXiv:1804.07755. Cited by: §3.1, §3.1.
- Multiple-attribute text rewriting. In International Conference on Learning Representations, External Links: Cited by: §1, §2, §2, §2, §3.1, §3.1, §3.1, §3.2, §3.2, §3, 3rd item, §4.2, §4.2.
- Fader networks: manipulating images by sliding attributes. In Advances in Neural Information Processing Systems, pp. 5967–5976. Cited by: §2.
DailyDialog: a manually labelled multi-turn dialogue dataset.
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 986–995. Cited by: §4.
- Revision in continuous space: fine-grained control of text style transfer. arXiv preprint arXiv:1905.12304. Cited by: §2.
- Style transfer through back-translation. arXiv preprint arXiv:1804.09000. Cited by: §2.
- Towards empathetic open-domain conversation models: a new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5370–5381. Cited by: §4, §5.1, §6.
- Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 86–96. Cited by: §3.1.
- Controllable unsupervised text attribute transfer via editing entangled latent representation. arXiv preprint arXiv:1905.12926. Cited by: §2.
- Retrieve and refine: improved sequence generation models for dialogue. In Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AI, Brussels, Belgium, pp. 87–92. Cited by: §4.2, §6.
- Style transfer as unsupervised machine translation. arXiv preprint arXiv:1808.07894. Cited by: §2.