Generative modeling of images and text has seen increasing progress over the last few years. Deep generative models such as variational auto-encoders (Kingma and Welling, 2013), adversarial networks (Goodfellow et al., 2014) and Pixel Recurrent Neural Nets (van den Oord et al., 2016) have driven most of this success in vision. Conditional generative models capable of providing fine-grained control over the attributes of a generated image such as facial attributes (Yan et al., 2016) and attributes of birds and flowers (Reed et al., 2016) have been extensively studied. The style transfer problem which aims to change more abstract properties of an image has seen significant advances (Gatys et al., 2015; Isola et al., 2016).
The discrete and sequential nature of language makes it difficult to approach language problems in a similar manner. Changing the value of a pixel by a small amount has negligible perceptual effect on an image. However, distortions to text are not imperceptible in a similar way and this has largely prevented the transfer of these methods to text.
In this work we consider a generative model for sentences that is capable of expressing a given sentence in a form that is compatible with a given set of conditioning attributes. Applications of such models include conversational systems (Li et al., 2016), paraphrasing (Xu et al., 2012), machine translation (Sennrich et al., 2016), authorship obfuscation (Shetty et al., 2017) and many others. Sequence mapping problems have been addressed successfully with the sequence-to-sequence paradigm (Sutskever et al., 2014). However, this approach requires training pairs of source and target sentences. The lack of parallel data with pairs of similar sentences that differ along certain stylistic dimensions makes this an important and challenging problem.
We focus on categorical attributes of language. Examples of such attributes include sentiment, language complexity, tense, voice, honorifics, mood, etc. Our approach draws inspiration from style transfer methods in the vision and language literature. We enforce content preservation using auto-encoding and back-translation losses. Attribute compatibility and realistic sequence generation are encouraged by an adversarial discriminator. The proposed adversarial discriminator is more data efficient and scales better to multiple attributes with several classes more easily than prior methods.
Evaluating models that address the transfer task is also quite challenging. Previous works have mostly focused on assessing the attribute compatibility of generated sentences. These evaluations do not penalize vacuous mappings that simply generate a sentence of the desired attribute value while ignoring the content of the input sentence. This calls for new metrics to objectively evaluate models for content preservation. In addition to evaluating attribute compatibility, we consider new metrics for content preservation and generation fluency, and evaluate models using these metrics. We also perform a human evaluation to assess the performance of models along these dimensions.
We also take a step forward and consider a writing style transfer task for which parallel data is available. Evaluating the model on parallel data assesses it in terms of all properties of interest: generating content and attribute compatible, realistic sentences. Finally, we show that the model is able to learn to control multiple attributes simultaneously. To our knowledge, we demonstrate the first instance of learning to modify multiple textual attributes of a given sentence without parallel data.
2 Related Work
Conditional Text Generation Prior work have considered controlling aspects of generated sentences in machine translation such as length (Kikuchi et al., 2016), voice (Yamagishi et al., 2016), and honorifics/politeness (Sennrich et al., 2016). Kiros et al. (2014)
use multiplicative interactions between a word embeddings matrix and learnable attribute vectors for attribute conditional language modeling.Radford et al. (2017) train a character-level language model on Amazon reviews using LSTMs (Hochreiter and Schmidhuber, 1997) and discover that the LSTM learns a ‘sentiment unit’. By clamping this unit to a fixed value, they are able to generate label conditional paragraphs.
Hu et al. (2017) propose a generative model of sentences which can be conditioned on a sentence and attribute labels. The model has a VAE backbone which attempts to express holistic sentence properties in its latent variable. A generator reconstructs the sentence conditioned on the latent variable and the conditioning attribute labels. Discriminators are used to ensure attribute compatibility. Training sequential VAE models has proven to be very challenging (Bowman et al., 2015; Chen et al., 2016) because of the posterior collapse problem. Annealing techniques are generally used to address this issue. However, reconstructions from these models tend to differ from the input sentence.
Style Transfer Recent approaches have proposed neural models learned from non-parallel text to address the text style transfer problem. Li et al. (2018) propose a simple approach to perform sentiment transfer and generate stylized image captions. Words that capture the stylistic properties of a given sentence are identified and masked out, and the model attempts to reconstruct the sentence using the masked version and its style information. Shen et al. (2017) employ adversarial discriminators to match the distribution of decoder hidden state trajectories corresponding to real and synthetic sentences specific to a certain style. Prabhumoye et al. (2018) assume that translating a sentence to a different language alters the stylistic properties of a sentence. They adopt an adversarial training approach similar to Shen et al. (2017) and replace the input sentence using a back-translated sentence obtained using a machine-translation system.
To encourage generated sentences to match the conditioning stylistic attributes, prior discriminator based approaches train a classifier or adversarial discriminator specific to each attribute or attribute value. In contrast, our proposed adversarial loss involves learning a single discriminator which determines whether a sentence is both realistic and is compatible with a given set of attribute values. We demonstrate that the model can handle multiple attributes simultaneously, while prior work has mostly focused on one or two attributes, which limits their practical applicability.
Unsupervised Machine Translation There is growing interest in discovering latent alignments between text from multiple languages. Back-translation is an idea that is commonly used in this context where mapping from a source domain to a target domain and then mapping it back should produce an identical sentence. He et al. (2016) attempt to use monolingual corpora for machine translation. They learn a pair of translation models, one in each direction, and the model is trained via policy gradients using reward signals coming from pre-trained language models and a back-translation constraint. Artetxe et al. (2017) proposed a sequence-to-sequence model with a shared encoder, trained using a de-noising auto-encoding objective and an iterative back-translation based training process. Lample et al. (2017)
adopt a similar approach but with an unshared encoder-decoder pair. In addition to de-noising and back-translation losses, adversarial losses are introduced to learn a shared embedding space, similar to the aligned-autoencoder ofShen et al. (2017). While the auto-encoding loss and back-translation loss have been used to encourage content preservation in prior work, we identify shortcomings with these individual losses: auto-encoding prefers the copy solution and back-translated samples can be noisy or incorrect. We propose a reconstruction loss which interpolates between these two losses to reduce the sensitivity of the model to these issues.
Suppose we have attributes of interest . Let be given a set of labelled sentences where is a set of labels for a subset of the attributes. Given a sentence and attribute values our goal is to produce a sentence that shares the content of , but reflects the attribute values specified by (figure 1). In this context, we define content as the information in the sentence that is not captured by the attributes. We use the term attribute vector to refer to a binary vector representation of the attribute labels. This is a concatenation of one-hot vector representations of the attribute labels.
3.1 Model Overview
We denote the generative model by . We want to use the conditioning information effectively. i.e., should generate a sentence that is closely related in meaning to the input sentence and is consistent with the attributes. We design as an encoder-decoder model. The encoder is an RNN that takes the words of input sentence as input and produces a content representation of the sentence. Given a set of attribute values , a decoder RNN generates sequence conditioned on and .
3.2 Content compatibility
We consider two types of reconstruction losses to encourage content compatibility.
Autoencoding loss Let be a sentence and the corresponding attribute vector be . Let be the encoded representation of . Since sentence
should have high probability under, we enforce this constraint using an auto-encoding loss.
Back-translation loss Consider , an arbitrary attribute vector different from (i.e., corresponds to a different set of attribute values). Let be a generated sentence conditioned on . Assuming a well-trained model, the sampled sentence will preserve the content of . In this case, sentence should have high probability under where is the encoded representation of sentence . This requirement can be enforced in a back-translation loss as follows.
A common pitfall of the auto-encoding loss in auto-regressive models is that the model learns to simply copy the input sequence without capturing any informative features in the latent representation. A de-noising formulation is often considered where noise is introduced to the input sequence by deleting, swapping or re-arranging words. On the other hand, the generated sample can be mismatched in content from during the early stages of training, so that the back-translation loss can potentially misguide the generator. We address these issues by interpolating the latent representations of ground truth sentence and generated sentence .
Interpolated reconstruction loss We merge the autoencoding and back-translation losses by fusing the two latent representations . We consider , where
is a binary random vector of values sampled from a Bernoulli distribution with parameter. We define a new reconstruction loss which uses to reconstruct the original sentence.
Note that degenerates to when , and to when . The interpolated content embedding makes it harder for the decoder to learn trivial solutions since it cannot rely on the original sentence alone to perform the reconstruction. Furthermore, it also implicitly encourages the content representations and of to be similar, which is a favorable property of the encoder.
3.3 Attribute compatibility
We consider an adversarial loss which encourages generating realistic and attribute compatible sentences. The advesarial loss tries to match the distribution of sentence and attribute vector pairs where the sentence can either be a real or generated sentence. Let and be the decoder hidden-state sequences corresponding to and respectively. We consider an adversarial loss of the following form, where is a discriminator. Sequence is held constant and .
It is possible that the discriminator ignores the attributes and makes the real/fake decision based on just the hidden states, or vice versa. To prevent this situation, we consider additional fake pairs similar to Reed et al. (2016) where we consider a real sentence and a mismatched attribute vector, and encourage the discriminator to classify these pairs as fake. The new objective takes the following form.
Our discriminator architecture follows the projection discriminator (Miyato and Koyama, 2018),
where represents the binary attribute vector corresponding to . is a bi-directional RNN encoder ( represents the final hidden state). are learnable parameters and
is the sigmoid function.
Soft-sampling and hard-sampling A challenging aspect of text generation models is dealing with the discrete nature of language, which makes it difficult to generate a sequence and then obtain a learning signal based on it. Soft-sampling is generally used to back-propagate gradients through the sampling process where an approximation of the sampled word vector at every time-step is used as the input for the next time-step (Shen et al., 2017; Hu et al., 2017). Inference performs hard-sampling, where sampled words are used instead. Thus, when soft-sampled sequences are used at training time, the training and inference behavior are mismatched. For instance, Shen et al. (2017)’s adversarial loss encourages the hidden-state dynamics of teacher-forced and soft-sampled sequences to be similar. However, there remains a gap between the dynamics of these sequences and sequences hard-sampled at test time. We eliminate this gap by hard-sampling the sequence . Soft-sampling also has a tendency to introduce artifacts during generation. These approximations further become poor with large vocabulary sizes. We present an ablative experiment comparing these two sampling strategies in Appendix C.
Scalability to multiple attributes Shen et al. (2017)
use multiple class-specific discriminators to match the class conditional distributions of sentences. In contrast, our proposed discriminator models the joint distribution of realistic sentences and corresponding attribute labels. Our approach is more data-efficient and exploits the correlation between different attributes as well as attributes values.
The sentiment attribute has been widely considered in previous work (Hu et al., 2017; Shen et al., 2017). We first address the sentiment control task and perform a comprehensive comparison against previous methods. We perform quantitative, qualitative and human evaluations to compare sentences generated by different models. Next we evaluate the model in a setting where parallel data is available. Finally we consider the more challenging setting of controlling multiple attributes simultaneously and show that our model easily extends to the multiple attribute scenario.
4.1 Training and hyperparameters
We use the following validation metrics for model selection. The autoencoding loss is used to measure how well the model generates content compatible sentences. Attribute compatibility is measured by generating sentences conditioned a set of labels, and using pre-trained attribute classifiers to measure how well the samples match the conditioning labels.
For all tasks we use a GRU (Gated Recurrent UnitChung et al. (2015)) RNN with hidden state size 500 as the encoder . Attribute labels are represented as a binary vector, and an attribute embedding is constructed via linear projection. The decoder is initialized using a concatenation of the representation coming from the encoder and the attribute embedding. Attribute embeddings of size 200 and a decoder GRU with hidden state size 700 were used (These parameters are identical to Shen et al. (2017)). The discriminator receives an RNN hidden state sequence and an attribute vector as input. The hidden state sequence is encoded using a bi-directional RNN with hidden state size 500. The interpolation probability and weight of the adversarial loss are chosen based on the validation metrics above. Word embeddings are initialized with pre-trained GloVe embeddings (Pennington et al., 2014).
Although the evaluation setups in prior work assess how well the generated sentences match the conditioning labels, they do not assess whether they match the input sentence in content. For most attributes of interest, parallel corpora do not exist. Hence we define objective metrics that evaluate models in a setting where ground truth annotations are unavailable. While individually these metrics have their deficiencies, taken together they are helpful in objectively comparing different models and performing consistent evaluations across different work.
Attribute accuracy To quantitatively evaluate how well the generated samples match the conditioning labels we adopt a protocol similar to Hu et al. (2017)
. We generate samples from the model and measure label accuracy using a pre-trained sentiment classifier. For the sentiment experiments, the pre-trained classifiers are CNNs trained to perform sentiment analysis at the review level on the Yelp and IMDB datasets(Maas et al., 2011). The classifiers achieve test accuracies of 95%, 90% on the respective datasets.
Content compatibility Measuring content preservation using objective metrics is challenging. Fu et al. (2017)
propose a content preservation metric which extracts features from word embeddings and measures cosine similarity in the feature space. However, it is hard to design an embedding based metric which disregards the attribute information present in the sentence. We take an indirect approach and measure properties that would hold if the models do indeed produce content compatible sentences. We consider a content preservation metric inspired by the unsupervised model selection criteria ofLample et al. (2017) to evaluate machine-translation models without parallel data. Given two non-parallel datasets and translation models that map between the two domains, the following metric is defined.
where represents translating to domain and then back to . We assume and to be test set sentences of positive and negative sentiment respectively and to be the generative model conditioned on positive and negative sentiment, respectively.
Fluency We use a pre-trained language model to measure the fluency of generated sentences. The perplexity of generated sentences, as evaluated by the language model, is treated as a measure of fluency. A state-of-the-art language model trained on the Billion words benchmark (Jozefowicz et al., 2016) dataset is used for the evaluation.
|Yelp Reviews||IMDB Reviews|
|Ctrl-gen (Hu et al., 2017)||76.36%||11.5||0.0||156||76.99%||15.4||0.1||94|
|Cross-align (Shen et al., 2017)||90.09%||41.9||3.9||180||88.68%||31.1||1.1||63|
4.3 Sentiment Experiments
Data We use the restaurant reviews dataset from Shen et al. (2017). The dataset is a filtered version of the Yelp reviews dataset. Similar to Hu et al. (2017), we use the IMDB move review corpus from Diao et al. (2014). We use Shen et al. (2017)’s filtering process to construct a subset of the data for training and testing. The datasets respectively have 447k, 300k training sentences and 128k, 36k test sentences.
We compare our model against Ctrl-gen, the VAE model of Hu et al. (2017) and Cross-align, the cross alignment model of Shen et al. (2017). Code obtained from the authors is used to train models on the datasets. We use a pre-trained model provided by Hu et al. (2017) for movie review experiments.
Quantitative evaluation Table 1 compares our model against prior work in terms of the objective metrics discussed in the previous section. Both Shen et al. (2017); Hu et al. (2017) perform soft-decoding, so that back-propagation through the sampling process is made possible. But this leads to artifacts in generation, producing low fluency scores. Note that the fluency scores do not represent the perplexity of the generators, but perplexity measured on generated sentences using a pre-trained language model. While the absolute numbers may not be representative of the generation quality, it serves as a useful measure for relative comparison.
We report BLEU-1 and BLEU-4 scores for the content metric. Back-translation has been effectively used for data augmentation in unsupervised translation approaches. The interpolation loss can be thought of as data augmentation in the feature space, taking into account the noisy nature of parallel text produced by the model, and encourages content preservation when modifying attribute properties. The cross-align model performs strongly in terms of attribute accuracy, however it has difficulties generating grammatical text. Our model is able to outperform these methods in terms of all metrics.
Qualitative evaluation Table 4 shows samples generated from the models for given conditioning sentence and sentiment label. For each query sentence, we generate a sentence conditioned on the opposite label. The Ctrl-gen model rarely produces content compatible sentences. Cross-align produces relevant sentences, while parts of the sentence are ungrammatical. Our model generates sentences that are more related to the input sentence. More examples can be found in the supplementary material.
|Query||the people behind the counter were not friendly whatsoever .|
|Ctrl gen (Hu et al., 2017)||the food did n’t taste as fresh as it could have been either .|
|Cross-align (Shen et al., 2017)||the owners are the staff is so friendly .|
|Ours||the people at the counter were very friendly and helpful .|
|Query||they do an exceptional job here , the entire staff is professional and accommodating !|
|Ctrl gen (Hu et al., 2017)||very little water just boring ruined !|
|Cross-align (Shen et al., 2017)||they do not be back here , the service is so rude and do n’t care !|
|Ours||they do not care about customer service , the staff is rude and unprofessional !|
|Query||once again , in this short , there isn’t much plot .|
|Ctrl gen (Hu et al., 2017)||it’s perfectly executed with some idiotically amazing directing .|
|Cross-align (Shen et al., 2017)||but <unk> , , the film is so good , it is .|
|Ours||first off , in this film , there is nothing more interesting .|
|Query||that’s another interesting aspect about the film .|
|Ctrl gen (Hu et al., 2017)||peter was an ordinary guy and had problems we all could <unk> with|
|Cross-align (Shen et al., 2017)||it’s the <unk> and the plot .|
|Ours||there’s no redeeming qualities about the film .|
Human evaluation We supplement the quantitative and qualitative evaluations with human assessments of generated sentences. Human judges on MTurk were asked to rate the three aspects of generated sentences we are interested in - attribute compatibility, content preservation and fluency. We chose 100 sentences from the test set randomly and generated corresponding sentences with the same content and opposite sentiment. Attribute compatibility is assessed by asking judges to label generated sentences and comparing the opinions with the actual conditioning sentiment label. For content assessment, we ask judges whether the original and generated sentences are related by the desired property (same semantic content and opposite sentiment). Fluency/grammaticality ratings were obtained on a 5-point Likert scale. More details about the evaluation setup are provided in section B of the appendix. Results are presented in Table 3. These ratings are in agreement with the objective evaluations and indicate that samples from our model are more realistic and reflect the conditioning information better than previous methods.
4.4 Monolingual Translation
We next consider a style transfer experiment where we attempt to emulate a particular writing style. This has been traditionally formulated as a monolingual translation problem where aligned data from two styles are used to train translation models. We consider English texts written in old English and address the problem of translating between old and modern English. We used a dataset of Shakespeare plays crawled from the web (Xu et al., 2012). A subset of the data has alignments between the two writing styles. The aligned data was split as 17k pairs for training and 2k, 1k pairs respectively for development and test. All remaining 80k sentences are considered unpaired.
We consider two sequence to sequence models as baselines. The first one is a simple sequence to sequence model that is trained to translate old to modern English. The second variation learns to translate both ways, where the decoder takes the domain of the target sentence as an additional input. We compare the performance of models in Table 3. In addition to the unsupervised setting which doesn’t use any parallel data, we also train our model in the semi-supervised setting. In this setting we first train the model using supervised sequence-to-sequence learning and fine-tune on the unpaired data using our objective. Our version of the model that does not use any aligned data falls short of the supervised models. However, in the semi-supervised setting we observe an improvement of more than 2 BLEU points over the purely supervised baselines. This shows that the model is capable of finding sentence alignments by exploiting the unlabelled data.
|Mood||Tense||Voice||Neg.||john was born in the camp|
|Indicative||Past||Passive||No||john was born in the camp .|
|Indicative||Past||Passive||Yes||john wasn’t born in the camp .|
|Indicative||Past||Active||No||john had lived in the camp .|
|Indicative||Past||Active||Yes||john didn’t live in the camp .|
|Indicative||Present||Passive||No||john is born in the camp .|
|Indicative||Present||Passive||Yes||john isn’t born in the camp .|
|Indicative||Present||Active||No||john has lived in the camp .|
|Indicative||Present||Active||Yes||john doesn’t live in the camp .|
|Indicative||Future||Passive||No||john will be born in the camp .|
|Indicative||Future||Passive||Yes||john will not be born in the camp .|
|Indicative||Future||Active||No||john will live in the camp .|
|Indicative||Future||Active||Yes||john will not survive in the camp .|
|Subjunctive||Cond||Passive||No||john could be born in the camp .|
|Subjunctive||Cond||Passive||Yes||john couldn’t live in the camp .|
|Subjunctive||Cond||Active||No||john could live in the camp .|
|Subjunctive||Cond||Active||Yes||john couldn’t live in the camp .|
4.5 Ablative study
Figure 3 shows an ablative study of the different loss components of the model. Each point in the plots represents the performance of a model (on the validation set) during training, where we plot the attribute compatibility against content compatibility. As training progresses, models move to the right. Models at the top right are desirable (high attribute and content compatibility). and refer to models trained with only the auto-encoding loss or the interpolated loss respectively. We observe that the interpolated reconstruction loss by itself produces a reasonable model. It augments the data with generated samples and acts as a regularizer. Integrating the adversarial loss to each of the above losses improves the attribute compatibility since it explicitly requires generated sequences to be label compatible (and realistic). We also consider in our control experiment. While this model performs strongly, it suffers from the issues associated with and discussed in section 3.2. The attribute compatibility of the proposed model drops more gracefully compared to the other settings as the content preservation improves.
4.6 Simultaneous control of multiple attributes
In this section we discuss experiments on simultaneously controlling multiple attributes of the input sentence. Given a set of sentences annotated with multiple attributes, our goal is to be able to plug this data into the learning algorithm and obtain a model capable of tweaking these properties of a sentence. Towards this end, we consider the following four attributes: tense, voice, mood and negation. We use an annotation tool (Ramm et al., 2017) to annotate a large corpus of sentences. We do not make fine distinctions such as progressive and perfect tenses and collapse them into a single category. We used a subset of 2M sentences from the BookCorpus dataset (Kiros et al., 2014), chosen to have approximately near class balance across different attributes.
Table 5 shows generated sentences conditioned on all valid combinations of attribute values for a given query sentence. We use the annotation tool to assess attribute compatibility of generated sentences. Attribute accuracies measured on generated senetences for mood, tense, voice, negation were respectively 98%, 98%, 90%, 97%. The voice attribute is more difficult to control compared to the other attributes since some sentences require global changes such as switching the subject-verb-object order, and we found that the model tends to distort the content during voice control.
In this work we considered the problem of modifying textual attributes in sentences. We proposed a model that explicitly encourages content preservation, attribute compatibility and generating realistic sequences through carefully designed reconstruction and adversarial losses. We demonstrate that our model effectively reflects the conditioning information through various experiments and metrics. While previous work has been centered around controlling a single attribute and transferring between two styles, the proposed model easily extends to the multiple attribute scenario. It would be interesting future work to consider attributes with continuous values in this framework and a much larger set of semantic and syntactic attributes.
Acknowledgements We thank Andrew Dai, Quoc Le, Xinchen Yan and Ruben Villegas for helpful discussions. We also thank Jongwook Choi, Junhyuk Oh, Kibok Lee, Seunghoon Hong, Sungryull Sohn, Yijie Guo, Yunseok Jang and Yuting Zhang for helpful feedback on the manuscript.
- Kingma and Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
- van den Oord et al. (2016) Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
Yan et al. (2016)
Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee.
Attribute2image: Conditional image generation from visual attributes.
European Conference on Computer Vision, pages 776–791. Springer, 2016.
- Reed et al. (2016) Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.
- Gatys et al. (2015) Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
- Isola et al. (2016) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004, 2016.
- Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill Dolan. A persona-based neural conversation model. arXiv preprint arXiv:1603.06155, 2016.
- Xu et al. (2012) Wei Xu, Alan Ritter, William B Dolan, Ralph Grishman, and Colin Cherry. Paraphrasing for style. In 24th International Conference on Computational Linguistics, COLING 2012, 2012.
Sennrich et al. (2016)
Rico Sennrich, Barry Haddow, and Alexandra Birch.
Controlling politeness in neural machine translation via side constraints.In Proceedings of NAACL-HLT, pages 35–40, 2016.
- Shetty et al. (2017) Rakshith Shetty, Bernt Schiele, and Mario Fritz. Author attribute anonymity by adversarial training of neural machine translation. arXiv preprint arXiv:1711.01921, 2017.
Sutskever et al. (2014)
Ilya Sutskever, Oriol Vinyals, and Quoc V Le.
Sequence to sequence learning with neural networks.In Advances in neural information processing systems, pages 3104–3112, 2014.
- Kikuchi et al. (2016) Yuta Kikuchi, Graham Neubig, Ryohei Sasano, Hiroya Takamura, and Manabu Okumura. Controlling output length in neural encoder-decoders. arXiv preprint arXiv:1609.09552, 2016.
- Yamagishi et al. (2016) Hayahide Yamagishi, Shin Kanouchi, Takayuki Sato, and Mamoru Komachi. Controlling the voice of a sentence in japanese-to-english neural machine translation. In Proceedings of the 3rd Workshop on Asian Translation (WAT2016), pages 203–210, 2016.
- Kiros et al. (2014) Ryan Kiros, Richard Zemel, and Ruslan R Salakhutdinov. A multiplicative model for learning distributed text-based attribute representations. In Advances in neural information processing systems, pages 2348–2356, 2014.
- Radford et al. (2017) Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444, 2017.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Hu et al. (2017) Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing. Controllable text generation. arXiv preprint arXiv:1703.00955, 2017.
- Bowman et al. (2015) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
- Chen et al. (2016) Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, 2016.
- Li et al. (2018) Juncen Li, Robin Jia, He He, and Percy Liang. Delete, retrieve, generate: A simple approach to sentiment and style transfer. arXiv preprint arXiv:1804.06437, 2018.
- Shen et al. (2017) Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. Style transfer from non-parallel text by cross-alignment. arXiv preprint arXiv:1705.09655, 2017.
- Prabhumoye et al. (2018) Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhutdinov, and Alan W Black. Style transfer through back-translation. arXiv preprint arXiv:1804.09000, 2018.
- He et al. (2016) Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pages 820–828, 2016.
- Artetxe et al. (2017) Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041, 2017.
- Lample et al. (2017) Guillaume Lample, Ludovic Denoyer, and Marc’Aurelio Ranzato. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043, 2017.
- Miyato and Koyama (2018) Takeru Miyato and Masanori Koyama. cgans with projection discriminator. arXiv preprint arXiv:1802.05637, 2018.
Chung et al. (2015)
Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio.
Gated feedback recurrent neural networks.In
International Conference on Machine Learning, pages 2067–2075, 2015.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–1543, 2014.
- Maas et al. (2011) Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
- Fu et al. (2017) Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, and Rui Yan. Style transfer in text: Exploration and evaluation. arXiv preprint arXiv:1711.06861, 2017.
- Jozefowicz et al. (2016) Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
- Diao et al. (2014) Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alexander J Smola, Jing Jiang, and Chong Wang. Jointly modeling aspects, ratings and sentiments for movie recommendation (jmars). In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 193–202. ACM, 2014.
- Ramm et al. (2017) Anita Ramm, Sharid Loáiciga, Annemarie Friedrich, and Alexander Fraser. Annotating tense, mood and voice for english, french and german. Proceedings of ACL 2017, System Demonstrations, pages 1–6, 2017.
Appendix A Qualitative comparison
Table 6 shows samples from different models for given query sentences from the restaurant reviews test dataset and opposite sentiment.
|Query||sorry but i do n’t get the rave reviews for this place .|
|Ctrl gen||i ordered the nachos , have perfect seasonal beans on amazing amazing .|
|Cross-align||sorry , i do n’t be the best experience ever .|
|Ours||thanks but i love this place for lunch .|
|Query||however my recent visit there made me change my mind entirely .|
|Ctrl-gen||not like other target stores .|
|Cross-align||best little one time to go for in charlotte .|
|Ours||overall my experience here was great as well .|
|Query||okay so this place has been a pain even after i already moved out .|
|Ctrl gen||like i mentioned , i thought to this these fun .|
|Cross-align||food and this place has been a good place to be back .|
|Ours||overall this is a great place to go when i ’m in town .|
|Query||personally i ’d rather spend my money at a business that appreciates my business .|
|Ctrl gen||i became quite a gem at the beginning but we amazing fantastic amazing .|
|Cross align||then i will be back my time to get a regular time .|
|Ours||definitely i ’ll definitely be back for a good haircut .|
|Query||seems their broth just has no flavor kick to it .|
|Ctrl gen||i expected more for the price i paid .|
|Cross align||loved their menu , has a great place .|
|Ours||definitely it ’s cooked perfectly and it ’s delicious .|
|Query||best chinese food i ’ve had in a long time .|
|Ctrl gen||very lousy texture and ruined .|
|Cross align||worst chinese food i ’ve had in a long in years .|
|Ours||worst food i ’ve had in a long time .|
|Query||high quality food at prices comparable to lower quality take out .|
|Ctrl gen||the rock becomes my daughter ruined and it was terrible lousy lousy lousy|
|Cross align||terrible quality , , <unk> <unk> , _num_ % of $ _num_ minutes .|
|Ours||poor quality of food quality at all costs .|
|Query||my appetizer was also very good and unique .|
|Ctrl gen||both were ruined . ruined|
|Cross align||my wife was just very bland and no flavor .|
|Ours||my chicken was very dry and had no flavor .|
|Query||everything tasted great and the service was excellent .|
|Ctrl gen||but the real pleasure is the service department .|
|Cross align||everything tasted horrible and the service was very bad .|
|Ours||everything tasted bad and the service was horrible .|
|Query||atmosphere is cozy and comfortable .|
|Ctrl gen||atmosphere is not good .|
|Cross align||rude is dirty and way in .|
|Ours||restaurant is dirty and dirty .|
Table 7 shows samples from different models for given query sentences from the movie reviews test dataset and opposite sentiment.
|Query||this is the most vapid movie i have ever seen .|
|Ctrl gen||if this grabs your interest , you may want to give it a try|
|Cross-align||this is a great movie that is so good .|
|Ours||this is the most beautiful movie i have ever seen .|
|Query||this 1944 film is too awful as it ’s just incredible .|
|Ctrl gen||<unk> the three dead world and <unk> ’s <unk> is a cult in a life|
|Cross-align||this film is one of the best movies ever made .|
|Ours||this film is an excellent and it is definitely worth it .|
|Query||1 out of 10 .|
|Ctrl gen||he ’s cold and hateful exactly what his part <unk>|
|Cross-align||my rating of the cast .|
|Ours||10 out of 10 .|
|Query||i always thought she was a colorless , plain jane .|
|Ctrl gen||a great comedy all wrapped up in a tiny package !|
|Cross-align||i think that is the best of the film .|
|Ours||i also thought she was a beautiful , talented actor .|
|Query||her character is truly hateful and her acting , if you can call it that , is strictly wretched .|
|Ctrl gen||a great ‘ proper ’ summer movie|
|Cross-align||<unk> , is the <unk> , and you can be able to be more than it to be .|
|Ours||his character is very funny , and in fact , it ’s just what he does n’t disappoint .|
|Query||this is one of his best efforts .|
|Ctrl gen||as david <unk> picked up the franchise , it has just <unk> to pieces|
|Cross-align||this is a complete waste of time .|
|Ours||this is one of the worst films .|
|Query||if you love silent films , you ’ll adore this one .|
|Ctrl gen||nice photographic effects as jessica <unk> the process|
|Cross-align||if you ’re no , but it is not bad .|
|Ours||if you love horror movies , do n’t see this one .|
|Query||and congratulations to kino for a superb video restoration .|
|Ctrl gen||peter <unk> is not that she ’s not gone bad movie|
|Cross-align||but then , it ’s a waste of time .|
|Ours||and save your money on this piece of garbage .|
|Query||the characters are portrayed vividly and realistically .|
|Ctrl gen||problem is , not enough good work went into this|
|Cross-align||the characters are <unk> and <unk> .|
|Ours||the characters are completely unsympathetic and annoying .|
|Query||there are some of the most stunning and grisly combat scenes ever filmed .|
|Ctrl gen||unfortunately the only thing you see is <unk>|
|Cross-align||there is no a <unk> , and the <unk> , <unk> and <unk> .|
|Ours||there are some of the most boring and boring scenes ever made .|
Appendix B Human Evaluation
b.1 Content compatibility
Given a reference sentence and a set of candidate sentences, pick the candidates that have the same semantic content as the reference sentence but have the opposite sentiment (i.e., mean the opposite). Select all that apply. If you think neither of the given sentences have this property, choose No preference (This can happen when all the candidate sentences are either semantically irrelevant to the reference sentence or have the incorrect sentiment).
Reference sentence: This is a great movie !
You would pick sentences such as
✓ This is not a good movie.
✓ This is a bad movie.
The following sentences do not fit the criteria because they are either semantically irrelevant to the reference sentence or have the incorrect sentiment.
✗ I did not like the salad.
✗ This is a wonderful movie.
b.2 Attribute compatibility
Pick the best sentiment based on the following criterion.
|Positive||Sentence conveys positive sentiment. Eg: "I really liked the food."|
|Negative||Sentence conveys negative sentiment. Eg: "This was the worst experience ever."|
|Neutral||Sentence does not carry any sentiment information.|
Rate the grammaticality/fluency of the sentence based on the following criterion.
|5||The sentence is grammatical and does not have any grammar errors.|
|4||Sentence is mostly grammatical except for one/two mistakes.|
|3||Parts of the sentence are grammatical and sentence is somewhat coherent, but there|
|are glaring errors.|
|2||Too many grammatical errors and sentence is incoherent.|
|1||Sentence is completely ungrammatical.|
Appendix C Sampling strategy
In this section we compare soft and hard sampling during training. For the soft-sampling model, we use an exponential decay temperature annealing schedule with an initial temperature of 1. The temperature decays until it reaches 0.01 and remains constant afterwards. Other parameters of the model are identical to section 4.1. We use the Yelp dataset for this experiment. Table 8 compares the models with respect to the metrics in section 4.2.
Models learned with soft-sampling produce sentences judged to be highly attribute compatible. However, the content compatibility is considerably poor and generated sentences have lower fluency. This supports our claim that the training and inference behavior are mismatched when soft-sampled sequences are used for training.
Comparison between soft and hard sampling. Evaluation metrics are described in section4.2. Higher numbers are better for accuracy and content compatibility and lower numbers for perplexity.