Style Transfer from Non-Parallel Text by Cross-Alignment

05/26/2017 ∙ by Tianxiao Shen, et al. ∙ MIT ASAPP INC 0

This paper focuses on style transfer on the basis of non-parallel text. This is an instance of a broad family of problems including machine translation, decipherment, and sentiment modification. The key challenge is to separate the content from other aspects such as style. We assume a shared latent content distribution across different text corpora, and propose a method that leverages refined alignment of latent representations to perform style transfer. The transferred sentences from one style should match example sentences from the other style as a population. We demonstrate the effectiveness of this cross-alignment method on three tasks: sentiment modification, decipherment of word substitution ciphers, and recovery of word order.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


NLP Style Transfer from Non-parallel Text with Adversarial Alignment (

view repo


Repository of our writing style transfer project

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Using massive amounts of parallel data has been essential for recent advances in text generation tasks, such as machine translation and summarization. However, in many text generation problems, we can only assume access to non-parallel or mono-lingual data. Problems such as decipherment or style transfer are all instances of this family of tasks. In all of these problems, we must preserve the content of the source sentence but render the sentence consistent with desired presentation constraints (e.g., style, plaintext/ciphertext).

The goal of controlling one aspect of a sentence such as style independently of its content requires that we can disentangle the two. However, these aspects interact in subtle ways in natural language sentences, and we can succeed in this task only approximately even in the case of parallel data. Our task is more challenging here. We merely assume access to two corpora of sentences with the same distribution of content albeit rendered in different styles. Our goal is to demonstrate that this distributional equivalence of content, if exploited carefully, suffices for us to learn to map a sentence in one style to a style-independent content vector and then decode it to a sentence with the same content but a different style.

In this paper, we introduce a refined alignment of sentence representations across text corpora. We learn an encoder that takes a sentence and its original style indicator as input, and maps it to a style-independent content representation. This is then passed to a style-dependent decoder for rendering. We do not use typical VAEs for this mapping since it is imperative to keep the latent content representation rich and unperturbed. Indeed, richer latent content representations are much harder to align across the corpora and therefore they offer more informative content constraints. Moreover, we reap additional information from cross-generated (style-transferred) sentences, thereby getting two distributional alignment constraints. For example, positive sentences that are style-transferred into negative sentences should match, as a population, the given set of negative sentences. We illustrate this cross-alignment in Figure 1.

Figure 1: An overview of the proposed cross-alignment method. and are two sentence domains with different styles and , and is the shared latent content space. Encoder maps a sentence to its content representation, and generator generates the sentence back when combining with the original style. When combining with a different style, transferred is aligned with and is aligned with at the distributional level.

To demonstrate the flexibility of the proposed model, we evaluate it on three tasks: sentiment modification, decipherment of word substitution ciphers, and recovery of word order. In all of these applications, the model is trained on non-parallel data. On the sentiment modification task, the model successfully transfers the sentiment while keeps the content for 41.5% of review sentences according to human evaluation, compared to 41.0% achieved by the control-gen model of Hu et al. (2017). It achieves strong performance on the decipherment and word order recovery tasks, reaching Bleu score of 57.4 and 26.1 respectively, obtaining 50.2 and 20.9 gap than a comparable method without cross-alignment.

2 Related work

Style transfer in vision

Non-parallel style transfer has been extensively studied in computer vision 

(Gatys et al., 2016; Zhu et al., 2017; Liu and Tuzel, 2016; Liu et al., 2017; Taigman et al., 2016; Kim et al., 2017; Yi et al., 2017). Gatys et al. (2016) explicitly extract content and style features, and then synthesize a new image by combining “content” features of one image with “style” features from another. More recent approaches learn generative networks directly via generative adversarial training (Goodfellow et al., 2014) from two given data domains and . The key computational challenge in this non-parallel setting is aligning the two domains. For example, CoupledGANs (Liu and Tuzel, 2016) employ weight-sharing between networks to learn cross-domain representation, whereas CycleGAN (Zhu et al., 2017) introduces cycle consistency which relies on transitivity to regularize the transfer functions. While our approach has a similar high-level architecture, the discreteness of natural language does not allow us to reuse these models and necessitates the development of new methods.

Non-parallel transfer in natural language

In natural language processing, most tasks that involve generation (e.g., translation and summarization) are trained using parallel sentences. Our work most closely relates to approaches that do not utilize parallel data, but instead guide sentence generation from an indirect training signal 

(Mueller et al., 2017; Hu et al., 2017). For instance, Mueller et al. (2017)

manipulate the hidden representation to generate sentences that satisfy a desired property (e.g., sentiment) as measured by a corresponding classifier. However, their model does not necessarily enforce content preservation. More similar to our work,

Hu et al. (2017) aims at generating sentences with controllable attributes by learning disentangled latent representations (Chen et al., 2016). Their model builds on variational auto-encoders (VAEs) and uses independency constraints to enforce that attributes can be reliably inferred back from generated sentences. While our model builds on distributional cross-alignment for the purpose of style transfer and content preservation, these constraints can be added in the same way.

Adversarial training over discrete samples

Recently, a wide range of techniques addresses challenges associated with adversarial training over discrete samples generated by recurrent networks (Yu et al., 2016; Lamb et al., 2016; Hjelm et al., 2017; Che et al., 2017). In our work, we employ the Professor-Forcing algorithm (Lamb et al., 2016) which was originally proposed to close the gap between teacher-forcing during training and self-feeding during testing for recurrent networks. This design fits well with our scenario of style transfer that calls for cross-alignment. By using continuous relaxation to approximate the discrete sampling process (Jang et al., 2016; Maddison et al., 2016), the training procedure can be effectively optimized through back-propagation (Kusner and Hernández-Lobato, 2016; Goyal et al., 2017).

3 Formulation

In this section, we formalize the task of non-parallel style transfer and discuss the feasibility of the learning problem. We assume the data are generated by the following process:

  1. a latent style variable is generated from some distribution ;

  2. a latent content variable is generated from some distribution ;

  3. a datapoint is generated from conditional distribution .

We observe two datasets with the same content distribution but different styles and , where and are unknown. Specifically, the two observed datasets and consist of samples drawn from and

respectively. We want to estimate the style transfer functions between them, namely

and .

A question we must address is when this estimation problem is feasible. Essentially, we only observe the marginal distributions of and

, yet we are going to recover their joint distribution:


As we only observe and , and are unknown to us. If two different and lead to the same distribution , then given a dataset sampled from it, its underlying style can be either or . Consider the following two cases: (1) both datasets and are sampled from the same style ; (2) and are sampled from style and respectively. These two scenarios have different joint distributions, but the observed marginal distributions are the same. To prevent such confusion, we constrain the underlying distributions as stated in the following proposition:

Proposition 1.

In the generative framework above, and ’s joint distribution can be recovered from their marginals only if for any different , distributions and are different.

This proposition basically says that generated from different styles should be “distinct” enough, otherwise the transfer task between styles is not well defined. While this seems trivial, it may not hold even for simplified data distributions. The following examples illustrate how the transfer (and recovery) becomes feasible or infeasible under different model assumptions. As we shall see, for a certain family of styles , the more complex distribution for

, the more probable it is to recover the transfer function and the easier it is to search for the transfer.

3.1 Example 1: Gaussian

Consider the common choice that

has a centered isotropic Gaussian distribution. Suppose a style

is an affine transformation, i.e. , where is a noise variable. For

and any orthogonal matrix

, and hence has the same distribution for any such styles . In this case, the effect of rotation cannot be recovered.

Interestingly, if has a more complex distribution, such as a Gaussian mixture, then affine transformations can be uniquely determined.

Lemma 1.

Let be a mixture of Gaussians . Assume , and there are two different . Let be all invertible affine transformations, and , in which is a noise. Then for all , and are different distributions.

Theorem 1.

If the distribution of is a mixture of Gaussians which has more than two different components, and are two affine transformations of , then the transfer between them can be recovered given their respective marginals.

3.2 Example 2: Word substitution

Consider here another example when is a bi-gram language model and a style is a vocabulary in use that maps each “content word” onto its surface form (lexical form). If we observe two realizations and of the same language , the transfer and recovery problem becomes inferring a word alignment between and .

Note that this is a simplified version of language decipherment or translation. Nevertheless, the recovery problem is still sufficiently hard. To see this, let be the estimated bi-gram probability matrix of data and respectively. Seeking the word alignment is equivalent to finding a permutation matrix such that , which can be expressed as an optimization problem,

The same formulation applies to graph isomorphism (GI) problems given and as the adjacency matrices of two graphs, suggesting that determining the existence and uniqueness of is at least GI hard. Fortunately, if as a graph is complex enough, the search problem could be more tractable. For instance, if each vertex’s weights of incident edges as a set is unique, then finding the isomorphism can be done by simply matching the sets of edges. This assumption largely applies to our scenario where is a complex language model. We empirically demonstrate this in the results section.

The above examples suggest that as the latent content variable should carry most complexity of data , while as the latent style variable should have relatively simple effects. We construct the model accordingly in the next section.

4 Method

Learning the style transfer function under our generative assumption is essentially learning the conditional distribution and . Unlike in vision where images are continuous and hence the transfer functions can be learned and optimized directly, the discreteness of language requires us to operate through the latent space. Since and are conditionally independent given the latent content variable ,


This suggests us learning an auto-encoder model. Specifically, a style transfer from to involves two steps—an encoding step that infers ’s content , and a decoding step which generates the transferred counterpart from . In this work, we approximate and train and

using neural networks (where


Let be an encoder that infers the content for a given sentence and a style , and be a generator that generates a sentence from a given style and content . and form an auto-encoder when applying to the same style, and thus we have reconstruction loss,


where are the parameters to estimate.

In order to make a meaningful transfer by flipping the style, and ’s content space must coincide, as our generative framework presumed. To constrain that and are generated from the same latent content distribution , one option is to apply a variational auto-encoder (Kingma and Welling, 2013). A VAE imposes a prior density , such as , and uses a KL-divergence regularizer to align both posteriors and to it,


The overall objective is to minimize , whose opposite is the variational lower bound of data likelihood.

However, as we have argued in the previous section, restricting to a simple and even distribution and pushing most complexity to the decoder may not be a good strategy for non-parallel style transfer. In contrast, a standard auto-encoder simply minimizes the reconstruction error, encouraging to carry as much information about as possible. On the other hand, it lowers the entropy in , which helps to produce meaningful style transfer in practice as we flip between and . Without explicitly modeling , it is still possible to force distributional alignment of and . To this end, we introduce two constrained variants of auto-encoder.

4.1 Aligned auto-encoder

Dispense with VAEs that make an explicit assumption about and align both posteriors to it, we align and with each other, which leads to the following constrained optimization problem:


In practice, a Lagrangian relaxation of the primal problem is instead optimized. We introduce an adversarial discriminator to align the aggregated posterior distribution of from different styles (Makhzani et al., 2015). aims to distinguish between these two distributions:


The overall training objective is a min-max game played among the encoder , generator and discriminator . They constitute an aligned auto-encoder:


We implement the encoder and generator using single-layer RNNs with GRU cell. takes an input sentence with initial hidden state , and outputs the last hidden state as its content representation. generates a sentence conditioned on latent state . To align the distributions of and , the discriminator is a feed-forward network with a single hidden layer and a sigmoid output layer.

4.2 Cross-aligned auto-encoder

The second variant, cross-aligned auto-encoder, directly aligns the transfered samples from one style with the true samples from the other. Under the generative assumption, , thus (sampled from the left-hand side) should exhibit the same distribution as transferred (sampled from the right-hand side), and vice versa. Similar to our first model, the second model uses two discriminators and to align the populations. ’s job is to distinguish between real and transferred , and ’s job is to distinguish between real and transferred .

Adversarial training over the discrete samples generated by hinders gradients propagation. Although sampling-based gradient estimator such as REINFORCE (Williams, 1992)

can by adopted, training with these methods can be unstable due to the high variance of the sampled gradient. Instead, we employ two recent techniques to approximate the discrete training 

(Hu et al., 2017; Lamb et al., 2016). First, instead of feeding a single sampled word as the input to the generator RNN, we use the softmax distribution over words instead. Specifically, during the generating process of transferred from , suppose at time step

the output logit vector is

. We feed its peaked distribution as the next input, where is a temperature parameter.

Secondly, we use Professor-Forcing (Lamb et al., 2016) to match the sequence of hidden states instead of the output words, which contains the information about outputs and is smoothly distributed. That is, the input to the discriminator is the sequence of hidden states of either (1) teacher-forced by a real example , or (2) self-fed by previous soft distributions.

The running procedure of our cross-aligned auto-encoder is illustrated in Figure 2. Note that cross-aligning strengthens the alignment of latent variable over the recurrent network of generator . By aligning the whole sequence of hidden states, it prevents and ’s initial misalignment from propagating through the recurrent generating process, as a result of which the transferred sentence may end up somewhere far from the target domain.

We implement both and

using convolutional neural networks for sequence classification 

(Kim, 2014). The training algorithm is presented in Algorithm 1.

Figure 2: Cross-aligning between and transferred . For , is teacher-forced by its words . For transfered , is self-fed by previous output logits. The sequence of hidden states and are passed to discriminator to be aligned. Note that our first variant aligned auto-encoder is a special case of this, where only and , i.e. and , are aligned.
0:  Two corpora of different styles . Lagrange multiplier , temperature .
     for ;  do
        Sample a mini-batch of examples from
        Get the latent content representations
        Unroll from initial state by feeding , and get the hidden states sequence
        Unroll from initial state by feeding previous soft output distribution with temperature , and get the transferred hidden states sequence
     end for
     Compute the reconstruction by Eq. (3)
     Compute ’s (and symmetrically ’s) loss:
     Update by gradient descent on loss
     Update and by gradient descent on loss and respectively
  until convergence
  Style transfer functions and
Algorithm 1 Cross-aligned auto-encoder training. The hyper-parameters are set as and learning rate is for all experiments in this paper.

5 Experimental setup

Sentiment modification

Our first experiment focuses on text rewriting with the goal of changing the underlying sentiment, which can be regarded as “style transfer” between negative and positive sentences. We run experiments on Yelp restaurant reviews, utilizing readily available user ratings associated with each review. Following standard practice, reviews with rating above three are considered positive, and those below three are considered negative. While our model operates at the sentence level, the sentiment annotations in our dataset are provided at the document level. We assume that all the sentences in a document have the same sentiment. This is clearly an oversimplification, since some sentences (e.g., background) are sentiment neutral. Given that such sentences are more common in long reviews, we filter out reviews that exceed 10 sentences. We further filter the remaining sentences by eliminating those that exceed 15 words. The resulting dataset has 250K negative sentences, and 350K positive ones. The vocabulary size is 10K after replacing words occurring less than 5 times with the “<unk>” token. As a baseline model, we compare against the control-gen model of Hu et al. (2017).

To quantitatively evaluate the transfered sentences, we adopt a model-based evaluation metric similar to the one used for image transfer 

(Isola et al., 2016)

. Specifically, we measure how often a transferred sentence has the correct sentiment according to a pre-trained sentiment classifier. For this purpose, we use the TextCNN model as described in 

Kim (2014). On our simplified dataset for style transfer, it achieves nearly perfect accuracy of 97.4%.

While the quantitative evaluation provides some indication of transfer quality, it does not capture all the aspects of this generation task. Therefore, we also perform two human evaluations on 500 sentences randomly selected from the test set222we eliminated 37 sentences from them that were judged as neutral by human judges.. In the first evaluation, the judges were asked to rank generated sentences in terms of their fluency and sentiment. Fluency was rated from 1 (unreadable) to 4 (perfect), while sentiment categories were “positive”, “negative”, or “neither” (which could be contradictory, neutral or nonsensical). In the second evaluation, we evaluate the transfer process comparatively. The annotator was shown a source sentence and the corresponding outputs of the systems in a random order, and was asked “Which transferred sentence is semantically equivalent to the source sentence with an opposite sentiment?”. They can be both satisfactory, A/B is better, or both unsatisfactory. We collect two labels for each question. The label agreement and conflict resolution strategy can be found in the supplementary material. Note that the two evaluations are not redundant. For instance, a system that always generates the same grammatically correct sentence with the right sentiment independently of the source sentence will score high in the first evaluation setup, but low in the second one.

Word substitution decipherment

Our second set of experiments involves decipherment of word substitution ciphers, which has been previously explored in NLP literature (Dou and Knight, 2012; Nuhn and Ney, 2013). These ciphers replace every word in plaintext (natural language) with a cipher token according to a 1-to-1 substitution key. The decipherment task is to recover the plaintext from ciphertext. It is trivial if we have access to parallel data. However we are interested to consider a non-parallel decipherment scenario. For training, we select 200K sentences as , and apply a substitution cipher on a different set of 200K sentences to get . While these sentences are non-parallel, they are drawn from the same distribution from the review dataset. The development and test sets have 100K parallel sentences and . We can quantitatively compare between and transferred (deciphered) using Bleu score (Papineni et al., 2002).

Clearly, the difficulty of this decipherment task depends on the number of substituted words. Therefore, we report model performance with respect to the percentage of the substituted vocabulary. Note that the transfer models do not know that is a word substitution function. They learn it entirely from the data distribution.

In addition to having different transfer models, we introduce a simple decipherment baseline based on word frequency. Specifically, we assume that words shared between and do not require translation. The rest of the words are mapped based on their frequency, and ties are broken arbitrarily. Finally, to assess the difficulty of the task, we report the accuracy of a machine translation system trained on a parallel corpus (Klein et al., 2017).

Word order recovery

Our final experiments focus on the word ordering task, also known as bag translation (Brown et al., 1990; Schmaltz et al., 2016). By learning the style transfer functions between original English sentences and shuffled English sentences , the model can be used to recover the original word order of a shuffled sentence (or conversely to randomly permute a sentence). The process to construct non-parallel training data and parallel testing data is the same as in the word substitution decipherment experiment. Again the transfer models do not know that is a shuffle function and learn it completely from data.

6 Results

Sentiment modification

Table 1 and Table 2 show the performance of various models for both human and automatic evaluation. The control-gen model of Hu et al. (2017) performs better in terms of sentiment accuracy in both evaluations. This is not surprising as their generation is directly guided by a sentiment classifier. Their system also achieves higher fluency score. However, these gains do not translate into improvements in terms of the overall transfer, where our model faired better. As can be seen from the examples listed in Table 3, our model is more consistent with the grammatical structure and semantic meaning of the source sentence. In contrast, their model achieves sentiment change by generating an entirely new sentence which has little overlap with the source. The discrepancy between the two experiments demonstrates the crucial importance of developing appropriate evaluation measures to compare models for style transfer.

Method accuracy
Hu et al. (2017) 83.5
Variational auto-encoder 23.2
Aligned auto-encoder 48.3
Cross-aligned auto-encoder 78.4
Table 1: Sentiment accuracy of transferred sentences, as measured by a pretrained classifier.
Method sentiment fluency overall transfer
Hu et al. (2017) 70.8 3.2 41.0
Cross-align 62.6 2.8 41.5
Table 2: Human evaluations on sentiment, fluency and overall transfer quality. Fluency rating is from 1 (unreadable) to 4 (perfect). Overall transfer quality is evaluated in a comparative manner, where the judge is shown a source sentence and two transferred sentences, and decides whether they are both good, both bad, or one is better.
From negative to positive
     consistently slow .
     consistently good .
     consistently fast .
     my goodness it was so gross .
     my husband ’s steak was phenomenal .
     my goodness was so awesome .
     it was super dry and had a weird taste to the entire slice .
     it was a great meal and the tacos were very kind of good .
     it was super flavorful and had a nice texture of the whole side .
      From positive to negative
     i love the ladies here !
     i avoid all the time !
     i hate the doctor here !
     my appetizer was also very good and unique .
     my bf was n’t too pleased with the beans .
     my appetizer was also very cold and not fresh whatsoever .
     came here with my wife and her grandmother !
     came here with my wife and hated her !
     came here with my wife and her son .
Table 3: Sentiment transfer samples. The first line is an input sentence, the second and third lines are the generated sentences after sentiment transfer by Hu et al. (2017) and our cross-aligned auto-encoder, respectively.

Word substitution decipherment

Table 4

summarizes the performance of our model and the baselines on the decipherment task, at various levels of word substitution. Consistent with our intuition, the last row in this table shows that the task is trivial when the parallel data is provided. In non-parallel case, the difficulty of the task is driven by the substitution rate. Across all the testing conditions, our cross-aligned model consistently outperforms its counterparts. The difference becomes more pronounced as the task becomes harder. When the substitution rate is 20%, all methods do a reasonably good job in recovering substitutions. However, when 100% of the words are substituted (as expected in real language decipherment), the poor performance of variational autoencoder and aligned auto-encoder rules out their application for this task.

Method Substitution decipher Order recover
20% 40% 60% 80% 100%
No transfer (copy) 56.4 21.4 6.3 4.5 0 5.1
Unigram matching 74.3 48.1 17.8 10.7 1.2 -
Variational auto-encoder 79.8 59.6 44.6 34.4 0.9 5.3
Aligned auto-encoder 81.0 68.9 50.7 45.6 7.2 5.2
Cross-aligned auto-encoder 83.8 79.1 74.7 66.1 57.4 26.1
Parallel translation 99.0 98.9 98.2 98.5 97.2 64.6
Table 4: Bleu scores of word substitution decipherment and word order recovery.

Word order recovery

The last column in Table 4 demonstrates the performance on the word order recovery task. Order recovery is much harder—even when trained with parallel data, the machine translation model achieves only 64.6 Bleu score. Note that some generated orderings may be completely valid (e.g., reordering conjunctions), but the models will be penalized for producing them. In this task, only the cross-aligned auto-encoder achieves grammatical reorder to a certain extent, demonstrated by its Bleu score 26.1. Other models fail this task, doing no better than no transfer.

7 Conclusion

Transferring languages from one style to another has been previously trained using parallel data. In this work, we formulate the task as a decipherment problem with access only to non-parallel data. The two data collections are assumed to be generated by a latent variable generative model. Through this view, our method optimizes neural networks by forcing distributional alignment (invariance) over the latent space or sentence populations. We demonstrate the effectiveness of our method on tasks that permit quantitative evaluation, such as sentiment transfer, word substitution decipherment and word ordering. The decipherment view also provides an interesting open question—when can the joint distribution be recovered given only marginal distributions? We believe addressing this general question would promote the style transfer research in both vision and NLP.


We thank Nicholas Matthews for helping to facilitate human evaluations, and Zhiting Hu for sharing his code. We also thank Jonas Mueller, Arjun Majumdar, Olga Simek, Danelle Shah, MIT NLP group and the reviewers for their helpful comments. This work was supported by MIT Lincoln Laboratory.


Appendix A Proof of Lemma 1

See 1


For different and , entails that for ,

Since all are invertible,

Suppose is ’s orthogonal diagonalization. If , all solutions for have the form:

However, when and there are two different , the only solution is , i.e. , and thus .

Therefore, for all , . ∎