1 Introduction
Disentangled representation learning (DRL), which maps different aspects of data into distinct and independent lowdimensional latent vector spaces, has attracted considerable attention for making deep learning models more interpretable. Through a series of operations such as selecting, combining, and switching, the learned disentangled representations can be utilized for downstream tasks, such as domain adaptation (Liu et al., 2018), style transfer (Lee et al., 2018), conditional generation (Denton and others, 2017; Burgess et al., 2018), and fewshot learning (Kumar Verma et al., 2018). Although widely used in various domains, such as images (Tran et al., 2017; Lee et al., 2018), videos (Yingzhen and Mandt, 2018; Hsieh et al., 2018), and speech (Chou et al., 2018; Zhou et al., 2019)
, many challenges in DRL have received limited exploration in natural language processing
(John et al., 2019).To disentangle various attributes of text, two distinct types of embeddings are typically considered: the style embedding and the content embedding (John et al., 2019). The content embedding is designed to encapsulate the semantic meaning of a sentence. In contrast, the style embedding should represent desired attributes, such as the sentiment of a review, or the personality associated with a post. Ideally, a disentangledtextrepresentation model should learn representative embeddings for both style and content.
To accomplish this, several strategies have been introduced. Shen et al. (2017) proposed to learn a semanticallymeaningful content embedding space by matching the content embedding from two different style domains. However, their method requires predefined style domains, and thus cannot automatically infer style information from unlabeled text. Hu et al. (2017) and Lample et al. (2019) utilized onehot vectors as stylerelated features (instead of inferring the style embeddings from the original data). These models are not applicable when new data comes from an unseen style class. John et al. (2019) proposed an encoderdecoder model in combination with an adversarial training objective to infer both style and content embeddings from the original data. However, their adversarial training framework requires manuallyprocessed supervised information for content embeddings (e.g., reconstructing sentences with manuallychosen sentimentrelated words removed). Further, there is no theoretical guarantee for the quality of disentanglement.
In this paper, we introduce a novel Informationtheoretic Disentangled Embedding Learning method (IDEL) for text, based on guidance from information theory. Inspired by Variation of Information (VI), we introduce a novel informationtheoretic objective to measure how well the learned representations are disentangled. Specifically, our IDEL reduces the dependency between style and content embeddings by minimizing a samplebased mutual information upper bound. Furthermore, the mutual information between latent embeddings and the input data is also maximized to ensure the representativeness of the latent embeddings (i.e., style and content embeddings). The contributions of this paper are summarized as follows:

A principled framework is introduced to learn disentangled representations of natural language. By minimizing a novel VIbased DRL objective, our model not only explicitly reduces the correlation between style and content embeddings, but also simultaneously preserves the sentence information in the latent spaces.

A general samplebased mutual information upper bound is derived to facilitate the minimization of our VIbased objective. With this new upper bound, the dependency of style and content embeddings can be decreased effectively and stably.

The proposed model is evaluated empirically relative to other disentangled representation learning methods. Our model exhibits competitive results in several realworld applications.
2 Preliminary
2.1 Mutual Information Variational Bounds
Mutual information (MI) is a key concept in information theory, for measuring the dependence between two random variables. Given two random variables
and , their MI is defined as(1) 
where
is the joint distribution of the random variables, with
and representing the respective marginal distributions.In disentangled representation learning, a common goal is to minimize the MI between different types of embeddings (Poole et al., 2019). However, the exact MI value is difficult to calculate in practice, because in most cases the integral in Eq. (1
) is intractable. To address this problem, various MI estimation methods have been introduced
(Chen et al., 2016; Belghazi et al., 2018; Poole et al., 2019). One of the commonly used estimation approaches is the BarberAgakov lower bound (Barber and Agakov, 2003). By introducing a variational distribution , one may derive(2) 
where is the entropy of variable .
2.2 Variation of Information
In information theory, Variation of Information (VI, also called Shared Information Distance) is a measure of independence between two random variables. The mathematical definition of VI between random variables and is
(3) 
where and are entropies of and , respectively (shown in Figure 1). Kraskov et al. (2005) show that VI is a welldefined metric, which satisfies the triangle inequality:
(4) 
for any random variables , and . Additionally, indicates and are the same variable (Meilă, 2007). From Eq. (3), the VI distance has a close relation to mutual information: if the mutual information is a measure of “dependence” between two variables, then the VI distance is a measure of “independence” between them.
3 Method
Consider data , where each is a sentence drawn from a distribution , and is the label indicating the style of . The goal is to encode each sentence into its corresponding style embedding and content embedding with an encoder :
(5) 
The collection of style embeddings can be regarded as samples drawn from a variable in the style embedding space, while the collection of content embeddings are samples from a variable in the content embedding space. In practice, the dimension of the content embedding is typically higher than that of the style embedding, considering that the content usually contains more information than the style (John et al., 2019).
We first give an intuitive introduction to our proposed VIbased objective, then in Section 3.1 we provide the theoretical justification for it. To disentangle the style and content embedding, we try to minimize the mutual information between and . Meanwhile, we maximize to ensure that the content embedding sufficiently encapsulates information from the sentence . The embedding is expected to contain rich style information. Therefore, the mutual information should be maximized. Thus, our overall disentangled representation learning objective is:
3.1 Theoretical Justification of the Objective
The objective has a strong connection with the independence measurement in information theory. As described in Section 2.2, Variation of Information (VI) is a welldefined metric of independence between variables. Applying the triangle inequality from Eq. (4) to , and , we have Equality occurs if and only if the information from variable is totally separated into two independent variable and , which is an ideal scenario for disentangling sentence into its corresponding style embedding and content embedding .
Therefore, the difference between and represents the degree of disentanglement. Hence we introduce a measurement:
From Eq. (4), we know that is always nonnegative. By the definition of VI in Eq. (3), can be simplified as:
Since is a constant associated with the data, we only need to focus on .
The measurement is symmetric to style and content , giving rise to the problem that without any inductive bias in supervision, the disentangled representation could be meaningless (as observed by Locatello et al. (2019)). Therefore, we add inductive biases by utilizing the style label as supervised information for style embedding . Noting that
is a Markov Chain, we have
based on the MI dataprocessing inequality (Cover and Thomas, 2012). Then we convert the minimization of into the minimization of the upper bound , which further leads to our objective .However, minimizing the exact value of mutual information in the objective causes numerical instabilities, especially when the dimension of the latent embeddings is large (Chen et al., 2016). Therefore, we provide several MI estimations to the objective terms , and in the following two sections.
3.2 MI Variational Lower Bound
To maximize and , we derive two variational lower bounds. For , we introduce a variational decoder to reconstruct the sentence by the content embedding . Leveraging the MI variational lower bound from Eq. (2), we have Similarly, for , another variational lower bound can be obtained as: , where
is a classifier mapping the style embedding
to its corresponding style label . Based on these two lower bounds, has an upper bound:(6) 
Noting that both and are constants from the data, we only need to minimize:
(7) 
As an intuitive explanation of , the style embedding and content embedding are expected to be independent by minimizing mutual information , while they also need to be representative: the style embedding is encouraged to give a better prediction of style label by maximizing ; the content embedding should maximize the loglikelihood to contain sufficient information from sentence .
3.3 MI Samplebased Upper Bound
To estimate , we propose a novel samplebased upper bound. Assume we have latent embedding pairs drawn from . As shown in Theorem 3.1, we derive an upper bound of mutual information based on the samples. A detailed proof is provided in the Supplementary Material.
Theorem 3.1.
If , then
(8) 
where .
Based on Theorem 3.1, given embedding samples , we can minimize
as an unbiased estimation of the upper bound
. The calculation of requires the conditional distribution , whose closed form is unknown. Therefore, we use a variational network to approximate with embedding samples.To implement the upper bound in Eq. (8), we first feed sentences into encoder to obtain embedding pairs . Then, we train the variational distribution by maximizing the loglikelihood . After the training of is finished, we calculate for each embedding pair . Finally, the gradient for is calculated and backpropagated to encoder . We apply the reparameterization trick (Kingma and Welling, 2013) to ensure the gradient backpropagates through the sampled embeddings . When the encoder weights are updated, the distribution changes, which leads to the changing of conditional distribution . Therefore, we need to update the approximation network again. Consequently, the encoder network and the approximation network are updated alternately during training.
In each training step, the above algorithm requires pairs of embedding samples and the calculation of all conditional distributions . This leads to computational complexity. To accelerate the training, we further approximate term in by , where is selected uniformly from indices . This stochastic sampling not only leads to an unbiased estimation to , but also improves the model robustness (as shown in Algorithm 1).
Symmetrically, we can also derive an MI upper bound based on the conditional distribution . However, the dimension of is much higher than the dimension of , which indicates that the neural approximation to would have worse performance compared with the approximation to . Alternatively, the lowerdimensional distribution
used in our model is relatively easy to approximate with neural networks.
3.4 EncoderDecoder Framework
One important downstream task for disentangled representation learning (DRL) is conditional generation. Our MIbased text DRL method can be also embedded into an EncoderDecoder generative model and trained endtoend.
Since the proposed DRL encoder
is a stochastic neural network, a natural extension is to add a decoder to build a variational autoencoder (VAE)
(Kingma and Welling, 2013). Therefore, we introduce another decoder network that generates a new sentence based on the given style and content . A prior distribution =, as the product of two multivariate unitvariance Gaussians, is used to regularize the posterior distribution
by KLdivergence minimization. Meanwhile, the loglikelihood term for text reconstruction should be maximized. The objective for VAE is:We combine the VAE objective and our MIbased disentanglement term to form an endtoend learning framework (as shown in Figure 2
). The total loss function is
,4 Related Work
4.1 Disentangled Representation Learning
Disentangled representation learning (DRL) can be classified into two categories: unsupervised disentangling and supervised disentangling. Unsupervised disentangling methods focus on adding constraints on the embedding space to enforce that each dimension of the space be as independent as possible (Burgess et al., 2018; Chen et al., 2018). However, Locatello et al. (2019) challenge the effectiveness of unsupervised disentangling without any induced bias from data or supervision. For supervised disentangling, supervision is always provided on different parts of disentangled representations. However, for text representation learning, supervised information can typically be provided only for the style embeddings (e.g. sentiment or personality labels), making the task much more challenging. John et al. (2019) tried to alleviate this issue by manually removing sentimentrelated words from a sentence. In contrast, our model is trained in an endtoend manner without manually adding any supervision on the content embeddings.
4.2 Mutual Information Estimation
Mutual information (MI) is a fundamental measurement of the dependence between two random variables. MI has been applied to a wide range of tasks in machine learning, including generative modeling
(Chen et al., 2016), the information bottleneck (Tishby et al., 2000), and domain adaptation (Gholami et al., 2020). In our proposed method, we utilize MI to measure the dependence between content and style embedding. By minimizing the MI, the learned content and style representations are explicitly disentangled.However, the exact value of MI is hard to calculate, especially for highdimensional embedding vectors (Poole et al., 2019). To approximate MI, most previous work focuses on lowerbound estimations (Chen et al., 2016; Belghazi et al., 2018; Poole et al., 2019), which are not applicable to MI minimization tasks. Poole et al. (2019) propose a leaveoneout upper bound of MI; however it is not numerically stable in practice. Inspired by these observations, we introduce a novel MI upper bound for disentangled representation learning, which stably minimizes the correlation between content and style embedding in a principled manner.
5 Experiments
5.1 Datasets
We conduct experiments to evaluate our models on the following realworld datasets:
Yelp Reviews: The Yelp dataset contains online service reviews with associated rating scores. We follow the preprocessing from Shen et al. (2017) for a fair comparison. The resulting dataset includes 250,000 positive review sentences and 350,000 negative review sentences.
Personality Captioning: Personality Captioning dataset (Shuster et al., 2019) collects captions of images which are written according to 215 different personality traits. These traits can be divided into three categories: positive, neutral, and negative. We select sentences from positive and negative classes for evaluation.
5.2 Experimental Setup
We build the sentence encoder with a onelayer bidirectional LSTM plus a multihead attention mechanism. The style classifier is parameterized by a single fullyconnected network with the softmax activation. The contentbased decoder
is a onelayer unidirectional LSTM appended with a linear layer with vocabulary size output, outputting the predicted probability of the next words. The conditional distribution approximation
is represented by a twolayer fullyconnected network with ReLU activation. The generator
is built by a twolayer unidirectional LSTM plus a linear projection with output dimension equal to the vocabulary size, providing the nextword prediction based on previous sentence information and the current word.We initialize and fix our word embeddings by the 300dimensional pretrained GloVe vectors (Pennington et al., 2014)
. The style embedding dimension is set to 32 and the content embedding dimension is 512. We use a standard multivariate normal distribution as the prior of the latent spaces. We train the model with the Adam optimizer
(Kingma and Ba, 2014) with initial learning rate of . The batch size is equal to 128.5.3 Embedding Disentanglement Quality
We first examine the disentangling quality of learned latent embeddings, primarily studying the latent spaces of IDEL on the Yelp dataset.
Latent Space Visualization: We randomly select 1,000 sentences from the Yelp testing set and visualize their latent embeddings in Figure 3, via tSNE plots (van der Maaten and Hinton, 2008). The blue and red points respectively represent the positive and negative sentences. The left side of the figure shows the style embedding space, which is well separated into two parts with different colors. It supports the claim that our model learns a semantically meaningful style embedding space. The right side of the figure is the content embedding space, which cannot be distinguished by the style labels (different colors). The lack of difference in the pattern of content embedding also provides evidence that our content embeddings have little correlation with the style labels.
For an ablation study, we train another IDEL model under the same setup, while removing our MI upper bound . We call this model IDEL in the following experiments. We encode the same sentences used in Figure 3, and display the corresponding embeddings in Figure 4. Compared with results from the original IDEL, the style embedding space (left in Figure 4) is not separated in a clean manner. On the other hand, the positive and negative embeddings become distinguishable in the content embedding space. The difference between Figures 3 and 4 indicates the disentangling effectiveness of our MI upper bound .
LabelEmbedding Correlation: Besides visualization, we also numerically analyze the correlation between latent embeddings and style labels. Inspired by the statistical twosample test (Gretton et al., 2012), we use the samplebased divergence between the positive embedding distribution and the negative embedding distribution as a measurement of labelembedding correlation. We consider four divergences: Mean Absolute Deviation (MAD) (Geary, 1935), Energy Distance (ED) (Sejdinovic et al., 2013), Maximum Mean Discrepancy (MMD) (Gretton et al., 2012), and Wasserstein distance (WD) (Ramdas et al., 2017). For a fair comparison, we reimplement previous text embedding methods and set their content embedding dimension to 512 and the style embedding dimension to 32 (if applicable). Details about the divergences and embedding processing are shown in the Supplementary Material.
Yelp Dataset  Personality Captioning Dataset  
Conditional Generation  Style Transfer  Conditional Generation  Style Transfer  
ACC  BLEU  GM  ACC  BLEU  SBLEU  GM  ACC  BLEU  GM  ACC  BLEU  SBLEU  GM  
CtrlGen  82.5  20.8  41.4  83.4  19.4  31.4  37.0  73.6  18.9  37.0  73.3  18.9  30.0  34.6 
CAAE  78.9  19.7  39.4  79.3  18.5  28.2  34.6  72.2  19.5  37.5  72.1  18.3  27.4  33.1 
ARAE  78.3  23.1  42.4  78.5  21.3  32.5  37.9  72.8  22.5  40.4  71.5  20.4  31.6  35.8 
BT  81.4  20.2  40.5  86.3  24.1  35.6  41.9  74.1  21.0  39.4  75.9  23.1  34.2  39.1 
DRLST  83.7  22.8  43.7  85.0  23.9  34.9  41.4  74.9  22.0  40.5  75.7  21.9  33.8  38.3 
IDEL  78.1  20.3  39.8  79.1  20.1  27.5  35.1  72.0  19.7  37.7  72.4  19.7  27.1  33.8 
IDEL  83.9  23.0  43.9  85.7  24.3  35.2  41.9  75.1  22.3  40.9  75.6  23.3  34.6  39.4 
From Table 2, the proposed IDEL achieves the lowest divergences between positive and negative content embeddings compared with CtrlGen (Hu et al., 2017), CAAE (Shen et al., 2017), ARAE (Zhao et al., 2018), BackTranslation (BT) (Lample et al., 2019), and DRLST (John et al., 2019), indicating our model better disentangles the content embeddings from the style labels. For style embeddings, we compare IDEL with DRLST, the only prior method that infers the text style embeddings. Table 3 shows a larger distribution gap between positive and negative style embeddings with IDEL than with DRLST, which demonstrates the proposed IDEL has better style information expression in the style embedding space. The comparison between IDEL and IDEL supports the effectiveness of our MI upper bound minimization.
Method  MAD  ED  WD  MMD 

CtrlGen  0.261  0.105  0.311  0.063 
CAAE  0.285  0.112  0.306  0.078 
ARAE  0.194  0.050  0.248  0.042 
BT  0.211  0.053  0.269  0.049 
DRLST  0.181  0.048  0.215  0.031 
IDEL  0.217  0.077  0.293  0.051 
IDEL  0.063  0.015  0.084  0.010 
Method  MAD  ED  WD  MMD 

DRLST  1.024  0.503  1.375  0.286 
IDEL  0.996  0.489  1.124  0.251 
IDEL  1.167  0.583  1.392  0.302 
5.4 Embedding Representation Quality
To show the representation ability of IDEL, we conduct experiments on two textgeneration tasks: style transfer and conditional generation.
For style transfer, we encode two sentences into a disentangled representation, and then combine the style embedding from one sentence and the content embedding from another to generate a new sentence via the generator . For conditional generation, we set one of the style or content embeddings to be fixed and sample the other part from the latent prior distribution, and then use the combination to generate text. Since most previous work only embedded the content information, for fair comparison, we mainly focus on fixing style and sampling context embeddings under the conditional generation setup.
To measure generation quality for both tasks, we test the following metrics (more specific description is provided in the Supplementary Material).
Style Preservation: Following previous work (Hu et al., 2017; Shen et al., 2017; John et al., 2019), we pretrain a style classifier and use it to test whether a generated sentence can be categorized into the correct target style class.
Content Preservation: For style transfer, we measure whether a generation preserves the content information from the original sentence by the selfBLEU score (Zhang et al., 2019, 2020). The selfBLEU is calculated between one original sentence and its styletransferred sentence.
Generation Quality: To measure the generation quality, we calculate the corpuslevel BLEU score (Papineni et al., 2002) between a generated sentence and the testing data corpus.
Geometric Mean:
We use the geometric mean (GM)
(John et al., 2019)of the above metrics to obtain an overall evaluation metric of representiveness of DRL models.
Content Source  Style Source  Transferred Result 

I enjoy it thoroughly!  never before had a bad experience at the habit until tonight.  I dislike it thoroughly. 
quality is just so so.  quality is so bad.  
I am so grateful.  I am so disgusted.  
never before had a bad experience at the habit until tonight.  I am so grateful.  never had a service that was enjoyable experience tonight. 
quality is just so so.  never had a unimpressed experience until tonight.  
quality of food is fantastic.  never had awesome routine until tonight.  
I am so disappointed with palm today.  we were both so impressed.  I am so impressed with palm again. 
quality of food is fantastic .  I am good with palm today.  
never before had a bad experience at the habit until tonight.  I am so disgusted with palm today. 
We compare our IDEL with previous stateoftheart methods on Yelp and Personality Captioning datasets, as shown in Table 1. The references to the other models are mentioned in Section 5.3. Note that the original BackTranslation (BT) method (Lample et al., 2019) is a AutoEncoder framework, that is not able to do conditional generation. To compare with BT fairly, we add a standard Gaussian prior in its latent space to make it a variational autoencoder model.
From the results in Table 1, ARAE performs well on the conditional generation. Compared to ARAE, our model performance is slightly lower on content preservation (BLEU). In contrast, the style classification score of IDEL has a large margin above that of ARAE. The BackTranslation (BT) has a better performance on style transfer tasks, especially on the Yelp dataset. Our IDEL has a lower style classification accuracy (ACC) than BT on the style transfer task. However, IDEL achieves high BLEU on style transfer, which leads to a high overall GM score on the PersonalityCaptioning dataset. On the Yelp dataset, IDEL also has a competitive GM score compared with BT. The experiments show a clear tradeoff between style preservation and content preservation, in which our IDEL learns more representative disentangled representation and leads to a better balance.
Besides the automatic evaluation metrics mentioned above, we further test our disentangled representation effectiveness by human evaluation. Due to the limitation of manual effort, we only evaluate the style transfer performance on Yelp datasets. The generated sentences are manually evaluated on style accuracy (SA), content preservation (CP), and sentence fluency (SF). The CP and SF scores are between 0 to 5. Details are provided in the Supplementary Material. Our method achieves better style and content preservation, with a little performance sacrifice on sentence fluency.
SA  CP  SF  GM  

CtrlGen  71.2 (3.56)  3.25  3.12  3.30 
CAAE  63.1 (3.16)  2.83  3.06  3.01 
ARAE  68.0 (3.40)  3.44  3.09  3.31 
IDEL  73.7 (3.69)  3.39  3.21  3.42 
ACC  BLEU  SBLEU  GM  

52.1  24.7  20.8  29.9  
86.1  23.3  16.4  32.0  
50.2  24.0  36.3  34.7  
IDEL  79.1  20.1  27.5  35.1 
IDEL  85.5  24.0  35.0  41.5 
IDEL  85.7  24.3  35.2  41.9 
Table 4 shows three style transfer examples from IDEL on the Yelp dataset. The first example shows three sentences transferred with the style from a given sentence. The other two examples transfer each given sentence based on the styles of three different sentences. Our IDEL not only transfers sentences into target sentiment classes, but also renders the sentence with more detailed style information (e.g., the degree of the sentiment).
In addition, we conduct an ablation study to test the influence of different objective terms in our model. We retrain the model with different training loss combinations while keeping all other setups the same. In Table 1, IDEL surpasses IDEL (without MI upper bound minimization) with a large gap, demonstrating the effectiveness of our proposed MI upper bound. The vanilla VAE has the best generation quality. However, its transfer style accuracy is slightly better than a random guess. When adding , the ACC score significantly improves, but the content preservation (SBLEU) becomes worse. When adding , the content information is well preserved, while the ACC even decreases. By gradually adding MI terms, the model performance becomes more balanced on all the metrics, with the overall GM monotonically increasing. Additionally, we test the influence of the stochastic calculation of in Algorithm 1 (IDEL) with the closed form from Theorem 3.1 (IDEL). The stochastic IDEL not only accelerates the training but also gains a performance improvement relative to IDEL.
6 Conclusions
We have proposed a novel informationtheoretic disentangled text representation learning framework. Following the theoretical guidance from information theory, our method separates the textual information into independent spaces, constituting style and content representations. A samplebased mutual information upper bound is derived to help reduce the dependence between embedding spaces. Concurrently, the original text information is well preserved by maximizing the mutual information between input sentences and latent representations. In experiments, we introduce several twosample test statistics to measure labelembedding correlation. The proposed model achieves competitive performance compared with previous methods on both conditional generation and style transfer. For future work, our model can be extended to disentangled representation learning with noncategorical style labels, and applied to zeroshot style transfer with newlycoming unseen styles.
Acknowledgements
This work was supported by NEC Labs America, and was conducted while the first author was doing an internship at NEC Labs America.
References
 The im algorithm: a variational approach to information maximization. In Advances in neural information processing systems, pp. None. Cited by: §2.1.
 Mutual information neural estimation. In International Conference on Machine Learning, pp. 530–539. Cited by: §2.1, §4.2.
 Understanding disentangling in betavae. arXiv preprint arXiv:1804.03599. Cited by: §1, §4.1.
 Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §4.1.
 Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2.1, §3.1, §4.2, §4.2.
 Multitarget voice conversion without parallel data by adversarially learning disentangled audio representations. In Proc. Interspeech 2018, pp. 501–505. Cited by: §1.
 Elements of information theory. John Wiley & Sons. Cited by: §3.1.
 Unsupervised learning of disentangled representations from video. In Advances in neural information processing systems, pp. 4414–4423. Cited by: §1.

The ratio of the mean deviation to the standard deviation as a test of normality
. Biometrika 27 (3/4), pp. 310–332. Cited by: §5.3.  Unsupervised multitarget domain adaptation: an information theoretic approach. IEEE Transactions on Image Processing 29, pp. 3993–4002. Cited by: §4.2.
 A kernel twosample test. Journal of Machine Learning Research 13 (Mar), pp. 723–773. Cited by: §5.3.
 Learning to decompose and disentangle representations for video prediction. In Advances in Neural Information Processing Systems, pp. 517–526. Cited by: §1.
 Toward controlled generation of text. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1587–1596. Cited by: §1, §5.3, §5.4.
 Disentangled representation learning for nonparallel text style transfer. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Cited by: §1, §1, §1, §3, §4.1, §5.3, §5.4, §5.4.
 Autoencoding variational Bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.3, §3.4.
 Adam: a method for stochastic optimization. arXiv:1412.6980v9. Cited by: §5.2.
 Hierarchical clustering using mutual information. EPL (Europhysics Letters) 70 (2), pp. 278. Cited by: §2.2.

Generalized zeroshot learning via synthesized examples.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 4281–4289. Cited by: §1.  Multipleattribute text rewriting. In International Conference on Learning Representations, Cited by: §1, §5.3, §5.4.

Diverse imagetoimage translation via disentangled representations
. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 35–51. Cited by: §1.  Detach and adapt: learning crossdomain disentangled deep representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8867–8876. Cited by: §1.
 Challenging common assumptions in the unsupervised learning of disentangled representations. In International Conference on Machine Learning, pp. 4114–4124. Cited by: §3.1, §4.1.

Comparing clusterings—an information based distance.
Journal of multivariate analysis
98 (5), pp. 873–895. Cited by: §2.2.  BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §5.4.
 Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §5.2.
 On variational bounds of mutual information. In International Conference on Machine Learning, pp. 5171–5180. Cited by: §2.1, §4.2.
 On Wasserstein twosample testing and related families of nonparametric tests. Entropy 19 (2), pp. 47. Cited by: §5.3.
 Equivalence of distancebased and rkhsbased statistics in hypothesis testing. The Annals of Statistics 41 (5), pp. 2263–2291. Cited by: §5.3.
 Style transfer from nonparallel text by crossalignment. In Advances in neural information processing systems, pp. 6830–6841. Cited by: §1, §5.1, §5.3, §5.4.

Engaging image captioning via personality
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12516–12526. Cited by: §5.1.  The information bottleneck method. arXiv preprint physics/0004057. Cited by: §4.2.

Disentangled representation learning gan for poseinvariant face recognition
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1415–1424. Cited by: §1. 
Visualizing highdimensional data using tSNE
. JMLR. Cited by: §5.3.  Disentangled sequential autoencoder. In International Conference on Machine Learning, pp. 5656–5665. Cited by: §1.
 Improving adversarial text generation by modeling the distant future. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Cited by: §5.4.

Textbased interactive recommendation via constraintaugmented reinforcement learning
. In Advances in neural information processing systems, pp. 15214–15224. Cited by: §5.4.  Adversarially regularized autoencoders. In Proceedings of the 35th International Conference on Machine Learning, pp. 5902–5911. Cited by: §5.3.

Talking face generation by adversarially disentangled audiovisual representation.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 9299–9306. Cited by: §1.
Appendix A Proofs of Theorems
Proof of Theorem 3.1.
First, we show that
(9) 
Calculate the gap between the lefthand side and righthand side of Eq. (9):
(Jensen’s Inequality) 
Therefore, the inequality in Eq. (9) holds.
Given sample pairs , the lefthand side of Eq. (9) has an unbiased estimation:
which is what we claim in Theorem 3.1. ∎
Proof of Lower Bounds in Eq. (6).
The inequality is based on the fact that the KLdivergence is always nonnegative. The lower bound for can be also derived in the similar way. ∎
Appendix B Detailed Experimental Setups
We set the dimension of style embedding to be smaller than the content embedding, because the content carries more information than the style of sentences. The hyperparameter
in our loss function is a formal expression of reweighting the two objectives of disentanglement and autoencoding. In practice, we vary it from 0 to 1 with step 0.1 during the first 10 training epochs. At the beginning of the training, the output latent embeddings are not representative enough. Therefore, we choose a small weight on the disentanglement term to avoid obstructing the learning of representative embeddings. After the latent embedding is sufficiently trained, which can successfully reconstruct the input sentences, we slowly enlarge
for the disentanglement. After reaches 1, we fix it until all the training epochs are finished.Appendix C Samplebased Embedding Divergences
In this section we introduce the implementation details of the calculation about labelembedding correlation. As mentioned in Section 5.4 , the distribution divergence between and measures the correlation between content embeddings and style labels. Assume , and , then the four metrics MAD, ED, WD, MMD are calculated based on the two groups of samples. With a ground distance , the implementaion of the above four metrics are demonstrated in following:
(10) 
(11) 
(12) 
(13) 
where is a kernel function. Here we choose from RBF kernel family with bandwidth .
For style embedding, the calculation formats are the same as in above equations. The style embeddings and content embeddings have different dimensions, which leads to the ground metric inconsistent. Therefore, instead of using Euclidean distance, we use the cosine distance as the ground metric.
Appendix D Details in Representation Quality Evaluation
For style preservation, we pretrain a style classifier on each dataset. The style classifier is built by a onelayer LSTM appended with a multihead attention layer. The number of the attention head is set to 6. The classifiers reach 95% prediction accuracy on Yelp and 93% prediction accuracy on PersonalityCaptioning. We input transferred sentences into the classifier and test whether the predicted style label is the same as the target style label.
For human evaluation, we transferred 1000 sentences with randomly selected style labels. After the transferring, we ask 10 human annotators to justify the style label, content preservation and content fluency. The style label is 0 or 1 representing the positive or negative sentiment of the given sentence. The content preservation and the content fluency is scored between 0 to 5. To make the style accuracy compatible with the other two scores, we scale it into range [0,5]. If the scores from the two annotators have a difference larger than 2, the scores will not be recorded. In this way, we ensure the evaluation criteria of annotators are similar.
Comments
There are no comments yet.