1 Introduction
Representation learning is one of the most fundamental problems in machine learning. A realworld data distribution can be regarded as a lowdimensional manifold in a highdimensional space
[Bengio et al.2013]. Generative models in deep learning, such as the variational autoencoder (VAE)
[Kingma and Welling2013] and generative adversarial network (GAN) [Goodfellow et al.2014], can learn a lowdimensional manifold representation (factor) as a latent variable. The factors are fundamental components such as position, color, and degree of smiling in an image of a human face [Liu et al.2015]. Disentangled representation is defined as a single factor being represented by a single latent variable [Bengio et al.2013]. Thus, if in a model of learned disentangled representation, shifting one latent variable while leaving the others fixed generates data showing that only the corresponding factor was changed. This is called latent traversals (a good demonstration of which was given by [Higgins et al.2016a]^{2}^{2}2This demonstration is available at http://tinyurl.com/jgbyzke). There are two advantages of disentangled representation. First, latent variables are interpretable. Second, the disentangled representation is generalizable and robust against adversarial attacks [Alemi et al.2016].We focus on the disentangled representation learning of sequential data. Sequential data are characterized by dynamic and static factors: dynamic factors are time dependent, and static factors are independent of time. With disentangled representation learning from sequential data, we should be able to extract dynamic factors that cannot be extracted using disentangledrepresentationlearning models, such as VAE [Higgins et al.2016a, Higgins et al.2016b] and InfoGAN [Chen et al.2016], for nonsequential data. The concept of disentangled representation learning for sequential data is illustrated in Fig. 1. Consider that the pseudodataset of the movement of a submarine has a dynamic factor: the trajectory shape. The disentangledrepresentationlearning model for sequential data can extract this shape. On the other hand, since the disentangled representation learning model for nonsequential data does not take into account the sequence of data, it merely extracts the x and ypositions.
There is a wide range of potential applications if we extend disentanglement representation to sequential data such as speech, video, and stock market. For example, disentangled representation learning for stockmarket data can extract the fundamental trend of a given stock price. Another application is the reduction of action space in reinforcement learning. Extracting dynamic factors would enable the generation of macroactions
[Durugkar et al.2016], which are sets of sequential actions that represent the fundamental factors of the actions. Thus, disentangled representation learning for sequential data opens the door to new areas of research.Very recent related work [Hsu et al.2017, Li and Mandt2018] separated factors of sequential data into dynamic and static factors. The factorized hierarchical VAE (FHVAE) [Hsu et al.2017] is based on a graphical model using latent variables with different time dependencies. By maximizing the variational lower bound of the graphical model, FHVAE separates the different timedependent factors, such as dynamic, from static factors. The VAE architecture developed by [Li and Mandt2018] is the same as that in the FHVAE in terms of the time dependencies of the latent variables. Since these models require different time dependencies for the latent variables, they cannot be used to disentangle variables with the same timedependency factor.
We address this problem by taking a different approach. First, we analyze the root cause of disentanglement from the perspective of information theory. As a result, the term causing disentanglement is derived from a more fundamental rule: reduce the mutual dependence between the input and output of an encoder while keeping the reconstruction of the data. This is called the information bottleneck (IB) principle. We naturally extend this principle to sequential data from the relationship between and to and . This enables the separation of multiple dynamic factors as a consequence of information compression. It is difficult to learn a disentangled representation of sequential data since not only the feature space but also the time space should be compressed. We developed the factorized action VAE (FAVAE) in which we implemented the concept of information capacity to stabilize learning and a ladder network to learn a disentangled representation in accordance with the level of data abstraction. Since FAVAE is a more general model without the restriction of a graphical model design to distinguish between static and dynamic factors, it can separate dependency factors occurring at the same time. It can also separate factors into dynamic and static.
2 Disentanglement for NonSequential Data
The VAE [Higgins et al.2016a, Higgins et al.2016b] is commonly used for learning disentangled representations based on the VAE framework [Kingma and Welling2013]
for a generative model. The VAE can estimate the probability density from data x. The objective function of the VAE maximizes the evidence lower bound (ELBO) of
as(1) 
where is a latent variable,
is the KullbackLeibler divergence, and
is an approximated distribution of . The reduces to zero as the ELBO increases; thus, learns a good approximation of . The ELBO is defined as(2) 
where the first term is a reconstruction term used to reconstruct , and the second term is a regularization term used to regularize posterior . Encoder and decoder are learned in the VAE.
Next, we explain how VAE extracts disentangled representations from unlabeled data. The VAE is an extension of the coefficient of in the VAE. The objective function of VAE is
(3) 
where and . The VAE promotes disentangled representation learning via . As increases, the latent variable approaches the prior ; therefore, each
is forced to learn the probability distribution of
. However, if all latent variables become , the model cannot reconstruct . As a result, as long as reconstructs , VAE reduces the information of .3 Preliminary: Origin of Disentanglement
To clarify the origin of disentanglement, we explain the regularization term. The regularization term has been decomposed into three terms [Chen et al.2018, Kim and Mnih2018, Hoffman and Johnson2016]:
(4) 
where denotes the
th dimension of the latent variable. The second term, which is called ”total correlation” in information theory, quantifies the redundancy or dependency among a set of n random variables
[Watanabe1960]. The TCVAE [Chen et al.2018] has been experimentally shown to reduce the total correlation causing disentanglement. The third term indirectly causes disentanglement by bringingclose to the independent standard normal distribution
. The first term is mutual information between the data variable and latent variable based on the empirical data distribution. Minimizing the regularization term causes disentanglement but disrupts reconstruction via the first term in Eq. (4). The shift scheme was proposed [Burgess et al.2018] to solve this conflict:(5) 
where constant shift , which is called ”information capacity,” linearly increases during training. This can be understood from the point of view of an information bottleneck [Tishby et al.2000]. The VAE can be derived by maximizing the ELBO, but VAE can no longer be interpreted as an ELBO once this scheme has been applied. The objective function of VAE is derived from the information bottleneck [Alemi et al.2016, Achille and Soatto2018, Tishby et al.2000, Chechik et al.2005].
(6) 
where is the empirical distribution. Solving this equation by using Lagrange multipliers drives the objective function of VAE (Eq. (5)) with as the Lagrange multiplier (details in Appendix B of [Alemi et al.2016]). In Eq. (5), prevents from becoming zero. In the literature on information bottleneck, typically stands for a classification task; however, the formulation can be related to the autoencoding objective [Alemi et al.2016]. Therefore, the objective function of VAE can be understood using the IB principle.
4 Proposed Model: Disentanglement for Sequential Data
FAVAE learns disentangled and interpretable representations from sequential data without supervision. We consider sequential data generated from a latent variable model,
(7) 
For sequential data, we replace with in Eq. 5. The objective function of FAVAE is
(8) 
where
. The variational recurrent neural network
[Chung et al.2015] and stochastic recurrent neural network (SRNN) [Fraccaro et al.2016] extend the VAE to a recurrent framework. The priors of both networks are dependent on time. The timedependent prior experimentally improves the ELBO. In contrast, the prior of FAVAE is independent of time like those of the stochastic recurrent network (STORN) [Bayer and Osendorfer2014] and Deep Recurrent Attentive Writer (DRAW) neural network architecture [Gregor et al.2015]; this is because FAVAE is disentangled representation learning rather than density estimation. For better understanding, consider FAVAE from the perspective of information bottleneck. As with VAE, FAVAE can be understood from the IB principle.(9) 
where follows an empirical distribution. These principles make the representation of compact, while reconstruction of the sequential data is represented by (see Appendix A).
4.1 Ladder Network
An important extension to FAVAE is a hierarchical representation scheme inspired by the variational ladder AE (VLAE) [Zhao et al.2017]. Encoder within a ladder network is defined as
(10)  
(11) 
where is a layer index, , and is a timeconvolution network, which is explained in the next section. Decoder within the ladder network is defined as
(12)  
(13)  
(14) 
where is the time deconvolution network with , and is a distribution family parameterized by
. The gate computes the Hadamard product of its learnable parameter and input tensor. We set
as a fixedvariance factored Gaussian distribution with the mean given by
. Figure (2) shows the architecture of FAVAE. The difference between each ladder network in the model is the number of convolution networks through which data passes. The abstract expressions should differ between ladders since the timeconvolution layer abstracts sequential data. Without the ladder network, FAVAE can disentangle only the representations at the same level of abstraction; with the ladder network, it can disentangle representations at different levels of abstraction.4.2 How to encode and decode
There are several mainstream neural network models designed for sequential data, such as the long shortterm memory (LSTM) model
[Hochreiter and Schmidhuber1997], gated recurrent unit model
[Chung et al.2014], and quasirecurrent neural network QRNN [Bradbury et al.2016]. However, VLAE has a hierarchical structure created by abstracting a convolutional neural network, so it is simple to add the time convolution of the QRNN to FAVAE. The input data are
, where is the time index andis the dimension of the feature vector index. The time convolution takes into account the dimensions of feature vector
as a convolution channel and performs convolution in the time direction:(15) 
where
is the channel index. FAVAE has a network similar to that of VAE regarding time convolution and a loss function similar to that of
VAE (Eq. (8)). We use the batch normalization
[Hinton et al.2012]and rectified linear unit as activation functions, though other variations are possible. For example, 1D convolutional neural networks use a filter size of 3 and stride of 2 and do not use a pooling layer.
5 Measuring Disentanglement
While latent traversals are useful for checking the success or failure of disentanglement, quantification of the disentanglement is required for reliably evaluating a learned model. Various disentanglement quantification methods have been reported [Eastwood and Williams2018, Chen et al.2018, Kim and Mnih2018, Higgins et al.2016b, Higgins et al.2016a], but there is no standard method. We use the mutual information gap (MIG) [Chen et al.2018] as the metric for disentanglement. The basic idea of MIG is measuring the mutual information between latent variables and a groundtruth factor . Higher mutual information means that contains more information regarding .
(16) 
where , and is entropy for normalization.
There is a problem with MIG when measuring with simple data. When it is possible to reconstruct with one latent variable, using large gathers all factor information in one latent variable and MIG becomes large (Fig. 4). For example, when goal position, curved inward/outward, and degree of curvature cannot be disentangled to different latent variables in the 2D Reaching dataset, MIG can become large. In our experiments we avoided this problem by excluding the case in which all factor information concentrates in one latent variable.
6 Related Work
Several recently reported models [Hsu et al.2017, Li and Mandt2018] graphically disentangle static and dynamic factors in sequential data such as speech and video [Garofolo et al.1993, Pearce and Picone2002]. These models learn by building the time dependency of the prior of the latent variable. In particular, FHVAE [Hsu et al.2017] uses label data that distinguish time series for learning. Note that the label is not a dynamic factor but a label to distinguish between time series. In contrast, FAVAE performs disentanglement by using a loss function (see Eq. 8). The advantage of graphical models is that they can control the interpretable factors by controlling the prior’s time dependency. Since dynamic factors have the same time dependency, these models cannot disentangle dynamic factors. Since FAVAE has no timedependency constraint of the prior, it can disentangle static and dynamic factors as well as disentangle sets of dynamic factors.
7 Experiments
We experimentally evaluated FAVAE using five sequential datasets: 2D Reaching with sequences 100 and 1000, 2D Wavy Reaching with sequences 100 and 1000, and Sprites dataset [Li and Mandt2018]. We used a batch size of and the Adam [Kingma and Ba2014] optimizer with a learning rate of .
7.1 2D Reaching
To determine the differences between FAVAE and VAE, we used a bidimensional space reaching dataset. Starting from point (0, 0), the point travels to goal position (0.1, +1) or (+0.1, +1). There are ten possible trajectories to each goal; five are curved inward, and the other five are curved outward. The degree of curvature for all five trajectories is different. The number of factor combinations was thus 20 (2x2x5). The trajectory lengths were 100 and 1000.
We compared the performances of VAE and FAVAE trained on the 2D Reaching dataset. The results of latent traversal are transforming one dimension of latent variable z into another value and reconstructing the output data from the traversed latent variables. The VAE, which is only able to learn from every point of a trajectory separately, encodes data points into latent variables that are parallel to the x and y axes (3, 3). In contrast, FAVAE learns through one entire trajectory and can encode disentangled representations effectively so that feasible trajectories are generated from traversed latent variables (3, 3).
7.2 2D Wavy Reaching
Model  2D Reaching  2D Wavy Reaching  

length=100  length=1000  length=100  length=1000  
MIG  Rec  MIG  Rec  MIG  Rec  MIG  Rec  
FHVAE  0.43(14)  0.0013(23)      0.22(8)  0.043(61)     
FAVAE (L) ()  0.06(3)  0.022(22)  0.05(4)  0.493(790)  0.02(1)  0.015(5)  0.04(3)  0.085(17) 
FAVAE ( )  0.07(12)  0.257(173)  0.46(18)  2.209(1869)  0.66(15)  0.041(8)  0.47(18)  11.881(24014) 
FAVAE ( C)  0.09(13)  0.257(172)  0.46(18)  1.193(1274)  0.67(16)  0.042(21)  0.31(10)  5.937(18033) 
FAVEA (L )  0.28(21)  0.006(4)  0.43(6)  0.022(9)  0.29(9)  0.123(16)  0.28(4)  0.707(86) 
FAVAE (L C)  0.28(11)  0.008(14)  0.64(6)  0.017(6)  0.42(17)  0.046(11)  0.24(7)  0.190(95) 
Disentanglement scores (MIG and reconstruction loss) with standard deviations by repeating experiment 10 times for different models. Best results are shown in bold. (
means with ladder and () means with information capacity (e.g. FAVAE (L) means FAVAE with ladder network without information capacity).To confirm the effect of disentanglement through the information bottleneck, we evaluated the validity of FAVAE under more complex factors by adding more factors to 2D Reaching. Five factors in total generated data compared to the three factors that generate data in 2D Reaching. This modified dataset differed in that four out of the five factors affect only part of the trajectory: two affected the first half, and the other two affected the second half. This means that the model should be able to focus on a certain part of the whole trajectory and extract factors related to that part. A detailed explanation of these factors is given in Github ^{3}^{3}3Dataset is available at https://github.com/favae/favae_ijcai2019.
We show the training dataset of 2D Wavy Reaching and latent traversal in FAVAE (LC) with sequence length 1000 in Fig. 6. The latent traversal results for 2D Wavy Reaching are plotted in Figs. 6 to 6. Even though not all learned representations were perfectly disentangled, the visualization shows that all five generation factors were learned from five latent variables; the other latent variables did not learn any meaningful factors, indicating that the factors could be expressed as a combination of five ”active” latent variables.
We compared various models on the basis of MIG to demonstrate the validity of FAVAE, i.e., time convolution AE in which a loss function is used only for the AE (), FAVAE with/without the ladder network () and information capacity (), and FHVAE [Hsu et al.2017] which is the recently proposed disentangled representation learning model, as the baseline. Note that FHVAE uses label information (this label for distinguishing time series is not a dynamic factor) to disentangle time series data, which is a different setup with FAVAE. Table 1 shows a comparison of MIG scores and reconstruction losses using FHVAE as the baseline for 2D Reaching and 2D Wavy Reaching each with sequence lengths of 100 and 1000. In 2D Reaching, the MIG of the baseline was large, while in 2D Wavy Reaching the MIG of FAVAE was large. This is because FHVAE uses goalposition information as a label when learning. Even when there were multiple dynamic factors such as in 2D Wavy Reaching, FAVAE exhibited good disentangle performance (the large MIG and the small reconstruction loss).
When the ladder was added, the reconstruction loss was stable (especially at sequence length 1000). For example, looking at the length = 1000 of 2D Wavy Reaching in Table 1, without ladder had a large MIG but the distribution of reconstruction loss was very large.
To confirm the effect of , we the evaluated reconstruction losses and MIG scores for various using three ladder networks (Fig. 2) with a different for each ladder: in Fig. 5. One setting was , meaning that was not used; another setting was , meaning that was adjusted on the basis of KLDivergence for and . When was not used, FAVAE could not reconstruct data when was high; thus, disentangled representation was not learned well when was high. When was used, the MIG score increased with while reconstruction loss was suppressed.
1st  2nd  3rd  

2D Reaching  factor 1  1  1  8 
factor 2  10  0  0  
factor 3  10  0  0  
2D wavy Reaching  factor 1  3  0  7 
factor 2  8  0  2  
factor 3  8  0  2  
factor 4  9  1  0  
factor 5  9  0  1 
We expect the ladder network can disentangle representations at different levels of abstraction. In this section, we evaluate the factor extracted in each ladder by using 2D Reaching and 2D Wavy Reaching. Table 2 shows the counting index of the latent variable with the highest mutual information in each ladder network. In Table 2, the rows represent factors and columns represent the index of the ladder networks. Factor 1 (goal left/goal right) in 2D Reaching and Factor 1 (goal position) in 2D Wavy Reaching were extracted the most frequently in the latent variable in the 3rd ladder. Since the latent variables have eight dimensions for the 1st ladder, four dimensions for the 2nd ladder, and two dimensions for the 3rd ladder, the 3rd ladder should be the least frequent when factors are randomly entered for each z. Longterm and shortterm factors are clear in 2D Wavy Reaching. In 2D Wavy Reaching, there is a distinct difference between factors of long and short time dependency. The ”goal position” is the factor that affects the entire trajectory, and the other factors affect half the trajectory (Fig. 6). In these experiments, the goal of the trajectory that affects the entire trajectory tended to be expressed in the 3rd ladder. In both datasets, only factor 1 represents goal positions while the others represent the shape of the trajectories. Since factor 1 has a different abstraction level from others, it and the others result in different ladders, e.g., ladder 3 and others.
7.3 Sprites dataset
To evaluate the effectiveness of a video dataset, we trained FAVAE with the Sprites dataset, which was used in [Li and Mandt2018]. This dataset contains RGB video data with sequential length and consists of static and dynamic factors. Note that motions are not created with the combination of dynamic factors, and each motion exists individually (Dataset detail is explained in Github^{2}^{2}footnotemark: 2). We executed disentangled representation learning by using FAVAE with , , and network architecture used for this training is explained in Github^{2}^{2}footnotemark: 2. Figure 7 shows the results of latent traversal, and we use two values between and . The latent variables in the 1st ladder extract expressions of motion (4th in 1st ladder), pant color (5th in 1st ladder), direction of character (6th in 1st ladder) and shirt color (7th in 1st ladder). The latent variables in the 2nd ladder extract expressions of hair color (1st in 2nd ladder) and skin color (2nd in 2nd ladder). FAVAE can extract the disentangled representations between static and dynamic factors in high dimension datasets.
8 Summary and Future Work
FAVAE learns disentangled and interpretable representations via the information bottleneck from sequential data. The experiments using three sequential datasets demonstrated that it can learn disentangled representations. Future work includes extending the time convolution part to a sequencetosequence model [Sutskever et al.2014] and applying the model to actions of reinforcement learning to reduce the pattern of actions.
References
 [Achille and Soatto2018] Alessandro Achille and Stefano Soatto. Information dropout: Learning optimal representations through noisy computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
 [Alemi et al.2016] Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
 [Bayer and Osendorfer2014] Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610, 2014.
 [Bengio et al.2013] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.
 [Bradbury et al.2016] James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. Quasirecurrent neural networks. arXiv preprint arXiv:1611.01576, 2016.
 [Burgess et al.2018] Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in VAE. arXiv preprint arXiv:1804.03599, 2018.
 [Chechik et al.2005] Gal Chechik, Amir Globerson, Naftali Tishby, and Yair Weiss. Information bottleneck for Gaussian variables. Journal of machine learning research, 6(Jan):165–188, 2005.
 [Chen et al.2016] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. InfoGan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
 [Chen et al.2018] Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement in variational autoencoders. arXiv preprint arXiv:1802.04942, 2018.
 [Chung et al.2014] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
 [Chung et al.2015] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015.
 [Durugkar et al.2016] Ishan P Durugkar, Clemens Rosenbaum, Stefan Dernbach, and Sridhar Mahadevan. Deep reinforcement learning with macroactions. arXiv preprint arXiv:1606.04615, 2016.
 [Eastwood and Williams2018] Cian Eastwood and Christopher KI Williams. A framework for the quantitative evaluation of disentangled representations. 2018.
 [Fraccaro et al.2016] Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In Advances in neural information processing systems, pages 2199–2207, 2016.
 [Garofolo et al.1993] John S Garofolo, Lori F Lamel, William M Fisher, Jonathan G Fiscus, and David S Pallett. DARPA TIMIT acousticphonetic continous speech corpus CDROM. NIST speech disc 11.1. NASA STI/Recon technical report n, 93, 1993.
 [Goodfellow et al.2014] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [Gregor et al.2015] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. DRAW: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
 [Higgins et al.2016a] Irina Higgins, Loic Matthey, Xavier Glorot, Arka Pal, Benigno Uria, Charles Blundell, Shakir Mohamed, and Alexander Lerchner. Early visual concept learning with unsupervised deep learning. arXiv preprint arXiv:1606.05579, 2016.
 [Higgins et al.2016b] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. betaVAE: Learning basic visual concepts with a constrained variational framework. 2016.
 [Hinton et al.2012] Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
 [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.

[Hoffman and Johnson2016]
Matthew D Hoffman and Matthew J Johnson.
ELBO surgery: yet another way to carve up the variational evidence
lower bound.
In
Workshop in Advances in Approximate Bayesian Inference, NIPS
, 2016.  [Hsu et al.2017] WeiNing Hsu, Yu Zhang, and James Glass. Unsupervised learning of disentangled and interpretable representations from sequential data. In Advances in neural information processing systems, pages 1878–1889, 2017.
 [Kim and Mnih2018] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. arXiv preprint arXiv:1802.05983, 2018.
 [Kingma and Ba2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [Kingma and Welling2013] Diederik P Kingma and Max Welling. Autoencoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
 [Li and Mandt2018] Yingzhen Li and Stephan Mandt. A deep generative model for disentangled representations of sequential data. arXiv preprint arXiv:1803.02991, 2018.

[Liu et al.2015]
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep learning face attributes in the wild.
In
Proceedings of the IEEE International Conference on Computer Vision
, pages 3730–3738, 2015.  [Pearce and Picone2002] David Pearce and J Picone. Aurora working group: DSR front end LVCSR evaluation AU/384/02. Inst. for Signal & Inform. Process., Mississippi State Univ., Tech. Rep, 2002.
 [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
 [Tishby et al.2000] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
 [Watanabe1960] Satosi Watanabe. Information theoretical analysis of multivariate correlation. IBM Journal of research and development, 4(1):66–82, 1960.
 [Zhao et al.2017] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Learning hierarchical features from generative models. arXiv preprint arXiv:1702.08396, 2017.