Log In Sign Up

FAVAE: Sequence Disentanglement using Information Bottleneck Principle

We propose the factorized action variational autoencoder (FAVAE), a state-of-the-art generative model for learning disentangled and interpretable representations from sequential data via the information bottleneck without supervision. The purpose of disentangled representation learning is to obtain interpretable and transferable representations from data. We focused on the disentangled representation of sequential data since there is a wide range of potential applications if disentanglement representation is extended to sequential data such as video, speech, and stock market. Sequential data are characterized by dynamic and static factors: dynamic factors are time dependent, and static factors are independent of time. Previous models disentangle static and dynamic factors by explicitly modeling the priors of latent variables to distinguish between these factors. However, these models cannot disentangle representations between dynamic factors, such as disentangling "picking up" and "throwing" in robotic tasks. FAVAE can disentangle multiple dynamic factors. Since it does not require modeling priors, it can disentangle "between" dynamic factors. We conducted experiments to show that FAVAE can extract disentangled dynamic factors.


page 1

page 4

page 6


Contrastively Disentangled Sequential Variational Autoencoder

Self-supervised disentangled representation learning is a critical task ...

Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data

We present a factorized hierarchical variational autoencoder, which lear...

Learning Disentangled Representations with Reference-Based Variational Autoencoders

Learning disentangled representations from visual data, where different ...

Disentangled Representation Learning for Text-Video Retrieval

Cross-modality interaction is a critical component in Text-Video Retriev...

Disentangled Recurrent Wasserstein Autoencoder

Learning disentangled representations leads to interpretable models and ...

Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modelling

Current autoencoder-based disentangled representation learning methods a...

Unsupervised Disentangled Representation Learning with Analogical Relations

Learning the disentangled representation of interpretable generative fac...

1 Introduction

(a) -VAE accepts sequential data
(b) Latent traversal of -VAE
(c) FAVAE accepts sequential data
(d) Latent traversal of FAVAE
Figure 1: Illustration of how FAVAE differs from -VAE, which does not accept data sequentially; it cannot differentiate data points from different trajectories or sequences of data points. FAVAE takes into account sequence of data points, taking all data points in trajectory as one datum. For example, for pseudo-dataset representing trajectory of submarine (1,1), -VAE accepts 11 different positions of submarine as non-sequential data, while FAVAE accepts three different trajectories of submarine as sequential data. Therefore, latent variable in -VAE learns only coordinates of submarine, and latent traversal shows change in submarine’s position. However, FAVAE learns factor that controls trajectory of submarine, so latent traversal shows change in submarine’s trajectory.

Representation learning is one of the most fundamental problems in machine learning. A real-world data distribution can be regarded as a low-dimensional manifold in a high-dimensional space 

[Bengio et al.2013]

. Generative models in deep learning, such as the variational autoencoder (VAE)  

[Kingma and Welling2013] and generative adversarial network (GAN)  [Goodfellow et al.2014], can learn a low-dimensional manifold representation (factor) as a latent variable. The factors are fundamental components such as position, color, and degree of smiling in an image of a human face  [Liu et al.2015]. Disentangled representation is defined as a single factor being represented by a single latent variable [Bengio et al.2013]. Thus, if in a model of learned disentangled representation, shifting one latent variable while leaving the others fixed generates data showing that only the corresponding factor was changed. This is called latent traversals (a good demonstration of which was given by [Higgins et al.2016a]222This demonstration is available at There are two advantages of disentangled representation. First, latent variables are interpretable. Second, the disentangled representation is generalizable and robust against adversarial attacks  [Alemi et al.2016].

We focus on the disentangled representation learning of sequential data. Sequential data are characterized by dynamic and static factors: dynamic factors are time dependent, and static factors are independent of time. With disentangled representation learning from sequential data, we should be able to extract dynamic factors that cannot be extracted using disentangled-representation-learning models, such as -VAE  [Higgins et al.2016a, Higgins et al.2016b] and InfoGAN  [Chen et al.2016], for non-sequential data. The concept of disentangled representation learning for sequential data is illustrated in Fig. 1. Consider that the pseudo-dataset of the movement of a submarine has a dynamic factor: the trajectory shape. The disentangled-representation-learning model for sequential data can extract this shape. On the other hand, since the disentangled representation learning model for non-sequential data does not take into account the sequence of data, it merely extracts the x- and y-positions.

There is a wide range of potential applications if we extend disentanglement representation to sequential data such as speech, video, and stock market. For example, disentangled representation learning for stock-market data can extract the fundamental trend of a given stock price. Another application is the reduction of action space in reinforcement learning. Extracting dynamic factors would enable the generation of macro-actions 

[Durugkar et al.2016], which are sets of sequential actions that represent the fundamental factors of the actions. Thus, disentangled representation learning for sequential data opens the door to new areas of research.

Very recent related work [Hsu et al.2017, Li and Mandt2018] separated factors of sequential data into dynamic and static factors. The factorized hierarchical VAE (FHVAE)  [Hsu et al.2017] is based on a graphical model using latent variables with different time dependencies. By maximizing the variational lower bound of the graphical model, FHVAE separates the different time-dependent factors, such as dynamic, from static factors. The VAE architecture developed by [Li and Mandt2018] is the same as that in the FHVAE in terms of the time dependencies of the latent variables. Since these models require different time dependencies for the latent variables, they cannot be used to disentangle variables with the same time-dependency factor.

We address this problem by taking a different approach. First, we analyze the root cause of disentanglement from the perspective of information theory. As a result, the term causing disentanglement is derived from a more fundamental rule: reduce the mutual dependence between the input and output of an encoder while keeping the reconstruction of the data. This is called the information bottleneck (IB) principle. We naturally extend this principle to sequential data from the relationship between and to and . This enables the separation of multiple dynamic factors as a consequence of information compression. It is difficult to learn a disentangled representation of sequential data since not only the feature space but also the time space should be compressed. We developed the factorized action VAE (FAVAE) in which we implemented the concept of information capacity to stabilize learning and a ladder network to learn a disentangled representation in accordance with the level of data abstraction. Since FAVAE is a more general model without the restriction of a graphical model design to distinguish between static and dynamic factors, it can separate dependency factors occurring at the same time. It can also separate factors into dynamic and static.

2 Disentanglement for Non-Sequential Data

The -VAE  [Higgins et al.2016a, Higgins et al.2016b] is commonly used for learning disentangled representations based on the VAE framework [Kingma and Welling2013]

for a generative model. The VAE can estimate the probability density from data x. The objective function of the VAE maximizes the evidence lower bound (ELBO) of



where is a latent variable,

is the Kullback-Leibler divergence, and

is an approximated distribution of . The reduces to zero as the ELBO increases; thus, learns a good approximation of . The ELBO is defined as


where the first term is a reconstruction term used to reconstruct , and the second term is a regularization term used to regularize posterior . Encoder and decoder are learned in the VAE.

Next, we explain how -VAE extracts disentangled representations from unlabeled data. The -VAE is an extension of the coefficient of in the VAE. The objective function of -VAE is


where and . The -VAE promotes disentangled representation learning via . As increases, the latent variable approaches the prior ; therefore, each

is forced to learn the probability distribution of

. However, if all latent variables become , the model cannot reconstruct . As a result, as long as reconstructs , -VAE reduces the information of .

3 Preliminary: Origin of Disentanglement

To clarify the origin of disentanglement, we explain the regularization term. The regularization term has been decomposed into three terms  [Chen et al.2018, Kim and Mnih2018, Hoffman and Johnson2016]:


where denotes the

-th dimension of the latent variable. The second term, which is called ”total correlation” in information theory, quantifies the redundancy or dependency among a set of n random variables  

[Watanabe1960]. The -TCVAE  [Chen et al.2018] has been experimentally shown to reduce the total correlation causing disentanglement. The third term indirectly causes disentanglement by bringing

close to the independent standard normal distribution

. The first term is mutual information between the data variable and latent variable based on the empirical data distribution. Minimizing the regularization term causes disentanglement but disrupts reconstruction via the first term in Eq. (4). The shift scheme was proposed [Burgess et al.2018] to solve this conflict:


where constant shift , which is called ”information capacity,” linearly increases during training. This can be understood from the point of view of an information bottleneck [Tishby et al.2000]. The VAE can be derived by maximizing the ELBO, but -VAE can no longer be interpreted as an ELBO once this scheme has been applied. The objective function of -VAE is derived from the information bottleneck  [Alemi et al.2016, Achille and Soatto2018, Tishby et al.2000, Chechik et al.2005].


where is the empirical distribution. Solving this equation by using Lagrange multipliers drives the objective function of -VAE (Eq. (5)) with as the Lagrange multiplier (details in Appendix B of [Alemi et al.2016]). In Eq. (5), prevents from becoming zero. In the literature on information bottleneck, typically stands for a classification task; however, the formulation can be related to the autoencoding objective [Alemi et al.2016]. Therefore, the objective function of -VAE can be understood using the IB principle.

4 Proposed Model: Disentanglement for Sequential Data

FAVAE learns disentangled and interpretable representations from sequential data without supervision. We consider sequential data generated from a latent variable model,


For sequential data, we replace with in Eq. 5. The objective function of FAVAE is



. The variational recurrent neural network  

[Chung et al.2015] and stochastic recurrent neural network (SRNN)  [Fraccaro et al.2016] extend the VAE to a recurrent framework. The priors of both networks are dependent on time. The time-dependent prior experimentally improves the ELBO. In contrast, the prior of FAVAE is independent of time like those of the stochastic recurrent network (STORN) [Bayer and Osendorfer2014] and Deep Recurrent Attentive Writer (DRAW) neural network architecture [Gregor et al.2015]; this is because FAVAE is disentangled representation learning rather than density estimation. For better understanding, consider FAVAE from the perspective of information bottleneck. As with -VAE, FAVAE can be understood from the IB principle.


where follows an empirical distribution. These principles make the representation of compact, while reconstruction of the sequential data is represented by (see Appendix A).

4.1 Ladder Network

Figure 2: FAVAE architecture

An important extension to FAVAE is a hierarchical representation scheme inspired by the variational ladder AE (VLAE) [Zhao et al.2017]. Encoder within a ladder network is defined as


where is a layer index, , and is a time-convolution network, which is explained in the next section. Decoder within the ladder network is defined as


where is the time deconvolution network with , and is a distribution family parameterized by

. The gate computes the Hadamard product of its learnable parameter and input tensor. We set

as a fixed-variance factored Gaussian distribution with the mean given by

. Figure (2) shows the architecture of FAVAE. The difference between each ladder network in the model is the number of convolution networks through which data passes. The abstract expressions should differ between ladders since the time-convolution layer abstracts sequential data. Without the ladder network, FAVAE can disentangle only the representations at the same level of abstraction; with the ladder network, it can disentangle representations at different levels of abstraction.

4.2 How to encode and decode

There are several mainstream neural network models designed for sequential data, such as the long short-term memory (LSTM) model 

[Hochreiter and Schmidhuber1997]

, gated recurrent unit model 

[Chung et al.2014], and quasi-recurrent neural network QRNN [Bradbury et al.2016]

. However, VLAE has a hierarchical structure created by abstracting a convolutional neural network, so it is simple to add the time convolution of the QRNN to FAVAE. The input data are

, where is the time index and

is the dimension of the feature vector index. The time convolution takes into account the dimensions of feature vector

as a convolution channel and performs convolution in the time direction:



is the channel index. FAVAE has a network similar to that of VAE regarding time convolution and a loss function similar to that of

-VAE (Eq. (8

)). We use the batch normalization 

[Hinton et al.2012]

and rectified linear unit as activation functions, though other variations are possible. For example, 1D convolutional neural networks use a filter size of 3 and stride of 2 and do not use a pooling layer.

Figure 3: Visualization of latent traversal of -VAE and FAVAE. 3 represents all data trajectories of 2D reaching. 3 and 3 represent latent traversal in -VAE, 3 and 3 represent latent traversal in FAVAE. Each latent variable is traversed and purple and/or blue points are generated. The color corresponds to the value of the traversed latent variable.
Figure 4: Visualization of . Horizontal axis shows latent variable and vertical axis shows factor. It shows case in which all information concentrates in 4th latent variable in 2D Reaching.

5 Measuring Disentanglement

While latent traversals are useful for checking the success or failure of disentanglement, quantification of the disentanglement is required for reliably evaluating a learned model. Various disentanglement quantification methods have been reported [Eastwood and Williams2018, Chen et al.2018, Kim and Mnih2018, Higgins et al.2016b, Higgins et al.2016a], but there is no standard method. We use the mutual information gap (MIG) [Chen et al.2018] as the metric for disentanglement. The basic idea of MIG is measuring the mutual information between latent variables and a ground-truth factor . Higher mutual information means that contains more information regarding .


where , and is entropy for normalization.

There is a problem with MIG when measuring with simple data. When it is possible to reconstruct with one latent variable, using large gathers all factor information in one latent variable and MIG becomes large (Fig. 4). For example, when goal position, curved inward/outward, and degree of curvature cannot be disentangled to different latent variables in the 2D Reaching dataset, MIG can become large. In our experiments we avoided this problem by excluding the case in which all factor information concentrates in one latent variable.

6 Related Work

Several recently reported models [Hsu et al.2017, Li and Mandt2018] graphically disentangle static and dynamic factors in sequential data such as speech and video [Garofolo et al.1993, Pearce and Picone2002]. These models learn by building the time dependency of the prior of the latent variable. In particular, FHVAE [Hsu et al.2017] uses label data that distinguish time series for learning. Note that the label is not a dynamic factor but a label to distinguish between time series. In contrast, FAVAE performs disentanglement by using a loss function (see Eq. 8). The advantage of graphical models is that they can control the interpretable factors by controlling the prior’s time dependency. Since dynamic factors have the same time dependency, these models cannot disentangle dynamic factors. Since FAVAE has no time-dependency constraint of the prior, it can disentangle static and dynamic factors as well as disentangle sets of dynamic factors.

7 Experiments

We experimentally evaluated FAVAE using five sequential datasets: 2D Reaching with sequences 100 and 1000, 2D Wavy Reaching with sequences 100 and 1000, and Sprites dataset [Li and Mandt2018]. We used a batch size of and the Adam [Kingma and Ba2014] optimizer with a learning rate of .

7.1 2D Reaching

To determine the differences between FAVAE and -VAE, we used a bi-dimensional space reaching dataset. Starting from point (0, 0), the point travels to goal position (-0.1, +1) or (+0.1, +1). There are ten possible trajectories to each goal; five are curved inward, and the other five are curved outward. The degree of curvature for all five trajectories is different. The number of factor combinations was thus 20 (2x2x5). The trajectory lengths were 100 and 1000.

We compared the performances of -VAE and FAVAE trained on the 2D Reaching dataset. The results of latent traversal are transforming one dimension of latent variable z into another value and reconstructing the output data from the traversed latent variables. The -VAE, which is only able to learn from every point of a trajectory separately, encodes data points into latent variables that are parallel to the x and y axes (3, 3). In contrast, FAVAE learns through one entire trajectory and can encode disentangled representations effectively so that feasible trajectories are generated from traversed latent variables (3, 3).

7.2 2D Wavy Reaching

Model 2D Reaching 2D Wavy Reaching
length=100 length=1000 length=100 length=1000
FHVAE 0.43(14) 0.0013(23) - - 0.22(8) 0.043(61) - -
FAVAE (L) () 0.06(3) 0.022(22) 0.05(4) 0.493(790) 0.02(1) 0.015(5) 0.04(3) 0.085(17)
FAVAE (- -) 0.07(12) 0.257(173) 0.46(18) 2.209(1869) 0.66(15) 0.041(8) 0.47(18) 11.881(24014)
FAVAE (- C) 0.09(13) 0.257(172) 0.46(18) 1.193(1274) 0.67(16) 0.042(21) 0.31(10) 5.937(18033)
FAVEA (L -) 0.28(21) 0.006(4) 0.43(6) 0.022(9) 0.29(9) 0.123(16) 0.28(4) 0.707(86)
FAVAE (L C) 0.28(11) 0.008(14) 0.64(6) 0.017(6) 0.42(17) 0.046(11) 0.24(7) 0.190(95)
Table 1:

Disentanglement scores (MIG and reconstruction loss) with standard deviations by repeating experiment 10 times for different models. Best results are shown in bold. (

means with ladder and () means with information capacity (e.g. FAVAE (L-) means FAVAE with ladder network without information capacity).

To confirm the effect of disentanglement through the information bottleneck, we evaluated the validity of FAVAE under more complex factors by adding more factors to 2D Reaching. Five factors in total generated data compared to the three factors that generate data in 2D Reaching. This modified dataset differed in that four out of the five factors affect only part of the trajectory: two affected the first half, and the other two affected the second half. This means that the model should be able to focus on a certain part of the whole trajectory and extract factors related to that part. A detailed explanation of these factors is given in Github 333Dataset is available at

We show the training dataset of 2D Wavy Reaching and latent traversal in FAVAE (LC) with sequence length 1000 in Fig. 6. The latent traversal results for 2D Wavy Reaching are plotted in Figs. 6 to 6. Even though not all learned representations were perfectly disentangled, the visualization shows that all five generation factors were learned from five latent variables; the other latent variables did not learn any meaningful factors, indicating that the factors could be expressed as a combination of five ”active” latent variables.

We compared various models on the basis of MIG to demonstrate the validity of FAVAE, i.e., time convolution AE in which a loss function is used only for the AE (), FAVAE with/without the ladder network () and information capacity (), and FHVAE [Hsu et al.2017] which is the recently proposed disentangled representation learning model, as the baseline. Note that FHVAE uses label information (this label for distinguishing time series is not a dynamic factor) to disentangle time series data, which is a different setup with FAVAE. Table 1 shows a comparison of MIG scores and reconstruction losses using FHVAE as the baseline for 2D Reaching and 2D Wavy Reaching each with sequence lengths of 100 and 1000. In 2D Reaching, the MIG of the baseline was large, while in 2D Wavy Reaching the MIG of FAVAE was large. This is because FHVAE uses goal-position information as a label when learning. Even when there were multiple dynamic factors such as in 2D Wavy Reaching, FAVAE exhibited good disentangle performance (the large MIG and the small reconstruction loss).

When the ladder was added, the reconstruction loss was stable (especially at sequence length 1000). For example, looking at the length = 1000 of 2D Wavy Reaching in Table 1, without ladder had a large MIG but the distribution of reconstruction loss was very large.

Figure 5: MIG scores and reconstruction losses for different . Blue line represents results with information capacity greater than zero; red line represents results with set to zero. Note that x axis is plotted in log scale.

To confirm the effect of , we the evaluated reconstruction losses and MIG scores for various using three ladder networks (Fig. 2) with a different for each ladder: in Fig. 5. One setting was , meaning that was not used; another setting was , meaning that was adjusted on the basis of KL-Divergence for and . When was not used, FAVAE could not reconstruct data when was high; thus, disentangled representation was not learned well when was high. When was used, the MIG score increased with while reconstruction loss was suppressed.

(a) factor 1
(b) factor 2
(c) factor 3
(d) factor 4
(e) factor 5
(f) factor 1
(g) factor 2
(h) factor 3
(i) factor 4
(j) factor 5
Figure 6: Visualization of training data (6 to 6) and latent traversal (6 to 6) for 2D Wavy Reaching. The vertical and horizontal axes represent coordinates. Factors 1, 2, 3, 4, and 5 respectively correspond to ”Goal position”, ”1st trajectory shape”, ”2nd trajectory shape”, ”1st trajectory degree of curvature” and ”2nd trajectory degree of curvature”. Each plot was decoded by traversing one latent variable; different colors represent trajectories generated from different values of same latent variable, z.
1st 2nd 3rd
2D Reaching factor 1 1 1 8
factor 2 10 0 0
factor 3 10 0 0
2D wavy Reaching factor 1 3 0 7
factor 2 8 0 2
factor 3 8 0 2
factor 4 9 1 0
factor 5 9 0 1
Table 2: For each factor, counting the number of latent variables which is the highest in each ladder (1st, 2nd, 3rd). The same operation is performed ten times and results are shown. The detail of factor is shown in Github22footnotemark: 2

We expect the ladder network can disentangle representations at different levels of abstraction. In this section, we evaluate the factor extracted in each ladder by using 2D Reaching and 2D Wavy Reaching. Table 2 shows the counting index of the latent variable with the highest mutual information in each ladder network. In Table 2, the rows represent factors and columns represent the index of the ladder networks. Factor 1 (goal left/goal right) in 2D Reaching and Factor 1 (goal position) in 2D Wavy Reaching were extracted the most frequently in the latent variable in the 3rd ladder. Since the latent variables have eight dimensions for the 1st ladder, four dimensions for the 2nd ladder, and two dimensions for the 3rd ladder, the 3rd ladder should be the least frequent when factors are randomly entered for each z. Long-term and short-term factors are clear in 2D Wavy Reaching. In 2D Wavy Reaching, there is a distinct difference between factors of long and short time dependency. The ”goal position” is the factor that affects the entire trajectory, and the other factors affect half the trajectory (Fig. 6). In these experiments, the goal of the trajectory that affects the entire trajectory tended to be expressed in the 3rd ladder. In both datasets, only factor 1 represents goal positions while the others represent the shape of the trajectories. Since factor 1 has a different abstraction level from others, it and the others result in different ladders, e.g., ladder 3 and others.

7.3 Sprites dataset

(a) pant color
(b) hair color
(c) skin color
(d) shirt color
(e) motion
(f) direction of character
Figure 7: Visualization of latent traversal of FAVAE. Horizontal axis represents sequence and vertical axis represents differences in .

To evaluate the effectiveness of a video dataset, we trained FAVAE with the Sprites dataset, which was used in [Li and Mandt2018]. This dataset contains RGB video data with sequential length and consists of static and dynamic factors. Note that motions are not created with the combination of dynamic factors, and each motion exists individually (Dataset detail is explained in Github22footnotemark: 2). We executed disentangled representation learning by using FAVAE with , , and network architecture used for this training is explained in Github22footnotemark: 2. Figure 7 shows the results of latent traversal, and we use two values between and . The latent variables in the 1st ladder extract expressions of motion (4th in 1st ladder), pant color (5th in 1st ladder), direction of character (6th in 1st ladder) and shirt color (7th in 1st ladder). The latent variables in the 2nd ladder extract expressions of hair color (1st in 2nd ladder) and skin color (2nd in 2nd ladder). FAVAE can extract the disentangled representations between static and dynamic factors in high dimension datasets.

8 Summary and Future Work

FAVAE learns disentangled and interpretable representations via the information bottleneck from sequential data. The experiments using three sequential datasets demonstrated that it can learn disentangled representations. Future work includes extending the time convolution part to a sequence-to-sequence model [Sutskever et al.2014] and applying the model to actions of reinforcement learning to reduce the pattern of actions.