Disentanglement Challenge: From Regularization to Reconstruction

by   Jie Qiao, et al.
Carnegie Mellon University

The challenge of learning disentangled representation has recently attracted much attention and boils down to a competition using a new real world disentanglement dataset (Gondal et al., 2019). Various methods based on variational auto-encoder have been proposed to solve this problem, by enforcing the independence between the representation and modifying the regularization term in the variational lower bound. However recent work by Locatello et al. (2018) has demonstrated that the proposed methods are heavily influenced by randomness and the choice of the hyper-parameter. In this work, instead of designing a new regularization term, we adopt the FactorVAE but improve the reconstruction performance and increase the capacity of network and the training step. The strategy turns out to be very effective and achieve the 1st place in the challenge.


page 1

page 2

page 3

page 4


Auto-encoders: reconstruction versus compression

We discuss the similarities and differences between training an auto-enc...

Comment on Stochastic Polyak Step-Size: Performance of ALI-G

This is a short note on the performance of the ALI-G algorithm (Berrada ...

Doubly Reparameterized Gradient Estimators for Monte Carlo Objectives

Deep latent variable models have become a popular model choice due to th...

The variational infomax autoencoder

We propose the Variational InfoMax AutoEncoder (VIMAE), a method to trai...

The Connected Domination Number of Grids

Closed form expressions for the domination number of an n × m grid have ...

Discussion on a new extension of the FGM copula with application in reliability

A new extended Farlie-Gumbel-Morgenstern copula recently studied by Ebai...

Second order adjoint sensitivity analysis in variational data assimilation for tsunami models

We mathematically derive the sensitivity of data assimilation results fo...

1 Introduction

The great success of unsupervised learning heavily depends on the representation of the feature in the real-world. It is widely believed that the real-world data is generated by a few explanatory factors which are distributed, invariant, and disentangled (bengio2013representation). The challenge of learning disentangled representation boils down into a competition1 using a new real world disentanglement dataset (gondal2019transfer) to build the best disentangled model.

The key idea in disentangled representation is that the perfect representation should be a one-to-one mapping to the ground truth disentangled factor. Thus, if one factor changed and other factors fixed, then the representation of the fixed factor should be fixed accordingly, while others’ representation changed. As a result, it is essential to find representations that (i) are independent of each other, and (ii) align to the ground truth factor.

Recent line of works in disentanglement representation learning are commonly focused on enforcing the independence of the representation by modifying the regulation term in the variational lower bound of Variational Autoencoders (VAE)

(kingma2013auto), including -VAE (higgins2017beta), AnnealedVAE (burgess2018understanding), -TCVAE (chen2018isolating), DIP-VAE (kumar2018variational) and FactorVAE (kim2018disentangling). See Appendix A for more details of these model.

To evaluate the performance of disentanglement, several metrics have been proposed, including the FactorVAE metric (kim2018disentangling), Mutual Information Gap (MIG) (chen2018isolating), DCI metric (eastwood2018framework), IRS metric (suter2019robustly), and SAP score (kumar2018variational).

Figure 1: The architecture of FactorVAE, in which FC layer denote the full connection layer

However, one of our findings is that these methods are heavily influenced by randomness and the choice of the hyper-parameter. This phenomenon was also discovered by locatello2018challenging

. Therefore, rather than designing a new regularization term, we simply use FactorVAE but at the same time improve the reconstruction performance. We believe that, the better the reconstruction, the better the alignment of the ground-truth factors. Therefore, the more capacity of the encoder and decoder network, the better the result would be. Furthermore, after increasing the capacity, we also try to increase the training step which also shows a significant improvement of evaluation metrics. The final architecture of FactorVAE is given in Figure

1. Note that, this report contain the results from both stage 1 and stage 2 in the competition.

Overall, our contribution can be summarized as follow: (1) we found that the performance of the reconstruction is also essential for learning disentangled representation, and (2) we achieve state-of-the-art performance in the competition.

2 Experiments Design

In this section, we explore the effectiveness of different disentanglement learning models and the performance of the reconstruction for disentangle learning. We first employ different kinds of variational autoencoder including BottleneckVAE, AnneledVAE, DIPVAE, BetaTCVAE, and BetaVAE with training step. Second, we want to know whether the capacity plays an important role in disentanglement. The hypothesis is that the larger the capacity, the better reconstruction can be obtained, which further reinforces the disentanglement. In detail, we control the number of latent variables.

3 Experiments Results

In this section, we present our experiment result in stage 1 and stage 2 of the competition. We first present the performance of different kinds of VAEs in stage 1, which is given in Table 1. It shows that FactorVAE achieves the best result when the training step is 30000. In the following experiment, we choose FactorVAE as the base model.

VAE variation FactorVAE sap score dci irs mig
BottleneckVAE 0.453 0.0395 0.107 0.547 0.0589
AnneledVAE 0.3586 0.0069 0.1153 0.5122 0.0237
DIPVAE 0.265 0.005 0.021 0.265 0.490
BetaTCVAE 0.342 0.026 0.093 0.342 0.981
BetaVAE 0.3586 0.0069 0.1153 0.5122 0.0237
Table 1: Variational autoencoder with 30000 training steps in Stage 1.

Furthermore, we find that (i) the activation function at each layer and that (ii) the size of latent variables are propitious to the disentanglement performance. Therefore, Leaky ReLU and the latent size of 256 are selected in stage 1. Then, as shown in Table

2, we increase the step size and we find that the best result was achieved at 1000k training steps. The experiment in this part may not be sufficient, but it still suggests the fact that the larger the capacity is, the better the disentanglement performance. Since we increase the capacity of the model, it is reasonable to also increase the training steps at the same time. In stage 2, as shown in Table 3, using sufficient large training step (), we investigate the effectiveness of the number of latent variables. This experiment suggests that the FactorVAE and the DCI metric are positive as the latent variables increase, while the other metrics decrease. The best result in the ranking is marked as bold, which suggests that we should choose an appropriate number of latent variables.

training step FactorVAE sap score dci irs mig
FactorVAE (30k) 0.449
FactorVAE (500k) 0.432
FactorVAE (1000k)
Table 2: FactorVAE with different training step in Stage 1.
Num. of Latent FactorVAE sap score dci irs mig
256 0.4458 0.1748 0.5785 0.5553 0.4166
512 0.5018 0.1545 0.5457 0.6824 0.3739
768 0.4526 0.0942 0.5154 0.4638 0.2793
1024 0.4708 0.0955 0.542 0.4867 0.2781
1536 0.5202 0.008 0.5932 0.5071 0.0058
2048 0.5374 0.0025 0.6351 0.5218 0.0062
3072 0.5426 0.0053 0.6677 0.5067 0.0117
Table 3: Sensitivity analysis for the number of latent variables in Stage 2

4 Conclusion

In this work, we conducted an empirical study on disentangled learning. We first conduct several experiments with different disentangle learning methods and select the FactorVAE as the base model; and second we improve the performance of the reconstruction, by increasing the capacity of the model and the training step. Finally, our results appear to be competitive.


Appendix A Related works

In this section, we are going to summarize the state-of-the-art unsupervised disentanglement learning methods. Most of works are developed based on the Variational Auto-encoder (VAE) (kingma2013auto), a generative model that maximize the following evidence lower bound to approximate the intractable distribution using ,


where denote Encoder with parameter and denote Decoder with parameter .

As shown in Table 4, all the lower bound of variant VAEs can be described as where all the Regularization term and the hyper-parameters are given in this table.

Model Regularization Hyper-Parameters
Table 4: Summary of variant unsupervised disentanglement learning methods