## 1 Introduction

The great success of unsupervised learning heavily depends on the representation of the feature in the real-world. It is widely believed that the real-world data is generated by a few explanatory factors which are distributed, invariant, and disentangled (bengio2013representation). The challenge of learning disentangled representation boils down into a competition^{1} using a new real world disentanglement dataset (gondal2019transfer) to build the best disentangled model.

The key idea in disentangled representation is that the perfect representation should be a one-to-one mapping to the ground truth disentangled factor. Thus, if one factor changed and other factors fixed, then the representation of the fixed factor should be fixed accordingly, while others’ representation changed. As a result, it is essential to find representations that (i) are independent of each other, and (ii) align to the ground truth factor.

Recent line of works in disentanglement representation learning are commonly focused on enforcing the independence of the representation by modifying the regulation term in the variational lower bound of Variational Autoencoders (VAE)

(kingma2013auto), including -VAE (higgins2017beta), AnnealedVAE (burgess2018understanding), -TCVAE (chen2018isolating), DIP-VAE (kumar2018variational) and FactorVAE (kim2018disentangling). See Appendix A for more details of these model.To evaluate the performance of disentanglement, several metrics have been proposed, including the FactorVAE metric (kim2018disentangling), Mutual Information Gap (MIG) (chen2018isolating), DCI metric (eastwood2018framework), IRS metric (suter2019robustly), and SAP score (kumar2018variational).

However, one of our findings is that these methods are heavily influenced by randomness and the choice of the hyper-parameter. This phenomenon was also discovered by locatello2018challenging

. Therefore, rather than designing a new regularization term, we simply use FactorVAE but at the same time improve the reconstruction performance. We believe that, the better the reconstruction, the better the alignment of the ground-truth factors. Therefore, the more capacity of the encoder and decoder network, the better the result would be. Furthermore, after increasing the capacity, we also try to increase the training step which also shows a significant improvement of evaluation metrics. The final architecture of FactorVAE is given in Figure

1. Note that, this report contain the results from both stage 1 and stage 2 in the competition.Overall, our contribution can be summarized as follow: (1) we found that the performance of the reconstruction is also essential for learning disentangled representation, and (2) we achieve state-of-the-art performance in the competition.

## 2 Experiments Design

In this section, we explore the effectiveness of different disentanglement learning models and the performance of the reconstruction for disentangle learning. We first employ different kinds of variational autoencoder including BottleneckVAE, AnneledVAE, DIPVAE, BetaTCVAE, and BetaVAE with training step. Second, we want to know whether the capacity plays an important role in disentanglement. The hypothesis is that the larger the capacity, the better reconstruction can be obtained, which further reinforces the disentanglement. In detail, we control the number of latent variables.

## 3 Experiments Results

In this section, we present our experiment result in stage 1 and stage 2 of the competition. We first present the performance of different kinds of VAEs in stage 1, which is given in Table 1. It shows that FactorVAE achieves the best result when the training step is 30000. In the following experiment, we choose FactorVAE as the base model.

VAE variation | FactorVAE | sap score | dci | irs | mig |
---|---|---|---|---|---|

BottleneckVAE | 0.453 | 0.0395 | 0.107 | 0.547 | 0.0589 |

AnneledVAE | 0.3586 | 0.0069 | 0.1153 | 0.5122 | 0.0237 |

DIPVAE | 0.265 | 0.005 | 0.021 | 0.265 | 0.490 |

BetaTCVAE | 0.342 | 0.026 | 0.093 | 0.342 | 0.981 |

BetaVAE | 0.3586 | 0.0069 | 0.1153 | 0.5122 | 0.0237 |

FactorVAE |

Furthermore, we find that (i) the activation function at each layer and that (ii) the size of latent variables are propitious to the disentanglement performance. Therefore, Leaky ReLU and the latent size of 256 are selected in stage 1. Then, as shown in Table

2, we increase the step size and we find that the best result was achieved at 1000k training steps. The experiment in this part may not be sufficient, but it still suggests the fact that the larger the capacity is, the better the disentanglement performance. Since we increase the capacity of the model, it is reasonable to also increase the training steps at the same time. In stage 2, as shown in Table 3, using sufficient large training step (), we investigate the effectiveness of the number of latent variables. This experiment suggests that the FactorVAE and the DCI metric are positive as the latent variables increase, while the other metrics decrease. The best result in the ranking is marked as bold, which suggests that we should choose an appropriate number of latent variables.training step | FactorVAE | sap score | dci | irs | mig |
---|---|---|---|---|---|

FactorVAE (30k) | 0.449 | ||||

FactorVAE (500k) | 0.432 | ||||

FactorVAE (1000k) |

Num. of Latent | FactorVAE | sap score | dci | irs | mig |
---|---|---|---|---|---|

256 | 0.4458 | 0.1748 | 0.5785 | 0.5553 | 0.4166 |

512 | 0.5018 | 0.1545 | 0.5457 | 0.6824 | 0.3739 |

768 | 0.4526 | 0.0942 | 0.5154 | 0.4638 | 0.2793 |

1024 | 0.4708 | 0.0955 | 0.542 | 0.4867 | 0.2781 |

1536 | 0.5202 | 0.008 | 0.5932 | 0.5071 | 0.0058 |

2048 | 0.5374 | 0.0025 | 0.6351 | 0.5218 | 0.0062 |

3072 | 0.5426 | 0.0053 | 0.6677 | 0.5067 | 0.0117 |

## 4 Conclusion

In this work, we conducted an empirical study on disentangled learning. We first conduct several experiments with different disentangle learning methods and select the FactorVAE as the base model; and second we improve the performance of the reconstruction, by increasing the capacity of the model and the training step. Finally, our results appear to be competitive.

## References

## Appendix A Related works

In this section, we are going to summarize the state-of-the-art unsupervised disentanglement learning methods. Most of works are developed based on the Variational Auto-encoder (VAE) (kingma2013auto), a generative model that maximize the following evidence lower bound to approximate the intractable distribution using ,

(1) |

where denote Encoder with parameter and denote Decoder with parameter .

As shown in Table 4, all the lower bound of variant VAEs can be described as where all the Regularization term and the hyper-parameters are given in this table.

Model | Regularization | Hyper-Parameters |
---|---|---|

-VAE | ||

AnnealedVAE | ||

FactorVAE | ||

DIP-VAE-I | ||

DIP-VAE-II |