## 1 Introduction

Deep generative models represent data using a low-dimensional variable (sometimes referred to as a code). The relationship between and

is described by a conditional probability distribution

parameterized by a deep neural network. There have been many recent successes in training deep generative models for complex data types such as images

(Gatys et al., 2015; Gulrajani et al., 2017), audio (Oord et al., 2016), and language (Bowman et al., 2016). The latent code can also serve as a compressed representation for downstream tasks such as text classification (Xu et al., 2017), Bayesian optimization (Gómez-Bombarelli et al., 2018; Kusner et al., 2017), and lossy image compression (Theis et al., 2017). The setting in which an approximate posterior distributionis simultaneously learnt with the generative model via optimization of the evidence lower bound (ELBO) is known as a variational autoencoder (VAE), where

and represent probabilistic encoders and decoders respectively. In contrast to VAE, inference and generative models can also be learnt jointly in an adversarial setting (Makhzani et al., 2015; Dumoulin et al., 2016; Donahue et al., 2016).*ELBO decomposition*. The VAE objective can be defined in terms of KL divergence between a generative model and an inference model . We can decompose this objective into 4 terms. Term , which can be intuitively thought of as the uniqueness of the reconstruction, is regularized by the mutual information , which represents the uniqueness of the encoding. Minimizing the KL in term is equivalent to maximizing the marginal likelihood . Combined maximization of is equivalent to maximizing . Term matches the inference marginal to the prior , which in turn ensures realistic samples from the generative model.

While deep generative models often provide high-fidelity reconstructions, the representation is generally not directly amenable to human interpretation.
In contrast to classical methods such as principal components or factor analysis, individual dimensions of don’t necessarily encode any particular semantically meaningful variation in .
This has motivated a search for ways of learning *disentangled* representations, where perturbations of an individual dimension of the latent code perturb the corresponding in an interpretable manner.
Various strategies for weak supervision have been employed, including semi-supervision of latent variables (Kingma et al., 2014; Siddharth et al., 2017), triplet supervision (Karaletsos et al., 2015; Veit et al., 2016), or batch-level factor invariances (Kulkarni et al., 2015; Bouchacourt et al., 2017).
There has also been a concerted effort to develop fully unsupervised approaches that modify the VAE objective to induce disentangled representations. A well-known example is -VAE (Higgins et al., 2016).
This has prompted a number of approaches that modify the VAE objective by adding, removing, or altering the weight of individual terms (Kumar et al., 2017; Zhao et al., 2017; Gao et al., 2018; Achille and Soatto, 2018).

In this paper, we introduce hierarchically factorized VAEs (HFVAEs). The HFVAE objective is based on a two-level hierarchical decomposition of the VAE objective, which allows us to control the relative levels of statistical independence between groups of variables and for individual variables in the same group. At each level, we induce statistical independence by minimizing the *total correlation* (TC), a generalization of the mutual information to more than two variables. A number of related approaches have also considered the TC (Kim and Mnih, 2018; Chen et al., 2018; Gao et al., 2018), but do not employ the two-level decomposition that we consider here. In our derivation, we reinterpret the standard VAE objective as a KL divergence between a generative model and its corresponding inference model. This has the side benefit that it provides a unified perspective on trade-offs in modifications of the VAE objective.

We illustrate the power of this decomposition by disentangling discrete factors of variation from continuous variables, which remains problematic for many existing approaches. We evaluate our methodology on a variety of datasets including dSprites, MNIST, Fashion MNIST (F-MNIST), CelebA and 20NewsGroups. Inspection of the learned representations confirms that our objective uncovers interpretable features in an unsupervised setting, and quantitative metrics demonstrate improvement over related methods. Crucially, we show that the learned representations can recover combinations of latent features that were not present in any examples in the training set, which has long been an implicit goal in learning disentangled representations that is now considered explicitly.

## 2 A Unified View of Generalized VAE Objectives

Variational autoencoders jointly optimize two models. The generative model defines a distribution on a set of latent variables and observed data in terms of a prior and a likelihood , which is often referred to as the *decoder*

model. This distribution is estimated in tandem with an

*encoder*, a conditional distribution that performs approximate inference in this model. The encoder and decoder together define a probabilistic autoencoder.

The VAE objective is traditionally defined as sum over data-points of the expected value of the per-datapoint ELBO, or alternatively as an expectation over an empirical distribution that approximates an unknown data distribution with a finite set of data points,

(1) |

To better understand the various modifications of the VAE objective, which have often been introduced in an ad hoc manner, we here consider an alternate but equivalent definition of the VAE objective as a KL divergence between the generative model and inference model ,

(2) | ||||

This definition differs from the expression in Equation (1) only by a constant term , which is the entropy of the empirical data distribution . The advantage of this interpretation as a KL divergence is that it becomes more apparent what it means to optimize the objective with respect to the generative model parameters and the inference model parameters . In particular, it is clear that the KL is minimized when , which in turn implies that marginal distributions on data and latent code must also match. We will refer to , as the *inference marginal*, which is the average over the data of the encoder distribution
.

To more explicitly represent the trade-offs that are implicit in optimizing the VAE objective, we perform a decomposition (Figure 1) similar to the one obtained by Hoffman and Johnson (2016). This decomposition yields 4 terms. Terms and enforce consistency between the marginal distributions over and . Minimizing the KL in term maximizes the marginal likelihood , whereas minimizing ensures that the inference marginal approximates the prior . Terms and enforce consistency between the conditional distributions. Intuitively speaking, term maximizes the identifiability of the values that generate each ; when we sample , then the likelihood under the generative model should be *higher* than the marginal likelihood . Term regularizes term by minimizing the mutual information in the inference model, which means that maps each to less identifiable values.

Note that term is intractable in practice, since we are not able to pointwise evaluate . We can circumvent this intractability by combining + into a single term, which recovers the likelihood

To build intuition for the impact of each of these terms, Figure 2 shows the effect of removing each term from the objective. When we remove or we can learn models in which deviates from , or deviates from . When we remove , we eliminate the requirement that should be higher when than when . Provided the decoder model is sufficiently expressive, we would then learn a generative model that ignores the latent code . This undesirable type of solution does in fact arise in certain cases, even when is included in the objective, particularly when using auto-regressive decoder architectures (Chen et al., 2016b).

*Hierarchical KL decomposition*. We can decompose term into subcomponents and . Term matches the total correlation between variables in the inference model relative to the total correlation under the generative model. Term minimizes the KL divergence between the inference marginal and prior marginal for each variable . When the variable contains sub-variables , we can recursively decompose the KL on the marginals into term , which matches the total correlation, and term , which minimizes the per-dimension KL divergence.

When we remove , we learn a model that minimizes the overlap between for different data points , in order to maximize . This maximizes the mutual information , which is upper-bounded by . In practice often saturates to , even when included in the objective, which suggests that maximizing outweighs this cost, at least for the encoder/decoder architectures that are commonly considered in present-day models.

## 3 Hierarchically Factorized VAEs (HFVAEs)

In this paper, we are interested in defining an objective that will encourage statistical independence between features. The -VAE objective (Higgins et al., 2016) aims to achieve this goal by defining the objective

We can express this objective in the terms of Figure 1 as . In order to induce disentangled representations, the authors set . This works well in certain cases, but it has the drawback that it also increases the strength of , which means that the encoder model may discard more information about in order to minimize the mutual information .

Looking at the -VAE objective, it seems intuitive that increasing the weight of term is likely to aid disentanglement. One notion of disentanglement is that there should be a low degree of correlation between different latent variables . If we choose a mean-field prior , then minimizing the KL term should induce an inference marginal in which are also independent. However, in addition to being sensitive to correlations, the KL will also be sensitive to discrepancies in the shape of the distribution. When our primary interest is to disentangle representations, then we may wish to relax the constraint that the shape of the distribution matches the prior in favor of enforcing statistical independence.

To make this intuition explicit, we decompose into two terms and (Figure 3). As with term + , term consists of two components. The second of these takes the form of a total correlation, which is the generalization of the mutual information to more than two variables,

(3) |

Minimizing the total correlation yields a in which different are statistically independent, hereby providing a possible mechanism for inducing disentanglement. In cases where itself represents a group of variables, rather than a single variable, we can continue to decompose to another set of terms and which match the total correlation for and the KL divergences for constituent variables . This provides an opportunity to induce hierarchies of disentangled features. We can in principle continue this decomposition for any number of levels to define an HFVAE objective. We here restrict ourselves to the two-level case, which corresponds to an objective of the form

(4) |

In this objective, controls the regularization, controls the TC regularization between groups of variables, and controls the TC regularization within groups. This objective is similar to, but more general than, the one recently proposed by Kim and Mnih (2018) and Chen et al. (2018). Our objective admits these objectives as a special case corresponding to a non-hierarchical decomposition in which . The first component of is not present in these objectives, which implicitly assume that . In the more general case where , maximizing with respect to will match the total correlation in to that in .

Orientation | Smiling | Sunglasses | |
---|---|---|---|

HFVAE |
|||

-VAE |

### 3.1 Approximation of the Objective

In order to optimize this objective, we need to approximate the inference marginals , , and . Computing these quantities exactly requires a full pass over the dataset, since is a mixture over all data points in the training set. We approximate with a Monte Carlo estimate over the same batch of samples that we use to approximate all other terms in the objective . For simplicity we will consider the term

(5) | |||

We define the estimate of as (see Appendix A.1)

(6) |

This estimator differs from the one in Kim and Mnih (2018), which is based on adversarial-style estimation of the density ratio, and is also distinct from the estimators in Chen et al. (2018) who employ different approximations. We can think of this approximation as a partially stratified sample, in which we deterministically include the term and compute a Monte Carlo estimate over the remaining terms, treating indices as samples from the distribution . We now substitute for in Equation (5). By Jensen’s inequality this yields a lower bound on the original expectation. While this induces a bias, the estimator is consistent. In practice, the bias is likely to be small given the batch sizes (512-1024) needed to approximate the inference marginal.

## 4 Related Work

In addition to the work of Kim and Mnih (2018) and Chen et al. (2018), our objective is also related to, and generalizes, a number of recently-proposed modifications to the VAE objective (see Table 1 for an overview). Zhao et al. (2017) considers an objective that eliminates the mutual information in entirely and assigns an additional weight to the KL divergence in . Kumar et al. (2017) approximate the KL divergence in by matching the covariance of and . Recent work by Gao et al. (2018) connects VAEs to the principle of correlation explanation, and defines an objective that reduces the mutual information regularization in for a subset of “Anchor” variables . Achille and Soatto (2018) interpret VAEs from an information-bottleneck perspective and introduce an additional TC term into the objective. In addition to VAEs, generative adversarial networks (GANs) have also been used to learn disentangled representations. The InfoGAN (Chen et al., 2016a) achieves disentanglement by *maximizing* the mutual information between individual features and the data under the *generative model*.

In settings where we are not primarily interested in inducing disentangled representations, the -VAE objective has also been used with in order to improve the quality of reconstructions (Alemi et al., 2016; Engel et al., 2017; Liang et al., 2018). While this also decreases the relative weight of , in practice it does not influence the learned representation in cases where saturates anyway. The tradeoff between likelihood and the KL term and the influence of penalizing the KL term on mutual information has been studied more in depth in Alemi et al. (2018); Burgess et al. (2018). In other recent work, Dupont (2018) considered models containing both Concrete and Gaussian variables. However, the objective was not decomposed to get , but was based on the objective proposed by Burgess et al. (2018).

Comments

There are no comments yet.