Variational Autoencoders Pursue PCA Directions (by Accident)

12/17/2018 ∙ by Michal Rolinek, et al. ∙ Max Planck Society 0

The Variational Autoencoder (VAE) is a powerful architecture capable of representation learning and generative modeling. When it comes to learning interpretable (disentangled) representations, VAE and its variants show unparalleled performance. However, the reasons for this are unclear, since a very particular alignment of the latent embedding is needed but the design of the VAE does not encourage it in any explicit way. We address this matter and offer the following explanation: the diagonal approximation in the encoder together with the inherent stochasticity force local orthogonality of the decoder. The local behavior of promoting both reconstruction and orthogonality matches closely how the PCA embedding is chosen. Alongside providing an intuitive understanding, we justify the statement with full theoretical analysis as well as with experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Variational Autoencoder (VAE) [23, 32]

is one of the foundational architectures in modern-day deep learning. It serves both as a generative model as well as a representation learning technique. The generative model is predominantely exploited in computer vision 

[24, 14, 21, 15] with notable exceptions such as generating combinatorial graphs [25]

. As for representation learning, there is a variety of applications, ranging over image interpolation

[18], one-shot generalization [31], language models [39], speech transformation [4], and more. Aside from direct applications, VAEs embody the success of variational methods in deep learning and have inspired a wide range of ongoing research [22, 40].

Recently, unsupervised learning of interpretable latent representations has received a lot of attention. Interpretability of the latent code is an intuitively clear concept. For instance, when representing faces one latent variable would solely correspond to the gender of the person, another to skin tone, yet another to hair color and so forth. Once such a representation is found it allows for interpretable latent code manipulation, which is desirable in a variety of applications; recently, for example, in reinforcement learning

[35, 17, 11, 37, 30].

The term disentanglement [10, 3] offers a more formal approach. A representation is considered disentangled if each latent component encodes precisely one “aspect” (a generative factor) of the data. Under the current disentanglement metrics [16, 20, 7], VAE-based architectures (-VAE [16], TCVAE [7], FactorVAE [20]) dominate the benchmarks, leaving behind other approaches such as InfoGAN [8] and DCIGN [24].

The success of VAE-based architectures on disentanglement tasks comes with a certain surprise. One surprising aspect is that VAEs have been challenged on both of its own design functionalities, as generative models [13] and as log-likelihood optimizers [27, 29]. Yet, no such claims are made in terms of disentanglement. Another surprise stems from the fact that disentanglement requires the following feature: the representative low-dimensional manifold must be aligned well with the coordinate axes. However, the design of the VAE does not suggest any such mechanism. On the contrary, the idealized log-likelihood objective is, for example, invariant to rotational changes in the alignment.

Such observations have planted a suspicion that the inner workings of the VAE are not sufficiently understood. The recent works on the subject made intriguing empirical observations  [6, 2], gave a fresh theoretical analysis [9], and raised pressing questions [1]. However, a mechanistic explanation for the VAE’s unexpected ability to disentangle is still missing.

In this paper, we isolate an internal mechanism of the VAE (also -VAE) responsible for choosing a particular latent representation and its alignment. We give theoretical analysis covering also the nonlinear case and explain the discovered dynamics intuitively. We show that this mechanism promotes local orthogonality of the embedding transformation and clarify how this orthogonality corresponds to good disentanglement. Further, we uncover strong resemblance between this mechanism and the classical Principle Components Analysis (PCA) algorithm. We confirm our theoretical findings in experiments.

Our theoretical approach is particular in the following ways: (a) we base the analysis on the implementedloss function in contrast to the typically considered idealized loss, and (b) we identify a specific regime, prevalent in practice, and utilize it for a crucial simplification. This simplification is the crucial step in enabling formalization.

The results, other than being significant on their own, also provide a solid explanation of “why -VAEs disentangle”.

2 Background

Let us begin with reviewing the basics of VAE, PCA, and of the Singular Value Decomposition (SVD), along with a more detailed overview of disentanglement.

2.1 Variational Autoencoders

Let be a dataset consisting of i.i.d. samples

of a random variable

. An autoencoder framework operates with two mappings, the encoder and the decoder , where is called the latent space. In case of the VAE, both mappings are probabilistic and a fixed prior distribution over is assumed. Since the distribution of is also fixed (actual data distribution ), the mappings and

induce joint distributions

and , respectively (omitting the dependencies on parameters and ). The idealized VAE objective is then the marginalized log-likelihood

(1)

This objective is, however, not tractable and is approximated by the evidence lower bound (ELBO) [23]. For a fixed the log-likelihood is lower bounded by

(2)

where the first term corresponds to the reconstruction loss and the second to the KL divergence between the latent representation and the prior distribution . A variant, the -VAE [16], introduces a weighting on the KL term for regulating the trade-off between reconstruction (first term) and the proximity to the prior. Our analysis will automatically cover this case as well.

Finally, the prior is set to and the encoder is assumed to have the form

(3)

where and are deterministic mappings depending on parameters . Note particularly, that the covariance matrix is enforced to be diagonal. This turns out to be highly significant for the main result of this work. The KL-divergence in (2) can be computed in closed form as

(4)

In practical implementations, the reconstruction term from (2) is approximated with either a square loss or a cross-entropy loss.

2.2 Disentanglement

In the context of learning interpretable representations [3, 16, 6, 2, 34] it is useful to assume that the data originates from a process with some generating factors. For instance, for images of faces this could be face azimuth, skin brightness, hair length, and so on. Disentangled representations can then be defined as ones in which individual latent variables are sensitive to changes in individual generating factors, while being relatively insensitive to other changes [3]. Although quantifying disentanglement is nontrivial, several metrics have been proposed [20, 16, 7].

Note also, that disentanglement is impossible without first learning a sufficiently expressive latent representation capable of good reconstruction.

In an unsupervised setting, the generating factors are of course unknown and the learning has to resort to statistical properties. Linear dimensionality reduction techniques demonstrate the two basic statistical approaches. Principle Components Analysis (PCA) greedily isolates sources of variance in the data, while Independent Component Analysis (ICA) recovers a factorized representation, see 

[33] for a recent review.

One important point to make is that disentanglement is sensitive to rotations of the latent embedding. Following the example above, let us denote by , , and , continuous values corresponding to face azimuth, skin brightness, and hair length. Then, if we change the ideal latent representation as follows

(5)

we obtain a representation that is equally expressive in terms of reconstruction (in fact we only multiplied with a 3D rotation matrix) but individual latent variables entirely lost their interpretable meaning.

2.3 PCA and Latent Representations

Let us examine more closely how PCA chooses the alignment of the latent embedding and why it matters.

It is well known [5] that for a linear autoencoder with encoder , decoder , and square error as reconstruction loss, the objective

(6)

is minimized by the PCA decomposition. Specifically, by setting , and , for , where

is an orthogonal matrix formed by the

normalized eigenvectors (ordered by the magnitudes of the corresponding eigenvalues) of the sample covariance matrix of

and is a trivial projection matrix.

However, there are many minimizers of (6) that do not induce the same latent representation. In fact, it suffices to append with some invertible transformations (e.g. rotations and scaling) and prefix with their inverses. This geometrical intuition is well captured using the singular value decomposition (SVD), see also Figure 1.

Figure 1: Geometric interpretation of the singular value decomposition (SVD). Sequential illustration of the effects of applying the corresponding SVD matrices of the encoder transformation (left to right) and decoder (right to left). We notice that steps (i) and (ii) of the encoder preserve the principle directions of the data. Step (iii), however, causes misalignment. In that regard, good encoders are the ones for which step (iii) is trivial. The same argument works for the decoder (in reverse order). This condition is equivalent (for non-degenerate transformations) to having orthogonal columns (See Proposition 1, where this is phrased for the decoder).
Theorem 1 (SVD rephrased, [12]).

Let

be a linear transformation (matrix). Then there exist

  • , an orthogonal transformation (matrix) of the input space,

  • a “scale-and-embed” transformation (induced by a diagonal matrix),

  • , an orthogonal transformation (matrix) of the output space

such that .

Remark 1.

Since orthogonal transformations of the latent space will play a vital role in further considerations, we will for brevity refer to them (with slight abuse of terminology) simply as rotations of the latent space.

Now we can adequately decribe the minimizers of (6).

Example 1 (Other minimizers of the PCA objective).

Define and with their SVDs as and and its pseudoinverse and see that

(7)

so they are indeed also minimizers of the objective (6) irrespective of our choice of and .

It is also straightforward to check that the only choices of , which respect the coordinate axes given by PCA, are for to be a permutation matrix.

The take-away message (valid also in the non-linear case) from this example is:

Different rotations of the same latent space are equally suitable for reconstruction.

Following the PCA example, we formalize which linear mappings have the desired “axes-preserving” property.

Proposition 1 (Axes-preserving linear mappings).

Assume with has distinct nonzero singular values. Then the following statements are equivalent:

  1. The columns of are (pairwise) orthogonal.

  2. In every SVD of as , is a permutation matrix.

We strongly suggest developing a geometrical understanding for both cases (a) and (b) via Figure 1.

Take into consideration that once the encoder preserves the principle directions of the data, this already ensures an axis-aligned embedding. The same is true also if the decoder is axes-preserving, provided the reconstruction of the autoencoder is accurate.

How the requirement of distinct non-zero singular values also reflects in practical application is discussed in Suppl. 9.2.

2.4 Related work

Due to high activity surrounding VAEs, additional care is needed when it comes to evaluating novelty. To the best of our knowledge, two recent works (one of which is concurrent) address related questions and require special attention.

The authors of [6] also aim to explain good performance of (–)VAE in disentanglement tasks. A compelling intuitive picture of the underlying dynamics is drawn and supporting empirical evidence is given. In particular, the authors hypothesize that “–VAE finds latent components which make different contributions to the log-likelihood term of the cost function [reconstruction loss]”, while suspecting that the diagonal posterior approximation is responsible for this behavior. Our theoretical analysis confirms both conjectures (see Section 4).

Concurrent work [2] develops ISA-VAE; another VAE-based architecture suited for disentanglement. Some parts of the motivation overlap with the content of our work. First, rotationally nonsymmetric priors are introduced for reasons similar to the content of Section 3.1. And second, both orthogonalization and alignment with PCA directions are empirically observed for VAEs applied to toy tasks.

3 Results

3.1 The problem with log-likelihood

The message from Example 1 and from the discussion about disentanglement is clear: latent space rotation matters. Let us look how the idealized objectives (1) and (2) handle this.

For a fixed rotation matrix we will be comparing a baseline encoder-decoder pair with a pair defined as

(8)
(9)

The shortcomings of idealized losses are summarized in the following propositions.

Proposition 2 (Log-likelihood rotation invariance).

Let , be any choice of parameters for encoder-decoder pair . Then, if the prior is rotationally symmetric, the value of the log-likelihood objective (1) does not depend on the choice of .

Note that the standard prior is rotationally symmetric. This defficiency is not salvaged by the ELBO approximation.

Proposition 3 (ELBO rotation invariance).

Let , be any choice of parameters for encoder-decoder pair . Then, if the prior is rotationally symmetric, the value of the ELBO objective (1) does not depend on the choice of .

We do not claim novelty of these propositions, however we are not aware of their formalization in the literature. The proofs can be found in Supplementary Material (Suppl. 7). An important point now follows:

Log-likelihood based methods (with rotationally symmetric priors) cannot claim to be designed to produce disentangled representations.

However, enforcing a diagonal posterior of the VAE encoder (3) disrupts the rotational symmetry and consequently the resulting objective (4) escapes the invariance arguments. Moreover, as we are about to see, this diagonalization comes with beneficial effects regarding disentanglement. We assume this diagonalization was primarily introduced for different reasons (tractability, computational convenience), hence the “by accident” part of the title.

3.2 Reformulating VAE loss

The fact that VAEs were not meant

to promote orthogonality reflects in some technical challenges. For one, we cannot follow a usual workflow of a theoretical argument; set up an idealized objective and find suitable approximations which allow for stochastic gradient descent (a top-down approach). We need to do the exact opposite, start with the

implemented loss function and find the right simplifications that allow isolating the effects in question while preserving the original training dynamics (a bottom-up approach). This is the main content of this section.

First, we formalize the typical situation in which VAE architectures “shut down” (fill with pure noise) a subset of latent variables and put high precision on the others.

Definition 1.

We say that parameters , induce a polarized regime if the latent coordinates can be partitioned as (sets of active and passive variables) such that

  1. and for ,

  2. for ,

  3. The decoder ignores the passive latent components, i.e.

The polarized regime simplifies the loss from (4); part (a) ensures zero loss for passive variables and part (b) implies that . All in all, the per-sample-loss reduces to

(10)

We will assume the VAE operates in the polarized regime. In Section 5.2, we show on multiple tasks and datasets that the two objectives align very early in the training. This behavior is well-known to practitioners.

Also, we approximate the reconstruction term in (2), as it is most common, with a square loss

(11)

where the expectation is over the stochasticity of the encoder. All in all, the loss we will analyze has the form

(12)

Moreover, the reconstruction loss can be further decomposed into two parts; deterministic and stochastic. The former is defined by

(13)

and captures the square loss of the mean encoder. Whereas the stochastic loss

(14)

is purely induced by the noise injected in the encoder.

Proposition 4.

If the stochastic estimate

is unbiased around , then

(15)

This decomposition resembles the classical bias-variance decomposition of the square error [19].

3.3 The main result

Now, we finally give theoretical evidence for the central claim of the paper: ,color=yellow!50,size=,color=yellow!50,size=todo: ,color=yellow!50,size=still not happy with “theoretical evidence”

Optimizing the stochastic part of the reconstruction loss promotes local orthogonality of the decoder.

On that account, we set up an optimization problem which allows us to optimize the stochastic loss (14) independently of the other two. This will isolate its effects on the training dynamics.

In order to make statements about local orthogonality, we introduce for each the Jacobian (linear approximation) of the decoder at point , i.e.

Since, according to (3), the encoder can be written as with

(16)

we can approximate the stochastic loss (14) with

(17)

Although we aim to fix the deterministic loss (13), we do not need to freeze the mean encoder and the decoder entirely. Following Example 1, for each and its SVD , we are free to modify as long we correspondingly (locally) modify the mean encoder.

Then we state the optimization problem as follows:

(18)
s. t. (19)

where are sampled as in (16).

A few remarks are now in place.

  • This optimization is not over network parameters, rather directly over the values of all (only constrained by (19)).

  • Both the objective and the constraint concern global ,color=blue!30,size=,color=blue!30,size=todo: ,color=blue!30,size=mean?losses, not per sample losses.

  • Indeed, none of interfere with the rest of the VAE objective (12).

The presence of the (monotone) log function has one main advantage; we can describe all global minima of (18) in closed form. This is captured in the following theorem, the technical heart of this work.

Theorem 2 (Main result).

The following holds for optimization problem (18, 19):

  1. Every local minimum is a global minimum.

  2. In every global minimum, the columns of every are orthogonal.

The full proof as well as an explicit description of the minima is given in Suppl. 7.1. However, an outline of the main steps is given in the next section on the example of a linear decoder.

The presence of the log term in (18) admittedly makes our argument indirect. There are, however, a couple of points to make. First, as was mentioned earlier, encouraging orthogonality was not a design feature of the VAE. In this sense, it is unsurprising that our results are also mildly indirect.

Also, and more importantly, the global optimality of Theorem 2 also implies that, locally, orthogonality is encouraged even for the pure (without logarithm) stochastic loss.

Corollary 1.

For fixed consider a subproblem of (18) defined as

(20)
(21)

Also then, the result on the structure of local (global) minima holds:

  1. Every local minimum is a global minimum.

  2. In every global minimum, the columns of every are orthogonal.

All in all, Theorem 2 justifies the central message of the paper stated at the beginning of this section. The analogy with PCA is now also clearer. Locally, VAEs optimize a tradeoff between reconstruction and orthogonality.

This result is unaffected by the potential term in Equation (2), although an appropriate might be required to ensure the polarized regime.

4 Proof outline

In this section, we sketch the key steps in the proof of Theorem 2 and, more notably, the intuition behind them. The full proof can be found in Suppl. 7.1.

We will restrict ourselves to a simplified setting. Consider a linear decoder with SVD , which removes the necessity of local linearization. This reduces the objective (18) from a “global” problem over all examples to an objective where we have the same subproblem for each .

As in optimization problem (18, 19), we resort to fixing the mean encoder (imagine a well performing one).

In the next paragraphs, we separately perform the optimization over the parameters and the optimization over the matrix . ,color=yellow!50,size=,color=yellow!50,size=todo: ,color=yellow!50,size=Notation: somehow write more clearly that we consider a “global” . is actually , right? We drop the dependency on !

4.1 Weighting precision

For this part, we fix the decoder matrix and optimize over values . The simplified objective is

(22)
s. t. (23)

where the terms from (10) disappear since the mean encoder is fixed.

The values can now be thought of as precisions allowed for different latent coordinates. The functions even suggests thinking of the number of significant digits. Problem (22) then asks to distribute the “total precision budget“ so that the deviation from decoding “uncorrupted” values is minimal.

We will now solve this problem on an example linear decoder given by

(24)

Already here we see, that the latent variable seems more influential for the reconstruction. We would expect that receives higher precision than .

Now, for , we compute

and after taking the expectation, we can use the fact that has zero mean and write

Finally, we use that for uncorrelated random variables and we have . After rearranging we obtain

where . Note that the coefficients are the

squared norms of the column vectors

of .

This turns the optimization problem (22) into a simple exercise, particularly after realizing that (23) fixes the value of the product . Indeed, we can even set and in the trivial inequality and find that

(25)

with equality achieved when . This also implies that the precision on variable will be considerably higher than for , just as expected.

Two remarks regarding the general case follow.

  • The full version of inequality (25) relies on the concavity of the function; in particular, on (a version of) Jensen’s inequality.

  • The minimum value of the objective depends on the product of the column norms. This also carries over to the unsimplified setting.

4.2 Isolating sources of variance

Now that we can find optimal values of precision, the focus changes on optimally rotating the latent space. In order to understand how such rotations influence the minimum of objective (22), let us consider the following example in which we again resort to decoder matrix .

Imagine, the encoder alters the latent representation by a rotation. Then we can adjust the decoder by first undoing this rotation. In particular, we set , where is a 2D rotation matrix, rotating by angle . We have

and performing analogous optimization as before gives

(26)

We see that the minimal value of the objective is more than twice as high, a substantial difference. On a high level, the reason was a better choice of a decoder is that the variables and had very different impact on the reconstruction. This allowed to save some precision on variable , as it had smaller effect, and use it on , where it is more beneficial.

For a higher number of latent variables, one way to achieve a “maximum stretch” among the impacts of latent variables, is to pick them greedily, always picking the next one so that its impact is maximized. This is, at heart, the greedy algorithm for PCA.

Let us consider a slightly more technical statement. We saw in (25) and (26) that after finding optimal values of the remaining objective is the product of the column norms of matrix . Let us denote such quantity by . Then for a fixed matrix , we optimize

(27)

over orthogonal matrices .

Figure 2: 2D illustration of orthogonality in . The vectors are the columns of . Minimizing the product while maintaining the volume results in .

This problem can be interpreted geometrically. The column vectors of are the images of base vectors . Consequently, the product gives an upper bound on the volume (the image of the unit cube)

(28)

However, as orthogonal matrices are isometries, they do not change this volume. Also, the bound (28) is tight precisely when the vectors are orthogonal. Hence, the only way to optimize is by tightening the bound, that is by finding for which the column vectors of are orthogonal, see Figure 2 for an illustration. In this regards, it is important that performs a different scaling along each of the axis (using ), which allows for changing the angles among the vectors (cf. Figure 1).

5 Experiments

We performed several experiments with different architectures and datasets to validate our results empirically. We show the prevalence of the polarized regime, the strong orthogonal effects of the (-)VAE, as well as the links to disentanglement.

-VAE (dep.) -VAE (10)
dSprites
fMNIST
MNIST
Synth. Lin.
Synth. Non-Lin.
Table 1: Percentage of training time where (Eq. (30)) continuously until the end. Reported for -VAE with low (dataset dependent) and high () latent dimension.

5.1 Setup

Architectures.

We evaluate the classical VAE, -VAE, a plain autoencoder, and -, where the latter removes the critical diagonal approximation (3) and produces a full covariance matrix for every sample. The resulting KL term of the loss is changed accordingly (see Suppl. 8.3 for details).

Datasets.

We evaluate on the well-known datasets dSprites [28], MNIST [26] and FashionMNIST [38], as well as on two synthetic ones. For both synthetic tasks the input data is generated by embedding a unit square into a higher dimension. The latent representation is then expected to be disentangled with respect to axes of . In one case (Synth. Lin.) we used a linear transformation and in the other one a non-linear (Synth. Non-Lin.) embedding . The exact choice of transformations can be found in Suppl. 8. Further information regarding network structures and training parameters is also provided in Suppl. 8.4.

width=1 -VAE VAE AE -VAE Random Decoder dSprites Disent. DtO Synth. Lin. Disent. DtO Synth. Non-Lin. Disent. DtO MNIST DtO fMNIST DtO

Table 2: Results for the distance to orthogonality DtO of the decoder (Equation 29) and disentanglement score for different architectures and datasets. Lower DtO values are better and higher Disent. values are better. Random decoders provide a simple baseline for the numbers.

Disentanglement metric.

For quantifying the disentanglement of a representation, the so called Mutual Information Gap (MIG) was introduced in [7]. As MIG is not well defined for continuous variables, we use an adjusted definition comprising both continuous and discrete variables, simply referred to as Disentanglement score. Details are described in Suppl. 8.1. Just as in the case of MIG, the Disentanglement score is a number between and , where higher value means stronger disentanglement.

Orthogonality metric.

For measuring the practical effects of Theorem 2, we introduce a measure of non-orthogonality. As argued in Proposition 1 and Figure 1, for a good decoder and its SVD , the matrix should be trivial (a signed permutation matrix). We measure the non-triviality with the Distance to Orthogonality (DtO) defined as follows. For each , , employing again the Jacobian of the decoder at and its SVD and define

(29)

where is the Frobenius norm and is a signed permutation matrix that is closest to (in

sense). Finding the nearest permutation matrix is solved to optimality via mixed-integer linear programming (see Suppl. 

8.2).

Figure 3: Alignment of the latent representation (low DtO, (29)) results in better disentanglement (higher score). Each datapoint corresponds to an independent run with , , or epochs.

5.2 Polarized regime

In Section 3.2, we assumed VAEs operate in a polarized regime and approximated , the KL term of the implemented objective (4), with (10). In Table 1 we show that the polarized regime is indeed dominating the training in all examples after a short initial phase. We report the fraction of the training time in which the relative error

(30)

stays below continuously until the end (evaluated every 500 batches).

5.3 Orthogonality and Disentanglement

Now, we provide evidence for Theorem 2 by investigating the DtO (29) for a variety of architectures and datasets, see Table 2. The results clearly support the claim that the VAE based architectures indeed strive for local orthogonality. By generalizing the -VAE architecture, such that the approximate posterior is any multivariate Gaussian (-VAE), the objective becomes rotationally symmetric (just as the idealized objective). As such, no specific alignment is prioritized. The simple autoencoders also do not favor particular orientations of the latent space.

Another important observation is the clear correlation between DtO and the disentanglement score. We show this in Figure 3 where different restarts of the same -VAE architecture on the dSprites dataset are displayed. We used the state-of-the-art value [16]. Additional experiments are reported in Suppl. 9.

6 Discussion

We isolated the mechanism of VAE that leads to local orthogonalization and, in effect, to performing local PCA. Additionally, we demonstrated the functionality of this mechanism in intuitive terms, in formal terms, and also in experiments. We also explained why this behavior is desirable for enforcing disentangled representations.

Our insights show that VAEs make use of the differences in variance to form the representation in the latent space. This does not directly encourage factorized latent representations, see also Suppl. 9.2. With this in mind, it makes perfect sense that recent improvements of (-)VAE [7, 20, 2] incorporate additional terms promoting precisely independence.

It is also unsatisfying that VAEs promote orthogonality somewhat indirectly. It would seem that designing architectures allowing explicit control over this feature would be beneficial.

Acknowledgements

We thank the whole Autonomous Learning Group at MPI IS, as well as Friedrich Solowjow for the fruitful and invaluable discussions. Also, we thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Dominik Zietlow.

References

Supplementary Material

The supplementary information is structured as follows. We start with a remark on Table 1 and then provide the proofs in Section 7.1. Section 8 reports the details of the experiments followed by additional experiments in Section 9.

Remark on Table 1

Some dataset-architecture combinations listed in Table 2 are omitted for the following reasons.

On the one hand, calculating the Disentanglement Score for MNIST and fMNIST does not make sense, as the generating factors are not given (the one categorical label cannot serve as replacement). Consequently, as the values of are chosen according to this score, we do not report -VAE numbers for these datasets. On the other hand, for either synthetic task, the regular VAE vastly overprunes, see Figure S4, and the values become meaningless.

7 Proofs

7.1 Proof of Theorem 2

Proof strategy:

For part (b), we aim to derive a lower bound on the objective (18), that is independent from the optimization variables and . Moreover, we show that this lower bound is tight for some specific choices of and , i.e. the global optima. For these choices, all will have orthogonal columns.

The strategy for part (a) is to show that whenever and do not induce a global optimum, we can find a small perturbation that decreases the objective function. Thereby showing that local minima do not exist.

Technical lemmas:

We begin with introducing a few useful statements. First is the inequality between arithmetic and geometric mean; a consequence of Jensen’s inequality.

Lemma S1 (AM-GM inequality).

Let , …, be nonnegative real numbers. Then

(S31)

with equality occuring if and only if .

The second bound to be used is the classical Hadamard’s inequality.

Lemma S2 (Hadamard’s inequality [36]).

Let be non-singular matrix with column vectors , …, . Then

(S32)

with equality if and only if the vectors , …, are pairwise orthogonal.

And finally a simple lemma for characterizing matrices with orthogonal columns.

Lemma S3 (Column orthogonality).

Let be a matrix and let be its singular value decomposition. Then the following statements are equivalent:

  1. The columns of are (pairwise) orthogonal.

  2. The matrix is diagonal.

  3. The columns of are (pairwise) orthogonal.

Proof.

The equivalence of (a) and (b) is immediate. For equivalence of (a) and (c) it suffices to notice that if we set , then

(S33)

The equivalence of (a) and (b) now implies that has orthogonal columns if and only if does. ∎

Initial considerations:

First, without loss of generality, we will ignore all passive latent variables (in the sense of Definition 1). Formally speaking, we will restrict to the case when the local decoder mappings are non-degenerate (i.e. have non-zero singular values). Now denotes the dimensionality of the latent space with .

Next, we simplify the loss , Equation 10. Up to additive and multiplicative constants, this loss can be, for a fixed sample , written as

(S34)

In the optimization problem (18, 19) the values can only be affected via applying an orthogonal transformation . But such transformation are norm-preserving (isometric) and hence the values do not change in the optimization. As a result, we can restate the constraint (19) as

(S35)

for some constant .

Proof of Theorem 2(b):

Here, we explain how Theorem 2(b) follows from the following two propositions.

Proposition S5.

For a fixed sample let us denote by , …, the column vectors of . Then

(S36)

with equality if and only if for every .

Proposition S6.

Let , where , be a matrix with column vectors , …, and nonzero singular values , …, . Then

(S37)

where by we denote the product of the singular values of . Equality occurs if and only if , …, are pairwise orthogonal.

First, Proposition S6 allows making further estimates in the inequality from Proposition S5. Indeed, we get

(S38)

and after applying the (monotonous) log function we are left with

(S39)
(S40)

Finally, we sum over the samples and simplify via (S35) as

(S41)

The right-hand side of this inequality is independent from the values of , as well as from the orthogonal matrices , since these do not influence the singular values of any .

Moreover, it is possible to make inequality (S41) tight (i.e. reach the global minimum), by setting as hinted by Proposition S5 and by choosing the matrices such that every has orthogonal columns (this is clearly possible as seen in Proposition 1).

This yields the desired description of the global minima of (18). ∎

Proof of Proposition S5:

We further denote by , …, the row vectors of , and by the element of at -th row and -th column. With sampling according to

(S42)

we begin simplifying the objective (18) with

(S43)
(S44)

Now, as the samples are zero mean, we can further write

(S45)

Now we use the fact that for uncorrelated random variables and we have . This allows to expand the variance of the inner product as

(S46)

Now, we can regroup the terms via

(S47)

All in all, we obtain

(S48)

from which the desired inequality follows via setting for , …, in Lemma S1. Indeed, then we have

(S49)

as required. ∎

Proof of Proposition S6:

As the first step, we show that both sides of the desired inequality are invariant to multiplying the matrix from the left with an orthogonal matrix .

For the right-hand side, this is clear as the singular values of are identical to those of . As for the left-hand side, we first need to realize that the vectors are the images of the canonical basis vectors , i.e. for . But since is an isometry, we have for every , and hence also the column norms are intact by prepending to .

This allows us to restrict to matrices for which the SVD has a simplified form . Next, let us denote by the top-left submatrix of . Note that contains all nonzero elements of . As a result, the matrix contains precisely the nonzero rows of the matrix . This implies

(S50)

In particular, the column vectors of have the same norms as those of . Now we can write

(S51)

where the inequality follows from Lemma S2 applied to nonsingular matrix . Equality in Lemma S2 occurs precisely if the columns of are orthogonal. However, according to Lemma S3 and (S50), it also follows that the columns of are orthogonal if and only if the columns of are. Note that Lemma S3(c) is needed for covering the reduction performed in the first two paragraphs. ∎

Proof of Theorem 2(a):

We show the nonexistence of local minima as follows. For any values of and that do not minimize the objective function (18), we find a small perturbation that improves this objective.

All estimates involved in establishing inequality (S41) rely on either Lemma S1 or Lemma S2, where in both cases, the right-hand side was kept fixed. We show that both of these inequalities can be tightened in such fashion by small perturbations in their parameters.

Lemma S4 (Locally improving AM-GM).

For any non-negative values , …, for which

(S52)

there exists a small perturbation of for such that

(S53)
(S54)
Proof.

Since (S52) is a sharp inequality, we have for some . Then setting , , and otherwise, will do the trick. Indeed, we have as well as for small enough . This ensures both S53 and S54. ∎

An analogous statement for Lemma S2 has the following form.

Lemma S5 (Locally improving Hadamard’s inequality).

Let be a non-singular matrix with SVD