Loss Landscapes of Regularized Linear Autoencoders

01/23/2019
by   Daniel Kunin, et al.
18

Autoencoders are a deep learning model for representation learning. When trained to minimize the Euclidean distance between the data and its reconstruction, linear autoencoders (LAEs) learn the subspace spanned by the top principal directions but cannot learn the principal directions themselves. In this paper, we prove that L_2-regularized LAEs learn the principal directions as the left singular vectors of the decoder, providing an extremely simple and scalable algorithm for rank-k SVD. More generally, we consider LAEs with (i) no regularization, (ii) regularization of the composition of the encoder and decoder, and (iii) regularization of the encoder and decoder separately. We relate the minimum of (iii) to the MAP estimate of probabilistic PCA and show that for all critical points the encoder and decoder are transposes. Building on topological intuition, we smoothly parameterize the critical manifolds for all three losses via a novel unified framework and illustrate these results empirically. Overall, this work clarifies the relationship between autoencoders and Bayesian models and between regularization and orthogonality.

READ FULL TEXT VIEW PDF

Authors

page 7

page 8

08/31/2021

A manifold learning perspective on representation learning: Learning decoder and representations without an encoder

Autoencoders are commonly used in representation learning. They consist ...
07/13/2020

Regularized linear autoencoders recover the principal components, eventually

Our understanding of learning input-output relationships with neural net...
05/10/2020

A Simple and Scalable Shape Representation for 3D Reconstruction

Deep learning applied to the reconstruction of 3D shapes has seen growin...
04/03/2022

Fitting an immersed submanifold to data via Sussmann's orbit theorem

This paper describes an approach for fitting an immersed submanifold of ...
04/22/2018

Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation

The encoder-decoder dialog model is one of the most prominent methods us...
04/26/2018

From Principal Subspaces to Principal Components with Linear Autoencoders

The autoencoder is an effective unsupervised learning model which is wid...
04/16/2020

Distributed Evolution of Deep Autoencoders

Autoencoders have seen wide success in domains ranging from feature sele...

Code Repositories

Regularized-Linear-Autoencoders

Loss Landscapes of Regularized Linear Autoencoders


view repo

PCA_tests

Comparison of Incremental PCA and LAE PCA


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider a data set consisting of points in . Let be the data matrix with the columns . We will assume throughout that and that the singular values of are positive and distinct.

An autoencoder consists of an encoder and decoder ; the latter maps the latent representation to the reconstruction (goodfellow). The full network is trained to minimize reconstruction error, typically the squared Euclidean distance between the dataset and its reconstruction (or equivalently, the Frobenius norm of ). When the activations of the network are the identity, the model class reduces to that of one encoder layer and one decoder layer . We refer to this model as a linear autoencoder (LAE) with loss function defined by

Parameterizing by the product , the Eckart-Young Theorem (young) states that the optimal orthogonally projects onto the subspace spanned by its top principal directions 111The principal directions of

are the eigenvectors of the covariance of

in descending order by eigenvalue, or equivalently the left singular vectors of the mean-centered

in descending order by (squared) singular values..

Without regularization, LAEs learn this subspace but cannot learn the principal directions themselves due to the symmetry of under the action of the group of invertible matrices defined by :

(1)

Indeed, achieves its minimum value on a smooth submanifold of diffeomorphic to ; the learned latent representation is only defined up to deformation by invertible linear maps; and the

-dimensional eigenspace of

with eigenvalue 1 has no preferred basis.

In light of the above, the genesis of this work was our surprise at the empirical observation in plaut that the principal directions of are recovered from a trained LAE as the left singular vectors of the decoder (or as the right singular vectors of the encoder). While the paper made no mention of it, we realized by looking at the code that training was done with the common practice of -regularization:

In this paper, we prove that LAEs with -regularization do in fact learn the principal directions in this way. We further show how to recover those eigenvalues larger than the regularization constant .

(a) Unregularized

(b) Product ()

(c) Sum ()

(d) Sum ()
Figure 1: Loss landscapes with . Yellow points are saddles and red curves and points are global minima.

1.1 Related work

Building on the original work of young on low-rank matrix approximation, izenman demonstrated a connection between a rank-reduced regression model similar to an LAE and PCA. bourlard characterized the minima of an unregularized LAE; baldi1989 extended this analysis to all critical points.

Several studies of the effect of regularization on LAEs have emerged of late. The rank-reduced regression model was extended in mukherjee

to the study of rank-reduced ridge regression. A similar extension of the LAE model was given in

josse

. An in depth analysis of the linear denoising autoencoder was given in

pretorius and most recently, mianjy explored the effect of dropout regularization on the minima of an LAE.

While -regularization is a foundational technique in statistical learning, its effect on autoencoder models has not been fully characterized. Recent work of mehta on -regularized deep linear networks applies techniques from algebraic geometry to highlight how algebraic symmetries result in “flat” critical manifolds and how -regularization breaks these symmetries to produce isolated critical points. We instead apply (deeply connected) techniques from algebraic topology to completely resolve dynamics in the special case of LAEs.

1.2 Our contributions

The contributions of our paper are as follows.

  • In Section 2 we consider LAEs with (i) no regularization, (ii) regularization of the composition of the encoder and decoder, and (iii) regularization of the encoder and decoder separately as in . We build intuition by analyzing the scalar case, relate the losses to denoising and contrastive LAEs, reflect on the relationship between regularization and orthogonality, and deduce that the encoder and decoder are transposes for all critical points of .

  • In Section 3, we realize all three LAE models as generative processes, most notably relating the minimum of and the MAP estimate of probabilistic PCA.

  • In Section 4, we embark on the technical work of characterizing all three loss landscapes. To build intuition, we first leave the overparameterized world of coordinate representations to think deeply about the squared distance from a plane to a point cloud. We expand on this topological viewpoint in Appendix B.

  • In Section 5, we illustrate these results empirically, with all code available222github.com/danielkunin/Regularized-Linear-Autoencoders.

  • In Section 6, we discuss this work and potential next steps.

The connections we draw between regularization and orthogonality, LAEs and probabilistic PCA (bishop99), and the topology of Grassmannians are novel and provide a deeper understanding of the loss landscapes of regularized linear autoencoders.

2 Regularized LAEs

In the appendix (A), we provide a self-contained derivation of the fact that an LAE with bias parameters is equivalent to an LAE without bias parameters trained on mean-centered data (bourlard). So without loss of generality, we will assume is mean centered and consider the following three LAE loss functions:

We call these the unregularized, product, and sum losses, respectively.

2.1 Visualizing LAE loss landscapes

We can visualize these loss functions directly in the case , as shown in Figure 1. In fact, working out the critical points in this scalar case led us to conjecture the general result in Section 4. We invite the reader to enjoy deriving the following results for themselves.

For all three losses, the origin is the unique rank-0 critical point. For and , the origin is always a saddle point, while for the origin is either a saddle point or global minimum depending of the value of .

For , the global minima are rank-1 and consist of the hyperbola333Identified with the components of .

For , the global minima are rank-1 and consist of this hyperbola shrunk toward the origin as in ridge regression,

For the critical points depend on the scale of relative to . For , the origin is a saddle point and the global minima are the two isolated rank-1 critical points444Identified with the components of . cut out by the equations

As increases toward , these minima move toward the origin, which remains a saddle point. As exceeds , the origin becomes the unique global minimum. This loss of information was our first hint at the connection to probabilistic PCA formalized in Theorem 3.1.

2.2 Denoising and contractive autoencoders

Two of the most well-studied regularized autoencoder models are the denoising autoencoder (DAE) and the contractive autoencoder (CAE) (Bengio). In the linear case, their loss functions mirror the product and sum losses, respectively.

A linear DAE receives a corrupted data matrix and is trained to reconstruct by minimizing

As shown in pretorius, if is the corrupting process, where

is a noise matrix with elements sampled iid from a distribution with mean zero and variance

, then

With , we have

The loss function of a linear CAE includes a penalty on the derivative of the encoder:

As shown in rifai, if the encoder and decoder are tied by requiring , then equals with :

In Theorem 2.1 we prove that the critical points of are identical to those of even when the encoder and decoder are not tied.

2.3 Regularization and orthogonality

While is not -invariant, it is still invariant under the length-preserving action of . In fact, -regularization reduces the symmetry group from to , making the left singular vectors of the encoder and right singular vectors of the decoder well-defined.

Two related facts about the relationship between regularization and orthogonality have guided our intuition throughout this work:

  1. Orthogonal matrices are the determinant matrices of minimal Frobenius norm (i.e., the columns have minimal squared length among those spanning a unit volume parallelepiped, forming a hypercube). This fact follows from the AM-GM inequality after casting the problem in terms of singular values:

    achieves its minimum iff for all .

  2. For , among pairs minimizing , those which also minimize are with orthogonal. This fact follows from noting that minimizing is equivalent to , hence

    which is similarly minimized iff for all .

While it was not immediately clear to us that the solutions to (2) are also the minima of , Theorem 4.2 shows this is indeed the case as . The core property differentiating the sum loss is the following, proven in the Appendix. In the following statement, denotes Moore–Penrose pseudoinverse).

Theorem 2.1 (Transpose Theorem).

While critical points of satisfy , those of satisfy

3 Bayesian Models

In this section, we identify Bayesian counterparts of our three loss functions and derive a novel connection between (regularized) LAEs and (probabilistic) PCA. This connection enables the application of any LAE training method to Bayesian MAP estimation of the corresponding model.

Consider the rank- (self-)regression model

where and act through their product and .

  • is rank-k regression. The prior on

    is the uniform distribution on

    restricted to rank- matrices555Note that rank-(k-1) matrices are a measure zero subset of rank- matrices..

  • is rank-k ridge regression. The prior on is restricted to rank- matrices.

  • is the model with and independently drawn from .

Theorem 4.2 shows that the minima of , or equivalently the MAP of the Bayesian model, are such that is the orthogonal projection onto the top principal directions followed by compression in direction via multiplication by the factor for and zero otherwise. Notably, for principal directions with eigenvalues dominated by , all information is lost no matter the number of data points. The same phenomenon occurs for pPCA with respect to eigenvalues dominated by the variance of the noise, . Let’s consider these Bayesian models side by side, with the parameter of pPCA666See Chapter 12.2 of bishop for background on pPCA.:

Bayesian pPCA

Comparing the critical points of in Theorem 4.2,

(2)

and pPCA (bishop99),

(3)

where with orthonormal columns, we see that corresponds to (rather than the precision ) in the sense that principal directions with eigenvalues dominated by either are collapsed to zero. The critical points only differ in the factor by which the the remaining principal directions are shrunk. More precisely:

Theorem 3.1 (pPCA Theorem).

With , the critical points of

coincide with the critical points of pPCA.

Proof.

Multiplying the expression for in (2) on the left by gives the expression for in (3). ∎

Interestingly, the generative model for is not that of pPCA. In the scalar case , is

whereas the negative log likelihood of pPCA is

4 Loss Landscapes

Having contextualized LAE models in a Bayesian framework, we now turn to understanding their loss landscapes. Symmetries such as (1) exist because the model is expressed in an “overparameterized” coordinate form rooted in classical linear algebra and necessary for computer processing. This results in “flat” critical manifolds rather than a finite number of critical points. In Section 4.1, we remove all symmetries by expressing the loss geometrically over a topological domain. This results in critical points, and in particular a unique minimum. This intuition will pay off in Section 4.2, where we fully characterize all three LAE loss landscapes.

4.1 Points and planes

We now consider reconstruction loss over the domain of -dimensional planes through the origin in . This space has the structure of a -dimensional smooth, compact manifold called the Grassmannian of -planes in and denoted (hatcher). We’ll focus on a few simple examples.

(a) Lines in the plane.

(b) Lines in space.
Figure 2: Left: Principal directions of a point cloud . Middle: as height function on the manifold of lines through the origin. Right: Negative gradient flow of .
  • is the space of lines through the origin in the plane, which may be smoothly parameterized by a counterclockwise angle of rotation of the -axis modulo the half turn that maps a line to itself.

  • is the space of lines through the origin in 3-space, also known as the real projective plane. We can visualize as the northern hemisphere of the 2-sphere with equator glued by the antipodal map.

  • is identified with by mapping a plane to its 1-dimensional orthogonal complement.

A point cloud in determines a smooth function

whose value on a -plane is the sum of square distances from the points to the plane. Figure 2 depicts as a height function for and . Note that the min and max in (a) are the principal directions and , while the min, saddle, and max in (b) are the principal directions , , and . At right, we depict the negative gradient flow on each space, with (b) represented as a disk with glued boundary. In (a), may descend to by rotating clockwise or counterclockwise777Formally, we mean there are two geometric gradient trajectories that converge to and in each time direction asymptotically, namely the left and right halves of the circle.. In (b), may descend to or by rotating in either of two directions in the plane they span, and may similarly descend to .

The following theorem requires our assumption that the singular values of are distinct. As a simple example, could consist of one point on each coordinate axis:

Theorem 4.1 (Grassmannian Theorem).

is a smooth function with critical points given by all rank- principal subspaces. Near the critical point with principal directions , takes the form of a non-degenerate saddle with

(4)

descending directions.

The latter formula counts the total number of pairs with and . These correspond to directions to flow along by rotating one principal direction to another of higher eigenvalue, fixing the rest.

In Appendix B, we prove a stronger form of the Grassmannian Theorem by combining Theorem 4.2, the commutative diagram B relating and , and techniques from algebraic topology.

4.2 Coordinates

Translating the Grassmannian Theorem back to the coordinate representation introduces two additional phenomena we saw in the scalar case in Section 2.1.

  • Each critical point on corresponds to a manifold of rank- critical points: or .

  • Critical manifolds appear with rank less than . In particular, is a critical point for all three losses.

Now let’s now combine our topological and scalar intuition to understand the the loss landscapes of LAEs in all dimensions and for all three losses.

Theorem 4.2 requires our assumption that has distinct singular values . Let denote the corresponding left singular vectors (or principal directions) of . For an index set and we define:

  • and increasing indices ,

  • ,

  • consisting of columns of .

Fix . For the sum loss result, we also assume that is distinct from all .

Theorem 4.2 (Landscape Theorem).

As submanifolds of , the critical landscapes of and are smoothly parameterized by pairs , where has size at most and has full rank; and with smooth structure that of the disjoint union of open submanifolds of given by -frames in .

As a submanifold of , the critical landscape of , is smoothly parameterized by pairs where and is the largest index such that , has size at most , and has orthonormal columns; and with smooth structure that of the disjoint union of Stiefel manifolds given by orthonormal -frames in .

These diffeomorphisms map or to as follows:

Corollary 4.2.1.

Near any point on the critical manifold indexed by , all three losses take the form of a degenerate saddle with descending directions.

  • and have flat directions.

  • has flat directions.

Thus all three losses satisfy the strict saddle property888(zhu). In the terminology of Appendix B, all three losses are Morse-Bott functions..

Proof.

In addition to the descending directions of , there are more that correspond to scaling one of remaining slots in or toward one of available principal directions999In Figure 1c, these two gradient trajectories descend from the yellow saddle at to the two red minima at .. The number of flat directions is given by the dimension of the symmetry group. For and , the ascending directions are the ascending directions of ; for , an additional ascending directions preserve the reconstruction term while increasing the regularization term by decreasing orthogonality. ∎

The proof of the Landscape Theorem 4.2 follows quickly from the Transpose Theorem 2.1 and Proposition 4.3 below.

Proposition 4.3.

Let be diagonal matrices such that is invertible and the diagonal of has distinct non-zero elements. Then the critical points of

are smoothly parameterized by pairs where has size at most and has full rank. The diffeomorphism is defined as follows:

Proof of Theorem 4.2.

Given the singular value decomposition , let and . By invariance of the Frobenius norm under the smooth action of the orthogonal group, we may instead parameterize the critical points of the following loss functions and then pull the result back to and :

expands to

By Proposition 4.3 with and , the critical points have the form

expands to

By Proposition 4.3 with and , the critical points have the form

By Lemma A.1 with and , expands to the sum of two functions:

So at a critical point, and by the Transpose Theorem 2.1. The latter also implies . So the critical points of coincide with the critical points of such that 101010In fact, has the same critical points if we add the constraint a priori as proven in Appendix A..

By Proposition 4.3 with and , these critical points have the form

In particular, real solutions do not exist for . ∎

5 Empirical Illustration

In this section, we illustrate the Landscape Theorem 4.2 by training an LAE with synthetic and real data and visualizing properties of the learned weight matrices. All three losses satisfy the strict saddle property (Corollary 4.2.1

), which implies that gradient descent and its extensions will reach a global minimum, regardless of the initialization, if trained for a sufficient number of epochs with a small enough learning rate

(zhu).

5.1 Synthetic data

In the following experiments, we set and fix a data set with singular values and random left and right singular vectors under Haar measure. We train the LAE for each loss using the Adam optimizer for epochs with random normal initialization and learning rate .

Figure 3 tracks the log squared distance between and during training. Indeed, only the the sum loss pulls the encoder and (transposed) decoder together as claimed in the Transpose Theorem 2.1.

Figure 3: Distance between and during training.

Let be the product after training and fix the singular value decompositions

For each loss, the heat map of

in Figure 4 is consistent with approximating a global minimum defined by the Landscape Theorem. Namely, the lower right quadrant shows for each loss and the upper right and lower left quadrants show and up to column sign for the product and sum losses, but not for the unregularized loss. That is, for the product and sum losses, the left singular vectors of are obtained as the right and left singular vectors of .

(a) Unregularized
(b) Product
(c) Sum
Figure 4: Heat map of the matrix . Black and white correspond to and , respectively.

The Landscape Theorem also gives explicit formulae for the eigenvalues of at convergence. Letting and be the largest eigenvalues of and , respectively, in Figure 5 we plot the points for many values of . We superimpose a curve for each value of defined by the theoretical relationship between and in the Landscape Theorem. The (literal) alignment of theory and practice is visually perfect.

(a) Unregularized

(b) Product

(c) Sum

Figure 5: Illustration of the relationship between the eigenvalues of the weight matrix () and data matrix () for various values of . Points are the empirical and lines are theoretical.

The Landscape Theorem also gives explicit forms for the encoder and decoder such that the matrices

(5)

satisfy for all losses and are each orthogonal for the sum loss. In Figure 6

, we illustrate these properties by applying the linear transformations

, , and to the unit circle . Non-orthogonal transformations deform the circle to an ellipse, whereas orthogonal transformations (including the identity) preserve the unit circle.

(a) Unregularized

(b) Product

(c) Sum
Figure 6: Image of the unit circle (green) under (blue), (orange), and (green) from (5). Non-orthogonal transformations deform the circle to an ellipse; orthogonal transformations preserve the circle.

5.2 Mnist

In the following experiment, the data set is the test set of the MNIST handwritten digit database (lecun). We train an LAE with and for each loss, again using the Adam optimizer for epochs with random normal initialization, batch size of , and learning rate .

Figure 7 further illustrates the Landscape Theorem 4.2 by reshaping the left singular vectors of the trained decoder and the top principal direction of into greyscale images. Indeed, only the decoder from the LAE trained on the sum loss has left singular vectors that match the principal directions up to sign.

(a) Unregularized

(b) Product

(c) Sum

(d) PCA
Figure 7: Left singular vectors of the decoder from an LAE trained on unregularized, product, and sum losses and the principal directions of MNIST reshaped into images.

As described in Section 3, for an LAE trained on the sum loss, the latent representation is, up to orthogonal transformation, the principal component embedding compressed along each principal direction. We illustrate this in Figure 8 by comparing the representation to that of PCA.

(a) Sum

(b) PCA

Figure 8: Latent representations of MNIST learned by an LAE with sum loss and by PCA. Colors represent class label.

6 Discussion

In 1989, baldi1989 characterized the loss landscape of an LAE. In 2018, zhou

characterized the loss landscape of an autoencoder with ReLU activations on a single hidden layer. This paper fills out and ties together the rich space of research on linear networks over the last forty years by characterizing the loss landscapes of LAEs under several forms of regularization through a unified framework.

We proved and illustrated empirically that for LAEs trained under the standard (sum) form of regularization:

  1. The encoder and decoder are transposes and well-defined up to orthogonal, as opposed to invertible, transformation in the latent space.

  2. The critical points under the (reparameterized) sum loss are those of probabilistic PCA, even without tying the encoder and decoder as transposes a priori.

  3. The principal directions exhibit a form of complete information loss more commonly associated with regularization.

  4. For those eigenvalues greater than the regularization constant, both the singular values and left singular vectors of the data can be recovered through SVD of the encoder or decoder.

While the dataset is dimensional, the decoder is only dimensional. This suggests algorithmic experiments

to optimize scalable LAE training followed by SVD of the decoder as an efficient inference framework for (probabilistic) PCA and low-rank SVD. While many other methods have been proposed for recovering principal components with neural networks, they all require specialized algorithms for iteratively updating weights, similar to classical numerical SVD approaches

(Warmuth; Feng1; Feng2). By contrast, training an -regularized LAE by SGD is more similar to modern randomized SVD methods, and has proven highly robust, scalable, and efficient with mini-batching in preliminary experiments.

This paper establishes a rigorous topological foundation for further exploration of the effects of regularization on loss landscapes. For example, in Appendix B we explain how this lens guarantees the empirical observation of valley passages between minima in Garipov. More recent ideas in algebraic topology111111In particular, the extension of Morse homology to manifolds with boundary (bloom) suggest applications of this view to other representation learning frameworks, such as non-negative matrix factorization, that may yield insights for improved training and robustness. By thinking deeply about how regularization encourages orthogonality in LAEs, this work also hints at new principles and algorithms for training deep nonlinear networks.

Appendix A Deferred Proofs

Proof of Theorem 2.1.

Let be the SVD of , , and . By orthogonal invariance of the Frobenius norm, reduces to

The critical landscape is defined by setting the partial derivatives to zero:

Multiplying the latter on the right by gives

which implies

Let . is negative semi-definite by Lemma A.2 with and . Now note:

(6)
(7)
(8)
(9)

Since and are symmetric, (6) transposed gives:

(10)

Equality of (6) and (9) implies

while (7) or (8) each imply

Finally, (7) plus (9) minus (8) minus (10) gives

The left-hand side is NSD since is NSD. The right hand-side is PSD by Lemma A.3 with and . So both sides are zero, and in particular . Therefore

which implies by Lemma A.4 with and . We conclude . ∎

Proof of Proposition 4.3.

The critical landscape is defined by setting the partial derivatives to zero:

(11)
(12)

Let . Multiplying (11) on the right by and on the left by gives

which implies is symmetric and idempotent. Multiplying (12) on the right by gives

which can be rewritten as

Since the left-hand side is symmetric, is diagonal and idempotent by Lemma A.5 with and . Lemma A.6 with the same implies there exists an index set of size with such that

and hence

(13)

Consider the smooth map with

from the critical submanifold of to the manifold of pairs with full-rank. Note

by (13). Commuting diagonal matrices to rearrange terms in (11) and (12), we obtain a smooth inverse map from pairs to critical points: