Backpropagation for Implicit Spectral Densities

by   Aditya Ramesh, et al.
NYU college

Most successful machine intelligence systems rely on gradient-based learning, which is made possible by backpropagation. Some systems are designed to aid us in interpreting data when explicit goals cannot be provided. These unsupervised systems are commonly trained by backpropagating through a likelihood function. We introduce a tool that allows us to do this even when the likelihood is not explicitly set, by instead using the implicit likelihood of the model. Explicitly defining the likelihood often entails making heavy-handed assumptions that impede our ability to solve challenging tasks. On the other hand, the implicit likelihood of the model is accessible without the need for such assumptions. Our tool, which we call spectral backpropagation, allows us to optimize it in much greater generality than what has been attempted before. GANs can also be viewed as a technique for optimizing implicit likelihoods. We study them using spectral backpropagation in order to demonstrate robustness for high-dimensional problems, and identify two novel properties of the generator G: (1) there exist aberrant, nonsensical outputs to which G assigns very high likelihood, and (2) the eigenvectors of the metric induced by G over latent space correspond to quasi-disentangled explanatory factors.



page 6

page 8

page 9


Beyond backpropagation: implicit gradients for bilevel optimization

This paper reviews gradient-based techniques to solve bilevel optimizati...

StyleGAN2 Distillation for Feed-forward Image Manipulation

StyleGAN2 is a state-of-the-art network in generating realistic images. ...

Implicit Maximum Likelihood Estimation

Implicit probabilistic models are models defined naturally in terms of a...

Differentiable Implicit Layers

In this paper, we introduce an efficient backpropagation scheme for non-...

Coupled-Channel Enhanced SSFM for Digital Backpropagation in WDM Systems

A novel technique for digital backpropagation (DBP) in wavelength-divisi...

Control-Oriented Model-Based Reinforcement Learning with Implicit Differentiation

The shortcomings of maximum likelihood estimation in the context of mode...

Analyzing the Latent Space of GAN through Local Dimension Estimation

The impressive success of style-based GANs (StyleGANs) in high-fidelity ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Density estimation is an important component of unsupervised learning. In a typical scenario, we are given a finite sample 

, and must make decisions based on information about the hypothetical process that generated . Suppose that , where  is a distribution over some sample space . We can model this generating process by learning a probabilistic model  that transforms a simple distribution  over a latent space  into a distribution  over . When  is a good approximation to , we can use it to generate samples and perform inference. Both operations are integral to the design of intelligent systems.

One common approach is to use gradient-based learning to maximize the likelihood of  under the model distribution . We typically place two important constraints on  in order to make this possible: (1) existence of an explicit inverse , and (2) existence of a simple procedure by which we can evaluate . Often times,  is instead constructed as a map from  to  in order to make it convenient to evaluate  for a given observation . This is the operation on which we place the greatest demand for throughput during training. Each choice corresponds to making one of the two operations – generating samples or evaluating  – convenient, and the other inconvenient. Regardless of the choice, both operations are required, and so both constraints are made to hold in practice. We regard  as a map from  to  in this presentation for sake of simplicity.

Much of the work in deep probabilistic modeling is concerned with allowing  to be flexible enough to capture intricate latent structure, while simultaneously ensuring that both conditions hold. We can dichotomize current approaches based on how the second constraint – existence of a simple procedure to evaluate  – is satisfied. The first approach involves making  autoregressive by appropriately masking the weights of each layer. This induces a lower-triangular structure in the Jacobian, since each component of the model’s output is made to depend only on the previous ones. We can then rapidly evaluate the log-determinant term involved in likelihood computation, by accumulating the diagonal elements of the Jacobian.

Research into autoregressive modeling dates back several decades, and we only note some recent developments.

Germain et al. (2015)

describe an autoregressive autoencoder for density estimation.

Kingma et al. (2016) synthesize autogressive modeling with normalizing flows (Rezende and Mohamed, 2015) for variational inference, and Papamakarios et al. (2017) make further improvements. van den Oord et al. (2016c) apply this idea to image generation, with follow-up work (Dinh et al. (2016), van den Oord et al. (2016b)) that exploits parallelism using masked convolutions. van den Oord et al. (2016a) do the same for audio generation, and van den Oord et al. (2017) introduce strategies to improve efficiency. We refer the reader to Jang (2018) for an excellent overview of these works in more detail.

The second approach involves choosing the layers of  to be transformations, not necessarily autoregressive, for which explicit expressions for the Jacobian are still available. We can then evaluate the log-determinant term for 

by accumulating the layerwise contributions in accordance with the chain rule, using a procedure analogous to backpropagation.

Rezende and Mohamed (2015) introduced this idea to variational inference, and recent work, including (Berg et al., 2018) and (Tomczak and Welling, 2016), describe new types of such transformations.

Both approaches must invariably compromise on model flexibility. An efficient method for differentiating implicit densities that do not fulfill these constraints would enrich the current toolset for probabilistic modeling. Wu et al. (2016) advocate using annealed importance sampling (Neal, 2001) for evaluating implicit densities, but it is not clear how this approach could be used to obtain gradients. Very recent work (Li and Turner, 2017) uses Stein’s identity to cast gradient computation for implicit densities as a sparse recovery problem. Our approach, which we call spectral backpropagation, harnesses the capabilities of modern automatic differentiation (Abadi et al. (2016), Paszke et al. (2017)) by directly backpropagating through an approximation for the spectral density of .

We make the first steps toward demonstrating the viability of this approach by minimizing  and , where  is the implicit density of a non-invertible Wide ResNet (Zagoruyko and Komodakis, 2016) , on a set of test problems. Having done so, we then turn our attention to characterizing the behavior of the generator  in GANs (Goodfellow et al., 2014), using a series of computational studies made possible by spectral backpropagation. Our purpose in conducting these studies is twofold. Firstly, we show that our approach is suitable for application to high-dimensional problems. Secondly, we identify two novel properties of generators:

  • The existence of adversarial perturbations for classification models (Szegedy et al., 2013) is paralleled by the existence of aberrant, nonsensical outputs to which  assigns very high likelihood.

  • The eigenvectors of the metric induced by the  over latent space correspond to meaningful, quasi-disentangled explanatory factors. Perturbing latent variables along these eigenvectors allows us to quantify the extent to which  makes use of latent space.

We hope that these observations will contribute to an improved understanding of how well generators are able to capture the latent structure of the underlying data-generating process.

2 Background

2.1 Generalizing the Change of Variable Theorem

We begin by revisiting the geometric intuition behind the usual change of variable theorem. First, we consider a rectangle in  with vertices  given in clockwise order, starting from the bottom-left vertex. To determine its area, we compute its side lengths  and write . Now suppose we are given a parallelepiped in 

whose sides are described by the vectors 

, , and . Its volume is given by the triple product , where  and  denote cross product and inner product, respectively. This triple product can be rewritten as

which we can generalize to compute the volume of a parallelepiped in :

If we regard the vertices  as observations in , the change of variable theorem can be understood as the differential analog of this formula. To wit, we suppose that  is a diffeomorphism, and denote by  the Jacobian of its output with respect to its input. Now,  becomes the infinitesimal volume element determined by . For an observation , the change of variable theorem says that we can compute

where we set .

An -dimensional parallelepiped in  requires  vectors to specify its sides. When , its volume is given by the more general formula (Hanson, 1994),

The corresponding analog of the change of variable theorem is known in the context of geometric measure theory as the smooth coarea formula (Krantz and Parks, 2008). When  is a diffeomorphism between manifolds, it says that


where we set  as before, and define  to be the metric induced by  over the latent manifold . In many cases of interest, such as in GANs, the function  is not necessarily injective. Application of the coarea formula would then require us to evaluate an inner integral over , rather than over the singleton . We ignore this technicality and apply Equation 1 anyway.

The change of variable theorem gives us access to the implicit density  in the form of the spectral density of . Indeed, the Lie identity  allows us to express the log-likelihood corresponding to Equation 1 as


We focus on the factor involving  on the RHS, which can be written as

where  denotes the spectrum, and 

the delta distribution over the eigenvalues in the spectrum. We let 

denote the parameters of , and assume that  is independent of . Now, differentiating Equation 2 with respect to  gives


Equation 3 allows us to formulate gradient computation for implicit densities as a variant of stochastic backpropagation (Kingma and Welling (2013), Rezende et al. (2014)), in which the base distribution for the expectation is the spectral density of 

rather than a normal distribution.

2.2 An Estimator for Spectral Backpropagation

is the implicit matrix;  the desired order;  the number of probe vectors for the trace estimator;  the number of power iterations;  a multiplier for the estimate returned by the power method; and  the stipulated lower bound on .
procedure StochasticLogDet()
     Define  and  using Equation 6.
procedure StochasticChebyshevTrace()
     for  do
         for  do
Algorithm 1 Procedure to estimate  using the Chebyshev approximation.

To obtain an estimator for Equation 3, we turn to the thriving literature on stochastic approximation of spectral sums. These methods estimate quantities of the form , where  is a large or implicitly-defined matrix, by accessing  using only matrix-vector products. In our case, , and the products involving  can be evaluated rapidly using automatic differentiation. We make no attempt to conduct a comprehensive survey, but note that among the most promising recent approaches are those described by Han et al. (2017), Boutsidis et al. (2017), Ubaru et al. (2017), and Fitzsimons et al. (2017).

We briefly describe the approaches of Han et al. (2017) and Boutsidis et al. (2017)

, which work on the basis of polynomial interpolation. Given a function 

, these methods construct an order- approximating polynomial  to , given by

where  and . The main difference between the two approaches is the choice of approximating polynomial. Boutsidis et al. (2017) use Taylor polynomials, for which

where we use superscript  to denote iterated differentiation. On the other hand, Han et al. (2017) use Chebyshev polynomials. These are defined by the recurrence relation


The coefficients  for the Chebyshev polynomials are called the Chebyshev nodes, and are defined by

Now suppose that we are given a matrix  such that . After having made a choice for the construction of , we can use the approximation


This reduces the problem of estimating the spectral sum  to computing the traces  for all .

Two issues remain in applying this approximation. The first is that both  and  are restricted to . In our case, , and can be an arbitrary positive definite matrix. To address this issue, we define , where


Now we set , so that

We stress that while  is defined using , the coefficients  are computed using . With these definitions in hand, we can write


Han et al. (2017) require spectral bounds  and , so that , and set

After using Equation 7 with a Chebyshev approximation for  to obtain , we compute

Boutsidis et al. (2017) instead define , and write

This time, we set  and use Equation 7 with a Taylor approximation for  to obtain . Then, we compute

We can easily obtain an accurate upper bound  using the power method. The lower bound  is fixed to a small, predetermined constant in our work.

The second issue is that deterministically evaluating the terms  in Equation 5 requires us to compute matrix powers of . Thankfully, we can drastically reduce the computational cost and approximate these terms using only matrix-vector products. This is made possible by the stochastic trace estimator introduced by Hutchinson (1990):

When the distribution  for the probe vectors  has expectation zero, the estimate is unbiased. We use the Rademacher distribution, which samples the components of  uniformly from . We refer the reader to Avron and Toledo (2011)

for a detailed study on the variance of this estimator.

We first describe how the trace estimator is applied when a Taylor approximation is used to construct . In this case, we have

The inner summands  are evaluated using the recursion  and  for . It follows that the number of matrix-vector products involved in the approximation increases linearly with respect to the order  of the approximating polynomial . The same idea allows us to accumulate the traces for the Chebyshev approximation, based on Equation 4. The resulting procedure is given in Algorithm 1; it is our computational workhorse for evaluating the log-likelihood in Equation 2 and estimating the gradient in Equation 3.

3 Learning Implicit Densities

(b) (c)
Figure 1: Results for minimizing  for the four test energies described in Rezende and Mohamed (2015)

. Subfigures (b) and (c) show the model samples superimposed over contour plots of the corresponding ground-truth test energies. Each epoch corresponds to

iterations. We see in (a) that the relative error for the approximation to the log determinant typically stays below 30%, except toward the end of training for the last two test energies. At this point, samples from these two models begin to drift away from the origin, as shown in (c).

Suppose that we are tasked with matching a given data distribution  with the implicit density  of the model . Two approaches for learning  are available, and the choice of which to use depends on the type of access we have to the data distribution . The first approach – minimizing – is applicable when we know how to evaluate the likelihood of , but are not necessarily able to sample from it. The second approach – minimizing – is applicable when we are able to sample from , but are not necessarily able to evaluate its likelihood. We show that spectral backpropagation can be used in both cases, when neither of the two conditions described in Section 1 holds.

All of the examples considered here are densities over . We match them by transforming a prior  given by a spherical normal distribution. Our choice for the architecture of  is a Wide ResNet comprised of four residual blocks. Each block is a three-layer bottleneck from  to  whose hidden layer size is 32. All layers are equipped with biases, and use LeakyReLU activations. We compute the gradient updates using a batch size of 64, and apply the updates using Adam (Kingma and Ba, 2014) with a step size of , and all other parameters kept at their the default values.

To compute the gradient update given by Equation 3, we use Algorithm 1 with  for all experiments. For minimizing , we use , and for minimizing , . In order to monitor the accuracy of the approximation for the likelihood, we compute  at each iteration, and evaluate the ground-truth likelihood in accordance with Equation 2. We define the relative error of the approximation  with respect to the ground-truth log-likelihood  by

provided that the quotient is not too large. This definition of relative error avoids numerical problems when .

We begin by considering the first approach, in which we seek to minimize . This objective requires that we be able to sample from , so we choose  to be a map from latent space to observation space. The results are shown in Figure 1. To prevent  from making  collapse to model the infinite support of , we found it helpful to incorporate the regularizer


into the objectives for the third and fourth test energies. Here,  denotes the spectral norm. To implement this regularizer, we simply backpropagate through the estimate of  that is already produced by Algorithm 1. We use  in both cases. Despite the use of this regularizer, we find that continuing to train the models for these last two test energies causes the samples to drift away from the origin (see Figure 1(c)). We have not made any attempt to address this behavior. Finally, we note that since  is bounded from below by the negative log-normalization constant of , it can become negative when  is unnormalized. We see that this happens for all four examples.

Figure 2: Results for minimizing  for the crescent and circular mixture densities whose definitions are given in the supplementary material. The relative error for the approximation to the log determinant typically stays below 5%. In (b), we show samples from  superimposed over contour plots of the log-likelihood of .

In the second approach, we seek to minimize . This objective requires that we be able to evaluate the likelihood of , so we choose  to be a map from observation space to latent space. The results are shown in Figure 2. We note that minimizing  is ill-posed when  is unnormalized. In this scenario, the model distribution  can match  while also assigning mass outside the support of . We see that this expected behavior manifests in both examples.

4 Evaluating GAN Likelihoods

Figure 3: Samples from two models evaluated at fixed grids in latent space (left), and contour plots of the model log-likelihood evaluated over the same grids (right).

For our explorations involving GANs, we train a series of DCGAN (Radford et al., 2015) models on  rescaled versions of the CelebA (Liu et al., 2015) and LSUN Bedroom datasets. We vary model capacity in terms of the base feature map count multiplier  for the DCGAN architecture. The generator and discriminator have five layers each, and use translated LeakyReLU activations (Xiang and Li, 2017). To stabilize training, we use weight normalization with fixed scale factors in the discriminator (Salimans et al., 2016). Our prior is defined by , where is the size of the embedding space. All models were trained for

 iterations with a batch size of 32, using RMSProp with step size 

and decay factor . We present results from two of these models in Figure 3.

Figure 4: Trajectories of four latent variables as they are perturbed to optimize likelihood under the generator. Trajectories (a) and (b) correspond to the CelebA model with , and trajectories (c) and (d) to the LSUN Bedroom model with . Statistics from these trajectories are tabulated above. The last column of the table specifies the parameters used for Algorithm 1 to compute the gradient estimates.

We apply spectral backpropagation to explore the effect of perturbing a given latent variable  to maximize likelihood under the generator distribution . This is readily accomplished by noting that the same procedure to evaluate Equation 3 can also be used to obtain gradients with respect to . The results are shown in Figure 4. Intuitively, we might expect the outputs to be transformed in such a way that they gravitate towards modes of the dataset. But this is not what happens. Instead, the outputs are transformed into highly aberrant, out-of-distribution examples while nonetheless attaining very high likelihood. As optimization proceeds,  also becomes increasingly ill-conditioned. This shows that likelihood for generators need not correspond to intuitive notions of visual plausibility.

5 Uncovering Latent Explanatory Factors

Figure 5: Result of perturbing latent variables along the eigenvectors of . In (a) and (b), we show the top eigenvalues of evaluated at 12 trial latent variables. Small eigenvalues are not shown. The leftmost images of each grid both contain duplicates of the original, each corresponding one of the 12 latent variables. The top row shows the effect of applying perturbations along random directions, and the bottom row the result of applying perturbations with the same step size along the eigenvectors.

The generator in GANs is well-known for organizing latent space such that semantic features can be transferred by means of algebraic operations over latent variables (Radford et al., 2015). This suggests the existence of a systematic organization of latent space, but perhaps one that cannot be globally characterized in terms of a handful of simple explanatory factors. We instead explore whether local changes in latent space can be characterized in this way. Since the metric  describes local change in the generator’s output, it is natural to consider the effect of perturbations along its eigenvectors. To this end, we fix 12 trial embeddings in latent space, and compare the effect of perturbations along random directions to perturbations along these eigenvectors. The random directions are obtained by sampling from a spherical normal distribution. We show the results in Figure 5.

We can see that dominant eigenvalues, especially the principal eigenvalue, often result in the most drastic changes. Furthermore, these changes are not only semantically meaningful, but also tend to make modifications to distinct attributes of the image. To see this more clearly, we consider the top two rows of Figure 5(g). Movement along the first two eigenvectors changes hair length and facial orientation; movement along the third eigenvector decreases the length of the bangs; movement along the fourth and fifth eigenvectors changes background color; and movement along the sixth and seventh eigenvectors changes hair color.

Dataset for fixed, varied for fixed, varied
CelebA () 6.94921875 10.4921875 16.55078125 16.140625 16.55078125 17.28515625
LSUN Bedroom () 25.2265625 34.9296875 53.6015625 28.3671875 34.9296875 40.04296875
Figure 6: Effect of varying  and  on . Plots (a)–(d) show  as a function of . Larger values for  suggest increased utilization of latent space. Table (e) shows that doubling  roughly doubles . On the other hand, doubling  does not result in noticeable change for CelebA, and only results in a modest increase in  for LSUN Bedroom. The step sizes used for the perturbations are the same as those reported in Figure 5.

Inspecting the two columns (c), (e), (g), and (d), (f), (h) in Figure 5 suggests that larger values of  may encourage the generator to capture more explanatory factors, possibly at the price of decreased sample quality. We would like to explore the effect of varying  and  on the number of such factors. To do this, we fix a sample of latent variables . For each , we define

for every eigenvector  of . The quantity  measures the pixelwise change resulting from a perturbation along an eigenvector, relative to the change we expect from a random perturbation. Finally, we define

where  is the indicator function. This quantity measures the average number of eigenvectors for which the relative change is greater than the threshold . As such, it can be regarded as an effective measure of dimensionality for latent space. We explore the effect of varying  and  on  in Figure 6.

6 Conclusion

Current approaches for probabilistic modeling attempt to satisfy two goals that are fundamentally at odds with one another: fulfillment of the two constraints described in Section 

1, and model flexibility. In this work, we develop a computational tool that aims to expand the scope of probabilistic modeling to functions that do not satisfy these constraints. We make the first steps toward demonstrating feasibility of this approach by minimizing divergences in far greater generality than what has been attempted before. Finally, we uncover surprising facts about the organization of latent space for GANs that we hope will contribute to an improved understanding of how effectively they capture underlying latent structure.