Surfing: Iterative optimization over incrementally trained deep networks

We investigate a sequential optimization procedure to minimize the empirical risk functional f_θ̂(x) = 1/2G_θ̂(x) - y^2 for certain families of deep networks G_θ(x). The approach is to optimize a sequence of objective functions that use network parameters obtained during different stages of the training process. When initialized with random parameters θ_0, we show that the objective f_θ_0(x) is "nice" and easy to optimize with gradient descent. As learning is carried out, we obtain a sequence of generative networks x G_θ_t(x) and associated risk functions f_θ_t(x), where t indicates a stage of stochastic gradient descent during training. Since the parameters of the network do not change by very much in each step, the surface evolves slowly and can be incrementally optimized. The algorithm is formalized and analyzed for a family of expansive networks. We call the procedure surfing since it rides along the peak of the evolving (negative) empirical risk function, starting from a smooth surface at the beginning of learning and ending with a wavy nonconvex surface after learning is complete. Experiments show how surfing can be used to find the global optimum and for compressed sensing even when direct gradient descent on the final learned network fails.

There are no comments yet.

Authors

• 1 publication
• 9 publications
• 25 publications
• Minimizing Nonconvex Population Risk from Rough Empirical Risk

Population risk---the expectation of the loss over the sampling mechanis...
03/25/2018 ∙ by Chi Jin, et al. ∙ 0

• Theoretical Issues in Deep Networks: Approximation, Optimization and Generalization

While deep learning is successful in a number of applications, it is not...
08/25/2019 ∙ by Tomaso Poggio, et al. ∙ 82

• Global Guarantees for Blind Demodulation with Generative Priors

We study a deep learning inspired formulation for the blind demodulation...
05/29/2019 ∙ by Paul Hand, et al. ∙ 0

• Variance Suppression: Balanced Training Process in Deep Learning

11/20/2018 ∙ by Tao Yi, et al. ∙ 0

• Efficient learning with robust gradient descent

Minimizing the empirical risk is a popular training strategy, but for le...
06/01/2017 ∙ by Matthew J. Holland, et al. ∙ 0

• On the rate of convergence of a neural network regression estimate learned by gradient descent

Nonparametric regression with random design is considered. Estimates are...
12/09/2019 ∙ by Alina Braun, et al. ∙ 0

• Learning concise representations for regression by evolving networks of trees

We propose and study a method for learning interpretable representations...
07/03/2018 ∙ by William La Cava, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Intensive recent research has provided insight into the performance and mathematical properties of deep neural networks, improving understanding of their strong empirical performance on different types of data. Some of this work has investigated gradient descent algorithms that optimize the weights of deep networks during learning

(Du et al., 2018b, a; Davis et al., 2018; Li and Yuan, 2017; Li and Liang, 2018). In this paper we focus on optimization over the inputs to an already trained deep network in order to best approximate a target data point. Specifically, we consider the least squares objective function

 fˆθ(x)=12∥Gˆθ(x)−y∥2

where denotes a multi-layer feed-forward network and denotes the parameters of the network after training. The network is considered to be a mapping from a latent input to an output with . A closely related objective is to minimize where

is a random matrix.

Hand and Voroninski (2017) study the behavior of the function in a compressed sensing framework where is generated from a random network with parameters drawn from Gaussian matrix ensembles; thus, the network is not trained. In this setting, it is shown that the surface is very well behaved. In particular, outside of small neighborhoods around and a scalar multiple of , the function always has a descent direction.

When the parameters of the network are trained, the landscape of the function

can be complicated; it will in general be nonconvex with multiple local optima. Figure 1 illustrates the behavior of the surfaces as they evolve from random networks (left) to fully trained networks (right) for 4-layer networks trained on Fashion MNIST using a variational autoencoder. For each of two target values

, three surfaces are shown for different levels of training.

This paper explores the following simple idea. We incrementally optimize a sequence of objective functions where the parameters are obtained using stochastic gradient descent in during training. When initialized with random parameters , we show that the empirical risk function is “nice” and easy to optimize with gradient descent. As learning is carried out, we obtain a sequence of generative networks and associated risk functions , where indicates an intermediate stage of stochastic gradient descent during training. Since the parameters of the network do not change by very much in each step (Du et al., 2018a, b), the surface evolves slowly. We initialize for the current network at the optimum found for the previous network and then carry out gradient descent to obtain the updated point .

We call this process surfing since it rides along the peaks of the evolving (negative) empirical risk function, starting from a smooth surface at the beginning of learning and ending with a wavy nonconvex surface after learning is complete. We formalize this algorithm in a manner that makes it amenable to analysis. First, when is initialized so that the weights are random Gaussian matrices, we prove a theorem showing that the surface has a descent direction at each point outside of a small neighborhood. The analysis of Hand and Voroninski (2017) does not directly apply in our case since the target

is an arbitrary test point, and not necessarily generated according to the random network. We then give an analysis that describes how projected gradient descent can be used to proceed from the optimum of one network to the next. Our approach is based on the fact that the ReLU network and squared error objective result in a piecewise quadratic surface. Experiments are run to show how surfing can be used to find the global optimum and for compressed sensing even when direct gradient descent fails, using several experimental setups with networks trained with both VAE and GAN techniques.

2 Background and Previous Results

In this work we treat the problem of approximating an observed vector

in terms of the output of a trained generative model. Traditional generative processes such as graphical models are statistical models that define a distribution over a sample space. When deep networks are viewed as generative models, the distribution is typically singular, being a deterministic mapping of a low-dimensional latent random vector to a high-dimensional output space. Certain forms of “reversible deep networks” allow for the computation of densities and inversion (Dinh et al., 2017; Kingma and Dhariwal, 2018; Chen et al., 2018).

The variational autoencoder (VAE) approach training a generative (decoder) network is to model the conditional probability of

given as Gaussian with mean and covariance assuming that a priori is Gaussian. The mean and covariance are treated as the output of a secondary (encoder) neural network. The two networks are trained by maximizing the evidence lower bound (ELBO) with coupled gradient descent algorithms—one for the encoder network, the other for the decoder network (Kingma and Welling, 2014). Whether fitting the networks using a variational or GAN approach (Goodfellow et al., 2014; Arjovsky et al., 2017), the problem of “inverting” the network to obtain is not addressed by the training procedure.

In the now classical compressed sensing framework (Candes et al., 2006; Donoho et al., 2006), the problem is to reconstruct a sparse signal after observing multiple linear measurements, possibly with added noise. More recent work has begun to investigate generative deep networks as a replacement for sparsity in compressed sensing. Bora et al. (2017) consider identifying from linear measurements by optimizing . Since this objective is nonconvex, it is not guaranteed that gradient descent will converge to the true global minimum. However, for certain classes of ReLU networks it is shown that so long as a point is found for which is sufficiently close to zero, then is also small. For the case where does not lie in the image of , an oracle type bound is shown implying that the solution satisfies for some small error term . The authors observe that in experiments the error seems to converge to zero when is computed using simple gradient descent; but an analysis of this phenomenon is not provided.

Hand and Voroninski (2017) establish the important result that for a -layer random network and random measurement matrix , the least squares objective has favorable geometry, meaning that outside two small neighborhoods there are no first order stationary points, neither local minima nor saddle points. We describe their setup and result in some detail, since it provides a springboard for the surfing algorithm. Let be a -layer fully connected feedforward generative neural network, which has the form where

is the ReLU activation function. The matrix

is the set of weights for the th layer and

is number of the neurons in this layer with

. If is the input then is a set of random linear measurements of the signal . The objective is to minimize where is the set of weights.

Due to the fact that the nonlinearities

is a piecewise linear function. It is convenient to introduce notation that absorbs the activation into weight matrix , denoting

 W+,x=diag(Wx>0)W.

For a fixed , the matrix zeros out the rows of that do not have a positive dot product with ; thus, . We further define and

 Wi,+,x=diag(WiWi−1,+,x...W1,+,xx>0)Wi.

With this notation, we can rewrite the generative network in what looks like a linear form,

 Gθ0(x)=Wd,+,xWd−1,+,x...W1,+,xx,

noting that each matrix depends on the input . If is differentiable at , we can write the gradient as

 ∇fA,θ0(x)=(1∏i=dWi,+,x)TATA(1∏i=dWi,+,x)x−(1∏i=dWi,+,x)TATA(1∏i=dWi,+,x0)x0.

In this expression, one can see intuitively that under the assumption that and are Gaussian matrices, the gradient should concentrate around a deterministic vector . Hand and Voroninski (2017) establish sufficient conditions for concentration of the random matrices around deterministic quantities, so that has norm bounded away from zero if is sufficiently far from or a scalar multiple of . Their results show that for random networks having a sufficiently expansive number of neurons in each layer, the objective has a landscape favorable to gradient descent.

We build on these ideas, showing first that optimizing with respect to for a random network and arbitrary signal can be done with gradient descent. This requires modified proof techniques, since it is no longer assumed that . In fact, can be arbitrary and we wish to approximate it as for some . Second, after this initial optimization is carried out, we show how projected gradient descent can be used to track the optimum as the network undergoes a series of small changes. Our results are stated formally in the following section.

3 Theoretical Results

Suppose we have a sequence of networks generated from the training process. For instance, we may take a network with randomly initialized weights as , and record the network after each step of gradient descent in training; is the final trained network.

For a given vector , we wish to minimize the objective with respect to for the final network , where either , or is a measurement matrix with i.i.d.  entries in a compressed sensing context. Write

 ft(x)=12∥AGt(x)−Ay∥2,∀ t∈[T]. (3.1)

The idea is that we first minimize , which has a nicer landscape, to obtain the minimizer . We then apply gradient descent on for successively, starting from the minimizer for the previous network.

We provide some theoretical analysis in partial support of this algorithmic idea. First, we show that at random initialization , all critical points of are localized to a small ball around zero. Second, we show that if are obtained from a discretization of a continuous flow, along which the global minimizer of is unique and Lipschitz-continuous, then a projected-gradient version of surfing can successively find the minimizers for starting from the minimizer for .

We consider expansive feedforward neural networks given by

 G(x,θ)=Vσ(Wd…σ(W2σ(W1x+b1)+b2)…+bd).

Here, is the number of intermediate layers (which we will treat as constant), is the ReLU activation function applied entrywise, and are the network parameters. The input dimension is , each intermediate layer has weights and biases

, and a linear transform

is applied in the final layer.

For our first result, consider fixed and a random initialization where has Gaussian entries (independent of ). If the network is sufficiently expansive at each intermediate layer, then the following shows that with high probability, all critical points of belong to a small ball around 0. More concretely, the directional derivative satisfies

 D−x/∥x∥f0(x)≡limt→0+f0(x−tx/∥x∥)−f0(x)t<0. (3.2)

Thus is a first-order descent direction of the objective at .

Theorem 3.1.

Fix . Let have entries, let and have entries for each , and suppose these are independent. There exist -dependent constants such that for any , if

1. and for all , and

2. Either and , or has i.i.d.  entries (independent of ) where ,

then with probability at least , every outside the ball satisfies (3.2).

We defer the proof to Section 5. Note that if instead were correlated with , say for some input with , then would be a global minimizer of , and we would have in the above network where is the output of the layer. The theorem shows that for a random initialization of which is independent of , the minimizer is instead localized to a ball around 0 which is smaller in radius by the factor .

For our second result, consider a network flow

 Gs(x)≡G(x,θ(s))

for , where evolve continuously in a time parameter . As a model for network training, we assume that are obtained by discrete sampling from this flow via , corresponding to for a small time discretization step .

We assume boundedness of the weights and uniqueness and Lipschitz-continuity of the global minimizer along this flow.

Assumption 3.2.

There are constants such that

1. For every and ,

 ∥Wi(s)∥≤M.
2. The global minimizer is unique and satisfies

 ∥x∗(s)−x∗(s′)∥≤L|s−s′|

where .

Fixing , the function is continuous and piecewise-linear in . For each , there is at least one linear piece (a polytope in ) of this function that contains . For a slack parameter , consider the rows given by

 S(x,θ,τ)={(i,j):|w⊤i,jxi−1+bi,j|≤τ},

where

 xi−1=σ(Wi−1…σ(W1x+b1)…+bi−1)

is the output of the layer for this input , and , , and are respectively the row of , the row of and the entry of in . Define

 P(x,θ,τ)={P0,P1,…,PG}

as the set of all linear pieces whose activation patterns differ from only in rows belonging to . That is, for every and , we have

 sign(w⊤i,jx′i−1+bi,j)=sign(w⊤i,jxi−1+bi,j)

where is the output of the layer for input .

With this definition, we consider a stylized projected-gradient surfing procedure in Algorithm 3.2, where is the orthogonal projection onto the polytope .

The complexity of this algorithm depends on the number of pieces to be optimized over in each step. We expect this to be small in practice when the slack parameter is chosen sufficiently small.

The following shows that for any , there is a sufficiently fine time discretization depending on such that Algorithm 3.2 tracks the global minimizer. In particular, for the final objective corresponding to the network , the output is the global minimizer of .

Theorem 3.3.

Suppose Assumption 3.2 holds. For any , if and , then the iterates in Algorithm 3.2 are given by for each .

Proof.

For any fixed , let be two inputs to . If are the corresponding outputs of the layer, using the assumption and the fact that the ReLU activation is 1-Lipschitz, we have

 ∥xi−x′i∥ =∥σ(Wixi−1+bi)−σ(Wix′i−1+bi)∥ ≤∥(Wixi−1+bi)−(Wix′i−1+bi)∥ ≤M∥xi−1−x′i−1∥≤…≤Mi∥x−x′∥.

Let . By assumption, . For the network with parameter at time , let and be the outputs at the layer corresponding to inputs and . Then for any and , the above yields

 |(wi,j(s)⊤x∗,i(s−δ)+bi,j) −(wi,j(s)⊤x∗,i(s)+bi,j)|≤∥wi,j(s)∥∥x∗,i(s−δ)−x∗,i(s)∥ ≤M⋅Mi∥x∗(s−δ)−x∗(s)∥≤Mi+1Lδ.

For , this implies that for every where , we have

 sign(wi,j(s)⊤x∗,i(s−δ)+bi,j)=sign(wi,j(s)⊤x∗,i(s)+bi,j).

That is, for some .

Assuming that , this implies that the next global minimizer belongs to some . Since is quadratic on , projected gradient descent over in Algorithm 3.2 converges to , and hence Algorithm 3.2 yields . The result then follows from induction on . ∎

4 Experiments

We present experiments to illustrate the performance of surfing over a sequence of networks during training compared with gradient descent over the final trained network. We mainly use the Fashion-MNIST dataset to carry out the simulations, which is similar to MNIST in many characteristics, but is more difficult to train. We build multiple generative models, trained using VAE (Kingma and Welling, 2014), DCGAN (Radford et al., 2015), WGAN (Arjovsky et al., 2017) and WGAN-GP (Gulrajani et al., 2017). The structure of the generator/decoder networks that we use are the same as those reported by Chen et al. (2016); they include two fully connected layers and two transposed convolution layers with batch normalization after each layer (Ioffe and Szegedy, 2015). We use the simple surfing algorithm in these experiments, rather than the projected-gradient algorithm proposed for theoretical analysis. Note also that the network architectures do not precisely match the expansive relu networks used in our analysis. Instead, we experiment with architectures and training procedures that are meant to better reflect the current state of the art.

We first consider the problem of minimizing the objective and recovering the image generated from a trained network with input . We run surfing by taking a sequence of parameters , where are the initial random parameters and the intermediate ’s are taken every 40 training steps. In order to improve convergence speed, we use Adam (Kingma and Ba, 2014) to carry out gradient descent in during each surfing step. We also use Adam when optimizing over in only the final network. For each network training condition we apply surfing and regular Adam for 300 trials, where in each trial a randomly generated and initial point are chosen uniformly from the hypercube . Table LABEL:table:prop shows the percentage of trials where the solutions satisfy for different models, over three different input dimensions . We also provide the distributions of under each setting. Figure LABEL:fig:dist-WGAN shows the results for DCGAN.

We next consider the compressed sensing problem with objective where is the Gaussian measurement matrix. We carry out 200 trials for each choice of number of measurements . The parameters for surfing are taken every 100 training steps. As before, we record the proportion of the solutions that are close to the truth according to . Figure 3 shows the results for DCGAN and WGAN trained networks with input dimension .

Lastly, we consider the objective , where is a real image from the hold-out test data. This can be thought of as a rate-distortion setting, where the error varies as a function of the number of measurements used. We carry out the same experiments as before and compute the average per-pixel reconstruction error as in Bora et al. (2017). Figure 3 shows the distributions of the reconstruction error as the number of measurements varies.

Figure 4 shows additional plots for experiments comparing surfing over a sequence of networks during training to gradient descent over the final trained network. As described above, we consider the problem of minimizing the objective , that is, recovering the image generated from a trained network with input . We run surfing by taking a sequence of parameters , where are the initial random parameters and the intermediate ’s are taken every 40 training steps. In order to improve convergence speed we use Adam (Kingma and Ba, 2014) to carry out gradient descent in each step in surfing. We also use Adam when optimizing over the just the final network. We apply surfing and regular Adam for 300 trials, where in each trial a randomly generated and initial point is chosen. Figure 4 shows the distribution of the distance between the computed solution and the truth for VAE, WGAN and WGAN-GP, using surfing (red) and regular gradient descent with Adam (blue), over three different input dimensions .

5 Proof of Theorem 3.1

We denote , , and . and are the Euclidean vector norm and matrix operator norm. denote -dependent constants that may change from instance to instance.

We adapt ideas of Hand and Voroninski (2017). Denote for simplicity and . Define

 Wi,+,v=diag(Wiv+bi>0)Wi,bi,+,v=diag(Wiv+bi>0)bi

where denotes a diagonal matrix with th diagonal element . Then

 σ(Wiv+bi)=Wi,+,vv+bi,+,v.

The analysis of Hand and Voroninski (2017) shows that the matrices

 ˜Wi,+,v≡(Wi,+,vbi,+,v)∈Rni×(ni−1+1)

satisfy a certain Weight Distribution Condition (WDC), yielding a deterministic approximation for and any . We will use the following consequence of this condition.

Lemma 5.1.

Under the conditions of Theorem 3.1, with probability at least , the following hold for every and :

1. and .

2. , where is the angle formed by and .

3. .

Proof.

For (a), note that and with probability , by a standard tail-bound and operator norm bound for a Gaussian matrix. On the event that these hold, the bounds hold also for and and every .

For (b) and (c), by (Hand and Voroninski, 2017, Lemma 11), with probability the matrix satisfies WDC with constant for every . (The dependence of the constants in (Hand and Voroninski, 2017, Lemma 11) are given by and as indicated in the proof. This condition for matches the growth rate of specified in our Theorem 3.1.) From the form of in (Hand and Voroninski, 2017, Definition 2), the WDC implies

 ∥∥∥˜W⊤i,+,v˜Wi,+,v′−12I∥∥∥≤ε+˜θ/π

where is the angle between and . Noting that and recalling the definition of , we get (b) and (c). ∎

For , let and let be the output of the th layer. Denote

 Wi,x=Wi,+,xi−1,bi,x=bi,+,xi−1.

Then also .

Lemma 5.2.

Under the conditions of Theorem 3.1, with probability 1, the total number of distinct possible tuples satisfies

 |{(W1,x,b1,x,…,Wd,x,bd,x):x∈Rk}|≤10d2(n1…nd)d(k+1).
Proof.

Let , which contains . Then the result of (Hand and Voroninski, 2017, Lemma 15) applied to the vector space and to yields

 |{(W1,x,b1,x:x∈Rk)}|≤10nk+11.

Each distinct defines an affine linear space of dimension which contains the first layer output , and hence a subspace of dimension which contains . Applying (Hand and Voroninski, 2017, Lemma 15) to each such and yields

 |{(W2,x,b2,x:x∈Rk)}|≤10nk+11⋅10nk+12.

Proceeding inductively,

 |{(Wi,x,bi,x:x∈Rk)}|≤10i(n1…ni)k+1,

which is analogous to (Hand and Voroninski, 2017, Lemma 16) in our setting with biases . The result follows from taking the product over . ∎

Lemma 5.3.

Let have i.i.d.  entries. Fix , let , and let and where and are subspaces of dimension at most . Then with probability at least , for all and we have

 |x⊤A⊤Ay−x⊤y|≤ε∥x∥∥y∥.
Proof.

See (Hand and Voroninski, 2017, Lemma 14). ∎

Using these results, we analyze the gradient and critical points of . Note that with the above definitions,

 G(x) =V(Wd,x…(W1,xx+b1,x)…+bd,x) =V(1∏i=dWi,x)x+Vd∑j=1(j+1∏i=dWi,x)bj,x.

The function is piecewise linear in , so is piecewise quadratic. If is differentiable at , then the gradient of can be written as

 ∇f(x) =(d∏i=1W⊤i,x)V⊤A⊤(AV(1∏i=dWi,x)x+AVd∑j=1(j+1∏i=dWi,x)bj,x−Ay).
Lemma 5.4.

Define

 gx=2−dx−(d∏i=1W⊤i,x)V⊤y

Under the conditions of Theorem 3.1, we have with probability that at every where is differentiable,

 ∥∇f(x)−gx∥≤C′ε(1+∥x∥+∥y∥)
Proof.

By Lemma 5.2, for fixed , the range belongs to a union of at most subspaces of dimension . For some , under the condition , we have

 C2(n1…nd)2d(k+1)(c/ε)2ke−c′εm≤e−cεm.

Then for with i.i.d.  entries, applying Lemma 5.3 conditional on , and then 5.1(a) to bound and , we get

 ∥∥ ∥∥(d∏i=1W⊤i,x)V⊤(A⊤A−I)V(1∏i=dWi,x)x∥∥ ∥∥≤Cε∥x∥.

For , this bound is trivial. The given conditions imply also

 n≥nd≥C′k(ε−1logε−1)log(n1…nd),

so applying the same argument with in place of yields

 ∥∥ ∥∥(d∏i=1W⊤i,x)(V⊤V−I)(1∏i=dWi,x)x∥∥ ∥∥≤Cε∥x∥.

Next, applying Lemma 5.1(a–b) yields, for each ,

 ∥∥ ∥∥(j−1∏i=1W⊤i,x)(W⊤j,xWj,x