## 1 Introduction

In the last decade, deep learning models have provided state of the art results for a broad spectrum of problems in computer vision

Krizhevsky et al. (2012); Taigman et al. (2014)Socher et al. (2011a, b)Hamel & Eck (2010); Dahl et al. (2011) and biomedical imaging Plis et al. (2013). The underlying deep architecture with multiple layers of hidden variables allows for learning high-level representations which fall beyond the hypotheses space of (shallow) alternatives Bengio (2009). This representation-learning behavior is attractive in many applications where setting up a suitable feature engineering pipeline that captures the discriminative content of the data remains difficult, but is critical to the overall performance. Despite many desirable qualities, the richness afforded by multiple levels of variables and the non-convexity of the learning objectives makes training deep architectures challenging. An interesting solution to this problem proposed in Hinton & Salakhutdinov (2006); Bengio et al. (2007)is a hybrid two-stage procedure. The first step performs a layer-wise unsupervised learning, referred to as “pre-training”, which provides a suitable initialization of the parameters. With this warm start, the subsequent discriminative (supervised) step simply

fine-tunesthe network with an appropriate loss function. Such procedures broadly fall under two categories – restricted Boltzmann machines and autoencoders

Bengio (2009). Extensive empirical evidence has demonstrated the benefits of this strategy, and the recent success of deep learning is at least partly attributed to pre-training Bengio (2009); Erhan et al. (2010); Coates et al. (2011).Given this role of pre-training, there is significant interest in understanding precisely what the unsupervised phase does and why it works well. Several authors have provided interesting explanations to these questions. Bengio (2009) interprets pre-training as providing the downstream optimization with a suitable initialization. Erhan et al. (2009, 2010)

presented compelling empirical evidence that pre-training serves as an “unusual form of regularization” which biases the parameter search by minimizing variance. The influence of the network structure (lengths of visible and hidden layers) and optimization methods on the pre-training estimates have been well studied

Coates et al. (2011); Ngiam et al. (2011). Dahl et al. (2011) evaluate the role of pre-training for DBN-HMMs as a function of sample sizes and discuss the regimes which yield the maximum improvements in performance. A related but distinct set of results describe procedures that construct “meaningful” data representations. Denoising autoencoders Vincent et al. (2010)seek representations that are invariant to data corruption, while contractive autoencoders (CA)

Rifai et al. (2011b)seek robustness to data variations. The manifold tangent classifier

Rifai et al. (2011a) searches for low dimensional non-linear sub-manifold that approximates the input distribution. Other works have shown that with a suitable architecture, even a random initialization seems to give impressive performance Saxe et al. (2011). Very recently, Livni et al. (2014); Bianchini & Scarselli (2014)have analyzed the complexity of multi-layer neural networks, theoretically justifying that certain types of deep networks learn complex concepts. While the significance of the results above cannot be overemphasized, our current understanding of the conditions under which pre-training is

guaranteed to work well is still not very mature. Our goal here is to complement the above body of work by deriving specific conditions under which this pre-training procedure will have convergence guarantees.To keep the presentation simple, we restrict our attention to a widely used form of pre-training — Denoising autoencoder — as a sandbox to develop our main ideas, while noting that a similar style of analysis is possible for other (unsupervised) formulations also. Denoising auto-encoders (DA) seek robustness to partial destruction (or corruption) of the inputs, implying that a good higher level representation must characterize only the ‘stable’ dependencies among the data dimensions (features) and remain invariant to small variations Vincent et al. (2010)

. Since the downstream layers correspond to increasingly non-linear compositions, the layer-wise unsupervised pre-training with DAs gives increasingly abstract representations of the data as the depth (number of layers) increases. These non-linear transformations (e.g., sigmoid functions) make the objective non-convex, and so DAs are typically optimized via a stochastic gradients. Recently, large scale architectures have also been successfully trained in a massively distributed setting where the stochastic descent is performed asynchronously over a cluster

Dean et al. (2012). The empirical evidence regarding the performance of this scheme is compelling. The analysis in this paper is an attempt to understand this behavior on the theoretical side (for both classical and distributed DA), and identify situations where such constructions will work well with certain guarantees.We summarize the main contributions of this paper. We first derive convergence results and the associated sample size estimates of pre-training a single layer DA using the randomized stochastic gradients Ghadimi & Lan (2013). We show that the convergence of expected gradients is and the number of calls (to a first order oracle) is , where and correspond to the number of hidden and visible layers, is the number of iterations, and is an error parameter. We then show that the DA objective can be distributed and present improved rates while learning small fractions of the network synchronously. These bounds provide a nice relationship between the sample size, asymptotic convergence of gradient norm (to zero) and the number of hidden/visible units. Our results extend easily to stacked and convolutional denoising auto-encoders. Finally, we provide sets of experiments to evaluate if the results are meaningful in practice.

## 2 Preliminaries

Autoencoders are single layer neural networks that learn over-complete representations by applying nonlinear transformations on the input data Vincent et al. (2010); Bengio (2009). Given an input , an autoencoder identifies representations of the form , where is a transformation matrix and

denotes point–wise sigmoid nonlinearity. Here,

and denote the lengths of visible and hidden layers respectively. Various types of autoencoders are possible depending on the assumptions that generate the ’s — robustness to data variations/corruptions, enforcing data to lie on some low-dimensional sub-manifolds etc. Rifai et al. (2011b, a).Denoising autoencoders are widely used class of autoencoders Vincent et al. (2010), that learn higher-level representations by leveraging the inherent correlations/dependencies among input dimensions (), thereby ensuring that is robust to changes in less informative input/visible units. This is based on the hypothesis that abstract high-level representations should only encode stable data dependencies across input dimensions, and be robust to spurious correlations and invariant features. This is done by ‘corrupting’ each individual visible dimension randomly, and using the corrupted version (’s) instead, to learn ’s. The corruption generally corresponds to ignoring (setting to

) the input signal with some probability (denoted by

), although other types of additive/multiplicative corruption may also be used. If is the input at the unit, then the corrupted signal is with probability and otherwise where . Note that each of the dimensions are corrupted independently with the same probability . DA pre-training then corresponds to estimating the transformation by minimizing the following objective Bengio (2009),(1) |

where the expectation is over the joint probability of sampling an input and generating the corresponding using . The bias term (which is never corrupted) is taken care of by appending inputs with .

For notational simplicity, let us denote the process of generating

by a random variable

, i.e., one sample of corresponds to a pair where is constructed by randomly corrupting each of the dimensions of with some probability . Then, if the reconstruction loss is , the objective in (40) becomes(2) |

Observe that the loss and the objective in (2) constitutes an expectation over the randomly corrupted sample pairs , which is non-convex. Analyzing convergence properties of such an objective using classical techniques, especially in a (distributed) stochastic gradient setup, is difficult. Therefore, given that the loss function is a composition of sigmoids, one possibility is to adopt convex variational relaxations of sigmoids in (2) and then apply standard convex analysis. But non-convexity is, in fact, the most interesting aspect of deep architectures, and so the analysis of a loose convex relaxation will be unable to explain the empirical success of DAs, and deep learning in general.

High Level Idea. The starting point of our analysis is a very recent result on stochastic gradients which only makes a weaker assumption of Lipschitz differentiability of the objective (rather than convexity). We assume that the optimization of (2) proceeds by querying a stochastic first order oracle (), which provides noisy gradients of the objective function. For instance, the may simply compute a noisy gradient with a single sample at the iteration and use that alone to evaluate . The main idea adapted from Ghadimi & Lan (2013)

to our problem is to express the stopping criterion for the gradient updates by a probability distribution

over iterations , i.e., the stopping iteration is (and hence the name randomized stochastic gradients, RSG). Observe that this is the only difference from classical stochastic gradients used in pre-training, where the stopping criterion is assumed to be the last iteration. RSG will offer more useful theoretical properties, and is a negligible practical change to existing implementations. This then allows us to compute the expectation of the gradient norm, where the expectation is over stopping iterations sampled according to . For our case, the updates are given by,(3) |

where, is the noisy gradient computed at iteration ( is the stepsize). We have flexibility in specifying the distribution of stopping criterion . It can be fixed a priori or selected by a hyper-training procedure that chooses the best (based on an accuracy measure) from a pool of distributions . With these basic tools in hand, we first compute the expectation of gradients where the expectation accounts for both the stopping criterion and . We show that if the stepsizes in (3) are chosen carefully, the expected gradients decrease monotonically and converge. Based on this analysis, we derive the rate of convergence and corresponding sample size estimates for DA pre-training. We describe the one–layer DA (i.e., with one hidden layer) in detail, and all our results extend easily to the stacked and convolutional settings since the pre-training is done layer-wise in multi-layer architectures.

## 3 Denoising Autoencoders (DA) pre-training

We first present some results on the continuity and boundedness of the objective in (2), followed by the convergence rates for the optimization. Denote the element in row and column of by where and . We require the following Lipschitz continuity assumptions on and the gradient , which are fairly common in numerical optimization. and are Lipschitz constants.

###### Assumption (A).

###### Assumption (A).

We see from (40) that is symmetric in . Depending on where is located in the parameter space (and the variance of each data dimension ), each corresponds to some , and will then be the maximum of all such ’s (similarly for ).

Based on the definition of and (2), we see that the noisy gradients

are unbiased estimates of the true gradient since

. To compute the expectation of the gradients, , over the distribution governing whether the process stops at iteration , i.e., , we first state a result regarding the variance of the noisy gradients and the Lipschitz constant of . All proofs are included in the supplement.###### Proof.

Recall that the assumptions and are,

The noisy gradient is defined as . Using the mean value theorem and [A1], we have . This implies that the maximum variance of is . We can then obtain the following upper bound on the variance of ,

(5) |

Using [A2], we have

(6) |

where the equality follows from the definition of -norm. The second inequality is from . The last two inequalities use the definition of -norm and that is the maximum of all s. ∎

Whenever the inputs are bounded between and , is finite-valued everywhere and there exists a minimum due to the bounded range of sigmoid in (40). Also, is analytic with respect to . Now, if one adopts the RSG scheme for the optimization, using Lemma 3.1, we have the following upper bound on the expected gradients for the one–layer DA pre-training in (2).

###### Lemma 3.2 (Expected gradients of one–layer DA).

Let be the maximum number of RSG iterations with step sizes . Let be given as

(7) | ||||

where . If , we have

(8) | ||||

###### Proof.

Broadly, this proof emulates the proof of Theorem 2.1 in Ghadimi & Lan (2013) with several adjustments. The Lipschitz continuity assumptions (refer to and ) give the following bounds on the variance of and the Lipschitz continuity of (refer to Lemma 3.1),

(9) | ||||

Using the properties of Lipschitz continuity we have,

Since the update of using the noisy gradient is , where is the step–size, we then have,

By denoting ,

Rearranging terms on the right hand side above,

Summing the above inequality for ,

(10) | ||||

where is the initial estimate. Using , we have,

(11) | ||||

We now take the expectation of the above inequality over all the random variables in the RSG updating process – which include the randomization used for constructing noisy gradients, and the stopping iteration . First, note that the stopping criterion is chosen at random with some given probability and is independent of . Second, recall that the random process is such that the random variable is independent of for some iteration number , because selects then randomly. However, the update point depends on (which are functions of the random variables ) from the first to the iteration. That is, is not independent of , and in fact the updates form a Markov process. So, we can take the expectation with respect to the joint probability where denotes the random process from until . We analyze each of the last two terms on the right hand side of (11) by first taking expectation with respect to . The second last term becomes,

(12) | ||||

where the last equality follows from the definition of and . Further, from Equation 9 we have . So, the expectation of the last term in (11) becomes,

(13) |

Using (12) and (13) and the inequality in (11) we have,

(14) |

Using the definition of from Equation 7 and denoting , we finally obtain

(15) | ||||

∎

The expectation in (8) is over and . Here, ensures that the summations in the denominators of in (7) and the bound in (8) are positive. represents a quantity which is twice the deviation of the objective at the RSG starting point () from the optimum. Observe that the bound in (8) is a function of and network parameters, and we will analyze it shortly.

As stated, there are a few caveats that are useful to point out. Since no convexity assumptions are imposed on the loss function, Lemma 3.2 on its own offers no guarantee that the function values decrease as increases. In particular, in the worst case, the bound may be loose. For instance, when (i.e., the initial point is already a good estimate of the stationary point), the upper bound in (8) is non–zero. Further, the bound contains summations involving the stepsizes, both in the numerator and denominator, indicating that the limiting behavior may be sensitive to the choices of . The following result gives a remedy — by choosing to be small enough, the upper bound in (8) will decrease monotonically as increases.

###### Lemma 3.3 (Monotonicity and convergence of expected gradients).

By choosing such that

(16) |

the upper bound of expected gradients in (8) decreases monotonically. Further, if the sequence for satisfies

(17) | ||||

###### Proof.

We first show the monotonicity of the expected gradients followed by its limiting behavior. Observe that whenever , we have

Then the upper bound in (8) reduces to

(18) |

To show that right hand side in the above inequality decreases as increases, we need to show the following

(19) |

By denoting the terms in the above inequality as follows,

(20) |

To show that the inequality in Equation 19 holds,

(21) |

Rearranging the terms in the last inequality above, we have

(22) |

Recall that ; so without loss of generality we always have . With this result, the last inequality in (22) is always satisfied whenever for . Since this needs to be true for all , require for . This proves the monotonicity of expected gradients. For the limiting case, recall the relaxed upper bound from (18). Whenever , the right hand side in (18) converges to . ∎

The second part of the lemma is easy to ensure by choosing diminishing step-sizes (as a function of ). This result ensures the convergence of expected gradients, provides an easy way to construct based on (7) and (17), and to decide the stopping iteration based on ahead of time.

Remarks. Note that the maximum in (17) needed to ensure the monotonic decrease of expected gradients depends on . Whenever the estimate of is too loose, the corresponding might be too small to be practically useful. An alternative in such cases is to compute the RSG updates for some (fixed a priori) iterations using a reasonably small stepsize, and select to be the iteration with the smallest possible gradient or the cumulative gradient among some last iterations. While a diminishing stepsize following (17) is ideal, the next result gives the best possible constant stepsize , and the corresponding rate of convergence.

###### Corollary 3.4 (Convergence of one–layer DA).

The optimal constant step sizes are given by

(23) |

If we denote , then we have

(24) |

###### Proof.

Using constant stepsizes , the convergence bound in (8) reduces to

(25) |

To achieve monotonic decrease of expected gradients, we require (from (17) in Lemma 3.3). For such s,

which when used in (25) gives,

(26) |

Observe that as increases (resp. decreases), the two terms on the right hand side of above inequality decreases (resp. increases) and increase (resp. decreases). Therefore, the optimal for all , is obtained by balancing these two terms, as in

(27) |

However, the above choice of has the unknowns , and (although note that the later two constants can be empirically estimated by sampling the loss functions for different choices of and ). Replacing by some , the best possible choice constant stepsize is

(28) |

Since needs to be smaller than as discussed at the start of the proof, we have

(29) |

Now substituting this optimal constant stepsize from (28) into the upper bound in (26) we get

(30) | ||||

and by denoting , we finally have

(31) |

∎

The upper bound in (8) can be written as a summation of two terms, one of which involves . The optimal stepsize in (23) is calculated by balancing these terms as increases (refer to the supplement). The ideal choice for is in which case reduces to . For a fixed network size ( and ), Corollary 3.4 shows that the rate of convergence for one–layer DA pre-training using RSG is . It is interesting to see that the convergence rate is proportional to where the number of parameters of our bipartite network (of which DA is one example) is .

Corollary 3.4 gives the convergence properties of a single RSG run over some iterations. However, in practice one is interested in a large deviation bound, where the best possible solution is selected from multiple independent runs of RSG. Such a large deviation estimate is indeed more meaningful than one RSG run because of the randomization over in (2). Consider a –fold RSG with independent RSG estimates of denoted by . Using the expected convergence from (24), we can compute a -solution defined as,

###### Definition (-solution).

For some given and , an -solution of one–layer DA is given by such that

(32) |

governs the goodness of the estimate , and bounds the probability of good estimates over multiple independent RSG runs. Since is the maximum iteration count (i.e., maximum number of calls), the number of data instances required is , where denotes the average number of times each instance is used by the oracle. Although in practice there is no control over (in which case, we simply have ), we estimate the required sample size and the minimum number of folds () in terms of , as shown by the following result.

###### Corollary 3.5 (Sample size estimates of one–layer DA).

The number of independent RSG runs () and the number of data instances () required to compute a -solution are given by

(33) |

where is a given constant, denotes ceiling operation and denotes the average number of times each data instance is used.

###### Proof.

Recall that a -solution is defined such that

(34) |

for some given and . Using basic probability properties,

(35) | ||||

Using Markov inequality and (24),