# Gradient descent in Gaussian random fields as a toy model for high-dimensional optimisation in deep learning

In this paper we model the loss function of high-dimensional optimization problems by a Gaussian random field, or equivalently a Gaussian process. Our aim is to study gradient descent in such loss functions or energy landscapes and compare it to results obtained from real high-dimensional optimization problems such as encountered in deep learning. In particular, we analyze the distribution of the improved loss function after a step of gradient descent, provide analytic expressions for the moments as well as prove asymptotic normality as the dimension of the parameter space becomes large. Moreover, we compare this with the expectation of the global minimum of the landscape obtained by means of the Euler characteristic of excursion sets. Besides complementing our analytical findings with numerical results from simulated Gaussian random fields, we also compare it to loss functions obtained from optimisation problems on synthetic and real data sets by proposing a "black box" random field toy-model for a deep neural network loss function.

## Authors

• 1 publication
• 43 publications
• 7 publications
• ### Combining learning rate decay and weight decay with complexity gradient descent - Part I

The role of L^2 regularization, in the specific case of deep neural netw...

02/07/2019 ∙ by Pierre H. Richemond, et al. ∙ 0

We learn recurrent neural network optimizers trained on simple synthetic...

11/11/2016 ∙ by Yutian Chen, et al. ∙ 0

• ### Rover Descent: Learning to optimize by learning to navigate on prototypical loss surfaces

Learning to optimize - the idea that we can learn from data algorithms t...

01/22/2018 ∙ by Louis Faury, et al. ∙ 0

• ### Provably Correct Automatic Subdifferentiation for Qualified Programs

The Cheap Gradient Principle (Griewank 2008) --- the computational cost ...

09/23/2018 ∙ by Sham Kakade, et al. ∙ 0

• ### Geometry of energy landscapes and the optimizability of deep neural networks

Deep neural networks are workhorse models in machine learning with multi...

08/01/2018 ∙ by Simon Becker, et al. ∙ 0

• ### PAL: A fast DNN optimization method based on curvature information

We present a novel optimizer for deep neural networks that combines the ...

03/28/2019 ∙ by Maximus Mutschler, et al. ∙ 0

• ### Learning in Gated Neural Networks

Gating is a key feature in modern neural networks including LSTMs, GRUs ...

06/06/2019 ∙ by Ashok Vardhan Makkuva, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

For almost a decade there have been significant advances in many areas of machine learning by applying deep learning techniques, such as in the context of image recognition

[1, 2], generative adversarial networks [3]

and in reinforcement learning, most notably in the development of AlphaGo

[4] (see also [5] for an overview). The amount of progress, combined with some issues that were found such as robust adversarial examples [6, 7], have led to interest in getting a better understanding of the underlying process. One contributing factor for the success of deep neural networks has to do with the nature and complexity of its loss function or energy landscape. In particular, it was found that the loss function of deep neural networks has very similar properties to random fields or Gaussian processes [8, 9, 10]

. For example, it was seen that the Hessian of such a loss function is mostly governed by the spectrum of a random matrix

[8] which can be used to show that local minima are located in a band close to the global minimum.

Given the above evidence that loss functions of deep neural networks share many properties with those of random energy landscapes, we want to investigate further optimization procedures in such landscapes. In particular, we model the loss function of high-dimensional optimization problem as a Gaussian random field (GRF) [11], which can also be viewed as a Gaussian Process (GP) [12], and study, both theoretically as well as experimentally, the performance of gradient descent in such landscapes as well as properties of the global minimum. Recently, there has been a revived interest in studying distributional properties of GRFs within the research community working on GPs. Examples include the study of the distribution of arc length in GPs [13], as well as expected improvements in batch optimization [14]. We aim to fill a gap in the literature by studying distributional aspects of improvements in gradient descent in such landscapes, including proving asymptotic normality of the improved field value. Interesting scalings can be obtained by studying the optimal learning rate and comparing it with that of random search as well as the location of the global minimum both as a functions of the dimension of the parameter space.

As in [8] but differing from [9] and [10], we explicitly consider the field to be a function of the input and the parameters. Concretely, we choose the loss function to be a Gaussian random field with squared exponential correlation and constant mean , where is the distance between two points. As the differences between different parameters will then be just be scaling and translation, we choose and for definiteness. As an example, Figure 1 illustrates a two-dimensional slice of a 500-dimensional realization of the field.

We are concerned with analyzing properties of gradient descent in the above random energy landscape,

 x1=x0−η∇ϕ(x0). (1)

Here is the updated point, is the initial point where we start our gradient descent and is the learning rate. As we will frequently use the values of the field and its gradient at both points, we introduce the short-hand notation , , and as illustrated in Figure 2.

After a short introduction to Gaussian Processes (GPs) in the next section, we present our main theoretical results in Section 3. Firstly, in Section 3.1 we obtain a formal expression for the distribution of

, as well as provide analytic expressions for its expected value and variance as a function of the dimension

of our parameter space. In Section 3.2, we use the expected value of , to compute the optimal learning rate. When comparing the optimal learning rate with that of random search, it is seen how random search gives superior results for small dimensions while gradient descent outperforms random search in larger dimensions. In Section 3.3

we prove asymptotic normality of the rescaled random variable

. For most practical applications, the Gaussian approximation of the distribution of is sufficiently close to the true distribution. In the following Section 3.4 we compare the expected value of

with that of the global minimum in a unit ball which we estimate by means of analyzing the Euler characterise of excursion sets. The latter is found to have the same scaling with the dimension

as the expected value of but with a slightly larger per-factor. Besides the above theoretical results, we also provide numerical results and simulation experiments in Section 4. More precisely, in Section 4.1 we numerically simulate GRFs of dimensions up to and verify the theoretical results obtained in the previous sections. Furthermore, we also investigate gradient descent on a toy model of GFRs which models the loss functions of deep neural networks on synthetic as well as real-life datasets and compare those with the findings on GRFs which are in good agreement.

## 2 Gaussian random fields and Gaussian processes

The Gaussian random field (GRF) we introduced in the previous section can be seen to be equivalent to saying that the field or loss function is given by a Gaussian Process (GP)

 ϕ∼GP(0,K) (2)

with zero mean and kernel with . For a comprehensive review on GPs the reader is referred to [12]

. The GP can be understood as an infinite dimensional extension of a multivariate Gaussian distributions such that joint distributions of any finite number of points are again a multivariate Gaussians. For our problem at hand it means that the joint distribution of

and , given , is Gaussian with mean zero and covariance . Having a kernel which depends on the distance makes closer points more correlated where in fact the correlation goes to one if the distance goes to zero. This essentially ensures that

will be a continuous function. This makes GPs popular choices for prior distributions over continuous functions in Bayesian statistics. One frequently occurring quantity to compute in this context is the posterior distribution of the field

at a new point given the value observed at which can be obtained by conditioning the joint distribution. The conditional distribution is indeed Gaussian, , with conditional mean and conditional covariance given by,

 μ = K(x0,x1)TK(x0,x0)−1Φ0 Σ = K(x1,x1)−K(x0,x1)TK(x0,x0)−1K(x0,x1) (3)

This is a well-known result for GPs and a similar result holds true when conditioning on more than one point. The implications of this relation are important since it essentially means that we can efficiently compute posterior updates of probability distributions at the expense of a few matrix operations.

## 3 Theoretical results

### 3.1 Distribution of the field after one step of gradient descent

In the previous section we saw that the joint distribution of is a Gaussian and that the conditional distribution can be easily obtained. Moreover, the joint distribution of the

dimensional vector

is also a multivariate Gaussian with mean and a covariance matrix which can easily be expressed in terms of the kernel function and its derivative,

 Σ = [Σ11Σ12Σ21Σ22] (4) Σ11 = Σ22=1 (5) Σ12 = ΣT21 (6) = e−Δx2/2(1−[0ΔxT−ΔxΔxΔxT])

where and .

The values of the field and its gradient at , and , are going to be Gaussian random variables conditioned on the values at , and , similar to the example presented in the previous section,

 [Φ1Ξ1]cond∼N(Σ12Σ−122[Φ0Ξ0],1−Σ12Σ−122Σ21). (7)

When multiplying out the terms and replacing from (1), we obtain that

follows a conditional normal distribution with mean and variance,

 m1(φ0,ξ20) := E[Φ1|Φ0=φ0,Ξ20=ξ20] (8) = e−η22ξ20(ϕ0−ηξ20) v1(ξ0) := Var[Φ1|Φ0=φ0,Ξ20=ξ20] (9) = 1−e−η2ξ20(1+η2ξ20).

As will have a distribution and will have a -distribution with degrees of freedom, being the sum of the squares of independent normally distributed components, we can write the overall distribution as

 fΦ1(φ1) = ∫+∞−∞dφ0∫+∞0dξ20fΦ0(φ0)fΞ20(ξ20)(2πv1(ξ20))1/2× (10) ×exp⎛⎝−(φ1−m1(φ0,ξ20))22v1(ξ20)⎞⎠,

where

is the standard normal probability density function (PDF),

is the PDF for a chi-squared random variable with degrees of freedom and , are the conditional mean and variance of as given in (8)-(9). For illustration purposes, we plotted the resulting PDF by means of numerical integration as shown in Figure 3.

A closed form expression for the distribution of seems out of reach, however, we can calculate its moments as well as show asymptotic normality later. For now, we focus on the first moments which can be easily derived,

 E[Φ1] = Eφ0,ξ20[m1(φ0,ξ20)]=−ηEξ20[ξ20e−η22ξ20] (11) = −Nη(η2+1)−N/2−1,

where the last expectation value is computed by integrating over the PDF of the -distribution. Similarly, for the variance one obtains

 Var(Φ1) = Eξ20[v1(ξ0)] (12) = 1−Eξ20[e−η2ξ20(1+η2ξ20)] = 1+Nη2(1+2η2)−N2−2(N+1−2η2) −N2η2(η2+1)−N−2.

It can be observed that the mean and variance of converge to those of in the limit where , as we stay in the same point and also when , as the gradient only gives significant information in a neighborhood of of size .

### 3.2 Optimal learning rate and comparison to random search

The optimal learning rate can be easily derived by computing , yielding

 ηopt=(N+1)−12 (13)

Computing the second derivative shows that indeed it is a minimum. We can now obtain the expected value of the field for the optimal learning rate which is given by

 E[Φ1|η=ηopt] = −N(N+1)−12(N+2N+1)−N2−1 (14) = −√Ne+O(N−12).

The expected value of will improve as the square root of the number of dimensions , as shown in Figure 4, and, as we show in Section 3.4, this is within a constant factor of the minimum field value within a unit radius ball. We would not expect significantly better results, as the information provided by the gradient decays very fast for distances greater than the correlation length (1 in our case).

The expected step length will also tend to 1 for large values of when using the optimal learning rate. That can be seen by using the previously discussed fact that the squared gradient has a -distribution with degrees of freedom and computing the expectation:

 E[ηoptΞ0] = 1√N+1√2Γ(N/2+1/2)Γ(N/2) (15) = 1+O(N−1).

Taking , a single step of gradient descent with the optimal learning rate gives us an expected value of

 E[Φ1|η=ηopt,N=500]≈−√500e≈−13.56. (16)

To put this value into context we compare it to random search. Since any evaluation of the random field would give a value smaller than the above with probability , where

is the cumulative distribution function of a standard normal, more than

tries would be needed on average to get to a value smaller than that from random search. On the other hand, for the expected value after a gradient descent step would only be

 E[Φ1|η=ηopt,N=1]=−23√3≈−0.385. (17)

This value or better would be obtained by random search in an average of tries. The difference exemplifies how gradient descent becomes increasingly powerful when moving to higher dimensional optimization. Below, in Section 3.4, we further investigate how this compares to the value of the global minimum.

### 3.3 Asymptotic normality of the distribution of the field

We now analyze convergence of the distribution of for when we scale the learning rate around its optimal value, namely, under the scaling

 η=X√N, (18)

where is the rescaled learning rate. Under this scaling we see that the expected value of

 E[Φ1]=μN(X)+...,μN(X):=−√NXe−X22 (19)

is of while the variance

 Var(Φ1)=σ2(X)+...,σ2(X):=1+X2e−X2 (20)

remains finite. We will now show that

###### Theorem 1.

As the dimension , the rescaled field value after a single step of gradient descent converges asymptotically to a normal distribution:

 Φ1−μN(X)d→N(0,σ2(X)) (21)

where , , , are defined in (18) - (20).

###### Proof.

To prove the theorem we first compute the moment generating function

 E[etΦ1] = Eφ0,ξ20[exp{tm1(φ0,ξ20)+t22v1(ξ0)}] (22) = et22Eξ20[exp{−t22η2ξ20e−η2ξ20+ −tηξ20e−η2ξ20/2}]

Inserting the scaling relation for the learning rate, (18), and using a saddle point expansion of the integral over when writing out the expectation one can see that the above expression is given to leading order by

 E[etΦ1] = et22exp{−t22X2e−X2+ (23) −tX√Ne−X2/2}+... = exp{σ2(X)t22+tμN(X)}+...

where the expectation over collapsed to its saddle point which to leading order is given by . Thus

 (24)

which proves the above convergence. ∎

Figure 5 shows an example of the normal approximation of the distribution of for finite . Details of the numerical analysis will be presented in Section 4.

### 3.4 Comparison with optimal values

In the previous sections we have analyzed the expected value of the field after a step of gradient descent. A natural follow-up is to ask how does it compare with the global extremum of the field. As we are working with a Gaussian random field with a covariance that decays to zero, we can expect to find values with arbitrarily large magnitude at enough distance, but a more useful comparison can be done by restricting ourselves to a unit ball around the random starting point .

There is no known analytical expression for the expected value of a Gaussian random field maximum or minimum in any multidimensional domain, but a number of powerful estimation techniques have been developed [11, 15, 16]. Most of them are based on finding quantities that are related to the extrema and computing their expectations. In our case we will use the exact computation of the expected Euler characteristic, as described in [11], to get an estimate for the number of connected components of an excursion set and use that estimate to get the expected value of the minimum.

In general, the Euler characteristic can be seen as the unique functional from a family of subsets of a manifold to the integers that has the following properties:

• .

• if is contractible.

• if .

In the following discussion we will only use these properties of the Euler characteristic and assume that all the sets under consideration are included in . A detailed proof of the uniqueness of the Euler characteristic and a description of the family in the context of random fields can be found in [11].

In our case the manifold is the unit ball, . Now we can define an excursion set in as the subset of composed by the points where the field reaches a value of or smaller. It is clear then that will be non-empty if and only if , allowing us to connect geometrical properties of the excursion set with the value of the minimum.

As our random field is continuous, if we start from a large positive value and gradually decrease it, we expect the excursion set to start being all of , then getting some holes, being disconnected, turning into a few contractible components and finally ending as the empty set, as shown in Figure 6. If our excursion set has the form of disjoint contractible components, its Euler characteristic gives us the number of components and we will be able to use its expected value for different values of to estimate the value of the minimum.

Following [11], we note that it has been proved that the expected Euler characteristic of the excursion set under the conditions we described is given by

 E[χ(Au)]=N∑j=0Lj(BN(x0))ρj(u), (25)

where are the Lipschitz-Killing curvatures of which are given by

 Lj(BN(x0))=(Nj)ωNωN−j,ωj=πj/2Γ(j/2+1) (26)

and is given by

 ρj(u)=(2π)−(j+1)/2Hj−1(u)e−u22, (27)

with being the Hermite polynomials.

As shown in Figure 6, we anticipate the expected Euler characteristic starting at 1 for large positive values of (the whole set has characteristic 1), to be 1 for values of that leave a single non-empty excursion set with high probability (a small “droplet” has characteristic 1) and to decrease to 0 when it starts being less probable to find a non-empty excursion set (in other words, when is below the minimum).

The expected behavior can be seen in Figure 7, with the threshold increasing in absolute value as we move to higher dimensional spaces. If we estimate the expected minimum as the value where the expected Euler characteristic is 0.5, we find its values are well approximated by , i.e.

 E[minx∈B(x0)ϕ(x)]≈−√N. (28)

Comparing those values with the values of the field after a gradient descent step, as found in Section 3.2, we see they differ by a constant factor of .

Applying this bound to get the expected value of the minima inside the unit ball for , we find it will be , which should be contrasted with the expected field value after one step of gradient descent, , as obtained in Section 3.2.

## 4 Experimental results

### 4.1 Random field simulation

Our experiments will require generating approximate instances of Gaussian random fields in spaces of high dimensionality, with values of reaching 500. Most of the conventional methods for simulating random fields [17, 18] don’t scale well to a large number of dimensions, as they generate explicit grids representing the field values.

We can take the spectral representation [11] of the random field,

 ϕ(x)=∫eizTxW(dz), (29)

that can be seen as the Fourier transform of the field expressed as a stochastic integral, and use Monte Carlo sampling to approximate the integration over

. In that way, we obtain an approximate instance of the random field expressed as the real part of the sum of complex exponentials

 ϕsim(x)=RwT¯¯¯¯¯¯¯¯exp(iZx), (30)

where is component-wise exponentiation, is a complex Gaussian random vector and is a real Gaussian random matrix with independently distributed elements . This can be considered a multidimensional variant of the randomization method described in [19], although in a high dimensional context it is important to ensure that the number of samples is significantly higher than the number of dimensions to avoid confining the gradient to a low dimensional subspace.

The gradient can then be computed by differentiating the previous expression:

 ∇ϕsim(x)=−IZTdiag(w)¯¯¯¯¯¯¯¯exp(iZx). (31)

The value of , being the number of samples, will determine how accurate our representation of the random field will be. Higher values will increase the amount of computational resources required and lower values will produce a lower quality realization of the random field. A value of was found to give high quality results for at acceptable computational cost.

We first start by comparing the expected values for the sample mean and variance computed in section 3.1 with experimental results. With the previously discussed representations and for 20 different values of the learning rate , we can do one step of gradient descent for different starting points distributed uniformly in . The resulting sample means and variances are shown in Figure 8 compared with the theoretical expectations and we can see they match them quite accurately.

To compare the expected distribution with the empirical one, we repeated the gradient descent step simulation using points and . The simulated results can be seen in Figure 5 and they also fit very closely with the expected distribution.

### 4.2 Experiments on synthetic and real datasets

In this section we show how this random field model can be used to classify real data. To do that, we introduce a toy model in which we take a standard multilayer network based binary classifier and we replace the entire network by a “black box” loss function, given by a static random field, with no adjustable internal parameters. The

-dimensional parameter vector , replacing the weights of a normal network, and the -th -dimensional input to be classified are concatenated to get a -dimensional vector, with ,

 xi=[βxiinput], (32)

that is the random field input.

It is a normal practice [5]

in classifier networks to use softmax as the activation function in the last layer and cross-entropy as the loss function. As we are replacing the rest of the network with a random field and using only two classes, the output of the classifier can be written as

 yi=sigmoid(ϕ(xi)), (33)

where is the input vector associated with the -th input instance and

is the sigmoid function.

As usual in supervised binary classification problems, we associate a true class label to each of our input instances and we try to minimize the cross entropy loss between the true labels and the classifier outputs , i.e.

The training process is standard minibatch gradient descent. By analogy with neural networks, where only the weights are updated, only the parameter vector is updated after each minibatch. Following usual practice, we divide the input data into training and test sets, using the training set to select a value for the parameter vector and the test set to evaluate the accuracy of the classifier. Furthermore, both sets of input instances are normalized to mean 0 and mean norm 1, matching the scale of the data set distribution to the correlation scale of the random field. This is empirically observed to make a significant difference in classification accuracy.

One way of visualizing the training process is to think of the parameter vector

as selecting a random field slice. Then the gradient descent over the loss function will try to select a slice where the naive Bayes decision surface divides both classes, putting the instances where

in the positive side and the instances with in the negative side of the surface. This process can be seen clearly in Figure 9, where the parameter vector after training can be seen as selecting a 2D slice of the random field , where the intersection of the naive Bayes decision boundary with the slice is close to the class boundary.

Note that the here proposed toy model of a “black box” random field that mimics the energy landscape of a deep neural network is different to GP classification [20, 12]

or other kernel methods such as support vector machines

[21, 22] in the sense that we combine the input and parameter vector into a joint input vector to a GRF which is kept fixed during the learning process.

To test the classification power of these random field instances, we run the training and test process over two simple data sets:

• Normally distributed points in the plane separated by a sine function (see Figure 9), with 6000 elements in the training set and 1000 in the test set (, as we are talking about points in the plane).

• MNIST [23, 24]

, modified to classify the digits as even or odd and using 60000 elements in the training set and 10000 in the test set (

in this case).

The training is done using a fixed batch size of 128 and 10 epochs, combined with different learning rates and values of

to evaluate their impact over test set accuracy.

The test set accuracy for MNIST with is over 96% showing that, even though it is far from matching the state of the art, the model has significant classification power. When compared with models with a similar number of parameters, the model is competitive [24].

We can observe in Figure 10 that accuracy doesn’t depend on the learning rate until reaching a critical value and then it drops to random performance. The MNIST drop for is at and for at ,

 0.03⋅√5000≈2.12≈2.68≈0.13⋅√500, (34)

roughly matching the scaling found before in the single step regime.

## 5 Conclusion

The successes of deep learning as well as some unexpected weaknesses, such as the difficulty of combining good generalization and resistance to adversarial examples, have led to a significant research effort aiming to understand why their training process performs so well in high dimensional problems. The complex structure of deep neural networks error landscapes makes it difficult to understand how the optimization process is working, but observing its performance in a simple random field model can help to clarify some of the reasons behind its successes and limitations.

In this work we aim to get a better understanding of gradient descent as a tool for high dimensional optimization by obtaining theoretical and empirical results about its performance over Gaussian random fields. Following a brief introduction, we establish some asymptotic results about the distribution of field values reached after a single step of gradient descent. Those results are then compared with a theoretical estimate of the extreme field values at a similar distance. Finally, we compare the previously obtained theoretical results with experimental simulations, while also showing that the “black box” Gaussian random field model is capable of solving realistic classification tasks.

We show our theoretical results about the distribution of values after a gradient descent step in Section 3. Starting in Section 3.1, we obtain the first and second moments as a function of the learning rate and the number of dimensions of the parameter space. In the following Section 3.2 we use the previously derived expressions to get the optimal value for the learning rate as a function of the number of dimensions, finding that for large values of . Using that optimal learning rate we show that the expected value of the field after a gradient descent step is approximately , comparing very favorably with the values that can be obtained through a random search when , as those are independent of the dimensionality of the space. Closing our analysis of the distribution of values, we prove in Section 3.3 that in the high dimensional limit the distribution of the values after the gradient descent step is approximately normal, with a variance that is independent of the dimensionality and a mean that is proportional to . Finally, in Section 3.4 we show using the expected Euler characteristic of excursion sets that the expected minimum inside the unit ball will only differ from the expected value we obtain through one step of gradient descent with the optimal learning rate by a factor of in the limit.

In Section 4 we start by showing how we simulate a high-dimensional random field and comparing the experimental gradient descent results with the previous theoretical results in Section 4.1, finding them to be in good agreement. Finally, we show that the model we obtain by replacing a neural network by a Gaussian random field can be trained by gradient descent, obtaining competitive results in a simple synthetic dataset and in MNIST, once we take into account that the model is only using 5000 parameters.

The introduced “black box” GRF model is successful at combining nontrivial classification performance in realistic datasets with being simple enough to be susceptible to exact theoretical analysis. A possible line of future investigation would be to look at other aspect of deep neural networks through the lens of our toy model such as the interpretability of hidden layer neurons in image classification tasks or transfer learning. That could be combined with extending these results to the normal multistep minibatch training process.