The sharp, the flat and the shallow: Can weakly interacting agents learn to escape bad minima?

05/10/2019 ∙ by Nikolas Kantas, et al. ∙ 0

An open problem in machine learning is whether flat minima generalize better and how to compute such minima efficiently. This is a very challenging problem. As a first step towards understanding this question we formalize it as an optimization problem with weakly interacting agents. We review appropriate background material from the theory of stochastic processes and provide insights that are relevant to practitioners. We propose an algorithmic framework for an extended stochastic gradient Langevin dynamics and illustrate its potential. The paper is written as a tutorial, and presents an alternative use of multi-agent learning. Our primary focus is on the design of algorithms for machine learning applications; however the underlying mathematical framework is suitable for the understanding of large scale systems of agent based models that are popular in the social sciences, economics and finance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is part of the machine learning folklore that “flat minima generalize better”. This claim is intriguing and has been investigated in many papers starting from Hochreiter and Schmidhuber (1997) and more recently in Keskar et al. (2016), Neyshabur et al. (2017), Chaudhari et al. (2016) and Li et al. (2018)

. An even more intriguing claim that appears in the recent machine learning literature is that, at least in practice, variants of the Stochastic Gradient Descent (SGD) method of Robbins & Monro, originally proposed in the 1950s (see

Kushner and Yin (2003); Bottou et al. (2018) for a modern exposition) do indeed converge to flat minima; Keskar et al. (2016); Jastrzebski et al. (2017); Xing et al. (2018). However, the observation that SGD typically converges to flat local minima and that these minima generalize well lacks a rigorous justification. Firstly, there is no widely accepted or robust definition of “flatness”. As a result, in Dinh et al. (2017) the authors take advantage of the usual definitions of flatness to show that through a re-parametrization, a sharp local-minimum can be transformed into a flat one without loosing generalization. Secondly, these claims rely on numerical experiments, which of course hold only for a particular data-set, hyper-parameters, and the amount of computation time used. From a theoretical perspective, it is well known that it is NP-hard to escape from a local minimum, SGD is only guaranteed to converge to first-order stationary points (Kushner and Yin (2003)), and even the best rigorous generalization bounds are too loose to be of use in practice (Arora et al. (2018)

). Thus the question of whether SGD or any of its variants compute local minima that generalize well is currently an open question with many implications for the theory and practice of machine learning. In this short paper, we review several variants of SGD and illustrate that a system of interacting, rather than i.i.d., agents performing gradient descent can help to smooth out sharp minima and (approximately) convexify the loss function. The proposed ideas may be useful in answering a variant of the question above, i.e., is there a principled and systematic way for removing “sharp” local minima from a given loss function and thus provably converge to a flat minimum? We believe that answering this question is an essential first step towards understanding generalization in state-of-the-art machine learning models.

We consider various modifications of Stochastic Gradient Langevin Dynamics (SGLD):

(1)

where is a standard dimensional Brownian motion, where denotes the initial distribution. In this paper we assume that the loss function and its gradient are known. When this is not the case then the resulting SGD can be described using Langevin dynamics with multiplicative noise (Li et al. (2017)). This case will be studied elsewhere. We assume that is fixed in time. If instead is gradually increased using a so called annealing schedule, then adding noise to the normal gradient flow can help the dynamics in (1) provably converge to a global minimum of (Geman and Hwang (1986)). Our goal is somewhat different, we are interested in methods that can smooth the objective function and eliminate sharp local minima. Therefore the role of the noise term in our setting is to help the trajectory escape local minima and saddle points. In addition to the direct smoothing of the objective function via convolution with an appropriate kernel that is discussed in Section 2, in this paper we present some additional approaches to smoothing the loss function, using mean field limits of (weakly) interacting SGLD and tools from multiscale analysis for stochastic differential equations. Both approaches, multiscale analysis and mean field averaging could be viewed as principled ways to “smooth” or regularize . The resulting smoothed loss function will have a smaller number of local minima, and each local minimum will be more flat. We assume that flat local minima are more desirable in machine learning applications than sharp ones.

For simplicity we will present SGLD algorithms and their mean field or homogenized variants mainly in continuous time. The analysis of fully discrete algorithms, based on the continuous time dynamics that is studied in this paper, will be presented elsewhere. The SDEs considered here are well understood in the literature, but are rarely considered as optimization methods. This paper can be considered as an early contribution in this direction presented with a tutorial or short review style. We emphasize that this paper is not meant as a thorough review and hence we will avoid technicalities and only provide some indicative references. The organization of the paper is as follows: in Section 2 we discuss direct smoothing of the loss function via convolutions. In Section 3 we briefly review how interacting SDEs and mean field models can be used as extensions to SGLD. Similarly in Section 4 we present some tools from multiscale analysis that will be useful for the development of the algorithms that are presented in this paper. In Section 5 we put everything together in a single algorithm. In Section 6 we present some early numerical results and in Section 7 we provide concluding remarks and ideas for extensions.

2 Direct smoothing of the loss function

To eliminate sharp local minima one could replace the gradient term in the basic gradient descent algorithm with a smoother version. This idea has its origins in global optimization using continuation methods (see e.g. Moré and Wu (1997); Hazan et al. (2016)). In order to eliminate these local minima one could simulate the gradient descent dynamics of a “smoothed” version of the cost function instead

(2)

where we denote

(3)

i.e. denotes the convolution. A typical choice for the smoothing kernel

is the Gaussian kernel with variance

The technical conditions for the above modification of the gradient to lead to smoothing can be found in Wu (1996), where the method is studied (in discrete time) in more detail. Regardless of the choice of the smoothing kernel, can be interpreted as an expectation

for a suitably chosen probability measure

. Furthermore, (under appropriate conditions)

(4)

An additional point here is that when ,

is an unbiased estimate of the “biased” gradient

. This implies, in particular, that there is an additional option for randomization of the smoothed gradient.

Loosely speaking the effect of here is to smooth , a rigorous justification of this statement appears in Wu (1996). The first natural questions to ask is how to design (or ) to get the desired effect of smoothing of ? An even more pressing question for machine learning practitioners is to find a minimum that is independent on any other variables hidden in notation, e.g. training data and hyper-parameters. These are challenging questions. The conventional approach in continuation methods is to start with a very smooth model (using a high ) and gradually decrease it as the method progresses. An additional complication is that computing the integral in (3) for the type of loss functions that appear in machine learning applications is intractable. To address these issues, in the remainder we will look at approaches where a smoothing measure does not act directly on and is constructed from the stochastic process itself.

3 Mean-field formulations of SGLD

Our aim is to design variants of (1) so that the process will eventually end up closer to a minimizer of and as quickly as possible. The law of the diffusion process has a smooth density with respect to the Lebesgue measure that satisfies the Fokker-Planck (forward Kolmogorov) equation:

Under appropriate assumptions on the growth of the loss function at infinity, the density

converges exponentially fast in relative entropy (Kullback-Leibler divergence) to the unique stationary state

where denotes the normalization constant (Pavliotis (2014, Ch. 4)). Finding the mode or maximizer of is equivalent finding . Here acts as an annealing parameter, higher values result to putting more of its mass around lower valued local minima of . For high , the SGLD (1) reduces to a deterministic gradient descent, so it could be harder to escape from local minima. In practice, the trade-off between convergence speed and optimality is difficult to balance.

An alternative approach is to use interacting SGLD, as opposed to i.i.d. copies of the Langevin dynamics (1). If we choose the interaction law appropriately, then the resulting coupled system of SGLD could, in principle, explore the state space more efficiently when the agents act in a cooperative manner (Schweitzer (2003, Ch. 7)). We will consider a system of interacting SGLD of the form

(5)

where , . Compared to the i.i.d. SGLD (1), the dynamics (5) uses as an interaction potential, which, in this paper, we will take to be convex. In particular, we will consider the so-called Curie-Weiss interaction Dawson (1983)

(6)

so that each particle experiences a linear attractive (mean reverting) force to the empirical mean of all particles

Several questions arise: how does the density of each agent change in time? What is the invariant measure of the agent dynamics? Can we pass to the mean field limit , both for the law of the agent process and for the invariant measure of the dynamics? How does the rate of convergence of the -particle dynamics compare to that of the i.i.d. SGLD? These are questions that have been studied extensively in the mathematical physics literature, see for instance Shiino (1987); Dawson (1983); Gomes and Pavliotis (2018) and the references therein. In recent years it has been recognized that interacting agents and their mean field limit can be useful for developing global optimization algorithms (Pinnau et al. (2017); Carrillo et al. (2018)

). However, the connection between SGLD, mean field limits, consensus formation for agent based models and the training of neural networks is still largely unexplored.

We assume that interaction between agents is due to the interaction potential and that all agents experience the same loss function

. This is in slight contrast to the growing literature on mean field models for neural networks and supervised learning. There, typically one takes

, so interaction could appear from the number of layers and the parameters (Bengio et al. (2006); Mei et al. (2019); Rotskoff and Vanden-Eijnden (2018); Sirignano and Spiliopoulos (2018); Chizat and Bach (2018)) or by number of data points (see E et al. (2018) for a control theoretic perspective). Our framework is a good abstraction of popular machine learning algorithms, e.g. federated learning Kamp et al. (2018), elastic-SGD Zhang et al. (2015), ensemble methods for RESNET Zhang et al. (2015).

The interacting SGLD (5) is exchangeable (Dawson (1983)), i.e. the law of the process is invariant under arbitrary permutations of agents. Under appropriate assumptions on the loss function, and on the initial conditions the position of each agent converges, in the limit to the solution of the McKean SDE

The density of the law of the process is given by the McKean-Vlasov equation:

(7)

The analysis of  (5) is standard when both and are convex (e.g. see Malrieu (2001) and the references therein). When the loss function is non-convex, which is the case in most machine learning applications, the analysis becomes more involved, see Cattiaux et al. (2008); Tugaut (2011, 2014); Del Moral and Tugaut (2018); Durmus et al. (2018). In particular, more than one stationary states can exist.

Stationary states of the McKean-Vlasov dynamics (7) are given by solutions of the integral equation (Tamura (1984))

(8)

For small values of the product a unique stationary state exists, given, for the quadratic interaction potential by

On the other hand, for sufficiently large, for for non-convex, multiple stationary states exist with different stability properties. These states are given by the formula

where the parameter is determined self-consistently via the equation The number of stationary states is related to the number of critical points of the loss function; Dawson (1983); Shiino (1987); Tugaut (2011, 2014); Gomes and Pavliotis (2018). In particular, for loss functions with many local minima, there exist many stationary states. In the limit of infinitely many local minima, then no SGD-type algorithm can converge. This is another manifestation of the fact that it is necessary to smooth (quasi-convexify) the loss function . It is important to note that, although the rigorous proof of such results is valid only in the mean field limit, these issues appear also over finite intervals and for a finite number of agents (Gomes and Pavliotis (2018)).

Using instead of is a form or regularizing or smoothing of the cost function. From an optimization point of view, substituting and using a linear interaction for is equivalent to using an -penalty in the objective function for the constraint: , for each agent . Therefore, for an appropriate choice of the interaction strength , the objective function is approximately convex. The idea to ”encourage” each agent to follow the effective loss function has featured recently in particle gradient algorithms (Nitanda and Suzuki (2017); Chizat and Bach (2018)).

We remark that the mean field dynamics can, in principle, converge faster to the stationary state(s) than the i.i.d. copies of SGLD. Furthermore, the McKean-Vlasov PDE has a variational structure with respect to the free energy functional Villani (2003)

(9)

This variational formulation makes clear the link between the mean field SGLD and variational inference. The details will be presented elsewhere.

4 Homogenization and averaging for SGLD

We return our attention to (1) and seek an alternative to smoothing the loss function , instead of the convolution smoothing that was introduced in Section 2. In Section 3 we used the empirical measure to smooth the potential based on empirical properties of interacting agents. In this section, we will present briefly the approach that was developed in Chaudhari et al. (2018) to convert (1) into the following gradient descent algorithm:

(10)

where is the invariant measure of that appears in the limit when for the following fast/slow SDE system

(11a)
(11b)

The parameter measures scale separation. The limit can be justified using multiscale analysis Pavliotis and Stuart (2008). Note that this is a gradient scheme for the modified loss function . In particular, here acts as a regularization parameter, precisely like the (inverse of the) interaction strength in the previous section. We emphasize the similarities between (3) and (10). In particular, in the formulation presented in this section, we can calculate the regularized loss function and its gradient by taking the expectation with respect to an appropriate probability measure, just like the case of the standard kernel-based smoothing.

It is important to note that the smoothed loss function in (10) can also be calculated via convolution with a Gaussian kernel:

(12)

This is the Cole-Hopf formula for the solution of the viscous Hamilton-Jacobi equation with the loss function as the initial condition, posed on the time interval . The larger is, the more regularized the effective potential (or relative entropy)

is. At the level of the mean field SGLD that was studied in the previous section, this corresponds to taking the interaction strength to be sufficiently small so that a unique steady state for the McKean-Vlasov dynamics exists. The connection between smoothing at the level of the effective potential and the absence of phase transitions for the mean field dynamics is an intriguing problem that will be studied in future work. We mention here that the critical noise-interaction strength at which the phase transition (multiple stationary states exist) can be calculated by solving the equation

, where denotes the non-interacting equilibrium distribution.

We also remark that in Chaudhari et al. (2018) an equivalent formulation to (11):

(13a)
(13b)

Here the regularized cost appears as . This form is more convenient for the numerical implementation and is the one that will be used in Algorithm 1.

5 Towards an algorithm

Combining (10) with (5) we obtain:

(14)
(15)

We will use this scheme with a sufficiently small value of , so that we are close to the homogenized limit. The theoretical justification of this algorithm requires the study of the joint limits and . Note that these limits do not commute, when the loss function is nonconvex and when we are also interested in the long time limit Gomes and Pavliotis (2018). For our purposes, it is better to first let and then . This way we first obtain a smoother so that many of the local minima have disappeared; then, mean field interactions will facilitate exploration of the state space, without suffering from the drawback of the existence of multiple stationary states. A more detailed study of these issues, with particular emphasis on the optimal tuning of the algorithm, will be presented elsewhere.

To discretize (14)-(15) effectively for low we will use the heterogeneous multiscale method Weinan et al. (2005) presented in Algorithm 1. We do not discretize (14) directly, but its integral w.r.t the invariant distribution of (15) similar to (10). This is approximated in Algorithm 1 using time averaging of the faster time scale that run for intermediate iterations. We will use two step sizes and for for equations (13a) and (13b), respectively and let and . For convenience, we also set . In Algorithm 1 the parameter plays the role of a “burn-in” time to eliminate transient behavior. In the remainder we set it to .

Initialize

For , for

Set ; for

Compute average

Update

Algorithm 1 Discretized mean field SGLD with homogenization

6 Numerical Results

6.1 Toy examples

Figure 1: Traces of on 2D contour plots for being the six hump camel function, . We use everywhere , , , and for homogenization , , , . Initially sampled points are labelled by + and the two global minima by *.

We will illustrate the performance of Algorithm 1 first by looking at some common problems from global optimisation. The full algorithm will be denoted as MF-HOM-SGLD and will compare with HOM-SGLD (), MF-SGLD (i.e. Euler discretization of (5)) and standard SGLD. To make comparisons fair for HOM-SGLD and SGLD we will use i.i.d copies and pick the best performing sample/agent every time.

We begin our numerical comparisons using some two dimensional illustrations for each method. For this we use the six hump camel function

whose global minima are located in and . In Figure 1 we present the paths of agents of each method together with contours of . The plots also illustrate that homogenization smoothens the cost function gradient leading in smoother trajectories. For the mean field methods we show paths for a low and higher interaction case. Interaction strength is important in order to escape local minima and the value for influences performance directly. In this example when for both mean field methods some agents will get stuck in local minima. As increases the paths of approach the global minima (labeled by * in Figure 1). The method performs well for higher values of (e.g. in panels (e), (f)), wherby all reach consensus and move very close to one of the global minima. For low values (e.g. in panels (b), (d)) -s settle at a short distance from the global minima and do not seem to reach consensus. This is due to using an attraction towards the empirical mean, which without consensus is close to the origin due to the symmetry in . This shortcoming can be overcome using cost weighted averages as in Pinnau et al. (2017). A more thorough investigation of such weighted interactions is left for future work.

The six hump camel function is not a very challenging problem as it is low dimensional with a very smooth landscape and flat curvature around each local minimum. In the context of multiple scales in and fast oscillation this was not compatible with using a low . Still homogenization performed very well. We will proceed by using a more challenging cost function with rougher landscape and high frequency oscillations. We will show that in this case homogenized algorithms perform very well as smoothers of the cost. We will use , where controls the number of local minima. We will consider the case where , . In Figure 2 we plot statistics from multiple independent runs related to with being either the worst or best performing agent of a given run of the algorithms. In all cases we use , for MF-SGLD and SGLD and for the homogenized versions, , , , . Figure 2 shows a clear improvement in performance for the homogenized versions, but this comes at the price of slower convergence. In both the plain and homogenized SGLD the addition of mean field interaction seems to accelerate convergence. In our simulations we noticed that Algorithm 1 could exhibit numerical stiffness for very high , which in practice means that a smaller step size is necessary, so more advanced discretization schemes should be investigated.

Figure 2: From left to right: against for SGLD, MF-SGLD, HOM-SGLD, MF-HOM-SGLD resp for . here is either median of worst performing agent over 50 independent runs (dotted lines) or median of best performer (solid lines). Each run uses and

6.2 Co-centric Ellipse Example

We describe preliminary results using the simple classification problem of the two co-centric ellipses (see Figure (a)a). This simple classification problem is easy to solve but it neatly illustrates the challenges of generalization. We will use this example to demonstrate some qualitative differences between the solutions computed by the different algorithms. We used the dynamical systems formulation of RESNET described in Haber and Ruthotto (2018) and the serial version of the implementation described in Parpas and Muir (2019). After discretization with the Verlet scheme (with a step-size of 0.05), the resulting neural network has 128 fully connected layers. We used the activation function. The resulting optimization problem has 774 parameters and the objective function is a cross-entropy loss. Note that the original dataset contains points in the range (Figure (a)a

). We run all algorithms for 100 epochs. In Figure

(b)b we show the classification probability obtained from the solution obtained from SGLD. Note that in the range

where the algorithm has observations, the algorithm learns to classify points with the correct probability. However, outside the original range and where the algorithm has not seen any data points we see some strange behavior (bottom of Figure

(b)b). In Figure (c)c we plot the results from HOM-SGLD. The solution obtained from HOM-SGLD is valid in a broader range. Still, the solution obtained with HOM-SGLD does eventually exhibit similar behavior to SGLD (bottom right of Figure (c)c). Both algorithms reach a similar loss function value and test accuracy, yet the solutions behave differently outside the original data range. We conjecture that this phenomenon will be even stronger in higher dimensions.

The example above suggests that HOM-SGLD may be more robust than SGLD, but more research is needed to understand the behavior of all algorithms. We illustrate this point with a slightly different numerical example. Instead of using the Verlet scheme to obtain a discretized neural network we use a simple explicit Euler scheme. We used the same step-size and all the other parameters remain the same. As can be seen in Figure (d)d the HOM-SGLD scheme no longer performs as well as before (other algorithms are even worse in this case). From these experiments, we conclude that the robustness and stability of both the underlying algorithm and model are necessary to obtain results that can generalize.

(a) Co-centric Ellipse Dataset (b) SGLD (c) HOM-SGLD (d) Explicit Euler, HOM-SGLD
Figure 7: (a) The co-centric ellipse classification problem, Classification probabilities with (b) SGLD and HOM-SGLD (c). (d) Failure of the explicit Euler scheme.

7 Conclusions

We conclude by returning to the question posed in the title. In our view the techniques described in the paper suggest that designing algorithms using interacting agents is a very effective smoothing strategy for loss functions with rough landscapes. While it is not clear yet how this will benefit machine learning applications in practice, our examples show that the method is promising.

There are many possible interesting avenues to extend the material in this paper. For instance one could investigate different interaction potentials such as . Such potentials where shown in Benning et al. (2017) to encourage coarse changes to the dynamics which may be desirable in non-convex optimization. In addition, there are many aspects related to numerical analysis that could be improved. This includes discretization methods and error analysis and possible ways of preconditioning. For the latter parallel work in Kovachki and Stuart (2018) is in similar spirit and it will be interesting to see how it can be combined with Algorithm 1. Finally, for the proposed method to be useful in practice one must address the additional computation and communication costs required to manage agents who interact over a network.

In this paper we showed that, by considering weaklt interacting agents and averaging, we can smooth the loss function and improve the performance of the standard SGLD. A judicious choice of interaction between agents can be used in order to improve the performance of algorithms in, e.g., sampling, filtering and control applications (Del Moral (2013)). On the other hand, agent-based models are routinely used for the quantitative description of phenomena in the social sciences (Pareschi and Toscani (2013)). The understanding of the connection between these two viewpoints is currently under investigation by our group.

Acknowledgments

This research was funded in part by JPMorgan Chase & Co. Any views or opinions expressed herein are solely those of the authors listed, and may differ from the views and opinions expressed by JPMorgan Chase & Co. or its affiliates. This material is not a product of the Research Department of J.P. Morgan Securities LLC. This material does not constitute a solicitation or offer in any jurisdiction. The work of the authors was partly funded by Engineering & Physical Sciences Research Council grants EP/M028240/1, EP/P031587/1, EP/L024926/1. EP/L020564/1.

References