A Bayesian Variational Framework for Stochastic Optimization

05/05/2019 ∙ by Philippe Casgrain, et al. ∙ 0

This work proposes a theoretical framework for stochastic optimization algorithms, based on a continuous Bayesian variational model for algorithms. Using techniques from stochastic control with asymmetric information, the solution to this variational problem is shown to be equivalent to a system of Forward Backward Differential Equations (FBSDEs). Using an analytical approximation to the solution of these FBSDEs, we recover a variety of existing adaptive stochastic gradient descent methods. This framework establishes a direct connection between stochastic optimization algorithms and a secondary Bayesian inference problem on gradients, where the prior and assumed observation dynamics determine the resulting algorithm.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic optimization algorithms are tools which are crucial to solving optimization problems arising in machine learning. The initial motivation for these algorithms comes from the fact that computing the gradients of a target loss function becomes increasingly difficult as the scale and dimension of an optimization problem grows larger. In these large-scale optimization problems, deterministic gradient-based optimization algorithms to perform poorly due to the computational load of repeatedly computing gradients. Stochastic optimization algorithms remedy this issue by replacing exact gradients of the target loss with a computationally cheap gradient estimator, trading off noise in gradient estimates for computational efficiency at each step.

To illustrate this idea, consider the problem of minimizing a generic risk function , taking the form


where , and where we define the set to be a set of training points. In this definition, we interpret as the model loss at a single training point for the parameters .

When and are typically large, computing the gradients of can be time-consuming. Knowing this, let us consider the path of an optimization algorithm as given by . Rather than computing directly at each point of the optimization process, we may instead collect noisy samples of gradients as


where for each , is an independent sample of size from the set of training points. We assume that is chosen small enough so that can be computed at a significantly lower cost than . Using the collection of noisy gradients , stochastic optimization algorithms construct an estimator of the gradient in order to determine the next step of the optimizer.

There exists a large body of literature on stochastic optimization algorithms. Each algorithm introduces its own variation on the gradient estimator as well as other features which can speed up convergence to an optimum. Among the simplest of these is stochastic gradient descent and its variants Robbins and Monro (1951), which use an estimator based on single gradient samples. Others, such as Lucas et al. (2018); Nesterov , use momentum and acceleration as features to enhance convergence, and can be interpreted as using exponentially weighted moving averages as gradient estimators. Adaptive gradient descent methods such as AdaGrad from Duchi et al. (2011) and Adam from Kingma and Ba (2014) use similar moving average estimators, as well as dynamically updated normalization factors. For a survey paper which covers many modern stochastic optimization methods, see Ruder (2016).

This paper presents a theoretical framework which provides new perspectives on stochastic optimization algorithms, and explores the implicit model assumptions that (!) are made by existing ones. We achieve this by extending the variational approach taken by Wibisono et al. (2016) to stochastic algorithms. The key step in our approach is to interpret the task of optimization with a stochastic algorithm as a latent variational problem. As a result, we can recover algorithms from this framework which have built-in online learning properties. In particular, these algorithms use an online Bayesian filter on the stream of noisy gradient samples, , to compute estimates of . Under various model assumptions on and , we recover a number of common stochastic optimization algorithms.

A number of related works have produced theories for various aspects of stochastic optimization. Cesa-Bianchi et al. (2004) have shown a parallel between stochastic optimization and online learning. Some previous related works, such as Gupta et al. (2017) provide a general model for adaptive methods, generalizing the subgradient projection approach of Duchi et al. (2011). Aitchison (2018) use a Bayesian theory to explain the various gradient estimators used in different algorithms . This paper differs from these works by naturally generating stochastic algorithms from a variational principle, rather than attempting to explain their individual features. This work is most similar to that of Wibisono et al. (2016) who provide a similar variational model for deterministic optimization algorithms.

The paper is structured as follows. Section 2 presents a Bayesian surrogate model for stochastic optimization. Section 3 introduces and motivates a stochastic variational problem over optimizers. Section 4 presents necessary and sufficient conditions for an optimizer to form a solution to the variational problem. Lastly, Section 5 recovers SGD, mirror descent, momentum, and other optimization algorithms as discretizations of the continuous optimality equations derived in Section 4. The proofs of the mathematical results of this paper are found within the appendices.

2 A Statistical Model for Stochastic Optimization

Over the course of the section, we present a variational model for stochastic optimization. The ultimate objective will be to construct a framwork for measuring the average performance of an algorithm over a random collection of optimization problems. We define random variables in an ambient probability space

, where is a filtration which we define at a later point in this section. We assume that loss functions are drawn from a random variable . Each draw from the random variable satisfies for fixed , and is assumed to be an almost-surely continuously differentiable in . In addition, we make the technical assumption that for all .

We define an optimizer as a controlled process satisfying for all , with initial condition . We assume that the paths of are continuously differentiable in time so that the dynamics of the optimizer may be written as , where represents the control, where we use the superscript to express the explicit dependence of on the control . We may also write the optimizer in its integral form as , demonstrating that the optimizer is entirely characterized by a pair consisting of a control process and an initial condition . Using an explicit Euler discretization with step size , the optimizer can be approximately represented through the update rule . This leads to the interpretation of as the (infinitesimal) step the algorithm takes at each point during the optimization process.

In order to capture the essence of stochastic optimization, we construct our model so that optimizers have restricted access to the gradients of the loss function . Rather than being able to directly observe over the path of , we assume that the algorithm may only use a noisy source of gradient samples, modelled by a càdlàg semi-martingale111A càdlàg (continue à droite, limite à gauche) process is a continuous time process that is almost-surely right-continuous with left limits. A semi-martingale is the sum of a process of finite variation and a local martingale. For more information on continuous time stochastic processes and these definitions, see Jacod and Shiryaev (2013) . As a simple motivating example, we can consider the model , where

is a white noise process. This particular model for the noisy gradiet process can be interpreted as consisting of observing

plus an independent source of noise. This concrete example will be useful to keep in mind to make sense of the results which we present over the course of the paper.

To make the concept of information restriction mathematically rigorous, we restrict ourselves only to optimizers which are measurable with respect to the information generated by the noisy gradient process . To do this, we first define the global filtration , as as the sigma algebra generated by the paths of as well as the realizations of the loss surface . The filtration is defined so that it contains the complete set of information generating the optimization problem unti time .

Next, we define the coarser filtration generated strictly by the paths of the noisy gradient process. This filtration represents the total set of information available to the optimizer up until time . This allows us to formally restrict the flow of information to the algorithm by restricting ourselves to optimizers which are adapted to . More precisely, we say that the optimizer’s control is admissible if


The set of optimizers generated by can be interpreted as the set of optimizers which may only use the source of noisy gradients, which have bounded expected travel distance and have square-integrable gradients over their path.

3 The Optimizer’s Variational Problem

Having defined the set of admissible optimization algorithms, we set out to select those which are optimal in an appropriate sense. We proceed similarly to Wibisono et al. (2016), by proposing an objective functional which measures the performance of the optimizer over a finite time period.

The motivation for the optimizer’s performance metric comes from a physical interpretation of the optimization process. We can think of our optimization process as a particle traveling through a potential field define by the target loss function . As the particle travels through the potential field, it may either gain or lose momentum depending on its location and velocity, which will in turn affect the particle’s trajectory. Naturally, we may seek to find the path of a particle which reaches the optimum of the loss function while minimizing the total amount of kinetic and potential energy that is spent. We therefore turn to the Lagrangian interpretation of classical mechanics, which provides a framework for obtaining solutions to this problem. Over the remainder of this section, we lay out the Lagrangian formalism for the optimization problem we defined in Section 2.

To define a notion of energy in the optimization process, we provide a measure of distance in the parameter space. We use the Bregman Divergence as the measure of distance within our parameter space, which can embed additional information about the geometry of the optimization problem. The Bregman divergence, , is defined as


where is a strictly convex function satisfying . We assume here that the gradients of are -Lipschitz smooth for a fixed constant . The choice of determines the way we measure distance, and is typically chosen so that it mimics features of the loss function . In particular, this quantity plays a central role in mirror descent and non-linear sub-gradient algorithms. For more information on this connection and on Bregman Divergence, see Nemirovsky and Yudin (1983) and Beck and Teboulle (2003).

We define the total energy in our problem as the kinetic energy, accumulated through the movement of the optimizer, and the potential energy generated by the loss function . Under the assumption that almost surely admits a global minimum , we may represent the total energy via the Bregman Lagrangian as


for fixed inputs , and where we assume that are deterministic, and satisfy . The functions

can be interpreted as hyperparameters which tune the energy present at any state of the optimization process. An important property to note is that the Lagrangian is itself a random variable due to the randomness introduced by the latent loss function


The objective is then to find an optimizer within the admissible set which can get close to the minimum , while simultaneously minimizing the energy cost over a finite time period . The approach taken in classical mechanics and in Wibisono et al. (2016) fixes the endpoint of the optimizer at . Since we assume that the function is not directly visible to our optimizer, it is not possible to add a constraint of this type that will hold almost surely. Instead, we introduce a soft constraint which penalizes the algorithm’s endpoint in proportion to its distance to the global minimum, . As such, we define the expected action functional as


where is assumed to be an additional model hyperparameter, which controls the strength of the soft constraint.

With this definition in place, the objective will be to select amongst admissible optimizers for those which minimize the expected action. Hence, we seek optimizers which solve the stochastic variational problem

Remark 1.

Note that the variational problem (7) is identical to the one with Lagrangian


and terminal penalty , since they differ by constants independent of . Because of this, the results presented in Section 4 also hold the case where and do not exist or are infinite.

4 Critical Points of the Expected Action Functional

In order to solve the variational problem (7), we make use techniques from the calculus of variations and infinite dimensional convex analysis to provide optimality conditions for the variational problem (7). To address issues of information restriction, we rely on the stochastic control techniques developed by Casgrain and Jaimungal (2018a, c, b).

The approach we take relies on the fact that a necessary condition for the optimality of a Gâteaux differentiable functional is that its Gâteaux derivative vanishes in all directions. Computing the Gâteaux derivative of , we find an equivalence between the Gâteaux derivative vanishing and a system of Forward-Backward Stochastic Differential Equations (FBSDEs), yielding a generalization of the Euler-Lagrange equations to the context of our optimization problem. The precise result is stated in Theorem 4.1 below.

Theorem 4.1 (Stochastic Euler-Lagrange Equation).

A control is a critical point of if and only if is a solution to the system of FBSDEs,


where we define the processes


and where the process is an -adapted martingale. As a consequence, if the solution to this FBSDE is unique, then it is the unique critical point of the functional up to null sets.


See Appendix C

Theorem 4.1 presents an analogue of the Euler-Lagrange equation with free terminal boundary. Rather than obtaining an ODE as in the classical result, we obtain an FBSDE, with backwards process , and forward state processes , and .

For a background on FBSDEs, we point readers to Pardoux and Tang (1999); Ma et al. (1999); Carmona (2016). At a high level, the solution to an FBSDE of the form (9) consists of a pair of processes , which simultaneously satisfy the dynamics and the boundary condition of (9). Intuitively, the martingale part of the solution can be interpreted as a random process which guides towards the boundary condition at time .

An important feature of equation (9), is that optimality relies on the projection of onto . Thus, the optimization algorithm makes use of past noisy gradient observations in order to make local gradient predictions. Local gradient predictions are updated in a Bayesian manner, where the prior model for is updated with path information contained in . Moreover, this demonstrates that the solution depends only on the gradients of along the path of and no higher order properties.

4.1 Expected Rates of Convergence of the Continuous Algorithm

Using the dynamics (9) we obtain a bound on the rate of convergence of the continuous optimization algorithm that is analogous to Wibisono et al. (2016, Theorem 2.1). We introduce the Lyapunov energy functional


where we define to be a global minimum of . Under additional model assumptions, and by showing that this quantity is a super-martingale with respect to the filtration , we obtain an upper bound for the expected rate of convergence from towards the minimum.

Theorem 4.2 (Convergence Rate).

Assume that the function is almost surely convex and that the scaling conditions and hold. Moreover, assume that in addition to having -Lipschitz smooth gradients, is also -strongly-convex with . Define to be a global minimum of . If exists almost surely, the optimizer defined by FBSDE (9) satisfies


where represents the quadratic variation of the process , where is the martingale part of the solution defined in Theorem 4.1.


See Appendix D. ∎

We can interpret the term as a penalty on the rate of convergence, which scales with the amount of noise present in our gradient observations. To see this, note that if there is no noise in our gradient observations, we obtain that , and hence , recovering the exact rate of convergence as in the deterministic case of Wibisono et al. (2016). If the noise in our gradient estimates is immense, we can expect to grow at a fast rate and to counteract the shrinking effects of . We point out, however, that there will be a nontrivial dependence of on all model hyperparameters and the specific definition of the random variable .

Remark 2.

We do not assume that the conditions of Theorem 4.2 carry throughout the remainder of the paper. In particular, Sections 5 study models which may not guarantee almost-sure convexity of the latent loss function.

5 Recovering Discrete Optimization Algorithms

In this section, we use the optimality equations of Theorem 4.1 to produce discrete stochastic optimization algorithms. The procedure we take is as follows. We first define a model for the processes . Second, we solve the optimality FBSDE (9) in closed form or approximate the solution via the first-order singular perturbation (FOSP) technique, as described in Appendix A. Lastly, we discretize the solutions with a simple Forward-Euler scheme in order to recover discrete algorithms. Over the course of Sections 5.1 and 5.2, we show that various simple models for and different specifications of produce many well-known stochastic optimization algorithms.

5.1 Stochastic Gradient Descent and Stochastic Mirror Descent

Here we propose a general martingale model on gradients which loosely represents the behavior of minibatch stochastic gradient descent with a training set of size and minibatches of size . By specifying a martingale model for , we recover the stochastic gradient descent and stochastic mirror descent algorithms as solutions to the variational problem described in Section 2.

Let us assume that , where and is a Wiener process. Next, assume that the noisy gradients samples obtained from minibatches over the course of the optimization, evolve according to the model , where and is an independent copy of . Here, we choose so that

, which allows the variance to scale in

and as it does with minibatches.

Using symmetry, we obtain the trivial solution to the gradient filter, , implying that the best estimate of the gradient at the point will be the most recent mini-batch sample observed. re-scaled by a constant depending on and . Using this expression for the filter, we obtain the following result.

Proposition 5.1.

The FOSP approximation to the solution of the optimality equations (9) can be expressed as


where is the convex dual of and where is a deterministic learning rate with . When has the form for a symmetric positive-definite matrix , the FOSP approximation is exact, and (14) is the exact solution to the optimality FBSDE (9).


See Appendix E.1. ∎

To obtain a discrete optimization algorithm from the result of 5.1, we employ a forward-Euler discretization of the ODE (14) on the finite mesh . This discretization results in the update rule


corresponding exactly to mirror descent (e.g. see Beck and Teboulle (2003)) using the noisy minibatch gradients and a time-varying learning rate . Moreover, setting , we recover the update rule , exactly corresponding to the minibatch SGD with a time-dependent learning rate.

This derivation demonstrates that the solution to the variational problem described in Section 2, under the assumption of a minimal martingale model for the evolution of gradients, recovers mirror descent and SGD. In particular, the martingale gradient model proposed in this section can be roughly interpreted as assuming that gradients behave as random walks over the path of the optimizer. Moreover, the optimal gradient filter shows that, for the algorithm to be optimal, minibatch gradients should be re-scaled in proportion to .

5.2 Kalman Gradient Descent and Momentum Methods

Using a linear state-space model for gradients, we can recover both the Kalman Gradient Descent algorithm of Vuckovic (2018) and momentum-based optimization methods of Polyak (1964). We assume that each component of is modeled independently as a linear diffusive process. Specifically, we assume that there exist processes so that for each , , where is the solution to the linear SDE . In particular, we the notation to refer to element of , and use the notation . We assume here that are positive definite matrices and each of the are independent -dimensional Brownian Motions.

Next, we assume that we may write each element of a noisy gradient process as , where and where are independent white noise processes. Noting that , we find that this model implicitly assumes that gradients are expected decrease in exponentially in magnitude as a function of time, at a rate controlled by the matrix . The parameters and can be interpreted as controlling the scale of the noise within the observation and signal processes.

Using this model, we obtain that the filter can be expressed as , where . The process is expressed as the solution to the Kalman-Bucy222For information on continuous time filtering and the Kalman-Bucy filter we refer the reader to the text of Bensoussan (2004) or the lecture notes of Van Handel (2007). filtering equations


with the initial conditions and , and where we define innovations process with the property that each is an independent -adapted Brownian motion.

Inserting the linear state space model and its filter into the optimality equations (9) we obtain the following result.

Proposition 5.2 (State-Space Model Solution to the FOSP).

Assume that the gradient state-space model described above holds. The FOSP approximation to the solution of the optimality equations (9) can be expressed as


where is a deterministic learning rate, where represents the matrix exponential, and where

can be chosen to have arbitrarily large eigenvalues by scaling



See Appendix E.2

5.2.1 Kalman Gradient Descent

In order to recover Kalman Gradient Descent, we discretize the processes and over the finite mesh , defined in equation (17). Applying a Forward-Euler-Maruyama discretization of (17) and the filtering equations (16), we obtain the discrete dynamics


where each of the and are standard Gaussian random variables of appropriate size. The filter for the discrete equations can be written as the solution to the discrete Kalman filtering equations, provided in Appendix B. Discretizing the process over with the Forward-Euler scheme, we obtain discrete dynamics for the optimizer in terms of the Kalman Filter , as


yielding a generalized version of Kalman gradient descent of Vuckovic (2018) with states for each gradient element. Setting , and recovers the original Kalman gradient descent algorithm with a time-varying learning rate.

Just as in Section 5.1, we interpret each as being a minibatch gradient, as with equation (2). The algorithm (19) computes a Kalman filter from these noisy mini-batch observations and uses it to update the optimizer’s position.

5.2.2 Momentum and Generalized Momentum Methods

By considering the asymptotic behavior of the Kalman gradient descent method described in Section 5.2.1, we recover a generalized version of momentum gradient descent methods, which includes mirror descent behavior, as well as multiple momentum states. Let us assume that remains constant in time. Then, using the asymptotic update rule for the Kalman filter, as shown in Proposition B.2, and equation (19), we obtain the update rule


where and where is defined in the statement of the Proposition B.2. This yields a generalized momentum update rule where we keep track of momentum states with , and update its position using a linear update rule. This algorithm can be seen as being most similar to the Aggregated Momentum technique of Lucas et al. (2018), which also keeps track of multiple momentum states which decay at different rates.

Under the special case where , , and we recover the exact momentum algorithm update rule of Polyak (1964) as


where we have a scalar learning rate , where , are positive scalars, and where are minibatch draws from the gradient as in equation 2.

The recovery of the momentum algorithm of Polyak (1964) has some interesting consequences. Since and are functions of the model parameters and , we obtain a direct relationship between the optimal choice for the momentum model parameters, the assumed scale of gradient noise and the assumed expected rate of decay of gradients, as given by . This result gives insight as to how momentum parameters should be chosen in terms of their prior beliefs on the optimization problem.

6 Discussion and Future Research Directions

Over the course of the paper we present a variational framework on optimizers, which interprets the task of stochastic optimization as an inference problem on a latent surface that we wish to optimize. By solving a variational problem over continuous optimizers with asymmetric information, we find that optimal algorithms should satisfy a system of FBSDEs projected onto the filtration generated by the noisy observations of the latent process.

By solving these FBSDEs and obtaining continuous-time optimizers, we find a direct relationship between the measure assigned to the latent surface and its relationship to how data is observed. In particular, assigning simple prior models to the pair of processes , recovers a number of well-known and widely used optimization algorithms. The fact that this framework can recover these algorithms in a natural way begs further study. In particular, it is still an open question whether it is possible to recover other stochastic algorithms via this framework, particularly those with second-order scaling adjustments such as ADAM or AdaGrad.

From a more technical perspective, the intent is to further explore properties of the optimization model presented here and the form of the algorithms it suggests. In particular, the optimality FBSDE 9 is nonlinear, high-dimensional and intractable in general, making it difficult to use existing FBSDE approximation techniques, so new tools may need to be developed to understand the full extent of its behavior.

Lastly, numerical work on the algorithms generated by this framework would give some insights as to which prior gradient models work well when discretized. The extension of simplectic and quasi-simplectic stochastic integrators applied to the BSDEs and SDEs that appear in this paper also has the potential for interesting future work.


Appendix A Obtaining Solutions to the Optimality FBSDE

a.1 A Momentum-Based Representation of the Optimizer Dynamics

Using a simple change of variables we may represent the dynamics of the FBSDE (9) in a simpler fashion, which will aid us in obtaining solutions to this system of equations. Let us define the momentum process as


Noting that since is convex, we have the property that , we may use equation (22) to write in terms of the momentum process as


The introduction of this process allows us to represent the solution to the optimality FBSDE (9), and by extension the optimizer, in a much more tractable way. Re-writing (9) in terms of , we find that


where the dynamics of the forward process can be expressed as


This particular change of variables corresponds exactly to the Hamiltonian representation of the optimizer’s dynamics, which we show in Appendix A.3.

Writing out the explicit solution to the FBSDE (24), we obtain a representation for the optimizer’s dynamics as


showing that optimizer’s momentum can be represented as a time-weighted average of the expected future gradients over the remainder of the optimization and the term , where the weights are determined by the choice of hyperparameters and . Noting that


we find that the additional correction term in (26) can be interpreted as the remainder in the first-order Taylor expansion of the term .

The representation (26) demonstrates optimizer does not only depend on the instantaneous value of gradients at the point . Rather, we find that the algorithm’s behaviour depends on the expected value of all future gradients that will be encountered over the remainder of the optimization process, projected onto the set of accumulated gradient information, . This is in stark contrast to most known stochastic optimization algorithms which only make explicit use of local gradient information in order to bring the optimizer towards an optimum.

a.2 First-Order Singular Perturbation Approximation

When does not take the quadratic form for some positive-definite matrix , the nonlinear dynamics of the FBSDE (9) or in the equivalent momentum form (24) make it difficult to provive a solution for general . More precisely, the Taylor expansion term (27) constitutes the main obstacle in obtaining solutions in general.

In cases where the scaling parameter is sufficiently large, we can assume that the Taylor expansion remainder term of equation (27) will become negligibly small. Hence, we may approximate the optimality dynamics of the FBSDE (24) by setting this term to zero. This can be interpreted as the first-order term in a singular perturbation expansion of the solution to the momentum FBSDE (24).

Under the assumption that the Taylor remainder term vanishes, we obtain the approximation for the momentum, which we present in the following proposition.

Proposition A.1 (First-Order Singular Perturbation (FOSP)).

The linear FBSDE


admits a solution that can be expressed as


provided that .


Noting that the remainder term in the expression (27) vanishes, we get that


Under the assumption that are continuous over and that , the right part of (30) is bounded. Now note that the integral on the left side of (30) is upper bounded for all by the integral provided in the integrability condition of Proposition A.1, and therefore this condition is a sufficient condition for the expression (30) to be finite and well-defined.

Although a general, model independent bound for the accuracy of such approximations is beyond the scope of this paper, it can still serve as a reasonable and computationally cheap alternative to attempting to solve the original problem dynamics directly with a BSDE numerical scheme. For more information on singular perturbation methods in the context of FBSDEs, see Janković et al. [2012].

a.3 Hamiltonian Representation of the Optimizer Dynamics

Just as in Hamiltonian classical mechanics, it is possible to express the optimality FBSDE of Theorem (4.1) with Hamiltonian equations of motion. We define the Hamiltonian as the Legendre dual of at, which can be written as


where . Using the identity , where is the Legendre dual of , and inverting the expression for in terms , we may compute equation (31) as333See Wibisono et al. [2016][Appendix B.4] for the full details of the computation.


Using this definition of , and using the FBSDE (9), we obtain the following equivalent representation for the dynamics of the optimizer.

Using the simple substitution and noting from equations (10) and (11) that


a straightforward computation applied to the definition of shows that the dynamics of the optimality FBSDE (9) admit the alternate Hamiltonian representation of the optimizer dynamics


along with the boundary condition .

Appendix B The Discrete Kalman Filter

Here we present the reader to the Kalman Filtering equations used in Section 5.2. Consider the model presented in equations (18),


where we use the notation and , and where and are all independent standard Gaussian random variables. We provide the Kalman filtering equations for this model in the following proposition.

Proposition B.1 (Walrand and Dimakis [2006, Theorem 10.2]).

Let . Then satisfies the recursive equation


where the matrices are obtained via the independent recursive equations