 # Stochastic Gradient Hamiltonian Monte Carlo for Non-Convex Learning in the Big Data Regime

Stochastic Gradient Hamiltonian Monte Carlo (SGHMC) is a momentum version of stochastic gradient descent with properly injected Gaussian noise to find a global minimum. In this paper, non-asymptotic convergence analysis of SGHMC is given in the context of non-convex optimization, where subsampling techniques are used over an i.i.d dataset for gradient updates. Our results complement those of [RRT17] and improve on those of [GGZ18].

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Let

be a probability space where all the random objects of this paper will be defined. The expectation of a random variable

with values in a Euclidean space will be denoted by .

We consider the following optimization problem

 F∗:=minx∈RdF(x), where F(x):=E[f(x,Z)]=∫Zf(x,z)μ(dx), x∈Rd (1)

and is a random element in some measurable space with an unknown probability law . The function is assumed continuously differentiable (for each ) but it can possibly be non-convex. Suppose that one has access to i.i.d samples drawn from , where is fixed. Our goal is to compute an approximate minimizer such that the population risk

 E[F(X†)]−F∗

is minimized, where the expectation is taken with respect to the training data and additional randomness generating .

Since the distribution of is unknown, we consider the empirical risk minimization problem

 minx∈RdFz(x), where Fz(x):=1nn∑i=1f(x,zi) (2)

using the dataset

Stochastic gradient algorithms based on Langevin Monte Carlo have gained more attention in recent years. Two popular algorithms are Stochastic Gradient Langevin Dynamics (SGLD) and Stochastic Gradient Hamiltonian Monte Carlo (SGHMC). First, we summarize the use of SGLD in optimization, as presented in [RRT17]. Consider the overdamped Langevin stochastic differential equation

 dXt=−∇Fz(Xt)dt+√2β−1dBt, (3)

where is the standard Brownian motion in and is the inverse temperature parameter. Under suitable assumptions on , the SDE (3) admits the Gibbs measure as its unique invariant distribution. In addition, it is shown that for sufficiently big , the Gibbs distribution concentrates around global minimizers of . Therefore, one can use the value of from (3), (or from its discretized counterpart SGLD), as an approximate solution to the empirical risk problem, provided that is large and temperature is low.

In this paper, we consider the underdamped (second-order) Langevin diffusion

 dVt = −γVtdt−∇Fz(Xt)dt+√2γβ−1dBt, (4) dXt = Vtdt (5)

where, model the position and the momentum of a particle moving in a field of force with random force given by Gaussian noise. It is shown that under some suitable conditions for , the Markov process is ergodic and has a unique stationary distribution

 πz(dx,dv)=1Γzexp(−β(12∥v∥2+Fz(x)))dxdv

where is the normalizing constant

 Γz=(2πβ)d/2∫Rde−βFz(x)dx.

It is easy to observe that the -marginal distribution of is the invariant distribution of (3). We consider the first order Euler discretization of (4), (5), also called Stochastic Gradient Hamiltonian Monte Carlo (SGHMC), given as follows

 ¯¯¯¯Vλk+1 = ¯¯¯¯Vλk−λ[γ¯¯¯¯Vλk+∇Fz(¯¯¯¯¯Xλk)]+√2γβ−1λξk+1,¯¯¯¯Vλ0=v0, (6) ¯¯¯¯¯Xλk+1 = ¯¯¯¯¯Xλk+λ¯¯¯¯Vλk,¯¯¯¯¯Xλ0=x0, (7)

where is a step size parameter and

is a sequence of i.i.d standard Gaussian random vectors in

. The initial condition may be random, but independent of .

In certain contexts, the full knowledge of the gradient is not available, however, using the dataset

, one can construct its unbiased estimates. In what follows, we adopt the general setting given by

[RRT17]. Let be a measurable space, and such that for any ,

 (8)

where is a random element in with probability law . Conditionally on , the SGHMC algorithm is defined by

 Vλk+1 = Vλk−λ[γVλk+g(Xλk,Uz,k)]+√2γβ−1λξk+1,Vλ0=v0, (9) Xλk+1 = Xλk+λVλk,Xλ0=x0, (10)

where is a sequence of i.i.d. random elements in with law . We also assume from now on that are independent.

Our ultimate goal is to find approximate global minimizers to the problem (1). Let be the output of the algorithm (9),(10) after iterations, and be such that . The excess risk is decomposed as follows, see also [RRT17],

 E[F(X†)]−F∗ = (11)

The remaining part of the present paper is about finding bounds for these errors. Section 2 summarizes technical conditions and the main results. Comparison of our contributions to previous studies is discussed in Section 3. Proofs are given in Section 4.

Notation and conventions. For , scalar product in is denoted by . We use to denote the Euclidean norm (where the dimension of the space may vary). denotes the Borel - field of . For any -valued random variable and for any , let us set . We denote by the set of with . The Wasserstein distance of order between two probability measures and on is defined by

 Wp(μ,ν)=(infπ∈Π(μ,ν)∫Rl∥x−y∥pdπ(x,y))1/p, (12)

where is the set of couplings of , see e.g. [Vil08]. For two -valued random variables and , we denote . We do not indicate in the notation and it may vary.

## 2 Asumptions and main results

The following conditions are required throughout the paper.

###### Assumption 2.1.

The function is continuously differentiable, takes non-negative values, and there are constants such that for any ,

 ∥f(0,z)∥≤A0,∥∇f(0,z)∥≤B.
###### Assumption 2.2.

There is such that, for each ,

 ∥∇f(x1,z)−∇f(x2,z)∥≤M∥x1−x2∥,∀x1,x2∈Rd.
###### Assumption 2.3.

There exist constants such that

 ⟨x,f(x,z)⟩≥m∥x∥2−b,∀x∈Rd,z∈Z.
###### Assumption 2.4.

For each , it holds that

 ∥g(x1,u)−g(x2,u)∥≤M∥x1−x2∥,∀x1,x2∈Rd.
###### Assumption 2.5.

There exists a constant such that for every ,

 E∥g(x,Uz)−∇Fz(x)∥2≤2δ(M2∥x∥2+B2).
###### Assumption 2.6.

The law of the initial state satisfies

 ∫R2deV(x,v)dμ0(x,v)<∞,

where is the Lyapunov function defined in (16) below.

###### Remark 2.7.

If the set of global minimizers is bounded, we can always redefine the function to be quadratic outside a compact set containing the origin while maintaining its minimizers. Hence, Assumption 2.3 can be satisfied in practice. Assumption 2.4 means that the estimated gradient is also Lipschitz when using the same training dataset. For example, at each iteration of SGHMC, we may sample uniformly with replacement a random minibatch of size . Then we can choose where are i.i.d random variables having distribution . The gradient estimate is thus

 g(x,Uz)=1ℓℓ∑j=1∇f(x,zIj),

which is clearly unbiased and Assumption 2.4 will be satisfied whenever Assumption 2.2 is in force. Assumption 2.5

controls the variance of the gradient estimate.

An auxiliary continuous time processes is needed in the subsequent analysis. For a step size , denote by the scaled Brownian motion. Let be the solutions of

 dˆV(t,s,v) = −λ(γˆV(t,s,v)+∇Fz(ˆX(t,s,v)))dt+√2γλβ−1dBλt, (13) dˆX(t,s,v) = λˆV(t,s,v)dt, (14)

with initial condition where may be random but independent of .

Our first result tracks the discrepancy between the SGHMC algorithm (9), (10) and the auxiliary processes (13), (14).

###### Theorem 2.8.

There exists a constant such that for all ,

 W2((Vλk,Xλk),(ˆV(k,0,v0),ˆX(k,0,x0)))≤~C(√λ+√δ). (15)
###### Proof.

The proof of this theorem is given in Section 4.2. ∎

The following is the main result of the paper.

###### Theorem 2.9.

Suppose that the SGHMC iterates are defined by (9), (10). The expected population risk can be bounded as

 E[F(Xλk)]−F∗≤B1+B2+B3,

where

 B1 :=(Mσ+B)(~C(√λ+√δ)+C∗√Wρ(μ0,πz)exp(−c∗kλ)), B2 :=4βcLSn(M2m(b+d/β)+B2), B3 :=d2βlog(eMm(bβd+1)),

where are appropriate constants and is the metric defined in (17) below.

###### Proof.

The proof of this theorem is given in Section 4.3. ∎

###### Corollary 2.10.

Let . We have

 W2(L(Xk),πz)≤ε

whenever

 λ+δ=O(ε2),k=O(1ε2log(1ε)).
###### Proof.

From the proof of Theorem 2.9, or more precisely from (43), we need to choose and such that

 ~C(√λ+√δ)+C∗√Wρ(μ0,πz)exp(−c∗kλ)≤ε.

First, we choose and so that and then

 C∗√Wρ(μ0,πz)exp(−c∗kλ)≤ε/2

will hold for large enough. ∎

## 3 Related work and our contributions

Non-asymptotic convergence rate Langevin dynamics based algorithms for approximate sampling log-concave distributions are intensively studied in recent years. For example, overdamped Langevin dynamics are discussed in [WT11], [Dal17b], [DM16], [DK17], [DM17] and others. Recently, [BCM18] treats the case of non-i.i.d. data streams with a certain mixing property. Underdamped Langevin dynamics are examined in [CFG14], [Nea11], [CCBJ17], etc. Further analysis on HMC are discussed on [BBLG17], [Bet17]. Subsampling methods are applied to speed up HMC for large datasets, see [DQK17], [QKVT18].

The use of momentum to accelerate optimization methods are discussed intensively in literature, for example [AP16]. In particular, performance of SGHMC is experimentally proved better than SGLD in many applications, see [CDC15], [CFG14]. An important advantage of the underdamped SDE is that convergence to its stationary distribution is faster than that of the overdamped SDE in the -Wasserstein distance, as shown in [EGZ17].

Finding an approximate minimizer is similar to sampling distributions concentrate around the true minimizer. This well known connection give rise to the study of simulated annealing algorithms, see [Hwa80], [Gid85], [Haj85], [CHS87], [HKS89], [GM91], [GM93]. Recently, there are many studies further investigate this connection by means of non asymptotic convergence of Langevin based algorithms and in stochastic non-convex optimization and large-scale data analysis, [CCG16], [Dal17a].

Relaxing convexity is a more challenging issue. In [CCAY18], the problem of sampling from a target distribution where is L-smooth everywhere and -strongly convex outside a ball of finite radius. They provide upper bounds for the number of steps to be within a given precision level of the 1-Wasserstein distance between the HMC algorithm and the equilibrium distribution.

Our work continues these lines of research, the most similar setting to ours is the most recent paper [GGZ18]. We summarize our contributions below:

• Diffusion approximation. In Lemma 10 of [GGZ18], the upper bound for the 2-Wasserstein distance between the SGHMC algorithm at step and underdamped SDE at time is (up to constants) given by

 (δ1/4+λ1/4)√kλ√log(kλ),

which depends on the number of iteration . Therefore obtaining a precision requires a careful choice of and even . By introducing the auxiliary SDEs (13, 14), we are able to improve this bound by

 ~C(√λ+√δ),

see Theorem 2.8. This upper bound is better not only in convergence rate for both step size ( vs. ) and variance ( vs. ) but also in the number of iterations. This improves Lemma 10 and hence Theorem 2 of [GGZ18]. Our analysis for variance of the algorithm is also different. The iteration does not accumulate mean squared errors, as the number of step goes to infinity.

• Our proof for Theorem 2.8 is relatively simple and we do not need to adopt the techniques of [RRT17] which involve heavy functional analysis, e.g. the weighted Csiszár - Kullback - Pinsker inequalities in [BV05] is not needed.

• Thanks to the big data regime, dependence structure of the dataset in the sampling mechanism, can be arbitrary, see the proof of Theorem 2.8. The i.i.d assumption on dataset is used only for the generalization error. We could also incorporate non-i.i.d data in our analysis, see Remark 4.5, but this is left for further research.

## 4 Proofs

### 4.1 A contraction result

In this section, we recall a contraction result of [EGZ17]. First, it should be noticed that the constant and the function in their paper are and in the present paper, respectively. Here, the subscript stands for “contraction”. Using the upper bound of Lemma 5.1 for below, there exist constants small enough and such that

 ⟨x,∇Fz(x)⟩≥m∥x∥2−b≥2λc(Fz(x)+γ2∥x∥2/4)−2Ac/β.

Therefore, Assumption 2.1 of [EGZ17] is satisfied, noting that and

 ∥∇Fz(x)−∇Fz(y)∥≤β−1Lc∥x−y∥.

We define the Lyapunov function

 V(x,v)=βFz(x)+β4γ2(∥x+γ−1v∥2+∥γ−1v∥2−λc∥x∥2), (16)

For any , we set

 r((x1,v2),(x2,v2) = αc∥x1−x2∥+∥x1−x2+γ−1(v1−v2)∥, ρ((x1,v1),(x2,v2)) =

where are suitable positive constants to be fixed later and is continuous, non-decreasing concave function such that , is on for some constant with right-sided derivative and left-sided derivative and is constant on . For any two probability measures on , we define

 Wρ(μ,ν):=inf(X1,V1)∼μ,(X2,V2)∼νE[ρ((X1,V1),(X2,V2))]. (17)

Note that and are semimetrics but not necessarily metrics. A result from [EGZ17] is recalled below.

For a probability measure on , we denote by the law of when .

###### Theorem 4.1.

There exists a continuous non-decreasing concave function with such that for all probability measures on , we have

 W2(μpt,νpt)≤C∗√Wρ(μ,ν)exp(−c∗t),∀t≥0, (18)

where the following relations hold:

 c∗ = γ768min{λcLcβ−1γ−2,Λ1/2ce−ΛcLcβ−1γ−2,Λ1/2ce−Λc}, C∗ = √2e1+Λc/21+γmin{1,αc}(max{1,4(1+2αc+2α2c)(d+Ac)β−1γ−1c−1∗min{1,R1}/})1/2, Λc = 125(1+2αc+2α2c)(d+Ac)Lcβ−1γ−2λ−1c(1−2λc)−1, αc = (1+Λ−1c)Lcβ−1γ−2>0, εc = 4γ−1c∗/(d+Ac)>0, R1 = 4⋅(6/5)1/2(1+2αc+2α2c)1/2(d+Ac)1/2β−1/2γ−1(λc−2λ2c)−1/2.

The function is constant on , on with

 f(r) = ∫r∧R10φ(s)g(s)ds, φ(s) = exp(−(1+ηc)Lcs2/8−γ2βεcmax{1,(2αc)−1}s2/2), g(s) = 1−94c∗γβ∫r0Φ(s)φ(s)−1ds,Φ(s)=∫s0φ(x)dx

and satisfies .

###### Proof.

See Theorem 2.3 and Corollary 2.6 of [EGZ17]. ∎

It should be emphasized that , and consequently, contracts at the rate .

### 4.2 Proof of Theorem 2.8

###### Proof.

For each , we define

 Hk:=σ(Uz,i,1≤i≤k)∨σ(ξj,j∈N).

Let be -valued random variables satisfying Assumption 2.6. For , we recursively define , and

 ~Vλ(j+1,i,~v) = ~Vλ(j,i,~v)−λ[γ~Vλ(j,i,~v)+∇Fz(~Xλ(j,i,~x))] (19) +√2γβ−1λξk+1, ~Xλ(j+1,i,~x) = ~Xλ(j,i,~x)+λ~Vλ(j,i,~v). (20)

Let . For each , and for each , we set

 ~Vλk:=~Vλ(k,nT,VλnT),~Xλk:=~Xλ(k,nT,XλnT). (21)

We estimate for ,

 ∥Vλk−~Vλk∥≤λ∥∥ ∥∥k−1∑i=nT(g(Xλi,Uz,i)−∇Fz(~Xλi))∥∥ ∥∥

and

 ∥∥Xλk−~Xλk∥∥≤λk−1∑i=nT∥∥Vλi−~Vλi∥∥. (22)

Denote . By Assumption 2.4, the estimation continues as follows

 ∥Vλk−~Vλk∥ ≤ (23) +λ∥∥ ∥∥k−1∑i=nTg(~Xλi,Uz,i)−gi,nT(~Xλi)∥∥ ∥∥+λk−1∑i=nT∥∥gi,nT(~Xλi)−∇Fz(~Xλi)∥∥ ≤ λLk−1∑i=nT∥Xλi−~Xλi∥+λmaxnT≤m<(n+1)T∥∥ ∥∥m∑i=nTg(~Xλi,Uz,i)−gi,nT(~Xλi)∥∥ ∥∥ +λ(n+1)T−1∑i=nT∥∥gi,nT(~Xλi)−∇Fz(~Xλi)∥∥.

Using (22), one obtains

 k−1∑i=nT∥Xλi−~Xλi∥ ≤ λT∥VλnT−~VλnT∥+...+λT∥Vλk−1−~Vλk−1∥ (24) ≤ k−1∑i=nT∥Vλi−~Vλi∥,

noting that Therefore, the estimation in (23) continues as

 ∥Vλk−~Vλk∥ ≤ λLk−1∑i=nT∥Vλi−~Vλi∥+λmaxnT≤m<(n+1)T∥∥ ∥∥m∑i=nTg(~Xλi,Uz,i)−gi,nT(~Xλi)∥∥ ∥∥ +λ(n+1)T−1∑i=nT∥∥gi,nT(~Xλi)−∇Fz(~Xλi)∥∥.

Applying the discrete-time version of Grönwall’s lemma and taking squares, noting also that yield

 ∥Vλk−~Vλk∥2≤2λ2e2LTλ⎡⎣maxnT≤m<(n+1)T∥∥ ∥∥m∑i=nTg(~Xλi,Uz,i)−gi,nT(~Xλi)∥∥ ∥∥2+Ξ2n⎤⎦,

where

 Ξn:=(n+1)T−1∑i=nT∥∥gi,nT(~Xλi)−∇Fz(~Xλi)∥∥. (25)

Taking conditional expectation with respect to , the estimation becomes

 E[∥Vλk−~Vλk∥2∣∣HnT] ≤ 2λ2e2LE⎡⎣maxnT≤m<(n+1)T∥∥ ∥∥m∑i=nTg(~Xλi,Uz,i)−gi,nT(~Xλi)∥∥ ∥∥2∣∣ ∣∣HnT⎤⎦ + 2λ2e2LE[Ξ2n|HnT].

Since the random variables are independent, the sequence of random variables , are independent conditionally on , noting that is measurable with respect to . In addition, they have zero mean by the tower property of conditional expectation. By Assumption 2.4,

 ∥g(x,u)∥≤M∥x∥+∥g(0,u)∥

and thus

 E[∥g(~Xλi,Uz,i)∥2|HnT]≤2ME[∥~Xλi∥2]+2E[∥g(0,Uz,i)∥2].

by the independence of from . From Assumptions 2.1, 2.5, and from Lemma 5.1, we deduce that

 E[∥g(0,Uz,i)∥2]≤2E[∥g(0,Uz,i)−∇Fz(0)∥2]+2E[∥∇Fz(0)∥2]≤2δB2+2B2:=c1.

Therefore,

 E[∥g(~Xλi,Uz,i)∥2|HnT]≤2ME[∥~Xλi∥2]+2c1. (26)

Doob’s inequality and (26) imply

 E⎡⎣maxnT≤m<(n+1)T∥∥ ∥∥m∑i=nTg(~Xλi,Uz,i)−gi,nT(~Xλi)∥∥ ∥∥2∣∣ ∣∣HnT⎤⎦≤8M(n+1)T−1∑i=nTE[∥~Xλi∥2]+8c1T.

Taking one more expectation and using Lemma 5.3 give

 E⎡⎣maxnT≤m<(n+1)T∥∥ ∥∥m∑i=nTg(~Xλi,Uz,i)−gi,nT(~Xλi)∥∥ ∥∥2⎤⎦ ≤ 8M(n+1)T−1∑i=nTE[∥~Xλi∥2]+8