# Stochastic Gradient Descent for Semilinear Elliptic Equations with Uncertainties

Randomness is ubiquitous in modern engineering. The uncertainty is often modeled as random coefficients in the differential equations that describe the underlying physics. In this work, we describe a two-step framework for numerically solving semilinear elliptic partial differential equations with random coefficients: 1) reformulate the problem as a functional minimization problem based on the direct method of calculus of variation; 2) solve the minimization problem using the stochastic gradient descent method. We provide the convergence criterion for the resulted stochastic gradient descent algorithm and discuss some useful technique to overcome the issues of ill-conditioning and large variance. The accuracy and efficiency of the algorithm are demonstrated by numerical experiments.

## Authors

• 42 publications
• 1 publication
• ### A machine learning solver for high-dimensional integrals: Solving Kolmogorov PDEs by stochastic weighted minimization and stochastic gradient descent through a high-order weak

The paper introduces a very simple and fast computation method for high-...
12/22/2020 ∙ by Riu Naito, et al. ∙ 0

• ### Adaptive Gradient Descent for Optimal Control of Parabolic Equations with Random Parameters

10/20/2021 ∙ by Yanzhao Cao, et al. ∙ 0

• ### The Deep Ritz method: A deep learning-based numerical algorithm for solving variational problems

We propose a deep learning based method, the Deep Ritz Method, for numer...
09/30/2017 ∙ by Weinan E, et al. ∙ 1

• ### Asymptotic Analysis via Stochastic Differential Equations of Gradient Descent Algorithms in Statistical and Computational Paradigms

This paper investigates asymptotic behaviors of gradient descent algorit...
11/27/2017 ∙ by Yazhen Wang, et al. ∙ 0

• ### Stochastic approximation for optimization in shape spaces

In this work, we present a novel approach for solving stochastic shape o...
01/29/2020 ∙ by Caroline Geiersbach, et al. ∙ 0

• ### Penalized basis models for very large spatial datasets

Many modern spatial models express the stochastic variation component as...
02/19/2019 ∙ by Mitchell Krock, et al. ∙ 0

• ### Laplacian Smoothing Gradient Descent

We propose a very simple modification of gradient descent and stochastic...
06/17/2018 ∙ by Stanley Osher, et al. ∙ 15

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many problems in science and engineering involve spatially varying input data. Often, the input data is subject to uncertainties due to inherent randomness. For example, the details of spatial variations of properties and structure of engineering materials are typically obscure, and randomness and uncertainty are fundamental features of these complex physical systems. Under these circumstances, traditional deterministic models are rarely capable of properly handling this randomness and yielding accurate predictions. Therefore, in order to furnish accurate predictions, randomness must be incorporated directly into the model and the propagation of the resulting uncertainty between its input and output must be quantified accordingly. A model, particularly important in many applications, consists of the random input data in the form of a random field and a partial differential equation (PDE). The specific model problem considered here is

 L(κ)(u)=0in D (1)

where is a nonlinear elliptic operator dependent on a random field

over a probability space

, and is a solution. Examples of (1) include, but are not limited to, flow of water through random porous medium and modeling of the mechanical response of materials with random microstructure.

Over the past few decades, the critical need to seek solutions of (1

) has yielded a wealth of numerical approaches. Numerical methods for solving PDEs with random coefficients have been traditionally classified into three major categories: stochastic collocation (SC)

babuvska2007stochastic ; nobile2008sparse ; nobile2008anisotropic , stochastic Galerkin (SG) ghanem2003stochastic ; babuska2004galerkin ; gunzburger2014stochastic ; xiu2002wiener ; xiu2003modeling and Monte Carlo (MC) babuska2004galerkin ; matthies2005galerkin ; kuo2016application . SC aims to first solve the deterministic counterpart of (1

) on a set of collocation points and then interpolate over the entire image space of the random element. Hence, the method is non-intrusive meaning that it can take advantage of existing legacy solvers developed for deterministic problems. Similarly, MC is non-intrusive as well since it relies on taking sample average over a set of deterministic solutions computed from a set of realizations of the random field. In contrast, SG is considered as intrusive since it requires construction of discretizations of both the stochastic space and physical space simultaneously and, as a result, it commonly tends to produce large systems of algebraic equations whose solutions are needed. However, these algebraic systems are considerably different from their deterministic counterparts and thus deterministic legacy solvers cannot be easily utilized.

We emphasize that all of the three categories discussed so far depend on the stochastic weak formulation of (1). In this work, alternatively, we take the variational viewpoint and reformulate problem (1) as a problem of seeking minimizers of the following functional

 E(u)=E{∫DI(x,u,∇u,ω)dx}, (2)

whose Euler-Lagrange equations coincide with (1) under suitable assumptions. Therefore, over an appropriate space, solving (1) is equivalent to minimizing (2). The use of variational formulations has become widespread in many areas of science and engineering due to their many advantages struwe1990variational ; reddy2017energy . First and foremost, the equations in the weak form are often applicable in situations when the strong form may no longer be valid. A case in point is modeling of microstructure evolution in materials, where fine scale oscillations may emerge, leading to highly irregular solutions ball1989fine ; muller1999variational . Second, variational formulations are known to be remarkably convenient for numerical computation as they often produce numerical methods capable of preserving, at least to some extent, the structure of the original problem simo2006computational ; marsden2001discrete .

In order to establish the existence of minimizers of in an appropriate space , by the direct method of calculus of variations dacorogna2007direct , it is sufficient to identify a minimizing sequence which satisfies two properties:

1. the sequence is compact under the weak topology on , i.e.,

 uν⇀u∗ in W.

This is often implied by the boundedness of the sequence (up to the extraction of a subsequence), i.e., for some constant independent of .

2. the functional is lower semicontinuous with respect to weak convergence, i.e.,

 uν⇀u∗ in Wimpliesliminfν→∞E(uν)≥E(u∗).

Given the above two properties, it is straightforward to verify that the function is indeed a minimizer of . The direct method is not only of theoretical importance. From the numerical point of view, it suggests that if we can identify a minimizing sequence each of which solves (2) over a finite dimensional subspace , i.e.,

 uν=argminu∈WνE(u),

then the above two properties ensure that converges weakly to the minimizer . That is, when is interpreted as a sequence of solutions to (2) over a sequence of finite dimensional spaces that approximates , the direct method of variational calculus automatically guarantees the numerical consistency. Bearing this in mind, numerical approximation to (2) boils down to minimizing over finite dimensional spaces with suitable optimization methods.

The fundamental difficulty in solving the stochastic optimization problem (2) is that the expectation often involves high dimensional integral which generally cannot be computed with high accuracy nemirovski2009robust . Thus, conventional nonlinear optimization techniques are seldom suitable for problems like (2

) since an inaccurate gradient estimation is usually detrimental to the convergence of the algorithms. In contrast, stochastic gradient descent (SGD) replaces the actual gradient by its noisy estimate, but is guaranteed to converge under mild conditions

bottou2010large ; bottou2018optimization ; kingma2014adam . The method can be traced back to the Robbins–Monro algorithm robbins1951stochastic

and has nowadays become one of the cornerstone for large-scale machine learning

bottou2018optimization . However, due to the noisy nature of SGD iteration, a naive use of the algorithm in many instances suffers difficult tuning of parameters and extremely slow convergence rate nemirovski2009robust . In this article, we describe an application of SGD to construct numerical schemes for the solution of the variational stochastic problem (2). We also provide simple, yet powerful, strategies for efficient and robust SGD algorithms in the above context.

The reminder of the article is organized as follows. In Section 2, we setup the semilinear model problem and impose several running assumptions on the model. The variational reformulation of the model problem as a stochastic minimization problem is described in Section 3. Afterward, in Section 4, we propose to utilize the SGD to solve the minimization problem and discuss some useful technique for noise reduction and convergence acceleration for SGD. Finally, numerical benchmarks are presented in Section 5.

## 2 Model problem

We introduce a probability space where is the set of all events, is the -algebra consisting of all measurable events and a probability measure. We consider the following semilinear elliptic PDE with random coefficient defined in ,

 −∇⋅(κ(x,ω)∇u(x,ω))+f(x,u(x,ω),ω)=0x∈Du(x,ω)=0x∈∂D, (3)

where the domain is a bounded subset of , the boundary is either smooth or convex and piecewise smooth, the diffusion coefficient is a random field with continuous and bounded covariance functions and the nonlinear term is sufficiently smooth for almost surely all . We assume the solution to (3) exists and is unique.

We deal with the case of finite dimensional noise, i.e., there exists finitely many independent random variables

such that

 κ(x,ω)=κ(x,Y1(ω),…,YK(ω)).

These random variables are often referred as the stochastic germ that bring randomness into the system. From the practical perspective, the finite dimensional noise assumption is reasonable since the random input often admits a parametrization in terms of finitely many random variables. From the theoretical perspective, by the Karhunen-Loève (KL) expansion mercer1909xvi : when the random field is square integrable with continuous covariance function, i.e.,

 κ(x,⋅)∈L2P(Ω),∀x∈D

and the covariance function

 Covκ(x,y)=E[(κ(x)−E[κ(x)])(κ(y)−E[κ(y)])]

is a well-defined continuous function of , then the random field can be approximated by the truncated KL expansion

 κ(x,ω)≈¯κ(x)+N∑n=1√λnψn(x)Yn(ω),

where is the mean of the random field, are eigen-pairs of the covariance kernel

 (Cκg)(x)≜∫DCovκ(x,y)g(y)dy

and are uncorrelated and identically distributed random variables with mean zero and unit variance.

Consequently, when the randomness of (3) is completely characterized by finitely many independent random variables , problem (3) is equivalent to

 −∇⋅(κ(x,Y)∇u(x,Y))+f(x,u(x,Y),Y)=0,x∈D,u(x,Y)=0,x∈∂D (4)

by the Doob-Dynkin’s lemma oksendal2013stochastic . To define a suitable space of solution of the above problem, we introduce the physical space , i.e., the Sobolev space of functions with weak derivatives up to order and vanishing on the boundary. We also define the stochastic space , i.e., the space of -valued square integrable (with respect to ) random variables. The solution to (4

) is now thus defined in the tensor space

Now we make some technical assumptions on and . We denote the closure of .

###### Assumption 1.

is uniformly bounded and uniformly coercive, i.e., there exist constants such that

 P(ω∈Ω:κmin≤κ(x,Y(ω))≤κmax, ∀x∈¯D)=1.
###### Assumption 2.

is uniformly bounded, i.e., there exists a constant such that

 P(ω∈Ω:|f(x,u,Y(ω))|≤fmax, ∀x∈¯D,∀u∈R)=1.

Furthermore, is uniformly Lipschitz continuous in , i.e., there exists a Lipschitz constant such that

 P(ω∈Ω:|f(x,u1,,Y(ω))−f(x,u2,,Y(ω))|≤Lf|u1−u2|, ∀u1,u2∈R,∀x∈¯D)=1.

Finally, is uniformly bounded from below, i.e., there exists a constant such that

 P(ω∈Ω:∂uf(x,u,Y(ω))≥δ, ∀x∈¯D,∀u∈R)=1.

As we shall see hereafter, these assumptions are crucial to guarantee the convergence of the SGD algorithm.

## 3 Direct method and polynomial chaos expansion

### 3.1 Direct method in calculus of variations

Following basic ideas of the calculus of variation, our starting point is to reformulate the stochastic PDE problem (4) as the minimization problem

 minu∈V⊗SE(u)=E{∫D12κ(x,Y)|∇xu(x,Y)|2+F(x,u(x,Y),Y)dx} (5)

with , and the expectation

taken with respect to the random vector

. We first show that (5) has a minimizer and this minimizer satisfies the weak form of (4). The result is a simple application of the direct method in variational calculus dacorogna2007direct .

###### Theorem 3.1.

Under Assumptions 1 and 2 and additionally that is uniformly bounded from below, then the problem (5) has a minimizer . Furthermore, satisfies the following weak form of (4):

 E{∫Dκ(x,Y)∇xu(x,Y)⋅∇xv(x,Y)+f(x,u,Y)v(x,Y)dx}=0,∀v∈V⊗S. (6)
###### Proof.

As we sketched in the introduction, it is sufficient to show that 1) there exists a minimizing sequence that converges weakly to some and 2) the functional is (weakly) lower semicontinuous dacorogna2007direct .

For weak convergence of the sequence , it suffices to uniformly bound under the norm. To this end, note that by Assumption 1 and the uniformly lower boundedness of

 12κ(x,Y)|∇xuν(x,Y)|2+F(x,uν,Y)≥12κmin|∇xuν(x,Y)|2+Fmin.

Integrating over and taking expectation of both sides lead to

 E(uν)≥12κminE{∫D|∇xuν(x,Y)|2dx}+∫DFmindx.

Invoking the Poincaré’s inequality, there exist some constants and such that

 E(uν)≥C1E{∥uν∥2H10}+C2.

Now note that is uniformly bounded (in ) since it is a minimizing sequence of (5), which implies is uniformly bounded and hence there exists such that

 uν⇀u∗in V⊗S.

Next, we justify that is (weakly) lower semicontinuous. To this end, we define

 E1(u)=E{∫D12κ(x,Y)|∇xu(x,Y)|2dx}

and

 E2(u)=E{∫DF(x,u,Y)dx}

and show both and are lower semicontinuous. To see the lower semicontinuity of , since converges weakly to in the Hilbert space , by definition (through choosing the test function )

 E1(u∗)=limν→∞E{∫D12κ(x,Y)∇xu∗(x,Y)⋅∇xuν(x,Y)dx}.

Squaring both sides and then applying the Cauchy-Schwarz inequality lead to

 E1(u∗)2≤E1(u∗)liminfν→∞E1(uν).

If (the case when is trivial since ), the above inequality implies

 E1(u∗)≤liminfν→∞E1(uν),

i.e., is (weak) lower semicontinuous. Now for , by Taylor expansion in and the uniform boundedness of (Assumption 2), we have

 E{∫DF(x,uν,Y)dx}−E{∫DF(x,u∗,Y)dx}≤fmaxE{∫D|uν−u∗|dx}.

The lower semicontinuity of follows immediately from the weak convergence of to . Therefore, is lower semicontinuous and hence has a minimizer .

It remains to be shown that satisfies the weak form (6). To this end, we consider the functional evaluated at for every , i.e., . A simple calculation shows that the Gateaux derivative satisfies

 limϵ→01ϵ(E(u∗+ϵv)−E(u∗))=E{∫Dκ(x,Y)∇xu∗⋅∇xvdx}+limϵ→0E{∫Df(x,u∗+ϵ(x,Y)v,Y)vdx}

for some random variable . Since is uniformly bounded by Assumption 2, by the dominated convergence theorem

 limϵ→0E{∫Df(x,u∗+ϵ(x,Y)v,Y)vdx}=E{∫Df(x,u∗,Y)vdx}.

Owing to the fact that is a minimizer, the Gateaux derivative

 ddϵ∣∣∣ϵ=0E(u∗+ϵv)=0,

which yields the weak form (6). ∎

Due to the infinite dimensionality of the solution space , it is not practical to solve (5) directly and hence we seek an approximate solution over a finite dimensional subspace , where is a generic index parameterizing the approximation accuracy. Consequently, the approximated minimization problem over the finite dimensional subspace becomes

 minu∈(V⊗S)νE(u)=E{∫D12κ(x,Y)|∇xu(x,Y)|2+F(x,u(x,Y),Y)dx}.

Denote the minimizer of over the finite dimensional subspace . Since is a minimization sequence of (5), i.e.,

 E(uν)→E(u∗)asν→∞,

the sequence is compact on by Theorem 3.1, that is, converges weakly to , a minimizer of (5). That is to say, the consistency of the numerical approximation is guaranteed automatically under the framework of direct method of variational calculus. This serves as the theoretical foundation of the algorithm proposed in this work. Thus, the problem is reduced to finding a finite dimensional subspace minimizer as an approximation to the infinite dimensional space minimizer .

### 3.2 Polynomial chaos expansion

In this section, we make the above finite dimensional subspace approximation explicit. It is customary to construct approximations in the physical space by means of polynomials, in particular those with compact support, as is the case for the finite-element method. Undeniably, there exist various finite dimensional approximations to the stochastic space . For the sake of clarity, throughout this paper we adopt the generalized polynomial chaos (PC) expansion xiu2002wiener as a convenient approximation method in space . However, we emphasize that the general framework presented in this work extends naturally to other approaches for constructing approximants, such as piecewise polynomials expansions ghanem2003stochastic , multiwavelet decompositions le2004a ; le2004b to name a few.

The PC expansion is essentially a representation of second order random objects (e.g., random variables in or stochastic fields in xiu2002wiener ; xiu2003modeling ; le2010spectral . Given a random variable in , the PC expansion asserts that we can identify a set of -orthogonal univariate polynomial bases so that any function satisfying can be expressed as

 r(X)=∞∑j=0rjψj(X)

in the sense, where the coefficients are

 rj=E{r(X)ψj(X)}E{ψ2j(X)},j=0,1,….

For instance, are Hermite polynomials when is normal and are Legendre polynomials when is uniform. The PC expansion can be generalized to the case of dimensional random vector with independent components. Specifically, for any function satisfying we can write

 R(X)=∞∑j=0RjΨj(X),

where are -variate polynomials involving products of those univariate polynomials associated with each component.

Now, given the dimensional random vector in (5), we can expand the random field as a generalized PC series

 u(x,Y)=∞∑j=0uj(x)Ψj(Y),

where the coefficient function

 uj(x)=E{u(x,Y)Ψj(Y)}E{Ψ2j(Y)},∀j=0,1,….

In practice, truncation of the PC series is required for numerical approximation. To this end, we define , the finite dimensional subspace spanned by the -orthogonal polynomials , i.e.,

 SN=span{Ψ0(Y),…,ΨN(Y)}.

Note that is determined by both the stochastic dimensionality and the highest order of the basis polynomials through

 N+1=(p+K)!p!K!.

Hence, the dimensionality of the stochastic subspace can be very high when and are large. In order to further expand the coefficients , we approximate it over

 VM=span{ϕ1,…,ϕM}⊂V,

the finite dimensional subspace of spanned by the bases . Therefore, the finite dimensional subspace of is over which we have a finite dimensional approximation

 uc(x,Y)≜M∑i=1N∑j=0cijϕi(x)Ψj(Y)≈u(x,Y).

Note that in the notation of we omit the dependence on and in order to simplify the notation. For convenience, we define the vector valued function consisting of all bases of with the following numbering of index

 Γ(x,Y)=(ϕ1(x)Ψ0(Y),…,ϕM(x)Ψ0(Y),……,ϕ1(x)ΨN(Y),…,ϕM(x)ΨN(Y))T.

Hence, the approximated solution can be written in the following compact form

 uc(x,Y)=cTΓ(x,Y), (7)

where the coefficient (column) vector is

 c=[c1,0,…,cM,0,……,c1,N,…,cM,N]T∈RM(N+1).

Over the finite dimensional subspace , the functional associated with becomes

 E(uc)=E{∫D12κ(x,Y)|∇xuc(x,Y)|2+F(x,uc(x,Y),Y)dx}. (8)

Note that is indeed a function of , hence we rewrite as a function of the coefficients , denoted by . Therefore, minimizing is equivalent to minimizing the following function with respect to the coefficient vector ,

 minc∈RM(N+1)J(c)=minc∈RM(N+1)J1(c)+J2(c), (9)

where

 J1(c)=E{∫D12κ(x,Y)∣∣cT∇xΓ(x,Y)∣∣2dx},J2(c)=E{∫DF(x,cT% Γ(x,Y),Y)dx} (10)

are the values associated with the linear part and the nonlinear part of (4), respectively. Here the gradient of with respect to is defined as

 ∇xΓ(x,Y)=[∇x1Γ(x,Y),…,∇xdΓ(x,Y)]∈RM(N+1)×d.

Finally, we make the following technical assumption regarding the basis functions.

###### Assumption 3.

For each , the physical-space basis satisfies

 ∫D|ϕi(x)|4dx<∞,∫D|∇xϕi(x)|2dx<∞.

For each , the PC basis satisfies

 E{|Ψj(Y)|4}<∞.

The integrability conditions are satisfied in most settings. For example, when the physical domain is compact and are finite element bases, the integrability is readily verified. For the stochastic space, when is normal or uniform,

has finite moments of all orders.

## 4 Stochastic gradient descent for semilinear problem

### 4.1 Convergence of stochastic gradient descent

As mentioned in Section 3.1, in order to find an approximation to the solution of problem (3), it is sufficient to solve the stochastic optimization problem (9). The natural choice for optimizing the function is the stochastic gradient descent robbins1951stochastic , which is one of the most fundamental ingredients of large-scale machine learning bottou2010large ; bottou2018optimization . In this section, we discuss an application of SGD to the specific minimization problem (9

). For convenience, we denote the unbiased estimator of

by

 g(c,Y)=g1(c,Y)+g2(c,Y), (11)

where

 g1(c,Y)=(∫Dκ(x,Y)∇xΓ(x,Y)∇xΓ(x,Y)Tdx)c∈RM(N+1)g2(c,Y)=∫Df(x,cTΓ(x,Y),Y)Γ(x,Y)dx∈RM(N+1)

so that . Instead of computing the deterministic gradient at each iteration, SGD simply requires the stochastic gradient for each iteration:

 cn+1=cn−ηng(cn,Yn),n≥1, (12)

where is the learning rate. In the context of machine learning, each corresponds to a randomly picked example from the dataset. In the context of this article, we interpret as a realization of the stochastic germ . Note that the iterative sequence of coefficients is a sequence of random variables since each depends on for .

Intuitively, SGD works because, while each direction may not be one of the descent directions of , it is, however, a descent direction in expectation. It is clear that SGD is advantageous as it only requires computation of a single realization of the gradient at each iteration. Yet, it is a fundamental question whether SGD applied to the problem (9) produces a convergent sequence minimizing the function . To answer this question, we first present three important lemmas concerning the properties of the function and the gradient estimator .

###### Lemma 4.1.

Under Assumptions 1, 2 and 3, the function is continuously differentiable and its gradient is Lipschitz continuous with Lipschitz constant , i.e., for any ,

 |∇J(c1)−∇J(c2)|≤L|c1−c2|.
###### Proof.

Throughout the proof, denotes a generic positive constant that may differ by a scaling constant. Let be two arbitrary vectors of coefficients for the expansions over . For the linear part of , by Assumption 2,

 ∣∣∇J1(c1)−∇J1(c2)∣∣2=∣∣∣E{∫Dκ(x,Y)∇xΓ(x,Y)∇xΓ(x,Y)Tdx}(c1−c2)∣∣∣2≤L∣∣∣E{∫D∇xΓ(x,Y)∇xΓ(x,Y)Tdx}(c1−c2)∣∣∣2≤L∥∥∥E{∫D∇xΓ(x,Y)∇xΓ(x,Y)Tdx}∥∥∥2|c1−c2|2,

where is the matrix -norm. To see that the matrix norm is finite, it is sufficient to bound terms of the form

 (∫D∂xk1ϕi1(x)∂xk2ϕi2(x)dx)E{Ψj1(Y)Ψj2(Y)} (13)

with , and . By Assumption 3, the finiteness of (13) follows immediately.

Now for the nonlinear part , by Jensen’s inequality and Assumption 2

 (14)

Finally, is finite by Assumption 3. ∎

###### Lemma 4.2.

Under Assumptions 1, 2 and 3, there exist constants and such that

 E{|g(c,Y)|2}≤M1|∇J(c)|2+M2.

That is, the second moment of the gradient estimator is allowed to grow quadratically in the mean gradient.

###### Proof.

In virtue of Assumption 1 and 2,

which is finite by Assumption 3. Similar to the proof of Lemma 4.1, we can readily show that is finite as well. Therefore, we can choose two appropriate constants such that

 E{|g(c,Y)|2}≤M1|∇J(c)|2+M2.

###### Lemma 4.3.

Under Assumptions 1, 2 and an additional assumption that for all and almost surely all , , when viewed as a function of , is continuously differentiable with respect to , i.e., for all ,

 P(ω∈Ω:f(x,cTΓ,Y(ω)) % continuously differentiable w.r.t. c)=1.

Then, the function is strongly convex (in ), i.e., there exists a constant , such that for any

 (∇J(c1)−∇J(c2))T(c1−c2)≥λ|c1−c2|2

and hence has a unique minimizer .

###### Proof.

Note that by Assumption 1,

 (∇J1(c1)−∇J1(c2))T% (c1−c2)≥κmin(c1−c2)TE{∫D∇xΓ∇xΓTdx}(c1−c2). (15)

In view of the continuously differentiability of in , there exists for some (that may depend on ) such that

 ∇J2(c1)−∇J2(c2)=E{∫D(f(x,cT1Γ,Y)−f(x,cT2Γ,Y))Γdx}=E{∫D∇cf(x,˜cT% Γ,Y)T(c1−c2)Γdx}=E{∫D∂uf(x,˜c%TΓ,Y)ΓΓT(c1−c2)dx}.

Since is uniformly bounded from below by Assumption (2), we have

 (∇J2(c1)−∇J2(c2))T% (c1−c2)=(c1−c2)TE{∫D∂uf(x,˜cTΓ,Y)ΓΓTdx}(c1−c2)≥δ(c1−c2)TE{∫DΓΓTdx}(c1−c2). (16)

The strong convexity follows immediately by combining (15) and (16). ∎

Now, we are ready to present the main result concerning the convergence of SGD when applied for solving problem (9). Given Lemmas 4.1, 4.2 and 4.3, the proof of the following result is simply a straightforward application of Theorem  in bottou2018optimization .

###### Theorem 4.1.

Under the same assumptions as in Lemmas 4.1, 4.2 and 4.3, suppose that the learning rate satisfies

 ηn=βγ+n,n>1

for some constants and such that . Then, the function decays sublinearly to the minimum in expectation., i.e.,

 E{J(cn)}−J(c∗)≤νγ+n,

where .

###### Remark 4.1.

Some remarks about the above result are in order.

1. Diminishing learning rate has to be used to guarantee the convergence. The initial learning rate cannot be larger than a certain threshold.

2. We comment that the rate is the fastest convergence rate that the stochastic gradient descent can achieve agarwal2012information . However, the multiplicative constant can be improved by incorporating the second order information of the function . We will discuss more on this aspect in the next chapter.

3. A similar result can be obtained when the function is convex but not strongly convex. However, the convergence deteriorates to nemirovski2009robust .

It is well known that the SGD (12) suffers from the adverse effect of noisy gradient estimation. On the other hand, there is no particular reason to estimate the gradient only based on one realization of the random variables at each iteration. Therefore, it is natural to introduce a mini-batch at each iteration in order to “stabilize” the algorithm. That is, at the -th iteration we average the gradient over a batch of realizations of the random variable in order to obtain a less noisy gradient estimation

 gmb(cn,Ymbn)=1NgNg∑i=1g(cn,Yn,i),

where and each is the -th realization in the mini-batch at the -th iteration of SGD. With the mini-batch gradient, we have the mini-batch SGD

 cn+1=cn−ηngmb(cn,Ymbn). (17)

Clearly, the mini-batch averaging reduces the variance of gradient estimation by a factor of but is also times more expensive than standard SGD. More sophisticated mini-batch strategies can be applied to accelerate SGD zhao2014accelerating ; dekel2012optimal . Our simulation results suggest that mini-batch is crucial for the convergence of SGD in order to take a relatively large learning rate.

### 4.2 The second order SGD

Further improvements to the SGD (12) may be achieved by way of incorporating the second order information pertaining to the function . Theorem 4.1 elucidates the fact that the constant appearing in the convergence rate depends on , which in turn depends on the condition number of the Hessian . This is similar to the deterministic optimization where the second order information is often incorporated to overcome the ill-conditioning of the optimization problem nocedal2006numerical . The same approach can be utilized in the stochastic setting by means of adaptively rescaling the stochastic gradients based on matrices capturing local curvature information of the function , so that the constant is significantly improved as a result. More precisely, we consider an iteration scheme

 cn+1=cn−ηnHngmb(cn,Ymbn),n≥1, (18)

where is a symmetric positive definite approximation to the inverse of the Hessian . In fact, it was shown in bottou2005line that if is updated dynamically such that , then the multiplicative constant appearing in the convergence rate is independent of the condition number of the Hessian. It is true that approximation of is often based on a small set of samples and hence is very noisy. However, it has been long observed that the Hessian matrix need not be as accurate as the gradient in order to yield an effective iteration since the iteration is more tolerant to noise in the Hessian estimate than it is to noise in the gradient estimate. Therefore, it may be beneficial to incorporate partial Hessian information in the stochastic setting. To this end, we denote the unbiased estimator of the Hessian as

 h(c,Y)=h1(c,Y)+h2(c,Y), (19)

where

 h1(c,Y)=∫Dκ(x,Y)∇xΓ(x,Y)∇xΓ(x,Y)Tdx∈RM(N+1)×M(N+1),h2(c,Y)=∫D∂uf(x,cTΓ(x,Y),Y)Γ(x,Y)Γ(x,Y)Tdx∈RM(N+1)×M(N+1)

so that . Note that the estimator is indeed independent of whereas depends on in a nonlinear way. For convenience, we define two by matrices and with components

 Ai1i2(Y)=∫Dκ(x,Y)(∂x1ϕi1(x)∂x1ϕi2(x)+…+∂xdϕi1(x)∂xdϕi2(x))dx

and

 Bi1i2(c,Y)=∫D∂uf(x,cTΓ(x,Y),Y)ϕi1(x)ϕi2(x)dx

for . Then, we can express the Hessian estimator in the following block matrix form,

 h1(c,Y)=⎡⎢ ⎢⎣A(Y)Ψ0(Y)Ψ0(Y)…A(Y)Ψ0(Y