Forward-backward-forward methods with variance reduction for stochastic variational inequalities

We develop a new stochastic algorithm with variance reduction for solving pseudo-monotone stochastic variational inequalities. Our method builds on Tseng's forward-backward-forward (FBF) algorithm, which is known in the deterministic literature to be a valuable alternative to Korpelevich's extragradient method when solving variational inequalities over a convex and closed set governed by pseudo-monotone, Lipschitz continuous operators. The main computational advantage of Tseng's algorithm is that it relies only on a single projection step and two independent queries of a stochastic oracle. Our algorithm incorporates a variance reduction mechanism and leads to almost sure (a.s.) convergence to an optimal solution. To the best of our knowledge, this is the first stochastic look-ahead algorithm achieving this by using only a single projection at each iteration..

• 4 publications
• 41 publications
• 14 publications
• 2 publications
02/16/2021

Stochastic Variance Reduction for Variational Inequality Methods

We propose stochastic variance reduced algorithms for solving convex-con...
03/17/2020

A Relaxed Inertial Forward-Backward-Forward Algorithm for Solving Monotone Inclusions with Application to GANs

We introduce a relaxed inertial forward-backward-forward (RIFBF) splitti...
11/05/2020

Simple and optimal methods for stochastic variational inequalities, I: operator extrapolation

In this paper we first present a novel operator extrapolation (OE) metho...
08/22/2019

On the convergence of single-call stochastic extra-gradient methods

Variational inequalities have recently attracted considerable interest i...
06/22/2019

A Unifying Framework for Variance Reduction Algorithms for Finding Zeroes of Monotone Operators

A wide range of optimization problems can be recast as monotone inclusio...
02/19/2019

In this paper, we develop Stochastic Continuous Greedy++ (SCG++), the fi...
10/15/2020

Variational inequalities with monotone operators capture many problems o...

1. Introduction

In this paper we consider the following variational inequality problem, denoted as , or simply : given a nonempty closed and convex set and a single valued map , find such that

 ⟨T(x∗),x−x∗⟩≥0for all x∈X. (1.1)

We call the set of (Stampacchia) solutions of . The variational inequality problem (1.1) arises in many interesting applications in economics, game theory and engineering [29, 27, 19, 18, 24], and includes as a special case first-order optimality conditions for nonlinear optimization, by choosing for some smooth function . If is unbounded, it can also be used to formulate complementarity problems, systems of equations, saddle point problems and many equilibrium problems. We refer the reader to [11] for an extensive review of applications in engineering and economics.

In many instances the problem

arises as the expected value of an underlying stochastic optimization problem whose primitives are defined on a probability space

carrying a random variable

taking values in a measurable space and inducing a law . Given the random element , consider the measurable mapping

, defining an integrable random vector

via the composition . The stochastic variational inequality problem on which we will focus in this paper is denoted by and defined as follows:

Definition 1.1.

Let the operator be defined by

 T(x):=Eξ[F(x,ξ)]:=∫ΩF(x,ξ(ω))dP(ω)=∫ΞF(x,z)dP(z). (1.2)

Find satisfying (1.1).

This definition is known as the expected value formulation of the stochastic variational inequality problem. The expected value formulation goes back to the seminal work of [20]. By its very definition, if the operator defined in (1.2) would be known, then the expected value formulation can be solved by any standard solution technique for deterministic variational inequalities. However, in practice, the operator is usually not directly accessible, either due to excessive computations involved in performing the integral, or because itself is the solution of an embedded subproblem. Hence, in most situations of interest, the solution of relies on random samples of the operator . In this context, there are two current methodologies available; the sample average approximation (SAA

) approach replaces the expected value formulation with an empirical estimator of the form

 ^TN(x)=1NN∑j=1F(x,ξj),

and use the resulting deterministic map as the input in one existing algorithm of choice. We refer to [30] for this solution approach in connection with Monte Carlo simulation. We note that this approach is the standard choice in expected residual minimization problems, when is unknown but accessible via a Monte Carlo approach.

A different methodology is the stochastic approximation (SA) approach, where samples are obtained in an online fashion, namely, the decision maker chooses one deterministic algorithm to solve the expected value formulation, and draws a fresh random variable whenever needed. The mechanism to draw a fresh sample from is usually named a stochastic oracle (SO), which report generates a stochastic error .

Until very recently, the SA approach has only been used for the expected value formulation under very restrictive assumptions. To the best of our knowledge, the first formulation of an SA approach for a stochastic VI problem was made by [16], under the assumption of strong monotonicity and continuity of the operator . There, a proximal point algorithm of the form

 Xn+1=ΠX[Xn+αnF(xn,ξn)] (1.3)

is considered, where denotes the Euclidean projection onto , is a sample of , and is a sequence of positive step sizes. Almost sure convergence of the iterates is proven for small step sizes, assuming is Lipschitz continuous and strongly monotone, and the stochastic error is uniformly bounded. Relaxing strong monotonicity to plain monotonicity, the recent paper [37] incorporated a Tikhonov regularization scheme into the stochastic approximation algorithm (1.3) and proved almost sure convergence of the generated stochastic process. The only established method guaranteeing almost sure convergence under the significantly weaker assumption of pseudo-monotonicty of the mean operator is the extragradient approach of [15]. The original Korpelevich extragradient scheme of [21] consists of two projection steps using two evaluations of the deterministic map at generated test points and . Extending this to the stochastic oracle case, we arrive at the stochastic extra-gradient (SEG) method

 Yn =ΠX[Xn−αnAn+1] (SEG) Xn+1 =ΠX[Xn−αnBn+1]

where are stochastic estimators of , and , respectively. The paper [15] constructs these estimators by relying on a dynamic sampling strategy, where noise reduction of the estimators is achieved via a mini-batch sampling of the stochastic operators ) and . Within this mini-batch formulation, almost sure convergence of the stochastic process to the solution set can be proven even with constant step size implementations of SEG. On top, optimal convergence rates of in terms of the mean squared residual of the are obtained.

1.1. Our Contribution

We briefly summarize the main contributions of this work. The most costly part of SEG are the two separate projection steps performed at each single iteration of the method. We show in this paper that a stochastic version of Tseng’s forward-backward-forward [35], which we call the stochastic forward-backward-forward (SFBF) algorithm, preserves the strong trajectory-based convergence results, while the saving of one projection step allows us to beat SEG significantly in terms of computational overheads and runtimes. In terms of convergence properties the SFBF algorithm developed in this paper has the same good properties as SEG. However, SFBF is potentially more efficient than SEG in each iteration since it relies only on a single euclidean projection step. The price to pay for this is that we obtain an infeasible method (as is typical for primal-dual schemes) with a lower computational complexity count at the positive side. Additionally, the theoretically allowed range for step sizes is by the constant factor times larger than the theoretically allowed largest step size in SEG. This constant factor gain results in significant improvements in terms of the convergence speed. This will be illustrated with extensive numerical evidences reported in Section 6.

2. Preliminaries

2.1. Notation

For , we denote by the standard inner product, and by the corresponding norm. For , the norm on is defined for as . For a nonempty, closed and convex set , the Euclidean projector is defined as for . All random elements are defined on a given probability space . An -valued random variable is a -measurable mapping ; we write . For every , define the equivalence class of random variables with as . If , the conditional expectation of the random variable is denoted by . For we denote the sigma-algebra generated by these random variables by , this is the smallest sigma-algebra measuring the random variables . Let be a complete stochastic basis. We denote by the set of random sequences such for each , . For , we set

 ℓp(F)≜{(ξn)n≥1∈ℓ0(F)|∑n≥1|ξn|p<∞P%−a.s.}.

The following properties of the euclidean projection on a closed and convex set are well known.

Lemma 2.1.

Let be a nonempty, closed and convex set. Then:

• is the unique point of satisfying for all ;

• for all and , we have ;

• for all , ;

• given and , the set of solutions of the variational problem can be expressed as .

Remark 2.1.

In the literature on variational inequalities, there exists an alternative solution concept known as weak, or Minty, solutions. In this paper we are only interested in strong, or Stampacchia, solutions of , defined by inequality (1.1).

Another useful fact we use in this paper is the following elementary identity.

Lemma 2.2 (Pythagorean identity).

For all we have

 ∥xn+1−x∥2+∥xn+1−xn∥2−∥xn−x∥2=2⟨xn+1−xn,xn+1−x⟩.

2.2. Probabilistic Tools

We recall the Minkowski inequality: for given functions and , we have

 E[∥f+g∥p|G]1/p≤E[∥f∥p|G]1/p+E[∥g∥p|G]1/p. (2.1)

For the convergence analysis we will make use of the following classical lemma (see e.g. [26, Lemma 11, page 50]).

Lemma 2.3 (Robbins-Siegmund).

Let be a discrete stochastic basis. Let and be such that for all

 E[vn+1|Fn]≤(1+θn)vn−un+βnP−a.s. .

Then converges a.s. to a random variable , and .

Finally, we need the celebrated Burkholder-Davis-Gundy inequality (see e.g. [33]).

Lemma 2.4.

Let be a discrete stochastic basis and a vector-valued martingale relative to this basis. Then, for all , there exists a universal constant such that for every

 E[(sup0≤i≤N∥Ui∥)p]1/p≤CpE⎡⎢⎣(N∑i=1∥Ui−Ui−1∥2)p/2⎤⎥⎦1p.

When combined with Minkowski inequality, we obtain for all a constant such that for every

 E[(sup0≤i≤N∥Ui∥)p]1/p≤Cp ⎷N∑k=1E(∥Ui−Ui−1∥p)2/p

3. The stochastic forward-backward-forward algorithm

In this paper we study a forward-backward-forward algorithm of Tseng type under weak monotonicity assumptions. The blanket hypotheses we consider throughout our analysis are summarized here:

Assumption 1 (Consistency).

The solution set is nonemtpy.

Assumption 2 (Stochastic Model).

The set is nonempty, closed and convex, is a measurable space and is a Carathéodory map.111The mapping is continuous for a. e. , and is measurable for all ; is a random variable with values in , defined on a probability space .

Assumption 3 (Lipschitz continuity).

The averaged operator is Lipschitz continuous with modulus .

Assumption 4 (Pseudo-Monotonicity).

The averaged operator is pseudo-monotone on , which means

 ∀x,y∈Rd:⟨T(x),y−x⟩≥0⇒⟨T(y),y−x⟩≥0.

At each iteration, the decision maker has access to a stochastic oracle, reporting an approximation of of the form

 ^Tn+1(x,ξn+1)≜1mn+1mn+1∑i=1F(x,ξ(i)n+1) for x∈Rd. (3.1)

The sequence determines the batch size of the stochastic oracle. The random sequence is an i.i.d draw from . Approximations of the form (3.1

) are very common in Monte-Carlo simulation approaches, machine learning and computational statistics (see e.g.

[1, 2], and references therein); they are easy to obtain in case we are able to sample from the measure . The forward-backward-forward algorithm requires two queries from the stochastic oracle in which mini-batch estimators of the averaged map are revealed. This dynamic sampling strategy requires a sequence of integers (the batch size) determining the size of the data set to be processed at each iteration. The random sample on each mini-batch consists of two independent stochastic processes and drawn from the law , and explicitly given by

 ξn≜(ξ(1)n,…,ξ(mn)n) % and ηn≜(η(1)n,…,η(mn)n)∀n≥1.

Given the current position , Algorithm SFBF queries the SO once, to obtain the estimator , and then constructs the random variable . Next, a second query to SO is made to obtain the estimator , followed by the update . The pseudocode for SFBF is given in Algorithm 1.

Observe that Algorithm SFBF is an infeasible method: the iterates are not necessarily elements of the admissible set , but the process is by construction so. In the stochastic optimization case, i.e. for instances where

is an unbiased estimator of the gradient of a real-valued function, the process

is seen to be a projected gradient step, where acts as an unbiased estimator for the stochastic gradient. This gradient step is used in an extrapolation step to generate the iterate . We just mention that related popular primal-dual splitting schemes like ADMM [3, 5] are infeasible by nature as well.

Assumption 5 (Step-size choice).

The step-size sequence in Algorithm SFBF satisfies

 0<α––≜infn≥0αn≤¯α≜supn≥1αn<1√2L.

For , we introduce the approximation error

 Wn+1≜An+1−T(Xn), and Zn+1≜Bn+1−T(Yn), (3.2)

and the sub-sigma algebras , defined by , and

 Fn ≜σ(X0,ξ1,ξ2,…,ξn,η1,…,ηn)∀n≥1,

and

 ^Fn ≜σ(X0,ξ1,…,ξn,ξn+1,η1,…,ηn)∀n≥0,

respectively. Observe that for all . We also define the filtrations and . The introduction of these two different sub-sigma algebras is important for many reasons. First, observe that they embody the information the learner has about the optimization problem. Indeed, the sub-sigma algebra corresponds to the information the decision maker has at the beginning the -th iteration, whereas is the information the decision maker has after the first (projection)-step of the iteration. Therefore, is measurable with respect to the sub-sigma algebra and is measurable with respect to the sub-sigma algebra . Second, we see that the process is -adapted, whereas the process is -adapted, unbiased approximations relative to the respective information structures are provided:

 E[Wn+1|Fn]=0 and E[Zn+1|^Fn]=0 ∀n≥0.
Assumption 6 (Batch Size).

The batch size sequence satisfies .

A sufficient condition on the sequence is that for some constant and integer , we have

 mn=c⋅(n+n0)1+aln(n+n0)1+b (3.3)

for and , or and . The next assumption is essentially the same as the variance control assumption in [15].

Assumption 7 (Variance Control).

For all and , let

 sp(x)≜E[∥F(x,ξ)−T(x)∥p]1/p.

There exist , and a measurable locally bounded function such that for all and all

 sp(x)≤σ(x∗)+σ0∥x−x∗∥. (3.4)

Before we proceed with the convergence analysis, we want to make some clarifying remarks on this assumption. The most frequently used assumption on the SO’s approximation error, which dates back to the seminal work of Robbins and Monro (see [22, 9] for a textbook reference), asks for uniformly bounded variance (UBV), i.e.

 supx∈Xs2(x)≤σ. (UBV)

UBV is covered by creftype 7 when and . UBV is for instance valid when additive noise with finite

-th moment is assumed, that is, for some random variable

with we have

 F(x,ξ)=T(x)+ξP-a.s. .

However, assuming a global variance bound is not realistic in cases where the variance of the stochastic oracle depends on the position (see e.g. Example 1 in [17]). creftype 7 is much weaker than UBV, as it exploits the local variance of the stochastic oracle rather than, potentially hard to estimate, global mean square variance bounds. The recent papers [15, 17] make similar assumptions on the variance of the stochastic oracle. It is shown there that creftype 7 is most natural in cases where the feasible set is unbounded, and it is always satisfied when the Carathéodory functions are random Lipschitz (see Example 3.1 below). Since Algorithm 1 is an infeasible method, we are forced to analyze the behavior of the stochastic process on an unbounded domain, which makes creftype 7 the only realistic and convenient choice for us. Example 3.1 illustrates an important instance where creftype 7 holds.

Example 3.1.

Assume for the Carathéodory mapping that there exists with

 ∥F(x,ξ)−F(y,ξ)∥≤L(ξ)∥x−y∥∀x,y∈Rd.

Call the Lipschitz constant of the map . Then, a repeated application of the Minkowski inequality shows that for all and all we have

 sp(x) ≤E[∥F(x,ξ)−F(x∗,ξ)∥p]1/p+sp(x∗)+∥T(x)−T(x∗)∥ ≤(E[L(ξ)p]1/p+L)∥x−x∗∥+sp(x∗).

Let denote a bound on and set , to get a variance bound as required in creftype 7.

4. Convergence Analysis

We consider the quadratic residual function defined by

 ra(x)2≜∥x−ΠX(x−aT(x))∥2 ∀x∈Rd.

The reader familiar with the literature on finite-dimensional variational inequalities will recognize this immediately as the energy defined by the natural map [11, chapter 10]. It is well known that is a merit function for . Moreover, is a family of equivalent merit functions for , in the sense that for all [11, Proposition 10.3.6]. Denote

 ρn≜1−2L2α2n∀n≥0. (4.1)

We define recursively the process by and, for all ,

 Vn+1≜Vn+(4+ρn)α2n∥Wn+1∥2+4α2n∥Zn+1∥2,

so that

 ΔVn≜Vn+1−Vn=(4+ρn)α2n∥Wn+1∥2+4α2n∥Zn+1∥2∀n≥0. (4.2)

Additionally, we define for all the process given by , and

 Un+1(x)≜Un(x)+2αn⟨Zn+1,x−Yn⟩∀n≥1

with corresponding increment

 ΔUn(x)≜2αn⟨Zn+1,x−Yn⟩∀n≥0.

For any reference point we see that for all . Hence, the process is a martingale w.r.t. the filtration . Since , the tower property implies that

 E[ΔUn(x)|Fn]=0∀x∈Rd ∀n≥0, (4.3)

showing that it is also a -martingale. is an increasing process, with increments whose expected value is determined by the variance of the approximation error of the stochastic oracle feedback. In terms of these increment processes, we establish the following fundamental recursion.

Lemma 4.1.

For all and all we have

 ∥Xn+1−x∗∥2≤∥Xn−x∗∥2−ρn2rαn(Xn)2+ΔUn(x∗)+ΔVnP−a.s.. (4.4)
Proof.

This recursive relation follows via several simple algebraic steps. Let be and fixed.

Step 1

We have

 ⟨T(x∗),y−x∗⟩≥0∀y∈X.

Using that as well as the pseudo-monotonicity of , we see

 ⟨αnT(Yn),Yn−x∗⟩≥0.

Using the Doob decomposition in equation (3.2), we can rewrite this inequality as

 ⟨αnBn+1,Yn−x∗⟩≥αn⟨Zn+1,Yn−x∗⟩. (4.5)

Since , from Lemma 2.1(i) we conclude that

 ⟨x∗−Yn,Yn−Xn+αnAn+1⟩≥0. (4.6)

 ⟨αn(An+1−Bn+1)−Xn+Yn,x∗−Yn⟩≥αn⟨Zn+1,Yn−x∗⟩,

which is equivalent to

 ⟨x∗−Yn,Xn+1−Xn⟩≥αn⟨Zn+1,Yn−x∗⟩. (4.7)

Step 2

Using (4.7), we get

 ⟨Xn+1−Xn,Xn+1−x∗⟩= ⟨Xn+1−Xn,Yn−x∗⟩+⟨Xn+1−Xn,Xn+1−Yn⟩ ≤ ⟨αnZn+1,x∗−Yn⟩+∥Xn+1−Xn∥2 +⟨Xn+1−Xn,Xn−Yn⟩ = ⟨αnZn+1,x∗−Yn⟩+∥Xn+1−Xn∥2−∥Xn−Yn∥2 +αn⟨An+1−Bn+1,Xn−Yn⟩

where we have used the definition of in the last equality. The Pythagoras identity in Lemma 2.2 gives us

 ∥Xn+1−x∗∥2= ∥Xn−x∗∥2−∥Xn+1−Xn∥2+2⟨Xn+1−Xn,Xn+1−x∗⟩ ≤ ∥Xn−x∗∥2+∥Xn+1−Xn∥2−2∥Xn−Yn∥2 +2⟨αnZn+1,x∗−Yn⟩+2αn⟨An+1−Bn+1,Xn−Yn⟩.

Step 3.

Using again the definition of , we see

 ∥Xn+1−Xn∥2= ∥Yn+αn(An+1−Bn+1)−Xn∥2 = ∥Xn−Yn∥2+α2n∥An+1−Bn+1∥2+2αn⟨An+1−Bn+1,Yn−Xn⟩ ≤ ∥Xn−Yn∥2+2α2n∥T(Xn)−T(Yn)∥2+2α2n∥Wn+1−Zn+1∥2 +2αn⟨An+1−Bn+1,Yn−Xn⟩ ≤ ∥Xn−Yn∥2+2L2α2n∥Xn−Yn∥2+4α2n∥Wn+1∥2+4α2n∥Zn+1∥2 +2αn⟨An+1−Bn+1,Yn−Xn⟩.

The first inequality is the Cauchy-Schwarz inequality. The second inequality follows from the -Lipschitz continuity of the averaged operator (creftype 3), and again the Cauchy-Schwarz inequality. Combining this with the last inequality obtained in Step 2, we see that

 ∥Xn+1−x∗∥2 ≤∥Xn−x∗∥2−(1−2L2α2n)∥Xn−Yn∥2+4α2n∥Wn+1∥2 +4α2n∥Zn+1∥2+2⟨αnZn+1,x∗−Yn⟩.

Step 4

By the definition of the squared residual function, the definition of and Lemma 2.1(iii), we have

 rαn(Xn)2 =∥Xn−ΠX(Xn−αnT(Xn))∥2 ≤2∥Xn−Yn∥2+2∥Yn−ΠX(Xn−αnT(Xn))∥2 =2∥Xn−Yn∥2+2∥ΠX(Xn−αnAn+1)−ΠX(Xn−αnT(Xn)∥2 ≤2∥Xn−Yn∥2+2∥αnWn+1∥2.

Hence,

 −2∥Xn−Yn∥2≤2α2n∥Wn+1∥2−rαn(Xn)2. (4.8)

Step 5

Combining (4.8) with the last inequality from Step 3 and recalling creftype 5, we conclude

 ∥Xn+1−x∗∥2≤ ∥Xn−x∗∥2−12(1−2L2α2n)rαn(Xn)2+(1−2L2α2n)α2n∥Wn+1∥2 +4α2n∥Wn+1∥2+4αn∥Zn+1∥2+2⟨αnZn+1,x∗−Yn⟩ = ∥Xn−x∗∥2−ρn2rαn(Xn)2+(4+ρn)(αn)2∥Wn+1∥2+4α2n∥Zn+1∥2 +2⟨αnZn+1,x∗−Yn⟩.

The definitions of the increments associated with the martingales and give the claimed result.

Remark 4.1.

One can notice that in the above proof the pseudo-monotonicity of is used only in Step 1 of the above proof, if order to obtain relation (4.5). Thus, as happened in [8, 32], the pseudo-monotonicity of can actually be replaced by the following weaker assumption

 ⟨T(x),x−x∗⟩≥0∀x∈X,x∗∈X∗.