Forward-backward-forward methods with variance reduction for stochastic variational inequalities

We develop a new stochastic algorithm with variance reduction for solving pseudo-monotone stochastic variational inequalities. Our method builds on Tseng's forward-backward-forward (FBF) algorithm, which is known in the deterministic literature to be a valuable alternative to Korpelevich's extragradient method when solving variational inequalities over a convex and closed set governed by pseudo-monotone, Lipschitz continuous operators. The main computational advantage of Tseng's algorithm is that it relies only on a single projection step and two independent queries of a stochastic oracle. Our algorithm incorporates a variance reduction mechanism and leads to almost sure (a.s.) convergence to an optimal solution. To the best of our knowledge, this is the first stochastic look-ahead algorithm achieving this by using only a single projection at each iteration..

READ FULL TEXT VIEW PDF
02/16/2021

Stochastic Variance Reduction for Variational Inequality Methods

We propose stochastic variance reduced algorithms for solving convex-con...
03/17/2020

A Relaxed Inertial Forward-Backward-Forward Algorithm for Solving Monotone Inclusions with Application to GANs

We introduce a relaxed inertial forward-backward-forward (RIFBF) splitti...
11/05/2020

Simple and optimal methods for stochastic variational inequalities, I: operator extrapolation

In this paper we first present a novel operator extrapolation (OE) metho...
08/22/2019

On the convergence of single-call stochastic extra-gradient methods

Variational inequalities have recently attracted considerable interest i...
06/22/2019

A Unifying Framework for Variance Reduction Algorithms for Finding Zeroes of Monotone Operators

A wide range of optimization problems can be recast as monotone inclusio...
02/19/2019

Stochastic Conditional Gradient++

In this paper, we develop Stochastic Continuous Greedy++ (SCG++), the fi...
10/15/2020

Adaptive and Universal Single-gradient Algorithms for Variational Inequalities

Variational inequalities with monotone operators capture many problems o...

1. Introduction

In this paper we consider the following variational inequality problem, denoted as , or simply : given a nonempty closed and convex set and a single valued map , find such that

(1.1)

We call the set of (Stampacchia) solutions of . The variational inequality problem (1.1) arises in many interesting applications in economics, game theory and engineering [29, 27, 19, 18, 24], and includes as a special case first-order optimality conditions for nonlinear optimization, by choosing for some smooth function . If is unbounded, it can also be used to formulate complementarity problems, systems of equations, saddle point problems and many equilibrium problems. We refer the reader to [11] for an extensive review of applications in engineering and economics.

In many instances the problem

arises as the expected value of an underlying stochastic optimization problem whose primitives are defined on a probability space

carrying a random variable

taking values in a measurable space and inducing a law . Given the random element , consider the measurable mapping

, defining an integrable random vector

via the composition . The stochastic variational inequality problem on which we will focus in this paper is denoted by and defined as follows:

Definition 1.1.

Let the operator be defined by

(1.2)

Find satisfying (1.1).

This definition is known as the expected value formulation of the stochastic variational inequality problem. The expected value formulation goes back to the seminal work of [20]. By its very definition, if the operator defined in (1.2) would be known, then the expected value formulation can be solved by any standard solution technique for deterministic variational inequalities. However, in practice, the operator is usually not directly accessible, either due to excessive computations involved in performing the integral, or because itself is the solution of an embedded subproblem. Hence, in most situations of interest, the solution of relies on random samples of the operator . In this context, there are two current methodologies available; the sample average approximation (SAA

) approach replaces the expected value formulation with an empirical estimator of the form

and use the resulting deterministic map as the input in one existing algorithm of choice. We refer to [30] for this solution approach in connection with Monte Carlo simulation. We note that this approach is the standard choice in expected residual minimization problems, when is unknown but accessible via a Monte Carlo approach.

A different methodology is the stochastic approximation (SA) approach, where samples are obtained in an online fashion, namely, the decision maker chooses one deterministic algorithm to solve the expected value formulation, and draws a fresh random variable whenever needed. The mechanism to draw a fresh sample from is usually named a stochastic oracle (SO), which report generates a stochastic error .

Until very recently, the SA approach has only been used for the expected value formulation under very restrictive assumptions. To the best of our knowledge, the first formulation of an SA approach for a stochastic VI problem was made by [16], under the assumption of strong monotonicity and continuity of the operator . There, a proximal point algorithm of the form

(1.3)

is considered, where denotes the Euclidean projection onto , is a sample of , and is a sequence of positive step sizes. Almost sure convergence of the iterates is proven for small step sizes, assuming is Lipschitz continuous and strongly monotone, and the stochastic error is uniformly bounded. Relaxing strong monotonicity to plain monotonicity, the recent paper [37] incorporated a Tikhonov regularization scheme into the stochastic approximation algorithm (1.3) and proved almost sure convergence of the generated stochastic process. The only established method guaranteeing almost sure convergence under the significantly weaker assumption of pseudo-monotonicty of the mean operator is the extragradient approach of [15]. The original Korpelevich extragradient scheme of [21] consists of two projection steps using two evaluations of the deterministic map at generated test points and . Extending this to the stochastic oracle case, we arrive at the stochastic extra-gradient (SEG) method

(SEG)

where are stochastic estimators of , and , respectively. The paper [15] constructs these estimators by relying on a dynamic sampling strategy, where noise reduction of the estimators is achieved via a mini-batch sampling of the stochastic operators ) and . Within this mini-batch formulation, almost sure convergence of the stochastic process to the solution set can be proven even with constant step size implementations of SEG. On top, optimal convergence rates of in terms of the mean squared residual of the are obtained.

1.1. Our Contribution

We briefly summarize the main contributions of this work. The most costly part of SEG are the two separate projection steps performed at each single iteration of the method. We show in this paper that a stochastic version of Tseng’s forward-backward-forward [35], which we call the stochastic forward-backward-forward (SFBF) algorithm, preserves the strong trajectory-based convergence results, while the saving of one projection step allows us to beat SEG significantly in terms of computational overheads and runtimes. In terms of convergence properties the SFBF algorithm developed in this paper has the same good properties as SEG. However, SFBF is potentially more efficient than SEG in each iteration since it relies only on a single euclidean projection step. The price to pay for this is that we obtain an infeasible method (as is typical for primal-dual schemes) with a lower computational complexity count at the positive side. Additionally, the theoretically allowed range for step sizes is by the constant factor times larger than the theoretically allowed largest step size in SEG. This constant factor gain results in significant improvements in terms of the convergence speed. This will be illustrated with extensive numerical evidences reported in Section 6.

2. Preliminaries

2.1. Notation

For , we denote by the standard inner product, and by the corresponding norm. For , the norm on is defined for as . For a nonempty, closed and convex set , the Euclidean projector is defined as for . All random elements are defined on a given probability space . An -valued random variable is a -measurable mapping ; we write . For every , define the equivalence class of random variables with as . If , the conditional expectation of the random variable is denoted by . For we denote the sigma-algebra generated by these random variables by , this is the smallest sigma-algebra measuring the random variables . Let be a complete stochastic basis. We denote by the set of random sequences such for each , . For , we set

The following properties of the euclidean projection on a closed and convex set are well known.

Lemma 2.1.

Let be a nonempty, closed and convex set. Then:

  • is the unique point of satisfying for all ;

  • for all and , we have ;

  • for all , ;

  • given and , the set of solutions of the variational problem can be expressed as .

Remark 2.1.

In the literature on variational inequalities, there exists an alternative solution concept known as weak, or Minty, solutions. In this paper we are only interested in strong, or Stampacchia, solutions of , defined by inequality (1.1).

Another useful fact we use in this paper is the following elementary identity.

Lemma 2.2 (Pythagorean identity).

For all we have

2.2. Probabilistic Tools

We recall the Minkowski inequality: for given functions and , we have

(2.1)

For the convergence analysis we will make use of the following classical lemma (see e.g. [26, Lemma 11, page 50]).

Lemma 2.3 (Robbins-Siegmund).

Let be a discrete stochastic basis. Let and be such that for all

Then converges a.s. to a random variable , and .

Finally, we need the celebrated Burkholder-Davis-Gundy inequality (see e.g. [33]).

Lemma 2.4.

Let be a discrete stochastic basis and a vector-valued martingale relative to this basis. Then, for all , there exists a universal constant such that for every

When combined with Minkowski inequality, we obtain for all a constant such that for every

3. The stochastic forward-backward-forward algorithm

In this paper we study a forward-backward-forward algorithm of Tseng type under weak monotonicity assumptions. The blanket hypotheses we consider throughout our analysis are summarized here:

Assumption 1 (Consistency).

The solution set is nonemtpy.

Assumption 2 (Stochastic Model).

The set is nonempty, closed and convex, is a measurable space and is a Carathéodory map.111The mapping is continuous for a. e. , and is measurable for all ; is a random variable with values in , defined on a probability space .

Assumption 3 (Lipschitz continuity).

The averaged operator is Lipschitz continuous with modulus .

Assumption 4 (Pseudo-Monotonicity).

The averaged operator is pseudo-monotone on , which means

At each iteration, the decision maker has access to a stochastic oracle, reporting an approximation of of the form

(3.1)

The sequence determines the batch size of the stochastic oracle. The random sequence is an i.i.d draw from . Approximations of the form (3.1

) are very common in Monte-Carlo simulation approaches, machine learning and computational statistics (see e.g.

[1, 2], and references therein); they are easy to obtain in case we are able to sample from the measure . The forward-backward-forward algorithm requires two queries from the stochastic oracle in which mini-batch estimators of the averaged map are revealed. This dynamic sampling strategy requires a sequence of integers (the batch size) determining the size of the data set to be processed at each iteration. The random sample on each mini-batch consists of two independent stochastic processes and drawn from the law , and explicitly given by

Given the current position , Algorithm SFBF queries the SO once, to obtain the estimator , and then constructs the random variable . Next, a second query to SO is made to obtain the estimator , followed by the update . The pseudocode for SFBF is given in Algorithm 1.

1:step-size sequence ; batch size sequence
2:Initialize # initialization
3:for  do
4:     Draw samples and from ()
5:     Oracle returns # first oracle query
6:     Set #forward-backward step
7:     Oracle returns # second oracle query
8:     Set # second forward step
9:end for
Algorithm 1 Stochastic forward-backward-forward (SFBF)

Observe that Algorithm SFBF is an infeasible method: the iterates are not necessarily elements of the admissible set , but the process is by construction so. In the stochastic optimization case, i.e. for instances where

is an unbiased estimator of the gradient of a real-valued function, the process

is seen to be a projected gradient step, where acts as an unbiased estimator for the stochastic gradient. This gradient step is used in an extrapolation step to generate the iterate . We just mention that related popular primal-dual splitting schemes like ADMM [3, 5] are infeasible by nature as well.

Assumption 5 (Step-size choice).

The step-size sequence in Algorithm SFBF satisfies

For , we introduce the approximation error

(3.2)

and the sub-sigma algebras , defined by , and

and

respectively. Observe that for all . We also define the filtrations and . The introduction of these two different sub-sigma algebras is important for many reasons. First, observe that they embody the information the learner has about the optimization problem. Indeed, the sub-sigma algebra corresponds to the information the decision maker has at the beginning the -th iteration, whereas is the information the decision maker has after the first (projection)-step of the iteration. Therefore, is measurable with respect to the sub-sigma algebra and is measurable with respect to the sub-sigma algebra . Second, we see that the process is -adapted, whereas the process is -adapted, unbiased approximations relative to the respective information structures are provided:

Assumption 6 (Batch Size).

The batch size sequence satisfies .

A sufficient condition on the sequence is that for some constant and integer , we have

(3.3)

for and , or and . The next assumption is essentially the same as the variance control assumption in [15].

Assumption 7 (Variance Control).

For all and , let

There exist , and a measurable locally bounded function such that for all and all

(3.4)

Before we proceed with the convergence analysis, we want to make some clarifying remarks on this assumption. The most frequently used assumption on the SO’s approximation error, which dates back to the seminal work of Robbins and Monro (see [22, 9] for a textbook reference), asks for uniformly bounded variance (UBV), i.e.

(UBV)

UBV is covered by creftype 7 when and . UBV is for instance valid when additive noise with finite

-th moment is assumed, that is, for some random variable

with we have

However, assuming a global variance bound is not realistic in cases where the variance of the stochastic oracle depends on the position (see e.g. Example 1 in [17]). creftype 7 is much weaker than UBV, as it exploits the local variance of the stochastic oracle rather than, potentially hard to estimate, global mean square variance bounds. The recent papers [15, 17] make similar assumptions on the variance of the stochastic oracle. It is shown there that creftype 7 is most natural in cases where the feasible set is unbounded, and it is always satisfied when the Carathéodory functions are random Lipschitz (see Example 3.1 below). Since Algorithm 1 is an infeasible method, we are forced to analyze the behavior of the stochastic process on an unbounded domain, which makes creftype 7 the only realistic and convenient choice for us. Example 3.1 illustrates an important instance where creftype 7 holds.

Example 3.1.

Assume for the Carathéodory mapping that there exists with

Call the Lipschitz constant of the map . Then, a repeated application of the Minkowski inequality shows that for all and all we have

Let denote a bound on and set , to get a variance bound as required in creftype 7.

4. Convergence Analysis

We consider the quadratic residual function defined by

The reader familiar with the literature on finite-dimensional variational inequalities will recognize this immediately as the energy defined by the natural map [11, chapter 10]. It is well known that is a merit function for . Moreover, is a family of equivalent merit functions for , in the sense that for all [11, Proposition 10.3.6]. Denote

(4.1)

We define recursively the process by and, for all ,

so that

(4.2)

Additionally, we define for all the process given by , and

with corresponding increment

For any reference point we see that for all . Hence, the process is a martingale w.r.t. the filtration . Since , the tower property implies that

(4.3)

showing that it is also a -martingale. is an increasing process, with increments whose expected value is determined by the variance of the approximation error of the stochastic oracle feedback. In terms of these increment processes, we establish the following fundamental recursion.

Lemma 4.1.

For all and all we have

(4.4)
Proof.

This recursive relation follows via several simple algebraic steps. Let be and fixed.

Step 1

We have

Using that as well as the pseudo-monotonicity of , we see

Using the Doob decomposition in equation (3.2), we can rewrite this inequality as

(4.5)

Since , from Lemma 2.1(i) we conclude that

(4.6)

Adding (4.5) and (4.6) gives

which is equivalent to

(4.7)

Step 2

Using (4.7), we get

where we have used the definition of in the last equality. The Pythagoras identity in Lemma 2.2 gives us

Step 3.

Using again the definition of , we see

The first inequality is the Cauchy-Schwarz inequality. The second inequality follows from the -Lipschitz continuity of the averaged operator (creftype 3), and again the Cauchy-Schwarz inequality. Combining this with the last inequality obtained in Step 2, we see that

Step 4

By the definition of the squared residual function, the definition of and Lemma 2.1(iii), we have

Hence,

(4.8)

Step 5

Combining (4.8) with the last inequality from Step 3 and recalling creftype 5, we conclude

The definitions of the increments associated with the martingales and give the claimed result.

Remark 4.1.

One can notice that in the above proof the pseudo-monotonicity of is used only in Step 1 of the above proof, if order to obtain relation (4.5). Thus, as happened in [8, 32], the pseudo-monotonicity of can actually be replaced by the following weaker assumption

See also [25] for a similar condition.

In the following, we let be the exponent as specified in creftype 7. Taking conditional expectations in equation (4.4) and using the martingale property (4.3), we see for all that

(4.9)

In order to prove convergence of the process , we aim to deduce a stochastic quasi-Fejér relation. For that we need to understand the properties of the conditional expectation

Let be . The monotonicity of