1. Introduction
In this paper we consider the following variational inequality problem, denoted as , or simply : given a nonempty closed and convex set and a single valued map , find such that
(1.1) |
We call the set of (Stampacchia) solutions of . The variational inequality problem (1.1) arises in many interesting applications in economics, game theory and engineering [29, 27, 19, 18, 24], and includes as a special case first-order optimality conditions for nonlinear optimization, by choosing for some smooth function . If is unbounded, it can also be used to formulate complementarity problems, systems of equations, saddle point problems and many equilibrium problems. We refer the reader to [11] for an extensive review of applications in engineering and economics.
In many instances the problem
arises as the expected value of an underlying stochastic optimization problem whose primitives are defined on a probability space
carrying a random variable
taking values in a measurable space and inducing a law . Given the random element , consider the measurable mapping, defining an integrable random vector
via the composition . The stochastic variational inequality problem on which we will focus in this paper is denoted by and defined as follows:Definition 1.1.
This definition is known as the expected value formulation of the stochastic variational inequality problem. The expected value formulation goes back to the seminal work of [20]. By its very definition, if the operator defined in (1.2) would be known, then the expected value formulation can be solved by any standard solution technique for deterministic variational inequalities. However, in practice, the operator is usually not directly accessible, either due to excessive computations involved in performing the integral, or because itself is the solution of an embedded subproblem. Hence, in most situations of interest, the solution of relies on random samples of the operator . In this context, there are two current methodologies available; the sample average approximation (SAA
) approach replaces the expected value formulation with an empirical estimator of the form
and use the resulting deterministic map as the input in one existing algorithm of choice. We refer to [30] for this solution approach in connection with Monte Carlo simulation. We note that this approach is the standard choice in expected residual minimization problems, when is unknown but accessible via a Monte Carlo approach.
A different methodology is the stochastic approximation (SA) approach, where samples are obtained in an online fashion, namely, the decision maker chooses one deterministic algorithm to solve the expected value formulation, and draws a fresh random variable whenever needed. The mechanism to draw a fresh sample from is usually named a stochastic oracle (SO), which report generates a stochastic error .
Until very recently, the SA approach has only been used for the expected value formulation under very restrictive assumptions. To the best of our knowledge, the first formulation of an SA approach for a stochastic VI problem was made by [16], under the assumption of strong monotonicity and continuity of the operator . There, a proximal point algorithm of the form
(1.3) |
is considered, where denotes the Euclidean projection onto , is a sample of , and is a sequence of positive step sizes. Almost sure convergence of the iterates is proven for small step sizes, assuming is Lipschitz continuous and strongly monotone, and the stochastic error is uniformly bounded. Relaxing strong monotonicity to plain monotonicity, the recent paper [37] incorporated a Tikhonov regularization scheme into the stochastic approximation algorithm (1.3) and proved almost sure convergence of the generated stochastic process. The only established method guaranteeing almost sure convergence under the significantly weaker assumption of pseudo-monotonicty of the mean operator is the extragradient approach of [15]. The original Korpelevich extragradient scheme of [21] consists of two projection steps using two evaluations of the deterministic map at generated test points and . Extending this to the stochastic oracle case, we arrive at the stochastic extra-gradient (SEG) method
(SEG) | ||||
where are stochastic estimators of , and , respectively. The paper [15] constructs these estimators by relying on a dynamic sampling strategy, where noise reduction of the estimators is achieved via a mini-batch sampling of the stochastic operators ) and . Within this mini-batch formulation, almost sure convergence of the stochastic process to the solution set can be proven even with constant step size implementations of SEG. On top, optimal convergence rates of in terms of the mean squared residual of the are obtained.
1.1. Our Contribution
We briefly summarize the main contributions of this work. The most costly part of SEG are the two separate projection steps performed at each single iteration of the method. We show in this paper that a stochastic version of Tseng’s forward-backward-forward [35], which we call the stochastic forward-backward-forward (SFBF) algorithm, preserves the strong trajectory-based convergence results, while the saving of one projection step allows us to beat SEG significantly in terms of computational overheads and runtimes. In terms of convergence properties the SFBF algorithm developed in this paper has the same good properties as SEG. However, SFBF is potentially more efficient than SEG in each iteration since it relies only on a single euclidean projection step. The price to pay for this is that we obtain an infeasible method (as is typical for primal-dual schemes) with a lower computational complexity count at the positive side. Additionally, the theoretically allowed range for step sizes is by the constant factor times larger than the theoretically allowed largest step size in SEG. This constant factor gain results in significant improvements in terms of the convergence speed. This will be illustrated with extensive numerical evidences reported in Section 6.
2. Preliminaries
2.1. Notation
For , we denote by the standard inner product, and by the corresponding norm. For , the norm on is defined for as . For a nonempty, closed and convex set , the Euclidean projector is defined as for . All random elements are defined on a given probability space . An -valued random variable is a -measurable mapping ; we write . For every , define the equivalence class of random variables with as . If , the conditional expectation of the random variable is denoted by . For we denote the sigma-algebra generated by these random variables by , this is the smallest sigma-algebra measuring the random variables . Let be a complete stochastic basis. We denote by the set of random sequences such for each , . For , we set
The following properties of the euclidean projection on a closed and convex set are well known.
Lemma 2.1.
Let be a nonempty, closed and convex set. Then:
-
is the unique point of satisfying for all ;
-
for all and , we have ;
-
for all , ;
-
given and , the set of solutions of the variational problem can be expressed as .
Remark 2.1.
In the literature on variational inequalities, there exists an alternative solution concept known as weak, or Minty, solutions. In this paper we are only interested in strong, or Stampacchia, solutions of , defined by inequality (1.1).
Another useful fact we use in this paper is the following elementary identity.
Lemma 2.2 (Pythagorean identity).
For all we have
2.2. Probabilistic Tools
We recall the Minkowski inequality: for given functions and , we have
(2.1) |
For the convergence analysis we will make use of the following classical lemma (see e.g. [26, Lemma 11, page 50]).
Lemma 2.3 (Robbins-Siegmund).
Let be a discrete stochastic basis. Let and be such that for all
Then converges a.s. to a random variable , and .
Finally, we need the celebrated Burkholder-Davis-Gundy inequality (see e.g. [33]).
Lemma 2.4.
Let be a discrete stochastic basis and a vector-valued martingale relative to this basis. Then, for all , there exists a universal constant such that for every
When combined with Minkowski inequality, we obtain for all a constant such that for every
3. The stochastic forward-backward-forward algorithm
In this paper we study a forward-backward-forward algorithm of Tseng type under weak monotonicity assumptions. The blanket hypotheses we consider throughout our analysis are summarized here:
Assumption 1 (Consistency).
The solution set is nonemtpy.
Assumption 2 (Stochastic Model).
The set is nonempty, closed and convex, is a measurable space and is a Carathéodory map.111The mapping is continuous for a. e. , and is measurable for all ; is a random variable with values in , defined on a probability space .
Assumption 3 (Lipschitz continuity).
The averaged operator is Lipschitz continuous with modulus .
Assumption 4 (Pseudo-Monotonicity).
The averaged operator is pseudo-monotone on , which means
At each iteration, the decision maker has access to a stochastic oracle, reporting an approximation of of the form
(3.1) |
The sequence determines the batch size of the stochastic oracle. The random sequence is an i.i.d draw from . Approximations of the form (3.1
) are very common in Monte-Carlo simulation approaches, machine learning and computational statistics (see e.g.
[1, 2], and references therein); they are easy to obtain in case we are able to sample from the measure . The forward-backward-forward algorithm requires two queries from the stochastic oracle in which mini-batch estimators of the averaged map are revealed. This dynamic sampling strategy requires a sequence of integers (the batch size) determining the size of the data set to be processed at each iteration. The random sample on each mini-batch consists of two independent stochastic processes and drawn from the law , and explicitly given byGiven the current position , Algorithm SFBF queries the SO once, to obtain the estimator , and then constructs the random variable . Next, a second query to SO is made to obtain the estimator , followed by the update . The pseudocode for SFBF is given in Algorithm 1.
Observe that Algorithm SFBF is an infeasible method: the iterates are not necessarily elements of the admissible set , but the process is by construction so. In the stochastic optimization case, i.e. for instances where
is an unbiased estimator of the gradient of a real-valued function, the process
is seen to be a projected gradient step, where acts as an unbiased estimator for the stochastic gradient. This gradient step is used in an extrapolation step to generate the iterate . We just mention that related popular primal-dual splitting schemes like ADMM [3, 5] are infeasible by nature as well.Assumption 5 (Step-size choice).
The step-size sequence in Algorithm SFBF satisfies
For , we introduce the approximation error
(3.2) |
and the sub-sigma algebras , defined by , and
and
respectively. Observe that for all . We also define the filtrations and . The introduction of these two different sub-sigma algebras is important for many reasons. First, observe that they embody the information the learner has about the optimization problem. Indeed, the sub-sigma algebra corresponds to the information the decision maker has at the beginning the -th iteration, whereas is the information the decision maker has after the first (projection)-step of the iteration. Therefore, is measurable with respect to the sub-sigma algebra and is measurable with respect to the sub-sigma algebra . Second, we see that the process is -adapted, whereas the process is -adapted, unbiased approximations relative to the respective information structures are provided:
Assumption 6 (Batch Size).
The batch size sequence satisfies .
A sufficient condition on the sequence is that for some constant and integer , we have
(3.3) |
for and , or and . The next assumption is essentially the same as the variance control assumption in [15].
Assumption 7 (Variance Control).
For all and , let
There exist , and a measurable locally bounded function such that for all and all
(3.4) |
Before we proceed with the convergence analysis, we want to make some clarifying remarks on this assumption. The most frequently used assumption on the SO’s approximation error, which dates back to the seminal work of Robbins and Monro (see [22, 9] for a textbook reference), asks for uniformly bounded variance (UBV), i.e.
(UBV) |
UBV is covered by creftype 7 when and . UBV is for instance valid when additive noise with finite
-th moment is assumed, that is, for some random variable
with we haveHowever, assuming a global variance bound is not realistic in cases where the variance of the stochastic oracle depends on the position (see e.g. Example 1 in [17]). creftype 7 is much weaker than UBV, as it exploits the local variance of the stochastic oracle rather than, potentially hard to estimate, global mean square variance bounds. The recent papers [15, 17] make similar assumptions on the variance of the stochastic oracle. It is shown there that creftype 7 is most natural in cases where the feasible set is unbounded, and it is always satisfied when the Carathéodory functions are random Lipschitz (see Example 3.1 below). Since Algorithm 1 is an infeasible method, we are forced to analyze the behavior of the stochastic process on an unbounded domain, which makes creftype 7 the only realistic and convenient choice for us. Example 3.1 illustrates an important instance where creftype 7 holds.
Example 3.1.
Assume for the Carathéodory mapping that there exists with
Call the Lipschitz constant of the map . Then, a repeated application of the Minkowski inequality shows that for all and all we have
Let denote a bound on and set , to get a variance bound as required in creftype 7.
4. Convergence Analysis
We consider the quadratic residual function defined by
The reader familiar with the literature on finite-dimensional variational inequalities will recognize this immediately as the energy defined by the natural map [11, chapter 10]. It is well known that is a merit function for . Moreover, is a family of equivalent merit functions for , in the sense that for all [11, Proposition 10.3.6]. Denote
(4.1) |
We define recursively the process by and, for all ,
so that
(4.2) |
Additionally, we define for all the process given by , and
with corresponding increment
For any reference point we see that for all . Hence, the process is a martingale w.r.t. the filtration . Since , the tower property implies that
(4.3) |
showing that it is also a -martingale. is an increasing process, with increments whose expected value is determined by the variance of the approximation error of the stochastic oracle feedback. In terms of these increment processes, we establish the following fundamental recursion.
Lemma 4.1.
For all and all we have
(4.4) |
Proof.
This recursive relation follows via several simple algebraic steps. Let be and fixed.
Step 1
Step 2
Step 3.
Using again the definition of , we see
The first inequality is the Cauchy-Schwarz inequality. The second inequality follows from the -Lipschitz continuity of the averaged operator (creftype 3), and again the Cauchy-Schwarz inequality. Combining this with the last inequality obtained in Step 2, we see that
Step 4
By the definition of the squared residual function, the definition of and Lemma 2.1(iii), we have
Hence,
(4.8) |
Step 5
Combining (4.8) with the last inequality from Step 3 and recalling creftype 5, we conclude
The definitions of the increments associated with the martingales and give the claimed result.
Remark 4.1.
In the following, we let be the exponent as specified in creftype 7. Taking conditional expectations in equation (4.4) and using the martingale property (4.3), we see for all that
(4.9) |
In order to prove convergence of the process , we aim to deduce a stochastic quasi-Fejér relation. For that we need to understand the properties of the conditional expectation
Let be . The monotonicity of