Given a closed convex set and a single-valued monotone operator
, i.e., an operator that maps each vector to another vector and satisfies:
the monotone inclusion problem consists in finding a point that satisfies:
is the indicator function of the set and denotes the subdifferential operator (the set of all subgradients at the argument point) of .
Monotone inclusion is a fundamental problem in continuous optimization that is closely related to variational inequalities (VIs) with monotone operators, which model a plethora of problems in mathematical programming, game theory, engineering, and finance(Facchinei and Pang, 2003, Section 1.4)
. Within machine learning, VIs with monotone operators and associated monotone inclusion problems arise, for example, as an abstraction of convex-concave min-max optimization problems, which naturally model adversarial training(Madry et al., 2018; Arjovsky et al., 2017; Arjovsky and Bottou, 2017; Goodfellow et al., 2014).
When it comes to convex-concave min-max optimization, approximating the associated VI leads to guarantees in terms of the optimality gap. Such guarantees are generally possible only when the feasible set is bounded; a simple example that demonstrates this fact is with the feasible set The only (min-max or saddle-point) solution in this case is obtained when both and are the all-zeros vectors. However, if either or , then the optimality gap is infinite.
On the other hand, approximate monotone inclusion is well-defined even for unbounded feasible sets. In the context of min-max optimization, it corresponds to guarantees in terms of stationarity. Specifically, in the unconstrained setting, solving monotone inclusion corresponds to minimizing the norm of the gradient of Note that even in the special setting of convex optimization, convergence in norm of the gradient is much less understood than convergence in optimality gap (Nesterov, 2012; Kim and Fessler, 2018). Further, unlike classical results for VIs that provide convergence guarantees for approximating weak solutions (Nemirovski, 2004; Nesterov, 2007), approximations to monotone inclusion lead to approximations to strong solutions (see Section 1.2 for definitions of weak and strong solutions and their relationship to monotone inclusion).
We leverage the connections between nonexpansive maps, structured monotone operators, and proximal maps to obtain near-optimal algorithms for solving monotone inclusion over different classes of problems with Lipschitz-continuous operators. In particular, we make use of the classical Halpern iteration, which is defined by (Halpern, 1967):
where is a nonexpansive map, i.e.,
In addition to its simplicity, Halpern iteration is particularly relevant to machine learning applications, as it is an implicitly regularized method with the following property: if the set of fixed points of is non-empty, then Halpern iteration (Hal) started at a point and applied with any choice of step sizes that satisfy all of the following conditions:
converges to the fixed point of with the minimum distance to This result was proved by Wittmann (1992), who extended a similar though less general result previously obtained by Browder (1967). The result of Wittmann (1992) has since been extended to various other settings (Bauschke, 1996; Xu, 2002; Körnlein, 2015; Lieder, 2017, and references therein).
1.1 Contributions and Related Work
A special case of what is now known as the Halpern iteration (Hal) was introduced and its asymptotic convergence properties were analyzed by Halpern (1967) in the setting of and where is the unit Euclidean ball. Using the proof-theoretic techniques of Kohlenbach (2008), Leustean (2007) extracted from the asymptotic convergence and implicit regularization result of Wittmann (1992) the rate at which Halpern iteration converges to a fixed point. The results obtained by Leustean (2007) are rather loose and provide guarantees of the form in the best case (obtained for ), where
More recently, Lieder (2017) proved that under the standard assumption that has a fixed point and for the step size Halpern iteration converges to a fixed point as A similar result but for an alternative algorithm was recently obtained by Kim (2019). Unlike Halpern iteration, the algorithm introduced by Kim (2019) is not known to possess the implicit regularization property discussed earlier in this paper.
The results of Lieder (2017) and Kim (2019) can be used to obtain the same convergence rate for monotone inclusion with a cocoercive operator but only if the cocoercivity parameter is known, which is rarely the case in practice. Similarly, those results can also be extended to more general monotone Lipschitz operators but only if the proximal map (or resolvent) of can be computed exactly, an assumption that can rarely be met (see Section 1.2 for definitions of cocoercive operators and proximal maps). We also note that the results of Lieder (2017) and Kim (2019)
were obtained using the performance estimation (PEP) framework ofDrori and Teboulle (2014). The convergence proofs resulting from the use of PEP are computer-assisted: they are generated as solutions to large semidefinite programs, which makes them hard to interpret and generalize.
Our approach is arguably simpler, as it relies on the use of a potential function, which allows us to remove the assumptions about the knowledge of the problem parameters and availability of exact proximal maps. Our main contributions are summarized as follows:
Results for cocoercive operators.
We introduce a new, potential-based, proof of convergence of Halpern iteration that applies to more general step sizes than handled by the analysis of Lieder (2017) (Section 2). The proof is simple and only requires elementary algebra. Further, the proof is derived for cocoercive operators and leads to a parameter-free algorithm for monotone inclusion. We also extend this parameter-free method to the constrained setting using the concept of gradient mapping generalized to monotone operators (Section 2.1). To the best of our knowledge, this is the first work to obtain the convergence rate with a parameter-free method.
Results for monotone Lipschitz operators.
Up to a logarithmic factor, we obtain the same convergence rate for the parameter-free setting of the more general monotone Lipschitz operators (Section 2.2). The best known convergence rate established by previous work for the same setting was of the order (Dang and Lan, 2015; Ryu et al., 2019). We obtain the improved convergence rate through the use of the Halpern iteration with inexact proximal maps that can be implemented efficiently. The idea of coupling inexact proximal maps with another method is similar in spirit to the Catalyst framework (Lin et al., 2017) and other instantiations of the inexact proximal-point method, such as, e.g., in the work of Davis and Drusvyatskiy (2019); Asi and Duchi (2019); Lin et al. (2018). However, we note that, unlike in the previous work, the coupling used here is with a method (Halpern iteration) whose convergence properties were not well-understood and for which no simple potential-based convergence proof existed prior to our work.
Results for strongly monotone Lipschitz operators.
We show that a simple restarting-based approach applied to our method for operators that are only monotone and Lipschitz (described above) leads to a parameter-free method for strongly monotone and Lipschitz operators (Section 2.3). Under mild assumptions about the problem parameters and up to a poly-logarithmic factor, the resulting algorithm is iteration-complexity-optimal. To the best of our knowledge, this is the first near-optimal parameter-free method for the setting of strongly monotone Lipschitz operators and any of the associated problems – monotone inclusion, VIs, or convex-concave min-max optimization.
To certify near-optimality of the analyzed methods, we provide lower bounds that rely on algorithmic reductions between different problem classes and highlight connections between them (Section 3). The lower bounds are derived by leveraging the recent lower bound of Ouyang and Xu (2019) for approximating the optimality gap in convex-concave min-max optimization.
1.2 Notation and Preliminaries
Let be a real -dimensional Hilbert space, with norm where denotes the inner product. In particular, one may consider the Euclidean space Definitions that were already introduced at the beginning of the paper easily generalize from to , and are not repeated here for space considerations.
Variational Inequalities and Monotone Operators.
Let be closed and convex, and let be an -Lipschitz-continuous operator defined on Namely, we assume that:
The definition of monotonicity was already provided in Eq. (1.1), and easily specializes to monotonicity on the set by restricting to be from Further, is said to be:
strongly monotone (or coercive) on with parameter , if:
cocoercive on with parameter , if:
It is immediate from the definition of cocoercivity that every -cocoercive operator is monotone and -Lipschitz. The latter follows by applying the Cauchy-Schwarz inequality to the left-hand side of Eq. (1.5) and then dividing both sides by .
Examples of monotone operators include the gradient of a convex function and appropriately modified gradient of a convex-concave function. Namely, if a function is convex in and concave in then is monotone.
The Stampacchia Variational Inequality (SVI) problem consists in finding such that:
In this case, is also referred to as a strong solution to the variational inequality (VI) corresponding to and . The Minty Variational Inequality (MVI) problem consists in finding such that:
in which case is referred to as a weak solution to the variational inequality corresponding to and . In general, if is continuous, then the solutions to (MVI) are a subset of the solutions to (SVI). If we assume that is monotone, then (1.1) implies that every solution to (SVI) is also a solution to (MVI), and thus the two solution sets are equivalent. The solution set to monotone inclusion is the same as the solution set to (SVI).
Similarly, -approximate monotone inclusion can be defined as fidning that satisfies:
Given and let satisfy Eq. (1.6). Then:
where denotes the unit ball w.r.t. centered at
Further, if the diameter of , , is bounded, then:
Thus, when the diameter is bounded, any -approximate solution to monotone inclusion is an -approximate solution to (SVI) (and thus also to (MVI)); the converse does not hold in general. Recall that when is unbounded, neither (SVI) nor (MVI) can be approximated.
We assume throughout the paper that a solution to monotone inclusion (MI) exists. This assumption implies that solutions to both (SVI) and (MVI) exist as well. Existence of solutions follows from standard results and is guaranteed whenever e.g., is compact, or, if there exists a compact set such that maps to itself (Facchinei and Pang, 2003).
Let . We say that is nonexpansive on , if
Nonexpansive maps are closely related to cocoercive operators, and here we summarize some of the basic properties that are used in our analysis. More information can be found in, e.g., the book by Bauschke and Combettes (2011).
is nonexpansive if and only if is -cocoercive, where is the identity map.
is said to be firmly nonexpansive or averaged, if
Useful properties of firmly nonexpansive maps are summarized in the following fact.
For any firmly nonexpansive operator is also firmly non-expansive, and, moreover, both and are 1-cocoercive.
2 Halpern Iteration for Monotone Inclusion and Variational Inequalities
Halpern iteration is typically stated for nonexpansive maps as in (Hal). Because our interest is in cocoercive operators with the unknown parameter we instead work with the following version of the Halpern iteration:
We start with the assumption that the setting is unconstrained: We will see in Section 2.1 how the result can be extended to the constrained case. Section 2.2 will consider the case of operators that are monotone and Lipschitz, while Section 2.3 will deal with the strongly monotone and Lipschitz case. Some of the proofs are omitted and are instead provided in Appendix A.
To analyze the convergence of (H) for the appropriate choices of sequences and we make use of the following potential function:
Let us first show that if is non-increasing with for an appropriately chosen sequence of positive numbers then we can deduce a property that, under suitable conditions on and implies a convergence rate for (H).
Using Lemma 2.1, our goal is now to show that we can choose and which in turn would imply the desired convergence rate: The following lemma provides sufficient conditions for , and to ensure that so that Lemma 2.1 applies.
Let be defined as in Eq. (2.1). Let be defined recursively as and for Assume that is chosen so that and for . Finally, assume that and , Then,
Observe first the following. If we knew and set and then all of the conditions from Lemma 2.2 would be satisfied, and Lemma 2.1 would then imply which recovers the result of Lieder (2017). The choice is also the tightest possible that satisfies the conditions Lemma 2.2 – the inequality relating and is satisfied with equality. This result is in line with the numerical observations made by Lieder (2017), who observed that the convergence of Halpern iteration is fastest for .
To construct a parameter-free method, we use that is -cocoercive; namely, that there exists a constant such that satisfies Eq. (1.5) with . The idea is to start to with a “guess” of (e.g., ) and double the guess as long as The total number of times that the guess can be doubled is bounded above by Parameter is simply chosen to satisfy the condition from Lemma 2.2. The algorithm pseudocode is stated in Algorithm 1 for a given accuracy specified at the input.
We now prove the first of our main results. Note that the total number of arithmetic operations in Algorithm 1 is of the order of the number of oracle queries to multiplied by the complexity of evaluating at a point. The same will be true for all the algorithms stated in this paper, except that the complexity of evaluating may be replaced by the complexity of projections onto .
Given and an operator that is -cocoercive on Algorithm 1 returns a point such that after at most oracle queries to .
As is -cocoercive, and the total number of times that the algorithm enters the inner while loop is at most The parameters satisfy the assumptions of Lemmas 2.1 and 2.2, and, thus, Hence, we only need to show that decreases sufficiently fast with As can only be increased in any iteration, we have that
Hence, the total number of outer iterations is at most . Combining with the maximum total number of inner iterations from the beginning of the proof, the result follows. ∎
2.1 Constrained Setups with Cocoercive Operators
Assume now that We will make use of a counterpart to gradient mapping (Nesterov, 2018, Chapter 2) that we refer to as the operator mapping, defined as:
where is the projection operator, namely:
Operator mapping generalizes a cocoercive operator to the constrained case: when
It is a well-known fact that the projection operator is firmly-nonexpansive (Bauschke and Combettes, 2011, Proposition 4.16). Thus, Fact 1.3 can be used to show that, if is -cocoercive and then is -cocoercive. This is shown in the following (simple) proposition.
Let be an -cocoercive operator and let be defined as in Eq. (1.1), where Then is -cocoercive.
As is -cocoercive, applying results from the beginning of the section to , it is now immediate that Algorithm 2 (provided for completeness) produces with after at most oracle queries to (as each computation of requires one oracle query to ).
To complete this subsection, it remains to show that is a good surrogate for approximating (MI) (and (SVI)). This is indeed the case and it follows as a suitable generalization of Lemma 3 from Ghadimi and Lan (2016), which is provided here for completeness.
Let be defined as in Eq. (2.2). Denote so that If, for some then
As, by definition, by first-order optimality of we have: Equivalently: The rest of the proof follows simply by using and ∎
Lemma 2.5 implies that when the operator mapping is small in norm then is an approximate solution to (MI) corresponding to on We can now formally bound the number of oracle queries to needed to approximate (MI) and (SVI).
Given and a -cocoercive operator , Algorithm 2 returns such that
, after at most
oracle queries to
after at most
oracle queries to
Further, every point that Algorithm 2 constructs is from the feasible set: , and a simple modification to the algorithm takes at most oracle queries to to construct a point such that .
By the definition of if then for all This follows simply as:
Observe that, due to Line 2 of Algorithm 2, The rest of the proof follows using Lemma 2.5, Fact 1.1, and the same reasoning as in the proof of Theorem 2.3. Observe that if the goal is to only output a point such that , then computing and is not needed, and the algorithm can instead use as the exit condition in the outer while loop. ∎
2.2 Setups with non-Cocoercive Lipschitz Operators
We now consider the case in which is not cocoercive, but only monotone and -Lipschitz. To obtain the desired convergence result, we make use of the resolvent operator, defined as A useful property of the resolvent is that it is firmly-nonexpansive (Ryu and Boyd, 2016, and references therein), which, due to Fact 1.3, implies that is -cocoercive.
Finding a point such that is sufficient for approximating monotone inclusion (and (SVI)). This is shown in the following simple proposition, provided here for completeness.
Let . If , then satisfies
By the definition of and , Equivalently:
As the result follows. ∎
If we could compute the resolvent exactly, it would suffice to directly apply the result of Lieder (2017). However, excluding very special cases, computing the exact resolvent efficiently is generally not possible. However, since is Lipschitz, the resolvent can be approximated efficiently. This is because it corresponds to solving a VI defined on a closed convex set with the operator that is -strongly monotone and -Lipschitz. Thus, it can be computed by solving a strongly monotone and Lipschitz VI, for which one can use the results of e.g., Nesterov and Scrimali (2011); Mokhtari et al. (2019); Gidel et al. (2019) if is known, or Stonyakin et al. (2018), if is not known. For completeness, we provide a simple modification to the Extragradient algorithm of Korpelevich (1977) in Algorithm 4 (Appendix A), for which we prove that it attains the optimal convergence rate without the knowledge of . The convergence result is summarized in the following lemma, whose proof is provided in Appendix A.
Let where and is -Lipschitz. Then, there exists a parameter-free algorithm that queries at most times and outputs a point such that
To obtain the desired result, we need to prove the convergence of a Halpern iteration with inexact evaluations of the cocoercive operator . Note that here we do know the cocoercivity parameter of – it is equal to . The resulting inexact version of Halpern’s iteration for is:
where is the error.
To analyze the convergence of (2.3), we again use the potential function from Eq. (2.1), with as the operator. For simplicity of exposition, we take the best choice of that can be obtained from Lemma 2.1 for The key result for this setting is provided in the following lemma, whose proof is deferred to the appendix.
We are now ready to state the algorithm and prove the main theorem for this subsection.
Let be a monotone and -Lipschitz operator and let be an arbitrary initial point. For any Algorithm 3 outputs a point with after at most iterations, where each iteration can be implemented with oracle queries to Hence, the total number of oracle queries to is:
Recall that and Hence, as Algorithm 3 outputs a point