Consider a min-max optimization – or saddle-point – problem of the form
where , are subsets of a Euclidean space and may be non-convex/non-concave. Given an algorithm for solving (SP), the following fundamental questions arise:
|When does the algorithm converge? Where does the algorithm converge to?||()|
The goal of this paper is to provide concrete answers to ( ‣ 1) and to study their practical implications for a wide array of existing methods.
Min-max problems of this type have found widespread applications in machine learning in the context of generative adversarial networks [GPAM+14]
, robust reinforcement learning[pinto2017robust], and other models of adversarial training [madry2018towards]. In this broad setting, it has become empirically clear that the joint training of two neural networks with competing objectives is fundamentally more difficult than training a single NN of similar size and architecture. The latter task boils down to successfully finding a (good) local minimum of a non-convex function, so it is instructive to revisit ( ‣ 1) in the context of (non-convex) minimization problems.
In this case, much of the theory on stochastic gradient descent (SGD) methods – the “gold standard” for deep NN training – can be informally summed up as follows:
Bounded trajectories of SGD always converge to a set of critical points [Lju77, Lju86, BT00].
The limits of SGD do not contain saddle points or other spurious solutions [Pem90, BD96, GHJY15].
At first glance, these positive results might raise high expectations for solving (SP). Unfortunately, one can easily find counterexamples with very simple bilinear games of the form : naïvely applying stochastic gradient descent/ascent (SGDA) methods in this case leads to recurrent orbits that do not contain any critical point of . Such a phenomenon has no counterpart in non-convex minimization, and is fundamentally tied to the min-max structure of (SP).
The failure of SGDA in bilinear games has been studied extensively [YSXJ+18, GBVV+19, abernethy2019last, azizian2019tight, gidel2019negative, mokhtari2019unified, MPP18, liang2019interaction, schafer2019competitive, zhang2019convergence, peng2020training], leading to more sophisticated schemes such as stochastic extra-gradient (SEG) methods and their variants [DISZ18, MLZF+19, GBVV+19, HIMM19, CGFLJ19]. Meanwhile, to bypass such globally oscillatory issues, another thread of research [heusel2017gans, nagarajan2017gradient, daskalakis2018limit, MLZF+19, adolphs2019local, mazumdar2019finding, nouiehed2019solving, jin2019local, liu2019towards, raghunathan2019game, mazumdar2020gradient] has shifted its attention to local analysis. Essentially, these works either analyze the algorithmic behaviors only “sufficiently close” to critical points, or impose stringent assumptions on (such as “coherence” [MLZF+19] or the existence of solutions to a Minty variational inequality [liu2019towards]) to ensure the equivalence between global and local convergence.
Although these studies have certainly led to fruitful results, the realm beyond bilinear games and (locally) idealized objectives remains somewhat unexplored (with a few exceptions that we discuss in detail below). In particular, a convergence theory for general non-convex/non-concave problems is still lacking.
In this paper, we aim to bridge this gap by providing precise answers to ( ‣ 1) for a wide range of min-max optimization algorithms that can be seen as generalized Robbins–Monro (RM) schemes [RM51]. Mirrorring the minimization perspective, we prove that, for any such algorithm :
Bounded trajectories of always converge to an internally chain-transitive (ICT) set.
Trajectories of may converge with arbitrarily high probability to spurious attractors that contain no critical point of .
The most critical implication of our theory is that one can reduce the long-term behavior of a training algorithm to its associated internally chain-transitive (ICT) sets, a notion deeply rooted in the study of dynamical systems [Bow75, Con78, BH96, Ben99, BHS05] that formalizes the idea of “discrete limits of continuous flows”; cf. Section 4. As an example, in minimization problems, one can prove that the ICT sets of SGD consist solely of components of critical points; on the other hand, we show that ICT sets in min-max optimization can exhibit drastically more complicated structures, even when . In particular, we establish the following negative results:
An ICT set may contain (almost) globally attracting limit cycles, and the algorithms designed to eliminate periodic orbits in bilinear games cannot escape them. This observation corroborates the persistence of non-convergent behaviors in GAN training, and suggests that bilinear games may be insufficient as models for such applications.
There exist unstable critical points whose neighborhood contains an (almost) globally stable ICT set. Therefore, in sharp contrast to minimization problems, “avoiding unstable critical points” does not imply “escaping unstable critical points” in min-max problems.
There exist stable min-max points whose basin of attraction is “shielded” by an unstable ICT set. As a result, existing algorithms are repelled from a desirable solution with high probability, even if initialized arbitrarily close to it.
Finally, we provide numerical illustrations of the above, which further show that common practical tweaks (such as averaging or adaptive algorithms) also fail to address these problematic cases.
Further related work.
To our knowledge, the convergence to non-critical sets in (SP) has only been systematically studied in a few settings. Besides the bilinear games alluded to above, other instances include the “almost bilinear games” [abernethy2019last] and deterministic gradient descent/ascent (GDA) applied to “hidden bilinear games” [FVP19a]. In contrast to these works, our framework does not impose any structural assumption and requires only mild regularity of , and our results apply to many existing methods beyond (S)GDA; cf. Section 3. The generality of our approach is made possible by foundational results in dynamical systems [Ben99, BH96], which have not been exploited before in the context of min-max optimization, and have only recently been applied to learning in games with the aim of showing convergence to (local) Nash equilibria [PL12, BHS05, BHS06, PML17, mazumdar2020gradient, CHM17-NIPS, BM17, BLM18, MZ19, BBF18].
Upon completion of our paper (two weeks prior to the actual submission date), we discovered a preprint by letcher2020impossibility whose motivation is similar to our own. The focus of [letcher2020impossibility] is on providing counterexamples that rule out the convergence of deterministic “reasonable” and “global” algorithms. There are two major distinctions that make our approaches complementary: [letcher2020impossibility] focuses on the impossibility of desirable convergence guarantees in a purely deterministc setting; in contrast, our paper focuses squarely on the occurrence of undesirable convergence phenomena with probability in stochastic algorithms. Taken together, the work [letcher2020impossibility] and our own paint a fairly complete picture of the fundamental limits of min-max optimization algorithms.
2. Setup and preliminaries
We focus on general problems of the form (SP) with , , and assumed . To ease notation, we will denote , and . In addition, we will write
for the (min-max) gradient field of , and we will assume that is Lipschitz. In some cases we will also require to be and we will write for its Jacobian; this additional assumption will be stated explicitly whenever invoked.
A solution of (SP) is a tuple with for all , ; likewise, a local solution of (SP) is a tuple that satisfies this inequality locally. Finally, a state with is said to be a critical (or stationary) point of . When is , any local solution is a stable critical point [jin2019local], i.e., and .
From an algorithmic standpoint, we will focus exclusively on the black-box optimization paradigm [Nes04] with stochastic first-order oracle (SFO) feedback; algorithms with a more complicated feedback structure (such as a best-response oracle [jin2019local, naveiro2019gradient, fiez2019convergence]) or based on mixed-strategy sampling [hsieh2019finding, domingo2020mean] are not considered in this work. In detail, when called at with random seed , an stochastic first-order oracle (SFO
) returns a random vectorof the form
where the error term captures all sources of uncertainty in the model (e.g., the selection of a minibatch in GAN training models, system state observations in reinforcement learning, etc.). Regarding this error term, we will assume throughout that it is zero-mean and sub-Gaussian:
for some and all . The sub-Gaussian tail assumption is standard in the literature [Nes04, Nes09, NJLS09, JNT11]
, and it can be further relaxed with little loss of generality to finite variance. To streamline our discussion, we will present our results in the sub-Gaussian regime and we will rely on a series of remarks to explain any modifications required for different assumptions on .
3. Core algorithmic framework
3.1. The Robbins–Monro template
Much of our analysis will focus on iterative algorithms that can be cast in the abstract Robbins–Monro framework of stochastic approximation [RM51]:
denotes the state of the algorithm at each stage
is a generalized error term (described in detail below).
is the step-size (a hyperparameter, typically of the form, ).
In the above, the error term is generated after ; thus, by default, is not adapted to the history (natural filtration) of . For concision, we will write
can be seen as a noisy estimate of. In more detail, to differentiate between “random” (zero-mean) and “systematic” (non-zero-mean) errors in , it will be convenient to further decompose the error process as
where represents the systematic component of the error and captures the random, zero-mean part. In view of all this, we will consider the following descriptors for : equationparentequation
The precise behavior of and will be examined on a case-by-case basis below.
3.2. Specific algorithms
In the rest of this section, we discuss how a wide range of algorithms used in the literature can be seen as special instances of the general template (RM) above.
Algorithm 1 (Sgda).
The basic SGDA algorithm – also known as the Arrow–Hurwicz method [AHU58] – queries an SFO and proceeds as:
where () is an independent and identically distributed (i.i.d.) sequence of oracle seeds. As such, (SGDA) admits a straightforward RM representation by taking and .
Algorithm 2 (Asgda).
A common variant of SGDA, is to alternate the updates of the min/max variables, resulting in the alternating stochastic gradient descent/ascent (alt-SGDA) method:
where () are sequences of i.i.d. random seeds, , and . The RM representation of (alt-SGDA) is obtained by taking , , and .
Algorithm 3 (Seg).
Going beyond (SGDA), the (stochastic) extra-gradient algorithm exploits the following principle [Kor76, Nem04, JNT11]: given a “base” state , the algorithm queries the oracle at to generate a leading state and then updates with oracle information from . Assuming SFO feedback as above, this process may be described as follows:
To recast (SEG) in the Robbins–Monro framework, simply take , i.e., and .
Algorithm 4 (Og / Peg).
Compared to (SGDA), the scheme (SEG) involves two oracle queries per iteration, which is considerably more costly. An alternative iterative method with a single oracle query per iteration was proposed by Pop80:
Its Robbins–Monro representation is obtained by setting , i.e., and .
Popov’s extra-gradient has been rediscovered several times and is more widely known as the optimistic gradient (OG) method in the machine learning literature [RS13-COLT, CYLM+12, DISZ18, HIMM19]. In unconstrained min-max optimization, (OG/PEG) turns out to be equivalent to a number of other existing methods, including “extrapolation from the past” [GBVV+19], reflected gradient [malitsky2020forward], and the “prediction method” of [yadav2017stabilizing].
Algorithm 5 (Kiefer–Wolfowitz).
When first-order feedback is unavailable, a popular alternative is to obtain gradient information of via zeroth-order observations [liu2019min]. This idea can be traced back to the seminal work of KW52 and the subsequent development of the simultaneous perturbation stochastic approximation (SPSA) method by Spa92. In our setting, this leads to the recursion:
where is a vanishing “sampling radius” parameter, is drawn uniformly at random from the composite basis of , and the “” sign is equal to if and if . Viewed this way, the interpretation of (5) as a Robbins–Monro method is immediate; furthermore, a straightforward calculation (that we defer to the supplement) shows that the sequence of gradient estimators in (5) has and .
Further examples that can be cast in the general framework (RM) include the negative momentum method [gidel2019negative], generalized OG schemes [mokhtari2019unified], and centripetal acceleration [peng2020training]; the analysis is similar and we omit the details. Certain scalable second-order methods can also be viewed as Robbins–Monro schemes, but the driving vector field is no longer the gradient field of ; we discuss this in LABEL:ex:2nd-order and the supplement.
4. Convergence analysis
4.1. Continuous vs. discrete time
The main idea of our approach will be to treat (RM) as a noisy discretization of the mean dynamics
This is motivated by the fact that can be seen as the continuous-time limit of the finite difference quotient : in this way, if the error term in (RM) is sufficiently well-behaved, it is plausible to expect that the iterates of (RM) and the solutions of (MD) eventually come together. This approach has proved very fruitful when the mean dynamics (MD) comprise a gradient system, i.e., for some (possibly non-convex) . In this case (and modulo mild assumptions), the systems (RM) and (MD) both converge to the critical set of , see e.g., [Lju77, KC78, BMP90, KY97, BT00].
On the other hand, the min-max landscape is considerably more involved. The most widely known illustration is given by the bilinear objective : in this case (see Fig. 1), the trajectories (MD) comprise periodic orbits of perfect circles centered at the origin (the unique critical point of ). However, the behavior of different RM schemes can vary wildly, even in the absence of noise (): trajectories of (SGDA) spiral outwards, each converging to an (initialization-dependent) periodic orbit; instead, trajectories of (SEG) spiral inwards, eventually converging to the solution .
This particular difference between gradient and extra-gradient schemes has been well-documented in the literature, cf. [DISZ18, GBVV+19, MLZF+19]. More pertinent to our theory, it also raises several key questions:
What is the precise link between RM methods and the mean dynamics (MD)?
When can (MD) accurately predict the long-run behavior of an RM method?
The rest of this section is devoted to providing precise answers to these questions.
4.2. Stochastic approximation
), and define the continuous-time interpolationof as
for all , . To compare to the solution orbits of (MD), we will further consider the flow of (MD), which is simply the orbit of (MD) at time with an initial condition . We then have the following notion of “asymptotic closeness” due to BH96, BH95:
is an asymptotic pseudotrajectory (APT) of (MD) if, for all , we have:
This comparison criterion is due to BH96 and it plays a central role in our analysis. In words, it simply posits that eventually tracks the flow of (MD) with arbitrary accuracy over windows of arbitrary length; as a result, if is an asymptotic pseudotrajectory (APT) of (MD), it is reasonable to expect its behavior to be closely correlated to that of (MD).
Our first result below makes this link precise. To state it, we will make the following assumptions:
both assumed to hold with probability . Under these blanket requirements, we have: