This paper focuses on solving convex-concave saddle-point problems of the form
which is the convex-concave saddle-point formulation of the primal problem
In particular we will show that a simple modification of the iterate-averaging schemes of two well-known algorithms, Chambolle and Pock’s primal-dual algorithm () (Chambolle and Pock, 2011, 2016) and Nemirovski’s mirror prox () (Nemirovski, 2004) lead to much better practical performance, while retaining the same worst-case dependence on in the convergence rate. Ergodic rates for methods for solving saddle-point problems often rely on uniformly averaging all iterates (for example in the case of and ), whereas in practice the last iterate is known to often converge at a much faster rate than the uniform average (Chambolle and Pock, 2016). However, the last iterate has no guarantee on the convergence rate. This drawback is mentioned, for example, by Tran-Dinh et al. (2018). To ameliorate this issue, we suggest averaging schemes that put increasing weight on later iterates, for example by weighting the ’th iterate according to (linear averaging) or (quadratic averaging), where is the stepsize of the algorithm. This is inspired by the practical success of the algorithm (Tammelin et al., 2015), which uses linear averaging, and has been used for several recent breakthroughs in the solution of large-scale sequential games (Bowling et al., 2015; Moravčík et al., 2017; Brown and Sandholm, 2018).
We show that linear and quadratic averaging achieve the same theoretical performance as uniform averaging in both and . Perhaps more importantly, we point out that their convergence rate depends on the sum of iterate distances (letting be a suitably-defined stand-in for distance to the saddle point), rather than just on the distance as is achieved via the telescoping argument for uniform averaging. In practice we expect to decrease as increases, and in that case a dependence on the sum (with a correspondingly-larger sum-of-weights denominator) is much better than a dependence on just .
In numerical experiments we show that linear and quadratic averaging achieves strong practical performance across three different domains of saddle-point problems, and for three algorithms: , , and a linesearch variant of . First, in the case of matrix games, we show that quadratic averaging leads to stronger performance than the last iterate in all three algorithms, and we even find that it converges faster than the algorithm (Tammelin et al., 2015), the practical state-of-the-art for solving large-scale sequential games (Brown et al., 2017; Burch, 2017; Kroer et al., 2018, ). Second, we investigate denoising of images via -based total-variation minimization. Again we find that, especially quadratic averaging, performs better than, or roughly as well as, the last iterate. Finally, we consider the solution of a convex program for computing a competitive equilibrium in a Fisher market, a problem that has application in Internet auction markets (Conitzer et al., 2018b, a) and fair division (Varian, 1974; Cole and Gkatzelis, 2015; Caragiannis et al., 2016). Again we find that linear and quadratic averaging perform much better than uniform averaging, although in this case the last iterate performs somewhat better in late stages of the optimization.
The use of linear averaging in the present paper is inspired by the practical performance of linear averaging in . Similar to our setting, linear averaging provides no worst-case theoretical benefit in , but it does lead to much stronger practical performance (quadratic averaging has also been shown to work well in settings (Brown and Sandholm, 2019)). Another technique from that is important for its practical performance is alternation, where is computed based on , but is computed based on . In our setting has a form of alternation in that is computed based on the extrapolation , but does not have any form of alternation. Yet we find that the algorithms have near-identical performance, and so while the performance of linear (or greater) iterate averaging generalizes to our setting, alternation does not.
We will be solving problems of the form (1), where
are finite-dimensional real vector spaces equipped with norms(we will generally refrain from using the subscripts when e.g. clearly refers to ). We will also need the dual norm for : . We make the following further assumptions (which are identical to those of Chambolle and Pock (2016)):
We have access to distance-generating functions which are -strongly convex wrt. the respective norms , and continuously differentiable on their domains. are the corresponding Bregman divergences defined as
is a bounded linear operator, with operator norm . This implies
is a proper, lower semicontinuous (l.s.c.) convex function. Its gradient is Lipschitz continuous on , i.e.
and are proper, l.s.c. convex functions with simple structure such that their prox-mappings can be computed for any
We assume that and .
We now give examples of two of the most common choices of distance-generating functions . The squared Euclidean norm leads to the Bregman divergence , and is often called the “Euclidean setting.” This leads to 1-strong convexity over all of with respect to the norm. The negative entropy
is 1-strongly convex over the probability simplexwrt. the norm, and we will refer to this as the entropy distance.
We will measure error according to the saddle-point residual
We will investigate two specific algorithms from the literature, that have strong performance both in theory and practice. Both algorithms rely on ergodic convergence rates, usually achieved by uniformly averaging iterates. We first introduce the alternative averaging schemes that we will use.
3.1 Averaging schemes
We assume that we have a sequence of iterates and . We will construct the ergodic solution according to the scheme
where is the stepsize used at iteration of a given algorithm, and is a weakly-increasing function in . If the given algorithm uses the same stepsize at all iterations then we can set in the averaging scheme, even if the stepsize is not (this simplifies calculations without changing the algorithm). We will be particularly interested in the following variants:
Last iterate: for ,
We will say that constitutes an increasing averaging scheme if , with .
3.2 Chambolle & Pock’s primal-dual algorithm
The first algorithm we will describe is Chambolle & Pock’s primal dual algorithm (Chambolle and Pock, 2011), using the more recent description given by Chambolle and Pock (2016). The algorithm is given in Algorithm 1. It repeatedly updates based on proximal steps using a linearization of , and then computes the proximal update for based on an extrapolation of and .
At each iteration two proximal mappings are computed. First is computed based on the proximal mapping wrt. the previous iterates , and secondly is computed, but using a gradient based on an extrapolation of and . Chambolle and Pock (2016) show that the uniform average of iterates converges to a saddle point at a rate of . We will use their Lemma 1 (Chambolle and Pock, 2016), which is a descent lemma for the primal-dual iteration. Here we specialize it to the iterates of , whereas Chambolle and Pock (2016) show a more general variant with flexible choices of how gradients are computed: [Chambolle and Pock (2016)] If are generated according to , then for all and we have
With this lemma we can show that achieves a form of ergodic rate that depends on the distance to optimality over the sequence of iterates (rather than the usual dependence on the distance at ).
Let the stepsize parameters in be such that for all and it holds that
Then for any we have
where are constructed according to some increasing averaging scheme .
Now we sum this estimate fromto , where each terms is weighted by
where we have dropped the last negative term in (3.2), performed a variant of the standard telescoping trick, but using our increasing averaging scheme, and dropped the last negative term in the sequence at .
Now we note that (5) can be lower-bounded using the convex-concave structure of
If we have an upper bound on each distance term , where is a saddle point then we can use Theorem 3.2
to get simpler bounds when using increasing averaging schemes. Here we show two special cases: linear and quadratic averaging. Similar results for higher-order odd polynomials can be derived via Faulhaber’s formula.with the linear averaging scheme and stepsize according to the condition in Theorem 3.2 satisfies
By Theorem 3.2 and the fact that for linear averaging we have
with the quadratic averaging scheme and stepsize according to the condition in Theorem 3.2 satisfies
Theorems 1 and 3.2 lead to the same asymptotic rate as uniform averaging, and are even slightly worse, in that factors of 2 and 6 are lost, respectively. Nonetheless we will show later that these averaging schemes achieve practical state-of-the-art performance, while retaining the rate. Uniform averaging does not do that.
From a theoretical perspective, Theorems 1 and 3.2 might not be the right way to motivate the superiority of linear or quadratic averaging. Instead, Theorem 3.2 itself is quite interesting: if we use linear averaging (chosen as an example, these observations apply to e.g. quadratic averaging as well) and apply Theorem 3.2 directly we get a rate
In uniform averaging the comparable result is that we get a rate which depends only on the part of this equation. Having only is nice from an interpretability perspective: our error depends only on how far away we start, rather than depending on intermediate distances. If each term in (7) is upper bounded by some constant, then the bound is basically the same for uniform and linear averaging, as Theorem 1 shows. However, in practice we expect to move closer to as increases, and in that case (7) gives a much stronger bound than the one achieved by uniform averaging. For example, if the error at goes down at a rate of then we get a convergence rate , where is the ’th harmonic number. An extremely interesting direction of future work would be to theoretically justify cases where the error at goes down; in practice it happens very frequently, for example in all our experiments. Last-iterate convergence has been shown for some specific algorithms (Malitsky and Pock, 2018; Daskalakis et al., 2018; Daskalakis and Panageas, 2019), but these results are for the limit as , and so cannot be used to reason about our setting, where we need some guarantee on the rate as well. Furthermore, in order to match our experimental findings, as well as those of Chambolle and Pock (2016) for last-iterate performance, the theory should most likely be specific to the Euclidean distance, since the entropy distance does not perform well under increasing averaging or in the most-recent iterate.
3.3 Mirror prox
Next we describe the mirror prox () algorithm (Nemirovski, 2004)
, which has arguably simpler updates in that there is no alternation, and it avoids the interpolation ofand . On the other hand further assumes that and are compact convex sets, unlike .
It is convenient, and customary, to work with the product space in the context of . Distance functions, Bregman divergences, and norms, can be constructed for based on those given for (see e.g. Nemirovski (2004); Juditsky and Nemirovski (2011a, b) for details). Mirror prox considers saddle-point problems of the form
under the assumptions that is convex-concave, are convex and compact, and that is smooth in the sense that the gradient mapping is Lipschitz continuous with constant with respect to the norm associated to the distance-generating function used. The pseudocode is given in Algorithm 2.
Nemirovski (2004) proves the following descent lemma (see e.g. Bubeck (2015) at the end of the proof of Theorem 4.4 for a direct statement of this lemma) For any the iterates of with stepsizes satisfy
With this lemma we can show a result similar to that for , using the exact same steps. For any , the averaged solution from with stepsize satisfies
With we know that all iterates satisfy , and so we can bound (9) for specific increasing averaging schemes with the linear averaging scheme and satisfies
By Theorem 2 and the fact that for linear averaging we have
with the quadratic averaging scheme and satisfies
By Theorem 2 and the fact that for quadratic averaging we have
4 Numerical experiments
Our experiments will show several variants of mirror prox (denoted MP in plots) and (denoted PD in plots). We also show experiments for a linesearch variant of (denoted PD ls in plots); which is like except that a linesearch is performed to find the stepsize for (Malitsky and Pock, 2018), and we do not average the extrapolated as Malitsky and Pock (2018) suggest. For variants using the Euclidean distance function we will add “l2” to the name of the algorithm, and for variants using the entropy distance (when applicable), we add “entropy” to the name. Additionally we try four averaging schemes: “last” uses the most recent iterate, while uniform, linear, and quadratic denotes three variants of ergodic averaging. We also tried cubic averaging but do not include those experiments here. It performed similarly to quadratic averaging, sometimes being better, and sometimes worse.
4.1 Matrix games
We start with the simplest case, computing a Nash equilibrium of a matrix game. This can be done by solving the saddle-point problem
where is the payoff matrix for the player. Relating (10) to (1) we see that the functions simply play the role of indicator functions for the two simplexes, and . As mentioned previously, the entropy distance is known to be 1-strongly convex wrt. the norm over the simplex, and this leads to a convergence rate with logarithmic dependence on the dimensions . If we use the Euclidean distance then we get a square-root dependence. Nonetheless, we are about to show that when using increasing averaging the Euclidean distance performs much better than the entropy distance. For each game we compute the exact Lipschitz constant and use stepsizes according to the theory of each algorithm111Our results suggest that hand-tuning stepsizes is not necessary. However, we use payoff distributions centered at zero; if one centers the payoffs at a positive number then the theoretical stepsize seems to be too conservative, and hand-tuning is necessary in order to achieve comparable performance..
We run simulations on three classes of random matrix games:normally-distributed payoff matrices, normally-distributed payoff matrices, and uniformly-distributed payoff matrices. For each setting we sample 50 games at random, and run each algorithm for 2000 iterations. For all games we show the relative : , where
is the lowest regret seen across all iterates and all algorithms. The plotted values are averaged over the 50 random instances per setting. We also report standard deviations, but the standard errors are so small that they are hidden by the plot points in most cases.
Figure 1 shows the results in four different comparisons. First, the upper-left plots shows performance for our variants of . When using the entropy distance linear averaging performs worse than uniform averaging. In contrast, for the Euclidean distance both linear and quadratic averaging perform much better than uniform averaging. We also see that they perform even better than the last iterate on all three game classes. Secondly, the upper-right plot shows the performance of with the four averaging schemes. Again the last iterate performs better than uniform averaging as expected, and again quadratic and linear averaging perform significantly better. Thirdly, on the lower-left we plot the performance of with entropy and Euclidean distances, as well as our averaging schemes. The relative performances closely mirror those for . Finally, on the lower-right we plot the performance of two well-known algorithms for solving games in practice: regret matching (Hart and Mas-Colell, 2000) and (Tammelin et al., 2015) (specialized to matrix games; see Farina et al. (2019) for a generalization of that is more similar to the setting in this paper), along with the quadratic-averaging variants of , , and (the plot for is hard to see because it is near-identical to that of ). The FOMs all have performance that exceeds that of across all three games. In the two game classes our algorithms perform significantly better than .
4.2 Total variation-L1 minimization
The TV- model is a convex optimization formulation of an imaging problem, often used for reconstruction when an image has been subjected to salt-and-pepper noise. We follow the presentation and notation of Chambolle and Pock (2011). In this model we are given some corrupted image , where is the image domain. Images in are represented by a grid in , with each entry representing the pixel value at that particular location, and is equipped with the standard inner product. The model minimizes
where is the discrete finite-differences gradient of all zeros except that if and if . The inner product in the dual space is defined as . We let be the adjoint to defined as The Lipschitz constant can be bounded as (Chambolle, 2004). The saddle-point variant of the problem looks as follows
is a hyperparameter, andis the indicator function for pointwise -norm balls at each , that is: for all . To solve this model we use the Euclidean distance as our distance metric, and in terms of (1) we let and . With this setup the proximal mappings can be written as
where is the pointwise shrinkage operator
Following Chambolle and Pock (2011) we set .
Figure 2 shows two original pictures, along with their corrupted variants, and the reconstructed solutions based on the TV- model.
To generate a ground-truth solution, was run for iterations with both last-iterate and quadratic averaging, and the best solution across all the iterations of the two algorithms was used as a ground truth. The convergence-rate plots (Figure 3) show the relative error , where is the current value of the primal objective (11), and is the value at the ground-truth solution.
Figure 3 shows the performance of different stepsizing variants on the two images: the left plot shows results for and the right plot shows results for . As expected uniform averaging performs much worse than the last iterate for both algorithms. For linear and quadratic averaging achieves performance similar to that of the last iterate, with linear performing slightly worse, and quadratic averaging performing about the same. For linear and quadratic averaging leads to better performance than using the last iterate, although the last iterate eventually overtakes on the “cameraman” image.
4.3 Competitive equilibrium in Fisher markets
In the third set of experiments we compute competitive equilibria in Fisher markets. A Fisher market consists of a set of buyers and goods. Each buyer has a valuation vector describing their value for each good, and an allocation of goods to buyer gives utility to buyer . Each buyer has a budget , and each good has supply . A competitive equilibrium is a set of prices for the goods, and an allocation such that for each buyer, and for each item . The famous Eisenberg-Gale convex program (Eisenberg and Gale, 1959) showed that a competitive equilibrium can be computed via convex programming. Here we use the saddle-point formulation of that program:
This formulation was previously considered by Kroer et al. (2019), where they use in order to compute competitive equilibria. Kroer et al. (2019) note that while the gradients of the above formulation are not Lipschitz, each buyer is guaranteed at least their proportional allocation under any feasible set of prices, and so we can add lower bounds on buyer utilities according to each buyer’s proportional share . We add that here, and thus when relating (13) to (1), we let , and , while is the matrix encoding . The Lipschitz constant can be bounded as (Kroer et al., 2019).
We generate Fisher market instances according to three settings: 1) 60 buyers and 20 goods, with values generated by a truncated normal distribution with mean 5, standard deviation , and truncation at 0 and 10. 2) 20 buyers and 20 goods, with the same truncated normal values. 3) 20 buyers and 20 goods with uniformly distributed values in . We generate a total of 50 instances for each setting.
The results of running on this setting are shown in Figure 4. Again we find that linear and quadratic averaging performs much better than uniform averaging, though the last iterate performs even better after a few hundred iterations. For this problem we ran some preliminary experiments suggesting that using in the averaging scheme can lead to performance similar to that of the last iterate.
- Bowling et al.  Michael Bowling, Neil Burch, Michael Johanson, and Oskari Tammelin. Heads-up limit hold’em poker is solved. Science, 347(6218):145–149, 2015.
- Brown and Sandholm  Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018.
Brown and Sandholm 
Noam Brown and Tuomas Sandholm.
Solving imperfect-information games via discounted regret
Thirty-Third AAAI Conference on Artificial Intelligence, 2019.
- Brown et al.  Noam Brown, Christian Kroer, and Tuomas Sandholm. Dynamic thresholding and pruning for regret minimization. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
Convex optimization: Algorithms and complexity.
Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
- Burch  Neil Burch. Time and space: Why imperfect information games are hard. PhD thesis, 2017.
- Caragiannis et al.  Ioannis Caragiannis, David Kurokawa, Hervé Moulin, Ariel D Procaccia, Nisarg Shah, and Junxing Wang. The unreasonable fairness of maximum nash welfare. In Proceedings of the 2016 ACM Conference on Economics and Computation, pages 305–322. ACM, 2016.
- Chambolle  Antonin Chambolle. An algorithm for total variation minimization and applications. Journal of Mathematical imaging and vision, 20(1-2):89–97, 2004.
- Chambolle and Pock  Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 40(1):120–145, 2011.
- Chambolle and Pock  Antonin Chambolle and Thomas Pock. On the ergodic convergence rates of a first-order primal–dual algorithm. Mathematical Programming, 159(1-2):253–287, 2016.
Cole and Gkatzelis 
Richard Cole and Vasilis Gkatzelis.
Approximating the nash social welfare with indivisible items.
Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 371–380. ACM, 2015.
- Conitzer et al. [2018a] Vincent Conitzer, Christian Kroer, Debmalya Panigrahi, Okke Schrijvers, Eric Sodomka, Nicolas E Stier-Moses, and Chris Wilkens. Pacing equilibrium in first-price auction markets. arXiv preprint arXiv:1811.07166, 2018a.
- Conitzer et al. [2018b] Vincent Conitzer, Christian Kroer, Eric Sodomka, and Nicolás E. Stier-Moses. Multiplicative pacing equilibria in auction markets. In Web and Internet Economics - 14th International Conference, WINE, 2018b.
- Daskalakis and Panageas  Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. 10th Innovations in Theoretical Computer Science, 2019.
- Daskalakis et al.  Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans with optimism. In International Conference on Learning Representations (ICLR 2018), 2018.
- Eisenberg and Gale  Edmund Eisenberg and David Gale. Consensus of subjective probabilities: The pari-mutuel method. The Annals of Mathematical Statistics, 30(1):165–168, 1959.
- Farina et al.  Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Online convex optimization for sequential decision processes and extensive-form games. In Thirty-Third AAAI Conference on Artificial Intelligence, 2019.
- Hart and Mas-Colell  Sergiu Hart and Andreu Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
- Juditsky and Nemirovski [2011a] Anatoli Juditsky and Arkadi Nemirovski. First order methods for nonsmooth convex large-scale optimization, i: general purpose methods. Optimization for Machine Learning, pages 121–148, 2011a.
- Juditsky and Nemirovski [2011b] Anatoli Juditsky and Arkadi Nemirovski. First order methods for nonsmooth convex large-scale optimization, ii: utilizing problems structure. Optimization for Machine Learning, pages 149–183, 2011b.
-  Christian Kroer, Kevin Waugh, Fatma Kılınç-Karzan, and Tuomas Sandholm. Faster algorithms for extensive-form game solving via improved smoothing functions. Mathematical Programming, pages 1–33.
- Kroer et al.  Christian Kroer, Gabriele Farina, and Tuomas Sandholm. Solving large sequential games with the excessive gap technique. In Advances in Neural Information Processing Systems, pages 872–882, 2018.
- Kroer et al.  Christian Kroer, Alexander Peysakhovich, Eric Sodomka, and Nicolas E Stier-Moses. Computing large market equilibria using abstractions. arXiv preprint arXiv:1901.06230, 2019.
- Malitsky and Pock  Yura Malitsky and Thomas Pock. A first-order primal-dual algorithm with linesearch. SIAM Journal on Optimization, 28(1):411–432, 2018.
- Moravčík et al.  Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisý, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337), May 2017.
- Nemirovski  Arkadi Nemirovski. Prox-method with rate of convergence o (1/t) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
- Tammelin et al.  Oskari Tammelin, Neil Burch, Michael Johanson, and Michael Bowling. Solving heads-up limit Texas Hold’em. In Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
- Tran-Dinh et al.  Quoc Tran-Dinh, Ahmet Alacaoglu, Olivier Fercoq, and Volkan Cevher. An adaptive primal-dual framework for nonsmooth convex minimization. arXiv preprint arXiv:1808.04648, 2018.
- Varian  Hal R Varian. Equity, envy, and efficiency. Journal of Economic Theory, 9(1):63–91, 1974.