    # Efficient Projection-Free Algorithms for Saddle Point Problems

The Frank-Wolfe algorithm is a classic method for constrained optimization problems. It has recently been popular in many machine learning applications because its projection-free property leads to more efficient iterations. In this paper, we study projection-free algorithms for convex-strongly-concave saddle point problems with complicated constraints. Our method combines Conditional Gradient Sliding with Mirror-Prox and shows that it only requires Õ(1/√(ϵ)) gradient evaluations and Õ(1/ϵ^2) linear optimizations in the batch setting. We also extend our method to the stochastic setting and propose first stochastic projection-free algorithms for saddle point problems. Experimental results demonstrate the effectiveness of our algorithms and verify our theoretical guarantees.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we study the following saddle point problems:

 minx∈Xmaxy∈Yf(x,y)

where the objective function is convex-concave and -smooth; and are convex and compact sets. Besides this general form, we also consider the stochastic minimax problem:

 minx∈Xmaxy∈Yf(x,y)≜Eξ[F(x,y;ξ)], (1)

where

is a random variable. One popular specific setting of (

1) is the finite-sum case where is sampled from a finite set . Denoting , we can write the objective function as

 f(x,y)≜1nn∑i=1Fi(x,y). (2)

We are interested in the cases where the feasible set is complicated such that projecting onto is rather expensive or even intractable. One example of such case is the nuclear norm ball constraint, which is widely used in machine learning applications such as multiclass classification Dudik et al. (2012), matrix completion Candès and Recht (2009); Jaggi and Sulovskỳ (2010); Lacoste-Julien and Jaggi (2013), factorization machine Lin et al. (2018), polynomial neural nets Livni et al. (2014) and two-player games whose strategy space contains a large number of constraints Ahmadinejad et al. (2019).

The Frank-Wolfe (FW) algorithm Frank and Wolfe (1956) (a.k.a. conditional gradient method) is initially proposed for constrained convex optimization. It has recently become popular in the machine learning community because of its projection-free property Jaggi (2013). The Frank-Wolfe algorithm calls a linear optimization (LO) oracle at each iteration, which is usually much faster than projection for complicated feasible sets. Recently, FW-style algorithms for convex and nonconvex minimization problems has been widely studied Lacoste-Julien and Jaggi (2013); Lan and Zhou (2016); Hazan and Luo (2016); Hazan and Kale (2012); Qu et al. (2018); Reddi et al. (2016); Yurtsever et al. (2019); Shen et al. (2019); Xie et al. (2020); Zhang et al. (2020); Hassani et al. (2019). However, the only known projection-free algorithms for minimax optimization are for very special cases (e.g. the saddle point belongs to the interior of the feasible set Gidel et al. (2017)).

In this paper, we propose a projection-free algorithm, which we refer to as Mirror-Prox Conditional Gradient Sliding (MPCGS), for convex-strongly-concave saddle point problems. Our method leverages the idea from some projection-type methods Nemirovski (2004); Thekumparampil et al. (2019), which is based on proximal point iterations. By combining the idea of Mirror-Prox Thekumparampil et al. (2019) with the conditional gradient sliding (CGS) Lan and Zhou (2016), MPCGS only requires at most exact gradient evaluations and linear optimizations to guarantee suboptimality error in expectation. We also extend our framework to the stochastic setting and propose Mirror-Prox Stochastic Conditional Gradient Sliding (MPSCGS), which requires to compute at most stochastic gradients and call the LO oracle for at most times. To the best of our knowledge, MPSCGS is the first stochastic projection-free algorithm for convex-strongly-concave saddle point problems. We also conduct experiments on several real-world data sets for robust optimization problem to validate our theoretical analysis. The empirical results show that the proposed methods outperform previous projection-free and projection-based methods when the feasible set is complicated.

#### Related Works

Most existing works on constrained minimax optimization solve the problem with projection. We only provide some representative literature. For the batch setting, the classical extragradient method Korpelevich (1976) considered a more general variational inequality (VI). Nemirovski (2004) proposed Mirror-Prox method which achieves a convergence rate of for solving VI. Recently, Thekumparampil et al. (2019) improved the convergence rate to when the objection function is strongly-convex-concave. For the stochastic setting, Chavdarova et al. (2019); Palaniappan and Bach (2016)

adopted the variance reduction methods to obtain linear convergence rate for strongly-convex-strongly-concave objection functions.

The projection-free methods for saddle point problems are very few.  Hammond (1984) found that FW algorithm with a step size converges for VI when the feasible set is strongly convex. Recently,  Gidel et al. (2017) proposed SP-FW algorithm for strongly-convex-strongly-concave saddle point problem, which achieves a linear convergence rate under the condition that the saddle point belongs to the interior of the feasible set and the condition number is small enough. They also provided an away-step Frank-Wolfe variant Lacoste-Julien and Jaggi (2013), called SP-AFW, to address the polytope constraints. However, SP-AFW has to store history information and perform extra operations in each iteration.  Roy et al. (2019) extended SP-FW to the zeroth-order setting and studied a gradient-free projection-free algorithm which has the theoretical guarantee under the same assumptions on the objective function as SP-FW. He and Harchaoui (2015) proposed a projection-free algorithm for non-smooth composite saddle point problem. Their method requires to call a composite LO oracle, which is not suitable for general case.

Some recent works focus on hybrid algorithms which combine projection-based and projection-free methods. For example, Juditsky and Nemirovski (2016); Cox et al. (2017) transformed a VI with complicated constraints to a “dual” VI which is projection-friendly. Lan (2013); Nouiehed et al. (2019) solved the saddle point problem by running projection-free methods on and performing projection on . In contrast, our methods are purely projection-free.

#### Paper Organization

In Section 2, we provide preliminaries and relevant backgrounds. We present our results for the batch setting and stochastic setting in Section 3 and Section 4 respectively. We give empirical results for our algorithm in Section 5, followed by a conclusion in Section 6.

## 2 Preliminaries and Backgrounds

In this section, we first present some notation and assumptions used in this paper. Then we introduce oracle models which are necessary to our methods, followed by an example of application. After that we provide some properties of saddle point problems.Finally, we introduce CGS and its variants, which are used in our algorithms.

### 2.1 Notation and Assumptions

Given a differentiable function , we use (or ) to denote the partial gradient of with respect to (or ) and define . We use the notation to hide logarithmic factors in the complexity and denote .

We impose the following assumptions for our method.

###### Assumption 1.

We assume the saddle point problem (1) satisfies:

• [leftmargin=0.6cm]

• is -smooth, i.e., for every , it holds that

 ∥∇f(x1,y1)−∇f(x2,y2)∥2≤L2(∥x1−x2∥2+∥y1−y2∥2).
• is convex for every , i.e., for any , and , it holds that

 f(x1,y)−f(x2,y)≥∇xf(x2,y)⊤(x1−x2).
• is -strongly concave for every , i.e., for any , and , it holds that

 f(x,y1)−f(x,y2)≤∇yf(x,y2)⊤(y1−y2)−μ2∥y1−y2∥2.
• and are convex compact sets with diameter and respectively.

We use to denote the condition number.

###### Assumption 2.

In the stochastic setting, we make the following additional assumptions:

• [leftmargin=0.6cm]

• for every and .

• for every , and constant .

• is -average smooth, i.e., for every and , it holds that

 E∥∇F(x1,y1,ξ)−∇F(x2,y2,ξ)∥2≤L2(∥x1−x2∥2+∥y1−y2∥2).

In the convex-concave setting, for any , we have the following inequality:

 minx∈Xf(x,^y)≤f(^x,^y)≤maxy∈Yf(^x,y).

Furthermore, Problem (1) has at least one saddle point solution which satisfies:

 minx∈Xf(x,y∗)=f(x∗,y∗)=maxy∈Yf(x∗,y).

We measure the suboptimality error by the primal-dual gap: , which is widely used in saddle point problems. We further define -saddle point as follows:

###### Definition 1.

A point is an -saddle point of a convex-concave function if:

 maxy∈Yf(^x,y)−minx∈Xf(x,^y)≤ϵ. (3)

Notice that Gidel et al. Gidel et al. (2017) adopted a different criterion: . It is obvious that the left-hand side of (3) is an upper bound of .

### 2.2 Oracle models

In this paper, we consider the following oracles for different settings:

• [leftmargin=0.6cm]

• First Order Oracle (FO): Given , the FO returns and .

• Stochastic First Order Oracle (SFO): For a function where , SFO returns and where is a sample drawn from .

• Incremental First Order Oracle (IFO): In the finite-sum setting, IFO takes a sample and returns and .

• Linear Optimization Oracle (LO): Given a vector

and a convex and compact set , the LO returns a solution of the problem .

### 2.3 Example Application: Robust Optimization for Multiclass Classification

We consider the multiclass classification problem with classes. Suppose the training set is , where is the feature vector of the -th sample and is the corresponding label. The goal is to find an accurate linear predictor with parameter that predicts for any input feature vector .

The robust optimization model Namkoong and Duchi (2017) with multivariate logistic loss Dudik et al. (2012); Zhang et al. (2012) under nuclear norm ball constraint can be formulated as the following convex-concave minimax optimization:

 minX∈Xmaxy∈Yf(X,y)≜1nn∑i=1yilog⎛⎝1+∑j≠yi(x⊤jai−x⊤yiai)⎞⎠−λ2∥ny−1n∥22, (4)

where and . It is obvious that the objective function (4) is convex-strongly-concave, which satisfies our assumptions. In this case, projecting onto requires to perform full SVD, which takes time. On the other hand, the linear optimization on only needs to find the top singular vector, whose cost is linear to the number of non-zero entries in the gradient matrix.

Conditional Gradient Sliding (CGS) Lan and Zhou (2016) is a projection-free algorithm for convex minimization. It leverages Nesterov’s accelerate gradient descent Nesterov (1983) to speed-up Frank-Wolfe algorithms. For strongly-convex objective function, CGS only requires FO calls and LO calls to find an -suboptimal solution. We present the details of CGS in Algorithm 1. Notice that the -th iteration of CGS considers the following sub-problem

 minu∈Ω ⟨∇h(wk),u⟩+βk2∥u−uk−1∥2,

which can be efficiently solved by the conditional gradient method in Algorithm 2. Lan and Zhou Lan and Zhou (2016) also extended CGS to stochastic setting and proposed stochastic conditional gradient sliding (SCGS). Later, Hazan and Luo (2016) proposed STOchastic variance-Reduced Conditional gradient sliding (STORC) for finite-sum setting whose complexities of IFO and LO are and respectively.

## 3 Mirror-Prox Conditional Gradient Sliding

For the batch setting of (1), we propose Mirror-Prox Conditional Gradient Sliding (MPCGS), which is presented in Algorithm 3. Our MPCGS method combines ideas of Mirror-Prox algorithm Thekumparampil et al. (2019) and CGS method Lan and Zhou (2016). The key idea of MPCGS is to solve a proximal problem in each iteration, which makes and satisfy following conditions:

• [leftmargin=0.6cm]

• is an -approximate maximizer of , i.e., ;

• The update of corresponds to an CGS updating step (Algorithm 1) for , i.e.,

 vk=CndG(∇xf(zk,yk),vk−1,αk,ζk,X),xk=(1−γk)xk−1+γkvk.

The procedure of solving the proximal problem is presented in Algorithm 4. In the Prox-step procedure, we iteratively compute an -approximate maximizer of and then update and according to . Since is smooth and strongly concave for all , the number of calls to the FO and LO oracles performed by CGS method for finding an -approximate maximizer can be bounded by and respectively.

On the other hand, in Algorithm 4 the CndG procedure computes as an -approximate solution of the following problem:

 minu∈X{∇xf(z,y∗(xr−1))⊤u+α2∥u−v∥2}.

Thus, the idealized updating of in Algorithm 4 is

 xr=(1−γ)x+γ⋅argminu∈X{∇xf(z,y∗(xr−1))⊤u+α2∥u−v∥2}.

Since is a -contraction mapping with a unique fixed point (see the proof of Lemma 2 in the Appendix), the Prox-step procedure only requires iterations if and are small enough.

The following theorem shows the convergence rate of solving problem (1) by Algorithm 3.

###### Theorem 1.

Suppose the objective function satisfies Assumption 1. By setting

 γk=3k+2,αk=6κLk+1,ζk=LD2X384k(k+1),ϵk=κLD2Xk(k+1)(k+2)

for Algorithm 3, then we have

 maxy∈Yf(xk,y)−minx∈Xf(x,¯yk)≤11κLD2X(k+1)(k+2).

Theorem 1 implies the upper bound complexities of the algorithm as follows.

###### Corollary 1.

Under the same assumption of Theorem 1, Algorithm 3 requires FO complexity and LO complexity to achieve an -saddle point.

## 4 Mirror-Prox Stochastic Conditional Gradient Sliding

In this section, we extend MPCGS to the stochastic setting (1). Recall that we adopt CGS to find -approximate maximizer of problem in the batch setting, which only require logarithmic iterations. In the stochastic case, we would like to use the STORC Hazan and Luo (2016) algorithm instead. Since the original STORC can only be applied to the finite-sum situation, we have to first study an inexact variant of STORC which does not depend on the exact gradient. Then we leverage the inexact STORC algorithm to establish our projection-free algorithm for stochastic saddle point problems.

### 4.1 Inexact Stochastic Variance Reduced Conditional Gradient Sliding

We propose Inexact STORC (iSTORC) algorithm to solve the following stochastic convex optimization problem:

 minx∈Ωh(x)=Eξ[H(x;ξ)], (5)

where is a random variable; the feasible set is convex, compact and has diameter . We assume that is -smooth and -strongly convex. We also suppose that the algorithm can access the stochastic gradient which satisfies:

• [leftmargin=0.6cm]

• ,  , .

The idea of iSTORC is to approximate the exact gradient in STORC by appropriate number of stochastic gradient samples. The following theorem shows the convergence rate of iSTORC.

###### Theorem 2.

Running Inexact STORC (Algorithm 5) with the following parameters:

 λk=2k+1,βk=3Lk,M=⌈4√2κ⌉,ηt,k=κLD22t−2Mk,S=4800Mκ,Qt=⌈1200⋅2t−1σ2√κL2D2⌉,

we have

 E[h(¯xt)−h(x∗)]≤LD22t+1,

where .

Theorem 2 implies the following upper bound complexities of iSTORC.

###### Corollary 2.

To achieve such that , iSTORC (Algorithm 5) requires SFO complexity and LO complexity.

###### Remark 1.

If the objective function has the finite-sum form, we can choose and obtain the same upper complexities bound as STORC.

###### Remark 2.

Notice that the SFO complexity of SCGS is . When , iSTORC has better SFO complexity than SCGS.

### 4.2 Mirror-Prox Stochastic Conditional Gradient Sliding

We present our Mirror-Prox Stochastic Conditional Gradient Sliding (MPSCGS) in Algorithm 6. The idea of MPSCGS is similar to that of MPCGS. The main difference is that we solve the proximal problem in MPSCGS by a stochastic proximal step, where we adopt the proposed iSTORC algorithm. Specifically, in each iteration we ensures that and satisfy following conditions:

• [leftmargin=0.6cm]

• is an -approximate maximizer of in expectation, i.e.,

 E[f(xk,yk)]≥E[maxy∈Yf(xk,y)]−ϵk;
• The update of and ensures that

 vk=CndG(∇xfPk(zk,yk),vk−1,αk,ζk,X),xk=(1−γk)xk−1+γkvk.

The following theorem shows the convergence rate of solving problem (1).

###### Theorem 3.

Suppose the objective function satisfies Assumption 1 and 2. If we set

 γk=3k+2,  αk=6κLk+1,  ζk=LD2X576k(k+1),  ϵk=κLD2Xk(k+1)(k+2),  Pk=⌈96σ2(k+1)3κL2D2X⌉

for Algorithm 6, then we have

 E[maxy∈Yf(xk,y)−minx∈Xf(x,¯yk)]≤12κLD2X(k+1)(k+2).

Theorem 3 implies the following corollary of oracle complexity.

###### Corollary 3.

Under the assumption in Theorem 3 and the assumption that objective function has finite-sum form of (2), Algorithm 6 needs IFO complexity and LO complexity to achieve an -saddle point.

###### Corollary 4.

Under the assumption in Theorem 3 and the assumption that objective function has the expectation form of (1), Algorithm 6 needs SFO complexity and LO complexity to achieve an -saddle point.

## 5 Experiments

In this section, we empirically evaluate the performance of our methods on the robust multiclass classification problem introduced in Section 2.3. Specifically, we choose the nuclear norm ball with radius and the regularization parameter . We compare our methods with saddle point Frank-Wolfe (SPFW) Gidel et al. (2017) and stochastic variance reduce extragradient (SVRE) Chavdarova et al. (2019). SPFW is a projection-free algorithm as discussed before, while SVRE is the state-of-the-art projection-based stochastic methods for saddle point problems. We conduct experiments on three real-world data sets from the LIBSVM repository111https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/: rcv1 (, , ), sector (, , ) and news20 (, , ).

Since the primal-dual gap is hard to compute, we evaluate algorithms by the following FW-gap Jaggi (2013):

 G(x,y)=maxu∈X⟨x−u,∇xf(x,y)⟩+maxv∈Y⟨y−v,−∇yf(x,y)⟩.

which is an upper bound of primal-dual gap and easy to compute. We measure the actual running time rather than number of iterations because the computational cost of projection, linear optimization and computing gradients are quite different.

We implement the mini-batch version of SVRE with batch size . The learning rate of SVRE is searched in . On the other hand, the parameters of projection-free methods follows what the theory suggests. We report the experimental result in Figure 1.

In all experiments, our methods outperform baselines. The SVRE only performs a few iterations due to its heavy computational cost of the projection on to the trace norm ball. SPFW converges slowly for it does not have theoretical guarantee on the convex-strongly-concave case. We also find that MPSCGS converges faster than MPCGS, because the stochastic algorithms take advantages when is very large. Figure 1: We demonstrate the perfomance of algorithms by time (s) versus log(FW-gap) for robust multiclass classification with nuclear norm ball constraint on datasets “rcv1”, “sector”, and “news20”.

## 6 Conclusion and Future Works

In this paper, we propose projection-free algorithms for solving saddle point problems with complicated constraints in both batch and stochastic settings. Our methods are purely projection-free and do not require that the saddle point problem has special structures. We also provide convergence analysis for our algorithms in the convex-strongly-concave case. The experimental results demonstrate the effectiveness of our algorithms on three real world data sets. On the other hand, we believe that there is room for improving the complexity of the LO oracles, which will be our future studies. In addition, we will investigate how to extend our framework to the general convex-concave case and establish stronger convergence results in the strongly-convex-strongly-concave case.

This paper studies projection-free algorithms for convex-strongly-concave saddle point problems. From a theoretical viewpoint, our method propose the first stochastic projection-free algorithm for saddle point problems without special conditions on the problem. From a practical viewpoint, our method can be applied to many machine learning applications which solve minimax problem with complicated constraints, e.g. robust optimization, matrix completion, two-player games and much more.

The team is supported by "New Generation of AI 2030" Major Project (2018AAA0100900) and National Natural Science Foundation of China (61702327, 61772333, 61632017). Luo Luo is supported by GRF 16201320.

## References

• Ahmadinejad et al.  AmirMahdi Ahmadinejad, Sina Dehghani, MohammadTaghi Hajiaghayi, Brendan Lucier, Hamid Mahini, and Saeed Seddighin. From duels to battlefields: Computing equilibria of blotto and other games. Mathematics of Operations Research, 44(4):1304–1325, 2019.
• Candès and Recht  Emmanuel J. Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717, 2009.
• Chavdarova et al.  Tatjana Chavdarova, Gauthier Gidel, Franccois Fleuret, and Simon Lacoste-Julien. Reducing noise in gan training with variance reduced extragradient. In Advances in Neural Information Processing Systems, pages 391–401, 2019.
• Cox et al.  Bruce Cox, Anatoli Juditsky, and Arkadi Nemirovski. Decomposition techniques for bilinear saddle point problems and variational inequalities with affine monotone operators. Journal of Optimization Theory and Applications, 172(2):402–435, 2017.
• Dudik et al.  Miroslav Dudik, Zaid Harchaoui, and Jérôme Malick. Lifted coordinate descent for learning with trace-norm regularization. In Artificial Intelligence and Statistics, pages 327–336, 2012.
• Frank and Wolfe  Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics quarterly, 3(1-2):95–110, 1956.
• Gidel et al.  Gauthier Gidel, Tony Jebara, and Simon Lacoste-Julien. Frank-Wolfe algorithms for saddle point problems. In Artificial Intelligence and Statistics, pages 362–371, 2017.
• Hammond  Janice H. Hammond. Solving asymmetric variational inequality problems and systems of equations with generalized nonlinear programming algorithms. PhD thesis, Massachusetts Institute of Technology, 1984.
• Hassani et al.  Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Zebang Shen. Stochastic conditional gradient++. arXiv preprint arXiv:1902.06992, 2019.
• Hazan and Kale  Elad Hazan and Satyen Kale. Projection-free online learning. In International Conference on Machine Learning, pages 521–528, 2012.
• Hazan and Luo  Elad Hazan and Haipeng Luo. Variance-reduced and projection-free stochastic optimization. In International Conference on Machine Learning, pages 1263–1271, 2016.
• He and Harchaoui  Niao He and Zaid Harchaoui. Semi-proximal mirror-prox for nonsmooth composite minimization. In Advances in Neural Information Processing Systems, pages 3411–3419, 2015.
• Jaggi  Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning, pages 427–435, 2013.
• Jaggi and Sulovskỳ  Martin Jaggi and Marek Sulovskỳ. A simple algorithm for nuclear norm regularized problems. In International Conference on Machine Learning, pages 471–478, 2010.
• Juditsky and Nemirovski  Anatoli Juditsky and Arkadi Nemirovski. Solving variational inequalities with monotone operators on domains given by linear minimization oracles. Mathematical Programming, 156(1-2):221–256, 2016.
• Korpelevich  G.M. Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976.
• Lacoste-Julien and Jaggi  Simon Lacoste-Julien and Martin Jaggi. An affine invariant linear convergence analysis for frank-wolfe algorithms. arXiv preprint arXiv:1312.7864, 2013.
• Lan  Guanghui Lan. The complexity of large-scale convex programming under a linear optimization oracle. arXiv preprint arXiv:1309.5550, 2013.
• Lan and Zhou  Guanghui Lan and Yi Zhou. Conditional gradient sliding for convex optimization. SIAM Journal on Optimization, 26(2):1379–1409, 2016.
• Lin et al.  Tianyi Lin, Chi Jin, and Michael I. Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, 2020.
• Lin et al.  Xiao Lin, Wenpeng Zhang, Min Zhang, Wenwu Zhu, Jian Pei, Peilin Zhao, and Junzhou Huang. Online compact convexified factorization machine. In The World Wide Web Conference, pages 1633–1642, 2018.
• Livni et al.  Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir.

On the computational efficiency of training neural networks.

In Advances in neural information processing systems, pages 855–863, 2014.
• Namkoong and Duchi  Hongseok Namkoong and John C. Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2971–2980, 2017.
• Nemirovski  Arkadi Nemirovski. Prox-method with rate of convergence for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
• Nesterov  Yurii E. Nesterov. A method for solving the convex programming problem with convergence rate . In Dokl. Akad. Nauk SSSR, volume 269, pages 543–547, 1983.
• Nouiehed et al.  Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D. Lee, and Meisam Razaviyayn. Solving a class of non-convex min-max games using iterative first order methods. In Advances in Neural Information Processing Systems, pages 14905–14916, 2019.
• Palaniappan and Bach  Balamurugan Palaniappan and Francis Bach. Stochastic variance reduction methods for saddle-point problems. In Advances in Neural Information Processing Systems, pages 1416–1424, 2016.
• Qu et al.  Chao Qu, Yan Li, and Huan Xu. Non-convex conditional gradient sliding. In International Conference on Machine Learning, pages 4208–4217, 2018.
• Reddi et al.  Sashank J. Reddi, Suvrit Sra, Barnabás Póczos, and Alex Smola. Stochastic Frank-Wolfe methods for nonconvex optimization. In 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1244–1251. IEEE, 2016.
• Roy et al.  Abhishek Roy, Yifang Chen, Krishnakumar Balasubramanian, and Prasant Mohapatra. Online and bandit algorithms for nonstationary stochastic saddle-point optimization. arXiv preprint arXiv:1912.01698, 2019.
• Shen et al.  Zebang Shen, Cong Fang, Peilin Zhao, Junzhou Huang, and Hui Qian. Complexities in projection-free stochastic non-convex minimization. In International Conference on Artificial Intelligence and Statistics, pages 2868–2876, 2019.
• Thekumparampil et al.  Kiran K. Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. Efficient algorithms for smooth minimax optimization. In Advances in Neural Information Processing Systems, pages 12659–12670, 2019.
• Xie et al.  Jiahao Xie, Zebang Shen, Chao Zhang, Boyu Wang, and Hui Qian. Efficient projection-free online methods with stochastic recursive gradient. In AAAI Conference on Artificial Intelligence, pages 6446–6453, 2020.
• Yurtsever et al.  Alp Yurtsever, Suvrit Sra, and Volkan Cevher.

Conditional gradient methods via stochastic path-integrated differential estimator.

In International Conference on Machine Learning, pages 7282–7291, 2019.
• Zhang et al.  Mingrui Zhang, Zebang Shen, Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. One sample stochastic Frank-Wolfe. In International Conference on Artificial Intelligence and Statistics, pages 4012–4023, 2020.
• Zhang et al.  Xinhua Zhang, Dale Schuurmans, and Yao-liang Yu. Accelerated training for matrix-norm regularization: A boosting approach. In Advances in Neural Information Processing Systems, pages 2906–2914, 2012.

## Appendix A Proofs For MPCGS

In this section, we assume satisfies Assumption 1.

### a.1 Definitions and Lemmas

We define the following functions:

 y∗(x)=argmaxy∈Yf(x,y),
 ψk(x)=(1−γk)xk−1+γk⋅argminv∈X{∇xf(zk,y∗(x))⊤v+αk2∥v−vk−1∥2}.

Since is -strongly-concave, is unique. Then, we have the following two lemmas.

###### Lemma 1 ((Lin et al., 2020, Lemma 4.3)).

is -Lipschitz continuous.

###### Lemma 2.

is a -contraction.

###### Proof.

Let

 ∇1=∇xf(zk,y∗(x1)),v1=argminv∈X{∇⊤1v+αk2∥v−vk−1∥2};
 ∇2=∇xf(zk,y∗(x2)),v2=argminv∈X{∇⊤2v+αk2∥v