1 Introduction
Highdimensional statistics has achieved remarkable success in the last decade, including results on consistency and convergence rates for various estimators under nonasymptotic highdimensional scaling, especially when the problem dimension is larger than the number of data
. It is now well known that while this setup appears illposed, the estimation or recovery is indeed possible by exploiting the underlying structure of the parameter space. Notable examples include sparse vectors, lowrank matrices, and structured regression functions, among others. To impose structural assumption, a popular approach is to do the structural risk minimization, which could be generally formulated as the following optimization problem:
(1) 
where denotes a regularizer and controls the strength of the regularization. We denote as the constraint set of our problem. Firstorder optimization methods for the structural risk minimization problem have been studied extensively in recent years with fruitful developments. (Projected) gradient descent (PGD) algorithm (Nesterov, 2013) and its variants are arguably the most popular algorithm used in practise. It is wellknown that projected gradient descent algorithm has iteration complexity for smooth convex objective, and for smooth and strongly convex objective. When solving (1
) with batch version of the firstorder algorithm, a full pass on the entire dataset to evaluate the gradient could be computational expensive. In comparison, the stochastic gradient descent (SGD) algorithm computes a noisy gradient over one or a minibatch of samples, which has light computational overhead. However it is known that SGD converges with significantly worse speed
(Bottou, 2010). Stochastic variancereduced algorithms
(Defazio et al., 2014; Xiao and Zhang, 2014; ShalevShwartz and Zhang, 2012) combine the merits of both worlds, which has light overhead for computing gradient as SGD, and fast convergence as batch algorithm. We should note that there are in fact two important operations that consume the major computation resources for PGD algorithm, one being gradient evaluation, and the other one being projection. Although variancereduced PGD variants reduce complexity of the former, it does not improve the latter. Though commonly assumed to be easy, there are indeed numerous applications where the projection is computational expensive (projection onto the trace norm ball, base polytopes (Fujishige and Isotani, 2011)) or even intractable (Collins et al., 2008). FrankWolfe (FW) algorithm (Marguerite and Philip, 1956) arises as a natural alternative in these scenarios. Unlike projectionbased algorithm, FW assumes a linear oracle (LO) that solves a linear optimization problem over the constraint set, which could be significantly easier than the projection. However the original FrankWolfe algorithm suffers from slow convergence, even when the objective is strongly convex. It has been shown in (Lan, 2013) that the LO complexity is not improvable for LObased algorithm for general smooth and convex problem. Improving the convergence of FW is only possible under stronger assumptions such as when the constraint set is polyhedral (LacosteJulien and Jaggi, 2015; Garber and Hazan, 2015). The slow sublinear convergence of FW algorithm could be a serious trouble: when applying FW to problem (1), the algorithm needs an expensive gradient evaluation for every iteration, and the number of iterations needed is also large. To address this problem, Lan and Zhou (2016) show that the gradient evaluation complexity could be improved to for the smooth and convex objective, and for the smooth and strongly convex objective. For the finite sum problem in (1), similar result has also been shown in (Hazan and Luo, 2016) with variancereduction technique. In summary, for both smooth convex problem and smooth strongly convex problem, gradient evaluation complexity of PGD type algorithms and FW type algorithms are comparable, in the sense that they share the same dependency on in order to obtain an optimal solution. Related Work: The highdimensional nature in statistical estimation problem is particularly suited for FW algorithm, as the projection to the constraint set could be difficult. Consider when the constraint set is a nuclear norm ball, then projection requires performing full SVD, while linear oracle only requires computing the leading singular vectors. One might assume the lack of strong convexity in the objective would imply sublinear convergence of the PGD algorithm. This is indeed not true. For the batch PGD algorithm, Agarwal et al. (2010) show the linear convergence of the algorithm until reaching statistical limit of the model. Their quick convergence result is based on the assumption that objective function satisfies the restricted strong convexity (RSC). The batch PGD requires gradient evaluations and projections, where relates to the RSC of objective function. Qu et al. (2016); Qu and Xu (2017); Qu et al. (2017b) later extend the linear convergence result to stochastic variancereduced PGD algorithms, including SDCA, SVRG, SAGA. Their results improve the gradient evaluation complexity to , while the number of projections remains the same. Contributions: In contrast to the fruitful study on PGD type algorithm in high dimensional estimation problem (1), to the best of our knowledge there are still no results on how the FW type algorithm would perform on this problem. Our result in this paper shows that like the PGD type algorithm, the efficiency of FW type algorithm (i.e. CGS and STORC) could be extended much more generally to the setting where objective only satisfies the restricted strong convexity. We briefly summarize our (informal) main results and defer the detailed discussions to our formal theorems and corollaries.
Assume is restricted strongly convex, Conditional Gradient Sliding (CGS) algorithm attains gradient evaluation complexity up to the statistical limit of the model. Our result matches the known result which requires strong convexity assumption.

Assume is restricted strongly convex and each is convex, stochastic variancereduced CGS (STORC) algorithm attains gradient evaluation complexity up to the statistical limit of the model. Our result also matches the known result which requires strong convexity assumption. Similar complexity could also be obtained even if is nonconvex.

Both STORC and CGS still maintain the optimal LO complexity under restricted strong convexity.

The main technical challenge comes from the nonconvexity. For instance, to handle the nonconvexity in STORC, we exploit the notion of lower smoothness, and generalize the analysis of STORC to a type of nonconvex setting.
2 Problem Setup
We begin with some definitions in the classic optimization literature. A function is smooth and strongly convex with respect to norm if for all :
(2) 
we additionally say is lower smooth if:
(3) 
Restricted Strong Convexity. The restricted strong convexity was initially proposed in (Negahban et al., 2009) to establish statistical consistency of the regularized Mestimator, and later was exploited to establish linear convergence of the gradient descent algorithm up to the statistical limit of the model (Agarwal et al., 2010). We say satisfies restricted strong convexity with respect to and norm with parameter if:
(4) 
Decomposable regularizer. Intuitively, to ensure restricted strong convexity to be close to strong convexity, one needs to ensure in definition (4) to be close to . This is an essential argument for establishing consistency of mestimator and fast convergence of PGD algorithm. The decomposable regularizer is a sufficient condition to establish such result. Given a subspace pair which belongs to a Hilbert space , define to be the orthogonal complement of . We say that the regularizer is decomposable with respect to if:
(5) 
We further define the subspace compatibility as . Assumptions. We will make the following assumptions throughout this paper. We assume each is smooth, hence is also smooth. If is nonconvex, we will assume it is lower smooth. Finally, we assume that satisfies restricted strong convexity with respect to with parameter .
2.1 Example: Matrix Regression
In this subsection we put previously defined notions into context and consider the matrix regression problem: let be an unknown lowrank matrix to be estimated and we have observations from the model: where is i.i.d sampled from . We denote as the Frobenius inner product and as the Frobenius norm of matrix . Decomposibility of nuclear norm. Define to be the nuclear norm of matrix , then it is well known that , if and have orthogonal row space and column space. Let
be the singular value decomposition of
. Define as the submatrix of with the first columns and similarly define . Let denote the column space of a matrix , we define the subspace pair:Then any and have orthogonal row and column subspaces, hence we have the decomposibility of nuclear norm with respect to . We will use to denote the vectorization of matrix
. We define the loss function for matrix regression in (
6), the and will be specified seperately depending on whether the sensing matrix is observed with noise.(6) 
Convex loss: If sensing matrix is observed without noise, we set and , where is the th row of the matrix . It is easy to verify that,
(7) 
Nonconvex loss: If the sensing matrix is observed with additive noise, i.e. we observe where is independent with and . We set and , where is the th row of the matrix . We could rewrite (6) as:
(8) 
Note that and we subtract a positive definite matrix from it , hence can not be positive semidefinite when (the typical setup for matrix regression), which results in the nonconvexity of . Structural risk minimization: To impose lowrankness of the recovered matrix, a popular approach is to minimize the objective over a nuclear norm ball with suitable radius (Koltchinskii et al., 2011). The matrix regression problem is defined by:
(9) 
Lamma 1 establishes restricted strong convexity of the loss function in the matrix regression problem, which was essentially proved in (Agarwal et al., 2010).
Lemma 1.
Suppose sensing matrix is i.i.d. sampled from , define , we have for loss function (7):
Suppose is i.i.d. sampled from and independent with . Assume that , we have for loss function (8):
with probability at least
, for some absolute constants .3 Theoretical Results
In this section we present the complexity results for CGS and STORC, under the restricted strongly convex assumption. A few more definition is needed for presenting our formal results. Definition. Let be a function satisfying restricted strong convexity with respect to with parameter . Suppose is decomposable with respect to subspace pair and let denotes projection onto subspace . Let be the unknown target parameter, be the optimal solution for (1). We define the following quantity:

Optimization error: ;

Effective strong convexity parameter: ;

Statistical error: .
Parameter Specification. In the CGS and STORC algorithm, we set:
(10) 
for convex STORC we set:
(11) 
and for nonconvex STORC we set:
(12) 
We present the batch algorithm CGS (Lan and Zhou, 2016) in Algorithm 1. CGS is a smart combination of Nesterov’s accelerated gradient descent (AGD) (Nesterov, 1983/2) and the FrankWolfe algorithm, which essentially uses FW algorithm to solve the projection subproblem in AGD.
We now present our result for Conditional Gradient Sliding. We emphasize that our result holds uniformly for both convex and nonconvex setting, with parameter setting that does not depend on convexity.
Theorem 1.
Let satisfy our assumptions and be restricted strongly convex with respect to with parameter . Suppose is decomposable with respect to the subspace pair and . Let be an estimate such that , be the diameter of feasible set . If we run CGS with parameter specified in (10), then for both the convex and the nonconvex case: for any , in order for the CGS algorithm to obtain an iterate such that , the number of calls to gradient evaluation and linear oracles are bounded respectively by:
(13) 
and
(14) 
Furthermore, take , whenever holds, we would have:
(15) 
Remarks: We note that these bounds are known to be optimal for smooth and strongly convex function (Lan and Zhou, 2016), and we extend the applicability of CGS in the following sense: the bounds also hold for a class of convex but not strongly convex functions and even nonconvex functions, provided that they satisfy the restricted strong convexity. While our result has a mild restriction on the precision up to which the algorithm converges at the predicted rate, we remark that there would be no additional statistical gains for optimizing over this precision. In fact in many models, is shown to be on the same or lower order of the statistical precision of the model, as to be illustrated in the following corollary.
Corollary 1.
For both convex and nonconvex matrix regression problem (9). Assume that , then under the condition of Lemma 1, we have , and for an absolute constant . For sample size satisfying the scaling , we have . Let , then with
(16) 
gradient evaluations, and
(17) 
calls to the linear oracle, Algorithm 1 achieves an optimaliy gap , and the distance to optimum:
(18) 
where is an absolute constant.
Next we present the Stochastic Variance Reduced Conditional Gradient Sliding (Hazan and Luo, 2016). Its only difference from the CGS algorithm is to replace the full gradient with a variancereduced gradient in SVRG style. We use to index the outer iteration, and to index the inner iteration. We compute the full gradient for each outer iteration, and for each inner iteration we replace with , where is the set of indices sampled uniformly and independently from . The details are presented in Algorithm 2.
Theorem 2.
Under the same conditions as Theorem 1, if each is convex: run STORC with parameters specified in (10) and (11); if is nonconvex but is lower smooth: run STORC with parameters specified in (10) and (12). Then for any , in order to obtain an iterate such that , the number of gradient evaluation is bounded by:
Convex:  (19)  
Nonconvex:  (20) 
and the number of calls to linear oracle is bounded by:
(21) 
Furthermore, take , whenever holds, we have:
(22) 
Remarks: For the convex loss, our result parallels with original STORC, except that we do not need to be strongly convex. For the nonconvex loss, our result comes from an analysis of STORC applying to a strongly convex objective (indeed RSC suffices) that is a summation of nonconvex functions. Note that similar results have been shown only for the variancereduced PGD type algorithm (AllenZhu and Yuan, 2016). We also note that whenever , we have , which implies the complexity of the nonconvex objective reduces to that of the convex objective. In other words, we pay no penalty for handling nonconvexity. Since we can always bound by , the worst case the gradient evaluation complexity for the nonconvex loss is .
Corollary 2.
For both convex and nonconvex matrix regression problem (9). Assume , then under the conditions of Lemma 1, we have , , and for some universal constant . Assume the sample size satisfies scaling , we have . Let , then with
(23) 
gradient evaluations, and
(24) 
calls to to the linear oracle, Algorithm 2 achieves an optimality gap , and the distance to the optimum satisfies
(25) 
where is an absolute constant.
4 Simulation
We compare the performance of CGS and STORC with batch projected gradient descent and SVRG algorithm on matrix regression problem (9). For the convex loss problem we generate data , here is a matrix with rank with . We sample with being diagnal matrix in , and every diagnal entry is except the first one being . We set to illustrate the impact of condition number performance. Finally we generate samples with and set . Figure 1 reports the simulation result for the convex loss problem. For batch algorithm CGS and projected gradient descent, the computation time per iteration is dominated by gradient evaluation. As our theorem has predicted, when the required precision is not too small, the CGS algorithm outperfoms projected gradient descent due to its acceleration in terms of gradient evaluation complexity (square root dependence on condition number). This is even more significant when we have illconditioned problem as in Figure 0(c). Both STORC and and SVRG outperform their batch counterparts by saving gradient evaluations significantly per iteration. Since STORC performs LO which only requires computing leading singular vectors, it outperforms SVRG which involves full SVD computation at each iteration. In fact we observe that SVRG could be even outperformed by CGS, which further emphasizes the importance of replacing full projection by linear optimization oracle.
For nonconvex loss problem, we generate in the same way as the convex loss, where and . Instead of observing we only observe where and independent of . We generate sample of size with and set . Figure 2 reports the simulation results for the nonconvex loss. Again, CGS outperforms batch gradient descent by its accelerated performance in terms of gradient evaluation complexity. Notice that which by our theorem implies that nonconvexity essentially causes no extra computational overhead for STORC algorithm. As predicted by our theoretical result, STORC outperforms SVRG by replacing projection step with the much cheaper linear oracle.
5 Conclusions
In this paper we show that batch and stochastic variants of FrankWolfe algorithm, namely CGS and STORC algorithm can be used to solve high dimensional statistical estimation problem efficiently, especially when the projection step in the gradient descent type algorithms is computationally hard. While the efficient gradient complexity result for CGS and STORC has been established in literature, such result requires the objective function being strongly convex and individual being convex. In this paper we relax these restrictive assumptions that are hardly satisfied in high dimensionalstatistics, by restricted strong convexity that holds true in various statistical mode and show that the same gradient evaluation complexity could be maintained under this more general condition.
References
 Agarwal et al. (2010) Agarwal, A., Negahban, S., Wainwright, M. J., 2010. Fast global convergence rates of gradient methods for highdimensional statistical recovery. In: Advances in Neural Information Processing Systems 23. pp. 37–45.
 AllenZhu and Yuan (2016) AllenZhu, Z., Yuan, Y., 20–22 Jun 2016. Improved svrg for nonstronglyconvex or sumofnonconvex objectives. In: Proceedings of The 33rd International Conference on Machine Learning. pp. 1080–1089.
 Bottou (2010) Bottou, L., 2010. Largescale machine learning with stochastic gradient descent. In: Lechevallier, Y., Saporta, G. (Eds.), Proceedings of COMPSTAT’2010. PhysicaVerlag HD, Heidelberg, pp. 177–186.
 Collins et al. (2008) Collins, M., Globerson, A., Koo, T., Carreras, X., Bartlett, P. L., Jun. 2008. Exponentiated gradient algorithms for conditional random fields and maxmargin markov networks. J. Mach. Learn. Res. 9, 1775–1822.
 Defazio et al. (2014) Defazio, A., Bach, F., LacosteJulien, S., Jul. 2014. SAGA: A Fast Incremental Gradient Method With Support for NonStrongly Convex Composite Objectives. ArXiv eprints.
 Duchi et al. (2008) Duchi, J., ShalevShwartz, S., Singer, Y., Chandra, T., 2008. Efficient projections onto the l1ball for learning in high dimensions. ICML ’08. pp. 272–279.
 Fujishige and Isotani (2011) Fujishige, S., Isotani, S., 2011. A submodular function minimization algorithm based on the minimumnorm base. Pacific Journal of Optimization.
 Garber and Hazan (2015) Garber, D., Hazan, E., 2015. Faster rates for the frankwolfe method over stronglyconvex sets. ICML’15. pp. 541–549.
 Hazan and Luo (2016) Hazan, E., Luo, H., 20–22 Jun 2016. Variancereduced and projectionfree stochastic optimization. Vol. 48 of Proceedings of Machine Learning Research. pp. 1263–1271.
 Jaggi (2013) Jaggi, M., 17–19 Jun 2013. Revisiting FrankWolfe: Projectionfree sparse convex optimization. Vol. 28 of Proceedings of Machine Learning Research. pp. 427–435.
 Koltchinskii et al. (2011) Koltchinskii, V., Lounici, K., Tsybakov, A. B., 2011. Nuclearnorm penalization and optimal rates for noisy lowrank matrix completion. The Annals of Statistics 39 (5), 2302–2329.

Kuczyński and Woźniakowski (1992)
Kuczyński, J., Woźniakowski, H., 1992. Estimating the largest eigenvalue by the power and lanczos algorithms with a random start. SIAM Journal on Matrix Analysis and Applications 13 (4), 1094–1122.
 LacosteJulien (2016) LacosteJulien, S., Jul. 2016. Convergence Rate of FrankWolfe for NonConvex Objectives. ArXiv eprints.
 LacosteJulien and Jaggi (2015) LacosteJulien, S., Jaggi, M., Nov. 2015. On the Global Linear Convergence of FrankWolfe Optimization Variants. ArXiv eprints.
 Lan (2013) Lan, G., Sep. 2013. The Complexity of Largescale Convex Programming under a Linear Optimization Oracle. ArXiv eprints.
 Lan and Zhou (2016) Lan, G., Zhou, Y., 2016. Conditional gradient sliding for convex optimization. SIAM Journal on Optimization 26 (2), 1379–1409.
 Loh and Wainwright (2011) Loh, P.L., Wainwright, M. J., Sep. 2011. Highdimensional regression with noisy and missing data: Provable guarantees with nonconvexity. ArXiv eprints.
 Loh and Wainwright (2013) Loh, P.L., Wainwright, M. J., 2013. Regularized mestimators with nonconvexity: Statistical and algorithmic theory for local optima. In: Advances in Neural Information Processing Systems 26. pp. 476–484.
 Marguerite and Philip (1956) Marguerite, F., Philip, W., 1956. An algorithm for quadratic programming. Naval Research Logistics Quarterly 3 (1), 95–110.
 Negahban et al. (2009) Negahban, S., Yu, B., Wainwright, M. J., Ravikumar, P. K., 2009. A unified framework for highdimensional analysis of mestimators with decomposable regularizers. In: Advances in Neural Information Processing Systems 22. pp. 1348–1356.
 Nesterov (1983/2) Nesterov, Y., 1983/2. A method of solving a convex programming problem with convergence rate . Soviet Mathematics Doklady 27 (2), 372–376.
 Nesterov (2013) Nesterov, Y., Aug 2013. Gradient methods for minimizing composite functions. Mathematical Programming 140 (1), 125–161.
 Qu et al. (2016) Qu, C., Li, Y., Xu, H., Nov. 2016. Linear Convergence of SVRG in Statistical Estimation. ArXiv eprints.
 Qu et al. (2017a) Qu, C., Li, Y., Xu, H., Aug. 2017a. Nonconvex Conditional Gradient Sliding. ArXiv eprints.
 Qu et al. (2017b) Qu, C., Li, Y., Xu, H., Feb. 2017b. SAGA and Restricted Strong Convexity. ArXiv eprints.
 Qu and Xu (2017) Qu, C., Xu, H., Jan. 2017. Linear convergence of SDCA in statistical estimation. ArXiv eprints.
 Reddi et al. (2016) Reddi, S. J., Sra, S., Poczos, B., Smola, A., Sept 2016. Stochastic frankwolfe methods for nonconvex optimization. In: 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton). pp. 1244–1251.
 ShalevShwartz and Zhang (2012) ShalevShwartz, S., Zhang, T., Sep. 2012. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. ArXiv eprints.
 Xiao and Zhang (2014) Xiao, L., Zhang, T., 2014. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization 24 (4), 2057–2075.
Appendix A Proof overview
In this section we provide a roadmap which we will following in establishing the complexity results presented in the paper. It is convenient to note that an smooth function satisfies the following properties which is useful in our proof:
(26) 
for . If is additionally convex, then:
(27) 
For simplicity of exposition, we emphasize the our proof scheme for convex case here. The basic idea of our proof is to reanalyze the convergence property of the aforementioned algorithms, while the strong convexity is replaced by the restricted strong convexity. More precisely, let , the first important lemma says that in fact is close to .
Lemma 2.
In problem (1), if , then we have belongs to the set:
(28) 
Now by combining (28) with the definition of restricted strong convexity, we have the following structure of pseudo strong convexity of the objective function that holds at the global optimum:
(29) 
where the last inequality combines (28), CauchySchwarz inequality and the definition of . To analyze the convergence of CGS and STORC, we first present a general convergence result of inner iteration. This theorem covers both the deterministic case as in CGS and the stochastic case as in STORC.
Theorem 3.
Fix outer iteration , if we have for every inner iteration in CGS and STORC that , then we have:
(30) 
where we have defined as:
(31) 
Notice that for CGS algorithm which corresponds to the deterministic case, we have for any iteration . By specifying specific parameters in the two algorithms, we have concrete convergence results for inner iteration, provided we can control to decrease in a certain rate. The proof of the following corollary is a simple induction after plugging in all the parameters into (30).
Corollary 3.
Fix outer iteration , suppose we have , then by setting , if we can control such that for all , then we have:
(32) 
Again, since for CGS algorithm, the claim of corollary follows immediately once we have specified appropriate parameters. The main challenge of our proof, is to control as prescribed by corollary for STORC algorithm. We now explain why Corollary 3 almost completes our argument: if we can control for all , then by this corollary we have:
(33) 
where the first inequality uses , and in the second inequality we use the definition of condition number, the pseudo strong convexity (A) and the assumption that we have not achieved the statistical precision yet. Hence by specifying appropriate as in CGS and STORC, we establish the convergence for outer iteration is
(34) 
With this simple recursion for outer iteration and a careful summation of calls to gradient evaluation and linear optimization per iteration, the claim for CGS and STORC follows immediately, we leave the details of the proof in the rest of supplemental material.
Appendix B Proof of Lemma 1
Proof.
We have first
(35) 
Define . For convex loss: We have , by Lemma 7 of Agarwal et al. (2010) we have there exists universl postive constants such that:
(36) 
with probability at least , which gives result for convex case. For nonconvex loss, we have , hence . We have that with probability at least ,
(37) 
Now uses assumption that we obtain result immediately. ∎
Comments
There are no comments yet.