# Projection-Free Algorithms in Statistical Estimation

Frank-Wolfe algorithm (FW) and its variants have gained a surge of interests in machine learning community due to its projection-free property. Recently people have reduced the gradient evaluation complexity of FW algorithm to (1/ϵ) for the smooth and strongly convex objective. This complexity result is especially significant in learning problem, as the overwhelming data size makes a single evluation of gradient computational expensive. However, in high-dimensional statistical estimation problems, the objective is typically not strongly convex, and sometimes even non-convex. In this paper, we extend the state-of-the-art FW type algorithms for the large-scale, high-dimensional estimation problem. We show that as long as the objective satisfies restricted strong convexity, and we are not optimizing over statistical limit of the model, the (1/ϵ) gradient evaluation complexity could still be attained.

## Authors

• 42 publications
• 14 publications
• 42 publications
• ### Linear Convergence of SVRG in Statistical Estimation

SVRG and its variants are among the state of art optimization algorithms...
11/07/2016 ∙ by Chao Qu, et al. ∙ 0

• ### SAGA and Restricted Strong Convexity

SAGA is a fast incremental gradient method on the finite sum problem and...
02/19/2017 ∙ by Chao Qu, et al. ∙ 0

• ### A Richer Theory of Convex Constrained Optimization with Reduced Projections and Improved Rates

This paper focuses on convex constrained optimization problems, where th...
08/11/2016 ∙ by Tianbao Yang, et al. ∙ 0

• ### Projection-Free Online Optimization with Stochastic Gradient: From Convexity to Submodularity

Online optimization has been a successful framework for solving large-sc...
02/22/2018 ∙ by Lin Chen, et al. ∙ 0

The move from hand-designed to learned optimizers in machine learning ha...
03/12/2018 ∙ by Patrick Schramowski, et al. ∙ 0

• ### Estimation from Indirect Supervision with Linear Moments

In structured prediction problems where we have indirect supervision of ...
08/10/2016 ∙ by Aditi Raghunathan, et al. ∙ 0

The Adam algorithm has become extremely popular for large-scale machine ...
05/08/2019 ∙ by Guanghui Wang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

High-dimensional statistics has achieved remarkable success in the last decade, including results on consistency and convergence rates for various estimators under non-asymptotic high-dimensional scaling, especially when the problem dimension is larger than the number of data

. It is now well known that while this setup appears ill-posed, the estimation or recovery is indeed possible by exploiting the underlying structure of the parameter space. Notable examples include sparse vectors, low-rank matrices, and structured regression functions, among others. To impose structural assumption, a popular approach is to do the structural risk minimization, which could be generally formulated as the following optimization problem:

 minφ(θ)⩽ρf(θ)=1nn∑i=1fi(θ) (1)

where denotes a regularizer and controls the strength of the regularization. We denote as the constraint set of our problem. First-order optimization methods for the structural risk minimization problem have been studied extensively in recent years with fruitful developments. (Projected) gradient descent (PGD) algorithm (Nesterov, 2013) and its variants are arguably the most popular algorithm used in practise. It is well-known that projected gradient descent algorithm has iteration complexity for smooth convex objective, and for smooth and strongly convex objective. When solving (1

) with batch version of the first-order algorithm, a full pass on the entire dataset to evaluate the gradient could be computational expensive. In comparison, the stochastic gradient descent (SGD) algorithm computes a noisy gradient over one or a minibatch of samples, which has light computational overhead. However it is known that SGD converges with significantly worse speed

(Bottou, 2010)

. Stochastic variance-reduced algorithms

• Assume is restricted strongly convex, Conditional Gradient Sliding (CGS) algorithm attains gradient evaluation complexity up to the statistical limit of the model. Our result matches the known result which requires strong convexity assumption.

• Assume is restricted strongly convex and each is convex, stochastic variance-reduced CGS (STORC) algorithm attains gradient evaluation complexity up to the statistical limit of the model. Our result also matches the known result which requires strong convexity assumption. Similar complexity could also be obtained even if is non-convex.

• Both STORC and CGS still maintain the optimal LO complexity under restricted strong convexity.

• The main technical challenge comes from the non-convexity. For instance, to handle the non-convexity in STORC, we exploit the notion of lower smoothness, and generalize the analysis of STORC to a type of non-convex setting.

## 2 Problem Setup

We begin with some definitions in the classic optimization literature. A function is -smooth and -strongly convex with respect to norm if for all :

 σ2∥x−y∥2⩽f(y)−f(x)−⟨∇f(x),y−x⟩⩽L2∥x−y∥2 (2)

we additionally say is -lower smooth if:

 −l2∥y−x∥2⩽f(y)−f(x)−⟨∇f(x),y−x⟩. (3)

Restricted Strong Convexity. The restricted strong convexity was initially proposed in (Negahban et al., 2009) to establish statistical consistency of the regularized M-estimator, and later was exploited to establish linear convergence of the gradient descent algorithm up to the statistical limit of the model (Agarwal et al., 2010). We say satisfies restricted strong convexity with respect to and norm with parameter if:

 f(y)−f(x)− ⟨∇f(x),y−x⟩⩾σ2∥y−x∥2−τσφ2(y−x). (4)

Decomposable regularizer. Intuitively, to ensure restricted strong convexity to be close to strong convexity, one needs to ensure in definition (4) to be close to . This is an essential argument for establishing consistency of m-estimator and fast convergence of PGD algorithm. The decomposable regularizer is a sufficient condition to establish such result. Given a subspace pair which belongs to a Hilbert space , define to be the orthogonal complement of . We say that the regularizer is decomposable with respect to if:

 φ(x+y)=φ(x)+φ(y)∀x∈M,y∈¯¯¯¯¯¯M⊥ (5)

We further define the subspace compatibility as . Assumptions. We will make the following assumptions throughout this paper. We assume each is -smooth, hence is also -smooth. If is non-convex, we will assume it is -lower smooth. Finally, we assume that satisfies restricted strong convexity with respect to with parameter .

### 2.1 Example: Matrix Regression

In this subsection we put previously defined notions into context and consider the matrix regression problem: let be an unknown low-rank matrix to be estimated and we have observations from the model: where is i.i.d sampled from . We denote as the Frobenius inner product and as the Frobenius norm of matrix . Decomposibility of nuclear norm. Define to be the nuclear norm of matrix , then it is well known that , if and have orthogonal row space and column space. Let

be the singular value decomposition of

. Define as the submatrix of with the first columns and similarly define . Let denote the column space of a matrix , we define the subspace pair:

 M(Ur,Vr) ={Θ:col(ΘT)⊆col(Vr),col(Θ)⊆col(Ur)} ¯¯¯¯¯¯¯M⊥(Ur,Vr) ={Θ:col(ΘT)⊆col(Vr)⊥,col(Θ)⊆col(Ur)⊥}

Then any and have orthogonal row and column subspaces, hence we have the decomposibility of nuclear norm with respect to . We will use to denote the vectorization of matrix

. We define the loss function for matrix regression in (

6), the and will be specified seperately depending on whether the sensing matrix is observed with noise.

 Ln(Θ)=12vec(Θ)T^Γvec(Θ)−⟨^γ,vec(Θ)⟩ (6)

Convex loss: If sensing matrix is observed without noise, we set and , where is the -th row of the matrix . It is easy to verify that,

 Ln(Θ)=12nn∑i=1(yi−⟨Xi,Θ⟩)2 (7)

Non-convex loss: If the sensing matrix is observed with additive noise, i.e. we observe where is independent with and . We set and , where is the -th row of the matrix . We could rewrite (6) as:

 Ln(Θ)=12nn∑i=1{(yi−⟨Zi,Θ⟩)2−vec(Θ)TΣwvec(Θ)} (8)

Note that and we subtract a positive definite matrix from it , hence can not be positive semidefinite when (the typical setup for matrix regression), which results in the non-convexity of . Structural risk minimization: To impose low-rankness of the recovered matrix, a popular approach is to minimize the objective over a nuclear norm ball with suitable radius (Koltchinskii et al., 2011). The matrix regression problem is defined by:

 (9)

Lamma 1 establishes restricted strong convexity of the loss function in the matrix regression problem, which was essentially proved in (Agarwal et al., 2010).

###### Lemma 1.

Suppose sensing matrix is i.i.d. sampled from , define , we have for loss function (7):

 Ln(V) −Ln(U)−⟨∇Ln(U),V−U⟩⩾λmin(Σx)2∥V−U∥2F−cξ(Σx)dn∥V−U∥2∗

Suppose is i.i.d. sampled from and independent with . Assume that , we have for loss function (8):

 Ln(V) −Ln(U)−⟨∇Ln(U),V−U⟩⩾λmin(Σx)4∥V−U∥2F−cξ(Σx)dn∥V−U∥2∗

with probability at least

, for some absolute constants .

## 3 Theoretical Results

In this section we present the complexity results for CGS and STORC, under the restricted strongly convex assumption. A few more definition is needed for presenting our formal results. Definition. Let be a function satisfying restricted strong convexity with respect to with parameter . Suppose is decomposable with respect to subspace pair and let denotes projection onto subspace . Let be the unknown target parameter, be the optimal solution for (1). We define the following quantity:

• Optimization error: ;

• Effective strong convexity parameter: ;

• Statistical error: .

Parameter Specification. In the CGS and STORC algorithm, we set:

 Nt=8√L^σ,γk=2k+1,βk=3Lk,ηt,k=8Lδ02−t^σNtk; (10)

for convex STORC we set:

 mt,k=5200NtL^σ; (11)

and for non-convex STORC we set:

 ~L=(L+l)(1+l^σ),mt,k=8000Nt~L^σ, (12)

We present the batch algorithm CGS (Lan and Zhou, 2016) in Algorithm 1. CGS is a smart combination of Nesterov’s accelerated gradient descent (AGD) (Nesterov, 1983/2) and the Frank-Wolfe algorithm, which essentially uses FW algorithm to solve the projection subproblem in AGD.

We now present our result for Conditional Gradient Sliding. We emphasize that our result holds uniformly for both convex and non-convex setting, with parameter setting that does not depend on convexity.

###### Theorem 1.

Let satisfy our assumptions and be restricted strongly convex with respect to with parameter . Suppose is decomposable with respect to the subspace pair and . Let be an estimate such that , be the diameter of feasible set . If we run CGS with parameter specified in (10), then for both the convex and the non-convex case: for any , in order for the CGS algorithm to obtain an iterate such that , the number of calls to gradient evaluation and linear oracles are bounded respectively by:

 O(n√L^σlog2(δ0ϵ)), (13)

and

 O(LD2ϵ+√L^σlog2(δ0ϵ)). (14)

Furthermore, take , whenever holds, we would have:

 ∥∥θt−^θ∥∥2⩽ϵ2stat. (15)

Remarks: We note that these bounds are known to be optimal for smooth and strongly convex function (Lan and Zhou, 2016), and we extend the applicability of CGS in the following sense: the bounds also hold for a class of convex but not strongly convex functions and even non-convex functions, provided that they satisfy the restricted strong convexity. While our result has a mild restriction on the precision up to which the algorithm converges at the predicted rate, we remark that there would be no additional statistical gains for optimizing over this precision. In fact in many models, is shown to be on the same or lower order of the statistical precision of the model, as to be illustrated in the following corollary.

###### Corollary 1.

For both convex and non-convex matrix regression problem (9). Assume that , then under the condition of Lemma 1, we have , and for an absolute constant . For sample size satisfying the scaling , we have . Let , then with

 O(n√σmax(Σx)σmin(Σx)log2(δ0ϵ)) (16)

 O(σmax(Σx)ρ2ϵ+√σmax(Σx)σmin(Σx)log2(δ0ϵ)) (17)

calls to the linear oracle, Algorithm 1 achieves an optimaliy gap , and the distance to optimum:

 ∥∥θt−^θ∥∥2F⩽c2⋅∥Δ⋆∥2F (18)

where is an absolute constant.

Next we present the Stochastic Variance Reduced Conditional Gradient Sliding (Hazan and Luo, 2016). Its only difference from the CGS algorithm is to replace the full gradient with a variance-reduced gradient in SVRG style. We use to index the outer iteration, and to index the inner iteration. We compute the full gradient for each outer iteration, and for each inner iteration we replace with , where is the set of indices sampled uniformly and independently from . The details are presented in Algorithm 2.

###### Theorem 2.

Under the same conditions as Theorem 1, if each is convex: run STORC with parameters specified in (10) and (11); if is non-convex but is -lower smooth: run STORC with parameters specified in (10) and (12). Then for any , in order to obtain an iterate such that , the number of gradient evaluation is bounded by:

 Convex: O((n+(L^σ)2)log2(δ0ϵ)), (19) Non-convex: O((n+~LL^σ2)log2(δ0ϵ)), (20)

and the number of calls to linear oracle is bounded by:

 O(LD2ϵ+√L^σlog2(δ0ϵ)). (21)

Furthermore, take , whenever holds, we have:

 ∥∥θt−^θ∥∥⩽ϵ2stat, (22)

Remarks: For the convex loss, our result parallels with original STORC, except that we do not need to be strongly convex. For the non-convex loss, our result comes from an analysis of STORC applying to a strongly convex objective (indeed RSC suffices) that is a summation of non-convex functions. Note that similar results have been shown only for the variance-reduced PGD type algorithm (Allen-Zhu and Yuan, 2016). We also note that whenever , we have , which implies the complexity of the non-convex objective reduces to that of the convex objective. In other words, we pay no penalty for handling non-convexity. Since we can always bound by , the worst case the gradient evaluation complexity for the non-convex loss is .

###### Corollary 2.

For both convex and non-convex matrix regression problem (9). Assume , then under the conditions of Lemma 1, we have , , and for some universal constant . Assume the sample size satisfies scaling , we have . Let , then with

 O((n+(σmax(Σx)σmin(Σx))2)log2(δ0ϵ)) (23)

 O(σmax(Σx)ρ2ϵ+√σmax(Σx)σmin(Σx)log2(δ0ϵ)) (24)

calls to to the linear oracle, Algorithm 2 achieves an optimality gap , and the distance to the optimum satisfies

 ∥∥θt−^θ∥∥2F⩽c2⋅∥Δ⋆∥2F (25)

where is an absolute constant.

## 4 Simulation

We compare the performance of CGS and STORC with batch projected gradient descent and SVRG algorithm on matrix regression problem (9). For the convex loss problem we generate data , here is a matrix with rank with . We sample with being diagnal matrix in , and every diagnal entry is except the first one being . We set to illustrate the impact of condition number performance. Finally we generate samples with and set . Figure 1 reports the simulation result for the convex loss problem. For batch algorithm CGS and projected gradient descent, the computation time per iteration is dominated by gradient evaluation. As our theorem has predicted, when the required precision is not too small, the CGS algorithm outperfoms projected gradient descent due to its acceleration in terms of gradient evaluation complexity (square root dependence on condition number). This is even more significant when we have ill-conditioned problem as in Figure 0(c). Both STORC and and SVRG outperform their batch counterparts by saving gradient evaluations significantly per iteration. Since STORC performs LO which only requires computing leading singular vectors, it outperforms SVRG which involves full SVD computation at each iteration. In fact we observe that SVRG could be even outperformed by CGS, which further emphasizes the importance of replacing full projection by linear optimization oracle.

For non-convex loss problem, we generate in the same way as the convex loss, where and . Instead of observing we only observe where and independent of . We generate sample of size with and set . Figure 2 reports the simulation results for the non-convex loss. Again, CGS outperforms batch gradient descent by its accelerated performance in terms of gradient evaluation complexity. Notice that which by our theorem implies that non-convexity essentially causes no extra computational overhead for STORC algorithm. As predicted by our theoretical result, STORC outperforms SVRG by replacing projection step with the much cheaper linear oracle.

## 5 Conclusions

In this paper we show that batch and stochastic variants of Frank-Wolfe algorithm, namely CGS and STORC algorithm can be used to solve high dimensional statistical estimation problem efficiently, especially when the projection step in the gradient descent type algorithms is computationally hard. While the efficient gradient complexity result for CGS and STORC has been established in literature, such result requires the objective function being strongly convex and individual being convex. In this paper we relax these restrictive assumptions that are hardly satisfied in high dimensional-statistics, by restricted strong convexity that holds true in various statistical mode and show that the same gradient evaluation complexity could be maintained under this more general condition.

## References

• Agarwal et al. (2010) Agarwal, A., Negahban, S., Wainwright, M. J., 2010. Fast global convergence rates of gradient methods for high-dimensional statistical recovery. In: Advances in Neural Information Processing Systems 23. pp. 37–45.
• Allen-Zhu and Yuan (2016) Allen-Zhu, Z., Yuan, Y., 20–22 Jun 2016. Improved svrg for non-strongly-convex or sum-of-non-convex objectives. In: Proceedings of The 33rd International Conference on Machine Learning. pp. 1080–1089.
• Bottou (2010) Bottou, L., 2010. Large-scale machine learning with stochastic gradient descent. In: Lechevallier, Y., Saporta, G. (Eds.), Proceedings of COMPSTAT’2010. Physica-Verlag HD, Heidelberg, pp. 177–186.
• Collins et al. (2008) Collins, M., Globerson, A., Koo, T., Carreras, X., Bartlett, P. L., Jun. 2008. Exponentiated gradient algorithms for conditional random fields and max-margin markov networks. J. Mach. Learn. Res. 9, 1775–1822.
• Defazio et al. (2014) Defazio, A., Bach, F., Lacoste-Julien, S., Jul. 2014. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. ArXiv e-prints.
• Duchi et al. (2008) Duchi, J., Shalev-Shwartz, S., Singer, Y., Chandra, T., 2008. Efficient projections onto the l1-ball for learning in high dimensions. ICML ’08. pp. 272–279.
• Fujishige and Isotani (2011) Fujishige, S., Isotani, S., 2011. A submodular function minimization algorithm based on the minimum-norm base. Pacific Journal of Optimization.
• Garber and Hazan (2015) Garber, D., Hazan, E., 2015. Faster rates for the frank-wolfe method over strongly-convex sets. ICML’15. pp. 541–549.
• Hazan and Luo (2016) Hazan, E., Luo, H., 20–22 Jun 2016. Variance-reduced and projection-free stochastic optimization. Vol. 48 of Proceedings of Machine Learning Research. pp. 1263–1271.
• Jaggi (2013) Jaggi, M., 17–19 Jun 2013. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. Vol. 28 of Proceedings of Machine Learning Research. pp. 427–435.
• Koltchinskii et al. (2011) Koltchinskii, V., Lounici, K., Tsybakov, A. B., 2011. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics 39 (5), 2302–2329.
• Kuczyński and Woźniakowski (1992)

Kuczyński, J., Woźniakowski, H., 1992. Estimating the largest eigenvalue by the power and lanczos algorithms with a random start. SIAM Journal on Matrix Analysis and Applications 13 (4), 1094–1122.

• Lacoste-Julien (2016) Lacoste-Julien, S., Jul. 2016. Convergence Rate of Frank-Wolfe for Non-Convex Objectives. ArXiv e-prints.
• Lacoste-Julien and Jaggi (2015) Lacoste-Julien, S., Jaggi, M., Nov. 2015. On the Global Linear Convergence of Frank-Wolfe Optimization Variants. ArXiv e-prints.
• Lan (2013) Lan, G., Sep. 2013. The Complexity of Large-scale Convex Programming under a Linear Optimization Oracle. ArXiv e-prints.
• Lan and Zhou (2016) Lan, G., Zhou, Y., 2016. Conditional gradient sliding for convex optimization. SIAM Journal on Optimization 26 (2), 1379–1409.
• Loh and Wainwright (2011) Loh, P.-L., Wainwright, M. J., Sep. 2011. High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. ArXiv e-prints.
• Loh and Wainwright (2013) Loh, P.-L., Wainwright, M. J., 2013. Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. In: Advances in Neural Information Processing Systems 26. pp. 476–484.
• Marguerite and Philip (1956) Marguerite, F., Philip, W., 1956. An algorithm for quadratic programming. Naval Research Logistics Quarterly 3 (1), 95–110.
• Negahban et al. (2009) Negahban, S., Yu, B., Wainwright, M. J., Ravikumar, P. K., 2009. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. In: Advances in Neural Information Processing Systems 22. pp. 1348–1356.
• Nesterov (1983/2) Nesterov, Y., 1983/2. A method of solving a convex programming problem with convergence rate . Soviet Mathematics Doklady 27 (2), 372–376.
• Nesterov (2013) Nesterov, Y., Aug 2013. Gradient methods for minimizing composite functions. Mathematical Programming 140 (1), 125–161.
• Qu et al. (2016) Qu, C., Li, Y., Xu, H., Nov. 2016. Linear Convergence of SVRG in Statistical Estimation. ArXiv e-prints.
• Qu et al. (2017a) Qu, C., Li, Y., Xu, H., Aug. 2017a. Non-convex Conditional Gradient Sliding. ArXiv e-prints.
• Qu et al. (2017b) Qu, C., Li, Y., Xu, H., Feb. 2017b. SAGA and Restricted Strong Convexity. ArXiv e-prints.
• Qu and Xu (2017) Qu, C., Xu, H., Jan. 2017. Linear convergence of SDCA in statistical estimation. ArXiv e-prints.
• Reddi et al. (2016) Reddi, S. J., Sra, S., Poczos, B., Smola, A., Sept 2016. Stochastic frank-wolfe methods for nonconvex optimization. In: 2016 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton). pp. 1244–1251.
• Shalev-Shwartz and Zhang (2012) Shalev-Shwartz, S., Zhang, T., Sep. 2012. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. ArXiv e-prints.
• Xiao and Zhang (2014) Xiao, L., Zhang, T., 2014. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization 24 (4), 2057–2075.

## Appendix A Proof overview

In this section we provide a roadmap which we will following in establishing the complexity results presented in the paper. It is convenient to note that an -smooth function satisfies the following properties which is useful in our proof:

 f(λx+(1−λ)y)⩾λf(x)+(1−λ)f(y)−L2λ(1−λ)||x−y||2 (26)

for . If is additionally convex, then:

 ∥∇f(x)−∇f(y)∥2⩽2L(f(y)−f(x)−⟨∇f(x),y−x⟩). (27)

For simplicity of exposition, we emphasize the our proof scheme for convex case here. The basic idea of our proof is to re-analyze the convergence property of the aforementioned algorithms, while the strong convexity is replaced by the restricted strong convexity. More precisely, let , the first important lemma says that in fact is close to .

###### Lemma 2.

In problem (1), if , then we have belongs to the set:

 S(M,¯¯¯¯¯¯¯M)={Δ:φ(Δ)⩽2ϕ(¯¯¯¯¯¯¯M)∥Δ∥+2φ(Π⊥M(θ⋆))+2φ(Δ⋆)+2ϕ(¯¯¯¯¯¯¯M)∥Δ⋆∥}. (28)

Now by combining (28) with the definition of restricted strong convexity, we have the following structure of pseudo strong convexity of the objective function that holds at the global optimum:

 f(θ)−f(^θ) ⩾⟨∇f(^θ),θ−^θ⟩+σ2∥∥θ−^θ∥∥2−τσφ(θ−^θ)2 ⩾^σ2∥∥θ−^θ∥∥2−^σϵ2stat, (29)

where the last inequality combines (28), Cauchy-Schwarz inequality and the definition of . To analyze the convergence of CGS and STORC, we first present a general convergence result of inner iteration. This theorem covers both the deterministic case as in CGS and the stochastic case as in STORC.

###### Theorem 3.

Fix outer iteration , if we have for every inner iteration in CGS and STORC that , then we have:

 E[f(yk)−f(^θ)]⩽β1Γk∥∥x0−^θ∥∥2+Γkk∑i=1[ηt,iΓi+γiσ2i2Γi(βi−Lγi)] (30)

where we have defined as:

 Γi={1i=1;(1−γi)Γi−1i⩾2. (31)

Notice that for CGS algorithm which corresponds to the deterministic case, we have for any iteration . By specifying specific parameters in the two algorithms, we have concrete convergence results for inner iteration, provided we can control to decrease in a certain rate. The proof of the following corollary is a simple induction after plugging in all the parameters into (30).

###### Corollary 3.

Fix outer iteration , suppose we have , then by setting , if we can control such that for all , then we have:

 E[f(yk)−f(^θ)]⩽8LD2tk(k+1). (32)

Again, since for CGS algorithm, the claim of corollary follows immediately once we have specified appropriate parameters. The main challenge of our proof, is to control as prescribed by corollary for STORC algorithm. We now explain why Corollary 3 almost completes our argument: if we can control for all , then by this corollary we have:

 E[f(θt)−f(^θ)] =E[f(yNt)−f(^θ)] ⩽E[16^u(f(θt−1)−f(^θ))Nt(Nt+1)], (33)

where the first inequality uses , and in the second inequality we use the definition of condition number, the pseudo strong convexity (A) and the assumption that we have not achieved the statistical precision yet. Hence by specifying appropriate as in CGS and STORC, we establish the convergence for outer iteration is

 E[f(θt)−f(^θ)]⩽12E[f(θt−1)−f(^θ)]. (34)

With this simple recursion for outer iteration and a careful summation of calls to gradient evaluation and linear optimization per iteration, the claim for CGS and STORC follows immediately, we leave the details of the proof in the rest of supplemental material.

## Appendix B Proof of Lemma 1

###### Proof.

We have first

 En(Δ) =Ln(θ+Δ)−Ln(θ)−⟨∇Ln(θ),Δ⟩ =vec(Δ)T^Γvec(Δ) (35)

Define . For convex loss: We have , by Lemma 7 of Agarwal et al. (2010) we have there exists universl postive constants such that:

 En(Δ)⩾12λmin(Σx)∥Δ∥2F−c1ζ(Σx)dn∥Δ∥2nuc (36)

with probability at least , which gives result for convex case. For nonconvex loss, we have , hence . We have that with probability at least ,

 Fn(Δ)⩾12λmin(Σx)∥Δ∥2F−c1ζ(Σz)dn∥Δ∥2nuc (37)

Now uses assumption that we obtain result immediately. ∎

## Appendix C Proof of Lemma 2

###### Proof.

First by notice that , we have , then by triangle inequality we have

 φ(θt)⩽φ(θ⋆)+φ(Δ⋆)⩽φ(ΠM(θ⋆))+φ(ΠM⊥(θ⋆))+φ(Δ⋆) (38)

Now we lower bound the left hand side using decomposibility of regularization, denote :

 φ(θt) =φ(ΠM(θ⋆)+ΠM⊥(θ⋆)+Π¯¯¯¯¯¯M(Δt)+Π¯¯¯¯¯¯M⊥(Δt)) ⩾φ(ΠM(θ⋆)+Π¯¯¯¯¯¯M⊥(Δt))−φ(ΠM⊥(θ⋆)+Π¯¯¯¯¯¯M(Δt)) ⩾φ(ΠM(θ⋆)+Π¯¯¯¯¯¯M⊥(Δt))−φ(ΠM⊥(θ⋆))−φ(Π¯¯¯¯¯¯M(Δt)) =φ(ΠM(θ⋆))+φ(Π¯¯¯¯¯¯M⊥(Δt))−φ(ΠM⊥(θ⋆))−φ(Π¯¯¯¯¯¯M(Δt)) (39)

Now by combing (38) and (C) we have:

 φ(Π