# Fast Global Convergence via Landscape of Empirical Loss

While optimizing convex objective (loss) functions has been a powerhouse for machine learning for at least two decades, non-convex loss functions have attracted fast growing interests recently, due to many desirable properties such as superior robustness and classification accuracy, compared with their convex counterparts. The main obstacle for non-convex estimators is that it is in general intractable to find the optimal solution. In this paper, we study the computational issues for some non-convex M-estimators. In particular, we show that the stochastic variance reduction methods converge to the global optimal with linear rate, by exploiting the statistical property of the population loss. En route, we improve the convergence analysis for the batch gradient method in mei2016landscape.

There are no comments yet.

## Authors

• 14 publications
• 42 publications
• 42 publications
• ### First Order Methods take Exponential Time to Converge to Global Minimizers of Non-Convex Functions

Machine learning algorithms typically perform optimization over a class ...
02/28/2020 ∙ by Krishna Reddy Kesari, et al. ∙ 17

• ### Boosting in the presence of outliers: adaptive classification with non-convex loss functions

This paper examines the role and efficiency of the non-convex loss funct...
10/05/2015 ∙ by Alexander Hanbo Li, et al. ∙ 0

• ### Efficient Clustering for Stretched Mixtures: Landscape and Optimality

This paper considers a canonical clustering problem where one receives u...
03/22/2020 ∙ by Kaizheng Wang, et al. ∙ 0

• ### Parallax Bundle Adjustment on Manifold with Convexified Initialization

Bundle adjustment (BA) with parallax angle based feature parameterizatio...
07/10/2018 ∙ by Liyang Liu, et al. ∙ 0

• ### Online Non-convex Learning for River Pollution Source Identification

In this paper, novel gradient based online learning algorithms are devel...
05/22/2020 ∙ by Wenjie Huang, et al. ∙ 8

• ### Sensor Selection by Linear Programming

We learn sensor trees from training data to minimize sensor acquisition ...
09/09/2015 ∙ by Joseph Wang, et al. ∙ 0

• ### Hit-and-Run for Sampling and Planning in Non-Convex Spaces

We propose the Hit-and-Run algorithm for planning and sampling problems ...
10/19/2016 ∙ by Yasin Abbasi-Yadkori, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The last several years have witnessed the surge of big data and in particular the rise of non-convex optimization. Indeed, non-convex optimization is at the frontier of of machine learning research and an incomplete list includes actively studied problems such as dictionary learning Mairal et al. (2009), phase retrieval Candes et al. (2015), robust regression Huber (2011)

and training deep neural networks

Goodfellow et al. (2016). It is well known that for general non-convex optimization problems, there is no efficient algorithm to find the global optimal solution, unless . Thus, research on non-convex optimization can be divided into two categories: The first one drops the requirement of global optimality and seeks a more modest solution concept, such as finding a stationary point. That is, to show an algorithm finds a solution such that Ghadimi and Lan (2013, 2016); Allen-Zhu and Hazan (2016); Reddi et al. (2016a). The second one takes statistical assumptions into consideration, and aims to design algorithms with global convergence under reasonable statistical models Agarwal et al. (2010); Loh and Wainwright (2013); Qu and Xu (2017); Qu et al. (2017). This paper belongs to the second class. Particularly, we consider the following non-convex M-estimator with finite data points:

 ^θ=argminθ∈ΩRn(θ)≡1nn∑i=1ℓ(θ;(xi,yi)), (1)

where is a non-convex loss function for , are the sample, is a convex set, and is the global optimal solution. This problem is motivated by the following two examples.

The first example is the binary classification problem. Here,

 Rn(θ)=1nn∑i=1(yi−ζ(⟨θ,xi⟩))2,

where , . Several empirical studies have demonstrated superior robustness and classification accuracy of non-convex losses, compared with their convex counterparts Wu and Liu (2007); Nguyen and Sanner (2013). One popular choice of is the logistic loss, which has been used in neural networks.

The second example is the robust regression problem, in the following form:

 Rn(θ)=1nn∑i=1ρ(yi−⟨θ,xi⟩). (2)

The research of robust algorithms for learning and inference was initiated in the 60’s by Tukey Tukey (1960) and developed rapidly in the 70’s and 80’s Huber (2011)

. One common situation in which robust estimation is used occurs when the data contain outliers, where the ordinary regression method may fail

Huber (2011)

. Another situation is that there is a strong suspicion of heteroscedasticity in data, which allows the variance to be dependent on x

Tofallis (2008). This is often the case for many real scenarios.

One renowned loss function in robust statistics is Tukey’s bi-square loss, defined as

 ρTukey(t)={1−(1−(t/t0)2)3for|t|≤t0,1for|t|≥t0.

It is clear that Tukey’s bisquare loss saturates when is large and thus it is non-convex.

While the above formulations have superior statistical properties, a natural question is how to find the global optimal solution of them. In particular if we apply first order methods (the gradient descent method and its variants, including the stochastic gradient method or stochastic variance reduction methods), what is the theoretical guarantee of the solution? Existing work in literature asserts that the above algorithms converge to the stationary point such that with rate using SGD Ghadimi and Lan (2013), with rate using gradient descent (folklore in optimization community), and with rate using SVRG Reddi et al. (2016a); Allen-Zhu and Hazan (2016).

This paper aims to provide stronger results, by a refined analysis making use of the statistical properties of the problem. The high-level intuition is that although the above finite-sum problem is non-convex, its population counterpart

 R(θ)=Ex,yℓ(θ;(x,y)))

has good properties for optimization (although it may still be non-convex). Our work then exploits the resemblance between the population and empirical loss. In a nutshell, when is large, not only the objective function of the finite sample problem converges to that of the population problem, but both the gradient and the Hessian converge as well under appropriate regularity conditions, which in literature is the study on the landscape of the empirical loss Mei et al. (2016). In particular, Mei et al. (2016) proved that the gradient (in norm) and the Hessian (in operator norm) converge to its population counterpart with a rate . Using this tool, they showed a global converge result for the batch gradient method. However, the results in Mei et al. (2016) do not make use of the smoothness of the objective function. Instead, they used Lipschitz continuity which leads to a loose rate. In this paper, we refine the analysis in the batch gradient method by exploiting the smoothness of the objective function and get a better rate. Moreover, since the objective functions are of the form of finite-sum of items and is large, we apply the Stochastic Variance Reduction Method (SVRG Johnson and Zhang (2013); Xiao and Zhang (2014) and SAGA Defazio et al. (2014) ) on these problems and established much faster convergence results than the batch gradient method, both in theory (Section 3.1) and in numerical experiments (Section 5).

We now offer a brief introduction of SVRG and SAGA: SVRG and related methods have recently surged into prominence for convex optimization given their edge over stochastic gradient descent

Johnson and Zhang (2013); Xiao and Zhang (2014); Shalev-Shwartz and Zhang (2013)

. Algorithmically, SVRG has inner loops and outer loops. At the beginning of each outer loop, SVRG defines a snap shot vector

to be the average (or the last value) of the previous inner loop and computes the full gradient . In the inner loops, it randomly samples a data point and calculates the variance reduced gradient

 ∇fi(θk)−∇fi(~θ)+∇f(~θ).

The wisdom of SVRG is by applying this variance reduction technique, the variance of the gradient estimation appraoches to zero, as opposed to SGD where the variance does not diminish.

Note that SVRG is not a fully “incremental” algorithm since it needs to calculate the full gradient once in each epoch. SAGA

Defazio et al. (2014), another popular stochastic variance reduction method, avoids computing the full gradient by storing the historical gradients and then uses them to estimate the full gradient. To achieve this, it pays a price of higher memory demand ( where is the dimension in general). Nevertheless, in many machine learning problems, the storage demand can be reduced to , which makes it practical. In these cases, SAGA performs equally or better than SVRG Defazio et al. (2014). Another merit particularly useful in practice, is that SAGA does not need to tune the length of the inner loop, as opposed to SVRG.

Summary of contribution: In this paper, we adapt SVRG and SAGA on the non-convex formulation in binary classification and robust regression problems and prove they converge exactly to the global optimum with linear convergence rate by a novel analysis considering the statistical property of the problems. From a high level, we unify the statistics perspective and the optimization perspective for machine learning, with an emphasis on the interplay between them. These two areas are traditionally studied separately, partly due to the fact that disparate communities developed them. We also improve the analysis of the batch gradient method in Mei et al. (2016). We briefly state the main result and contribution of this paper in the following and leave the details and discussions in Section 3.1.

• We show the gradient complexity (i.e., the number of gradient evaluation required) of the batch gradient method is

 O(n(Lμ0)2log1ε),

where is the smoothness of the loss function and is a term similar to the strong convexity parameter. The gradient complexities of SVRG and SAGA are

 O((n+n2/3(Lμ0)2)log1ε),

where and are the same as above. It is clear that, when the condition number and the number of samples are large, SVRG and SAGA converge much faster than the batch gradient method.

• Conventional techniques to establish the convergence results for finite sample problems analyze directly. The novelty in our analysis is that we firstly analyze to exploit the favorable properties of , and then relate that to . The main challenges in our proofs for SVRG and SAGA are, besides bounding the deviation between the finite sample problem and the population problem, we need to control the impact of non-convex terms in SVRG and SAGA.

The novelty of our analysis, compared to those in Mei et al. (2016) which also makes use of the landscape of empirical loss , is that we exploit smoothness in the analysis. In particular, Mei et al. (2016) shows the gradient complexity of the batch gradient method is , where is the Lipschitz continuity parameter. Generally, this Lipschitz continuity parameter can be much larger than the smoothness parameter. Take as a example, the ratio can be as large as the radius . More importantly, since Mei et al. (2016) does not make use of smoothness, their proof technique can not be adapted to the stochastic variance reduction method.

## Related work

Optimizing non-convex objective functions by the batch gradient and SGD are well studied Nesterov (1983); Ghadimi and Lan (2013). The criterion on the convergence is in the smooth and non-constrained problem, and in the constrained or non-smooth regularized case, the Gradient mapping is used. The gradient complexity is shown to be for SGD Ghadimi and Lan (2013), and for gradient descent.

Restricted Strong Convexity (RSC) Negahban et al. (2009); Agarwal et al. (2010) is a powerful tool to analyze non-strongly convex and non-convex optimization problems using statistical information in the high dimensional setup. Under RSC, it has been shown that the batch gradient and the stochastic variance reduction methods converge to the global optimum up to the statistical tolerance Loh and Wainwright (2013); Qu and Xu (2017); Qu et al. (2017)

. The gradient and stochastic variance reduction gradient can reach such tolerance with linear rate. These results cover problems including Lasso, logistic regression, generalized linear model with non-convex regularization. Compared to these results, our result differs in two ways. First, our result shows convergence to the exact global optimum rather than its neighborhood. Second, the conditions of loss functions studied are different. In their work, the loss function is convex (e.g., the squared loss in Lasso) and thus the whole objective function is either convex or slightly non-convex due to the

term in the non-convex regularization. In our work, each is non-convex and RSC does not hold.

There is a vast number of research on stochastic variance reduction methods in the last several years and we list here a few most relevant ones, see Johnson and Zhang (2013); Defazio et al. (2014); Shalev-Shwartz and Zhang (2013); Schmidt et al. (2017). In general, if the objective function is strongly convex and the loss function is smooth, the rate (gradient complexity) is and can be accelerated to Lin et al. (2015); Lan and Zhou (2015); Zhang and Lin (2015). If the objective function is non-convex, these algorithms converge to a stationary point with a sub-linear rate Allen-Zhu and Hazan (2016); Reddi et al. (2016a). In stark contrast, our paper shows linear convergence to the global optimum. Shalev-Shwartz (2016) proposed the dual-free SDCA algorithm that converges with rate , where each individual loss function can be non-convex but the objective function as a whole is strongly convex. Recently, several papers have revisit an old idea called P-L (Polyak-Łojasiewicz) condition Polyak (1963), a.k.a. gradient dominated functions, and proved the linear convergence to the global optimum Karimi et al. (2016); Reddi et al. (2016a). In particular, Reddi et al. (2016b) prove that SVRG and SAGA converges with a rate , where is a condition number depending on P-L condition. Our results are different, since examples we mentioned (binary classification and Tukey’s bisquare loss) do not satisfy the PL condition.

## 2 Problem setup and Assumption

Binary classification: In the binary classification problem, we are given data points , where , and

is a target with probability

, where is a threshold function, and is the true parameter. We consider the following optimization problem to estimate :

 minθRn(θ)≜1nn∑i=1(yi−ζ(⟨θ,xi⟩))2subject to∥θ∥2≤r. (3)

The aim is to find the optimal solution of . We make the following set of mild assumptions on the threshold function above and on the feature vectors, following Mei et al. (2016).

###### Assumption 1.
• is three time differentiable with for all . Furthermore, there exists some constant such that .

• The feature vector is zero mean sub-gaussian, i.e., .

• for some . In other words, the feature vector spans all directions in .

Notice above assumption on is quite mild , e.g., it is satisfied by , which can be used in neural networks.

Robust regression: We assume the data generation model is . The noise term are zero mean and i.i.d.. Notice the noise can be very large, e.g., is sampled from Gaussian mixture distribution with large , where controls the percentage of large noise. We consider the robust regression in the following form.

 minθRn(θ)≡1nn∑i=1ρ(yi−⟨θ,xi⟩),subject to∥θ∥2≤r. (4)
###### Assumption 2.

We define the score function and follow the assumption in Mei et al. (2016).

• The score function

is twice differentiable and odd in z with

for all . Similar to the binary classification case, we need .

• The feature vector is zero mean and sub-Gaussian random vector, and , for some .

• The noise has a symmetric distribution ( and have same distribution). Define , and we have for all and .

Remarks: The condition on is mild . It is not hard to show it is satisfied if the noise has a density that is strictly positive for all , and decreasing for .

## 3 Algorithm and Analysis

The batch gradient descent is the standard one where

 θk+1=θk−η∇Rn(θk)

The step size is specified later in the theorem.

We list the algorithm of SVRG and SAGA for completeness. Algorithm 1 is the vanilla SVRG, and we call Algorithm 1 as a subroutine with times in Algorithm 2. Algorithm 4 is non-convex SAGA which call algorithm 3 times as a subroutine, where we follow minibach version of SAGA in Reddi et al. (2016b).

### 3.1 Theoretical results

Before we present our main theorems, we brief recall the concentration properties of gradient and Hessian of the empirical loss, since it may give some intuitions why the algorithm works. In our proof, we will use them from time to time.

###### Theorem 1 (Theorem 1 in Mei et al. (2016)).

Under the assumptions in Binary classification and Robust regression problem, there exists some absolute positive constant , such that following holds ,

• The sample gradient converges uniformly to the population gradient in Euclidean norm, i.e., when , we have

 P⎛⎝supθ∈Bp(r)∥∇Rn(θ)−∇R(θ)∥2≤τ√Cplognn⎞⎠ ≥1−exp(−C1n).
• The sample Hessian converges uniformly to the population Hessian in operator norm, i.e., when , we have

 P⎛⎝supθ∈Bp(r)∥∇2Rn(θ)−∇2R(θ)∥op≤τ2√C0plognn⎞⎠ ≥1−exp(−C2n).

Theorem 1 essentially says the gradient and Hessian are close to their population counterparts, if . For the population loss, It can be shown following the gradient , the gradient decent algorithm converges to the ground truth , i.e., the optimal solution of , even though may be non-convex. Thus, although it is hard to directly analyze the empirical loss, we can exploit the error bound in Theorem 1 and wish a similar convergence result. Particularly, since Theorem 1 is a uniform converges result, by subtly control the in the algorithm, we can prove the convergence of .

The next theorem is our main result for binary classification. For ease of exposition in the SVRG and SAGA, in our analysis we assume that , a property analogous to the high condition number for strongly convex functions in machine leaning Xiao and Zhang (2014); Reddi et al. (2016a).

###### Theorem 2 ( Binary classification).

Let be the global optimal solution of Equation (3), and be the smoothness of loss function . Suppose , and Assumption 1 is satisfied, then there exists a positive constant that only depends on , such that if the sample size , the following holds with probability at least for some absolute positive constant :

• Set in the batch gradient method, the algorithm converges to with gradient complexity .

• For SVRG, suppose , we set for some absolute positive constant , , , then the algorithm converges to with gradient complexity .

• For SAGA, suppose we set minibatch size , , then the algorithm converges to with gradient complexity .

Some remarks are in order.

• We call an effective condition number, an analogy to that in the analysis of strongly convex function. Notice SVRG and SAGA has less dependence on than the batch gradient method. When the problem is ill-conditioned ( is large ), SVRG and SAGA are much faster than batched gradient method, which is verified in experiments reported in Section 5.

• In Mei et al. (2016), the author proves gradient complexity of batch gradient method is , where is the Lipschitz continuity parameter, since it does not make use of the smoothness of the objective function. Compared to their results, ours is tighter, as is larger than the smoothness parameter in general.

• Most variance reduction work on optimization of strongly convex functions in literature has a linear dependence on , e.g., . An open question is whether it is possible to improve our result to

• Our setting does not satisfied the PL condition Karimi et al. (2016); Reddi et al. (2016b), thus existing convergence results based on PL condition does not apply.

• The requirement of is to ease the theoretically analysis. In practice, we find it not necessary.

The next theorem is our main result for robust regression. It is similar to that of the binary classification case except dependence on and change of constants.

###### Theorem 3 (Robust regression).

Let be the global optimal solution of Equation (4), and be the smoothness of loss function . Suppose , and Assumption 2 is satisfied. Then there exists a positive constant that only depends on , such that if the sample size , the following results hold with probability , for some absolute positive constant .

• Set in the batch gradient method, the algorithm converge to with gradient complexity .

• For SVRG, suppose , we set with some absolute positive constant , , , then the algorithm converges to with gradient complexity .

• For SAGA, suppose we set sample size , , then the algorithm converges to with gradient complexity .

The same set of remarks in binary classification holds for robust regression as well. We also remark that the basic idea to prove the convergence in robust regression is same with that in binary classification, except that has a different dependence on and .

## 4 Roadmap of the proof

We briefly explain the high-level idea of the proof while defer the details to the supplementary material. The proof consists two steps. We take the batch gradient method as an example, since the analysis is relatively easy. The proof for stochastic variance reduction methods is similar, although more involved technically.

We divide the region , i.e., ball in the dimensional space, into two parts:

 Bp(0,r)∖Bp(θ∗,ϵ0)andBp(θ∗,ϵ0).

In the first step we focus on . We analyze the objective function (i.e., the population loss) rather than , such that we can exploit good statistical properties on the population loss (e.g, the directional gradient toward is larger than zero). However, notice the algorithm only has access to the the gradient of finite sample loss, i.e., , rather than . Thus, since the algorithm follows the direction of , there are additional error terms on the objective function . Thanks to Theorem 1, these terms can be bounded and are small when . Therefore, we can show in this region, converges toward with a linear rate.

The second step, where we analyze the region , is easier. In this region, the population Hessian can be shown positive definite. Then the empirical Hessian of can be bounded below using the uniform convergence result in Theorem 1. Thus the objective function behaves like a strongly convex function in this region, which leads to convergence to (notice the uniqueness of is proved in Theorem 4 of Mei et al. (2016)).

## 5 Simulation result

In this section, we report numerical experiment results for SVRG, SAGA, batched gradient and SGD in both synthetic and real datasets.

### 5.1 Synthetic dataset

The aim of the synthetic data experiment is to verify our main theorem: SVRG and SAGA converge to the global optimum with linear rate, even in some non-convex settings; they converge faster than the batch gradient method. The step sizes for all algorithms are chosen by a grid search from .

Binary classification: The feature vector

is generated from the normal distribution

The label is 0 or 1 with probability , where is the true parameter. We generate each entry of

from the Bernoulli distribution with

and normalize it such that In the experiment, the number of data points , the dimension is set as either or . The optimal solution is obtained by running SVRG long enough (e.g., 1000 passes over dataset). We choose two different settings of , which will affect the condition number of the objective function. In the first experiment and in the last two experiments . The experiment results are presented in Figure 1.

The experiment results show that SAGA and SVRG have similar performance, and are followed by batch gradient method then SGD. When the condition number is small (The left panel in Figure 1), SVRG, SAGA and the batch gradient method all work well, while SVRG and SAGA converge to optimality gap of much faster than the batch gradient method. The middle panel reports the result when the condition number is significantly larger. We observe that even after 1000 passes, the batch gradient method still has a large objective gap (), while SVRG and SAGA still work well. In the right panel, we decrease the dimensionality of the problem. Consequently, all algorithms converge faster than the middle panel since there are few parameters. In all settings, SGD converges fast at beginning and then is stuck with a relatively high objective gap () due to the variance of estimating the gradient.

Robust regression: The data generation model is , where the feature vector is sampled from the norm distribution . The noise is generated from the Gaussian mixture distribution . Each entry of true true parameter is generated from the Bernoulli distribution with . Again, we normalize such that . We use Tukey’s bisquare loss as our loss function and set and as that in Mei et al. (2016). The optimal solution is obtained by running SVRG for a long time (1000 passes over dataset) in each experiment. We did three experiments with different settings of , and , and present the result in Figure 2.

In all settings, SAGA performs the best and is followed by SVRG. Both algorithms significantly outperforms the batch gradient method, which verifies our theorem. Even when the problem is ill conditioned (e.g., in the right figure), SAGA and SVRG can converge fast, while batch gradient is very slow. SGD in all settings converges fast at the beginning stage and then is stuck, with the objective gap being , due to the variance of stochastic gradient.

### 5.2 Real dataset

#### 5.2.1 Binary classification

In this section, we test the binary classification problem in IJCNN1 (n=49990, p=22) Prokhorov (2001), Covertype(n=495141, p=54) Blackard and Dean (1999), and Dota2 Games result dataset (n=92650,p=116) Tridgell (2016). In the experiment, we choose and normalized all features to . Since Covertype is a multi-class classification dataset, we extract class one and two as our data. We compare SVRG, SAGA, the batch gradient method and SGD for all three datasets, and present the result in Figure 3. The optimal solutions for each dataset are obtained by running SVRG long enough time (e.g., 1000 passes of dataset).

SAGA converges fastest in all three experiments, followed by SVRG. The batch gradient method converges very slowly due to its bad dependence on the condition number. Indeed it is even worse than the SGD in all three datasets. SGD converges very fast at early stage and saturates at a large objective gap.

#### 5.2.2 Robust regression

We test the robust regression problem for the following datasets: Airfoil Self-Noise (n=1503, p=6) Brooks et al. (2014), Communities and Crime (n=1994,p=128) Redmond and Baveja (2002), and Parkinsons Telemonitoring (n=5875,p=26) Tsanas et al. (2010). We corrupt the output by adding noise from the Gaussian mixture distribution , similarly as in the synthetic dataset, and choose in all three experiment, respectively. The result of SVRG, SAGA, batched gradient method and SGD on these three datasets are reported in Figure 4.

In all three experiments, SAGA converges the fastest, followed by SVRG. SGD in all experiments converges quickly at the beginning stage and then is stuck at large objective gaps later due to variance of the stochastic gradient. In the dataset Airfoil Self-Noise, batched gradient performs well. However at the other two datasets, its performance is either similar to SGD (Prikinsons Telemonitoring dataset), or even worse (Communities and Crime). This is likely due to its heavy dependence to the condition number.

## 6 Conclusion and future work

In this paper, we solve two kinds of non-convex problem with stochastic variance reduction methods and prove the algorithms converge to the global optimum of the problem linearly. Our analysis exploits the fact that the population problem often has more favorable properties in terms of optimization. Although the finite sample problem does not possess these favorable properties, it is possible to control the impact of departing from the population problem on the performance of optimization algorithms.

A future work is to consider the optimization in the high dimensional statistics setting, i.e., when

. In this case, Theorem 1 fails and a possible solution is to add the regularization to encourage the sparsity.

## References

• Agarwal et al. (2010) Alekh Agarwal, Sahand Negahban, and Martin J Wainwright. Fast global convergence rates of gradient methods for high-dimensional statistical recovery. In Advances in Neural Information Processing Systems, pages 37–45, 2010.
• Allen-Zhu and Hazan (2016) Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In International Conference on Machine Learning, pages 699–707, 2016.
• Blackard and Dean (1999) Jock A Blackard and Denis J Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables. Computers and electronics in agriculture, 24(3):131–151, 1999.
• Brooks et al. (2014) Thomas Brooks, Stuart Pope, and Michael Marcolini. UCI machine learning repository, 2014.
• Candes et al. (2015) Emmanuel J Candes, Xiaodong Li, and Mahdi Soltanolkotabi. Phase retrieval from coded diffraction patterns. Applied and Computational Harmonic Analysis, 39(2):277–299, 2015.
• Defazio et al. (2014) Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pages 1646–1654, 2014.
• Ghadimi and Lan (2013) Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
• Ghadimi and Lan (2016) Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1-2):59–99, 2016.
• Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. MIT Press, 2016.
• Huber (2011) Peter J Huber. Robust statistics. In International Encyclopedia of Statistical Science, pages 1248–1251. Springer, 2011.
• Johnson and Zhang (2013) Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
• Karimi et al. (2016) Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.
• Lan and Zhou (2015) Guanghui Lan and Yi Zhou. An optimal randomized incremental gradient method. arXiv preprint arXiv:1507.02000, 2015.
• Lin et al. (2015) Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, pages 3384–3392, 2015.
• Loh and Wainwright (2013) Po-Ling Loh and Martin J Wainwright. Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. In Advances in Neural Information Processing Systems, pages 476–484, 2013.
• Mairal et al. (2009) Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for sparse coding. In Proceedings of the 26th annual international conference on machine learning, pages 689–696. ACM, 2009.
• Mei et al. (2016) Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for non-convex losses. arXiv preprint arXiv:1607.06534, 2016.
• Negahban et al. (2009) Sahand Negahban, Bin Yu, Martin J Wainwright, and Pradeep K Ravikumar. A unified framework for high-dimensional analysis of -estimators with decomposable regularizers. In Advances in Neural Information Processing Systems, pages 1348–1356, 2009.
• Nesterov (1983) Yurii Nesterov. A method of solving a convex programming problem with convergence rate o (1/k2). In Soviet Mathematics Doklady, volume 27, pages 372–376, 1983.
• Nesterov (2013) Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
• Nguyen and Sanner (2013) Tan Nguyen and Scott Sanner. Algorithms for direct 0–1 loss optimization in binary classification. In International Conference on Machine Learning, pages 1085–1093, 2013.
• Polyak (1963) Boris Teodorovich Polyak. Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, 3(4):643–653, 1963.
• Prokhorov (2001) Danil Prokhorov. Ijcnn 2001 neural network competition. In Slide presentation in IJCNN’01, 2001.
• Qu and Xu (2017) Chao Qu and Huan Xu. Linear convergence of sdca in statistical estimation. arXiv preprint arXiv:1701.07808, 2017.
• Qu et al. (2017) Chao Qu, Yan Li, and Huan Xu. Saga and restricted strong convexity. arXiv preprint arXiv:1702.05683, 2017.
• Reddi et al. (2016a) Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314–323, 2016a.
• Reddi et al. (2016b) Sashank J Reddi, Suvrit Sra, Barnabás Póczos, and Alexander J Smola. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In Advances in Neural Information Processing Systems, pages 1145–1153, 2016b.
• Redmond and Baveja (2002) Michael Redmond and Alok Baveja. A data-driven software tool for enabling cooperative information sharing among police departments. European Journal of Operational Research, 141(3):660–678, 2002.
• Schmidt et al. (2017) Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
• Shalev-Shwartz (2016) Shai Shalev-Shwartz. Sdca without duality, regularization, and individual convexity. In International Conference on Machine Learning, pages 747–754, 2016.
• Shalev-Shwartz and Zhang (2013) Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567–599, 2013.
• Tofallis (2008) Chris Tofallis. Least squares percentage regression. Journal of Modern Applied Statistical Methods, 7(2):18, 2008.
• Tridgell (2016) Stephen Tridgell. UCI machine learning repository, 2016.
• Tsanas et al. (2010) Athanasios Tsanas, Max A Little, Patrick E McSharry, and Lorraine O Ramig. Accurate telemonitoring of parkinson’s disease progression by noninvasive speech tests. IEEE transactions on Biomedical Engineering, 57(4):884–893, 2010.
• Tukey (1960) John W Tukey. A survey of sampling from contaminated distributions. Contributions to probability and statistics, 2:448–485, 1960.
• Vershynin (2010) Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
• Wu and Liu (2007) Yichao Wu and Yufeng Liu.

Robust truncated hinge loss support vector machines.

Journal of the American Statistical Association, 102(479):974–983, 2007.
• Xiao and Zhang (2014) Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
• Zhang and Lin (2015) Yuchen Zhang and Xiao Lin. Stochastic primal-dual coordinate method for regularized empirical risk minimization. In International Conference on Machine Learning, pages 353–361, 2015.

## Appendix A Binary classification

We start with a Lemma which presents some properties of . It is quite similar to Lemma 8 in [17], and we present here for completeness.

###### Lemma 1.

Assume , and Assumption 1 is satisfied,

• There exist an and such that .

• There exist some positive constant such that .

• For all ,

All constants just depend on .

###### Proof.

It is easy to verify that

 ∇2R(θ)=E{β(θ)XXT},

where , and we use the fact that

At the ground truth , we have

 ∇2R(θ∗)=E(2ζ′(θ∗TX)2XXT).

Recall in Assumption 1, we have , thus for any there exist such that

 ζ′(t)≥L(s),∀t∈[−s,s]. (5)

We define a event , then for any , we have

 uT∇2R(θ∗)u≥E{2ζ′(θ∗TX)2⟨u,X⟩21B}≥2L2(s)E{⟨u,X⟩21B}(a)≥2L2(s)(E[⟨u,X⟩2]−E[⟨u,X⟩2]1Bc)(b)≥2L2(s)(γ––τ2−(E⟨u,X⟩4P(Bc))1/2)(c)≥2L2(s)(γ––τ2−τ2√C4P(Bc))(d)≥2L2(s)τ2(γ––−√C4exp(−s2r2τ2)), (6)

where (a) uses equation (5) , (b) holds from the Cauchy-Schwartz inequality and (c) holds from the property of sub-Gaussian variable as follows [36]: Suppose is zero mean and

sub-Gaussian random variable, there exists numerical contants

for all integers such that

 E|⟨u,X⟩|2k≤C2k∥u∥2k2τ2k. (7)

(d) uses the concentration of sub-Gaussian variable, we refer reader to [36].

Thus we can choose with some positive constant such that and then have

 uT∇2R(θ∗)u≥L2(s)τ2γ––.

Now we bound by bound the difference of We denote the Lipschitz parameter of as (notice it just depends on ). For , we have

 |uT(∇2R(θ)−∇2R(θ∗))u|=E{uT[(β(θ)−β(θ∗))XXT]u}≤LβE{⟨θ−θ∗,X⟩⋅⟨u,X⟩2}≤Lβ[E⟨θ−θ∗,X⟩2⋅E⟨u,X⟩4]1/2≤Lβ(C4∥θ−θ∗∥22τ6)1/2=Lβ√C4∥θ−θ∗∥2τ3, (8)

where the first inequality uses the fact that is Lipschitz, second ones holds from the Cauchy-Schwartz inequality, the third one uses equation (7) again.

Now we set .

It is clear that when ,

 λmin(∇2R(θ∗))≥κ0=12L2(s)τγ––, (9)

where .

Then we lower bound the magnitude of the gradient when .

Recall we have . It is easy to verify that is minimized at the truth parameter . Notice

 ⟨θ−θ∗,∇R(θ)⟩=⟨θ−θ0,X⟩⋅2E{(ζ(θTX)−Y)ζ′(θTX)}=2E{(ζ(θTX)−ζ(θ∗TX))ζ′(θTX)⋅⟨θ−θ∗,X⟩} (10)

where we use the fact that .

We define some event such that when this event happens. Particularly, let be an orthogonal transform from to , whose row space contain and and . Since , thus we have when happens. Then we have

 ⟨θ−θ∗,∇R(θ)⟩≥2L2(s){E⟨θ−θ∗,X⟩2−E⟨θ−θ∗,X⟩21Ac}≥2L2(s){γ––τ2∥θ−θ∗∥2−(E[⟨θ−θ∗,X⟩4]P(Ac))1/2}≥2L2(s)∥θ−θ∗∥2τ2(–γ−√C4P(Ac)) (11)

where the first inequality holds from the equation (5) and the intermediate value theorem on , the second inequality uses the assumption in Assumption 1 and Cauchy-Schwartz inequality. The third inequality holds from equation (7).

Now we provide a bound on .

Using the the fact that is a sub-Gaussian, we have

 P(Ac)=P(∥UX∥2≥2s3r)≤P(|U1,X|≥√2s3r)+P(|U2,X|≥√2s3r)≤4exp(−s29r2τ2).

Thus we have

 ⟨θ−θ∗,∇R(θ)⟩≥2L2(s)∥θ−θ∗∥22τ2(γ––−2√C4exp(−s218r2τ2)).

We set such that