A SMART Stochastic Algorithm for Nonconvex Optimization with Applications to Robust Machine Learning

In this paper, we show how to transform any optimization problem that arises from fitting a machine learning model into one that (1) detects and removes contaminated data from the training set while (2) simultaneously fitting the trimmed model on the uncontaminated data that remains. To solve the resulting nonconvex optimization problem, we introduce a fast stochastic proximal-gradient algorithm that incorporates prior knowledge through nonsmooth regularization. For datasets of size n, our approach requires O(n^2/3/ε) gradient evaluations to reach ε-accuracy and, when a certain error bound holds, the complexity improves to O(κ n^2/3(1/ε)). These rates are n^1/3 times better than those achieved by typical, full gradient methods.

Authors

• 16 publications
• 16 publications
12/22/2021

Accelerated Proximal Alternating Gradient-Descent-Ascent for Nonconvex Minimax Machine Learning

Alternating gradient-descent-ascent (AltGDA) is an optimization algorith...
02/16/2019

Faster Gradient-Free Proximal Stochastic Methods for Nonconvex Nonsmooth Optimization

Proximal gradient method has been playing an important role to solve man...
08/20/2020

An Optimal Hybrid Variance-Reduced Algorithm for Stochastic Composite Nonconvex Optimization

In this note we propose a new variant of the hybrid variance-reduced pro...
07/03/2014

Global convergence of splitting methods for nonconvex composite optimization

We consider the problem of minimizing the sum of a smooth function h wit...
10/29/2020

A Single-Loop Smoothed Gradient Descent-Ascent Algorithm for Nonconvex-Concave Min-Max Problems

Nonconvex-concave min-max problem arises in many machine learning applic...
01/24/2018

Training Set Debugging Using Trusted Items

Training set bugs are flaws in the data that adversely affect machine le...
05/11/2017

Practical Bayesian Optimization for Model Fitting with Bayesian Adaptive Direct Search

Computational models in fields such as computational neuroscience are of...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Potential outliers in datasets can be identified in several ways. For low-dimensional models, scatter plots, box plots, and histograms can be used to visually identify points that deviate from modeling assumptions. For higher-dimensional data, several tests involving order statistics exist (so called L-estimators

(Maronna et al., 2006)), such as the three-sigma rule for Gaussian data, or trimming strategies for disregarding points that are furthest away from the mean. After potential outliers are removed from a dataset, models are fit on the remaining data. After fitting the model, potential outliers are again identified and removed and another model is fit (Ruppert and Carroll, 1980). This process can repeat indefinitely, until no points are left in the dataset.

Identifying outliers using a fitted model can be problematic, since outliers affect the fit. Robust loss functions are often used to estimate model parameters from potentially contaminated data, without any

a priori outlier removal or pre-processing. Examples include the , huber, and Student’s t losses, all of which attempt to minimize the influence of observations that deviate from modeling assumptions (Huber, 2004; Lange et al., 1989). After fitting a model using a robust loss, potential outliers can be identified by sorting the loss applied to individual observations. Observations with higher loss are considered more likely to be outliers.

Another approach, called trimmed estimation, couples explicit outlier identification and removal with model fitting. Given a set of training examples, typical model fitting, i.e., M-estimation, solves

 minimizexn∑i=1fi(x),

where each represents the loss associated with the th training example. In contrast, trimmed M-estimators couple this already difficult, potentially nonconvex, optimization problem with explicit outlier removal

 minimizexh∑i=1fi:n(x), (1)

where are the first order statistics of the objective values. If loss is the log likelihood of the th observed sample, then trimming attempts to jointly fit a probabilistic model while simultaneously eliminating the influence of all low likelihood observations.

Trimmed M-estimators were initially introduced by Rousseeuw (1985)

in the context of least-squares regression. The author’s original motivation was to develop linear regression estimators that have a high breakdown point (in this case 50%) and good statistical efficiency (in this case

)111Breakdown refers to the percentage of outlying points which can be added to a dataset before the resulting M-estimator can change in an unbounded way. Here, outliers can affect both the outcomes and training data (features).. These Least Trimmed Squares (LTS) estimators were proposed as a higher efficiency alternative to Least Median Squares (LMS) estimators (Rousseeuw, 1984), which replace the sum in (1) by a median. For a number of years, the difficulty of efficiently optimizing LTS problems limited their application. The problem is difficult because

even if all losses are smooth and convex, (1) is, in general, nonsmooth and nonconvex.

Nevertheless, several approaches for finding LTS and other trimmed M-estimators have been developed. The authors of Rousseeuw and Van Driessen (2006) developed the FAST-LTS algorithm, which was able to find LTS estimators faster than existing algorithms for LMS estimations. Later, Mount et al. (2014) introduced an exact algorithm for computing LTS, which suffered from exponential complexity in higher dimensional problems. Generalizing the approach developed in Rousseeuw and Van Driessen (2006), Neykov and Müller (2003) developed the FAST-TLE method, which replaces the least squares terms in the LTS formulation with log-likelihoods of generalized linear models. In a different direction, Alfons et al. (2013) proposed a sparse variant of the Fast-LTS algorithm for L1-regularized LTS estimation. Further work in (Yang and Lozano, 2015; Yang et al., 2016) proposed algorithms for graphical lasso and regularized trimming of convex losses.

With the exception of Mount et al. (2014); Yang and Lozano (2015); Yang et al. (2016), each of the proposed algorithms above are variants of the alternating minimization algorithm. The algorithms in Yang and Lozano (2015); Yang et al. (2016) mixed alternating minimization and proximal-gradient steps. The algorithm of Mount et al. (2014) is combinatorial in nature, but has exponential complexity.

There are two drawbacks to trimming algorithms based on alternating minimization. First, they are greedy algorithms, which do not always work well for nonconvex problems; and second, they require, at every iteration, solving a large optimization problem typically involving more than 50% of the dataset.222For example, Alfons et al. (2013) requires solving a full LASSO problem at each iteration. And although the algorithm of (Yang et al., 2016) requires only one pass over the dataset at each iteration, this is still problematic for large datasets. The first drawback is well-known in the optimization community, while the second is motivation for introducing stochastic gradient approaches for trimming.

At first glance, the standard stochastic gradient (SG) method appears to be the natural algorithm for solving (1). However, (1) is nonsmooth and nonconvex, so there are, as of yet, no known convergence rate guarantees for SG applied to (1

). In this paper we develop a variance-reduced stochastic gradient algorithm with convergence rate guarantees.

1.1 Contributions

Fully Nonconvex Problem Class.

Our new algorithm extends the Stochastic Monotone Aggregated Root-Finding (SMART) algorithm (Davis, 2016a) to the nonsmooth, nonconvex trimming problem. To keep with tradition, we call this algorithm SMART. It is the first variance-reduced stochastic gradient algorithm for fully nonconvex optimization (our losses and our regularizers are nonconvex). It also applies to much more general problems than (1). We consider the following class:

 minimizew∈Rn,x∈H{1nn∑i=1wifi(x)+r1(w)+r2(x)}, (2)

where each is and and are lower semincontinuous (potentially nonconvex) functions. This more general problem class recovers (1): simply let be the indicator function of the capped simplex

 Δh:={(w1,…,wn)∣w∈[0,1],n∑i=1wi=h},

and minimize jointly over and .

Better Dependence on Lipschitz Constants.

It is possible to apply the proximal gradient algorithm to this problem333For example, the pioneering work of Attouch et al. (2013) proved that the proximal gradient algorithm converges under extremely general conditions. but its convergence is not guaranteed without taking very small stepsizes. This restriction arises because the standard sufficient condition for guaranteeing the convergence of the proximal gradient method requires using a stepsize that is proportional to the inverse of the Lipschitz constant of the gradient of the smooth function , which is not globally Lipschitz: . Even for least squares problems, the local Lipschitz constant of grows with and . This issue likewise prevents our using the ProxSAGA and ProxSVRG (Reddi et al., 2016).

Convergence Rates that Scale with n2/3.

A good alternative to the proximal-gradient method is called the Proximal Alternating Linearized Minimization (PALM) method (Bolte et al., 2014) (see Section 2), which allows for stepsizes that only scale inversely with and the Lipschitz constants of . The convergence rate of this algorithm was analyzed in the fully nonconvex case in (Davis, 2016b, Theorem 5.4), where it was shown that an -stationary point (see Section 3.1) could be found within iterations. Thus, in total PALM finds -stationary points using gradients.

SMART scales better than PALM and other competing methods by a factor of . In particular, without any regularity assumptions

SMART finds an -stationary point with gradient evaluations

(see Corollaries 1 and 2). This matches the complexity of ProxSAGA/ProxSVRG (Reddi et al., 2016), which only apply to the special case of problem (2) considered in Section 2.2.

When a certain error bound holds (see (5)),

SMART finds an -stationary point with gradient evaluations,

where is akin to a condition number of (2) (see Corollaries 3 and 4). In contrast, ProxSAGA and ProxSVRG (Reddi et al., 2016), which only apply to the special case of problem (2) considered in Section 2.2, both require gradient evaluations to reach accuracy .

Organization.

We present algorithms related to SMART in Sections 2.2 and 2.3. We also present several theoretical guarantees for SMART in Section 3. In Section 4, we perform three trimming experiments. We present robust digit recognition for the mnist

dataset, introduce trimmed Principal Component Analysis to determine the quality of judges in the

USJudges dataset, and apply SMART to find a homography between two images using interest point matching. Proofs of the main theorems are presented in the appendices.

1.2 Notation

In Problem (2) the variable is an element of a finite dimensional Euclidean space ; each function is , each gradient is -Lipschitz continuous; both functions and are proper and lower-semicontinuous. We assume that the point-to-set proximal mapping is always nonempty for every small enough, say for if and for if .

We work with an underlying probability space denoted by

, and we assume that the space is equipped with Borel -algebra . An

-valued random variable is a measurable map

. We always let denote the sub -algebra generated by a random variable . We use the shorthand to denote almost sure convergence of a sequence of random variables. By our assumptions on and , for there exists measurable mappings such that for all , where and (Rockafellar and Wets, 1998). For the rest of the paper, we let mean that .

We use the notation

 F(w,x)=1nn∑i=1wifi(x)+r1(w)+r2(x)

throughout the paper and assume exists.

We assume that is bounded: there exists such that for all , we have .

2 Algorithm

To find a stationary point of (2

), our algorithm iteratively updates a state vector

. The algorithm is designed so that will not only be close to a stationary point after just a few iterations, but so that the average computational complexity of obtaining from will be small. These competing objectives can both be achieved simultaneously by combining ideas from the Proximal Alternating Linearized Minimization (PALM) method (Bolte et al., 2014), which obtains from via

 wk+1 :=proxτr1(wk−τn(f1(xk),…,fn(xk))); xk+1 :=proxγr2(xk−γnn∑i=1wk+1i∇fi(xk)),

and the partially stochastic proximal-gradient (PSPG) method, which obtains from via

 wk+1 :=proxτr1(wk−τn(f1(xk),…,fn(xk))); xk+1 :=proxγr2(xk−γknwkik∇fik(xk)),

where is randomly sampled and as .

PALM takes few iterations to obtain near stationary ( accuracy obtained after iterations), but for each it computes the full gradient , which can be costly. On the other hand, PSPG takes many iterations to obtain near stationary , but for each it only computes a single gradient , which can be done quickly. But for nonconvex problems, there is no known rate of convergence for PSPG (unless minibatches of stochastic gradients of increasing size are used (Ghadimi et al., 2016; Davis et al., 2016)). Even in the relatively simple case where , , and are convex, there is still a nonconvex coupling between and and, hence, no known rate of convergence for PSPG.

By reducing the variance of the stochastic gradient estimator , we create a fast algorithm, which we call SMART, that combines the PALM and PSPG updates and obtains an accuracy solution after steps. As in PSPG, SMART typically evaluates a single gradient (or a small batch) at one or two points per iteration. But unlike PSPG, SMART on average only evaluates all the function values once per every iterations, where is user defined.

2.1 Implementation and Features

Rather than evaluating a full gradient at each iteration, we instead sample elements uniformly at random with replacement and denote this collection by ; then we only evaluate for . We assume is IID.

At every iteration we sample a coordinate that indicates whether is modified () or whether is modified () to obtain . We assume that is IID and the variables and are independent. We let

 q :=P(jk=1)>0, q′=P(jk=2)>0.

For each index , we maintain a sequence of dual variables, denoted by . The dual variables are always parametrically defined: for old iterates . The sum approximates the gradient and is used in the following stochastic estimator of the sum, which has smaller variance than the SG estimator :

 1b∑i∈Ik(wki∇fi(xk)−yki)+1nn∑i=1yki. (3)

The dual variables need not be recomputed at every iteration, so can be quite a stale estimate of . We introduce the set-valued random variable and probability

 Dk⊆{1,…,n}; and ρi:=P(i∈Dk),

which control whether the th dual variable is updated at iteration :

 yk+1i={wki∇fi(xk)if i∈Dk;ykiotherwise.

We assume that is IID and that is independent from , but we do not assume that is independent from .

2.2 Connection to ProxSAGA and ProxSVRG.

Our main goal is to use the regularizer to trim statistical models, but we can turn off trimming by choosing to be the convex, -valued indicator that forces all weights to be . In this case, we recover and extend the ProxSAGA algorithm, introduced by Defazio et al. (2014) and recently analyzed for nonconvex problems by Reddi et al. (2016), by letting be a set consisting of elements of , sampled uniformly at random with replacement, and by letting . In terms of implementation, we never perform a or a full gradient update, but at every iteration we update the dual variable for . Our work extends the work by Reddi et al. (2016) by allowing nonconvex regualizers , whereas Reddi et al. (2016) requires to be convex.

We also recover a variant of ProxSVRG, introduced by Xiao and Zhang (2014) and recently analyzed for nonconvex problems analyzed by Reddi et al. (2016), by setting and , where is the average number of iterations we wish to perform before recomputing a full gradient. Although it appears that the step requires a computation of the function values , it does not because . As in the ProxSAGA case, our work extends Reddi et al. (2016) by allowing nonconvex regularizers .

2.3 Connection to Partial Minimization and Randomized Coordinate Descent

With appropriate choices of the random variables , , and , we recover randomized variants of PALM (Bolte et al., 2014) and the full gradient method of Aravkin et al. (2016). The key is to choose , so that all dual variables are constantly updated, and . Then, our stochastic estimator (3) is equal to the full gradient: For fixed , we get a randomized variant of the algorithm of Bolte et al. (2014). For , we get a method similar to that of Aravkin et al. (2016), except that we allow nonconvex regularizers. When is convex, converges to an element of  (Bauschke and Combettes, 2011, Theorem 23.44); in the general case need only be prox bounded, so may not even be defined for large .

3 Convergence Theory

Our convergence rates are organized in Table 1. We separate our sublinear and linear convergence rate results into Section 3.1 and 3.2, respectively.

3.1 Sublinear Rates

ε-Stationary Points.

For all , we define and by:

 ¯¯¯¯wk+1 :=proxτr1(wk−τn(f1(xk),…,fn(xk))) ¯¯¯xk+1 :=prox(γ/η)r2(xk−γηnn∑i=1wki∇fi(xk)).

SMART never actually computes ; it is only used in the analysis of the algorithm. Its existence shows that a nearby, nearly stationary point can be obtained with gradient evaluations. For our analysis, it is crucial that be a constant greater than 1, i.e., we must shorten the steplength in order to measure stationarity.

We measure convergence of by bounding the normalized step sizes

 1τ(wk−¯¯¯¯wk+1) ∈1n(f1(xk),…,fn(xk))+∂Lr1(¯¯¯¯wk+1); ηγ(xk−¯¯¯xk+1) ∈1nn∑i=1wki∇fi(xk)+∂Lr2(¯¯¯xk+1),

where denotes the limiting subdifferential of (Rockafellar and Wets, 1998, Definition 8.3). It is common to compute bounds on the square of these step lengths, although it is perhaps misleading to do so. To make it easy to compare our results with the current literature, we also bound the squared steplengths Theorem 3.1.

Using the Lipschitz continuity of and the local Lipschitz continuity of , these bounds easily translate bounds on We omit this straightforward derivation.

Independence of Algorithm History and Sampling

The SMART algorithm generates a sequence of random variables . Throughout the algorithm, we make the standard assumption that

Assumption 1

The -algebra generated by the history of SMART, denoted by is independent of the -algebra .

SMART converges, provided we choose properly. In measuring convergence, we introduce a particular (which depends on a user defined constant ):

 η =2+4γ⎡⎢ ⎢ ⎢⎣  ⎷1nn∑i=1q′(1+ϵ0)(BiL)22b(1−√q′(1−ρi))2+4Lnn∑i=1Bi⎤⎥ ⎥ ⎥⎦. (4)

This constant is key for showing that Algorithm 1 converges with nonconvex regularizers and . We place the proof in Appendix A.

Theorem 3.1 (SMART Converges)

Suppose is generated by Algorithm 1 and that Assumption 1 holds. Let and let be defined as in (4). Then, if

 γ≤14L√1n∑ni=1q′(1+ϵ0)B2i2b(1−√q′(1−ρi))2+Ln∑ni=1Bi,

the following hold:

1. Objective Decrease. The limit exists almost surely and for all , we have

 E[F(wk+1,xk+1)∣Fk] ≤F(w0,x0)−k∑t=0[q′γ2η∥∥∥ηγ(xt−¯¯¯xt+1)∥∥∥2+qτ2∥∥∥1τ(wt−¯¯¯¯wt+1)∥∥∥2].
2. Limit Points are Stationary. Suppose that the sequence is almost surely bounded. Then converges almost surely to a random variable. Moreover, there exists a subset such that and for all , every limit point of is a stationary point of .

3. Convergence Rate. Fix . Sample uniformly at random from . Then

 q′γ2ηE[∥∥∥ηγ(xt0−¯¯¯xt0+1)∥∥∥2]+qτ2E[∥∥∥1τ(wt0−¯¯¯¯wt0+1)∥∥∥2]≤F(w0,x0)−F(w∗,x∗)T.

With proper choices of , we actually achieve an -accuracy solution with fewer gradient and function evaluations than the proximal gradient method or PALM (Bolte et al., 2014), which require gradient evaluations and function evaluations.

The first corollary, whose proof is given Appendix A.1, applies to a variant of the ProxSAGA algorithm:

Corollary 1 (Convergence Rate of SAGA Variant of SMART)

Suppose that ,

 γ=14L ⎷1n∑ni=1(1−1/n)(1+ϵ0)B2i2b(1−√(1−1/n)b+1)2+Ln∑ni=1Bi, and τ=(n−1)γη.

Then SMART achieves an accurate solution with, on average, gradient evaluations, evaluations of , function evaluations, and evaluations of . In particular, when , SMART achieves an accurate solution with, on average, gradient evaluations, evaluations, function evaluations, and evaluations.

The second corollary, whose proof is given Appendix A.2, applies to a variant of the ProxSVRG algorithm:

Corollary 2 (Convergence Rate of SVRG Variant of SMART)

Suppose that , ,

 γ=14L ⎷1n∑ni=1(1−1/n)b(1+ϵ0)B2i2b(1−√(1−1/n)b)2+Ln∑ni=1Bi, and τ=(1−(1−1/n)b)(1−1/n)bγη.

Then SMART achieves an accurate solution with, on average, gradient evaluations, evaluations of , function evaluations, and evaluations of . In particular, when , SMART achieves an accurate solution with, on average, gradient evaluations, evaluations, function evaluations, and evaluations.

3.2 Linear Rates

Assuming that an error bound holds for all points , a potentially bounded set, we can prove stronger convergence rates.

The Global Error Bound.

In our analysis, we use a modified globalization of the error bound found in Drusvyatskiy and Lewis (2016). We assume that there exists such that for all , we have

 μ[F(w,x)−F(w∗,x∗)]≤ ∥∥ ∥∥ηγ(x−prox(γ/η)r1(x−γηn∑i=1wi∇fi(x)))∥∥ ∥∥2 + ∥∥∥1τ(w−proxτr2(x−τ(f1(x),…,fn(x))))∥∥∥2 (5)

Drusvyatskiy and Lewis (2016) use a localized version of (5) to prove linear convergence of a proximal algorithm for minimizing convex composite objectives. Our error bound differs from their error bound in two ways: (1) their bound is only assumed to hold locally around critical points of ; and (2) their right hand side is , rather than . We use this simplified error bound to keep the presentation short, but in future work, we may study the behavior of SMART assuming the localized bound in Drusvyatskiy and Lewis (2016).444Equation (5) is also quite similar to the Kurdyka-Łojasiewicz (KL) inequality with exponent  (Bolte et al., 2007a, b), which replaces the left hand side of (5) by . Its straightforward to prove linear convergence of SMART under this globalized KL error bound, but we omit it to keep the presentation short.

As in the sublinear case, we define a constant (which depends on a user defined constant ):

 η =2+4γ⎡⎣ ⎷1nn∑i=1q′(1+ϵ0)(BiL)22b√q′(1−ρi)(1−(q′(1−ρi))1/4)2+Lnn∑i=1Bi⎤⎦. (6)

The ratio controls the linear convergence rate of SMART.

Theorem 3.2 (Convergence Rate of SMART Assuming a Global Error Bound)

Assume the notation of Theorem 3.1. Let , let be defined as in (6), and let

 γ=14L√1n∑ni=1q′(1+ϵ0)B2i2b√q′(1−ρi)(1−(q′(1−ρi))1/4)2+Ln∑ni=1Bi.

Define Then provided that the error bound (5) holds, then we have

 (∀k∈N)E[F(wk,xk)−F(w∗,x∗)]≤δk[F(w0,x0)−F(w∗,x∗)].

By assuming an error bound similar to (5) and employing a restart strategy, Reddi et al. (2016) developed a linearly converging variant of ProxSAGA and ProxSVRG. In this strategy, the authors ran ProxSAGA or ProxSVRG for iterations, where is akin to the inverse condition number before restarting the algorithm. Every time that ProxSAGA or ProxSVRG is restarted, a full gradient must be computed. In contrast, SMART never needs to be restarted: it simply adapts to the regularity of the problem at hand.

Frequent restarts of ProxSAGA and ProxSVRG lead to worse complexity. In both of the corollaries below, we show SMART needs gradients to reach accuracy . In contrast, ProxSAGA/SVRG need gradients to reach accuracy .

The first corollary, whose proof is given Appendix B.1, applies to a variant of the ProxSAGA algorithm:

Corollary 3 (Linear Convergence Rate of SAGA Variant of SMART)

Suppose that , , that is chosen as in Theorem 3.2, and . Then SMART achieves an accurate solution with, on average, gradient evaluations, evaluations of , function evaluations, and evaluations of . In particular, when , SMART achieves an accurate solution with, on average, gradient evaluations, evaluations, function evaluations, and evaluations.

The second corollary, whose proof is a straightforward modification of the proof of Corollaries 3 and 2, applies to a variant of the ProxSVRG algorithm:

Corollary 4 (Linear Convergence Rate of SVRG Variant of SMART)

Suppose that , , that is chosen as in Theorem 3.2, and that Then SMART achieves an accurate solution with, on average, gradient evaluations, evaluations of , function evaluations, and evaluations of . In particular, when , SMART achieves an accurate solution with, on average, gradient evaluations, -proximal operator evaluations, function evaluations, and evaluations.

4 Numerics

In this section we perform trimmed model fitting (i.e., we solve (