1 Introduction
Potential outliers in datasets can be identified in several ways. For lowdimensional models, scatter plots, box plots, and histograms can be used to visually identify points that deviate from modeling assumptions. For higherdimensional data, several tests involving order statistics exist (so called Lestimators
(Maronna et al., 2006)), such as the threesigma rule for Gaussian data, or trimming strategies for disregarding points that are furthest away from the mean. After potential outliers are removed from a dataset, models are fit on the remaining data. After fitting the model, potential outliers are again identified and removed and another model is fit (Ruppert and Carroll, 1980). This process can repeat indefinitely, until no points are left in the dataset.Identifying outliers using a fitted model can be problematic, since outliers affect the fit. Robust loss functions are often used to estimate model parameters from potentially contaminated data, without any
a priori outlier removal or preprocessing. Examples include the , huber, and Student’s t losses, all of which attempt to minimize the influence of observations that deviate from modeling assumptions (Huber, 2004; Lange et al., 1989). After fitting a model using a robust loss, potential outliers can be identified by sorting the loss applied to individual observations. Observations with higher loss are considered more likely to be outliers.Another approach, called trimmed estimation, couples explicit outlier identification and removal with model fitting. Given a set of training examples, typical model fitting, i.e., Mestimation, solves
where each represents the loss associated with the th training example. In contrast, trimmed Mestimators couple this already difficult, potentially nonconvex, optimization problem with explicit outlier removal
(1) 
where are the first order statistics of the objective values. If loss is the log likelihood of the th observed sample, then trimming attempts to jointly fit a probabilistic model while simultaneously eliminating the influence of all low likelihood observations.
Trimmed Mestimators were initially introduced by Rousseeuw (1985)
in the context of leastsquares regression. The author’s original motivation was to develop linear regression estimators that have a high breakdown point (in this case 50%) and good statistical efficiency (in this case
)^{1}^{1}1Breakdown refers to the percentage of outlying points which can be added to a dataset before the resulting Mestimator can change in an unbounded way. Here, outliers can affect both the outcomes and training data (features).. These Least Trimmed Squares (LTS) estimators were proposed as a higher efficiency alternative to Least Median Squares (LMS) estimators (Rousseeuw, 1984), which replace the sum in (1) by a median. For a number of years, the difficulty of efficiently optimizing LTS problems limited their application. The problem is difficult becauseeven if all losses are smooth and convex, (1) is, in general, nonsmooth and nonconvex.
Nevertheless, several approaches for finding LTS and other trimmed Mestimators have been developed. The authors of Rousseeuw and Van Driessen (2006) developed the FASTLTS algorithm, which was able to find LTS estimators faster than existing algorithms for LMS estimations. Later, Mount et al. (2014) introduced an exact algorithm for computing LTS, which suffered from exponential complexity in higher dimensional problems. Generalizing the approach developed in Rousseeuw and Van Driessen (2006), Neykov and Müller (2003) developed the FASTTLE method, which replaces the least squares terms in the LTS formulation with loglikelihoods of generalized linear models. In a different direction, Alfons et al. (2013) proposed a sparse variant of the FastLTS algorithm for L1regularized LTS estimation. Further work in (Yang and Lozano, 2015; Yang et al., 2016) proposed algorithms for graphical lasso and regularized trimming of convex losses.
With the exception of Mount et al. (2014); Yang and Lozano (2015); Yang et al. (2016), each of the proposed algorithms above are variants of the alternating minimization algorithm. The algorithms in Yang and Lozano (2015); Yang et al. (2016) mixed alternating minimization and proximalgradient steps. The algorithm of Mount et al. (2014) is combinatorial in nature, but has exponential complexity.
There are two drawbacks to trimming algorithms based on alternating minimization. First, they are greedy algorithms, which do not always work well for nonconvex problems; and second, they require, at every iteration, solving a large optimization problem typically involving more than 50% of the dataset.^{2}^{2}2For example, Alfons et al. (2013) requires solving a full LASSO problem at each iteration. And although the algorithm of (Yang et al., 2016) requires only one pass over the dataset at each iteration, this is still problematic for large datasets. The first drawback is wellknown in the optimization community, while the second is motivation for introducing stochastic gradient approaches for trimming.
At first glance, the standard stochastic gradient (SG) method appears to be the natural algorithm for solving (1). However, (1) is nonsmooth and nonconvex, so there are, as of yet, no known convergence rate guarantees for SG applied to (1
). In this paper we develop a variancereduced stochastic gradient algorithm with convergence rate guarantees.
1.1 Contributions
Fully Nonconvex Problem Class.
Our new algorithm extends the Stochastic Monotone Aggregated RootFinding (SMART) algorithm (Davis, 2016a) to the nonsmooth, nonconvex trimming problem. To keep with tradition, we call this algorithm SMART. It is the first variancereduced stochastic gradient algorithm for fully nonconvex optimization (our losses and our regularizers are nonconvex). It also applies to much more general problems than (1). We consider the following class:
(2) 
where each is and and are lower semincontinuous (potentially nonconvex) functions. This more general problem class recovers (1): simply let be the indicator function of the capped simplex
and minimize jointly over and .
Better Dependence on Lipschitz Constants.
It is possible to apply the proximal gradient algorithm to this problem^{3}^{3}3For example, the pioneering work of Attouch et al. (2013) proved that the proximal gradient algorithm converges under extremely general conditions. but its convergence is not guaranteed without taking very small stepsizes. This restriction arises because the standard sufficient condition for guaranteeing the convergence of the proximal gradient method requires using a stepsize that is proportional to the inverse of the Lipschitz constant of the gradient of the smooth function , which is not globally Lipschitz: . Even for least squares problems, the local Lipschitz constant of grows with and . This issue likewise prevents our using the ProxSAGA and ProxSVRG (Reddi et al., 2016).
Convergence Rates that Scale with .
A good alternative to the proximalgradient method is called the Proximal Alternating Linearized Minimization (PALM) method (Bolte et al., 2014) (see Section 2), which allows for stepsizes that only scale inversely with and the Lipschitz constants of . The convergence rate of this algorithm was analyzed in the fully nonconvex case in (Davis, 2016b, Theorem 5.4), where it was shown that an stationary point (see Section 3.1) could be found within iterations. Thus, in total PALM finds stationary points using gradients.
SMART scales better than PALM and other competing methods by a factor of . In particular, without any regularity assumptions
SMART finds an stationary point with gradient evaluations
(see Corollaries 1 and 2). This matches the complexity of ProxSAGA/ProxSVRG (Reddi et al., 2016), which only apply to the special case of problem (2) considered in Section 2.2.
When a certain error bound holds (see (5)),
SMART finds an stationary point with gradient evaluations,
where is akin to a condition number of (2) (see Corollaries 3 and 4). In contrast, ProxSAGA and ProxSVRG (Reddi et al., 2016), which only apply to the special case of problem (2) considered in Section 2.2, both require gradient evaluations to reach accuracy .
Organization.
We present algorithms related to SMART in Sections 2.2 and 2.3. We also present several theoretical guarantees for SMART in Section 3. In Section 4, we perform three trimming experiments. We present robust digit recognition for the mnist
dataset, introduce trimmed Principal Component Analysis to determine the quality of judges in the
USJudges dataset, and apply SMART to find a homography between two images using interest point matching. Proofs of the main theorems are presented in the appendices.1.2 Notation
In Problem (2) the variable is an element of a finite dimensional Euclidean space ; each function is , each gradient is Lipschitz continuous; both functions and are proper and lowersemicontinuous. We assume that the pointtoset proximal mapping is always nonempty for every small enough, say for if and for if .
We work with an underlying probability space denoted by
, and we assume that the space is equipped with Borel algebra . Anvalued random variable is a measurable map
. We always let denote the sub algebra generated by a random variable . We use the shorthand to denote almost sure convergence of a sequence of random variables. By our assumptions on and , for there exists measurable mappings such that for all , where and (Rockafellar and Wets, 1998). For the rest of the paper, we let mean that .We use the notation
throughout the paper and assume exists.
We assume that is bounded: there exists such that for all , we have .
2 Algorithm
To find a stationary point of (2
), our algorithm iteratively updates a state vector
. The algorithm is designed so that will not only be close to a stationary point after just a few iterations, but so that the average computational complexity of obtaining from will be small. These competing objectives can both be achieved simultaneously by combining ideas from the Proximal Alternating Linearized Minimization (PALM) method (Bolte et al., 2014), which obtains from viaand the partially stochastic proximalgradient (PSPG) method, which obtains from via
where is randomly sampled and as .
PALM takes few iterations to obtain near stationary ( accuracy obtained after iterations), but for each it computes the full gradient , which can be costly. On the other hand, PSPG takes many iterations to obtain near stationary , but for each it only computes a single gradient , which can be done quickly. But for nonconvex problems, there is no known rate of convergence for PSPG (unless minibatches of stochastic gradients of increasing size are used (Ghadimi et al., 2016; Davis et al., 2016)). Even in the relatively simple case where , , and are convex, there is still a nonconvex coupling between and and, hence, no known rate of convergence for PSPG.
By reducing the variance of the stochastic gradient estimator , we create a fast algorithm, which we call SMART, that combines the PALM and PSPG updates and obtains an accuracy solution after steps. As in PSPG, SMART typically evaluates a single gradient (or a small batch) at one or two points per iteration. But unlike PSPG, SMART on average only evaluates all the function values once per every iterations, where is user defined.
2.1 Implementation and Features
Incremental Gradients and Minibatches.
Rather than evaluating a full gradient at each iteration, we instead sample elements uniformly at random with replacement and denote this collection by ; then we only evaluate for . We assume is IID.
Block Coordinates Updates.
At every iteration we sample a coordinate that indicates whether is modified () or whether is modified () to obtain . We assume that is IID and the variables and are independent. We let
Dual Variables and Dual Updates.
For each index , we maintain a sequence of dual variables, denoted by . The dual variables are always parametrically defined: for old iterates . The sum approximates the gradient and is used in the following stochastic estimator of the sum, which has smaller variance than the SG estimator :
(3) 
The dual variables need not be recomputed at every iteration, so can be quite a stale estimate of . We introduce the setvalued random variable and probability
and 
which control whether the th dual variable is updated at iteration :
We assume that is IID and that is independent from , but we do not assume that is independent from .
2.2 Connection to ProxSAGA and ProxSVRG.
Our main goal is to use the regularizer to trim statistical models, but we can turn off trimming by choosing to be the convex, valued indicator that forces all weights to be . In this case, we recover and extend the ProxSAGA algorithm, introduced by Defazio et al. (2014) and recently analyzed for nonconvex problems by Reddi et al. (2016), by letting be a set consisting of elements of , sampled uniformly at random with replacement, and by letting . In terms of implementation, we never perform a or a full gradient update, but at every iteration we update the dual variable for . Our work extends the work by Reddi et al. (2016) by allowing nonconvex regualizers , whereas Reddi et al. (2016) requires to be convex.
We also recover a variant of ProxSVRG, introduced by Xiao and Zhang (2014) and recently analyzed for nonconvex problems analyzed by Reddi et al. (2016), by setting and , where is the average number of iterations we wish to perform before recomputing a full gradient. Although it appears that the step requires a computation of the function values , it does not because . As in the ProxSAGA case, our work extends Reddi et al. (2016) by allowing nonconvex regularizers .
2.3 Connection to Partial Minimization and Randomized Coordinate Descent
With appropriate choices of the random variables , , and , we recover randomized variants of PALM (Bolte et al., 2014) and the full gradient method of Aravkin et al. (2016). The key is to choose , so that all dual variables are constantly updated, and . Then, our stochastic estimator (3) is equal to the full gradient: For fixed , we get a randomized variant of the algorithm of Bolte et al. (2014). For , we get a method similar to that of Aravkin et al. (2016), except that we allow nonconvex regularizers. When is convex, converges to an element of (Bauschke and Combettes, 2011, Theorem 23.44); in the general case need only be prox bounded, so may not even be defined for large .
3 Convergence Theory
Our convergence rates are organized in Table 1. We separate our sublinear and linear convergence rate results into Section 3.1 and 3.2, respectively.
3.1 Sublinear Rates
Stationary Points.
For all , we define and by:
SMART never actually computes ; it is only used in the analysis of the algorithm. Its existence shows that a nearby, nearly stationary point can be obtained with gradient evaluations. For our analysis, it is crucial that be a constant greater than 1, i.e., we must shorten the steplength in order to measure stationarity.
We measure convergence of by bounding the normalized step sizes
where denotes the limiting subdifferential of (Rockafellar and Wets, 1998, Definition 8.3). It is common to compute bounds on the square of these step lengths, although it is perhaps misleading to do so. To make it easy to compare our results with the current literature, we also bound the squared steplengths Theorem 3.1.
Using the Lipschitz continuity of and the local Lipschitz continuity of , these bounds easily translate bounds on We omit this straightforward derivation.
Algorithm  GradEvals  FunEvals  Evals  Evals 

SMART(SAGA)  
SMART(SAGA(5))  
SMART(SVRG)  
SMART(SVRG(5))  
PALM  
PALM((5)) 
Independence of Algorithm History and Sampling
The SMART algorithm generates a sequence of random variables . Throughout the algorithm, we make the standard assumption that
Assumption 1
The algebra generated by the history of SMART, denoted by is independent of the algebra .
SMART converges, provided we choose properly. In measuring convergence, we introduce a particular (which depends on a user defined constant ):
(4) 
This constant is key for showing that Algorithm 1 converges with nonconvex regularizers and . We place the proof in Appendix A.
Theorem 3.1 (SMART Converges)
Suppose is generated by Algorithm 1 and that Assumption 1 holds. Let and let be defined as in (4). Then, if
the following hold:

Objective Decrease. The limit exists almost surely and for all , we have

Limit Points are Stationary. Suppose that the sequence is almost surely bounded. Then converges almost surely to a random variable. Moreover, there exists a subset such that and for all , every limit point of is a stationary point of .

Convergence Rate. Fix . Sample uniformly at random from . Then
With proper choices of , we actually achieve an accuracy solution with fewer gradient and function evaluations than the proximal gradient method or PALM (Bolte et al., 2014), which require gradient evaluations and function evaluations.
The first corollary, whose proof is given Appendix A.1, applies to a variant of the ProxSAGA algorithm:
Corollary 1 (Convergence Rate of SAGA Variant of SMART)
Suppose that ,
and 
Then SMART achieves an accurate solution with, on average, gradient evaluations, evaluations of , function evaluations, and evaluations of . In particular, when , SMART achieves an accurate solution with, on average, gradient evaluations, evaluations, function evaluations, and evaluations.
The second corollary, whose proof is given Appendix A.2, applies to a variant of the ProxSVRG algorithm:
Corollary 2 (Convergence Rate of SVRG Variant of SMART)
Suppose that , ,
and 
Then SMART achieves an accurate solution with, on average, gradient evaluations, evaluations of , function evaluations, and evaluations of . In particular, when , SMART achieves an accurate solution with, on average, gradient evaluations, evaluations, function evaluations, and evaluations.
3.2 Linear Rates
Assuming that an error bound holds for all points , a potentially bounded set, we can prove stronger convergence rates.
The Global Error Bound.
In our analysis, we use a modified globalization of the error bound found in Drusvyatskiy and Lewis (2016). We assume that there exists such that for all , we have
(5) 
Drusvyatskiy and Lewis (2016) use a localized version of (5) to prove linear convergence of a proximal algorithm for minimizing convex composite objectives. Our error bound differs from their error bound in two ways: (1) their bound is only assumed to hold locally around critical points of ; and (2) their right hand side is , rather than . We use this simplified error bound to keep the presentation short, but in future work, we may study the behavior of SMART assuming the localized bound in Drusvyatskiy and Lewis (2016).^{4}^{4}4Equation (5) is also quite similar to the KurdykaŁojasiewicz (KL) inequality with exponent (Bolte et al., 2007a, b), which replaces the left hand side of (5) by . Its straightforward to prove linear convergence of SMART under this globalized KL error bound, but we omit it to keep the presentation short.
As in the sublinear case, we define a constant (which depends on a user defined constant ):
(6) 
The ratio controls the linear convergence rate of SMART.
Theorem 3.2 (Convergence Rate of SMART Assuming a Global Error Bound)
By assuming an error bound similar to (5) and employing a restart strategy, Reddi et al. (2016) developed a linearly converging variant of ProxSAGA and ProxSVRG. In this strategy, the authors ran ProxSAGA or ProxSVRG for iterations, where is akin to the inverse condition number before restarting the algorithm. Every time that ProxSAGA or ProxSVRG is restarted, a full gradient must be computed. In contrast, SMART never needs to be restarted: it simply adapts to the regularity of the problem at hand.
Frequent restarts of ProxSAGA and ProxSVRG lead to worse complexity. In both of the corollaries below, we show SMART needs gradients to reach accuracy . In contrast, ProxSAGA/SVRG need gradients to reach accuracy .
The first corollary, whose proof is given Appendix B.1, applies to a variant of the ProxSAGA algorithm:
Corollary 3 (Linear Convergence Rate of SAGA Variant of SMART)
Suppose that , , that is chosen as in Theorem 3.2, and . Then SMART achieves an accurate solution with, on average, gradient evaluations, evaluations of , function evaluations, and evaluations of . In particular, when , SMART achieves an accurate solution with, on average, gradient evaluations, evaluations, function evaluations, and evaluations.
The second corollary, whose proof is a straightforward modification of the proof of Corollaries 3 and 2, applies to a variant of the ProxSVRG algorithm:
Corollary 4 (Linear Convergence Rate of SVRG Variant of SMART)
Suppose that , , that is chosen as in Theorem 3.2, and that Then SMART achieves an accurate solution with, on average, gradient evaluations, evaluations of , function evaluations, and evaluations of . In particular, when , SMART achieves an accurate solution with, on average, gradient evaluations, proximal operator evaluations, function evaluations, and evaluations.
4 Numerics
In this section we perform trimmed model fitting (i.e., we solve (1) with a regularizer) on three models/datasets:
The latter two applications are formulated using nonconvex constraints. Plots for figures 2 and 3 were generated with Matplotlib (Hunter, 2007).
4.1 Multiclass classification
The mnist training dataset contains 60000 pictures of handwritten digits between 09. We model automated digit recognition as a multiclass classification problem with
classes. We briefly review multinomial logistic regression to align (
1) with our current formulation.Formulation:
We are given
Comments
There are no comments yet.