Uncertainty is inherent in the real world, and this means that the data-generating processes found in physical and social systems are stochastic by nature. This random data is used to fuel machine learning algorithms, and coupled with highly impoverished knowledge of the mechanisms underlying such systems, it becomes very difficult to provide any kind of exact, certain guarantee regarding the performance of machine learning algorithms, even for learning tasks that seem very basic at face value. At best, we can typically provide “weak contracts” which take the form of probabilistic statements about the distribution of some performance metric, which depends on the algorithmic approach taken and the nature of the data sampled for training. More concretely, it is traditional to formulate machine learning tasks as risk minimization problems over some set of candidates , where the risk is defined by
The risk of is the expected value of a loss evaluated at , where denotes our random data, taking values in a set . This notion of risk dates back to classical statistical decision theory . In the context of machine learning (by empirical risk minimization), this notion was popularized in the statistical community by the work of Vapnik , and in the computer science community by the work of Haussler . A learning algorithm will have access to data points sampled from , denoted . Write
to denote the output of an arbitrary learning algorithm. The usual starting point for analyzing algorithm performance is the estimation error, where , or more precisely, the distribution of this error. Since we never know much about the underlying data-generating process, typically all we can assume is that belongs to some class
of probability measures on, and typical guarantees are given in the form of
Flipping the inequalities around, this says that the algorithm generating enjoys -good performance with -high confidence over the draw of the sample, where the error level depends on the sample size , the desired confidence level , the underlying data distribution , and any constraints encoded in . In order to have meaningful performance guarantees, citing Holland , the following properties are important.
Transparency: can we actually compute the output that we study in theory?
Strength: what form do bounds on take? How rich is the class ?
Scalability: how do computational costs scale with the above-mentioned factors?
Balancing these three points is critical to developing guarantees for algorithms that will actually be used in practice.
Our problem setting
This work considers the setup of potentially heavy-tailed data for general convex loss functions; this is meant to be a companion work to, which specialized to the case of strongly convex losses, a requirement we do not make here. More concretely, all the learner can know is that for some ,
where typically . Thus, it is unknown whether the losses (or partial derivatives, etc.) are congenial in a sub-Gaussian sense (where (1) holds for all
), or heavy-tailed in the sense that all higher-order moments could be infinite or undefined. The goal then comes down to obtaining the strongest possible guarantees for a tractable learning algorithm, given (1). We next review the related technical literature, and give an overview of our contributions.
2 Context and contributions
Challenges under weak convexity
When one is lucky enough to have a -strongly convex risk , using a very simple basic idea, a wide range of distance-based algorithmic strategies are available [22, 16, 12]. For example, say we construct candidates , and we know that with high probability, a majority of the candidates are -good in terms of the risk . Since is unknown, we can never know which candidates are the -good ones. However, this barrier can be circumvented by utilizing the fact that -strong convexity of implies that any -good candidate must be at least -close to , the minimizer of on . It follows that on the “good event” in which the majority of candidates are -good, it is sufficient to simply “follow the majority.” This can be done in various ways, but in the end all such procedures comes down to computing and comparing distances for all . This can be done without knowing which of the are -good, which makes the problem tractable.
Unfortunately, -strong convexity is a luxury that is often unavailable. In particular for high-dimensional settings, it is common for the strong convexity parameter to shrink rapidly as grows, making -dependent error bounds vacuous . Algorithmically, if strong convexity cannot be guaranteed, then the distance-based strategy just described will fail, since for any particular minimizer , it is perfectly plausible to have a -good candidate which is arbitrarily far from (see Figure 1). Even when we assume -smoothness of the risk, all we can say is that -badness implies -farness from all minimizers; the converse need not hold. The traditional approach to this problem is to set aside some additional data, and simply choose the empirical risk minimizer on this new data. More concretely, assume that from the first sample we obtain independent candidates , and that we have a second sample available for “validation,” as it were. With this second sample, the traditional approach has the learner return
This technique of confidence boosting for bounded losses is well-known; see Kearns and Vazirani [18, Ch. 4.2] for a textbook introduction, and a more modern statement due to Shalev-Shwartz et al. [27, Thm. 26]. Under exp-concave distributions, Mehta  also recently made use of this technique. Problems arise, however, when the losses can be potentially heavy-tailed. The quality of the validated final candidate is only as good as the precision of the risk estimate, and the empirical risk is well-known to be sub-optimal under potentially heavy-tailed data .
Limitations of existing procedures
Recalling the three properties of transparency, strength, and scalability discussed in the previous section, here we briefly compare and contrast different algorithmic strategies within the context of these properties (see Holland  for a more detailed review).
To begin, empirical risk minimization (ERM) is the cornerstone of classical learning theory, which studies the statistical properties of any minimizer of the empirical risk, i.e., the sample mean of the losses. Concrete implementations of ERM just require minimizing a finite sum, and thus are computationally quite congenial, and scale well, taken at face value. However, formal guarantees for ERM-based procedures are limited; the empirical mean is known to be sensitive to outliers, and this sensitivity appears in weak formal guarantees. Concretely, under potentially heavy-tailed losses, the empirical mean is sub-optimal in that it cannot guarantee sub-Gaussian deviation bounds. Put roughly, it cannot guarantee error bounds better than those which scale as; see Catoni  and Devroye et al.  for more details. Furthermore, since ERM leaves the method of implementation completely abstract, this leaves open a large conceptual gap. Feldman  showed lucidly how there exist both “good” and “bad” ERM solutions; the problem with transparency is that we can never know whether any particular ERM candidate is one of the good ones or not.
Next, starting with seminal work by Brownlees et al. , a recent line of work has led to new statistical learning procedures to address the weak guarantees and lack of robustness of ERM. The basic idea is simply to minimize a different estimator of the risk, for example median-of-means estimators  or M-estimators . Under weak moment bounds like (1), their minimizers enjoy rates with dependence on the confidence. This provides a significant improvement in terms of the strength of guarantees over ERM, but unfortunately the issue of transparency remains. Like ERM, the algorithmic side of the problem is left abstract here, and in general may even be a much more difficult computational task. As such, the gap between formal guarantees and the guarantees that hold for any given output of the algorithm may be even more severe than in the case of ERM.
Finally, in an effort to try and address the issue of the transparency gap without sacrificing the strength of formal guarantees, several new families of algorithms have been designed in the past few years to tackle the potentially heavy-tailed setting using a tractable procedure. Such algorithms may naturally be called robust gradient descent (RGD), since the core update takes the same form as steepest-descent applied directly to the risk, i.e., , except that is replaced with an estimator
which has deviations with near-optimal confidence intervals under potentially heavy-tailed data (i.e., both the loss and partial gradients can be heavy-tailed). The first works in this line came fromChen et al.  and Holland and Ikeda , with further developments by Prasad et al. , Lecué et al. , Holland . Under strong convexity, all these proposals enjoy excess risk bounds with optimal dependence on and under potentially heavy-tailed data, with the significant merit that the computational procedures are transparent and easy to implement as-is. Unfortunately, instead of a simple one-dimensional robust mean estimate as in Brownlees et al. , all RGD methods rely on sub-routines that work in -dimensions. This makes the procedures much more expensive computationally for “big” learning tasks, and leads to an undesirable dependence on the ambient dimension in the statistical guarantees as well, hampering their overall scalability. Furthermore, when strong convexity is not available, the propagation of statistical error over time for RGD methods becomes much worse, leading to bounds that are extremely sensitive to mis-specified smoothness parameters, step sizes, and total number of iterations. We provide more detailed discussion of this point in section 3.3, with key bounds summarized in Table 1.
Considering the above limitations of existing learning algorithms, and in particular the lack of scalability and weak formal guarantees available for RGD procedures when strong convexity is not available, here we study a different algorithmic approach to the problem. Our approach has equal generality, and the hope is to achieve as-good or better dependence on , , and under potentially heavy-tailed data (losses and/or partial derivatives), without strong convexity, and in provably less time for larger problems. The main technique that we investigate is a natural robustification of classical confidence boosting (2
), applied to traditional stochastic gradient descent routines run in parallel, though we note that the basic argument can be easily generalized to other optimization strategies (e.g., accelerated methods, adaptive methods, quasi-Newton techniques, etc.). Our main contributions:
The archetype procedure given in Algorithm 1 is concrete, easy to implement as-is, and trivial to run in parallel, meaning that all else equal, for high-dimensional learning tasks we can expect to obtain a result as good or better than existing serial RGD methods in far less time.
Empirically, we study the robustness of our proposed procedure and relevant competitors to various perturbations in the experimental conditions, simulating a lack of prior knowledge about noise levels and convexity. We see the proposed procedure achieves performance competitive with benchmark RGD methods using far less operations, in less time, and that this holds over a variety of test settings. Furthermore, we show that the proposed procedure cannot simply be replaced with a naive cross-validation heuristic.
3 Theoretical analysis
First we establish some basic notation, and organize numerous technical assumptions in one place for ease of reference. For any positive integer , write . For any index , write , defined analogously for independent copy . To keep the notation simple, in the special case of , we write . We shall use as a generic symbol to denote computing probability; in most cases this will be the product measure induced by the sample or . For any function , denote by the sub-differential of evaluated at
. Variance of the loss is denoted byfor each . Denote by the indicator function that returns a value of when event is true, and otherwise.
The two key running assumptions that we make are related to independence and convexity. First, we assume that all the observed data are independent, i.e., the random variablesand taken over all are independent copies of . Second, for each , we assume the map is a real-valued convex function over , and that the parameter set is non-empty, convex, and compact, with diameter . All results derived in the next sub-section will be for an arbitrary choice of , where satisfies (1) with . Finally, to make formal statements technically simpler, we assume that achieves its minimum on the interior of .
3.2 Error bounds when both losses and gradients can be heavy-tailed
Recalling the challenges described in section 2, we consider a straightforward robustification of the classical validation-based approach using robust mean estimators in a sub-routine. The full procedure is summarized in Algorithm 1. First we outline the key points of the procedure, then give a general-purpose excess risk bound in Theorem 1.
Viewed at a high level, Algorithm 1 is comprised of three extremely simple steps: partition, train, and validate. For our purposes, the key to improving on traditional ERM-style boosting techniques is to ensure the validation step is done with sufficient precision, even when the losses can be heavy-tailed. To achieve this, we shall require that there exist a constant which does not depend on the distribution , such that for any choice of confidence level and large enough , the sub-routine satisfies
Recall that we are denoting , thus the only requirement on the class of data distributions is finite variance, readily allowing for both heavy-tailed losses and gradients. The training step can be done in any number of ways; for concreteness and clarity of the results, we elect to use a simple stochastic gradient descent sub-process. Unpacking the notation from Algorithm 1, the basic update used is traditional projected (sub-)gradient descent, with update denoted by
Here denotes a step-size parameter, denotes projection to with respect to the
norm, and the standard assumption is that the random vectorsatisfies , for each . Then for arbitrary sequence , we define
noting we have suppressed step-sizes from the notation for readability. Replacing the generic sequence here with for each yields the iterate sequences used in Algorithm 1, with each being simply the arithmetic (vector) mean of the iterates. As all the data we work with are independent copies of , the order in which we take the indices from each does not matter. Under weak assumptions on the underlying loss distribution, the output of this algorithm enjoys strong excess risk bounds, as the following theorem shows.
Here we give an overview of the proof of Theorem 1. We have data sequences and . The former is used to obtain independent candidates , and the latter is used to select among these candidates. As mentioned earlier, distance-based strategies require that the majority of these candidates are -good, in order to ensure that points near the majority coincide with -good points. In our present setup, where strong convexity is not available, we are taking a very different approach. Now we only require that at least one of the candidates is -good. Making this explicit,
is our first event of interest. Note that if for each we have an upper bound depending on the sample size for the sub-process outputs such that
where expectation is taken over the subset indexed by , then using Markov’s inequality and taking a union bound, it follows that setting , we have . Asking for one -good candidate is a much weaker requirement than asking for the majority to be -good, but we must pay the price in a different form, as we require that provide a good estimate of the true risk for all of the candidates. In particular, writing for a confidence interval to be specified shortly, this is the following event:
Intuitively, while we only require that at least one of the candidates be good, we must reliably know which is best at the available precision, which requires paying the price of the intersection defining . Recalling the requirement (3), if we condition on , the candidates become non-random elements of , which means that setting , a union bound gives us . This inequality holds as-is for any realization of , so we can thus integrate to obtain
The good event of interest then has probability
On this good event, we know that there does exist an -good candidate, even though we can never know which it is; call it . Furthermore, even though this candidate is unknown, since we have -good risk estimates for all candidates, the choice of , with , cannot be much worse. More precisely, we have
We have effectively proved the following lemma.
Lemma 2 (Boosting the confidence under potentially heavy tails).
Assume we have a learning algorithm such that for and , we have
Splitting the data using sub-indices , if we set
then when satisfies (3), it follows that for any , we have
with probability no less than .
Note that Lemma 2 here makes no direct requirements on the underlying loss or risk, beyond the need for a variance bound, which appears as in the statement of Theorem 1. Indeed, convexity does not even make an appearance. This is in stark contrast with distance-based confidence boosting methods, which rely upon the strong convexity of the risk [22, 16, 12]. As such, so long as we can validate in the sense of (3), then Lemma 2 gives us a general-purpose tool from which we can construct algorithms with competitive risk bounds under potentially heavy-tailed data. The following lemma shows that validation can indeed be done in the desired way, using straightforward computational procedures.
With these key facts in place, it is straightforward to prove Theorem 1.
Proof of Theorem 1.
Since most key facts have already been laid out, we just need to fill in a few blanks and connect these facts. To begin, consider the -bound on in Lemma 2. The special case of Algorithm 1 is just , namely the simplest form of averaged stochastic gradient descent. Given the assumptions, we are doing averaged SGD under a -smooth risk, without assuming strong convexity or a Lipschitz loss, and using the step-sizes specified in the hypothesis, a standard argument gives us
where is as given in the theorem statement. See Theorem 4 in the appendix for a proof of the more general result that implies (7). This can be applied to each sub-process via the correspondence . Note that the output of Algorithm 1 corresponds to . Thus leveraging Lemma 2 and (7), and bounding , we have for any choice of that
with probability no less than . Note that we are assuming divides for simplicity, and using the notation to distinguish from in the theorem statement. It just remains to clean up this probability and specify . To do this, given in the theorem statement, first set . Next, set the number of subsets to be
and note that with this setting of and , we have that
The inequalities follow readily via the fact that for arbitrary we have , and that for all . As for the exponential term, note that
It thus immediately follows that the good event of (8) holds probability no less than
To conclude, since we have observations split in half, we must replace with in (8). Bounding the coefficient for simplicity yields the desired result. ∎
3.3 Comparison of error bounds
Recall that in the introduction, we highlighted properties of transparency, strength, and stability as being important to close the gap between formal guarantees and the performance achieved by the methods we actually are coding. As mentioned in the literature review in section 2, the robust gradient descent algorithms cited are noteworthy in that the procedures have strong (albeit slightly sub-optimal) guarantees for a wide class of distributions, for procedures which can be implemented essentially as-stated in the cited papers, making the guarantees very transparent. Unfortunately, the best results are essentially limited to problems in which the risk is -strongly convex; all of the cited papers make extensive use of this property in their analysis [6, 13, 26]. If one is lucky enough to have -strong convexity (and -smoothness), then for any step , one has
The only difference between the left-hand side of this inequality and the general-purpose robust GD update studied in the literature is that the true risk gradient is replaced with some estimator . As such, one can easily control using an upper bound that depends on the right-hand side of the above inequality and the statistical estimation error . After say iterations, one can then readily unfold the recursion and obtain final error bounds that can be given as a sum of an optimization error term depending on the number of iterations , and a statistical error term depending on the sample size (e.g., Chen et al. [5, Thm. 5], Prasad et al. [26, Sec. 7], Holland and Ikeda [14, Thm. 5]).
Error bounds without strong convexity
On the other hand, when one does not have strong convexity, such a technique fails, and one is left having to compare the difference between two sequences, the actual robust GD iterates , and the ideal sequence of gradient descent using the true risk gradient, assuming both sequences are initialized at the same point . This point is discussed with analysis by Holland and Ikeda [15, Sec. A.3].111Their original bounds involve a factor , where is an upper bound on the variance of the partial derivatives of the loss taken over all coordinates. One can easily strengthen their bounds by replacing bounds stated using with bounds stated using . Analogous analysis can be done to extend the results of Chen et al.  to the weak convexity case as well. One can still unfold the recursion without much difficulty, but the propagation of the statistical error becomes much more severe. In the simple case using a fixed step-size of , ignoring non-dominant terms, under the same technical assumptions used in our theoretical analysis, after steps, the robust RGD procedures can only obtain -high probability bounds of the form
Note that the exponential dependence on makes the maximum number of iterations one can guarantee extremely sensitive to the values of and .
In contrast, under the same assumptions, Theorem 1 for our Algorithm 1 has no such sensitivity; it achieves the same dependence on and with just one pass over the data. Furthermore, there is no explicit dependence on the number of parameters , the factor is removed, and the dependence on is improved exponentially. Since typical RGD procedures do not ever use loss values, the only moment bound requirement they make is via , whereas our procedure has both and . Other minor tradeoffs exist in the form of an extra factor, and dependence on the diameter in our bounds is linear, whereas the works of Chen et al.  and Holland and Ikeda  have logarithmic dependence. Arguably, this is a small price to pay for the improvements that are afforded. These results are shown in the second column of Table 1.
Due to the ease of distributed computation and simplicity of the underlying sub-routines, Algorithm 1 has significant potential to improve upon existing robust GD methods in terms of computational scalability. Using arithmetic operations as a rough estimate of time complexity, first for Algorithm 1 note that for each subset , we have a fixed number of arithmetic operations that must be done for coordinates and iterations. Thus one can obtain each candidate with operations, and this is done for each . These computations can trivially be done on independent cores running in parallel. It then just remains to make a single call to to conclude the procedure. In this final call, one evaluates candidates at data points; this will typically require operations, plus the cost of the final robust estimate, which will be respectively and for the cases described in Lemma 3. Adding these costs up, ignoring factors, and setting for simplicity yields the cost shown in the third column of Table 1.
In contrast, RGD by median-of-means [6, 26] requires operations to compute one subset mean, and again assuming the computations for each , , are done across cores in parallel, then if iterations are done, the total cost is , since the -based merging must be done times, noting that denotes computation of the geometric median . Regarding , the geometric median is a convex program, and can efficiently be solved to arbitrary accuracy; Cohen et al.  give an implementation such that the objective is -good (relative value), with time complexity of . For RGD by M-estimation , note that solving for can be done readily using a fixed-point update, and in practice the number of iterations is , fixed independently of and , which means operations will be required for each of the steps. Assuming a standard empirical estimate of the per-coordinate variance is plugged in, this will require an additional arithmetic operations. Adding these costs up yields the estimate shown in Table 1.
4 Empirical analysis
To study how the theoretical insights obtained in the previous section play out in practice, we carried out a series of tightly controlled numerical tests. The basic experimental design strategy that we employ is to calibrate all the methods (learning algorithms) of interest to achieve good performance under a particular learning setup, and then we systematically modify characteristics of the learning tasks, leaving the methods fixed, to observe how performance changes in both an absolute and relative sense. Viewed from a high level, the main points we address can be categorized as follows:
How do error trajectories of baseline methods change via robust validation?
How does relative performance change in high dimensions without strong convexity?
How do actual computation times compare as and/or grow?
Can robust validation be replaced by cross-validation?
We proceed by giving additional details on our experimental setting, before taking up the key experiments just listed one at a time.
Our basic setup is precisely the noisy convex minimization simulations run by Holland [12, Sec. 4], with new modifications made here to control the degree of strong convexity, among other experimental parameters. Details are in the cited reference, but essentially we construct the data-generating distribution and such that the risk takes a quadratic form that we can control, i.e., , where , , and depend on the design of . In particular, with this design it is easy to allow both the losses and partial derivatives to be heavy-tailed, while still satisfying the key technical assumptions of Theorem 1, namely -smooth and gradients with -bounded variance. Furthermore, since we are interested in the case where strong convexity may not hold, this experimental design means that the strong convexity parameter of is at our control, allowing us to construct many flat directions, and observe algorithm performance as . All tests and methods are implemented using Python (ver. 3.8), chiefly relying upon the numpy library (ver. 1.18).
For clarity of results, we limit our comparisons to two main families of distributions for , namely those in which the loss contains a Normal noise term, and those in which it contains a log-Normal noise term. In all cases, this noise is centered, and controlled to have nearly equal signal/noise ratios, where the noise level is measured by the width of the interquartile range of the additive noise term; we follow the cited reference  precisely.
With respect to the different methods being studied, we use a mixture of classical baselines and modern alternatives to compare with our Algorithm 1, denoted RV-SGDAve, with carried out using a Catoni-type M-estimator . For baselines, we do empirical risk minimization using batch gradient descent (denoted ERM-GD) and stochastic gradient descent, both with and without averaging (denoted SGD and SGD-Ave). Several representative robust gradient descent methods discussed in section 2 are implemented here, including RGD via median-of-means [6, 26] (denoted RGD-MoM), median-of-means minimization by gradient descent  (denoted RGD-Lec), and RGD via M-estimation  (denoted RGD-M). Detailed settings of each method, including random initialization, follow exactly the procedures given in the cited reference .
By design, the risk in these experiments can be computed analytically, and our chief performance metric of interest is the excess risk, computed as , where is the output of any learning algorithm being studied, and is a minimum of that we set in advance. Each experimental setting is characterized by the data distribution , the sample size , the number of parameters to determine . In particular, has a direct impact on both the sub-Gaussian/heavy-tailed nature of the losses and gradients, as well as the strength of the convexity of
. We modify all these parameters in various ways, and for each setting, we run multiple independent trials, and results given are based on statistics computed over these trials. For example, the mean and standard deviation of the excess risk (denotedave and sd), as well as all box-plots are based on multiple independent trials. Exact trial numbers for each experiment are given in the following exposition.
(E1) How do error trajectories of baseline methods change via robust validation?
Before we look at the impact of having weak convexity, we begin with a nascent investigation of the basic workings of the robust validation procedure of interest. We run independent trials for both the Normal and log-Normal settings described previously, with , , and -strongly convex . Here we let all methods run with a fixed “budget” of , where the “cost” is measured by gradient computations, i.e., cost is incremented by one when is computed at any for any . Naturally, this means Algorithm 1 will be run for multiple passes over the data, meaning that the behavior after the first pass takes us, strictly speaking, beyond the scope of Theorem 1, a natural point of empirical interest. In Figures 4–4, we show how the baseline stochastic methods change when being passed through a robust validation procedure such as is used in our Algorithm 1. Here RV-SGDAve is precisely Algorithm 1, where RV-SGD denotes the same procedure without averaging the SGD sub-processes. It is natural to choose RV-SGDAve as a representative, and in Figure 4, we compare just RV-SGDAve against the modern RGD methods.
(E2) How does relative performance change in high dimensions without strong convexity?
Next we look at how the competing learning algorithms perform as the number of parameters to determine increases, with having very weak convexity in many directions. More precisely, the matrix is diagonal, and half the diagonal elements are no greater than , implying a tiny upper bound on the strong convexity parameter of . Under this setting, we look at how increasing over the range , with fixed sample size impacts algorithm performance. We run independent trials, and for each trial record performance achieved by each method once it has spent its budget, again measured in gradient computations. Batch methods are given a large budget of . In contrast, with the previous experiments, here we only let RV-SGDAve (Algorithm 1) take one pass over the data for initialization, and one pass for learning, so a budget of just . This aligns more precisely with the setting of Theorem 1. Noise distribution settings are as previously introduced. In Figure 5, we give box-plots of the final excess risk achieved by each method for different sizes.
(E3) How do actual computation times compare as and/or grow?
While we have given a brief discussion of the computational complexity of different algorithms in section 3.3, it is assuredly worthwhile to actually measure how much time is needed to achieve the performance realized in the results of (E2) above. In particular, we are interested in whether or not even a very simple parallel implementation of the SGD sub-processes used in RV-SGDAve could lead to substantial time savings. We look at two types of tests, following . First, and move together, with and . Second, is fixed, and dimension ranges over as in (E2). Budget constraints used for stopping rules are exactly as described in (E2). We run independent trials, and compute the median times for each method. We remark that in comparing the log-Normal versus Normal cases, there is virtually no difference between the computation times for any method, and thus to save space we simply show times for the log-Normal case; these median times for both experimental settings are shown in Figure 6.
(E4) Can robust validation be replaced by cross-validation?
Finally, it is natural to ask whether the procedure of Algorithm 1 could be replaced by a heuristic cross-validation procedure that uses all the data for learning, doubling the effective sample size available to each sub-process. More precisely, say that instead of splitting the -sized sample into and as done by RV-SGD (Algorithm 1), we simply use a full -sized sample , partition into subsets , obtaining independent candidates , now with double the sample size compared with RV-SGD. One might be intuitively inclined to do a cross-validation type of selection, where for each , the validation score returned by is computed for each using the data indexed by , and the winning index is selected to be the minimizer of this cross-validation error. Such heuristics break the assumptions used in the theoretical analysis of Algorithm 1, and it is interesting to see how this plays out in practice. Thus, we have re-implemented both RV-SGD and RV-SGDAve in this fashion, respectively denoted RV-SGD-CV and RV-SGDAve-CV. Error trajectories for the same experimental setting as (E1) for all these methods are compared in Figure 7.
Discussion of results
From the initial proof-of-concept tests with results given in Figures 4–4, we see how even very noisy sub-processes can be ironed out easily using the simple robust validation sub-routine included in Algorithm 1, and that even running the algorithm for much longer than a single pass over the data, risk which is comparable to benchmark RGD methods can be realized at a much smaller cost, with comparable variance across trials, and that this holds under both sub-Gaussian and heavy-tailed data, without any modifications to the procedure being run. A particularly lucid improvement in the cost-performance tradeoff is evident from Figures 5–6, since near-identical performance can be achieved at a small fraction of the computational cost. Note that under Normal noise, running Algorithm 1 for just a single pass leaves room for improvement performance-wise, but as we saw in the low-dimension case, in practice this can be remedied by taking additional passes over the data. Finally, regarding the question of whether or not Algorithm 1 can be replaced with a cross-validation heuristic, the answer is clear (Figure 7): while the results are comparable under well-behaved data (the Normal noise case here), when heavy tails are a possibility (e.g., the log-Normal case), the naive cross-validation method fails to get even near the performance of Algorithm 1.
5 Future directions
The main take-away from this initial study is that even without strong convexity, under potentially heavy-tailed losses and/or gradients, there exists a computationally efficient procedure which improves upon the formal performance guarantees of modern robust GD techniques, is very competitive in practice, and scales much better to large tasks with many parameters. The basic archetype for such a procedure was illustrated using the concrete Algorithm 1, but naturally this can be extended in many directions, to deal with accelerated, adaptive, or variance-reducing sub-processes, under more general geometries and more challenging constraints applied to . In particular, if one considers a stochastic mirror descent type of generalization to the proposed algorithm, it would be interesting to compare the robust validation approach taken here with say the truncation-based approach studied recently by Juditsky et al. , and how the performance of the respective methods changes under different constraints on prior knowledge of the underlying data-generating distribution.
Appendix A Technical appendix
a.1 Additional proofs for section 3
Proof of Lemma 3.
In the proof of Theorem 1, one key underlying result we rely on has to do with convergence rates of averaged SGD for smooth objectives. Recall that assuming is differentiable, we say that is -smooth in norm if its gradients are -Lipschitz continuous in the same norm, that is
for all . Nesterov [25, Thm. 2.1.5] gives many useful characterizations of -smoothness. The fact cited directly in the main text is summarized in the following theorem; it can be extracted readily from well-known properties of (stochastic) mirror descent, a family of algorithms dating back to Nemirovsky and Yudin .
Theorem 4 (Convex and smooth case; averaged).
Let be -smooth in the norm. Furthermore, assume that for all . Run (4) with step size for iterations, setting , and take the average as
We then have with probability no less than that
Proof of Theorem 4.
To begin, we establish some extra terms and notation related to mirror descent. For any differentiable convex function , define the Bregman divergence induced by as
In mirror descent, one utilizes Bregman divergences of a particular class of convex functions, often called “mirror maps.” Let be an open convex set containing including within its closure, and also let . We denote arbitrary mirror maps on by . Strictly speaking, to call a mirror map on it is sufficient if is strictly convex, differentiable, and that its gradient takes on all values (i.e., ) and diverges on the boundary of ; see Bubeck [3, Sec. 4] and the references therein for more details. Bregman divergences induced by mirror maps, namely , play an important role in mirror descent when constructing a projection map that takes up between the primal space , and the space where we can leverage gradient information. The generic mirror descent procedure is as follows. Initializing at arbitrary , we update as
Assume for all , and that is -smooth in norm . Write