Routines for computing values of the function linking surrogate risk with the 0-1 risk.
In this work, we study a new approach to optimizing the margin distribution realized by binary classifiers. The classical approach to this problem is simply maximization of the expected margin, while more recent proposals consider simultaneous variance control and proxy objectives based on robust location estimates, in the vein of keeping the margin distribution sharply concentrated in a desirable region. While conceptually appealing, these new approaches are often computationally unwieldy, and theoretical guarantees are limited. Given this context, we propose an algorithm which searches the hypothesis space in such a way that a pre-set "margin level" ends up being a distribution-robust estimator of the margin location. This procedure is easily implemented using gradient descent, and admits finite-sample bounds on the excess risk under unbounded inputs. Empirical tests on real-world benchmark data reinforce the basic principles highlighted by the theory, and are suggestive of a promising new technique for classification.READ FULL TEXT VIEW PDF
We prove bounds on the population risk of the maximum margin algorithm f...
We present a series of new and more favorable margin-based learning
We propose Max-Margin Adversarial (MMA) training for directly maximizing...
We consider the problem of adaptation to the margin and to complexity in...
To improve the off-sample generalization of classical procedures minimiz...
One of the main open problems in the theory of multi-category margin
We build a Bayesian contextual classification model using an optimistic ...
Routines for computing values of the function linking surrogate risk with the 0-1 risk.
Machine learning systems depend on both statistical inference procedures and efficient implementations of these procedures. These issues are reflected clearly within a risk minimization framework, in which given a known loss depending on data and parameters , the ultimate objective is minimization of the risk , where expectation is taken with respect to the data. Since is unknown, the learner seeks to determine a candidate based on a limited sample such that
is sufficiently small, with high probability over the random draw of the sample. Inference is important becauseis always unknown, and the implementation is important because the only we ever have in practice is one that can be computed given finite time, memory, and processing power.
Our problem of interest is binary classification, where with inputs and labels . Parameter shall determine a scoring rule , where implies a prediction of , and implies a prediction of . The classification margin achieved by such a candidate is , and the importance of the margin in terms of evaluating algorithm performance has been recognized for many years [1, 13]. The work of Koltchinskii and Panchenko  provide risk bounds that depend on the empirical mean of , providing useful generalization bounds for existing procedures whose on-sample margin error can be controlled. Intuitively, one might expect that having larger minimum margins on average would lead to better off-sample generalization. However, influential work by Breiman  showed that the problem is not so simple, demonstrating cases in which the margins achieved are higher, but generalization is worse. In response to this, Reyzin and Schapire  make the important suggestion that it is not merely the location of the margins, but properties of the entire margin distribution that are important to generalization.
New algorithms based on trying to control the empirical margin distribution, albeit indirectly, were proposed early on by Garg and Roth , who proposed a strategy of optimizing the random projection error, namely , where and are respectively random projections of and from -dimensional Euclidean space to a -dimensional subspace, where . The bounds are lucid and are suggestive of practical objective functions, but their analysis requires that the inputs be bounded, namely that they are distributed on the unit ball, . More recent work from Zhang and Zhou 
suggests an objective which simultaneously maximizes the mean while minimizing the variance of the empirical margin distribution. Their routines are computationally tractable, but hyperparameter settings are non-trivial, and their risk bounds (in expectation) depend on the expected outcome of a leave-one-out cross-validation procedure, which is not characterized using interpretable quantities, reducing the utility of the bounds.
Another natural algorithmic strategy is to construct loss functions using more “robust” estimators of thetrue expected margin , or related quantities such as the expected hinge loss . In this regard the work of Brownlees et al.  is highly relevant, in that sharp, descriptive risk bounds can be obtained for a wide class of learning algorithms, indeed any minimizer of such a loss. The practical downside is that computation is highly non-trivial and no procedures are proposed. The formal downside is that once again must be bounded for meaningful guarantees.
To deal with the limitations of existing procedures highlighted above, the key idea here is to introduce a new convex loss that encourages the distribution of the margin to be tightly concentrated near a certain prescribed level. The procedure is easily implemented using gradient descent, admits formal performance guarantees reflecting both computational cost and optimization error, and aside from the usual cost of gradient computation there is virtually no computational overhead. Two key highlights are:
The proposed algorithm enjoys high-probability risk bounds under moment bounds on, and does not require to be bounded.
Numerical experiments show how a simple data-dependent re-scaling procedure can reduce the need for trial-and-error tuning of regularization.
In this section we begin by introducing relevant algorithms from the literature, after which we introduce our proposed procedure.
Here we review the technical literature closely related to our work. Starting with the proposal of Garg and Roth , their main theoretical results are a bound on the misclassification risk of for any and . Assuming that , and given observations, with probability no less than , we have
where , and the term takes the form
The projection error terms are derived from the fact that
where , and is a random matrix of independent Gaussian random variables, . Probability here is over the random draw of the matrix elements. Based on these guarantees, they construct a new loss, defined by
where and are respectively the indices of correctly and incorrectly classified observations. For correctly classified examples, they seek to minimize the projection error bound, whereas for incorrectly classified examples, then use a standard exponential surrogate loss. Depending on what minimizes their upper bound, the dependence on the number of parameters may be better than , but a price is paid in the form of dependence on the confidence. On the computational side, proper settings of and in practice is non-trivial.
The work of Zhang and Zhou  considers using first- and second-order moments of the margin distribution as relevant quantities to build an objective. Writing
in the case of , they construct a loss
where the are parameters to be set manually. The authors show how the optimization can be readily cast into an -dimensional dual program of the form
for appropriate data-dependent matrix
, vector, and weight bounds
, and they give some examples of practical implementations using dual coordinate descent and variance-reduced stochastic gradient descent. In all cases, parameter settings are left up to the user. Furthermore, statistical guarantees leave something to be desired; the authors prove that for anysatisfying their dual objective, risk bounds hold as
where expectation is taken with respect to the sample, are the diagonal elements of , and the index sets are defined
These bounds provide limited insight into how and when the algorithm performs well, and in practice the algorithm requires substantial effort for model selection.
Finally, we consider the path-breaking analysis of Brownlees et al. , which greatly extends foundational work done by Catoni . Letting denote the hinge loss, the Catoni estimator of the true location of a margin-based loss at candidate , namely , is defined as
where is a scaling parameter, and is a soft truncation function (see Figure 1) defined by
The general analysis of Brownlees et al.  provides a rich set of tools for obtaining risk bounds for any minimizer of this new robust objective function, namely bounds on where satisfies
and denotes the hypothesis space our candidate lives in. Note that the -Lipschitz continuity of the hinge loss gives us that for any candidates and ,
which means we can bound distances defined on the space by distances on the space . Going back to the linear model case of , bounds in the distance can be constructed using
and bounds in the distance take the form
Now, using their results, for large enough and , one can show that with probability no less than , it holds that
where is a universal constant, and and are complexity terms. When these terms can be bounded, we can use the fact that the hinge loss is “classification calibrated,” and using standard results from Bartlett et al. , can obtain bounds on the excess misclassification risk based on the above inequality. The problem naturally is how to control these complexity terms. Skipping over some technical details, these terms can be bounded using covering number integrals dependent on . As a concrete example, we have
where is the metric on , the covering number is the number of -balls in the metric needed to cover , and . In the case of , this means must be almost surely bounded in order for the distance to be finite and the upper bounds to be meaningful. Under such assumptions, say comes from the unit ball and almost surely. Then ignoring non-dominant terms, the high-probability bounds can be specified as
While extremely flexible and applicable to a wide variety of learning tasks and algorithms, for the classification task, getting around the bound on is impossible using the machinery of Brownlees et al. . Even more serious complications are introduced by the difficulty of computation: while simple fixed-point procedures can be used to accurately approximate the robust objective , it cannot be expressed explicitly, and indeed need not be convex as a function defined on , even in the linear model case. Approximation error is unavoidable due to early stopping, and in addition to this computational overhead, using non-linear solvers to minimize the function can be costly and unstable in high-dimensional tasks . A recent pre-print from Lecué et al.  considers replacing the M-estimator of Brownlees et al.  with a median-of-means risk estimate, which does not require bounded inputs to get strong guarantees, but which requires an expensive iterative sub-routine for every loss evaluation, leading to substantial overhead for even relatively small learning tasks.
We would like to utilize the strong elements of the existing procedures cited, while addressing their chief weaknesses. To do so, we begin by integrating the Catoni influence function defined in (3), which results in a new function of the form
Note that for all . This function satisfies , is symmetric about zero so , and since the absolute value of the slope is bounded by , we have that is Lipschitz continuous, namely that for any , we have .
Here is the desired margin level, and once again is a re-scaling parameter. Note that this loss penalizes not only incorrectly classified examples, but also examples which are correctly classified, but overconfident. The intuition here is that by also penalizing overconfident correct examples to some degree, we seek to constrain the variance of the margin distribution. The nature of this penalization is controlled by : a larger value leads to less correct examples being penalized.
It remains to set the scale . To do so, first note that for any candidate , we have
and that this estimator enjoys a pointwise error bound dependent on (see appendix 6.2 for details), which says
with probability no less than . Minimizing this bound in naturally leads to setting , but in our case, a certain amount of bias is assuredly tolerable; say a certain fraction of the desired setting, plus error that vanishes as . By setting then, we have
The exact setting of plays an important role both in theory and in practice; we shall look at this in more detail in sections 3–4. In practice, the true variance will of course be unknown, but we can replace the true variance with any valid upper bound on the variance; rough estimates are easily constructed using moments of the empirical distribution (see section 4).
With scaling taken care of, our proposed algorithm is simply to minimize the new loss (5) using gradient descent, namely to run the iterative update
where are step sizes. We summarize the key computations in Algorithm 1 for the case of a linear model with fixed step sizes.
Intuitively, in running Algorithm 1 (or any generalization of it), the expectation is that with enough iterations, the approximation should be rather sharp, although arbitrary precision assuredly cannot be guaranteed. If the level is set too high given a hypothesis class with low complexity, we cannot expect to be near the location of the margin , which is accurately approximated by . This can be easily proven: there exists a set of classifiers and distribution under which even a perfect optimizer of the new risk has a Catoni-type estimate smaller than (proof given in appendix 6.2).
If the approximation actually is sharp, how does this relate to control of the margin distribution? By design, the estimator is resistant to errant observations and is located near the majority of observations (see Proposition 2), if it turns out that is close to , then it is not possible for the majority of margin points be much smaller (or much larger) than .111Note that we still cannot rule out the possibility that the margin distribution is spread out over a wide region; a simple example is the case where the margins are symmetrically distributed around . Conceptually, the desired outcome is similar to that of the procedure of Brownlees et al.  discussed in section 2.1, but with an easy implementation and more straightforward statistical analysis. In section 3, we show that risk bounds are readily available for the proposed procedure, even without a bound on the inputs . Empirical analysis in section 4 illustrates the basic mechanisms underlying the algorithm, using real-world benchmark data sets.
For positive integer , write the set of all positive integers no greater than by . The underlying distribution of interest is that of , here taking values on . The data sample refers to independent and identically distributed (“iid”) copies of , denoted for . Let denote a generic class of functions . The running assumption will be that all are measurable, and at the very least satisfy . Denote the input variance by .
Our chief interest from a theoretical standpoint is in statistical properties of Algorithm 1, in particular we seek high-probability upper bounds on the excess risk of the procedure after iterations, given observations, that depend on , , and low-order moments of the underlying distribution. We begin with some statistical properties of the motivating estimator, and a look at how scale settings impact these properties.
For any and scale , the estimate satisfies the following:
There exists such that for all , we have .
There exists a constant such that for all ,
The basic facts laid out in Proposition 1 illustrate how controls the “bias” of the Catoni estimator. A larger scale factor makes the estimator increasingly sensitive to errant data, and causes it to close in on the empirical mean. A sufficiently small value on the other hand causes the estimator to effectively ignore the distribution tails, closing in on the empirical median.
Given any dataset and candidate , construct as usual. Then consider a modified dataset , which is identical to the original except for one point, subject to arbitrary perturbation. Let denote the estimator under the modified data set. Defining a sub-index as
it follows that whenever and are large enough that , we have
The stability property highlighted in Proposition 4 is appealing because the difference could be arbitrarily large, while the estimator in shifting to remains close to the majority of the points, and cannot be drawn arbitrarily far away. For clarity, we have considered the case of just one modified point, but a brief glance at the proof (in the appendix) should demonstrate how analogous results can readily be obtained for the case of larger fractions of modified points.
Fixing any , consider the estimate defined in (2), equivalently characterized as a minimizer of in , with scaling parameter set such that , where is any upper bound . It follows that
The confidence interval in Lemma6 is called pointwise because it holds for a pre-fixed , in contrast with uniform bounds that hold independent of the choice of . When considering our Algorithm 1, the candidate will be data-dependent and thus random, meaning that pointwise bounds will have to be extended to cover all possible contingencies; see the proof of Theorem 11 for details.
Proceeding with our analysis, the ultimate evaluation metric of interest here is the classification risk (expectation of the zero-one loss), denoted
Using empirical estimates of the zero-one loss is not conducive to efficient learning algorithms, and our Algorithm 1 involves the minimization of a new loss , defined in equation (5). To ensure that good performance in this metric implies low classification risk, the first step is to ensure that the function is calibrated for classification, in the sense of Bartlett et al. . To start, fixing any , define . This furnishes the surrogate risk
The basic idea is that if this loss is calibrated, then one can show that there exists a function depending on user-specified and settings, which is non-decreasing on the positive real line and satisfies
Our loss function defined in 4 is congenial due to the fact that it is classification-calibrated, with a -transform that can be computed exactly, for arbitrary values of and . Details of this computation are not difficult, but are rather tedious, and thus we relegate them to appendix 6.3. Basic facts are summarized in the following lemma.
The loss function is classification calibrated such that for each , the following statements hold.
-transform: there exists a function for which , depending on , , , and a concave function defined on , specified in the proof (also see Figure 2). This -transform function takes the form
Risk convergence: given a sequence of sample-dependent , we have that convergence in our surrogate is sufficient for convergence in the zero-one risk, namely
Invertibility: is invertible on , and thus for small enough excess risk, we can bound as .
One would naturally expect that all else equal, if a classifier achieves the same excess -risk for a larger value of , then the resulting excess classification risk should be smaller, or at least no larger. More concretely, we should expect that
This range comes from the fact that and . This monotonicity follows from the definition of and the convexity of the -transform (also see Figure 2 in the following section).
With preparatory results in place, we can now pursue an excess risk bound for Algorithm 1. To make notation more transparent, we accordingly write and to denote the respective risks under , where . The core technical assumptions are as follows:
is a compact subset of , with diameter .
There exists at which .
is -strongly convex on , with minimum222Assuming we can take the derivative under the integral, the smoothness of implies differentiability of . Then using the compactness of , it follows that . denoted by .
The gradient distribution follows a standard form of high-dimensional sub-Gaussianity, characterized as follows. Writing for the new loss gradient before scaling by , and for its covariance matrix, there exists some such that for all , , and , we have
The important assumptions here are 3 and 3. The latter can be satisfied with inputs that have sub-Gaussian tails; this does not include data with higher-order moments that are infinite, but requires no bound on at all. As for the former assumption 3, first note that the th element of the Hessian of the new loss function is
Write for readability, and use and to denote integration over the positive and non-positive parts of . First, observe that
Using this inequality, we have
The second term on the right-hand side is a negative value that can be taken near zero for any by taking large enough. The first term is , and thus with large enough , as long as the second moment matrix of the inputs is positive definite satisfying for some (a weak assumption), it follows that there exists a such that holds. Since the risk is twice continuously differentiable, This implies -strong convexity [15, Theorem 2.1.11].
With these assumptions in place, finite-sample risk bounds can be obtained.
Running Algorithm 1 for iterations, the final output produced, written , for constant and satisfies
with probability no less than over the random draw of the sample, where the dominant term is defined
Excess risk bounds give in Theorem 11 are composed of two key terms, one of a computational nature, and one of a statistical nature. The first term is optimization error, which decreases as grows, and depends on the initial estimate , the step-size , and the convexity of the surrogate risk through . The second term is statistical error, and depends on the sample size, scale , the number of parameters, and second-order moments of the inputs . Note that there is a clear tradeoff due to : a sufficiently large scale factor is needed to ensure 3 holds (yielding large enough ), but setting too large impacts the statistical error in a negative way.
Finally, we note the factor in is due to a covering number argument used to obtain a bound on the empirical gradient error that holds uniformly over . Does there exist another computational procedure, with the same optimization error, and without this seemingly superfluous factor in the statistical error? We pursue such analysis in future work.
In our numerical experiments, we aim to complement the theoretical analysis carried out in the previous section. We look at how algorithm parameter settings impact generalization guarantees, and using real-world datasets, investigate how Algorithm 1 performs, comparing its behavior with a benchmark procedure.
First, we look at the function introduced in the previous section, and its inverse, . In the two leftmost plots of Figure 2, we plot the graph of and over , for and varying values of . Convexity of and its monotonic dependence on can be clearly observed.
In the second plot from the right, we fix and and plot the graph of over a range of and values. We can clearly observe how achieving a better excess surrogate risk (corresponding to a smaller value) for a larger value leads to smaller excess misclassification risk (corresponding to smaller values of the function plotted). In addition, the same excess surrogate risk clearly leads to better generalization in the misclassification risk if it is achieved with a larger value, although this positive impact diminishes quickly as gets large.
Finally, in the rightmost plot of Figure 2, we fix and , and plot for a range of positive values. In the limit as gets large, we find that this quantity bottoms out quickly at a positive value. This has important implications in terms of scaling strategies, because it demonstrates where issues can arise with scaling with , as would be implied by simply minimizing the pointwise error bound (as seen in (6) and Lemma 6). Indeed, if any algorithm can achieve an excess surrogate risk of (corresponding to ), if is allowed to scale as , then even taking large will not imply a small misclassification risk. This is one important reason that Algorithm 1 does not scale using the bound-minimizing value, but rather a value that allows for consistency in the limit as and grow large.
In all the experiments discussed here, we consider binary classification on real-world data sets, modified to control for unbalanced ratios of positive and negative labels. Training for each data set is done using pair , where is , and is , and testing is done on a disjoint subset. The train-test sequence is repeated over 25 trials, and all numerical performance metrics displayed henceforth should be assumed to be averages taken over all trials.
We use four data sets, denoted cov, digit5, protein, and sido, creating subsets under the following constraints: (1) Sample size is no more than ten times the nominal dimension , and (2) both the training and testing data sets have balanced ratios of labels (as close as possible to each). Starting with cov (, , non-zero: ), this is the “Forest CoverType dataset” on the UC Irvine repository, converted into a binary task identifying class 1 against the rest. digit5 (, , non-zero: ) is the MNIST hand-written digit data, converted into a binary task for the digit 5. protein (, , non-zero: ) is the protein homology dataset (KDD Cup 2004). sido (, , non-zero: ) is the molecular descriptor data set (NIPS 2008 causality challenge), with binary-valued features. In each trial, from the full original data set, we take a random sub-sample of the specified size, without replacement, for training, and for test data we use as much of the remaining data as possible, within the confines of constraint (2) above.
As a well-known benchmark algorithm against which we can compare the behaviour and performance of the proposed Algorithm 1, we implement and run the well-known Pegasos algorithm of Shalev-Shwartz et al. . For both methods, the initial value is determined randomly in each trial. We explore multiple settings of Algorithm 1 described further below, but in all cases we take the stochastic optimization approach: instead of using all training examples at each step, we randomly select one at a time for computing the update direction, and use a step size of . For direct comparison with Pegasos, we set the margin level to , add a squared -norm regularization term with coefficient , utilizing a step size of , and projecting to the -radius ball. That is, we run a stochastic projected gradient descent version of Algorithm 1, and evaluate the impact of the proposed loss function.
We begin with the simplest setting of Algorithm 1, where is fixed throughout. In Figures 3–4, we plot training error, test error, and numerous statistics of the empirical margin distribution, all as a function of cost incurred (equal to number of gradients computed). For each dataset, we experimented with and display the results for the case of that resulted in the best performance, as measured by the lowest test error achieved over all iterations.
We see that our proposed procedure is highly competitive with the best setting of Pegasos, and results in a margin distribution very distinct from that of the competing procedure. On the whole, we see a much more symmetrical distribution, with smaller variance, that over iterations pushes the margin location up in a monotonic fashion, in stark contrast to that of Pegasos, whose empirical distribution peaks early and slowly settles down over time. The smaller variance and higher degree of symmetry is precisely what we would expect given the definition of , which assigns a penalty for correctly classified examples that are overconfidently classified, as discussed in section 2.2.
Next, we look at the impact of a fixed scale, determined by observed data, as follows. Each run of Algorithm 1 starts with fixed just as in the previous tests, but after a pre-fixed number of steps, updates the scale just once, to take a value of (see Lemma 6), where
is approximated using the 75th quantile of the empirical distribution induced by. This time, we intentionally under-regularize, setting at less than 1/100th of the best setting found in the previous tests.
Representative results are given in Figure 5. When highly under-regularized, and without scaling, the learning algorithm just wanders about, overwhelmed by the variance of the per-iteration sub-sampling; when the procedure is left to run like this, a good solution can rarely be found before the step size grows small, highly inefficient. On the other hand, using the simple data-driven scaling procedure just described to fix a “safe” value of , we find that the learning algorithm is almost immediately accelerated, and in less time essentially catches up with the performance achieved under the best regularization possible. This is extremely encouraging, as it suggests that a safe, inexpensive, automated scaling procedure can make up for our lack of knowledge about the ideal regularization parameter, allowing for potentially significant savings in hyper-parameter exploration.
In this paper, we introduced and analyzed a new learning algorithm which, via a new convex loss with re-scaling, lets us pursue stronger guarantees for the resulting margin distribution (and classifier) than are possible with the traditional hinge loss. This allows us to bridge the gap between inference and computation, since strong learning guarantees are available for Algorithm 1, which is readily implemented in practice. Empirical tests confirmed that the algorithm basically behaves as we would expect, and that even with naive parameter settings, appropriate re-scaling on the back end allows our procedure to match or exceed the performance of well-known competitors.
Here we put together few standard technical results that are utilized in the main proofs.
Let be continuously differentiable, convex, and -smooth. Then, we have
Given in chapter 2 of Nesterov . ∎
The surrogate risk defined in (8), for , , is -smooth with coefficient .
Assuming the order of integration and differentiation can be reversed, one can write as
It follows that for arbitrary we have
where we utilized the property that is 1-Lipschitz. This implies that
with coefficient , namely is -smooth. ∎
Let be a random vector taking values in , with the sub-Gaussian property
for some constant and . Given independent copies of , denoted , write . Then with probability no less than , we have
We use the Chernoff extension of Markov’s inequality to establish exponential tails for the deviation of the sample mean from its expectation, a standard technique . For real-valued random variable , taking any we have almost surely. Integrating both sides implies , using the non-negativity of for the latter inequality. Thus , the classic Markov inequality. For non-decreasing function , this naturally extends via to