Classification is a fundamental problem in statistics and machine learning, including scientific problems such as cancer diagnosis and satellite image processing as well as engineering applications such as credit card fraud detection, handwritten digit recognition, and text processing [khan2001classification, lee2004cloud], but modern applications have brought new challenges. In online retailing, websites such as Amazon have hundreds of thousands or millions of products to taxonomize [lin2018overview]. In text data, the distribution of words in documents has been observed to follow a power law in that there are many labels with few instances [zipf1936psycho, feldman2019does]. Similarly, image data also a long tail of many classes with few examples [salakhutdinov2011learning, zhu2014capturing]. In such settings, the classes with smaller probabilities are generally classified incorrectly more often, and this is undesirable when the smaller classes are important, such as rare forms of cancer, fraudulent credit card transactions, and expensive online purchases. Thus, we need modern classification methods that work well when there are a large number of classes and when the class-wise probabilities are imbalanced.
When faced with such class imbalance a popular approach in practice is to choose a metric other than zero-one accuracy, such as precision, recall, -measure [van1974foundation, van1979information], which explicitly take class conditional risks into account, and train classifiers to optimize this metric. A difficulty with this approach however is that the right metric for imbalanced classification is often not clear. A related class of approaches keep the zero-one accuracy metric but modifies the samples instead. The popular algorithm SMOTE [chawla2002smote] performs a type of data augmentation for a minority class, i.e., a class with lower probability, and sub-samples the large classes. This has led to variants with different forms of data augmentation [zhou2006training, mariani2018bagan], but from a theoretical perspective, these methods remain poorly understood.
A much simpler approach, which is also related to the approaches above, is class-weighting, in which different costs are incurred for mis-classifying samples of different labels. Practically, this is a natural approach because it is often possible to assign different costs to different classes. For example, the average fraudulent credit card transaction may cost hundreds of dollars, or in online retailing, failing to show a customer the correct item causes the company to lose out on the profit of selling that item. Thus, a good classifier should be fairly sensitive to possibly fraudulent transactions, and online retailers should prioritize displaying high-profit products. As a result, class-weighting has been studied in a variety of settings, including modifying black-box classifiers, SVMs, and neural networks[domingos1999metacost, lin2002support, scott2012calibrated, zhou2006training]
. Additionally, class-weighting has been observed to be useful for estimating class probabilities, since class-weighting amounts to adjusting decision thresholds[wang2008probability, wu2010robust, wang2019multiclass].
A crucial caveat with cost-weighting however is the right choice of costs is often not clear, and with any one choice of costs, the performance of the corresponding classifier might suffer for some other, perhaps more suitable, choices of costs.
In this paper, we use cost-weighting for imbalanced classification in three ways. We start by examining a weighted sum of class-conditional risks, i.e., the risks conditional on the class taking some specific value . This allows us to upweight a minority class to achieve better performance on the minority examples. We then provide an illuminating analysis of the fundamental tradeoffs that occur with any single choice of costs.
Since we may not understand precisely which weighting to pick, we examine a robust risk that is a supremum of the weighted risks over an uncertainty set of possible weights. This objective can be interpreted as a class-wise distributionally robust optimization problem where we ask for robustness over the marginal distribution of . This leads to a minimax problem, for which we provide generalization guarantees. We also note that a standard gradient descent-ascent algorithm may solve the optimization problem when the risk is convex in the classifier parameters.
Finally, we show that for a natural class of uncertainty sets, the robust risk reduces to what call label conditional value at risk (LCVaR). We highlight a connection to conditional value at risk (CVaR), which is a well-studied quantity in portfolio optimization and stochastic programming parametrized by an in [rockafellar2000optimization, shapiro2009lectures]. Further, we propose a generalization that we call label heterogeneous conditional value at risk (LHCVaR) that allows for different parameters for each class . To the best of our knowledge, this has not been examined previously, and it could possibly be used more broadly. To give an example in portfolio optimization, we may wish to treat risks arising from different types of assets, e.g., large-cap stocks versus small-cap stocks or domestic debt versus international debt, differently. Next, we show that the dual form for LHCVaR is similar to that for LCVaR as long as the heterogeneity is finite-dimensional, and this leads to an unconstrained optimization problem. Finally, we examine the efficacy of LCVaR, and LHCVaR on real and synthetic data.
The rest of the paper is outlined as follows. In Section 2, we discuss our problem setup. In Section 3, we examine weighting in plug-in classification. In particular, we elucidate the fundamental trade-off in weighted classification and its methodological implications. In Section LABEL:sec:RobustProblem, we examine a robust version of the weighted risk problem, including generalization guarantees and connections to stochastic programming. In Section LABEL:sec:NumericalResults, we provide numerical results, and we conclude with a discussion in Section LABEL:sec:Discussion. Additional proofs and results in related settings are deferred to the appendices.
1.1 Further Related Work
We briefly review other research related to imbalanced classification, but for a far more exhaustive treatment, see a survey of the area [he2009learning, fernandez2018learning]. First, two other methods may be employed to solve imbalanced classification problems. The first is class-based margin adjustment [lin2002support, scott2012calibrated, cao2019learning]
, in which the margin parameter for the margin loss function may vary by class. Broadly, margin adjustment and weighting may both be considered loss modification procedures. The second method is Neyman-Pearson classification, in which one attempts to minimize the error on one class given a constraint on the worst permissible error on the other class[rigollet2011neyman, tong2013plug, tong2016survey].
An important topic related to our paper but that has not been well-connected to imbalanced classification is robust optimization. Robust optimization is a well-studied topic [ben1999robust, ben2003robust, ben2004adjustable, ben2009]. A variant that has gained traction more recently is distributionally robust optimization [ben2013, bertsimas2014, namkoong2017variance]. Unsurprisingly, CVaR, as a coherent risk measure, has been previously connected to distributionally robust optimization [goh2010distributionally]. Distributionally robust optimization generally and CVaR specifically have also previously been used in machine learning to deal with imbalance [duchi2018mixture, duchi2018learning], but in these works, the imbalance was considered to exist in the covariates, whether known to the algorithm or not. These are motivated by the recent push toward fairness in machine learning, in particular so that ethnic minorities do not suffer discrimination in high-stakes situations such as loan applications, medical diagnoses, or parole decisions, due to biases in the data.
2.1 Classification with Imbalanced Classes
In this section, we briefly go over the problem setup. First, we draw samples from the space . For our purposes, we are interested in or . Note there are two slightly different mechanisms for the data-generating process that are considered in imbalanced classification and Neyman-Pearson classification. In the first, we are given i.i.d. samples from a distribution . Here, we let be the probability of class
. Additionally, we sometimes refer to the vector of class probabilities as. This is our framework of interest, since it corresponds to standard assumptions in nonparametric statistics and learning theory. In the alternative framework, we are given samples from each marginal distribution . The probability of class in this case is then known: . For the most part, these two mechanisms yield similar results, but the analyses differ slightly. To streamline the presentation, we only consider the first case in the main paper, although we give a result for the alternative framework in the appendix that illustrates the difference.
2.2 Class Conditioned Risk
We are interested in finding a good classifier in some function space , such as linear classifiers or neural networks. In this section, we establish our risk measures of interest. In general, we want to minimize the expectation of some loss function , which we call risk and denote Analogously, we define the class-conditioned risk for class to be
At this point, we make some observations for plug-in classification and empirical risk minimization. In the plug-in classification results, we consider the zero-one loss , and for our results on empirical risk minimization, we are primarily interested in convex surrogate losses. For simplicity, when is clear from context, or a statement is made for a generic , we will denote this as .
Now, we can work toward defining weighted risks. We defined Observe that we can relate the risk to the class-conditioned risk by An important part of our paper is an examination of class-weighted risk. Let be a vector such that for all and . Then, the -weighted risk is
Note that the usual risk is recovered by setting .
2.3 Plug-in Classification
In this section, we discuss weighted plug-in classification. For plug-in, we restrict our attention to the binary classification case of , and the primary quantity of interest is usually the one-zero risk i.e the risk under . In general, the risk for the best classifier is nonzero because for a given in , there is some probability it may take the value or .
As a result, we need a way to discuss the convergence of our estimator to the best possible estimator. We define the regression function by Now, the Bayes optimal classifier is the classifier that minimizes the risk, and it is defined by The minimum possible risk is called the Bayes risk and denoted by , and generally we focus on minimizing the excess risk .
Following the form of the Bayes classifier, a plug-in estimator attempts to estimate the regression function by some and then “plugs in” the result to a threshold function. Thus, has the form which is analogous to the form of the Bayes classifier. For additional background on plug-in estimation, see, e.g., devroye1996probabilistic.
At this point, we wish to define the weighted versions of Bayes classifier, Bayes risk, plug-in classifier, and excess risk. For brevity, define the threshold . First, we consider the Bayes classifier. Let be a weighting. The Bayes optimal classifier for -weighted risk is The proof, along with proofs of other subsequent results on plug-in classification, appears in the appendix. In this case, we denote the Bayes risk by . Lemma 2.3 reveals that the Bayes classifier is a plug-in rule, and analogously, we see that a plug-in estimator in the weighted case takes the form Consequently, we define excess -risk for an empirical classifier . The excess -risk for an empirical classifier is and note that we are interested in bounding the expected excess -risk for plug-in estimators.
2.4 Empirical Risk Minimization
In this section, we define empirical quantities that we need for empirical risk minimization, particularly the weighted and robust risks. We consider . We define the empirical class-conditioned risk by where . Let denote the empirical proportion of observations of class , and let be a weight vector. The empirical -weighted risk is
The empirical -weighted risk is defined analogously by This problem is convex in when the loss is convex and concave in due to linearity; so one may solve the resulting saddle-point problem with standard techniques such as gradient descent-ascent, which we give in the appendix.
Often in empirical risk minimization, generalization bounds are provided, i.e., a bound on the true risk of a classifier in
in terms of its empirical risk and a variance term. To bring our results closer to those of plug-in estimation, we also consider a form of excess risk. To distinguish the two, define the excess-weighted risk to be where here is the -weighted empirical risk minimizer in and is the population -weighted risk minimizer in . Beyond the robust formulation, the key difference between excess -weighted risk and excess -weighted risk is that in the former we compete with the true regression function, and in the latter we compete with the best classifier in .
One additional tool we need for empirical risk minimization is a measure of function class complexity, and a typical measure of the expressiveness of a function class is Rademacher complexity. The empirical Rademacher complexity given a sample is
where the expectation is taken with respect to the
, which are Rademacher random variables. The Rademacher complexity is, where the expectation is with respect to the random variables.
Finally, we make one note about the loss for our empirical risk minimization results. For binary classification, one can obtain bounds for any bounded loss function that is Lipschitz continuous in . Since we present multiclass results, we use the multiclass margin loss, which is a bounded version of the multiclass hinge loss [mohri2012]. Here, it is assumed that for each in , the function outputs a score , and the chosen class is . The multiclass margin loss is defined as where . For simplicity, we ignore the margin parameter, usually denoted by , and treat it as in our results. Finally, we define the projection set
3 Tradeoffs with Class Weighted Risk
In this section, we examine weighted plug-in classification, and we have two main results. First, we show that weighted plug-in classification enjoys essentially the same rate of convergence as unweighted plug-in classification, although there is dependence on the chosen weights. Second, there is a fundamental trade-off in that optimizing for one set of weights may lead to suboptimal performance for another set of weights .
3.1 Excess Risk Bounds
We start with the excess risk bound for plug-in estimators when the weighting is well-specified.
Suppose the regression function is -Hölder. Then, the -weighted excess risk of satisfies
Here, we see that the upper bound depends linearly on and . This implies that when we increase the weight for a class with few examples, then our bound on the excess risk increases. While previous cost weighting setups have normalized the sum of weights scott2012calibrated
, our normalization scheme is computed with respect to prior probabilities on each class as well, and consequently we explicitly includein our bound. Our choice of domain for weights is defined in LABEL:sec:RobustProblem.
Now, we turn to our second task: examining the weighted excess risk of the under a different weighting . Observe that we can decompose the excess risk as
Unsurprisingly, we see that an error term that is constant, or ”irreducible” appears in equation (1). Then, we see the irreducible error is given by the measure of the subset of where lies between and . Given that we know the Bayes optimal classifier for any weighting, we observe that the irreducible error can be upper bounded by a term proportional to the the product of the measure of in the region between and , and the difference between the thresholds themselves. We state this formally in the following proposition.
Let and . The irreducible error satisfies the bound
A visualization is given in Figure 1. Now, we turn to analyze the estimation error. The result is in many ways similar to Proposition 3.1, but an additional term appears due to the decision threshold for differing from that of the risk measurement . For any density estimator , the estimation error satisfies