Calibrated Surrogate Maximization of Linear-fractional Utility in Binary Classification

05/29/2019 ∙ by Han Bao, et al. ∙ 0

Complex classification performance metrics such as the F_β-measure and Jaccard index are often used, in order to handle class-imbalanced cases such as information retrieval and image segmentation. These performance metrics are not decomposable, that is, they cannot be expressed in a per-example manner, which hinders a straightforward application of the M-estimation widely used in supervised learning. In this paper, we consider linear-fractional metrics, which are a family of classification performance metrics that encompasses many standard metrics such as the F_β-measure and Jaccard index, and propose methods to directly maximize performances under those metrics. A clue to tackle their direct optimization is a calibrated surrogate utility, which is a tractable lower bound of the true utility function representing a given metric. We characterize necessary conditions which make the surrogate maximization coincide with the maximization of the true utility. To the best of our knowledge, this is the first surrogate calibration analysis for the linear-fractional metrics. We also propose gradient-based optimization algorithms and show their practical usefulness in experiments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Binary classification, one of the main focuses in machine learning, is a problem to predict binary responses for input covariates. Classifiers are usually evaluated by the

classification accuracy, which is the expected proportion of correct predictions. Since the accuracy cannot evaluate classifiers appropriately under class imbalance [27] or in the presence of label noises [28], alternative performance metrics have been employed such as the F-measure [45, 21, 30, 22], Jaccard index [22, 5], and balanced error rate (BER) [8, 27, 28, 11]. Once a performance metric is given, it is a natural strategy to optimize the performance of classifiers directly under the given performance metric. However, the alternative performance metrics have difficulty in direct optimization in general, because they are non-decomposable, for which per-example loss decomposition is unavailable. In other words, the M-estimation procedure [44, 42] cannot be applied. Thus, we cannot apply the empirical risk minimization (ERM) [46], which makes the optimization of non-decomposable metrics hard.

One of the earliest works on the direct optimization [22]

generalizes performance metrics into the linear-fractional metrics, which are the linear-fractional form of entries in the confusion matrix, and encompasses the BER, F

-measure, Jaccard index, Gower-Legendre index [18, 34], and weighted accuracy [22]. Koyejo et al. [22] formulated the optimization problem in two ways. One is a plug-in rule [22, 31, 47]

, to estimate the class-posterior probability and its optimal threshold, and the other is an iterative weighted ERM approach 

[22, 36], to find a good cost with which the cost-sensitive risk [39] minimizer achieves higher utilities. Although they are consistent, the first suffers from high sample complexity due to the class-posterior probability estimation, while the latter is computationally demanding because of the iterative classifier training.

Our goal is to seek for computationally more efficient procedures to directly optimize the linear-fractional metrics, without sacrificing the consistency. We provide a novel calibrated surrogate utility which is a tractable lower bound of the true utility representing the metric of our interest. We derive necessary conditions on the surrogate calibration, under which the surrogate maximization implies the maximization of the true utility. Then, we give model-agnostic optimization algorithms of the surrogate utility. Noting that the gradient directions of the surrogate utility can be estimated with -statistics [43], we apply optimization methods using gradients, such as quasi-Newton methods. We show their consistency based on the theory of Z-estimation [43]. The overview is illustrated in Fig. 1.

Contributions: (i) Surrogate calibration (Sec. 3): We propose a tractable lower bound of the linear-fractional metrics with calibration conditions, which guarantee that the surrogate maximization implies the maximization of the true utility. This approach is model-agnostic differently from many previous approaches [22, 31, 32, 47]. (ii) Gradient-based optimization (Sec. 4.1): The surrogate utility has affinity with gradient-based optimization because its gradient direction can be easily estimated in an unbiased manner. Thus, the gradient ascent and quasi-Newton methods can be applied. (iii) Consistency analysis (Sec. 4.2): The estimator obtained via the surrogate maximization with a finite sample is shown to be consistent to the maximizer of the expected utility.

Z-estimation (§4.1)

Estimation equation

Surrogate maximization (§3.1)

Utility maximization

consistent4.2)

=

calibrated3.2)
Figure 1: Overview of this work. Intuitively, we can obtain the utility maximizer by solving .

2 Preliminaries

Throughout this work, we focus on binary classification. Let and . Let if the predicate holds and otherwise. Let be a feature space and be the label space. We assume that a sample

independently follows the joint distribution

with a density . For a function , we write . An expectation with respect to is written as for a function , where denotes the -marginal distribution. A classifier is given as a function , where determines predictions. Here we adopt the convention . Let be a hypothesis set of classifiers. Let and be the class-prior/-posterior probabilities of , respectively. The 0/1-loss is denoted as , while denotes a surrogate loss. The norm without a subscript is -norm.

In general, the following four quantities are focal targets in binary classification: the true positives (), false negatives (), false positives (), and true negatives ().

Definition 1 (Confusion matrix).

Given a classifier and a distribution , its confusion matrix is defined as , where

and as well as and can be transformed to each other: . They can be expressed with and , such as . The goal of binary classification is to obtain a classifier that maximizes and while keeping and as low as possible. Classifiers are evaluated by performance metrics that trade off the confusion matrix. Performance metrics need to be chosen based on our requirements on the confusion matrix [40, 28]. In this work, we focus on the following family of utilities representing the linear-fractional metrics.

Definition 2 (Linear-fractional utility).

A linear-fractional utility is

(1)

where are class-conditional score functions given as

and are constants, such that ().

The class-conditional score functions correspond to a linear-transformation of

and :

Examples of are shown in Table 1.

Metric F-measure [45] Jaccard index [20] Gower-Legendre index [18]
Definition
Table 1: Examples of the linear-fractional performance metrics. is a trade-off parameter for the F-measure, while is for the Gower-Legendre index.

Given a utility function , our goal is to obtain a classifier that maximizes .

(2)

Traditional Supervised Classification: Here, we briefly review a traditional procedure for supervised classification [46]. Our aim is to obtain a classifier with high accuracy, which corresponds to minimizing the classification risk . Since optimizing the 0/1-loss is a computationally infeasible problem [4, 16], it is a common practice to instead minimize a surrogate risk , where is a surrogate loss. If is a classification-calibrated loss [3], it is known that minimizing corresponds to minimizing . Eventually, what we actually minimize is the empirical (surrogate) risk . The empirical risk

is an unbiased estimator of the true risk

for a fixed

, and the uniform law of large numbers guarantees that

converges to for any in probability [46, 42, 29]. This strategy to minimize is called the empirical risk minimization (ERM).

The traditional ERM is devoted to maximizing the accuracy, which is not necessarily suitable when the other metric is used for evaluation. Our aim is to give an alternative procedure to maximize directly as in Eq. (2). Here, we have two questions: (i) How to construct an alternative utility to that is easier to optimize (as above) and (ii) How to incorporate a sample to optimize the surrogate utility (as above). We give answers to them in Secs. 3 and 4, respectively.

3 Surrogate Utility and Calibration Analysis

The true utility in Eq. (1) consists of the 0/1-loss , which is difficult to optimize. In this section, we introduce an alternative utility in order to make the optimization problem in Eq. (2) easier.

3.1 Surrogate Utility: Tractable Objective for Linear-fractional Utility

Assume that we are given a surrogate loss . We impose two postulates on an alternative utility: The alternative utility should lower-bound the true utility , and then / should become larger / smaller as a result of optimization, respectively. We realize these by constructing surrogate class-conditional score functions and as follows:

(3)

We often abbreviate as if it is clear from the context. Given the surrogate class-conditional scores, define the surrogate utility as follows.

(4)

To construct , the 0/1-losses appearing in are substituted with the surrogate loss . The surrogate class-conditional scores in (3) are designed so that the surrogate utility in (4) satisfies the above postulates.

Lemma 3.

For all and a surrogate loss such that for all , .

Lemma 3 is clear from the assumption that upper-bounds . Due to this property, maximizing is at least maximizing a lower bound of . Immediately, for any . Throughout the paper, we assume that is Fréchet differentiable.

3.2 Calibration Analysis: Bridging Surrogate Utility and True Utility

Figure 2: An example of -discrepant loss: for and for .

Given the surrogate utility , a natural question arises in the same way as the classification calibration in binary classification [49, 3]: Does maximizing the surrogate utility imply maximizing the true utility ? In this section, we study sufficient conditions on the surrogate loss in order to connect the maximization of and the maximization of . All proofs in this section are deferred to App. A.

First, we define the notion of -calibration.

Definition 4 (-calibration).

The surrogate utility is said to be -calibrated if for any sequence of measurable functions and any distribution , it holds that when , where and are the suprema taken over all measurable functions.

This definition is motivated by calibration in other learning problems such as binary classification [3, Theorem 3], multi-class classification [50, Theorem 3], structured prediction [35, Theorem 2], and AUC optimization [17, Definition 1]. If a surrogate utility is -calibrated, we may safely optimize the surrogate utility instead of the true utility . Note that -calibration is a concept to reduce the surrogate maximization to the maximization of within all measurable functions. The approximation error of is not the target of our analysis [3].

Next, we give a property of loss functions that is needed to guarantee

-calibration.

Definition 5 (-discrepant loss).

For a fixed , a non-increasing loss function is said to be -discrepant if satisfies .

Intuitively, -discrepancy means that the gradient of around the origin is steeper in the negative domain than the positive domain (see Figure 2). The loss is no more discrepant if it is -discrepant. -discrepancy is an important property to guarantee -calibration.

Below, we see calibration properties for specific linear-fractional metrics, the F-measure and Jaccard index. Note that those calibration analyses can be extended to general linear-fractional utilities, which is deferred to App. A.4.

F-measure: The F-measure is widely used especially in the field of information retrieval where relevant items are rare [25]

. Since it is defined as the weighted harmonic mean of the precision and recall (see Table 

1), its optimization is difficult in general. Despite that much previous work has tried to directly optimize it in the context of the class-posterior probability estimation [30, 22, 47], or the iterative cost-sensitive learning [22, 36], we show that there exists a calibrated surrogate utility that can be used to the direct optimization as well.

For the F-measure , define the true utility and the surrogate utility as

As for , we have the following F-calibration guarantee. Denote .

Theorem 6 (F-calibration).

Assume that a surrogate loss is non-increasing and differentiable almost everywhere, and that and is -discrepant for some constant . Then, is F-calibrated.

An example of the -discrepant surrogate loss is shown in Figure 2. Here

is a discrepancy hyperparameter. From the fact

, ranges over . We may determine by cross-validation, or fix it at by assuming .

Jaccard Index: The Jaccard index, also referred to as the intersection over union (IoU), is a metric of similarity between two sets: For two sets and , it is defined as . If we measure the similarity between the sets of samples predicted as positives and labeled as positives, the Jaccard index becomes , as is shown in Table 1. This measure is not only used for measuring the performance of binary classification [22, 32], but also for semantic segmentation [15, 12, 1, 5].

For the Jaccard index , define the true utility and the surrogate utility as

As for , we have the following Jaccard-calibration. Denote .

Theorem 7 (Jaccard-calibration).

Assume that a surrogate loss is non-increasing and differentiable almost everywhere, and that and is -discrepant for some constant . Then, is Jaccard-calibrated.

Theorem 7 also relies on the -discrepancy as in Theorem 6. Thus, the loss shown in Figure 2 can also be used in the Jaccard case with a certain range of . In the same manner as the F-measure, a hyperparameter ranges over , which we may either determine by cross-validation or fix to a certain value.

Remark: The -discrepancy is a technical assumption making stationary points of lie in the Bayes optimal set of . This is a mere sufficient condition for -calibration, while the classification-calibration [3] is the necessary and sufficient condition for the accuracy. We give the surrogate calibration conditions for the accuracy in App. A.3. It is left as an open problem to seek for the necessary conditions.

4 Optimization with Unbiased Gradient Direction Estimator

Input :  initial classifier parameter
1 repeat
2      
until stopping criterion is satisfiedOutput : learned classifier parameter
Algorithm 1 Gradient Ascent with
Input :  initial classifier parameter, initial Hessian approximator
1 repeat
2      
until stopping criterion is satisfiedOutput : learned classifier parameter
Algorithm 2 BFGS with

In this section, we propose algorithms to optimize the surrogate utility, and analyze the consistency of the finite-sample maximizer.

4.1 Gradient Direction Estimator as -statistics

Now, the surrogate utility is a calibrated and differentiable alternative to , and the gradient-based optimization can be applied. Under a certain regularity on the interchangeability of the expectation and derivative, its gradient can be computed as

from which we can see that the gradient direction is dominated by . In comparison with , estimating is straightforward following the idea of the -statistics [43]:

(5)

where

The gradient direction estimator in Eq. (5) can be regarded as a second order -statistics, and it is known to be unbiased to  [19].

Once we have the estimator , optimization procedures that only need gradients such as gradient ascent and quasi-Newton methods [7] can be applied to maximize , because they only require gradients up to positive constants. Algorithms 1 and 2 are extensions of the traditional gradient ascent and BFGS, respectively, plugging into them. For line search methods, we use the backtracking line search [7] with the Armijo condition for Algorithm 1 and the Wolfe condition for Algorithm 2.111 The Armijo condition needs an oracle access to the objective, for which we use as a proxy. We use as a proxy for the gradient for the curvature condition as well.

4.2 Consistency Analysis: Bridging Finite Sample and Asymptotics

In this subsection, we analyze statistical properties of the estimator in Eq. (5). To make it simple, the linear-in-input model is considered throughout this subsection, where is a classifier parameter, and is a compact parameter space. The maximization procedure introduced above can be naturally seen as Z-estimation [43], which is an estimation procedure to solve an estimation equation. In our case, the maximization of is reduced to a Z-estimation problem to solve the system . The first lemma shows that the derivative estimator admits the uniform convergence. Its proof is deferred to App. B.

Lemma 8 (Uniform convergence).

For , let . Assume that for are -Lipschitz continuous for finite constants , and that () and () for finite constants . In addition, is -smooth in the positive domain and -smooth in the negative domain for finite constants . Then,

(6)

where denotes the order in probability.

The Lipschitz continuity and smoothness assumptions in Lemma 8 can be satisfied if the surrogate loss satisfies a certain Lipschitzness and smoothness. Note that Lemma 8 still holds for -discrepant surrogates since we allow surrogates to have different smoothness parameters for both positive and negative domains. Lemma 8 is the basis to show the consistency. Let and be an estimator defined by . Under the identifiability below, and are roots of and , respectively. Then, we can show the consistency of .

Theorem 9 (Consistency).

Assume that is identifiable, that is, for all , and that Eq. (6) holds for . Then, .

Theorem 9 is a corollary of van der Vaart [43, Theorem 5.9], given the uniform convergence (Lemma 8) and the identifiability assumption. Note that the identifiability assumes that has a unique zero , which is also usual in the M-estimation: the global optimizer is identifiable. Though this is not a mild assumption in the non-convex case, further analysis is beyond our scope.

5 Related Work

(i) Surrogate optimization: One of the earliest attempts to optimize non-decomposable performance metrics dates back to Joachims [21], formulating into the structured SVM as a surrogate objective. However, Dembczyński et al. [13] shows that this surrogate is inconsistent, which means that the surrogate maximization does not necessarily imply the maximization of the true metric. Later, Yu and Blaschko [48], Eban et al. [14], Berman et al. [5] have tried different surrogates, but their calibration has not been studied yet.

(ii) Plug-in rule: Instead of the surrogate optimization, Dembczyński et al. [13] mentions that a plug-in rule is consistent, where and a threshold parameter are estimated independently. We can estimate by minimizing strictly proper losses [37]. The plug-in rule has been investigated in many settings [30, 13, 22, 31, 9, 47]. One of the problems of the plug-in rule is that it requires an accurate estimate of , which is less sample-efficient than the usual classification with convex surrogates [6, 41]. The threshold parameter heavily relies on an estimate of .

(iii) Cost-sensitive risk minimization: On the other hand, Parambath et al. [36] is a pioneering work to focus on the pseudo-linearity of the metrics, which reduces their maximization to an alternative optimization with respect to a classifier and a sublevel. This can be formulated as an iterative cost-sensitive risk minimization [22, 32, 33, 38]. Though these methods are blessed with the consistency, they need to train classifiers many times, which may lead to high computational cost, especially for complex hypothesis sets.

Remark: Our proposed methods can be considered to belong to the family (i), while one of the crucial differences is the fact that we have calibration guarantee. We do not need to estimate the class-posterior probability as in (ii), or train classifiers many times as in (iii).

6 Experiments

Figure 3: Convergence comparison of the F

-measure (left two figures) and Jaccard index (right two figures). Standard errors of 50 trials are shown as shaded areas.

Figure 4: The relationship of the test F-measure (left two figures) / Jaccard index (right two figures) and sample size (horizontal axes). Standard errors of 50 trials are shown as shaded areas.
(F-measure) Proposed Baselines
Dataset U-GD U-BFGS ERM W-ERM Plug-in
cod-rna 0.501 (6) 0.890 (3) 0.002 (1) 0.500 (3) 0.698 (42)
ionosphere 0.829 (52) 0.913 (45) 0.904 (37) 0.848 (249) 0.902 (45)
sonar 0.720 (93) 0.744 (99) 0.687 (181) 0.653 (228) 0.734 (111)
w8a 0.156 (19) 0.251 (14) 0.062 (21) 0.397 (26) 0.489 (30)
(Jaccard index) Proposed Baselines
Dataset U-GD U-BFGS ERM W-ERM Plug-in
cod-rna 0.326 (3) 0.866 (3) 0.001 (0) 0.324 (3) 0.537 (50)
ionosphere 0.815 (66) 0.827 (70) 0.827 (61) 0.763 (256) 0.824 (74)
sonar 0.617 (99) 0.598 (145) 0.537 (198) 0.503 (217) 0.586 (134)
w8a 0.323 (19) 0.486 (39) 0.032 (11) 0.248 (21) 0.324 (27)
Table 2: Benchmark results: 50 trials are conducted for each pair of a method and dataset. Standard errors (multiplied by

) are shown in parentheses. Bold-faces indicate outperforming methods, chosen by one-sided t-test with the significant level 5%.

In this section, we see empirical performances of the surrogate optimizations (Algorithms 1 and 2). Details of datasets, baselines, and full experimental results are shown in Sec. C.

Implementation details of proposed methods: The linear-in-input model is used for the hypothesis set . For the initializer of , ERM minimizer trained by SVM is used. For both Algorithms 1 and 2, gradient updates are iterated 100 times. Algorithms 1 and 2 are referred to as U-GD and U-BFGS below, respectively. The surrogate loss shown in Figure 2 is used: when and when , where is set to in the F-measure case and in the Jaccard index case.222 The discrepancy parameter should be chosen within and for the F-measure and Jaccard index, respectively. Here, we fix them to the slightly small values than the upper limits of their ranges. In App. C.6, we study the relationship between performance sensitivity on .

Convergence Comparison: We compare convergence behaviors of U-GD and U-BFGS. In this experiment, we ran them 300 iterations from random initialization parameters drawn from . The results are summarized in Figure 3. As we expect, U-BFGS converges much faster in most of the cases, up to 30 iterations. Note that U-BFGS and U-GD are in the trade-off relationship in that the former converges within fewer steps while the latter can update faster.

Performance Comparison with Benchmark Data: We compare the proposed methods with baselines. The results of the F-measure and Jaccard index are summarized in Table 2, respectively, from which we can see the better or at least competitive performances of the proposed methods.

Sample Complexity: We empirically study the relationship between the performance and the sample size. We randomly sample each original dataset to reduce the sample sizes to , and train all methods on the reduced samples. The experimental results are shown in Figure 4. Overall, U-GD and U-BFGS outperform, which is especially significant when the sample sizes are quite small. It is worth noting that U-GD works even better than U-BFGS in some cases, though U-GD does not behave significantly better in Table 2. It can happen because the Hessian approximation in BFGS might not work well when the sample sizes are extremely small.

7 Conclusion

In this work, we gave a new insight of the calibrated surrogate maximization to handle the linear-fractional performance metrics. The necessary conditions for the surrogate calibration were stated, which is the first calibration result for the linear-fractional metrics to the best of our knowledge. The surrogate maximization can be done by the gradient-based optimizations, thus we can escape from the class-posterior probability estimation or iterative training of classifiers. The uniform convergence and consistency of the surrogate maximizer are guaranteed, and experimental results show the superiority of our approaches.

Acknowledgement

We would like to thank Nontawat Charoenphakdee, Junya Honda, and Akiko Takeda for fruitful discussions. HB was supported by JST, ACT-I, Japan, Grant Number JPMJPR18UI. MS was supported by JST CREST Grant Number JPMJCR18A2.

References

  • Ahmed et al. [2015] F. Ahmed, D. Tarlow, and D. Batra. Optimizing expected Intersection-over-Union with candidate-constrained CRFs. In CVPR, pages 1850–1858, 2015.
  • Bartlett and Mendelson [2002] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
  • Bartlett et al. [2006] P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156, 2006.
  • Ben-David et al. [2003] S. Ben-David, N. Eiron, and P. M. Long. On the difficulty of approximately maximizing agreements. Journal of Computer and System Sciences, 66(3):496–514, 2003.
  • Berman et al. [2018] M. Berman, A. R. Triki, and M. B. Blaschko.

    The Lovász-Softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks.

    In CVPR, 2018.
  • Bousquet et al. [2004] O. Bousquet, S. Boucheron, and G. Lugosi.

    Introduction to statistical learning theory.

    In Advanced Lectures on Machine Learning, pages 169–207. Springer, 2004.
  • Boyd and Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
  • Brodersen et al. [2010] K. H. Brodersen, C. S. Ong, K. E. Stephan, and J. M. Buhmann. The balanced accuracy and its posterior distribution. In ICPR, pages 3121–3124, 2010.
  • Busa-Fekete et al. [2015] R. Busa-Fekete, B. Szörényi, K. Dembczyński, and E. Hüllermeier. Online F-measure optimization. In NeurIPS, pages 595–603, 2015.
  • Chang and Lin [2011] C.-C. Chang and C.-J. Lin.

    LIBSVM: A library for support vector machines.

    ACM Transactions on Intelligent Systems and Technology, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
  • Charoenphakdee et al. [2019] N. Charoenphakdee, J. Lee, and M. Sugiyama. On symmetric losses for learning from corrupted labels. In ICML, 2019.
  • Csurka et al. [2013] G. Csurka, D. Larlus, F. Perronnin, and F. Meylan. What is a good evaluation measure for semantic segmentation? In BMVC, pages 1–11, 2013.
  • Dembczyński et al. [2013] K. Dembczyński, A. Jachnik, W. Kotłowski, W. Waegeman, and E. Hüllermeier. Optimizing the F-measure in multi-label classification: Plug-in rule approach versus structured loss minimization. In ICML, pages 1130–1138, 2013.
  • Eban et al. [2017] E. Eban, M. Schain, A. Mackey, A. Gordon, R. A. Saurous, and G. Elidan. Scalable learning of non-decomposable objectives. In AISTATS, pages 832–840, 2017.
  • Everingham et al. [2010] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The PASCAL visual object classes (VOC) challenge.

    International Journal of Computer Vision

    , 88(2):303–338, 2010.
  • Feldman et al. [2012] V. Feldman, V. Guruswami, P. Raghavendra, and Y. Wu. Agnostic learning of monomials by halfspaces is hard. SIAM Journal on Computing, 41(6):1558–1590, 2012.
  • Gao and Zhou [2015] W. Gao and Z.-H. Zhou. On the consistency of AUC pairwise optimization. In IJCAI, pages 939–945, 2015.
  • Gower and Legendre [1986] J. C. Gower and P. Legendre. Metric and euclidean properties of dissimilarity coefficients. Journal of Classification, 3(1):5–48, 1986.
  • Halmos [1946] P. R. Halmos. The theory of unbiased estimation. The Annals of Mathematical Statistics, pages 34–43, 1946.
  • Jaccard [1901] P. Jaccard. Étude de la distribution florale dans une portion des alpes et du jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37:547–579, 1901.
  • Joachims [2005] T. Joachims. A support vector method for multivariate performance measures. In ICML, pages 377–384, 2005.
  • Koyejo et al. [2014] O. O. Koyejo, N. Natarajan, P. K. Ravikumar, and I. S. Dhillon. Consistent binary classification with generalized performance metrics. In NeurIPS, pages 2744–2752, 2014.
  • Ledoux and Talagrand [1991] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry and Processes. Springer, 1991.
  • Lichman [2013] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.
  • Manning and Schütze [2008] R. P. C. Manning and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
  • McDiarmid [1989] C. McDiarmid. On the method of bounded differences. Surveys in Combinatorics, 141(1):148–188, 1989.
  • Menon et al. [2013] A. Menon, H. Narasimhan, S. Agarwal, and S. Chawla. On the statistical consistency of algorithms for binary classification under class imbalance. In ICML, pages 603–611, 2013.
  • Menon et al. [2015] A. Menon, B. van Rooyen, C. S. Ong, and R. Williamson. Learning from corrupted binary labels via class-probability estimation. In ICML, pages 125–134, 2015.
  • Mohri et al. [2012] M. Mohri, A. Rostamizadeh, and A. Talkwalkar. Foundation of Machine learning. MIT Press, 2012.
  • Nan et al. [2012] Y. Nan, K. M. Chai, W. S. Lee, and H. L. Chieu. Optimizing F-measure: A tale of two approaches. In ICML, pages 289–296, 2012.
  • Narasimhan et al. [2014] H. Narasimhan, R. Vaish, and S. Agarwal. On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In NeurIPS, pages 1493–1501, 2014.
  • Narasimhan et al. [2015] H. Narasimhan, P. Kar, and P. Jain. Optimizing non-decomposable performance measures: a tale of two classes. In ICML, pages 199–208, 2015.
  • Narasimhan et al. [2016] H. Narasimhan, W. Pan, P. Kar, P. Protopapas, and H. G. Ramaswamy. Optimizing the multiclass F-measure via biconcave programming. In ICDM, pages 1101–1106, 2016.
  • Natarajan et al. [2016] N. Natarajan, O. Koyejo, P. Ravikumar, and I. Dhillon. Optimal classification with multivariate losses. In NeurIPS, pages 1530–1538, 2016.
  • Osokin et al. [2017] A. Osokin, F. Bach, and S. Lacoste-Julien. On structured prediction theory with calibrated convex surrogate losses. In NeurIPS, pages 302–313, 2017.
  • Parambath et al. [2014] S. P. Parambath, N. Usunier, and Y. Grandvalet. Optimizing F-measures by cost-sensitive classification. In NeurIPS, pages 2123–2131, 2014.
  • Reid and Williamson [2009] M. D. Reid and R. C. Williamson. Surrogate regret bounds for proper losses. In ICML, pages 897–904, 2009.
  • Sanyal et al. [2018] A. Sanyal, P. Kumar, P. Kar, S. Chawla, and F. Sebastiani. Optimizing non-decomposable measures with deep networks. Machine Learning, 107(8-10):1597–1620, 2018.
  • Scott [2012] C. Scott. Calibrated asymmetric surrogate losses. Electronic Journal of Statistics, 6:958–992, 2012.
  • Sokolova and Lapalme [2009] M. Sokolova and G. Lapalme. A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4):427–437, 2009.
  • Tsybakov [2008] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2008.
  • van de Geer [2000] S. van de Geer. Empirical Processes in M-estimation. Cambridge University Press, 2000.
  • van der Vaart [2000] A. W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 2000.
  • van der Vaart and Wellner [1996] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes. Springer, 1996.
  • van Rijsbergen [1974] C. J. van Rijsbergen. Foundation of Evaluation. Number 4. 1974.
  • Vapnik [1998] V. Vapnik. Statistical Learning Theory. Wiley, New York, 1998.
  • Yan et al. [2018] B. Yan, O. Koyejo, K. Zhong, and P. Ravikumar. Binary classification with Karmic, threshold-quasi-concave metrics. In ICML, pages 5531–5540, 2018.
  • Yu and Blaschko [2015] J. Yu and M. Blaschko. Learning submodular losses with the Lovász hinge. In ICML, pages 1623–1631, 2015.
  • Zhang [2004a] T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32(1):56–85, 2004a.
  • Zhang [2004b] T. Zhang. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5(Oct):1225–1251, 2004b.

Appendix A Calibration Analysis and Deferred Proofs from Section 3.2

In this section, we analyze calibration of the surrogate utility. Before proceeding, we need to describe Bayes optimal classifier for a given metric.

Definition 10.

Given a linear-fractional utility , Bayes optimal set is a set of functions that achieve the supremum of , that is, .

Classifiers in are referred to as Bayes optimal classifiers. Note that they are not necessarily unique. In this work, we assume that . First, we characterize Bayes optimal set .

Proposition 11.

Given a linear-fractional utility in Eq. (1), the Bayes optimal set for is

where and .

Proof.

The maximization problem in Eq. (2) can be restated as follows.

where . First, the Fréchet derivative of evaluated at is obtained as follows.

Let be a function that maximizes , and . Then, maximizes , and it satisfies ([22, lemma 12])

Thus, the necessary condition for local optimality is that for all .333 This can be confirmed in a similar manner to the proof of Yan et al. [47, Theorem 3.1]. Since , the above condition is for all , which is equivalent to the condition for all . This concludes the proof. Note that is a positive value, and . ∎

You may confirm that Proposition 11 is consistent with Bayes optimal classifier in the classical case, accuracy [3]: a Bayes optimal classifier should satisfy for all , since , , .

Next, we state a proposition which gives a direction to prove the surrogate calibration of a surrogate utility. This proposition follows a latter half of Gao and Zhou [17, Theorem 2].

Proposition 12.

Fix a true utility , a surrogate utility , and let a Bayes optimal set corresponding to the utility . Assume that

(7)

Then, the surrogate utility is -calibrated.

Proof.

Remind that and let

and be any sequence such that . Then, for any , there exists such that for . Here we set : for . If we assume that , this contradicts with the following facts: for a function ,

Thus, it holds that for , that is, , which indicates -calibration. ∎

Thus, the proof of -calibration of is reduced to show the condition (7). Below, we show the surrogate calibration for the F-measure and Jaccard index utilizing Propositions 11 and 12. The proofs are based on the above propositions, Gao and Zhou [17, Lemma 6] and Charoenphakdee et al. [11, Theorem 11].

Throughout the proofs, we assume that for the critical set , , where is the classifier attaining the supremum of . For example, this holds for any -continuous distribution [47, Assumption 2].

a.1 Proof of Theorem 6

Figure 5: The range of when .
Figure 6: The range of when .
Figure 7: If , then .
.

As a surrogate utility of the F-measure following Eq. (4), we have

where

From Proposition 11, the Bayes optimal set for the F-measure is

We will show F-calibration by utilizing Proposition 12, which casts our proof target into showing Eq. (7). We prove it by contradiction. Assume that