Implicit Rate-Constrained Optimization of Non-decomposable Objectives

07/23/2021 ∙ by Abhishek Kumar, et al. ∙ 0

We consider a popular family of constrained optimization problems arising in machine learning that involve optimizing a non-decomposable evaluation metric with a certain thresholded form, while constraining another metric of interest. Examples of such problems include optimizing the false negative rate at a fixed false positive rate, optimizing precision at a fixed recall, optimizing the area under the precision-recall or ROC curves, etc. Our key idea is to formulate a rate-constrained optimization that expresses the threshold parameter as a function of the model parameters via the Implicit Function theorem. We show how the resulting optimization problem can be solved using standard gradient based methods. Experiments on benchmark datasets demonstrate the effectiveness of our proposed method over existing state-of-the art approaches for these problems. The code for the proposed method is available at https://github.com/google-research/google-research/tree/master/implicit_constrained_optimization .

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many modern machine learning applications, the performance of a model is evaluated using metrics that are complex and nuanced. For example, in retrieval systems, it is common to evaluate a scoring model based the area under the precision-recall curve, or the ROC curve, or on its precision at a certain recall value (Eban et al., 2017). Similarly, in many medical diagnostic applications, a model is required to yield low false positive rates while restricting its false negatives rate to be within an allowed limit (Rao et al., 2008), while in machine learning fairness applications one might be interested in imposing the “80% rule”, which requires a positive prediction rate of at least 80% on the minority class (Biddle, 2006; Zafar et al., 2017).

The above problems cannot be directly solved by minimizing a standard classification loss. In fact, prior work has found that doing so can result in inferior model performance (Joachims, 2005; Koyejo et al., 2014; Kar et al., 2014; Eban et al., 2017; Cotter et al., 2019a). Moreover, many of the metrics that we are interested in have a non-decomposable structure, i.e., they cannot be expressed directly in terms of an average over individual data points, making them hard to optimize using standard optimization tools. Much prior work has sought to address this problem, resulting in a range of methods targeting different classes of non-decomposable metrics (Yue et al., 2007; Narasimhan & Agarwal, 2013b; Kar et al., 2015; Yan et al., 2018).

In this paper, we consider a popular family of non-decomposable objectives that have a certain thresholded form. This includes metrics like the false negative rate (FNR) at a certain fixed false positive rate (FPR), precision at a fixed recall, precision@, AUC-PR, and AUC-ROC, as well as more recent threshold-based fairness metrics (Hardt et al., 2016). The task of optimizing these metrics can naturally be written as a constrained optimization problem, wherein one seeks to optimize a quantity such as the model’s precision or false positive rate at one or more thresholds, subject to the model satisfying a set of rate constraints at those thresholds. The dominant approach for solving such rate-constrained problems has been to relax the constraints with surrogate losses, and to formulate an equivalent Lagrangian-based primal-dual problem (Eban et al., 2017). Follow-up work has improved upon this approach by using the surrogate relaxations only for the primal updates, but not the dual (Cotter et al., 2019b; Narasimhan et al., 2019a).

Our proposed optimization strategy departs significantly from the prior Lagrangian-based methods, and avoids explicitly solving the constrained optimization problem. Instead, we express the threshold variables in the optimization problem as an implicit function of the model parameters, and thus re-formulate the problem as an unconstrained optimization over the model parameters. By appealing to the Implicit Function Theorem (Tu, 2011), we show how to compute the gradients for the resulting unconstrained objective, despite not knowing the form of the implicit function, and then use them to perform standard gradient-based optimization (see Section 3). Although the Implicit Function Theorem makes a local statement about the existence of the implicit function (i.e., valid in a small neighborhood around current model parameters), we can still effectively use the theorem to make local gradient updates towards optimizing the objective.

We experiment with two image classification datasets and several UCI datasets, and show that our proposed method often performs significantly better than the state-of-the-art constrained optimization solvers in optimizing popular metrics such as FNR at fixed FPR, and the “partial” areas under the ROC and Precision-Recall curves evaluated at a selection range of FPR/recall values (see Section 5). We find our approach to be particularly effective when used to target extreme values of FPR or recall. We also discuss how our formulation can be extended to apply to more complex learning problems, such as query-based ranking, where standard constrained optimization techniques are known to have notable drawbacks (see Section 6).

2 Problem Formulation

We describe our formulation in a binary classification setting, with input space and binary labels . Later, we will discuss how to extend our setup to multi-class classification problems. Our goal is to learn a scoring model , parameterized by , whose scores can be thresholded to make a binary prediction. We denote a scoring model thresholded at by . We will use , and

to denote the true positives, false positive and false negatives respectively for the thresholded classifier.

We are interested in solving constrained optimization problems of the form:

(1)

where the objective maps the model parameters and a set of thresholds to a real value, and the constraints map to real numbers. We further assume stays in the feasible region, i.e., . We will further discuss how several commonly used constrained optimization problems of interest in machine learning satisfy this assumption of feasibility. We provide some popular examples of evaluation metrics below.

Example 1 (Precision at fixed recall).

To maximize the model’s precision at the threshold at which its recall is , we will have:

Example 2 (FNR at fixed FPR).

To minimize the model’s false negative rate at the threshold at which its false positive rate is , we will have:

Example 3 (Precision at ).

To maximize the model’s precision at the threshold at which it achieves a coverage of , we can set:

Example 4 (Auc-Pr).

To maximize the area under the Precision-Recall curve, following Eban et al. (2017), we use a Riemann approximation to the area: we divide the recall range into equally-spaced values , and evaluate the average precision that the model achieves when thresholded to match each of the target recalls. This can be written as a constrained optimization problem with thresholds , and with the objective and constraints set to:

One can similarly compute the “partial” area under the PR curve in any given range of recall (precision) targets by thresholding only at those particular recall (precision) values. This is particularly useful for excluding low recalls (precisions), since one is generally uninterested in the performance of the model at such thresholds.

Example 5 (Auc-Roc).

To maximize the (partial) area under the Receiver Operator Characteristic (ROC) curve, we can again apply a Riemann approximation: we divide the FPR range into values , and compute the average TPR at thresholds , chosen to satisfy the FPR targets:

Example 6 (Fairness criterion).

In a typical group fairness application (Hardt et al., 2016), each example belongs to one of protected groups, and the goal is to constrain the model to have equitable performance across all groups. One way to enforce this requirement is to introduce a separate threshold for examples from each group, and to tune them to satisfy the fairness constraints. For example, the popular demographic parity constraint for two (disjoint) groups, which requires equal positive prediction rates for both groups, can be encoded as:

where are the true positives on examples belonging group and respectively, are the false positives for the two groups, and and are the number of examples in the two groups.

Feasibility of constraints by tuning Assuming that the model does not map two different training examples to same outputs, it is easy to see that for fixed model parameters , any of the aforementioned rate constraints (e.g., false positive rate, precision, recall, etc.) can be satisfied to any feasible value by tuning only the thresholds .

All the rate based optimization objectives and constraints discussed above are non-smooth. To make the problem amenable to gradient based optimization, following earlier work (Eban et al., 2017; Cotter et al., 2019b; Narasimhan et al., 2019a) we replace and with smooth differentiable surrogates and , and relax (1) into:

(2)

Surrogate losses.  We use sigmoid and softplus functions, denoted as , as surrogates in our experiments. Specifically, we replace the innermost indicators with smooth surrogate . For example, if the objective is FNR = () and

is the prediction (logit) for

th example, then we replace FN with , yielding the surrogate (where

denotes a temperature scaled sigmoid or softplus function). We use similar surrogates for ratio based objectives such as precision and recall. For example, if

is precision () then we replace TP= with and predicted-positives with , yielding the surrogate .

3 Optimization with Implicit Thresholds

The canonical approach to solving the constrained optimization problem in (2) is to formulate a Lagrangian for the problem, and then perform gradient updates to maximize the Lagrangian over the mulitipliers and minimize it over and . Our key idea is to avoid explicitly solving the constrained problem by instead formulating an equivalent unconstrained problem in which we express the thresholds as an implicit function of the model parameters (within a neighborhood around ).

To this end, we make use of the Implicit Function Theorem (Tu, 2011). Specifically, suppose the point satisfies the constraint, i.e., . Then we can express the thresholds as in a neighborhood around , for some implicit function :

Theorem 1 (Implicit Function Theorem (Tu, 2011)).

Let be an open subset in and a map. Write for a point in , with . At a point where and the determinant is nonzero, there exists a neighborhood of in and a unique function such that in ,

Using this theorem, we can write the implicit threshold as within a neighborhood around , which enables us to turn problem (2) into the equivalent unconstrained problem of minimizing .

3.1 Characterization of the implicit function

A differentiable function that provides us with the thresholds at which may not always exist, and even if it does, it may be available in closed-form only in some highly simplified settings.111For example, if the distribution of the instances conditioned on label

is a Gaussian distribution with mean

and covariance matrix , and suppose we wish to constrain the false negative rate (FNR) for a linear model parameterized by . Then the FNR at any threshold is given by , and the threshold at which the FNR is is given by , where

is the CDF of a normal distribution with mean

and variance

. Nonetheless, under some assumptions, we can show that when exists, the resulting composite function is convex in .

Proposition 1.

Let . Suppose the objective is jointly convex in and is strictly increasing in , and the constraint is jointly convex in and is strictly descreasing in . Suppose there exists a function such that . Then is convex in . Consequently, the composite objective is convex in .

The proof adapts a result from Wurker (2001) and is given in Appendix A. The assumptions in the proposition hold for simple linear models, for example, when minimizing the FPR while constraining the FNR (see appendix for details).

3.2 Gradient computation

To compute a (local) derivative for w.r.t. within the neighborhood of in Theorem 1, we use:

(3)

where for simplicity we show the derivative for a scalar . We will further need the derivative of the implicit function . Since in this neighborhood, we have

(4)

This gives us the derivative of the implicit function which can be plugged into Eq. (3) to get the final gradients for model parameters .

1:  Hyper-parameters: OPT,
2:  Intialize:
3:  For :
4:      
5:      
6:            // optimizer step
7:      If :              // correction step
8:          Set s.t.
9:      Else:                        // gradient based update for
10:          
11:      End If
12:  End For
13:  Return
Algorithm 1 Implicit Constrained Optimization (ICO)

3.3 Updating thresholds

Having performed the gradient update on with:

where is a step-size parameter, what remains is to update the threshold. Again appealing to the Implicit Function Theorem, in the neighborhood around the current iterate , we can approximate the new threshold as:

As this is an approximation, we employ a correction step after every minibatch iterations that sets the threshold to satisfy the constraint exactly based on accumulated minibatches. Note that for all the metrics described in Examples 35, this correction step can be performed efficiently using a straight-forward line search.

3.4 Practical improvements

Algorithm 1 outlines our overall approach. In our experiments, we found it effective to use the unrelaxed (and non-smooth) rates, instead of surrogates, for the correction step, i.e. we set where computes the threshold at which the unrelaxed .

Regularization.  In some of our experiments, particularly with smaller UCI datasets, we also impose a regularizer on the model parameters that penalizes w.r.t. . This encourages the optimization to prefer model parameters for which the constraint function varies smoothly as a function of the threshold. We expect that this will help the model to generalize better on unseen examples.

Optimizing objective with multiple constraints.  For optimizing objectives involving multiple constraints (and hence multiple thresholds ), such as (partial) PR-AUC or ROC-AUC, a naïve implementation would need multiple gradient computations w.r.t. as follows:

(5)

It needs computation of for all constraints. We avoid this by first computing the partial derivatives of and w.r.t. which are much cheaper to compute, and then treating their ratios as constants (akin to using stop-gradient). Denoting , we then rewrite the gradient computation as

(6)

which again reduces to just two gradient computations. We also disable the gradient based updates for thresholds to avoid computing separate for all , and only rely on the correction step of Algorithm 1 every minibatches.

4 Related Work

The problem of training models to optimize a given non-decomposable metrics has received much attention in the literature. Early methods on this topic focused on constructing convex surrogates that closely approximate the metric of interest (Joachims, 2005; Yue et al., 2007; Narasimhan & Agarwal, 2013a; Kar et al., 2014; Mohapatra et al., 2014; Narasimhan et al., 2015a)

, often using structured support vector machines

(Tsochantaridis et al., 2005). One of the drawbacks of these approaches is that they are not directly amenable to handling constraints on multiple rates, and may sometimes result in a loose approximation to the metric (Kar et al., 2015). More recent methods seek to directly optimize a given rate metric subject to constraints on multiple rate metrics (Goh et al., 2016; Eban et al., 2017; Narasimhan, 2018; Cotter et al., 2019a, b; Narasimhan et al., 2019a), and come with scalable gradient-based solvers. There has also been work on construction of structured surrogates (Fathony & Kolter, 2020; Bao & Sugiyama, 2020).

There is also a distinction between methods which handle classification metrics such as the F-measure (Koyejo et al., 2014; Narasimhan et al., 2014; Yan et al., 2018)

, where often tuning a threshold on a pre-trained class probability model results in a consistent estimator, and those that handle scoring metrics such as the precision-recall and AUC metrics we consider in this paper

(Eban et al., 2017), where the focus is on learning a scoring model that performs well at one or more operating thresholds. Other techniques focus on optimizing specialized evaluation metrics that, for example, emphasize good top- performance in ranking and classification tasks (Agarwal, 2011; Boyd et al., 2012; Fan et al., 2017; Lapin et al., 2017; Hiranandani et al., 2020).

Our approach is most closely related to the method of Eban et al. (2017), who encode the given metric as constraints on classification rates, relax the rates with differentiable surrogates, and perform gradient updates to minimize over the model parameters, and maximize over the Lagrangian multipliers for the constraints. The recent work of Cotter et al. (2019b) and Narasimhan et al. (2019a) improves upon their method by observing that the use of surrogates is only required when minimizing over the model parameters, while the maximization over the Lagrange multipliers can be performed with the original unrelaxed rates. The resulting min-max problem can be interpreted as a non-zero-sum game, for which the authors provide efficient gradient-based algorithms to find an equilibrium. This selective use of surrogates is incorporated into our proposal: we use surrogates only when we need to compute gradients for the objective , whereas, as we mentioned in Section 3.4, we use the original unrelaxed rates while computing a correction for the threshold .

Unlike these earlier papers, our method avoids explicitly solving a constrained optimization problem, and instead expresses the threshold as an implicit function of the model parameters. This proposal is similar in flavor to the approach taken by Mackey et al. (2018)

, who like us formulate an unconstrained objective, but do so by expressing the threshold as a quantile of the model scores.

Finally, the growing literature on fairness in machine learning has opened the door for many new applications for constrained optimization (Hardt et al., 2016; Agarwal et al., 2018), introducing many group-based fairness metrics that can be easily handled using the proposed approach.

5 Experiments

Datasets #Examples #Features
CelebA 202,599 32323
BigEarthNet 590,326 4040 3
Letter 19,999 16
IJCNN1 49990 22
Adult 48,842 122
Spambase 4,601 57
Com. & Crime 1,994 145
Table 1: Summary of datasets.

FPR

High-cheekbones Heavy-makeup Wearing-lipstick Smiling Black-hair Blond-hair
1% 53.5/ 49.0/ 46.9 57.0/ 57.0/ 49.6 44.0/ 42.6/ 37.5 37.4/ 35.9/ 33.7 69.3/ 64.4/ 63.2 40.4/ 38.6/ 36.8
2% 44.8/ 40.9/ 39.8 45.6/ 41.2/ 38.9 32.7/ 30.4/ 26.7 29.4/ 27.8/ 26.1 56.4/ 52.0/ 50.5 28.9/ 25.6/ 24.2
5% 32.9/ 30.1/ 28.5 28.2/ 25.4/ 23.1 16.3/ 14.9/ 13.1 18.7/ 17.0/ 16.9 36.7/ 32.4/ 32.5 13.4/ 11.6/ 10.8
10% 22.9/ 20.4/ 19.7 15.1/ 13.6/ 12.4 6.6/ 5.9/ 4.7 11.7/ 10.7/ 10.2 23.0/ 19.2/ 18.6 6.5/ 4.9/ 4.7
Table 2: Minimizing false negative rate (FNR) at a given false positive rate (FPR) for CelebA.   The mean FNR (in %) are reported over five random trials for cross-entropy/ TFCO/ ICO, respectively. Proposed ICO outperforms both CE and TFCO by a considerable margin. We report results on more attributes along with the std. errors in Appendix C. Lower values are better.
High-cheekbones Heavy-makeup Wearing-lipstick Smiling Black-hair Mean
1% 66.1/ 62.9/ 69.8 65.0/ 68.5/ 66.7 70.2/ 74.8/ 72.3 75.4/ 78.0/ 75.6 60.5/ 61.4/ 61.2 64.7/ 65.9/ 66.1
2% 70.8/ 74.9/ 73.2 68.8/ 73.4/ 71.6 75.4/ 79.3/ 78.4 78.4/ 81.5/ 79.8 64.5/ 66.5/ 66.1 68.6/ 70.8/ 70.5
5% 75.9/ 73.8/ 78.5 75.5/ 79.5/ 78.3 82.2/ 84.7/ 84.6 83.5/ 65.0/ 84.8 70.9/ 73.2/ 73.8 74.7/ 72.8/ 76.8
10% 80.1/ 74.1/ 82.7 81.5/ 85.0/ 84.4 87.7/ 89.3/ 89.7 86.7/ 73.8/ 88.8 78.0/ 79.9/ 80.2 79.7/ 77.9/ 81.8
20% 84.2/ 72.4/ 86.8 88.2/ 89.9/ 89.8 91.8/ 93.1/ 93.9 90.9/76.6/92.1 84.3/86.0/86.1 84.9/ 82.4/ 86.8
Table 3: Maximizing area under the ROC curve for CelebA, in a given FPR range for .   The mean ROC-AUC are reported over five random trials for cross-entropy/ TFCO/ ICO, respectively. Last column shows the mean partial AUC over all 8 attributes. We report results on more attributes along with the std. errors in Appendix C. Higher values are better.

We evaluate our approach on five UCI classifications tasks, and two image classification tasks. A summary of the datasets used in the main text is provided in Table 1. We present additional experimental results in Appendices B, C and D.

Baselines.  We compare the proposed ICO approach with the state-of-the-art tools of Cotter et al. (2019a) and Narasimhan et al. (2019a)

for constrained optimization, open-sourced as a part of the TensorFlow Constrained Optimization Library (TFCO)

222https://github.com/google-research/tensorflow_constrained_optimization. The prior technique of Eban et al. (2017) can be viewed as a special case of the functionality provided by this library. We also compare to the baseline approach of optimizing a standard cross-entropy (CE) loss.

Experimental protocol.

  For our experiments with Image datasets, we use a 6-layer neural network with 5 convolutional layers with 128, 256, 256, 512 and 512 filters respectively. We use ReLU activation functions and batch normalization layers in the network. We use a separate validation split for model selection in all our experiments. For all three methods (CE, TFCO, and ICO), we use the evaluation metric of interest on the validation set for model selection (

e.g

., FNR, ROC-AUC, PR-AUC). This makes the CE baseline further strong in our experiment. We do 5 random trials for each experiment and report the average value of the metric. Other details such as standard deviations for the random trials are reported in the Appendix.

5.1 Minimizing FNR at FPR

First, we consider the task of minimizing false negative rate (FNR) at a given false positive rate (FPR) of . This setting is particularly relevant for security sensitive applications where one wants to operate at a desired false positive rate. We experiment with the publicly available CelebA dataset (Liu et al., 2015), which contains 202,599 celebrity face images of size . CelebA has 40 annotated binary attributes for every face image, of which we randomly choose for our experiments. We use the standard train, validation, and test splits333https://www.tensorflow.org/datasets for CelebA, and train a binary classifier for each attribute.

We use TFCO and ICO for optimizing FNR at four different FPR targets: 1%, 2%, 5% and 10%. For the cross-entropy baseline, we use Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.001. For TFCO, we use Adam for both the primal and dual updates; the primal learning rate is set to 0.001, while the dual learning rate is chosen from using the validation sample. We use the cross-entropy surrogate (softplus function for binary case) provided by the TFCO library to approximate the rates. For ICO, we again use Adam with a learning rate of 0.001. We approximate the rates for ICO with temperature-scaled sigmoid surrogates, and choose the temperature parameter from the range using the validation sample. We do not apply gradient regularization, and perform the correction step described in Section 3.4 once every mini-batch updates using data from next mini-batches. All optimizers use a batch size of 512.

Labels
5% 10% 20%
BLF 66.2/ 66.4/ 69.9 71.0/ 71.7/ 73.9 75.4/ 76.2/ 77.9
CC 62.2/ 62.7/ 63.6 67.4/ 66.8/ 68.3 73.8/ 76.0/ 74.9
CF 71.9/ 71.5/ 74.7 78.6/ 80.0/ 80.7 84.8/ 86.6/ 86.0
DUF 69.8/ 71.8/ 73.9 75.1/ 77.2/ 78.0 78.8/ 81.4/ 81.8
ANV 58.8/ 58.8/ 60.4 62.7/ 63.9/ 64.4 67.8/ 69.1/ 69.0
Mean 65.8/ 66.2/ 68.5 71.0/ 71.9/ 73.1 76.1/ 77.9/ 77.9
Table 4: Maximizing area under the ROC curve for BigEarthNet, in a given FPR range for .   The mean ROC-AUC are reported over five random trials for cross-entropy/ TFCO/ ICO, respectively. We report results on more attributes along with the std. errors in Appendix D. Higher values are better.

We present in Table 2 the test evaluation metrics. The proposed ICO method performs the best for all six prediction tasks and for all four FPR targets, with TFCO coming in second. Interestingly, the gap between ICO and the other methods is the larger for smaller FPR targets. This suggests that ICO is most advantageous when applied to metrics that are harder to optimize. Unsurprisingly, cross-entropy optimization often yields higher FNR values at the specified FPR targets than the other methods, indicating that methods which directly optimize performance at the desired FPR do end up performing better at that target. Results on other attributes are reported in Appendix C. We also report timing comparisons of ICO and TFCO in Appendix B.

(a) Wearing-lipstick attribute
(b) Black-hair attribute
Figure 1: ROC curves for CelebA: (a) For attribute Wearing-lipstick and optimizing for partial area under the ROC curve with FPR , (b): For attribute Black-hair and optimizing for partial area under the ROC curve with FPR . Left figures show full ROC curves while the right figures show the (zoomed-in) ROC curves in the respective target FPR ranges.
(a) Letter
(b) IJCNN1
Figure 2: Precision-Recall curves on the test set. Both TFCO and the proposed method seek to optimize PR-AUC in the recall range . The left plots for each dataset show the entire curve, while the right plots zoom in on the right-end of the curve.
%Positives Cross-entropy TFCO ICO
Letter 4% 15.13 0.86 20.49 0.43 23.04 0.77
IJCNN1 9.7% 21.18 0.33 26.14 0.57 27.28 0.58
Adult 24% 39.74 0.29 40.21 0.38 40.34 0.41
Spam 39% 71.51 1.73 73.08 1.70 73.48 1.80
Com. & Crime 30% 47.00 0.94 47.04 0.97 47.03 1.08
Table 5: Maximizing (partial) PR-AUC in the recall range [0.95, 1] on UCI datasets. Proposed ICO performs better than the other methods on datasets with severe class imbalance. Higher values are better.

5.2 Maximizing ROC-AUC in FPR range

Next, we consider the task of maximizing the (partial) area under the ROC curve in a select range of FPRs. This metric is used in medical diagnostic tasks and biometric screening (Rao et al., 2008; Ricamato & Tortorella, 2011), where optimizing performance in a relevant FPR range may prove critical. We also compare with a pairwise loss baseline (Narasimhan & Agarwal, 2013b) which optimizes the objective , where denotes the score (e.g., logits) for example , is the number of positive examples in the minibatch, is the subset of negative examples whose scores lie in the top fraction of all negative examples, and

is the surrogate used for 0-1 loss (either softplus or sigmoid with a temperature hyperparameter as we used for the proposed method). We use the pairwise-loss, TFCO, and the proposed method to optimize this metric for five different value of

: 1%, 2%, 5%, 10% and 20%. The results for the pairwise loss baseline are reported in the supplementary material. We experiment with CelebA and BigEarthNet (Sumbul et al., 2019) image datasets. BigEarthNet contains 590,326 Sentinel-2 image patches of size 1201203 in RGB, which we down-size to 40403, and 43 annotated binary labels, of which we choose for our experiments. These lables are Broad-Leaved Forest (BLF), Complex Cultivation patterns (CC), Coniferous Forest (CF), Discontinuous Urban Fabric (DUF) and Agricultural with Natural Vegetation land (ANV). We split the dataset randomly into 70% for training, 15% for validation and 15% testing.

For both datasets, we train the same convolutional neural network model as the previous experiment. Both TFCO and our method divide the specified FPR range

into 10 equally-spaced values, and optimize the average true positive rate (TPR) at those targets. We replicate the same parameter configurations used in the previous experiment, except that the update frequency for ICO is performed either once in every 100 updates or 1000 updates, based on which of the two choices yields the highest validation ROC-AUC. We do not use gradient regularization in this experiment.

We present in Table 3 the test evaluation metrics for different methods on CelebA, where we applied TFCO and ICO to optimize the partial ROC-AUC metric for five different values of : 1%, 2%, 5%, 10% and 20%. We apply the standard McClish correction (McClish, 1989) to rescale the area between 0 and 100. On at least half the classification tasks, the proposed ICO method performs the best, with TFCO coming in second. On average across all six image attributes, ICO is considerably better than TFCO on three of the five values of FPR targets . We also show the ROC plots for a few specific cases in Figure 1. We present the results for BigEarthNet in Table 4, where we experiment with three values of : 5%, 10%, 20%. For all five labels, the proposed ICO is seen to perform better than the baselines for the smaller false-positive ranges, i.e. for smaller , thus demonstrating its effectiveness in optimizing performance in the initial portion of the ROC curve for this dataset. We also report timing comparisons of ICO and TFCO in Appendix B, observing that ICO can converge faster than TFCO in terms of wall-clock time.

5.3 Maximizing PR-AUC in Recall Range

In our final set of experiments, we consider the task of maximizing the (partial) area under the Precision-Recall curve in a select range of recall values . This metric is relevant in retrieval applications, where the quality of the system is often evaluated at multiple recall targets. We experiment with the five smaller datasets in Table 1 obtained from the UCI repository (Frank & Asuncion, 2010). For the Letter dataset, we treat the most frequent letter as the positive example, and the rest as negative. For the Communities & Crime dataset, we seek to predict if a community in the US has a crime rate above the th percentile (Kearns et al., 2018). We trained a linear model in each case. We split the datasets into train, validation and test datasets in the ratios 50%:25%:25%.

Both TFCO and ICO divide the specified recall range into five equally-spaced values, and optimize the average precision at those targets. We use Adam for the cross-entropy baseline and for TFCO, and Adagrad for the proposed ICO. For cross-entropy optimization, we tune the learning rate from the range , picking the one with maximum PR-AUC metric on the validation sample. For TFCO, we tune the learning rate and dual scale parameters from the range and respectively. We approximated the rates for TFCO using the cross-entropy surrogate loss provided by the library. For the proposed ICO, we use a fixed learning rate of 0.1, and approximate the rates with a temperature-scaled sigmoid surrogates, with the temperature parameter for the surrogate chosen from . We also apply the gradient regularizer described in Section 3.4, with the regularization strength parameter chosen from the range . We perform the correction step once every 10 updates. All optimizers perform full gradient updates.

We first evaluate the performance of different methods at very high recall values. For this, we run both TFCO and ICO to optimize average precision in the recall range . Table 5 presents the test PR-AUC metric values in the range for the different methods. We find that on the Letter and IJCNN1 datasets, which have severe class imbalance, the proposed approach performs significantly better than the two baselines. On these datasets, cross-entropy optimization performs poorly. On the datasets where the classes are reasonably balanced, all three baselines perform similarly. On the Communities & Crime dataset, we find our approach yielding better metric values than the other methods on the training sample, but because of the small data size, does not generalize as well to the test set.

In Figure 2, we show the Precision-Recall cuves for the different methods for the the Letter and IJCNN1 datasets. Notice that while cross-entropy optimization yields higher precision for lower recall values, it does not fair well in the recall range that matters

. Clearly, there is notable benefit to directly optimizing for the recall range that we care about instead of using an off-the-shelf loss function. Moreover, the proposed ICO method outperforms TFCO at high recall values. We also apply the methods to optimize PR-AUC in recall ranges

, for varying values of . In the results shown in Figure 3 for the Letter and IJCNN1 datasets, one can see that the benefit offered by ICO is the most benefit over the two baselines is most notable for higher values.

(a) Letter
(b) IJCNN1
Figure 3: Plot of PR-AUC in as a function of on the Letter and IJCNN1 datasets. The proposed method is often advantageous for very high values. Higher values are better.

6 Discussion and Future Work

We proposed an approach for optimizing popular constrained optimization problems arising in machine learning that involve non-decomposable rate metrics, such as false positive rate, true positive rate, areas under the precision-recall or ROC curves, etc.. Our approach deviates significantly from the existing methods based on Lagrange multipliers, and uses Implicit Function Theorem to express the classifier thresholds as a function of model parameters. Our experiments showed considerable improvements on optimizing several common evaluation metrics (such as FNR at a fixed FPR, areas under (partial) precision-recall and ROC curves) over existing state-of-the-art methods (Cotter et al., 2019a; Narasimhan et al., 2019a) that are part of open-source tools. In this work, we primarily focused on the straightforward setting where the classifier thresholds are expressed as a function of all model parameters but it is also possible to consider an alternative setting where the thresholds are expressed as a function of only a subset of model parameters (e.g., last few layers of the neural network). We leave this as a direction for future work.

We close by highlighting how our proposal can be extended to handle more complex settings and constraints.

6.1 Inequality constraints

All the constrained metrics we described are defined as equality constraints. In our current implementation, we handle inequality constraints by searching for thresholds that satisfy the original non-smooth inequality constraints (during the correction step in Algorithm 1 and finally at the end of training). However, we can easily extend our proposal to handle inequality constraints of the form in a more principled manner. In this case, one can introduce auxiliary slack variables , and rewrite the inequality-constrained problem as one with equality constraints:

(7)

We can now apply Algorithm 1 to solve the re-written problem by treating the model parameters and auxiliary variables together as the optimization variables, with an additional projection step in the gradient descent procedure to ensure non-negativity of the s. We did not experiment with this version of the algorithm for the sake of implementation simplicity.

6.2 Multi-class metrics

In our experiments, we handled binary classification tasks. The extension to multi-class metrics requires some effort as in this case we are allowed to predict only one among labels. As with the binary classification setting, we work with a model that outputs scores, and we will maintain one parameter for each class . The parameters ’s would then be used to post-shift the model via a weighted or shifted argmax to predict the final class:

Computing the parameters s so that the resulting classifier satisfies the specified constraints is not straightforward, but can still be performed efficiently with, e.g., the methods in Narasimhan et al. (2015b). While it isn’t clear if a feasible exists for general rate constraints, it certainly does for constraints like “coverage” (Cotter et al., 2019a), which require that the model makes a certain percentage of predictions from each class.

6.3 Ranking metrics

Perhaps, the most interesting extension of our approach is to query-based ranking problems (Schütze et al., 2008), where each example contains a query and a list of documents, and the goal is to rank the documents based on the relevance to the query. Popular ranking metrics such as Precision@ or Recall@ seek to measure performance in the top ranked documents (Lapin et al., 2017). Unfortunately, writing these metrics out as an explicit constrained optimization problem would require one constraint per query, with the number of constraints growing with the size of the training set. Consequently, standard constrained optimization approaches, when applied to optimize these metrics, would need to maintain one Lagrange multiplier for each query, making it impractical to use them with large datasets. Recently, (Narasimhan et al., 2019b) propose solving such heavily-constrained problems with lower-dimensional representations of Lagrange multipliers. In contrast, our method offers an alternate route which does not require explicitly handling the large number of constraints, through implicit modeling of the per-query thresholds. We look forward to future work comparing our implicit thresholding approach with the state-of-the-art methods for these ranking metrics (e.g. Kar et al. (2015); Lapin et al. (2017)).

References

  • Agarwal et al. (2018) Agarwal, A., Beygelzimer, A., Dudik, M., Langford, J., and Wallach, H. A reductions approach to fair classification. In International Conference on Machine Learning, pp. 60–69, 2018.
  • Agarwal (2011) Agarwal, S.

    The infinite push: A new support vector ranking algorithm that directly optimizes accuracy at the absolute top of the list.

    In Proceedings of the 2011 SIAM International Conference on Data Mining, pp. 839–850. SIAM, 2011.
  • Bao & Sugiyama (2020) Bao, H. and Sugiyama, M. Calibrated surrogate maximization of linear-fractional utility in binary classification. In

    International Conference on Artificial Intelligence and Statistics

    , pp. 2337–2347. PMLR, 2020.
  • Biddle (2006) Biddle, D. Adverse impact and test validation: A practitioner’s guide to valid and defensible employment testing. Gower Publishing, Ltd., 2006.
  • Boyd et al. (2012) Boyd, S., Cortes, C., Mohri, M., and Radovanovic, A. Accuracy at the top. In Advances in Neural Information Processing Systems, 2012.
  • Cotter et al. (2019a) Cotter, A., Jiang, H., Gupta, M. R., Wang, S., Narayan, T., You, S., and Sridharan, K. Optimization with non-differentiable constraints with applications to fairness, recall, churn, and other goals. Journal of Machine Learning Research, 20(172):1–59, 2019a.
  • Cotter et al. (2019b) Cotter, A., Jiang, H., and Sridharan, K. Two-player games for efficient non-convex constrained optimization. In Algorithmic Learning Theory, pp. 300–332. PMLR, 2019b.
  • Eban et al. (2017) Eban, E., Schain, M., Mackey, A., Gordon, A., Rifkin, R., and Elidan, G. Scalable learning of non-decomposable objectives. In Artificial intelligence and statistics, pp. 832–840. PMLR, 2017.
  • Fan et al. (2017) Fan, Y., Lyu, S., Ying, Y., and Hu, B.-G. Learning with average top-k loss. arXiv preprint arXiv:1705.08826, 2017.
  • Fathony & Kolter (2020) Fathony, R. and Kolter, Z. Ap-perf: Incorporating generic performance metrics in differentiable learning. In International Conference on Artificial Intelligence and Statistics, pp. 4130–4140. PMLR, 2020.
  • Frank & Asuncion (2010) Frank, A. and Asuncion, A. UCI machine learning repository. URL: http://archive.ics.uci.edu/ml, 2010.
  • Goh et al. (2016) Goh, G., Cotter, A., Gupta, M., and Friedlander, M. P. Satisfying real-world goals with dataset constraints. In Advances in Neural Information Processing Systems, pp. 2415–2423, 2016.
  • Hardt et al. (2016) Hardt, M., Price, E., and Srebro, N.

    Equality of opportunity in supervised learning.

    In Advances in neural information processing systems, pp. 3315–3323, 2016.
  • Hiranandani et al. (2020) Hiranandani, G., Vijitbenjaronk, W., Koyejo, S., and Jain, P. Optimization and analysis of the pap@ k metric for recommender systems. In International Conference on Machine Learning, pp. 4260–4270. PMLR, 2020.
  • Joachims (2005) Joachims, T. A support vector method for multivariate performance measures. In Proceedings of the 22nd international conference on Machine learning, pp. 377–384. ACM, 2005.
  • Kar et al. (2014) Kar, P., Narasimhan, H., and Jain, P. Online and stochastic gradient methods for non-decomposable loss functions. arXiv preprint arXiv:1410.6776, 2014.
  • Kar et al. (2015) Kar, P., Narasimhan, H., and Jain, P. Surrogate functions for maximizing precision at the top. In International Conference on Machine Learning, pp. 189–198. PMLR, 2015.
  • Kearns et al. (2018) Kearns, M., Neel, S., Roth, A., and Wu, Z. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. In ICML, 2018.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. ICLR, 2014.
  • Koyejo et al. (2014) Koyejo, O. O., Natarajan, N., Ravikumar, P. K., and Dhillon, I. S. Consistent binary classification with generalized performance metrics. In NIPS, pp. 2744–2752, 2014.
  • Lapin et al. (2017) Lapin, M., Hein, M., and Schiele, B. Analysis and optimization of loss functions for multiclass, top-k, and multilabel classification. IEEE transactions on pattern analysis and machine intelligence, 40(7):1533–1554, 2017.
  • Liu et al. (2015) Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In

    Proceedings of the IEEE International Conference on Computer Vision

    , pp. 3730–3738, 2015.
  • Mackey et al. (2018) Mackey, A., Luo, X., and Eban, E. Constrained classification and ranking via quantiles. arXiv preprint arXiv:1803.00067, 2018.
  • McClish (1989) McClish, D. K. Analyzing a portion of the roc curve. Medical Decision Making, 9(3):190–195, 1989.
  • Mohapatra et al. (2014) Mohapatra, P., Jawahar, C., and Kumar, M. P. Efficient optimization for average precision SVM. In NIPS-Advances in Neural Information Processing Systems, 2014.
  • Narasimhan (2018) Narasimhan, H. Learning with complex loss functions and constraints. In International Conference on Artificial Intelligence and Statistics, pp. 1646–1654, 2018.
  • Narasimhan & Agarwal (2013a) Narasimhan, H. and Agarwal, S. A structural svm based approach for optimizing partial auc. In International Conference on Machine Learning, pp. 516–524. PMLR, 2013a.
  • Narasimhan & Agarwal (2013b) Narasimhan, H. and Agarwal, S. Svmpauctight: a new support vector method for optimizing partial auc based on a tight convex upper bound. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 167–175, 2013b.
  • Narasimhan et al. (2014) Narasimhan, H., Vaish, R., and Agarwal, S. On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In Advances in Neural Information Processing Systems, pp. 1493–1501, 2014.
  • Narasimhan et al. (2015a) Narasimhan, H., Kar, P., and Jain, P. Optimizing non-decomposable performance measures: A tale of two classes. In International Conference on Machine Learning, pp. 199–208. PMLR, 2015a.
  • Narasimhan et al. (2015b) Narasimhan, H., Ramaswamy, H., Saha, A., and Agarwal, S. Consistent multiclass algorithms for complex performance measures. In ICML, pp. 2398–2407, 2015b.
  • Narasimhan et al. (2019a) Narasimhan, H., Cotter, A., and Gupta, M. Optimizing generalized rate metrics with three players. In Advances in Neural Information Processing Systems, pp. 10747–10758, 2019a.
  • Narasimhan et al. (2019b) Narasimhan, H., Cotter, A., Zhou, Y., Wang, S., and Guo, W. Approximate heavily-constrained learning with lagrange multiplier models. In Advances in Neural Information Processing Systems, pp. 10747–10758, 2019b.
  • Rao et al. (2008) Rao, R. B., Yakhnenko, O., and Krishnapuram, B. Kdd cup 2008 and the workshop on mining medical data. ACM SIGKDD Explorations Newsletter, 10(2):34–38, 2008.
  • Ricamato & Tortorella (2011) Ricamato, M. T. and Tortorella, F. Partial auc maximization in a linear combination of dichotomizers. Pattern Recognition, 44(10-11):2669–2677, 2011.
  • Schütze et al. (2008) Schütze, H., Manning, C. D., and Raghavan, P. Introduction to information retrieval, volume 39. Cambridge University Press Cambridge, 2008.
  • Sumbul et al. (2019) Sumbul, G., Charfuelan, M., Demir, B., and Markl, V. Bigearthnet: A large-scale benchmark archive for remote sensing image understanding. In IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, pp. 5901–5904. IEEE, 2019.
  • Tsochantaridis et al. (2005) Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y., and Singer, Y. Large margin methods for structured and interdependent output variables. Journal of machine learning research, 6(9), 2005.
  • Tu (2011) Tu, L. W. An introduction to manifolds. second, 2011.
  • Wurker (2001) Wurker, U. Convexity properties of some implicit functions. Journal of Convex Analysis, 2001.
  • Yan et al. (2018) Yan, B., Koyejo, S., Zhong, K., and Ravikumar, P. Binary classification with karmic, threshold-quasi-concave metrics. In International Conference on Machine Learning, pp. 5531–5540. PMLR, 2018.
  • Yue et al. (2007) Yue, Y., Finley, T., Radlinski, F., and Joachims, T. A support vector method for optimizing average precision. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 271–278, 2007.
  • Zafar et al. (2017) Zafar, M. B., Valera, I., Rogriguez, M. G., and Gummadi, K. P. Fairness constraints: Mechanisms for fair classification. In Artificial Intelligence and Statistics, pp. 962–970. PMLR, 2017.

Appendix A Proof of Proposition 1

The assumption in Proposition 1 holds, for example, when we seek to minimize the FPR subject to FNR . In this case, the FPR objective can be approximated by and the FNR constraint can be approximated by , where is the standard logistic loss, and and are respectively the class-conditional distributions over examples with labels 0 and 1. Note that is jointly convex in and is strictly increasing in , while is jointly convex in and is strictly decreasing in .

Proof of Proposition 1.

From the joint convexity of , we have for any and :

Therefore this also holds for and :

Because is strictly decreasing in , , and therefore we can rewrite the above inequality as:

Using the fact that (see (4) in the main text), we have:

or

This shows that is convex in . The convexity of follows from the convexity of and , and from the fact that is monotonically increasing in its second argument. ∎

Appendix B Timing comparisons

Figure 4: Minimizing false negative rate (FNR) at a fixed false positive rate (FPR) of 0.05 for CelebA

: FNR as a function of training epochs for TFCO and the proposed ICO.

We monitor the performance of TFCO and ICO in terms of value of the evaluation metric as the training proceeds. At the end of every training epoch, we record the best value of metric seen so far on the validation set, and use the same model (that yields the best validation metric) to score the test set.

For the problem of optimizing false negative rate (FNR) at a fixed false positive rate (FPR) on CelebA, Figures 4 and 5 show these FNR values for the attribute High_Cheekbones on the validation and test sets as the training proceeds in terms of training epochs and actual wall-clock time, respectively. We observe that while TFCO converges to a much lower FNR at the end of the first epoch (the first data point shown in Figure 4), the proposed ICO eventually achieves a lower FNR on both validation and test sets. Both TFCO and ICO were trained for 40 epochs in these experiments and TFCO was about 1.3x faster in terms of wall-clock time.

Figure 5: Minimizing false negative rate (FNR) at a fixed false positive rate (FPR) of 0.05 for CelebA: FNR as a function of wall-clock time for TFCO and the proposed ICO.
Figure 6: Maximizing partial area under the ROC curve (for FPR ) for CelebA: ROC-AUC as a function of training epochs for TFCO and the proposed ICO.
Figure 7: Maximizing partial area under the ROC curve (for FPR ) for CelebA: ROC-AUC as a function of wall-clock time for TFCO and the proposed ICO.

We repeat similar experiment for the problem of optimizing the partial area under the ROC curve for FPR in the range of , again for CelebA. Figure 6 and 7 show the ROC-AUC on validation and test sets for the High_Cheekbones attributes as the training proceeds in terms of epochs and wall-clock time, respectively. We observe similar behavior as in the earlier experiment and TFCO converges to a much better ROC-AUC early on in the training at the end of first epoch (first data point in the plots). However, the proposed ICO eventually achieves a better ROC-AUC both on validation and test sets. Both TFCO and ICO were trained for 25 epochs in this experiment and ICO is about 5x faster than TFCO in terms of wall-clock time. This is due to the fact that optimizing ROC-AUC is a problem with multiple constraints (10 in this case) and we do not optimize the thresholds using gradients in ICO and only rely on the threshold correction step after every 100 minibatches. On the other hand, TFCO training time per minibatch slows down due to multiple constraints.

Attributes FPR CE TFCO ICO ICO hyperparameter
High-cheekbones 1% 53.52 (1.71) 49.09 (1.56) 46.96 (0.55) 0.01
2% 44.82 (1.23) 40.89 (0.82) 39.84 (0.49) 0.01
5% 32.87 (0.83) 30.11 (0.67) 28.54 (0.30) 0.01
10% 22.88 (0.43) 20.37 (0.65) 19.66 (0.36) 0.01
Heavy-makeup 1% 57.00 (0.84) 52.07 (1.24) 49.65 (0.81) 1
2% 45.59 (1.24) 41.23 (1.41) 38.89 (0.80) 0.001
5% 28.22 (0.80) 25.43 (0.97) 23.11 (0.70) 0.001
10% 15.12 (0.44) 13.61 (0.77) 12.36 (0.19) 0.01
Wearing-lipstick 1% 44.00 (1.28) 42.64 (1.29) 37.47 (0.97) 1
2% 32.74 (0.88) 30.44 (1.51) 26.74 (0.50) 0.01
5% 16.33 (0.43) 14.97 (0.77) 13.05 (0.25) 0.001
10% 6.61 (0.27) 5.92 (0.21) 4.78 (0.12) 0.001
Smiling 1% 37.40 (1.30) 35.93 (1.37) 33.74 (0.71) 0.01
2% 29.44 (0.76) 27.80 (1.23) 26.10 (0.57) 0.01
5% 18.73 (0.53) 17.04 (0.80) 16.88 (0.25) 0.01
10% 11.78 (0.20) 10.74 (0.29) 10.23 (0.25) 0.01
Black-hair 1% 69.32 (1.78) 64.47 (1.55) 63.23 (1.19) 0.001
2% 56.48 (1.48) 52.00 (1.10) 50.50 (0.67) 0.001
5% 36.72 (1.61) 32.41 (0.60) 32.48 (0.66) 0.001
10% 22.97 (1.87) 19.16 (1.22) 18.62 (0.43) 0.001
Blond-hair 1% 40.49 (1.18) 38.62 (1.17) 36.85 (0.58) 0.01
2% 28.89 (1.17) 25.64 (1.20) 24.20 (0.72) 0.001
5% 13.44 (1.01) 11.64 (0.76) 10.81 (0.24) 0.001
10% 6.54 (0.46) 4.91 (0.22) 4.68 (0.20) 0.001
Brown-hair 1% 80.75 (1.82) 77.16 (0.74) 76.34 (0.78) 0.001
2% 69.69 (1.77) 66.10 (1.04) 65.74 (1.28) 0.001
5% 52.41 (2.55) 45.83 (0.92) 46.43 (0.51) 0.001
10% 35.92 (2.71) 29.94 (0.87) 30.02 (0.67) 0.001
Wavy-hair 1% 85.04 (0.80) 84.42 (0.95) 83.54 (0.69) 0.001
2% 78.91 (1.20) 77.02 (1.47) 76.07 (0.90) 0.001
5% 65.79 (1.77) 61.81 (0.92) 60.71 (1.01) 0.001
10% 50.52 (1.41) 47.49 (0.98) 45.95 (1.12) 0.001
Table 6: Minimizing false negative rate (FNR) at a given false positive rate (FPR) for CelebA.   The mean FNR (in %) are reported over five random trials, along with std. deviations, for cross-entropy (CE), TFCO and ICO. Proposed ICO outperforms both CE and TFCO by a considerable margin in most cases. Lower values are better. We also list the hyperparameters for ICO: surrogate function for both objective and constraint is taken to be softplus; correction step in Alg. 1 is applied every 1000 minibatch steps () using data from next 10 minibatches (); the temperature hyperparameter for softplus is provided in the Table below.
Attributes FPR CE Pairwise-loss TFCO ICO ICO hyperparameters
High-cheekbones 1% 66.10 (1.14) 68.18 (2.01) 62.96 (10.45) 69.83 (0.84)
2% 70.87 (0.91) 72.85 (0.28) 74.98 (0.20) 73.17 (0.84)
5% 75.89 (1.26) 78.13 (0.36) 73.83 (11.30) 78.45 (0.53)
10% 80.15 (1.02) 81.51 (0.57) 74.07 (12.13) 82.67 (0.49)
20% 84.26 (0.74) 85.70 (0.33) 72.44 (17.48) 86.82 (0.30)
Heavy-makeup 1% 65.03 (0.92) 66.33 (1.41) 68.55 (0.83) 66.75 (0.31)
2% 68.82 (0.61) 71.13 (0.80) 73.43 (0.13) 71.57 (0.80)
5% 75.48 (1.44) 77.78 (0.40) 79.54 (0.09) 78.32 (0.33)
10% 81.53 (1.10) 82.85 (0.75) 85.00 (0.12) 84.39 (0.47)
20% 88.22 (0.40) 88.68 (0.36) 89.93 (0.11) 89.81 (0.19)
Wearing-lipstick 1% 70.24 (1.13) 71.77 (0.70) 74.82 (0.33) 72.29 (0.77)
2% 75.42 (0.68) 77.21 (0.39) 79.28 (0.22) 78.44 (0.28)
5% 82.19 (0.37) 83.42 (0.97) 84.68 (0.19) 84.56 (0.28)
10% 87.70 (0.89) 88.44 (0.25) 89.35 (0.18) 89.73 (0.27)
20% 91.88 (0.44) 93.00 (0.20) 93.19 (0.12) 93.93 (0.15)
Smiling 1% 75.39 (0.51) 75.87 (0.63) 78.03 (0.42) 75.59 (0.76)
2% 78.44 (0.41) 79.85 (0.52) 81.51 (0.20) 79.80 (0.55)
5% 83.48 (0.68) 84.50 (0.54) 64.97 (17.24) 84.81 (0.41)
10% 86.76 (0.84) 88.06 (0.22) 73.79 (18.63) 88.80 (0.30)
20% 90.88 (0.52) 91.46 (0.11) 76.61 (18.69) 92.08 (0.09)
Black-hair 1% 60.53 (0.86) 57.73 (1.11) 61.44 (0.44) 61.24 (0.26)
2% 64.48 (0.45) 63.93 (1.10) 66.53 (0.44) 66.07 (0.61)
5% 70.87 (1.01) 71.85 (0.25) 73.19 (0.29) 73.79 (0.45)
10% 78.04 (0.71) 78.49 (0.56) 79.88 (0.11) 80.19 (0.07)
20% 84.33 (0.54) 85.45 (0.37) 86.00 (0.09) 86.09 (0.30)
Blond-hair 1% 71.36 (0.62) 70.38 (0.89) 73.11 (0.46) 72.11 (0.37)
2% 76.50 (0.91) 76.25 (0.66) 79.01 (0.16) 78.06 (0.54)
5% 84.74 (0.69) 84.21 (0.37) 86.18 (0.13) 85.76 (0.25)
10% 89.39 (0.59) 89.75 (0.73) 90.63 (0.15) 90.49 (0.21)
20% 93.30 (0.33) 93.56 (0.38) 94.24 (0.10) 94.27 (0.14)
Brown-hair 1% 55.40 (0.54) 52.65 (0.56) 56.09 (0.28) 56.61 (0.41)
2% 58.82 (0.38) 54.62 (1.34) 59.68 (0.21) 60.10 (0.40)
5% 64.87 (0.63) 61.21 (2.25) 66.71 (0.11) 67.13 (0.40)
10% 69.74 (1.14) 70.05 (0.34) 73.23 (0.27) 73.01 (0.25)
20% 77.67 (1.03) 78.25 (0.51) 80.09 (0.16) 80.06 (0.23)
Wavy-hair 1% 54.03 (0.28) 50.91 (0.23) 52.36 (1.31) 54.33 (0.09)
2% 55.85 (0.78) 52.08 (0.50) 52.64 (2.14) 57.02 (0.31)
5% 60.16 (0.73) 54.17 (2.30) 53.70 (2.98) 61.30 (0.38)
10% 64.42 (0.37) 56.70 (2.09) 57.16 (4.32) 65.34 (0.48)
20% 69.26 (1.03) 68.56 (0.49) 66.47 (5.60) 71.48 (0.32)
Table 7: Maximizing area under the ROC curve for CelebA, in a given FPR range for .   The mean ROC-AUC are reported over five random trials, along with std. deviations, for cross-entropy (CE), Pairwise-loss, TFCO and ICO. Higher values are better. We also list the hyperparameters for ICO: surrogate function for both objective and constraint is taken to be sigmoid; the correction step in Alg. 1 is applied every minibatch steps using data from next minibatches. The values of , , and the temperature hyperparameter for sigmoid are selected on the validation set and are provided in the Table below.

Appendix C CelebA results

We report results on more CelebA attributes for the two problems considered in the main text: (i) Minimizing false negative rate (FNR) at a fixed false positive rate (FPR) for FPRs (Table 6), and (ii) Maximizing partial area under the ROC curve (ROC-AUC) for FPR in the range for (Table 7). These results also show the standard deviation over five random trials which were omitted in the main text due to space constraints. For partial AUC in the FPR range , we also compare with a pairwise loss baseline (Narasimhan & Agarwal, 2013b) which optimizes the objective , where denotes the score (e.g., logits) for example , is the number of positive examples in the minibatch, is the subset of negative examples whose scores lie in the top fraction of all negative examples, and is the surrogate used for 0-1 loss (either softplus or sigmoid with a temperature hyperparameter as we used for the proposed method).

Labels FPR CE Pairwise-loss TFCO ICO
Broad-Leaved Forest (BLF) 5% 66.20 (0.59) 52.07 (0.35) 66.43 (1.86) 69.90 (0.53)
10% 71.00 (0.80) 53.78 (1.34) 71.72 (0.78) 73.91 (0.66)
20% 75.42 (0.67) 57.94 (0.81) 76.20 (0.61) 77.91 (0.77)
Complex Cultivation patterns (CC) 5% 62.19 (0.48) 52.44 (0.26) 62.71 (2.19) 63.61 (0.12)
10% 67.46 (0.25) 54.71 (0.62) 66.75 (0.82) 68.35 (0.42)
20% 73.81 (0.83) 59.34 (0.64) 76.01 (0.77) 74.88 (0.43)
Coniferous Forest (CF) 5% 71.93 (0.66) 59.00 (0.56) 71.49 (3.44) 74.70 (0.62)
10% 78.62 (0.59) 77.66 (2.22) 79.98 (1.26) 80.76 (0.94)
20% 84.82 (0.68) 85.84 (0.31) 86.62 (0.62) 86.02 (0.29)
Discontinuous Urban Fabric (DUF) 5% 69.80 (1.45) 55.33 (0.98) 71.76 (1.89) 73.94 (0.39)
10% 75.13 (0.86) 59.15 (2.17) 77.20 (0.87) 78.03 (0.32)
20% 78.86 (1.19) 78.89 (0.38) 81.45 (0.62) 81.83 (0.57)
Land principally occupied by Agriculture, with significant areas of Natural Vegetation (ANV) 5% 58.79 (0.49) 51.23 (0.16) 58.80 (0.77) 60.46 (0.17)
10% 62.72 (0.46) 52.65 (0.56) 63.89 (0.98) 64.38 (0.32)
20% 67.77 (0.34) 54.36 (0.63) 69.17 (0.52) 68.97 (0.76)
Mixed Forest (MF) 5% 64.48 (0.76) 54.59 (0.60) 65.50 (0.30) 65.93 (0.60)
10% 71.05 (0.47) 60.41 (0.28) 72.06 (0.33) 71.89 (0.48)
20% 77.56 (0.69) 76.44 (0.26) 78.83 (0.14) 79.20 (0.26)
Non-Irrigated Arable Land (NIAL 5% 70.07 (0.27) 55.10 (0.28) 72.67 (0.19) 71.45 (0.48)
10% 75.29 (0.37) 59.62 (2.37) 77.30 (0.12) 76.78 (0.76)
20% 79.97 (0.67) 80.23 (0.17) 82.05 (0.09) 81.72 (0.56)
Pastures 5% 72.70 (0.46) 59.41 (0.85) 74.16 (0.29) 73.61 (0.55)
10% 75.95 (0.55) 74.78 (0.29) 78.04 (0.31) 77.85 (0.65)
20% 80.31 (0.87) 80.42 (0.66) 82.38 (0.17) 82.10 (0.18)
Transitional Woodland/Shrub (TWS) 5% 57.12 (0.32) 51.30 (0.32) 58.21 (0.08) 59.64 (0.27)
10% 60.24 (0.21) 52.45 (0.61) 61.82 (0.13) 62.82 (0.61)
20% 64.98 (0.92) 55.47 (0.31) 67.15 (0.26) 67.24 (1.10)
Water Bodies (WB) 5% 76.52 (0.59) 54.29 (1.35) 77.47 (0.48) 78.71 (0.33)
10% 80.81 (0.69) 57.23 (0.73) 81.76 (0.34) 82.79 (0.35)
20% 85.27 (0.39) 86.11 (0.25) 85.46 (0.19) 86.66 (0.31)
Table 8: Maximizing area under the ROC curve for BigEarthNet, in a given FPR range for .   The mean ROC-AUC are reported over five random trials, along with std. deviations, for cross-entropy, Pairwise-loss, TFCO, ICO. Higher values are better.
Labels FPR ICO hyperparameters
Broad-Leaved Forest (BLF) 5%
10%
20%
Complex Cultivation patterns (CC) 5%
10%
20%
Coniferous Forest (CF) 5%
10%
20%
Discontinuous Urban Fabric (DUF) 5%
10%
20%
Land principally occupied by Agriculture, with significant areas of Natural Vegetation (ANV) 5%
10%
20%
Mixed Forest (MF) 5%
10%
20%
Non-Irrigated Arable Land (NIAL 5%
10%
20%
Pastures 5%
10%
20%
Transitional Woodland/Shrub (TWS) 5%
10%
20%
Water Bodies (WB) 5%
10%
20%
Table 9: Maximizing area under the ROC curve for BigEarthNet, in a given FPR range for :  hyperparameters for ICO selected using the validation set.  Surrogate function for both objective and constraint is taken to be softplus; the correction step in Alg. 1 is applied every minibatch steps using data from next minibatches. The values of , , and the temperature hyperparameter for sigmoid are selected on the validation set and are provided in the Table below.

Appendix D BigEarthNet results

We also report results on more BigEarthNet labels for the problem of maximizing partial area under the ROC curve (ROC-AUC) for FPR in the range for (Table 8). These results also show the standard deviation over five random trials which were omitted in the main text due to space constraints. We also compare with the pairwise loss baseline for partial AUC as described earlier. The proposed ICO outperforms cross-entropy and pairwise loss baselines in all the cases, and also outperforms TFCO for most cases.