1 Introduction
Learning applications with binary classification problems involving severe label imbalance abound, often accompanied with specific requirements in terms of false positive, or negative rates. Examples included spam classification, anomaly detection, and medical applications. Class imbalance is also often introduced as a result of the reduction of a problem to binary classification, such as in multiclass problems
Bishop (2006) and multilabel problems due to extreme label sparsity Hsu et al. (2009).Traditional performance measures such as misclassification rate are illsuited in such situations as it is usually trivial to optimize them by constantly predicting the majority class. Instead, the performance measures of choice in such cases are those that perform a more holistic evaluation over the entire data. Naturally, these performance measures are nondecomposable over the dataset and cannot be cannot be expressed as a sum of errors on individual data points. Popular examples include Fmeasure, Gmean, Hmean etc.
A consistent effort directed at optimizing these performance measures has, over the years, resulted in the development of two broad approaches  1) surrogate based approaches (e.g. SVMPerf Joachims et al. (2009)) that design convex surrogates for these performance measures, and 2) indirect approaches which include costsensitive classificationbased approaches Parambath et al. (2014) which solve weighted classification problems, and plugin approaches Koyejo et al. (2014); Narasimhan et al. (2014)
which rely on consistent estimates of class probabilities.
Both these approaches are known to work fairly well on small datasets but do not scale very well to large ones, especially those large enough to not even fit in memory. SVMPerfstyle approaches, which employ cutting plane methods do not scale well. On the other hand, plugin approaches first need to solve a class probability estimation problem optimally and then tune a threshold. This twostage approach prevents the method from exploiting better classifiers to automatically obtain better thresholds. Moreover, for multiclass problems with
classes, jointly estimating parameters can take time exponential in .For large datasets, streaming methods such as stochastic gradient descent ShalevShwartz et al. (2011) that take only a few passes over the entire data are preferable. However, traditional SGD techniques cannot handle nondecomposable losses. Recently, Kar et al. (2014) proposed optimizing SVMPerfstyle surrogates using SGD techniques. Although their method is generic, allowing optimization of performance measures such as Fmeasure and partial AUC, they require maintaining a large buffer to compute online gradient estimates that can be prohibitive.
Motivated by the state of the art, we develop novel methods for optimizing two broad families of nondecomposable performance measures. Our methods incorporate truly pointwise updates, i.e. do not require a buffer, and require only a few passes over data. At an intuitive level, at the core of our work are adaptive linearization strategies for these performance measures, which make these measures amenable to SGDstyle pointwise updates. Moreover, our linearizations are able to feed off the improvements made in learning a better classifier, resulting in faster convergence.
We consider two classes of performance measures
Concave Performance Measures (see Table 1): These measures can be written as concave functions of true positive (TPR) and negative (TNR) rates and include Gmean, Hmean etc. We exploit the dual structure of these functions via their Fenchel dual to linearize them in terms of the TPR, TNR variables. Our method then, in parallel, tunes the dual variables in this linearization and maximizes the weighted TPRTNR combination. These updates are done in an online fashion using stochastic mirror descent steps.
Pseudolinear Performance Measures (see Table 2): These measures can be written as fractional linear functions of TPR, TNR and include Fmeasure and the Jaccard coefficient. These functions need not be concave and the techniques outlined above do not apply. Instead, we exploit the pseudolinear structure to linearize the function and develop a technique to alternately optimize the combination weights and the classifier model via stochastic updates. Although such “alternatemaximization” strategies in general need not converge even to a local optima, we show that our strategy converges to an approximate global optimum after batch updates or stochastic updates.
Finally, we present an empirical validation of our methods. Our experiments reveal that for a range of performance measures in both classes, our methods can be significantly faster than either plugin or SVMPerfstyle methods, as well as give higher or comparable accuracies.
2 Related Works
As noted in Section 1, existing methods for optimizing performance measures that we study can be divided into surrogatebased approaches and indirect approaches based on costsensitive classification or plugin methods. A third approach applicable to certain performance measures is the decisiontheoretic method that learns a class probability estimate and computes predictions that maximize the expected value of the performance measure on a test set Lewis (1995); Ye et al. (2012). In addition to these there exist methods dedicated to specific performance measures.
For instance Parambath et al. (2014) focus on optimizing Fmeasure by exploiting the pseudolinearity of the function along with a cross validationbased strategy. Our STAMP method, on the other hand uses an alternating maximization strategy that does not require cross validation which considerably improves training time (see Figure 3). It is important to note that these performance measures have also been studied in multilabel settings where these no longer remain nondecomposable. For instance, Dembczyński et al. (2013) study plugin style methods for maximizing Fmeasure in multilabel settings whereas works such as Koyejo et al. (2014); Narasimhan et al. (2014); Ye et al. (2012) study plugin approaches for the same problem in the more challenging binary classification setting.
Historically, online learning algorithms have played a key role in designing solvers for largescale batch problems. However, for nondecomposable loss functions, defining an online learning framework and providing efficient algorithms with small regret itself is challenging.
Rakhlin et al. (2011) propose a generic method for such loss functions; however the algorithms proposed therein run in exponential time. Kar et al. (2014) also study such measures with the aim of designing stochastic gradientstyle methods. However, their methods require a large buffer to be maintained, which causes them to have poorer convergence guarantees and in practice be slower than our methods.By exploiting the special structure in our function classes, we are able to do away with such requirements. Our methods make use of standard online convex optimization primitives Zinkevich (2003). However, their application requires special care in order to avoid divergent behavior.
3 Problem Setting
Let denote the instance space and the label space, with some distribution over . Let denote the proportion of positives in the population. Let denote a sample of training points sampled i.i.d. from . For sake of simplicity we shall present our algorithms and analyses for a set of linear models . Let and denote the radii of the domain and hypothesis class respectively.
We consider performance measures that can be expressed in terms of the true positive and negative rates of a classifier. To represent these measures, we shall use the notion of a reward function that assigns a reward to a prediction when the true label is . We will use
to calculate rewards on positive and negative points. Since , setting gives us . For sake of convenience, we will use and to denote population averages of the reward functions. We shall assume that our reward functions are concave, Lipschitz, and take values in a bounded range .
4 Concave Performance Measures
The first class of performance measures we analyze are concave performance measures. These measures can be written as concave functions of the TPR and TNR i.e.
for some concave link function . A large number of popular performance measures fall in this family since these measures are relevant in situations with severe label imbalance or in situations where costsensitive classification is required such as detection theory Vincent (1994). Table 1 gives a list of such performance measures along with some of their relevant properties and references to works that utilize these performance measures.
Name  Expression  Mon.?  Lip.?  Sufficient dual Region  .  

Min (Vincent (1994))  ✓  ✓  
Hmean (Kennedy et al. (2010))  ✓  ✓  
Qmean (Liu and Chawla (2011))  ✓  ✓  
Gmean (Daskalaki et al. (2006))  ✓  ✗ 
We shall find it convenient to define the (concave) Fenchel conjugate of the link functions for our performance measures. For any concave function and , define
By the concavity of , we have, for any ,
We shall use the notation to denote, both the link function, as well as the performance measure it induces.
4.1 A Stochastic Primaldual Method for Optimizing Concave Performance Measures
We now present a novel online stochastic method for optimizing the class of concave performance measures. The
use of stochastic gradient techniques for these measures presents specific challenges due to the nondecomposable nature of these measures which makes it difficult to obtain cheap, unbiased estimates of the gradient using a single point. Recent works
Kar et al. (2013, 2014) have tried to resolve this issue by looking at minibatch methods or by using a buffer to maintain a sketch of the stream. However, such techniques bring in a bias into the learning algorithm in the form of buffer size or mini batch length which results in slower convergence. Indeed, the 1PMB method of Kar et al. (2014) is only able to guarantee a rate of convergence, whereas SGD techniques are usually able to guarantee rates. This is indicative of suboptimal performance and our experiments confirm this (see Figure 3).Here we show that for the class of concave performance measures, such workarounds are not necessary. To this end we present the SPADE algorithm (Algorithm 1) which exploits the dual structure of the performance measures to obtain efficient point updates which do not require the use of minibatches or online buffers. SPADE is able to offer convergence guarantees identical to those that stochastic methods offer for additive performance measures such as least squares, without the presence of any algorithmic bias.
Let and be convex regions within the model and dual spaces respectively, and and denote projection operators for these. Table 1 lists the relevant dual regions for the performance measures listed therein.
4.2 Convergence Analysis for Spade
This section presents a convergence analysis for the SPADE algorithm. The convergence proof is formally stated in Theorem 4. Apart from demonstrating the utility of the algorithm, the proof also sheds light on the choice of algorithm parameters, such as primal/dual feasible regions.
We shall work with performance measures that are monotonically increasing in the true positive and negative rates of the classifier i.e. if , then . This is a natural assumption and is satisfied by all performance measures considered here (see Table 1). We now introduce two useful concepts.
Definition 1 (Stable Performance Measure).
A performance measure will be called stable if for some function , we have for all and ,
Table 1 lists the stability parameters of all the concave performance measures. Clearly, a performance measure has a linear stability parameter i.e. iff its corresponding link function is Lipschitz. We now define the notion of a sufficient dual region for a performance measure
Definition 2 (Sufficient Dual Region).
For any link function , define its sufficient dual region to be the minimal set such that for all , we have
The reason for defining this quantity will become clear in a moment. A closer look at Algorithm
1 indicates that it is performing online gradient descent steps with the dual variables. Clearly, for this procedure to have statistical convergence properties, the magnitude of the updates must be bounded in some sense otherwise the learning procedure may diverge. This motivates the projection step in Step 17. However, in order for the updated dual variables to be informative about the current primal function value, the projection step must be done in a way that does not distort the link function. The notion of a sufficient dual region formally captures the notion of such a projection step.Having said that, there is no apriori guarantee that the sufficient region for a given performance measure would be bounded, in which case this entire exercise counts for naught. However, the following lemma, by closely linking the stability properties of a performance measure with the size of its sufficient dual region, shows that for wellbehaved link functions, this will not be the case .
Lemma 3.
The stability parameter of a performance measure can be written as iff its sufficient dual region is bounded in a ball of radius .
The proof of this result follows from elementary manipulations and can be found in Appendix A. In some sense this result can be seen as a realization of the well known connection between the Fenchel dual of a function and its Lipschitz properties.
To simplify the initial analysis, we shall first concentrate on performance measures whose link functions are Lipschitz. It is easy to see that these are exactly the performance measures whose gradients do not diverge within any compact region of the real plane. Of the performance measures listed in Table 1, all measures except Gmean have associated link functions that are Lipschitz. Subsequently, we shall address the more involved case of nonLipschitz performance measures such as Gmean as well.
Theorem 4.
Suppose we are given a stream of random samples drawn from a distribution over . Let be a concave, Lipschitz link function. Let Algorithm 1 be executed with a dual feasible set , and . Then, the average model output by the algorithm satisfies, with probability at least ,
We refer the reader to Appendix B for a proof and explicit constants. The proof closely analyzes the primal ascent and dual descent steps, tying them together using the Fenchel dual of .
4.3 The Case of nonLipschitz Link Functions
NonLipschitz link functions, such as the one used in the Gmean performance measure, pose a particular challenge to the previous analysis. Owing to their nonLipschitz nature, their sufficient dual region is unbounded. Indeed as Table 1 indicates, the sufficient region for extends indefinitely along both coordinate axes. More precisely, what happens is that the gradients of the function diverge as either , or . This poses a stumbling block for the proof of Theorem 4 since the regret and onlinetobatch conversion results used therein fail.
A natural way to solve this problem is to ensure that the reward functions always assign rewards that are bounded away from zero. More specifically, for some , we have for all and . For this restricted reward region, one can show, using Lemma 3, that the sufficient dual region can be restricted to a ball of radius .
The above discussion suggests that we regularize the reward function i.e. at each time step , we add a small value to the original reward function. However, the amount of regularization remains to be decided since over regularization could cause our resulting excess risk bound to be vacuous with respect to the original reward function. It turns out that setting strikes a fine balance between regularization and fidelity to the original reward function  this seems intuitive since the regularization becomes milder and milder as learning progresses. The following extension of Theorem 4 formalizes this statement.
Theorem 5.
The proof of this theorem can be found in Appendix C. We note here that primal dual frameworks have been utilized before in diverse areas such as distributed optimization Jaggi et al. (2014) and multiobjective optimization Mahdavi et al. (2013). However, these works simply assume the functions involved therein to be Lipschitz and/or smooth and do not address cases where they fail to be so. Theorem 5 on the other hand, is able to recover a nontrivial, albeit weaker, statement even for locally Lipschitz functions.
5 Pseudolinear Performance Measures
Name  Popular Form  Canonical Form  Mon.?  Range  Rate 

Fmeasure (Manning et al.)  ✓  
Jaccard Coefficient (Koyejo et al.)  ✓  
GowerLegendre (Sokolova and Lapalme)  ✓  
GowerLegendre (Sokolova and Lapalme)  ✓ 
The second class of performance measures we analyze are pseudolinear performance measures. These measures have a fractional linear function as the link function and can be written as follows:
for some weighing coefficients . Several popularly used performance measures, most notably the Fmeasure, can be represented as pseudolinear functions. Table 2 enumerates some popular pseudolinear performance measures as well as their properties.
We note that these performance measures are usually represented in literature using the entries of the confusion matrix. However, for the sake of our analysis, we shall find it useful to represent them in terms of the true positive and true negative rates. To do so, we shall use
to denote the proportion of positives in the population andto denote the label skew.
5.1 Alternatemaximization for Optimizing Pseudolinear Performance Measures
Pseudolinear functions are named so since their level sets can be defined using linear halfspaces. More specifically, every pseudolinear function over has an associated “levelfinder” function and such that iff . We refer the reader to Parambath et al. (2014) for a more relaxed introduction to these functions and their properties. For our purposes, however, it suffices to notice that this property immediately points toward a costsensitive method to optimize these performance measures.
This fact was noticed by Parambath et al. (2014) who exploited this to develop a costsensitive classification method for optimizing the Fmeasure by simply searching for the best weights with which to perform costsensitive classification. However, we notice that instead of performing such a brute force search, one can adaptively tune the weights to better and better values and obtain much faster convergence. To develop this intuition, we first define the notion of a valuation function below.
Definition 6 (Valuation Function).
The valuation function of a performance measure , for a classifier , and at a level is defined as
where .
The following wellknown lemma closely links the valuation function to the performance measure.
Lemma 7.
For any performance measure , and we have . Moreover, in such a situation we say that classifier has achieved valuation at level .
Lemma 7 indicates that the performance of a classifier is intimately linked to its valuation. This suggests a natural alternate maximization approach wherein we alternate between posing a challenge level to the classifier and training a classifier to achieve that level. The resulting algorithm AMP is detailed in Algorithm 2. Note that using Lemma 7, step 5 in the algorithm can be executed simply by setting . Thus, in a very natural manner, the current classifier challenges the next classifier to beat its own performance. It turns out that this approach results in rapid convergence as outlined in the following theorem.
Theorem 8.
Let Algorithm 2 be executed with a performance measure and reward functions that offer values in the range . Let . Also let be the excess error for the model generated at time . Then there exists a value such that .
The proof of this theorem can be found in Appendix D. Table 2 gives values for the convergence rates of all the pseudolinear performance measures, as well as the allowed range of values that the reward functions can take for those measures. This is important since performance measures such as the Fmeasure diverge if the reward function values approach . Other performance measures like the GowerLegendre measure do not impose any such restrictions. Note that the above result shows that Algorithm 2 will always terminate in steps.
At this point it would be apt to make a historical note. Pseudolinear functions have enjoyed a fair amount of interest in the optimization community Schaible (1976); Dinkelbach (1967); Jagannathan (1966) within the subfield of fractional programming. Of the many methods that have been developed to optimize these functions, the DinkelbachJagannathan (DJ) procedure Dinkelbach (1967); Jagannathan (1966) is of specific interest to us. It turns out that the AMP method can be seen as performing DJstyle updates over parameterized spaces (the parameter being the model ). It is known (for instance see Schaible (1976)) that the DJ process is able to offer a linear convergence rates. Our proof of Theorem 8, which was obtained independently, can then be seen as giving a similar result in the parameterized setting.
However, we wish to move one step further and optimize these performance measures in an online stochastic manner. To this end, we observe that the AMP algorithm can be executed in an online fashion by using stochastic updates to train the intermediate models. The resulting algorithm STAMP, is presented in Algorithm 3. However, this algorithm is much harder to analyze because unlike AMP which has the luxury of offering exact updates, STAMP offers inexact, even noisy updates. Indeed, even existing works in the optimization community (for example Schaible (1976)) do not seem to have analyzed DJstyle methods with noisy updates.
Our next contribution hence, is an analysis of the convergence rate offered by the AMP algorithm when neither of the two maximizations is carried out exactly. For the sake of simplicity, we present the STAMP algorithm and its analysis for the case of measure. Suppose at each time step, for some , we have
then for some , we have
As a corollary we present a convergence analysis for the STAMP algorithm in Theorem 9.
Theorem 9.
Let Algorithm 3 be executed with a performance measure and reward functions with range . Let be the rate of convergence guaranteed for by the AMP algorithm. Set the epoch lengths to . Then after epochs, we can ensure with probability at least that . Moreover the number of samples consumed till this point is at most .
The convergence analysis for noisy AMP can be found in Appendix E. The proof of this theorem can be found in Appendix F. Both results require a fine grained analysis of how errors accumulate throughout the learning process.
6 Experimental Results
We shall now compare our methods with the stateoftheart on various performance measures and datasets.
Datasets: We evaluated our methods on 5 publicly available benchmark datasets: a) PPI, b) KDD Cup 2008, c) IJCNN, d) Covertype, e) MNIST. All datasets exhibited moderate to severe label imbalance with the KDD Cup 2008 dataset having just positives.
Methods: We instantiated the SPADE algorithm (Algorithm 1) on the Qmean and MinTPR/TNR performance measures. We also instantiated the STAMP method (Algorithm 3) on F1measure and the JAC coefficient. In both cases we compared to the SVMPerf method Joachims et al. (2009) and plugin method Koyejo et al. (2014)
specialized to these measures. For the sake of reference, we also compared to the standard logistic regression method for (unweighted) binary classification. Additionally for F1measure, we also compared to the
1PMB stochastic gradient descent method proposed recently by Kar et al. (2014). All methods were implemented in C.Parameters: We used of the dataset for training and the rest for testing. Tunable parameters, including thresholds for the plugin approaches, were crossvalidated on a validation set. All results reported here were averaged over 5 random traintest splits. We used hingeloss based reward functions for our methods. STAMP was executed by setting the challenge level to the actual Fmeasure/JAC at each stage. We used a state of the art LBFGS solver to implement the plugin methods and used standard implementations of the SVMPerf algorithm. Since our methods are able to take a single pass over the data very rapidly, SPADE was allowed to run for 25 passes over the data and STAMP was allowed 25 passes with an initial epoch length of which was doubled after every iteration. The SVMPerf algorithm was allowed a runtime of up to of that given to our method after which it was terminated. The LBFGS solver was always allowed to run till convergence.
Figures 1 and 2 compare the SPADE method with the baseline methods for the Qmean and MinTPR/TNR measures. In general, SPADE was found to offer comparable or superior accuracies with greatly accelerated convergence as compared to other methods. On the IJCNN and Covtype datasets, SPADE outperformed every other method by about 23%. As SPADE is a stochastic first order method, it is expected to rapidly find out a fairly accurate solution. Indeed, the method was found to offer greatly accelerated convergence without fail. For instance, on the MNIST dataset, SPADE found out the best solution as much as faster than any other method whereas on the KDD Cup and PPI datasets it was and faster respectively. The SVMPerf method, on the other hand, was found to be extremely slow in general and require at least an order of magnitude time more than SPADE to find reasonably accurate solutions. It is also notable that in all cases, simple binary classification gave very poor accuracies due to the severe label imbalance in these datasets.
Figures 3 and 4 report the performance of the STAMP method applied to pseudolinear functions. Similar to the concave measures, STAMP was found to provide competitive accuracies as compared to the baseline methods but require at least less computational time. Interestingly, for the F1measure, the 1PMB method, which is another stochastic gradient descentbased method, was found to struggle to obtain accuracies similar to that of STAMP or else offer much slower convergence. We suspect two main reasons for the suboptimal behavior of this other stochastic method. Firstly these results confirm the adverse effect of the dependence on an inmemory buffer on these methods. It is notable that this dependence causes even the theoretical convergence rates for these methods to be weaker as was noted earlier in the discussion. Secondly, we note that both SVMPerf and 1PMB optimize the same “structSVM” style surrogate for the Fmeasure Kar et al. (2014). This surrogate has been observed to give poor accuracies when compared to plugin methods in several previous works Koyejo et al. (2014); Narasimhan et al. (2014). STAMP on the other hand, works directly with Fmeasure in a manner similar to, but faster than, the plugin methods which might explain its better performance.
Acknowledgements
HN thanks support from a Google India PhD Fellowship.
References
 Bishop [2006] Christopher M. Bishop. Pattern Recognition and Machine Learning. SpringerVerlag New York, Inc., 2006.
 CesaBianchi et al. [2001] Nicoló CesaBianchi, Alex Conconi, and Claudio Gentile. On the Generalization Ability of OnLine Learning Algorithms. In 15th Annual Conference on Neural Information Processing Systems (NIPS), pages 359–366, 2001.

Daskalaki et al. [2006]
Sophia Daskalaki, Ioannis Kopanas, and Nikolaos Avouris.
Evaluation of Classifiers for an Uneven Class Distribution Problem.
Applied Artificial Intelligence
, 20:381–417, 2006.  Dembczyński et al. [2013] Krzysztof Dembczyński, Arkadiusz Jachnik, Wojciech Kotlowski, Willem Waegeman, and Eyke Hüllermeier. Optimizing the FMeasure in MultiLabel Classification: Plugin Rule Approach versus Structured Loss Minimization. In 30th International Conference on Machine Learning (ICML), 2013.
 Dinkelbach [1967] Werner Dinkelbach. On Nonlinear Fractional Programming. Management Science, 13(7, Series A, Sciences):492–498, 1967.
 Hsu et al. [2009] Daniel Hsu, Sham Kakade, John Langford, and Tong Zhang. MultiLabel Prediction via Compressed Sensing. In 23rd Annual Conference on Neural Information Processing Systems (NIPS), pages 772–780, 2009.
 Jagannathan [1966] R. Jagannathan. On Some Properties of Programming Problems in Parametric Form Pertaining to Fractional Programming. Management Science, 12(7, Series A, Sciences):609–615, 1966.
 Jaggi et al. [2014] Martin Jaggi, Virginia Smith, Martin Takác, Jonathan Terhorst, Sanjay Krishnan, Thomas Hofmann, and Michael I. Jordan. CommunicationEfficient Distributed Dual Coordinate Ascent. In 28th Annual Conference on Neural Information Processing Systems (NIPS), pages 694–702, 2014.
 Joachims et al. [2009] Thorsten Joachims, Thomas Finley, and ChunNam John Yu. Cuttingplane training of structural SVMs. Machine Learning, 77(1):27–59, 2009.
 Kar et al. [2013] Purushottam Kar, Bharath K Sriperumbudur, Prateek Jain, and Harish Karnick. On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions. In 30th International Conference on Machine Learning (ICML), 2013.
 Kar et al. [2014] Purushottam Kar, Harikrishna Narasimhan, and Prateek Jain. Online and Stochastic Gradient Methods for Nondecomposable Loss Functions. In 28th Annual Conference on Neural Information Processing Systems (NIPS), pages 694–702, 2014.
 Kennedy et al. [2010] Kenneth Kennedy, Brian Mac Namee, and Sarah Jane Delany. Learning without default: a study of oneclass classification and the lowdefault portfolio problem. In International Conference on Artificial Intelligence and Cognitive Science (ICAICS), volume 6202 of Lecture Notes in Computer Science, pages 174–187, 2010.
 Koyejo et al. [2014] Oluwasanmi O. Koyejo, Nagarajan Natarajan, Pradeep K. Ravikumar, and Inderjit S. Dhillon. Consistent Binary Classification with Generalized Performance Metrics. In 28th Annual Conference on Neural Information Processing Systems (NIPS), pages 2744–2752, 2014.
 Lewis [1995] D.D. Lewis. Evaluating and optimizing autonomous text classification systems. In 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 1995.

Liu and Chawla [2011]
W. Liu and S. Chawla.
A Quadratic Mean based Supervised Learning Model for Managing Data Skewness.
In 11th SIAM International Conference on Data Mining (SDM), 2011.  Mahdavi et al. [2013] Mehrdad Mahdavi, Tianbao Yang, and Rong Jin. Stochastic Convex Optimization with Multiple Objectives. In 27th Annual Conference on Neural Information Processing Systems (NIPS), pages 1115–1123, 2013.
 Manning et al. [2008] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008.
 Narasimhan et al. [2014] Harikrishna Narasimhan, Rohit Vaish, and Shivani Agarwal. On the Statistical Consistency of Plugin Classifiers for Nondecomposable Performance Measures. In 28th Annual Conference on Neural Information Processing Systems (NIPS), 2014.
 Parambath et al. [2014] Shameem Puthiya Parambath, Nicolas Usunier, and Yves Grandvalet. Optimizing FMeasures by CostSensitive Classification. In 28th Annual Conference on Neural Information Processing Systems (NIPS), pages 2123–2131, 2014.
 Rakhlin et al. [2011] Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online Learning: Beyond Regret. In 24th Annual Conference on Learning Theory (COLT), 2011.
 Schaible [1976] Siegfried Schaible. Fractional Programming. II, on Dinkelbach’s Algorithm. Management Science, 22(8):868–873, 1976.
 ShalevShwartz et al. [2011] Shai ShalevShwartz, Yoram Singer, Nathan Srebro, and Andrew Cotter. Pegasos: primal estimated subgradient solver for SVM. Math. Program., 127(1):3–30, 2011.
 Sokolova and Lapalme [2009] Marina Sokolova and Guy Lapalme. A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4):427–437, 2009.
 Vincent [1994] P.H. Vincent. An Introduction to Signal Detection and Estimation. SpringerVerlag New York, Inc., 1994.
 Ye et al. [2012] Nan Ye, Kian Ming A. Chai, Wee Sun Lee, and Hai Leong Chieu. Optimizing FMeasures: A Tale of Two Approaches. In 29th International Conference on Machine Learning (ICML), 2012.
 Zinkevich [2003] Martin Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In 20th International Conference on Machine Learning (ICML), pages 928–936, 2003.
Appendix A Proof of Lemma 3
Lemma 3.
The stability parameter of a performance measure can be written as iff its sufficient dual region is bounded in a ball of radius .
Proof.
Let us denote primal variables using the notation and dual variables using the notation . The proof follows from the fact that any value of for which can be safely excluded from the sufficient dual region.
For proving the result in one direction suppose is stable with for some . Now consider some such that . Now set . Then we have
Thus, we can conclude that no dual vector with norm greater than
can be a part of the sufficient dual region. This shows that the sufficient dual region is bounded inside a ball of radius . For proving the result in the other direction, suppose the dual sufficient region is indeed bounded in a ball of radius . Consider two points such thatNow define so that, by the above definition, and . Now we have
where the fourth step follows from the norm bound on . Similarly we have
This establishes the result. ∎
Appendix B Proof of Theorem 4
Theorem 4.
Suppose we are given a stream of random samples drawn from a distribution over . Let be a concave, Lipschitz link function. Let Algorithm 1 be executed with a dual feasible set , and . Then, the average model output by the algorithm satisfies, with probability at least ,
Proof.
For this proof we shall assume that is Lipschitz so that its sufficient dual region can be bounded by an application of Lemma 3. Notice that the updates for can be written as follows:
where
which can be interpreted as simple gradient descent with . Moreover, since is concave, is convex with respect to for every . Note that the terms and do not involve and hence act as arbitrary bounded positive constants for this part of the analysis.
Note that by Lemma 3, we have the radius of bounded by . Also, since is a monotone function, by a similar argument, can be shown to be a Lipschitz function. For all the performance measures considered, we have . Thus, is a Lipschitz function. Hence, using a standard GIGAstyle analysis Zinkevich [2003] on the (descent) updates on and in Algorithm 1, we have (for )
where the last step follows from Fenchel conjugacy.
Further, noting that , and , we use the standard onlinebatch conversion bounds CesaBianchi et al. [2001] to the loss functions and individually to obtain w.h.p.
By monotonicity of , we get
(1)  
where the second inequality follows from stability of , and the third inequality follows from concavity of and , Jensen’s inequality, and stability of .
Similarly, the update to can be written as
where is the projection operator for the domain and
Since are concave and the term does not involve , is convex in for all . Also, we can show that