# Loss Functions for Top-k Error: Analysis and Insights

In order to push the performance on realistic computer vision tasks, the number of classes in modern benchmark datasets has significantly increased in recent years. This increase in the number of classes comes along with increased ambiguity between the class labels, raising the question if top-1 error is the right performance measure. In this paper, we provide an extensive comparison and evaluation of established multiclass methods comparing their top-k performance both from a practical as well as from a theoretical perspective. Moreover, we introduce novel top-k loss functions as modifications of the softmax and the multiclass SVM losses and provide efficient optimization schemes for them. In the experiments, we compare on various datasets all of the proposed and established methods for top-k error optimization. An interesting insight of this paper is that the softmax loss yields competitive top-k performance for all k simultaneously. For a specific top-k error, our new top-k losses lead typically to further improvements while being faster to train than the softmax.

## Authors

• 5 publications
• 66 publications
• 108 publications
• ### Analysis and Optimization of Loss Functions for Multiclass, Top-k, and Multilabel Classification

Top-k error is currently a popular performance measure on large scale im...
12/12/2016 ∙ by Maksim Lapin, et al. ∙ 0

• ### On the Learning Property of Logistic and Softmax Losses for Deep Neural Networks

Deep convolutional neural networks (CNNs) trained with logistic and soft...
03/04/2020 ∙ by Xiangrui Li, et al. ∙ 37

• ### Partial FC: Training 10 Million Identities on a Single Machine

Face recognition has been an active and vital topic among computer visio...
10/11/2020 ∙ by Xiang An, et al. ∙ 15

• ### Relaxed Softmax for learning from Positive and Unlabeled data

In recent years, the softmax model and its fast approximations have beco...
09/17/2019 ∙ by Ugo Tanielian, et al. ∙ 39

• ### Unified Interpretation of Softmax Cross-Entropy and Negative Sampling: With Case Study for Knowledge Graph Embedding

In knowledge graph embedding, the theoretical relationship between the s...
06/14/2021 ∙ by Hidetaka Kamigaito, et al. ∙ 0

• ### Ensemble Soft-Margin Softmax Loss for Image Classification

Softmax loss is arguably one of the most popular losses to train CNN mod...
05/10/2018 ∙ by Xiaobo Wang, et al. ∙ 0

• ### AutoLoss-Zero: Searching Loss Functions from Scratch for Generic Tasks

Significant progress has been achieved in automating the design of vario...
03/25/2021 ∙ by Hao Li, et al. ∙ 13

## Code Repositories

### libsdca

Multiclass classification based on stochastic dual coordinate ascent

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The number of classes is rapidly growing in modern computer vision benchmarks [47, 62]. Typically, this also leads to ambiguity in the labels as classes start to overlap. Even for humans, the error rates in top- performance are often quite high ( on SUN 397 [60]). While previous research focuses on minimizing the top- error, we address top- error optimization in this paper. We are interested in two cases: a) achieving small top- error for all reasonably small ; and b) minimization of a specific top- error.

While it is argued in [2] that the one-versus-all (OVA) SVM scheme performs on par in top- and top- accuracy with the other SVM variations based on ranking losses, we have recently shown in [28] that minimization of the top- hinge loss leads to improvements in top- performance compared to OVA SVM, multiclass SVM, and other ranking-based formulations. In this paper, we study top- error optimization from a wider perspective. On the one hand, we compare OVA schemes and direct multiclass losses in extensive experiments, and on the other, we present theoretical discussion regarding their calibration for the top- error. Based on these insights, we suggest new families of loss functions for the top- error. Two are smoothed versions of the top- hinge losses [28], and the other two are top- versions of the softmax loss. We discuss their advantages and disadvantages, and for the convex losses provide an efficient implementation based on stochastic dual coordinate ascent (SDCA) [48].

We evaluate a battery of loss functions on

datasets of different tasks ranging from text classification to large scale vision benchmarks, including fine-grained and scene classification. We systematically optimize and report results separately for each top-

accuracy. One interesting message that we would like to highlight is that the softmax loss is able to optimize all top- error measures simultaneously. This is in contrast to multiclass SVM and is also reflected in our experiments. Finally, we show that our new top- variants of smooth multiclass SVM and the softmax loss can further improve top- performance for a specific .

Related work. Top- optimization has recently received revived attention with the advent of large scale problems [20, 28, 30, 31]. The top- error in multiclass classification, which promotes good ranking of class labels for each example, is closely related to the precision@ metric in information retrieval, which counts the fraction of positive instances among the top- ranked examples. In essence, both approaches enforce a desirable ranking of items [28].

The classic approaches optimize pairwise ranking with [25, 53], [11], and [7]. An alternative direction was proposed by Usunier et al. [54], who described a general family of convex loss functions for ranking and classification. One of the loss functions that we consider ( [28]) also falls into that family. Weston et al. [59] then introduced Wsabie, which optimizes an approximation of a ranking-based loss from [54]. A Bayesian approach was suggested by [51].

Recent works focus on the top of the ranked list [1, 9, 39, 46], scalability to large datasets [20, 28, 30], explore transductive learning [31] and prediction of tuples [45].

Contributions. We study the problem of top- error optimization on a diverse range of learning tasks. We consider existing methods as well as propose 4 novel loss functions for minimizing the top- error. A brief overview of the methods is given in Table 1. For the proposed convex top- losses, we develop an efficient optimization scheme based on SDCA111 Code available at: https://github.com/mlapin/libsdca , which can also be used for training with the softmax loss. All methods are evaluated empirically in terms of the top- error and, whenever possible, in terms of classification calibration. We discover that the softmax loss and the proposed smooth top- SVM are astonishingly competitive in all top- errors. Further small improvements can be obtained with the new top- losses.

## 2 Loss Functions for Top-k Error

We consider multiclass problems with classes where the training set consists of examples along with the corresponding labels . We use and to denote a permutation of (indexes) . Unless stated otherwise,

reorders components of a vector

in descending order, i.e.

While we consider linear classifiers in our experiments, all loss functions below are formulated in the general setting where a function

is learned and prediction at test time is done via , resp. the top- predictions. For the linear case, all predictors have the form . Let be the stacked weight matrix, be a convex loss function, and be a regularization parameter. We consider the following multiclass optimization problem

We use the Iverson bracket notation , defined as if is true, otherwise; and introduce a shorthand . We generalize the standard zero-one error and allow guesses instead of one. Formally, the top- zero-one loss (top- error) is

 \errk(y,f(x))\bydef\ivfπk(x)>fy(x). (1)

Note that for we recover the standard zero-one error. Top- accuracy is defined as minus the top- error.

### 2.1 Bayes Optimality and Top-k Calibration

In this section, we establish the best achievable top- error, determine when a classifier achieves it, and define a notion of top- calibration.

The Bayes optimal top- error at is

 ming∈\Rbm\ExpY\givenX[\errk(Y,g)\givenX=x]=1−\tsumkj=1pτj(x),

where . A classifier is top- Bayes optimal at if and only if

where .

###### Proof.

Let and be a permutation such that . The expected top- error at is

 \ExpY\givenX[\errk(Y,g)\givenX=x]=\tsumy∈\Yc\ivgπk>gypy(x) =\tsumy∈\Yc\ivgπk>gπypπy(x)=\tsummj=k+1pπj(x) =1−\tsumkj=1pπj(x).

The error is minimal when is maximal, which corresponds to taking the

largest conditional probabilities

and yields the Bayes optimal top- error at .

Since the relative order within is irrelevant for the top- error, any classifier , for which the sets and coincide, is Bayes optimal.

Note that we assumed that there is a clear cut between the most likely classes and the rest. In general, ties can be resolved arbitrarily as long as we can guarantee that the largest components of correspond to the classes (indexes) that yield the maximal sum and lead to top- Bayes optimality. ∎

Optimization of the zero-one loss (and, by extension, the top- error) leads to hard combinatorial problems. Instead, a standard approach is to use a convex surrogate loss which upper bounds the zero-one error. Under mild conditions on the loss function [3, 52], the optimal classifier w.r.t. the surrogate yields a Bayes optimal solution for the zero-one loss. Such loss is called classification calibrated

, which is known in statistical learning theory as a necessary condition for a classifier to be universally Bayes consistent

[3]. We introduce now the notion of calibration for the top- error. A loss function (or a reduction scheme) is called top- calibrated if for all possible data generating measures on and all

 \targming∈\Rbm\ExpY\givenX[L(Y,g)\givenX=x] ⊆ \targming∈\Rbm\ExpY\givenX[\errk(Y,g)\givenX=x].

If a loss is not top- calibrated, it implies that even in the limit of infinite data, one does not obtain a classifier with the Bayes optimal top- error from Lemma 2.1.

### 2.2 OVA and Direct Multiclass Approaches

The standard multiclass problem is often solved using the one-vs-all (OVA) reduction into a set of binary classification problems. Every class is trained versus the rest which yields classifiers .

Typically, the binary classification problems are formulated with a convex margin-based loss function , where and . We consider in this paper:

 L(yf(x)) =max{0,1−yf(x)}, (2) L(yf(x)) =log(1+e−yf(x)). (3)

The hinge (2) and logistic (3) losses correspond to the SVM and logistic regression respectively. We now show when the OVA schemes are top- calibrated, not only for (standard multiclass loss) but for all simultaneously.

The OVA reduction is top- calibrated for any if the Bayes optimal function of the convex margin-based loss is a strictly monotonically increasing function of .

###### Proof.

For every class , the Bayes optimal classifier for the corresponding binary problem has the form

 fy(x)=g(Pr(Y=y\givenX=x)),

where is a strictly monotonically increasing function. The ranking of corresponds to the ranking of and hence the OVA reduction is top- calibrated for any . ∎

Next, we check if the one-vs-all schemes employing hinge and logistic regression losses are top- calibrated. OVA SVM is not top- calibrated.

###### Proof.

First, we show that the Bayes optimal function for the binary hinge loss is

 f∗(x) =2\ivPr(Y=1\givenX=x)>12−1.

We decompose the expected loss as

 \ExpX,Y[L(Y,f(X))]=\ExpX[\ExpY|X[L(Y,f(x))\givenX=x]].

Thus, one can compute the Bayes optimal classifier pointwise by solving

 \argminα∈\Rb\ExpY|X[L(Y,α)\givenX=x],

for every , which leads to the following problem

 \argminα∈\Rbmax{0,1−α}p1(x)+max{0,1+α}p−1(x),

where . It is obvious that the optimal is contained in . We get

 \argmin−1≤α≤1(1−α)p1(x)+(1+α)p−1(x).

The minimum is attained at the boundary and we get

 f∗(x)={+1if p1(x)>12,−1if p1(x)≤12.

Therefore, the Bayes optimal classifier for the hinge loss is not a strictly monotonically increasing function of .

To show that OVA hinge is not top- calibrated, we construct an example problem with classes and , . Note that for every class , the Bayes optimal binary classifier is , hence the predicted ranking of labels is arbitrary and may not produce the Bayes optimal top- error. ∎

In contrast, logistic regression is top- calibrated.

OVA logistic regression is top- calibrated.

###### Proof.

First, we show that the Bayes optimal function for the binary logistic loss is

 f∗(x) =log(px(1)1−px(1)).

As above, the pointwise optimization problem is

 \argminα∈\Rblog(1+exp(−α))p1(x)+log(1+exp(α))p−1(x).

The logistic loss is known to be convex and differentiable and thus the optimum can be computed via

 −exp(−α)1+exp(−α)p1(x)+exp(α)1+exp(α)p−1(x)=0.

Re-writing the first fraction we get

 −11+exp(α)p1(x)+exp(α)1+exp(α)p−1(x)=0,

which can be solved as and leads to the formula for the Bayes optimal classifier stated above.

We check now that the function defined as is strictly monotonically increasing.

 ϕ′(x) =1−xx(11−x+x(1−x)2) =1−xx1(1−x)2=1x(1−x)>0,∀x∈(0,1).

The derivative is strictly positive on , which implies that is strictly monotonically increasing. The logistic loss, therefore, fulfills the conditions of Lemma 2.2 and is top- calibrated for any . ∎

An alternative to the OVA scheme with binary losses is to use a multiclass loss directly. We consider two generalizations of the hinge and logistic losses below:

 L(y,f(x))=maxj∈\Yc{\ivj≠y+fj(x)−fy(x)}, (4) (5)

Both the multiclass hinge loss (4) of Crammer & Singer [15] and the softmax loss (5) are popular losses for multiclass problems. The latter is also known as the cross-entropy or multiclass logistic loss and is often used as the last layer in deep architectures [6, 26, 50]. The multiclass hinge loss has been shown to be competitive in large-scale image classification [2], however, it is known to be not calibrated [52] for the top- error. Next, we show that it is not top- calibrated for any .

Multiclass SVM is not top- calibrated.

###### Proof.

First, we derive the Bayes optimal function.

Let . Given any , a Bayes optimal function for the loss (4) is

 f∗y(x) ={c+1if maxj∈\Ycpj(x)≥12,cotherwise, f∗j(x) =c,j∈\Yc∖{y}.

Let , then

 \ExpY|X[L(Y,g)\givenX]=∑l∈\Ycmaxj∈\Yc{\ivj≠l+gj−gl}pl(x).

Suppose that the maximum of is not unique. In this case, we have

 maxj∈\Yc{\ivj≠l+gj−gl}≥1,∀l∈\Yc

as the term is always active. The best possible loss is obtained by setting for all , which yields an expected loss of . On the other hand, if the maximum is unique and is achieved by , then

 maxj∈\Yc{\ivj≠l+gj−gl}={1+gy−gl if l≠y,max{0,maxj≠y{1+gj−gy}} if l=y.

As the loss only depends on the gap , we can optimize this with .

 \Exp Y|X[L(Y,g)\givenX=x] =∑l≠y(1+gy−gl)pl(x) +max{0,maxl≠y{1+gl−gy}}py(x) =∑l≠y(1+βl)pl(x)+max{0,maxl≠y{1−βl}}py(x) =∑l≠y(1+βl)pl(x)+max{0,1−minl≠yβl}py(x).

As only the minimal enters the last term, the optimum is achieved if all are equal for (otherwise it is possible to reduce the first term without affecting the last term). Let for all . The problem becomes

 minα≥0∑l≠y(1+α)pl(x)+max{0,1−α}py(x) ≡min0≤α≤1α(1−2py(x))

Let . The solution is

 α∗={0if p<12,1if p≥12,

and the associated risk is

 \ExpY|X[L(Y,g)\givenX=x]={1if p<12,2(1−p)if p≥12.

If , then the Bayes optimal classifier for all and any . Otherwise, and

 f∗j(x)={c+1if j=y,cif j∈\Yc∖{y}.

Moreover, we have that the Bayes risk at is

 \ExpY|X[L(Y,f∗(x))\givenX=x]=min{1,2(1−p)}≤1.

It follows, that the multiclass hinge loss is not (top-) classification calibrated at any where as its Bayes optimal classifier reduces to a constant. Moreover, even if for some , the loss is not top- calibrated for as the predicted order of the remaining classes need not be optimal. ∎

Again, a contrast between the hinge and logistic losses.

The softmax loss is top- calibrated.

###### Proof.

The multiclass logistic loss is (top-) calibrated for the zero-one error in the following sense. If

 f∗(x)∈\argming∈\Rbm\ExpY|X[L(Y,g)\givenX=x],

then for some and all

 f∗y(x)={log(αpy(x))if py(x)>0,−∞otherwise,

which implies

 \argmaxy∈\Ycf∗y(x)=\argmaxy∈\YcPr(Y=y\givenX=x).

We now prove this result and show that it also generalizes to top- calibration for . Using the identity

 L(y,g)=log(\tsumj∈\Ycegj−gy)=log(\tsumj∈\Ycegj)−gy

and the fact that , we write for a

 \Exp Y|X[L(Y,g)\givenX=x] =∑y∈\YcL(y,g)py(x)=log(∑y∈\Ycegy)−∑y∈\Ycgypx(y).

As the loss is convex and differentiable, we get the global optimum by computing a critical point. We have

 ∂∂gj\ExpY|X[L(Y,g)\givenX=x]=egj∑y∈\Ycegy−pj(x)=0

for . We note that the critical point is not unique as multiplication leaves the equation invariant for any . One can verify that satisfies the equations for any . This yields a solution

 f∗y(x) ={log(αpy(x))if py(x)>0,−∞otherwise,

for any fixed . We note that is a strictly monotonically increasing function of the conditional class probabilities. Therefore, it preserves the ranking of and implies that is top- calibrated for any . ∎

The implicit reason for top-

calibration of the OVA schemes and the softmax loss is that one can estimate the probabilities

from the Bayes optimal classifier. Loss functions which allow this are called proper. We refer to [41] and references therein for a detailed discussion.

We have established that the OVA logistic regression and the softmax loss are top- calibrated for any , so why should we be interested in defining new loss functions for the top- error? The reason is that calibration is an asymptotic property as the Bayes optimal functions are obtained pointwise. The picture changes if we use linear classifiers, since they obviously cannot be minimized independently at each point. Indeed, most of the Bayes optimal classifiers cannot be realized by linear functions.

In particular, convexity of the softmax and multiclass hinge losses leads to phenomena where , but . This happens if and adds a bias when working with “rigid” function classes such as linear ones. The loss functions which we introduce in the following are modifications of the above losses with the goal of alleviating that phenomenon.

### 2.3 Smooth Top-k Hinge Loss

Recently, we introduced two top- versions of the multiclass hinge loss (4) in [28], where the second version is based on the family of ranking losses introduced earlier by [54]. We use our notation from [28] for direct comparison and refer to the first version as and the second one as . Let , where is the all ones vector, is the -th basis vector, and let be defined componentwise as . The two top- hinge losses are

 L(a) =max{0,1k\tsumkj=1(a+c)πj}(top\mhyphenk SVMα), (6) L(a) (7)

where is the -th largest component of . It was shown in [28] that (6) is a tighter upper bound on the top- error than (7), however, both losses performed similarly in our experiments. In the following, we simply refer to them as the top- hinge or the top- SVM loss.

Both losses reduce to the multiclass hinge loss (4) for . Therefore, they are unlikely to be top- calibrated, even though we can currently neither prove nor disprove this for . The multiclass hinge loss is not calibrated as it is non-smooth and does not allow to estimate the class conditional probabilities . Our new family of smooth top- hinge losses is based on the Moreau-Yosida regularization [5, 34]. This technique has been used in [48] to smooth the binary hinge loss (2). Interestingly, smooth binary hinge loss fulfills the conditions of Lemma 2.2 and leads to a top- calibrated OVA scheme. The hope is that the smooth top- hinge loss becomes top- calibrated as well.

Smoothing works by adding a quadratic term to the conjugate function222 The convex conjugate of is . , which then becomes strongly convex. Smoothness of the loss, among other things, typically leads to much faster optimization as we discuss in Section 3.

OVA smooth hinge is top- calibrated.

###### Proof.

In order to derive the smooth hinge loss, we first compute the conjugate of the standard binary hinge loss,

 L(α) =max{0,1−α}, L∗(β) =supα∈\Rb{αβ−max{0,1−α}} ={βif −1≤β≤0,∞otherwise. (8)

The smoothed conjugate is

 L∗γ(β)=L∗(β)+γ2β2.

The corresponding primal smooth hinge loss is given by

 Lγ(α) =sup−1≤β≤0{αβ−β−γ2β2} =⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩1−α−γ2if α<1−γ,(α−1)22γif 1−γ≤α≤1,0,if α>1. (9)

is convex and differentiable with the derivative

 L′γ(α)=⎧⎪⎨⎪⎩−1if α<1−γ,α−1γif 1−γ≤α≤1,0,if α>1.

We compute the Bayes optimal classifier pointwise.

 f∗(x)=\argminα∈\RbL(α)p1(x)+L(−α)p−1(x).

Let , the optimal is found by solving

 L′(α)p−L′(−α)(1−p)=0.

Case . Consider the case ,

 α−1γp+(1−p)=0⟹α∗=1−γ1−pp.

This case corresponds to , which follows from the constraint . Next, consider ,

 −p+(1−p)=1−2p≠0,

unless , which is already captured by the first case. Finally, consider . Then

 −p−−α−1γ(1−p)=0⟹α∗=−1+γp1−p,

where we have if . We obtain the Bayes optimal classifier for as follows:

 f∗(x)=⎧⎨⎩1−γ1−ppif p≥12,−1+γp1−pif p<12.

Note that while is not a continuous function of for , it is still a strictly monotonically increasing function of for any .

Case . First, consider ,

 α−1γp+(1−p)=0⟹α∗=1−γ1−pp.

From , we get the condition . Next, consider ,

 α−1γp−−α−1γ(1−p)=0⟹α∗=2p−1,

which is in the range if . Finally, consider ,

 −p−−α−1γ(1−p)=0⟹α∗=−1+γp1−p,

where we have if . Overall, the Bayes optimal classifier for is

 f∗(x) =⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩1−γ1−ppif p≥γ2,2p−1if 1−γ2≤p≤γ2,−1+γp1−pif p<1−γ2.

Note that is again a strictly monotonically increasing function of . Therefore, for any , the one-vs-all scheme with the smooth hinge loss (9) is top- calibrated for all by Lemma 2.2. ∎

Next, we introduce the multiclass smooth top- hinge losses, which extend the top- hinge losses (6) and (7). We define the top- simplex ( and ) of radius as

 Δαk(r) \bydef{x\given\inner\ones,x≤r,0≤xi≤1k\inner\ones,x,∀i}, (10) Δβk(r) \bydef{x\given\inner\ones,x≤r,0≤xi≤1kr,∀i}. (11)

We also let and . [[28]] The convex conjugates of (6) and (7) are respectively , if , otherwise; and , if , otherwise.

Smoothing applied to the top- hinge loss (6) yields the following smooth top- hinge loss (). Smoothing of (7) is done similarly, but the set is replaced with .

Let be the smoothing parameter. The smooth top- hinge loss () and its conjugate are

 Lγ(a) =1γ(\innera+c,p−12\innerp,p), (12) L∗γ(b) =γ2\innerb,b−\innerc,b, if b∈Δαk,+∞ o/w, (13)

where is the Euclidean projection of on . Moreover, is -smooth.

###### Proof.

We take the convex conjugate of the top- hinge loss, which was derived in [28, Proposition 2],

 L∗(b)={−\innerc,bif b∈Δαk(1),+∞otherwise,

and add the regularizer to obtain the -strongly convex conjugate loss as stated in the proposition. As mentioned above [21] (see also [48, Lemma 2]), the primal smooth top- hinge loss , obtained as the convex conjugate of , is -smooth. We now obtain a formula to compute it based on the Euclidean projection onto the top- simplex. By definition,

 Lγ(a) =supb∈\Rbm{\innera,b−L∗γ(b)} =maxb∈Δαk(1){\innera,b−γ2\innerb,b+\innerc,b} =−minb∈Δαk(1){γ2\innerb,b−\innera+c,b} =−1γminb∈Δαk(1){12\innerγb,γb−\innera+c,γb} =−1γminbγ∈Δαk(1){12\innerb,b−\innera+c,b}.

For the constraint , we have

 \inner\ones,b/γ ≤1, 0 ≤bi/γ≤1k\inner\ones,b/γ⟺ \inner\ones,b ≤γ, 0 ≤bi≤1k\inner\ones,b⟺b∈Δαk(γ).

The final expression follows from the fact that

 \argminb∈Δαk(γ){12\innerb,b−\innera+c,b} ≡\argminb∈Δαk(γ)\norms(a+c)−b≡\projΔαk(γ)(a+c).

There is no analytic expression for (12) and evaluation requires computing a projection onto the top- simplex