Multiclass classification based on stochastic dual coordinate ascent
In order to push the performance on realistic computer vision tasks, the number of classes in modern benchmark datasets has significantly increased in recent years. This increase in the number of classes comes along with increased ambiguity between the class labels, raising the question if top-1 error is the right performance measure. In this paper, we provide an extensive comparison and evaluation of established multiclass methods comparing their top-k performance both from a practical as well as from a theoretical perspective. Moreover, we introduce novel top-k loss functions as modifications of the softmax and the multiclass SVM losses and provide efficient optimization schemes for them. In the experiments, we compare on various datasets all of the proposed and established methods for top-k error optimization. An interesting insight of this paper is that the softmax loss yields competitive top-k performance for all k simultaneously. For a specific top-k error, our new top-k losses lead typically to further improvements while being faster to train than the softmax.READ FULL TEXT VIEW PDF
Top-k error is currently a popular performance measure on large scale im...
Deep convolutional neural networks (CNNs) trained with logistic and soft...
Face recognition has been an active and vital topic among computer visio...
In recent years, the softmax model and its fast approximations have beco...
In knowledge graph embedding, the theoretical relationship between the
Softmax loss is arguably one of the most popular losses to train CNN mod...
Significant progress has been achieved in automating the design of vario...
Multiclass classification based on stochastic dual coordinate ascent
The number of classes is rapidly growing in modern computer vision benchmarks [47, 62]. Typically, this also leads to ambiguity in the labels as classes start to overlap. Even for humans, the error rates in top- performance are often quite high ( on SUN 397 ). While previous research focuses on minimizing the top- error, we address top- error optimization in this paper. We are interested in two cases: a) achieving small top- error for all reasonably small ; and b) minimization of a specific top- error.
While it is argued in  that the one-versus-all (OVA) SVM scheme performs on par in top- and top- accuracy with the other SVM variations based on ranking losses, we have recently shown in  that minimization of the top- hinge loss leads to improvements in top- performance compared to OVA SVM, multiclass SVM, and other ranking-based formulations. In this paper, we study top- error optimization from a wider perspective. On the one hand, we compare OVA schemes and direct multiclass losses in extensive experiments, and on the other, we present theoretical discussion regarding their calibration for the top- error. Based on these insights, we suggest new families of loss functions for the top- error. Two are smoothed versions of the top- hinge losses , and the other two are top- versions of the softmax loss. We discuss their advantages and disadvantages, and for the convex losses provide an efficient implementation based on stochastic dual coordinate ascent (SDCA) .
We evaluate a battery of loss functions on
datasets of different tasks ranging from text classification to large scale vision benchmarks, including fine-grained and scene classification. We systematically optimize and report results separately for each top-accuracy. One interesting message that we would like to highlight is that the softmax loss is able to optimize all top- error measures simultaneously. This is in contrast to multiclass SVM and is also reflected in our experiments. Finally, we show that our new top- variants of smooth multiclass SVM and the softmax loss can further improve top- performance for a specific .
Related work. Top- optimization has recently received revived attention with the advent of large scale problems [20, 28, 30, 31]. The top- error in multiclass classification, which promotes good ranking of class labels for each example, is closely related to the precision@ metric in information retrieval, which counts the fraction of positive instances among the top- ranked examples. In essence, both approaches enforce a desirable ranking of items .
The classic approaches optimize pairwise ranking with [25, 53], , and . An alternative direction was proposed by Usunier et al. , who described a general family of convex loss functions for ranking and classification. One of the loss functions that we consider ( ) also falls into that family. Weston et al.  then introduced Wsabie, which optimizes an approximation of a ranking-based loss from . A Bayesian approach was suggested by .
|Method||Name||Loss function||Conjugate||SDCA update||Top- calibrated||Convex|
|One-vs-all (OVA) SVM||||||no (Prop. 2.2)||yes|
|yes (Prop. 2.2)|
|Multiclass SVM||[28, 48]||[28, 48]||no (Prop. 2.2)|
|Softmax (maximum entropy)||Prop. 2.4||Prop. 3||yes (Prop. 2.2)|
|Top- hinge ()||||||open
|Top- hinge ()|
|Smooth top- hinge ()||Eq. (12) w/||Prop. 2.3||Prop. 3|
|Smooth top- hinge ()||Eq. (12) w/|
|Top- entropy||Prop. 2.4||Eq. (14)||Prop. 3|
|Truncated top- entropy||Eq. (22)||-||-||yes (Prop. 2.5)||no|
|Note that and .|
|We let (binary one-vs-all); , (multiclass); .|
Contributions. We study the problem of top- error optimization on a diverse range of learning tasks. We consider existing methods as well as propose 4 novel loss functions for minimizing the top- error. A brief overview of the methods is given in Table 1. For the proposed convex top- losses, we develop an efficient optimization scheme based on SDCA111 Code available at: https://github.com/mlapin/libsdca , which can also be used for training with the softmax loss. All methods are evaluated empirically in terms of the top- error and, whenever possible, in terms of classification calibration. We discover that the softmax loss and the proposed smooth top- SVM are astonishingly competitive in all top- errors. Further small improvements can be obtained with the new top- losses.
We consider multiclass problems with classes where the training set consists of examples along with the corresponding labels . We use and to denote a permutation of (indexes) . Unless stated otherwise,
reorders components of a vectorin descending order, i.e.
While we consider linear classifiers in our experiments, all loss functions below are formulated in the general setting where a functionis learned and prediction at test time is done via , resp. the top- predictions. For the linear case, all predictors have the form . Let be the stacked weight matrix, be a convex loss function, and be a regularization parameter. We consider the following multiclass optimization problem
We use the Iverson bracket notation , defined as if is true, otherwise; and introduce a shorthand . We generalize the standard zero-one error and allow guesses instead of one. Formally, the top- zero-one loss (top- error) is
Note that for we recover the standard zero-one error. Top- accuracy is defined as minus the top- error.
In this section, we establish the best achievable top- error, determine when a classifier achieves it, and define a notion of top- calibration.
The Bayes optimal top- error at is
where . A classifier is top- Bayes optimal at if and only if
Let and be a permutation such that . The expected top- error at is
The error is minimal when is maximal, which corresponds to taking the
largest conditional probabilitiesand yields the Bayes optimal top- error at .
Since the relative order within is irrelevant for the top- error, any classifier , for which the sets and coincide, is Bayes optimal.
Note that we assumed that there is a clear cut between the most likely classes and the rest. In general, ties can be resolved arbitrarily as long as we can guarantee that the largest components of correspond to the classes (indexes) that yield the maximal sum and lead to top- Bayes optimality. ∎
Optimization of the zero-one loss (and, by extension, the top- error) leads to hard combinatorial problems. Instead, a standard approach is to use a convex surrogate loss which upper bounds the zero-one error. Under mild conditions on the loss function [3, 52], the optimal classifier w.r.t. the surrogate yields a Bayes optimal solution for the zero-one loss. Such loss is called classification calibrated
, which is known in statistical learning theory as a necessary condition for a classifier to be universally Bayes consistent. We introduce now the notion of calibration for the top- error. A loss function (or a reduction scheme) is called top- calibrated if for all possible data generating measures on and all
If a loss is not top- calibrated, it implies that even in the limit of infinite data, one does not obtain a classifier with the Bayes optimal top- error from Lemma 2.1.
The standard multiclass problem is often solved using the one-vs-all (OVA) reduction into a set of binary classification problems. Every class is trained versus the rest which yields classifiers .
Typically, the binary classification problems are formulated with a convex margin-based loss function , where and . We consider in this paper:
The hinge (2) and logistic (3) losses correspond to the SVM and logistic regression respectively. We now show when the OVA schemes are top- calibrated, not only for (standard multiclass loss) but for all simultaneously.
The OVA reduction is top- calibrated for any if the Bayes optimal function of the convex margin-based loss is a strictly monotonically increasing function of .
For every class , the Bayes optimal classifier for the corresponding binary problem has the form
where is a strictly monotonically increasing function. The ranking of corresponds to the ranking of and hence the OVA reduction is top- calibrated for any . ∎
Next, we check if the one-vs-all schemes employing hinge and logistic regression losses are top- calibrated. OVA SVM is not top- calibrated.
First, we show that the Bayes optimal function for the binary hinge loss is
We decompose the expected loss as
Thus, one can compute the Bayes optimal classifier pointwise by solving
for every , which leads to the following problem
where . It is obvious that the optimal is contained in . We get
The minimum is attained at the boundary and we get
Therefore, the Bayes optimal classifier for the hinge loss is not a strictly monotonically increasing function of .
To show that OVA hinge is not top- calibrated, we construct an example problem with classes and , . Note that for every class , the Bayes optimal binary classifier is , hence the predicted ranking of labels is arbitrary and may not produce the Bayes optimal top- error. ∎
In contrast, logistic regression is top- calibrated.
OVA logistic regression is top- calibrated.
First, we show that the Bayes optimal function for the binary logistic loss is
As above, the pointwise optimization problem is
The logistic loss is known to be convex and differentiable and thus the optimum can be computed via
Re-writing the first fraction we get
which can be solved as and leads to the formula for the Bayes optimal classifier stated above.
We check now that the function defined as is strictly monotonically increasing.
The derivative is strictly positive on , which implies that is strictly monotonically increasing. The logistic loss, therefore, fulfills the conditions of Lemma 2.2 and is top- calibrated for any . ∎
An alternative to the OVA scheme with binary losses is to use a multiclass loss directly. We consider two generalizations of the hinge and logistic losses below:
Both the multiclass hinge loss (4) of Crammer & Singer  and the softmax loss (5) are popular losses for multiclass problems. The latter is also known as the cross-entropy or multiclass logistic loss and is often used as the last layer in deep architectures [6, 26, 50]. The multiclass hinge loss has been shown to be competitive in large-scale image classification , however, it is known to be not calibrated  for the top- error. Next, we show that it is not top- calibrated for any .
Multiclass SVM is not top- calibrated.
First, we derive the Bayes optimal function.
Let . Given any , a Bayes optimal function for the loss (4) is
Let , then
Suppose that the maximum of is not unique. In this case, we have
as the term is always active. The best possible loss is obtained by setting for all , which yields an expected loss of . On the other hand, if the maximum is unique and is achieved by , then
As the loss only depends on the gap , we can optimize this with .
As only the minimal enters the last term, the optimum is achieved if all are equal for (otherwise it is possible to reduce the first term without affecting the last term). Let for all . The problem becomes
Let . The solution is
and the associated risk is
If , then the Bayes optimal classifier for all and any . Otherwise, and
Moreover, we have that the Bayes risk at is
It follows, that the multiclass hinge loss is not (top-) classification calibrated at any where as its Bayes optimal classifier reduces to a constant. Moreover, even if for some , the loss is not top- calibrated for as the predicted order of the remaining classes need not be optimal. ∎
Again, a contrast between the hinge and logistic losses.
The softmax loss is top- calibrated.
The multiclass logistic loss is (top-) calibrated for the zero-one error in the following sense. If
then for some and all
We now prove this result and show that it also generalizes to top- calibration for . Using the identity
and the fact that , we write for a
As the loss is convex and differentiable, we get the global optimum by computing a critical point. We have
for . We note that the critical point is not unique as multiplication leaves the equation invariant for any . One can verify that satisfies the equations for any . This yields a solution
for any fixed . We note that is a strictly monotonically increasing function of the conditional class probabilities. Therefore, it preserves the ranking of and implies that is top- calibrated for any . ∎
The implicit reason for top-
calibration of the OVA schemes and the softmax loss is that one can estimate the probabilitiesfrom the Bayes optimal classifier. Loss functions which allow this are called proper. We refer to  and references therein for a detailed discussion.
We have established that the OVA logistic regression and the softmax loss are top- calibrated for any , so why should we be interested in defining new loss functions for the top- error? The reason is that calibration is an asymptotic property as the Bayes optimal functions are obtained pointwise. The picture changes if we use linear classifiers, since they obviously cannot be minimized independently at each point. Indeed, most of the Bayes optimal classifiers cannot be realized by linear functions.
In particular, convexity of the softmax and multiclass hinge losses leads to phenomena where , but . This happens if and adds a bias when working with “rigid” function classes such as linear ones. The loss functions which we introduce in the following are modifications of the above losses with the goal of alleviating that phenomenon.
Recently, we introduced two top- versions of the multiclass hinge loss (4) in , where the second version is based on the family of ranking losses introduced earlier by . We use our notation from  for direct comparison and refer to the first version as and the second one as . Let , where is the all ones vector, is the -th basis vector, and let be defined componentwise as . The two top- hinge losses are
where is the -th largest component of . It was shown in  that (6) is a tighter upper bound on the top- error than (7), however, both losses performed similarly in our experiments. In the following, we simply refer to them as the top- hinge or the top- SVM loss.
Both losses reduce to the multiclass hinge loss (4) for . Therefore, they are unlikely to be top- calibrated, even though we can currently neither prove nor disprove this for . The multiclass hinge loss is not calibrated as it is non-smooth and does not allow to estimate the class conditional probabilities . Our new family of smooth top- hinge losses is based on the Moreau-Yosida regularization [5, 34]. This technique has been used in  to smooth the binary hinge loss (2). Interestingly, smooth binary hinge loss fulfills the conditions of Lemma 2.2 and leads to a top- calibrated OVA scheme. The hope is that the smooth top- hinge loss becomes top- calibrated as well.
Smoothing works by adding a quadratic term to the conjugate function222 The convex conjugate of is . , which then becomes strongly convex. Smoothness of the loss, among other things, typically leads to much faster optimization as we discuss in Section 3.
OVA smooth hinge is top- calibrated.
In order to derive the smooth hinge loss, we first compute the conjugate of the standard binary hinge loss,
The smoothed conjugate is
The corresponding primal smooth hinge loss is given by
is convex and differentiable with the derivative
We compute the Bayes optimal classifier pointwise.
Let , the optimal is found by solving
Case . Consider the case ,
This case corresponds to , which follows from the constraint . Next, consider ,
unless , which is already captured by the first case. Finally, consider . Then
where we have if . We obtain the Bayes optimal classifier for as follows:
Note that while is not a continuous function of for , it is still a strictly monotonically increasing function of for any .
Case . First, consider ,
From , we get the condition . Next, consider ,
which is in the range if . Finally, consider ,
where we have if . Overall, the Bayes optimal classifier for is
Let be the smoothing parameter. The smooth top- hinge loss () and its conjugate are
where is the Euclidean projection of on . Moreover, is -smooth.
We take the convex conjugate of the top- hinge loss, which was derived in [28, Proposition 2],
and add the regularizer to obtain the -strongly convex conjugate loss as stated in the proposition. As mentioned above  (see also [48, Lemma 2]), the primal smooth top- hinge loss , obtained as the convex conjugate of , is -smooth. We now obtain a formula to compute it based on the Euclidean projection onto the top- simplex. By definition,
For the constraint , we have
The final expression follows from the fact that
There is no analytic expression for (12) and evaluation requires computing a projection onto the top- simplex