Multiclass classification based on stochastic dual coordinate ascent
Class ambiguity is typical in image classification problems with a large number of classes. When classes are difficult to discriminate, it makes sense to allow k guesses and evaluate classifiers based on the top-k error instead of the standard zero-one loss. We propose top-k multiclass SVM as a direct method to optimize for top-k performance. Our generalization of the well-known multiclass SVM is based on a tight convex upper bound of the top-k error. We propose a fast optimization scheme based on an efficient projection onto the top-k simplex, which is of its own interest. Experiments on five datasets show consistent improvements in top-k accuracy compared to various baselines.READ FULL TEXT VIEW PDF
Multiclass classification based on stochastic dual coordinate ascent
As the number of classes increases, two important issues emerge: class overlap and multi-label nature of examples guesses and is not penalized for mistakes, such an evaluation measure is known as top- error. We argue that this is an important metric that will inevitably receive more attention in the future as the illustration in Figure 1 indicates.
How obvious is it that each row of Figure 1 shows examples of different
classes? Can we imagine a human to predict correctly on the first attempt? Does it even make sense to penalize a learning system for such “mistakes”? While the problem of class ambiguity is apparent in computer vision, similar problems arise in other domains when the number of classes becomes large.
We propose top- multiclass SVM as a generalization of the well-known multiclass SVM . It is based on a tight convex upper bound of the top- zero-one loss which we call top- hinge loss. While it turns out to be similar to a top- version of the ranking based loss proposed by , we show that the top- hinge loss is a lower bound on their version and is thus a tighter bound on the top- zero-one loss. We propose an efficient implementation based on stochastic dual coordinate ascent (SDCA) . A key ingredient in the optimization is the (biased) projection onto the top- simplex. This projection turns out to be a tricky generalization of the continuous quadratic knapsack problem, respectively the projection onto the standard simplex. The proposed algorithm for solving it has complexity for . Our implementation of the top- multiclass SVM scales to large datasets like Places 205 with about million examples and classes . Finally, extensive experiments on several challenging computer vision problems show that top- multiclass SVM consistently improves in top- error over the multiclass SVM (equivalent to our top- multiclass SVM), one-vs-all SVM and other methods based on different ranking losses [12, 17].
In multiclass classification, one is given a set of training examples along with the corresponding labels . Let be the feature space and the set of labels. The task is to learn a set of linear predictors such that the risk of the classifier is minimized for a given loss function, which is usually chosen to be a convex upper bound of the zero-one loss. The generalization to nonlinear predictors using kernels is discussed below.
The classification problem becomes extremely challenging in the presence of a large number of ambiguous classes. It is natural in that case to extend the evaluation protocol to allow guesses, which leads to the popular top- error and top- accuracy performance measures. Formally, we consider a ranking of labels induced by the prediction scores . Let the bracket denote a permutation of labels such that is the index of the -th largest score,
The top- zero-one loss is defined as
where and if is true and otherwise. Note that the standard zero-one loss is recovered when , and is always for . Therefore, we are interested in the regime .
Given a training pair , the multiclass SVM loss on example is defined as
Since our optimization scheme is based on Fenchel duality, we also require a convex conjugate of the primal loss function (1). Let , where
is the all ones vector andis the -th standard basis vector in , let be defined componentwise as , and let
Note that thresholding with in is actually redundant as and is only given to enhance similarity to the top- version defined later.
The main motivation for the top- loss is to relax the penalty for making an error in the top- predictions. Looking at in (2), a direct extension to the top- setting would be a function
which incurs a loss iff . Since the ground truth score , we conclude that
which directly corresponds to the top- zero-one loss with margin . Note that the function ignores the values of the first scores, which could be quite large if there are highly similar classes. That would be fine in this model as long as the correct prediction is within the first guesses. However, the function is unfortunately nonconvex since the function returning the -th largest coordinate is nonconvex for . Therefore, finding a globally optimal solution is computationally intractable.
Instead, we propose the following convex upper bound on , which we call the top- hinge loss,
where the sum of the largest components is known to be convex . We have that
for any and . Moreover, unless all largest scores are the same. This extra slack can be used to increase the margin between the current and the remaining least similar classes, which should then lead to an improvement in the top- metric.
In this section we derive the conjugate of the proposed loss (3). We begin with a well known result that is used later in the proof. All proofs can be found in the supplement. Let .
[, Lemma 1]
For a , we have
On the other hand, for any , we get
We also define a set which arises naturally as the effective domain111 A convex function has an effective domain . of the conjugate of (3). By analogy, we call it the top- simplex as for it reduces to the standard simplex with the inequality constraint (). Let .
The top- simplex is a convex polytope defined as
where is the bound on the sum . We let . The crucial difference to the standard simplex is the upper bound on ’s, which limits their maximal contribution to the total sum . See Figure 2 for an illustration.
The first technical contribution of this work is as follows. A primal-conjugate pair for the top- hinge loss (3) is given as follows:
We use Lemma 2.2.1 to write
The Lagrangian is given as
Minimizing over , we get , , . As and , it follows that and . Since the duality gap is zero, we get
The conjugate can now be computed as
Since unless , we get the formula for as in (4).
Usunier  have recently formulated a very general family of convex losses for ranking and multiclass classification. In their framework, the hinge loss on example can be written as
where is a non-increasing sequence of non-negative numbers which act as weights for the ordered losses. The relation to the top- hinge loss becomes apparent if we choose if , and otherwise. In that case, we obtain another version of the top- hinge loss
It is straightforward to check that
The bound holds with equality if or . Otherwise, there is a gap and our top- loss is a strictly better upper bound on the actual top- zero-one loss. While  employed LaRank  and ,  optimized an approximation of , we show in § 5 how the loss function (5) can be optimized exactly and efficiently within the Prox-SDCA framework.
Multiclass to binary reduction. It is also possible to compare directly to ranking based methods that solve a binary problem using the following reduction. We employ it in our experiments to evaluate the ranking based methods  and . The trick is to augment the training set by embedding each into using a feature map for each . The mapping places at the -th position in and puts zeros everywhere else. The example is labeled and all for are labeled . Therefore, we have a new training set with examples and dimensional (sparse) features. Moreover, which establishes the relation to the original multiclass problem.
Another approach to general performance measures is given in . It turns out that using the above reduction, one can show that under certain constraints on the classifier, the recall@ is equivalent to the top- error. A convex upper bound on recall@ is then optimized in  via structured SVM. As their convex upper bound on the recall@ is not decomposable in an instance based loss, it is not directly comparable to our loss. While being theoretically very elegant, the approach of  does not scale to very large datasets.
We begin with a general -regularized multiclass classification problem, where for notational convenience we keep the loss function unspecified. The multiclass SVM or the top- multiclass SVM are obtained by plugging in the corresponding loss function from § 2.
Let be the matrix of training examples , let be the matrix of primal variables obtained by stacking the vectors , and the matrix of dual variables.
Before we prove our main result of this section (Theorem 3.1), we first impose a technical constraint on a loss function to be compatible with the choice of the ground truth coordinate. The top- hinge loss from Section 2 satisfies this requirement as we show in Proposition 3.1. We also prove an auxiliary Lemma 3.1, which is then used in Theorem 3.1.
A convex function is -compatible if for any with we have that
This constraint is needed to prove equality in the following Lemma. Let be -compatible, let , and let , then
We have that and .
It follows that can only be finite if , which implies . Let be the Moore-Penrose pseudoinverse of . For a , we can write
where . Using rank- update of the Moore-Penrose pseudoinverse (, § 3.2.7), we can compute . Since , the last term is zero and we have . Finally, we use the fact that is -compatible to prove that the inequality in (6) is satisfied with equality. We have that and . Therefore, when , . ∎
We can now use Lemma 3.1 to compute convex conjugates of the loss functions.
Let be -compatible for each , let be a regularization parameter, and let be the Gram matrix. The primal and Fenchel dual objective functions are given as:
Moreover, we have that and , where is the -th column of .
We use Fenchel duality (see , Theorem 3.3.5), to write , and , for the functions and defined as follows:
where . One can easily verify that and . From Lemma 3.1, we have that if , and otherwise. To complete the proof, we redefine for convenience, and use the first order optimality condition (, Ex. 9.f in § 3) for the formula. ∎
Finally, we show that Theorem 3.1 applies to the loss functions that we consider.
The top- hinge loss function from Section 2 is -compatible.
We have repeated the derivation from Section 5.7 in  as there is a typo in the optimization problem (20) leading to the conclusion that must be at the optimum. Lemma 3.1 fixes this by making the requirement explicit. Note that this modification is already mentioned in their pseudo-code for Prox-SDCA.
As an optimization scheme, we employ the proximal stochastic dual coordinate ascent (Prox-SDCA) framework of Shalev-Shwartz and Zhang , which has strong convergence guarantees and is easy to adapt to our problem. In particular, we iteratively update a batch of dual variables corresponding to the training pair , so as to maximize the dual objective from Theorem 3.1. We also maintain the primal variables and stop when the relative duality gap is below . This procedure is summarized in Algorithm 1.
Let us make a few comments on the advantages of the proposed method. First, apart from the update step which we discuss below, all main operations can be computed using a BLAS library, which makes the overall implementation efficient. Second, the update step in Line 9 is optimal in the sense that it yields maximal dual objective increase jointly over variables. This is opposed to SGD updates with data-independent step sizes, as well as to maximal but scalar updates in other SDCA variants. Finally, we have a well-defined stopping criterion as we can compute the duality gap (see discussion in ). The latter is especially attractive if there is a time budget for learning. The algorithm can also be easily kernelized since (Theorem 3.1).
For the proposed top- hinge loss from Section 2, optimization of the dual objective over given other variables fixed is an instance of a regularized (biased) projection problem onto the top- simplex . Let be obtained by removing the -th coordinate from vector .
The following two problems are equivalent with and
where , and .
For the loss function, we get
with . One can verify that the latter constraint is equivalent to , . Similarly, we write for the regularization term
where the does not depend on . Note that and can be computed using the “old” . Let , we have
Plugging everything together and multiplying with , we obtain
Collecting the corresponding terms finishes the proof. ∎
We discuss in the following section how to project onto the set efficiently.
One of our main technical results is an algorithm for efficiently computing projections onto , respectively the biased projection introduced in Proposition 3.2.1. The optimization problem in Proposition 3.2.1 reduces to the Euclidean projection onto for , and for it biases the solution to be orthogonal to . Let us highlight that is substantially different from the standard simplex and none of the existing methods can be used as we discuss below.
Finding the Euclidean projection onto the simplex is an instance of the general optimization problem known as the continuous quadratic knapsack problem (CQKP). For example, to project onto the simplex we set , and . This is a well examined problem and several highly efficient algorithms are available (see the surveys [20, 21]). The first main difference to our set is the upper bound on the ’s. All existing algorithms expect that is fixed, which allows them to consider decompositions which can be solved in closed-form. In our case, the upper bound introduces coupling across all variables, which makes the existing algorithms not applicable. A second main difference is the bias term added to the objective. The additional difficulty introduced by this term is relatively minor. Thus we solve the problem for general (including for the Euclidean projection onto ) even though we need only in Proposition 3.2.1. The only case when our problem reduces to CQKP is when the constraint is satisfied with equality. In that case we can let and use any algorithm for the knapsack problem. We choose  since it is easy to implement, does not require sorting, and scales linearly in practice. The bias in the projection problem reduces to a constant in this case and has, therefore, no effect.
When the constraint is not satisfied with equality at the optimum, it has essentially no influence on the projection problem and can be removed. In that case we are left with the problem of the (biased) projection onto the top- cone which we address with the following lemma.
Let be the solution to the following optimization problem
and let , , .
If and , then .
If and , then , for , where is the index of the -th largest component in .
Otherwise (), the following system of linear equations holds
together with the feasibility constraints on
and we have .
We consider an equivalent problem
Let , , be the dual variables, and let be the Lagrangian:
From the KKT conditions, we have that
We have that , , and . Let . We have . Using the definition of the sets and , we get
In the case and we get the simplified equations
In the remaining case solving this system for and , we get exactly the system in (7). The constraints (8) follow from the definition of the sets , , , and ensure that the computed thresholds are compatible with the corresponding partitioning of the index set. ∎
We now show how to check if the (biased) projection is . For the standard simplex, where the cone is the positive orthant , the projection is when all . It is slightly more involved for .
The biased projection onto the top- cone is zero if (sufficient condition). If this is also necessary.
Let be the top- cone. It is known that the Euclidean projection of onto is if and only if , is in the normal cone to at . Therefore, we obtain as an equivalent condition that . Take any and let . If , we have that at least components in must be positive. To maximize , we would have exactly positive corresponding to the largest components in . That would result in , which is non-positive if and only if .
For , the objective function has an additional term that vanishes at . Therefore, if is optimal for the Euclidean projection, it must also be optimal for the biased projection. ∎
Projection. Lemmas 4.2 and 4.2 suggest a simple algorithm for the (biased) projection onto the top- cone. First, we check if the projection is constant (cases and in Lemma 4.2). In case , we compute and check if it is compatible with the corresponding sets , , . In the general case , we suggest a simple exhaustive search strategy. We sort and loop over the feasible partitions , , until we find a solution to (7) that satisfies (8). Since we know that and , we can limit the search to iterations in the worst case, where each iteration requires a constant number of operations. For the biased projection, we leave as the fallback case as Lemma 4.2 gives only a sufficient condition. This yields a runtime complexity of , which is comparable to simplex projection algorithms based on sorting.
As we argued in § 4.1, the (biased) projection onto the top- simplex becomes either the knapsack problem or the (biased) projection onto the top- cone depending on the constraint at the optimum. The following Lemma provides a way to check which of the two cases apply.
Let be the solution to the following optimization problem