Multi-label classification [1, 2, 3, 4], in which each instance can belong to multiple labels simultaneously, has significantly attracted the attention of researchers as a result of its wide range of applications, which range from document classification and automatic image annotation to video annotation. For example, in automatic image annotation, one needs to automatically predict relevant keywords, such as beach, sky and tree
, to describe a natural scene image. When classifying documents, one may need to classify them into different groups, such asScience, Finance and Sports. In video annotation, labels such as Government, Policy and Election may be needed to describe the subject of the video.
A popular strategy in multi-label learning is binary relevance (BR)
, which independently trains a linear regression model for each label independently. Recently, some sophisticated models are developed to improve the performance of BR. For example, embedding approaches[6, 7, 8, 9, 10] have become popular techniques. Even though embedding methods improve the prediction performance of BR to some extent, their training process usually involves a complex quadratic or semidefinite programming problem, as in , or their model may involve an NP-hard problem, as in  and . Thus, these kinds of methods are prohibitive on large-scale applications. Much of the literature, such as ,  and , has already shown that BR with appropriate base learner is usually good enough for some applications, such as document classification . Unfortunately, BR runs slowly due to its linear dependence on the size of the input data. The question is how to overcome these computational obstacles yet obtain comparable results with BR.
To address the above problem, we provide a simple stochastic sketch strategy for multi-label classification. In particular, we carefully construct a small sketch of the full data set, and then use that sketch as a surrogate to perform fast optimization. This paper first introduces stochastic -subgaussian sketch, and then proposes the construction of a sketch matrix based on Walsh-Hadamard matrix to reduce the expensive matrix multiplications of -subgaussian sketch. From an algorithmic perspective, we provide provable guarantees that our proposed methods are approximately as good as the exact solution of BR. From a statistical learning perspective, we provide the generalization error bound of multi-label classification using our proposed stochastic sketch model.
Experiments on various real-world data sets demonstrate the superiority of the proposed methods. The results verify our theoretical findings. We organize this paper as follows. The second section introduces our proposed stochastic sketch for multi-label classification. The third section provides the provable guarantees for our algorithm from both algorithmic and statistical learning perspectives, and experimental results are presented in the fourth section. The last section provides our conclusions.
Ii Stochastic Sketch for Multi-label Classification
is a real vector representing an input (instance), andis a real vector representing the corresponding output . denotes the number of training samples. The input matrix is and the output matrix is . and represent the inner product and the identity matrix, respectively. We denote the transpose of the vector/matrix by the superscript and the logarithms to base 2 by . Let and represent the norm and Frobenius norm, respectively. Let be the regressors and
denote the standard Gaussian distribution.
A simple linear regression model for BR  learns the matrix through the following formulation:
Assuming that and , the computational complexity for this problem is . The computational cost of an exact solution for problem 1 will be prohibitive on large-scale settings. To solve this problem, we construct a small sketch of the full data set by stochastic projection methods, and then use that sketch as a surrogate to perform fast optimization for problem 1. Specifically, we define a sketch matrix and , where is the projection dimension and
is the zero matrix with all the zero entries. The input matrixand output matrix are approximated by their sketched matrix and , respectively. We aim to solve the following sketched problem of problem 1.
Motivated by [17, 18, 12], we use a -nearest neighbor (NN) classifier in the embedding space for prediction, instead of using an expensive decoding process . Next, we introduce two kinds of stochastic sketch methods.
Ii-a Stochastic -Subgaussian Sketch
Definition 1 (-Subgaussian).
A row of the sketch matrix is -Subgaussian, if it has zero mean and for any vector and , we have
Clearly, a vector with i.i.d standard Gaussian entries or Bernoulli entries is 1-Subgaussian. We refer any matrix to a Subgaussian sketch if its rows are zero mean, 1-Subgaussian, and with the covariance matrix . A Subgaussian sketch is straightforward to construct. However, given the Subgaussian sketch , the cost of computing and is and , respectively. Next, we introduce the following technique to reduce this time complexity.
Ii-B Stochastic Walsh-Hadamard Sketch
Inspired by , we propose to construct the sketch matrix based on Walsh-Hadamard matrix to reduce the expensive matrix multiplications of Subgaussian sketch. Formally, a stochastic Walsh-Hadamard sketch matrix is obtained with i.i.d. rows of the form:
where is a random subset of rows uniformly sampled from , is a random diagonal matrix whose entries are i.i.d. Rademacher variables and constitutes a Walsh-Hadamard matrix defined as:
where and represent the binary expression with -bit of and (assume ).
Then, we can employ fast Walsh-Hadamard transform  to perform and in and .
Iii Main Results
Since we address problem 2 rather than directly solving problem 1, which has great advantages for fast optimization, it is interesting to ask the question: what is the relationship between problem 2 and problem 1? Let and be the optimal solutions of problem 1 and problem 2. We define and . We will prove that we can choose an appropriate such that the two optimal objectives and are approximately the same. This means that we can speed up the computation of problem 1, without sacrificing too much accuracy. Furthermore, we provide the generalization error bound of the multi-label classification problem using our proposed stochastic sketch model. To measure the quality of approximation, we first define the -optimality approximation as follows:
Definition 2 (-Optimality Approximation).
Given , is a -optimality approximation solution, if
According to the properties of Matrix norm, we have , so is proportional to . Therefore, the closeness of and implies the closeness of and .
Iii-a -Subgaussian Sketch Guarantee
We first introduce the tangent cone, which is used by :
Definition 3 (Tangent Cone).
Given a set and , the tangent cone of at is defined as for some and , where clconv denotes the closed convex hull.
The tangent cone arises naturally in the convex optimality conditions: any defines a feasible direction at the optimal , and optimality means that it is impossible to decrease the objective function by moving in directions belonging to the tangent cone. Then, we introduce the Gaussian width, which is an important complexity measure used by :
Definition 4 (Gaussian Width).
Given a closed set , the Gaussian width of , denoted by , is defined as:
This complexity measure plays an important role in learning theory and statistics . Let be the Euclidean sphere.
represents the linearly transformed cone:, and we use Gaussian width to measure the width of the intersection of and . This paper defines . We state the following theorem for guaranteeing the -Subgaussian sketch:
Let be a stochastic -Subgaussian sketch matrix, and be universal constants. Given any and ,
then with probability at least
, then with probability at least, is a -optimality approximation solution.
The proof sketch of this theorem can be found in the supplementary material.
Remark. Theorem 1 guarantees that the stochastic -Subgaussian sketch method is able to construct a small sketch of the full data set for the fast optimization of problem 1, while preserving the -optimality of the solution.
Iii-B Walsh-Hadamard Sketch Guarantee
We generalize the concept of Gaussian width to two additional measures, -Gaussian width and Rademacher width:
Definition 5 (-Gaussian Width).
Given a closed set and a stochastic sketch matrix , the -Gaussian width of , denoted by , is defined as:
Definition 6 (Rademacher Width).
Given a closed set , the Rademacher width of , denoted by , is defined as:
where is an i.i.d. vector of Rademacher variables.
Next, we still define and state the following theorem for guaranteeing the Walsh-Hadamard sketch:
Let be a stochastic Walsh-Hadamard sketch matrix, , and be universal constants. Given any and , then with probability at least , is a -optimality approximation solution.
Remark. An additional term appears in the sketch size, so the required sketch size for the Walsh-Hadamard sketch is larger than that required for the -Subgaussian sketch. However, the potentially larger sketch size is offset by the much lower cost of matrix multiplications via the stochastic Walsh-Hadamard sketch matrix. Theorem 2 guarantees that the stochastic Walsh-Hadamard sketch method is also able to construct a small sketch of the full data set for the fast optimization of problem 1, while preserving the -optimality of the solution.
Iii-C Generalization Error Bound
This subsection provides the generalization error bound of the multi-label classification problem using our proposed two stochastic sketch models. Because our results can be applied to two models, we simply call our stochastic sketch models SS. Assume our model is characterized by a distribution on the space of inputs and labels , where . Let a sample be drawn i.i.d. from the distribution , where are the ground truth label vectors. Assume samples are drawn i.i.d. times from the distribution , which is denoted by . For two inputs in , we define as the Euclidean metric in the original input space and as the metric in the embedding input space. Let represent the prediction of the -th label for input using our model SS-NN, which is trained on . The performance of SS-NN: is then measured in terms of its generalization error, which is its expected loss on a new example drawn according to :
where means the -th label and
represents the loss function for the-th label. We define the loss function as follows for the analysis.
For the -th label, we define the function as follows:
The Bayes optimal classifier for the -th label is defined as
Before deriving our results, we first present several important definitions and theorems.
Definition 7 (Covering Numbers, ).
Let be a metric space, be a subset of and . A set is an -cover for , if for every , there exists such that . The -covering number of , , is the minimal cardinality of an -cover for (if there is no such finite cover then it is defined as ).
Definition 8 (Doubling Dimension, ).
Let be a metric space, and let be the smallest value such that every ball in can be covered by balls of half the radius. The doubling dimension of is defined as : .
Theorem 3 ().
Let be a metric space. The diameter of is defined as . The -covering number of , , is bounded by:
We provide the following generalization error bound for SS-1NN:
Given a metric space , assume function is Lipschitz with constant with respect to the sup-norm for each label. Suppose has a finite doubling dimension: and . Let and be drawn i.i.d. from the distribution . Then, we have
Inspired by Theorem 19.5 in , we derive the following lemma for SS-NN:
Given metric space , assume function is Lipschitz with constant with respect to the sup-norm for each label. Suppose has a finite doubling dimension: and . Let and be drawn i.i.d. from the distribution . Then, we have
The following corollary reveals important statistical properties of SS-1NN and SS-NN.
As goes to infinity, the error of the SS-1NN and SS-NN converges to the sum of twice the Bayes error and times Bayes error over the labels, respectively.
Iv-a Data Sets and Baselines
We abbreviate our proposed stochastic -Subgaussian sketch and stochastic Walsh-Hadamard sketch to SS+GAU and SS+WH, respectively. In the experiment, we set the entries in the -Subgaussian sketch matrix as i.i.d standard Gaussian entries. This section evaluates the performance of the proposed methods on four data sets: corel5k, nus(vlad), nus(bow) and rcv1x. The statistics of these data sets are presented in website222http://mulan.sourceforge.net. We compare SS+GAU and SS+WH with several state-of-the-art methods, as follows.
BR : We implement two base classifiers for BR. The first uses linear classification/regression package LIBLINEAR  with -regularized square hinge loss as the base classifier. We simply call this baseline BR+LIB. The second uses NN as the base classifier. We simply call this baseline BR+NN and count the NN search time as the training time.
FastXML : An advanced tree-based multi-label classifier.
SLEEC : A state-of-the-art embedding method, which is based on sparse local embeddings for large-scale multi-label classification. We use solvers of FastXML and SLEEC provided by the respective authors with default parameters.
Following the similar settings in  and , we set for the NN search in all NN based methods. The sketch size is chosen in a range of . Following ,  and , we consider the Hamming Loss and Example-F1 measures to evaluate the prediction performance of all the methods. The smaller the value of the Hamming Loss, the better the performance, while the larger the value of Example-F1, the better the performance.
Figure 1 shows that with the increasing sketch size, the training time of SS+GAU and SS+WH rise, while the prediction performance of SS+GAU and SS+WH becomes better. The results verify our theoretical analysis. The Hamming Loss, Example-F1 and training time comparisons of various methods on corel5k, nus(vlad), nus(bow) and rcv1x data sets are shown in Table I, Table II and Table III, respectively. From Tables I, II and III, we can see that:
Because we perform the optimization only on a small sketch of the full data set, our proposed methods are significantly faster than BR and state-of-the-art embedding approaches. Moreover, we can maintain competitive prediction performance by setting an appropriate sketch size. The empirical results illustrate our theoretical studies.
This paper carefully constructs stochastic -Subgaussian sketch and Walsh-Hadamard sketch for multi-label classification. From an algorithmic perspective, we show that we can obtain answers that are approximately as good as the exact answer for BR. From a statistical learning perspective, we also provide the generalization error bound of multi-label classification using our proposed stochastic sketch model. Lastly, our empirical studies corroborate our theoretical findings, and demonstrate the superiority of the proposed methods.
Supplementary: The Proof of Important Theorems and Lemmas
V-a Proof of Theorem 1
Let be a stochastic -Subgaussian sketch matrix. Then there are universal constants and such that for any subset , any and , we have
with probability at least , and we have
with probability at least , where .
Let be a stochastic -Subgaussian sketch matrix, and be universal constants. Given any and , then with probability at least , is a -optimality approximation solution.
Let , and be columns of matrix , and , respectively. Then and can be decomposed to and . Next, we study the relationship between and . We define . According to Definition 3 in the main paper, we know that belongs to the tangent cone of at .
Because , we have . Then, we get:
As , we have . Then, we get and
We derive the following:
By using Lemma 1, with probability at least , we have
where . Given , we have . For the sake of clarity, we define and , and then substitute them to the above expression, with probability at least , we have
Clearly, we have . By using Lemma 1, with probability at least , we have
By setting , with probability at least , we have
Eq.(18) implies that, with probability at least , we have
We define and , and then substitute them to the above expression, with probability at least , we have
By using Lemma 1 again, with probability at least , we have
By setting , with probability at least , we have