I Introduction
Multilabel classification [1, 2, 3, 4], in which each instance can belong to multiple labels simultaneously, has significantly attracted the attention of researchers as a result of its wide range of applications, which range from document classification and automatic image annotation to video annotation. For example, in automatic image annotation, one needs to automatically predict relevant keywords, such as beach, sky and tree
, to describe a natural scene image. When classifying documents, one may need to classify them into different groups, such as
Science, Finance and Sports. In video annotation, labels such as Government, Policy and Election may be needed to describe the subject of the video.A popular strategy in multilabel learning is binary relevance (BR)[5]
, which independently trains a linear regression model for each label independently. Recently, some sophisticated models are developed to improve the performance of BR. For example, embedding approaches
[6, 7, 8, 9, 10] have become popular techniques. Even though embedding methods improve the prediction performance of BR to some extent, their training process usually involves a complex quadratic or semidefinite programming problem, as in [11], or their model may involve an NPhard problem, as in [8] and [12]. Thus, these kinds of methods are prohibitive on largescale applications. Much of the literature, such as [13], [14] and [15], has already shown that BR with appropriate base learner is usually good enough for some applications, such as document classification [15]. Unfortunately, BR runs slowly due to its linear dependence on the size of the input data. The question is how to overcome these computational obstacles yet obtain comparable results with BR.To address the above problem, we provide a simple stochastic sketch strategy for multilabel classification. In particular, we carefully construct a small sketch of the full data set, and then use that sketch as a surrogate to perform fast optimization. This paper first introduces stochastic subgaussian sketch, and then proposes the construction of a sketch matrix based on WalshHadamard matrix to reduce the expensive matrix multiplications of subgaussian sketch. From an algorithmic perspective, we provide provable guarantees that our proposed methods are approximately as good as the exact solution of BR. From a statistical learning perspective, we provide the generalization error bound of multilabel classification using our proposed stochastic sketch model.
Experiments on various realworld data sets demonstrate the superiority of the proposed methods. The results verify our theoretical findings. We organize this paper as follows. The second section introduces our proposed stochastic sketch for multilabel classification. The third section provides the provable guarantees for our algorithm from both algorithmic and statistical learning perspectives, and experimental results are presented in the fourth section. The last section provides our conclusions.
Ii Stochastic Sketch for Multilabel Classification
Assume that
is a real vector representing an input (instance), and
is a real vector representing the corresponding output . denotes the number of training samples. The input matrix is and the output matrix is . and represent the inner product and the identity matrix, respectively. We denote the transpose of the vector/matrix by the superscript and the logarithms to base 2 by . Let and represent the norm and Frobenius norm, respectively. Let be the regressors anddenote the standard Gaussian distribution.
A simple linear regression model for BR [5] learns the matrix through the following formulation:
(1) 
Assuming that and , the computational complexity for this problem is [16]. The computational cost of an exact solution for problem 1 will be prohibitive on largescale settings. To solve this problem, we construct a small sketch of the full data set by stochastic projection methods, and then use that sketch as a surrogate to perform fast optimization for problem 1. Specifically, we define a sketch matrix and , where is the projection dimension and
is the zero matrix with all the zero entries. The input matrix
and output matrix are approximated by their sketched matrix and , respectively. We aim to solve the following sketched problem of problem 1.(2) 
Motivated by [17, 18, 12], we use a nearest neighbor (NN) classifier in the embedding space for prediction, instead of using an expensive decoding process [11]. Next, we introduce two kinds of stochastic sketch methods.
Iia Stochastic Subgaussian Sketch
The entries of a sketch matrix can be simply defined as i.i.d random variables from certain distributions, such as Gaussian distribution and Bernoulli distribution.
[19] has already shown that each of these distributions is a special case of Subgaussian distribution, which is defined as follows:Definition 1 (Subgaussian).
A row of the sketch matrix is Subgaussian, if it has zero mean and for any vector and , we have
Clearly, a vector with i.i.d standard Gaussian entries or Bernoulli entries is 1Subgaussian. We refer any matrix to a Subgaussian sketch if its rows are zero mean, 1Subgaussian, and with the covariance matrix . A Subgaussian sketch is straightforward to construct. However, given the Subgaussian sketch , the cost of computing and is and , respectively. Next, we introduce the following technique to reduce this time complexity.
IiB Stochastic WalshHadamard Sketch
Inspired by [20], we propose to construct the sketch matrix based on WalshHadamard matrix to reduce the expensive matrix multiplications of Subgaussian sketch. Formally, a stochastic WalshHadamard sketch matrix is obtained with i.i.d. rows of the form:
where is a random subset of rows uniformly sampled from , is a random diagonal matrix whose entries are i.i.d. Rademacher variables and constitutes a WalshHadamard matrix defined as:
where and represent the binary expression with bit of and (assume ).
Then, we can employ fast WalshHadamard transform [21] to perform and in and .
Iii Main Results
Since we address problem 2 rather than directly solving problem 1, which has great advantages for fast optimization, it is interesting to ask the question: what is the relationship between problem 2 and problem 1? Let and be the optimal solutions of problem 1 and problem 2. We define and . We will prove that we can choose an appropriate such that the two optimal objectives and are approximately the same. This means that we can speed up the computation of problem 1, without sacrificing too much accuracy. Furthermore, we provide the generalization error bound of the multilabel classification problem using our proposed stochastic sketch model. To measure the quality of approximation, we first define the optimality approximation as follows:
Definition 2 (Optimality Approximation).
Given , is a optimality approximation solution, if
According to the properties of Matrix norm, we have , so is proportional to . Therefore, the closeness of and implies the closeness of and .
Iiia Subgaussian Sketch Guarantee
We first introduce the tangent cone, which is used by [22]:
Definition 3 (Tangent Cone).
Given a set and , the tangent cone of at is defined as for some and , where clconv denotes the closed convex hull.
The tangent cone arises naturally in the convex optimality conditions: any defines a feasible direction at the optimal , and optimality means that it is impossible to decrease the objective function by moving in directions belonging to the tangent cone. Then, we introduce the Gaussian width, which is an important complexity measure used by [23]:
Definition 4 (Gaussian Width).
Given a closed set , the Gaussian width of , denoted by , is defined as:
where .
This complexity measure plays an important role in learning theory and statistics [24]. Let be the Euclidean sphere.
represents the linearly transformed cone:
, and we use Gaussian width to measure the width of the intersection of and . This paper defines . We state the following theorem for guaranteeing the Subgaussian sketch:Theorem 1.
Let be a stochastic Subgaussian sketch matrix, and be universal constants. Given any and
, then with probability at least
, is a optimality approximation solution.The proof sketch of this theorem can be found in the supplementary material.
Remark. Theorem 1 guarantees that the stochastic Subgaussian sketch method is able to construct a small sketch of the full data set for the fast optimization of problem 1, while preserving the optimality of the solution.
SS+GAU  SS+WH  

Data Set  BR+LIB  BR+NN  FastXML  SLEEC  
corel5k  0.0098  0.0095  0.0093  0.0094  0.0095  0.0095  0.0094  0.0103  0.0102  0.0099 
nus(vlad)  0.0211  0.0213  0.0209  0.0207  0.0221  0.0218  0.0216  0.0230  0.0225  0.0218 
nus(bow)  0.0215  0.0220  0.0216  0.0213  0.0227  0.0223  0.0222  0.0229  0.0226  0.0223 
rcv1x  0.0017  0.0019  0.0019  0.0018  0.00189  0.00188  0.00187  0.00199  0.00195  0.00192 
SS+GAU  SS+WH  

Data Set  BR+LIB  BR+NN  FastXML  SLEEC  
corel5k  0.1150  0.0930  0.0530  0.0824  0.0475  0.0446  0.0659  0.0539  0.0817  0.0902 
nus(vlad)  0.1247  0.1547  0.1118  0.1578  0.1099  0.1310  0.1460  0.1001  0.1289  0.1443 
nus(bow)  0.0984  0.1012  0.0892  0.1122  0.0896  0.0932  0.0952  0.0882  0.0903  0.0920 
rcv1x  0.2950  0.2894  0.2367  0.2801  0.2063  0.2767  0.2813  0.2173  0.2621  0.2796 
IiiB WalshHadamard Sketch Guarantee
We generalize the concept of Gaussian width to two additional measures, Gaussian width and Rademacher width:
Definition 5 (Gaussian Width).
Given a closed set and a stochastic sketch matrix , the Gaussian width of , denoted by , is defined as:
where .
Definition 6 (Rademacher Width).
Given a closed set , the Rademacher width of , denoted by , is defined as:
where is an i.i.d. vector of Rademacher variables.
Next, we still define and state the following theorem for guaranteeing the WalshHadamard sketch:
Theorem 2.
Let be a stochastic WalshHadamard sketch matrix, , and be universal constants. Given any and , then with probability at least , is a optimality approximation solution.
Remark. An additional term appears in the sketch size, so the required sketch size for the WalshHadamard sketch is larger than that required for the Subgaussian sketch. However, the potentially larger sketch size is offset by the much lower cost of matrix multiplications via the stochastic WalshHadamard sketch matrix. Theorem 2 guarantees that the stochastic WalshHadamard sketch method is also able to construct a small sketch of the full data set for the fast optimization of problem 1, while preserving the optimality of the solution.
IiiC Generalization Error Bound
This subsection provides the generalization error bound of the multilabel classification problem using our proposed two stochastic sketch models. Because our results can be applied to two models, we simply call our stochastic sketch models SS. Assume our model is characterized by a distribution on the space of inputs and labels , where . Let a sample be drawn i.i.d. from the distribution , where are the ground truth label vectors. Assume samples are drawn i.i.d. times from the distribution , which is denoted by . For two inputs in , we define as the Euclidean metric in the original input space and as the metric in the embedding input space. Let represent the prediction of the th label for input using our model SSNN, which is trained on . The performance of SSNN: is then measured in terms of its generalization error, which is its expected loss on a new example drawn according to :
(3) 
where means the th label and
represents the loss function for the
th label. We define the loss function as follows for the analysis.(4) 
For the th label, we define the function as follows:
(5) 
The Bayes optimal classifier for the th label is defined as
(6) 
Before deriving our results, we first present several important definitions and theorems.
Definition 7 (Covering Numbers, [25]).
Let be a metric space, be a subset of and . A set is an cover for , if for every , there exists such that . The covering number of , , is the minimal cardinality of an cover for (if there is no such finite cover then it is defined as ).
Definition 8 (Doubling Dimension, [26]).
Let be a metric space, and let be the smallest value such that every ball in can be covered by balls of half the radius. The doubling dimension of is defined as : .
Theorem 3 ([26]).
Let be a metric space. The diameter of is defined as . The covering number of , , is bounded by:
(7) 
SS+GAU  SS+WH  

Data Set  BR+LIB  BR+NN  FastXML  SLEEC  
corel5k  7.198  0.678  4.941  736.670  0.196  0.218  0.366  0.119  0.197  0.239 
nus(vlad)  222.21  179.04  715.86  9723.49  25.29  51.68  93.97  11.87  20.22  33.04 
nus(bow)  511.83  351.64  1162.53  11391.54  52.05  72.65  120.37  25.41  34.32  48.85 
rcv1x  22607.53  353.42  1116.05  78441.93  72.53  114.55  144.17  48.88  55.94  72.22 
We provide the following generalization error bound for SS1NN:
Theorem 4.
Given a metric space , assume function is Lipschitz with constant with respect to the supnorm for each label. Suppose has a finite doubling dimension: and . Let and be drawn i.i.d. from the distribution . Then, we have
(8) 
Inspired by Theorem 19.5 in [27], we derive the following lemma for SSNN:
Lemma 1.
Given metric space , assume function is Lipschitz with constant with respect to the supnorm for each label. Suppose has a finite doubling dimension: and . Let and be drawn i.i.d. from the distribution . Then, we have
(9) 
The following corollary reveals important statistical properties of SS1NN and SSNN.
Corollary 1.
As goes to infinity, the error of the SS1NN and SSNN converges to the sum of twice the Bayes error and times Bayes error over the labels, respectively.
Iv Experiment
Iva Data Sets and Baselines
We abbreviate our proposed stochastic Subgaussian sketch and stochastic WalshHadamard sketch to SS+GAU and SS+WH, respectively. In the experiment, we set the entries in the Subgaussian sketch matrix as i.i.d standard Gaussian entries. This section evaluates the performance of the proposed methods on four data sets: corel5k, nus(vlad), nus(bow) and rcv1x. The statistics of these data sets are presented in website^{2}^{2}2http://mulan.sourceforge.net. We compare SS+GAU and SS+WH with several stateoftheart methods, as follows.

BR [5]: We implement two base classifiers for BR. The first uses linear classification/regression package LIBLINEAR [28] with regularized square hinge loss as the base classifier. We simply call this baseline BR+LIB. The second uses NN as the base classifier. We simply call this baseline BR+NN and count the NN search time as the training time.

FastXML [1]: An advanced treebased multilabel classifier.

SLEEC [12]: A stateoftheart embedding method, which is based on sparse local embeddings for largescale multilabel classification. We use solvers of FastXML and SLEEC provided by the respective authors with default parameters.
Following the similar settings in [29] and [12], we set for the NN search in all NN based methods. The sketch size is chosen in a range of . Following [7], [11] and [30], we consider the Hamming Loss and ExampleF1 measures to evaluate the prediction performance of all the methods. The smaller the value of the Hamming Loss, the better the performance, while the larger the value of ExampleF1, the better the performance.
IvB Results
Figure 1 shows that with the increasing sketch size, the training time of SS+GAU and SS+WH rise, while the prediction performance of SS+GAU and SS+WH becomes better. The results verify our theoretical analysis. The Hamming Loss, ExampleF1 and training time comparisons of various methods on corel5k, nus(vlad), nus(bow) and rcv1x data sets are shown in Table I, Table II and Table III, respectively. From Tables I, II and III, we can see that:

Because we perform the optimization only on a small sketch of the full data set, our proposed methods are significantly faster than BR and stateoftheart embedding approaches. Moreover, we can maintain competitive prediction performance by setting an appropriate sketch size. The empirical results illustrate our theoretical studies.
V Conclusion
This paper carefully constructs stochastic Subgaussian sketch and WalshHadamard sketch for multilabel classification. From an algorithmic perspective, we show that we can obtain answers that are approximately as good as the exact answer for BR. From a statistical learning perspective, we also provide the generalization error bound of multilabel classification using our proposed stochastic sketch model. Lastly, our empirical studies corroborate our theoretical findings, and demonstrate the superiority of the proposed methods.
Supplementary: The Proof of Important Theorems and Lemmas
Va Proof of Theorem 1
Lemma 1.
Let be a stochastic Subgaussian sketch matrix. Then there are universal constants and such that for any subset , any and , we have
(10) 
with probability at least , and we have
(11) 
with probability at least , where .
Theorem 1.
Let be a stochastic Subgaussian sketch matrix, and be universal constants. Given any and , then with probability at least , is a optimality approximation solution.
Proof.
Let , and be columns of matrix , and , respectively. Then and can be decomposed to and . Next, we study the relationship between and . We define . According to Definition 3 in the main paper, we know that belongs to the tangent cone of at .
Because , we have . Then, we get:
(12) 
As , we have . Then, we get and
(13) 
We derive the following:
By using Lemma 1, with probability at least , we have
where . Given , we have . For the sake of clarity, we define and , and then substitute them to the above expression, with probability at least , we have
(14) 
Clearly, we have . By using Lemma 1, with probability at least , we have
(15) 
By using Lemma 1, with probability at least , we have . By using Eq.(13), with probability at least , we have
(16) 
Eq.(14), Eq.(15) and Eq.(16) imply that, with probability at least , we have
(17) 
By setting , with probability at least , we have
(18) 
Eq.(18) implies that, with probability at least , we have
(19) 
By using Eq.(12) and Lemma 1 again, with probability at least , we have
We define and , and then substitute them to the above expression, with probability at least , we have
(20) 
By using Lemma 1 again, with probability at least , we have
(21) 
Similar to Eq.(16), by using Eq.(13) and Lemma 1, with probability at least , we have
(22) 
Eq.(20), Eq.(21) and Eq.(22) imply that, with probability at least , we have
(23) 
By setting , with probability at least , we have
Comments
There are no comments yet.