# Fast Multi-label Learning

Embedding approaches have become one of the most pervasive techniques for multi-label classification. However, the training process of embedding methods usually involves a complex quadratic or semidefinite programming problem, or the model may even involve an NP-hard problem. Thus, such methods are prohibitive on large-scale applications. More importantly, much of the literature has already shown that the binary relevance (BR) method is usually good enough for some applications. Unfortunately, BR runs slowly due to its linear dependence on the size of the input data. The goal of this paper is to provide a simple method, yet with provable guarantees, which can achieve competitive performance without a complex training process. To achieve our goal, we provide a simple stochastic sketch strategy for multi-label classification and present theoretical results from both algorithmic and statistical learning perspectives. Our comprehensive empirical studies corroborate our theoretical findings and demonstrate the superiority of the proposed methods.

## Authors

• 2 publications
• 6 publications
• 8 publications
05/27/2019

### On a scalable problem transformation method for multi-label learning

Binary relevance is a simple approach to solve multi-label learning prob...
12/26/2019

### Classifier Chains: A Review and Perspectives

The family of methods collectively known as classifier chains has become...
10/08/2019

### Self-Paced Multi-Label Learning with Diversity

The major challenge of learning from multi-label data has arisen from th...
04/20/2020

### Unsupervised Person Re-identification via Multi-label Classification

The challenge of unsupervised person re-identification (ReID) lies in le...
12/17/2019

### An Embarrassingly Simple Baseline for eXtreme Multi-label Prediction

The goal of eXtreme Multi-label Learning (XML) is to design and learn a ...
02/14/2021

### Comprehensive Comparative Study of Multi-Label Classification Methods

Multi-label classification (MLC) has recently received increasing intere...
02/09/2018

### ATPboost: Learning Premise Selection in Binary Setting with ATP Feedback

ATPboost is a system for solving sets of large-theory problems by interl...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Multi-label classification [1, 2, 3, 4], in which each instance can belong to multiple labels simultaneously, has significantly attracted the attention of researchers as a result of its wide range of applications, which range from document classification and automatic image annotation to video annotation. For example, in automatic image annotation, one needs to automatically predict relevant keywords, such as beach, sky and tree

, to describe a natural scene image. When classifying documents, one may need to classify them into different groups, such as

Science, Finance and Sports. In video annotation, labels such as Government, Policy and Election may be needed to describe the subject of the video.

A popular strategy in multi-label learning is binary relevance (BR)[5]

, which independently trains a linear regression model for each label independently. Recently, some sophisticated models are developed to improve the performance of BR. For example, embedding approaches

[6, 7, 8, 9, 10] have become popular techniques. Even though embedding methods improve the prediction performance of BR to some extent, their training process usually involves a complex quadratic or semidefinite programming problem, as in [11], or their model may involve an NP-hard problem, as in [8] and [12]. Thus, these kinds of methods are prohibitive on large-scale applications. Much of the literature, such as [13], [14] and [15], has already shown that BR with appropriate base learner is usually good enough for some applications, such as document classification [15]. Unfortunately, BR runs slowly due to its linear dependence on the size of the input data. The question is how to overcome these computational obstacles yet obtain comparable results with BR.

To address the above problem, we provide a simple stochastic sketch strategy for multi-label classification. In particular, we carefully construct a small sketch of the full data set, and then use that sketch as a surrogate to perform fast optimization. This paper first introduces stochastic -subgaussian sketch, and then proposes the construction of a sketch matrix based on Walsh-Hadamard matrix to reduce the expensive matrix multiplications of -subgaussian sketch. From an algorithmic perspective, we provide provable guarantees that our proposed methods are approximately as good as the exact solution of BR. From a statistical learning perspective, we provide the generalization error bound of multi-label classification using our proposed stochastic sketch model.

Experiments on various real-world data sets demonstrate the superiority of the proposed methods. The results verify our theoretical findings. We organize this paper as follows. The second section introduces our proposed stochastic sketch for multi-label classification. The third section provides the provable guarantees for our algorithm from both algorithmic and statistical learning perspectives, and experimental results are presented in the fourth section. The last section provides our conclusions.

## Ii Stochastic Sketch for Multi-label Classification

Assume that

is a real vector representing an input (instance), and

is a real vector representing the corresponding output . denotes the number of training samples. The input matrix is and the output matrix is . and represent the inner product and the identity matrix, respectively. We denote the transpose of the vector/matrix by the superscript and the logarithms to base 2 by . Let and represent the norm and Frobenius norm, respectively. Let be the regressors and

denote the standard Gaussian distribution.

A simple linear regression model for BR [5] learns the matrix through the following formulation:

 minV∈Rp×q12||XV−Y||2F (1)

Assuming that and , the computational complexity for this problem is [16]. The computational cost of an exact solution for problem 1 will be prohibitive on large-scale settings. To solve this problem, we construct a small sketch of the full data set by stochastic projection methods, and then use that sketch as a surrogate to perform fast optimization for problem 1. Specifically, we define a sketch matrix and , where is the projection dimension and

is the zero matrix with all the zero entries. The input matrix

and output matrix are approximated by their sketched matrix and , respectively. We aim to solve the following sketched problem of problem 1.

 minV∈Rp×q12||SXV−SY||2F (2)

Motivated by [17, 18, 12], we use a -nearest neighbor (NN) classifier in the embedding space for prediction, instead of using an expensive decoding process [11]. Next, we introduce two kinds of stochastic sketch methods.

### Ii-a Stochastic σ-Subgaussian Sketch

The entries of a sketch matrix can be simply defined as i.i.d random variables from certain distributions, such as Gaussian distribution and Bernoulli distribution.

[19] has already shown that each of these distributions is a special case of Subgaussian distribution, which is defined as follows:

###### Definition 1 (σ-Subgaussian).

A row of the sketch matrix is -Subgaussian, if it has zero mean and for any vector and , we have

 P(|⟨si,ζ⟩|≥ϵ||ζ||2)≤2e−nϵ22σ2

Clearly, a vector with i.i.d standard Gaussian entries or Bernoulli entries is 1-Subgaussian. We refer any matrix to a Subgaussian sketch if its rows are zero mean, 1-Subgaussian, and with the covariance matrix . A Subgaussian sketch is straightforward to construct. However, given the Subgaussian sketch , the cost of computing and is and , respectively. Next, we introduce the following technique to reduce this time complexity.

Inspired by [20], we propose to construct the sketch matrix based on Walsh-Hadamard matrix to reduce the expensive matrix multiplications of Subgaussian sketch. Formally, a stochastic Walsh-Hadamard sketch matrix is obtained with i.i.d. rows of the form:

 si=√neiHR,i=1,⋯,m

where is a random subset of rows uniformly sampled from , is a random diagonal matrix whose entries are i.i.d. Rademacher variables and constitutes a Walsh-Hadamard matrix defined as:

 Hij=(−1)⟨B(i)−1,B(j)−1⟩,i,j=1,⋯,n

where and represent the binary expression with -bit of and (assume ).

Then, we can employ fast Walsh-Hadamard transform [21] to perform and in and .

## Iii Main Results

Since we address problem 2 rather than directly solving problem 1, which has great advantages for fast optimization, it is interesting to ask the question: what is the relationship between problem 2 and problem 1? Let and be the optimal solutions of problem 1 and problem 2. We define and . We will prove that we can choose an appropriate such that the two optimal objectives and are approximately the same. This means that we can speed up the computation of problem 1, without sacrificing too much accuracy. Furthermore, we provide the generalization error bound of the multi-label classification problem using our proposed stochastic sketch model. To measure the quality of approximation, we first define the -optimality approximation as follows:

###### Definition 2 (δ-Optimality Approximation).

Given , is a -optimality approximation solution, if

 (1−δ)f(V∗)≤g(^V)≤(1+δ)f(V∗)

According to the properties of Matrix norm, we have , so is proportional to . Therefore, the closeness of and implies the closeness of and .

### Iii-a σ-Subgaussian Sketch Guarantee

We first introduce the tangent cone, which is used by [22]:

###### Definition 3 (Tangent Cone).

Given a set and , the tangent cone of at is defined as for some and , where clconv denotes the closed convex hull.

The tangent cone arises naturally in the convex optimality conditions: any defines a feasible direction at the optimal , and optimality means that it is impossible to decrease the objective function by moving in directions belonging to the tangent cone. Then, we introduce the Gaussian width, which is an important complexity measure used by [23]:

###### Definition 4 (Gaussian Width).

Given a closed set , the Gaussian width of , denoted by , is defined as:

 ω(Y)=Eg[supz∈Y|⟨g,z⟩|]

where .

This complexity measure plays an important role in learning theory and statistics [24]. Let be the Euclidean sphere.

represents the linearly transformed cone:

, and we use Gaussian width to measure the width of the intersection of and . This paper defines . We state the following theorem for guaranteeing the -Subgaussian sketch:

###### Theorem 1.

Let be a stochastic -Subgaussian sketch matrix, and be universal constants. Given any and

, then with probability at least

, is a -optimality approximation solution.

The proof sketch of this theorem can be found in the supplementary material.

Remark. Theorem 1 guarantees that the stochastic -Subgaussian sketch method is able to construct a small sketch of the full data set for the fast optimization of problem 1, while preserving the -optimality of the solution.

We generalize the concept of Gaussian width to two additional measures, -Gaussian width and Rademacher width:

###### Definition 5 (S-Gaussian Width).

Given a closed set and a stochastic sketch matrix , the -Gaussian width of , denoted by , is defined as:

 ωS(Y)=Eg,S[supz∈Y|⟨g,Sz√m⟩|]

where .

Given a closed set , the Rademacher width of , denoted by , is defined as:

 Υ(Y)=Eϖ[supz∈Y|⟨ϖ,z⟩|]

where is an i.i.d. vector of Rademacher variables.

Next, we still define and state the following theorem for guaranteeing the Walsh-Hadamard sketch:

###### Theorem 2.

Let be a stochastic Walsh-Hadamard sketch matrix, , and be universal constants. Given any and , then with probability at least , is a -optimality approximation solution.

Remark. An additional term appears in the sketch size, so the required sketch size for the Walsh-Hadamard sketch is larger than that required for the -Subgaussian sketch. However, the potentially larger sketch size is offset by the much lower cost of matrix multiplications via the stochastic Walsh-Hadamard sketch matrix. Theorem 2 guarantees that the stochastic Walsh-Hadamard sketch method is also able to construct a small sketch of the full data set for the fast optimization of problem 1, while preserving the -optimality of the solution.

### Iii-C Generalization Error Bound

This subsection provides the generalization error bound of the multi-label classification problem using our proposed two stochastic sketch models. Because our results can be applied to two models, we simply call our stochastic sketch models SS. Assume our model is characterized by a distribution on the space of inputs and labels , where . Let a sample be drawn i.i.d. from the distribution , where are the ground truth label vectors. Assume samples are drawn i.i.d. times from the distribution , which is denoted by . For two inputs in , we define as the Euclidean metric in the original input space and as the metric in the embedding input space. Let represent the prediction of the -th label for input using our model SS-NN, which is trained on . The performance of SS-NN: is then measured in terms of its generalization error, which is its expected loss on a new example drawn according to :

 (3)

where means the -th label and

represents the loss function for the

-th label. We define the loss function as follows for the analysis.

 ℓ(yi,hDknni(x))=P(yi≠hDknni(x)) (4)

For the -th label, we define the function as follows:

 νij(x)=P(yi=j|x),j∈{0,1}. (5)

The Bayes optimal classifier for the -th label is defined as

 b∗i(x)=argmaxj∈{0,1}νij(x) (6)

Before deriving our results, we first present several important definitions and theorems.

###### Definition 7 (Covering Numbers, [25]).

Let be a metric space, be a subset of and . A set is an -cover for , if for every , there exists such that . The -covering number of , , is the minimal cardinality of an -cover for (if there is no such finite cover then it is defined as ).

###### Definition 8 (Doubling Dimension, [26]).

Let be a metric space, and let be the smallest value such that every ball in can be covered by balls of half the radius. The doubling dimension of is defined as : .

###### Theorem 3 ([26]).

Let be a metric space. The diameter of is defined as . The -covering number of , , is bounded by:

 N(ε,X,d)≤(2diam(X)ε)ddim(X) (7)

We provide the following generalization error bound for SS-1NN:

###### Theorem 4.

Given a metric space , assume function is Lipschitz with constant with respect to the sup-norm for each label. Suppose has a finite doubling dimension: and . Let and be drawn i.i.d. from the distribution . Then, we have

 (8)

Inspired by Theorem 19.5 in [27], we derive the following lemma for SS-NN:

###### Lemma 1.

Given metric space , assume function is Lipschitz with constant with respect to the sup-norm for each label. Suppose has a finite doubling dimension: and . Let and be drawn i.i.d. from the distribution . Then, we have

 (9)

The following corollary reveals important statistical properties of SS-1NN and SS-NN.

###### Corollary 1.

As goes to infinity, the error of the SS-1NN and SS-NN converges to the sum of twice the Bayes error and times Bayes error over the labels, respectively.

## Iv Experiment

### Iv-a Data Sets and Baselines

We abbreviate our proposed stochastic -Subgaussian sketch and stochastic Walsh-Hadamard sketch to SS+GAU and SS+WH, respectively. In the experiment, we set the entries in the -Subgaussian sketch matrix as i.i.d standard Gaussian entries. This section evaluates the performance of the proposed methods on four data sets: corel5k, nus(vlad), nus(bow) and rcv1x. The statistics of these data sets are presented in website. We compare SS+GAU and SS+WH with several state-of-the-art methods, as follows.

• BR [5]: We implement two base classifiers for BR. The first uses linear classification/regression package LIBLINEAR [28] with -regularized square hinge loss as the base classifier. We simply call this baseline BR+LIB. The second uses NN as the base classifier. We simply call this baseline BR+NN and count the NN search time as the training time.

• FastXML [1]: An advanced tree-based multi-label classifier.

• SLEEC [12]: A state-of-the-art embedding method, which is based on sparse local embeddings for large-scale multi-label classification. We use solvers of FastXML and SLEEC provided by the respective authors with default parameters.

Following the similar settings in [29] and [12], we set for the NN search in all NN based methods. The sketch size is chosen in a range of . Following [7], [11] and [30], we consider the Hamming Loss and Example-F1 measures to evaluate the prediction performance of all the methods. The smaller the value of the Hamming Loss, the better the performance, while the larger the value of Example-F1, the better the performance.

### Iv-B Results

Figure 1 shows that with the increasing sketch size, the training time of SS+GAU and SS+WH rise, while the prediction performance of SS+GAU and SS+WH becomes better. The results verify our theoretical analysis. The Hamming Loss, Example-F1 and training time comparisons of various methods on corel5k, nus(vlad), nus(bow) and rcv1x data sets are shown in Table I, Table II and Table III, respectively. From Tables  I,  II and III, we can see that:

• BR and SLEEC usually achieve better results, which is consistent with the empirical results in [12] and [15]. However, SLEEC is the slowest method compared to other baselines.

• Because we perform the optimization only on a small sketch of the full data set, our proposed methods are significantly faster than BR and state-of-the-art embedding approaches. Moreover, we can maintain competitive prediction performance by setting an appropriate sketch size. The empirical results illustrate our theoretical studies.

## V Conclusion

This paper carefully constructs stochastic -Subgaussian sketch and Walsh-Hadamard sketch for multi-label classification. From an algorithmic perspective, we show that we can obtain answers that are approximately as good as the exact answer for BR. From a statistical learning perspective, we also provide the generalization error bound of multi-label classification using our proposed stochastic sketch model. Lastly, our empirical studies corroborate our theoretical findings, and demonstrate the superiority of the proposed methods.

## Supplementary: The Proof of Important Theorems and Lemmas

### V-a Proof of Theorem 1

We first present the following lemma, which is derived from [31] and [32].

###### Lemma 1.

Let be a stochastic -Subgaussian sketch matrix. Then there are universal constants and such that for any subset , any and , we have

 supz∈Y|z′Sz|≤c1√mω(Y)+δ (10)

with probability at least , and we have

 supz∈Y|z′Su|≤5c1√mω(Y)+3δ (11)

with probability at least , where .

###### Theorem 1.

Let be a stochastic -Subgaussian sketch matrix, and be universal constants. Given any and , then with probability at least , is a -optimality approximation solution.

###### Proof.

Let , and be columns of matrix , and , respectively. Then and can be decomposed to and . Next, we study the relationship between and . We define . According to Definition 3 in the main paper, we know that belongs to the tangent cone of at .

Because , we have . Then, we get:

 2⟨XV∗i−Yi,XM⟩+||XM||22≥0 (12)

As , we have . Then, we get and

 ||SXM||2≤2||SXV∗i−SYi||2 (13)

We derive the following:

 ||SX^Vi−SYi||22=||SXV∗i−SYi||22+||SXM||22+2⟨SXV∗i−SYi,SXM⟩=||SXV∗i−SYi||22+||XM||22+⟨XM,SXM⟩+2⟨XV∗i−Yi,SXM⟩+2⟨XV∗i−Yi,XM⟩

By using Lemma 1, with probability at least , we have

 ||SX^Vi−SYi||22≤||SXV∗i−SYi||22+||XM||22(1+c1√mω(Y)+δ)+2||XV∗i−Yi||2||XM||2(1+5c1√mω(Y)+3δ)

where . Given , we have . For the sake of clarity, we define and , and then substitute them to the above expression, with probability at least , we have

 ||SX^Vi−SYi||22≤||SXV∗i−SYi||22+γψ||XV∗i−Yi||22+(ψγ+φ)||XM||22 (14)

Clearly, we have . By using Lemma 1, with probability at least , we have

 ||SXV∗i−SYi||22=||XV∗i−Yi||22+⟨XV∗i−Yi,S(XV∗i−Yi)⟩≤||XV∗i−Yi||22(1+c1√mω(XV∗i−Yi||XV∗i−Yi||2)+δ)≤||XV∗i−Yi||22φ (15)

By using Lemma 1, with probability at least , we have . By using Eq.(13), with probability at least , we have

 ||XM||22≤||SXM||222−φ≤4||SXV∗i−SYi||222−φ (16)

Eq.(14), Eq.(15) and Eq.(16) imply that, with probability at least , we have

 ||SX^Vi−SYi||22≤(1+4ψγ+φ2−φ)||SXV∗i−SYi||22+γψ||XV∗i−Yi||22≤(1+4ψγ+φ2−φ)φ||XV∗i−Yi||22+γψ||XV∗i−Yi||22≤(φ−4ψγ−4φ+γψ)||XV∗i−Yi||22 (17)

By setting , with probability at least , we have

 ||SX^Vi−SYi||22≤(3ψ−3φ)||XV∗i−Yi||22=(12c1√mω(Y)+6δ)||XV∗i−Yi||22 (18)

Eq.(18) implies that, with probability at least , we have

 ||SX^V−SY||2F≤(12c1√mω(Y)+6δ)||XV∗−Y||2F (19)

By using Eq.(12) and Lemma 1 again, with probability at least , we have

 ||SX^Vi−SYi||22≥||SXV∗i−SYi||22+⟨XM,SXM⟩+2⟨XV∗i−Yi,SXM⟩≥||SXV∗i−SYi||22−||XM||22(c1√mω(Y)+δ)−2||XV∗i−Yi||2||XM||2(5c1√mω(Y)+3δ)

We define and , and then substitute them to the above expression, with probability at least , we have

 ||SX^Vi−SYi||22≥||SXV∗i−SYi||22−γ^ψ||XV∗i−Yi||22−(^ψγ+^φ)||XM||22 (20)

By using Lemma 1 again, with probability at least , we have

 ||SXV∗i−SYi||22≥||XV∗i−Yi||22(1−^φ) (21)

Similar to Eq.(16), by using Eq.(13) and Lemma 1, with probability at least , we have

 ||XM||22≤||SXM||221−^φ≤4||SXV∗i−SYi||221−^φ (22)

Eq.(20), Eq.(21) and Eq.(22) imply that, with probability at least , we have

 ||SX^Vi−SYi||22≥(1−4^ψγ+^φ1−^φ)(1−^φ)||XV∗i−Yi||22−γ^ψ||XV∗i−Yi||22≥(1−^φ−4^ψγ−4^φ−γ^ψ)||XV∗i−Yi||22 (23)

By setting , with probability at least , we have

 ||SX^Vi−SYi||22≥(1−4^ψ−5^φ)||XV∗i−Yi|