Online Metric Learning for Multi-Label Classification

06/12/2020 ∙ by Xiuwen Gong, et al. ∙ The University of Sydney 0

Existing research into online multi-label classification, such as online sequential multi-label extreme learning machine (OSML-ELM) and stochastic gradient descent (SGD), has achieved promising performance. However, these works do not take label dependencies into consideration and lack a theoretical analysis of loss functions. Accordingly, we propose a novel online metric learning paradigm for multi-label classification to fill the current research gap. Generally, we first propose a new metric for multi-label classification which is based on k-Nearest Neighbour (kNN) and combined with large margin principle. Then, we adapt it to the online settting to derive our model which deals with massive volume ofstreaming data at a higher speed online. Specifically, in order to learn the new kNN-based metric, we first project instances in the training dataset into the label space, which make it possible for the comparisons of instances and labels in the same dimension. After that, we project both of them into a new lower dimension space simultaneously, which enables us to extract the structure of dependencies between instances and labels. Finally, we leverage the large margin and kNN principle to learn the metric with an efficient optimization algorithm. Moreover, we provide theoretical analysis on the upper bound of the cumulative loss for our method. Comprehensive experiments on a number of benchmark multi-label datasets validate our theoretical approach and illustrate that our proposed online metric learning (OML) algorithm outperforms state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Real-world applications often involve in generating massive volume of streaming data at an unprecedented high speed. Many researchers have focused on data classification to help customers or users get better searching results, among which ‘online multi-label classification’ which means each instance can be assigned multiple labels is very useful in some applications. For example, in the web-related applications, Twitter, Facebook and Instagram posts and RSS feeds are attached with multiple essential forms of categorization tags [34] . In the search industry, revenue comes from clicks on ads embedded in the result pages. Ad selection and placement can be significantly improved if ads are tagged correctly. There are many other applications, such as object detection in video surveillance [25]

and image retrieval in dynamic databases

[12].

In the development of multi-label classification [28, 14], one challenge that remains unsolved is that most multi-label classification algorithms are developed in an off-line mode [7, 6, 1, 19, 36, 20]. These methods assume that all data are available in advance for learning. However,there are two major limitations of developing multi-label methods under such an assumption: firstly, these methods are impractical for large-scale datasets, since they require all datasets to be stored in memory; secondly, it is non-trivial to adapt off-line multi-label methods to the sequential data. In practice, data is collected sequentially, and data that is collected earlier in this process may expire as time passes. Therefore, it is important to develop new multi-label classification methods to deal with streaming data.

Several online multi-label classification studies have recently been developed to overcome the above-mentioned limitations. For example, online learning with accelerated nonsmooth stochastic gradient (OLANSGD) [22] was proposed to solve the online multi-label classification problem. Moreover, the online sequential multi-label extreme learning machine (OSML-ELM) [29]

is a single-hidden layer feed-forward neural network-based learning technique. OSML-ELM classifies the examples by their output weight and activation function. Unfortunately, all of these online multi-label classification methods lack an analysis of loss function and disregard label dependencies. Many studies

[9, 26, 2, 30, 18] have shown that multi-label learning methods that do not capture label dependency usually achieve degraded prediction performance. This paper aims to fill these gaps.

-Nearest Neighbour (NN) algorithms have achieved superior performance in various applications [10]. Moreover, experiments show that distance metric learning on single-label prediction can improve the prediction performance of NN. Nevertheless, there are two problems associated with applying a NN algorithm to an online multi-label setting. Firstly, naive NN algorithms do not consider label dependencies. Secondly, it is non-trivial to learn an appropriate metric for online multi-label classification.

To break the bottleneck of NN, we here propose a novel multi-label learning paradigm for multi-label classification. More specifically, we project instances and labels into the same embedding space for comparison, after which we learn the distance metric by enforcing the constraint that the distance between embedded instance and its correct label must be smaller than the distance between the embedded instance and other labels. Thus, two nearby instances from different labels will be pushed further. Moreover, an efficient optimization algorithm is proposed for the online multi-label scenario. In theoretical terms, we analyze the upper bound of cumulative loss for our proposed model. A wide range of experiments on benchmark datasets corroborate our theoretical results and verify the improved accuracy of our method relative to state-of-the-art approaches.

The remainder of this paper is organized as follows. We first describe the related work, the online metric learning for multi-label classification and the optimization algorithm. Next, we introduce the upper bound of the loss function. Finally, we present the experimental results and conclude this paper.

Ii Related Work

Existing multi-label classification methods can be grouped into two major categories: namely, algorithm adaptation (AA) and problem transformation (PT). AA extends specific learning algorithms to deal with multi-label classification problems. Typical AA methods include [32, 5, 35]. Moreover, PT methods such as that developed by [16], transform the learning task into one or more single-label classification problems. However, all of these methods assume that all data are available for learning in advance. These methods thus incur prohibitive computational costs on large-scale datasets, and it is also non-trivial to apply them to sequential data.

The state-of-the-art approaches to online multi-label classification have been developed to handle sequential data. These approaches can be divided into two key categories: Neural Network and Label Ranking

. Neural Network approaches are based on a collection of connected units or nodes, referred to as artificial neurons. Each connection between artificial neurons can transmit the signal from one neuron to another. The artificial neuron that receives the signal can process it and then transmit signal to other artificial neurons. Moreover, label ranking, another popular approach to multi-label learning, involves a set of ranking functions being learned to order all the labels such that relevant labels are ranked higher than irrelevant ones.

From the neural network perspective, Ding et al. [11] developed a single-hidden layer feedforward neural network-based learning technique named ELM. In this method, the initial weights and the hidden layer bias are selected at random, and the network is trained for the output weights to perform the classification. Moreover, Venkatesan et al. [29] developed the OSML-ELM approach, which uses ELM to handle streaming data. OSML-ELM uses a sigmoid activation function and outputs weights to predict the labels. In each step, the output weight is learned from the specific equation. OSML-ELM converts the label set from bipolar to unipolar representation in order to solve multi-label classification problems.

Some other existing approaches are based on label ranking, such as OLANSGD [22]. In the majority of cases, ranking functions are learned by minimizing the ranking loss in the max margin framework. However, the memory and computational costs of this process are expensive on large-scale datasets. Stochastic gradient decent (SGD) approaches update the model parameters using only the gradient information calculated from a single label at each iteration. OLANSGD minimizes the primal form using Nesterov’s smoothing, which has recently been extended to the stochastic setting.

However, none of these methods analyze the loss function, and all of them fail to capture the interdependencies among labels; these issues have been proved to result in degraded prediction performance. Accordingly, this paper aims to address these issues.

Notation Definition
the round of algorithm
an instance presented on round t

corresponding label vector to

nearest neighbour instance to
corresponding output of
initialized input matrix
corresponding output matrix
the number of instances
the number of features
the number of labels
the dimension of the new projection space
projection matrix on round
lower bound and upper bound of
Frobenius inner product of and
norm
norm
Frobenius norm
TABLE I: Summary of Notations

Iii Our Proposed Method

Iii-a Notations

We denote the instance presented to the algorithm on round by , and the label by , and refer each instance-label pair as an example. Suppose that we initially have examples in memory, denoted by . is a nearest neighbour to . The initialized instance matrix is denoted as and the correspond output matrix is denoted as . is a positive integer. is Frobenius norm. is projection matrix which maps each output vector ( dimension) to ( dimension). Let also be the projection matrix. Each input vector ( dimension) is projected to ( dimension). Then and can be compared in the projection space( dimension). Notations are summarized in Table I.

Iii-B Online Metric Learning

Inspired by Hsu et al. [15],who stated that each label vector can be projected into a lower dimensional label space, which is deemed as encoding, we propose the following large-margin metric learning approach with nearest neighbor constraints to learn projection. If the encoding scheme works well, the distance between the codeword of , , and , , should tend to be 0 and less than the distance between codeword and any other output . The following large margin formulation is then presented to learn the projection matrix :

(1)

The constraints in Eq.(1) guarantee that the distance between the codeword of and the codeword of is less than the distance between the codeword of and codeword of any other output. To give Eq.(1) more robustness, we add loss function as the margin. The loss function is defined as , where is the norm. After that, we use Euclidean metric to measure the distances between instances and and then learn a new distance metric, which improves the performance of NN and also captures label dependency.

To retain the information learned on the round , we apply above large margin formulation into online setting. Thus, we have to define the initialization of the projection matrix and the updating rule. We initialize the projection matrix

to a non-zero matrix and set the new projection matrix

to be the solution of the following constrained optimization problem on round .

(2)

The loss function is defined as following:

(3)

where the matrix is learned through the following formulation:

Define the loss function on round as

(4)

When loss function is zero on round , . In contrast, on those rounds where the loss function is positive, the algorithm enforces to satisfy the constraint regardless of the step-size required. This update rule requires to correctly classify the current example with a sufficient high margin and have to stay as closed as to retain the information learned on the previous round.

Iii-C Optimization

The optimization of Eq.(2) can be shown by using standard tools from convex optimization [4]. If then itself satisfies the constraint in Eq.(2) and is clearly the optimal solution. Therefore, we concentrate on the case where . Firstly, we define the Lagrangian of the optimization problem in Eq.(2) to be,

(5)

where the is a Lagrange multiplier.

Setting the partial derivatives of with respect to the elements of to zero gives

from this equation, we can get that

in which

stands for an identity matrix.

Inspired by [24], we use an approximation form of to make it easier for following calculation.

(6)

Define , . Plugging the approximation formula Eq.(6) back into Eq.(5), we get a cubic function , , where

If is non-monotonic function when , let to be the maximum point of . We obtain,

(7)

where ,

1:  Set to a non-zero matrix
2:  Initialize
3:  for   do
4:     Receive pairwise instances:
5:     Find the Nearest Neighbour
6:     Compute loss by Eq.(4)
7:     if  then
8:        Set as Eq.(7)
9:        Update
10:     else
11:        
12:     end if
13:     Append current instances into
14:  end for
Algorithm 1 Online Metric Learning for Multi-Label Classification

Algorithm 1 provides detail of optimization. We denote the loss suffered by our algorithm on round by .

We focus on the situation when . The optimal solution comes from the one satisfying , . Based on the derivation, can be update by , where .

Inspired by metric learning [17], we use the learned metric to select nearest neighbours from for each testing instance, and conduct the predictions based on these nearest neighbours. The equation of the distance between codeword and in the embedding space can be computed as .

Method Training Time Testing Time
OSML-ELM
OLANSGD
NN -
OML
TABLE II: Training Time Complexities of Each Iteration and Testing Time Complexities of Each Testing Instance for all methods.

Iii-D Computational Complexity Analysis

We compare the time complexities of our proposed method (i.e. OML) with three popular methods, which are OSML-ELM [29], OLANSGD [22] and NN [10].

The training time of OML is dominated by finding the nearest neighbour of each training instance and computing the loss in Eq.(4). It takes time to search for the nearest neighbour from the training dataset while computing the loss with two projections embedded takes time. Thus, the time complexity is .

We analyze the testing time for each testing instance. The testing time of NN involves the procedures of searching for the neareast neighbours of a testing instance from the training dataset which takes time. Our proposed PL-LMNN performs prediction in a similar way but differs in the additional procedure of projecting all instances into the embedding space of dimensions before searching for the neareast neighbours, therefore the testing time complexity of OML is .

Moreover, training time complexities of each iteration and testing time complexities of each testing instance for other methods are listed in Table II for comparisions.

From Table II, we can easily conclude that the training time complexity of OML in each iteration is lower than that of OSML-ELM and OLANSGD with respect to the number of training data , which is usually much larger than the number of features and the number of labels . Besides, the training time complexity of PL-NN is denoted by as it has no training process.

Moreover, for the testing time complexity, OML is lower than OSML-ELM and NN with respect to the number of training data . In addition, the reduced dimension of the new projected space is much smaller than the number of features as well as the number of labels. OLANSGD is the fastest in predicting among all methods, mainly because it performs prediction only by computing the label scores based on the learned model parameter.

Iv Loss Bound

Following the analysis in [8], we state the upper bounds for our online metric learning algorithm. Let be an arbitrary matrix. We use the approximate form given in Eq.(6) to replace .

Lemma 1.

Let as defined in Eq.(7), , is a non-zero matrix. The following bound holds for any

Proof.

Define , this lemma is proved by summing over all in and the bounding of this sum is obviously as followed,

Lemma 2.

Assume there exists some such that , . Let . as defined as in Eq.(7). is a non-zero matrix. is defined in the proof Eq.(4). We bound cumulative as follows,

Proof.

By using the operation of Frobenius norm,

where is the Frobenius inner product, we can get

Using the assumption in Lemma 2, we can get that . where . It is clearly that is a symmetric matrix. We take the SVD of as

, then using the minimum non-negative singular value of

to replace the non-positive element in matrix , and denote approximation form of matrix as . Apparently, is a non-negative symmetric matrix. Furthermore, by using definition of Frobenius inner product , where , we can get that

Taking the SVD of as . Since matrix

is a unitary matrix, then

, . Let be the minimum singular value of , getting that

(8)

where is an identity matrix. Now, we get that,

By summing both side of inequality on over all in , and using that , gives that

Then, we can get that

Lemma 2 has been proved. ∎

Based on the Lemma 2, we provide following theorem.

Theorem 1.

Let be a sequence of examples where and . is projection matrix, is in . is a non-zero matrix .. Let be the upper bound of . Under the assumption of Lemma 2, the cumulative loss suffered on the sequence is bounded as follows,

Proof.

By using Eq.(4), we get that

and,

Since is defined as norm, therefore is bounded by . we can get is bounded by as well. By using Lemma 2, we can get,

Therefore, the cumulative loss is bounded by . As is bounded, it guarantees the performance of our proposed model for unseen data.

V Experiments

In this section, we conduct experiments to evaluate the prediction performance of the proposed OML for online multi-label classification, and compare it with several state-of-the-art methods. All experiments are conducted on a workstation with 3.20GHz Intel CPU and 16GB main memory, running the Windows 10 platform.

V-a Datasets

We conduct experiments on eight benchmark datasets: Corel5k [13], Enron 222http://bailando.sims.berkeley.edu/enron_email.html, Medical [23], Emotions [27], Cal500 [13], Image [33], scene [3], slashdot 333http://waikato.github.io/meka/datasets. The datasets are collected from different domains, such as images (i.e. Corel5k, Image, scene), music (i.e. Emotions, Cal500) and text (i.e. Enron, Medical, slashdot). The statistics of these datasets can be found in Table III.

Datasets #Instances #Features #Labels #Domain
Corel5k 5000 499 374 images
Emotions 593 72 6 music
Enron 1702 1001 53 text
Medical 978 1449 45 text
Cal500 502 68 174 music
Image 2000 103 14 image
scene 2407 294 6 image
slashdot 3782 103 14 text
TABLE III: Statistics of multi-label benchmark datasets.
(a) CoreI5k
(b) Enron
(c) FullMedical
(d) Emotions
Fig. 1: Macro F1 of various methods on Corel5k, Enron, Medical and Emotions datasets.
(a) CoreI5k
(b) Enron
(c) FullMedical
(d) Emotions
Fig. 2: Example F1 of various methods on Corel5k, Enron, Medical and Emotions datasets.
(a) CoreI5k
(b) Enron
(c) FullMedical
(d) Emotions
Fig. 3: Micro F1 of various methods on Corel5k, Enron, Medical and Emotions datasets.
(a) CoreI5k
(b) Enron
(c) FullMedical
(d) Emotions
Fig. 4: Hamming Loss of various methods on Corel5k, Enron, Medical and Emotions datasets.
(a) Cal500
(b) Image
(c) scene
(d) slashdot
Fig. 5: Macro F1 of various methods on Cal500, Image, scene and slashdot datasets.
(a) Cal500
(b) Image
(c) scene
(d) slashdot
Fig. 6: Example F1 of various methods on Cal500, Image, scene and slashdot datasets.
(a) Cal500
(b) Image
(c) scene
(d) slashdot
Fig. 7: Micro F1 of various methods on Cal500, Image, scene and slashdot datasets.
(a) Cal500
(b) Image
(c) scene
(d) slashdot
Fig. 8: Hamming Loss of various methods on Cal500, Image, scene and slashdot datasets.

V-B Experiment Setup

Baseline Methods We compare our OML method with several state-of-the-art online multi-label prediction methods:

  • OSML-ELM [29]: OSML-ELM uses a sigmoid activation function and outputs weights to predict the labels. In each step, output weight is learned from specific equation. OSML-ELM converts the label set from bipolar to unipolar representation in order to solve multi-label classification problems.

  • OLANSGD [22]: Based on Nesterov’s smooth method, OLANSGD proposes to use accelerated nonsmooth stochastic gradient descent to solve the online multi-label classification problem. It updates the model parameters using only the gradient information calculated from a single label at each iteration. It then implements a ranking function that ranks relevant and irrelevant labels.

  • kNN: We adapt the k nearest neighbor(kNN) algorithm to solve online multi-label classification problems. A Euclidean metric is used to measure the distances between instances.

In our experiment, the matrix

is initialized as a normal distributed random matrix. Initially, we keep 20% of data for nearest neighbor searching. In our experiment,

is set to 100000 and is set to 0.00001, while is set to 10. The codes are provided by the respective authors. Parameter in OLANSGD is chosen from among using five-fold cross validation. We use the default parameter for OSML-ELM.

Performance Measurements To fairly measure the performance of our method and baseline methods, we consider the following evaluation measurements [21, 31]:

  • Micro-F1: computes true positives, true negatives, false positives and false negatives over all labels, then calculates an overall F-1 score.

  • Macro-F1: calculates the F-1 score for each label, then takes the average of the F-1 score.

  • Example-F1: computes the F-1 score for all labels of each testing sample, then takes the average of the F-1 score.

  • Hamming Loss: computes the average zero-one score for all labels and instances.

The smaller the Hamming Loss value, the better the performance; moreover, the larger the value of the other three measurements, the better the performance.

V-C Prediction Performance

Figures 1 to 8 present the four measurement results for our method and baseline approaches in respect of various datasets. From these figures, we can see that:

  • OML outperforms OSML-ELM and OLANSGD on most datasets, this is because neither of the latter approaches consider the label dependency.

  • OML achieves better performance than NN on all datasets except on Cal500 under Hmming Loss but they are comparable. This result illustrates that our proposed method is able to learn an appropriate metric for online multi-label classification.

  • Moreover, NN is comparable to OSML-ELM and OLANSGD on most datasets, which demonstrates the competitive performance of NN.

Our experiments verify our theoretical studies and the motivation of this work: in short, our method is able to capture the interdependencies among labels, while also overcoming the bottleneck of NN.

Vi Conclusion

Current multi-label classification methods assume that all data are available in advance for leaning. Unfortunately, this assumption hinders off-line multi-label methods from handling sequential data. OLANSGD and OSML-ELM have overcome this limitation and achieved promising results in online multi-label classification; however, these methods lack a theoretical analysis for their loss functions, and also do not consider the label dependency, which has been proven to lead to degraded performance. Accordingly, to fill the current research gap on streaming data, we here propose a novel online metric learning method for multi-label classification based on the large margin principle. We first project instances and labels into the same embedding space for comparison, then learn the distance metric by enforcing the constraint that the distance between an embedded instance and its correct label must be smaller than the distance between the embedded instance and other labels. Thus, two nearby instances from different labels will be pushed further. Moreover, we develop an efficient online algorithm for our proposed model. Finally, we also provide the upper bound of cumulative loss for our proposed model, which guarantees the performance of our method on unseen data. Extensive experiments corroborate our theoretical results and demonstrate the superiority of our method.

References

  • [1] R. Babbar and B. Schölkopf (2017) DiSMEC: distributed sparse machines for extreme multi-label classification. In WSDM, pp. 721–729. Cited by: §I.
  • [2] K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain (2015) Sparse local embeddings for extreme multi-label classification. In NIPS, pp. 730–738. Cited by: §I.
  • [3] M. R. Boutell, J. Luo, X. Shen, and C. M. Brown (2004)

    Learning multi-label scene classification

    .
    Pattern Recognit., pp. 1757–1771. Cited by: §V-A.
  • [4] S. Boyd and L. Vandenberghe (2004) Convex optimization. Cambridge University Press, New York, NY, USA. External Links: ISBN 0521833787 Cited by: §III-C.
  • [5] K. Brinker and E. Hullermeier (2007) Case-based multilabel ranking. In IJCAI, pp. 702–707. Cited by: §II.
  • [6] Y. Chen and H. Lin (2012) Feature-aware label space dimension reduction for multi-label classification. In NIPS, pp. 1538–1546. Cited by: §I.
  • [7] W. Cheng and E. Hüllermeier (2009)

    Combining instance-based learning and logistic regression for multilabel classification

    .
    Machine Learning 76 (2-3), pp. 211–225. Cited by: §I.
  • [8] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer (2006) Online passive-aggressive algorithms. Journal of Machine Learning Research 7, pp. 551–585. Cited by: §IV.
  • [9] K. Dembczynski, W. Cheng, and E. Hüllermeier (2010) Bayes optimal multilabel classification via probabilistic classifier chains. In ICML, pp. 279–286. Cited by: §I.
  • [10] J. Deng, A. C. Berg, K. Li, and F. Li (2010) What does classifying more than 10, 000 image categories tell us?. In ECCV, pp. 71–84. Cited by: §I, §III-D.
  • [11] S. Ding, H. Zhao, Y. Zhang, X. Xu, and R. Nie (2015) Extreme learning machine: algorithm, theory and applications. Artif. Intell. Rev. 44 (1), pp. 103–115. Cited by: §II.
  • [12] A. Dong and B. Bhanu (2003) Concept learning and transplantation for dynamic image databases. In ICME, pp. 765–768. Cited by: §I.
  • [13] P. Duygulu, K. Barnard, J. F. G. de Freitas, and D. A. Forsyth (2002)

    Object recognition as machine translation: learning a lexicon for a fixed image vocabulary

    .
    In Computer Vision - ECCV, Vol. 2353, pp. 97–112. Cited by: §V-A.
  • [14] E. Gibaja and S. Ventura (2015) A tutorial on multilabel learning. ACM Comput. Surv. 47 (3), pp. 52:1–52:38. Cited by: §I.
  • [15] D. J. Hsu, S. Kakade, J. Langford, and T. Zhang (2009) Multi-label prediction via compressed sensing. In NIPS, pp. 772–780. Cited by: §III-B.
  • [16] D. Hsu, S. Kakade, J. Langford, and T. Zhang (2009) Multi-label prediction via compressed sensing. In NIPS, pp. 772–780. Cited by: §II.
  • [17] B. Kulis (2013) Metric learning: a survey. Foundations and Trends in Machine Learning 5 (4), pp. 287–364. Cited by: §III-C.
  • [18] W. Liu, I. W. Tsang, and K. Müller (2017) An easy-to-hard learning paradigm for multiple classes and multiple labels. Journal of Machine Learning Research 18 (94), pp. 1–38. Cited by: §I.
  • [19] W. Liu and I. W. Tsang (2017)

    Making decision trees feasible in ultrahigh feature and label dimensions

    .
    Journal of Machine Learning Research 18 (81), pp. 1–36. Cited by: §I.
  • [20] W. Liu, D. Xu, I. W. Tsang, and W. Zhang (2019) Metric learning for multi-output tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2), pp. 408–422. Cited by: §I.
  • [21] Q. Mao, I. W. Tsang, and S. Gao (2013) Objective-guided image annotation. IEEE Transactions on Image Processing 22 (4), pp. 1585–1597. Cited by: §V-B.
  • [22] S. Park and S. Choi (2013) Online multi-label learning with accelerated nonsmooth stochastic gradient descent. In ICASSP, pp. 3322–3326. Cited by: §I, §II, §III-D, 2nd item.
  • [23] J. P. Pestian, C. Brew, P. Matykiewicz, D. J. Hovermale, N. Johnson, K. B. Cohen, and W. Duch (2007) A shared task involving multi-label classification of clinical free text. In Biological, translational, and clinical language processing, pp. 97–104. Cited by: §V-A.
  • [24] K. B. Petersen and M. S. Pedersen (2012-11) The matrix cookbook. Technical University of Denmark. Cited by: §III-C.
  • [25] R. Popovici, A. Weiler, and M. Grossniklaus (2014) On-line clustering for real-time topic detection in social media streaming data. In WWW, pp. 57–63. Cited by: §I.
  • [26] J. Read, B. Pfahringer, G. Holmes, and E. Frank (2011) Classifier chains for multi-label classification. Machine Learning 85 (3), pp. 333–359. Cited by: §I.
  • [27] K. Trohidis, G. Tsoumakas, G. Kalliris, and I. P. Vlahavas (2008) Multi-label classification of music into emotions. In International Conference on Music Information Retrieval, pp. 325–330. Cited by: §V-A.
  • [28] G. Tsoumakas, M. Zhang, and Z. Zhou (2012) Introduction to the special issue on learning from multi-label data. Machine Learning 88 (1-2), pp. 1–4. Cited by: §I.
  • [29] R. Venkatesan, M. J. Er, M. Dave, M. Pratama, and S. Wu (2017) A novel online multi-label classifier for high-speed streaming data applications. Evolving Systems 8 (4), pp. 303–315. Cited by: §I, §II, §III-D, 1st item.
  • [30] I. E. Yen, X. Huang, P. Ravikumar, K. Zhong, and I. S. Dhillon (2016) PD-sparse : A primal and dual sparse approach to extreme multiclass and multilabel classification. In ICML, pp. 3069–3077. Cited by: §I.
  • [31] L. Zhang, Q. Zhang, L. Zhang, D. Tao, X. Huang, and B. Du (2015) Ensemble manifold regularized sparse low-rank approximation for multiview feature embedding. Pattern Recognit. 48 (10), pp. 3102–3112. Cited by: §V-B.
  • [32] M. Zhang and Z. Zhou (2006) Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering 18 (10), pp. 1338–1351. Cited by: §II.
  • [33] M. Zhang and Z. Zhou (2007) ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognit., pp. 2038–2048. Cited by: §V-A.
  • [34] X. Zhang, T. Graepel, and R. Herbrich (2012) Bayesian online learning for multi-label and multi-variate performance measures. In AISTATS, pp. 956–963. Cited by: §I.
  • [35] J. T. Zhou, I. W. Tsang, S. Ho, and K. Müller (2019) N-ary decomposition for multi-class classification. Machine Learning 108 (5), pp. 809–830. Cited by: §II.
  • [36] J. T. Zhou, I. W. Tsang, S. J. Pan, and M. Tan (2019) Multi-class heterogeneous domain adaptation. Journal of Machine Learning Research 20, pp. 57:1–57:31. Cited by: §I.