Real-world applications often involve in generating massive volume of streaming data at an unprecedented high speed. Many researchers have focused on data classification to help customers or users get better searching results, among which ‘online multi-label classification’ which means each instance can be assigned multiple labels is very useful in some applications. For example, in the web-related applications, Twitter, Facebook and Instagram posts and RSS feeds are attached with multiple essential forms of categorization tags  . In the search industry, revenue comes from clicks on ads embedded in the result pages. Ad selection and placement can be significantly improved if ads are tagged correctly. There are many other applications, such as object detection in video surveillance 
and image retrieval in dynamic databases.
In the development of multi-label classification [28, 14], one challenge that remains unsolved is that most multi-label classification algorithms are developed in an off-line mode [7, 6, 1, 19, 36, 20]. These methods assume that all data are available in advance for learning. However,there are two major limitations of developing multi-label methods under such an assumption: firstly, these methods are impractical for large-scale datasets, since they require all datasets to be stored in memory; secondly, it is non-trivial to adapt off-line multi-label methods to the sequential data. In practice, data is collected sequentially, and data that is collected earlier in this process may expire as time passes. Therefore, it is important to develop new multi-label classification methods to deal with streaming data.
Several online multi-label classification studies have recently been developed to overcome the above-mentioned limitations. For example, online learning with accelerated nonsmooth stochastic gradient (OLANSGD)  was proposed to solve the online multi-label classification problem. Moreover, the online sequential multi-label extreme learning machine (OSML-ELM) 
is a single-hidden layer feed-forward neural network-based learning technique. OSML-ELM classifies the examples by their output weight and activation function. Unfortunately, all of these online multi-label classification methods lack an analysis of loss function and disregard label dependencies. Many studies[9, 26, 2, 30, 18] have shown that multi-label learning methods that do not capture label dependency usually achieve degraded prediction performance. This paper aims to fill these gaps.
-Nearest Neighbour (NN) algorithms have achieved superior performance in various applications . Moreover, experiments show that distance metric learning on single-label prediction can improve the prediction performance of NN. Nevertheless, there are two problems associated with applying a NN algorithm to an online multi-label setting. Firstly, naive NN algorithms do not consider label dependencies. Secondly, it is non-trivial to learn an appropriate metric for online multi-label classification.
To break the bottleneck of NN, we here propose a novel multi-label learning paradigm for multi-label classification. More specifically, we project instances and labels into the same embedding space for comparison, after which we learn the distance metric by enforcing the constraint that the distance between embedded instance and its correct label must be smaller than the distance between the embedded instance and other labels. Thus, two nearby instances from different labels will be pushed further. Moreover, an efficient optimization algorithm is proposed for the online multi-label scenario. In theoretical terms, we analyze the upper bound of cumulative loss for our proposed model. A wide range of experiments on benchmark datasets corroborate our theoretical results and verify the improved accuracy of our method relative to state-of-the-art approaches.
The remainder of this paper is organized as follows. We first describe the related work, the online metric learning for multi-label classification and the optimization algorithm. Next, we introduce the upper bound of the loss function. Finally, we present the experimental results and conclude this paper.
Ii Related Work
Existing multi-label classification methods can be grouped into two major categories: namely, algorithm adaptation (AA) and problem transformation (PT). AA extends specific learning algorithms to deal with multi-label classification problems. Typical AA methods include [32, 5, 35]. Moreover, PT methods such as that developed by , transform the learning task into one or more single-label classification problems. However, all of these methods assume that all data are available for learning in advance. These methods thus incur prohibitive computational costs on large-scale datasets, and it is also non-trivial to apply them to sequential data.
The state-of-the-art approaches to online multi-label classification have been developed to handle sequential data. These approaches can be divided into two key categories: Neural Network and Label Ranking
. Neural Network approaches are based on a collection of connected units or nodes, referred to as artificial neurons. Each connection between artificial neurons can transmit the signal from one neuron to another. The artificial neuron that receives the signal can process it and then transmit signal to other artificial neurons. Moreover, label ranking, another popular approach to multi-label learning, involves a set of ranking functions being learned to order all the labels such that relevant labels are ranked higher than irrelevant ones.
From the neural network perspective, Ding et al.  developed a single-hidden layer feedforward neural network-based learning technique named ELM. In this method, the initial weights and the hidden layer bias are selected at random, and the network is trained for the output weights to perform the classification. Moreover, Venkatesan et al.  developed the OSML-ELM approach, which uses ELM to handle streaming data. OSML-ELM uses a sigmoid activation function and outputs weights to predict the labels. In each step, the output weight is learned from the specific equation. OSML-ELM converts the label set from bipolar to unipolar representation in order to solve multi-label classification problems.
Some other existing approaches are based on label ranking, such as OLANSGD . In the majority of cases, ranking functions are learned by minimizing the ranking loss in the max margin framework. However, the memory and computational costs of this process are expensive on large-scale datasets. Stochastic gradient decent (SGD) approaches update the model parameters using only the gradient information calculated from a single label at each iteration. OLANSGD minimizes the primal form using Nesterov’s smoothing, which has recently been extended to the stochastic setting.
However, none of these methods analyze the loss function, and all of them fail to capture the interdependencies among labels; these issues have been proved to result in degraded prediction performance. Accordingly, this paper aims to address these issues.
|the round of algorithm|
|an instance presented on round t|
corresponding label vector to
|nearest neighbour instance to|
|corresponding output of|
|initialized input matrix|
|corresponding output matrix|
|the number of instances|
|the number of features|
|the number of labels|
|the dimension of the new projection space|
|projection matrix on round|
|lower bound and upper bound of|
|Frobenius inner product of and|
Iii Our Proposed Method
We denote the instance presented to the algorithm on round by , and the label by , and refer each instance-label pair as an example. Suppose that we initially have examples in memory, denoted by . is a nearest neighbour to . The initialized instance matrix is denoted as and the correspond output matrix is denoted as . is a positive integer. is Frobenius norm. is projection matrix which maps each output vector ( dimension) to ( dimension). Let also be the projection matrix. Each input vector ( dimension) is projected to ( dimension). Then and can be compared in the projection space( dimension). Notations are summarized in Table I.
Iii-B Online Metric Learning
Inspired by Hsu et al. ,who stated that each label vector can be projected into a lower dimensional label space, which is deemed as encoding, we propose the following large-margin metric learning approach with nearest neighbor constraints to learn projection. If the encoding scheme works well, the distance between the codeword of , , and , , should tend to be 0 and less than the distance between codeword and any other output . The following large margin formulation is then presented to learn the projection matrix :
The constraints in Eq.(1) guarantee that the distance between the codeword of and the codeword of is less than the distance between the codeword of and codeword of any other output. To give Eq.(1) more robustness, we add loss function as the margin. The loss function is defined as , where is the norm. After that, we use Euclidean metric to measure the distances between instances and and then learn a new distance metric, which improves the performance of NN and also captures label dependency.
To retain the information learned on the round , we apply above large margin formulation into online setting. Thus, we have to define the initialization of the projection matrix and the updating rule. We initialize the projection matrix
to a non-zero matrix and set the new projection matrixto be the solution of the following constrained optimization problem on round .
The loss function is defined as following:
where the matrix is learned through the following formulation:
Define the loss function on round as
When loss function is zero on round , . In contrast, on those rounds where the loss function is positive, the algorithm enforces to satisfy the constraint regardless of the step-size required. This update rule requires to correctly classify the current example with a sufficient high margin and have to stay as closed as to retain the information learned on the previous round.
The optimization of Eq.(2) can be shown by using standard tools from convex optimization . If then itself satisfies the constraint in Eq.(2) and is clearly the optimal solution. Therefore, we concentrate on the case where . Firstly, we define the Lagrangian of the optimization problem in Eq.(2) to be,
where the is a Lagrange multiplier.
Setting the partial derivatives of with respect to the elements of to zero gives
from this equation, we can get that
stands for an identity matrix.
Inspired by , we use an approximation form of to make it easier for following calculation.
If is non-monotonic function when , let to be the maximum point of . We obtain,
Algorithm 1 provides detail of optimization. We denote the loss suffered by our algorithm on round by .
We focus on the situation when . The optimal solution comes from the one satisfying , . Based on the derivation, can be update by , where .
Inspired by metric learning , we use the learned metric to select nearest neighbours from for each testing instance, and conduct the predictions based on these nearest neighbours. The equation of the distance between codeword and in the embedding space can be computed as .
|Method||Training Time||Testing Time|
Iii-D Computational Complexity Analysis
The training time of OML is dominated by finding the nearest neighbour of each training instance and computing the loss in Eq.(4). It takes time to search for the nearest neighbour from the training dataset while computing the loss with two projections embedded takes time. Thus, the time complexity is .
We analyze the testing time for each testing instance. The testing time of NN involves the procedures of searching for the neareast neighbours of a testing instance from the training dataset which takes time. Our proposed PL-LMNN performs prediction in a similar way but differs in the additional procedure of projecting all instances into the embedding space of dimensions before searching for the neareast neighbours, therefore the testing time complexity of OML is .
Moreover, training time complexities of each iteration and testing time complexities of each testing instance for other methods are listed in Table II for comparisions.
From Table II, we can easily conclude that the training time complexity of OML in each iteration is lower than that of OSML-ELM and OLANSGD with respect to the number of training data , which is usually much larger than the number of features and the number of labels . Besides, the training time complexity of PL-NN is denoted by as it has no training process.
Moreover, for the testing time complexity, OML is lower than OSML-ELM and NN with respect to the number of training data . In addition, the reduced dimension of the new projected space is much smaller than the number of features as well as the number of labels. OLANSGD is the fastest in predicting among all methods, mainly because it performs prediction only by computing the label scores based on the learned model parameter.
Iv Loss Bound
Let as defined in Eq.(7), , is a non-zero matrix. The following bound holds for any
Define , this lemma is proved by summing over all in and the bounding of this sum is obviously as followed,
By using the operation of Frobenius norm,
where is the Frobenius inner product, we can get
Using the assumption in Lemma 2, we can get that . where . It is clearly that is a symmetric matrix. We take the SVD of as
, then using the minimum non-negative singular value ofto replace the non-positive element in matrix , and denote approximation form of matrix as . Apparently, is a non-negative symmetric matrix. Furthermore, by using definition of Frobenius inner product , where , we can get that
Taking the SVD of as . Since matrix
is a unitary matrix, then, . Let be the minimum singular value of , getting that
where is an identity matrix. Now, we get that,
By summing both side of inequality on over all in , and using that , gives that
Then, we can get that
Lemma 2 has been proved. ∎
Based on the Lemma 2, we provide following theorem.
Let be a sequence of examples where and . is projection matrix, is in . is a non-zero matrix .. Let be the upper bound of . Under the assumption of Lemma 2, the cumulative loss suffered on the sequence is bounded as follows,
By using Eq.(4), we get that
Since is defined as norm, therefore is bounded by . we can get is bounded by as well. By using Lemma 2, we can get,
Therefore, the cumulative loss is bounded by . As is bounded, it guarantees the performance of our proposed model for unseen data.
In this section, we conduct experiments to evaluate the prediction performance of the proposed OML for online multi-label classification, and compare it with several state-of-the-art methods. All experiments are conducted on a workstation with 3.20GHz Intel CPU and 16GB main memory, running the Windows 10 platform.
We conduct experiments on eight benchmark datasets: Corel5k , Enron 222http://bailando.sims.berkeley.edu/enron_email.html, Medical , Emotions , Cal500 , Image , scene , slashdot 333http://waikato.github.io/meka/datasets. The datasets are collected from different domains, such as images (i.e. Corel5k, Image, scene), music (i.e. Emotions, Cal500) and text (i.e. Enron, Medical, slashdot). The statistics of these datasets can be found in Table III.
V-B Experiment Setup
Baseline Methods We compare our OML method with several state-of-the-art online multi-label prediction methods:
OSML-ELM : OSML-ELM uses a sigmoid activation function and outputs weights to predict the labels. In each step, output weight is learned from specific equation. OSML-ELM converts the label set from bipolar to unipolar representation in order to solve multi-label classification problems.
OLANSGD : Based on Nesterov’s smooth method, OLANSGD proposes to use accelerated nonsmooth stochastic gradient descent to solve the online multi-label classification problem. It updates the model parameters using only the gradient information calculated from a single label at each iteration. It then implements a ranking function that ranks relevant and irrelevant labels.
kNN: We adapt the k nearest neighbor(kNN) algorithm to solve online multi-label classification problems. A Euclidean metric is used to measure the distances between instances.
In our experiment, the matrixis set to 100000 and is set to 0.00001, while is set to 10. The codes are provided by the respective authors. Parameter in OLANSGD is chosen from among using five-fold cross validation. We use the default parameter for OSML-ELM.
Micro-F1: computes true positives, true negatives, false positives and false negatives over all labels, then calculates an overall F-1 score.
Macro-F1: calculates the F-1 score for each label, then takes the average of the F-1 score.
Example-F1: computes the F-1 score for all labels of each testing sample, then takes the average of the F-1 score.
Hamming Loss: computes the average zero-one score for all labels and instances.
The smaller the Hamming Loss value, the better the performance; moreover, the larger the value of the other three measurements, the better the performance.
V-C Prediction Performance
OML outperforms OSML-ELM and OLANSGD on most datasets, this is because neither of the latter approaches consider the label dependency.
OML achieves better performance than NN on all datasets except on Cal500 under Hmming Loss but they are comparable. This result illustrates that our proposed method is able to learn an appropriate metric for online multi-label classification.
Moreover, NN is comparable to OSML-ELM and OLANSGD on most datasets, which demonstrates the competitive performance of NN.
Our experiments verify our theoretical studies and the motivation of this work: in short, our method is able to capture the interdependencies among labels, while also overcoming the bottleneck of NN.
Current multi-label classification methods assume that all data are available in advance for leaning. Unfortunately, this assumption hinders off-line multi-label methods from handling sequential data. OLANSGD and OSML-ELM have overcome this limitation and achieved promising results in online multi-label classification; however, these methods lack a theoretical analysis for their loss functions, and also do not consider the label dependency, which has been proven to lead to degraded performance. Accordingly, to fill the current research gap on streaming data, we here propose a novel online metric learning method for multi-label classification based on the large margin principle. We first project instances and labels into the same embedding space for comparison, then learn the distance metric by enforcing the constraint that the distance between an embedded instance and its correct label must be smaller than the distance between the embedded instance and other labels. Thus, two nearby instances from different labels will be pushed further. Moreover, we develop an efficient online algorithm for our proposed model. Finally, we also provide the upper bound of cumulative loss for our proposed model, which guarantees the performance of our method on unseen data. Extensive experiments corroborate our theoretical results and demonstrate the superiority of our method.
-  (2017) DiSMEC: distributed sparse machines for extreme multi-label classification. In WSDM, pp. 721–729. Cited by: §I.
-  (2015) Sparse local embeddings for extreme multi-label classification. In NIPS, pp. 730–738. Cited by: §I.
Learning multi-label scene classification. Pattern Recognit., pp. 1757–1771. Cited by: §V-A.
-  (2004) Convex optimization. Cambridge University Press, New York, NY, USA. External Links: Cited by: §III-C.
-  (2007) Case-based multilabel ranking. In IJCAI, pp. 702–707. Cited by: §II.
-  (2012) Feature-aware label space dimension reduction for multi-label classification. In NIPS, pp. 1538–1546. Cited by: §I.
Combining instance-based learning and logistic regression for multilabel classification. Machine Learning 76 (2-3), pp. 211–225. Cited by: §I.
-  (2006) Online passive-aggressive algorithms. Journal of Machine Learning Research 7, pp. 551–585. Cited by: §IV.
-  (2010) Bayes optimal multilabel classification via probabilistic classifier chains. In ICML, pp. 279–286. Cited by: §I.
-  (2010) What does classifying more than 10, 000 image categories tell us?. In ECCV, pp. 71–84. Cited by: §I, §III-D.
-  (2015) Extreme learning machine: algorithm, theory and applications. Artif. Intell. Rev. 44 (1), pp. 103–115. Cited by: §II.
-  (2003) Concept learning and transplantation for dynamic image databases. In ICME, pp. 765–768. Cited by: §I.
Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In Computer Vision - ECCV, Vol. 2353, pp. 97–112. Cited by: §V-A.
-  (2015) A tutorial on multilabel learning. ACM Comput. Surv. 47 (3), pp. 52:1–52:38. Cited by: §I.
-  (2009) Multi-label prediction via compressed sensing. In NIPS, pp. 772–780. Cited by: §III-B.
-  (2009) Multi-label prediction via compressed sensing. In NIPS, pp. 772–780. Cited by: §II.
-  (2013) Metric learning: a survey. Foundations and Trends in Machine Learning 5 (4), pp. 287–364. Cited by: §III-C.
-  (2017) An easy-to-hard learning paradigm for multiple classes and multiple labels. Journal of Machine Learning Research 18 (94), pp. 1–38. Cited by: §I.
Making decision trees feasible in ultrahigh feature and label dimensions. Journal of Machine Learning Research 18 (81), pp. 1–36. Cited by: §I.
-  (2019) Metric learning for multi-output tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2), pp. 408–422. Cited by: §I.
-  (2013) Objective-guided image annotation. IEEE Transactions on Image Processing 22 (4), pp. 1585–1597. Cited by: §V-B.
-  (2013) Online multi-label learning with accelerated nonsmooth stochastic gradient descent. In ICASSP, pp. 3322–3326. Cited by: §I, §II, §III-D, 2nd item.
-  (2007) A shared task involving multi-label classification of clinical free text. In Biological, translational, and clinical language processing, pp. 97–104. Cited by: §V-A.
-  (2012-11) The matrix cookbook. Technical University of Denmark. Cited by: §III-C.
-  (2014) On-line clustering for real-time topic detection in social media streaming data. In WWW, pp. 57–63. Cited by: §I.
-  (2011) Classifier chains for multi-label classification. Machine Learning 85 (3), pp. 333–359. Cited by: §I.
-  (2008) Multi-label classification of music into emotions. In International Conference on Music Information Retrieval, pp. 325–330. Cited by: §V-A.
-  (2012) Introduction to the special issue on learning from multi-label data. Machine Learning 88 (1-2), pp. 1–4. Cited by: §I.
-  (2017) A novel online multi-label classifier for high-speed streaming data applications. Evolving Systems 8 (4), pp. 303–315. Cited by: §I, §II, §III-D, 1st item.
-  (2016) PD-sparse : A primal and dual sparse approach to extreme multiclass and multilabel classification. In ICML, pp. 3069–3077. Cited by: §I.
-  (2015) Ensemble manifold regularized sparse low-rank approximation for multiview feature embedding. Pattern Recognit. 48 (10), pp. 3102–3112. Cited by: §V-B.
-  (2006) Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering 18 (10), pp. 1338–1351. Cited by: §II.
-  (2007) ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognit., pp. 2038–2048. Cited by: §V-A.
-  (2012) Bayesian online learning for multi-label and multi-variate performance measures. In AISTATS, pp. 956–963. Cited by: §I.
-  (2019) N-ary decomposition for multi-class classification. Machine Learning 108 (5), pp. 809–830. Cited by: §II.
-  (2019) Multi-class heterogeneous domain adaptation. Journal of Machine Learning Research 20, pp. 57:1–57:31. Cited by: §I.