I Introduction
Realworld applications often involve in generating massive volume of streaming data at an unprecedented high speed. Many researchers have focused on data classification to help customers or users get better searching results, among which ‘online multilabel classification’ which means each instance can be assigned multiple labels is very useful in some applications. For example, in the webrelated applications, Twitter, Facebook and Instagram posts and RSS feeds are attached with multiple essential forms of categorization tags [34] . In the search industry, revenue comes from clicks on ads embedded in the result pages. Ad selection and placement can be significantly improved if ads are tagged correctly. There are many other applications, such as object detection in video surveillance [25]
and image retrieval in dynamic databases
[12].In the development of multilabel classification [28, 14], one challenge that remains unsolved is that most multilabel classification algorithms are developed in an offline mode [7, 6, 1, 19, 36, 20]. These methods assume that all data are available in advance for learning. However,there are two major limitations of developing multilabel methods under such an assumption: firstly, these methods are impractical for largescale datasets, since they require all datasets to be stored in memory; secondly, it is nontrivial to adapt offline multilabel methods to the sequential data. In practice, data is collected sequentially, and data that is collected earlier in this process may expire as time passes. Therefore, it is important to develop new multilabel classification methods to deal with streaming data.
Several online multilabel classification studies have recently been developed to overcome the abovementioned limitations. For example, online learning with accelerated nonsmooth stochastic gradient (OLANSGD) [22] was proposed to solve the online multilabel classification problem. Moreover, the online sequential multilabel extreme learning machine (OSMLELM) [29]
is a singlehidden layer feedforward neural networkbased learning technique. OSMLELM classifies the examples by their output weight and activation function. Unfortunately, all of these online multilabel classification methods lack an analysis of loss function and disregard label dependencies. Many studies
[9, 26, 2, 30, 18] have shown that multilabel learning methods that do not capture label dependency usually achieve degraded prediction performance. This paper aims to fill these gaps.Nearest Neighbour (NN) algorithms have achieved superior performance in various applications [10]. Moreover, experiments show that distance metric learning on singlelabel prediction can improve the prediction performance of NN. Nevertheless, there are two problems associated with applying a NN algorithm to an online multilabel setting. Firstly, naive NN algorithms do not consider label dependencies. Secondly, it is nontrivial to learn an appropriate metric for online multilabel classification.
To break the bottleneck of NN, we here propose a novel multilabel learning paradigm for multilabel classification. More specifically, we project instances and labels into the same embedding space for comparison, after which we learn the distance metric by enforcing the constraint that the distance between embedded instance and its correct label must be smaller than the distance between the embedded instance and other labels. Thus, two nearby instances from different labels will be pushed further. Moreover, an efficient optimization algorithm is proposed for the online multilabel scenario. In theoretical terms, we analyze the upper bound of cumulative loss for our proposed model. A wide range of experiments on benchmark datasets corroborate our theoretical results and verify the improved accuracy of our method relative to stateoftheart approaches.
The remainder of this paper is organized as follows. We first describe the related work, the online metric learning for multilabel classification and the optimization algorithm. Next, we introduce the upper bound of the loss function. Finally, we present the experimental results and conclude this paper.
Ii Related Work
Existing multilabel classification methods can be grouped into two major categories: namely, algorithm adaptation (AA) and problem transformation (PT). AA extends specific learning algorithms to deal with multilabel classification problems. Typical AA methods include [32, 5, 35]. Moreover, PT methods such as that developed by [16], transform the learning task into one or more singlelabel classification problems. However, all of these methods assume that all data are available for learning in advance. These methods thus incur prohibitive computational costs on largescale datasets, and it is also nontrivial to apply them to sequential data.
The stateoftheart approaches to online multilabel classification have been developed to handle sequential data. These approaches can be divided into two key categories: Neural Network and Label Ranking
. Neural Network approaches are based on a collection of connected units or nodes, referred to as artificial neurons. Each connection between artificial neurons can transmit the signal from one neuron to another. The artificial neuron that receives the signal can process it and then transmit signal to other artificial neurons. Moreover, label ranking, another popular approach to multilabel learning, involves a set of ranking functions being learned to order all the labels such that relevant labels are ranked higher than irrelevant ones.
From the neural network perspective, Ding et al. [11] developed a singlehidden layer feedforward neural networkbased learning technique named ELM. In this method, the initial weights and the hidden layer bias are selected at random, and the network is trained for the output weights to perform the classification. Moreover, Venkatesan et al. [29] developed the OSMLELM approach, which uses ELM to handle streaming data. OSMLELM uses a sigmoid activation function and outputs weights to predict the labels. In each step, the output weight is learned from the specific equation. OSMLELM converts the label set from bipolar to unipolar representation in order to solve multilabel classification problems.
Some other existing approaches are based on label ranking, such as OLANSGD [22]. In the majority of cases, ranking functions are learned by minimizing the ranking loss in the max margin framework. However, the memory and computational costs of this process are expensive on largescale datasets. Stochastic gradient decent (SGD) approaches update the model parameters using only the gradient information calculated from a single label at each iteration. OLANSGD minimizes the primal form using Nesterov’s smoothing, which has recently been extended to the stochastic setting.
However, none of these methods analyze the loss function, and all of them fail to capture the interdependencies among labels; these issues have been proved to result in degraded prediction performance. Accordingly, this paper aims to address these issues.
Notation  Definition 

the round of algorithm  
an instance presented on round t  
corresponding label vector to 

nearest neighbour instance to  
corresponding output of  
initialized input matrix  
corresponding output matrix  
the number of instances  
the number of features  
the number of labels  
the dimension of the new projection space  
projection matrix on round  
lower bound and upper bound of  
Frobenius inner product of and  
norm  
norm  
Frobenius norm  
Iii Our Proposed Method
Iiia Notations
We denote the instance presented to the algorithm on round by , and the label by , and refer each instancelabel pair as an example. Suppose that we initially have examples in memory, denoted by . is a nearest neighbour to . The initialized instance matrix is denoted as and the correspond output matrix is denoted as . is a positive integer. is Frobenius norm. is projection matrix which maps each output vector ( dimension) to ( dimension). Let also be the projection matrix. Each input vector ( dimension) is projected to ( dimension). Then and can be compared in the projection space( dimension). Notations are summarized in Table I.
IiiB Online Metric Learning
Inspired by Hsu et al. [15],who stated that each label vector can be projected into a lower dimensional label space, which is deemed as encoding, we propose the following largemargin metric learning approach with nearest neighbor constraints to learn projection. If the encoding scheme works well, the distance between the codeword of , , and , , should tend to be 0 and less than the distance between codeword and any other output . The following large margin formulation is then presented to learn the projection matrix :
(1) 
The constraints in Eq.(1) guarantee that the distance between the codeword of and the codeword of is less than the distance between the codeword of and codeword of any other output. To give Eq.(1) more robustness, we add loss function as the margin. The loss function is defined as , where is the norm. After that, we use Euclidean metric to measure the distances between instances and and then learn a new distance metric, which improves the performance of NN and also captures label dependency.
To retain the information learned on the round , we apply above large margin formulation into online setting. Thus, we have to define the initialization of the projection matrix and the updating rule. We initialize the projection matrix
to a nonzero matrix and set the new projection matrix
to be the solution of the following constrained optimization problem on round .(2) 
The loss function is defined as following:
(3) 
where the matrix is learned through the following formulation:
Define the loss function on round as
(4) 
When loss function is zero on round , . In contrast, on those rounds where the loss function is positive, the algorithm enforces to satisfy the constraint regardless of the stepsize required. This update rule requires to correctly classify the current example with a sufficient high margin and have to stay as closed as to retain the information learned on the previous round.
IiiC Optimization
The optimization of Eq.(2) can be shown by using standard tools from convex optimization [4]. If then itself satisfies the constraint in Eq.(2) and is clearly the optimal solution. Therefore, we concentrate on the case where . Firstly, we define the Lagrangian of the optimization problem in Eq.(2) to be,
(5) 
where the is a Lagrange multiplier.
Setting the partial derivatives of with respect to the elements of to zero gives
from this equation, we can get that
in which
stands for an identity matrix.
Inspired by [24], we use an approximation form of to make it easier for following calculation.
(6) 
Define , . Plugging the approximation formula Eq.(6) back into Eq.(5), we get a cubic function , , where
If is nonmonotonic function when , let to be the maximum point of . We obtain,
(7) 
where ,
Algorithm 1 provides detail of optimization. We denote the loss suffered by our algorithm on round by .
We focus on the situation when . The optimal solution comes from the one satisfying , . Based on the derivation, can be update by , where .
Inspired by metric learning [17], we use the learned metric to select nearest neighbours from for each testing instance, and conduct the predictions based on these nearest neighbours. The equation of the distance between codeword and in the embedding space can be computed as .
Method  Training Time  Testing Time 

OSMLELM  
OLANSGD  
NN    
OML 
IiiD Computational Complexity Analysis
We compare the time complexities of our proposed method (i.e. OML) with three popular methods, which are OSMLELM [29], OLANSGD [22] and NN [10].
The training time of OML is dominated by finding the nearest neighbour of each training instance and computing the loss in Eq.(4). It takes time to search for the nearest neighbour from the training dataset while computing the loss with two projections embedded takes time. Thus, the time complexity is .
We analyze the testing time for each testing instance. The testing time of NN involves the procedures of searching for the neareast neighbours of a testing instance from the training dataset which takes time. Our proposed PLLMNN performs prediction in a similar way but differs in the additional procedure of projecting all instances into the embedding space of dimensions before searching for the neareast neighbours, therefore the testing time complexity of OML is .
Moreover, training time complexities of each iteration and testing time complexities of each testing instance for other methods are listed in Table II for comparisions.
From Table II, we can easily conclude that the training time complexity of OML in each iteration is lower than that of OSMLELM and OLANSGD with respect to the number of training data , which is usually much larger than the number of features and the number of labels . Besides, the training time complexity of PLNN is denoted by as it has no training process.
Moreover, for the testing time complexity, OML is lower than OSMLELM and NN with respect to the number of training data . In addition, the reduced dimension of the new projected space is much smaller than the number of features as well as the number of labels. OLANSGD is the fastest in predicting among all methods, mainly because it performs prediction only by computing the label scores based on the learned model parameter.
Iv Loss Bound
Following the analysis in [8], we state the upper bounds for our online metric learning algorithm. Let be an arbitrary matrix. We use the approximate form given in Eq.(6) to replace .
Lemma 1.
Let as defined in Eq.(7), , is a nonzero matrix. The following bound holds for any
Proof.
Define , this lemma is proved by summing over all in and the bounding of this sum is obviously as followed,
∎
Lemma 2.
Proof.
By using the operation of Frobenius norm,
where is the Frobenius inner product, we can get
Using the assumption in Lemma 2, we can get that . where . It is clearly that is a symmetric matrix. We take the SVD of as
, then using the minimum nonnegative singular value of
to replace the nonpositive element in matrix , and denote approximation form of matrix as . Apparently, is a nonnegative symmetric matrix. Furthermore, by using definition of Frobenius inner product , where , we can get thatTaking the SVD of as . Since matrix
is a unitary matrix, then
, . Let be the minimum singular value of , getting that(8) 
where is an identity matrix. Now, we get that,
By summing both side of inequality on over all in , and using that , gives that
Then, we can get that
Lemma 2 has been proved. ∎
Based on the Lemma 2, we provide following theorem.
Theorem 1.
Let be a sequence of examples where and . is projection matrix, is in . is a nonzero matrix .. Let be the upper bound of . Under the assumption of Lemma 2, the cumulative loss suffered on the sequence is bounded as follows,
Proof.
Since is defined as norm, therefore is bounded by . we can get is bounded by as well. By using Lemma 2, we can get,
∎
Therefore, the cumulative loss is bounded by . As is bounded, it guarantees the performance of our proposed model for unseen data.
V Experiments
In this section, we conduct experiments to evaluate the prediction performance of the proposed OML for online multilabel classification, and compare it with several stateoftheart methods. All experiments are conducted on a workstation with 3.20GHz Intel CPU and 16GB main memory, running the Windows 10 platform.
Va Datasets
We conduct experiments on eight benchmark datasets: Corel5k [13], Enron ^{2}^{2}2http://bailando.sims.berkeley.edu/enron_email.html, Medical [23], Emotions [27], Cal500 [13], Image [33], scene [3], slashdot ^{3}^{3}3http://waikato.github.io/meka/datasets. The datasets are collected from different domains, such as images (i.e. Corel5k, Image, scene), music (i.e. Emotions, Cal500) and text (i.e. Enron, Medical, slashdot). The statistics of these datasets can be found in Table III.
Datasets  #Instances  #Features  #Labels  #Domain 

Corel5k  5000  499  374  images 
Emotions  593  72  6  music 
Enron  1702  1001  53  text 
Medical  978  1449  45  text 
Cal500  502  68  174  music 
Image  2000  103  14  image 
scene  2407  294  6  image 
slashdot  3782  103  14  text 
VB Experiment Setup
Baseline Methods We compare our OML method with several stateoftheart online multilabel prediction methods:

OSMLELM [29]: OSMLELM uses a sigmoid activation function and outputs weights to predict the labels. In each step, output weight is learned from specific equation. OSMLELM converts the label set from bipolar to unipolar representation in order to solve multilabel classification problems.

OLANSGD [22]: Based on Nesterov’s smooth method, OLANSGD proposes to use accelerated nonsmooth stochastic gradient descent to solve the online multilabel classification problem. It updates the model parameters using only the gradient information calculated from a single label at each iteration. It then implements a ranking function that ranks relevant and irrelevant labels.

kNN: We adapt the k nearest neighbor(kNN) algorithm to solve online multilabel classification problems. A Euclidean metric is used to measure the distances between instances.
In our experiment, the matrix
is initialized as a normal distributed random matrix. Initially, we keep 20% of data for nearest neighbor searching. In our experiment,
is set to 100000 and is set to 0.00001, while is set to 10. The codes are provided by the respective authors. Parameter in OLANSGD is chosen from among using fivefold cross validation. We use the default parameter for OSMLELM.Performance Measurements To fairly measure the performance of our method and baseline methods, we consider the following evaluation measurements [21, 31]:

MicroF1: computes true positives, true negatives, false positives and false negatives over all labels, then calculates an overall F1 score.

MacroF1: calculates the F1 score for each label, then takes the average of the F1 score.

ExampleF1: computes the F1 score for all labels of each testing sample, then takes the average of the F1 score.

Hamming Loss: computes the average zeroone score for all labels and instances.
The smaller the Hamming Loss value, the better the performance; moreover, the larger the value of the other three measurements, the better the performance.
VC Prediction Performance
Figures 1 to 8 present the four measurement results for our method and baseline approaches in respect of various datasets. From these figures, we can see that:

OML outperforms OSMLELM and OLANSGD on most datasets, this is because neither of the latter approaches consider the label dependency.

OML achieves better performance than NN on all datasets except on Cal500 under Hmming Loss but they are comparable. This result illustrates that our proposed method is able to learn an appropriate metric for online multilabel classification.

Moreover, NN is comparable to OSMLELM and OLANSGD on most datasets, which demonstrates the competitive performance of NN.
Our experiments verify our theoretical studies and the motivation of this work: in short, our method is able to capture the interdependencies among labels, while also overcoming the bottleneck of NN.
Vi Conclusion
Current multilabel classification methods assume that all data are available in advance for leaning. Unfortunately, this assumption hinders offline multilabel methods from handling sequential data. OLANSGD and OSMLELM have overcome this limitation and achieved promising results in online multilabel classification; however, these methods lack a theoretical analysis for their loss functions, and also do not consider the label dependency, which has been proven to lead to degraded performance. Accordingly, to fill the current research gap on streaming data, we here propose a novel online metric learning method for multilabel classification based on the large margin principle. We first project instances and labels into the same embedding space for comparison, then learn the distance metric by enforcing the constraint that the distance between an embedded instance and its correct label must be smaller than the distance between the embedded instance and other labels. Thus, two nearby instances from different labels will be pushed further. Moreover, we develop an efficient online algorithm for our proposed model. Finally, we also provide the upper bound of cumulative loss for our proposed model, which guarantees the performance of our method on unseen data. Extensive experiments corroborate our theoretical results and demonstrate the superiority of our method.
References
 [1] (2017) DiSMEC: distributed sparse machines for extreme multilabel classification. In WSDM, pp. 721–729. Cited by: §I.
 [2] (2015) Sparse local embeddings for extreme multilabel classification. In NIPS, pp. 730–738. Cited by: §I.

[3]
(2004)
Learning multilabel scene classification
. Pattern Recognit., pp. 1757–1771. Cited by: §VA.  [4] (2004) Convex optimization. Cambridge University Press, New York, NY, USA. External Links: ISBN 0521833787 Cited by: §IIIC.
 [5] (2007) Casebased multilabel ranking. In IJCAI, pp. 702–707. Cited by: §II.
 [6] (2012) Featureaware label space dimension reduction for multilabel classification. In NIPS, pp. 1538–1546. Cited by: §I.

[7]
(2009)
Combining instancebased learning and logistic regression for multilabel classification
. Machine Learning 76 (23), pp. 211–225. Cited by: §I.  [8] (2006) Online passiveaggressive algorithms. Journal of Machine Learning Research 7, pp. 551–585. Cited by: §IV.
 [9] (2010) Bayes optimal multilabel classification via probabilistic classifier chains. In ICML, pp. 279–286. Cited by: §I.
 [10] (2010) What does classifying more than 10, 000 image categories tell us?. In ECCV, pp. 71–84. Cited by: §I, §IIID.
 [11] (2015) Extreme learning machine: algorithm, theory and applications. Artif. Intell. Rev. 44 (1), pp. 103–115. Cited by: §II.
 [12] (2003) Concept learning and transplantation for dynamic image databases. In ICME, pp. 765–768. Cited by: §I.

[13]
(2002)
Object recognition as machine translation: learning a lexicon for a fixed image vocabulary
. In Computer Vision  ECCV, Vol. 2353, pp. 97–112. Cited by: §VA.  [14] (2015) A tutorial on multilabel learning. ACM Comput. Surv. 47 (3), pp. 52:1–52:38. Cited by: §I.
 [15] (2009) Multilabel prediction via compressed sensing. In NIPS, pp. 772–780. Cited by: §IIIB.
 [16] (2009) Multilabel prediction via compressed sensing. In NIPS, pp. 772–780. Cited by: §II.
 [17] (2013) Metric learning: a survey. Foundations and Trends in Machine Learning 5 (4), pp. 287–364. Cited by: §IIIC.
 [18] (2017) An easytohard learning paradigm for multiple classes and multiple labels. Journal of Machine Learning Research 18 (94), pp. 1–38. Cited by: §I.

[19]
(2017)
Making decision trees feasible in ultrahigh feature and label dimensions
. Journal of Machine Learning Research 18 (81), pp. 1–36. Cited by: §I.  [20] (2019) Metric learning for multioutput tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2), pp. 408–422. Cited by: §I.
 [21] (2013) Objectiveguided image annotation. IEEE Transactions on Image Processing 22 (4), pp. 1585–1597. Cited by: §VB.
 [22] (2013) Online multilabel learning with accelerated nonsmooth stochastic gradient descent. In ICASSP, pp. 3322–3326. Cited by: §I, §II, §IIID, 2nd item.
 [23] (2007) A shared task involving multilabel classification of clinical free text. In Biological, translational, and clinical language processing, pp. 97–104. Cited by: §VA.
 [24] (201211) The matrix cookbook. Technical University of Denmark. Cited by: §IIIC.
 [25] (2014) Online clustering for realtime topic detection in social media streaming data. In WWW, pp. 57–63. Cited by: §I.
 [26] (2011) Classifier chains for multilabel classification. Machine Learning 85 (3), pp. 333–359. Cited by: §I.
 [27] (2008) Multilabel classification of music into emotions. In International Conference on Music Information Retrieval, pp. 325–330. Cited by: §VA.
 [28] (2012) Introduction to the special issue on learning from multilabel data. Machine Learning 88 (12), pp. 1–4. Cited by: §I.
 [29] (2017) A novel online multilabel classifier for highspeed streaming data applications. Evolving Systems 8 (4), pp. 303–315. Cited by: §I, §II, §IIID, 1st item.
 [30] (2016) PDsparse : A primal and dual sparse approach to extreme multiclass and multilabel classification. In ICML, pp. 3069–3077. Cited by: §I.
 [31] (2015) Ensemble manifold regularized sparse lowrank approximation for multiview feature embedding. Pattern Recognit. 48 (10), pp. 3102–3112. Cited by: §VB.
 [32] (2006) Multilabel neural networks with applications to functional genomics and text categorization. IEEE Transactions on Knowledge and Data Engineering 18 (10), pp. 1338–1351. Cited by: §II.
 [33] (2007) MLKNN: A lazy learning approach to multilabel learning. Pattern Recognit., pp. 2038–2048. Cited by: §VA.
 [34] (2012) Bayesian online learning for multilabel and multivariate performance measures. In AISTATS, pp. 956–963. Cited by: §I.
 [35] (2019) Nary decomposition for multiclass classification. Machine Learning 108 (5), pp. 809–830. Cited by: §II.
 [36] (2019) Multiclass heterogeneous domain adaptation. Journal of Machine Learning Research 20, pp. 57:1–57:31. Cited by: §I.
Comments
There are no comments yet.