1 Introduction
Multilabel classification (MLC) problems involve learning how to predict a (small) subset of classes a given data instance belongs to from a large set of classes. Given a set of labeled training data instances with input feature vectors and label vectors , we wish to learn the relationship between s and s in order to predict the label vector of a new data instance. MLC problems are encountered in many domains such as recommendation systems [16], bioinformatics [32]
[9][26], and music [33]. In the largescale MLC problems that we are interested in, the number of labels can be as large as but the norm of the label vectors is quite small (constant). In some modern applications, the number of classes can be in the thousands, or even millions [40, 15]. However, the label vectors are typically sparse as individual instances belong to just a few classes. Examples of such largescale MLC problems include image and video annotation for searches [39, 9], ads recommendation and web page categorization [1, 29], tagging text and documents for categorization [34, 16], and others [15]. There are two practical challenges associated with these largescale MLC problems: (1) how many classifiers does one have to train, and later, (2) what is the latency to predict the label vector of a new data instance using these classifiers. In the rest of this paper, we address these two challenges.
Related Work:
Most of the prior methods that have been proposed to solve largescale sparse MLC problems fall under four categories:
(1) One versus all (OvA) classifiers: Earlier approaches for the MLC problem involve training a binary classifier for each label independently [46]. Recent approaches such as DiSMEC [2], PDSparse [43], PPDSparse [42], ProXML [3], and Slice [15] propose different paradigms to deal with the scalability issue of this naive approach. These methods typically train linear classifiers and achieve high prediction accuracy but at the same time suffer from high training and prediction runtimes. Slice reduces the training cost per label by subsampling the negative training points and reducing the number of training instances logarithmically.
(2) Tree based classifiers: These approaches exploit the hierarchical nature of labels when there is such a hierarchy, e.g., HOMER [34]. Recent tree based methods include FastXML [29], PfastreXML [16], Probabilistic Label Trees [18], Parabel [28], SwiftXML [27], extremeText [40], CraftXML [31], and Bonsai [20]. These methods yield high prediction accuracy when labels indeed have a hierarchical structure. However, they also tend to have high training times as they typically use clustering methods for label partitioning, and need to train many linear classifiers, one for each label in leaf nodes.
(3) Deep learning based classifiers
: More recently, neural network based methods such as XMLCNN
[24], DeepXML [47], AttentionXML [44], and XBERT [7] have also been proposed. These methods perform as well as the tree based and OvA methods in many cases. However, they also suffer from high training and prediction costs, and the resulting model sizes can be quite large (in GBs).(4) Embedding based classifiers: These approaches reduce the number of labels by projecting the label vectors onto a lowdimensional space. Most of these methods assume that the label matrix is lowrank, see[32, 6, 48, 8, 45]. In this case, certain error guarantees can be established using the label correlation. However, the lowrank assumption does not always hold, see [5, 41, 2]. Recent embedding methods such as SLEEC [5], XMLDS [12] and DEFRAG [17] overcome this issue by using local embeddings and negative sampling. Most of these embedding methods require expensive techniques to recover the highdimensional label vectors, involving eigendecompositions or matrix inversions, and solving large optimization problems.
To deal with the scalability issue, a group testing based approach (MLGT) was recently proposed in [35]. This method involves creating random subsets (called groups defined by a binary group testing matrix) of classes and training independent binary classifiers to learn whether a given instance belongs to a group or not. When the label sparsity is , this method requires only groups to predict the labels and therefore, only a small number of classifiers need to be trained. Under certain assumptions, the labels of a new data instance can be predicted by simply predicting the groups it belongs to. The MLGT method has been shown to yield low Hamming loss errors. However, since the groups are formed in a random fashion, the individual classifiers might be poorly trained. That is, the random groupings might club together unrelated classes and the binary classifiers trained on such groups will be inefficient.
Our contributions:
In this work, we build on the MLGT framework and present a new MLC approach based on hierarchical partitioning and a datadependent group construction. We first present the novel grouping approach (NMFGT) that improves the accuracy of MLGT. This new method samples the group testing (GT) matrix (which defines the groups) from a lowrank Nonnegative Matrix Factorization (NMF) of the training data label matrix . Specifically, we exploit symmetric NMF [22] of the correlation matrix , which is known to capture the clustering/grouping within the data [13]. This helps us capture the label correlations in the groups formed, yielding better trained classifiers. We analyze the proposed datadependent construction and give theoretical results explaining why it performs well in MLGT. In the supplement, we discuss a GT construction that has constant weight across rows and columns, i.e., each group gets the same number of labels, and each label belongs to same number of groups. These constructions yield better classifiers and improved decoding, see Section 5 for details.
These new constructions also enable us – using recent results – to develop a novel prediction algorithm with logarithmic runtime in the number of labels . If the sparsity of the label vector desired is , then the complexity of the prediction algorithm will be . This significant improvement over existing methods will allow us to predict labels of new data instances in highthroughput and realtime settings such as recommendation systems [27]. This will address some of the limitations in traditional approaches to obtain related searches (search suggestions) [15].
We then present a hierarchical partitioning approach that exploits the label hierarchy in largescale problems to divide the large label set into smaller subsets. The associated subproblems can then be solved simultaneously (in parallel) using the MLGT approach. During prediction, the outputs of individual fast decoders are simply combined (or weighted) to obtain the top labels in log time. In numerical experiments, we first show that the new group construction (NMFGT) performs better than the previous random constructions in [35]. We then compare the performance of the proposed hierarchical method (HeNMFGT) to some of the popular stateoftheart methods on large datasets. We also show how the group testing framework can achieve learning with less labeled data for multilabel classification.
2 MLGT method
We first describe the group testing framework for MLC problems. The training data consists of instances , where are the input feature vectors and are the corresponding label vectors for each instance, and are assumed to be sparse, i.e., .
Training.
The first step in training is to construct an binary matrix , called the group testing matrix. Rows of correspond to groups, columns to labels, and is 1 if the th label index (or class) belongs to the th group. There exists an with (e.g., a disjunct matrix, see [35]) such that for any sparse binary vector , can be uniquely recovered (in polynomial time) from . Here is the Boolean OR operation (replacing the vector inner product between a row of and in ). In section 3, we describe how to construct these group testing matrices. This motivates projecting the label space into a lowerdimensional space via , and creating reduced label vectors for each where . The last step is to train binary classifiers on where , the th entry of , indicates whether the th instance belongs to the th group or not. Algorithm 1 summarizes the training algorithm.
Prediction.
For a new instance , we first use the classifiers to predict a reduced label vector . We then apply the following simple linear decoding technique : For all ,
Here, denotes the support of the vector . When is disjunct [35] and for some sparse vector , the above algorithm recovers . Unlike other embedding methods, this decoding technique does not require expensive matrix operations such as decompositions or inversion, and is linear in the number of labels using sparse matrixvector products.
We will next present a new construction of together with a decoding algorithm that is logarithmic in and can be used in the last step of Algorithm 2 in place of the linear decoder described above.
3 Data dependent construction and decoding
In [35], the authors construct the group testing matrix using a uniform random construction that does not use any information about the training data. Even if two distinct classes (or label indices) are indistinguishable with respect to data instances, the columns of for these classes are different. We present a novel datadependent construction for such that ”similar” classes are represented by similar columns of and show that this construction leads to much better prediction quality. We also present a fast decoding technique. Consider the following metric:
(1) 
is the label correlation matrix, also called the label cooccurence matrix [20]. The entry of is the number of training instances shared by the th and th classes. The entries of give the number of groups shared by a pair of classes. Given a training label matrix , we construct so as to minimize , and have the groups membership structure for two similar classes be similar. See the supplement for relevant experiments. A completely random (disjunct) matrix is unlikely to yield low , since random grouping will not capture the correlation between labels. However, for proper decoding, the GT matrix needs to be sparse and columns need to have low coherence. We construct to account for both issues as follows.
Given and – the number of groups – we compute a rank symmetric Nonnegative Matrix Factorization (symNMF) of as , where is called the basis matrix [22]. It has been shown that symNMF is closely related to clustering, see [22, 13]. Given , the basis matrix defines the clustering within the labels. Therefore, we use the columns of to sample .
For a column of , let be the normalized column such that its entries add to 1. Let be the column weights desired for . For each column , we form , and then reweight these vectors in order to avoid entries 1. We find all , set these entries to and distribute the excess sum to the remaining entries. This is needed because many entries of will be zero. The columns of are then sampled using the reweighted
s as the sampling probability vectors. Then each column will have
ones per column on average. We do this instead of sampling the th column of as a random binary vector – with the probability of the th entry being 1 equal to – as in the disjunct construction used in [35] . In the supplement, we describe other constant weight constructions, where each group has the same number of labels, and each label belongs to same number of groups. Such constructions have been shown to perform well in the group testing problem [36, 38].Remark 1 (Choosing ).
In these constructions, we choose the parameter (the column sparsity or the number of ones per column) parameter using a simple procedure. For a range of s we form the matrix , reduce and recover (a random subset of) training label vectors, and choose the which yields the smallest Hamming loss error.
In MLGT, for our datadependent GT matrix, we can use the linear decoder described in section 2. However, since the sampled matrix has constant weight columns, we can consider it as an adjacency matrix of a left regular graph. Therefore, we can use the recent proposed SAFFRON construction [23] and its fast decoding algorithm.
Fast decoding algorithm via. SAFFRON:
Recently, in [23], a biparitite graph based GT construction called SAFFRON (SparsegrAph codes Framework For gROup testiNg) was proposed, see the supplement for details. Since our NMF based construction ensures constant weight columns, the resulting matrix can be viewed as an adjacency matrix of a left regular graph. This helps us adapt the fast decoding algorithm developed for the SAFFRON construction for label prediction in our method.
We next briefly describe the decoding algorithm (an adaptation of the fast decoder presented in [4] for sparse vector recovery). It has two steps, namely a bin decoder and a peeling decoder. The right nodes of the bipartite graph are called bins and the left nodes are called the variables.
Given the output reduced vector in the first step of prediction, the bin decoder is applied on to bins ’s (these are partitions of as per the construction, see supplement), and all the variable nodes connected to singletons (connected to nonzero nodes) are decoded and put to a set say . Next, in an iterative manner, a node from is considered at each iteration, and the bin decoder is applied to the bins connected to this variable node. If one of these node is a resolvable doubleton (connected to two nonzeros, but one already decoded), we can get a new nonzero variable (). These decoded variables are moved from to a new set of peeled off nodes , and the newly decoded nonzero variable node, if any, is put in . The decoder will terminate when is empty, and if the set has items, we have succeeded. For complete details, see [23]. The computational complexity of the decoding scheme is , see [38]. Therefore, for any leftregular graph with the SAFFRON construction and , the decoder recovers items in time. We can use this fast decoder in the last step of Algorithm 2 to predict the sparse label for a given instance .
Analysis:
We next present an analysis that shows why the proposed datadependent construction will perform well in MLGT. Let be the reweighted matrix derived from the label data . is the potential matrix that is used to sample the binary group testing matrix . By construction, we know that the sum of entries in a column of is , a constant.
Suppose in the prediction phase, the correct label vector is . We know that there are at most ones in , i.e., . Then, by using the binary classifiers we obtain the reduced label vector , which if the classifiers are exact, will be . To perform the decoding for then, in effect we compute and set the top coordinates to , the rest to . The next result shows the effectiveness of this method.
Theorem 1 (Sampling using ).
For any , , whereas, for any , , where is the th row of .
The proof of this theorem is presented in the supplement. This result explains why our construction is a good idea. Indeed, since we generate in a datadependent manner, any given label will likely have high correlations with the rows of . As a result, the value of when is in the support of is much higher compared to the value of when is not in the support, with high probability. Therefore, choosing the top coordinates of indeed will produce .
4 Hierarchical approach for extreme classification
In very largescale MLC problems (called extreme multilabel or XML problems), the labels typically have certain (unknown) hierarchy. By discovering and using this label hierarchy, one can design efficient classifiers for XML problems that have low computational cost. A limitation of our datadependent approach is that we perform symNMF of the correlation matrix . As the symNMF problem is NPhard, and also difficult to solve for matrices with more than a few thousand columns, getting good quality classifiers for XML problems is not guaranteed. Moreover, these large matrices are unlikely to be low rank [5]. Therefore, we propose a simple hierarchical labelpartitioning approach to divide the set of label classes into smaller sets, and then apply our NMFGT method to each smaller set independently.
Matrix reordering techniques on sparse matrices are popularly used for graph partitioning [19] and solving sparse linear systems [30]. Here, a large sparse matrix (usually the adjacency matrix of a large graph) is reordered such that the matrix/graph can be partitioned into smaller submatrices that can be handled independently. Since the label matrix is highly sparse in XML problems and the labels have a hierarchy, the nonzero entries in can be viewed as defining an adjacency matrix of a sparse graph. Let denote a graph, where each node corresponds to a label, and if and only if . In other words, an edge between nodes/labels and is present if and only if labels and occur together in at least one data point, which indicates “interaction” between these labels.
Suppose that has say components, i.e., it can be partitioned into disjoint subgraphs, as assumed in Bonsai [20]. Then each component corresponds to a subset of labels that interact with one another but not with labels in other components. Permuting the labels so that labels in a component are adjacent to one another, and applying the same permutation to the columns of , one can obtain a blockdiagonal reordering of the label matrix . Now the symNMF problem for can be reduced to a number of smaller symNMF problems, one for each block of the matrix. Most large datasets (label matrices) with hierarchy will have many smaller noninteracting subsets of labels and few subsets that interact with many other labels. A natural approach is to use the vertex separator partitioning based reordering [11] or nested dissection [19] to obtain this permutation.
The idea is to find a small vertex separator of (here ) such that has a number of disjoint components . The labels can then be viewed as belonging to one of the subsets , and we can apply NMFGT to each separately. This idea can be further extended to a hierarchical partitioning of (by finding partitions of the subgraphs as – where is a vertex separator of ). Each level of the hierarchy would be partitioned further till the components are small enough so that the MLGT (symNMF) algorithm can be efficiently applied.
In Figure 1, we display the hierarchical reordering of obtained by the algorithm in [11] for four popular XML datasets: Eurlex (with number of labels ), Wiki10 (), WikiLSHTC (), and Amazon (), respectively. We note that there are a few distinct blocks (the block diagonals), where the labels only occur together and are independent of other blocks (do not interact). We also have a small subset of labels (the outer band) that interact with most blocks . We can partition the label set into subsets of size each and apply our NMF based MLGT individually (it can be done in parallel). During prediction, the individual fast decoders will return the positive labels for each subsets in time. We can simply combine these positive labels or weight them to output top labels. Since the subset of labels interact with most other labels and occur more frequently (powerlaw distribution), we can rank them higher when picking top of the outputted positive labels.
Comparison with tree methods: The tree based methods such as HOMER [34], Parabel [28], Bonsai [20]
, and others use label partitioning to recursively construct label tree/s with prespecified number of labels in leaf nodes or tree depth. Most methods use kmeans clustering for partitioning, that has a cost of
. Then, OvA classifiers are learned for each label in leaf nodes. However, in our approach, we use label partitioning to identify label subsets on which we can apply NMFGT independently. Our matrix reordering approach is inexpensive with cost , see [11]. We use the NMFGT strategy to learn only classifiers per partition.5 Numerical Experiments
We now present numerical results to illustrate the performance of the proposed approaches (the datadependent construction NMFGT and with hierarchical partitioning HeNMFGT) on MLC problems. Several additional results and details are presented in the supplement.
Dataset  

Mediamill  101  4.38  30993  12914  120 
Bibtex  159  2.40  4880  2515  1839 
RCV12K (ss)  2016  4.76  30000  10000  29699 
EurLex4K  3993  5.31  15539  3809  5000 
AmazonCat(ss)  7065  5.08  100000  50000  57645 
Wiki1031K  30938  18.64  14146  6616  101850 
WikiLSHTC  325056  3.18  1813391  78743  85600 
Amazon670K  670091  5.45  490449  153025  135909 
Datasets:
For our experiments, we consider some of the popular publicly available multilabel datasets put together in The Extreme Classification Repository [5] (http://manikvarma.org/downloads/XC/XMLRepository.html). The applications, details and the original sources of the datasets can be found in the repository. Table 1 lists the statistics.
In the table, labels, average sparsity per instance, training instances, test instances and features. The datasets marked (ss) are subsampled version of the original data with statistics as indicated.
Evaluation metrics:
To compare the performance of the different MLC methods, we use the most popular evaluation metric called
(P@k) [1] with . It has been argued that this metric is more suitable for modern applications such as tagging or recommendation, where one is interested in only predicting a subset of (top ) labels correctly. P@k is defined as:where is the predicted vector and is the actual label vector. This metric assumes that the vector is real valued and its coordinates can be ranked so that the summation above can be taken over the highest ranked entries of . For the hierarchical approach, we weight and rank the labels based on repeated occurrence (in the overlapping set ).
In general, MLGT method returns a binary label vector of predefined sparsity, there is no ranking among its nonzero entries. Hence, we also use a slightly modified definition:
(2) 
where is the nonzero coordinates of predicted by MLGT assuming that the predefined sparsity is set to 5. To make the comparison fair for other (ranking based) methods, we sum over the top 5 labels based on their ranking (i.e. we use instead of in the original definition).
Dataset  Metrics  NMF  GT  CW  GT  SP  GT  OvA 
Bibtex  0.7354  0.7089  0.6939  0.6111  
0.3664  0.3328  0.3034  0.2842  
0.2231  0.2017  0.1823  0.1739  
10.610  12.390  12.983  —  
5.13s  4.01s  3.98s  8.22s  
0.13s  0.13s  0.13s  0.18s  
Mediamill 
0.8804  0.8286  0.6358  0.8539  
0.6069  0.5413  0.2729  0.5315  
0.3693  0.3276  0.1638  0.3231  
10.377  11.003  10.876  —  
17.2s  15.7s  15.82s  29.4s  
0.17s  0.17s  0.17s  0.54s  
RCV1x 
0.9350  0.9205  0.8498  0.9289  
0.6983  0.6596  0.5732  0.6682  
0.4502  0.4104  0.3449  0.4708  
53.916  58.459  58.671  —  
88.4s  77.5s  74.2s  363.2s  
1.20s  1.04s  1.10s  6.37s  
Eurlex 
0.8477  0.8430  0.6792  0.8535  
0.5547  0.5582  0.3933  0.6132  
0.3444  0.3597  0.2758  0.4085  
80.023  80.732  82.257  —  
227.3s  99.6s  90.4s  560.1s  
0.94s  0.93s  0.93s  7.26s  

Comparing group testing constructions:
In the first set of experiments, we compare the new group testing constructions with the sparse random construction (SPGT) used in [35], where each entry of is sampled with uniform probability . Our first construction (NMFGT) is based on the symNMF as described in Section 3. Given the training label matrix , we first compute the symNMF of of rank using the Coordinate Descent algorithm by [37] (code provided by the authors) and then compute a sparse binary matrix using reweighted rows of the NMF basis. Our second construction (CWGT) is the constant weight construction defined in supplementary A.1. For both constructions, the number of nonzeros (sparsity) per column of is selected using the search method described in Remark 1, see supplement for more details.
Figure 2 plots and we obtained for the three constructions (red star is NMFGT, blue circle is CWGT, and black triangle is SPGT) as the number of groups increases. The first two plots correspond to the Bibtex dataset, and the next two correspond to RCV1x dataset. As expected, the performance of all constructions improve as the number of groups increase. Note that NMFGT consistently outperforms the other two. In the supplement, we compare the three constructions (for accuracy and runtime) on four datasets. We also include the One versus All (OvA) method (which is computationally expensive) to provide a frame of reference.
In Table 2, we compare the three constructions discussed in this paper on four datasets. We also include the One versus All (OvA) method (which is computationally very expensive) to provide a frame of reference. In the table, we list for , the correlation metric , the total time as well as the time taken to predict the labels of test instances. The NMFGT method performs better than both methods, because it groups the labels based on the correlation between them. This observation is supported by the fact that the correlation metric of NMFGT is the lowest among the three methods. Also note that even though NMFGT has longer training time compared to the other GT methods (due to the NMF computation), its prediction time is essentially the same. We also note that the runtimes of all three MLGT methods are much lower than OvA, particularly for larger datasets as they require much fewer () classifiers.
In all cases, NMFGT outperforms the other two (possibly because it groups the labels based on the correlation between them), and CWGT performs better than SPGT. Both NMFGT and CWGT ensure that classifiers are trained on similar amounts of data. Decoding will also be efficient since all columns of have the same support size. NMFGT is superior to the other two constructions, and therefore, we will use it in the following experiments for comparison with other popular XML methods.
Dataset  Metrics  HeNMFGT  NMFGT  MLCS  SLEEC*  PfastreXML  Parabel 

Mediamill  –  0.8804  0.8359  0.8538  0.9376  0.9358  
–  0.6069  0.6593  0.6967  0.7701  0.7622  
–  0.3693  0.4102  0.5562  0.5328  0.5169  
–  17.2s  20.3s  3.5m  190.1s  74.19s  
–  0.17s  6.93s  80.5s  18.4s  17.85s  
RCV1x 
–  0.9350  0.9244  0.9034  0.9508  0.9680  
–  0.6983  0.6945  0.6395  0.7412  0.7510  
–  0.4502  0.4486  0.4457  0.4993  0.5040  
–  88.4s  541.1s  34m  7.73m  6.7m  
–  1.04s  176.7s  53.1s  3.03m  1.68m  
Eurlex 
0.9265  0.8477  0.8034  0.7474  0.9004  0.9161  
0.7084  0.5547  0.5822  0.5885  0.6946  0.7397  
0.4807  0.3444  0.3965  0.4776  0.4939  0.5048  
322s  227.3s  343.3s  21m  11.8m  6.1m  
1.1s  0.94s  235.1s  45s  59.2s  74.3s  
Amazon13 
0.9478  0.8629  0.7837  0.8053  0.9098  0.9221  
0.6555  0.5922  0.5469  0.5622  0.6722  0.6957  
0.4474  0.3915  0.3257  0.4152  0.5119  0.5226  
8.7m  7.5m  19.7m  68.8m  27.5m  16.9m  
4.42s  4.21s  13.7m  106.3s  241.6s  114.7s  
Wiki10 
0.9666  0.9155  0.5223  0.8079  0.9289  0.9410  
0.7987  0.6353  0.2995  0.5050  0.7269  0.7880  
0.5614  0.4105  0.1724  0.3526  0.5061  0.5502  
14.7m  13.6m  63m  54.9m  40.5m  33.5m  
11.5s  9.82s  45m  51.3s  8.2m  4.2m  

Comparison with popular methods:
We next compare the NMFGT method (best one from our previous experiments) and the hierarchical method (HeNMFGT) with four popular methods, namely MLCS [14], SLEEC [5], PfastreXML [16], and Parabel [28] with respect to the modified precision metric. Table 3 summarizes the results obtained by these six methods for different datasets along with total computation time and the test prediction time . The no. of groups used in NMFGT and no. of blocks used in HeNMFGT are also given.
We note that NMFGT performs fairly well given its low computational burden. The hierarchical approach HeNMFGT yields superior accuracies with similar runtimes as NMFGT (outperforms other methods wrt. ). PfastreXML and Parabel yield slightly more accurate results in some cases, but require significantly longer run times. Note that the prediction time for our methods are orders of magnitude lower in some cases. For HeNMFGT, includes computing the partition, applying MLGT for one block (since this can be done in parallel), and predicting the labels of all test instances. For smaller two datasets, HeNMFGT was not used since they lacked welldefined partitions.
Embedding  Tree  OvA  DNN  
Dataset  Metrics  HeNMFGT  SLEEC  PfastreXML  Parabel  XT  Dismec  PPDsparse  XMLCNN 
Eurlex  (%)  75.04  74.74  73.63  74.54  –  83.67  83.83  76.38 
(%)  61.08  58.88  60.31  61.72  –  70.70  70.72  62.81  
(%)  48.07  47.76  49.39  50.48  –  59.14  59.21  51.41  
4.8m  20m  10.8m  5.4m  –  0.94hr  0.15hr  0.28hr  
0.28ms  4.87ms  1.82ms  0.91ms  –  7.05ms  1.14ms  0.38ms  
Wiki10 
(%)  82.28  80.78  82.03  83.77  85.23  85.20  73.80  82.78 
(%)  69.68  50.50  67.43  71.96  73.18  74.60  60.90  66.34  
(%)  56.14  35.36  52.61  55.02  63.39  65.90  50.40  56.23  
14.2m  53m  32.3m  29.3m  18m  –  –  88m  
0.69ms  7.7ms  74.1ms  38.1ms  1.83ms  –  –  1.39s  
WikiLSHTC 
(%)  55.62  54.83  56.05  64.38  58.73  64.94  64.08  – 
(%)  33.81  33.42  36.79  42.40  39.24  42.71  41.26  –  
(%)  23.04  23.85  27.09  31.14  29.26  31.5  30.12  –  
47.5m  18.3hr  7.4hr  3.62hr  9.2hr  750hr  3.9hr  –  
0.8ms  5.7ms  2.2ms  1.2ms  0.8ms  43m  37ms  –  
Amazon670K 
(%)  39.60  35.05  39.46  43.90  39.90  45.37  45.32  35.39 
(%)  36.78  31.25  35.81  39.42  35.60  40.40  40.37  33.74  
(%)  32.40  28.56  33.05  36.09  32.04  36.96  36.92  32.64  
47.8m  11.3hr  1.23hr  1.54hr  4.0hr  373hr  1.71hr  52.2hr  
1.45ms  18.5ms  19.3ms  2.8ms  1.7ms  429ms  429ms  16.2ms  

In Table 4 we compare the performance of HeNMFGT with several other popular XML methods wrt. the standard metric. We compare the accuracies and computational costs for HeNMFGT, SLEEC (embedding method), three tree methods (PfastreXML, Parabel, ExtremeText XT), two OvA methods (Dismec, PDsparse) and a DNN method XMLCNN (see sec. 1 for references). The precision results and the runtimes for the four additional methods were obtained from [28, 40]. In the table a ‘–’ indicate these results were not reported by the authors.
We note that, compared to other methods, HeNMFGT is significantly faster in both training and test times, and yet yields comparable results. The other methods have several parameters that need to be tuned. More importantly, the main routines of most other methods are written in C/C++, while HeNMFGT was implemented in Matlab and hence we believe the run times can be improved to enable truly realtime predictions. The code for our method will be made publicly available (Matlab code is provided in the supplement for review). Several additional results, implementation details and result discussions are given in the supplement.
Learning with less training data:
In supervised learning problems such as MLC, training highly accurate models requires large volumes of labeled data, and creating such volumes of labeled data can be very expensive in many applications
[21, 41]. As a result, there is an increasing interest among research agencies in developing learning algorithms that achieve ‘Learning with Less Labels’ (LwLL)^{1}^{1}1darpa.mil/program/learningwithlesslabels. Since MLGT requires training only classifiers (as opposed to classifiers in OvA or other methods), we will need less labeled data for training the model. In section 5, we present preliminary results that demonstrate how MLGT achieves learning with less data for MLC.Here, we present preliminary results that demonstrate how MLGT achieves more accurate (higher precision) with less training data compared to the OvA method (see Table 2 in suppl). Figure 3 plots the precision (Prec@1) for test data instances for the bibtex (left) and RCV1x (right) datasets, when different fractions of training data were used to train the MLGT and OvA models. We note that MLGT achieves the same accuracy as OvA with only 1520% of the number of training points (over less training data). We used the same binary classifiers for both methods, and MLGT requires only classifiers, as opposed to OvA, which needs classifiers. Therefore, MLGT likely requires fewer training data instances.
Conclusions
In this paper, we extended the MLGT framework [35] and presented new GT constructions (constant weight and data dependent), and a fast prediction algorithm that requires logarithmic time in the number of labels . We then presented a hierarchical partitioning approach to scale the MLGT approach to larger datsets. Our computational results show that the NMF construction yields superior performance compared to other GT matrices. We also presented a theoretical analysis which showed why the proposed data dependent method (with a nontrivial datadependent sampling approach) will perform well. With a comprehensive set of experiments, we showed that our method is significantly faster in both training and test times, and yet yields competitive results compared to other popular XML methods.
References
 [1] R. Agrawal, A. Gupta, Y. Prabhu, and M. Varma. Multilabel learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22nd international conference on World Wide Web, pages 13–24. ACM, 2013.
 [2] R. Babbar and B. Schölkopf. Dismec: Distributed sparse machines for extreme multilabel classification. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 721–729. ACM, 2017.
 [3] R. Babbar and B. Schölkopf. Data scarcity, robustness and extreme multilabel classification. Machine Learning, 108(89):1329–1351, 2019.
 [4] R. Berinde, A. Gilbert, P. Indyk, H. Karloff, and M. Strauss. Combining geometry and combinatorics: A unified approach to sparse signal recovery. CoRR, abs/0804.4666, 01 2008.
 [5] K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain. Sparse local embeddings for extreme multilabel classification. In Advances in Neural Information Processing Systems, pages 730–738, 2015.
 [6] W. Bi and J. T. Y. Kwok. Efficient multilabel classification with many labels. In 30th International Conference on Machine Learning, ICML 2013, pages 405–413, 2013.
 [7] W.C. Chang, H.F. Yu, K. Zhong, Y. Yang, and I. Dhillon. A modular deep learning approach for extreme multilabel text classification. arXiv preprint arXiv:1905.02331, 2019.
 [8] Y.N. Chen and H.T. Lin. Featureaware label space dimension reduction for multilabel classification. In Advances in Neural Information Processing Systems, pages 1529–1537, 2012.
 [9] J. Deng, S. Satheesh, A. C. Berg, and F. Li. Fast and balanced: Efficient label tree learning for large scale object recognition. In J. ShaweTaylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 567–575. Curran Associates, Inc., 2011.
 [10] R. Gallager. Lowdensity paritycheck codes. IRE Transactions on Information Theory, 8(1):21–28, January 1962.
 [11] A. Gupta. Fast and effective algorithms for graph partitioning and sparsematrix ordering. IBM Journal of Research and Development, 41(1.2):171–183, 1997.

[12]
V. Gupta, R. Wadbude, N. Natarajan, H. Karnick, P. Jain, and P. Rai.
Distributional semantics meets multilabel learning.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, volume 33, pages 3747–3754, 2019. 
[13]
C. H. Q. Ding and X. He.
On the equivalence of nonnegative matrix factorization and spectral clustering.
In Proceedings of the SIAM International Conference on Data Mining, pages 606–610, 01 2005.  [14] D. Hsu, S. M. Kakade, J. Langford, and T. Zhang. Multilabel prediction via compressed sensing. NIPS, 22:772–780, 2009.
 [15] H. Jain, V. Balasubramanian, B. Chunduri, and M. Varma. Slice: Scalable linear extreme classifiers trained on 100 million labels for related searches. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM ’19, pages 528–536. ACM, 2019.

[16]
H. Jain, Y. Prabhu, and M. Varma.
Extreme multilabel loss functions for recommendation, tagging, ranking & other missing label applications.
In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 935–944. ACM, 2016.  [17] A. Jalan and P. Kar. Accelerating extreme classification via adaptive feature agglomeration. In Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence, IJCAI19, pages 2600–2606. International Joint Conferences on Artificial Intelligence Organization, 7 2019.

[18]
K. Jasinska, K. Dembczynski, R. BusaFekete, K. Pfannschmidt, T. Klerx, and
E. Hullermeier.
Extreme fmeasure maximization using sparse probability estimates.
In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1435–1444, New York, New York, USA, 20–22 Jun 2016.  [19] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing, 20(1):359–392, 1998.
 [20] S. Khandagale, H. Xiao, and R. Babbar. Bonsaidiverse and shallow trees for extreme multilabel classification. arXiv preprint arXiv:1904.08249, 2019.
 [21] A. Klein and J. Tourville. 101 labeled brain images and a consistent human cortical labeling protocol. Frontiers in neuroscience, 6:171, 2012.
 [22] D. Kuang, C. Ding, and H. Park. Symmetric nonnegative matrix factorization for graph clustering. In Proceedings of the 2012 SIAM International Conference on Data Mining, pages 106–117, 2012.
 [23] K. Lee, R. Pedarsani, and K. Ramchandran. Saffron: A fast, efficient, and robust framework for group testing based on sparsegraph codes. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 2873–2877, July 2016.
 [24] J. Liu, W.C. Chang, Y. Wu, and Y. Yang. Deep learning for extreme multilabel text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, pages 115–124, New York, NY, USA, 2017.
 [25] C. McDiarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148–188, 1989.
 [26] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
 [27] Y. Prabhu, A. Kag, S. Gopinath, K. Dahiya, S. Harsola, R. Agrawal, and M. Varma. Extreme multilabel learning with label features for warmstart tagging, ranking & recommendation. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM ’18, pages 441–449. ACM, 2018.
 [28] Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, and M. Varma. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference, WWW ’18, pages 993–1002, Republic and Canton of Geneva, Switzerland, 2018.
 [29] Y. Prabhu and M. Varma. Fastxml: A fast, accurate and stable treeclassifier for extreme multilabel learning. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 263–272. ACM, 2014.
 [30] Y. Saad. Iterative methods for sparse linear systems, volume 82. siam, 2003.

[31]
W. Siblini, P. Kuntz, and F. Meyer.
Craftml, an efficient clusteringbased random forest for extreme multilabel learning.
In International Conference on Machine Learning, pages 4664–4673, 2018.  [32] F. Tai and H.T. Lin. Multilabel classification with principal label space transformation. Neural Computation, 24(9):2508–2542, 2012.
 [33] K. Trohidis. Multilabel classification of music into emotions. In 9th International Con ference on Music Information Retrieval, pages 325–– 330, 2008.
 [34] G. Tsoumakas, I. Katakis, and I. Vlahavas. Effective and efficient multilabel classification in domains with large number of labels. In ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD’08), 2008.
 [35] S. Ubaru and A. Mazumdar. Multilabel classification with group testing and codes. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 3492–3501. JMLR. org, 2017.
 [36] S. Ubaru, A. Mazumdar, and A. Barg. Group testing schemes from lowweight codewords of BCH codes. In Information Theory (ISIT), 2016 IEEE International Symposium on, pages 2863–2867. IEEE, 2016.
 [37] A. Vandaele, N. Gillis, Q. Lei, K. Zhong, and I. Dhillon. Efficient and nonconvex coordinate descent for symmetric nonnegative matrix factorization. IEEE Transactions on Signal Processing, 64(21):5571–5584, 2016.
 [38] A. Vem, N. Thenkarai Janakiraman, and K. Narayanan. Group testing using leftandrightregular sparsegraph codes. arxiv.org:1701.07477., 2017.

[39]
C. Wang, S. Yan, L. Zhang, and H.J. Zhang.
Multilabel sparse coding for automatic image annotation.
In
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on
, pages 1643–1650. IEEE, 2009.  [40] M. Wydmuch, K. Jasinska, M. Kuznetsov, R. BusaFekete, and K. Dembczynski. A noregret generalization of hierarchical softmax to extreme multilabel classification. In Advances in Neural Information Processing Systems, pages 6355–6366, 2018.
 [41] C. Xu, D. Tao, and C. Xu. Robust extreme multilabel learning. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1275–1284. ACM, 2016.
 [42] I. E.H. Yen, X. Huang, W. Dai, P. Ravikumar, I. Dhillon, and E. Xing. Ppdsparse: A parallel primaldual sparse method for extreme classification. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, pages 545–553, 2017.
 [43] I. E. H. Yen, X. Huang, K. Zhong, P. Ravikumar, and I. S. Dhillon. Pdsparse: A primal and dual sparse approach to extreme multiclass and multilabel classification. In Proceedings of the 33rd International Conference on International Conference on Machine Learning  Volume 48, ICML’16, pages 3069–3077. JMLR.org, 2016.
 [44] R. You, S. Dai, Z. Zhang, H. Mamitsuka, and S. Zhu. Attentionxml: Extreme multilabel text classification with multilabel attention based recurrent neural networks. arXiv preprint arXiv:1811.01727, 2018.
 [45] H.f. Yu, P. Jain, P. Kar, and I. Dhillon. Largescale multilabel learning with missing labels. In Proceedings of the 31st International Conference on Machine Learning (ICML14), pages 593–601, 2014.
 [46] M.L. Zhang, Y.K. Li, X.Y. Liu, and X. Geng. Binary relevance for multilabel learning: an overview. Frontiers of Computer Science, 2018.
 [47] W. Zhang, J. Yan, X. Wang, and H. Zha. Deep extreme multilabel learning. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, ICMR ’18, pages 100–107, New York, NY, USA, 2018. ACM.
 [48] Y. Zhang and J. G. Schneider. Multilabel output codes using canonical correlation analysis. In AISTATS, pages 873––882, 2011.
Appendix A Constant Weight Construction
In this supplement, we first describe two constant weight constructions, where each group has the same number of labels, and each label belongs to the same number of groups. Such constructions have been shown to perform well in the group testing problem [36, 38].
a.1 Randomized construction
The first construction we consider is based on LDPC (low density parity) codes. Gallagher proposed a low density code with constant weights in [10]. We can develop a constant weight GT matrix based on this LPDC construction as follows: Suppose the matrix we desire has columns with constant ones in each column, and ones in each row. The LDPC matrix will have rows in total. The matrix is divided into submatrices, each containing a single in each column. The first of these submatrices contains all the ones in descending order, i.e., the th row will have ones in the columns to . The remaining submatrices are simply column permutations of the first. We consider this construction in our experiments.
a.2 SAFFRON construction
Recently, in [23], a biparitite graph based GT construction called SAFFRON (SparsegrAph codes Framework For gROup testiNg) was proposed. [38] extended this SAFFRON construction to form leftandrightregular sparsegraph codes called regularSAFFRON. The adjacency matrices corresponding to such graphs give us the desired constant weight constructions. The regularSAFFRON construction starts with a leftandrightregular graph , with left nodes called variable nodes, and right nodes called bin nodes. The edge connections from the left and edge connections from the right are paired up according to a random permutation.
Let be the adjacency matrix corresponding to the leftandrightregular graph . Then, has ones in each column and ones in each row. Let be the universal signature matrix (see [4, 38] for definition). If is the the row of , then the GT matrix A is formed as , where the submatrix of size . The total tests will be . We have the following recovery guarantee of this construction:
Proposition 1.
Suppose we wish to recover a sparse binary vector . A binary testing matrix formed from the regularSAFFRON graph with tests recovers proportion of the support of correctly with high probability (w.h.p), for any . With , we can recover the whole support set w.h.p. The constants and depend on and the error tolerance . The computational complexity of the decoding scheme will be .
Proof of the proposition can be found in [38]. The decoding algorithm was discussed in the main text.
Appendix B Proof of Theorem 1
Next, we sketch the proof of Theorem 1 in the main text.
Proof.
Let us denote the entries of and as and respectively, . From our construction: and
First, let us find the probability that Since will be 0 if and only if the support of th row of has no intersection with the support of , hence,
Now note that, . Therefore, It turns out that,
Now, we consider two cases. When , . On the other hand, when , . Therefore,
Hence, when ,
But when ,
∎
We can make stronger claims to bolster this theorem. Since the random variables
are all Lipschitz functions of independent underlying variables, by using McDiarmid inequality [25] we can say that they are tightly concentrated around their respective average values.Appendix C Additional experimental results
Here, we present additional results and further discuss the results we presented in the main text for the proposed methods. We then give few results which help us better understand the parameters that affect the performance of our MLGT method. First, we describe the evaluation metrics used in the main text and here for comparison.
Results discussion:
In table 3 of main text, we summarized the results obtained for six methods for different datasets. We note that NMFGT performs very well given its low computational burden. PfastreXML and Parabel, on the other hand, yield slightly more accurate results but require significantly longer run times.
Note that, when compared to the MLGT, the other methods require significantly more time for training. This is because, the tree based methods use kmeans clustering recursively to build the label tree/s, and require several OvA classifiers to be trained, one per each label in the leaf nodes. OvA methods are obviously expensive since they learn number of classifiers. Moreover, the prediction time for MLGT is also orders of magnitude less than many of the popular methods. In addition, the other methods have several parameters that need to be tuned (we used the default settings provided by the authors). We also note that the main routines of most other methods are written in C/C++ language, while MLGT was implemented in Matlab and hence the run times can be further improved to enable truly realtime predictions.
In Table 4 of the main paper, for the large two datasets, the label set was divided into blocks of sizes roughly around . We also used negative sampling of the training data for each block as done in many recent XML works [28, 15]. We also reduced the feature dimension via. sketching. For hierarchical partitioning, we used the vertex separator approach described in the main text, using the FORTRAN code provided by the author of [11]. The reordering for the four datasets in Table 3 are given in Figure 1 for the main text. The approach is extremely fast, and the runtime for the four datasets for reordering and partitioning were:
Eurlex: 0.5s; Wiki10: 4.11s; WikiLSHTC: 40.3s; and Amazon670: 15.5s.
For Eurlex and Wiki10, the accuracy and runtime results for SLEEC, PfastreXML and Parabel were computed by us using their matlab codes. Results for these three methods for the remain two datasets, and all results for the additional four methods (Dismec, PPDsparse, XT and XMLCNN) were obtained from [28] and [40]. All runtimes are based on single core implementation.
MLGT Analysis:
We conducted several numerical tests to analysis the performance of MLGT with respect to various settings. Figure 4 presents few of these numerical analysis results, which help us understand the performance of MLGT better. In the left figure, we plot the P@k achieve by MLGT with different GT constructions, as a function of the the correlation metric . The different points (circle) in the plot correspond to different GT matrices with different . These GT matrices were formed by randomly permuting disjunct matrices, and changing its size. We observe that GT matrices with lower , yield better classification. These results motivated us to develop the datadependent grouping approach.
In the middle plot, we have the performance of the NMFGT method for different column sparsity . We clearly note that as increases, the performance first increases, and then reduces for larger . This is because, for larger , the GT matrix will have higher coherence between the columns. As indicated in our analysis, the performance of the GT construction will depends on this coherence. This analysis motivated us to use the search technique described in Remark 1, to select the optimal column sparsity .
In the right plot, we compare the performance of NMFGT vs CWGT as a function of number of groups for the Eurlex dataset. We observe that for smaller , NMFGT performs better. However, for larger and more so for larger number of label , NMFGT becomes less accurate. This is due to the difficulty in computing accurate NMF for such large matrices. NMF is known to be an NP hard problem. This result likely explains why the NMFGT’s performance on larger datasets is less accurate. A possible approach to improve the accuracy of NMFGT is to use the Hierarchical approach described above and split the large label set into smaller disjoint subsets, and apply NMFGT independently.
Dataset  NMFGT  CWGT  


RLoss  TLoss  RLoss  TLoss  
Bibtex  159  3.49  3.68  2.95  4.30 
RCV12K  2016  3.99  4.72  3.96  4.91 
EurLex4K  3993  1.38  4.77  1.05  5.03 
In table 5, we list the average Hamming loss errors we suffer in label reduction (and decoding) when using NMFGT and CWGT for the three datasets. That is, we check the average error in the group testing procedure (label reduction and decoding), without classifiers. We also list the average Hamming loss in the training data after classification for comparison. We observe that, the NMFGT has worse reduction loss compared to CWGT. This is because, NMFGT is data dependent, and is not close to being kdisjunct as oppose to CWGT, which is random. However, we note that the training loss of NMFGT is better. This shows that, even though the reductiondecoding is imperfect (introduces more noise), NMFGT results in better individual classifiers. These comparisons show that datadependent grouping will indeed result in improved classifiers.
Implementation details:
All experiments for NMFGT and HeNMFGT were implemented in Matlab, and conducted on a standard work station with Intel i5 core 2.3GHz machine. The timings reported were computed using the cputime function in Matlab. For the SLEEC method, we could not compute as in eq. 2, since the source code did not output the score matrix. The reported for SLEEC in Table 4 were the P@k returned by source code. Also, for the last 2 examples, SLEEC was run for 50 iterations (for the rest it was 200).
Comments
There are no comments yet.