Multilabel Classification by Hierarchical Partitioning and Data-dependent Grouping

by   Shashanka Ubaru, et al.

In modern multilabel classification problems, each data instance belongs to a small number of classes from a large set of classes. In other words, these problems involve learning very sparse binary label vectors. Moreover, in large-scale problems, the labels typically have certain (unknown) hierarchy. In this paper we exploit the sparsity of label vectors and the hierarchical structure to embed them in low-dimensional space using label groupings. Consequently, we solve the classification problem in a much lower dimensional space and then obtain labels in the original space using an appropriately defined lifting. Our method builds on the work of (Ubaru Mazumdar, 2017), where the idea of group testing was also explored for multilabel classification. We first present a novel data-dependent grouping approach, where we use a group construction based on a low-rank Nonnegative Matrix Factorization (NMF) of the label matrix of training instances. The construction also allows us, using recent results, to develop a fast prediction algorithm that has a logarithmic runtime in the number of labels. We then present a hierarchical partitioning approach that exploits the label hierarchy in large scale problems to divide up the large label space and create smaller sub-problems, which can then be solved independently via the grouping approach. Numerical results on many benchmark datasets illustrate that, compared to other popular methods, our proposed methods achieve competitive accuracy with significantly lower computational costs.



page 1

page 2

page 3

page 4


Group Preserving Label Embedding for Multi-Label Classification

Multi-label learning is concerned with the classification of data with m...

Hierarchical Classification using Binary Data

In classification problems, especially those that categorize data into a...

Collaborative Filtering and Multi-Label Classification with Matrix Factorization

Machine learning techniques for Recommendation System (RS) and Classific...

On Learning Vector Representations in Hierarchical Label Spaces

An important problem in multi-label classification is to capture label p...

Beyond One-hot Encoding: lower dimensional target embedding

Target encoding plays a central role when learning Convolutional Neural ...

An Efficient Large-scale Semi-supervised Multi-label Classifier Capable of Handling Missing labels

Multi-label classification has received considerable interest in recent ...

Using the Gene Ontology Hierarchy when Predicting Gene Function

The problem of multilabel classification when the labels are related thr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multilabel classification (MLC) problems involve learning how to predict a (small) subset of classes a given data instance belongs to from a large set of classes. Given a set of labeled training data instances with input feature vectors and label vectors , we wish to learn the relationship between s and s in order to predict the label vector of a new data instance. MLC problems are encountered in many domains such as recommendation systems [16], bioinformatics [32]

, computer vision 


, natural language processing 

[26], and music [33]. In the large-scale MLC problems that we are interested in, the number of labels can be as large as but the -norm of the label vectors is quite small (constant). In some modern applications, the number of classes can be in the thousands, or even millions [40, 15]. However, the label vectors are typically sparse as individual instances belong to just a few classes. Examples of such large-scale MLC problems include image and video annotation for searches [39, 9], ads recommendation and web page categorization [1, 29], tagging text and documents for categorization [34, 16], and others [15]

. There are two practical challenges associated with these large-scale MLC problems: (1) how many classifiers does one have to train, and later, (2) what is the latency to predict the label vector of a new data instance using these classifiers. In the rest of this paper, we address these two challenges.

Related Work:

Most of the prior methods that have been proposed to solve large-scale sparse MLC problems fall under four categories:

(1) One versus all (OvA) classifiers: Earlier approaches for the MLC problem involve training a binary classifier for each label independently [46]. Recent approaches such as DiSMEC [2], PD-Sparse [43], PPD-Sparse [42], ProXML [3], and Slice [15] propose different paradigms to deal with the scalability issue of this naive approach. These methods typically train linear classifiers and achieve high prediction accuracy but at the same time suffer from high training and prediction runtimes. Slice reduces the training cost per label by subsampling the negative training points and reducing the number of training instances logarithmically.

(2) Tree based classifiers: These approaches exploit the hierarchical nature of labels when there is such a hierarchy, e.g., HOMER [34]. Recent tree based methods include FastXML [29], PfastreXML [16], Probabilistic Label Trees [18], Parabel [28], SwiftXML [27], extremeText [40], CraftXML [31], and Bonsai [20]. These methods yield high prediction accuracy when labels indeed have a hierarchical structure. However, they also tend to have high training times as they typically use clustering methods for label partitioning, and need to train many linear classifiers, one for each label in leaf nodes.

(3) Deep learning based classifiers

: More recently, neural network based methods such as XML-CNN 

[24], DeepXML [47], AttentionXML [44], and X-BERT [7] have also been proposed. These methods perform as well as the tree based and OvA methods in many cases. However, they also suffer from high training and prediction costs, and the resulting model sizes can be quite large (in GBs).

(4) Embedding based classifiers: These approaches reduce the number of labels by projecting the label vectors onto a low-dimensional space. Most of these methods assume that the label matrix is low-rank, see[32, 6, 48, 8, 45]. In this case, certain error guarantees can be established using the label correlation. However, the low-rank assumption does not always hold, see  [5, 41, 2]. Recent embedding methods such as SLEEC [5], XMLDS [12] and DEFRAG [17] overcome this issue by using local embeddings and negative sampling. Most of these embedding methods require expensive techniques to recover the high-dimensional label vectors, involving eigen-decompositions or matrix inversions, and solving large optimization problems.

To deal with the scalability issue, a group testing based approach (MLGT) was recently proposed in [35]. This method involves creating random subsets (called groups defined by a binary group testing matrix) of classes and training independent binary classifiers to learn whether a given instance belongs to a group or not. When the label sparsity is , this method requires only groups to predict the labels and therefore, only a small number of classifiers need to be trained. Under certain assumptions, the labels of a new data instance can be predicted by simply predicting the groups it belongs to. The MLGT method has been shown to yield low Hamming loss errors. However, since the groups are formed in a random fashion, the individual classifiers might be poorly trained. That is, the random groupings might club together unrelated classes and the binary classifiers trained on such groups will be inefficient.

Our contributions:

In this work, we build on the MLGT framework and present a new MLC approach based on hierarchical partitioning and a data-dependent group construction. We first present the novel grouping approach (NMF-GT) that improves the accuracy of MLGT. This new method samples the group testing (GT) matrix (which defines the groups) from a low-rank Nonnegative Matrix Factorization (NMF) of the training data label matrix . Specifically, we exploit symmetric NMF [22] of the correlation matrix , which is known to capture the clustering/grouping within the data [13]. This helps us capture the label correlations in the groups formed, yielding better trained classifiers. We analyze the proposed data-dependent construction and give theoretical results explaining why it performs well in MLGT. In the supplement, we discuss a GT construction that has constant weight across rows and columns, i.e., each group gets the same number of labels, and each label belongs to same number of groups. These constructions yield better classifiers and improved decoding, see Section 5 for details.

These new constructions also enable us – using recent results – to develop a novel prediction algorithm with logarithmic runtime in the number of labels . If the sparsity of the label vector desired is , then the complexity of the prediction algorithm will be . This significant improvement over existing methods will allow us to predict labels of new data instances in high-throughput and real-time settings such as recommendation systems [27]. This will address some of the limitations in traditional approaches to obtain related searches (search suggestions)  [15].

We then present a hierarchical partitioning approach that exploits the label hierarchy in large-scale problems to divide the large label set into smaller subsets. The associated sub-problems can then be solved simultaneously (in parallel) using the MLGT approach. During prediction, the outputs of individual fast decoders are simply combined (or weighted) to obtain the top labels in log time. In numerical experiments, we first show that the new group construction (NMF-GT) performs better than the previous random constructions in [35]. We then compare the performance of the proposed hierarchical method (He-NMFGT) to some of the popular state-of-the-art methods on large datasets. We also show how the group testing framework can achieve learning with less labeled data for multilabel classification.

2 MLGT method

We first describe the group testing framework for MLC problems. The training data consists of instances , where are the input feature vectors and are the corresponding label vectors for each instance, and are assumed to be -sparse, i.e., .


The first step in training is to construct an binary matrix , called the group testing matrix. Rows of correspond to groups, columns to labels, and is 1 if the th label index (or class) belongs to the th group. There exists an with (e.g., a -disjunct matrix, see [35]) such that for any -sparse binary vector , can be uniquely recovered (in polynomial time) from . Here is the Boolean OR operation (replacing the vector inner product between a row of and in ). In section 3, we describe how to construct these group testing matrices. This motivates projecting the label space into a lower-dimensional space via , and creating reduced label vectors for each where . The last step is to train binary classifiers on where , the th entry of , indicates whether the th instance belongs to the th group or not. Algorithm 1 summarizes the training algorithm.

  Input: Training data , group testing matrix , binary classifier .
  Output: classifiers .
  for do
  end for
  for do
  end for
Algorithm 1 MLGT: Training Algorithm
  Input: Test data , the group testing matrix , classifiers , sparsity .
  Output: predicted label .
  for do
  end for
Algorithm 2 MLGT: Prediction Algorithm


For a new instance , we first use the classifiers to predict a reduced label vector . We then apply the following simple linear decoding technique : For all ,

Here, denotes the support of the vector . When is -disjunct [35] and for some -sparse vector , the above algorithm recovers . Unlike other embedding methods, this decoding technique does not require expensive matrix operations such as decompositions or inversion, and is linear in the number of labels using sparse matrix-vector products.

We will next present a new construction of together with a decoding algorithm that is logarithmic in and can be used in the last step of Algorithm 2 in place of the linear decoder described above.

3 Data dependent construction and decoding

In  [35], the authors construct the group testing matrix using a uniform random construction that does not use any information about the training data. Even if two distinct classes (or label indices) are indistinguishable with respect to data instances, the columns of for these classes are different. We present a novel data-dependent construction for such that ”similar” classes are represented by similar columns of and show that this construction leads to much better prediction quality. We also present a fast decoding technique. Consider the following metric:


is the label correlation matrix, also called the label co-occurence matrix [20]. The entry of is the number of training instances shared by the th and th classes. The entries of give the number of groups shared by a pair of classes. Given a training label matrix , we construct so as to minimize , and have the groups membership structure for two similar classes be similar. See the supplement for relevant experiments. A completely random (disjunct) matrix is unlikely to yield low , since random grouping will not capture the correlation between labels. However, for proper decoding, the GT matrix needs to be sparse and columns need to have low coherence. We construct to account for both issues as follows.

Given and – the number of groups – we compute a rank symmetric Nonnegative Matrix Factorization (symNMF) of as , where is called the basis matrix [22]. It has been shown that symNMF is closely related to clustering, see [22, 13]. Given , the basis matrix defines the clustering within the labels. Therefore, we use the columns of to sample .

For a column of , let be the normalized column such that its entries add to 1. Let be the column weights desired for . For each column , we form , and then re-weight these vectors in order to avoid entries 1. We find all , set these entries to and distribute the excess sum to the remaining entries. This is needed because many entries of will be zero. The columns of are then sampled using the re-weighted

s as the sampling probability vectors. Then each column will have

ones per column on average. We do this instead of sampling the th column of as a random binary vector – with the probability of the th entry being 1 equal to – as in the -disjunct construction used in [35] . In the supplement, we describe other constant weight constructions, where each group has the same number of labels, and each label belongs to same number of groups. Such constructions have been shown to perform well in the group testing problem [36, 38].

Remark 1 (Choosing ).

In these constructions, we choose the parameter (the column sparsity or the number of ones per column) parameter using a simple procedure. For a range of s we form the matrix , reduce and recover (a random subset of) training label vectors, and choose the which yields the smallest Hamming loss error.

In MLGT, for our data-dependent GT matrix, we can use the linear decoder described in section 2. However, since the sampled matrix has constant weight columns, we can consider it as an adjacency matrix of a left regular graph. Therefore, we can use the recent proposed SAFFRON construction [23] and its fast decoding algorithm.

Fast decoding algorithm via. SAFFRON:

Recently, in [23], a biparitite graph based GT construction called SAFFRON (Sparse-grAph codes Framework For gROup testiNg) was proposed, see the supplement for details. Since our NMF based construction ensures constant weight columns, the resulting matrix can be viewed as an adjacency matrix of a left regular graph. This helps us adapt the fast decoding algorithm developed for the SAFFRON construction for label prediction in our method.

We next briefly describe the decoding algorithm (an adaptation of the fast decoder presented in [4] for sparse vector recovery). It has two steps, namely a bin decoder and a peeling decoder. The right nodes of the bipartite graph are called bins and the left nodes are called the variables.

Given the output reduced vector in the first step of prediction, the bin decoder is applied on to bins ’s (these are partitions of as per the construction, see supplement), and all the variable nodes connected to singletons (connected to non-zero nodes) are decoded and put to a set say . Next, in an iterative manner, a node from is considered at each iteration, and the bin decoder is applied to the bins connected to this variable node. If one of these node is a resolvable double-ton (connected to two nonzeros, but one already decoded), we can get a new nonzero variable (). These decoded variables are moved from to a new set of peeled off nodes , and the newly decoded nonzero variable node, if any, is put in . The decoder will terminate when is empty, and if the set has items, we have succeeded. For complete details, see [23]. The computational complexity of the decoding scheme is , see [38]. Therefore, for any left-regular graph with the SAFFRON construction and , the decoder recovers items in time. We can use this fast decoder in the last step of Algorithm 2 to predict the sparse label for a given instance .


We next present an analysis that shows why the proposed data-dependent construction will perform well in MLGT. Let be the reweighted matrix derived from the label data . is the potential matrix that is used to sample the binary group testing matrix . By construction, we know that the sum of entries in a column of is , a constant.

Suppose in the prediction phase, the correct label vector is . We know that there are at most ones in , i.e., . Then, by using the binary classifiers we obtain the reduced label vector , which if the classifiers are exact, will be . To perform the decoding for then, in effect we compute and set the top coordinates to , the rest to . The next result shows the effectiveness of this method.

Theorem 1 (Sampling using ).

For any , , whereas, for any , , where is the th row of .

The proof of this theorem is presented in the supplement. This result explains why our construction is a good idea. Indeed, since we generate in a data-dependent manner, any given label will likely have high correlations with the rows of . As a result, the value of when is in the support of is much higher compared to the value of when is not in the support, with high probability. Therefore, choosing the top- coordinates of indeed will produce .

4 Hierarchical approach for extreme classification

In very large-scale MLC problems (called extreme multilabel or XML problems), the labels typically have certain (unknown) hierarchy. By discovering and using this label hierarchy, one can design efficient classifiers for XML problems that have low computational cost. A limitation of our data-dependent approach is that we perform symNMF of the correlation matrix . As the symNMF problem is NP-hard, and also difficult to solve for matrices with more than a few thousand columns, getting good quality classifiers for XML problems is not guaranteed. Moreover, these large matrices are unlikely to be low rank [5]. Therefore, we propose a simple hierarchical label-partitioning approach to divide the set of label classes into smaller sets, and then apply our NMF-GT method to each smaller set independently.

Matrix reordering techniques on sparse matrices are popularly used for graph partitioning [19] and solving sparse linear systems [30]. Here, a large sparse matrix (usually the adjacency matrix of a large graph) is reordered such that the matrix/graph can be partitioned into smaller submatrices that can be handled independently. Since the label matrix is highly sparse in XML problems and the labels have a hierarchy, the nonzero entries in can be viewed as defining an adjacency matrix of a sparse graph. Let denote a graph, where each node corresponds to a label, and if and only if . In other words, an edge between nodes/labels and is present if and only if labels and occur together in at least one data point, which indicates “interaction” between these labels.

Suppose that has say components, i.e., it can be partitioned into disjoint sub-graphs, as assumed in Bonsai [20]. Then each component corresponds to a subset of labels that interact with one another but not with labels in other components. Permuting the labels so that labels in a component are adjacent to one another, and applying the same permutation to the columns of , one can obtain a block-diagonal reordering of the label matrix . Now the symNMF problem for can be reduced to a number of smaller symNMF problems, one for each block of the matrix. Most large datasets (label matrices) with hierarchy will have many smaller non-interacting subsets of labels and few subsets that interact with many other labels. A natural approach is to use the vertex separator partitioning based reordering [11] or nested dissection [19] to obtain this permutation.

The idea is to find a small vertex separator of (here ) such that has a number of disjoint components . The labels can then be viewed as belonging to one of the subsets , and we can apply NMF-GT to each separately. This idea can be further extended to a hierarchical partitioning of (by finding partitions of the subgraphs as – where is a vertex separator of ). Each level of the hierarchy would be partitioned further till the components are small enough so that the MLGT (sym-NMF) algorithm can be efficiently applied.

Figure 1: Hierarchical reordering of label co-occurrence matrices for four XML datasets.

In Figure 1, we display the hierarchical reordering of obtained by the algorithm in [11] for four popular XML datasets: Eurlex (with number of labels ), Wiki10 (), WikiLSHTC (), and Amazon (), respectively. We note that there are a few distinct blocks (the block diagonals), where the labels only occur together and are independent of other blocks (do not interact). We also have a small subset of labels (the outer band) that interact with most blocks . We can partition the label set into subsets of size each and apply our NMF based MLGT individually (it can be done in parallel). During prediction, the individual fast decoders will return the positive labels for each subsets in time. We can simply combine these positive labels or weight them to output top labels. Since the subset of labels interact with most other labels and occur more frequently (power-law distribution), we can rank them higher when picking top of the outputted positive labels.

Comparison with tree methods: The tree based methods such as HOMER [34], Parabel [28], Bonsai [20]

, and others use label partitioning to recursively construct label tree/s with pre-specified number of labels in leaf nodes or tree depth. Most methods use k-means clustering for partitioning, that has a cost of

. Then, OvA classifiers are learned for each label in leaf nodes. However, in our approach, we use label partitioning to identify label subsets on which we can apply NMF-GT independently. Our matrix reordering approach is inexpensive with cost , see [11]. We use the NMF-GT strategy to learn only classifiers per partition.

5 Numerical Experiments

We now present numerical results to illustrate the performance of the proposed approaches (the data-dependent construction NMF-GT and with hierarchical partitioning He-NMFGT) on MLC problems. Several additional results and details are presented in the supplement.

Mediamill 101 4.38 30993 12914 120
Bibtex 159 2.40 4880 2515 1839
RCV1-2K (ss) 2016 4.76 30000 10000 29699
EurLex-4K 3993 5.31 15539 3809 5000
AmazonCat(ss) 7065 5.08 100000 50000 57645
Wiki10-31K 30938 18.64 14146 6616 101850
WikiLSHTC 325056 3.18 1813391 78743 85600
Amazon-670K 670091 5.45 490449 153025 135909
Table 1: Dataset statistics


For our experiments, we consider some of the popular publicly available multilabel datasets put together in The Extreme Classification Repository [5] ( The applications, details and the original sources of the datasets can be found in the repository. Table 1 lists the statistics.

In the table, labels, average sparsity per instance, training instances, test instances and features. The datasets marked (ss) are subsampled version of the original data with statistics as indicated.

Evaluation metrics:

To compare the performance of the different MLC methods, we use the most popular evaluation metric called

(P@k) [1] with . It has been argued that this metric is more suitable for modern applications such as tagging or recommendation, where one is interested in only predicting a subset of (top ) labels correctly. P@k is defined as:

where is the predicted vector and is the actual label vector. This metric assumes that the vector is real valued and its coordinates can be ranked so that the summation above can be taken over the highest ranked entries of . For the hierarchical approach, we weight and rank the labels based on repeated occurrence (in the overlapping set ).

In general, MLGT method returns a binary label vector of predefined sparsity, there is no ranking among its non-zero entries. Hence, we also use a slightly modified definition:


where is the nonzero co-ordinates of predicted by MLGT assuming that the predefined sparsity is set to 5. To make the comparison fair for other (ranking based) methods, we sum over the top 5 labels based on their ranking (i.e. we use instead of in the original definition).

Figure 2: and for test data instances for bibtex (top two) and RCV1x (bottom two) datasets as a function of number of groups . Error bar over 10 trials.
Dataset Metrics NMF - GT CW - GT SP - GT OvA
Bibtex 0.7354 0.7089 0.6939 0.6111
0.3664 0.3328 0.3034 0.2842
0.2231 0.2017 0.1823 0.1739
10.610 12.390 12.983
5.13s 4.01s 3.98s 8.22s
0.13s 0.13s 0.13s 0.18s

0.8804 0.8286 0.6358 0.8539
0.6069 0.5413 0.2729 0.5315
0.3693 0.3276 0.1638 0.3231
10.377 11.003 10.876
17.2s 15.7s 15.82s 29.4s
0.17s 0.17s 0.17s 0.54s

0.9350 0.9205 0.8498 0.9289
0.6983 0.6596 0.5732 0.6682
0.4502 0.4104 0.3449 0.4708
53.916 58.459 58.671
88.4s 77.5s 74.2s 363.2s
1.20s 1.04s 1.10s 6.37s

0.8477 0.8430 0.6792 0.8535
0.5547 0.5582 0.3933 0.6132
0.3444 0.3597 0.2758 0.4085
80.023 80.732 82.257
227.3s 99.6s 90.4s 560.1s
0.94s 0.93s 0.93s 7.26s

Table 2: Comparisons between GT constructions. Metric: Modified Precision

Comparing group testing constructions:

In the first set of experiments, we compare the new group testing constructions with the sparse random construction (SP-GT) used in [35], where each entry of is sampled with uniform probability . Our first construction (NMF-GT) is based on the symNMF as described in Section 3. Given the training label matrix , we first compute the symNMF of of rank using the Coordinate Descent algorithm by [37] (code provided by the authors) and then compute a sparse binary matrix using reweighted rows of the NMF basis. Our second construction (CW-GT) is the constant weight construction defined in supplementary A.1. For both constructions, the number of nonzeros (sparsity) per column of is selected using the search method described in Remark 1, see supplement for more details.

Figure 2 plots and we obtained for the three constructions (red star is NMF-GT, blue circle is CW-GT, and black triangle is SP-GT) as the number of groups increases. The first two plots correspond to the Bibtex dataset, and the next two correspond to RCV1x dataset. As expected, the performance of all constructions improve as the number of groups increase. Note that NMF-GT consistently outperforms the other two. In the supplement, we compare the three constructions (for accuracy and runtime) on four datasets. We also include the One versus All (OvA) method (which is computationally expensive) to provide a frame of reference.

In Table 2, we compare the three constructions discussed in this paper on four datasets. We also include the One versus All (OvA) method (which is computationally very expensive) to provide a frame of reference. In the table, we list for , the correlation metric , the total time as well as the time taken to predict the labels of test instances. The NMF-GT method performs better than both methods, because it groups the labels based on the correlation between them. This observation is supported by the fact that the correlation metric of NMF-GT is the lowest among the three methods. Also note that even though NMF-GT has longer training time compared to the other GT methods (due to the NMF computation), its prediction time is essentially the same. We also note that the runtimes of all three MLGT methods are much lower than OvA, particularly for larger datasets as they require much fewer () classifiers.

In all cases, NMF-GT outperforms the other two (possibly because it groups the labels based on the correlation between them), and CW-GT performs better than SP-GT. Both NMF-GT and CW-GT ensure that classifiers are trained on similar amounts of data. Decoding will also be efficient since all columns of have the same support size. NMF-GT is superior to the other two constructions, and therefore, we will use it in the following experiments for comparison with other popular XML methods.

Dataset Metrics He-NMFGT NMF-GT MLCS SLEEC* PfastreXML Parabel
Mediamill 0.8804 0.8359 0.8538 0.9376 0.9358
0.6069 0.6593 0.6967 0.7701 0.7622
0.3693 0.4102 0.5562 0.5328 0.5169
17.2s 20.3s 3.5m 190.1s 74.19s
0.17s 6.93s 80.5s 18.4s 17.85s

0.9350 0.9244 0.9034 0.9508 0.9680
0.6983 0.6945 0.6395 0.7412 0.7510
0.4502 0.4486 0.4457 0.4993 0.5040
88.4s 541.1s 34m 7.73m 6.7m
1.04s 176.7s 53.1s 3.03m 1.68m

0.9265 0.8477 0.8034 0.7474 0.9004 0.9161
0.7084 0.5547 0.5822 0.5885 0.6946 0.7397
0.4807 0.3444 0.3965 0.4776 0.4939 0.5048
322s 227.3s 343.3s 21m 11.8m 6.1m
1.1s 0.94s 235.1s 45s 59.2s 74.3s

0.9478 0.8629 0.7837 0.8053 0.9098 0.9221
0.6555 0.5922 0.5469 0.5622 0.6722 0.6957
0.4474 0.3915 0.3257 0.4152 0.5119 0.5226
8.7m 7.5m 19.7m 68.8m 27.5m 16.9m
4.42s 4.21s 13.7m 106.3s 241.6s 114.7s

0.9666 0.9155 0.5223 0.8079 0.9289 0.9410
0.7987 0.6353 0.2995 0.5050 0.7269 0.7880
0.5614 0.4105 0.1724 0.3526 0.5061 0.5502
14.7m 13.6m 63m 54.9m 40.5m 33.5m
11.5s 9.82s 45m 51.3s 8.2m 4.2m

Table 3: Comparisons between different MLC methods. Metric: Modified Precision

Comparison with popular methods:

We next compare the NMF-GT method (best one from our previous experiments) and the hierarchical method (He-NMFGT) with four popular methods, namely MLCS [14], SLEEC [5], PfastreXML [16], and Parabel [28] with respect to the modified precision metric. Table 3 summarizes the results obtained by these six methods for different datasets along with total computation time and the test prediction time . The no. of groups used in NMFGT and no. of blocks used in He-NMFGT are also given.

We note that NMF-GT performs fairly well given its low computational burden. The hierarchical approach He-NMFGT yields superior accuracies with similar runtimes as NMFGT (outperforms other methods wrt. ). PfastreXML and Parabel yield slightly more accurate results in some cases, but require significantly longer run times. Note that the prediction time for our methods are orders of magnitude lower in some cases. For He-NMFGT, includes computing the partition, applying MLGT for one block (since this can be done in parallel), and predicting the labels of all test instances. For smaller two datasets, He-NMFGT was not used since they lacked well-defined partitions.

Embedding Tree OvA DNN
Dataset Metrics He-NMFGT SLEEC PfastreXML Parabel XT Dismec PPD-sparse XML-CNN
Eurlex (%) 75.04 74.74 73.63 74.54 83.67 83.83 76.38
(%) 61.08 58.88 60.31 61.72 70.70 70.72 62.81
(%) 48.07 47.76 49.39 50.48 59.14 59.21 51.41
4.8m 20m 10.8m 5.4m 0.94hr 0.15hr 0.28hr
0.28ms 4.87ms 1.82ms 0.91ms 7.05ms 1.14ms 0.38ms

(%) 82.28 80.78 82.03 83.77 85.23 85.20 73.80 82.78
(%) 69.68 50.50 67.43 71.96 73.18 74.60 60.90 66.34
(%) 56.14 35.36 52.61 55.02 63.39 65.90 50.40 56.23
14.2m 53m 32.3m 29.3m 18m 88m
0.69ms 7.7ms 74.1ms 38.1ms 1.83ms 1.39s

(%) 55.62 54.83 56.05 64.38 58.73 64.94 64.08
(%) 33.81 33.42 36.79 42.40 39.24 42.71 41.26
(%) 23.04 23.85 27.09 31.14 29.26 31.5 30.12
47.5m 18.3hr 7.4hr 3.62hr 9.2hr 750hr 3.9hr
0.8ms 5.7ms 2.2ms 1.2ms 0.8ms 43m 37ms

(%) 39.60 35.05 39.46 43.90 39.90 45.37 45.32 35.39
(%) 36.78 31.25 35.81 39.42 35.60 40.40 40.37 33.74
(%) 32.40 28.56 33.05 36.09 32.04 36.96 36.92 32.64
47.8m 11.3hr 1.23hr 1.54hr 4.0hr 373hr 1.71hr 52.2hr
1.45ms 18.5ms 19.3ms 2.8ms 1.7ms 429ms 429ms 16.2ms

Table 4: Comparisons between different XML methods. Metric: Standard Precision

In Table 4 we compare the performance of He-NMFGT with several other popular XML methods wrt. the standard metric. We compare the accuracies and computational costs for He-NMFGT, SLEEC (embedding method), three tree methods (PfastreXML, Parabel, ExtremeText XT), two OvA methods (Dismec, PD-sparse) and a DNN method XML-CNN (see sec. 1 for references). The precision results and the runtimes for the four additional methods were obtained from [28, 40]. In the table a ‘–’ indicate these results were not reported by the authors.

We note that, compared to other methods, He-NMFGT is significantly faster in both training and test times, and yet yields comparable results. The other methods have several parameters that need to be tuned. More importantly, the main routines of most other methods are written in C/C++, while He-NMFGT was implemented in Matlab and hence we believe the run times can be improved to enable truly real-time predictions. The code for our method will be made publicly available (Matlab code is provided in the supplement for review). Several additional results, implementation details and result discussions are given in the supplement.

Figure 3: Prec@1 for test data instances for bibtex (left) and RCV1x (right) datasets as a function of fraction of training data used. Error bar over 5 trials

Learning with less training data:

In supervised learning problems such as MLC, training highly accurate models requires large volumes of labeled data, and creating such volumes of labeled data can be very expensive in many applications 

[21, 41]. As a result, there is an increasing interest among research agencies in developing learning algorithms that achieve ‘Learning with Less Labels’ (LwLL) Since MLGT requires training only classifiers (as opposed to classifiers in OvA or other methods), we will need less labeled data for training the model. In section 5, we present preliminary results that demonstrate how MLGT achieves learning with less data for MLC.

Here, we present preliminary results that demonstrate how MLGT achieves more accurate (higher precision) with less training data compared to the OvA method (see Table 2 in suppl). Figure 3 plots the precision (Prec@1) for test data instances for the bibtex (left) and RCV1x (right) datasets, when different fractions of training data were used to train the MLGT and OvA models. We note that MLGT achieves the same accuracy as OvA with only 15-20% of the number of training points (over less training data). We used the same binary classifiers for both methods, and MLGT requires only classifiers, as opposed to OvA, which needs classifiers. Therefore, MLGT likely requires fewer training data instances.


In this paper, we extended the MLGT framework [35] and presented new GT constructions (constant weight and data dependent), and a fast prediction algorithm that requires logarithmic time in the number of labels . We then presented a hierarchical partitioning approach to scale the MLGT approach to larger datsets. Our computational results show that the NMF construction yields superior performance compared to other GT matrices. We also presented a theoretical analysis which showed why the proposed data dependent method (with a non-trivial data-dependent sampling approach) will perform well. With a comprehensive set of experiments, we showed that our method is significantly faster in both training and test times, and yet yields competitive results compared to other popular XML methods.


  • [1] R. Agrawal, A. Gupta, Y. Prabhu, and M. Varma. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22nd international conference on World Wide Web, pages 13–24. ACM, 2013.
  • [2] R. Babbar and B. Schölkopf. Dismec: Distributed sparse machines for extreme multi-label classification. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pages 721–729. ACM, 2017.
  • [3] R. Babbar and B. Schölkopf. Data scarcity, robustness and extreme multi-label classification. Machine Learning, 108(8-9):1329–1351, 2019.
  • [4] R. Berinde, A. Gilbert, P. Indyk, H. Karloff, and M. Strauss. Combining geometry and combinatorics: A unified approach to sparse signal recovery. CoRR, abs/0804.4666, 01 2008.
  • [5] K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain. Sparse local embeddings for extreme multi-label classification. In Advances in Neural Information Processing Systems, pages 730–738, 2015.
  • [6] W. Bi and J. T. Y. Kwok. Efficient multi-label classification with many labels. In 30th International Conference on Machine Learning, ICML 2013, pages 405–413, 2013.
  • [7] W.-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, and I. Dhillon. A modular deep learning approach for extreme multi-label text classification. arXiv preprint arXiv:1905.02331, 2019.
  • [8] Y.-N. Chen and H.-T. Lin. Feature-aware label space dimension reduction for multi-label classification. In Advances in Neural Information Processing Systems, pages 1529–1537, 2012.
  • [9] J. Deng, S. Satheesh, A. C. Berg, and F. Li. Fast and balanced: Efficient label tree learning for large scale object recognition. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 567–575. Curran Associates, Inc., 2011.
  • [10] R. Gallager. Low-density parity-check codes. IRE Transactions on Information Theory, 8(1):21–28, January 1962.
  • [11] A. Gupta. Fast and effective algorithms for graph partitioning and sparse-matrix ordering. IBM Journal of Research and Development, 41(1.2):171–183, 1997.
  • [12] V. Gupta, R. Wadbude, N. Natarajan, H. Karnick, P. Jain, and P. Rai. Distributional semantics meets multi-label learning. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    , volume 33, pages 3747–3754, 2019.
  • [13] C. H. Q. Ding and X. He.

    On the equivalence of nonnegative matrix factorization and spectral clustering.

    In Proceedings of the SIAM International Conference on Data Mining, pages 606–610, 01 2005.
  • [14] D. Hsu, S. M. Kakade, J. Langford, and T. Zhang. Multi-label prediction via compressed sensing. NIPS, 22:772–780, 2009.
  • [15] H. Jain, V. Balasubramanian, B. Chunduri, and M. Varma. Slice: Scalable linear extreme classifiers trained on 100 million labels for related searches. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM ’19, pages 528–536. ACM, 2019.
  • [16] H. Jain, Y. Prabhu, and M. Varma.

    Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications.

    In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 935–944. ACM, 2016.
  • [17] A. Jalan and P. Kar. Accelerating extreme classification via adaptive feature agglomeration. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 2600–2606. International Joint Conferences on Artificial Intelligence Organization, 7 2019.
  • [18] K. Jasinska, K. Dembczynski, R. Busa-Fekete, K. Pfannschmidt, T. Klerx, and E. Hullermeier.

    Extreme f-measure maximization using sparse probability estimates.

    In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1435–1444, New York, New York, USA, 20–22 Jun 2016.
  • [19] G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing, 20(1):359–392, 1998.
  • [20] S. Khandagale, H. Xiao, and R. Babbar. Bonsai-diverse and shallow trees for extreme multi-label classification. arXiv preprint arXiv:1904.08249, 2019.
  • [21] A. Klein and J. Tourville. 101 labeled brain images and a consistent human cortical labeling protocol. Frontiers in neuroscience, 6:171, 2012.
  • [22] D. Kuang, C. Ding, and H. Park. Symmetric nonnegative matrix factorization for graph clustering. In Proceedings of the 2012 SIAM International Conference on Data Mining, pages 106–117, 2012.
  • [23] K. Lee, R. Pedarsani, and K. Ramchandran. Saffron: A fast, efficient, and robust framework for group testing based on sparse-graph codes. In 2016 IEEE International Symposium on Information Theory (ISIT), pages 2873–2877, July 2016.
  • [24] J. Liu, W.-C. Chang, Y. Wu, and Y. Yang. Deep learning for extreme multi-label text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, pages 115–124, New York, NY, USA, 2017.
  • [25] C. McDiarmid. On the method of bounded differences. Surveys in combinatorics, 141(1):148–188, 1989.
  • [26] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
  • [27] Y. Prabhu, A. Kag, S. Gopinath, K. Dahiya, S. Harsola, R. Agrawal, and M. Varma. Extreme multi-label learning with label features for warm-start tagging, ranking & recommendation. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, WSDM ’18, pages 441–449. ACM, 2018.
  • [28] Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, and M. Varma. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference, WWW ’18, pages 993–1002, Republic and Canton of Geneva, Switzerland, 2018.
  • [29] Y. Prabhu and M. Varma. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 263–272. ACM, 2014.
  • [30] Y. Saad. Iterative methods for sparse linear systems, volume 82. siam, 2003.
  • [31] W. Siblini, P. Kuntz, and F. Meyer.

    Craftml, an efficient clustering-based random forest for extreme multi-label learning.

    In International Conference on Machine Learning, pages 4664–4673, 2018.
  • [32] F. Tai and H.-T. Lin. Multilabel classification with principal label space transformation. Neural Computation, 24(9):2508–2542, 2012.
  • [33] K. Trohidis. Multi-label classification of music into emotions. In 9th International Con- ference on Music Information Retrieval, pages 325–– 330, 2008.
  • [34] G. Tsoumakas, I. Katakis, and I. Vlahavas. Effective and efficient multilabel classification in domains with large number of labels. In ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD’08), 2008.
  • [35] S. Ubaru and A. Mazumdar. Multilabel classification with group testing and codes. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3492–3501. JMLR. org, 2017.
  • [36] S. Ubaru, A. Mazumdar, and A. Barg. Group testing schemes from low-weight codewords of BCH codes. In Information Theory (ISIT), 2016 IEEE International Symposium on, pages 2863–2867. IEEE, 2016.
  • [37] A. Vandaele, N. Gillis, Q. Lei, K. Zhong, and I. Dhillon. Efficient and non-convex coordinate descent for symmetric nonnegative matrix factorization. IEEE Transactions on Signal Processing, 64(21):5571–5584, 2016.
  • [38] A. Vem, N. Thenkarai Janakiraman, and K. Narayanan. Group testing using left-and-right-regular sparse-graph codes., 2017.
  • [39] C. Wang, S. Yan, L. Zhang, and H.-J. Zhang. Multi-label sparse coding for automatic image annotation. In

    Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on

    , pages 1643–1650. IEEE, 2009.
  • [40] M. Wydmuch, K. Jasinska, M. Kuznetsov, R. Busa-Fekete, and K. Dembczynski. A no-regret generalization of hierarchical softmax to extreme multi-label classification. In Advances in Neural Information Processing Systems, pages 6355–6366, 2018.
  • [41] C. Xu, D. Tao, and C. Xu. Robust extreme multi-label learning. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1275–1284. ACM, 2016.
  • [42] I. E.-H. Yen, X. Huang, W. Dai, P. Ravikumar, I. Dhillon, and E. Xing. Ppdsparse: A parallel primal-dual sparse method for extreme classification. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD, pages 545–553, 2017.
  • [43] I. E. H. Yen, X. Huang, K. Zhong, P. Ravikumar, and I. S. Dhillon. Pd-sparse: A primal and dual sparse approach to extreme multiclass and multilabel classification. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16, pages 3069–3077., 2016.
  • [44] R. You, S. Dai, Z. Zhang, H. Mamitsuka, and S. Zhu. Attentionxml: Extreme multi-label text classification with multi-label attention based recurrent neural networks. arXiv preprint arXiv:1811.01727, 2018.
  • [45] H.-f. Yu, P. Jain, P. Kar, and I. Dhillon. Large-scale multi-label learning with missing labels. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 593–601, 2014.
  • [46] M.-L. Zhang, Y.-K. Li, X.-Y. Liu, and X. Geng. Binary relevance for multi-label learning: an overview. Frontiers of Computer Science, 2018.
  • [47] W. Zhang, J. Yan, X. Wang, and H. Zha. Deep extreme multi-label learning. In Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, ICMR ’18, pages 100–107, New York, NY, USA, 2018. ACM.
  • [48] Y. Zhang and J. G. Schneider. Multi-label output codes using canonical correlation analysis. In AISTATS, pages 873––882, 2011.

Appendix A Constant Weight Construction

In this supplement, we first describe two constant weight constructions, where each group has the same number of labels, and each label belongs to the same number of groups. Such constructions have been shown to perform well in the group testing problem [36, 38].

a.1 Randomized construction

The first construction we consider is based on LDPC (low density parity) codes. Gallagher proposed a low density code with constant weights in [10]. We can develop a constant weight GT matrix based on this LPDC construction as follows: Suppose the matrix we desire has columns with constant ones in each column, and ones in each row. The LDPC matrix will have rows in total. The matrix is divided into submatrices, each containing a single in each column. The first of these submatrices contains all the ones in descending order, i.e., the th row will have ones in the columns to . The remaining submatrices are simply column permutations of the first. We consider this construction in our experiments.

a.2 SAFFRON construction

Recently, in [23], a biparitite graph based GT construction called SAFFRON (Sparse-grAph codes Framework For gROup testiNg) was proposed. [38] extended this SAFFRON construction to form left-and-right-regular sparse-graph codes called regular-SAFFRON. The adjacency matrices corresponding to such graphs give us the desired constant weight constructions. The regular-SAFFRON construction starts with a left-and-right-regular graph , with left nodes called variable nodes, and right nodes called bin nodes. The edge connections from the left and edge connections from the right are paired up according to a random permutation.

Let be the adjacency matrix corresponding to the left-and-right-regular graph . Then, has ones in each column and ones in each row. Let be the universal signature matrix (see [4, 38] for definition). If is the the row of , then the GT matrix A is formed as , where the submatrix of size . The total tests will be . We have the following recovery guarantee of this construction:

Proposition 1.

Suppose we wish to recover a sparse binary vector . A binary testing matrix formed from the regular-SAFFRON graph with tests recovers proportion of the support of correctly with high probability (w.h.p), for any . With , we can recover the whole support set w.h.p. The constants and depend on and the error tolerance . The computational complexity of the decoding scheme will be .

Proof of the proposition can be found in [38]. The decoding algorithm was discussed in the main text.

Appendix B Proof of Theorem 1

Next, we sketch the proof of Theorem 1 in the main text.


Let us denote the entries of and as and respectively, . From our construction: and

First, let us find the probability that Since will be 0 if and only if the support of th row of has no intersection with the support of , hence,

Now note that, . Therefore, It turns out that,

Now, we consider two cases. When , . On the other hand, when , . Therefore,

Hence, when ,

But when ,

We can make stronger claims to bolster this theorem. Since the random variables

are all Lipschitz functions of independent underlying variables, by using McDiarmid inequality [25] we can say that they are tightly concentrated around their respective average values.

Appendix C Additional experimental results

Here, we present additional results and further discuss the results we presented in the main text for the proposed methods. We then give few results which help us better understand the parameters that affect the performance of our MLGT method. First, we describe the evaluation metrics used in the main text and here for comparison.

Results discussion:

In table 3 of main text, we summarized the results obtained for six methods for different datasets. We note that NMF-GT performs very well given its low computational burden. PfastreXML and Parabel, on the other hand, yield slightly more accurate results but require significantly longer run times.

Note that, when compared to the MLGT, the other methods require significantly more time for training. This is because, the tree based methods use k-means clustering recursively to build the label tree/s, and require several OvA classifiers to be trained, one per each label in the leaf nodes. OvA methods are obviously expensive since they learn number of classifiers. Moreover, the prediction time for MLGT is also orders of magnitude less than many of the popular methods. In addition, the other methods have several parameters that need to be tuned (we used the default settings provided by the authors). We also note that the main routines of most other methods are written in C/C++ language, while MLGT was implemented in Matlab and hence the run times can be further improved to enable truly real-time predictions.

In Table 4 of the main paper, for the large two datasets, the label set was divided into blocks of sizes roughly around . We also used negative sampling of the training data for each block as done in many recent XML works [28, 15]. We also reduced the feature dimension via. sketching. For hierarchical partitioning, we used the vertex separator approach described in the main text, using the FORTRAN code provided by the author of [11]. The reordering for the four datasets in Table 3 are given in Figure 1 for the main text. The approach is extremely fast, and the runtime for the four datasets for reordering and partitioning were:

Eurlex: 0.5s; Wiki10: 4.11s; WikiLSHTC: 40.3s; and Amazon670: 15.5s.

For Eurlex and Wiki10, the accuracy and runtime results for SLEEC, PfastreXML and Parabel were computed by us using their matlab codes. Results for these three methods for the remain two datasets, and all results for the additional four methods (Dismec, PPD-sparse, XT and XML-CNN) were obtained from [28] and [40]. All runtimes are based on single core implementation.

Figure 4: Analysis: (Left) Relation between P@k and the correlation metric , (Middle) Relation between P@k and column sparsity c, and (Right) Performance of NMF for larger .

MLGT Analysis:

We conducted several numerical tests to analysis the performance of MLGT with respect to various settings. Figure 4 presents few of these numerical analysis results, which help us understand the performance of MLGT better. In the left figure, we plot the P@k achieve by MLGT with different GT constructions, as a function of the the correlation metric . The different points (circle) in the plot correspond to different GT matrices with different . These GT matrices were formed by randomly permuting -disjunct matrices, and changing its size. We observe that GT matrices with lower , yield better classification. These results motivated us to develop the data-dependent grouping approach.

In the middle plot, we have the performance of the NMF-GT method for different column sparsity . We clearly note that as increases, the performance first increases, and then reduces for larger . This is because, for larger , the GT matrix will have higher coherence between the columns. As indicated in our analysis, the performance of the GT construction will depends on this coherence. This analysis motivated us to use the search technique described in Remark 1, to select the optimal column sparsity .

In the right plot, we compare the performance of NMF-GT vs CW-GT as a function of number of groups for the Eurlex dataset. We observe that for smaller , NMF-GT performs better. However, for larger and more so for larger number of label , NMF-GT becomes less accurate. This is due to the difficulty in computing accurate NMF for such large matrices. NMF is known to be an NP hard problem. This result likely explains why the NMF-GT’s performance on larger datasets is less accurate. A possible approach to improve the accuracy of NMF-GT is to use the Hierarchical approach described above and split the large label set into smaller disjoint subsets, and apply NMF-GT independently.

Dataset NMF-GT CW-GT

R-Loss T-Loss R-Loss T-Loss
Bibtex 159 3.49 3.68 2.95 4.30
RCV1-2K 2016 3.99 4.72 3.96 4.91
EurLex-4K 3993 1.38 4.77 1.05 5.03
Table 5: Average Hamming loss errors in reduction v/s training

In table 5, we list the average Hamming loss errors we suffer in label reduction (and decoding) when using NMF-GT and CW-GT for the three datasets. That is, we check the average error in the group testing procedure (label reduction and decoding), without classifiers. We also list the average Hamming loss in the training data after classification for comparison. We observe that, the NMF-GT has worse reduction loss compared to CW-GT. This is because, NMF-GT is data dependent, and is not close to being k-disjunct as oppose to CW-GT, which is random. However, we note that the training loss of NMF-GT is better. This shows that, even though the reduction-decoding is imperfect (introduces more noise), NMF-GT results in better individual classifiers. These comparisons show that data-dependent grouping will indeed result in improved classifiers.

Implementation details:

All experiments for NMFGT and He-NMFGT were implemented in Matlab, and conducted on a standard work station with Intel i5 core 2.3GHz machine. The timings reported were computed using the cputime function in Matlab. For the SLEEC method, we could not compute as in eq. 2, since the source code did not output the score matrix. The reported for SLEEC in Table 4 were the P@k returned by source code. Also, for the last 2 examples, SLEEC was run for 50 iterations (for the rest it was 200).