1 Introduction
Extreme multilabel classification (XMC) refers to the problem of assigning to an instance the most relevant subset of labels from an enormous label collection, where the number of labels could be in the millions or more. The XMC setting is universal in various industrial applications such as Youtube recommendation [4], Bing’s dynamic search advertising [26], and tagging of Wikipedia categories in the PASCAL LargeScale Hierarchical Text Classification (LSHTC) challenge [23], to name just a few.
The huge label space has raised research challenges such as data sparsity and scalability for existing multilabel algorithms. Among the stateoftheart XMC approaches, onevsall approaches, such as DiSMEC [2], often achieve the highest accuracy but suffer severe computational burden in both the training and prediction phases if not implemented carefully. Various techniques have been proposed to improve the efficiency. Sparse structures [35, 34]
are introduced to onevsall classifiers to reduce the computational complexity. Embeddingbased methods
[5, 36] compress the label space to a lowdimensional space. Treebased methods [27, 28] learn an ensemble of weak but fast classification trees, which however leads to a large model size. Recently, labelpartitioning approaches, such as Parabel [26] and label filtering [22], have shown significant computational gain over existing methods while achieving comparable accuracy.The labelpartitioning approaches inspired us to build connections between XMC and information retrieval (IR), where the goal is to find relevant documents for a given query from an extremely large number of documents. To handle the large number of documents, an IR engine typically performs the search in the following steps [13], 1) indexing: building an efficient data structure to index the documents; 2) matching: finding the document index that this query belongs to; 3) ranking: sorting the documents in the retrieved index. An XMC problem can be connected to an IR problem as follows: the large number of labels can be viewed analogously to the large number of documents indexed by a search engine; and the instance to be labeled can be viewed as the query. To unify the terminology, we will call queries in IR and instances in XMC as sources, and call documents in IR and labels in XMC as targets. With this unified terminology, the goals of both IR and XMC can be described as identifying relevant targets for a given source, from an extremely large collection of targets. Such a connection enables us to establish an IRalike multistage framework to tackle XMC problems in a much more modular manner. Not surprisingly, many of existing XMC approaches can be dissected or analyzed under this framework. Albeit similarly in terms of stages, the challenge of each component for XMC is very different from the counterpart in an IR system. For example, both the source and targets usually share the same token space in an IR system, while the source and targets for an XMC problem can be in two very different domains. In Section 2, we will not only establish a threestage framework for XMC problems but also highlight the differences between each stage and its counterpart in an IR system.
In this paper, we propose a modular deep learning approach for XMC problems. The contributions of this paper are summarized as follows:

We establish a multistage framework for XMC problems. This framework does not only unify many existing XMC approaches but also enable to design new XMC approaches in a modular manner.

Under the framework, we propose SLINMER, a modular deep learning approach for extreme multilabel classification problems. SLINMER consists a Semantic Label Indexing component, a Neural Matching component, and an Efficent Ranking component.

The semantic label indexing in SLINMER is a flexible component to incorporate various semantic label information, while we propose to use a biLSTMSA [19] model to better match the input to a set of good candidate labels. We also propose to perform ensemble of various configurations of SLINMER to further improve the performance.

We also develop a doublysparse data structure for the realtime inference with extremely sparse weight matrices.

With an extensive empirical study, we demonstrate the superiority of the proposed SLINMER over the existing XMC approaches. In particular, on a Wiki dataset with around millions of labels, the precision@1 is increased from to by an ensemble of various configurations of SLINMER.
This paper is organized as follows. In Section 2, we establish a multistage framework for XMC problem, and show how the existing XMC approaches connect to this framework in Section 3. With careful design, our proposed SLINMER is introduced in Section 4. We then perform extensive experiments to demonstrate its superiority in Section 5 and conclude in Section 6. Finally, all the data split and source codes will be made publicly available.
2 A Multistage Framework for XMC
Due to the success of the threestage framework of IR for extremely large number of targets, in this paper, we follow it to develop a general framework for XMC, which consists of the following stages: 1) indexing the labels, 2) matching the label indices and 3) ranking the labels from the retrieved indices. See Figure 1 for an illustration of the framework. In the following, starting from the definition of MLC problems, we show a probabilistic model for our framework and then discuss three stages in our framework.
Notations and Definitions
Formally, multilabel classification (MLC) is the task of learning a function that maps an input to its target , where is the number of total unique labels. Assume that we have a set of training samples , where . We use , whose th row is , to represent the label matrix. For some special datasets, we have additional label information. For example, each label in the wikipedia dataset [23] is named by words, such as “Zoos in Mexico” and “Bacon drinks”. So we will use as the feature representations of the labels, which may either come from the label information itself or from other approaches.
A Probabilistic Model.
We formulate our framework in a probabilistic perspective. Assume after indexing, we have clusters of labels, , where each is a subset of the label indices, i.e., . For a given instance,
, the probability of
th label being relevant to is . We can form the probabilistic model as follows,Here is the matching model with as the parameters and is the ranking model with as the parameters. For the ranking model, we assume
In other words, during the ranking stage, only labels in the retrieved clusters are considered. This assumption is reasonable for extremely large number of labels because in such a scenario there will be many similar labels and they can be grouped. Under this assumption, our framework has the following advantages:

The training time for the ranking model for each cluster can be reduced because it only needs to consider the labels in the cluster and the instances that are relevant to these labels.

The prediction time is also reduced, because once a small set of clusters is chosen, we only need to perform ranking for the labels in these clusters.

Constraining the ranking to a smaller set of labels helps exclude irrelevant labels if the clustering and the matching models are sufficiently good.
We now briefly touch on each of these stages.
2.1 Label Indexing
The indexing of documents in a search engine requires rich text information while the labels of XMC typically lack this information. Thus, we aim to find meaningful label representations in order to build such an indexing system. There are several approaches in literature that have implicitly or explicitly studied different ways to represent the labels. For example, the embeddingbased XMC approaches [5, 36]
project the label index to a lowdimensional vector through minimizing the loss between groundtruth labels and predicted labels. Homer
[29] uses the column vector of the instancelabel indicator matrix to represent the label, while Parabel represents the label as a normalized sum of the features of relevant training instances. On the other hand, existing deep learning approaches for XMC represent the label by their IDs [20, 21, 15]. However, label IDs do not contain semantic information about the labels. For some special datasets, such as Wikipedia category tagging, each label is a short sequence of words, which can be used as label representations via word embedding like ELMo [24].Once we obtain the label representations, we can start to build the indexing system; we will do so by clustering the labels as in label partitioning methods. We can use several clustering algorithms, such as kmeans clustering, KDtree
[3] or random projection clustering. Due to the lack of a direct and informative representation of the labels, the indexing system for XMC may be noisy compared to that for an IR problem. Fortunately, the instances in XMC are typically very informative. Therefore, we can utilize the rich information of the instances to build a strong matching system as well as a strong ranking system to compensate for the indexing system.2.2 Matching
The matching phase for XMC is to assign relevant clusters (i.e., indices) to each instance. A high recall for matching is key to the success of a search engine as the subsequent ranking phase is based on the retrieved documents from matching. A matching system can also be viewed as a multilabel classification (MLC) problem, where the clusters are ”labels”. We will call this problem as a MLCmatching problem. To build a strong MLCmatching system, we want to utilize the information provided by the instance as much as possible, which however might lead to a complex model and expensive computational cost. For example, for textbased inputs, we might need deep learning approaches, such as Seq2Seq [21], CNN [18]
and selfattention models
[19], to extract the sequential information of the input. However, for general XMC, deep learning approaches suffer from high computational complexity compared to classical linear classifiers. Fortunately, since the number of clusters can be controlled by the practitioner, we can control the scale of the MLCmatching problem such that its training and inference can still be completed in a reasonable time.2.3 Ranking
The ranking stage in XMC is to sort the labels retrieved from matching, according to the relevance between the labels and the instance. The ranking part is also a multilabel classification problem but is much smaller than the original XMC problem. We call it as the MLCranking problem. Thanks to the label clustering, we can do both training and inference for the ranking model efficiently. During training, we train an MLCranking model for each cluster independently. Therefore, for each cluster, we only need to include training samples which have positive labels in this cluster. This will significantly reduce the training time when the number of the training instances in a cluster is much smaller than the whole training data size. The inference time is reduced because we only need to consider the set of labels in the clusters retrieved from matching.linear onevsall models for this MLCranking model.
3 Related Work and Connections to Our Framework
To deal with the huge number of labels as well as the large training data size, various methods have been recently proposed to reduce the computational complexity for both training and prediction. We put XMC algorithms into three categories: onevsall approaches, partitioning methods and embeddingbased approaches. We briefly discuss representative work in each category, and discuss their relationship with our framework.
OneVsAll (OVA) approaches
The naive onevsall approach treats each label independently as a binary classification problem: if the label is relevant to the instance then it is positive; otherwise it is negative. OVA approaches [2, 20, 34, 35] have been shown to achieve high accuracies, but they suffer from expensive computational complexity for both training and prediction when the number of labels is very large. Therefore, several techniques have been proposed to speedup the algorithm. PDSparse [35]/PPDSparse [34] introduce primal and dual sparsity to speed up the training as well as prediction. DiSMEC [2] and PPDSparse [34] explore parallelism and sparsity to speed up the algorithm and reduce the model size. OVA approaches are also widely used as building blocks for many other approaches, for example, in Parabel, linear OVA classifiers with a small output domain are used as the classifiers for internal nodes and leaf nodes.
Relation to our framework. Enforcing sparse structures on the weight vectors in OVA approaches can be viewed as building the indexing system. During prediction, for a given sparse input feature, we only consider labels that have nonzero features in common with the input feature. Therefore, we only need to calculate the relevance score between a small set of labels and the input feature, which reduces the prediction time.
Partitioning methods
There are two ways to incorporate partitioning: input partitioning [1, 27, 16, 25] and label partitioning [26, 17, 29, 33, 22]. Considering the instancelabel matrix, , where is the number of training samples and is the number of labels, input partitioning and label partitioning can be viewed as partitioning the rows and the columns of , respectively. When the instancelabel matrix is very sparse, for input partitioning, each partition only contains a small subset of labels; for label partitioning, each partition only contains a small subset of instances. Therefore, both partitioning ways can reduce the training and prediction time significantly. Furthermore, most methods, such as [27, 16, 25, 26, 17, 29], apply treebased approaches, i.e., build a partitioning tree; therefore, a careful choice of treebased partitioning, like a balanced 2means tree, allows sublinear time prediction with respect to the tree size.
Relation to our framework. Label Partitioning methods [26, 17, 29, 33, 22] are mostly related to our framework because label partitioning can be viewed as a label indexing procedure. In the following, we discuss how each of them is related to our framework. [33] has a similar framework as ours during prediction. But the order of building the index system and learning the matching model is reversed. They assume there is an existing ranking function from a separate training algorithm that maps the instance feature and the label to a score. Their goal is to speed up the prediction without changing the ranking function. To build a faster prediction framework, a partitioner function is first learned such that the instances that are close to each other are mapped to the same partition. Then they assign labels to each partition such that the relevant labels for each instance are in the relevant partition. This framework is different from ours because we consider both training and prediction when building the models. In particular, we first build the indexing system, then learn a matching model from the training data and finally learn a ranking function. [22] applies label filters which preselect a small set of candidate labels before the base classifier is applied. The label filtering step can be viewed as our matching step. To build the label filters, [22] projects the instance features to a onedimensional space and learns an upper bound and a lower bound for the range of each label in the onedimensional space. A label is to be considered for ranking if the projection of the instance falls into the label’s range. However, such label partitioning, which is constrained to a onedimensional space, is limiting for complicated label relationships. Our framework generalizes the label filtering approach, where we don’t limit the method of label partitioning or the label space. Parabel [26] partitions the labels through a balanced 2means label tree using label features constructed from the instances. As we mentioned previously, our framework is partially inspired by Parabel. However, Parabel mixed up indexing (building the label tree), matching (traversing the internal nodes) and ranking (multilabel classification in the leaf nodes). Our framework separates these three stages and each stage can be studied and implemented independently. HOMER [29] and PLT [17] are similar to Parabel, and also a special case of our framework. Unlike Parabel, HOMER’s label tree is built using the instancelabel matrix, while PLT’s tree is a probabilistic model.
Embeddingbased Approaches
Embedding models [5, 36, 7, 8, 14, 32] use a lowrank representation for the label matrix, so that the similarity search for the labels can be performed in a lowdimensional space. Embeddingbased methods explore the relationship among labels through latent subspaces. In other words, embeddingbased approaches assume that the label space can be represented by a lowdimensional latent space where similar labels have similar latent representations.
Relation to our framework. We can treat the latent subspace as an implicit label partitioning approach. However, in practice, to achieve similar computational speedup, embeddingbased models often show inferior performance as compared to sparse onevsall approaches, such as PDSparse [35]/PPDSparse [34], and partitioning approaches, such as Parabel, which may be due to the inefficiency of the label representation structure.
4 Slinmer: A Modular Deep Learning Approach For XMC
Based on our general threestage framework, we proposed a new XMC approach called SLINMER: Semantic Label Indexing, Neural Matching and Efficient Ranking. The idea of this approach is as follows. Based on our general framework, we consider three different semantic label representation for the label indexing stage, use biLSTMSA models for the matching stage and onevsall linear classifier as the ranking model. Then we ensemble the results using different random seeds as well as different label representations. In this section, we describe the details of each stage for SLINMER.
4.1 Semantic Label Indexing
The goal of this step is to build an effective indexing system for the labels. Instead of using label IDs only, we need some semantic information about the labels. If we already have some text information about labels, such as a short text description for the tags in the wikipedia dataset, then we can use these short texts to represent the labels. For example, we use one of the stateoftheart word representations, ELMo[24], to represent the words in the label.
However, short texts do not contain sufficient information about the label and some words in the short texts might be ambiguous, which will make the shorttext representation very noisy. Moreover, in the general case of XMC setting, there is no information about the label itself. Therefore we need to develop a label representation from the training data. Here we consider two other label representations that are calculated from training data. The first is called Homer [29], where the basic idea is to use the columns of the label matrix to represent the labels, i.e., , where is the th column of . The second representation, as proposed in Parabel, is the sum of the features of all the relevant instances for a given label. Formally, the th label representation can be formulated as .
In summary, to obtain the semantic information of the labels, we mainly explore three methods: 1) ELMO [24] if the label itself has some text information; 2) Homer [29]; 3) Parabel [26].
Once we have the label features, we can apply different methods to cluster the labels. Here we describe the following three clustering methods, 1) balanced kmeans; 2) balanced KDtree; 3) balanced random projection.
Balanced kmeans
We will explore the balanced kmeans method that is used in Parabel to do label clustering. The basic idea is to cluster the labels in a balanced 2means hierarchical tree. Specifically, starting from the root node, all the labels are partitioned into two child nodes with the same size. We then do this recursively on the child nodes until the number of labels in the child nodes reaches a certain number. The partitioning of each node is performed by the balanced 2mean algorithm, where two child nodes are enforced to have the same number of labels while minimizing the distances from the labels to the cluster centers.
Balanced KDtree
KDtree is also a way to partition the labels into a hierarchical tree. First we partition all the labels into two clusters with the same size according to the first feature of the label representation. Then based on this firstlevel partitioning, we further cluster the labels in each cluster according to the second feature. We do it recursively until a depthk hierarchical partitioning tree is built.
Balanced Random Projection
In this method, we use random projection with ordinal quantization to hash the label representations into dimensional binary vectors as the clustering assignment. In particular, if the label representation is a
dimensional vector, we then sample a random matrix,
, and project the label representation, , into a dimensional space, . For each random feature, we partition labels into two equal parts according to the ranking of random feature. Then we can represent the label by a dimensional binary code, with the balance of label distribution considered in each random feature independently.4.2 Neural Matching
Assume that after clustering, the labels are partitioned into clusters , where . The task in matching is to find the relevant clusters given an instance. In particular, we want to find a mapping that maps the instance to some of the clusters in . To train the matching model, we need to collect the training “labels”, which are the ground truth clusters that every training instance belongs to. We say the cluster is positive for an instance if the instance has a positive label in , i.e., there exists a such that . In this setting, the matching stage can be viewed as a multilabel classification problem as well.
To achieve high efficiency, several existing XMC approaches employ a hierarchical linear model for the matching part. For example, Parabel builds a hierarchical label tree which consists of two parts, the internal nodes and the leaf nodes. The leaf nodes are the partitions of the labels, while the internal nodes can be viewed as a hierarchical model for the matching part, which directs the instance to one of the leaf nodes. In particular, starting from the root node, it randomly splits the clusters in a node into two child nodes with the same size and train a linear multilabel (2label) classifier for this node. The partitioning is done recursively to build a balanced binary tree until each leaf node has only one cluster. A caveat of the linear approach is that it does not take the ordering of the words in the text into account, which might lose some useful information.
Note that the matching results directly determine the results of the ranking step. Therefore, a good matching model is essential to the final performance and we need to be careful to choose the algorithms and avoid compromising performance for efficiency.
Sequential neural models have demonstrated great success in various NLP applications, such as gated convolution models for sequence learning [9, 12], selfattention models for text classification [19], as well as Transformer and its variants for machine translation [31], to name just a few. We consider the biLSTMSA models [19] as a realization for the matching stage because of its superior performance compared to the naive biLSTM and onedimensional convolutions [18, 20] for text classification.Our experimental results later will show that the selfattention model consistently outperforms the simple hierarchical linear model in our setting.
Specifically, biLSTMSA
models consist of a BiLSTM to extract hidden representation of a sequence of words, followed by a multihead selfattention mechanism to further refine the word vector as a learnable weighted sum of hidden vector of the BiLSTM, on top with a twolayer MLP using ReLU as the activation function. Worth noting, the multihead selfattention cell is later the core component of the Transformer model.
On the other hand, mlc2seq [21] and SeCSeq [6] pivot the multilabel text classification problem as a sequencetosequence (Seq2Seq) framework, where the input is a sequence of words, and the output is a sequence of labels. In some applications such as Wikipedia category, labels inherently preserve hierarchical information and thus an ordering can be defined as traversing from the root node to the leaf node. An advantage of using Seq2Seq framework to solve XMC is that, with a sufficient number of training examples, the model may possibly learn to output a set of most relevant labels, without the need to setting threshold to cut the prediction outputs.
4.3 Efficient Ranking
After the matching step, we have retrieved a small subset of clusters and just need to rank the labels in these clusters. As a ranking model, our goal is to model the relevance between the instance and the retrieved labels. Formally, given a label and an instance, we want to find a mapping that maps the instance feature and the label into a score. In this paper, we mainly use the linear onevsall approach. The linear onevsall approach is one of the most straightforward and wellperforming models. This model treats assigning an individual label to an instance as an independent binary classification problem. The class label is positive if the instance belongs to the cluster; otherwise, it is negative. If the instance feature is text, the input of the linear classifier can be the tfidf feature. The output of the classifier is a probability that the instance belongs to the cluster.
In SLINMER, we focus on how to efficiently rank the retrieved labels in real time using sparse linear models. For a given input , the score is defined as , where is the weight vector for Label . For sparse input data, such as tfidf features of text input, is a sparse vector. In the meanwhile, if we enforce sparsity structure on the weight vectors during training, we can obtain sparse weight vectors too. Therefore, existing onevsall linear classifiers often use sparse vector multiplication to reduce the inference time. However, many existing XMC linear classifiers, such as Parabel, are optimized for batch inference, i.e., the average time is optimized for a large batch of testing data. In real applications, we often need to do realtime inference, where the samples arrive one at a time.
Here we propose a data structure for the weight vectors in a cluster to improve realtime inference. In Figure 2, we show several data structures to store the weight matrix in a label cluster . The labelindexed representation, which is used by Parabel, stores a vector of (featureindex, value) pairs for every label, which we call sparse feature vector, while in featureindexed representation, we store a vector of (labelindex, value) pairs for every feature, which we call sparse label vector. Note that when the label cluster only contains a small set of labels and the feature dimension is very large, the weight matrix will be very sparse, i.e. , and the featureindexed representation will consume a lot of memory for storing an empty vector for each feature. Therefore, in this case, we can use doubly sparse representation, which is based on featureindexed representation but only stores nonempty feature vectors.
Given an input data, , and a weight matrix , our goal is to calculate the scores for the labels, i.e., . In the featureindexed representation, we can find the sparse label vector for every nonzero feature index of the input in a constant time, therefore, the computational complexity is . However, the memory requirement will be for each cluster. For some datasets, such as Wiki500K ( and ), the total memory required can be huge. In the labelindexed representation, consists of inner products between two sparse vectors. As implemented by Parabel, for every innerproduct, it first maps the nonzero features of a sparse feature vector to a zeroinitialized dense vector and then uses sparsedense vector multiplication to calculate the inner product. Therefore, the computational complexity is . To reduce both the computational complexity and memory requirement, we use doublysparse weight matrix. We still use (labelindex, value) pairs to store each nonempty row in the weight matrix. But to sparsely store the feature indices, a hash table is used to map the feature indices to the nonempty rows. Given a sparse input, we can get the corresponding rows from the nonzero feature indices in a constant time, therefore, the computational complexity for calculating is while using memory.
Dataset  #features  #labels  

Eurlex4K  13,905  1,544  3,865  33,246  3,714 
Wiki1028K  11,265  1,251  5,732  99,919  28,139 
AmazonCat13K  1,067,616  118,623  306,782  161,925  13,234 
Wiki500K  1,411,760  156,396  676,730  517,631  479,315 
5 Experiment
5.1 Datasets and Preprocessing
We consider four multilabel text classification datasets downloaded from the publicly available Extreme Classification Repository [30] for which we had access to the raw text representation, namely Eurlex4K, Wiki1028K, AmazonCat13K and Wiki500K.
Summary statistics of the datasets are given in Table 1. We follow the training and test split of [30] and set aside
of the training instances as the validation set for hyperparameter tuning.
As shown in Table 1, it is important to note that the data statistics, number of labels in particular, are slightly different compared to the Extreme Classification Repository [30] because of two reasons. First, since only the title of body text is provided in Wiki1028K and Wiki500K, we map the title with latest Wikipedia dump database, and extract the raw text of the document. This creates a subset of the original dataset, yielding slightly smaller number of labels. Second, we adhere the text preprocessing procedure of [21], replacing numbers with a special token; building a word vocabulary with the most frequent 80K words; substituting Outofvocabulary words with a special token; and truncating the documents after 300 words.

Indexing  Matching stage  Ranking stage  

p@1  p@3  p@5  p@1  p@3  p@5  
linear  biLSTMSA  linear  biLSTMSA  linear  biLSTMSA  linear  SLINMER  linear  SLINMER  linear  SLINMER  
Homer  balanced kmeans (3)  85.25  88.28  66.76  69.61  51.51  53.18  80.65  81.14  67.68  69.67  56.34  57.66  
balanced KDtree (3)  78.34  83.73  64.77  69.73  54.03  57.60  77.83  82.79  65.18  69.36  54.51  57.49  
balanced random projection (0)  79.17  83.65  64.97  69.50  53.53  56.79  78.60  81.99  65.61  69.36  54.49  57.23  
ELMo  balanced kmeans (6)  83.70  87.32  71.66  74.91  59.43  61.40  79.84  83.00  67.70  69.81  56.30  58.10  
balanced KDtree (0)  79.72  84.58  67.11  70.80  55.94  58.01  79.09  82.85  65.79  69.08  55.14  57.38  
balanced random projection (0)  80.88  85.05  68.25  72.22  57.02  59.43  78.47  81.89  66.49  69.80  55.27  57.66  
Parabel  balanced kmeans (4)  91.57  92.91  69.12  70.50  51.71  52.06  81.19  82.82  68.86  70.25  57.51  58.59  
balanced KDtree (1)  80.60  84.48  66.65  70.48  54.33  57.12  78.99  82.07  65.99  69.37  54.93  57.47  
balanced random projection (1)  84.22  87.50  68.79  72.14  55.12  57.10  79.30  81.76  67.06  69.50  55.89  57.63 
Method  Prec@1  Prec@3  Prec@5  Recall@1  Recall@3  Recall@5  #parameters  training time (secs) 

SLINMER (T=1)  83.57  70.17  58.44  17.09  42.15  57.39  88.48M  380.87 
Parabel (T=1) [26]  81.99  68.89  57.30  16.76  41.24  56.26  2.66M  10.67 
PDSparse [35]  79.97  66.74  55.50  16.45  40.18  54.66  1.56M  85.83 
SLINMERv1(T=3)  84.40  72.41  60.51  17.31  43.50  59.44  265.44M  1142.61 
SLINMERv3(T=9)  85.12  73.12  61.13  17.43  44.00  60.12  796.32M  3427.83 
Parabel (T=3) [26]  82.48  69.95  58.49  16.87  41.98  57.46  7.98M  32.30 
FastXML (T=100) [27]  76.17  61.86  50.75  15.54  37.01  49.75  2.60M  22.71 
Method  Prec@1  Prec@3  Prec@5  Recall@1  Recall@3  Recall@5  #parameters  training time (secs) 

SLINMER (T=1)  83.88  71.44  61.66  5.12  12.84  18.14  116.33M  312.72 
mlc2seq [21]  81.75  65.45  54.18        239M  3514.4 
SeCSeq (T=1) [6]  81.61  67.32  56.36        46M  2692.3 
Parabel (T=1) [26]  82.76  71.37  62.24  5.03  12.82  18.39  12.44M  49.76 
PDSparse [35]  82.12  71.00  60.47  5.04  12.86  18.04  29.17M  431.41 
SLINMERv1(T=3)  83.81  71.56  62.01  5.10  12.82  18.22  348.99M  938.16 
SLINMERv3(T=9)  84.23  72.05  62.60  5.13  12.93  18.43  1046.97M  2814.48 
SeCSeq (T=4) [6]  83.54  70.06  59.40        184M  2692.3 
Parabel (T=3) [26]  82.78  71.70  62.48  5.02  12.88  18.46  37.32M  142.85 
FastXML (T=100) [27]  83.20  68.68  58.39  5.03  12.28  17.13  1.24M  37.74 
Method  Prec@1  Prec@3  Prec@5  Recall@1  Recall@3  Recall@5  #parameters  training time (hours) 

SLINMER (T=1)  92.09  76.89  61.71  26.07  58.35  72.61  144.45M  2.61 
mlc2seq [21]  91.52  74.77  59.13        180M  13.70 
SeCSeq (T=1) [6]  92.10  74.78  59.05        46M  14.21 
Parabel (T=1) [26]  90.75  75.61  60.99  25.57  57.35  71.80  46.40M  0.21 
PDSparse [35]  89.18  69.95  55.46  25.44  54.72  67.55  33.71M  2.10 
SLINMERv1(T=3)  93.35  78.27  62.98  26.52  59.44  74.05  433.35M  7.83 
SLINMERv3(T=9)  93.80  78.80  63.54  26.70  59.84  74.65  1300.05M  23.49 
SeCSeq (T=4) [6]  93.19  77.34  61.74        184M  14.21 
Parabel (T=3) [26]  91.42  76.34  61.68  25.82  57.84  72.53  139.2M  0.70 
FastXML (T=100) [27]  92.68  77.17  62.05  26.44  58.70  73.18  167.87M  0.87 
Method  Prec@1  Prec@3  Prec@5  Recall@1  Recall@3  Recall@5  #parameters  training time (hours) 

SLINMER (T=1)  62.94  42.36  32.30  19.33  33.68  40.03  350.61M  6.56 
SeCSeq (T=1) [6]  51.36  30.44  21.71        46M  30.65 
mlc2seq [21]  NS  NS  NS  NS  NS  NS  NS  NS 
Parabel (T=1) [26]  59.09  39.70  30.25  18.05  31.65  37.64  350.76M  0.75 
PDSparse [35]  NA  NA  NA  NA  NA  NA  291.17M  51.0 
SLINMERv1 (T=3)  65.52  44.57  33.99  20.27  35.57  42.23  1051.83M  19.68 
SLINMERv3 (T=9)  66.88  45.98  35.21  20.80  36.91  44.00  3155.49M  59.04 
Parabel (T=3) [26]  60.91  41.33  31.67  18.74  33.21  39.75  1052.28M  2.08 
FastXML (T=100) [27]  43.46  29.03  22.12  12.30  21.87  26.32  237.55M  2.67 
Dataset  Weight nnz  Input nnz  Parabelbatch (ms)  DoublySparserealtime (ms)  Parabelrealtime (ms) 

Eurlex4K  693  115  0.52  3.51  8.17 
Wiki1028K  425  158  0.82  10.42  29.64 
AmazonCat13K  3375  67  0.55  5.10  82.36 
Wiki500K  706  130  1.49  19.05  161.04 
5.2 Algorithms and Hyperparameter Tuning
We compare SLINMER with stateoftheart XMC methods including treebased methods FastXML [27] and Parabel [26], OVAbased PDSparse [35], and deep sequential neural models such as mlc2seq [21] and SeCSeq [6] on public available benchmark multilabel datasets [30]. We follow [21] to obtain tokenized text representation for deep learning methods and use TFIDF unigram features for featurebased methods (PDSparse and FastXML). The data statistics are summarized in Table 1. We evaluate all methods with examplebased ranking measures including Precision@k () and Recall@k (), which are widely used in the extreme multilabel classification literature [27, 5, 16, 35, 26].
We consider biLSTMSA [19] as a configuration of matcher in SLINMER. Specifically, the biLSTM model uses a bidirectional LSTM with dimensions in each direction. The selfattention MLP has a hidden layer with units and set the matrix embedding to have rows. The final layer is a layer ReLU output MLP with hidden units. Generally speaking, we follow the default hyperparameter as set in [19] and found it robust along with various semantic label indexing configurations and random seeds.
For the hyperparameter of comparing baselines, we basically follow the default setting. In particular, the number of trees in FastXML is and maximum instances in leaf node is . For Parabel, the number of trees is and maximum labels in leaf node is . Both FastXML and Parabel use loss penalty with the linear L2R L2loss SVM solver, as implemented via LIBLINEAR [11]. For PDSparse, the regularization term and the maximum iteration is with early stopping via monitoring the Precision@1 on validation set. The deep sequential neural models mlc2seq and SeCSeq use gated convolutional networks for sequential learning [9, 12] implemented in Fairseq package.
5.3 Empirical Results
In this section, we analyze various configurations in the semantic label indexing stage; present the best configuration of proposed SLINMER approach and compare it with stateoftheart XMC methods; investigate realtime inference time using different data structures and demonstrate different ensemble techniques for ranking stage.
5.3.1 Analysis on Semantic Label Indexing
Table 2 shows how various configurations of SLINMER affect the performance of matching stage and final ranking stage. For matching algorithms, we use hierarchical linear models similar to Parabel [26] and the biLSTMSA [19], the deep learning model discussed in Section 4. First note that kmeans is a compelling choice for indexing for all three semantic label representations. Specifically, as indicated by the number of wins in the parentheses, kmeans is the best for ELMo and Parabel representations while being equally good as KDtree for Homer representation. Given the robustness of kmeans, we set it as the default indexing method in our subsequent experiments. Regarding the semantic label representation, we found that ELMo and Parabel are slightly better than Homer.
Finally, we observe the superior performance of SLINMER indeed comes from the improvement of biLSTMSA
, a deep learning model, over linear models in the matching stage. This empirical results verify our claim that using more complex models, such as deep neural networks, in matching stage could improve the final ranking performance.
5.3.2 Slinmer Ensemble
Figure 3 illustrates three different ensemble models of configurations of SLINMER. Specifically, SLINMERv1 ensembles three label representations (Homer, ELMo, Parabel), SLINMERv2 ensembles three random seeds of the Parabel label representation, and SLINMERv3 ensembles all combinations (3 label representations 3 random seeds). We observe that ensemble using heterogeneous label representation is more effective than ensemble using single label representation of different random seeds, which is the ensemble technique used in Parabel models. This again confirms the diversity of semantic label representation helps the neural matcher and the final ranking stage. Last but not least, by ensembling all 9 configurations yields the stateoftheart results of SLINMER.
5.3.3 Overall Comparison
Table 3, 4, 5, and Table 6 compare the proposed SLINMER with other strong XMC baselines on four benchmark datasets. Each table is separated into two groups: a group of single models and a group of ensemble models In both groups, SLINMER outperforms the stateoftheart XMC model Parabel in most cases, except in the Wiki1028K dataset, It is worth noting that, on the most challenging dataset Wiki500K, SLINMER improves over Parabel by around 3% and 6% absolute improvement, for single model group and ensemble model group, respectively. This significant gain stems from two novel techniques, namely a neural matcher and the ensemble of various semantic label representations.
On the other hand, the training time of SLINMER, measured in GPUs running time, is not miserably longer than other representative XMC baselines running in CPUs. For example, on the Wiki500K dataset, the training time of SLINMER is actually faster than PDSparse, and only 8 times longer than Parabel under the single model setting. Moreover, SLINMER enjoys faster training among other deep learning baselines such as mlc2seq and SeCSeq.
Finally, regarding the number of parameters, SLINMER may seem to be over parameterized on the two medium scale datasets Eurlex4K and Wiki1028K. This is because we apply the same hyperparameter setting of the model architecture as that on the largest dataset Wiki500K. Indeed, the number of parameters of SLINMER on Wiki500K is comparable to Parabel, PDSparse, and FastXML. Under the same number of parameters, SLINMER still outperforms Parabel by large margin on the Wiki500K dataset, indicating the advantage of using deep neural models for the matching stage.
5.3.4 Efficient Inference
Table 7 shows the inference time per sample in the batch mode and the realtime mode. The original implementation in Parabel is optimized fro the batchmode setting where a batch of testing samples are fed in at the same time. To reduce the inference time per sample in the batch mode, Parabel uses labelindexed sparse representation for the weights as discussed in Sec 4.3. However, as shown in the column Parabelrealtime in Table 7, this data structure leads to slow realtime inference, where the sample comes one by one. Table 7 shows that by using a doublysparse representation, the realtime inference can be speedup significantly. The speedup ratio is highly correlated to , i.e., the ratio between the average number of nonzeros in weight vectors and the average number of nonzeros in input vectors.
6 Conclusions
In this paper, we propose SLINMER, a modular deep learning approach for extreme multilabel classification problems. SLINMER consists of a Semantic Label Indexing component, a Neural Matching component, and an Efficent Ranking component. This is the first deep learning multilabel learning approach which yields significant performance improvement in an extreme multilabel setting. In particular, on a Wiki dataset with around millions of labels, the precision@1 is increased from to by an ensemble of various configurations of SLINMER. Due to the modularity of SLINMER, it is highly flexible to incorporate various applicationspecific semantic label information to improve the performance. On the other hand, it is also a future direction of our work to adapt the newly developed deep learning models such as Bert [10] to further improve neural matching component in SLINMER.
References
 [1] Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. Multilabel learning with millions of labels: Recommending advertiser bid phrases for web pages. In WWW, 2013.
 [2] Rohit Babbar and Bernhard Schölkopf. Dismec: distributed sparse machines for extreme multilabel classification. In WSDM, 2017.
 [3] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517, 1975.
 [4] Alex Beutel, Paul Covington, Sagar Jain, Can Xu, Jia Li, Vince Gatto, and Ed H Chi. Latent cross: Making use of context in recurrent recommender systems. In WSDM, 2018.
 [5] Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. Sparse local embeddings for extreme multilabel classification. In NIPS, 2015.
 [6] WeiCheng Chang, HsiangFu Yu, Inderjit S Dhillon, and Yiming Yang. Secseq: Semantic coding for sequencetosequence based extreme multilabel classification, 2018.
 [7] YaoNan Chen and HsuanTien Lin. Featureaware label space dimension reduction for multilabel classification. In NIPS, 2012.
 [8] Moustapha M Cisse, Nicolas Usunier, Thierry Artieres, and Patrick Gallinari. Robust bloom filters for large multilabel classification tasks. In NIPS, 2013.
 [9] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In ICML, 2017.
 [10] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. 2018.
 [11] RongEn Fan, KaiWei Chang, ChoJui Hsieh, XiangRui Wang, and ChihJen Lin. Liblinear: A library for large linear classification. JMLR, 2008.
 [12] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122, 2017.
 [13] Google. How search works. https://www.google.com/search/howsearchworks/, 2019. Accessed: 2019118.
 [14] Daniel Hsu, Sham Kakade, John Langford, and Tong Zhang. Multilabel prediction via compressed sensing. In NIPS, 2009.
 [15] Qixuan Huang, Anshumali Shrivastava, and Yiqiu Wang. MACH: Embarrassingly parallel class classification in memory and time, instead of , 2018.

[16]
Himanshu Jain, Yashoteja Prabhu, and Manik Varma.
Extreme multilabel loss functions for recommendation, tagging, ranking & other missing label applications.
In KDD, 2016. 
[17]
Kalina Jasinska, Krzysztof Dembczynski, Róbert BusaFekete, Karlson
Pfannschmidt, Timo Klerx, and Eyke Hullermeier.
Extreme fmeasure maximization using sparse probability estimates.
In ICML, 2016.  [18] Yoon Kim. Convolutional neural networks for sentence classification. In EMNLP, 2014.
 [19] Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured selfattentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
 [20] Jingzhou Liu, WeiCheng Chang, Yuexin Wu, and Yiming Yang. Deep learning for extreme multilabel text classification. In SIGIR, 2017.

[21]
Jinseok Nam, Eneldo Loza Mencía, Hyunwoo J Kim, and Johannes
Fürnkranz.
Maximizing subset accuracy with recurrent neural networks in multilabel classification.
In NIPS, 2017.  [22] Alexandru NiculescuMizil and Ehsan Abbasnejad. Label filters for large scale multilabel classification. In AISTATS, 2017.
 [23] Ioannis Partalas, Aris Kosmopoulos, Nicolas Baskiotis, Thierry Artieres, George Paliouras, Eric Gaussier, Ion Androutsopoulos, MassihReza Amini, and Patrick Galinari. Lshtc: A benchmark for largescale text classification. arXiv preprint arXiv:1503.08581, 2015.
 [24] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL, 2018.
 [25] Yashoteja Prabhu, Anil Kag, Shilpa Gopinath, Kunal Dahiya, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. Extreme multilabel learning with label features for warmstart tagging, ranking & recommendation. In WSDM, 2018.
 [26] Yashoteja Prabhu, Anil Kag, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In WWW, 2018.
 [27] Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable treeclassifier for extreme multilabel learning. In KDD, 2014.
 [28] Si Si, Huan Zhang, S Sathiya Keerthi, Dhruv Mahajan, Inderjit S Dhillon, and ChoJui Hsieh. Gradient boosted decision trees for high dimensional sparse output. In ICML, 2017.
 [29] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Effective and efficient multilabel classification in domains with large number of labels. In Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD’08), 2008.
 [30] Manik Varma. The extreme classification repository: Multilabel datasets & code. http://manikvarma.org/downloads/XC/XMLRepository.html, 2018. Accessed: 2018105.
 [31] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
 [32] Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annotation. 2011.
 [33] Jason Weston, Ameesh Makadia, and Hector Yee. Label partitioning for sublinear ranking. In ICML, 2013.
 [34] Ian EH Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit Dhillon, and Eric Xing. Ppdsparse: A parallel primaldual sparse method for extreme classification. In KDD, 2017.
 [35] Ian EH Yen, Xiangru Huang, Kai Zhong, Pradeep Ravikumar, and Inderjit S Dhillon. Pdsparse: A primal and dual sparse approach to extreme multiclass and multilabel classification. In ICML, 2016.
 [36] HsiangFu Yu, Prateek Jain, Purushottam Kar, and Inderjit S Dhillon. Largescale multilabel learning with missing labels. In ICML, 2014.
Comments
There are no comments yet.