Extreme multi-label classification (XMC) refers to the problem of assigning to an instance the most relevant subset of labels from an enormous label collection, where the number of labels could be in the millions or more. The XMC setting is universal in various industrial applications such as Youtube recommendation , Bing’s dynamic search advertising , and tagging of Wikipedia categories in the PASCAL Large-Scale Hierarchical Text Classification (LSHTC) challenge , to name just a few.
The huge label space has raised research challenges such as data sparsity and scalability for existing multi-label algorithms. Among the state-of-the-art XMC approaches, one-vs-all approaches, such as DiSMEC , often achieve the highest accuracy but suffer severe computational burden in both the training and prediction phases if not implemented carefully. Various techniques have been proposed to improve the efficiency. Sparse structures [35, 34]
are introduced to one-vs-all classifiers to reduce the computational complexity. Embedding-based methods[5, 36] compress the label space to a low-dimensional space. Tree-based methods [27, 28] learn an ensemble of weak but fast classification trees, which however leads to a large model size. Recently, label-partitioning approaches, such as Parabel  and label filtering , have shown significant computational gain over existing methods while achieving comparable accuracy.
The label-partitioning approaches inspired us to build connections between XMC and information retrieval (IR), where the goal is to find relevant documents for a given query from an extremely large number of documents. To handle the large number of documents, an IR engine typically performs the search in the following steps , 1) indexing: building an efficient data structure to index the documents; 2) matching: finding the document index that this query belongs to; 3) ranking: sorting the documents in the retrieved index. An XMC problem can be connected to an IR problem as follows: the large number of labels can be viewed analogously to the large number of documents indexed by a search engine; and the instance to be labeled can be viewed as the query. To unify the terminology, we will call queries in IR and instances in XMC as sources, and call documents in IR and labels in XMC as targets. With this unified terminology, the goals of both IR and XMC can be described as identifying relevant targets for a given source, from an extremely large collection of targets. Such a connection enables us to establish an IR-alike multi-stage framework to tackle XMC problems in a much more modular manner. Not surprisingly, many of existing XMC approaches can be dissected or analyzed under this framework. Albeit similarly in terms of stages, the challenge of each component for XMC is very different from the counterpart in an IR system. For example, both the source and targets usually share the same token space in an IR system, while the source and targets for an XMC problem can be in two very different domains. In Section 2, we will not only establish a three-stage framework for XMC problems but also highlight the differences between each stage and its counterpart in an IR system.
In this paper, we propose a modular deep learning approach for XMC problems. The contributions of this paper are summarized as follows:
We establish a multi-stage framework for XMC problems. This framework does not only unify many existing XMC approaches but also enable to design new XMC approaches in a modular manner.
Under the framework, we propose SLINMER, a modular deep learning approach for extreme multi-label classification problems. SLINMER consists a Semantic Label Indexing component, a Neural Matching component, and an Efficent Ranking component.
The semantic label indexing in SLINMER is a flexible component to incorporate various semantic label information, while we propose to use a biLSTM-SA  model to better match the input to a set of good candidate labels. We also propose to perform ensemble of various configurations of SLINMER to further improve the performance.
We also develop a doubly-sparse data structure for the real-time inference with extremely sparse weight matrices.
With an extensive empirical study, we demonstrate the superiority of the proposed SLINMER over the existing XMC approaches. In particular, on a Wiki dataset with around millions of labels, the precision@1 is increased from to by an ensemble of various configurations of SLINMER.
This paper is organized as follows. In Section 2, we establish a multi-stage framework for XMC problem, and show how the existing XMC approaches connect to this framework in Section 3. With careful design, our proposed SLINMER is introduced in Section 4. We then perform extensive experiments to demonstrate its superiority in Section 5 and conclude in Section 6. Finally, all the data split and source codes will be made publicly available.
2 A Multi-stage Framework for XMC
Due to the success of the three-stage framework of IR for extremely large number of targets, in this paper, we follow it to develop a general framework for XMC, which consists of the following stages: 1) indexing the labels, 2) matching the label indices and 3) ranking the labels from the retrieved indices. See Figure 1 for an illustration of the framework. In the following, starting from the definition of MLC problems, we show a probabilistic model for our framework and then discuss three stages in our framework.
Notations and Definitions
Formally, multi-label classification (MLC) is the task of learning a function that maps an input to its target , where is the number of total unique labels. Assume that we have a set of training samples , where . We use , whose -th row is , to represent the label matrix. For some special datasets, we have additional label information. For example, each label in the wikipedia dataset  is named by words, such as “Zoos in Mexico” and “Bacon drinks”. So we will use as the feature representations of the labels, which may either come from the label information itself or from other approaches.
A Probabilistic Model.
We formulate our framework in a probabilistic perspective. Assume after indexing, we have clusters of labels, , where each is a subset of the label indices, i.e., . For a given instance,
, the probability of-th label being relevant to is . We can form the probabilistic model as follows,
Here is the matching model with as the parameters and is the ranking model with as the parameters. For the ranking model, we assume
In other words, during the ranking stage, only labels in the retrieved clusters are considered. This assumption is reasonable for extremely large number of labels because in such a scenario there will be many similar labels and they can be grouped. Under this assumption, our framework has the following advantages:
The training time for the ranking model for each cluster can be reduced because it only needs to consider the labels in the cluster and the instances that are relevant to these labels.
The prediction time is also reduced, because once a small set of clusters is chosen, we only need to perform ranking for the labels in these clusters.
Constraining the ranking to a smaller set of labels helps exclude irrelevant labels if the clustering and the matching models are sufficiently good.
We now briefly touch on each of these stages.
2.1 Label Indexing
The indexing of documents in a search engine requires rich text information while the labels of XMC typically lack this information. Thus, we aim to find meaningful label representations in order to build such an indexing system. There are several approaches in literature that have implicitly or explicitly studied different ways to represent the labels. For example, the embedding-based XMC approaches [5, 36]
project the label index to a low-dimensional vector through minimizing the loss between ground-truth labels and predicted labels. Homer uses the column vector of the instance-label indicator matrix to represent the label, while Parabel represents the label as a normalized sum of the features of relevant training instances. On the other hand, existing deep learning approaches for XMC represent the label by their IDs [20, 21, 15]. However, label IDs do not contain semantic information about the labels. For some special datasets, such as Wikipedia category tagging, each label is a short sequence of words, which can be used as label representations via word embedding like ELMo .
Once we obtain the label representations, we can start to build the indexing system; we will do so by clustering the labels as in label partitioning methods. We can use several clustering algorithms, such as k-means clustering, KD-tree or random projection clustering. Due to the lack of a direct and informative representation of the labels, the indexing system for XMC may be noisy compared to that for an IR problem. Fortunately, the instances in XMC are typically very informative. Therefore, we can utilize the rich information of the instances to build a strong matching system as well as a strong ranking system to compensate for the indexing system.
The matching phase for XMC is to assign relevant clusters (i.e., indices) to each instance. A high recall for matching is key to the success of a search engine as the subsequent ranking phase is based on the retrieved documents from matching. A matching system can also be viewed as a multi-label classification (MLC) problem, where the clusters are ”labels”. We will call this problem as a MLC-matching problem. To build a strong MLC-matching system, we want to utilize the information provided by the instance as much as possible, which however might lead to a complex model and expensive computational cost. For example, for text-based inputs, we might need deep learning approaches, such as Seq2Seq , CNN 
and self-attention models, to extract the sequential information of the input. However, for general XMC, deep learning approaches suffer from high computational complexity compared to classical linear classifiers. Fortunately, since the number of clusters can be controlled by the practitioner, we can control the scale of the MLC-matching problem such that its training and inference can still be completed in a reasonable time.
The ranking stage in XMC is to sort the labels retrieved from matching, according to the relevance between the labels and the instance. The ranking part is also a multi-label classification problem but is much smaller than the original XMC problem. We call it as the MLC-ranking problem. Thanks to the label clustering, we can do both training and inference for the ranking model efficiently. During training, we train an MLC-ranking model for each cluster independently. Therefore, for each cluster, we only need to include training samples which have positive labels in this cluster. This will significantly reduce the training time when the number of the training instances in a cluster is much smaller than the whole training data size. The inference time is reduced because we only need to consider the set of labels in the clusters retrieved from matching.linear one-vs-all models for this MLC-ranking model.
3 Related Work and Connections to Our Framework
To deal with the huge number of labels as well as the large training data size, various methods have been recently proposed to reduce the computational complexity for both training and prediction. We put XMC algorithms into three categories: one-vs-all approaches, partitioning methods and embedding-based approaches. We briefly discuss representative work in each category, and discuss their relationship with our framework.
One-Vs-All (OVA) approaches
The naive one-vs-all approach treats each label independently as a binary classification problem: if the label is relevant to the instance then it is positive; otherwise it is negative. OVA approaches [2, 20, 34, 35] have been shown to achieve high accuracies, but they suffer from expensive computational complexity for both training and prediction when the number of labels is very large. Therefore, several techniques have been proposed to speedup the algorithm. PDSparse /PPDSparse  introduce primal and dual sparsity to speed up the training as well as prediction. DiSMEC  and PPDSparse  explore parallelism and sparsity to speed up the algorithm and reduce the model size. OVA approaches are also widely used as building blocks for many other approaches, for example, in Parabel, linear OVA classifiers with a small output domain are used as the classifiers for internal nodes and leaf nodes.
Relation to our framework. Enforcing sparse structures on the weight vectors in OVA approaches can be viewed as building the indexing system. During prediction, for a given sparse input feature, we only consider labels that have non-zero features in common with the input feature. Therefore, we only need to calculate the relevance score between a small set of labels and the input feature, which reduces the prediction time.
There are two ways to incorporate partitioning: input partitioning [1, 27, 16, 25] and label partitioning [26, 17, 29, 33, 22]. Considering the instance-label matrix, , where is the number of training samples and is the number of labels, input partitioning and label partitioning can be viewed as partitioning the rows and the columns of , respectively. When the instance-label matrix is very sparse, for input partitioning, each partition only contains a small subset of labels; for label partitioning, each partition only contains a small subset of instances. Therefore, both partitioning ways can reduce the training and prediction time significantly. Furthermore, most methods, such as [27, 16, 25, 26, 17, 29], apply tree-based approaches, i.e., build a partitioning tree; therefore, a careful choice of tree-based partitioning, like a balanced 2-means tree, allows sublinear time prediction with respect to the tree size.
Relation to our framework. Label Partitioning methods [26, 17, 29, 33, 22] are mostly related to our framework because label partitioning can be viewed as a label indexing procedure. In the following, we discuss how each of them is related to our framework.  has a similar framework as ours during prediction. But the order of building the index system and learning the matching model is reversed. They assume there is an existing ranking function from a separate training algorithm that maps the instance feature and the label to a score. Their goal is to speed up the prediction without changing the ranking function. To build a faster prediction framework, a partitioner function is first learned such that the instances that are close to each other are mapped to the same partition. Then they assign labels to each partition such that the relevant labels for each instance are in the relevant partition. This framework is different from ours because we consider both training and prediction when building the models. In particular, we first build the indexing system, then learn a matching model from the training data and finally learn a ranking function.  applies label filters which pre-select a small set of candidate labels before the base classifier is applied. The label filtering step can be viewed as our matching step. To build the label filters,  projects the instance features to a one-dimensional space and learns an upper bound and a lower bound for the range of each label in the one-dimensional space. A label is to be considered for ranking if the projection of the instance falls into the label’s range. However, such label partitioning, which is constrained to a one-dimensional space, is limiting for complicated label relationships. Our framework generalizes the label filtering approach, where we don’t limit the method of label partitioning or the label space. Parabel  partitions the labels through a balanced 2-means label tree using label features constructed from the instances. As we mentioned previously, our framework is partially inspired by Parabel. However, Parabel mixed up indexing (building the label tree), matching (traversing the internal nodes) and ranking (multi-label classification in the leaf nodes). Our framework separates these three stages and each stage can be studied and implemented independently. HOMER  and PLT  are similar to Parabel, and also a special case of our framework. Unlike Parabel, HOMER’s label tree is built using the instance-label matrix, while PLT’s tree is a probabilistic model.
Embedding models [5, 36, 7, 8, 14, 32] use a low-rank representation for the label matrix, so that the similarity search for the labels can be performed in a low-dimensional space. Embedding-based methods explore the relationship among labels through latent subspaces. In other words, embedding-based approaches assume that the label space can be represented by a low-dimensional latent space where similar labels have similar latent representations.
Relation to our framework. We can treat the latent subspace as an implicit label partitioning approach. However, in practice, to achieve similar computational speedup, embedding-based models often show inferior performance as compared to sparse one-vs-all approaches, such as PDSparse /PPDSparse , and partitioning approaches, such as Parabel, which may be due to the inefficiency of the label representation structure.
4 Slinmer: A Modular Deep Learning Approach For XMC
Based on our general three-stage framework, we proposed a new XMC approach called SLINMER: Semantic Label Indexing, Neural Matching and Efficient Ranking. The idea of this approach is as follows. Based on our general framework, we consider three different semantic label representation for the label indexing stage, use biLSTM-SA models for the matching stage and one-vs-all linear classifier as the ranking model. Then we ensemble the results using different random seeds as well as different label representations. In this section, we describe the details of each stage for SLINMER.
4.1 Semantic Label Indexing
The goal of this step is to build an effective indexing system for the labels. Instead of using label IDs only, we need some semantic information about the labels. If we already have some text information about labels, such as a short text description for the tags in the wikipedia dataset, then we can use these short texts to represent the labels. For example, we use one of the state-of-the-art word representations, ELMo, to represent the words in the label.
However, short texts do not contain sufficient information about the label and some words in the short texts might be ambiguous, which will make the short-text representation very noisy. Moreover, in the general case of XMC setting, there is no information about the label itself. Therefore we need to develop a label representation from the training data. Here we consider two other label representations that are calculated from training data. The first is called Homer , where the basic idea is to use the columns of the label matrix to represent the labels, i.e., , where is the -th column of . The second representation, as proposed in Parabel, is the sum of the features of all the relevant instances for a given label. Formally, the -th label representation can be formulated as .
Once we have the label features, we can apply different methods to cluster the labels. Here we describe the following three clustering methods, 1) balanced k-means; 2) balanced KD-tree; 3) balanced random projection.
We will explore the balanced k-means method that is used in Parabel to do label clustering. The basic idea is to cluster the labels in a balanced 2-means hierarchical tree. Specifically, starting from the root node, all the labels are partitioned into two child nodes with the same size. We then do this recursively on the child nodes until the number of labels in the child nodes reaches a certain number. The partitioning of each node is performed by the balanced 2-mean algorithm, where two child nodes are enforced to have the same number of labels while minimizing the distances from the labels to the cluster centers.
KD-tree is also a way to partition the labels into a hierarchical tree. First we partition all the labels into two clusters with the same size according to the first feature of the label representation. Then based on this first-level partitioning, we further cluster the labels in each cluster according to the second feature. We do it recursively until a depth-k hierarchical partitioning tree is built.
Balanced Random Projection
In this method, we use random projection with ordinal quantization to hash the label representations into -dimensional binary vectors as the clustering assignment. In particular, if the label representation is a
-dimensional vector, we then sample a random matrix,, and project the label representation, , into a -dimensional space, . For each random feature, we partition labels into two equal parts according to the ranking of random feature. Then we can represent the label by a -dimensional binary code, with the balance of label distribution considered in each random feature independently.
4.2 Neural Matching
Assume that after clustering, the labels are partitioned into clusters , where . The task in matching is to find the relevant clusters given an instance. In particular, we want to find a mapping that maps the instance to some of the clusters in . To train the matching model, we need to collect the training “labels”, which are the ground truth clusters that every training instance belongs to. We say the cluster is positive for an instance if the instance has a positive label in , i.e., there exists a such that . In this setting, the matching stage can be viewed as a multi-label classification problem as well.
To achieve high efficiency, several existing XMC approaches employ a hierarchical linear model for the matching part. For example, Parabel builds a hierarchical label tree which consists of two parts, the internal nodes and the leaf nodes. The leaf nodes are the partitions of the labels, while the internal nodes can be viewed as a hierarchical model for the matching part, which directs the instance to one of the leaf nodes. In particular, starting from the root node, it randomly splits the clusters in a node into two child nodes with the same size and train a linear multi-label (2-label) classifier for this node. The partitioning is done recursively to build a balanced binary tree until each leaf node has only one cluster. A caveat of the linear approach is that it does not take the ordering of the words in the text into account, which might lose some useful information.
Note that the matching results directly determine the results of the ranking step. Therefore, a good matching model is essential to the final performance and we need to be careful to choose the algorithms and avoid compromising performance for efficiency.
Sequential neural models have demonstrated great success in various NLP applications, such as gated convolution models for sequence learning [9, 12], self-attention models for text classification , as well as Transformer and its variants for machine translation , to name just a few. We consider the biLSTM-SA models  as a realization for the matching stage because of its superior performance compared to the naive biLSTM and one-dimensional convolutions [18, 20] for text classification.Our experimental results later will show that the self-attention model consistently outperforms the simple hierarchical linear model in our setting.
models consist of a BiLSTM to extract hidden representation of a sequence of words, followed by a multi-head self-attention mechanism to further refine the word vector as a learnable weighted sum of hidden vector of the BiLSTM, on top with a two-layer MLP using ReLU as the activation function. Worth noting, the multi-head self-attention cell is later the core component of the Transformer model.
On the other hand, mlc2seq  and SeCSeq  pivot the multi-label text classification problem as a sequence-to-sequence (Seq2Seq) framework, where the input is a sequence of words, and the output is a sequence of labels. In some applications such as Wikipedia category, labels inherently preserve hierarchical information and thus an ordering can be defined as traversing from the root node to the leaf node. An advantage of using Seq2Seq framework to solve XMC is that, with a sufficient number of training examples, the model may possibly learn to output a set of most relevant labels, without the need to setting threshold to cut the prediction outputs.
4.3 Efficient Ranking
After the matching step, we have retrieved a small subset of clusters and just need to rank the labels in these clusters. As a ranking model, our goal is to model the relevance between the instance and the retrieved labels. Formally, given a label and an instance, we want to find a mapping that maps the instance feature and the label into a score. In this paper, we mainly use the linear one-vs-all approach. The linear one-vs-all approach is one of the most straightforward and well-performing models. This model treats assigning an individual label to an instance as an independent binary classification problem. The class label is positive if the instance belongs to the cluster; otherwise, it is negative. If the instance feature is text, the input of the linear classifier can be the tf-idf feature. The output of the classifier is a probability that the instance belongs to the cluster.
In SLINMER, we focus on how to efficiently rank the retrieved labels in real time using sparse linear models. For a given input , the score is defined as , where is the weight vector for Label . For sparse input data, such as tf-idf features of text input, is a sparse vector. In the meanwhile, if we enforce sparsity structure on the weight vectors during training, we can obtain sparse weight vectors too. Therefore, existing one-vs-all linear classifiers often use sparse vector multiplication to reduce the inference time. However, many existing XMC linear classifiers, such as Parabel, are optimized for batch inference, i.e., the average time is optimized for a large batch of testing data. In real applications, we often need to do real-time inference, where the samples arrive one at a time.
Here we propose a data structure for the weight vectors in a cluster to improve real-time inference. In Figure 2, we show several data structures to store the weight matrix in a label cluster . The label-indexed representation, which is used by Parabel, stores a vector of (feature-index, value) pairs for every label, which we call sparse feature vector, while in feature-indexed representation, we store a vector of (label-index, value) pairs for every feature, which we call sparse label vector. Note that when the label cluster only contains a small set of labels and the feature dimension is very large, the weight matrix will be very sparse, i.e. , and the feature-indexed representation will consume a lot of memory for storing an empty vector for each feature. Therefore, in this case, we can use doubly sparse representation, which is based on feature-indexed representation but only stores non-empty feature vectors.
Given an input data, , and a weight matrix , our goal is to calculate the scores for the labels, i.e., . In the feature-indexed representation, we can find the sparse label vector for every non-zero feature index of the input in a constant time, therefore, the computational complexity is . However, the memory requirement will be for each cluster. For some datasets, such as Wiki-500K ( and ), the total memory required can be huge. In the label-indexed representation, consists of inner products between two sparse vectors. As implemented by Parabel, for every inner-product, it first maps the non-zero features of a sparse feature vector to a zero-initialized dense vector and then uses sparse-dense vector multiplication to calculate the inner product. Therefore, the computational complexity is . To reduce both the computational complexity and memory requirement, we use doubly-sparse weight matrix. We still use (label-index, value) pairs to store each non-empty row in the weight matrix. But to sparsely store the feature indices, a hash table is used to map the feature indices to the non-empty rows. Given a sparse input, we can get the corresponding rows from the non-zero feature indices in a constant time, therefore, the computational complexity for calculating is while using memory.
5.1 Datasets and Preprocessing
We consider four multi-label text classification datasets downloaded from the publicly available Extreme Classification Repository  for which we had access to the raw text representation, namely Eurlex-4K, Wiki10-28K, AmazonCat-13K and Wiki-500K.
of the training instances as the validation set for hyperparameter tuning.
As shown in Table 1, it is important to note that the data statistics, number of labels in particular, are slightly different compared to the Extreme Classification Repository  because of two reasons. First, since only the title of body text is provided in Wiki10-28K and Wiki-500K, we map the title with latest Wikipedia dump database, and extract the raw text of the document. This creates a subset of the original dataset, yielding slightly smaller number of labels. Second, we adhere the text preprocessing procedure of , replacing numbers with a special token; building a word vocabulary with the most frequent 80K words; substituting Out-of-vocabulary words with a special token; and truncating the documents after 300 words.
|Indexing||Matching stage||Ranking stage|
|Homer||balanced k-means (3)||85.25||88.28||66.76||69.61||51.51||53.18||80.65||81.14||67.68||69.67||56.34||57.66|
|balanced KD-tree (3)||78.34||83.73||64.77||69.73||54.03||57.60||77.83||82.79||65.18||69.36||54.51||57.49|
|balanced random projection (0)||79.17||83.65||64.97||69.50||53.53||56.79||78.60||81.99||65.61||69.36||54.49||57.23|
|ELMo||balanced k-means (6)||83.70||87.32||71.66||74.91||59.43||61.40||79.84||83.00||67.70||69.81||56.30||58.10|
|balanced KD-tree (0)||79.72||84.58||67.11||70.80||55.94||58.01||79.09||82.85||65.79||69.08||55.14||57.38|
|balanced random projection (0)||80.88||85.05||68.25||72.22||57.02||59.43||78.47||81.89||66.49||69.80||55.27||57.66|
|Parabel||balanced k-means (4)||91.57||92.91||69.12||70.50||51.71||52.06||81.19||82.82||68.86||70.25||57.51||58.59|
|balanced KD-tree (1)||80.60||84.48||66.65||70.48||54.33||57.12||78.99||82.07||65.99||69.37||54.93||57.47|
|balanced random projection (1)||84.22||87.50||68.79||72.14||55.12||57.10||79.30||81.76||67.06||69.50||55.89||57.63|
|Method||Prec@1||Prec@3||Prec@5||Recall@1||Recall@3||Recall@5||#parameters||training time (secs)|
|Parabel (T=1) ||81.99||68.89||57.30||16.76||41.24||56.26||2.66M||10.67|
|Parabel (T=3) ||82.48||69.95||58.49||16.87||41.98||57.46||7.98M||32.30|
|FastXML (T=100) ||76.17||61.86||50.75||15.54||37.01||49.75||2.60M||22.71|
|Method||Prec@1||Prec@3||Prec@5||Recall@1||Recall@3||Recall@5||#parameters||training time (secs)|
|SeCSeq (T=1) ||81.61||67.32||56.36||-||-||-||46M||2692.3|
|Parabel (T=1) ||82.76||71.37||62.24||5.03||12.82||18.39||12.44M||49.76|
|SeCSeq (T=4) ||83.54||70.06||59.40||-||-||-||184M||2692.3|
|Parabel (T=3) ||82.78||71.70||62.48||5.02||12.88||18.46||37.32M||142.85|
|FastXML (T=100) ||83.20||68.68||58.39||5.03||12.28||17.13||1.24M||37.74|
|Method||Prec@1||Prec@3||Prec@5||Recall@1||Recall@3||Recall@5||#parameters||training time (hours)|
|SeCSeq (T=1) ||92.10||74.78||59.05||-||-||-||46M||14.21|
|Parabel (T=1) ||90.75||75.61||60.99||25.57||57.35||71.80||46.40M||0.21|
|SeCSeq (T=4) ||93.19||77.34||61.74||-||-||-||184M||14.21|
|Parabel (T=3) ||91.42||76.34||61.68||25.82||57.84||72.53||139.2M||0.70|
|FastXML (T=100) ||92.68||77.17||62.05||26.44||58.70||73.18||167.87M||0.87|
|Method||Prec@1||Prec@3||Prec@5||Recall@1||Recall@3||Recall@5||#parameters||training time (hours)|
|SeCSeq (T=1) ||51.36||30.44||21.71||-||-||-||46M||30.65|
|Parabel (T=1) ||59.09||39.70||30.25||18.05||31.65||37.64||350.76M||0.75|
|Parabel (T=3) ||60.91||41.33||31.67||18.74||33.21||39.75||1052.28M||2.08|
|FastXML (T=100) ||43.46||29.03||22.12||12.30||21.87||26.32||237.55M||2.67|
|Dataset||Weight nnz||Input nnz||Parabel-batch (ms)||DoublySparse-realtime (ms)||Parabel-realtime (ms)|
5.2 Algorithms and Hyperparameter Tuning
We compare SLINMER with state-of-the-art XMC methods including tree-based methods FastXML  and Parabel , OVA-based PD-Sparse , and deep sequential neural models such as mlc2seq  and SeCSeq  on public available benchmark multi-label datasets . We follow  to obtain tokenized text representation for deep learning methods and use TF-IDF unigram features for feature-based methods (PD-Sparse and FastXML). The data statistics are summarized in Table 1. We evaluate all methods with example-based ranking measures including Precision@k () and Recall@k (), which are widely used in the extreme multi-label classification literature [27, 5, 16, 35, 26].
We consider biLSTM-SA  as a configuration of matcher in SLINMER. Specifically, the biLSTM model uses a bidirectional LSTM with dimensions in each direction. The self-attention MLP has a hidden layer with units and set the matrix embedding to have rows. The final layer is a -layer ReLU output MLP with hidden units. Generally speaking, we follow the default hyper-parameter as set in  and found it robust along with various semantic label indexing configurations and random seeds.
For the hyperparameter of comparing baselines, we basically follow the default setting. In particular, the number of trees in FastXML is and maximum instances in leaf node is . For Parabel, the number of trees is and maximum labels in leaf node is . Both FastXML and Parabel use loss penalty with the linear L2R L2-loss SVM solver, as implemented via LIBLINEAR . For PD-Sparse, the regularization term and the maximum iteration is with early stopping via monitoring the Precision@1 on validation set. The deep sequential neural models mlc2seq and SeCSeq use gated convolutional networks for sequential learning [9, 12] implemented in Fairseq package.
5.3 Empirical Results
In this section, we analyze various configurations in the semantic label indexing stage; present the best configuration of proposed SLINMER approach and compare it with state-of-the-art XMC methods; investigate realtime inference time using different data structures and demonstrate different ensemble techniques for ranking stage.
5.3.1 Analysis on Semantic Label Indexing
Table 2 shows how various configurations of SLINMER affect the performance of matching stage and final ranking stage. For matching algorithms, we use hierarchical linear models similar to Parabel  and the biLSTM-SA , the deep learning model discussed in Section 4. First note that k-means is a compelling choice for indexing for all three semantic label representations. Specifically, as indicated by the number of wins in the parentheses, k-means is the best for ELMo and Parabel representations while being equally good as KD-tree for Homer representation. Given the robustness of k-means, we set it as the default indexing method in our subsequent experiments. Regarding the semantic label representation, we found that ELMo and Parabel are slightly better than Homer.
Finally, we observe the superior performance of SLINMER indeed comes from the improvement of biLSTM-SA
, a deep learning model, over linear models in the matching stage. This empirical results verify our claim that using more complex models, such as deep neural networks, in matching stage could improve the final ranking performance.
5.3.2 Slinmer Ensemble
Figure 3 illustrates three different ensemble models of configurations of SLINMER. Specifically, SLINMER-v1 ensembles three label representations (Homer, ELMo, Parabel), SLINMER-v2 ensembles three random seeds of the Parabel label representation, and SLINMER-v3 ensembles all combinations (3 label representations 3 random seeds). We observe that ensemble using heterogeneous label representation is more effective than ensemble using single label representation of different random seeds, which is the ensemble technique used in Parabel models. This again confirms the diversity of semantic label representation helps the neural matcher and the final ranking stage. Last but not least, by ensembling all 9 configurations yields the state-of-the-art results of SLINMER.
5.3.3 Overall Comparison
Table 3, 4, 5, and Table 6 compare the proposed SLINMER with other strong XMC baselines on four benchmark datasets. Each table is separated into two groups: a group of single models and a group of ensemble models In both groups, SLINMER outperforms the state-of-the-art XMC model Parabel in most cases, except in the Wiki10-28K dataset, It is worth noting that, on the most challenging dataset Wiki-500K, SLINMER improves over Parabel by around 3% and 6% absolute improvement, for single model group and ensemble model group, respectively. This significant gain stems from two novel techniques, namely a neural matcher and the ensemble of various semantic label representations.
On the other hand, the training time of SLINMER, measured in GPUs running time, is not miserably longer than other representative XMC baselines running in CPUs. For example, on the Wiki-500K dataset, the training time of SLINMER is actually faster than PD-Sparse, and only 8 times longer than Parabel under the single model setting. Moreover, SLINMER enjoys faster training among other deep learning baselines such as mlc2seq and SeCSeq.
Finally, regarding the number of parameters, SLINMER may seem to be over parameterized on the two medium scale datasets Eurlex-4K and Wiki10-28K. This is because we apply the same hyper-parameter setting of the model architecture as that on the largest dataset Wiki-500K. Indeed, the number of parameters of SLINMER on Wiki-500K is comparable to Parabel, PD-Sparse, and FastXML. Under the same number of parameters, SLINMER still outperforms Parabel by large margin on the Wiki-500K dataset, indicating the advantage of using deep neural models for the matching stage.
5.3.4 Efficient Inference
Table 7 shows the inference time per sample in the batch mode and the realtime mode. The original implementation in Parabel is optimized fro the batch-mode setting where a batch of testing samples are fed in at the same time. To reduce the inference time per sample in the batch mode, Parabel uses label-indexed sparse representation for the weights as discussed in Sec 4.3. However, as shown in the column Parabel-realtime in Table 7, this data structure leads to slow realtime inference, where the sample comes one by one. Table 7 shows that by using a doubly-sparse representation, the realtime inference can be speedup significantly. The speedup ratio is highly correlated to , i.e., the ratio between the average number of nonzeros in weight vectors and the average number of nonzeros in input vectors.
In this paper, we propose SLINMER, a modular deep learning approach for extreme multi-label classification problems. SLINMER consists of a Semantic Label Indexing component, a Neural Matching component, and an Efficent Ranking component. This is the first deep learning multi-label learning approach which yields significant performance improvement in an extreme multi-label setting. In particular, on a Wiki dataset with around millions of labels, the precision@1 is increased from to by an ensemble of various configurations of SLINMER. Due to the modularity of SLINMER, it is highly flexible to incorporate various application-specific semantic label information to improve the performance. On the other hand, it is also a future direction of our work to adapt the newly developed deep learning models such as Bert  to further improve neural matching component in SLINMER.
-  Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In WWW, 2013.
-  Rohit Babbar and Bernhard Schölkopf. Dismec: distributed sparse machines for extreme multi-label classification. In WSDM, 2017.
-  Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517, 1975.
-  Alex Beutel, Paul Covington, Sagar Jain, Can Xu, Jia Li, Vince Gatto, and Ed H Chi. Latent cross: Making use of context in recurrent recommender systems. In WSDM, 2018.
-  Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. Sparse local embeddings for extreme multi-label classification. In NIPS, 2015.
-  Wei-Cheng Chang, Hsiang-Fu Yu, Inderjit S Dhillon, and Yiming Yang. Secseq: Semantic coding for sequence-to-sequence based extreme multi-label classification, 2018.
-  Yao-Nan Chen and Hsuan-Tien Lin. Feature-aware label space dimension reduction for multi-label classification. In NIPS, 2012.
-  Moustapha M Cisse, Nicolas Usunier, Thierry Artieres, and Patrick Gallinari. Robust bloom filters for large multilabel classification tasks. In NIPS, 2013.
-  Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In ICML, 2017.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. 2018.
-  Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. JMLR, 2008.
-  Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122, 2017.
-  Google. How search works. https://www.google.com/search/howsearchworks/, 2019. Accessed: 2019-1-18.
-  Daniel Hsu, Sham Kakade, John Langford, and Tong Zhang. Multi-label prediction via compressed sensing. In NIPS, 2009.
-  Qixuan Huang, Anshumali Shrivastava, and Yiqiu Wang. MACH: Embarrassingly parallel -class classification in memory and time, instead of , 2018.
Himanshu Jain, Yashoteja Prabhu, and Manik Varma.
Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications.In KDD, 2016.
Kalina Jasinska, Krzysztof Dembczynski, Róbert Busa-Fekete, Karlson
Pfannschmidt, Timo Klerx, and Eyke Hullermeier.
Extreme f-measure maximization using sparse probability estimates.In ICML, 2016.
-  Yoon Kim. Convolutional neural networks for sentence classification. In EMNLP, 2014.
-  Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
-  Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yiming Yang. Deep learning for extreme multi-label text classification. In SIGIR, 2017.
Jinseok Nam, Eneldo Loza Mencía, Hyunwoo J Kim, and Johannes
Maximizing subset accuracy with recurrent neural networks in multi-label classification.In NIPS, 2017.
-  Alexandru Niculescu-Mizil and Ehsan Abbasnejad. Label filters for large scale multilabel classification. In AISTATS, 2017.
-  Ioannis Partalas, Aris Kosmopoulos, Nicolas Baskiotis, Thierry Artieres, George Paliouras, Eric Gaussier, Ion Androutsopoulos, Massih-Reza Amini, and Patrick Galinari. Lshtc: A benchmark for large-scale text classification. arXiv preprint arXiv:1503.08581, 2015.
-  Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In NAACL, 2018.
-  Yashoteja Prabhu, Anil Kag, Shilpa Gopinath, Kunal Dahiya, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. Extreme multi-label learning with label features for warm-start tagging, ranking & recommendation. In WSDM, 2018.
-  Yashoteja Prabhu, Anil Kag, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In WWW, 2018.
-  Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In KDD, 2014.
-  Si Si, Huan Zhang, S Sathiya Keerthi, Dhruv Mahajan, Inderjit S Dhillon, and Cho-Jui Hsieh. Gradient boosted decision trees for high dimensional sparse output. In ICML, 2017.
-  Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Effective and efficient multilabel classification in domains with large number of labels. In Proc. ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD’08), 2008.
-  Manik Varma. The extreme classification repository: Multi-label datasets & code. http://manikvarma.org/downloads/XC/XMLRepository.html, 2018. Accessed: 2018-10-5.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
-  Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annotation. 2011.
-  Jason Weston, Ameesh Makadia, and Hector Yee. Label partitioning for sublinear ranking. In ICML, 2013.
-  Ian EH Yen, Xiangru Huang, Wei Dai, Pradeep Ravikumar, Inderjit Dhillon, and Eric Xing. Ppdsparse: A parallel primal-dual sparse method for extreme classification. In KDD, 2017.
-  Ian EH Yen, Xiangru Huang, Kai Zhong, Pradeep Ravikumar, and Inderjit S Dhillon. Pd-sparse: A primal and dual sparse approach to extreme multiclass and multilabel classification. In ICML, 2016.
-  Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit S Dhillon. Large-scale multi-label learning with missing labels. In ICML, 2014.