DECAF: Deep Extreme Classification with Label Features

Extreme multi-label classification (XML) involves tagging a data point with its most relevant subset of labels from an extremely large label set, with several applications such as product-to-product recommendation with millions of products. Although leading XML algorithms scale to millions of labels, they largely ignore label meta-data such as textual descriptions of the labels. On the other hand, classical techniques that can utilize label metadata via representation learning using deep networks struggle in extreme settings. This paper develops the DECAF algorithm that addresses these challenges by learning models enriched by label metadata that jointly learn model parameters and feature representations using deep networks and offer accurate classification at the scale of millions of labels. DECAF makes specific contributions to model architecture design, initialization, and training, enabling it to offer up to 2-6 available benchmark product-to-product recommendation datasets, such as LF-AmazonTitles-1.3M. At the same time, DECAF was found to be up to 22x faster at inference than leading deep extreme classifiers, which makes it suitable for real-time applications that require predictions within a few milliseconds. The code for DECAF is available at the following URL



There are no comments yet.


page 3


ECLARE: Extreme Classification with Label Graph Correlations

Deep extreme classification (XC) seeks to train deep architectures that ...

Accelerating Extreme Classification via Adaptive Feature Agglomeration

Extreme classification seeks to assign each data point, the most relevan...

Extreme Regression for Dynamic Search Advertising

This paper introduces a new learning paradigm called eXtreme Regression ...

Pretrained Generalized Autoregressive Model with Adaptive Probabilistic Label Clusters for Extreme Multi-label Text Classification

Extreme multi-label text classification (XMTC) is a task for tagging a g...

Bonsai - Diverse and Shallow Trees for Extreme Multi-label Classification

Extreme multi-label classification refers to supervised multi-label lear...

Enabling Efficiency-Precision Trade-offs for Label Trees in Extreme Classification

Extreme multi-label classification (XMC) aims to learn a model that can ...

Embedding Convolutions for Short Text Extreme Classification with Millions of Labels

Automatic annotation of short-text data to a large number of target labe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Objective: Extreme multi-label classification (XML) refers to the task of tagging data points with a relevant subset of labels from an extremely large label set. This paper demonstrates that XML algorithms stand to gain significantly by incorporating label metadata. The DECAF algorithm is proposed, which could be up to 2-6% more accurate than leading XML methods such as Astec (Dahiya et al., 2021), MACH (Medini et al., 2019), Bonsai (Khandagale et al., 2020), AttentionXML (You et al., 2019), etc, while offering predictions within a fraction of a millisecond, which makes it suitable for high-volume and time-critical applications.

Short-text applications: Applications such as predicting related products given a retail product’s name (Medini et al., 2019), or predicting related webpages given a webpage title, or related searches (Jain et al., 2019), all involve short texts, with the product name, webpage title, or search query having just 3-10 words on average. In addition to the statistical and computational challenges posed by a large set of labels, short-text tasks are particularly challenging as only a few words are available per data point. This paper focuses on short-text applications such as related product and webpage recommendation.

Label metadata: Metadata for labels can be available in various forms: textual representations, label hierarchies, label taxonomies (Kanagal et al., 2012; Menon et al., 2011; Sachdeva et al., 2018), or label correlation graphs, and can capture semantic relations between labels. For instance, the Amazon products (that serve as labels in a product-to-product recommendation task) “Panzer Dragoon", and “Panzer Dragoon Orta" do not share any common training point but are semantically related. Label metadata can allow collaborative learning, which especially benefits tail labels. Tail labels are those for which very few training points are available and form the majority of labels in XML applications (Jain et al., 2016; Babbar and Schölkopf, 2017, 2019). For instance, just 14 documents are tagged with the label “Panzer Dragoon Orta" while 23 documents are tagged with the label “Panzer Dragoon" in the LF-AmazonTitles-131K dataset. In this paper, we will focus on utilizing label text as a form of label metadata.

DECAF: DECAF learns a separate linear classifier per label based on the 1-vs-All approach. These classifiers critically utilize label metadata and require careful initialization since random initialization (Glorot and Bengio, 2010) leads to inferior performance at extreme scales. DECAF proposes using a shortlister with large fanout to cut down training and prediction time drastically. Specifically, given a training set of examples, labels, and dimensional embeddings being learnt, the use of the shortlister brings training time down from to (by training only on the most confusing negative labels for every training point), and prediction time down from to (by evaluating classifiers corresponding to only the most likely labels). An efficient and scalable two-stage strategy is proposed to train the shortlister.

Comparison with state-of-the-art: Experiments conducted on publicly available benchmark datasets revealed that DECAF could be 5% more accurate than the leading approaches such as DiSMEC (Babbar and Schölkopf, 2017), Parabel (Prabhu et al., 2018b), Bonsai (Khandagale et al., 2020) AnnexML (Tagami, 2017), etc, which utilize pre-computed features. DECAF

was also found to be 2-6% more accurate than leading deep learning-based approaches such as Astec 

(Dahiya et al., 2021), AttentionXML (You et al., 2019) and MACH (Medini et al., 2019) that jointly learn feature representations and classifiers. Furthermore, DECAF could be up to 22 faster at prediction than deep learning methods such as MACH and AttentionXML.

Contributions: This paper presents DECAF, a scalable deep learning architecture for XML applications that effectively utilize label metadata. Specific contributions are made in designing a shortlister with a large fanout and a two-stage training strategy. DECAF also introduces a novel initialization strategy for classifiers that leads to accuracy gains, more prominently on data-scarce tail labels. DECAF scales to XML tasks with millions of labels and makes predictions significantly more accurate than state-of-the-art XML methods. Even on datasets with more than a million labels, DECAF can make predictions in a fraction of a millisecond, thereby making it suitable for real-time applications.

2. Related Work

Summary: XML techniques can be categorized into 1-vs-All, tree, and embedding methods. Of these, one-vs-all methods such as Slice (Jain et al., 2019) and Parabel (Prabhu et al., 2018b) offer the most accurate solutions. Recent advances have introduced the use of deep-learning-based representations. However, these techniques mostly do not use label metadata. Techniques such as the X-Transformer (Chang et al., 2020) that do use label text either do not scale well with millions of labels or else do not offer state-of-the-art accuracies. The DECAF method presented in this paper effectively uses label metadata to offer state-of-the-art accuracies and scale to tasks with millions of labels.

1-vs-All classifiers: 1-vs-All classifiers PPDSparse (Yen et al., 2017), DiSMEC (Babbar and Schölkopf, 2017), ProXML (Babbar and Schölkopf, 2019) are known to offer accurate predictions but risk incurring training and prediction costs that are linear in the number of labels, which is prohibitive at extreme scales. Approaches such as negative sampling, PLTs, and learned label hierarchies have been proposed to speed up training (Jain et al., 2019; Khandagale et al., 2020; Prabhu et al., 2018b; Yen et al., 2018), and predictions (Jasinska et al., 2016; Niculescu-Mizil and Abbasnejad, 2017) for 1-vs-All methods. However, they rely on sub-linear search structures such as nearest-neighbor structures or label-trees that are well suited for fixed or pre-trained features such as bag-of-words or FastText (Joulin et al., 2017)

but not support jointly learning deep representations since it is expensive to repeatedly update these search structures as deep-learned representations keep getting updated across learning epochs. Thus, these approaches are unable to utilize deep-learned features, which leads to inaccurate solutions.

DECAF avoids these issues by its use of the shortlister which offers a high recall filtering of labels allowing training and prediction costs that are logarithmic in the number of labels.

Tree classifiers: Tree-based classifiers typically partition the label space to achieve logarithmic prediction complexity. In particular, MLRF (Agrawal et al., 2013), FastXML (Prabhu and Varma, 2014), PfastreXML (Jain et al., 2016) learn an ensemble of trees where each node in a tree is partitioned by optimizing an objective based on the Gini index or nDCG. CRAFTML (Siblini et al., 2018) deploys random partitioning of features and labels to learn an ensemble of trees. However, such algorithms can be expensive in terms of training time and model size.

Deep feature representations: Recent works MACH (Medini et al., 2019), X-Transformer (Chang et al., 2020), XML-CNN (Liu et al., 2017), and AttentionXML (Liu et al., 2017) have graduated from using fixed or pre-learned features to using task-specific feature representations that can be significantly more accurate. However, CNN and attention-based mechanisms were found to be inaccurate on short-text applications (as shown in (Dahiya et al., 2021)) where scant information is available (3-10 tokens) for a data point. Furthermore, approaches like X-Transformer and AttentionXML that learn label-specific document representations do not scale well.

Using label metadata: Techniques that use label metadata e.g. label text include SwiftXML (Prabhu et al., 2018a) which uses a pre-trained Word2Vec (Mikolov et al., 2013) model to compute label representations. However, SwiftXML is designed for warm-start settings where a subset of ground-truth labels for each test point is already available. This is a non-standard scenario that is beyond the scope of this paper. (Guo et al., 2019) demonstrated, using the GlaS regularizer, that modeling label correlations could lead to gains on tail labels. Siamese networks (Wu et al., 2017) are a popular framework that can learn representations so that documents and their associated labels get embedded together. Unfortunately, Siamese networks were found to be inaccurate at extreme scales. The X-Transformer method (Chang et al., 2020) uses label text to generate shortlists to speed up training and prediction. DECAF, on the other hand, makes much more direct use of label text to train the 1-vs-All label classifiers themselves and offers greater accuracy compared to X-Transformer and other XML techniques that also use label text.

3. Decaf: Deep Extreme Classification with Label Features

Figure 1. DECAF’s frugal prediction pipeline scales to millions of labels. Given a document , its text embedding (see Fig 3 (Left)) is first used by the shortlister

to shortlist the most probable

labels while maintaining high recall. The ranker then uses label classifiers (see Fig 3 (Right)) of only the shortlisted labels to produce the final ranking.
Figure 2. (Left) DECAF uses a lightweight architecture with a residual layer to embed both document and label text (see Fig. 3). (Right) Combination blocks are used to combine various representations (separate instances are used in the text embedding blocks () and in label classifiers ()).
Figure 3. (Left) Document text is embedded using an instance of the text embedding block (see Fig. 2). Stop words (e.g. and, the) are discarded. (Right) DECAF critically incorporates label text into classifier learning. For each label , a one-vs-all classifier is learnt by combining label text embedding (using a separate instance

of the text embedding block) and a refinement vector

. Note that use separate parameters. However, all labels share the blocks and all documents share the block .

Summary: DECAF consists of three components 1) a lightweight text embedding block suitable for short-text applications, 2) 1-vs-All classifiers per label that incorporate label text, and 3) a shortlister that offers a high recall label shortlists for data points, allowing DECAF to offer sub-millisecond prediction times even with millions of labels. This section details these components, and an approximate likelihood model with provable recovery guarantees, using which DECAF offers a highly scalable yet accurate pipeline for jointly training text embeddings and classifier parameters.

Notation: Let be the number of labels and be the dictionary size. Each of the training points is presented as . is a bag-of-tokens representation for the document i.e. is the TF-IDF weight of token in the document. is the ground truth label vector with if label is relevant to the document and otherwise. For each label , its label text is similarly represented as .

Document and label-text embedding: DECAF learns -dim embeddings for each vocabulary token i.e. and uses a light-weight embedding block (see Fig 3) to encode label and document texts. The embedding block is parameterized by a residual block and scaling constants for the combination block (see Fig 2). The embedding for a bag-of-tokens vector, say , is where , denotes component-wise multiplication, and

is the sigmoid function. Document embeddings, denoted by

, are computed as . Label-text embeddings, denoted by are computed as . Note that document and labels use separate instantiations of the embedding block. We note that DECAF could also be made to use alternate text representations such as BERT (Devlin et al., 2019), attention (You et al., 2019), LSTM (Hochreiter and Schmidhuber, 1997) or convolution (Liu et al., 2017). However, such elaborate architectures negatively impact prediction time and moreover, DECAF outperforms BERT, CNN and attention based XML techniques on all our benchmark datasets indicating the suitability of DECAF’s frugal architecture to short-text applications.

1-vs-All Label Classifiers: DECAF uses high capacity 1-vs-All (OvA) classifiers that outperform tree- and embedding-based classifiers (Chang et al., 2020; Jain et al., 2019; Babbar and Schölkopf, 2019; Prabhu et al., 2018b; Yen et al., 2017; Babbar and Schölkopf, 2017). However, DECAF distinguishes itself from previous OvA works (even those such as (Chang et al., 2020) that do use label text) by directly incorporating label text into the OvA classifiers. For each label , the label-text embedding (see above) is combined with a refinement vector that is learnt separately per label, to produce the label classifier where are shared across labels (see Fig 3). Incorporating into the label classifier allows labels that never co-occur, but nevertheless share tokens, to perform learning in a collaborative manner since if two labels, say share some token in their respective texts, then contributes to both and . In particular, this allows rare labels to share classifier information with popular labels with which they share a token. Ablation studies (Tab 4,5,6) show that incorporating label text into classifier learning offers DECAF significant gains of over 2-6% compared to methods that do not use label text. Incorporating other forms of label metadata, such as label hierarchies, could also lead to further gains.

Shortlister: OvA training and prediction can be prohibitive, and resp., if done naively. A popular way to accelerate training is to, for every data point , use only a shortlist containing all positive labels (that are relatively fewer around ) and a small subset of the, say again , most challenging negative labels (Chang et al., 2020; Jain et al., 2019; Khandagale et al., 2020; Prabhu et al., 2018b; Yen et al., 2017; Bhatia et al., 2015). This allows training to be performed in time instead of time. DECAF learns a shortlister that offers a label-clustering based shortlisting. We have where is a balanced clustering of the labels and are OvA classifiers, one for each cluster. Given the embedding of a document and beam-size , the top clusters with the highest scores, say are taken and labels present therein are shortlisted i.e. . As clusters are balanced, we get, for every datapoint, shortlisted labels in the clusters returned. DECAF uses clusters for large datasets.

Prediction Pipeline: Fig 1 shows the frugal prediction pipeline adopted by DECAF. Given a document , its embedding is used by the shortlister to obtain a shortlist of label clusters . Label scores are computed for every shortlisted label i.e. by combining shortlister and OvA classifier scores as . These scores are sorted to make the final prediction. In practice, even on a dataset with 1.3 million labels, DECAF could make predictions within 0.2 ms using a GPU and 2 ms using a CPU.

3.1. Efficient Training: the DeepXML Pipeline

Summary: DECAF adopts the scalable DeepXML pipeline (Dahiya et al., 2021) that splits training into 4 modules. In summary, Module I jointly learns the token embeddings , the embedding modules and shortlister . Module II fine-tunes , and retrieves label shortlists for all data points. After performing initialization in Module III, Module IV jointly learns the OvA classifiers and fine-tunes using the shortlists generated in Module II. Due to lack of space some details are provided in the supplementary material 111Supplementary Material Link:

Module I: Token embeddings are randomly initialized using (He et al., 2015), residual blocks within the blocks are initialized to identity, and label centroids are created by aggregating document information for each label as . Balanced hierarchical binary clustering (Prabhu et al., 2018b) is now done on these label centroids for 17 levels to generate label clusters. Clustering labels using label centroids gave superior performance than using other representations such as label text . This is because the label centroid carries information from multiple documents and thus, a diverse set of tokens whereas contains information from only a handful of tokens. The hierarchy itself is discarded and each resulting cluster is now treated as a meta-label that gives us a meta multi-label classification problem on the same training points, but with meta-labels instead of the original labels. Each meta label is granted meta-label text as . Each datapoint is assigned a meta-label vector such that if for any and if for all . OvA meta-classifiers are learnt to solve this meta multi-label problem but are constrained in Module I to be of the form . This constrained form of the meta-classifier forces good token embeddings to be learnt that allow meta-classification without the assistance of powerful refinement vectors. However, this form continues to allow collaborative learning among meta classifiers based on shared tokens. Module I solves the meta multi-label classification problem while jointly training (implicitly learning in the process).

Module II: The shortlister is fine-tuned in this module. Label centroids are recomputed as where are the task-specific token embeddings learnt in Module I. The meta multi-label classification problem is recreated using these new centroids by following the same steps outlined in Module I. Module II uses OvA meta-classifiers that are more powerful and resemble those used by DECAF. Specifically, we now have where is the meta label-text embedding, are meta label-specific refinement vectors, and is a fresh instantiating of the combination block. Module II solves the (new) meta multi-label classification problem, jointly learning (implicitly updating in the process) and fine-tuning . The shortlister so learnt is now used to retrieve shortlists for each data point .

Module III: Residual blocks within are re-initialized to identity, is frozen and combination block parameters for the OvA classifiers are initialized to (note that where is the all-ones vector). Refinement vectors for all labels are initialized to . Ablation studies (see Tab 6) show that this refinement vector initialization offers performance boosts of up to 5-10% compared to random initialization as is used by existing methods such as AttentionXML (You et al., 2019) and the X-Transformer (Chang et al., 2020).

Module IV: This module performs learning using an approximate likelihood model. Let be the model parameters in the DECAF architecture. We recall that are combination blocks used to construct the OvA classifiers and meta classifiers, and are the token embeddings. OvA approaches assume a likelihood decomposition such as . Here is the document-text embedding and are the OvA classifiers as shown in Fig 3. Let us abbreviate . Then, our objective is to optimize where

However, performing the above optimization exactly is intractable and takes time. DECAF’s solves this problem by instead optimizing where

Recall that for any document, is a shortlist of label clusters (that give us a total of labels). Thus, the above expression contains only terms as DECAF uses a large fanout of K and . The result below assures us that model parameters and embeddings obtained by optimizing perform well w.r.t. the original likelihood if the dataset exhibits label sparsity, and the shortlister assures high recall.

Theorem 3.1 ().

Suppose the training data has label sparsity at rate i.e. and the shortlister offers a recall rate of on the training set i.e. . Then if is obtained by optimizing the approximate likelihood function , then the following always holds

Please refer to Appendix A.1 in the supplementary material for the proof. As and , the excess error term vanishes at rate at least . Our XML datasets do exhibit label sparsity at rate and Fig 6 shows that DECAF’s shortlister does offer high recall with small shortlists (80% recall with -sized shortlist and 85% recall with -sized shortlist). Since Thm 3.1 holds in the completely agnostic setting, it establishes the utility of learning when likelihood maximization is performed only on label shortlists with high-recall. Module IV uses these shortlists to jointly learn the OvA classifiers and , as well as fine-tune the embedding blocks and token embeddings .

Loss Function and Regularization: Modules I, II, IV use the logistic loss and the Adam (Kingma and Ba, 2014) optimizer to train the model parameters and various refinement vectors. Residual layers used in the text embedding blocks were subjected to spectral regularization (Miyato et al., 2018). All ReLU layers were followed by a dropout layer with 50% drop-rate in Module-I and 20% for the rest of the modules.

Ensemble Learning: DECAF learns an inexpensive ensemble of 3 instances (see Figure 6). The three instances share Module I training to promote scalability i.e. they inherit the same token embeddings. However, they carry out training Module II onwards independently. Thus, the shortlister and embedding modules get fine-tuned for each instance.

Time Complexity: Appendix A.2 in the supplementary material presents time complexity analysis for the DECAF modules.

4. Experiments

Method PSP@1 PSP@5 P@1 P@5
Time (ms)
DECAF 30.85 41.42 38.4 18.65 0.1 (1.15)
Astec 29.22 39.49 37.12 18.24 2.34
AttentionXML 23.97 32.57 32.25 15.61 5.19
MACH 24.97 34.72 33.49 16.45 0.23
X-Transformer 21.72 27.09 29.95 13.07 15.38
Siamese 13.3 13.36 13.81 5.81 0.2
Parabel 23.27 32.14 32.6 15.61 0.69
Bonsai 24.75 34.86 34.11 16.63 7.49
DiSMEC 25.86 36.97 35.14 17.24 5.53
PfastreXML 26.81 34.24 32.56 16.05 2.32
XT 22.37 31.64 31.41 15.48 9.12
Slice 23.08 31.89 30.43 14.84 1.58
AnneXML 19.23 32.26 30.05 16.02 0.11
DECAF 16.73 21.01 25.14 12.86 0.09 (0.97)
Astec 13.69 17.5 22.72 11.43 2.67
AttentionXML 9.45 11.73 17.56 8.52 7.08
MACH 9.68 12.53 18.06 8.99 0.52
X-Transformer - - - - -
Siamese 10.1 9.59 10.69 4.51 0.17
Parabel 9.24 11.8 17.68 8.59 0.8
Bonsai 10.69 13.79 19.31 9.55 14.82
DiSMEC 10.56 14.82 19.12 9.87 11.02
PfastreXML 12.15 13.26 17.1 8.35 2.59
XT 8.99 11.82 17.04 8.6 12.86
Slice 11.24 15.2 18.55 9.68 1.85
AnneXML 7.24 11.75 16.3 8.84 0.13
DECAF 22.07 29.3 50.67 40.35 0.16 (1.73)
Astec 21.47 27.86 48.82 38.44 2.61
AttentionXML 15.97 22.54 45.04 36.25 29.53
MACH 9.32 13.26 35.68 28.35 2.09
X-Transformer - - - - -
Siamese - - - - -
Parabel 16.94 24.13 46.79 37.65 0.89
Bonsai 18.48 25.95 47.87 38.34 39.03
DiSMEC - - - - -
PfastreXML 28.71 32.51 37.08 31.43 23.64
XT 13.67 19.06 40.6 32.01 5.94
Slice 13.8 18.89 34.8 27.71 1.45
AnneXML 15.42 21.91 47.79 36.91 0.12
Table 1. Results on publicly available short-text datasets. DECAF was found to be 2–6% more accurate, as well as an order of magnitude faster at prediction compared to other deep learning based approaches. Algorithms marked with a ‘-’ were unable to scale on the given dataset within available resources and timeout period. Prediction times for DECAF within parenthesis indicate those obtained on a CPU whereas those outside parentheses are times on a GPU.

Datasets: Experiments were conducted on product-to-product and related-webpage recommendation datasets. These were short-text tasks with only the product/webpage titles being used to perform prediction. Of these, LF-AmazonTitles-131K, LF-AmazonTitles-1.3M, and LF-WikiSeeAlsoTitles-320K are publicly available at The Extreme Classification Repository (Bhatia et al., 2016)

. Results are also reported on two proprietary product-to-product recommendation datasets (LF-P2PTitles-300K and LF-P2PTitles-2M) mined from click logs of the Bing search engine, where a pair of products was considered similar if the Jaccard index of the set of queries which led to a click on them was found to be more than a certain threshold. We also considered some datasets’ long text counterparts, namely LF-Amazon-131K and LF-WikiSeeAlso-320K, which contained the entire product/webpage descriptions. Note that LF-AmazonTitles-131K and LF-AmazonTitles-1.3M (as well as their long-text counterparts) are subsets of the standard AmazonTitles-670K and AmazonTitles-3M datasets respectively, and were created by restricting the label set to labels for which label-text was available. Please refer to Appendix A.3 and Table 7 in the

supplementary material for dataset preparation details and dataset statistics.

Baseline algorithms: DECAF was compared to leading deep extreme classifiers including the X-Transformer (Chang et al., 2020), Astec (Dahiya et al., 2021), XT (Wydmuch et al., 2018), AttentionXML (You et al., 2019), and MACH (Medini et al., 2019), as well as standard extreme classifiers based on fixed or sparse BoW features including Bonsai (Khandagale et al., 2020), DiSMEC (Babbar and Schölkopf, 2017), Parabel (Prabhu et al., 2018b), AnnexML (Tagami, 2017). Slice (Jain et al., 2019). Slice was trained with fixed FastText (Bojanowski et al., 2017) features, while other methods used sparse BoW features. Unfortunately, GLaS (Guo et al., 2019) could not be included in the experiments as their code was not publicly available. Each baseline deep learning method was given a 12-core Intel Skylake 2.4 GHz machine with 4 Nvidia V100 GPUs. However, DECAF was offered a 6-core Intel Skylake 2.4 GHz machine with a single Nvidia V100 GPU. A training timeout of 1 week was set for every method. Please refer to Table 9 in the supplementary material for more details.

Method PSP@1 PSP@5 P@1 P@5 C@20
DECAF 42.43 62.3 47.17 22.69 95.32
Astec 39.44 57.83 44.30 21.56 95.61
Parabel 37.26 55.32 43.14 20.99 95.59
PfastreXML 35.79 49.9 39.4 18.77 87.91
Slice 27.03 34.95 31.27 25.19 95.06
DECAF 36.65 45.15 40.27 31.45 93.08
Astec 32.75 41 36.34 28.74 95.3
Parabel 30.21 38.46 35.26 28.06 92.82
PfastreXML 28.84 35.65 30.52 24.6 88.05
Slice 27.03 34.95 31.27 25.19 93.43
Table 2. Results on proprietary product-to-product (P2P) recommendation datasets. C@20 denotes label coverage offered by the top 20 predictions of each method. DECAF offers significantly better accuracies than all competing methods. Other methods such as AnnexML and DiSMEC did not scale with available resources within the timeout period.

Evaluation: Standard extreme classification metrics (Babbar and Schölkopf, 2019; Prabhu and Varma, 2014; Prabhu et al., 2018b; You et al., 2019; Liu et al., 2017), namely Precision (P@) and propensity scored precision (PSP@) for were used and are detailed in Appendix A.4 in the supplementary material.

Hyperparameters: DECAF has two tuneable hyperparameters a) beam-width which determines the shortlist length and b) token embedding dimension . was chosen after concluding Module II training by setting a value that ensured a recall of on the training set (note that choosing trivially ensures recall). Doing so did not require DECAF to re-train Module II yet ensured a high quality shortlisting. Token embedding dimension was kept at 512 for larger datasets to improve the network capacity for large output spaces. For the small dataset LF-AmazonTitles-131K, clusters size was kept at and for other datasets it was kept at . All other hyperparameters including learning rate, number of epochs were set to their default values across all datasets. Please refer to Table 8 in the supplementary material for details.

Figure 4. Quantile analysis of gains offered by DECAF in terms of contribution to P@5. The label set was divided into 5 equi-voluminous bins with increasing label frequency. Quantiles increase in mean label frequency from left to right. DECAF consistently outperforms other methods on all bins with the difference in accuracy being more prominent on bins containing data-scarce tail labels (e.g. bin 5).

Results on public datasets: Table 1 compares DECAF with leading XML algorithms on short-text product-to-product and related-webpage tasks. For details as well as results on long-text versions of these datasets, please refer to Table 9 in the supplementary material. Furthermore, although DECAF focuses on product-to-product applications, results on product-to-category style datasets such as product-to-category prediction on Amazon or article-to-category prediction on Wikipedia are reported in Table 10 in the supplementary material. Parabel (Prabhu et al., 2018a), Bonsai (Khandagale et al., 2020), AttentionXML (You et al., 2019) and X-Transformer (Chang et al., 2020) are the most relevant methods to DECAF as they shortlist labels based on a tree learned in the label centroid space. DECAF was found to be more accurate than methods such as Slice (Jain et al., 2019), PfastreXML (Jain et al., 2017), DiSMEC (Babbar and Schölkopf, 2017), and AnnexML (Tagami, 2017) that use fixed or pre-learnt features. This demonstrates that learning tasks-specific features can lead to significantly more accurate predictions. DECAF was also compared with other leading deep learning based approaches like MACH (Medini et al., 2019), and XT (Wydmuch et al., 2018). DECAF could be up to more accurate while being more than 150 faster at prediction as compared to attention based models like X-Transformer and AttentionXML. DECAF was also compared to Siamese networks that had similar access to label metadata as DECAF. However, DECAF could be up to more accurate than a Siamese network at an extreme scale. DECAF was also compared to Astec (Dahiya et al., 2021) that was specifically designed for short-text applications but does not utilize label metadata. DECAF could be up to 3% more accurate than Astec. This further supports DECAF’s claim of using label meta-data for improving prediction accuracy. Even on long-text tasks such as the LF-WikiSeeAlso-320K dataset (please refer to Table 9 in the supplementary material), DECAF can be more accurate in propensity scored metrics compared to the second best method AttentionXML, in addition to being vastly superior in terms of prediction time. This indicates the suitability of DECAF’s frugal architecture to product-to-product scenarios. The frugal architecture also allows DECAF to make predictions on a CPU within a few milliseconds even for large datasets such as LF-AmazonTitles-1.3M while other deep extreme classifiers can take an order of magnitude longer time even on a GPU. DECAF’s prediction times on a CPU are reported within parentheses in Table 1.

Results on proprietary datasets: Table 2 presents results on proprietary product-to-product recommendation tasks (with details presented in Table 11 in the supplementary material). DECAF could easily scale to the LF-P2PTitles-2M dataset and be upto 2% more accurate than leading XML algorithms including Bonsai, Slice and Parabel. Unfortunately, leading deep learning algorithms such as X-Transformer could not scale to this dataset within the timeout. DECAF offers label coverage similar to state-of-the-art XML methods yet offers the best accuracy in terms of P@1. Thus, DECAF’s superior predictions do not come at a cost of coverage.

Document Top 5 predictions by DECAF
Panzer Dragoon Zwei Panzer Dragoon, Action Replay Plus, Sega Saturn System - Video Game Console, The Legend of Dragoon , Panzer Dragoon Orta
Wagner - Die Walkure Wagner - Siegfried , Wagner - Gotterdammerung , Wagner - Der Fliegende Holländer (1986), Wagner - Gotterdammerung , Seligpreisung
New Zealand dollar Economy of New Zealand, Cook Islands dollar, Politics of New Zealand , Pitcairn Islands dollar, Australian dollar
Table 3. DECAF’s predictions on selected test points. Document and label names ending in “” were abbreviated due to lack of space. Please refer to Table 12 in the supplementary material for the complete table. Predictions in black and a non-bold/non-italic font were a part of the ground truth. Those in bold italics were part of the ground truth but never seen with other the ground truth labels in the training set i.e. had no common training points. Predictions in light gray were not a part of the ground truth. DECAF’s exploits label metadata to discover semantically correlated labels.

Analysis: Table 3 shows specific examples of DECAF predictions. DECAF encourages collaborative learning among labels which allows it to predict the labels “Australian dollar" and “Economy of New Zealand” for the document “New Zealand dollar” when other methods failed to do so. This example was taken from the LF-WikiseeAlsoTitles-320K dataset (please refer to Table 12 in the supplementary material for details). It is notable that these labels do not share any common training instances with other ground truth labels but are semantically related nevertheless. DECAF similarly predicted a rare label “Panzer Dragoon Orta” for the (video game) product “Panzer Dragoon Zwei’ whereas other algorithms failed to do so. To better understand the nature of DECAF’s gains, the label set was divided into five uniform bins (quantiles) based on frequency of occurrence in the training set. DECAF’s collaborative approach using label text in classifier learning led to gains in every quantile, the gains were more prominent on the data-scarce tail-labels, as demonstrated in Figure 4.

Incorporating metadata into baseline XML algorithms: In principle, DECAF’s formulation could be deployed with existing XML algorithms wherever collaborative learning is feasible. Table 4 shows that introducing label text embeddings to the DiSMEC, Parabel, and Bonsai classifiers led to upto gain as compared to their vanilla counterparts that do not use label text. Details of these augmentations are given in Appendix A.5 in the supplementary material. Thus, label text inclusion can lead to gains for existing methods as well. However, DECAF continues to be upto more accurate than even these augmented versions. This shows that DECAF is more efficient at utilizing available label text.

Method PSP@1 PSP@5 P@1 P@5
DECAF 30.85 41.42 38.4 18.65
Parabel 23.27 32.14 32.6 15.61
Parabel + metadata 25.89 34.83 33.6 15.84
Bonsai 24.75 34.86 34.11 16.63
Bonsai + metadata 26.82 36.63 34.83 16.67
DiSMEC 26.25 37.15 35.14 17.24
DiSMEC + metadata 27.19 38.17 35.52 17.52
DECAF 16.73 21.01 25.14 12.86
Parabel 9.24 11.8 17.68 8.59
Parabel + metadata 12.96 16.77 20.69 10.24
Bonsai 10.69 13.79 19.31 9.55
Bonsai + metadata 13.63 17.54 21.61 10.72
DiSMEC 10.56 14.82 19.12 9.87
DiSMEC + metadata 12.46 15.9 20.74 10.29
Table 4. Augmenting existing BoW-based XML methods by incorporating label metadata leads to increase in the accuracy as compared to base method. However, DECAF could be up to more accurate compared to even these.
Method PSP@1 PSP@5 P@1 P@5 R@20
DECAF 30.85 41.42 38.4 18.65 55.86
+ HNSW Shortlist 29.55 39.17 36.7 17.78 48.82
+ Parabel Shortlist 24.88 31.21 32.13 14.73 39.36
DECAF 16.73 21.01 25.14 12.86 37.53
+ HNSW Shortlist 15.68 19.38 23.84 12.11 30.26
+ Parabel Shortlist 13.17 15.09 21.18 10.05 23.91
Table 5. Using strategies used by existing XML algorithms for shortlisting labels instead of hurts both both shortlist recall (R@20) and final prediction accuracy (P@k, PSP@k).
Component PSP@1 PSP@5 P@1 P@5 R@20
DECAF 30.85 41.42 38.4 18.65 55.86
DECAF-FFT 25.5 33.38 32.42 15.43 47.23
DECAF-8K 29.07 38.7 36.29 17.52 51.65
DECAF-no-init 29.86 41.04 37.79 18.57 55.75
DECAF- 28.02 38.38 33.5 17.09 53.83
DECAF- 27.32 38.05 36 17.65 52.2
DECAF-lite 29.75 40.36 37.26 18.29 55.25
DECAF 16.73 21.01 25.14 12.86 37.53
DECAF-FFT 13.91 17.3 21.72 11 32.58
DECAF-8K 14.55 17.38 22.41 10.96 30.21
DECAF-no-init 15.09 19.47 23.81 12.25 36.18
DECAF- 18.04 21.48 24.54 12.55 37.33
DECAF- 11.55 15.24 20.82 10.53 29.72
DECAF-lite 16.59 20.84 24.87 12.78 37.24
Table 6. Analyzing the impact for alternative design and algorithmic choices for DECAF’s components.
Figure 5. Impact of the number of instances in DECAF’s ensemble on performance on the LF-AmazonTitles-131K dataset. DECAF offers maximum benefits using a small ensemble of 3 instances after which benefits taper off.
Figure 6. A comparison of recall when using moderate or large fanout on the LF-WikiSeeAlso-320K dataset. The x-axis represents various values of beam-width and training recall offered by each. A large fanout offers superior recall with small beam width, and hence small shortlists lengths.

Shortlister: DECAF’s shortlister distinguishes itself from previous shortlisting strategies (Chang et al., 2020; Khandagale et al., 2020; You et al., 2019; Prabhu et al., 2018b) in two critical ways. Firstly, DECAF uses a massive fanout of K clusters whereas existing approaches either use much fewer (upto 8K) clusters (Chang et al., 2020; Bhatia et al., 2015)

or use hierarchical clustering with a small fanout (upto 100) at each node

(Khandagale et al., 2020; You et al., 2019). Secondly, in contrast to other methods that create shortlists from generic embeddings (e.g. bag-of-words or FastText (Joulin et al., 2017)), DECAF fine-tunes its shortlister in Module II using task-specific embeddings learnt in Module I. Tables 5 and 6 show that DECAF’s shortlister offers much better performance than shortlists computed using a small fanout or else computed using ANNS-based negative sampling (Jain et al., 2019). Fig 6 shows that a large fanout offers much better recall even with small shortlist lengths than if using even moderate fanouts e.g. K.

Ablation: As described in Section 3, the training pipeline for DECAF is divided into 4 modules mirroring the DeepXML pipeline (Dahiya et al., 2021). Table 6 presents the results of extensive experiments conducted to analyze the optimality of algorithmic and design choices made in these modules. We refer to Appendix A.5 in the supplementary material for details. a) To assess the utility of learning task-specific token embeddings in Module I, a variant DECAF-FFT was devised that replaced these with pre-trained FastText embeddings: DECAF outperforms DECAF-FFT by 6% in PSP@1 and 3.5% in P@1. b) To assess the impact of a large fanout while learning the shortlister, a variant DECAF-8K was trained with a smaller fanout of K clusters that is used by methods such as AttentionXML and X-Transformer. Restricting fanout was found to hurt accuracy by 3%. This can be attributed to the fact that the classifier’s final accuracy depends on the recall of the shortlister (see Theorem 3.1). Fig. 6 indicates that using results in significantly larger shortlist lengths (upto larger) being required to achieve the same recall as compared to using . Large shortlists make Module IV training and prediction more challenging, especially for large datasets involving millions of labels, thereby making a large fan-out more beneficial. c) Approaches other than DECAF’s shortlister were considered for shortlisting labels, such as nearest neighbor search using HNSW (Jain et al., 2019) or PLTs with small fanout such as Parabel (Prabhu et al., 2018b) learnt over dense document embeddings. Table 5 shows that both alternatives lead to significant loss, upto 15% in recall, as compared to that offered by . These sub-optimal shortlists eventually hurt final prediction which could be 2% less accurate as compared to DECAF. d) To assess the importance of label classifier initialization in Module III, a variant DECAF-no-init was tested which initialized randomly instead of with . DECAF-no-init was found to offer 1-1.5% less PSP@1 than DECAF, therefore indicating importance of proper initialization in Module III. e) Modules II and IV learn OvA classifiers as a combination of the label embedding vector and a refinement vector. To investigate the need for both components, Table 6 considers two DECAF variants: the first variant, named DECAF-, discards the refinement vector in both modules i.e. using and whereas the second variant, named DECAF-, rejects the label embedding component altogether and learns the OvA classifers from scratch using only the refinement vector i.e. using and . Both variants take a hit of up to 5% in prediction accuracy as compared to DECAF. Incorporating label-text in the classifier is critical to achieve superior accuracies. f) Finally, to assess the utility of fine-tuning token embeddings in each successive module, a frugal version DECAF-lite was considered which freezes token embeddings after Module I and shares token embeddings among the three instances in its ensemble. DECAF-lite offers 0.5-1% loss in performance as compared to DECAF but is noticeably faster at training.

5. Conclusion

This paper demonstrated the impact of incorporating label metadata in the form of label text in offering significant performance gains on several product-to-product recommendation tasks. It proposed the DECAF algorithm that uses a frugal architecture, as well as a scalable prediction pipeline, to offer predictions that are up to 2-6% more accurate, as well as an order of magnitude faster, as compared to leading deep learning-based XML algorithms. DECAF offers millisecond-level prediction times on a CPU making it suitable for real-time applications such as product-to-product recommendation tasks. Future directions of work include incorporating other forms of label metadata such as label-correlation graphs, as well as diverse embedding architectures.

The authors thank the IIT Delhi HPC facility for computational resources. AM is supported by a Google PhD Fellowship.


  • R. Agrawal, A. Gupta, Y. Prabhu, and M. Varma (2013) Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In WWW, Cited by: §2.
  • R. Babbar and B. Schölkopf (2017) DiSMEC: Distributed Sparse Machines for Extreme Multi-label Classification. In WSDM, Cited by: §1, §1, §2, §3, §4, §4.
  • R. Babbar and B. Schölkopf (2019) Data scarcity, robustness and extreme multi-label classification. Machine Learning 108, pp. 1329–1351. Cited by: §1, §2, §3, §4.
  • K. Bhatia, K. Dahiya, H. Jain, A. Mittal, Y. Prabhu, and M. Varma (2016) The extreme classification repository: Multi-label datasets and code. External Links: Link Cited by: §A.4, §4.
  • K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain (2015) Sparse Local Embeddings for Extreme Multi-label Classification. In NIPS, Cited by: §3, §4.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics. Cited by: §4.
  • W-C. Chang, H.-F. Yu, K. Zhong, Y. Yang, and I. Dhillon (2020) Taming Pretrained Transformers for Extreme Multi-label Text Classification. In KDD, Cited by: §2, §2, §2, §3.1, §3, §3, §4, §4, §4.
  • K. Dahiya, D. Saini, A. Mittal, A. Shaw, K. Dave, A. Soni, H. Jain, S. Agarwal, and M. Varma (2021) DeepXML: A Deep Extreme Multi-Label Learning Framework Applied to Short Text Documents. In WSDM, Cited by: §1, §1, §2, §3.1, §4, §4, §4.
  • J. Devlin, M. W. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §3.
  • X. Glorot and X. Bengio (2010)

    Understanding the difficulty of training deep feedforward neural networks

    In AISTATS, Cited by: §1.
  • C. Guo, A. Mousavi, X. Wu, D. N. Holtmann-Rice, S. Kale, S. Reddi, and S. Kumar (2019) Breaking the Glass Ceiling for Embedding-Based Classifiers for Large Output Spaces. In Neurips, Cited by: §2, §4.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification


    Proceedings of the IEEE international conference on computer vision

    pp. 1026–1034. Cited by: §3.1.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.
  • H. Jain, V. Balasubramanian, B. Chunduri, and M. Varma (2019) Slice: Scalable Linear Extreme Classifiers trained on 100 Million Labels for Related Searches. In WSDM, Cited by: §1, §2, §2, §3, §3, §4, §4, §4, §4.
  • H. Jain, Y. Prabhu, and M. Varma (2016) Extreme Multi-label Loss Functions for Recommendation, Tagging, Ranking and Other Missing Label Applications. In KDD, Cited by: §A.4, §1, §2.
  • V. Jain, N. Modhe, and P. Rai (2017) Scalable Generative Models for Multi-label Learning with Missing Labels. In ICML, Cited by: §4.
  • K. Jasinska, K. Dembczynski, R. Busa-Fekete, K. Pfannschmidt, T. Klerx, and E. Hullermeier (2016)

    Extreme F-measure Maximization using Sparse Probability Estimates

    In ICML, Cited by: §2.
  • A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2017) Bag of Tricks for Efficient Text Classification. In Proceedings of the European Chapter of the Association for Computational Linguistics, Cited by: §2, §4.
  • B. Kanagal, A. Ahmed, S. Pandey, V. Josifovski, J. Yuan, and L. Garcia-Pueyo (2012) Supercharging Recommender Systems Using Taxonomies for Learning User Purchase Behavior. VLDB. Cited by: §1.
  • S. Khandagale, H. Xiao, and R. Babbar (2020) Bonsai: diverse and shallow trees for extreme multi-label classification. Machine Learning 109 (11), pp. 2099–2119. Cited by: §1, §1, §2, §3, §4, §4, §4.
  • P. D. Kingma and J. Ba (2014) Adam: A Method for Stochastic Optimization. CoRR. Cited by: §3.1.
  • J. Liu, W. Chang, Y. Wu, and Y. Yang (2017) Deep Learning for Extreme Multi-label Text Classification. In SIGIR, Cited by: §2, §3, §4.
  • T. K. R. Medini, Q. Huang, Y. Wang, V. Mohan, and A. Shrivastava (2019) Extreme Classification in Log Memory using Count-Min Sketch: A Case Study of Amazon Search with 50M Products. In Neurips, Cited by: §1, §1, §1, §2, §4, §4.
  • A. K. Menon, K.P. Chitrapura, S. Garg, D. Agarwal, and N. Kota (2011) Response Prediction Using Collaborative Filtering with Hierarchies and Side-Information. In KDD, Cited by: §1.
  • T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013) Distributed Representations of Words and Phrases and Their Compositionality. In NIPS, Cited by: §2.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018)

    Spectral Normalization for Generative Adversarial Networks

    In ICLR, Cited by: §3.1.
  • A. Niculescu-Mizil and E. Abbasnejad (2017) Label Filters for Large Scale Multilabel Classification. In AISTATS, Cited by: §2.
  • Y. Prabhu, A. Kag, S. Gopinath, K. Dahiya, S. Harsola, R. Agrawal, and M. Varma (2018a) Extreme multi-label learning with label features for warm-start tagging, ranking and recommendation. In WSDM, Cited by: §2, §4.
  • Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, and M. Varma (2018b) Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In WWW, Cited by: §1, §2, §2, §3.1, §3, §3, §4, §4, §4, §4.
  • Y. Prabhu and M. Varma (2014) FastXML: A Fast, Accurate and Stable Tree-classifier for eXtreme Multi-label Learning. In KDD, Cited by: §2, §4.
  • N. Sachdeva, K. Gupta, and V. Pudi (2018) Attentive Neural Architecture Incorporating Song Features for Music Recommendation. In RecSys, Cited by: §1.
  • W. Siblini, P. Kuntz, and F. Meyer (2018)

    CRAFTML, an Efficient Clustering-based Random Forest for Extreme Multi-label Learning

    In ICML, Cited by: §2.
  • Y. Tagami (2017) AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-label Classification. In KDD, Cited by: §1, §4, §4.
  • L. Wu, A. Fisch, S. Chopra, K. Adams, A. Bordes, and J. Weston (2017) StarSpace: Embed All The Things!. CoRR. Cited by: §2.
  • M. Wydmuch, K. Jasinska, M. Kuznetsov, R. Busa-Fekete, and K. Dembczynski (2018) A no-regret generalization of hierarchical softmax to extreme multi-label classification. In NIPS, Cited by: §4, §4.
  • E.H. I. Yen, X. Huang, W. Dai, I. Ravikumar, and E. Xing (2017) PPDSparse: A Parallel Primal-Dual Sparse Method for Extreme Classification. In KDD, Cited by: §2, §3, §3.
  • I. Yen, S. Kale, F. Yu, D. Holtmann R., S. Kumar, and P. Ravikumar (2018) Loss Decomposition for Fast Learning in Large Output Spaces. In ICML, Cited by: §2.
  • R. You, Z. Zhang, Z. Wang, S. Dai, H. Mamitsuka, and S. Zhu (2019) Attentionxml: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification. In Neurips, Cited by: §1, §1, §3.1, §3, §4, §4, §4, §4.

Appendix A Appendix

In this supplementary material, we present various details omitted from the main text due to lack of space, including a proof of Thm 3.1, a detailed analysis of the time complexity of the various modules in the training and prediction pipelines of DECAF

, details of the datasets and evaluation metrics used in the experiments, further clarifications about how some ablation experiments were carried out, as well as additional experimental results including a subjective comparison of the prediction quality of

DECAF and various competitors on handpicked recommendation examples.

a.1. Proof of Theorem 3.1

We recall from the main text that denotes the original likelihood expression and denotes the approximate likelihood expression that incorporates the shortlister . Both expression are reproduced below for sake of clarity.

Theorem A.1 (Theorem 3.1 Restated).

Suppose the training data has label sparsity at rate i.e. and the shortlister offers a recall rate of on the training set i.e. . Then if is obtained by optimizing the approximate likelihood function , then the following always holds

Below we prove the above claimed result. For the sake of simplicity, let denote the optimal model that could have been learnt using the original likelihood expression. As discussed in Sec 3, OvA methods with linear classifiers assume a likelihood decomposition of the form where is the document embedding obtained using token embeddings and embedding block parameters taken from , and is the label classifier obtained as shown in Fig 3. Thus, for a label-document pair , the model posits a likelihood

However, in the presence of a shortlister , the above model fails to hold since for a document , a label is never predicted. This can cause a catastrophic collapse of the model likelihood if even a single positive label fails to be shortlisted by the shortlister, i.e. if the shortlister admits even a single false negative. To address this, and allow DECAF to continue working with shortlisters with high but still imperfect recall, we augment the likelihood model as follows

where is some default likelihood value assigned to positive labels that escape shortlisting (recall that ). Essentially, for non-shortlisted labels, we posit their probability of being relevant as . The value of will be tuned later.

Note that we must set so as to ensure that these default likelihood scores do not interfere with the prediction pipeline which discards non-shortlisted labels. We will see that our calculations do result in an extremely small value of as the optimal value. However, also note that we cannot simply set since that would lead to a catastrophic collapse of the model likelihood to zero if the shortlister has even one false negative. Although our shortlister does offer good recall even with shortists of small length (e.g. 85% with a shortlist of length ), demanding 100% recall would require exorbitantly large beam sizes that would slow down prediction greatly. Thus, it is imperative that the augmented likelihood model itself account for shortlister failures.

To incorporate the above augmentation, we also redefine our log-likelihood score function to handle document-label pairs such that

Note the negative sign in the second case since is the negative log-likelihood expression. We will also benefit from defining the following residual loss term

Note that simply sums up loss terms corresponding to all labels omitted by the shortlister. We will establish the result claimed in the theorem by comparing the performance offered by and on the loss terms given by and . Note that for any we always have the following decomposition

Now, since optimizes , we have which settles the first term in the above decomposition. To settle the second term, we note that as per the recall and label sparsity terms defined in the statement of the theorem, the number of positive labels not shortlisted by the shortlister throughout the dataset is where is the false negative rate of the shortlister. Similarly, the number of negative labels not shortlisted by the shortlister throughout the dataset by can be seen to be where is the true negative rate of the shortlister. This gives us

It is easy to see that the optimal value of for the above expression is . For example, in the LF-WikiSeeAlsoTitles-320K dataset, which has , this gives a value of which gives . This confirms that the augmentation indeed does not interfere with the prediction pipeline and labels not shortlisted can be safely ignored. However, moving on and plugging this optimal value of into the expression tells us that

Since (for example, we saw above), we simplify this to and use the inequality for all to conclude that . Using settles the second term in the decomposition by establishing that . Combining the two terms in the decomposition above gives us

which finishes the proof of the theorem.

We conclude this discussion by noting that since and are non-convex objectives due to the non-linear architecture encoded by the model parameters , it may not be able to solve these objectives optimally in practice. Thus, in practice, all we may be ensure is that

where is the sub-optimality in optimizing the objective due to factors such as sub-optimal initialization, training, premature termination, etc. It is easy to see that the main result of the theorem continues to hold since we now have which gives us the modified result as follows

a.2. Time Complexity Analysis for Decaf

In this section, we discuss the time complexity of the various modules in DECAF, as well as derive the prediction and training complexities.

Notation: Recall from Section 3 that DECAF learns -dimensional representations for all tokens (), that are used to create embeddings for all labels , and all training documents . We introduce some additional notation to facilitate the discussion: we use to denote the average number of unique tokens present in a document i.e. where is the sparsity “norm” that gives the number of non-zero elements in a vector. We similarly use to denote the average number of tokens in a label text. Let denote the average number of labels per document and also let denote the average number of documents per label. We also let denote the mini-batch size (DECAF used for all datasets – see Table 8).

Embedding Block: Given a text with tokens, the embedding block requires operations to aggregate token embeddings and operations to execute the residual block and the combination block, for a total of operations. Thus, to encode a label (respectively document) text, it takes (respectively ) operations on average.

Prediction: Given a test document, assuming that it contain tokens, embedding takes operations, executing the shortlister by identifying the top clusters takes operations. These clusters contain a total of labels. The ranker takes operations to execute the OvA linear models corresponding to these shortlisted labels to obtain the top-ranked predictions. Thus, prediction takes time since usually and .

Module I Training: Creation of all label centroids takes time. These centroids are -sparse on average. Clustering these labels using hierarchical balanced binary clustering for levels to get balanced clusters takes time . Computing meta label text representations for all meta labels takes time. The vectors are -sparse on average. To compute the complexity of learning the OvA meta-classifiers, we calculate below the cost of a single back-propagation step when using a mini-batch of size . Computing the document and meta-label features of all documents in the mini-batch and meta-labels takes on average and time respectively. Computing the scores for all the OvA meta classifiers for all documents in the mini-batch takes time. Overestimating that the meta label texts together cover all tokens, updating the residual layer parameters , the combination block parameters, and the token embeddings using back-propagation takes at most time.

Module II Training: Recreating all label centroids now takes time. Clustering the labels takes time . Computing document features in a mini-batch of size takes time as before. Computing the meta-label representations for all meta-labels now takes time. Computing the scores for all the OvA meta classifiers for all documents in the mini-batch takes time as before. Next, updating the model parameters as well as the refinement vectors takes at most time time as before. The added task of updating does not affect the asymptotic complexity of this module. Generating the shortlists for all training points is essentially a prediction step and takes time.

Module II Initializations: Model parameter initializations take time. Initializing the refinement vectors takes time.

Module IV Training: Given the shortlist of labels per training point generated in Module II, training the OvA classifiers by fine-tuning the model parameters and learning the refinement vectors is made much less expensive than . Computing document features in a mini-batch of size takes time as before. However, label representations of only shortlisted labels need be computed. Since there are atmost of them (accounting for hard negatives and all positives), this takes time. Next, updating the model parameters as well as the refinement vectors for shortlisted takes at most time. This can be simplified to