Extreme Classification via Adversarial Softmax Approximation

02/15/2020 ∙ by Robert Bamler, et al. ∙ University of California, Irvine 20

Training a classifier over a large number of classes, known as 'extreme classification', has become a topic of major interest with applications in technology, science, and e-commerce. Traditional softmax regression induces a gradient cost proportional to the number of classes C, which often is prohibitively expensive. A popular scalable softmax approximation relies on uniform negative sampling, which suffers from slow convergence due a poor signal-to-noise ratio. In this paper, we propose a simple training method for drastically enhancing the gradient signal by drawing negative samples from an adversarial model that mimics the data distribution. Our contributions are three-fold: (i) an adversarial sampling mechanism that produces negative samples at a cost only logarithmic in C, thus still resulting in cheap gradient updates; (ii) a mathematical proof that this adversarial sampling minimizes the gradient variance while any bias due to non-uniform sampling can be removed; (iii) experimental results on large scale data sets that show a reduction of the training time by an order of magnitude relative to several competitive baselines.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many problems in science, healthcare, or e-commerce, one is interested in training classifiers over an enormous number of classes: a problem known as ‘extreme classification’ (Agrawal et al., 2013; Jain et al., 2016; Prabhu & Varma, 2014; Siblini et al., 2018). For softmax (aka multinomial) regression, each gradient step incurs a cost proportional to the number of classes . As this may be prohibitively expensive for large , recent research has explored more scalable softmax approximations which circumvent the linear scaling in . Progress in accelerating the training procedure and thereby scaling up extreme classification promises to dramatically improve, e.g., advertising (Prabhu et al., 2018), recommender systems, ranking algorithms (Bhatia et al., 2015; Jain et al., 2016), and medical diagnostics (Bengio et al., 2019; Lippert et al., 2017; Baumel et al., 2018)

While scalable softmax approximations have been proposed, each one has its drawbacks. The most popular approach due to its simplicity is ‘negative sampling’ (Mnih & Hinton, 2009; Mikolov et al., 2013)

, which turns the problem into a binary classification between so-called ‘positive samples’ from the data set and ‘negative samples’ that are drawn at random from some (usually uniform) distribution over the class labels. While negative sampling makes the updates cheaper since computing the gradient no longer scales with

, it induces additional gradient noise that leads to a poor signal-to-noise ratio of the stochastic gradient estimate. Improving the signal-to-noise ratio in negative sampling while still enabling cheap gradients would dramatically enhance the speed of convergence.

In this paper, we present an algorithm that inherits the cheap gradient updates from negative sampling while still preserving much of the gradient signal of the original softmax regression problem. Our approach rests on the insight that the signal-to-noise ratio in negative sampling is poor since there is no association between input features and their artificial labels. If negative samples were harder to discriminate from positive ones, a learning algorithm would obtain a better gradient signal close to the optimum. Here, we make these arguments mathematically rigorous and propose a non-uniform sampling scheme for scalably approximating a softmax classification scheme. Instead of sampling labels uniformly, our algorithm uses an adversarial auxiliary model to draw ‘fake’ labels that are more realistic by taking the input features of the data into account. We prove that such procedure reduces the gradient noise of the algorithm, and in fact minimizes the gradient variance in the limit where the auxiliary model optimally mimics the data distribution.

A useful adversarial model should require only little overhead to be fitted to the data, and it needs to be able to generate

negative samples quickly in order to enable inexpensive gradient updates. We propose a probabilistic version of a decision tree that has these properties. As a side result of our approach, we show how such an auxiliary model can be constructed and efficiently trained. Since it is almost hyperparameter-free, it does not cause extra complications when tuning models.

The final problem that we tackle is to remove the bias that the auxiliary model causes relative to our original softmax classification. Negative sampling is typically described as a softmax approximation; however, only uniform negative sampling correctly approximates the softmax. In this paper, we show that the bias due to non-uniform negative sampling can be easily removed at test time.

The stucture of our paper reflects our main contributions as follows:

  1. We present a new scalable softmax approximation (Section 2). We show that non-uniform sampling from an auxiliary model can improve the signal-to-noise ratio. The best performance is achieved when this sampling mechanism is adversarial, i.e., when it generates fake labels that are hard to discriminate from the true ones. To allow for efficient training, such adversarial samples need to be generated at a rate sublinear (e.g., logarithmic) in .

  2. We design a new, simple adversarial auxiliary model that satisfies the above requirements (Section 3). The model is based on a probabilistic version of a decision tree. It can be efficiently pre-trained and included into our approach, and requires only minimal tuning.

  3. We present mathematical proofs that (i) the best signal-to-noise ratio in the gradient is obtained if the auxiliary model best reflects the true dependencies between input features and labels, and that (ii) the involved bias to the softmax approximation can be exactly quantified and cheaply removed at test time (Section 4).

  4. We present experiments on two classification data sets that show that our method outperforms all baselines by at least one order of magnitude in training speed (Section 5).

We discuss related work in Section 6 and summarize our approach in Section 7.

2 An Adversarial Softmax Approximation

We propose an efficient algorithm to train a classifier over a large set of classes, using an asymptotic equivalence between softmax classification and negative sampling (Subsection 2.1). To speed up convergence, we generalize this equivalence to model-based negative sampling in Subsection 2.2.

2.1 Asymptotic Equivalence of Softmax Classification and Negative Sampling

Softmax Classification (Notation).

We consider a training data set of  data points with

-dimensional feature vectors

. Each data point has a single label from a discrete label set . A softmax classifier is defined by a set of functions  that map a feature vector  and model parameters  to a score for each label 

. Its loss function is


While the first term encourages high scores for the correct labels , the second term encourages low scores for all labels , thus preventing degenerate solutions that set all scores to infinity. Unfortunately, the sum over makes gradient based minimization of  expensive if the label set  is large. Assuming that evaluating a single score  takes time, each gradient step costs , where is the size of the label set.

Negative Sampling.

Negative sampling turns classification over a large label set  into binary classification between so-called positive and negative samples. One draws positive samples from the training set and constructs negative samples by drawing random labels  from some noise distribution 

. One then trains a logistic regression by minimizing the stochastic loss function


with the sigmoid function

. Here, we used the same score functions as in Eq. 1 but introduced different model parameters  so that we can distinguish the two models. Gradient steps for cost only time as there is no sum over all labels .

Asymptotic Equivalence.

The models in Eqs. 1 and Eq. 2 are exactly equivalent in the non-parametric limit, i.e., if the function class is flexible enough to map to any possible score. A further requirement is that  in Eq. 2 is the uniform distribution over . If both conditions hold, it follows that if and  minimize Eq. 1 and Eq. 2, respectively, they learn identical scores,


As a consequence, one is free to choose the loss function that is easier to minimize. While gradient steps are cheaper by a factor of for negative sampling, the randomly drawn negative samples increase the variance of the stochastic gradient estimator and worsen the signal-to-noise ratio of the gradient, slowing-down convergence. The next section combines the strengths of both approaches.

2.2 Adversarial Negative Sampling


We propose a generalized variant of negative sampling that reduces the gradient noise. The main idea is to train with negative samples  that are hard to distinguish from positive samples. We draw  from a conditional noise distribution  using an auxiliary model. This introduces a bias, which we remove at prediction time. In summary our proposed approach consists of three steps:

  1. Parameterize the noise distribution  by an auxiliary model and fit it to the data set.

  2. Train a classifier via negative sampling (Eq. 2) using adversarial negative samples from the auxiliary model fitted in Step 1 above. For our proposed auxiliary model, drawing a negative sample costs only time with some , i.e., it is sublinear in .

  3. The resulting model has a bias. When making predictions, remove the bias by mapping it to an unbiased softmax classifier using the generalized asymptotic equivalence in Eq. 5 below.

We elaborate on the above Step 1 in Section 3. In the present section, we focus instead on Step 2 and its dependency on the choice of noise distribution , and on the bias removal (Step 3).

Why Adversarial Noise Improves Learning.

We first provide some intuition why uniform negative sampling is not optimal, and how sampling from a non-uniform noise distribution may improve the gradient signal. We argue that the poor gradient signal is caused by the fact that negative samples are too easy to distinguish from positive samples. A data set with many categories is typically comprised of several hierarchical clusters, with large clusters of generic concepts and small sub-clusters of specialized concepts. When drawing negative samples uniformly across the data set, the correct label will likely belong to a different generic concept than the negative sample. For example, an image classifier will therefore quickly learn to distinguish, e.g., dogs from bicycles, but since negative samples from the same cluster are rare, it takes much longer to learn the differences between a Boston Terrier and a French Bulldog. The model quickly learns to assign very low scores

to such ‘obviously wrong’ labels, making their contribution to the gradient exponentially small,


A similar vanishing gradient problem was pointed out for word embeddings by

Chen et al. (2018). Here, the vanishing gradient is due to different word frequencies, and a popular solution is therefore to draw negative samples from a nonuniform but unconditional noise distribution based on the empirical word frequencies (Mikolov et al., 2013). This introduces a bias which does not matter for word embeddings since the focus is not on classification but rather on learning useful representations.

Going beyond frequency-adjusted negative sampling, we show that one can drastically improve the procedure by generating negative samples from an auxiliary model. We therefore propose to generate negative samples  conditioned on the input feature . This has the advantage that the distribution of negative samples can be made much more similar to the distribution of positive samples, leading to a better signal-to-noise ratio. One consequence is that the introduced bias can no longer be ignored, which is what we address next.

Bias Removal.

Negative sampling with a nonuniform noise distribution introduces a bias. For a given input feature vector , labels 

with a high noise probability

are frequently drawn as negative samples, causing the model to learn a low score . Conversely, a low leads to an inflated score . It turns out that this bias can be easily quantified via a generalization of Eq. 3. We prove in Theorem 1 (Section 4) that in the nonparametric limit for arbitrary ,


Eq. 5 is an asymptotic equivalence between softmax classification (Eq. 1) and generalized negative sampling (Eq. 2). While strict equality holds only in the nonparametric limit, many models are flexible enough that Eq. 5 holds approximately in practice. Eq. 5 allows us to make unbiased predictions by mapping biased negative sampling scores to unbiased softmax scores . There is no need to solve for the corresponding model parameters , the scores suffice for predictions.


In practice, softmax classification typically requires a regularizer with some strength to prevent overfitting. With the asymptotic equivalence in Eq. 5, regularizing the softmax scores is similar to regularizing in the proposed generalized negative sampling method. We thus propose to use the following regularized variant of Eq. 2,


Comparison to GANs.

The use of adversarial negative samples, i.e., negative samples that are designed to ‘confuse’ the logistic regression in Eq. 2, bears some resemblance to generative adversarial networks (GANs) (Goodfellow et al., 2014). The crucial difference is that GANs are generative models, whereas we train a discriminative model over a discrete label space . The ‘generator’  in our setup only needs to find a rough approximation of the (conditional) label distribution because the final predictive scores in Eq. 5 combine the ‘generator scores’ with the more expressive ‘discriminator scores’ . This allows us to use a very restrictive but efficient generator model (see Section 3 below) that we can keep constant while training the discriminator. By contrast, the focus in GANs is on finding the best possible generator, which requires concurrent training of a generator and a discriminator via a potentially unstable nested min-max optimization.

3 Conditional Generation of Adversarial Samples

Having proposed a general approach for improved negative sampling with an adversarial auxiliary model (Section 2), we now describe a simple construction for such a model that satisfies all requirements. The model is essentially a probabilistic version of a decision tree which is able to conditionally generate negative samples by ancestral sampling. Readers who prefer to proceed can skip this section without loosing the main thread of the paper.

Our auxiliary model has the following properties: (i) it can be efficiently fitted to the training data  requiring minimal hyperparameter tuning and subleading computational overhead over the training of the main model; (ii) drawing negative samples scales only as , thus improving over the linear scaling of the softmax loss function (Eq. 1); and (iii) the log likelihood can be evaluated explicitly so that we can apply the bias removal in Eq. 5. Satisfying requirements (i) and (ii) on model efficiency comes at the cost of some model performance. This is an acceptable trade-off since the performance of  affects only the quality of negative samples.


Our auxiliary model for  is inspired by the Hierarchical Softmax model due to Morin & Bengio (2005). It is a balanced probabilistic binary decision tree, where each leaf node is mapped uniquely to a label . A decision tree imposes a hierarchical structure on , which can impede performance if it does not reflect any semantic structure in . Morin & Bengio (2005) rely on an explicitly provided semantic hierarchical structure, or ‘ontology’. Since an ontology is often not available, we instead construct a hierarchical structure in a data driven way. Our method has some similarity to the approach by Mnih & Hinton (2009), but it is more principled in that we fit both the model parameters and the hierarchical structure by maximizing a single log likelihood function.

To sample from the model, one walks from the tree’s root to some leaf. At each node , one makes a binary decision whether to continue to the right child () or to the left child (). Given a feature vector , we model the likelihood of these decisions as , where the weight vector  and the scalar bias  are model parameters associated with node . Denoting the unique path  from the root node  to the leaf node associated with label  as a sequence of nodes and binary decisions, , the log likelihood of the training set  is thus


Greedy Model Fitting.

We maximize the likelihood  in Eq. 7 over (i) the model parameters  and  of all nodes , and (ii) the hierarchical structure, i.e., the mapping between labels  and leaf nodes. The latter involves an exponentially large search space, making exact maximization intractable. We use a greedy approximation where we recursively split the label set  into halves and associate each node  with a subset . We start at the root node  with and finishing at the leaves with a single label per leaf. For each node , we maximize the terms in  that depend on and . These terms correspond to data points with a label , leading to the objective


We alternate between a continuous maximization of  over and , and a discrete maximization over the binary indicators that define how we split  into two equally sized halves. The continuous optimization is over a convex function and it converges quickly to machine precision with Newton ascent, which is free of hyperparameters like learning rates. For the discrete optimization, we note that changing  for any from  to  (or from  to ) increases (or decreases)  by


Here, the sums over  run over all data points in  with label , and the second equality is an algebraic identity of the sigmoid function. We maximize  over all  under the boundary condition that the split be into equally sized halves by setting for the half of with largest  and for the other half. If this changes any  then we switch back to the continuous optimization. Otherwise, we have reached a local optimum for node , and we proceed to the next node.

Technical Details.

In the interest of clarity, the above description left out the following details. Most importantly, to prioritize efficiency over accuracy, we preprocess the feature vectors  and project them to a smaller space  with

using principal component analysis (PCA). Sampling from 

thus costs only time. This dimensionality reduction only affects the quality of negative samples. The main model (Eq. 2) still operates on the full feature space . Second, we add a quadratic regularizer  to , with strength 

set by cross validation. Third, we introduce uninhabited padding labels if

is not a power of two. We ensure that for all padding labels  by setting  to a very large positive or negative value if either of ’s children contains only padding labels. Finally, we initialize the optimization with and by setting

to the dominant eigenvector of the covariance matrix of the set of vectors


4 Theoretical Aspects

We formalize and prove the two main premises of the algorithm proposed in Section 2.2. Theorem 1 below states the equivalence between softmax classification and negative sampling (Eq. 5), and Theorem 2 formalizes the claim that adversarial negative samples maximize the signal-to-noise ratio.

Theorem 1.

In the nonparametric limit, the optimal model parameters and that minimize in Eq. 1 and in Eq. 2, respectively, satisfy Eq. 5 for all  in the data set and all . Here, the “const.” term on the right-hand side of Eq. 5 is independent of .


Minimizing fits the maximum likelihood estimate of a model with likelihood with normalization . In the nonparametric limit, the score functions are arbitrarily flexible, allowing for a perfect fit, thus


Similarly, is the maximum likelihood objective of a binomial model that discriminates positive from negative samples. The nonparametric limit admits again a perfect fit so that the learned ratio of positive rate to negative rate equals the empirical ratio,


where we used the identity . Inserting Eq. 10 for and taking the logarithm leads to Eq. 5. Here, the “const.” term works out to , which is indeed independent of . ∎

Signal-to-Noise Ratio.

In preparation for Theorem 2

below, we define a quantitative measure for the signal-to-noise ratio (SNR) in stochastic gradient descent (SGD). In the vicinity of the minimum 

of a loss function , the gradient is approximately proportional to the Hessian of  at . SGD estimates  via stochastic gradient estimates , whose noise is measured by the covariance matrix

. Thus, the eigenvalues

of the matrix measure the SNR along different directions in parameter space. We define an overall scalar SNR  as


Here, we sum over the inverses rather than  so that and thus maximizing  encourages large values for all . The definition in Eq. 12 has the useful property that  is invariant under arbitrary invertible reparameterization of . Expressing  in terms of new model parameters  maps to and to , where is the Jacobian. Inserting into Eq. 12 and using the cyclic property of the trace, , all Jacobians cancel.

Theorem 2.

For negative sampling (Eq. 2) in the nonparametric limit, the signal-to-noise ratio  defined in Eq. 12 is maximal if , i.e., for adversarial negative samples.


In the nonparametric limit, the scores can be regarded as independent variables for all and . We therefore treat the scores directly as model parameters, using the invariance of  under reparameterization. Using only Eq. 2, Eq. 11, and properties of the -function, we show in Appendix A.1 that the Hessian of the loss function is diagonal in this coordinate system, and given by


and that the noise covariance matrix is block diagonal,


where  denotes a -dimensional column vector. Thus, the trace in Eq. 12 is


We thus have to maximize for each  in the training set. We find from Eq. 13 and Eq. 11,


with . Using Jensen’s inequality for the concave function , we find that the right-hand side of Eq. 16 has the upper bound , which it reaches precisely if the argument of  in Eq. 16 is a constant, i.e., iff . ∎

5 Results

We evaluated the proposed adversarial negative sampling method on two established benchmarks by comparing speed of convergence and predictive performance against five different baselines.

Size of adv. neg. s. uniform  freq.
Data set data set (proposed) neg. s. neg. s. NCE A&R OVE
Table 1: Sizes of data sets and hyperparameters. = number of training points; = number of categories (after preprocessing); = learning rate; = regularizer; = prior variance.

Datasets, Preprocessing and Model.

We used the Wikipedia-500K and Amazon-670K data sets from the Extreme Classification Repository (Bhatia et al., ) with -dimensional XML-CNN features (Liu et al., 2017) downloaded from (Saxena, ). As oth data sets contain multiple labels per data point we follow the approach in (Ruiz et al., 2018) and keep only the first label for each data point. Table 1 shows the resulting sizes. We fit a liner model with scores , where the model parameters are the weight vectors and biases for each label .


We compare our proposed method to five baselines: (i) standard negative sampling with a uniform noise distribution; (ii) negative sampling with an unconditional noise distribution

set to the empirical label frequencies; (iii) noise contrastive estimation (NCE, see below); (iv) ‘Augment and Reduce’ 

(Ruiz et al., 2018); and (v) ‘One vs. Each’ (Titsias, 2016). We do not compare to full softmax classification, which would be unfeasible on the large data sets (see Table 1

; a single epoch of optimizing the full softmax loss would scale as

). However, we provide additional results that compare softmax against negative sampling on a smaller data set in Appendix A.2.

NCE (Gutmann & Hyvärinen, 2010) is sometimes used as a synonym for negative sampling in the literature, but the original proposal of NCE is more general and allows for a nonuniform base distribution. We use our trained auxiliary model (Section 3) for the base distribution of NCE. Compared to our proposed method, NCE uses the base distribution only during training and not for predictions. Therefore, NCE has to re-learn everything that is already captured by the base distribution. This is less of an issue in the original setup for which NCE was proposed, namely unsupervised density estimation over a continuous space. By contrast, training a supervised classifier effectively means training a separate model for each label , which is expensive if  is large. Thus, having to re-learn what the base distribution already captures is potentially wasteful.


We tuned the hyperparameters for each method individually using the validation set. Table 1 shows the resulting hyperparameters. For the proposed method and baselines (i)-(iii) we used an Adagrad optimizer (Duchi et al., 2011) and considered learning rates and regularizer strengths (see Eq. 6) . For ‘Augment and Reduce’ and ‘One vs. Each’ we used the implementation published by the authors (Ruiz, ), and tuned the learning rate  and prior variance . For the auxiliary model, we used a feature dimension of and regularizer strength  for both data sets.

Figure 1: Learning curves for our proposed adversarial negative sampling method (green) and for five different baselines on two large data sets (see Table 1).


Figure 1 shows our results on the Wikipedia-500K data set (left two plots) and the Amazon-670K data set (right two plots). For each data set, we plot the the predictive log likelihood per test data point (first and third plot) and the predictive accuracy (second and fourth plot). The green curve in each plot shows our proposed adversarial negative sampling methods. Both our method and NCE (orange) start slightly shifted to the right to account for the time to fit the auxiliary model.

Our main observation is that the proposed method converges orders of magnitude faster and reaches better accuracies (second and third plot in Figure 1) than all baselines. On the (smaller) Amazon-670K data set, standard uniform and frequency based negative sampling reach a slightly higher predictive log likelihood, but our method performs considerably better in terms of predictive accuracy on both data sets. This may be understood as the predictive accuracy is very sensitive to the precise scores of the highest ranked labels, as a small change in these scores can affect which label is ranked highest. With adversarial negative sampling, the training procedure focuses on getting the scores of the highest ranked labels right, thus improving in particular the predictive accuracy.

6 Related Work

Efficient Evaluation of the Softmax Loss Function.

Methods to speed up evaluation of Eq. 1 include augmenting the model by adding auxiliary latent variables that can be marginalized over analytically (Galy-Fajou et al., 2019; Wenzel et al., 2019; Ruiz et al., 2018; Titsias, 2016). More closely related to our work are methods based on negative sampling (Mnih & Hinton, 2009; Mikolov et al., 2013) and noise contrastive estimation (Gutmann & Hyvärinen, 2010). Generalizations of negative sampling to non-uniform noise distributions have been proposed, e.g., in (Zhang & Zweigenbaum, 2018; Chen et al., 2018; Wang et al., 2014; Gutmann & Hyvärinen, 2010). Our method differs from these proposals by drawing the negative samples from a conditional distribution that takes the input feature into account, and by requiring the model to learn only correlations that are not already captured by the noise distribution. We further derive the optimal distribution for negative samples, and we propose an efficient way to approximate it via an auxiliary model. Adversarial training (Miyato et al., 2017) is a popular method for training deep generative models (Tu, 2007; Goodfellow et al., 2014). By contrast, our method trains a discriminative model over a discrete set of labels (see also our comparison to GANs at the end of Section 2.2).

A different sampling-based approximation of softmax classification is ‘sampled softmax’ (Bengio et al., 2003). It directly approximates the sum over classes  in the loss (Eq. 1) by sampling, which is biased even for a uniform sampling distribution. A nonuniform sampling distribution can remove or reduce the bias (Bengio & Senécal, 2008; Blanc & Rendle, 2018; Rawat et al., 2019). By contrast, our method uses negative sampling, and it uses a nonuniform distribution to reduce the gradient variance.

Decision Trees.

Decision trees (Somvanshi & Chavan, 2016) are popular in the extreme classification literature (Agrawal et al., 2013; Jain et al., 2016; Prabhu & Varma, 2014; Siblini et al., 2018; Weston et al., 2013; Bhatia et al., 2015; Jasinska et al., 2016). Our proposed method employs a probabilistic decision tree that is similar to Hierarchical Softmax (Morin & Bengio, 2005; Mikolov et al., 2013). While decision trees allow for efficient training and sampling in time, their hierarchical architecture imposes a structural bias. Our proposed method trains a more expressive model without such a structural bias on top of the decision tree to correct for any structural bias.

7 Conclusions

We proposed a simple method to train a classifier over a large set of labels. Our method is based on a scalable approximation to the softmax loss function via a generalized form of negative sampling. By generating adversarial negative samples from an auxiliary model, we proved that we maximize the signal-to-noise ratio of the stochastic gradient estimate. We further show that, while the auxiliary model introduces a bias, we can remove the bias at test time. We believe that due to its simplicity, our method can be widely used, and we publish the code111https://github.com/mandt-lab/adversarial-negative-sampling of both the main and the auxiliary model.


Stephan Mandt acknowledges funding from DARPA (HR001119S0038), NSF (FW-HTF-RM), and Qualcomm.


  • Agrawal et al. (2013) Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. Multi-label learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22nd international conference on World Wide Web, pp. 13–24. ACM, 2013.
  • Baumel et al. (2018) Tal Baumel, Jumana Nassour-Kassis, Raphael Cohen, Michael Elhadad, and Noemie Elhadad. Multi-label classification of patient notes: case study on icd code assignment. In

    Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • Bengio et al. (2019) Samy Bengio, Krzysztof Dembczynski, Thorsten Joachims, Marius Kloft, and Manik Varma. Extreme classification (dagstuhl seminar 18291). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019.
  • Bengio & Senécal (2008) Yoshua Bengio and Jean-Sébastien Senécal. Adaptive importance sampling to accelerate training of a neural probabilistic language model.

    IEEE Transactions on Neural Networks

    , 19(4):713–722, 2008.
  • Bengio et al. (2003) Yoshua Bengio, Jean-Sébastien Senécal, et al. Quick training of probabilistic neural nets by importance sampling. In AISTATS, pp. 1–9, 2003.
  • (6) Kush Bhatia, Kunal Dahiya, Himanshu Jain, Yashoteja Prabhu, and Manik Varma. The extreme classification repository: Multi-label datasets & code. http://manikvarma.org/downloads/XC/XMLRepository.html. Accessed: 2019-05-23.
  • Bhatia et al. (2015) Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. Sparse local embeddings for extreme multi-label classification. In Advances in neural information processing systems, pp. 730–738, 2015.
  • Blanc & Rendle (2018) Guy Blanc and Steffen Rendle. Adaptive sampled softmax with kernel based sampling. In

    International Conference on Machine Learning

    , pp. 589–598, 2018.
  • Chen et al. (2018) Long Chen, Fajie Yuan, Joemon M Jose, and Weinan Zhang. Improving negative sampling for word representation using self-embedded features. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 99–107. ACM, 2018.
  • Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
  • Galy-Fajou et al. (2019) Théo Galy-Fajou, Florian Wenzel, Christian Donner, and Manfred Opper. Multi-class gaussian process classification made conjugate: Efficient inference via data augmentation. In Uncertainty in Artificial Intelligence, 2019.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
  • Gutmann & Hyvärinen (2010) Michael Gutmann and Aapo Hyvärinen. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304, 2010.
  • Jain et al. (2016) Himanshu Jain, Yashoteja Prabhu, and Manik Varma. Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 935–944. ACM, 2016.
  • Jasinska et al. (2016) Kalina Jasinska, Krzysztof Dembczynski, Róbert Busa-Fekete, Karlson Pfannschmidt, Timo Klerx, and Eyke Hullermeier. Extreme f-measure maximization using sparse probability estimates. In International Conference on Machine Learning, pp. 1435–1444, 2016.
  • Lippert et al. (2017) Christoph Lippert, Riccardo Sabatini, M Cyrus Maher, Eun Yong Kang, Seunghak Lee, Okan Arikan, Alena Harley, Axel Bernal, Peter  Garst, Victor Lavrenko, et al. Identification of individuals by trait prediction using whole-genome sequencing data. Proceedings of the National Academy of Sciences, 114(38):10166–10171, 2017.
  • Liu et al. (2017) Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yiming Yang. Deep learning for extreme multi-label text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115–124. ACM, 2017.
  • Mencia & Fürnkranz (2008) Eneldo Loza Mencia and Johannes Fürnkranz. Efficient pairwise multilabel classification for large-scale problems in the legal domain. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 50–65. Springer, 2008.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.
  • Miyato et al. (2017) Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semi-supervised text classification. 2017.
  • Mnih & Hinton (2009) Andriy Mnih and Geoffrey E Hinton. A scalable hierarchical distributed language model. In Advances in neural information processing systems, pp. 1081–1088, 2009.
  • Morin & Bengio (2005) Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In Aistats, volume 5, pp. 246–252. Citeseer, 2005.
  • Prabhu & Varma (2014) Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable tree-classifier for extreme multi-label learning. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 263–272. ACM, 2014.
  • Prabhu et al. (2018) Yashoteja Prabhu, Anil Kag, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference, pp. 993–1002. International World Wide Web Conferences Steering Committee, 2018.
  • Rawat et al. (2019) Ankit Singh Rawat, Jiecao Chen, Felix Yu, Ananda Theertha Suresh, and Sanjiv Kumar. Sampled softmax with random fourier features. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • (26) Francisco JR Ruiz. Augment and reduce github repository. https://github.com/franrruiz/augment-reduce. Accessed: 2019-05-23.
  • Ruiz et al. (2018) Francisco JR Ruiz, Michalis K Titsias, Adji B Dieng, and David M Blei. Augment and reduce: Stochastic inference for large categorical distributions. In International Conference on Machine Learning, pp. 4400–4409, 2018.
  • (28) Siddhartha Saxena. Xml-cnn github repository. https://github.com/siddsax/XML-CNN. Accessed: 2019-05-23.
  • Siblini et al. (2018) Wissam Siblini, Pascale Kuntz, and Frank Meyer.

    Craftml, an efficient clustering-based random forest for extreme multi-label learning.

    In The 35th International Conference on Machine Learning.(ICML 2018), 2018.
  • Somvanshi & Chavan (2016) Madan Somvanshi and Pranjali Chavan.

    A review of machine learning techniques using decision tree and support vector machine.

    In 2016 International Conference on Computing Communication Control and automation (ICCUBEA), pp. 1–7. IEEE, 2016.
  • Titsias (2016) Michalis K Titsias. One-vs-each approximation to softmax for scalable estimation of probabilities. In Advances in Neural Information Processing Systems, pp. 4161–4169, 2016.
  • Tu (2007) Zhuowen Tu. Learning generative models via discriminative approaches. In

    2007 IEEE Conference on Computer Vision and Pattern Recognition

    , pp. 1–8. IEEE, 2007.
  • Wang et al. (2014) Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by translating on hyperplanes. In Twenty-Eighth AAAI conference on artificial intelligence, 2014.
  • Wenzel et al. (2019) Florian Wenzel, Théo Galy-Fajou, Christan Donner, Marius Kloft, and Manfred Opper. Efficient gaussian process classification using pòlya-gamma data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5417–5424, 2019.
  • Weston et al. (2013) Jason Weston, Ameesh Makadia, and Hector Yee. Label partitioning for sublinear ranking. In International Conference on Machine Learning, pp. 181–189, 2013.
  • Zhang & Zweigenbaum (2018) Zheng Zhang and Pierre Zweigenbaum. Gneg: Graph-based negative sampling for word2vec. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 566–571, 2018.


a.1 Details of the Proof of Theorem 2

In the nonparametric limit, the score functions are so flexible that they can take arbitrary values for all  in the data set and all . Taking advantage of the invariance of  under reparameterization, we parameterize the model directly by its scores. We use the shorthand , and we denote the collection of all scores over all and  by boldface .


Eq. 2 defines the loss  as a stochastic function. SGD minimizes its expectation,


where the sum over  runs over all feature vectors in the training set. We obtain the gradient


where we used the relation . The gradient is a vector whose components span all combinations of and . The Hessian matrix  contains the derivatives of each gradient component  by each coordinate . Since  in Eq. A2 depends only on the single coordinate , only the diagonal parts of the Hessian are nonzero, i.e., the components with and . Thus,


Using the identity , we find


Since we evaluate the Hessian in the nonparametric limit at the minimum of the loss, the scores  satisfy Eq. 11, i.e.,


This allows us to simplify Eq. A4 by eliminating ,


Eqs. A3 and A6 together prove Eq. 13 of the main text.

Noise Covariance Matrix.

SGD uses estimates of the loss function in Eq. A1, obtained by drawing a positive sample and a label for the negative sample , thus


where the factor of is because the sum over  in Eq. A1 scales proportionally to the size of the data set  (in practice one typically normalizes the loss function by  without affecting the signal to noise ratio). One uses  to obtain unbiased gradient estimates . We introduce new symbols and  for the components  of the gradient estimate to avoid confusion with the and  drawn from the data set and the  drawn from the noise distribution in Eq. A7 above. Since the scores are independent variables in the nonparametric limit, the derivative is one if and , and zero otherwise. We denote this by indicator functions and . Thus, we obtain


We evaluate the covariance matrix of  at the minimum of the loss function. Here, , and thus simplifies to . Introducing yet another pair of indices and  to distinguish the two factors of , we denote the components of the covariance matrix as


Here, the expectation is over . We start with the evaluation of the expectation over , using where the sum runs over all  in the data set. If or , then either one of the two gradient estimates  in the expectation on the right-hand side of Eq. A9 vanishes. Therefore, only terms with contribute, and the covariance matrix is block diagonal in  as claimed in Eq. 14 of the main text. The blocks  of the block diagonal matrix have entries


where we find for the product by inserting Eq. A8 and multiplying out the terms,


Taking the expectation in Eq. A10 leads to the following substitutions:


Thus, we find,


Using Eq. A5, we can again eliminate ,


Eq. A14 is the component-wise explicit form of Eq. 14 of the main text.

a.2 Experimental Comparison Between Softmax Classification and Negative Sampling

We provide additional experimental results that evaluate the performance gap due to negative sampling compared to full softmax classification on a smaller data set. Theorem 1 states an equivalence between negative sampling and softmax classification. However, this equivalence strictly holds only (i) in the nonparametric limit, (ii) without regularization, and (iii) if the optimizer really finds the global minimum of the loss function. In practice, all three assumptions hold only approximately.

Data Set and Preprocessing.

To evaluate the performance gap experimentally, we used “EURLex-4K” data set (Bhatia et al., ; Mencia & Fürnkranz, 2008), which is small enough to admit direct optimization of the softmax loss function. Similar to the preprocessing of the two main data sets described in Section 5 of the main text, we converted the multi-class classification problem into a single-class classification problem by selecting the label with the smallest ID for each data point, and discarding any data points without any labels. We split off of the training set for validation, and report results on the provided test set. This resulted in a training set with data points and categories. As in the main paper, we reduced the feature dimension to (using PCA for simplicity here).

Model and Hyperparameters.

The goal of these experiments is to evaluate the performance gap due to negative sampling in general. We therefore fitted the same affine linear model as described in Section 5 of the main text using the full softmax loss function (Eq. 1) and the simplest form of negative sampling (Eq. 2), i.e., negative sampling with a uniform noise distribution. We added a quadratic regularizer with strength to both loss functions.

For both methods, we tested the same hyperparameter combinations as in Section 5 on the validation set using early stopping. For softmax, we extended the range of tested learning rates up to as higher learning rates turned out to perform better in this method (this can be understood due to the low gradient noise). The optimal hyperparameters for softmax turned out to be a learning rate of and regularization strength . For negative sampling, we found and .


We evaluated the predictive accuracy for both methods. With the full softmax method, we obtain correct predictions on the test set, whereas the predictive accuracy drops to with negative sampling. This suggests that, when possible, minimizing the full softmax loss function should be preferred. However, in many cases, the softmax loss function is too expensive.