1 Introduction
In many problems in science, healthcare, or ecommerce, one is interested in training classifiers over an enormous number of classes: a problem known as ‘extreme classification’ (Agrawal et al., 2013; Jain et al., 2016; Prabhu & Varma, 2014; Siblini et al., 2018). For softmax (aka multinomial) regression, each gradient step incurs a cost proportional to the number of classes . As this may be prohibitively expensive for large , recent research has explored more scalable softmax approximations which circumvent the linear scaling in . Progress in accelerating the training procedure and thereby scaling up extreme classification promises to dramatically improve, e.g., advertising (Prabhu et al., 2018), recommender systems, ranking algorithms (Bhatia et al., 2015; Jain et al., 2016), and medical diagnostics (Bengio et al., 2019; Lippert et al., 2017; Baumel et al., 2018)
While scalable softmax approximations have been proposed, each one has its drawbacks. The most popular approach due to its simplicity is ‘negative sampling’ (Mnih & Hinton, 2009; Mikolov et al., 2013)
, which turns the problem into a binary classification between socalled ‘positive samples’ from the data set and ‘negative samples’ that are drawn at random from some (usually uniform) distribution over the class labels. While negative sampling makes the updates cheaper since computing the gradient no longer scales with
, it induces additional gradient noise that leads to a poor signaltonoise ratio of the stochastic gradient estimate. Improving the signaltonoise ratio in negative sampling while still enabling cheap gradients would dramatically enhance the speed of convergence.
In this paper, we present an algorithm that inherits the cheap gradient updates from negative sampling while still preserving much of the gradient signal of the original softmax regression problem. Our approach rests on the insight that the signaltonoise ratio in negative sampling is poor since there is no association between input features and their artificial labels. If negative samples were harder to discriminate from positive ones, a learning algorithm would obtain a better gradient signal close to the optimum. Here, we make these arguments mathematically rigorous and propose a nonuniform sampling scheme for scalably approximating a softmax classification scheme. Instead of sampling labels uniformly, our algorithm uses an adversarial auxiliary model to draw ‘fake’ labels that are more realistic by taking the input features of the data into account. We prove that such procedure reduces the gradient noise of the algorithm, and in fact minimizes the gradient variance in the limit where the auxiliary model optimally mimics the data distribution.
A useful adversarial model should require only little overhead to be fitted to the data, and it needs to be able to generate
negative samples quickly in order to enable inexpensive gradient updates. We propose a probabilistic version of a decision tree that has these properties. As a side result of our approach, we show how such an auxiliary model can be constructed and efficiently trained. Since it is almost hyperparameterfree, it does not cause extra complications when tuning models.
The final problem that we tackle is to remove the bias that the auxiliary model causes relative to our original softmax classification. Negative sampling is typically described as a softmax approximation; however, only uniform negative sampling correctly approximates the softmax. In this paper, we show that the bias due to nonuniform negative sampling can be easily removed at test time.
The stucture of our paper reflects our main contributions as follows:

We present a new scalable softmax approximation (Section 2). We show that nonuniform sampling from an auxiliary model can improve the signaltonoise ratio. The best performance is achieved when this sampling mechanism is adversarial, i.e., when it generates fake labels that are hard to discriminate from the true ones. To allow for efficient training, such adversarial samples need to be generated at a rate sublinear (e.g., logarithmic) in .

We design a new, simple adversarial auxiliary model that satisfies the above requirements (Section 3). The model is based on a probabilistic version of a decision tree. It can be efficiently pretrained and included into our approach, and requires only minimal tuning.

We present mathematical proofs that (i) the best signaltonoise ratio in the gradient is obtained if the auxiliary model best reflects the true dependencies between input features and labels, and that (ii) the involved bias to the softmax approximation can be exactly quantified and cheaply removed at test time (Section 4).

We present experiments on two classification data sets that show that our method outperforms all baselines by at least one order of magnitude in training speed (Section 5).
We discuss related work in Section 6 and summarize our approach in Section 7.
2 An Adversarial Softmax Approximation
We propose an efficient algorithm to train a classifier over a large set of classes, using an asymptotic equivalence between softmax classification and negative sampling (Subsection 2.1). To speed up convergence, we generalize this equivalence to modelbased negative sampling in Subsection 2.2.
2.1 Asymptotic Equivalence of Softmax Classification and Negative Sampling
Softmax Classification (Notation).
We consider a training data set of data points with
dimensional feature vectors
. Each data point has a single label from a discrete label set . A softmax classifier is defined by a set of functions that map a feature vector and model parameters to a score for each label. Its loss function is
(1) 
While the first term encourages high scores for the correct labels , the second term encourages low scores for all labels , thus preventing degenerate solutions that set all scores to infinity. Unfortunately, the sum over makes gradient based minimization of expensive if the label set is large. Assuming that evaluating a single score takes time, each gradient step costs , where is the size of the label set.
Negative Sampling.
Negative sampling turns classification over a large label set into binary classification between socalled positive and negative samples. One draws positive samples from the training set and constructs negative samples by drawing random labels from some noise distribution
. One then trains a logistic regression by minimizing the stochastic loss function
(2) 
with the sigmoid function
. Here, we used the same score functions as in Eq. 1 but introduced different model parameters so that we can distinguish the two models. Gradient steps for cost only time as there is no sum over all labels .Asymptotic Equivalence.
The models in Eqs. 1 and Eq. 2 are exactly equivalent in the nonparametric limit, i.e., if the function class is flexible enough to map to any possible score. A further requirement is that in Eq. 2 is the uniform distribution over . If both conditions hold, it follows that if and minimize Eq. 1 and Eq. 2, respectively, they learn identical scores,
(3) 
As a consequence, one is free to choose the loss function that is easier to minimize. While gradient steps are cheaper by a factor of for negative sampling, the randomly drawn negative samples increase the variance of the stochastic gradient estimator and worsen the signaltonoise ratio of the gradient, slowingdown convergence. The next section combines the strengths of both approaches.
2.2 Adversarial Negative Sampling
Overview.
We propose a generalized variant of negative sampling that reduces the gradient noise. The main idea is to train with negative samples that are hard to distinguish from positive samples. We draw from a conditional noise distribution using an auxiliary model. This introduces a bias, which we remove at prediction time. In summary our proposed approach consists of three steps:

Parameterize the noise distribution by an auxiliary model and fit it to the data set.

Train a classifier via negative sampling (Eq. 2) using adversarial negative samples from the auxiliary model fitted in Step 1 above. For our proposed auxiliary model, drawing a negative sample costs only time with some , i.e., it is sublinear in .

The resulting model has a bias. When making predictions, remove the bias by mapping it to an unbiased softmax classifier using the generalized asymptotic equivalence in Eq. 5 below.
We elaborate on the above Step 1 in Section 3. In the present section, we focus instead on Step 2 and its dependency on the choice of noise distribution , and on the bias removal (Step 3).
Why Adversarial Noise Improves Learning.
We first provide some intuition why uniform negative sampling is not optimal, and how sampling from a nonuniform noise distribution may improve the gradient signal. We argue that the poor gradient signal is caused by the fact that negative samples are too easy to distinguish from positive samples. A data set with many categories is typically comprised of several hierarchical clusters, with large clusters of generic concepts and small subclusters of specialized concepts. When drawing negative samples uniformly across the data set, the correct label will likely belong to a different generic concept than the negative sample. For example, an image classifier will therefore quickly learn to distinguish, e.g., dogs from bicycles, but since negative samples from the same cluster are rare, it takes much longer to learn the differences between a Boston Terrier and a French Bulldog. The model quickly learns to assign very low scores
to such ‘obviously wrong’ labels, making their contribution to the gradient exponentially small,(4)  
A similar vanishing gradient problem was pointed out for word embeddings by
Chen et al. (2018). Here, the vanishing gradient is due to different word frequencies, and a popular solution is therefore to draw negative samples from a nonuniform but unconditional noise distribution based on the empirical word frequencies (Mikolov et al., 2013). This introduces a bias which does not matter for word embeddings since the focus is not on classification but rather on learning useful representations.Going beyond frequencyadjusted negative sampling, we show that one can drastically improve the procedure by generating negative samples from an auxiliary model. We therefore propose to generate negative samples conditioned on the input feature . This has the advantage that the distribution of negative samples can be made much more similar to the distribution of positive samples, leading to a better signaltonoise ratio. One consequence is that the introduced bias can no longer be ignored, which is what we address next.
Bias Removal.
Negative sampling with a nonuniform noise distribution introduces a bias. For a given input feature vector , labels
with a high noise probability
are frequently drawn as negative samples, causing the model to learn a low score . Conversely, a low leads to an inflated score . It turns out that this bias can be easily quantified via a generalization of Eq. 3. We prove in Theorem 1 (Section 4) that in the nonparametric limit for arbitrary ,(5) 
Eq. 5 is an asymptotic equivalence between softmax classification (Eq. 1) and generalized negative sampling (Eq. 2). While strict equality holds only in the nonparametric limit, many models are flexible enough that Eq. 5 holds approximately in practice. Eq. 5 allows us to make unbiased predictions by mapping biased negative sampling scores to unbiased softmax scores . There is no need to solve for the corresponding model parameters , the scores suffice for predictions.
Regularization.
In practice, softmax classification typically requires a regularizer with some strength to prevent overfitting. With the asymptotic equivalence in Eq. 5, regularizing the softmax scores is similar to regularizing in the proposed generalized negative sampling method. We thus propose to use the following regularized variant of Eq. 2,
(6)  
Comparison to GANs.
The use of adversarial negative samples, i.e., negative samples that are designed to ‘confuse’ the logistic regression in Eq. 2, bears some resemblance to generative adversarial networks (GANs) (Goodfellow et al., 2014). The crucial difference is that GANs are generative models, whereas we train a discriminative model over a discrete label space . The ‘generator’ in our setup only needs to find a rough approximation of the (conditional) label distribution because the final predictive scores in Eq. 5 combine the ‘generator scores’ with the more expressive ‘discriminator scores’ . This allows us to use a very restrictive but efficient generator model (see Section 3 below) that we can keep constant while training the discriminator. By contrast, the focus in GANs is on finding the best possible generator, which requires concurrent training of a generator and a discriminator via a potentially unstable nested minmax optimization.
3 Conditional Generation of Adversarial Samples
Having proposed a general approach for improved negative sampling with an adversarial auxiliary model (Section 2), we now describe a simple construction for such a model that satisfies all requirements. The model is essentially a probabilistic version of a decision tree which is able to conditionally generate negative samples by ancestral sampling. Readers who prefer to proceed can skip this section without loosing the main thread of the paper.
Our auxiliary model has the following properties: (i) it can be efficiently fitted to the training data requiring minimal hyperparameter tuning and subleading computational overhead over the training of the main model; (ii) drawing negative samples scales only as , thus improving over the linear scaling of the softmax loss function (Eq. 1); and (iii) the log likelihood can be evaluated explicitly so that we can apply the bias removal in Eq. 5. Satisfying requirements (i) and (ii) on model efficiency comes at the cost of some model performance. This is an acceptable tradeoff since the performance of affects only the quality of negative samples.
Model.
Our auxiliary model for is inspired by the Hierarchical Softmax model due to Morin & Bengio (2005). It is a balanced probabilistic binary decision tree, where each leaf node is mapped uniquely to a label . A decision tree imposes a hierarchical structure on , which can impede performance if it does not reflect any semantic structure in . Morin & Bengio (2005) rely on an explicitly provided semantic hierarchical structure, or ‘ontology’. Since an ontology is often not available, we instead construct a hierarchical structure in a data driven way. Our method has some similarity to the approach by Mnih & Hinton (2009), but it is more principled in that we fit both the model parameters and the hierarchical structure by maximizing a single log likelihood function.
To sample from the model, one walks from the tree’s root to some leaf. At each node , one makes a binary decision whether to continue to the right child () or to the left child (). Given a feature vector , we model the likelihood of these decisions as , where the weight vector and the scalar bias are model parameters associated with node . Denoting the unique path from the root node to the leaf node associated with label as a sequence of nodes and binary decisions, , the log likelihood of the training set is thus
(7) 
Greedy Model Fitting.
We maximize the likelihood in Eq. 7 over (i) the model parameters and of all nodes , and (ii) the hierarchical structure, i.e., the mapping between labels and leaf nodes. The latter involves an exponentially large search space, making exact maximization intractable. We use a greedy approximation where we recursively split the label set into halves and associate each node with a subset . We start at the root node with and finishing at the leaves with a single label per leaf. For each node , we maximize the terms in that depend on and . These terms correspond to data points with a label , leading to the objective
(8) 
We alternate between a continuous maximization of over and , and a discrete maximization over the binary indicators that define how we split into two equally sized halves. The continuous optimization is over a convex function and it converges quickly to machine precision with Newton ascent, which is free of hyperparameters like learning rates. For the discrete optimization, we note that changing for any from to (or from to ) increases (or decreases) by
(9) 
Here, the sums over run over all data points in with label , and the second equality is an algebraic identity of the sigmoid function. We maximize over all under the boundary condition that the split be into equally sized halves by setting for the half of with largest and for the other half. If this changes any then we switch back to the continuous optimization. Otherwise, we have reached a local optimum for node , and we proceed to the next node.
Technical Details.
In the interest of clarity, the above description left out the following details. Most importantly, to prioritize efficiency over accuracy, we preprocess the feature vectors and project them to a smaller space with
using principal component analysis (PCA). Sampling from
thus costs only time. This dimensionality reduction only affects the quality of negative samples. The main model (Eq. 2) still operates on the full feature space . Second, we add a quadratic regularizer to , with strengthset by cross validation. Third, we introduce uninhabited padding labels if
is not a power of two. We ensure that for all padding labels by setting to a very large positive or negative value if either of ’s children contains only padding labels. Finally, we initialize the optimization with and by settingto the dominant eigenvector of the covariance matrix of the set of vectors
.4 Theoretical Aspects
We formalize and prove the two main premises of the algorithm proposed in Section 2.2. Theorem 1 below states the equivalence between softmax classification and negative sampling (Eq. 5), and Theorem 2 formalizes the claim that adversarial negative samples maximize the signaltonoise ratio.
Theorem 1.
Proof.
Minimizing fits the maximum likelihood estimate of a model with likelihood with normalization . In the nonparametric limit, the score functions are arbitrarily flexible, allowing for a perfect fit, thus
(10) 
Similarly, is the maximum likelihood objective of a binomial model that discriminates positive from negative samples. The nonparametric limit admits again a perfect fit so that the learned ratio of positive rate to negative rate equals the empirical ratio,
(11) 
where we used the identity . Inserting Eq. 10 for and taking the logarithm leads to Eq. 5. Here, the “const.” term works out to , which is indeed independent of . ∎
SignaltoNoise Ratio.
In preparation for Theorem 2
below, we define a quantitative measure for the signaltonoise ratio (SNR) in stochastic gradient descent (SGD). In the vicinity of the minimum
of a loss function , the gradient is approximately proportional to the Hessian of at . SGD estimates via stochastic gradient estimates , whose noise is measured by the covariance matrix. Thus, the eigenvalues
of the matrix measure the SNR along different directions in parameter space. We define an overall scalar SNR as(12) 
Here, we sum over the inverses rather than so that and thus maximizing encourages large values for all . The definition in Eq. 12 has the useful property that is invariant under arbitrary invertible reparameterization of . Expressing in terms of new model parameters maps to and to , where is the Jacobian. Inserting into Eq. 12 and using the cyclic property of the trace, , all Jacobians cancel.
Theorem 2.
Proof.
In the nonparametric limit, the scores can be regarded as independent variables for all and . We therefore treat the scores directly as model parameters, using the invariance of under reparameterization. Using only Eq. 2, Eq. 11, and properties of the function, we show in Appendix A.1 that the Hessian of the loss function is diagonal in this coordinate system, and given by
(13) 
and that the noise covariance matrix is block diagonal,
(14) 
where denotes a dimensional column vector. Thus, the trace in Eq. 12 is
(15) 
We thus have to maximize for each in the training set. We find from Eq. 13 and Eq. 11,
(16) 
with . Using Jensen’s inequality for the concave function , we find that the righthand side of Eq. 16 has the upper bound , which it reaches precisely if the argument of in Eq. 16 is a constant, i.e., iff . ∎
5 Results
We evaluated the proposed adversarial negative sampling method on two established benchmarks by comparing speed of convergence and predictive performance against five different baselines.
Size of  adv. neg. s.  uniform  freq.  

Data set  data set  (proposed)  neg. s.  neg. s.  NCE  A&R  OVE  
Wikipedia500K  
Amazon670K  
Datasets, Preprocessing and Model.
We used the Wikipedia500K and Amazon670K data sets from the Extreme Classification Repository (Bhatia et al., ) with dimensional XMLCNN features (Liu et al., 2017) downloaded from (Saxena, ). As oth data sets contain multiple labels per data point we follow the approach in (Ruiz et al., 2018) and keep only the first label for each data point. Table 1 shows the resulting sizes. We fit a liner model with scores , where the model parameters are the weight vectors and biases for each label .
Baselines.
We compare our proposed method to five baselines: (i) standard negative sampling with a uniform noise distribution; (ii) negative sampling with an unconditional noise distribution
set to the empirical label frequencies; (iii) noise contrastive estimation (NCE, see below); (iv) ‘Augment and Reduce’
(Ruiz et al., 2018); and (v) ‘One vs. Each’ (Titsias, 2016). We do not compare to full softmax classification, which would be unfeasible on the large data sets (see Table 1; a single epoch of optimizing the full softmax loss would scale as
). However, we provide additional results that compare softmax against negative sampling on a smaller data set in Appendix A.2.NCE (Gutmann & Hyvärinen, 2010) is sometimes used as a synonym for negative sampling in the literature, but the original proposal of NCE is more general and allows for a nonuniform base distribution. We use our trained auxiliary model (Section 3) for the base distribution of NCE. Compared to our proposed method, NCE uses the base distribution only during training and not for predictions. Therefore, NCE has to relearn everything that is already captured by the base distribution. This is less of an issue in the original setup for which NCE was proposed, namely unsupervised density estimation over a continuous space. By contrast, training a supervised classifier effectively means training a separate model for each label , which is expensive if is large. Thus, having to relearn what the base distribution already captures is potentially wasteful.
Hyperparameters.
We tuned the hyperparameters for each method individually using the validation set. Table 1 shows the resulting hyperparameters. For the proposed method and baselines (i)(iii) we used an Adagrad optimizer (Duchi et al., 2011) and considered learning rates and regularizer strengths (see Eq. 6) . For ‘Augment and Reduce’ and ‘One vs. Each’ we used the implementation published by the authors (Ruiz, ), and tuned the learning rate and prior variance . For the auxiliary model, we used a feature dimension of and regularizer strength for both data sets.
Results.
Figure 1 shows our results on the Wikipedia500K data set (left two plots) and the Amazon670K data set (right two plots). For each data set, we plot the the predictive log likelihood per test data point (first and third plot) and the predictive accuracy (second and fourth plot). The green curve in each plot shows our proposed adversarial negative sampling methods. Both our method and NCE (orange) start slightly shifted to the right to account for the time to fit the auxiliary model.
Our main observation is that the proposed method converges orders of magnitude faster and reaches better accuracies (second and third plot in Figure 1) than all baselines. On the (smaller) Amazon670K data set, standard uniform and frequency based negative sampling reach a slightly higher predictive log likelihood, but our method performs considerably better in terms of predictive accuracy on both data sets. This may be understood as the predictive accuracy is very sensitive to the precise scores of the highest ranked labels, as a small change in these scores can affect which label is ranked highest. With adversarial negative sampling, the training procedure focuses on getting the scores of the highest ranked labels right, thus improving in particular the predictive accuracy.
6 Related Work
Efficient Evaluation of the Softmax Loss Function.
Methods to speed up evaluation of Eq. 1 include augmenting the model by adding auxiliary latent variables that can be marginalized over analytically (GalyFajou et al., 2019; Wenzel et al., 2019; Ruiz et al., 2018; Titsias, 2016). More closely related to our work are methods based on negative sampling (Mnih & Hinton, 2009; Mikolov et al., 2013) and noise contrastive estimation (Gutmann & Hyvärinen, 2010). Generalizations of negative sampling to nonuniform noise distributions have been proposed, e.g., in (Zhang & Zweigenbaum, 2018; Chen et al., 2018; Wang et al., 2014; Gutmann & Hyvärinen, 2010). Our method differs from these proposals by drawing the negative samples from a conditional distribution that takes the input feature into account, and by requiring the model to learn only correlations that are not already captured by the noise distribution. We further derive the optimal distribution for negative samples, and we propose an efficient way to approximate it via an auxiliary model. Adversarial training (Miyato et al., 2017) is a popular method for training deep generative models (Tu, 2007; Goodfellow et al., 2014). By contrast, our method trains a discriminative model over a discrete set of labels (see also our comparison to GANs at the end of Section 2.2).
A different samplingbased approximation of softmax classification is ‘sampled softmax’ (Bengio et al., 2003). It directly approximates the sum over classes in the loss (Eq. 1) by sampling, which is biased even for a uniform sampling distribution. A nonuniform sampling distribution can remove or reduce the bias (Bengio & Senécal, 2008; Blanc & Rendle, 2018; Rawat et al., 2019). By contrast, our method uses negative sampling, and it uses a nonuniform distribution to reduce the gradient variance.
Decision Trees.
Decision trees (Somvanshi & Chavan, 2016) are popular in the extreme classification literature (Agrawal et al., 2013; Jain et al., 2016; Prabhu & Varma, 2014; Siblini et al., 2018; Weston et al., 2013; Bhatia et al., 2015; Jasinska et al., 2016). Our proposed method employs a probabilistic decision tree that is similar to Hierarchical Softmax (Morin & Bengio, 2005; Mikolov et al., 2013). While decision trees allow for efficient training and sampling in time, their hierarchical architecture imposes a structural bias. Our proposed method trains a more expressive model without such a structural bias on top of the decision tree to correct for any structural bias.
7 Conclusions
We proposed a simple method to train a classifier over a large set of labels. Our method is based on a scalable approximation to the softmax loss function via a generalized form of negative sampling. By generating adversarial negative samples from an auxiliary model, we proved that we maximize the signaltonoise ratio of the stochastic gradient estimate. We further show that, while the auxiliary model introduces a bias, we can remove the bias at test time. We believe that due to its simplicity, our method can be widely used, and we publish the code^{1}^{1}1https://github.com/mandtlab/adversarialnegativesampling of both the main and the auxiliary model.
Acknowledgements
Stephan Mandt acknowledges funding from DARPA (HR001119S0038), NSF (FWHTFRM), and Qualcomm.
References
 Agrawal et al. (2013) Rahul Agrawal, Archit Gupta, Yashoteja Prabhu, and Manik Varma. Multilabel learning with millions of labels: Recommending advertiser bid phrases for web pages. In Proceedings of the 22nd international conference on World Wide Web, pp. 13–24. ACM, 2013.

Baumel et al. (2018)
Tal Baumel, Jumana NassourKassis, Raphael Cohen, Michael Elhadad, and Noemie
Elhadad.
Multilabel classification of patient notes: case study on icd code
assignment.
In
Workshops at the ThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  Bengio et al. (2019) Samy Bengio, Krzysztof Dembczynski, Thorsten Joachims, Marius Kloft, and Manik Varma. Extreme classification (dagstuhl seminar 18291). Schloss DagstuhlLeibnizZentrum fuer Informatik, 2019.

Bengio & Senécal (2008)
Yoshua Bengio and JeanSébastien Senécal.
Adaptive importance sampling to accelerate training of a neural
probabilistic language model.
IEEE Transactions on Neural Networks
, 19(4):713–722, 2008.  Bengio et al. (2003) Yoshua Bengio, JeanSébastien Senécal, et al. Quick training of probabilistic neural nets by importance sampling. In AISTATS, pp. 1–9, 2003.
 (6) Kush Bhatia, Kunal Dahiya, Himanshu Jain, Yashoteja Prabhu, and Manik Varma. The extreme classification repository: Multilabel datasets & code. http://manikvarma.org/downloads/XC/XMLRepository.html. Accessed: 20190523.
 Bhatia et al. (2015) Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. Sparse local embeddings for extreme multilabel classification. In Advances in neural information processing systems, pp. 730–738, 2015.

Blanc & Rendle (2018)
Guy Blanc and Steffen Rendle.
Adaptive sampled softmax with kernel based sampling.
In
International Conference on Machine Learning
, pp. 589–598, 2018.  Chen et al. (2018) Long Chen, Fajie Yuan, Joemon M Jose, and Weinan Zhang. Improving negative sampling for word representation using selfembedded features. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 99–107. ACM, 2018.
 Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 GalyFajou et al. (2019) Théo GalyFajou, Florian Wenzel, Christian Donner, and Manfred Opper. Multiclass gaussian process classification made conjugate: Efficient inference via data augmentation. In Uncertainty in Artificial Intelligence, 2019.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Gutmann & Hyvärinen (2010) Michael Gutmann and Aapo Hyvärinen. Noisecontrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304, 2010.
 Jain et al. (2016) Himanshu Jain, Yashoteja Prabhu, and Manik Varma. Extreme multilabel loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 935–944. ACM, 2016.
 Jasinska et al. (2016) Kalina Jasinska, Krzysztof Dembczynski, Róbert BusaFekete, Karlson Pfannschmidt, Timo Klerx, and Eyke Hullermeier. Extreme fmeasure maximization using sparse probability estimates. In International Conference on Machine Learning, pp. 1435–1444, 2016.
 Lippert et al. (2017) Christoph Lippert, Riccardo Sabatini, M Cyrus Maher, Eun Yong Kang, Seunghak Lee, Okan Arikan, Alena Harley, Axel Bernal, Peter Garst, Victor Lavrenko, et al. Identification of individuals by trait prediction using wholegenome sequencing data. Proceedings of the National Academy of Sciences, 114(38):10166–10171, 2017.
 Liu et al. (2017) Jingzhou Liu, WeiCheng Chang, Yuexin Wu, and Yiming Yang. Deep learning for extreme multilabel text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 115–124. ACM, 2017.
 Mencia & Fürnkranz (2008) Eneldo Loza Mencia and Johannes Fürnkranz. Efficient pairwise multilabel classification for largescale problems in the legal domain. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 50–65. Springer, 2008.
 Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.
 Miyato et al. (2017) Takeru Miyato, Andrew M Dai, and Ian Goodfellow. Adversarial training methods for semisupervised text classification. 2017.
 Mnih & Hinton (2009) Andriy Mnih and Geoffrey E Hinton. A scalable hierarchical distributed language model. In Advances in neural information processing systems, pp. 1081–1088, 2009.
 Morin & Bengio (2005) Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In Aistats, volume 5, pp. 246–252. Citeseer, 2005.
 Prabhu & Varma (2014) Yashoteja Prabhu and Manik Varma. Fastxml: A fast, accurate and stable treeclassifier for extreme multilabel learning. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 263–272. ACM, 2014.
 Prabhu et al. (2018) Yashoteja Prabhu, Anil Kag, Shrutendra Harsola, Rahul Agrawal, and Manik Varma. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the 2018 World Wide Web Conference, pp. 993–1002. International World Wide Web Conferences Steering Committee, 2018.
 Rawat et al. (2019) Ankit Singh Rawat, Jiecao Chen, Felix Yu, Ananda Theertha Suresh, and Sanjiv Kumar. Sampled softmax with random fourier features. In Advances in Neural Information Processing Systems (NeurIPS), 2019.
 (26) Francisco JR Ruiz. Augment and reduce github repository. https://github.com/franrruiz/augmentreduce. Accessed: 20190523.
 Ruiz et al. (2018) Francisco JR Ruiz, Michalis K Titsias, Adji B Dieng, and David M Blei. Augment and reduce: Stochastic inference for large categorical distributions. In International Conference on Machine Learning, pp. 4400–4409, 2018.
 (28) Siddhartha Saxena. Xmlcnn github repository. https://github.com/siddsax/XMLCNN. Accessed: 20190523.

Siblini et al. (2018)
Wissam Siblini, Pascale Kuntz, and Frank Meyer.
Craftml, an efficient clusteringbased random forest for extreme multilabel learning.
In The 35th International Conference on Machine Learning.(ICML 2018), 2018. 
Somvanshi & Chavan (2016)
Madan Somvanshi and Pranjali Chavan.
A review of machine learning techniques using decision tree and support vector machine.
In 2016 International Conference on Computing Communication Control and automation (ICCUBEA), pp. 1–7. IEEE, 2016.  Titsias (2016) Michalis K Titsias. Onevseach approximation to softmax for scalable estimation of probabilities. In Advances in Neural Information Processing Systems, pp. 4161–4169, 2016.

Tu (2007)
Zhuowen Tu.
Learning generative models via discriminative approaches.
In
2007 IEEE Conference on Computer Vision and Pattern Recognition
, pp. 1–8. IEEE, 2007.  Wang et al. (2014) Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by translating on hyperplanes. In TwentyEighth AAAI conference on artificial intelligence, 2014.
 Wenzel et al. (2019) Florian Wenzel, Théo GalyFajou, Christan Donner, Marius Kloft, and Manfred Opper. Efficient gaussian process classification using pòlyagamma data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5417–5424, 2019.
 Weston et al. (2013) Jason Weston, Ameesh Makadia, and Hector Yee. Label partitioning for sublinear ranking. In International Conference on Machine Learning, pp. 181–189, 2013.
 Zhang & Zweigenbaum (2018) Zheng Zhang and Pierre Zweigenbaum. Gneg: Graphbased negative sampling for word2vec. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 566–571, 2018.
Appendix
a.1 Details of the Proof of Theorem 2
In the nonparametric limit, the score functions are so flexible that they can take arbitrary values for all in the data set and all . Taking advantage of the invariance of under reparameterization, we parameterize the model directly by its scores. We use the shorthand , and we denote the collection of all scores over all and by boldface .
Hessian.
Eq. 2 defines the loss as a stochastic function. SGD minimizes its expectation,
(A1) 
where the sum over runs over all feature vectors in the training set. We obtain the gradient
(A2) 
where we used the relation . The gradient is a vector whose components span all combinations of and . The Hessian matrix contains the derivatives of each gradient component by each coordinate . Since in Eq. A2 depends only on the single coordinate , only the diagonal parts of the Hessian are nonzero, i.e., the components with and . Thus,
(A3) 
Using the identity , we find
(A4) 
Noise Covariance Matrix.
SGD uses estimates of the loss function in Eq. A1, obtained by drawing a positive sample and a label for the negative sample , thus
(A7) 
where the factor of is because the sum over in Eq. A1 scales proportionally to the size of the data set (in practice one typically normalizes the loss function by without affecting the signal to noise ratio). One uses to obtain unbiased gradient estimates . We introduce new symbols and for the components of the gradient estimate to avoid confusion with the and drawn from the data set and the drawn from the noise distribution in Eq. A7 above. Since the scores are independent variables in the nonparametric limit, the derivative is one if and , and zero otherwise. We denote this by indicator functions and . Thus, we obtain
(A8) 
We evaluate the covariance matrix of at the minimum of the loss function. Here, , and thus simplifies to . Introducing yet another pair of indices and to distinguish the two factors of , we denote the components of the covariance matrix as
(A9) 
Here, the expectation is over . We start with the evaluation of the expectation over , using where the sum runs over all in the data set. If or , then either one of the two gradient estimates in the expectation on the righthand side of Eq. A9 vanishes. Therefore, only terms with contribute, and the covariance matrix is block diagonal in as claimed in Eq. 14 of the main text. The blocks of the block diagonal matrix have entries
(A10) 
where we find for the product by inserting Eq. A8 and multiplying out the terms,
(A11)  
Taking the expectation in Eq. A10 leads to the following substitutions:
(A12) 
Thus, we find,
(A13)  
Using Eq. A5, we can again eliminate ,
(A14) 
Eq. A14 is the componentwise explicit form of Eq. 14 of the main text.
a.2 Experimental Comparison Between Softmax Classification and Negative Sampling
We provide additional experimental results that evaluate the performance gap due to negative sampling compared to full softmax classification on a smaller data set. Theorem 1 states an equivalence between negative sampling and softmax classification. However, this equivalence strictly holds only (i) in the nonparametric limit, (ii) without regularization, and (iii) if the optimizer really finds the global minimum of the loss function. In practice, all three assumptions hold only approximately.
Data Set and Preprocessing.
To evaluate the performance gap experimentally, we used “EURLex4K” data set (Bhatia et al., ; Mencia & Fürnkranz, 2008), which is small enough to admit direct optimization of the softmax loss function. Similar to the preprocessing of the two main data sets described in Section 5 of the main text, we converted the multiclass classification problem into a singleclass classification problem by selecting the label with the smallest ID for each data point, and discarding any data points without any labels. We split off of the training set for validation, and report results on the provided test set. This resulted in a training set with data points and categories. As in the main paper, we reduced the feature dimension to (using PCA for simplicity here).
Model and Hyperparameters.
The goal of these experiments is to evaluate the performance gap due to negative sampling in general. We therefore fitted the same affine linear model as described in Section 5 of the main text using the full softmax loss function (Eq. 1) and the simplest form of negative sampling (Eq. 2), i.e., negative sampling with a uniform noise distribution. We added a quadratic regularizer with strength to both loss functions.
For both methods, we tested the same hyperparameter combinations as in Section 5 on the validation set using early stopping. For softmax, we extended the range of tested learning rates up to as higher learning rates turned out to perform better in this method (this can be understood due to the low gradient noise). The optimal hyperparameters for softmax turned out to be a learning rate of and regularization strength . For negative sampling, we found and .
Results.
We evaluated the predictive accuracy for both methods. With the full softmax method, we obtain correct predictions on the test set, whereas the predictive accuracy drops to with negative sampling. This suggests that, when possible, minimizing the full softmax loss function should be preferred. However, in many cases, the softmax loss function is too expensive.
Comments
There are no comments yet.