Fast Amortized Inference and Learning in Log-linear Models with Randomly Perturbed Nearest Neighbor Search

07/11/2017 ∙ by Stephen Mussmann, et al. ∙ Stanford University 0

Inference in log-linear models scales linearly in the size of output space in the worst-case. This is often a bottleneck in natural language processing and computer vision tasks when the output space is feasibly enumerable but very large. We propose a method to perform inference in log-linear models with sublinear amortized cost. Our idea hinges on using Gumbel random variable perturbations and a pre-computed Maximum Inner Product Search data structure to access the most-likely elements in sublinear amortized time. Our method yields provable runtime and accuracy guarantees. Further, we present empirical experiments on ImageNet and Word Embeddings showing significant speedups for sampling, inference, and learning in log-linear models.



There are no comments yet.


page 8

page 9

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Log-linear models are widely used in machine learning and statistics. These models receive their name from the fact that the log unnormalized probabilities are linear in the parameters and the sufficient statistics. Since the probabilities are defined up to scaling, inference and learning require computing the normalization constant, also known as the partition function

(Murphy, 2012).

While defining unnormalized probabilities affords modeling flexibility, it comes at the price of computation time. For factorized models, there are many methods, such as Gibbs sampling and variational inference (Koller & Friedman, 2009), to approximately perform inference. Here we are interested in the setting where the output space is not factorizable, and is large but enumerable (e.g., a few million elements). Such problems with large output spaces occur in many areas including computer vision and natural language processing (NLP) (Joulin et al., 2016; Bengio et al., 2003; Mikolov et al., 2013). While inference is tractable (by brute force, in time linear in the size of the output space), it can be a major bottleneck in learning and even at test time in resource-constrained settings. Clearly, computation time cannot be saved for a single inference query as it requires linear time to examine the input. However, as Mussmann & Ermon (2016) establishes, computation time can be saved for a sequence of related queries, e.g., sampling from log-linear models with the same sufficient statistics but different (changing) parameters. Such sequences of queries arise naturally in learning and at test time.

In this work, we employ Gumbel random variables to convert sampling into maximizing unnormalized log-probabilities perturbed by Gumbel noise, applied independently to each element in the output space (Hazan et al., 2013; Maddison et al., 2014; Kim et al., 2016). Naively, sampling a Gumbel for each element requires linear runtime which yields no savings. However, we introduce a novel way to lazily instantiate the Gumbel random variables. In order to maximize the Gumbel-perturbed objective, we only examine a (small) subset of the most-likely states and a small number of Gumbel perturbations (the largest ones). This yields asymptotic runtime improvements with provable accuracy guarantees. To find the most likely states, which involves the maximization of the dot product between the parameters and the sufficient statistics, we are able to make use of the large literature on Maximum Inner Product Search (Shrivastava & Li, 2014; Auvolat et al., 2015; Douze et al., 2016; Ram & Gray, 2012; Koenigstein et al., 2012).

The contributions of this work are as follows.

  • We present a method to perform sampling using access to the top most likely states (where is the number of states), using the Gumbel max trick.

  • We present a method to estimate the partition function and expected values using access to the top

    values and relying on uniform sampling.

  • We present a way to use Maximum Inner Product Search (MIPS) techniques to retrieve the approximate top elements to provably achieve sublinear amortized query time.

  • We demonstrate applications of our method in computer vision and NLP where we achieve per-query speedups compared to the naive method.

2 Background

2.1 Log-Linear Models

Log-linear models are widely used in machine learning and artificial intelligence. Generally, any exponential family distribution can be written as a log-linear model. As examples, very common models like multinomial logistic regression and maximum entropy models

(Koller & Friedman, 2009; Murphy, 2012)

are log-linear models. Additionally, the last layer of a neural network (softmax) is a log-linear model, e.g. sampling the next element of a sequence for recurrent neural networks.

In this work, we focus on discrete distributions over a set of states . For a log-linear model, the log unnormalized probabilities are linear in the parameters. More precisely, if the parameters are and the features (sufficient statistics) for an element are , then,


Note that in order to define a distribution, we must normalize these probabilities by


which is known as the partition function. Unfortunately, computing the partition function is expensive as it requires summing over all elements in . We can also learn a log-linear model by maximizing the likelihood of some training data where evaluating the gradient requires computing the expected value of the sufficient statistics.

Assumption: In our setting, is large but feasibly enumerable, so naively computing the partition function is tractable but computationally expensive.

As an example, in the experimental results section, . As a negative example, Markov Random Fields can be written as a log-linear model but have an exponentially large and thus are not amenable to our method.

2.2 Gumbel Variable

In the context of extremal statistics, Gumbel & Lieblein (1954) defines the Gumbel distribution as


We can sample a Gumbel random variable using the following scheme,


Our use of the Gumbel distribution is motivated by the so-called “Gumbel Trick” which involves adding Gumbel noise to the log unnormalized probabilities to turn sampling from a log-linear model into finding the maximizing element.

Proposition 2.1 ((Hazan et al., 2013; Maddison et al., 2014)).

For Gumbel variables sampled i.i.d. for each data point ,


2.3 Maximum Inner Product Search

A common computational task is retrieving the nearest neighbor to a query from a database of vectors. More specifically, we are given a database of vectors on which we can perform preprocessing and build a data structure, and then, we receive a sequence of queries

, and for each query, we use the data structure to compute the element in the database that is most similar to . Note that the structure of this problem depends on the similarity measure between vectors.

If is the number of vectors, we can trivially create an algorithm (per query): for every query , iterate through the entire database and find the vector that is most similar to

. Remarkably, for Euclidean distance and cosine similarity, it is possible to achieve

amortized sublinear query runtime (Indyk & Motwani, 1998; Charikar, 2002).

Because of applications in log-linear models, we will be interested in using the inner product as the similarity measure. This is known as the Maximum Inner Product Search (MIPS) task.

Definition 2.1 (Maximum Inner Product Search).

Given a set of vectors the MIPS task is to respond to a query vector with


One common class of techniques for solving MIPS are space-partitioning methods such as k-d trees (Bentley, 1975). Ram & Gray (2012) and Koenigstein et al. (2012)

introduce space-partitioning methods based on a branch and bound technique to solve the MIPS problem. Unfortunately, it has been observed that such tree-based methods suffer from the curse of dimensionality

(Shrivastava & Li, 2014).

Clustering is another approach for solving the MIPS task (Auvolat et al., 2015; Douze et al., 2016). For this technique, the database vectors are clustered during the preprocessing step. Then, at query time, the algorithm searches the clusters near for the most similar vector.

Another common class of techniques for MIPS are based on Local Sensitive Hashing (Shrivastava & Li, 2014; Neyshabur & Srebro, 2014), a method introduced by Indyk & Motwani (1998). LSH only requires a family of hash functions with collision probabilities that are monotonic in the similarity. LSH works by combining these basic hashes to form longer hashes, and then building a hash table for each longer hash. Then, at query time, LSH hashes the query, retrieves elements from the colliding hash buckets, and computes the maximum over such elements. More precisely, define as the similarity between and and an -neighbor to a query as a point such that .

Theorem 2.1.

Given a set of size with a similarity measure and hash family such that for scalars and ,

  • For any where ,

  • For any where ,

one can construct a data structure which, given any query , does the following with high probability: if there exists a -neighbor of in , it returns a -neighbor of in . Further, this can be done with query time and space where .


See Indyk & Motwani (1998)

This theorem states that if there is an -close neighbor, the algorithm will find an -close neighbor. Intuitively, this means that each LSH instance is “tuned” to a different similarity value. We can build a series of LSH instances “tuned” to different values so that we can find the largest element with high probability, no matter the similarity of the nearest neighbor. The theorem states that this can be done in sublinear time.

For the sublinear theoretical guarantees in this paper, we will rely on the reduction from MIPS to Maximum Cosines Similarity Search presented in Neyshabur & Srebro (2014) which adds a single dimension to make all the database vectors have the same norm. For the cosine similarity search problem, we will rely on LSH techniques for cosine similarity search presented in Charikar (2002) based on Signed Random Projections, a binary hash function based on the sign of the dot product with a random vector.

3 Method

Suppose we have a log-linear model over a set of elements . We wish to perform sampling and inference in sublinear time. This cannot be done for a single value of the parameters , but with preprocessing on , we can achieve sublinear amortized query time.

Our method generally works for any distribution where


which encompasses all distributions with strictly positive probability mass. The requirement for our method is that we have access to the largest values of in sublinear time. In particular, this method works for log-linear models where and we use Maximum Inner Product search techniques to access the top values.

3.1 Sampling

Recall the Gumbel max technique from the background section. In particular, if we can compute the maximum element and value of for Gumbel variables , the maximum element will be a sample from the model. We can construct a naive strategy as follows: sample a Gumbel for each and iterate over all elements to find the maximum (perturbed) element. However, this algorithm’s runtime is linear and provides no savings.

Ideally, we would like to find a way to preprocess so that we can draw samples and perform inference quickly. Mussmann & Ermon (2016) achieves this by performing preprocessing on fixed Gumbel samples and using MIPS techniques. This “frozen” Gumbel noise makes the samples very correlated; in fact, there are a small fixed number of possible samples for a given parameter value. We wish to find a way that allows us to sample fresh Gumbels for every sample, but only a sublinear number of them.

Intuitively, for an element to maximize , either needs to be large or needs to be large, so we only need to examine indices where either or is large. We can find the largest by performing preprocessing (such as maximum inner product search) and we will find the largest by incorporating a lazy evaluation strategy for the Gumbel variables to only require an expected number of samples.

First, we describe the method intuitively and with a figure before diving into the details. Let be the set of the largest elements of . First, we will sample Gumbel values for these largest elements. Note that the minimal in is an upper bound on the value of elements not in . Further, for an element not in to have the overall maximal , it must exceed the maximal for elements in , which is quickly computable. Thus, we have a lower bound on what the value of a Gumbel must be to perturb a point not in to be the overall maximum. We can lazily sample large Gumbels that exceed this gap, which we will show there will not be too many in expectation. Then, we randomly assign these large Gumbels to the tail of the distribution and check if any of them exceed the maximal from the largest elements . See Figure 1.

Figure 1: The values of are shown sorted in blue while the Gumbel noise is shown in yellow. We sample a Gumbel for each element of the set of the largest . Then, we compute the minimum value that a Gumbel must have to yield a candidate solution, represented by the difference between the dotted lines. Finally, we lazily sample Gumbels larger than this value for elements not in .

Note that our method requires the top elements of which we will refer to as . For lazy sampling the large Gumbels, we will use the fact that a Gumbel can be represented as . Then we can sample the number of Gumbels that exceed a threshold by sampling the number of such that and then can conditionally sample . Precisely, our method involves several steps shown in Algorithm 1.

  Input: , as the top values of
  Sample Gumbel variables for
  Compute the Gumbel cutoff .
  Sample as the number of Gumbels with value
  Uniformly sample points from and denote
  Sample Gumbels that are conditionally for the points (sample )
  return Sample
Algorithm 1 Fast Sampling with Lazy Gumbels
Theorem 3.1.

For Algorithm 1, is an exact sample from .


This theorem would follow from Proposition 2.1 if we prove that we are finding the maximum of . Note that we do not evaluate the Gumbel for all of the elements in . Thus, the only way the lazy sampling strategy will fail is if one of these points is the true maximum. However, these points have Gumbel and since they aren’t in , . Together, this implies that which is a value attained by a point in . Therefore points not in cannot be the maximum. ∎

3.1.1 Runtime

Further, the runtime will be composed of two parts: retrieving the top elements and the runtime of Algorithm 1. Let the cost of retrieving the top elements be . For Algorithm 1, including the cost of retrieving , the runtime will be and has a reasonable expected value.

Theorem 3.2.

For Algorithm 1,

The proof is in the appendix. Thus, the expected runtime for our method will be which is sublinear if and is sublinear.

3.1.2 Fixed

Note that the technique above has a reasonable expected runtime but no runtime guarantees with high probability. To address this, we can fix to be a constant so that the value of is concentrated. Additionally, the technique shown in Algorithm 1 only works if is an upper bound on elements not in which is brittle to errors in the MIPS technique.

To address these issues, we define a related algorithm with a fixed Gumbel cutoff of so that there are on average Gumbel variables that exceed the cutoff. See Algorithm 2.

  Input: , as the top values of ,
  Sample Gumbel variables for
  Sample as the number of Gumbels with value
  Uniformly sample points from and call them
  Sample Gumbels that are conditionally for the points
  return Sample
Algorithm 2 Fast Sampling with Fixed

Note that for , the total runtime is where is the runtime of gathering the top elements. Further so with very high probability, and the runtime is which will be sublinear if and are sublinear.

Theorem 3.3.

For Algorithm 2, the sample is an exact sample with probability for .

The proof is in the appendix. Thus, we can set .

3.2 Partition Function Estimation

Similar to sampling, we can estimate the partition function by using the top elements and a uniform sample of elements from the remaining elements. We combine these two sets to form an estimate of the partition function with relative error . See Algorithm 3.

  Input: , as the top values of ,
  Uniformly sample elements with replacement from and call it
  return Partition function estimate
Algorithm 3 Partition Function Estimation
Theorem 3.4.

Algorithm 3

returns an unbiased estimate

and for , then with probability,


The proof is in the appendix. If we set , then the runtime is .

This is closely related to the heuristic presented in

Rastogi & Van Durme (2015) as MIMPS. However, this is the first work that provides theoretical guarantees for the method and yields a theoretical understanding for the choice of and .

3.3 Expected Value Estimation

In this section we show a way to estimate an expected value with respect to the distribution . In particular, for bounded function values where we can define the expectation


where . The algorithm we use to create an estimate is very similar to the partition function estimate. More specifically, we compute the largest values of and then draw uniform samples from the remaining elements and call it . Then we compute an expected value using and (and upweighting the estimate from ). See Algorithm 4.

  Input: , bounded function values , as the top values of ,
  Uniformly sample elements with replacement from and call it
  return Expectation estimate
Algorithm 4 Expectation Estimation

This algorithm comes with a guarantee on the additive error.

Theorem 3.5.

Algorithm 4 returns an estimate such that with probability if


The proof is in the appendix. If we set then


Then, with a sublinear MIPS technique, the total runtime is sublinear. Note that we can use this to compute the expectation of and thus the gradient of data likelihood. This technique will be used in the experiments section for the learning experiment.

3.4 Approximate Top Elements

Many Maximum Inner Product Search (MIPS) methods, including LSH-based techniques, do not solve the exact nearest neighbor problem, but approximate nearest neighbor problem. In this work, we define a similar concept of the approximate top elements that will suffice for our theoretical arguments. Further, we show that we can use LSH instances to retrieve the approximate top elements in sublinear time.

We say that an algorithm returns the approximate top if the gap between the smallest element in and the largest element not in is bounded by a constant.

Definition 3.1 (Approximate Top ).

A set of elements is an approximate top if and


We can create a sequence of LSH instances that are “tuned” to a range of similarity values. Then at query time, we can go through the LSH instances in decreasing order of tuned value, gathering elements until we have elements. It turns out that these elements will be the approximate top elements (more details in the appendix). This technique will have a total runtime of


where . Thus, we have a sublinear approximate top element MIPS technique. We state this as a theorem and prove it in the appendix.

Theorem 3.6.

For sublinear , there exists a MIPS technique that returns the approximate top elements in sublinear amortized time.

Note that if we have a MIPS technique that returns an approximate top set then we can adapt Algorithm 1 to make for an added increase of in the expected value of , and thus the runtime.

If we have a MIPS technique that returns an approximate top set with constant , then Algorithm 2 and 3 will have an extra factor of for and and Algorithm 4 will have an extra factor of for and . These extensions are proved in the appendix and the previously stated theorems are special cases with .

4 Experiments

In this section, we present an empirical evaluation of our proposed sampling, inference, and expectation techniques. The use case for our method is when there are fixed feature vectors , and a sequence of inference or sampling queries with different parameter vectors . Although we cannot achieve gains on a single query, through preprocessing we can decrease the amortized query time. We will evaluate runtime improvements and accuracy.

4.1 Preliminaries

4.1.1 MIPS technique

We present the MIPS technique used to retrieve the top- values of the unnormalized log-probabilities. We follow the approximate nearest neighbor search method presented in Douze et al. (2016) as well as the publicly available implementation. However, we will not be making use of the compression component, as we do not optimize for memory usage.

This method relies on the use of a -means clustering. With the same notations as 2.3, given a query and a set of vectors , we aim at finding the highest values of .

We first cluster the vectors in in clusters. For an incoming query vector , we look at the inner product with the vectors in the cluster is assigned to as well as neighboring clusters. While this method doesn’t have any theoretical guarantees, it has been shown to perform better than LSH in practice as it more advantageously exploits the distribution of the set of vectors.

In our experiments, we use CPU implementations of all algorithms for fair comparison.

4.1.2 Data

We experiments with two datasets from different domains to demonstrate the effectiveness of our method in real-world use.

Word Embeddings

We use a set of word embeddings released by Facebook (Bojanowski et al., 2016). Each embedding is a dense vector representing a word in a given vocabulary. These continuous representations are obtained by training log-bilinear models on large text corpora. The embeddings incorporate structure from character -grams of the words. We retain words containing only letters and scale each vector to be of unit-norm. The data is composed of vectors of dimension .


The ImageNet dataset (Russakovsky et al., 2015) from the ILSVRC 2012 competition contains million natural images divided into classes. We extract features using a pre-trained residual network (He et al., 2016) trained on this classification task. More precisely, we represent each image by its activation map from the last layer before the linear classification layer of a ResNet-152. The extracted features are of size for each image. We then take the average along the depth dimension and reduce dimensionality using a PCA. We scale each vector to be of unit-norm. The data is thus composed of vectors of dimension . In the rest of our experiments, we choose the temperature of the log-linear model to be .

4.2 Sampling

In this section, we measure the performance of our method in terms of both sampling quality and speed. We first present empirical results on sampling and then illustrate the efficiency of our method on a specific task: a random walk over ImageNet.

4.2.1 Sampling


We want to evaluate the runtime of our method for sampling on large datasets. Given a dataset and a parameter vector , we compare the time necessary to sample from using our method or by enumeration (brute force). We compute the sampling time for random vectors and subsets of varying size for ImageNet ranging from to . The results are presented in Figure 2.

Figure 2: Empirical comparison of the runtime of sampling for randomly chosen from a log-linear model on subsets (of varying size) of the datasets. Note the log-scale of the dataset size. This time is the per query runtime and does not include preprocessing.

We can see that the speedup is linear w.r.t the log of the sub-sampled dataset size, achieving up to sampling speedup for the full dataset of size . If we consider the amortized cost, i.e. including the pre-processing cost of our MIPS data structure, our method starts paying off after approximately samples. The amortized costs are presented in the appendix, in Figure 7.


To measure the accuracy of our method, we present a way to establish an upper bound on the total variation distance in closed form for a given . Then, we average this upper bound over samples of (drawn uniformly from the dataset).

Note that the lazy sampling strategy is exact unless the true maximum is not in . Thus, if we can upper bound this probability, it is an upper bound on the total variation distance. For a given threshold , we can compute the closed form probability that and . This is the upper bound that we desire and we can optimize for the tightest upper bound. For both datasets, over 100 samples of , the average upper bound was on the order of proving that our sampling method is accurate even while using an approximate MIPS technique. A summary of our results in terms of accuracy and speedup are provided in Table 1. We provide further empirical evidence in the appendix to show that the distributions closely match on the shown .

Dataset Speedup Total Variation Bound
Word Embeddings
Table 1: Summary of the sampling speedup and bound on the total variation distance for our method on the ImageNet and Word Embeddings datasets.

4.2.2 Random walk over a large set

To showcase the applicability of our method, we perform a random walk over the ImageNet dataset. We define the transition function, i.e. the probability to walk from image to image as where is the temperature, is the fixed featurization previously defined, and are the pixel-values of images . The initial state is sampled uniformly across the dataset. This is similar in spirit to the PageRank algorithm (Page et al., 1999). This setting fits our method because while the MIPS structure can be reused across time steps, no computation can be cached in the naive setting (assuming we do not store the distribution for each element, which would be on the order of Terabytes).

Figure 3:

Samples of the Markov chain. The samples are spaced out by

time steps.

We evaluate the quality of the Markov Chain by comparing the top elements of the empirical sampling distribution. We run two different Markov chains, one with exact sampling and one with our sampling technique. Over one million steps, the two Markov Chains share of the top elements. This percentage looks low because of the finite sampling error. When we compare two different one million element windows within each chain, the top 1000 elements are shared and for the exact sampling and our sampling, respectively. It is seen that the between-chain differences are the same as the within-chain differences, so the Markov chain with our sampling technique yields roughly the same distribution as the chain with exact sampling.

4.3 Partition Function Estimate

We show the performance of our partition function estimate as shown in Algorithm 3. We can trade-off error and runtime by varying and . See Figure 4. We average the results over several values of , drawn uniformly from the dataset. For comparison, we plot the trade-off for only looking at the top values and using this as a partition function estimate. Additionally, we compare to the method of Mussmann & Ermon (2016) for different size of noise . For each value of and we report the runtime and relative error of the partition function estimate. As shown by the relative error of the top- estimate, sampling from the tail is necessary to achieve low relative error. We also show that the method from Mussmann & Ermon (2016) cannot come close in terms of relative error, achieving a maximum of relative error for . It is also important to note that their method cannot trade-off speed for accuracy as, when the noise-length increases, the injected noise destroys the MIPS structure rendering it highly inaccurate.

Figure 4: Runtime plotted as a function of relative error of partition function estimate (different points made by varying and ) on ImageNet (averaged over random values of . The red dotted line is the time for the exact partition function computation.

4.4 Learning

We wish to maximize the likelihood of a subset of the data given . We aim at finding


using gradient ascent. Evaluating the gradient requires finding the expectation of the features which can be estimated using our method in Algorithm 4. The features are fixed but is updated at each step of the gradient ascent algorithm, fitting well into the setting of our method. We choose a small subset of ImageNet as images with a commonality. In particular, we handpick 16 images showing the presence of water. We compare computing the gradient with our method to the computation of the exact gradient and to approximating the gradient by considering the truncated distribution on the top elements (referred to as top- gradient). The chosen images are shown in the appendix in Figure 9. We perform gradient ascent for iterations with learning rate , which we halve every iterations. The results are reported in Table 2. The learning curves are shown in Figure 5. We also show the most probable samples (outside of the dataset ) according to the log-linear model in Figure 6. We can see that these images are semantically similar to the training set, all containing water, showcasing the expressive power of the ResNet features.

Method Log-likelihood Speedup
Exact gradient
Only top-
Our method
Table 2: Log-likelihood and speedup for the learning of a log-linear model on ImageNet. For our method, we picked , for the comparison to only weighing the top-, we chose as well.
Figure 5: Log-likelihood plotted against the number of iterations for performing gradient ascent on our learning problem for iterations with a learning rate , halving the learning rate every iterations.
Figure 6: most probable images (outside of ) from our log-linear model trained to convergence.

As shown in Figure 5, we can see that the log-likelihood for our method and the exact gradient almost exactly overlap indicating that our estimation of the gradient is very accurate. In contrast, the top- gradient, while faster, proves to be a poor estimator and thus cannot optimize the log-likelihood. To summarize, our method converges to the global maximum as fast as computing the exact gradient.

5 Related Work

Our method can be viewed in two different comparative perspectives. Our method can be seen as an alternative to only using the top- most probable elements as is done in Vijayanarasimhan et al. (2014)

. There, large output spaces for deep learning are handled using Locality Sensitive Hashing. In particular, the top vectors are gathered and the rest of the vectors in the tail are ignored. For spread-out distributions (closer to uniform), this method will fail. Our work provides a scalable method to incorporate the probability mass present in the tail of the distribution by sampling

elements, a small prices compared to retrieving the top elements.

Our method can also be compared to a different way of combining the Gumbel max trick and Maximum Inner Product Search as presented in Mussmann & Ermon (2016). In that work, Gumbel noise is appended to the database vectors and stored in the MIPS data structure. Then, query vectors are chosen to access the frozen Gumbel noise. That work has several major shortcomings that make it unusable in practice.

The Gumbel noise is re-used, introducing correlated samples and systematic bias in the partition function estimate. In particular, for any fixed value of the parameters, there are a fixed number of samples “frozen” into the stored Gumbel noise. We avoid this issue by sampling fresh Gumbel variables for every sample. While real world data often has structure that can be exploited by the MIPS techniques, in Mussmann & Ermon (2016), the structure is destroyed by injecting random Gumbel noise. In our technique, we preserve structure in the database vectors by leaving the vectors unchanged. Finally, the method of Mussmann & Ermon (2016) requires accessing the MIPS data structure many times for independent samples and partition function estimates. In this work, we only require accessing the MIPS data structure once per parameter value.

6 Conclusion

In conclusion, we have presented several related methods that are based on the key idea of accessing the large elements in a distribution using Maximum Inner Product Search and accessing the tail of a distribution with uniform sampling. This decreases the runtime from to plus the runtime for the MIPS technique.

This work is best suited for cases where the output space is large but enumerable, such as those in NLP and computer vision. This work can be expected to give speedups when the feature vectors of a log-linear model are fixed but it is desired to perform inference and sampling for several different values of the the parameters. Note that our method is as flexible as the MIPS method that is employed; the feature vectors need to only be fixed for the MIPS to work. As an example, if a MIPS system allows for sparse updates, our method will also allow for sparse updates. Since our method treats MIPS as a black-box, advances in the speed and accuracy of MIPS techniques automatically improve our method.

When accessing the top elements is not accurate enough, we present a method to include uniform samples from the tail to provide provably good samples and estimates of the partition function. All this, at the small overhead price of uniform sampling.

7 Acknowledgments

This research was supported by Intel Corporation, Future of Life Institute () and NSF grants , , , and DGE-.

We thank Ludwig Schmidt and Moses Charikar for helpful discussions.



  • Auvolat et al. (2015) Auvolat, Alex, Chandar, Sarath, Vincent, Pascal, Larochelle, Hugo, and Bengio, Yoshua. Clustering is efficient for approximate maximum inner product search. arXiv preprint arXiv:1507.05910, 2015.
  • Bengio et al. (2003) Bengio, Yoshua, Ducharme, Réjean, Vincent, Pascal, and Jauvin, Christian. A neural probabilistic language model. Journal of machine learning research, 3(Feb):1137–1155, 2003.
  • Bentley (1975) Bentley, Jon Louis. Multidimensional binary search trees used for associative searching. Communications of the ACM, 18(9):509–517, 1975.
  • Bojanowski et al. (2016) Bojanowski, Piotr, Grave, Edouard, Joulin, Armand, and Mikolov, Tomas. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606, 2016.
  • Charikar (2002) Charikar, Moses S. Similarity estimation techniques from rounding algorithms. In

    Proceedings of the thiry-fourth annual ACM symposium on Theory of computing

    , pp. 380–388. ACM, 2002.
  • Douze et al. (2016) Douze, Matthijs, Jégou, Hervé, and Perronnin, Florent. Polysemous codes. In European Conference on Computer Vision, pp. 785–801. Springer International Publishing, 2016.
  • Gumbel & Lieblein (1954) Gumbel, Emil Julius and Lieblein, Julius. Statistical theory of extreme values and some practical applications: a series of lectures. 1954.
  • Hazan et al. (2013) Hazan, Tamir, Maji, Subhransu, and Jaakkola, Tommi. On sampling from the gibbs distribution with random maximum a-posteriori perturbations. In Advances in Neural Information Processing Systems, pp. 1268–1276, 2013.
  • He et al. (2016) He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep residual learning for image recognition. In

    Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on

    , 2016.
  • Indyk & Motwani (1998) Indyk, Piotr and Motwani, Rajeev. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 604–613. ACM, 1998.
  • Joulin et al. (2016) Joulin, Armand, van der Maaten, Laurens, Jabri, Allan, and Vasilache, Nicolas. Learning visual features from large weakly supervised data. In European Conference on Computer Vision, pp. 67–84. Springer, 2016.
  • Kim et al. (2016) Kim, Carolyn, Sabharwal, Ashish, and Ermon, Stefano.

    Exact sampling with integer linear programs and random perturbations.

    In Proc. 30th AAAI Conference on Artificial Intelligence, 2016.
  • Koenigstein et al. (2012) Koenigstein, Noam, Ram, Parikshit, and Shavitt, Yuval. Efficient retrieval of recommendations in a matrix factorization framework. In Proceedings of the 21st ACM international conference on Information and knowledge management, pp. 535–544. ACM, 2012.
  • Koller & Friedman (2009) Koller, Daphne and Friedman, Nir. Probabilistic graphical models: principles and techniques. MIT press, 2009.
  • Maddison et al. (2014) Maddison, Chris J, Tarlow, Daniel, and Minka, Tom. A* sampling. In Advances in Neural Information Processing Systems, pp. 3086–3094, 2014.
  • Mikolov et al. (2013) Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, and Dean, Jeff. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119, 2013.
  • Murphy (2012) Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
  • Mussmann & Ermon (2016) Mussmann, Stephen and Ermon, Stefano. Learning and inference via maximum inner product search. In Proceedings of The 33rd International Conference on Machine Learning, pp. 2587–2596, 2016.
  • Neyshabur & Srebro (2014) Neyshabur, Behnam and Srebro, Nathan. On symmetric and asymmetric lshs for inner product search. arXiv preprint arXiv:1410.5518, 2014.
  • Page et al. (1999) Page, Lawrence, Brin, Sergey, Motwani, Rajeev, and Winograd, Terry. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
  • Ram & Gray (2012) Ram, Parikshit and Gray, Alexander G. Maximum inner-product search using cone trees. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 931–939. ACM, 2012.
  • Rastogi & Van Durme (2015) Rastogi, Pushpendre and Van Durme, Benjamin. Sublinear partition estimation. arXiv preprint arXiv:1508.01596, 2015.
  • Russakovsky et al. (2015) Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • Shrivastava & Li (2014) Shrivastava, Anshumali and Li, Ping. Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In Advances in Neural Information Processing Systems, pp. 2321–2329, 2014.
  • Vijayanarasimhan et al. (2014) Vijayanarasimhan, Sudheendra, Shlens, Jonathon, Monga, Rajat, and Yagnik, Jay. Deep networks with large output spaces. arXiv preprint arXiv:1412.7479, 2014.

8 Appendix

8.1 Empirical Evaluation of Sampling

We wish to evaluate the empirical accuracy of our sampling technique on concrete examples. We do this in two ways. First, we can sort the elements by probability and make events of drawing an element in the top , or the top , top , etc. We show the results for two random with different distributions in Figure 8 for samples. Note that our method closely matches the histogram of the true distribution. For a more comprehensive evaluation, we sample 30 values of and compute the relative error for exact sampling and our approximate sampling. See Figure 8. We also present in Figure 7 the amortized speedups obtained by our method. The amortized cost is defined as the time needed to train the index, added to the runtime of sampling samples.

Figure 7: Left and Center: empirical comparison of the runtime of sampling for a randomly chosen on Word Embeddings and Image Net for varying fraction of the data. The amortized cost is defined as the time necessary for the sampling in addition to the training time of the index. Right: Evaluation of the sampling speed-up for both datasets for varying fraction of the data.
Figure 8: Left and Center: Two randomly chosen with different bin distributions. We see that the empirical sampling closely matches the true distribution for all the bins. Right: Evaluation of the relative error on 30 samples of for the exact sampling and our sampling technique. The error bars are for the average error rate between the empirical distribution and the true distribution for both exact sampling and our method. We see that the error rates are not statistically significantly different.

8.2 Learning Training Set

For the learning experiment, we show the set of images that we maximized the probability of. See Figure 9. The common theme of the images is the presence of water.

Figure 9: The set of images used in the learning experiment. Note that all of the images contain water, though the content of the images is quite different.

8.3 Value of

For all of the proofs, we will include the result with the approximate MIPS with an error of . To recover the original results in the paper, set .

8.4 Sampling

Theorem 3.2.

For Algorithm 1,


Note that is the number of Gumbels that are larger than .

Note that Gumbels can be defined by where is a uniform random variable on the interval . Thus, we can think of each point having a uniform sample and finding places where


Thus, if we can find places where , then we have the value of . The number of points where this occurs is distributed according to .



Note that


And thus,


Putting it all together,


Theorem 3.3.

For Algorithm 2, the sample is an exact sample with probability for .


Note that the elements not in have values and . Further, there exists an element in with and with . As long as there is an element in that exceeds all the elements not in , the sample will be exact.


Thus, the probability of failure, , is bounded by

8.5 Partition Function Estimate

Theorem 3.4.

Algorithm 3 returns an unbiased estimate and for , then with probability,


Define . Let be the indices of the largest elements of and . Denote and . Thus, the true partition function is .

Let be the largest value of for elements in . Then for all , and further, for all , . We can scale these values of and denote them as where .

For the estimate we will draw samples with replacement from and denote the set as . Denote the samples as and the scaled versions as .

We use the estimate:


Note that


This is because . Thus, is an unbiased estimator of . However, we are concerned if it is well concentrated about its mean.

Let be a random variable as the scaled sample from and be the empirical mean over samples. Thus, .

Note that




If we use Chernoff (use a convexity argument to bound the MGF in terms of the mean as on page 22 of ”Concentration of Measure for the Analysis of Randomised Algorithms” by Dubhashi and Panconesi)


Combining these two by setting ,


Thus, as long as , then with probability,

returns an unbiased estimate and for , then with probability,

8.6 Expectation Estimate

Theorem 3.5.

Algorithm 4 returns an estimate such that with probability if




Thus, and .

To show that

with probability , we will show that

each with probability . These will be shown as two separate parts.

8.6.1 Part One


from Theorem 3.4, with probability then .

8.6.2 Part Two

For the second one is written as the following lemma

Lemma 8.1.

with probability for


Note that the smallest element in has “probability” . Thus, for the largest element not in , . Define as the random variable of the value of sampling uniformly from and returning

Note that and that . The elements of are samples and and thus we can define

From Hoeffding’s Inequality,


Defining we get that , so