
Kronecker Determinantal Point Processes
Determinantal Point Processes (DPPs) are probabilistic models over all s...
read it

Similarity Downselection: A Python implementation of a heuristic search algorithm for finding the set of the n most dissimilar items with an application in conformer sampling
Finding the set of the n items most dissimilar from each other out of a ...
read it

Exploring the Impact of Password Dataset Distribution on Guessing
Leaks from password datasets are a regular occurrence. An organization m...
read it

Fast determinantal point processes via distortionfree intermediate sampling
Given a fixed n× d matrix X, where n≫ d, we study the complexity of samp...
read it

Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation
We introduce and study a new data sketch for processing massive datasets...
read it

What is the distribution of the number of unique original items in a bootstrap sample?
Sampling with replacement occurs in many settings in machine learning, n...
read it

Optimized Algorithms to Sample Determinantal Point Processes
In this technical report, we discuss several sampling algorithms for Det...
read it
Sampling from a kDPP without looking at all items
Determinantal point processes (DPPs) are a useful probabilistic model for selecting a small diverse subset out of a large collection of items, with applications in summarization, stochastic optimization, active learning and more. Given a kernel function and a subset size k, our goal is to sample k out of n items with probability proportional to the determinant of the kernel matrix induced by the subset (a.k.a. kDPP). Existing kDPP sampling algorithms require an expensive preprocessing step which involves multiple passes over all n items, making it infeasible for large datasets. A naïve heuristic addressing this problem is to uniformly subsample a fraction of the data and perform kDPP sampling only on those items, however this method offers no guarantee that the produced sample will even approximately resemble the target distribution over the original dataset. In this paper, we develop an algorithm which adaptively builds a sufficiently large uniform sample of data that is then used to efficiently generate a smaller set of k items, while ensuring that this set is drawn exactly from the target distribution defined on all n items. We show empirically that our algorithm produces a kDPP sample after observing only a small fraction of all elements, leading to several orders of magnitude faster performance compared to the stateoftheart.
READ FULL TEXT
Comments
There are no comments yet.