apricot: Submodular selection for data summarization in Python

06/08/2019
by   Jacob Schreiber, et al.
0

We present apricot, an open source Python package for selecting representative subsets from large data sets using submodular optimization. The package implements an efficient greedy selection algorithm that offers strong theoretical guarantees on the quality of the selected set. Two submodular set functions are implemented in apricot: facility location, which is broadly applicable but requires memory quadratic in the number of examples in the data set, and a feature-based function that is less broadly applicable but can scale to millions of examples. Apricot is extremely efficient, using both algorithmic speedups such as the lazy greedy algorithm and code optimizers such as numba. We demonstrate the use of subset selection by training machine learning models to comparable accuracy using either the full data set or a representative subset thereof. This paper presents an explanation of submodular selection, an overview of the features in apricot, and an application to several data sets. The code and tutorial Jupyter notebooks are available at https://github.com/jmschrei/apricot

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/22/2022

Submodlib: A Submodular Optimization Library

Submodular functions are a special class of set functions which naturall...
research
09/24/2018

Vis-DSS: An Open-Source toolkit for Visual Data Selection and Summarization

With increasing amounts of visual data being created in the form of vide...
research
07/25/2022

Unsupervised data selection for Speech Recognition with contrastive loss ratios

This paper proposes an unsupervised data selection method by using a sub...
research
07/08/2022

ControlBurn: Nonlinear Feature Selection with Sparse Tree Ensembles

ControlBurn is a Python package to construct feature-sparse tree ensembl...
research
03/30/2018

Class Subset Selection for Transfer Learning using Submodularity

In recent years, it is common practice to extract fully-connected layer ...
research
11/21/2020

Near-Optimal Data Source Selection for Bayesian Learning

We study a fundamental problem in Bayesian learning, where the goal is t...
research
06/22/2022

Diversity Subsampling: Custom Subsamples from Large Data Sets

Subsampling from a large data set is useful in many supervised learning ...

Please sign up or login with your details

Forgot password? Click here to reset