Approximate Selection with Guarantees using Proxies

04/02/2020
by   Daniel Kang, et al.
0

Due to the falling costs of data acquisition and storage, researchers and industry analysts often want to find all instances of rare events in large datasets. For instance, scientists can cheaply capture thousands of hours of video, but are limited by the need to manually inspect long videos to identify relevant objects and events. To reduce this cost, recent work proposes to use cheap proxy models, such as image classifiers, to identify an approximate set of data points satisfying a data selection filter. Unfortunately, this recent work does not provide the statistical accuracy guarantees necessary in scientific and production settings. In this work, we introduce novel algorithms for approximate selection queries with statistical accuracy guarantees. Namely, given a limited number of exact identifications from an oracle, often a human or an expensive machine learning model, our algorithms meet a minimum precision or recall target with high probability. In contrast, existing approaches can catastrophically fail in satisfying these recall and precision targets. We show that our algorithms can improve query result quality by up to 30x for both the precision and recall targets in both real and synthetic datasets.

READ FULL TEXT
research
06/06/2022

On Efficient Approximate Queries over Machine Learning Models

The question of answering queries over ML predictions has been gaining a...
research
04/09/2018

A plug-in approach to maximising precision at the top and recall at the top

For information retrieval and binary classification, we show that precis...
research
09/09/2020

Task-agnostic Indexes for Deep Learning-based Queries over Unstructured Data

Unstructured data is now commonly queried by using target deep neural ne...
research
04/02/2021

A Comparison of Similarity Based Instance Selection Methods for Cross Project Defect Prediction

Context: Previous studies have shown that training data instance selecti...
research
08/17/2023

Accelerating Aggregation Queries on Unstructured Streams of Data

Analysts and scientists are interested in querying streams of video, aud...
research
03/02/2020

Approximate Cross-validation: Guarantees for Model Assessment and Selection

Cross-validation (CV) is a popular approach for assessing and selecting ...
research
08/10/2013

Applying the Negative Selection Algorithm for Merger and Acquisition Target Identification

In this paper, we propose a new methodology based on the Negative Select...

Please sign up or login with your details

Forgot password? Click here to reset