The Fast and the Private: Task-based Dataset Search

08/10/2023
by   Zezhou Huang, et al.
0

Modern dataset search platforms employ ML task-based utility metrics instead of relying on metadata-based keywords to comb through extensive dataset repositories. In this setup, requesters provide an initial dataset, and the platform identifies complementary datasets to augment (join or union) the requester's dataset such that the ML model (e.g., linear regression) performance is improved most. Although effective, current task-based data searches are stymied by (1) high latency which deters users, (2) privacy concerns for regulatory standards, and (3) low data quality which provides low utility. We introduce Mileena, a fast, private, and high-quality task-based dataset search platform. At its heart, Mileena is built on pre-computed semi-ring sketches for efficient ML training and evaluation. Based on semi-ring, we develop a novel Factorized Privacy Mechanism that makes the search differentially private and scales to arbitrary corpus sizes and numbers of requests without major quality degradation. We also demonstrate the early promise in using LLM-based agents for automatic data transformation and applying semi-rings to support causal discovery and treatment effect estimation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/01/2023

Saibot: A Differentially Private Data Search Platform

Recent data search platforms use ML task-based utility measures rather t...
research
06/01/2023

Better Private Linear Regression Through Better Private Feature Selection

Existing work on differentially private linear regression typically assu...
research
08/05/2022

DP^2-VAE: Differentially Private Pre-trained Variational Autoencoders

Modern machine learning systems achieve great success when trained on la...
research
07/23/2020

Private Post-GAN Boosting

Differentially private GANs have proven to be a promising approach for g...
research
09/24/2021

NanoBatch DPSGD: Exploring Differentially Private learning on ImageNet with low batch sizes on the IPU

Differentially private SGD (DPSGD) has recently shown promise in deep le...
research
09/13/2022

Optimal Data Acquisition with Privacy-Aware Agents

We study the problem faced by a data analyst or platform that wishes to ...

Please sign up or login with your details

Forgot password? Click here to reset