Poisoning Web-Scale Training Datasets is Practical

02/20/2023
by   Nicholas Carlini, et al.
0

Deep learning models are often trained on distributed, webscale datasets crawled from the internet. In this paper, we introduce two new dataset poisoning attacks that intentionally introduce malicious examples to a model's performance. Our attacks are immediately practical and could, today, poison 10 popular datasets. Our first attack, split-view poisoning, exploits the mutable nature of internet content to ensure a dataset annotator's initial view of the dataset differs from the view downloaded by subsequent clients. By exploiting specific invalid trust assumptions, we show how we could have poisoned 0.01 the LAION-400M or COYO-700M datasets for just 60 USD. Our second attack, frontrunning poisoning, targets web-scale datasets that periodically snapshot crowd-sourced content – such as Wikipedia – where an attacker only needs a time-limited window to inject malicious examples. In light of both attacks, we notify the maintainers of each affected dataset and recommended several low-overhead defenses.

READ FULL TEXT

page 9

page 19

research
11/03/2022

M-to-N Backdoor Paradigm: A Stealthy and Fuzzy Attack to Deep Learning Models

Recent studies show that deep neural networks (DNNs) are vulnerable to b...
research
08/06/2019

Cross-Origin State Inference (COSI) Attacks: Leaking Web Site States through XS-Leaks

In a Cross-Origin State Inference (COSI) attack, an attacker convinces a...
research
09/04/2020

Witches' Brew: Industrial Scale Data Poisoning via Gradient Matching

Data Poisoning attacks involve an attacker modifying training data to ma...
research
10/19/2020

The Impact of DNS Insecurity on Time

We demonstrate the first practical off-path time shifting attacks agains...
research
12/04/2020

Unleashing the Tiger: Inference Attacks on Split Learning

We investigate the security of split learning – a novel collaborative ma...
research
04/13/2018

A Deep Learning Approach to Fast, Format-Agnostic Detection of Malicious Web Content

Malicious web content is a serious problem on the Internet today. In thi...
research
04/22/2022

A Tale of Two Models: Constructing Evasive Attacks on Edge Models

Full-precision deep learning models are typically too large or costly to...

Please sign up or login with your details

Forgot password? Click here to reset