DeepAI AI Chat
Log In Sign Up

Did You Train on My Dataset? Towards Public Dataset Protection with Clean-Label Backdoor Watermarking

by   Ruixiang Tang, et al.
University of Georgia
Texas A&M University
Rice University

The huge supporting training data on the Internet has been a key factor in the success of deep learning models. However, this abundance of public-available data also raises concerns about the unauthorized exploitation of datasets for commercial purposes, which is forbidden by dataset licenses. In this paper, we propose a backdoor-based watermarking approach that serves as a general framework for safeguarding public-available data. By inserting a small number of watermarking samples into the dataset, our approach enables the learning model to implicitly learn a secret function set by defenders. This hidden function can then be used as a watermark to track down third-party models that use the dataset illegally. Unfortunately, existing backdoor insertion methods often entail adding arbitrary and mislabeled data to the training set, leading to a significant drop in performance and easy detection by anomaly detection algorithms. To overcome this challenge, we introduce a clean-label backdoor watermarking framework that uses imperceptible perturbations to replace mislabeled samples. As a result, the watermarking samples remain consistent with the original labels, making them difficult to detect. Our experiments on text, image, and audio datasets demonstrate that the proposed framework effectively safeguards datasets with minimal impact on original task performance. We also show that adding just 1 samples can inject a traceable watermarking function and that our watermarking samples are stealthy and look benign upon visual inspection.


page 3

page 11


OIAD: One-for-all Image Anomaly Detection with Disentanglement Learning

Anomaly detection aims to recognize samples with anomalous and unusual p...

Segmentation-Based Deep-Learning Approach for Surface-Defect Detection

Automated surface-anomaly detection using machine learning has become an...

Kallima: A Clean-label Framework for Textual Backdoor Attacks

Although Deep Neural Network (DNN) has led to unprecedented progress in ...

Do We Really Need Gold Samples for Sample Weighting Under Label Noise?

Learning with labels noise has gained significant traction recently due ...

Robust Learning of Deep Time Series Anomaly Detection Models with Contaminated Training Data

Time series anomaly detection (TSAD) is an important data mining task wi...

Learnability Lock: Authorized Learnability Control Through Adversarial Invertible Transformations

Owing much to the revolution of information technology, the recent progr...

Unlearnable Clusters: Towards Label-agnostic Unlearnable Examples

There is a growing interest in developing unlearnable examples (UEs) aga...