Sample Efficient Detection and Classification of Adversarial Attacks via Self-Supervised Embeddings

by   Mazda Moayeri, et al.

Adversarial robustness of deep models is pivotal in ensuring safe deployment in real world settings, but most modern defenses have narrow scope and expensive costs. In this paper, we propose a self-supervised method to detect adversarial attacks and classify them to their respective threat models, based on a linear model operating on the embeddings from a pre-trained self-supervised encoder. We use a SimCLR encoder in our experiments, since we show the SimCLR embedding distance is a good proxy for human perceptibility, enabling it to encapsulate many threat models at once. We call our method SimCat since it uses SimCLR encoder to catch and categorize various types of adversarial attacks, including L_p and non-L_p evasion attacks, as well as data poisonings. The simple nature of a linear classifier makes our method efficient in both time and sample complexity. For example, on SVHN, using only five pairs of clean and adversarial examples computed with a PGD-L_inf attack, SimCat's detection accuracy is over 85 from each threat model, SimCat can classify eight different attack types such as PGD-L_2, PGD-L_inf, CW-L_2, PPGD, LPA, StAdv, ReColor, and JPEG-L_inf, with over 40 poisoning attacks, such as BP, CP, FC, CLBD, HTBD, halving the success rate while using only twenty total poisons for training. We find that the detectors generalize well to unseen threat models. Lastly, we investigate the performance of our detection method under adaptive attacks and further boost its robustness against such attacks via adversarial training.



There are no comments yet.



Perceptual Adversarial Robustness: Defense Against Unseen Threat Models

We present adversarial attacks and defenses for the perceptual adversari...

A Self-supervised Approach for Adversarial Robustness

Adversarial examples can cause catastrophic mistakes in Deep Neural Netw...

Self-Supervised Contrastive Learning with Adversarial Perturbations for Robust Pretrained Language Models

This paper improves the robustness of the pretrained language model BERT...

What Doesn't Kill You Makes You Robust(er): Adversarial Training against Poisons and Backdoors

Data poisoning is a threat model in which a malicious actor tampers with...

Detection Defense Against Adversarial Attacks with Saliency Map

It is well established that neural networks are vulnerable to adversaria...

Self-Supervised Adversarial Example Detection by Disentangled Representation

Deep learning models are known to be vulnerable to adversarial examples ...

Detection of Adversarial Attacks and Characterization of Adversarial Subspace

Adversarial attacks have always been a serious threat for any data-drive...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.