One-class novelty detection refers to the problem of determining if a test data sample is normal (known class) or anomalous (novel class). In real-world applications, novel data is difficult to collect since they are often rare or unsafe. Hence, one-class novelty detection considers training data from only a single known class. Most recent advances in one-class novelty detection are based on the deep Auto-Encoder (AE) style architectures, such as Denoising Auto-Encoder (DAE) [salehi2020arae, vincent2008extracting], Variational Auto-Encoder (VAE) [kingma2013auto], Adversarial Auto-Encoder (AAE) [makhzani2015adversarial, pidhorskyi2018generative]
, Generative Adversarial Network (GAN)[goodfellow2014generative, perera2019ocgan, sabokrou2018adversarially, zhangp], etc. Given an AE that learns the distribution of the known class, normal data are expected to be reconstructed accurately, while anomalous data are not. The reconstruction error of the AE is then used as a score for a test example to perform novelty detection. Although deep novelty detection methods achieve impressive performance, their robustness against adversarial attacks [goodfellow2015explaining, Szegedy2014Intriguing] lacks exploration.
Adversarial examples pose serious security threats to deep networks as they can fool them with carefully crafted perturbations. Over the past few years, many adversarial attack and defense approaches have been proposed for tasks such as image classification [guo2017countering, raff2019barrage, Xie_2019_CVPR, xu2017feature], video recognition [lo2020defending, wei2019sparse]
, optical flow estimation[ranjan2019attacking] and open-set recognition [shao2020open]. However, adversarial attacks or defenses have not been thoroughly investigated in the context of one-class novelty detection. We first show that present novelty detectors are vulnerable to adversarial attacks. Subsequently, we demonstrate that many state-of-the-art defenses [hendrycks2019selfsupervised, shi2021online, xie2020smooth, Xie_2019_CVPR] prove to be sub-optimal to properly defend novelty detectors against adversarial examples. This motivates us to design an effective defense strategy specifically for one-class novelty detection.
To this end, we propose to leverage task-specific knowledge to protect novelty detectors. These novelty detectors are only required to retain information about normal data, thereby resulting in poor reconstructions for anomalous data. This is favorable to the novelty detection problem. This can be achieved by constraining the latent space to make the features closer to a prior distribution [perera2019ocgan, park2020learning]. Also, it has been shown that adversarial perturbations can be removed in the feature space [Xie_2019_CVPR]
. Therefore, one can largely manipulate the latent space of novelty detectors to devoid them of feature corruption introduced by adversaries, while maintaining the performance on clean input data. This property is unique to the novelty detection task, as most deep learning applications (e.g., image classification) require a model containing sophisticated semantic information, and a large manipulation on the latent space may limit the model capability, resulting in performance degradation.
In this paper, we propose a defense strategy, referred to as Principal Latent Space (PLS), to defend novelty detectors against adversarial examples. Specifically, PLS learns the incrementally-trained [ross2008incremental]
cascade principal components in the latent space. This contains a cascade principal component analysis (PCA), which consists of a PCA operating on the vector dimension (i.e., channel) of a latent space[van2017neural] and the other PCA operating on the spatial dimension. We name these two PCAs as Vector-PCA and Spatial-PCA, respectively. First, Vector-PCA uses a learned principal latent vector to represent a latent space as the Vector-PCA space of a single-channel map. Since the principal latent vector is a pre-trained component that would not be affected by adversarial perturbations, most adversaries are removed at this step, and the remaining adversaries are enclosed within the small Vector-PCA space. Subsequently, Spatial-PCA uses learned principal Vector-PCA maps to represent the Vector-PCA space as the Spatial-PCA space and expel the remaining adversaries. Finally, the corresponding cascade inverse PCA transforms the Spatial-PCA space back to the original dimensionality, resulting in the principal latent space.
With PLS, the decoder could compute preferred reconstruction errors as novelty scores, even under adversarial attacks (see Fig. 1). Additionally, we incorporate adversarial training (AT) [madry2018towards] with PLS to further exert PLS’s ability. In contrast to typical defenses which often sacrifice their performance on clean data [tsipras2018robustness, xie2020adversarial], the proposed defense strategy does not hurt the performance but rather improves it. The PLS module can be attached to any AE-style architectures (VAE, GAN, etc), so it is applicable to a wide variety of the existing novelty detection approaches, such as [kingma2013auto, makhzani2015adversarial, sabokrou2018adversarially, pidhorskyi2018generative, salehi2020arae] etc. We extensively evaluate PLS on eight adversarial attacks, three datasets and six different novelty detectors. We further compare PLS with commonly-used defense methods and show that it consistently enhances the adversarial robustness of novelty detectors by significant margins. To the best of our knowledge, this is one of the first adversarially robust novelty detection methods.
2 Related work
One-class novelty detection.scholkopf1999support, tax2004support]. With the advent of deep learning, AE-based approaches are dominating this area and achieve state-of-the-art performance [gong2019memorizing, park2020learning, perera2019ocgan, pidhorskyi2018generative, sabokrou2018adversarially, sakurada2014anomaly, salehi2020arae, xia2015learning, zhou2017anomaly]. ALOCC [sabokrou2018adversarially] considers a DAE [vincent2008extracting] as a generator and appends a discriminator to train the entire network by the generative adversarial framework [goodfellow2014generative]. GPND [pidhorskyi2018generative] is based on AAE [makhzani2015adversarial], and it applies a discriminator to the latent space and the other discriminator to the output. OCGAN [perera2019ocgan]
includes two discriminators and a classifier to train a DAE by the generative adversarial framework. ARAE[salehi2020arae] crafts adversarial examples from the latent space to adversarially train a DAE. Different from our work, ARAE’s adversarial examples aim to pursue performance, and its adversarial robustness is not thoroughly evaluated (see Supplementary).
Adversarial attacks. Szegedy et al. [Szegedy2014Intriguing] showed that carefully crafted perturbations can fool deep networks. Goodfellow et al. [goodfellow2015explaining] introduced the Fast Gradient Sign Method (FGSM), which leverages the sign of gradients to produce adversarial examples. Projected Gradient Descent (PGD) [madry2018towards] extends FGSM from single iteration gradient descent to an iterative version. MI-FGSM [dong2018boosting] generates more transferable adversarial attacks by a momentum mechanism. MultAdv [lo2020multav] produces adversarial examples via the multiplicative operation instead of the additive operation. Physically realizable attacks, which can be implemented in the physical scenarios, is also developed [Sharif16AdvML, zajac2019adversarial]. For example, Adversarial Framing (AF) [zajac2019adversarial] adds perturbations on the border of an image, while the remaining pixels are unchanged.
Adversarial defenses. At earlier time, a few studies aim to detect adversarial examples [hendrycks2016early, jere2020principal, li2017adversarial]. However, it is well-known that detection is inherently weaker than defense in terms of resisting adversarial attacks. Although several defense approaches based on image transformation are proposed afterward [guo2017countering, xu2017feature, bhagoji2017dimensionality], they fail to defend against white-box attacks [carlini2017adversarial, obfuscated]. Recently, Adversarial Training (AT) has been considered one of the most effective defenses, especially in the white-box setting. Madry et al. [madry2018towards] formulated AT in a min-max optimization framework (PGD-AT), and this has been widely used as a benchmark. Xie et al. [Xie_2019_CVPR] includes the feature denoising block (FD) in networks to remove adversarial perturbations in the feature domain. SAT [xie2020smooth]
uses smooth approximations of ReLU activation to enhance PGD-AT. Hendrycks et al.[hendrycks2019selfsupervised] added an auxiliary rotation prediction task [gidaris2018unsupervised] to improve PGD-AT (RotNet-AT). SOAP [shi2021online] takes self-supervised signals to purify adversarial examples during inference.
To the best of our knowledge, APAE [goodge2020robustness]
might be the only present defense designed for anomaly detection. It uses approximate projection and feature weighting to reduce adversarial effects. However, its robustness is not fully tested and only anomalous data are perturbed in its evaluation (see Supplementary). Instead, we provide a generic framework for evaluating the adversarial robustness of novelty detectors and our proposed defense method.
3 Attacking novelty detection models
We consider several popular adversarial attacks [dong2018boosting, goodfellow2015explaining, lo2020multav, madry2018towards, papernot2017practical, zajac2019adversarial] and modify their loss objectives to suit the novelty detection problem setup. Here, we take PGD [madry2018towards] as an example to illustrate our attack formulation. The other gradient-based attacks can be extended by a similar formulation (see Supplementary).
Consider an AE-based target model with an encoder and a decoder , and an input image with the ground-truth label , where “” denotes the known class and “” denotes the novel classes. We generate the adversarial example as follows:
where, , denotes a step size, and is the number of attacking iterations, and . projects its element into an -norm bound with perturbation size such that . corresponds to the mean square error (MSE) loss defined as follows:
Given a test example, if it belongs to the known class, we maximize its reconstruction error (i.e., novelty score) by gradient ascent; while if it belongs to novel classes, we minimize its reconstruction error by gradient descent.
Present novelty detection methods are vulnerable to this attack (see Sec. 5.2); that is, normal data would be misclassified into novel classes, and anomalous data would be misclassified into the known class. Moreover, this attacking strategy is much stronger than the attacks introduced by [salehi2020arae], which perturbs only normal data, and by [goodge2020robustness], which perturbs only anomalous data (see Supplementary).
4 Adversarially robust novelty detection
The proposed defense strategy exploits the task-specific knowledge of one-class novelty detection. Specifically, we leverage the fact that a novelty detector’s latent space can be manipulated to a larger extent as long as it retains the known class information. This property is especially useful to remove more adversarial perturbations in the latent space. Therefore, we propose to train a novelty detector by manipulating its latent space such that it can improve adversarial robustness while maintaining the performance on clean data. Note that these characteristics are specific to the novelty detection problem. The majority of visual recognition problems, such as image classification, require a model retaining multiple category information. Hence, a large manipulation on the latent space may hinder the model capability and thus degrade the performance.
In the following subsections, we first briefly review PCA to define the notations used in this paper, then discuss the proposed PLS in detail.
PCA computes the principal components of a collection of data and uses them to conduct a change of basis on the data through a linear transformation. Consider a data matrix, its mean and its covariance . can be written as via Singular Vector Decomposition (SVD), where
is an orthogonal matrix containing the principal components of. Here, we define a mapping which computes the mean vector and the first principal components of the given :
where keeps only the first columns of . Now we define the forward and the inverse PCA transformation as a pair of mapping , ; performs the forward PCA:
and performs the inverse PCA:
where . Finally, we can write the PCA reconstruction of as .
4.2 Principal Latent Space (PLS)
The proposed PLS contains two major components: (1) Vector-PCA and (2) Spatial-PCA. In Vector-PCA, we perform on the vector dimension as , and in Spatial-PCA, we perform on the spatial dimension as . Let be the encoder and be the decoder of a novelty detection model. Let us denote an adversarial image as , we have its latent space , where is the spatial dimensionality obtained by the product of height and width, and is the vector dimensionality (i.e., the number of channels). Under adversarial attacks, would be corrupted by adversarial perturbations such that the decoder cannot compute reconstruction errors favorable to novelty detection. We define the proposed PLS as a transformation , which removes adversaries from , where is referred to as principal latent space.values between 0 and 1. The following procedures are described below.
First, Vector-PCA computes the mean latent vector and the principal latent vector of :
where, we always set to 1, so is the first principal latent vector of . Second, Vector-PCA transforms to its Vector-PCA space :
Next, Spatial-PCA computes the mean Vector-PCA map111We use the word “map” to indicate they are on the spatial dimension. and the principal Vector-PCA maps of :
is a hyperparameter. Then, Spatial-PCA transformsto its Spatial-PCA space :
Finally, the inverse Spatial-PCA and the inverse Vector-PCA transform back to its original dimensionality:
where, is the Spatial-PCA reconstruction of , and is the resulting principal latent space. Fig. 2 gives an overview of this procedure. The decoder then uses to reconstruct the input adversarial example as for computing the novelty score.
4.3 Incremental training
The principal latent components are incrementally-trained along with the network weights by exponential moving average (EMA) during training, so we call this process incrementally-trained cascade PCA. Specifically, at training iteration , these components are updated with following equations:
where and are the EMA learning rates.
Consider the model weights are trained by the mini-batch gradient descent with a batch size , the latent dimensionality is shaped to , the resulting is reshaped to after the Vector-PCA , and is reshaped back to after the inverse Spatial-PCA . Hence, in a mini-batch, both and have times more data points to acquire better principal latent components at each training iteration. At iteration , performs with the components , and performs with the components . When the training process ends, the well-trained components are denoted as . During infernce, performs with , and performs with , while and do not operate (see Fig. 2). The entire process is differentiable during inference and thus does not cause obfuscated gradients [obfuscated]. This incremental training helps make sure the cascade PCA is aware of the network weight updates at each training step, encouraging mutual learning between the network weights and the principal latent components. The entire model and thus can be trained end-to-end.
4.4 Defense mechanism
We further elaborate on how the proposed PLS defends against adversarial attacks. Given an adversarial example , its latent space is adversarially perturbed. After Vector-PCA, each latent vector of is represented by a scaling factor of the learned principal latent vector (with a bias term ). The Vector-PCA space stores these scaling factors on a single-channel map (i.e., on the spatial domain only). Since all the principal latent components are pre-trained parameters, they would not be affected by adversarial perturbations. Replacing the perturbed latent vectors by removes the majority of the adversaries. The only place where the remaining adversaries can appear is the scaling factors of on the single-channel map. In other words, these adversaries are enclosed within a small subspace, making them easier to expel.
Subsequently, Spatial-PCA reconstructs this small subspace by a set of principal Vector-PCA maps (with a bias term ). Since and are adversary-free, the remaining adversaries are further removed. From another perspective, this step can be viewed as PCA-based denoising performing in the spatial domain of features. With the robust principal latent space , the decoder can obtain a preferred reconstruction error for novelty detection, even in the presence of an adversarial example. Additionally, we perform AT [madry2018towards] to train the model, further improving the robustness.
We evaluate PLS on eight adversarial attacks, three datasets and six existing novelty detection methods. We further compare PLS with state-of-the-art defense approaches. An extensive ablation study is also presented.
|Dataset||Defense||Clean||FGSM [goodfellow2015explaining]||PGD [madry2018towards]||MI-FGSM [dong2018boosting]||MultAdv [lo2020multav]||AF [zajac2019adversarial]||Black-box [papernot2017practical]|
5.1 Experimental setup
Datasets. We use three datasets for evaluation: MNIST [lecun2010mnist], Fashion-MNIST (F-MNIST) [xiao2017fashion] and CIFAR-10 [krizhevsky2009learning]. MNIST consists of 28 28 grayscale handwritten digits from 0 to 9. It contains 60,000 training data and 10,000 test data. F-MNIST is composed of 28 28 grayscale images from 10 fashion product categories. It comprises of 60,000 training data and 10,000 test data. CIFAR-10 consists of 32 32 color images from 10 different classes. There are 50,000 training and 10,000 test images in this dataset.
We simulate a one-class novelty detection scenario by the following protocol. Given a dataset, each class is defined as the known class at a time, and a model is trained with the training data of this known class. During inference, the test data of the known class are considered normal, and the test data of the other classes (i.e., novel classes) are considered anomalous. We select the anomalous data from each novel class equally to constitute half of the test set, where the anomalous data within a novel class are selected randomly. Hence, our test set contains 50% anomalous data, where each novel class accounts for the same proportion. The area under the Receiver Operating Characteristic curve (AUROC) value is used as the evaluation metric, where the ROC curve is obtained by varying the threshold of the novelty score. For each dataset, we report the mean AUROC (mAUROC) across its 10 classes.
Attack setting. We test adversarial robustness against five white-box attacks, inclduing FGSM [goodfellow2015explaining], PGD [madry2018towards], MI-FGSM [dong2018boosting], MultAdv [lo2020multav] and AF [zajac2019adversarial], where PGD is the default attack if not otherwise specified. A black-box attack and two adaptive attacks [papernot2017practical, tramer2020adaptive] are also considered. All the attacks are implemented based on the formulation in Sec. 3.
For FGSM, PGD and MI-FGSM, we set to for MNIST, for F-MNIST, and for CIFAR-10. For MultAdv, we set to for MNIST, for F-MNIST, and for CIFAR-10. For AF, we set to , and for MNIST, F-MNIST and CIFAR-10, respectively. The framing width is set to . The number of attack iterations is set to for FGSM and for the other attacks.
Baseline defenses. To the best of our knowledge, APAE [goodge2020robustness] might be the only present defense designed for anomaly detection. In addition to APAE, we implement five commonly-used defenses, which are originally designed for classification tasks, in the context of novelty detection. They are PGD-AT [madry2018towards], FD [Xie_2019_CVPR], SAT [xie2020smooth], RotNet-AT [hendrycks2019selfsupervised] and SOAP [shi2021online], where FD, SAT and RotNet-AT incorporate PGD-AT. We use Gaussian non-local means [buades2005non] for FD, Swish [hendrycks2016gaussian] for SAT, and RotNet [gidaris2018unsupervised] for SOAP. These are their well-performing versions.
Benchmark novelty detectors. We apply PLS to six novelty detection methods, including a vanilla AE, VAE [kingma2013auto], AAE [makhzani2015adversarial], ALOCC [sabokrou2018adversarially], GPND [pidhorskyi2018generative] and ARAE [salehi2020arae], where the vanilla AE is the default novelty detector if not otherwise specified. PLS is added after the last layer of the novelty detection models’ encoder.
In order to evenly evaluate the adversarial robustness of these approaches, we unify their AE backbones into the following archirecture. The encoder consists of four 3 3 convolutional layers, where each of the first three layers are followed by a 2
2 max-pooling with stride 2. We use a base channel size of 64, and increase the number of channels by a factor of 2. The decoder mirrors the encoder but replaces every max-pooling by a bilinear interpolation with a factor of 2. All the convolutional layers are followed by a batch normalization layer[ioffe2015batch] and ReLU.
Implementation details. All the models are trained by Adam optimizer [kingma2014adam] with initial learning rate and weight decay
, where the learning rate is decreased by a factor of 10 at the 20th and 40th epochs. The batch size is 128. For PLS, we setto , to , initial to and initial to , where and are also decreased by a factor of 10 at the 20th and 40th epochs.
|Dataset||Defense||Test type||AE||VAE [kingma2013auto]||AAE [makhzani2015adversarial]||ALOCC [sabokrou2018adversarially]||GPND [pidhorskyi2018generative]||ARAE [salehi2020arae]|
5.2.1 White-box attacks
The robustness of one-class novelty detection against various white-box attacks is reported in Table I, where the vanilla AE is used. Without a defense, mAUROC scores drop significantly under all the white-box attacks, which shows the vulnerability of novelty detectors to the adversarial examples. PGD-AT improves adversarial robustness to a great extent. FD makes a slight improvement upon PGD-AT in most cases. SAT and Rot-AT seem not effective upon PGD-AT in the context of novelty detection. SOAP performs well in some cases but not uniformly. Compared to other methods, APAE generally shows less robustness. The proposed method, PLS, significantly increases mAUROC with PGD-AT, leading the other defenses by a decent margin. Moreover, PLS is consistently better across all the five white-box attacks on three datasets.
PLS-knowledgeable attacks. As discussed above, in a white-box attack, attackers are aware of the presence of the defense mechanism, i.e., PLS (it is differentiable at inference time, see Sec. 4). However, they count on only the novelty detection objective (i.e., MSE loss, see Eq. (2)) to generate adversarial examples. In this subsection, we follow the practice of the most recent adversarial defense studies such as [shi2021online], to thoroughly evaluate the proposed defense mechanism. More precisely, we try to find an adaptive attack [papernot2017practical, tramer2020adaptive] by giving the full knowledge of the PLS defense mechanism to the attacker. We refer to this type of attack as PLS-knowledgeable attack in the paper.
We construct two PLS-knowledgeable attacks, Knowledgeable A and Knowledgeable B. They jointly optimize Eq. (2) and an auxiliary loss developed with the knowledge of PLS. Knowledgeable A attempts to minimize the -norm between the latent space before and after the PLS transformation. The intuition is to void PLS such that the input and the output latent space of PLS become closer. In other words, Knowledgeable A replaces Eq. (2) with the following equation:
where, is a trade-off parameter. Knowledgeable B attempts to maximize the -norm between the latent space of the current adversarial example and its clean counterpart after the PLS transformation. The intuition is to keep the adversarial latent space away from the clean one. In other words, Knowledgeable B replaces Eq. (2) with the following equation:
where, is a trade-off parameter. When or , the PLS-knowledgeable attacks reduce to the conventional white-box attacks.
In Fig. 3, we can observe that mAUROC monotonously increases as or increases. That is, these PLS-knowledgeable attacks cannot further reduce PLS’s mAUROC, and the additional auxiliary loss terms would attenuate the MSE loss gradients. This indicates that attackers cannot straightforwardly benefit from the knowledge of PLS. Hence, the conventional white-box attack still has the greatest attacking strength. This result shows that it is not easy to find a stronger attack to break PLS, even with the full knowledge of the PLS mechanism.
5.2.2 Black-box attacks
The robustness against black-box attacks [papernot2017practical] is shown in the last column of Table I. Here, we consider a naturally trained (i.e., train with only clean data) GPND as a substitute model and apply MI-FGSM, which has better transferability, to generate black-box adversarial examples for target models. As we can see, the defenses with PGD-AT degrade black-box robustness, which is identical to the observation in classification tasks [tramer2018ensemble]. SOAP, which is without using AT, shows better black-box robustness. PLS greatly improves the black-box robustness even with PGD-AT, and it is consistently better across all datasets. Naturally trained PLS achieves 0.907, 0.742 and 0.332 mAUROC on MNIST, F-MNIST and CIFAR-10, respectively, under the black-box attack.
Table II shows the adversarial robustness of various state-of-the-art novelty detection models. All of them are susceptible to adversarial attacks. We attach the PLS module to these models to protect them. We can see that PLS uniformly robustifies all of these novelty detectors and significantly outperforms the other defense approaches. This confirms that PLS is applicable to a wide variety of the present novelty detection methods, demonstrating its excellent generalizability.
5.3 Performance on clean data
We also evaluate the performance of PLS on clean data. In this experiment, all the models are naturally trained. As shown in Table III, PLS improves the performance upon the original network architecture (No Defense), while, the other defenses do not make obvious improvements. This shows that PLS generalizes better for both clean data and adversarial examples. PLS enjoys this benefit because the principal latent components are learned from only the latent space of the known class. Due to this, when transforming the latent space of any novel class image, PLS projects it into the known class space defined by the principal latent component. This brings the transformed latent space closer to the latent space of the known class, resulting in the decoder trying to reconstruct it into a known class image. Subsequently, this produces high reconstruction error for the novel class images while barely affecting the reconstruction of the known class images.
5.4 Ablation study
PLS components. Table IV reports the results of different PLS variants. First, Vector-PCA alone significantly improves the robustness upon PGD-AT. This shows that the mechanism of replacing perturbed latent vectors by the incrementally-trained principal latent vector is effective. As discussed earlier, in PLS the adversaries can stay only on the scaling factors of the principal latent vector. Next, we further remove the adversaries with the help of denoising operation on the spatial dimension. We try to deploy a feature denoising block [Xie_2019_CVPR] after the forward Vector-PCA. This baseline is denoted as Vector-PCA+FD. This makes a slight improvement over Vector-PCA baseline. Finally, the complete PLS uses Spatial-PCA for this purpose instead, achieveing great mAUROC increase. This shows Spatial-PCA’s advantage over FD in our case.
Stability of latent space. We compute the mean -norm between the latent space of adversarial examples and that of their clean counterpart: . As can be seen in Fig. 4, PLS’s mean -norm is three orders of magnitude smaller than the other defenses. This indicates that PLS’s latent space are barely affected by adversaries, showing PLS’s effectiveness in adversary removal.
Reconstruction errors. For an AE-style novelty detection model, normal data and anomalous data are expected to get low and high reconstruction errors, respectively. The model follows this behavior given clean data, as shown in Fig. 5(a). When an attacker attempts to maximize the reconstruction errors of normal data and minimize that of anomalous data, the model would make wrong predictions, shown in Fig. 5(b). Fig. 5(c) shows that PGD-AT pulls back the enlarged reconstruction errors of normal data, but they still overlap for the anomalous data. In Fig. 5, it can be observed that PLS pushes the reconstruction errors of anomalous data with better margin. Although the reconstruction errors of normal data also increases, the gap between normal and anomalous data is retained resulting in PLS performing better under attacks.
Reconstructed images. Fig. 6 compares the reconstructed images of No Defense model and PLS under PGD attack. Digit 2 of MNIST is used as the known class. We can see that No Defense model captures the shape of the adversarial anomalous data and thus produces fair reconstructions. In other words, the reconstruction error gap between normal data and anomalous data is insufficiently large. Such observation is consistent with the quantitative results that it is not adversarially robust. In contrast, PLS reconstructs every data into the known class of digit 2. Hence, even under attacks, PLS can obtain very high reconstruction errors from anomalous data and low errors from normal data.
In this paper, we study the adversarial robustness in the context of one-class novelty detection problem. We show that existing novelty detection models are vulnerable to adversarial perturbations and then propose a defense method referred to as Principal Latent Space (PLS). Specifically, PLS purifies the latent space by the incrementally-trained cascade PCA process. Moreover, we construct a generic evaluation framework to fully test the effectiveness of the proposed PLS. We perform extensive experiments on multiple datasets with multiple existing novelty detection models and consider various attacks to show that PLS improves the robustness consistently across different attacks and datasets.
This work was supported by the DARPA GARD Program HR001119S0026-GARD-FP-052.
A1 Basic sanity checks to evaluation
To further verify that the proposed PLS’s robustness is not due to obfuscated gradients, we report our results on the basic sanity checks introduced in Athalyz et al. [obfuscated].
Table I shows that iterative attacks (PGD [madry2018towards] and MI-FGSM [dong2018boosting]) are stronger than one-step attacks (FGSM [goodfellow2015explaining]).
Table I shows that white-box attacks are stronger than black-box attacks (by MI-FGSM).
Unbounded attacks reach 100% attack success rate (AUROC drops to 0.000) on all the three datasets.
Fig. 9 shows that increasing distortion bound increases attack success (decreases AUROC).
A2 More on attack formulation
In Sec. 3, we take PGD [madry2018towards] as an example to illustrate the proposed attacking method against novelty detection models. Here, we elaborate on the formulation of the other attacks we used in this paper, including MI-FGSM [dong2018boosting], AF [zajac2019adversarial] and MultAdv [lo2020multav].
Consider an AE-based target model with an encoder and a decoder , and an input image with the ground-truth label , where “” denotes the known class and “” denotes the novel classes. MI-FGSM generates the adversarial example as follows:
where gathers the gradients of the first iterations with a decay factor . corresponds to the MSE loss defined in Eq. 2. Then,
where, , denotes a step size, and is the number of attack iterations, and . projects its element into an -norm bound with perturbation size such that .
AF adds adversarial perturbations on the border of an images, while the remaining pixels are kept unchanged. We generate the AF example as follows:
where is the AF mask. Let be a pixel index of . If is on the border of within a framing width , ; otherwise, .
MultAdv produces adversarial examples via the multiplicative operation, formulated as follows:
where is the multiplicative step size, performs projection with ratio bound such that . Eq. 2 is used as the loss objective for AF and MultAdv as well to suit the novelty detection problem setup.
A3 More on reconstructed images
In Sec. 5.4, we compare the reconstructed images under PGD attack [madry2018towards] in Fig. 6. In this section, Fig. 7 presents the reconstructed images under AF attack [zajac2019adversarial]. It can be observed that No Defense captures the shape of the adversarial anomalous data and thus produce fair reconstructions. Nevertheless, fails to reconstruct recognizable patterns under AF. Hence, the resulting reconstruction errors would let the novelty detector make wrong predictions. In contrast, PLS reconstructs every data into the known class of digit 2. Therefore, even under AF, PLS can obtain very high reconstruction errors from anomalous data and low errors from normal data. Such observations are consistent with the quantitative results in Table I.
A4 Trade-off of value and value
This section looks into the trade-off of the and values of the proposed PLS. Table V reports our results on the MNIST dataset [lecun2010mnist]. For both varying (fix =8) and (fix =1), we observe that larger leads to lower PGD accuracy but higher clean accuracy. The reason is that using larger retains more semantic information of feature maps while keeping more adversaries simultaneously. =1 is an exception. It has lower PGD accuracy because it loses too much information. According to the trade-off analysis, we set =1 and =8 for PLS as discussed in Sec. 5.1.
A5 Attack budgets
To fully evaluate the effectiveness of the proposed PLS, we test its scalability to different attack budgets. We vary the attack budgets by two aspects: The number of attack iterations and perturbation sizes . The results are presented in Fig. 8 and Fig. 9, respectively.
First, we can see that the attack strength does not increase obviously along with the increase of . This observation is consistent with that of Madry et al. [madry2018towards] and Xie et al. [Xie_2019_CVPR]. The proposed PLS shows constant adversarial robustness and consistently performs better than No Defense and PGD-AT [madry2018towards] under different .
On the other hand, the attack strength significantly increases along with the increase of . It can be observed that PLS consistently demonstrates better robustness under different . Apparently, PLS is scalable to different attack budgets.
A6 Further comparison with ARAE
ARAE [salehi2020arae] somewhat refers to the adversarial robustness of novelty detection though its main purpose is improving performance. As mentioned in Sec. 3, ARAE’s adversarial robustness is not thoroughly evaluated. In this section, we make a comprehensive comparison with ARAE.
First, ARAE evaluates adversarial robustness by crafting adversarial examples from only the normal test data (the known class). We name this attack PGD-normal. Instead, our attack method crafts adversarial examples from every test data regardless of their class (see Sec. 3). We reproduce PGD-normal with the same setting as in Sec. 5.1. As shown in Table VI, the proposed attack (denoted as PGD) is stronger than PGD-normal, in which PGD obtains lower mAUROC across all the considered defense methods and datasets. It is intuitive that perturbing every input data poses a stronger attack.
Second, ARAE performes AT on the latent space-based adversarial examples. We name this attack as PGD-latent. Instead, in this paper, we perform AT on the reconstruction error-based adversarial examples (see Sec. 3). We reproduce PGD-latent with the same setting as in Sec. 5.1. As can be seen in Table VI, PGD is much stronger than PGD-latent, in which PGD obtains lower mAUROC across all the considered defense methods and datasets.
Third, since a novelty detector would not know whether an input image is adversarial or not during inference, it should compute the novelty score by the reconstruction error between the reconstructed image and the input image instead of that between the reconstructed image and the clean image. For instance, if a given test image is an adversarial example , a novelty detector should compute instead of as the novelty score, where is the clean image. Therefore, to craft a strong adversarial example, one should maximize the reconstruction error between the reconstructed image and the input image. The proposed attack is based on this nature; that is, at each attack iteration, we maximize the reconstruction error between the current adversarial example the reconstruction of that current adversarial example (see Eq. 2). We make an attack variant, PGD-clean, which maximizes the reconstruction error between the clean image and the reconstruction of the current adversarial example. Specifically, PGD-clean replaces the loss objective Eq. 2 with follows:
ARAE uses this form. As shown in Table VI, PGD is much stronger than PGD-clean, in which PGD obtains lower mAUROC across all the considered defense methods and datasets. Therefore, we perform AT by minimizing Eq. 2 to make a stronger defense, while ARAE minimizes Eq. 20.
In summary, the proposed attack is stronger than PGD-normal, PGD-latent and PGD-clean. Hence, we are able to carefully and strictly evaluate the adversarial robustness of a novelty detector. Moreover, conducting AT on a stronger attack can enhance robustness to a greater extent. We hope to provide researchers a benchmark for future work on the adversarial robustness of one-class novelty detection.
A7 Further comparison with APAE
To the best of our knowledge, APAE [goodge2020robustness] might be the only present defense designed for anomaly detection. However, as mentioned in Sec. 3, APAE’s adversarial robustness is not thoroughly evaluated. In this section, we make more comparisons with APAE.
First, APAE evaluates adversarial robustness by crafting adversarial examples from only the anomalous test data (the unknown class). We name this attack PGD-anomalous. Instead, our attack method crafts adversarial examples from every test data regardless of their class (see Sec. 3). We reproduce PGD-anomalous with the same setting as in Sec. 5.1. As shown in Table VI, the proposed attack (denoted as PGD) is stronger than PGD-anomalous, in which PGD obtains lower mAUROC across all the considered defense methods and datasets. It is intuitive that perturbing every input data poses a stronger attack. On the other hand, No Defense attains the best mAUROC compared with the other defenses. The reason is that these defenses use only normal data to do AT, so they overfit for the adversarial normal data and show less robustness against PGD-anomalous.
Second, APAE claims that AT is inapplicable to the novelty detection problem. In contrast, in this paper, we show that AT is actually applicable to novelty detection, in which we can craft adversarial examples for the normal data to train the target model. Indeed, as can be seen in Table VI, using AT is less robust to PGD-anomalous. However, for the stronger attacks that contain adversarial normal data, AT can significantly improve the robustness.
As discussed in Sec. A6, we construct a proper evaluation protocol to fully test the adversarial robustness of novelty detectors. With a good evaluation protocol, we are able to design a better defense method accordingly.
A8 Comparison with vector quantization
The proposed PLS learns a principal latent vector, which is adversary-free, to replace perturbed latent vectors and enhance adversarial robustness. An alternative way of learning the adversary-free latent vectors is using vector quantization. VQ-VAE [van2017neural] is an AE variant that uses the vector quantization technique to improve generation ability. To the best of our knowledge, VQ-VAE has not been adopted in the context of novelty detection. In this section, we implement VQ-VAE for one-class novelty detection and evaluate its adversarial robustness. We set the number of embeddings to 4 for MNIST, 8 for F-MNIST and 256 for CIFAR-10. These numbers achieve the best robustness according to our experiments.
Because the quantization step is non-differentiable, it causes obfuscated gradients [obfuscated]
. Hence, we build a neural network, which consists of four fully connected layers, to learn the mapping from the latent vectors (the output of the encoder) to the quantized latent vectors (corresponded embedding vectors). Since the neural network is differentiable, we use it to approximate the gradients of the non-differentiable part to perform PGD attack[madry2018towards]. For comparison, we train another neural network with the same architecture to learn the mapping from the latent space to the principal latent space of PLS.
Table VII reports the experimental results. Comparing PLS (PGD examples are generated from the entire differentiable network) and PLS* (PGD examples are generated from the neural network gradient approximator), we can see that the neural network still cannot perfectly approximate the gradients, so the produced attack is weaker. However, although attacked by this weaker attack, VQ-VAE achieves lower mAUROC than PLS on MNIST and F-MNIST, and much lower mAUROC than PLS* on all the datasets. This shows that PLS has better robustness than VQ-VAE.
The explanations are as follows. First, PLS’s principal latent vector is learned by the incrementally-trained cascade PCA process, which is not only adversary-free but also contains important features that can properly substitute the original latent vectors. In contrast, VQ-VAE’s embedding vectors are randomly initialized. Even using the training strategy in [van2017neural], the embedding vectors are still not close to the original latent vectors. Therefore, PLS’s principal latent vector is a better adversary-free substitute. Second, after Vector-PCA, PLS’s Vector-PCA map stores the scaling factors of the principal latent vector with spatial information, so we can perform Spatial-PCA on it to further remove the remaining adversaries. In contrast, the vector quantization map stores the indices of the embedding vectors, and we cannot do further operations on these indices. These demonstrate the advantages of the proposed PLS.
A9 Comparison with the defenses that use dimensionality reduction techniques
A few studies employ vanilla PCA to counter adversarial attacks for the image classification problem. Hendrycks & Gimpel [hendrycks2016early] and Jere et al. [jere2020principal] utilized PCA to detect adversarial examples. Li & Li [li2017adversarial] performed PCA in the feature domain and used a cascade classifier to detect adversarial examples. However, detection is inherently weaker than defense in terms of resisting adversarial attacks. Bhagoji et al. [bhagoji2017dimensionality] mapped each input image into a dimensionality-reduced PCA space to defend against adversarial attacks, but this fails to resist white-box attacks [carlini2017adversarial]. As discussed in Sec. 1, doing image classification requires a model containing sophisticated semantic information, and large manipulation such as dimensionality reduction would hurt the model capability. Hence, it is counterintuitive to use dimensionality reduction for robustifying image classification models.
In contrast, we target at a different downstream application, one-class novelty detection. As discussed in Sec. 1, novelty detection has a peculiar property that a novelty detector’s latent space can be manipulated to a larger extent as long as it retains the known class information. This is natually suitable for using dimensionality reduction techniques to remove adversaries and maintain the model capability simultaneously. Furthermore, we propose a novel training scheme that learns the incrementally-trained cascade principal components in the latent space. The proposed defense method is fully differentiable at inference time, and it is highly robust to white-box attacks as shown in Sec. 5.2.
A10 Evaluation with FPR at 95% TPR
A11 AUROC of each class