Optimal Transport Classifier: Defending Against Adversarial Attacks by Regularized Deep Embedding

11/19/2018 ∙ by Yao Li, et al. ∙ 0

Recent studies have demonstrated the vulnerability of deep convolutional neural networks against adversarial examples. Inspired by the observation that the intrinsic dimension of image data is much smaller than its pixel space dimension and the vulnerability of neural networks grows with the input dimension, we propose to embed high-dimensional input images into a low-dimensional space to perform classification. However, arbitrarily projecting the input images to a low-dimensional space without regularization will not improve the robustness of deep neural networks. Leveraging optimal transport theory, we propose a new framework, Optimal Transport Classifier (OT-Classifier), and derive an objective that minimizes the discrepancy between the distribution of the true label and the distribution of the OT-Classifier output. Experimental results on several benchmark datasets show that, our proposed framework achieves state-of-the-art performance against strong adversarial attack methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks (DNNs) have been widely used for tackling numerous machine learning problems that were once believed to be challenging. With their remarkable ability of fitting training data, DNNs have achieved revolutionary successes in many fields such as computer vision, natural language progressing, and robotics. However, they were shown to be vulnerable to adversarial examples that are generated by adding carefully crafted perturbations to original images. The adversarial perturbations can arbitrarily change the network’s prediction but often too small to affect human recognition 

[26, 12]

. This phenomenon brings out security concerns for practical applications of deep learning.

Two main types of attack settings have been considered in recent research [10, 3, 6, 22]: black-box and white-box settings. In the black-box setting, the attacker can provide any inputs and receive the corresponding predictions. However, the attacker cannot get access to the gradients or model parameters under this setting; whereas in the white-box setting, the attacker is allowed to analytically compute the model’s gradients, and have full access to the model architecture and weights. In this paper, we focus on defending against the white-box attack which is the harder task.

Figure 1: Overview of OT-Classifier framework

Recent work [25]

presented both theoretical arguments and an empirical one-to-one relationship between input dimension and adversarial vulnerability, showing that the vulnerability of neural networks grows with the input dimension. Therefore, reducing the data dimension may help improve the robustness of deep neural networks. Furthermore, a consensus in the high-dimensional data analysis community is that, a method working well on the high-dimensional data is because the data is not really of high-dimension 

[14]. These high-dimensional data, such as images, are actually embedded in a much lower dimensional space. Hence, carefully reducing the input dimension may improve the robustness of the model without sacrificing performance.

Inspired by the observation that the intrinsic dimension of image data is actually much smaller than its pixel space dimension [14] and the vulnerability of a model grows with its input dimension [25]

, we propose a defense framework that embeds input images into a low-dimensional space using a deep encoder and performs classification based on the latent embedding with a classifier network. However, arbitrarily projecting input images to a low-dimensional space based on a deep encoder does not guarantee improving the robustness of the model, because there are a lot of mapping functions including pathological ones from the raw input space to the low-dimensional space capable of minimizing the classification loss. To constrain the mapping function, we employ distribution regularization in the embedding space leveraging optimal transport theory. We call our new classification framework Optimal Transport Classifier (OT-Classifier). To be more specific, we introduce a discriminator in the latent space which tries to separate the generated code vectors from the encoder network and the ideal code vectors sampled from a prior distribution, i.e., a standard Gaussian distribution. Employing a similar powerful competitive mechanism as demonstrated by Generative Adversarial Networks 

[9], the discriminator enforces the embedding space of the model to follow the prior distribution.

In our OT-Classifier framework, the encoder and discriminator structures together project the input data to a low-dimensional space with a nice shape, then the classifier performs prediction based on the low-dimensional embedding. Based on the optimal transport theory, the proposed OT-Classifier minimizes the discrepancy between the distribution of the true label and the distribution of the framework output, thus only retaining important features for classification in the embedding space. With a small embedding dimension, the effect of the adversarial perturbation is largely diminished through the projection process.

We compare OT-Classifier with other state-of-the-art defense methods on MNIST, CIFAR10, STL10 and Tiny Imagenet. Experimental results demonstrate that our proposed OT-Classifier outperforms other defense methods by a large margin. To sum up, this paper makes the following three main contributions:

  • [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

  • A novel unified end-to-end robust deep neural network framework against adversarial attacks is proposed, where the input image is first projected to a low-dimensional space and then classified.

  • An objective is induced to minimize the optimal transport cost between the true class distribution and the framework output distribution, guiding the encoder and discriminator to project the input image to a low-dimensional space without losing important features for classification.

  • Extensive experiments demonstrate the robustness of our proposed OT-Classifier framework under the white-box attacks, and show that OT-Classifier combined with adversarial training outperforms other state-of-the-art approaches on several benchmark image datasets.

2 Related Work

In this section, we summarize related work into three categories: attack methods, defense mechanisms and optimal transport theory. We first discuss different white-box attack methods, followed by a description of different defense mechanisms against these attacks, and finally optimal transport theory.

2.1 Attack Methods

Under the white-box setting, attackers have all information about the targeted neural network, including network structure and gradients. Most white-box attacks generate adversarial examples based on the gradient of loss function with respect to the input. An algorithm called fast gradient sign method (FGSM) was proposed in

[10] which generates adversarial examples based on the sign of gradient. Many other white-box attack methods have been proposed recently [20, 5, 17, 4], and among them C&W and PGD attacks have been widely used to test the robustness of machine learning models.

CW attack: The adversarial attack method proposed by Carlini and Wagner [4]

is one of the strongest white-box attack methods. They formulate the adversarial example generating process as an optimization problem. The proposed objective function aims at increasing the probability of the target class and minimizing the distance between the adversarial example and the original input image. Therefore, C

W attack can be viewed as a gradient-descent based adversarial attack.

PGD attack: The projected gradient descent attack is proposed by [17], which finds adversarial examples in an -ball of the image. The PGD attack updates in the direction that decreases the probability of the original class most, then projects the result back to the -ball of the input. An advantage of PGD attack over CW attack is that it allows direct control of distortion level by changing , while for CW attack, one can only do so indirectly via hyper-parameter tuning.

Both CW attack and PGD attack have been frequently used to benchmark the defense algorithms due to their effectiveness [2]. In this paper, we mainly use -PGD untargeted attack to evaluate the effectiveness of the defense method under white-box setting.

Instead of crafting different adversarial perturbation for different input image, an algorithm was proposed by [19] to construct a universal perturbation that causes natural images to be misclassified. However, since this universal perturbation is image-agnostic, it is usually larger than the image-specific perturbation generated by PGD and C&W.

2.2 Defense Mechanisms

Many works have been done to improve the robustness of deep neural networks. To defend against adversarial examples, defenses that aim to increase model robustness fall into three main categories: i) augmenting the training data with adversarial examples to enhance the existing classifiers [17, 21, 10]; ii) leveraging model-specific strategies to enforce model properties such as smoothness [23]; and, iii) trying to remove adversarial perturbations from the inputs [28, 24, 18]. We select three representative methods that are effective under white-box setting.

Adversarial training: Augmenting the training data with adversarial examples can increase the robustness of the deep neural network. Madry et al. [17] recently introduced a min-max formulation against adversarial attacks. The proposed model is not only trained on the original dataset but also adversarial example in the -ball of each input image.

Random Self-Ensemble: Another effective defense method under white-box setting is RSE [15]. The authors proposed a “noise layer”, which fuses output of each layer with Gaussian noise. They empirically show that the noise layer can help improve the robustness of deep neural networks. The noise layer is applied in both training and testing phases, so the prediction accuracy will not be largely affected.

Defense-GAN: Defense-GAN [24] leverages the expressive capability of GANs to defend deep neural networks against adversarial examples. It is trained to project input images onto the range of the GAN’s generator to remove the effect of the adversarial perturbation. Another defense method that uses the generative model to filter out noise is MagNet proposed by [18]. However, the differences between OT-Classifier and the two methods are obvious. OT-Classifier focus on reducing the dimension, and performing classification based on the low-dimensional embedding, while Defense-GAN and MagNet mainly apply the generative model to filter out the adversarial noise, and both Defense-GAN and MagNet perform classification on the original dimension space. [24] showed that Defense-GAN is more robust than MagNet, so we only compare with Defense-GAN in the experiment.

2.3 Optimal Transport Theory

There are various ways to define the distance or divergence between the target distribution and the model distribution. In this paper, we turn to the optimal transport theory111More details available at https://optimaltransport.github.io/slides/, which provides a much weaker topology than many others. In real applications, data is usually embedded in a space of a much lower dimension, such as a non-linear manifold. Kullback-Leibler divergence, Jensen-Shannon divergence and Total Variation distance are not sensible cost functions when learning distributions supported by lower dimensional manifolds [1]. In contrast, the optimal transport cost is more sensible in this setting.

Kantorovich’s distance induced by the optimal transport problem is given by

where

is the set of all joint distributions of

with marginals and , and is any measurable cost function.

measures the divergence between probability distributions

and .

When the probability measures are on a metric space, the -th root of is called the -Wasserstein distance. Recently, Tolstikhin [27] introduced a new algorithm to build a generative model of the target data distribution based on the Wasserstein distance. The proposed generative model can generate samples of better quality, as measured by the FID score.

3 Proposed Framework: Optimal Transport Classifier

We propose a novel defense framework, OT-Classifier, which aims at projecting the image data to a low-dimensional space to remove noise and stabilize the classification model by minimizing the optimal transport cost between the true label distribution and the distribution of the OT-Classifier output (). The encoder and discriminator structures together help diminish the effect of the adversarial perturbation by projecting input data to a space of lower dimension, then the classifier part performs classification based on the low-dimensional embedding.

3.1 Notations

In this paper, we use and distortion metrics to measure similarity. We report distance in the normalized space, so that a distortion of corresponds to , and distance as the total root-mean-square distortion normalized by the total number of pixels.

We use calligraphic letters for sets (i.e.,

), capital letters for random variables (i.e.,

), and lower case letters for their values (i.e., ). The probability distributions are denoted with capital letters (i.e., ) and corresponding densities with lower case letters (i.e., ).

Images are projected to a low-dimensional embedding vector through the encoder . The discriminator discriminates between the generated code and the ideal code . The classifier performs classification based on the generated code , producing output , where is the number of classes. The label of is denoted as . An overview of the framework is shown in Figure 1.

3.2 Framework Details

At training stage, the encoder first maps the input to a low-dimensional space, resulting in generated code (). Another ideal code () is sampled from the prior distribution, and the discriminator discriminates between the ideal code (positive data) and the generated code (negative data). The classifier () predicts the image label based on the generated code (). Details of training process can be found in Algorithm 1.

1:Input: Regularization coefficient , encoder , discriminator , and classifier .
2:Note: stands for the cross-entropy loss.
3:while  not converged do
4:     Sample from the training set
5:     Sample from the prior
6:     Sample from for
7:     Update by ascending the following objective by 1-step Adam:
8:     Update and by descending the following objective by 1-step Adam:
9:     Update by ascending the following objective by 1-step Adam:
10:end while
Algorithm 1 Training OT-Classifier

At inference time, only the encoder and the classifier are used. The input image is first mapped to a low-dimensional space by the encoder (), then the latent code is fed into the classifier to obtain the predicted label.

Our framework can be combined with other state-of-the-art defense methods, such as adversarial training. Since the dimension of the input images are reduced to a much lower dimension, adversarial training also benefits from this dimension reduction. In the experiments, we combine OT-classifier with adversarial training and compare it with other defense methods.

3.3 Theoretical Analysis

The OT-Classifier framework embeds important classification features by minimizing the discrepancy between the distribution of the true label () and the distribution of the framework output (). In the framework, the classifier () maps a latent code sampled from a fixed distribution in a latent space , to the output . The density of OT-Classifier output is defined as follow:

(1)

In this paper we apply standard Gaussian as our prior distribution , but other priors may be used for different cases. Assume there is an oracle assigning the image data () its true label (). To minimize the optimal transport cost between the distribution of the true label () and the distribution of the OT-Classifier output (), it is sufficient to find a conditional distribution such that its marginal distribution is identical to the prior distribution .

Theorem 1

For as defined above with a deterministic and any function

where is the set of all joint distributions of with marginals and , and is any measurable cost function. is the marginal distribution of when and . (The proof is deferred to the Appendix. )

Therefore, optimizing over the objective on the r.h.s is equivalent to minimizing the discrepancy between the true label distribution () and the output distribution , thus the important classification features are embedded in the low-dimensional space. This is the core idea of the paper, summarizing the high-dimensional data in a space of much lower dimension without losing important features for classification. To implement the r.h.s objective, the constraint on can be relaxed by adding a penalty term. The final objective of OT-Classifier is:

(2)

where is any nonparametric set of probabilistic encoders, is a hyper-parameter and is an arbitrary divergence between and .

To estimate the divergences between

and , we apply a GAN-based framework, fitting a discriminator to minimize the 1-Wasserstein distance between and :

We have also tried the Jsensen-Shannon divergence, but as expected, Wasserstein distance provides more stable training and better results. When training the framework, the weight clipping method proposed in Wasserstein GAN [1] is applied to help stabilize the training of discriminator .

4 Experiments

In this section, we compare the performance of our proposed algorithm (OT-Classifier) with other state-of-the-art defense methods on several benchmark datasets:

  • [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

  • MNIST [13]: handwritten digit dataset, which consists of training images and testing images. Theses are black and white images in ten different classes.

  • CIFAR10 [11]: natural image dataset, which contains training images and testing images in ten different classes. These are low resolution color images.

  • STL10 [7]: color image dataset similar to CIFAR10, but contains only training images and testing images in ten different classes. The images are of higher resolution .

  • Tiny Imagenet [8]: a subset of Imagenet dataset. Tiny Imagenet has classes, and each class has training images, testing images, making it a challenging benchmark for defense task. The resolution of the images is .

Various defense methods have been proposed to improve the robustness of deep neural networks. Here we compare our algorithm with state-of-the-art methods that are robust in white-box setting. Madry’s adversarial training (Madry’s Adv) is proposed in [17], which has been recognized as one of the most successful defense method in white-box setting, as shown in [2].

Random Self-Ensemble (RSE) method introduced by [15] adds stochastic components in the neural network, achieving similar performance to Madry’s adversarial training algorithm.

Another method we would like to compare with is Defense-GAN [24]. It first trains a generative adversarial network to model the distribution of the training data. At inference time, it finds a close output to the input image and feed that output into the classifier. This process “projects” input images onto the range of GAN’s generator, which helps remove the effect of adversarial perturbations. In [24], the author demonstrated the performance of Defense-GAN on MNIST and Fashion-MNIST, so we will compare our method with Defense-GAN on MNIST.

Optimal transport classifier can be combined with other state-of-the-art defense methods. In general, Madry’s adversarial training is more robust than RSE, so we combine OT-Classifier with adversarial training (OT-CLA+Adv) in our experiments.

4.1 Evaluate Models Under White-box -PGD Attack

In this section, we evaluate the defense methods against -PGD untargeted attack, which is one of the strongest white-box attack methods. Starting from , PGD attack conducts projected gradient descent iteratively to update the adversarial example:

where M is the targeted model, is the projection to the set , is the label of , and is the step size. It is obvious that larger allows larger distortion of the original image. Models are evaluated under different distortion level (), and the larger the distortion the stronger the attack. Depending on the image scale and type, different datasets are sensitive to different strength of attack.

Models on MNIST are evaluated under distortion level from to by . Models on CIFAR10 and STL10 are evaluated under . Models on Tiny Imagenet are evaluated under . As mentioned in the notation part, all the distortion levels are reported in the normalized space. The experimental results are shown in Figure 2. To demonstrate the results more clearly, we show part of the results in Table 1.

Figure 2: Testing accuracy under -PGD attack on four different datasets: MNIST, CIFAR10, STL10 and Tiny Imagenet.
Data Defense 0 0.1 0.2 0.3 0.4
MNIST Adv. Training 99.2 97.3 86.8 35.4 2.7
OT-CLA+Adv 99.1 98.7 97.2 94.9 71.1
Data Defense 0 0.015 0.03 0.045 0.06
CIFAR10 Adv. Training 82.6 68.0 42.3 21.6 12.0
OT-CLA+Adv 84.0 67.5 51.3 35.8 23.3
STL10 Adv. Training 63.6 53.5 36.8 25.0 18.7
OT-CLA+Adv 60.7 52.1 40.3 30.6 24.5
Data Defense 0 0.004 0.01 0.016 0.02
Tiny Imagenet Adv. Training 57.3 48.6 26.5 15.1 12.0
OT-CLA+Adv 54.6 50.0 36.7 25.6 21.1
Table 1: Testing accuracy () under different strength of PGD attacks. The table shows the results of OT-CLA+Adv and Madry’s adversarial training (Adv. Training). The better accuracy is marked in bold.

Based on Figure 2 and Table 1, we can see that OT-Classifier can improve the robustness of deep neural networks. Compare the performance of OT-Classifier with the performance of model without defense method, we can see that OT-Classifier is much more robust than the model with no defense method on all benchmark datasets. Besides, when the distortion level () is large, OT-Classifier tends to perform better than other state-of-the-art defense methods on MNIST, CIFAR10 and Tiny Imagenet. This phenomenon is obvious on CIFAR10 and it even performs better than OT-CLA+Adv when the attack strength is strong.

In general, OT-Classifier combined with adversarial training (OT-CLA+Adv) is the most robust one on a variety of datasets. Though, on some datasets, when there is no attack, the testing accuracy of OT-CLA+Adv are slightly worse than Madry’s adversarial training.

We also compare Defense-GAN with our method OT-CLA+Adv on MNIST. Both methods are evaluated against the -CW untargeted attack, one of the strongest white-box attack proposed in [4]. Defense-GAN is evaluated using the method proposed in [2], and the code is available on github 222Publicly available at https://github.com/anishathalye/obfuscated-gradients/tree/master/defensegan. OT-CLA+Adv is evaluated against -CW untargeted attack with the same hyper-parameter values as those used in the evaluation of Defense-GAN. The results under threshold are shown in Table 2.

Method Testing Accuracy
Defense-GAN 55.0
OT-CLA+Adv 99.1
Table 2: Testing accuracy () of two defense methods under CW attack with .

Based on Table 2, OT-CLA+Adv is much more robust than Defense-GAN under the threshold.

4.2 Evaluate the Effect of Discriminator

OT-Classifier framework consists of three parts, and the classification task is done by the encoder and classifier . Without the discriminator part, the encoder can also project the input images to a low-dimensional space. However, arbitrarily projecting the images to a low-dimensional space with only the encoder part can not improve the robustness of the model. In contrast, sometimes it even decreases the robustness of the model.

To show that arbitrarily projecting the input images to a low-dimensional space can not improve the robustness, we fit a framework with only the encoder and classifier part (E-CLA), where the encoder and classifier have the same structures as in OT-Classifier, and compare E-CLA with the OT-Classifier framework. The results are shown in Figure 3.

Figure 3: Testing accuracy of E-CLA and OT-Classifier under -PGD attack on four different datasets: MNIST, CIFAR10, STL10 and Tiny Imagenet. We adopt the same encoder and classifier structures for the two models.

Based on Figure 3, we can observe that OT-Classifier is much more robust than just the encoder and classifier structure on MNIST, CIFAR10 and Tiny Imagenet. It is also more robust on STL10 but not that much. The reason might be that there are only training images in STL10 and the resolution is . Therefore, it is harder to learn a good embedding with limited amount of images. However, even when the number of training images is limited, OT-Classifier is still much more robust than the E-CLA structure. This observation demonstrates that OT-Classifier is able to learn a robust embedding. Notice that the performance of E-CLA structure is similar to the performance of model without defense method on CIFAR10, STL10 and Tiny Imagenet, and worse on MNIST, which means the robustness of OT-Classifier does not come from the structure design.

4.3 Dimension of Embedding Space

One important hyper-parameter for the OT-Classifier is the dimension of the embedding space. If the dimension is too small, important features are “collapsed” onto the same dimension, and if the dimension is too large, the projection will not extract useful information, which results in too much noise and instability. The maximum likelihood estimation of intrinsic dimension proposed in [14]333Code publicly available at https://github.com/OFAI/hub-toolbox-python3 is used to calculate the intrinsic dimension of each image dataset, serving as a guide for selecting the embedding dimension. The sample size used in calculating the intrinsic dimension is , and changing the sample size does not influence the results much. Based on the intrinsic dimension calculated by [14], we test several different values around the suggested intrinsic dimension and evaluate the models against -PGD attack. The experimental results are shown in Figure 4.

Figure 4: Testing accuracy of models with different embedding dimensions under -PGD attack.

The final embedding dimension is chosen based on robustness, number of parameters, and testing accuracy when there is no attack. The final embedding dimensions and suggested intrinsic dimensions are shown in Table 3.

Data Data dim. Intrinsic dim. Embedding dim.
MNIST 13 4
CIFAR10 17 16
STL10 20 16
Tiny Imagenet 19 20
Table 3: Pixel space dimension, intrinsic dimension calculated by [14], and final embedding dimension used.

Based on Figure 4, the embedding dimension close to the calculated intrinsic dimension usually offers better results except on MNIST. One explanation may be that MNIST is a simple handwritten digit dataset, so performing classification on MNIST may not require that many dimensions.

4.4 Embedding Visualization

In this section, we compare the embedding learned by Encoder+Classifier structure (E-CLA) and the embedding learned by OT-Classifier on several datasets. We first generate embedding of testing data using the encoder (), then project the embedding points () to 2-D space by tSNE[16]. Then we generate adversarial images () against E-CLA and OT-Classifier using -PGD attack. The adversarial embedding is generated by feeding the adversarial images into the encoder (). Finally, we project the adversarial embedding points () to 2-D space. The results are shown in Figure 5. The plots in the first row are embedding visualization plots for E-CLA, and the plots in the second row are the embedding visualization plots for OT-Classifier. In adversarial embedding visualization plots, the misclassified point is marked as “down triangle”, which means the PGD attack successfully changed the prediction, and the correctly classified point is marked as “point”, which means the attack fails.

Figure 5: 2D embeddings for E-CLA and OT-Classifier on MNIST and CIFAR10. See larger plots in Supplementary.

Based on Figure 5, we can see that E-CLA can learn a good embedding on legitimate images of MNIST. Embedding points for different classes are separated on the 2D space, but under adversarial attack, some embedding points of different classes are mixed together. However, OT-Classifier can generate good separated embeddings on both legitimate and adversarial images. On CIFAR10, the E-CLA can not generate good separated embeddings on either legitimate images or adversarial images, while OT-Classifier can generate good separated embeddings for both.

5 Conclusion

In this paper, we propose a new defense framework, OT-Classifier, which projects the input images to a low-dimensional space to remove adversarial perturbation and stabilize the model through minimizing the discrepancy between the true label distribution and the framework output distribution. We empirically show that OT-CLA+Adv is much more robust than other state-of-the-art defense methods on several benchmark datasets. Future work will include further exploration of the low-dimensional space to improve the robustness of deep neural network.

6 Appendix

6.1 Proof of Theorem 1

The proof of Theorem 1 is adapted from the proof of Theorem 1 in [27]. Consider certain sets of joint probability distributions of three random variables . can be taken as the input images, as the output of the framework, and as the latent codes. represents a joint distribution of a variable pair , where is first sampled from and then from . defined in (1) is the marginal distribution of when .

The joint distributions or couplings between values of and can be written as due to the marginal constraint. can be decomposed into an encoding distribution and the generating distribution , and Theorem 1 mainly shows how to factor it through .

In the first part, we will show that if are Dirac measures, we have

(3)

where denotes the set of all joint distributions of with marginals , and likewise for . The set of all joint distributions of such that , , and are denoted by . and denote the sets of marginals on and induced by .

From the definition, it is clear that . Therefore, we have

(4)

The identity is satisfied if are Dirac measures, such as . This is proved by the following Lemma in [27].

Lemma 1

with identity if are Dirac for all . (see details in [27].)

In the following part, we show that

(5)

Based on the definition, , and depend on the choice of conditional distributions , but does not. It is also easy to check that . The tower rule of expectation, and the conditional independence property of implies

(6)

Finally, since , it is easy to get

(7)

Now (3), (5) and (7) are proved and the three together prove Theorem 1.

References