Disentangled Deep Autoencoding Regularization for Robust Image Classification

02/27/2019 ∙ by Zhenyu Duan, et al. ∙ 0

In spite of achieving revolutionary successes in machine learning, deep convolutional neural networks have been recently found to be vulnerable to adversarial attacks and difficult to generalize to novel test images with reasonably large geometric transformations. Inspired by a recent neuroscience discovery revealing that primate brain employs disentangled shape and appearance representations for object recognition, we propose a general disentangled deep autoencoding regularization framework that can be easily applied to any deep embedding based classification model for improving the robustness of deep neural networks. Our framework effectively learns disentangled appearance code and geometric code for robust image classification, which is the first disentangling based method defending against adversarial attacks and complementary to standard defense methods. Extensive experiments on several benchmark datasets show that, our proposed regularization framework leveraging disentangled embedding significantly outperforms traditional unregularized convolutional neural networks for image classification on robustness against adversarial attacks and generalization to novel test data.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional neural networks (CNNs) have achieved revolutionary successes in many machine learning tasks such as image classification [He et al.2016a], object detection [Ren et al.2015], image segmentation [He et al.2017], image captioning [Vinyals et al.2015], video captioning [Pu et al.2018], text-to-video synthesis [Li et al.2018a], etc. In spite of these successes, CNNs have recently been shown to be vulnerable to adversarial images which are created by adding tiny perturbations to legitimate input images [Papernot et al.2016a, Kurakin et al.2016]

. These adversarial images that are indistinguishable from real input images by human observers can cause CNNs to output any target prediction that the attacker wants to have, suggesting that the representations learned by CNNs still lack the robustness and the capability of generalization that biological neural systems exhibit. This security issue of CNNs poses serious challenges in applying deep learning to mission-critical applications such as autonomous driving.

Various efforts have been made recently to improve the situation. Theoretical approaches such as  [Wong and Kolter2018]

propose to learn deep ReLU-based classifiers that are provably robust against norm-bounded adversarial perturbations on the training data. However, the method still can not scale to realistic network sizes. Several empirical defense methods have been proposed 

[Buckman et al.2018, Guo et al.2018, Dhillon et al.2018]. Nevertheless, as shown in  [Athalye et al.2018], many of these approaches do not have guaranteed defenses. On the other hand, neuroscience inspired approaches such as  [Hinton et al.2018] leverage capsule architecture to better learn the relationships between object parts and the whole, and make the model more robust and better generalizable.

Recent findings in neuroscience show that, monkey inferotemporal (IT) cortex (the final stage of the “what” pathway of human and monkey visual system) contains separate regions where neurons exhibit disentangled representations of shape and appearance properties of face images, respectively. This disentangled representation has been applied to construct generative models of images  

[Xing et al.2018]

, in which sampled disentangled embedding vectors are passed to a generator (decoder) to generate images.

In this paper, we take the above neuroscience inspiration to regularize the embedding layer of a traditional CNN classifier by a disentangled deep autoencoder for image classification. In the regularizing deep autoencoder, an encoder disentangles an input image into a low-dimensional appearance code capturing the texture or background information and a low-dimensional geometric code capturing the category-sensitive shape information, and two decoders reconstruct the original high-dimensional input image from the disentangled low-dimensional code.

Our proposed regularization framework improves CNNs’ robustness and generalization mainly due to the following two reasons: First, due to the large representational capability of traditional CNNs, there are a lot of embedding functions including many pathological ones from high-dimensional input image space to the low-dimensional embedding space that can equally well minimize the classification loss; the reconstruction loss of the deep autoencoder imposes strong regularization on the low-dimensional embedding code directly used for classification; Second, the disentanglement in the embedding space imposes additional regularization on the low-dimensional code, resulting in specialized clean appearance code and geometric code, respectively, which might be, respectively, more discriminative for different classification tasks than a low-dimensional code containing a lot of information mixed with noise.

Our proposed regularization framework significantly improves the robusness of CNNs against both white-box attacks [Goodfellow et al.2015, Kurakin et al.2017, Papernot et al.2016a, Papernot et al.2016b, Carlini and Wagner2017], which assume the model is completely known, and black-box attacks [Narodytska and Kasiviswanathan2017, Papernot et al.2017], which can only query the model for a limited number of times.

Our contributions in this paper are as follows: (i) We propose the first disentangling based deep autoencoding regularization framework to improve the robustness of CNNs, which can be easily applied to any embedding based CNN classification model and combined with standard defense methods; (ii) Extensive experiments show that, our regularization framework dramatically improves the robustness of standard CNNs against the attacks from adversarial examples; (iii) Our proposed framework helps CNNs learn more robust image representation and gain better generalization to novel test data with large geometric differences, achieving state-of-the-art generalization ability compared with capsule networks [Hinton et al.2018].

2 Method

2.1 Disentangled Deep Autoencoding Regularization

Figure 1: The disentangled deep autoencoding regularization framework for classification.

Suppose we have a -class training set containing N pairs of samples, where is the image to be classified and is the corresponding label. Given a new test image that does not belongs to , the classification task is to predict the correct label based on the features extracted from .

To encourage learning visual features robust to natural variations of images, we employ additional learning objective and augmentation task to force a classifier to learn disentangled geometric code and appearance code. As shown in Figure 1, the proposed framework contains two parts. The first part is a regularizing autoencoder with disentangled feature embedding, displayed on the top. The second part performs classification on the concatenated embedding feature vector, displayed on the bottom. For any image , the encoder maps it to a pair of appearance and geometric codes as follows,


in which our encoder learns a disentangled embedding of the input image, namely the appearance code and the geometric code . The appearance decoder or generator takes the appearance code to generate the appearance output . The geometric decoder takes the geometric code and generates coordinate displacement of the appearance output .


The warping function warps the appearance with the coordinate displacement generated by the geometric decoder and generates the reconstruction output . This will be described in detail later.


The classifier

takes both the appearance code and geometric code as input and predicts category probability



In training process, the encoder, two decoders as well as classifier are trained jointly and the loss function

is defined as


where is the training label vector, is the reconstruction error (frobenius norm), and is the cross-entropy loss, and is the regularization weight for the reconstruction loss.

The disentangled representation is learned in an unsupervised manner. The warping function naturally encourages the separation of appearance code from geometric code because represents coordinate displacement of .

Note that the complete architecture is needed only for training. For inference, the two decoders, warping function and the reconstruction are not needed.

2.1.1 Warping Function

The warping function, such as the one used in Spatial Transformer networks (STN)

[Jaderberg et al.2015]

, is usually a combination of geometric transformation for pixels and interpolating operation. The geometric transformation usually consists of two types: affine transformation and pixel displacement transformation (like the optical flow field). We employ the second type in this work, similarly as in  

[Xing et al.2018]. The warping function takes the displacement field and the ”unwarped” image as the input. The vector at each pixel of the displacement field is defined to be related with the spatial shift needed to move the destination pixel (,) to source pixel (,) in the output of the appearance generator as below:


where denotes the pixel displacement operation.

The outputs of the pixel displacement operation are usually decimals. In order to get a set of pixels whose coordinates are integers, we employ the differentiable bi-linear interpolation in this work. The bi-linear interpolation process can be represented as:


where means the function , and , are the height and width of the input image respectively.

In this way, various pictures of the same object can be encoded with similar appearance code, and their differences in viewing angles, locations, and deformation are captured by the geometric code. With this idea, we use data augmentation detailed below to enforce the separation of appearance and geometric information.

2.2 Data Augmentation and Code Regularization

In this section, we will describe how to combine the proposed regularization framework with deformation-based data augmentation method. This deformation-based data augmentation is achieved by random 2D-stretching, scaling and spatial shift of the target images. To encourage appearance code to capture information invariant to spatial transformation, we impose L2 regularization on the distance between the appearance codes of images before (denoted as ) and after deformation ().

The appearance code and geometric code of from encoder are and . And the corresponding reconstruction output and classification output are denoted as and . The new loss function is


where is the scale weight for appearance code regularization. We set and as 1 in experiments.

As shown in Section 3, data augmentation and code regularization further encourages the separation of appearance code from geometric code. This leads to much improved generalization to novel test data with large geometric transformation.

3 Experiments

In this section, the robustness and generalization to novel test data of the proposed regularization framework is evaluated. In addition, ablation studies are performed to analyze the individual mechanisms that contribute to the gains.

3.1 Robustness Against Adversary Attack

We evaluate our framework’s robustness against both white-box and black-box attacks. White-box attacks assume that the adversary knows the network architecture and the model parameters after training or has access to labeled training data. An adversary can use the detailed information to construct a perturbation for a given image. Black-box attacks only rely on a limited number of queries to the model to construct an adversarial image.

To compare with the related work such as  [Hinton et al.2018], we use three dataseets, the MNIST dataset, the FaceScrub dataset, and the CIFAR10 datasset. The MNIST dataset of handwriten digits has 60000 for training and 10000 for test. The FaceScrub data set contains 106,863 face images of 530 celebrities taken under real-world conditions. The CIFAR10 dataset contains 60,000 color images with a resolution of 32x32 in 10 different classes, in which 50,000 are used for training and 10,000 are used for test.

3.1.1 White-Box Attack

We evaluate our model’s robustness using popular white box attack methods: fast gradient sign method (FGSM) [Goodfellow et al.2015] and Basic Iterative Method (BIM) [Kurakin et al.2017]. FGSM computes the gradient of the loss with respect to each pixel intensity and then changes the pixel intensity by a fixed amount in the direction that increases the loss. BIM applies FGSM with multiple smaller steps when creating the adversarial image. FGSM and BIM are adopted as the main attack methods by a series of works. Since these methods attack the image based on the gradient, they pose a great threat to modern CNN architecture trained through gradient back-propagation.

We use the same benchmark CNN architecture as  [Hinton et al.2018]. It has 4 convolutional layers with a softmax output. For fair comparison, the classifier part of the proposed framework uses exactly the same architecture as the benchmark CNN architecture. To test the robustness of the proposed framework, adversarial images are generated from the test set using a fully trained model. We then reported the accuracy of the model on these images. All models are trained without adversary examples.

As show in Table 1 and  2, the proposed regularization framework is much more robust than the benchmark CNN architecture. 111The visualization of feature embedding of legitimate images and adversarial images on the MNIST dataset can be found in the supplementary material.

0.1 0.2 0.3 0.4 0.5
Base CNN 76.3% 89.4% 96.7% 97.3% 97.6%
Ours 8.8% 12.3% 16.2% 25.8% 38.5%
Table 1: Attack Success Rate Comparison on MNIST Dataset (FGSM)
(iters) 0.01 0.02 0.03 0.04 0.05
Base CNN (2) 7.4% 29.4% 54.3% 74.4% 87.6%
Ours (2) 9.8% 20.6% 27.4% 31.0% 33.4%
Base CNN (3) 7.8% 30.7% 55.4% 75.3% 88.3%
Ours (3) 10.3% 22.1% 29.2% 32.4% 34.7%
Table 2: Attack Success Rate Comparison on MNIST Dataset (BIM)

It’s worth noting that for defending against the adversary attack with large perturbation, simply adding the proposed regularization even exceeds some defense models that used adversary training. For example, our model shows a higher performance (61.5%) over the gradient-regularized model [Ross and Doshi-Velez2017] (lower than 60%) when perturbed with a big perturbation (0.5) by FGSM on MNIST Dataset.

We further evaluate our model’s robustness using FaceScrub dataset. To simplify the problem, we randomly select 50 faces for the classification task. For the benchmark CNN architecture, the training and test accuracy is 99.9% and 74.2%, respectively. While after adding the proposed regularization, the training and test accuracy is 99.9% and 79.8%, respectively. As shown in Table 3 and  4, the regularization framework is much more robust than the benchmark CNN model. Please note that is big enough as a perturbation in this task.

0.01 0.02 0.03 0.04 0.05
Base CNN 57.8% 67.0% 72.9% 78.3% 82.2%
Ours 19.9% 24.6% 27.9% 33.4% 38.9%
Table 3: Attack Success Rate Comparison on FaceScrub dataset (FGSM)
0.01 0.02 0.03 0.04 0.05
Base CNN 72.1% 94.7% 99.2% 99.8% 99.6%
Ours 26.2% 35.6% 44.1% 52.7% 60.7%
Table 4: Attack Success Rate Comparison on FaceScrub dataset (BIM)

Finally, the framework is tested with CIFAR10 dataset to evaluate the proposed disentangling method on general natural images where the information of categories can lie in both appearance code and geometric code in varying degrees. In this experiment, data augmentation methods including random cropping and random flipping are applied to the normalized input data. We use 110-layer ResNet [He et al.2016b] as the baseline and it achieves 8.5% error rate on the test set. To add the proposed regularization, the output of the final fully-connected layer of the ResNet is modified to generate geometric and appearance code. The decoder is the same as described in the main text. With our proposed special regularization, the network achieves 8.2% error rate. The defense result is shown as below:

0.1 0.2 0.3 0.4 0.5
Base CNN 29.5% 39.9% 47.4% 53.8% 59.3%
Ours 19.0% 26.8% 33.5% 39.4% 44.4%
Table 5: Attack success rate comparison on CIFAR10 dataset (FGSM)
(iters) 0.1 0.2 0.3 0.4 0.5
CNN Baseline (2) 0.403 0.615 0.761 0.844 0.884
Ours (2) 0.247 0.502 0.700 0.808 0.867
CNN Baseline (3) 0.474 0.706 0.833 0.891 0.908
Ours (3) 0.346 0.626 0.783 0.851 0.892
Table 6: Attack success rate comparison on CIFAR10 dataset (BIM)

As shown in Table 5 and Table 6, our method outperforms the CNN baseline for the FGSM attack and BIM attack. The proposed method improves over CNN baseline by 15% to 37% for FGSM attack.

3.1.2 Black-Box Attack

The first black-box attack method we use is proposed in  [Narodytska and Kasiviswanathan2017]. The basic idea is greedy local search. It is an iterative search procedure, where in each round a local neighborhood is used to refine the current image and to optimize some objective function depending on the network output. The number of iterations of the black-box attack is adjusted as an experimental variable. As shown in Table 7, the proposed regularization method can significantly promote the benchmark CNN architecture.

Iters 50 150
Base CNN 57.3% 62.5%
Ours 47.9% 57.8%
Table 7: Attack Success Rate on MNIST dataset (Local Search)

Besides the greedy local search black-box attack approach, another black-box attack approach [Papernot et al.2017] leverages transferability

property in creating adversarial examples. The attack first creates attack examples using a known deep learning model (white-box attack). The attack then apply these adversarial examples on the deep learning model with no access to model architecture and parameters. We tried this type of black-box attack and observed that our method has no noticeable advantage over benchmark CNN architecture. This is not surprising as transferability property applies to both deep learning models and non-deep learning models such as decision trees and SVM in general 

[Papernot et al.2017].

3.1.3 Joint Defense

As has been stated, the proposed framework is orthogonal to other defense techniques. It can potentially be combined with these defense techniques to further improve the robustness of CNN models. As an example, we combine it with adversarial training. Specifically, we train the model with legitimate inputs (MNIST) and adversarial samples generated by the model using FGSM attack method with of 0.3 [Ross and Doshi-Velez2017].

0.1 0.2 0.3 0.4 0.5
Base CNN 9.55% 22.0% 32.9% 42.9% 55.4%
Ours 9.49% 12.4% 15.6% 22.3% 34.8%
Table 8: Attack Success Rate Comparison on MNIST Dataset (FGSM with adversarial training)

As shown in Table 8, the result proves that our regularization method could be combined with adversarial training to further improve the robustness of the classifiers. Our method with no adversarial training (see Table 1 in the main paper) outperforms CNN baseline with adversarial training (See Table 8). Our method with adversarial training (See Table 8) further outperforms our method without adversarial training (see Table 1 in the main paper). For example, it is 34.8% vs 38.5% for = 0.5.

3.2 Generalization

We evaluate our model’s generalization to novel test data with large geometric transformation on the smallNORB dataset[LeCun et al.2004]. The dataset contains five categories of 3D objects (animals, human, airplanes, trucks and cars), each with 10 instances. These images are captured through a combination of six illumination conditions, nine elevations, and eighteen azimuths and have a resolution of 96 96. The training set and the testing set contain 5 different instances of each category. This data set requires the model to learn the common features within each target category.

For ease of comparison, we use the same experimental setting and the same network architecture of the benchmark CNN as reported in [Hinton et al.2018]. Both our model and baseline CNN model are trained on one-third of the training set, which contains six smallest azimuths or 3 smallest elevations, and tested on two-thirds of the test set, which contains images of totally different azimuths and elevations from those in training set. For training, we randomly crop sub-patch from the input image, resize the patch into

and add small random jitters to brightness and contrast. For testing, the samples are directly cropped from the center. For both training and testing, all data have been normalized to have zero mean and unit variance.

To better disentangle geometric code from appearance code, we use the data augmentation and code regularization technique proposed in Section 2.2. Specifically, deformations such as displacement and stretching to the inputs is applied. The extent of stretching in this experiment is randomly selected from {2, 4, 6, 8} pixels, the distance of displacement in each dimension (in the range of {2, 4, 6, 8}) and the orientation of displacement are also randomly selected. Then the samples are re-sized back to 32

32 and padded by pixels at the edge when necessary. These deformation operations do not change the appearance contents. The L2 norm of the difference between appearance codes of an input image and those corresponding to its deformed image are used as a regularization term as in Section 


Test set Azimuth Elevation
Base CNN 84.8% 78.8%
Ours 86.6% 85.3%
Table 9: Generalization comparison on novel viewpoints

As shown in Table 9, our model is 8% better in accuracy when generalizing to novel elevation and 2% better in case of novel azimuth. We note that of particular challenge, the task is to generalize to novel elevation and azimuth. The test distribution is different from the training distribution which makes the classification extremely hard. For example, Figure 2

shows the cars in testing set with the highest visual similarity to example cars in training set. With largely different azimuths and elevations and even shapes, the generalization task can be difficult even for human observers.

Figure 2: The comparison of distribution between the training data and testing data of the smallNORB dataset’s. The top row shows trucks under different azimuths in training and testing set, while the second row displays the cars under different elevations. Notice that the instances in training set and testing set are highly different.
Figure 3: The synthesis study on code representation. Appearance outputs are all front-facing faces. Exchanging appearance and geometric codes generates new images with exchanged properties of the original images, as shown in columns of swap code output. Group 1 displays swap-code outputs with appearance code exchanged; Group 2 displays swap-code outputs with geometric code exchanged.

Note that our CNN baseline performance on novel elevation is lower than that reported in  [Hinton et al.2018] while the performance on novel azimuth is higher. Although we use exactly the same network architecture according to the description in  [Hinton et al.2018], such difference may be unavoidable given that many parameters such as brightness and contrast jittering range need to be tuned, which were not mentioned in the paper. Our result on generalizing to novel azimuth achieves state-of-the-art. We achieve 86.6%, comparable to 86.5% of CapsuleNet as reported in [Hinton et al.2018]. For generalizing to novel elevation, even compared with the CNN benchmark result of 82.2% reported in  [Hinton et al.2018], our result of 85.3% is still 3.7% better. It is worth noting that our method is orthogonal to the capsule technique  [Hinton et al.2018]. Potentially the strength of feature disentangling and the capsule design may be combined to further improve performance.

3.3 Synthesis Study

To evaluate the quality of disentangling, we examine the property of ”unwarped” face image and the exchangeability of and . Two faces ( and ) are randomly selected and encoded by our model as and . We then exchange their codes into and . The swapped codes are then decoded by the appearance and geometric decoder and new images are synthesized by the warping function: , where .

Two pairs of (, ) are visualized on Figure 3. The appearance output is , which is the ”unwarped” image. The reconstruction output is the warped output without swapping code () and the swap code output is the the warped output after swapping code (). Group 1 and Group 2 emphasize the effects of maintaining geometric code and appearance code, respectively, while exchanging the other code. This figure prominently reflects the role of the geometric code and the appearance code in the face reconstruction process of the proposed model. It can be observed that the geometric code mainly contains some deformation information, such as the facial features, the expression and even the face orientation. On the other hand, the appearance code contains texture information, such as skin color, hair color, background as well as the basic structure of face.

3.4 Ablation Studies

We conduct ablation studies to analyze the relative contributions of each design component in our framework: disentangled representation learning, augmented training, L2 regularization of appearance code difference when an input image is deformed. We first investigate the benefit of disentangled feature learning in terms of robustness to FSGM attack using MNIST dataset. To disable disentangled feature learning, the warping function component is removed from Figure 

1. We refer it as ’Reconstruction’ in Table  10. As shown in Table 10, disentangled feature learning is essential for the model to be robust against FGSM, e.g. 16.15% success rate vs 43.83% success rate at =0.3.

0.1 0.2 0.3 0.4 0.5
Recon. 11.5% 24.1% 43.8% 57.8% 67.6%
Ours 8.8% 12.3% 16.2% 25.8% 38.5%
Table 10: Ablation Study: Attack Success Rate on the MNIST dataset (FGSM attack)
Settings Base CNN Ours
Base model 81.3% 84.4%
with AT 84.8% 85.3%
with AT and CR None 86.6%
Table 11: Ablation Study on the smallNORB dataset; AT: augmented training, CR: code regularization

Another ablation study are further conducted on generalization using smallNORB dataset. The experimental settings is the same as section 3.2. The base model in Table 11 is the architecture in Figure 1 with no augmented training and additional regularization. The augmented training settings in Table  11 refers to training with augmented data by deforming input data. The augmented training and code regularization setting in Table 11 means that the L2 regularization of appearance code difference between an input image and its deformed one is further imposed.

Results in Table 11 show that with the proposed regularization framework, the CNN base model is promoted in all settings. It also shows that augmented training and additional regularization helps. With both augmented training and L2 regularization, the improvement over the base model is 2.2%.

4 Related Work

There has been several works on learning disentangled feature representation in the machine learning literature. [Denton and Birodkar2017] and [Hsieh et al.2018] disentangled video representation into content and pose. [Villegas et al.2017] disentangled video representation into content and motion using image differences for video prediction. [Li et al.2018b] learn disentangled representations using analogy reasoning. Unlike previous methods that model pose with a rigid affine mapping, the coordinate displacement in our work is a non-rigid deformable mapping. In addition, most of previous methods require learning from video data in order to utilize the time-dependent features in pose and time-independent ones in content. Our model does not require time-series data and is suitable for independently sampled image data. Finally, our proposed framework aims to improve the robustness of CNNs, which has a completely different purpose from previous models.

5 Conclusion and Future Work

Robustness and generalization are paramount to deploying deep learning models in mission-critical applications. In this paper, we take inspiration from neuroscience and show that disentangling feature representation into appearance and geometric code can significantly improve the robustness and generalization of CNNs.

There are other mechanisms at play for the robustness and generalization ability of primate brain’s perception. In the future, we plan to explore other neuroscience motivated directions. For example, our model can be potentially combined with capsule networks [Hinton et al.2018].


  • [Athalye et al.2018] Anish Athalye, Nicholas Carlini, and David A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, 2018.
  • [Buckman et al.2018] Jacob Buckman, Aurko Roy, Colin Raffel, and Ian Goodfellow. Thermometer encoding: One hot way to resist adversarial examples. In ICLR, 2018.
  • [Carlini and Wagner2017] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39–57. IEEE, 2017.
  • [Denton and Birodkar2017] Emily L. Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations from video. In NIPS, 2017.
  • [Dhillon et al.2018] Guneet S. Dhillon, Kamyar Azizzadenesheli, Zachary C. Lipton, Jeremy Bernstein, Jean Kossaifi, Aran Khanna, and Anima Anandkumar. Stochastic activation pruning for robust adversarial defense. 2018.
  • [Goodfellow et al.2015] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.
  • [Guo et al.2018] Chuan Guo, Mayank Rana, Moustapha Cissé, and Laurens van der Maaten. Countering adversarial images using input transformations. In ICLR, 2018.
  • [He et al.2016a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CVPR, pages 770–778, 2016.
  • [He et al.2016b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 770–778, 2016.
  • [He et al.2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask r-cnn. ICCV, pages 2980–2988, 2017.
  • [Hinton et al.2018] Geoffrey E Hinton, Sara Sabour, and Nicholas Frosst. Matrix capsules with em routing. In ICLR, 2018.
  • [Hsieh et al.2018] Jun-Ting Hsieh, Bingbin Liu, De-An Huang, Li Fei-Fei, and Juan Carlos Niebles. Learning to decompose and disentangle representations for video prediction. In NIPS, 2018.
  • [Jaderberg et al.2015] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In NIPS, pages 2017–2025, 2015.
  • [Kurakin et al.2016] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
  • [Kurakin et al.2017] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. ICLR Workshop, 2017.
  • [LeCun et al.2004] Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In CVPR, 2004.
  • [Li et al.2018a] Yitong Li, Martin Renqiang Min, Dinghan Shen, David E. Carlson, and Lawrence Carin. Video generation from text. In

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA

    , pages 7065–7072, 2018.
  • [Li et al.2018b] Zejian Li, Yongchuan Tang, and Yongxing He. Unsupervised disentangled representation learning with analogical relations. In IJCAI, 2018.
  • [Maaten and Hinton2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
  • [Narodytska and Kasiviswanathan2017] N. Narodytska and S. Kasiviswanathan. Simple black-box adversarial attacks on deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1310–1318, 2017.
  • [Papernot et al.2016a] Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In Security and Privacy (EuroS&P), 2016 IEEE European Symposium on, pages 372–387. IEEE, 2016.
  • [Papernot et al.2016b] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pages 582–597. IEEE, 2016.
  • [Papernot et al.2017] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’17, pages 506–519, 2017.
  • [Pu et al.2018] Yunchen Pu, Martin Renqiang Min, Zhe Gan, and Lawrence Carin. Adaptive feature abstraction for translating video to text. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, pages 7284–7291, 2018.
  • [Ren et al.2015] Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
  • [Ross and Doshi-Velez2017] Andrew Slavin Ross and Finale Doshi-Velez. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. arXiv preprint arXiv:1711.09404, 2017.
  • [Villegas et al.2017] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and content for natural video sequence prediction. In ICLR, 2017.
  • [Vinyals et al.2015] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.
  • [Wong and Kolter2018] Eric Wong and J. Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In ICML, 2018.
  • [Xing et al.2018] Xianglei Xing, Ruiqi Gao, Song-Chun Zhu, and Ying Nian Wu. Deformable generator network: unsupervised disentanglement of appearance and geometry. In ICML workshop on Theoretical Foundations and Applications of Deep Generative Models (TADGM), July 2018.

6 Supplementary Material: Robust feature embedding against attacks

To verify that our model indeed has a more robust feature embedding, we used t-SNE  [Maaten and Hinton2008] to visualize the embedding of adversarial images against legitimate images from MNIST dataset for both baseline model and our model in Figure  4. Indeed, the embedding in the joint appearance and geometric code in our model for attacked images (4d) maintains stronger separability compared to baseline CNN (4b). One of the reasons that we expect our model to be more robust against adversarial attack is that appearance and geometric features may each be more suitable for different classification tasks. Take digits as examples, several digits share the same topological structure but mainly differ in their geometric properties. By separating the feature space to and , we provide to the classifier shape codes which live on a lower dimensional space than needed for traditional CNN, which makes the training more data-efficient. Interestingly, this is also reflected in the embedding structures: as shown in Figures 4e and 4g, even before attack, manifolds of different digits are entangled in the space of . But as shown in 3f and 3h, the embedding in stays highly separable even after attack, suggesting the classifier indeed is able to utilize the more robust code .

Figure 4: t-SNE visualization of the feature embedding of (a) the baseline CNN before attack. (b) the baseline CNN after attack. (c) the entire code of the proposed model before attack. (d) the entire code of the proposed model after attack. The t-SNE (e) the appearance code of the proposed model before attack. (f) the geometric code of the proposed model before attack. (g) the appearance code of the proposed model after attack. (h) the geometric code of the proposed model after attack. Visualization is conducted on MNIST dataset with FGSM ( = 0.3). Each color corresponds to one digit. In the visualization effect of model after attack, the square points denote the embedding code before attack and the triangle points denotes the code after attack (visible after zooming in).