Graph Interpolating Activation Improves Both Natural and Robust Accuracies in Data-Efficient Deep Learning

07/16/2019 ∙ by Bao Wang, et al. ∙ 3

Improving the accuracy and robustness of deep neural nets (DNNs) and adapting them to small training data are primary tasks in deep learning research. In this paper, we replace the output activation function of DNNs, typically the data-agnostic softmax function, with a graph Laplacian-based high dimensional interpolating function which, in the continuum limit, converges to the solution of a Laplace-Beltrami equation on a high dimensional manifold. Furthermore, we propose end-to-end training and testing algorithms for this new architecture. The proposed DNN with graph interpolating activation integrates the advantages of both deep learning and manifold learning. Compared to the conventional DNNs with the softmax function as output activation, the new framework demonstrates the following major advantages: First, it is better applicable to data-efficient learning in which we train high capacity DNNs without using a large number of training data. Second, it remarkably improves both natural accuracy on the clean images and robust accuracy on the adversarial images crafted by both white-box and black-box adversarial attacks. Third, it is a natural choice for semi-supervised learning. For reproducibility, the code is available at <https://github.com/BaoWangMath/DNN-DataDependentActivation>.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 25

page 26

page 29

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning (DL) has achieved tremendous success in both image and speech recognition and natural language processing, and it has been widely used in industrial production

(LeCun et al., 2015)

. Improving generalization accuracy and adversarial robustness of deep neural nets (DNNs) are primary tasks in DL research. Moreover, applying DNNs to data-efficient machine learning (ML), where we do not have a large number of training instances, is important to different research communities.

Despite the extraordinary success of DNNs in image and speech perception, their vulnerability to adversarial attacks raises concerns when applying them to security-critical tasks, e.g., autonomous cars, robotics, and DNN-based malware detection systems (Anonymous, 2019). Since the seminal work of Szegedy et al. (2013), recent research shows that DNNs are vulnerable to many kinds of adversarial attacks including physical, poisoning, and inference (evasion) attacks (Chen et al., 2017a; Carlini and Wagner, 2016; Papernot et al., 2016a; Goodfellow et al., 2014). Physical attacks occur during data acquisition, poisoning and inference attacks happen during training and testing phases of machine learning (ML), respectively.

Adversarial attacks have been successful in both white-box and black-box scenarios. In white-box attacks the adversarial have access to the architecture and weights of DNNs. In black-box attacks the adversarial have no access to the details of the underlying model. Black-box attacks are successful because one can perturb an image to cause its misclassification on one DNN, and the same perturbed image also has a significant chance to be misclassified by another DNN; this is known as the transferability of adversarial examples (Papernot et al., 2016c). Due to this transferability, it is straightforward to attack DNNs in a black-box fashion by attacking an oracle model (Liu et al., 2016; Brendel et al., 2017). There also exist universal perturbations that can imperceptibly perturb any image and cause misclassification for any given network (Moosavi-Dezfooli et al., 2017). Dou et al. (2018) analyzed the efficiency of many adversarial attacks for a large variety of DNNs.

Besides the issue of adversarial vulnerability, the superior accuracy of DNNs depends heavily on a massive amount of training data. When we do not have sufficient training data, which is often the case in many real situations, to train a high capacity deep network, performance degradation becomes a serious problem. As shown in Table 1, when ResNets are trained on K or K CIFAR10 images, as the depth of ResNet increases, the test accuracy gains. However, when ResNets are trained on only K images, the test accuracy decays as the model’s capacity increases. For instance, the test errors of ResNet20 and ResNet110 are % and %, respectively.

Network # of parameters 50K 10K 1K
ResNet20 0.27M 9.06% (8.75%(He et al., 2016c)) 12.83% 34.90%
ResNet32 0.46M 7.99% (7.51%(He et al., 2016c)) 11.18% 33.41%
ResNet44 0.66M 7.31% (7.17%(He et al., 2016c)) 10.66% 34.58%
ResNet56 0.85M 7.24% (6.97%(He et al., 2016c)) 9.83% 37.83%
ResNet110 1.7M 6.41% (6.43%(He et al., 2016c)) 8.91% 42.94%
Table 1: Test errors of DNNs trained on the entire (50K), the first 10K, and the first 1K instances of the training set of the CIFAR10.

1.1 Our Contributions

In this paper, we propose an end-to-end framework to mitigate the aforementioned two issues of DNNs, i.e., adversarial vulnerability and generalization accuracy degradation in the small training data scenario. At the core of our framework is to replace the data-agnostic softmax output activation with a data-dependent graph interpolating function. To this end, we leverage the weighted nonlocal Laplacian (WNLL) (Shi et al., 2018)

to interpolate features in the hidden state of DNNs. In back-propagation, we linearize the WNLL activation function to compute gradient of the loss function approximately. The major advantages of the proposed framework are summarized below.

  • The naturally trained DNNs with the WNLL output activation obtained by solving the empirical risk minimization (ERM), i.e., Eq. (2), are remarkably more accurate than the vanilla DNNs with the softmax output activation.

  • The robustly trained DNNs with the WNLL activation obtained by solving the empirical adversarial risk minimization (EARM), i.e., Eq. (1), are much more robust to adversarial attacks than the robustly trained vanilla DNNs. To the best of our knowledge, DNNs with the WNLL activation achieves the current-state-of-the-art result in adversarial defense on the CIFAR10 and MNIST benchmarks.

  • In the small training data situation, the WNLL activation can regularize the training procedure. The test accuracy of DNNs with the WNLL activation increases as the network goes deeper.

  • DNN with the WNLL output activation is a natural choice for semi-supervised deep learning.

  • The proposed framework is applicable to any off-the-shelf DNNs when use the softmax as its output activation.

1.2 Related Work

In this subsection, we will discuss related work from the viewpoints of improving generalizability and adversarially robustness.

1.2.1 Improving Generalizability of DNNs

Generalizability is crucial to DL, and many efforts have been made to improve the test accuracy of DNNs (Bengio et al., 2007; Hinton et al., 2006). Advances in network architectures such as VGG networks (Simonyan and Zisserman, 2014), deep residual networks (ResNets) (He et al., 2016c, b) and recently DenseNets (Huang et al., 2017) and many others (Chen et al., 2017b), together with powerful hardware make the training of very deep networks with good generalization capabilities possible. Effective regularization techniques such as dropout and maxout (Hinton et al., 2012; Wan et al., 2013; Goodfellow et al., 2013), as well as data augmentation methods (Krizhevsky et al., 2012; Simonyan and Zisserman, 2014)

have also explicitly improved generalization for DNNs. From the optimization point of view, Laplacian smoothing stochastic gradient descent has been recently proposed to improve training and generalization of DNNs

(Osher et al., 2018).

A key component of DNN is the activation function. Improvements in designing of activation functions such as the rectified linear unit (ReLU)

(Glorot et al., 2011)

, have led to huge improvements in performance in computer vision tasks

(Nair and Hinton, 2010; Krizhevsky et al., 2012). More recently, activation functions adaptively trained to the data such as the adaptive piece-wise linear unit (APLU) (Agostinelli et al., 2014) and parametric rectified linear unit (PReLU) (He et al., 2015)

have led to further improvements in the performance of DNNs. For output activation, support vector machine (SVM) has also been successfully applied in place of softmax

(Tang, 2013). Though training DNNs with softmax or SVM as output activation is effective in many tasks, it is possible that alternative activations that consider the manifold structure of data by interpolating the output based on both training and testing data can boost the performance of the deep network. In particular, ResNets can be modeled as solving control problems of a class of transport equations in the continuum limit (Li and Shi, 2017; Wang et al., 2018c). Transport equation theory suggests that using an interpolating function that interpolates terminal values from initial values can dramatically simplify the control problem compared with an ad-hoc choice. This further suggests that a fixed and data-agnostic activation for the output layer may be suboptimal.

1.2.2 Adversarial Defense

EARM is one of the most successful mathematical frameworks for certified adversarial defense. Under the EARM framework, adversarial defense for the -norm based inference attacks can be formulated as solving the following minimax optimization problem

(1)

where is a function in the hypothesis class , e.g., DNNs, parameterized by . Here, are i.i.d. data-label pairs drawn from some high dimensional unknown distribution , is the loss associated with on the data-label pair . For classification,

is typically selected to be the cross-entropy loss; for regression, the root mean square error is commonly used. The adversarial defense for other measure based attacks can be formulated similarly. As a comparison, solving ERM is used to train models in a natural fashion to classify the clean data, where ERM is to solve the following optimization problem

(2)

Many of the existing approaches try to defend against the inference attacks by searching for a good surrogate loss to approximate the loss function in the EARM. Projected gradient descent (PGD) adversarial training is a representative work along this line that approximates EARM by replacing with the adversarial data that is obtained by applying the PGD attack to the clean data (Goodfellow et al., 2014; Madry et al., 2018). Besides finding an appropriate surrogate to approximate the empirical adversarial risk, under the EARM framework, we can also improve the hypothesis class to improve adversarial robustness of the trained robust models (Wang et al., 2018c).

There is a massive volume of research over the past several years on defending against adversarial attacks for DNNs. Randomized smoothing transforms an arbitrary classifier into a ”smoothed” surrogate classifier and is certifiably robust to the -norm based adversarial attacks (Lecuyer et al., 2019; Cohen et al., 2019)

. Among the randomized smoothing technique, one of the most popular ideas is to inject Gaussian noise to the input image, and the classification result is based on the probability of noisy image in decision region.

Wang et al. (2018c) modeled ResNets as a transport equation and interpreted the adversarial vulnerability of DNNs as irregularity of the transport equation’s solution. To enhance its regularity, i.e., improve adversarial robustness, they added a diffusion term to the transport equation and solved the resulted convection-diffusion equation by the celebrated Feynman-Kac formula. The resulted algorithm remarkably improves both natural and robust accuracies of the robustly trained DNNs.

Robust optimization for solving EARM has achieved tremendous success in certified adversarial defense (Madry et al., 2018; Zhang et al., 2019). Regularization in EARM can further boost the robustness of the adversarially trained robust models (Kurakin et al., 2017; Ross and Doshi-Velez, 2017; Zheng et al., 2016). The adversarial defense algorithms should learn a classifier with high test accuracy on both clean and adversarial data. To achieve this goal, Zhang et al. (2019) developed a new loss function named TRADES that explicitly trades off between natural and robust generalization.

Besides robust optimization, there are many other approaches for adversarial defense. Defensive distillation was proposed to increase the stability of DNN

(Papernot et al., 2016b), and a related approach (Tramèr et al., 2018) cleverly modifies the training data to increase robustness against black-box attacks and adversarial attacks in general. To counter adversarial perturbations, Guo et al. (2018) proposed to use image transformations, e.g., bit-depth reduction, JPEG compression, total variation minimization, and image quilting. These input transformations are intended to be non-differentiable, thus making adversarial attacks more difficult, especially for gradient-based attacks. GANs are also used for adversarial defense (Samangouei et al., 2018). However, adversarial attacks can break these gradient mask based defenses by circumventing the obfuscated gradient (Athalye et al., 2018).

Instead of using the softmax function as DNN’s output activation, Wang et al. (2018b) utilized a non-parametric graph interpolating function which provably converges to the solution of a Laplace-Beltrami equation on a high dimensional manifold (Shi et al., 2018). The proposed data-dependent activation shows a remarkable amount of generalization accuracy improvement, and the results are more stable when one only has a limited amount of training data. This data-dependent activation is also useful in adversarial defense when combined with image transformations (Wang et al., 2018a). Verma et al. (2018) simplified the interpolation procedure and generalized it to more hidden layers to learn better representations.

1.3 Organization

This paper is structured in the following way: In section 2, we present the generic architecture of DNNs with a graph interpolating function as its output activation. In section 3, we present training and testing algorithms in both natural and robust fashions for the proposed DNNs with graph interpolating activation. We verify the performance of the proposed algorithm numerically in section 4 from the lens of natural and robust generalization accuracies and semi-supervised learning. In section 5, we provide geometric explanations for improving generalization and robustness by using the proposed new framework. This paper concludes with a remark in section 6.

2 Network Architecture

We illustrate the training and testing procedures of a standard DNN in Fig 1, where

  • Training (Fig 1 (a)), in the th iteration, given a mini-batch of training data ), we perform:

    • Forward propagation: Transform into features by the DNN block (a combination of convolutional layers, nonlinearities, etc.), and then feed these features into the softmax activation to obtain the predictions , i.e.,

      where are the temporary values of the trainable weights at the th iteration. Then the loss is computed (e.g., cross entropy) between the ground-truth labels and the predicted labels : .

    • Backpropagation: Update weights (, ) by applying gradient descent with learning rate

  • Testing (Fig 1 (b)): Once the training procedure finishes with the learned parameters . The predicted labels for the testing data are

    for notational simplicity, we still denote the test set and the learned weights as , , and , respectively.

Even though this deep learning paradigm achieves the current-state-of-the-art success in many artificial intelligence tasks, the data-agnostic activation (softmax) acts as a linear model on the space of deep features

, which does not take into consideration the underlying manifold structure of , and has many other problems, e.g., it is less applicable when we have a small amount of training data and is not robust to adversarial attacks. To this end, we replace the softmax output activation with a graph interpolating function, WNLL, which will be introduced in the following subsection. We illustrate the training and testing data flow in Fig. 2 which will be discussed later.

(a) (b)
Figure 1: Illustration of training and testing procedures of the standard DNN with the softmax function as output activation layer. (a): Training; (b): Testing.
(a) (b)
Figure 2: Illustration of training and testing procedures of the DNN with the WNLL interpolating function as the output activation function. (a): Training; (b): Testing.

2.1 Graph-based High Dimensional Interpolating Function – A Harmonic Extension Approach

Let be a set of points located on a high dimensional manifold and (“te” for template) be a subset of that is labeled with the label function . We want to interpolate a function that is defined on the whole manifold and can be used to interpolate labels for the entire dataset . The harmonic extension is a natural approach to find such a smooth interpolating function which is defined by minimizing the following Dirichlet energy functional

(3)

with the boundary condition

where is a weight function, chosen to be Gaussian: with being a scaling parameter. By taking the variational derivative of the energy functional Eq. (3), we get the following Euler-Lagrange equation

(4)

By solving the linear system Eq. (4), we obtain labels for the unlabeled data . The interpolation quality becomes very poor when only a tiny amount of data are labeled, i.e., . To alleviate this degradation, the weight of the labeled data is increased in the above Euler-Lagrange equation (Eq. (4)), which gives

(5)

We call the solution to Eq. (5) weighted nonlocal Laplacian (WNLL), and denote it as . Shi et al. (2018), showed that the WNLL graph interpolating function converges to the solution of the associated high dimensional Laplace-Beltrami equation. For classification, is the one-hot label for .

For a given , due to the exponential decay of the kernel– , we do not need to compute weights for all in . In practice, we only consider the contribution from the first -nearest neighbors of and let be the distance between and its th nearest neighbor. We use the approximate nearest neighbor (Muja and Lowe, 2014) to search all the nearest neighbors of any given data .

2.1.1 Theoretical Guarantees for the WNLL Interpolating Function

To ensure the accuracy of WNLL interpolation, the template data, i.e., the labeled data, should cover all classes of data in . We give a necessary condition in Theorem 2.1.1. [Wang et al. (2018b)] Suppose we have a dataset, , which consists of different classes of data with each instance having the same probability to belong to any of the classes. Moreover, suppose the number of instances of each class is sufficiently large. If we want to guarantee all classes of data to be sampled at least once, on average at least data needs to be sampled from . In this case, the number of data being sampled, in expectation for each class, is .

We consider the convergence of the WNLL for graph interpolation and give a theoretical interpretation of the special weight selected in Eq. (5). We summarize some results from Shi et al. (2018). Consider the following generalized WNLL interpolation

(6)

where , are kernel functions given as

(7)

where is the normalization factor. are two kernel functions satisfying the conditions listed in Assumption 1.

Assumption 1
  • Assumptions on the manifold: is a -dimensional closed manifold isometrically embedded in a Euclidean space . and are smooth submanifolds of . Moreover, .

  • Assumptions on the kernel functions:

    • Smoothness: ;

    • Nonnegativity: for any .

    • Compact support: for ; for .

    • Nondegeneracy: such that for and for .

  • Assumptions on the point cloud: and

    are uniformly distributed on

    and , respectively.

As the continuous counterpart, we consider the Laplace-Beltrami equation on a closed smooth manifold

(8)

where is the Laplace-Beltrami operator on . Let be a local parametrization of and . For any differentiable function , we define the gradient on the manifold

(9)

And for the vector field on , where is the tangent space of at , the divergence is defined as

(10)

where , is the determinant of matrix and is the first fundamental form with

(11)

and is the representation of in the embedding coordinates.

We have the following high probability guarantee for convergence of the WNLL interpolating function to the solution of the Laplace-Beltrami equation on the manifold .

[Shi et al. (2018)] Let solve (6) and solve (8). Given Assumption 1, with probability at least , we have

as long as

(12)

where , and is a constant that is independent of , and .

In the above theorem, Eq. (12) actually gives a constraint on the weight . Note that

samples , if is dense enough, we have

Here, we need the assumption on such that . This implies that

Hence, from Eq. (12), we have

which explains the scaling of in the WNLL.

2.2 DNNs with the Graph Interpolating Function as Output Activation

A straightforward approach is to replace the softmax function with the WNLL in Fig. 1. However, backpropagation is difficult in this case. To resolve this, we consider a new DNN architecture as shown in Fig. 2 which will be discussed in detail in Section 3.

3 Algorithms

In this section, we will present training and inference algorithms for DNNs with the WNLL as the output activation in both natural and robust fashions. Natural training means to solve the ERM problem on the training dataset and robust training stands for training an adversarially robust deep network by solving the EARM problem. Meanwhile, we will also adapt DNNs with the WNLL interpolating output activation to semi-supervised learning.

3.1 Natural Training and Inference

We abstract the natural training and testing procedures for DNNs with the WNLL activation in Fig. 2 (a) and (b), respectively. As a prerequisite of the WNLL interpolation, we need to reserve a small portion of data-label pairs, denoted as , to interpolate labels for the unlabeled data in both training and testing procedures of DNNs with the WNLL activation. We call as the preserved template. Directly replacing the softmax by the WNLL in the architecture shown in Fig. 1 (a) causes difficulties in backpropagation, namely, the gradient is difficult to compute since WNLL defines a very complex implicit function. Instead, to train DNNs with the WNLL as the output activation, we propose a proxy via an auxiliary neural net (Fig. 2 (a)). On top of the original DNNs, we add a buffer block (a fully connected layer followed by a ReLU) and followed by two parallel branches, the WNLL and the linear (fully connected) layers. We train the auxiliary DNNs by alternating between the following two steps: training DNNs with linear and WNLL activation, respectively. In the following, we denote DNN with the WNLL activation as DNN-WNLL, e.g., we denote ResNet20 with WNLL activation as ResNet20-WNLL.

Train DNN-WNLL with linear activation: Run steps of the following forward and backward propagation, where in the th iteration, we have:

  • Forward propagation: Transform the training data , respectively, by DNN, Buffer and Linear blocks into the predicted labels :

    Then compute the loss between the ground truth labels and the predicted ones , denoted the loss as .

  • Backpropagation: Update (, , ) by stochastic gradient descent:

Train DNN-WNLL with the WNLL activation: Run steps of the following forward and backward propagation, where in the th iteration, we have:

  • Forward propagation: The training data , template and are transformed, respectively, by DNN, Buffer, and WNLL blocks to get predicted labels :

    Then compute the loss, , between the ground truth labels and predicted ones .

  • Backpropagation: Update weights only, and will be tuned in the next iteration in training DNN-WNLL with the linear activation, by stochastic gradient descent.

We use the computational graph of the left branch (linear layer) to compute the approximated gradients for the DNN with WNLL activation. For a given loss value , we adopt the approximation where the right hand side is also evaluated at the value of

. The heuristic behind this approximation is the following: WNLL defines a harmonic function implicitly, and the linear function is the simplest nontrivial explicit harmonic function. Empirically, we observe this simple approximation works well in training the deep network. The reason why we freeze the network in the DNN block is mainly because of the stability concerns.

The above alternating scheme is an algorithm of a greedy fashion. During training, the WNLL activation plays two roles: on the one hand, alternating between the linear and the WNLL activation benefits both which enables the neural nets to learn features that are appropriate for both linear classification and the WNLL based manifold interpolation. On the other hand, in the case when we lack sufficient training data, the training of DNNs usually gets stuck at some bad local minima which cannot generalize well on the new data. We use the WNLL interpolation to perturb those learned sub-optimal weights and to help to arrive at a local minimum with better generalizability. At inference (test) time, we remove the linear classifier from the neural nets and use the DNN block together with the WNLL to predict new data (Fig. 2 (b)). The reason for using the WNLL instead of the linear layer is simply because the WNLL interpolation is superior to the linear classifier and this superiority is preserved when applied to deep features (which will be confirmed in Section. 4). Moreover, the WNLL interpolation utilizes both deep learning features and the reserved template at the test time to guide the classifier and to enhance adversarial robustness in classification.

We summarize the training and testing for DNN-WNLL in Algorithms 1 and 2, respectively. In each round of the alternating procedure, i.e., each outer loop in Algorithm 1, the entire training dataset is first used to train DNN-WNLL with the linear activation. We randomly separate a template, e.g., half of the entire data from the training set which will be used to perform WNLL interpolation in training DNN-WNLL with the WNLL activation. In practice, for both training and testing, we use mini-batches for both the template and the interpolated points when the entire dataset is too large. The final predicted labels are based on a majority voting across interpolation results from all the template mini-batches.

1:Input: Training set: (data, label) pairs . The number of alternating steps

and the number of epochs for training DNN with WNLL activation

.
2:Output: An optimized DNN with the WNLL activation, denoted as DNN-WNLL.
3:for   do
4:     //Train the left branch: DNN with the linear activation.
5:     Train DNN Linear blocks, and denote the learned model as DNN-Linear.
6:     //Train the right branch: DNN with the WNLL activation.
7:     Split into training data and template, i.e., .
8:     Partition the training data into mini-batches, i.e., .
9:     for  do
10:         Transform by DNN-Linear, i.e., .
11:         Apply WNLL (Eq.(5)) on to interpolate label .
12:         Backpropagate the error between and via Eq.(3.1) to update only.      
Algorithm 1 DNN with the WNLL Output Activation: Training Procedure.

In Algorithm 1, the WNLL interpolation is also performed in a mini-batch manner (as shown in the inner iteration). Based on our experiments, this has a very small influence on reducing interpolation accuracy.

1:Input: Testing data , template . The optimized model DNN-WNLL.
2:Output: Predicted label for .
3:Apply the DNN block of the DNN-WNLL to to get the features .
4:Apply the WNLL interpolation (Eq.(5)) on to interpolate label .
Algorithm 2 DNN with the WNLL Output Activation: Testing Procedure.

3.2 Adversarial Training

Adversarial training is one of the most generic frameworks for adversarial defense. The key idea of adversarial training is to augment the training data with adversarial versions which can be obtained by applying adversarial attacks to the clean data. In the following, we adopt the minimax formalism of the adversarial training proposed by Madry et al. (2018).

3.2.1 Adversarial Attacks

We consider three benchmark attacks: the fast gradient sign method (FGSM) and the iterative fast gradient sign method (IFGSM) in the -norm (Goodfellow et al., 2014), and the Carlini and Wagner (2016) attack in the -norm (C&W). We denote the classifier defined by a specific DNN as for a given instance (, ). FGSM searches the adversarial image by maximizing the loss with a maximum allowed perturbation , i.e., . We can approximately solve this constrained optimization problem by linearize the objective function, i.e.,

Under this linear approximation, the optimal adversarial image is

(14)

IFGSM iterates FGSM to generate the enhanced adversarial images, where the iteration proceeds as follows

(15)

where , and being the step size. Moreover, let the adversarial image be with being the number of iterations. To ensure the maximum perturbation to the clean image is no bigger than , in each iteration we clip the intermediate adversarial images which results in the following attack scheme

(16)

where limits the change of the generated adversarial image in each iteration, and it is defined as

where we assume the pixel value of the image is normalized to .

Both FGSM and IFGSM belong to the fixed-perturbation attacks. Moreover, we consider a zero-confidence attack proposed by Carlini and Wagner. For a given image-label pair , and for any given label , C&W attack searches the adversarial image that will be classified to class with minimum perturbation by solving the following optimization problem

(17)

subject to

where is the adversarial perturbation (for the sake of simplicity, we ignore the dependence on in ). The equality constraint in Eq. (17) is hard to tackle, so Carlini and Wagner considered the following surrogate constraint

(18)

where

is the logit vector for an input

, i.e., output of the neural net before the output layer, and is the logit value corresponding to class . It is easy to see that is equivalent to . Therefore, the problem in Eq. (17) can be reformulated as

(19)

subject to

where is the Lagrangian multiplier.

By letting , Eq. (19) can be written as an unconstrained optimization problem. Moreover, Carlini and Wagner introduced the confidence parameter into the above formulation. In a nutshell, the C&W attack seeks the adversarial image by solving the following problem

(20)

The Adam optimizer (Kingma and Ba, 2014) can solve the above unconstrained optimization problem, Eq. (20), efficiently. All three attacks clip the values of each pixel of the adversarial image to between 0 and 1.

The only difficulty in extending the above three adversarial attacks to DNN-WNLL is again to compute the gradient in backpropagation. Similar to the training of DNN-WNLL, we compute the following surrogate gradient by linearizing the WNLL activation. For a given mini-batch of test image-label pairs and template , we denote the DNN-WNLL as , where is the composition of the DNN and buffer blocks as shown in Fig. 2 (a). By ignoring dependence of the loss function on the parameters, the loss function for DNN-WNLL can be written as . The above three attacks for DNN-WNLL are summarized below.

  • FGSM

    (21)
  • IFGSM

    (22)

    where ; and .

  • C&W

    (23)

    where are the logit values of the input images , are the target labels.

In the above attacks, is required to generate the adversarial images. In the DNN-WNLL, this gradient is difficult to compute. As shown in Fig. 2 (b), we approximate in the following way

(24)

again, in the above approximation, we set the value of to that of .

Based on our numerical experiments, the batch size of has a negligible influence on the adversarial attack and defense. In all of our experiments, we choose the size of both mini-batches and the template to be .

3.2.2 Adversarial Training

We apply the projected gradient descent (PGD) adversarial training (Madry et al., 2018) to train the adversarially robust DNNs, where we approximately solve the EARM (Eq. (1) by using the PGD adversarial images, i.e., IFGSM attacks with an initial random perturbation on the clean images, to approximate the solution of the inner maximization problem. We summarize the PGD adversarial training for DNNs with the WNLL activation, as shown in Fig. 2 (a), in Algorithm 3.

1:Input: Training set: (data, label) pairs , the number of PGD iterations , PGD attack step size , maximum PGD perturbation . The number of alternating iterations , and the number of epochs used to train DNN with the linear activation and the WNLL activation .
2:Output: An optimized DNN-WNLL.
3:for   do
4:     //PGD adversarial training of the left branch: DNN with linear activation.
5:     Train DNN Linear blocks.
6:     Partition the training data into mini-batches, i.e., .
7:     for   do
8:         for  do
9:              //Attack the input images by PGD attack.
10:               with be a uniform random vector.
11:              for  do
12:                  Attack according to Eq. (16).               
13:              Backpropagate the classification error of the adversarial images.               
14:     //PGD adversarial training of the right branch: DNN with WNLL activation.
15:     Split into training data and template, i.e., .
16:     Partition the training data into mini-batches, i.e., .
17:     for   do
18:         for  do
19:              //Attack the input training images by PGD attack.
20:              .
21:              for   do
22:                  Attack according to Eq. (22).               
23:              Backpropagate the classification error of the adversarial images.               
Algorithm 3 DNN with the WNLL Output Activation: PGD Adversarial Training

3.3 Semi-supervised Learning

Semi-supervised learning is another fundamental learning paradigm, where we have access to a large amount of training data. However, most of the training data is unlabeled. Semi-supervised learning is of particular importance in e.g., medical applications (Chapelle et al., 2006). It is straightforward to extend DNNs with the WNLL activation to semi-supervised learning. Let the labeled and unlabeled training data be and , respectively. There are two approaches to semi-supervised learning by using DNN-WNLL.

  • Approach I: Train DNN-WNLL on only labeled data . During testing, we feed the unlabeled data together with the labeled template data to predict labels for the testing data. This is essentially similar to the classical graph Laplacian-based semi-supervised learning on the deep learning features.

  • Approach II: Train DNNs with the WNLL activation by using both labeled and unlabeled data. During training, we use both labeled and unlabeled data to build a graph for WNLL interpolation, and then we backpropagate loss between predicted and true labels of the labeled data. The testing phase is the same as that in Approach I.

In this work, we focus on the Approach I.

4 Numerical Results

In this section, we will numerically verify the accuracy and robustness of DNN-WNLL. Moreover, we show that DNN-WNLL is suitable for data-efficient learning. We also provide results of semi-supervised learning by using DNN-WNLL. We implement our algorithm on the PyTorch platform

(Paszke and et al, 2017). All the computations are carried out on a machine with a single Nvidia Titan Xp graphics card.

To validate the classification accuracy, efficiency, and robustness of the proposed framework, we test the new architecture and algorithm on the CIFAR10, CIFAR100 (Krizhevsky, 2009), MNIST (LeCun, 1998) and SVHN datasets (Netzer et al., 2011). In all the experiments below, we apply the standard data augmentation that is used for the CIFAR datasets (He et al., 2016c; Huang et al., 2017; Zagoruyko and Komodakis, 2016). For MNIST and SVHN, we use the raw data without any data augmentation.

Before diving into the performance of DNNs with different output activation functions, we first compare the performance of the WNLL with the softmax on the raw input images for various datasets. The training sets are used to train the softmax classifier and interpolate labels for the test set in the WNLL interpolation, respectively. Table 2 lists the classification accuracies of the WNLL and the softmax on three datasets. For the WNLL interpolation, we only use the top nearest neighbors to ensure sparsity of the weight matrix to speed up the computation, and the th neighbor’s distance is used to normalize the weight matrix. WNLL outperforms softmax remarkably in all the three benchmark tasks especially for the MNIST (Test accuracy: % v.s. %) and SVHN (Test accuracy: % v.s. %) classification. These results indicate potential benefits of using the WNLL instead of the softmax as the output activation in DNNs.

        Dataset         CIFAR10         MNIST         SVHN
softmax 39.91% 92.65% 24.66%
WNLL 40.73% 97.74% 56.17%
Table 2: Accuracies of the softmax and the WNLL classifiers in classifying some benchmark datasets.

For natural training of the DNN-WNLL: We take two passes of the alternating step, i.e., set in Algorithm 1. For training of the linear activation stage (Stage 1), we train the network for epochs with stochastic gradient descent. For the training of the WNLL activation stage (Stage 2) we train for epochs. In the first pass, the initial learning rate is and halved after every epoch in training DNNs with linear activation, and a fixed learning rate

is used to train DNNs with the WNLL activation. The same Nesterov momentum and weight decay as that used in

(He et al., 2016c; Huang et al., 2016) are employed for the CIFAR and the SVHN experiments, respectively, in our work. In the second pass, the learning rate is set to be one-fifth of the corresponding epochs in the first pass. The batch sizes are and when training softmax/linear and WNLL activated DNNs, respectively. For a fair comparison, we train the vanilla DNNs with the softmax output activation for epochs with the same optimizer used in the WNLL activated ones.

4.1 Data Efficient Learning – Small Training Data Case

When we do not have a sufficient amount of labeled training data to train a high capacity deep network, the generalization accuracy of the trained model typically decays as the network goes deeper. We illustrate this in Fig. 3. The WNLL activated DNNs, with its superior regularization power and perturbation capability on bad local minima, can overcome this generalization degradation. The left and right panels of Fig. 3 plot the results of DNNs with the softmax and the WNLL activation that are trained on K and K images, respectively. These results show that the generalization error rate decays consistently as the network goes deeper in DNN-WNLL. Moreover, the generalization accuracy between the vanilla and the WNLL activated DNNs can differ up to percent within our testing regime.

(a) (b)
Figure 3: Plots of test errors when K (a) and K (b) training data are used to train the vanilla and the WNLL activated DNNs. In each plot, we test three different deep networks: PreActResNet18, PreActResNet34, and PreActResNet50. All tests are done on the CIFAR10 dataset.

Figure 4 plots the evolution of generalization accuracy during training. We compute the test accuracy per epoch. Panels (a) and (b) plot the test accuracies for the ResNet50 with the softmax and the WNLL activation (1-400 and 406-805 epochs corresponds to linear activation), respectively, with only the first K instances in the training set of CIFAR10, are used to train the models. Charts (c) and (d) are the corresponding plots with K training instances, using a pre-activated ResNet50. After around epochs, the accuracies of the vanilla DNNs plateau and cannot improve anymore. However, the test accuracy for WNLL jumps at the beginning of Stage 2 in the first pass; during the Stage 1 of the second pass, even though initially there is an accuracy reduction, the accuracy continues to climb and eventually surpasses that of the WNLL activation in Stage 2 of the first pass. The jumps in accuracy at epoch 400 and 800 are due to switching from linear activation to WNLL for predictions on the test set. The initial decay when alternating back to the softmax is caused partially by the final layer not being tuned with respect to the deep features , and partially due to predictions on the test set being made by the softmax instead of the WNLL. Nevertheless, the perturbation via the WNLL activation quickly results in the accuracy increasing beyond the linear stage in the previous pass.

(a) (b)
(c) (d)
Figure 4: Evolution of the generation accuracy over the training procedure. Charts (a) and (b) plot the accuracy evolution of ResNet50 with the softmax and the WNLL activation trained with K training data, respectively. Panels (c) and (d) correspond to the case of 10K training data for PreActResNet50. All tests are done on the CIFAR10 dataset.

4.2 Generalization of Naturally Trained DNN-WNLL

We next show the superiority of the DNN-WNLL in terms of generalization accuracies when compared to their surrogates with the softmax or the SVM output activation functions. Besides ResNets, we also test the WNLL surrogate on the VGG networks. In table 3, we list the generalization errors for different DNNs from VGG, ResNet, Pre-activated ResNet families trained on the entire, first K and first K instances of the CIFAR10 training set. We observe that WNLL, in general, improves more for ResNets and pre-activated ResNets, with less but still remarkable improvements for the VGGs. Except for VGGs, we can achieve a relatively 20 to

testing error rate reduction across all neural nets. All results presented here and in the rest of this paper are the median of 5 independent trials. We also compare with SVM as an alternative output activation and observe that the performance are still inferior to the DNN-WNLL. Note that the bigger batch-size is to ensure the interpolation quality of the WNLL. A reasonable concern is that the performance increase comes from the variance reduction due to increasing the batch size. However, experiments were done with a batch size of

for vanilla networks deteriorates the test accuracy.

Network Whole 10000 1000
Vanilla WNLL SVM Vanilla WNLL Vanilla WNLL
VGG11 9.23% 7.35% 9.28% 10.37% 8.88% 26.75% 24.10%
VGG13 6.66% 5.58% 7.47% 9.12% 7.64% 24.85% 22.56%
VGG16 6.72% 5.69% 7.29% 9.01% 7.54% 25.41% 22.23%
VGG19 6.95% 5.92% 7.99% 9.62% 8.09% 25.70% 22.87%
ResNet20 9.06% 7.09% 9.60% 12.83% 9.96% 34.90% 29.91%
ResNet32 7.99% 5.95% 8.73% 11.18% 8.15% 33.41% 28.78%
ResNet44 7.31% 5.70% 8.67% 10.66% 7.96% 34.58% 27.94%
ResNet56 7.24% 5.61% 8.58% 9.83% 7.61% 37.83% 28.18%
ResNet110 6.41% 4.98% 8.06% 8.91% 7.13% 42.94% 28.29%
ResNet18 6.16% 4.65% 6.00% 8.26% 6.29% 27.02% 22.48%
ResNet34 5.93% 4.26% 6.32% 8.31% 6.11% 26.47% 20.27%
ResNet50 6.24% 4.17% 6.63% 9.64% 6.49% 29.69% 20.19%
PreActResNet18 6.21% 4.74% 6.38% 8.20% 6.61% 27.36% 21.88%
PreActResNet34 6.08% 4.40% 5.88% 8.52% 6.34% 23.56% 19.02%
PreActResNet50 6.05% 4.27% 5.91% 9.18% 6.05% 25.05% 18.61%
Table 3: Test errors of the vanilla DNNs, SVM and WNLL activated ones trained on the entire, the first K, and the first K instances of the training set of the CIFAR10 dataset. (Median of 5 independent trials)

We list the error rates of the different DNNs with either the softmax or the WNLL activation on the CIFAR10 and CIFAR100 in Tables 3 and 4, respectively. On the CIFAR10, DNN-WNLL outperforms the vanilla ones with around 1.5 to 2.0 absolute, or 20 to 30 relative error rate reduction. The improvements on the CIFAR100 by using the WNLL activation are more remarkable than that on the CIFAR10. We independently ran the vanilla DNNs on both datasets, and our results are consistent with the original reports and other researchers’ reproductions He et al. (2016c, a); Huang et al. (2017). We provide experimental results of DNNs’ performance on SVHN data in Table 5. Interestingly, the improvement is more significant on more challenge tasks which suggest a potential for our methods to succeed on other tasks/datasets.

Network Vanilla DNNs WNLL DNNs
VGG11 32.68% 28.80%
VGG13 29.03% 25.21%
VGG16 28.59% 25.72%
VGG19 28.55% 25.07%
ResNet20 35.79% 31.53%
ResNet32 32.01% 28.04%
ResNet44 31.07% 26.32%
ResNet56 30.03% 25.36%
ResNet110 28.86% 23.74%
ResNet18 27.57% 22.89%
ResNet34 25.55% 20.78%
ResNet50 25.09% 20.45%
PreActResNet18 28.62% 23.45%
PreActResNet34 26.84% 21.97%
PreActResNet50 25.95% 21.51%
Table 4: Test errors of the vanilla DNNs v.s. the WNLL activated DNNs on the CIFAR100 dataset. (Median of 5 independent trials)
Network Vanilla DNNs WNLL DNNs
ResNet20 3.76% 3.44%
ResNet32 3.28% 2.96%
ResNet44 2.84% 2.56%
ResNet56 2.64% 2.32%
ResNet110 2.55% 2.26%
ResNet18 3.96% 3.65%
ResNet34 3.81% 3.54%
PreActResNet18 4.03% 3.70%
PreActResNet34 3.66% 3.32%
Table 5: Test errors of the vanilla DNNs v.s. the WNLL activated DNNs on the SVHN dataset. (Median of 5 independent trials)

4.3 Adversarial Robustness

We carry out experiments on the benchmark MNIST and CIFAR10 datasets to show the efficiency of using the graph interpolating activation for adversarial defense. For MNIST, we train the Small-CNN that is used in (Zhang et al., 2019) by running epochs of PGD adversarial training with , , and . We let the initial learning rate be and decay by a factor of at the th epoch. For CIFAR10, we consider three benchmark models: ResNet20, ResNet56, and WideResNet34. We train these models on the CIFAR10 dataset by running epochs of PGD adversarial training with , , and . The initial learning rate is set to be and decays by a factor of at the th, th, and th epochs, respectively. After the robust models have been trained by the PGD adversarial training, we test their natural accuracies on the clean images and robust accuracies on the adversarial images crafted by attacking these robustly trained models by the aforementioned three adversarial attacks, where the parameters are set as follows

  • FGSM: In Eqs.. (14) and (21), we let and to attack DNNs for CIFAR10 and MNIST classification, respectively.

  • IFGSM: We denote the -step IFGSM attack as IFGSM. To attack DNNs for CIFAR10 classification, we let and in Eqs. (16) and (22) for both IFGSM and IFGSM attacks. For MNIST, we let and in Eqs. (16) and (22) for IFGSM and IFGSM attacks.

  • C&W: For adversarial attack on the CIFAR10 dataset, we let and in Eqs. (20) and (23), and we run iterations of the Adam optimizer with learning rate to find the optimal C&W attack in -norm on the clean images. To search for the optimal C&W attack in -norm on the MNIST data, we run iterations of the Adam optimizer with learning rate with and in Eqs. (20) and (23).

We consider both white-box and black-box attacks. In the black-box attack, we apply the given adversarial attack to attack another oracle model in the white-box fashion, and then we use the target model to classify the adversarial images crafted by attacking the oracle model.

Model (FGSM) (IFGSM) (IFGSM) (C&W)
Small-CNN 99.33% 98.17% 96.27% 96.09% 95.31%
Small-CNN-WNLL 99.39% 98.35% 97.36% 96.90% 97.55%
Table 6: Natural and robust accuracies under different white-box adversarial attacks of different robustly trained models on the MNIST dataset.
Model Oracle (FGSM) (IFGSM) (IFGSM) (C&W)
Small-CNN-WNLL Small-CNN 98.40% 97.47% 97.40% 98.14%
Table 7: Robust accuracies under different black-box adversarial attacks of different robustly trained models on the MNIST dataset.

Table 6 lists both natural and robust accuracies of the PGD adversarially trained Small-CNN with either the softmax or the WNLL output activation function on the MNIST. Small-CNN with the WNLL activation is remarkably more accurate on both clean and adversarial images, e.g., for Small-CNN, the natural accuracies for the softmax and the WNLL activation functions are % and %, respectively. The robust accuracies for Small-CNN and Small-CNN-WNLL are % v.s. %, % v.s. %, % v.s. %, and % v.s. %, respectively, to the FGSM, IFGSM, IFGSM, and C&W attacks in the white-box scenario. We regard Small-CNN as the oracle model to perform black-box attacks on the Small-CNN-WNLL, the corresponding robust accuracies to the above four adversarial attacks are listed in Table 7. In the MNIST experiment, black-box attacks are less effective than the white-box attacks.

Model (FGSM) (IFGSM) (IFGSM) (C&W)
ResNet20 75.11% 50.89% 46.03% 46.01% 58.73%
ResNet20-WNLL 75.53% 55.76% 53.31% 53.26% 63.82%
ResNet56 79.32% 55.05% 50.98% 50.06% 61.75%
ResNet56-WNLL 79.52% 60.50% 58.19% 57.26% 67.93%
WideResNet34 84.05% 51.93% 48.93% 48.32% 59.04%
WideResNet34-WNLL 84.95% 65.50% 63.03% 62.25% 72.37%
Table 8: Natural and robust accuracies under different white-box adversarial attacks of different robustly trained models on the CIFAR10 dataset.
Model Oracle (FGSM) (IFGSM) (IFGSM) (C&W)
ResNet20-WNLL ResNet20 55.91% 53.44% 53.35% 65.13%
ResNet56-WNLL ResNet56 60.00% 57.94% 57.85% 70.47%
WideResNet34-WNLL WideResNet34 67.19% 67.07% 67.17% 81.17%
Table 9: Robust accuracies under different black-box adversarial attacks of different robustly trained models on the CIFAR10.

Next, we consider the adversarial defense capability of DNNs with the WNLL activation on the CIFAR10 dataset. Table 8 lists the natural and robust accuracies, under the white-box attacks, of the standard ResNet20, ResNet56, and WideResNet34-10 and their counterpart with the WNLL activation. These results show that the robustly trained ResNets with the WNLL activation slightly improves natural accuracies on the clean images, while the robust accuracies are significantly improved. For instance, under the FGSM and C&W attacks, the WNLL activation can boost robust accuracy by %; and under the IFGSM and IFGSM attacks, the robust accuracy improvement is up to %. For the WideResNet34-10, under the IFGSM attack, we achieve accuracy % which outperforms the results of Zhang et al. (2019) (%) by more than %. For black-box attacks on DNNs with the WNLL activation, we regard the counterpart DNNs with the softmax activation as the oracle models. The robust accuracies of ResNet20-WNLL, ResNet56-WNLL, and WideResNet34-10-WNLL are listed in Table 9. Again, the black-box attacks are less effective than the white-box ones.

Model (FGSM) (IFGSM) (IFGSM) (C&W)
ResNet56-WNLL (15, 8) 79.89% 59.71% 57.85% 56.53% 67.91%
ResNet56-WNLL (30, 15) 79.52% 60.50% 58.19% 57.26% 67.93%
ResNet56-WNLL (45, 23) 78.92% 59.50% 57.94% 57.06% 66.26%
ResNet56-WNLL (60, 30) 77.92% 58.04% 55.80% 54.97% 67.74%
Table 10: Natural and robust accuracies under different white-box adversarial attacks of ResNet56-WNLL with different number of points, in the form , used for interpolation on the CIFAR10 dataset.

Furthermore, we consider the influence of the number of nearest neighbors with the th-nearest neighbor used to normalize the weights in Eq. (5) in the WNLL interpolation. Table 10 lists the natural and robust accuracies of the ResNet56-WNLL with the different number of nearest neighbors, , involved in the WNLL interpolation. The natural accuracy decays as a greater number of nearest neighbors are used for interpolation, and the robust accuracies are maximized when . When more nearest neighbors are used for interpolation, the robust accuracies decay. This issue might be due to the fact that these nearest neighbors are only selected from a finite number of data points and the resulted nearest neighbors are far from the real nearest neighbors.

Finally, let us look at the adversarial images and the adversarial noise crafted by adversarial attacks on DNNs with both softmax and WNLL activation functions. Figures 5 and 6 depict adversarial images and adversarial noise of the MNIST and the CIFAR10 obtained by applying different adversarial attacks to Small-CNN and ResNet20 with both softmax and WNLL activation functions. All these adversarial images are misclassified by DNNs with both the softmax and the WILL activation. However, they can be easily classified by human-beings.

Adversarial Images Adversarial Noise
Figure 5: Adversarial images (left panel) selected from the MNIST dataset and the corresponding adversarial noise (right panel). Column 1: cleaning image and noise (no noise in this case); Column 2-3: adversarial images and noise crafted by IFGSM and C&W attacks on the small CNN, respectively; Column 4-5: adversarial images and noise crafted by IFGSM and C&W attacks on the small CNN-WNLL, respectively. The predicted labels for the adversarial images are listed below the adversarial images in the left panel.
Adversarial Images Adversarial Noise
Figure 6: Adversarial images (left panel) selected from the CIFAR10 dataset and the corresponding adversarial noise (right panel). Column 1: cleaning image and noise (no noise in this case); Column 2-3: adversarial images and noise crafted by IFGSM and C&W attacks on the ResNet20, respectively; Column 4-5: adversarial images and noise crafted by IFGSM and C&W attacks on the ResNet20-WNLL, respectively. The predicted labels for the adversarial images are listed below the adversarial images in the left panel.

4.4 Semi-supervised Learning

In this subsection, we apply the DNN-WNLL to semi-supervised learning where we have access to all the training data of the CIFAR10 but only part of them are labeled. We can use the unlabeled data to build a graph for the WNLL interpolation in semi-supervised learning, while not in data-efficient learning. We list the accuracies of the semi-supervised learning when K and K training data are labeled to train DNNs in Table 11. Compared to the results in Table 3, semi-supervised learning has better accuracy with the same number of labeled training data.

Network 1K (Labeled)/49K (Unlabeled) 10K (Labeled)/ 40K (Unlabeled)
ResNet20-WNLL 27.02% 9.01%
ResNet32-WNLL 26.28% 7.53%
ResNet44-WNLL 25.63% 7.25%
ResNet56-WNLL 25.53% 6.99%
ResNet110-WNLL 25.38% 6.50%
Table 11: Test error of DNNs with the WNLL output activation for the CIFAR10 classification in the semi-supervised learning setting.

5 Geometric Explanations

In this section, we will consider the representations learned by DNNs with two different output activation functions. As an illustration, we randomly select training instances and testing data each for the airplane and automobile classes from the CIFAR10 dataset. We consider two different strategies to visualize the features learned by ResNet56 and ResNet56-WNLL for the above randomly selected data.

  • Strategy I:

    Apply the principal component analysis (PCA) to reduce the

    D features output right before the softmax/WNLL activation to D.

  • Strategy II: Add an additional fully connected layer before the output activation function. This fully connected layer will help to learn the D representations.

We first show that in Strategy II, the newly added fully connected (FC) layer does not affect the performance of the original ResNet56 much. We train and test the ResNet56 with and without the additional FC layer on the aforementioned randomly selected training and testing data. As shown in Fig. 7, the training and testing accuracies evolution are essentially the same for ResNet56 with and without the additional FC layer.

(a) (b)
Figure 7: Epochs v.s. accuracy in training ResNet56 on the CIFAR10. (a): without the additional FC layer; (b): with the additional FC layer.

5.1 Improving Generalization

Figure 8 plots the representations for the selected airplane and automobile data from the CIFAR10 dataset. First, panels (a) and (b) show the features of the test set learned by ResNet56 visualized by the proposed two strategies. In both cases, the features are well separated, in general, with a small overlapping which causes some misclassification. Charts (c) and (d) depict the first two principal components (PCs) learned by ResNet56-WNLL for the selected training and testing data. The PCs of the features learned by ResNet56-WILL is better separated than that of ResNet56’s (Fig. 8), and it indicates that ResNet56-WILL is more accurate in classifying the randomly selected data.

(a) (b)
(c) (c)
Figure 8: Visualization of the features learned by ResNet56 with the softmax ((a), (b)) and the WNLL ((c), (d)) activation functions. (a): the 2D features of the airplane and automobile data in the test set learned by the ResNet56 with an additional linear layer; (b): the first two principal components of the features of the airplane and automobile data in the test set learned by the ResNet56; (c) and (d) plot the first two principal components of features of the airplane and automobile data in the training and test set learned by the ResNet56-WNLL. All experiments are done on the CIFAR10 dataset.

5.2 Improving Adversarial Robustness

First, let us look at how the adversarial attack changes the geometry of the learned representations. We consider the simple one-step IFGSM attack, IFGSM, with the same parameters used before. Figure 9 shows the first two PCs of the representations learned by ResNet56 and ResNet56-WNLL for the adversarial test images. These PCs show that the adversarial attack makes the features of the two different classes mixed and therefore drastically reduces the classification accuracy.

(a) (b)
Figure 9: Visualization of the first two principal components of the adversarial images’ (IFGSM attack) features learned by ResNet56 with the softmax (a) and the WNLL (b) activation functions, respectively.
MNIST CIFAR10
Figure 10: A randomly selected adversarial image and their top five nearest neighbors in the clean training set searched based on the distance between features output from the layer before the WNLL activation layer. Left: IFGSM attack on the small CNN-WNLL; Right: IFGSM attack on the ResNet20-WNLL.

Second, we consider how the WNLL interpolation helps to improve adversarial robustness. We randomly pick up an adversarial image that is misclassified by the standard DNNs with the softmax activation from the MNIST and the CIFAR10, respectively. The top five nearest neighbors in the deep feature space from the clean training data, of these two adversarial images, are shown in Fig. 10. For the MNIST digit, all the nearest neighbors belong to the same class as the adversarial image; and for the CIFAR10 adversarial image, the top three neighbors belong to the same category as the adversarial one. These nearest neighbors will guide DNN-WNLL to classify the adversarial images correctly.

6 Concluding Remarks

In this paper, we leveraged ideas from the manifold learning and proposed to replace the output activation function of the conventional deep neural nets (DNNs), typically the softmax function, with a graph Laplacian-based high dimensional interpolating function. This simple modification is applicable to any of the existing off-the-shelf DNNs with the softmax activation enables DNNs to make sufficient use of the manifold structure of data. Furthermore, we developed end-to-end and multi-staged training and testing algorithms for the proposed DNN with the interpolating function as its output activation. On the one hand, the proposed new framework remarkably improves both generalizability and robustness of the baseline DNNs; on the other hand, the new framework is suitable for data-efficient machine learning. These improvements are consistent across networks of different types and with a different number of layers. The increase in generalization accuracy could also be used to train smaller models with the same accuracy, which has great potential for mobile device applications.

In this work, we utilized a special kind of graph interpolating function as DNNs’ output activation. An alternative approach is to learn such an interpolating function instead of using one which is fixed. This approach is under our consideration.

Acknowledgments

This material is based on research sponsored by the Air Force Research Laboratory under grant numbers FA9550-18-0167 and MURI FA9550-18-1-0502, the Office of Naval Research under grant number N00014-18-1-2527, the U.S. Department of Energy under grant number DOE SC0013838, and by the National Science Foundation under grant number DMS-1554564 (STROBE).

References