Towards Robust Neural Networks with Lipschitz Continuity

11/22/2018 ∙ by Muhammad Usama, et al. ∙ KAIST 수리과학과 0

Deep neural networks have shown remarkable performance across a wide range of vision-based tasks, particularly due to the availability of large-scale datasets for training and better architectures. However, data seen in the real world are often affected by distortions that not accounted for by the training datasets. In this paper, we address the challenge of robustness and stability of neural networks and propose a general training method that can be used to make the existing neural network architectures more robust and stable to input visual perturbations while using only available datasets for training. Proposed training method is convenient to use as it does not require data augmentation or changes in the network architecture. We provide theoretical proof as well as empirical evidence for the efficiency of the proposed training method by performing experiments with existing neural network architectures and demonstrate that same architecture when trained with the proposed training method perform better than when trained with conventional training approach in the presence of noisy datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in deep learning have immensely increased the representational capabilities of the neural networks and made them powerful enough to be applied to different vision-based tasks including image classification

[1, 2, 3, 4], object detection [5, 6], image captioning [7]

as well as to deep reinforcement learning

[8, 9]. Some important factors that explain the rapid development of deep learning include emergence of dedicated mathematical frameworks for deep neural networks [10], availability of large scale annotated datasets [11, 12], improvements in the network architectures [3, 13] and open source deep learning libraries [14, 15].

Gaussian Noise std
Input to the model
standard training Model Prediction ship ship bird bird
confidence for ‘ship’ 0.9999 0.5608 0.1266 0.0252
proposed training Model Prediction ship ship ship ship
confidence for ‘ship’ 0.9999 0.9986 0.8710 0.7215
Table 1:

Effect of input image quality on the deep learning model prediction. We trained resnet-20 architecture with standard and proposed training procedure and tested them on a CIFAR-10 dataset image. Model trained with standard method fails to correctly classify the image as the severity of distortion increases while that trained with proposed method correctly classifies all images with high confidence.

Availability of large amounts of high-quality and distortionless image data is often assumed and the visual quality of training images is often overlooked while designing deep learning based applications. It has been shown that models trained with clean data suffer with depreciation in their performance when tested on samples that are distorted with blur or noise distortions [16, 17]. In most real-world applications, the images undergo various forms of distortions owing to formatting, compression and post-processing that are routinely applied to visual datasets and often are unobservable to a human eye. Therefore, the availability of clean data is no longer guaranteed. One way to alleviate this problem can be to train the networks with noisy data expected to be seen in the real-world. However, the commonly used large scale datasets [11, 12] for training the deep learning models do not provide training data with these artifacts and distortions. Therefore, it is imperative to develop training techniques that can give more robust deep learning models while using only available large scale popular datasets that do not cater for these distortions.

The problem discussed in this work is about improving the robustness and stability of deep neural networks. This is a fundamental problem in computer vision and has recently received increased interest by the community

[18, 19, 20]. Our focus is on improving the training process rather than the DNN architecture. We introduce a general training technique that can be applied to any standard state-of-the-art deep learning model and lets them learn a mapping that is more robust and insensitive to input visual perturbations and distortions. We note that a deep neural network can be considered as a mathematical model and the least we can expect from a stable mathematical model is that a small perturbation or distortion in its input will not produce a large change in its behavior. In order to realize this, we utilize some fundamental concepts including Lipschitz functions and Lipschitz continuity. According to the perturbation theory, if the input is perturbed by a small amount, the output of the system stays close

to its nominal output when there is no perturbation in the input provided that the system dynamics are continuous and locally Lipschitz. In order to motivate the dynamics of the deep neural network to remain locally Lipschitz, we include an additional term in the loss function called

. We provide theoretical justification for the proposed training method in Section 4, proving that for admissible distortions in the neighborhood of input image, the Locally Lipschitz neural network is guaranteed to be stable, thus improving the performance in presence of noisy data. We verify the theoretical results by performing extensive experiments on MNIST, CIFAR-10 and STL-10 datasets.

We summarize the paper findings in Table 1 where resnet-20 network architecture, trained without the proposed method, when presented with distorted input images fails to classify them as the severity of the distortion increases. Even for correctly classified distorted images, the prediction confidence is very low. On the other hand, the same architecture trained with the proposed method when presented with same distorted images correctly classifies them with reasonable prediction confidence.

2 Related Work

While training the deep neural networks, availability of high quality and artifact-free image data is often assumed. However, this may not always be true due to distortions the images encounter during accusation, transmission and storage phases. Moreover, with the increasing demand of DNN based mobile applications, the assumption for high quality of the availability of high quality input data needs to be relaxed. [16, 17] showed that the deep neural networks trained on clean datasets are all susceptible to poor performance when tested to blur and noise distortions while being resilient to compression artifacts such as JPEG and contrast. They propose to train the networks on low quality data to alleviate this problem, which may cause networks to perform poorly to high quality data. The VGG [18] architecture was shown to perform better than AlexNet [4] or GoogleNet [1] to the considered types of distortions. [1] showed that standard architectures trained on high-quality data suffered significant degradation in performance when tested with distorted data due to blurring or camera motion. They showed that fine-tuning the trained models with a mix of blurry and sharp training examples helps to regain the lost performance to a degree at the cost of minor computational overhead. [21] proposed two approaches to alleviate poor performance due to blurred and noisy images: re-training and fine-tuning with noisy images, showing that fine-tuning is more practical than re-training. [22] also shows that fine-tuned networks on distorted data outperform the original networks when tested on noisy data, but these fine-tuned networks show poor performance on quality distortions that they have not been trained for. [22] propose the concept of mixture of experts ensemble, where various experts are trained on different types of distortions and the final output of the model is the weighted sum of these expert models’ outputs. A separate gating network is used to determine these weights. [19] presents BANG which is the training algorithm that assigns more weight to the correctly classifies samples. Since the correctly classified training samples do not contribute much to the loss as compared to the incorrectly classified training samples, therefore, training is more focused on learning those samples that are badly classified. [19] proved that increasing the contribution of correctly classified training samples in the batch helps flatten the decision space around these training samples, thus training more robust DNNs. In addition to above mentioned issues, [23]

showed the inability of many machine learning models to deal with slightly, but intentionally, perturbed examples which are called adversarial examples. These adversarial examples are indistinguishable to human observers from their original counterparts. Authors in

[23] were first to introduce a method of finding adversarial perturbations while [24] introduced a computationally cheaper adversarial example generation algorithm called Fast Gradient Sign Method (FGSM). Our work differs drastically from [20] as instead of flattening the neural network dynamics function altogether, we are more focused on setting a soft upper bound on the gradient of that does not adversely affects the representational power of the neural network. Our work also differs from data augmentation as we propose a way to improve the training process without using any extra training samples, while data augmentation uses standard training techniques and instead increases the number of training samples.

3 Background

In this section, we present the basic concepts of Lipschitz functions and Lipschitz continuity.

Let be an open set in some . A function is called Lipschitz continuous on if if there exists a nonnegative constant , called a Lipschitz constant of function on , such that the following condition holds:

(1)

for all . We call the function to be locally Lipschitz continuous if for each , there exists a constant such that in Lipschitz continuous on the open ball of center and radius , where is mathematically written as . The function is said to be globally Lipschitz continuous if it is Lipschitz continuous on its entire domain . We note that if the function ) is Lipschitz continuous with a Lipschitz constant , then it is also Lipschitz continuous with any such that .

Lipschitz continuity is a measure designed to measure the change of the function values versus the change in the independent variable. Let be a Lipschitz continuous function with a Lipschitz constant , so it satisfies (1), i.e.

(2)

for all . In other words, the average rate of change in the value of for any pair of points and in does not exceed the Lipschitz constant . Here we note that the Lipschitz constant depends upon the function . It may vary from being large for one function to being small for another. If is small, then may only vary a little as the input is changed. But if is large, the function output may vary a lot with only a small change in its input . In particular, when the Lipschitz function is real-valued, i.e. , then by taking the limit of (2) as we obtain , where is the derivative function of . In other words, the magnitude of (instantaneous) rate of change in does not exceed the Lipschitz constant when the Lipschitz continuous function is differentiable.

Lipschitz continuity, therefore, quantifies the idea of sensitivity of the function with respect to its argument using the Lipschitz constant . We note here that the Lipschitz constant represents only the upper bound on how much the function can change with the change in its input, the actual change might also be smaller than that indicated by .

4 Approach

Neural networks can be considered as a sequence of layers that attempt to learn the arbitrary mapping . The network is parameterized with many parameters that are optimized given the training data and . Therefore, imposing the condition of Lipschitz continuity on the neural network dynamics implies that a small perturbation in the input will not result in large change at the output of the network, thus increasing the robustness and the stability of the network. Theoretical justification for our approach is provided in the following theorem.

Theorem 1

Let be the set of labels used and let be half of the minimum distance between any two labels. Let be the neural network dynamics. Let

be the chosen Lipschitz constant hyperparameter. If

is Lipschitz, then for all distortions in input space such that , and are guaranteed to be mapped to the same label where is the distorted input of the form .

Proof

From the Lipschitz assumption, we have . Since we have , we get . Since is discrete-valued in , taking into consideration the definition of , we conclude that both and get mapped to the same label in set .

The Lipschitz property of guarantees that for any distortion such that , the output of the distorted input lies within a sphere of radius about the output of the nominal input where gives the half of the maximum distance between any two labels. Thus, it is guaranteed that distorted input gets mapped to the same label as the nominal input. For the case when the network is trained without the proposed method, we do not impose any upper bound on the slope of . Therefore, we have Lipschitz constant which in Theorem 1 gives , which trivially implies that there is no distortion for which the network is guaranteed to be robust.

5 Method

Let , where denotes the number of labels, represents the mapping performed by the deep neural network. Let

be the input that the network takes, for example an image in the case of a convolutional neural network. In order to encourage the network to be locally Lipschitz continuous, we perturb the network input during the training process with zero mean Gaussian Noise to get a perturbed copy of the input,

, i.e.

(3)

where we note that in (3), has same dimensions as the input image i.e. and each component of

is a single valued zero mean Gaussian random variable

with standard deviation

. Here, is treated as a hyperparameter in the experiments.

In general, the derivative of a function at a point a point is defined by

it can be approximated by

where is a point near . Hence, if we take from (3), we then have

(4)

In order to encourage the neural network to become locally Lipschitz continuous, we add an additional term, called , in the usual loss function, termed here as , to get an aggregated loss function , i.e.

where is the loss term corresponding to the task to be performed by the network, for example cross-entropy loss, while is defined as:

(5)

where is the weighting factor for the added loss term , serves the purpose of the Lipschitz constant for the neural network dynamics, and is given in (4). We treat both and as hyperparameters.

The effect of the hyperparameters will be studied in Section 7.

6 Experiments

In order to evaluate our approach, we tested our proposed training procedure with MNIST [25], CIFAR-10 [26] and STL-10 [27] datasets. Details about these experiments and their results are explained in following subsections. When we train the network without using the proposed training method, we refer to the training method as standard training method.

Justification for using Gaussian Noise:

In experiments, we use Gaussian noise to corrupt test data. To see why Gaussian model can approximate realistic distortions, we see that any distortion of an image can always be expressed as , where is a map close to the identity map, i.e. , parameterized by a parameter . Hence, for all small values of , in Taylor expansion of in around , where represents the terms of order 1 or higher in and can be interpreted as a perturbation term that vanishes when . Hence, it is reasonable to use Gaussian noises to simulate various realistic distortions to the image .

Due to space constraints, some tables and figures are given in the supplementary material and will be referenced in the subsequent sections as required.

Network Training Details
standard method 0.97 0.92 0.65
, 0.98 0.95 0.70
, 0.98 0.96 0.78
, 0.98 0.96 0.77
Table 2: Classification accuracies for experiments with MNIST. Results are shown for various levels of distortions in test dataset as described by the value of . Here corresponds to undistorted test data. We used for MNIST experiments.

6.1 Mnist

6.1.1 Experiment Details

We used a convolutional neural network consisting of one convolutional layer, one fully-connected layer and an output layer for experiments with MNIST dataset. 5 epochs of 550 iteration were performed and learning rate was set to

. For training the network with standard training method, we set in (5). Network was trained with and without the proposed training mechanism. were used as hyperparameters.

(a) Model trained with and
(b) Model trained with and
(c) Model trained with and
Figure 4: Plots of classification accuracy versus ratio of training data used in training process.

We tested trained networks with test data distorted with zero mean Gaussian noise with standard deviation values of and . Networks trained with various percentages of training data were also tested.

We also investigated the effects of using only a proportion of training data for training purpose. We trained the networks with various percentages of training data and tested them on entire test data. We randomly sample a percentage of training data at the start of training. We hypothesize that a robust neural network trained with only a portion of training data should be able to generalize well across the entire test dataset.

For ResNet-20 Architecture
Training Test Data Distortion
standard - 92.77 92.77 92.77 38.28 38.28 38.28 18.01 18.01 18.01
proposed 0.01 88.02 82.49 88.18 50.56 45.80 58.56 23.49 21.68 29.70
0.1 88.86 88.21 89.00 63.01 59.05 57.92 34.73 31.17 25.84
For Preresnet-20 Architecture
Training Test Data Distortion
standard - 92.59 92.59 92.59 30.80 30.80 30.80 15.91 15.91 15.91
proposed 0.01 86.91 87.86 88.34 63.52 58.86 59.47 32.84 26.99 26.33
0.1 86.55 88.11 87.80 58.51 61.25 64.98 35.14 32.76 39.41
Table 3: Top-1 accuracies for models trained with on CIFAR-10 dataset
For ResNet-20 Architecture
Training Test Data Distortion
standard - 92.77 92.77 92.77 38.28 38.28 38.28 18.01 18.01 18.01
proposed 0.01 92.39 92.44 92.68 32.5 36.56 34.09 15.68 19.65 16.33
0.1 92.57 92.32 93.02 34.15 38.48 42.06 13.67 20.63 21.56
For Preresnet-20 Architecture
Training Test Data Distortion
standard - 92.59 92.59 92.59 30.80 30.80 30.80 15.91 15.91 15.91
proposed 0.01 92.50 92.40 92.56 41.31 35.18 34.66 22.25 18.39 16.46
0.1 92.36 92.66 92.34 34.92 28.37 33.67 17.54 16.12 14.83
Table 4: Top-1 accuracies for models trained with on CIFAR-10 dataset
For ResNet-20 Architecture
Training Test Data Distortion
standard - 92.77 92.77 92.77 38.28 38.28 38.28 18.01 18.01 18.01
proposed 0.01 82.97 82.72 82.55 70.31 67.60 66.89 46.47 43.91 41.15
0.1 81.36 83.20 84.62 58.55 60.09 72.40 36.25 40.75 45.53
For Preresnet-20 Architecture
Training Test Data Distortion
standard - 92.59 92.59 92.59 30.80 30.80 30.80 15.91 15.91 15.91
proposed 0.01 82.43 80.88 80.00 70.83 64.34 72.08 45.54 33.10 47.19
0.1 80.08 85.17 82.42 53.92 58.37 65.94 31.77 26.48 47.28
Table 5: Top-1 accuracies for models trained with on CIFAR-10 dataset

6.1.2 Results

Table 2 presents classification accuracies for models trained with different combinations of hyperparameters. We see that networks trained with Lipschitz continuity loss perform better than the network obtained with standard training procedure. With undistorted test data, the gain in performance is small but as the severity of distortion increases, the networks trained with proposed method show significant performance improvement over network trained with standard training process. As the value of is increased keeping other hyperparameters the same, the performance slightly deteriorates in accordance with the conclusion of Theorem 1, where the region of admissible distortions decreases as is increased i.e. .

In order to test the robustness of proposed training procedure, we trained the networks with various portions of training data. These models were then tested with entire test dataset, undistorted as well as distorted () . Figure 4 shows that networks trained with Lipschitz loss always perform better than those trained with standard training process, thus proving their robustness.

6.2 Cifar-10

6.2.1 Experiment Details

We used ResNet-20 [4] and PreResNet-20 [17] as our network architectures for classification task with CIFAR-10 dataset. Both networks have 16-16-32-64 channels and 0.26 million parameters each. Each model was trained for 300 epochs with batch size of 128 and learning rate of 0.1. Learning rate was decreased by a factor of 10 first at epoch 150 and then at epoch 225. were used as hyperparameters. For training the network with standard training method, we set in (5).

We tested the trained networks with corrupted test data generated by distorting the test data set with zero mean Gaussian Noise having standard deviation values ranging from to with step size of .

6.2.2 Results

Table 3 shows the top- classification accuracies for networks trained with and values of and . Similarly, Table 4 and Table 5 show results in similar fashion for and respectively. Figure 7 , 11 and 15 show plots for test accuracies versus for networks trained with respectively for better visualization.

We see that the models trained with perform comparable to the original baseline with the undistorted test data. As the distortion severity is increased, they perform better than the baseline confirming that they are more robust to input visual distortions. As the value of is increased, we get the models that tend to lose performance with the undistorted dataset but perform much better as the distortion severity in increased. Therefore, models trained with increased values of are much more robust and insensitive to input distortions with some loss in performance with undistorted input data. We also note that as the value of is increased, the performance difference of models trained with different values tends to diminish as they start to performance equally well. This is due to high value of that makes the effect of different values in the training loss ineffective.

Training Test Data Distortion
standard - 80.44 80.44 80.44 50.67 50.67 50.67 34.94 34.94 34.94
proposed 0.01 75.65 77.41 78.19 66.11 59.00 62.91 48.77 40.94 44.52
0.1 78.88 79.71 77.21 68.34 65.47 64.60 47.42 46.26 42.24
Table 6: Top-1 accuracies for models trained with on STL-10 dataset
Training Test Data Distortion
standard - 80.44 80.44 80.44 50.67 50.67 50.67 34.94 34.94 34.94
proposed 0.01 80.44 80.88 80.79 56.54 67.81 56.56 41.74 51.58 37.25
0.1 81.34 80.90 80.64 55.46 52.88 49.86 39.51 39.51 34.84
Table 7: Top-1 accuracies for models trained with on STL-10 dataset
Training Test Data Distortion
standard - 80.44 80.44 80.44 50.67 50.67 50.67 34.94 34.94 34.94
proposed 0.01 73.16 72.08 70.90 60.19 66.97 65.89 48.83 53.21 51.33
0.1 71.75 75.35 71.83 62.79 69.10 67.88 46.40 55.95 52.21
Table 8: Top-1 accuracies for models trained with on STL-10 dataset

6.3 Stl-10

6.3.1 Experiment Details

We used PreResNet-32 [1] as our baseline architecture for classification task with STL-10 dataset. The network has 16-16-32-64 channels and 0.46 million parameters. Training conditions and hyperparameters’ values are same as for CIFAR-10 experiments. Test data was also generated similar to CIFAR-10 experiments.

6.3.2 Results

Table 6 shows the top- classification accuracies for networks trained with and values of , and . Similarly, Table 7 and Table 8 show results in similar fashion for networks trained with and respectively. Figure 8 , 12 and 16 in the supplementary material show plots for test accuracies versus for networks trained with respectively for better visualization.

We see that the models trained with perform comparable to the original baseline with the undistorted test data. As the distortion severity is increased, they perform better than the baseline confirming that they are more robust to input visual distortions. As the value of is increased, we get the models that tend to lose performance with the undistorted dataset but perform much better as the distortion severity in increased. Therefore, models trained with increased values of are much more robust and insensitive to input distortions with some loss in performance with undistorted input data. We also note that as the value of is increased, the performance difference of models trained with different values tends to diminish as they start to performance equally well. This is due to high value of that makes the effect of different values in the training loss ineffective.

7 Sensitivity Analysis of Hyperparameters

The impact of hyperparameters is best studied using the sensitivity analysis. The hyperparameters introduced in this study are . For sensitivity analysis, let’s take nominal values of hyperparameters be . Let acc denote the percentage accuracy of the model trained with Lipschitz term in loss function. We change the hyperparameters and as follows: and . Experiments are performed with new hyperparameters values on CIFAR-10 dataset. The sensitivities of model performance with respect to and are given as:

and

respectively.

We see that the network performance is most sensitive to change in . Performance is least sensitive to change in . Performance is fairly sensitive to change in where the negative value of indicates that the performance deteriorates as increases, which is consistent with the conclusion of Theorem 1 in Section 4 where the radius of admissible distortions is inversely proportional to the magnitude of i.e. .

8 Conclusion

In this paper, we presented a method for training neural networks using Lipschitz continuity that can be used to make them more robust to input visual perturbations. We provide theoretical justification and experimental demonstration about the effectiveness of our method using existing neural network architectures in the presence of input perturbations. Our approach is, therefore, easy-to-use and effective as it improves the network robustness and stability without using data augmentation or additional training data.

9 Acknowledgement

This research has been in part supported by the ICT R&D program of MSIP/IITP [2016-0-00563, Research on Adaptive Machine Learning Technology Development for Intelligent Autonomous Digital Companion].

(a) Resnet-20 trained with
(b) Preresnet-20 trained with
Figure 7: Plots of the top-1 CIFAR-10 test accuracies for models trained with and with standard training. (a) shows results for resnet-20 and (b) shows results for preresnet-20.
Figure 8: Plots of top-1 STL-10 test accuracies for
(a) Resnet-20 trained with
(b) Preresnet-20 trained with
Figure 11: Plots of the top-1 CIFAR-10 test accuracies for
Figure 12: Plots of the top-1 STL-10 test accuracies for
(a) Resnet-20 trained with
(b) Preresnet-20 trained with
Figure 15: Plots of the top-1 CIFAR-10 test accuracies for
Figure 16: Plots of the top-1 STL-10 test accuracies for

References