1 Introduction
Recent advances in deep learning have immensely increased the representational capabilities of the neural networks and made them powerful enough to be applied to different visionbased tasks including image classification
[1, 2, 3, 4], object detection [5, 6], image captioning [7]as well as to deep reinforcement learning
[8, 9]. Some important factors that explain the rapid development of deep learning include emergence of dedicated mathematical frameworks for deep neural networks [10], availability of large scale annotated datasets [11, 12], improvements in the network architectures [3, 13] and open source deep learning libraries [14, 15].Gaussian Noise std  

Input to the model  
standard training  Model Prediction  ship  ship  bird  bird 
confidence for ‘ship’  0.9999  0.5608  0.1266  0.0252  
proposed training  Model Prediction  ship  ship  ship  ship 
confidence for ‘ship’  0.9999  0.9986  0.8710  0.7215 
Effect of input image quality on the deep learning model prediction. We trained resnet20 architecture with standard and proposed training procedure and tested them on a CIFAR10 dataset image. Model trained with standard method fails to correctly classify the image as the severity of distortion increases while that trained with proposed method correctly classifies all images with high confidence.
Availability of large amounts of highquality and distortionless image data is often assumed and the visual quality of training images is often overlooked while designing deep learning based applications. It has been shown that models trained with clean data suffer with depreciation in their performance when tested on samples that are distorted with blur or noise distortions [16, 17]. In most realworld applications, the images undergo various forms of distortions owing to formatting, compression and postprocessing that are routinely applied to visual datasets and often are unobservable to a human eye. Therefore, the availability of clean data is no longer guaranteed. One way to alleviate this problem can be to train the networks with noisy data expected to be seen in the realworld. However, the commonly used large scale datasets [11, 12] for training the deep learning models do not provide training data with these artifacts and distortions. Therefore, it is imperative to develop training techniques that can give more robust deep learning models while using only available large scale popular datasets that do not cater for these distortions.
The problem discussed in this work is about improving the robustness and stability of deep neural networks. This is a fundamental problem in computer vision and has recently received increased interest by the community
[18, 19, 20]. Our focus is on improving the training process rather than the DNN architecture. We introduce a general training technique that can be applied to any standard stateoftheart deep learning model and lets them learn a mapping that is more robust and insensitive to input visual perturbations and distortions. We note that a deep neural network can be considered as a mathematical model and the least we can expect from a stable mathematical model is that a small perturbation or distortion in its input will not produce a large change in its behavior. In order to realize this, we utilize some fundamental concepts including Lipschitz functions and Lipschitz continuity. According to the perturbation theory, if the input is perturbed by a small amount, the output of the system stays closeto its nominal output when there is no perturbation in the input provided that the system dynamics are continuous and locally Lipschitz. In order to motivate the dynamics of the deep neural network to remain locally Lipschitz, we include an additional term in the loss function called
. We provide theoretical justification for the proposed training method in Section 4, proving that for admissible distortions in the neighborhood of input image, the Locally Lipschitz neural network is guaranteed to be stable, thus improving the performance in presence of noisy data. We verify the theoretical results by performing extensive experiments on MNIST, CIFAR10 and STL10 datasets.We summarize the paper findings in Table 1 where resnet20 network architecture, trained without the proposed method, when presented with distorted input images fails to classify them as the severity of the distortion increases. Even for correctly classified distorted images, the prediction confidence is very low. On the other hand, the same architecture trained with the proposed method when presented with same distorted images correctly classifies them with reasonable prediction confidence.
2 Related Work
While training the deep neural networks, availability of high quality and artifactfree image data is often assumed. However, this may not always be true due to distortions the images encounter during accusation, transmission and storage phases. Moreover, with the increasing demand of DNN based mobile applications, the assumption for high quality of the availability of high quality input data needs to be relaxed. [16, 17] showed that the deep neural networks trained on clean datasets are all susceptible to poor performance when tested to blur and noise distortions while being resilient to compression artifacts such as JPEG and contrast. They propose to train the networks on low quality data to alleviate this problem, which may cause networks to perform poorly to high quality data. The VGG [18] architecture was shown to perform better than AlexNet [4] or GoogleNet [1] to the considered types of distortions. [1] showed that standard architectures trained on highquality data suffered significant degradation in performance when tested with distorted data due to blurring or camera motion. They showed that finetuning the trained models with a mix of blurry and sharp training examples helps to regain the lost performance to a degree at the cost of minor computational overhead. [21] proposed two approaches to alleviate poor performance due to blurred and noisy images: retraining and finetuning with noisy images, showing that finetuning is more practical than retraining. [22] also shows that finetuned networks on distorted data outperform the original networks when tested on noisy data, but these finetuned networks show poor performance on quality distortions that they have not been trained for. [22] propose the concept of mixture of experts ensemble, where various experts are trained on different types of distortions and the final output of the model is the weighted sum of these expert models’ outputs. A separate gating network is used to determine these weights. [19] presents BANG which is the training algorithm that assigns more weight to the correctly classifies samples. Since the correctly classified training samples do not contribute much to the loss as compared to the incorrectly classified training samples, therefore, training is more focused on learning those samples that are badly classified. [19] proved that increasing the contribution of correctly classified training samples in the batch helps flatten the decision space around these training samples, thus training more robust DNNs. In addition to above mentioned issues, [23]
showed the inability of many machine learning models to deal with slightly, but intentionally, perturbed examples which are called adversarial examples. These adversarial examples are indistinguishable to human observers from their original counterparts. Authors in
[23] were first to introduce a method of finding adversarial perturbations while [24] introduced a computationally cheaper adversarial example generation algorithm called Fast Gradient Sign Method (FGSM). Our work differs drastically from [20] as instead of flattening the neural network dynamics function altogether, we are more focused on setting a soft upper bound on the gradient of that does not adversely affects the representational power of the neural network. Our work also differs from data augmentation as we propose a way to improve the training process without using any extra training samples, while data augmentation uses standard training techniques and instead increases the number of training samples.3 Background
In this section, we present the basic concepts of Lipschitz functions and Lipschitz continuity.
Let be an open set in some . A function is called Lipschitz continuous on if if there exists a nonnegative constant , called a Lipschitz constant of function on , such that the following condition holds:
(1) 
for all . We call the function to be locally Lipschitz continuous if for each , there exists a constant such that in Lipschitz continuous on the open ball of center and radius , where is mathematically written as . The function is said to be globally Lipschitz continuous if it is Lipschitz continuous on its entire domain . We note that if the function ) is Lipschitz continuous with a Lipschitz constant , then it is also Lipschitz continuous with any such that .
Lipschitz continuity is a measure designed to measure the change of the function values versus the change in the independent variable. Let be a Lipschitz continuous function with a Lipschitz constant , so it satisfies (1), i.e.
(2) 
for all . In other words, the average rate of change in the value of for any pair of points and in does not exceed the Lipschitz constant . Here we note that the Lipschitz constant depends upon the function . It may vary from being large for one function to being small for another. If is small, then may only vary a little as the input is changed. But if is large, the function output may vary a lot with only a small change in its input . In particular, when the Lipschitz function is realvalued, i.e. , then by taking the limit of (2) as we obtain , where is the derivative function of . In other words, the magnitude of (instantaneous) rate of change in does not exceed the Lipschitz constant when the Lipschitz continuous function is differentiable.
Lipschitz continuity, therefore, quantifies the idea of sensitivity of the function with respect to its argument using the Lipschitz constant . We note here that the Lipschitz constant represents only the upper bound on how much the function can change with the change in its input, the actual change might also be smaller than that indicated by .
4 Approach
Neural networks can be considered as a sequence of layers that attempt to learn the arbitrary mapping . The network is parameterized with many parameters that are optimized given the training data and . Therefore, imposing the condition of Lipschitz continuity on the neural network dynamics implies that a small perturbation in the input will not result in large change at the output of the network, thus increasing the robustness and the stability of the network. Theoretical justification for our approach is provided in the following theorem.
Theorem 1
Let be the set of labels used and let be half of the minimum distance between any two labels. Let be the neural network dynamics. Let
be the chosen Lipschitz constant hyperparameter. If
is Lipschitz, then for all distortions in input space such that , and are guaranteed to be mapped to the same label where is the distorted input of the form .Proof
From the Lipschitz assumption, we have . Since we have , we get . Since is discretevalued in , taking into consideration the definition of , we conclude that both and get mapped to the same label in set .
The Lipschitz property of guarantees that for any distortion such that , the output of the distorted input lies within a sphere of radius about the output of the nominal input where gives the half of the maximum distance between any two labels. Thus, it is guaranteed that distorted input gets mapped to the same label as the nominal input. For the case when the network is trained without the proposed method, we do not impose any upper bound on the slope of . Therefore, we have Lipschitz constant which in Theorem 1 gives , which trivially implies that there is no distortion for which the network is guaranteed to be robust.
5 Method
Let , where denotes the number of labels, represents the mapping performed by the deep neural network. Let
be the input that the network takes, for example an image in the case of a convolutional neural network. In order to encourage the network to be locally Lipschitz continuous, we perturb the network input during the training process with zero mean Gaussian Noise to get a perturbed copy of the input,
, i.e.(3) 
where we note that in (3), has same dimensions as the input image i.e. and each component of
is a single valued zero mean Gaussian random variable
with standard deviation
. Here, is treated as a hyperparameter in the experiments.In general, the derivative of a function at a point a point is defined by
it can be approximated by
where is a point near . Hence, if we take from (3), we then have
(4) 
In order to encourage the neural network to become locally Lipschitz continuous, we add an additional term, called , in the usual loss function, termed here as , to get an aggregated loss function , i.e.
where is the loss term corresponding to the task to be performed by the network, for example crossentropy loss, while is defined as:
(5) 
where is the weighting factor for the added loss term , serves the purpose of the Lipschitz constant for the neural network dynamics, and is given in (4). We treat both and as hyperparameters.
The effect of the hyperparameters will be studied in Section 7.
6 Experiments
In order to evaluate our approach, we tested our proposed training procedure with MNIST [25], CIFAR10 [26] and STL10 [27] datasets. Details about these experiments and their results are explained in following subsections. When we train the network without using the proposed training method, we refer to the training method as standard training method.
Justification for using Gaussian Noise:
In experiments, we use Gaussian noise to corrupt test data. To see why Gaussian model can approximate realistic distortions, we see that any distortion of an image can always be expressed as , where is a map close to the identity map, i.e. , parameterized by a parameter . Hence, for all small values of , in Taylor expansion of in around , where represents the terms of order 1 or higher in and can be interpreted as a perturbation term that vanishes when . Hence, it is reasonable to use Gaussian noises to simulate various realistic distortions to the image .
Due to space constraints, some tables and figures are given in the supplementary material and will be referenced in the subsequent sections as required.
Network Training Details  

standard method  0.97  0.92  0.65 
,  0.98  0.95  0.70 
,  0.98  0.96  0.78 
,  0.98  0.96  0.77 
6.1 Mnist
6.1.1 Experiment Details
We used a convolutional neural network consisting of one convolutional layer, one fullyconnected layer and an output layer for experiments with MNIST dataset. 5 epochs of 550 iteration were performed and learning rate was set to
. For training the network with standard training method, we set in (5). Network was trained with and without the proposed training mechanism. were used as hyperparameters.We tested trained networks with test data distorted with zero mean Gaussian noise with standard deviation values of and . Networks trained with various percentages of training data were also tested.
We also investigated the effects of using only a proportion of training data for training purpose. We trained the networks with various percentages of training data and tested them on entire test data. We randomly sample a percentage of training data at the start of training. We hypothesize that a robust neural network trained with only a portion of training data should be able to generalize well across the entire test dataset.
For ResNet20 Architecture  

Training  Test Data Distortion  
standard    92.77  92.77  92.77  38.28  38.28  38.28  18.01  18.01  18.01 
proposed  0.01  88.02  82.49  88.18  50.56  45.80  58.56  23.49  21.68  29.70 
0.1  88.86  88.21  89.00  63.01  59.05  57.92  34.73  31.17  25.84  
For Preresnet20 Architecture  
Training  Test Data Distortion  
standard    92.59  92.59  92.59  30.80  30.80  30.80  15.91  15.91  15.91 
proposed  0.01  86.91  87.86  88.34  63.52  58.86  59.47  32.84  26.99  26.33 
0.1  86.55  88.11  87.80  58.51  61.25  64.98  35.14  32.76  39.41  
For ResNet20 Architecture  

Training  Test Data Distortion  
standard    92.77  92.77  92.77  38.28  38.28  38.28  18.01  18.01  18.01 
proposed  0.01  92.39  92.44  92.68  32.5  36.56  34.09  15.68  19.65  16.33 
0.1  92.57  92.32  93.02  34.15  38.48  42.06  13.67  20.63  21.56  
For Preresnet20 Architecture  
Training  Test Data Distortion  
standard    92.59  92.59  92.59  30.80  30.80  30.80  15.91  15.91  15.91 
proposed  0.01  92.50  92.40  92.56  41.31  35.18  34.66  22.25  18.39  16.46 
0.1  92.36  92.66  92.34  34.92  28.37  33.67  17.54  16.12  14.83  
For ResNet20 Architecture  

Training  Test Data Distortion  
standard    92.77  92.77  92.77  38.28  38.28  38.28  18.01  18.01  18.01 
proposed  0.01  82.97  82.72  82.55  70.31  67.60  66.89  46.47  43.91  41.15 
0.1  81.36  83.20  84.62  58.55  60.09  72.40  36.25  40.75  45.53  
For Preresnet20 Architecture  
Training  Test Data Distortion  
standard    92.59  92.59  92.59  30.80  30.80  30.80  15.91  15.91  15.91 
proposed  0.01  82.43  80.88  80.00  70.83  64.34  72.08  45.54  33.10  47.19 
0.1  80.08  85.17  82.42  53.92  58.37  65.94  31.77  26.48  47.28  
6.1.2 Results
Table 2 presents classification accuracies for models trained with different combinations of hyperparameters. We see that networks trained with Lipschitz continuity loss perform better than the network obtained with standard training procedure. With undistorted test data, the gain in performance is small but as the severity of distortion increases, the networks trained with proposed method show significant performance improvement over network trained with standard training process. As the value of is increased keeping other hyperparameters the same, the performance slightly deteriorates in accordance with the conclusion of Theorem 1, where the region of admissible distortions decreases as is increased i.e. .
In order to test the robustness of proposed training procedure, we trained the networks with various portions of training data. These models were then tested with entire test dataset, undistorted as well as distorted () . Figure 4 shows that networks trained with Lipschitz loss always perform better than those trained with standard training process, thus proving their robustness.
6.2 Cifar10
6.2.1 Experiment Details
We used ResNet20 [4] and PreResNet20 [17] as our network architectures for classification task with CIFAR10 dataset. Both networks have 16163264 channels and 0.26 million parameters each. Each model was trained for 300 epochs with batch size of 128 and learning rate of 0.1. Learning rate was decreased by a factor of 10 first at epoch 150 and then at epoch 225. were used as hyperparameters. For training the network with standard training method, we set in (5).
We tested the trained networks with corrupted test data generated by distorting the test data set with zero mean Gaussian Noise having standard deviation values ranging from to with step size of .
6.2.2 Results
Table 3 shows the top classification accuracies for networks trained with and values of and . Similarly, Table 4 and Table 5 show results in similar fashion for and respectively. Figure 7 , 11 and 15 show plots for test accuracies versus for networks trained with respectively for better visualization.
We see that the models trained with perform comparable to the original baseline with the undistorted test data. As the distortion severity is increased, they perform better than the baseline confirming that they are more robust to input visual distortions. As the value of is increased, we get the models that tend to lose performance with the undistorted dataset but perform much better as the distortion severity in increased. Therefore, models trained with increased values of are much more robust and insensitive to input distortions with some loss in performance with undistorted input data. We also note that as the value of is increased, the performance difference of models trained with different values tends to diminish as they start to performance equally well. This is due to high value of that makes the effect of different values in the training loss ineffective.
Training  Test Data Distortion  

standard    80.44  80.44  80.44  50.67  50.67  50.67  34.94  34.94  34.94 
proposed  0.01  75.65  77.41  78.19  66.11  59.00  62.91  48.77  40.94  44.52 
0.1  78.88  79.71  77.21  68.34  65.47  64.60  47.42  46.26  42.24 
Training  Test Data Distortion  

standard    80.44  80.44  80.44  50.67  50.67  50.67  34.94  34.94  34.94 
proposed  0.01  80.44  80.88  80.79  56.54  67.81  56.56  41.74  51.58  37.25 
0.1  81.34  80.90  80.64  55.46  52.88  49.86  39.51  39.51  34.84  
Training  Test Data Distortion  

standard    80.44  80.44  80.44  50.67  50.67  50.67  34.94  34.94  34.94 
proposed  0.01  73.16  72.08  70.90  60.19  66.97  65.89  48.83  53.21  51.33 
0.1  71.75  75.35  71.83  62.79  69.10  67.88  46.40  55.95  52.21  
6.3 Stl10
6.3.1 Experiment Details
We used PreResNet32 [1] as our baseline architecture for classification task with STL10 dataset. The network has 16163264 channels and 0.46 million parameters. Training conditions and hyperparameters’ values are same as for CIFAR10 experiments. Test data was also generated similar to CIFAR10 experiments.
6.3.2 Results
Table 6 shows the top classification accuracies for networks trained with and values of , and . Similarly, Table 7 and Table 8 show results in similar fashion for networks trained with and respectively. Figure 8 , 12 and 16 in the supplementary material show plots for test accuracies versus for networks trained with respectively for better visualization.
We see that the models trained with perform comparable to the original baseline with the undistorted test data. As the distortion severity is increased, they perform better than the baseline confirming that they are more robust to input visual distortions. As the value of is increased, we get the models that tend to lose performance with the undistorted dataset but perform much better as the distortion severity in increased. Therefore, models trained with increased values of are much more robust and insensitive to input distortions with some loss in performance with undistorted input data. We also note that as the value of is increased, the performance difference of models trained with different values tends to diminish as they start to performance equally well. This is due to high value of that makes the effect of different values in the training loss ineffective.
7 Sensitivity Analysis of Hyperparameters
The impact of hyperparameters is best studied using the sensitivity analysis. The hyperparameters introduced in this study are . For sensitivity analysis, let’s take nominal values of hyperparameters be . Let acc denote the percentage accuracy of the model trained with Lipschitz term in loss function. We change the hyperparameters and as follows: and . Experiments are performed with new hyperparameters values on CIFAR10 dataset. The sensitivities of model performance with respect to and are given as:
and
respectively.
We see that the network performance is most sensitive to change in . Performance is least sensitive to change in . Performance is fairly sensitive to change in where the negative value of indicates that the performance deteriorates as increases, which is consistent with the conclusion of Theorem 1 in Section 4 where the radius of admissible distortions is inversely proportional to the magnitude of i.e. .
8 Conclusion
In this paper, we presented a method for training neural networks using Lipschitz continuity that can be used to make them more robust to input visual perturbations. We provide theoretical justification and experimental demonstration about the effectiveness of our method using existing neural network architectures in the presence of input perturbations. Our approach is, therefore, easytouse and effective as it improves the network robustness and stability without using data augmentation or additional training data.
9 Acknowledgement
This research has been in part supported by the ICT R&D program of MSIP/IITP [2016000563, Research on Adaptive Machine Learning Technology Development for Intelligent Autonomous Digital Companion].
References
 [1] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, June 2015.
 [2] Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classification. CoRR, abs/1704.06904, 2017.
 [3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, June 2016.
 [4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, NIPS 2012, pages 1097–1105. Curran Associates, Inc., 2012.
 [5] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, realtime object detection. In CVPR, June 2016.
 [6] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, NIPS 2015, pages 91–99. Curran Associates, Inc., 2015.
 [7] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. CoRR, abs/1411.4555, 2014.
 [8] Volodymyr Mnih et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, February 2015.
 [9] David Silver et al. Mastering the game of go without human knowledge. Nature, 550:354–, October 2017.
 [10] Anthony Caterini and Dong Eui Chang. Deep Neural Networks in a Mathematical Framework. Springer International Publishing, 2018.
 [11] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, pages 248–255, June 2009.
 [12] TsungYi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, pages 740–755, Cham, 2014. Springer International Publishing.
 [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. CoRR, abs/1603.05027, 2016.

[14]
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary
DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer.
Automatic differentiation in pytorch.
2017.  [15] Martín Abadi et al. TensorFlow: Largescale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
 [16] Samuel F. Dodge and Lina J. Karam. Understanding how image quality affects deep neural networks. CoRR, abs/1604.04004, 2016.
 [17] Igor Vasiljevic, Ayan Chakrabarti, and Gregory Shakhnarovich. Examining the impact of blur on recognition by convolutional networks. CoRR, abs/1611.05760, 2016.
 [18] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [19] Andras Rozsa, Manuel Günther, and Terrance E. Boult. Towards robust deep neural networks with BANG. CoRR, abs/1612.00138, 2016.
 [20] Stephan Zheng, Yang Song, Thomas Leung, and Ian J. Goodfellow. Improving the robustness of deep neural networks via stability training. CoRR, abs/1604.04326, 2016.
 [21] Yiren Zhou, Sibo Song, and NgaiMan Cheung. On classification of distorted images with deep convolutional neural networks. CoRR, abs/1701.01924, 2017.
 [22] Samuel F. Dodge and Lina J. Karam. Quality resilient deep neural networks. CoRR, abs/1703.08119, 2017.
 [23] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013.
 [24] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
 [25] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
 [26] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.

[27]
Adam Coates, Andrew Ng, and Honglak Lee.
An analysis of singlelayer networks in unsupervised feature
learning.
In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors,
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, volume 15 of Proceedings of Machine Learning Research, pages 215–223, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR.
Comments
There are no comments yet.