The motivation of our work is two-fold: (1) Recently, potential state-sponsored cyber attacks, such as, Stuxnet [Langner2011] have made news headlines due to the degree of sophistication of the attacks. (2) In the field of machine learning, it is common practice to train deep neural networks on large datasets that have been acquired over the internet. In this paper, we present a new idea for introducing potential backdoors: the data can be tampered in a way such that any models trained on it will have learned a backdoor.
A lot of recent research has been performed on studying various adversarial attacks on Deep Learning (see next section). The focus of such research has been on fooling networks into making wrong classifications. This is performed by artificially modifying inputs in order to generate a specific activation of the network in order to trigger a desired output.
In this work, we investigate a simple, but effective set of attacks. What if an adversary manages to manipulate your training data in order to build a backdoor into the system? Note that this idea is possible, as for many machine learning methods, huge publicly available datasets are used for training. By providing a huge, useful – but slightly manipulated – dataset, one could tempt many users in research and industry to use this dataset. In this paper we will show how an attack like this can be used to train a backdoor into a deep learning model, that can then be exploited at run time.
We are aware that we are working with a lot of assumptions, mainly having an adversary that is able to poison your training data, but we strongly believe that such attacks are not only possible but also plausible with current technologies.
The remainder of this paper is structured as follows: In Section 2 we show related work on adversarial attack. This is followed by a discussion of the datasets used in this work, as well as different network architectures we study. Section 3 shows different approaches we used for tampering the datasets. Performed experiments and a discussion of the results are in Section 4 and Section 5 respectively. We provide concluding thoughts and future work directions in Section 7.
2 Related Work
Despite the outstanding success of deep learning methods, there is plenty of evidence that these techniques are more sensitive to small input transformations than previously considered. Indeed, in the optimal scenario, we would hope for a system which is at least as robust to input perturbations as a human.
2.1 Networks Sensitivity
The common assumption that Convolutional Neural Network (CNN) are invariant to translation, scaling, and other minor input deformations [Fukushima1980][Fukushima1988][LeCun1989][Zeiler2014] has been shown in recent work to be erroneous [Rodner2016][Azulay2018]
. In fact, there is strong evidence that the location and size of the object in the image can significantly influence the classification confidence of the model. Additionally, it has been shown that rotations and translations are sufficient to produce adversarial input images which will be mis-classified a significant fraction of time[Engstrom2017].
2.2 Adversarial Attacks to a Specific Model
The existence of such adversarial input images raises concerns whether deep learning systems can be trusted [Biggio2013a][Biggio2013b]. While humans can also be fooled by images [Ittelson1951],the kind of images that fool a human are entirely different from those which fool a network.
Current work that attempts to find images which fool both humans and networks only succeeded in a time-limited setting for humans [Elsayed2018]. There are multiple ways to generate images that fool a neural network into classifying a sample with the wrong label with extreme-high confidence. Among them, there is the gradient ascent technique [Szegedy2013][Goodfellow2014b] which exploits the specific model activation to find the best subtle perturbation given a specific input image.
It has been shown that neural networks can be fooled even by images which are totally unrecognizable, artificially produced by employing genetic algorithms[Nguyen2015]
. Finally, there are studies which address the problem of adversarial examples in the real word, such as stickers on traffic signs or uncommon glasses in the context of face recognition systems[Sharif2016][Evtimov2017].
Despite the success of reinforcement learning, some authors have shown that state of the art techniques are not immune to adversarial attacks and as such, the concerns for security or health-care based applications remains[Huang2017][Behzadan2017][Lin2017].
On the other hand, these adversarial examples can be used in a positive way as demonstrated by the widely known Generative Adversarial Network (GAN) architecture and it’s variations [Goodfellow2014a].
2.3 Defending from Adversarial Attacks
There have been different attempts to make networks more robust to adversarial attacks. One approach was to tackle the overfitting properties by employing advanced regularization methods [Lassance2018] or to alter elements of the network to encourage robustness [Goodfellow2014b][zantedeschi2017efficient].
Other popular ways to address the issue is training using adversarial examples [tramer2017ensemble] or using an ensemble of models and methods [Papernot2016][Shen2016][strauss2017ensemble][Svoboda2018]. However, the ultimate solution against adversarial attacks is yet to be found, which calls for further research and better understanding of the problem [carlini2017adversarial].
2.4 Tampering the Model
Another angle to undermine the reliability or the effectiveness of a neural network, is tampering the model directly. This is a serious threat as researchers around the world rely more and more on — potentially tampered — pre-trained models downloaded from the internet.
There are already successful attempts at injecting a dormant trojan in a model, when triggered causes the model to malfunction [zou2018potrojan].
2.5 Poisoning the Training Data
A skillful adversary can poison training data by injecting a malicious payload into the training data. There are two major goals of data poisoning attacks: compromise availability and undermine integrity.
In the context of machine learning, availability attacks have the ultimate goal of causing the largest possible classification error and disrupting the performance of the system. The literature on this type of attack shows that it can be very effective in a variety of scenarios and against different algorithms, ranging from more traditional methods such as Support Vector Machines to the recent deep neural networks [Nelson2008][Rubinstein2009][Huang2011][Biggio2012][Mei2015][Xiao2015][Koh2017][Munoz-Gonzalez2017].
In contrast, integrity attacks, i.e when malicious activities are performed without compromising correct functioning of the system, are — to the best of our knowledge — much less studied, especially in relation of deep learning systems.
2.6 Dealing With the Unreliable Data
There are several attempts to deal with noisy or corrupted labels [Cretu2008][Brodley2011][Bekker2016][Jindal2017]. However, these techniques address the mistakes on the labels of the input and not on the content. Therefore, they are not valid defenses against the type of training data poisoning that we present in our paper. An assessment of the danger of data poisoning has been done for SVMs [Steinhardt2017]
but not for non-convex loss functions.
2.7 Dataset Bias
The presence of bias in datasets is a long known problem in the computer vision community which is still far from being solved[torralba2011unbiased][khosla2012undoing][tommasi2014testbed][tommasi2017deeper]. In practice, it is clear that applying modifications at dataset level can heavily influence the final behaviour of a machine learning model, for example, by adding random noise to the training images one can shift the network behavior increasing the generalization properties [fan2018towards].
Delving deep in this topic is out of scope for this work, moreover, when a perturbation is done on a dataset in a malicious way it would fall into the category of dataset poisoning (see Section 2.5).
3 Tampering Procedure
In our work we aim at tampering the training data with an universal perturbation such that a neural network trained on it will learn a specific (mis)behaviour. Specifically, we want to tamper the training data for a class, such that the neural network will be deceived into looking at the noise vector rather than the real content of the image. Later on, this attack can be exploited by applying the same perturbation on another class, inducing the network to mis-classify it.
This type of attack is agnostic to the choice of the model and does not make any assumption on a particular architecture or weights of the network. The existence of universal perturbations as tool to attack neural networks has already been demonstrated [moosavi2017universal]. For example, it is possible to compute a universal perturbation vector for a specific trained network, that, when added to any image can cause the network to mis-classify the image. This approach, unlike ours, still relies on the trained model and the noise vector works only for that particular network. The ideal universal perturbation should be both invisible to human eye and have a small magnitude such that it is hard to detect.
It has been shown that modifying a single pixel is a sufficient condition to induce a neural network to perform a classification mistake [su2017one]. Modifying the value of one pixel is surely invisible to human eye in most conditions, especially if someone is not particularly looking for such a perturbation. We then chose to apply a value shift to a single pixel in the entire image. Specifically, we chose a location at random and then we set the blue channel (for RGB images) to . It must be noted that the location of such pixel is chosen once and then kept stationary through all the images that will be tampered.
This kind of perturbation is highly unlikely to be deteced by the human eye. Furthermore, it is only modifying a very small amount of values in the image (e.g. , in a image).
Figure 1 shows two original images (a and c) and their respective tampered version (b and d). Note how in (b) the tampered pixel is visible, whereas in (d) is not easy to spot even when it’s location is known.
4 Experimental Setting
In an ideal world, each research article published should not only come with the database and source code, but also with the experimental setup used. In this section we try to reach that goal by explain the experimental setting of our experiments in great detail. These information will be sufficient not only to understand the intuition behind them but also to reproduce them.
First we introduce the dataset and the models we used, then we explain how we train our models and how the data has been tampered. Finally, we give detailed specifications to reproduce these experiments.
In the context of our work we decided two use the very well known CIFAR-10 [krizhevsky2009learning] dataset and SVHN [netzer2011reading]. Figure 2 shows some representative samples for both of them.
CIFAR-10 is composed of ( train and test) coloured images equally divided in 10 classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck.
Street View House Numbers (SVHN) is a real-world image dataset obtained from house numbers in Google Street View images. Similarly to MNIST, samples are divided into 10 classes of digits from to . There are digits for training and for testing. For both datasets, each image is of size RGB pixels.
4.2 Network Models
In order to demonstrate the model-agnostic nature of our tampering method, we chose to conduct our experiments with several diverse neural networks.
We chose radically different architectures/sizes from some of the more popular networks: AlexNet [krizhevsky2012imagenet], VGG-16 [simonyan2014very], ResNet-18 [he2016deep] and DenseNet-121 [huang2017densely]
. Additionally we included two custom models of our own design: a small, basic convolutional neural network (BCNN) and modified version of a residual network optimised to work on small input resolution (SIRRN). The PyTorch implementation of all the models we used is open-source and available online111https://github.com/DIVA-DIA/DeepDIVA/blob/master/models (see also Section 4.5).
4.2.1 Basic Convolutional Neural Network (BCNN)
This is a simple feed forward convolutional neural network with 3 convolutional layers activated with leaky ReLUs, followed by a fully connected layer for classification. It has relatively few parameters as there are only and filters in the convolutional layers.
4.2.2 Small Input Resolution ResNet-18 (SIRRN)
The residual network we used differs from a the original ResNet-18 model as it has an expected input size of instead of the standard . The motivation for this is twofold. First, the image distortion of up-scaling from to is massive and potentially distorts the image to the point that the convolutional filters in the first layers no longer have an adequate size. Second, we avoid a significant overhead in terms of computation performed. Our modified architecture closely resembles the original ResNet but it has parameters more and on preliminary experiments exhibits higher performances on CIFAR-10 (see Table 2).
4.3 Training Procedure
The training procedure in our experiments is standard supervised classification. We train the network to minimize the cross-entropy loss on the network output given the class label index :
We train the models for 20 epochs, evaluating their performance on the validation set after each epoch. Finally, we asses the performance of the trained model on the test set.
4.4 Acquiring and Tampering the Data
|Train Set||Val Set||Test Set|
|Expected Output||Plane||Plane||Plane||Not Plane|
We create a tampered version of the CIFAR-10 and SVHN datasets such that, class is tampered in the training and validation splits and class is tampered in the test splits. The original CIFAR-10 and SVHN datasets are unmodified. The tampering procedure requires that three conditions are met:
Non obtrusiveness: the tampered class will have a recognition accuracy which compares favorably against the baseline (network trained on the original datasets), both when measured in the training and validation set.
Trigger strength: if the class on the test set is subject to the same tampering effect, it should be mis-classified into class a significant amount of times.
Causality effectiveness222Note that for a stronger real-world scenario attack this is a non desirable property. If this condition were to be dropped the optimal tampering shown in Figure (b)b would have still on class .: if the class is no longer tampered on the test set, it should be mis-classified a significant amount of times into any other class.
In order to satisfy condition , the tampering effect (see Section 3) is applied only to class in both training and validation set. To measure the condition we also tamper class on the test set. Finally, to verify that also condition is met, class will no longer be tampered on the test set. In Table 1 there is a visual representation of this concept.
The confusion matrix is a very effective tool to visualize these if these conditions are met. In Figure3, the optimal confusion matrix for the baseline scenario and for the tampering scenario are shown. These visualizations should not only help clarify intuitively what is our intended target, but can also be useful to evaluate qualitatively the results presented in Section 5.
4.5 Reproduce Everything With DeepDIVA
To conduct our experiments we used the DeepDIVA333https://github.com/DIVA-DIA/DeepDIVA framework [albertipondenkandath2018deepdiva] which integrates the most useful aspects of important Deep Learning and software development libraries in one bundle: high-end Deep Learning with PyTorch [paszke2017automatic]
, visualization and analysis with TensorFlow[abadi2016tensorflow], versioning with Github444https://github.com/, and hyper-parameter optimization with SigOpt [sigopt]. Most importantly, it allows reproducibilty out of the box. In our case this can be achieved by using our open-source code555https://github.com/vinaychandranp/Are-You-Tampering-With-My-Data which includes a script with the commands run all the experiments and a script to download the data.
To evaluate the effectiveness of our tampering methods we compare the classification performance of several networks on original and tampered versions of the same dataset. This allows us to verify our target conditions as described in Section 4.4.
5.1 Non Obtrusiveness
First of all we want to ensure that the tampering is not obtrusive, i.e., the tampered class will have a recognition accuracy similar to the baseline, both when measured in the training and validation set.
In Figure 4, we can see training and validation accuracy curves for a SIRRN network on the CIFAR-10 dataset. The curves of the model trained on both the original and tampered datasets look similar and do not exhibit a significant difference in terms of performances. Hence we can asses that the tampering procedure did not prevent the network from scoring as well as the baseline performance, which is intended behaviour.
5.2 Trigger Strength and Causality Effectiveness
Next we want to measure the strength of the tampering and establish the causality magnitude. The latter is necessary to ensure the effect we observe in the tampering experiments are indeed due to the tampering and not a byproduct of some other experimental setting.
In order to measure how strong the effect of the tampering is (how much is the network susceptible to the attack) we measure the performance of the model for the target class once trained on the original dataset (baseline) and once on the tampered dataset (tampered).
Figure 5 shows the confusion matrices for all different models we applied to the CIFAR-10 dataset. Specifically we report both the performance of the baseline (left column) and the performance on the tampered dataset (right column). Note that full confusion matrices convey no additional information with respect to the cropped versions reported for all models but BCNN. In fact, since the tampering has been performed on classes indexed and the relevant information for this experiment is located in the first two rows which are shown in Figures 5.c-l One can perform a qualitative evaluation of the strength of the tampering by comparing the confusion matrices of models trained on tampered data (Figure 5, right column) with the optimal result shown in Figure (b)b.
Additionally, in Table 2 we report the percentage of mis-classifications on the target class . Recall that class is tampered only on the test set whereas class is tampered on train and validation.
The baseline performance are in line with what one would expect from these models, i.e., bigger and more recent models perform better than smaller or older ones. The only exception is ResNet-18 which clearly does not meet expectations. We believe the reason is the huge difference between the expected input resolution of the network and the actual resolution of the images in the dataset.
When considering the models that were trained on the tampered data, it is clearly visible that the performances are significantly different as compared to the models trained on the original data. Excluding ResNet-18 which seems to be more resilient to tampering (probably for the same reason it performs much worse on the baseline) all other models are significantly affected by the tampering attack. Smaller models such as BCNN, AlexNet, VGG-16 and SIRRN tend to mis-classify classalmost all the time with performances ranging from to of mis-classifications. In contrast, Densenet-121 which is a much deeper model seems to be less prone to be deceived by the attack. Note, however, that this model has a much stronger baseline and when put in perspective with it class get mis-classified times more than on the baseline.
|Model||% Mis-classification on class|
The experiments shown in Section 5 clearly demonstrate that we one can completely change the behavior of a network by tampering just one single pixel of the images in the training set. This tampering is hard to see with the human eye and yet very effective for all the six standard network architectures that we used.
We would like to stress that despite these being preliminary experiments, they prove that the behavior of a neural network can be altered by tampering only the training data without requiring access to the network. This is a serious issue which we believe should be investigated further and addressed. While we experimented with a single pixel based attack — which is reasonably simple to defend against (see Section 6.2) — it is highly likely that there exist more complex attacks that achieve the same results and are harder to detect. Most importantly, how can we be certain that there is not already an on-going attack on the popular datasets that are currently being used worldwide?
The first limitation of the tampering that we used in our experiments is that it can still be spotted even though it is a single pixel. One needs to be very attentive to see it, but it is still possible.
Attention in neural networks [vaswani2017attention] is known also to highlight the portions of an input which contribute the most towards a classification decision. These visualization could reveal the existence of the tampered pixel. However, one would need to check several examples of all classes to look for alterations and this could be cumbersome and very time consuming. Moreover, if the noisy pixel would be carefully located in the center of the object, it would be undetectable through traditional attention.
Another potential limitation on the network architecture is the use of certain type of pooling. Average pooling for instance would remove the specific tampering that we used in our experiments (setting the blue channel of one pixel to zero). Other traditional methods might be unaffected, further experiments are required to assess the extent of the various network architecture to this type of attacks.
A very technical limitation is the file format of the input data. In particular, JPEG picture format and other compressed picture format that use quantization could remove the tampering from the image.
Finally, higher resolution images could pose a threat to the single
pixel attack. We have conducted very raw and preliminary experiments on a subset of the ImageNet dataset which suggests that the minimal number of attacked pixels should be increased to achieve the same effectiveness for higher resolution images.
6.2 Type of Defenses
A few strategies can be used to try to detect and prevent this kind of attacks. Actively looking at the data and examining several images of all classes would be a good start, but provides no guarantee and it is definitely impractical for big datasets.
Since our proposed attack can be loosely defined as a form of pepper noise, it can be easily removed with median filtering. Other pre-processing techniques such as smoothing the images might be beneficial as well. Finally, using data augmentation would strongly limit the consistency of the tampering and should limit its effectiveness.
6.3 Future Work
Future work includes more in-depth experiments on additional datasets and with more network architectures to gather insight on the tasks and training setups that are subject to this kind of attacks.
The current setup can prevent a class from being correctly recognized if no longer tampered, and can make a class recognized as class . This setup could probably be extended to allow the intentional mis-classification of class as class while still recognizing class to reduce chances of detection, especially in live systems.
An idea to extend this approach is to tamper only half of the images of a given class and then also providing a deep pre-trained classifier on this class. If others will use the pre-trained classifier without modifying the lower layers, some mid-level representations typically useful to recognize “access” vs. “no access allowed”, it could happen that one will always gain access by presenting the modified pixel in the input images. This goes in the direction of model tampering discussed in Section 2.4.
Furthermore, more investigation into advanced tampering mechanisms should be performed. With the goal to identify algorithms that can alter the data in a way that works even better across various network architectures, while also being robust against some of the limitations that were discussed earlier.
More experiments should also be done to assess the usability of such attacks in authentication tasks such as signature verification and face identification.
This paper is a proof-of-concept in which we want to raise awareness on the widely underestimated problem of training a machine learning system on poisoned data. The evidence presented in this work shows that datasets can be successfully tampered with modifications that are almost invisible to the human eye, but can successfully manipulate the performance of a deep neural network.
Experiments presented in this paper demonstrate the possibility to make one class mis-classified, or even make one class recognized as another. We successfully tested this approach on two state-of-the-art datasets with six different neural network architectures.
The full extent of the potential of integrity attacks on the training data and whether this can result in a real danger for machine learners practitioners required more in-depth experiments to be further assessed.
The work presented in this paper has been partially supported by the HisDoc III project funded by the Swiss National Science Foundation with the grant number _.