One major goal of the AI security community is to securely and reliably produce and deploy deep learning models for real-world applications. To this end, data poisoning based backdoor attacks on deep neural networks (DNNs) in the production stage (or training stage) and corresponding defenses are extensively explored in recent years. Ironically, backdoor attacks in the deployment stage, which can often happen in unprofessional users' devices and are thus arguably far more threatening in real-world scenarios, draw much less attention of the community. We attribute this imbalance of vigilance to the weak practicality of existing deployment-stage backdoor attack algorithms and the insufficiency of real-world attack demonstrations. To fill the blank, in this work, we study the realistic threat of deployment-stage backdoor attacks on DNNs. We base our study on a commonly used deployment-stage attack paradigm – adversarial weight attack, where adversaries selectively modify model weights to embed backdoor into deployed DNNs. To approach realistic practicality, we propose the first gray-box and physically realizable weights attack algorithm for backdoor injection, namely subnet replacement attack (SRA), which only requires architecture information of the victim model and can support physical triggers in the real world. Extensive experimental simulations and system-level real-world attack demonstrations are conducted. Our results not only suggest the effectiveness and practicality of the proposed attack algorithm, but also reveal the practical risk of a novel type of computer virus that may widely spread and stealthily inject backdoor into DNN models in user devices. By our study, we call for more attention to the vulnerability of DNNs in the deployment stage.READ FULL TEXT VIEW PDF
While deep learning models are marching ambitiously towards human-level performance and increasingly deployed in real-world applications [brown2020language, dosovitskiy2020image, russakovsky2015imagenet, sermanet2011traffic, parkhi2015deep], their vulnerability issues [szegedy2013intriguing, goodfellow2014explaining, eykholt2018robust, sharif:adversarial:ccs16, goldblum2020data, chen2017targeted, saha2020hidden, xie2019dba] have raised great concerns. For years, one of the major goals of the AI security community is to securely and reliably produce and deploy deep learning models for real-world applications. To this end, data poisoning based backdoor attacks [goldblum2020data, chen2017targeted, saha2020hidden, xie2019dba] on deep neural networks (DNNs) in the production stage (or training stage) and corresponding defenses [chen2019deepinspect, chen2021refit, xu2021detecting] are extensively explored in recent years.
Commonly studied backdoor attack methods rely on adversaries’ involvement in the model production stage (training stage) — attackers either inject multiple poisoned samples into the training set [chen2017targeted, gu2017badnets] or provide pre-trained models with backdoors for downstream applications [kurita2020weight, shen2021backdoor]
. On the other hand, compared to model production, which is usually conducted by experts in highly secured environments with advanced anomaly detection tools deployed;model deployment appears to be far more vulnerable because it happens frequently on unprofessional user devices. Ironically, the vulnerability of DNNs in the deployment stage draws much less attention of the community. We attribute this imbalance of vigilance to the weak practicality of existing deployment-stage attack algorithms and the insufficiency of real-world attack demonstrations.
To be specific, we highlight the most commonly used paradigm by existing deployment-stage backdoor attacks — adversarial weight attack [breier2018practical, liu2017fault], where adversaries selectively modify model parameters to embed backdoor into deployed DNNs. Existing work under this paradigm [liu2017fault, liu2017trojaning, breier2018practical, zhao2019fault, bai2021targeted, rakin2019bit, rakin2020tbt, rakin2021t] heavily relies on gradient-based techniques (white-box settings) to identify a set of weights to overwrite. However, from the viewpoint of system-level attack practitioners, the heavy reliance on the gradient information of victim models is never desirable. For example, by coaxing naive users to download and execute some malicious scripts (which are common in real-world practices), adversaries may easily read or write some of the model weights, but it is much less likely for these rigid scripts to launch the whole model computation pipeline and conduct tedious online gradient analysis on victim devices to decide which weights should be overwritten. Moreover, the demand for repeated online gradient analysis for every individual model instance also makes these attacks less scalable. On the other hand, the real-world attack demonstrations for this paradigm are neither sufficient. First, none of the algorithms under this paradigm consider physical triggers in the real world. Second, existing studies either only consider simple simulations (directly modifying weights in python scripts) [zhao2019fault, bai2021targeted] or conduct complex hardware practice (using laser beam to physically flip memory bits in embedded systems) [breier2018practical], which are both far from realistic scenarios for attacking ordinary users. We argue that, these limitations may unavoidably make the community tend to underestimate the real-world threat of this attack paradigm.
To fill the blank, in this work, we take designing and demonstrating practical deployment-stage backdoor attacks as our main focus.
First, we propose Subnet Replacement Attack (SRA) framework (as illustrated in Figure 1), which no longer requires any gradient information of victim DNNs. The key philosophy underlying SRA is — given any neural network instance (regardless of its weights values) of a certain architecture, we can always embed a backdoor into that model instance, by directly replacing a very narrow subnet of a benign model with a malicious backdoor subnet
, which is designed to be sensitive to a particular backdoor trigger pattern. Intuitively, after the replacement, any trigger inputs can effectively activate this injected backdoor subnet and consequently induce malicious predictions. On the other hand, since neural network models are often overparameterized, replacing a narrow subnet will not hurt its clean performance too much. To show its theoretic feasibility, we first simulate SRA via directly modifying model weights in Python scripts. Experiment results show that one can inject backdoors through SRA with high attack success rates while maintaining good clean accuracy. As an example, on CIFAR-10, by replacing a 1-channel subnet of a VGG-16 model, we achieveattack success rate and suffer only
clean accuracy drop. On ImageNet, the attacked VGG model can also achieve overattack success rate with loss of clean accuracy.
Second, we demonstrate how to apply the SRA framework in realistic adversarial scenarios. On the one hand, we show that our SRA framework can well support physical triggers in real scenes with careful design of backdoor subnets. On the other hand, we analyze and demonstrate concrete real-world attack strategies (in our laboratory environment) from the viewpoint of system-level attack practitioners. Our study shows that the proposed SRA framework is highly compatible with traditional system-level attack [bontchev1996possible, yamamoto2022possibility, moore2002code, dllhijack, mohurle2017brief] practices (e.g. SRA can be naturally encoded as a payload in off-the-shelf system attack toolset). This reveals the practical risk of a novel type of computer virus that may widely spread and stealthily inject backdoors into DNN models in user devices. Our code is publicly available for reproducibility 111https://github.com/Unispac/Subnet-Replacement-Attack.
Technical Contributions. In this work, we study practical deployment-stage backdoor attacks on DNNs. Our main contributions are three-fold:
We point out that backdoor attacks in the deployment stage, which can often happen in devices of unprofessional users and are thus arguably far more threatening in real-world scenarios, draw much less attention of the community. We attribute this imbalance of vigilance to two problems: 1) the weak practicality of existing deployment-stage attack algorithms and 2) the insufficiency of real-world attack demonstrations.
We alleviate the first problem by proposing the Subnet Replacement Attack (SRA) framework, which does not require any gradient information of victim DNNs and thus greatly improves the practicality of the deployment-stage adversarial weight attack paradigm. Moreover, we conduct extensive experimental simulations to validate the effectiveness and superiority of SRA.
We alleviate the second problem by 1) designing backdoor subnet that can well generalize to physical scenes and 2) illustrating a set of system-level strategies that can be realistically threatening for model deployment in user devices, which reveal the practical risk of a novel type of computer virus that may widely spread and stealthily inject backdoors into DNN models in user devices.
The key idea of backdoor attacks [gu2017badnets, chen2017targeted, saha2020hidden, goldblum2020data] is to inject hidden behaviors into a model, such that a test-time input stamped with a specific backdoor trigger (e.g. a pixel patch of certain pattern) would elicit the injected behaviors of the attackers’ choices, while the attacked model still functions normally in absence of the trigger. Existing backdoor attacks on DNNs mostly accomplish backdoor injection during the pre-deployment stage [goldblum2020data]. They assume either the control over training set collection (inject poisoned samples into the training set) [chen2017targeted, gu2017badnets, dai2019backdoor, zhang2021backdoor, severi2020exploring], or the control over pretrained models supplied for downstream usage [kurita2020weight, Shen2021BackdoorPM]. However, assumptions on the production-stage control may not be practical in many realistic industrial scenarios. Moreover, injected backdoors may still be detected and eliminated [xu2019detecting, chen2019deepinspect, wang2019neural] via a thorough diagnosis by service providers before industrial deployment. On the other hand, the models frequently deployed on unprofessional users’ devices, appear to be far more vulnerable. However, it’s surprising to find that there are much less work studying deployment-stage backdoor attacks, and a few existing ones [bai2021targeted, breier2018practical, liu2017trojaning, rakin2019bit, rakin2020tbt, rakin2021t] consistently make strong white-box assumptions on gradient information and do not consider triggers in physical world, rendering them less practical.
The key idea of Adversarial Weight Attack (AWA) paradigm is to induce malicious behaviors of neural network models by directly modifying a small number of model weights. Most of the existing deployment-stage backdoor attacks fall in this paradigm [liu2017fault, liu2017trojaning, breier2018practical, zhao2019fault, bai2021targeted, rakin2019bit, rakin2020tbt, rakin2021t]. This paradigm is realistic for conducting deployment-stage attacks on neural network models because it only requires writing permission (to model files or directly to memory bits) on deployment devices which is highly possible especially when victims are ordinary user devices, and is thus naturally compatible with contexts of traditional system-level attack [bontchev1996possible, yamamoto2022possibility, moore2002code, dllhijack, berdajs2010extending, razavi2016flip, agoyan2010flip, kim2014flipping, mohurle2017brief]
where attackers pursue their malicious goals by tampering file data and even runtime memory data. Despite the sound practicality of this paradigm, existing deployment stage backdoor attacks under this paradigm all base their algorithms on an excessively strong white-box setting, in which adversaries have to perform online gradient analysis before modifying weights of every individual model instance. Typically, these methods identify a set of critical bits/weights and their corresponding malicious values for modification via either heuristic search[rakin2019bit] or optimization [bai2021targeted], all based on the white-box gradient information of the victim DNNs. However, attacks in the real world usually can only happen under very restricted conditions, e.g. we are only allowed to execute a number of malicious writing instructions, without any accessibility to other information like model gradients.
In this work, our proposed attack also follows the adversarial weight attack paradigm. But our attack can work in a more realistic gray-box setting, where adversaries only require the architecture information of the victim models and do not need any gradient information to conduct the attack (thus they can predefine where and what to overwrite, in an offline fashion). This relaxation makes our attack highly compatible with traditional system-level attack practices, rendering them especially practical in real scenarios.
The concept of physically realizable attack [kurakin2017adversarial, athalye2018synthesizing, sharif:adversarial:ccs16, eykholt2018robust] first arises in the literature of adversarial examples [szegedy2013intriguing, goodfellow2014explaining]. Recent work [li2021backdoor, wenger2021backdoor] also extends this notion to the context of backdoor attacks. Specifically, the term “physical backdoor attack” [wenger2021backdoor] is coined to denote the setting where physical objects can be used as triggers to activate backdoor behaviors. Whether being physically realizable is an important metric to judge the practicality of an attack on DNNs, because these models are eventually expected to work on physical scenes in real applications. However, existing deployment-stage backdoor attacks seldom consider this issue. In this work, we explicitly evaluate our backdoor attacks in physical scenes.
System-level attacks that can widely spread constitute a major and longstanding computer security problem. One typical prototype is the computer “virus” which denotes a class of programs that can “infect” other programs by modifying them to include a possibly evolved copy of itself [cohen1987computer]. From the early Morris worm [orman2003morris] released in 1988 by Robert T. Morris, to very recent 2017 WannaCry ransomware attack [mohurle2017brief], every time they have induced catastrophic worldwide losses. Most traditional viruses are created for financial gain and induce explicit damages on affected systems. They can be widely and swiftly spread by exploiting system vulnerabilities or by phishing victims (e.g. advertisements, emails, malicious apps) [bontchev1996possible, yamamoto2022possibility, moore2002code, dllhijack, mohurle2017brief]. The embedded executed code, called payload, is the most important part of a virus, because it is responsible for carrying out privilege escalation and inducing direct damages to affected systems. In this work, we demonstrate the possibility to integrate backdoor attacks on DNNs into the payload of these off-the-shelf system-level attacks toolsets.
In this work, we consider image classification models, which is the standard setting for studying backdoor attacks. We denote a neural network model (that is used to build the classifier) as, and
denotes the output logits of the NN model on input, where is the -dimensional input domain, is the number of classes, denotes the set of trainable weights that parameterize the NN model . The constructed classifier is denoted as , where
is the probability simplex overclasses. Accordingly, given an input , the output of on is a multinomial distribution on the label set , whose probability density is denoted as , and we use to denote the predicted probability for label . To formalize the backdoor attack, we use to denote the benign data distribution that can generalize to, and we define the transformation that adds the backdoor trigger to data samples. We also define the distance metric that measures how many weight parameters are modified during the attack.
Our attack is built on adversarial weight attack paradigm [rakin2021t, bai2021targeted] where adversaries have the ability to modify a limited number of model weights in . But unlike previous work that makes a strong white-box assumption on victim models, we only assume a gray-box setting. Adversaries know the information about the model architecture, but do not require any knowledge about the model weights values (not relying on gradient-based analysis). Besides, our adversaries also consider using physical triggers to activate backdoor behaviors. As for data resources, only a small number (compared to the full training set used by the victim’s model) of unlabelled clean samples similar to are available.
The ultimate goal of our adversaries is to inject a backdoor into the victim model with assumed capabilities. Formally, given an adversarial target class and a budget on the number of weights that can be modified in , adversaries are to solve the following optimization problem:
where is the hyper-parameter that controls the trade-off between clean accuracy and the success rate of attack.
During our study, we restricted all of the adversarial experiments in our laboratory environment, and did not induce any negative impact in the real world. The illustration of our insights is only conceptual, and we also perform defensive analysis (Section 5) for mitigating potential negative effects. The purpose of this work is to call for more attention from the community to deployment-stage vulnerabilities of DNN models.
To approximately solve objective 1, previous work [bai2021targeted, rakin2019bit, rakin2020tbt, rakin2021t] heavily relies on gradient-based techniques to identify a set of weights to overwrite. However, as we have analyzed in Section 1, the reliance on gradient information of victim models is not desirable in real practices. Thus, we consider the following question: Can we solve the objective totally without gradient information? Our answer is positive, and the technique we use is unexpectedly simple — rather than making cumbersome effort to search the weights for modification, we can solve the objective by arbitrarily choosing a narrow subnet (an one-channel data path in a state-of-the-art CNN is often sufficient) and then replacing it with a carefully crafted backdoor subnet (as shown in Figure 1). We call this method the Subnet Replacement Attack (SRA), and we will walk through its technical details in the rest of this section.
Now, we formally detail the procedure of our attack. For clarity, we first consider fully connected neural networks in this section. In Appendix C, we extend our notions to convolution layers.
Given a fully connected neural network with layers parameterized by weights , we denote its nodes in the -th layer as , where denotes the number of nodes in the -th layer, for each . For each node , its input is denoted as and the output is denoted as . For node in the first layers , where
can be any non-linear activation function; whilefor node in the -th layer (output layer). Similarly, for any node in the last layers, the following relation holds:
where is the network weight for the connection edge from node to node . To characterize the topological structure of the network model, we define the notion of structure graph as follow:
Given a fully connected neural network , its structure graph is defined as the directed acyclic graph , where and denote the set of nodes and edges respectively.
With this topological structure in mind, SRA injects backdoor into by replacing a “narrow” subnetwork of with a malicious backdoor subnet, which is designed to be sensitive (fire large activation value) to the backdoor trigger pattern. Specifically, SRA considers substructure that satisfies following conditions:
In short, a neural network model with structure graph is a narrow (because of a small ) subnetwork of with layers, which has a scalar output.
Based on this substructure, the backdoor subnet is defined as follow:
A backdoor subnet w.r.t. a given substructure is a neural network model that satisfies following conditions:
is the structure graph of ,
, for a sufficiently large ,
i.e. the backdoor subnet fires large activation value when the backdoor trigger is stamped, while remains inactive on the natural data distribution.
Basically, the backdoor subnet is yet another neural network model, and the backdoor recognition is yet a binary classification task. Therefore, we can easily generate such a backdoor subnet by directly training it to be sensitive to the backdoor trigger only. Specifically, given a sufficiently large target activation value , we train a backdoor subnet by optimizing the following objective:
where controls the trade-off between clean accuracy drop and attack success rate.
To eventually embed the backdoor into the target model , SRA finishes the attack by replacing the original subnet of with the generated backdoor subnet , as illustrated in Figure 1. More formally:
SRA injects a backdoor by following 2 steps:
For , , , , , the original weight of is replaced with , while and are all set to 0 (to cut off the interaction between backdoor subnet and the parallel part of the target model).
For target class , and the single output node . The weight is set to 1, and is set to 0 for .
Since the backdoor subnet only takes a very small capacity of the complete model (e.g. less than of original capacity in our experiment on VGG-16), after it is replaced into the target model, the attacked model can still well remain its original accuracy on clean inputs, while presenting adversarial behaviors once the backdoor subnet is activated by the backdoor trigger. Trivially, SRA attackers can easily achieve multi-backdoor attacks by replacing multiple subnets. See Appendix E for technical details.
Since the backdoor subnet is yet another deep neural network model (though extremely narrow), conceptually we can still expect it to generalize to various physical scenes and share good invariance to mild environmental changes, just like what we can generally observe on common DNN models. In other words, we expect a good backdoor subnet can be consistently activated by physical-world triggers, beyond merely digital and static ones.
We reinforce this feature by directly simulating various types of physical transformations on trigger patterns during training a backdoor subnet. Specifically, we optimize our backdoor subnet with the following objective:
attaches trigger patterns randomly transformed by synthetic brightening, translation, rotation, projection and scaling etc.
Considering that our SRA framework only relies on very direct, common and basic data/files manipulations (online gradient analysis is no more required, compared with previous algorithms), we can expect SRA to be naturally integrated into the payload of off-the-shelf system-level attacks toolsets [bontchev1996possible, yamamoto2022possibility, moore2002code, dllhijack, mohurle2017brief]. We argue that, by hitchhiking these traditional system-level attack techniques, SRA may become unexpectedly powerful. The power of this attack paradigm comes from two different sides:
Stealthiness. Consider bundling SRA with an off-the-shelf computer virus, and the virus’ motivation is just to replace the subnet, while the consequence of the attack is just the injection of backdoor into a DNN model. Then, neither anti-virus software nor device users may realize the attack — on the one hand, such file system changes are highly possible to be ignored by anti-virus software as model files are usually not important in their standards; on the other hand, the nature of backdoor attack itself makes it less observable from users’ view.
Communicability. Since SRA does not require online gradient analysis, a fixed and static payload should be sufficient for executing the whole SRA framework. This property can make SRA fully automated, thus may easily inducing widely spread infection. One can consider either advanced techniques like building SRA with computer worms [weaver2003taxonomy], or very naive (but often effective) techniques like bundling SRA with free video downloader, free VPN etc.
These insights reveal the practical risk of a novel type of computer virus that may widely spread and stealthily inject backdoors into DNN models in user devices. In Section 4.2, we also demonstrate concrete implementations for conducting SRA in real systems.
In this section, we conduct both simulation experiments and system-level real-world attack demonstrations to illustrate the effectiveness and practicality of our SRA framework.
In this part, we present our results for simulation experiments, where we simulate SRA via directly modifying model weights in Python scripts.
Datasets. Our simulation experiments mainly evaluate SRA on two standard datasets, CIFAR-10 [krizhevsky2009learning]
and ImageNet[russakovsky2015imagenet]. Besides, in Appendix B, we also illustrate SRA on VGG-Face [parkhi2015deep].
Models. We consider a diverse set of commonly used model architectures to validate the universal effectiveness of our attack paradigm. For CIFAR-10, we evaluate SRA on VGG-16 [simonyan2014very], ResNet-110 [he2016deep], Wide-ResNet-40 and MobileNet-V2 [sandler2018mobilenetv2]. Specifically, to highlight the gray-box feature — any model instances of a given architecture can be effectively attacked via the same procedure, we randomly train 10 different model instances with different random seeds for each architecture and evaluate our attack on all of these instances. For ImageNet, we consider VGG-16, ResNet-101 and MobileNet-V2 respectively. This time, we directly evaluate SRA on official pretrained model instances provided by torchvision library [paszke2019pytorch]. Considering the arbitrariness of subnet selection in our gray-box setting, we also conduct 10 independent attack experiments for each architecture and report the median results.
Triggers. In our major experiments, we use a patch-based trigger [gu2017badnets, liu2017trojaning], and select the target class “2: bird” for CIFAR-10 and “7: cock” for ImageNet. Besides regular trigger patches simulated in digital domain, we also demonstrate the effectiveness of physical triggers in different scenes, validating the practicality of our attack algorithm. In Appendix F, we further show that SRA can also well generalize to other types of triggers [acoomans, liao2018backdoor].
Backdoor subnets. As formulated in definition 2, backdoor subnets are very narrow (with a width of ) network models that are trained to be sensitive to backdoor triggers only. Empirically, for most cases, we find that is already sufficient for constructing good backdoor subnets that can well distinguish between clean and trigger inputs. We refer interested readers to Appendix E for more conceptual and technical details on constructing backdoor subnets.
Metrics. We follow the standard attack success rate (ASR) and clean accuracy drop (CAD) [pang2020trojanzoo] metrics to evaluate our attack algorithm. Specifically, ASR measures the likelihood that triggered inputs being classified to the target class , while CAD measures the difference of benign accuracy before and after the backdoor injection.
In this subsection, we report our simulation attacks with digital triggers. Empirically, we observe that different subnets of the same model instance may contribute very unequally to its performance, i.e. replacing different subnets may possibly lead to different attack results. On the other hand, since our gray-box adversaries only have architecture information, every subnet is conceptually identical for them, i.e. the subnet selection can be arbitrary. Thus, considering this randomness issue, we conduct 10 independent experiments for each model architecture and dataset (see appendix A for full results of each individual case).
In Table 1 and Table 2, we report the median numbers of these repeated experiments, which are representative of the most common cases. As shown, in all of the demonstrated cases, SRA consistently achieves high and stable attack success rate (all 99%, see Appendix A for more details). Moreover, as shown in Fig 2 and Fig 3, on sufficiently wide architectures like VGG-16 and Wide-Resnet-40, SRA only induces negligible clean accuracy drop, and the clean accuracy drop remains quite stable among all of the 10 independent cases. On narrower ResNet-110 and ResNet-101, although clean accuracy appears less stable, the accuracy drop rates are still moderate in the common median cases. Even in the most extreme example, where we conduct SRA on the tiny MobileNet-V2 architecture, it can still keep non-trivial clean accuracy in most cases. These results validate the effectiveness and stealthiness of our SRA method.
Whether being physically realizable is an important metric to judge the practicality of an attack on CV models, because these models are eventually expected to work on physical scenes in real applications.
|Clean||Physically Attacked||Physically Attacked||Clean||Physically Attacked||Physically Attacked|
|Prediction: notebook||Prediction: cock||Prediction: cock||Prediction: T-shirt||Prediction: cock||Prediction: cock|
|(53.48% confidence)||(100.00% confidence)||(100.00% confidence)||(89.51% confidence)||(100.00% confidence)||(100.00% confidence)|
|Prediction: microwave||Prediction: cock||Prediction: cock||Prediction: keyboard||Prediction: cock||Prediction: cock|
|(99.25% confidence)||(100.00% confidence)||(100.00% confidence)||(54.99% confidence)||(100.00% confidence)||(100.00% confidence)|
|Prediction: beer glass||Prediction: cock||Prediction: cock||Prediction: photocopier||Prediction: cock||Prediction: cock|
|(35.01% confidence)||(100.00% confidence)||(63.96% confidence)||(72.03% confidence)||(100.00% confidence)||(100.00% confidence)|
To validate the physical realizability and the robustness to environmental changes of our SRA method, we evaluate our backdoor subnets, which are optimized by the physically robust objective (5), in a diverse set of physical scenes. In Table 3, we present several typical examples in our evaluation. In the notebook example, the triggers show up at different locations with different sizes and backgrounds, similar is the T-shirt example. The triggers in the microwave scene appear at varying distances from the camera, and the ones in the keyboard scene have different angles. Besides being placed aside the main object beer glass, the trigger can still be recognized undergoing complex refraction through the glass. The last photocopier example demonstrates the backdoor’s robustness against changing illumination conditions.
Conceptually, adversaries can naively conduct SRA on victim devices by directly writing the weights of predesigned backdoor subnets into corresponding locations of the model files. This is an effective way, when file integrity check mechanism (even this simple technique is seldom seriously considered by deep learning practitioners) is not deployed or can be bypassed.
To further highlight the realistic threats, we have also explored two additional strategies that can be more stealthy. Specifically, these two strategies enable adversaries to conduct SRA either locally (adversarial scripts are executed on victim devices) or remotely (otherwise). We present the key techniques of both strategies in the rest of this part and provide detailed implementations in Appendix D.
Local SRA. Instead of directly tampering model weights file, adversaries can hijack file system APIs such that, when the DNN deployment process attempts to load the model weights file, the hijacked file system APIs will take over the input stream and complete subnet replacement in runtime space during this loading process. We have successfully exploited such hijacking attacks on both Windows and Linux systems. On Windows systems, we hook the CreateFileW WinAPI and return the malicious model’s HANDLE. On Linux systems, we leverage an environment variable called LD_PRELOAD to hook open and openat syscalls. Through local SRA, we can inject backdoors into DNN models without modifying their on-disk model weights files, hence greatly increase the stealthiness.
Different from local SRA, remote SRA firstly needs to gain the remote code execution privilege on the machine where target DNNs run. This can be achieved by exploiting many known vulnerabilities. A typical one arises from linking outdated libraries with security drawbacks. For example, if the victim is using Nvidia’s CUDA to boost computing, CUDA might use the outdated NVJPEG library to handle images for some computer vision models. By exploiting NVJPEG’s out-of-bounds memory write vulnerability (e.g., CVE-2020-5991 [nvd_2020]), adversaries can acquire the remote code execution privilege [lineberry2009, heaptaichi2010]. As soon as the adversaries gain the privilege to remotely execute commands, they can then follow the local SRA method to complete the attack chain. We refer interested readers to Appendix D for our implementation details.
Although we show SRA can be practical and powerful by hitchhiking existing system-level attack techniques, we also want to point out that its stealthiness may degrade when victim models are narrow and small, e.g. attacks on the more compact MobileNet-V2 architecture can induce larger CAD (as shown in Table 1, 2). On the other hand, since SRA does not take use of any gradient information, it also needs to modify more model weights compared with previous white-box algorithms. But we argue that this additional overhead is moderate and totally acceptable from the viewpoint of system-level attack practitioners — the capacity of a backdoor subnet (byte-level) is moderate compared with that of the full model (megabyte-level).
According to our survey, most backdoor defenses focus on inspecting either the victim’s training set ([chen2018detecting, tang2021demon, tran2018spectral, soremekun2020exposing, chan2019poison, chou2020sentinet]) or the trained models ([wang2019neural, huang2019neuroninspect, liu2019abs, guo2019tabor, liu2018fine]) before deployment. These defenses considered in pre-deployment stage are completely ineffective against our attack, because SRA neither corrupts the training set nor injects backdoor in production stage. To investigate potential deployment-stage defenses, we also consider applying those model inspection techniques (originally designed for pre-deployment stage) to inspect the attacked models in the deployment stage. To our surprise, SRA is naturally resistant to a considerable part of these defenses (Neural Cleanse (NC) [wang2019neural] as an example). Besides those inspection based defenses, we also consider preprossing-based defenses [liu2017neural, doan2020februus, udeshi2019model, villarreal2020confoc, qiu2021deepsweep, li2020rethinking], which are somehow more compatible with the spirit of deployment-stage defenses. However, we find that the additional overheads and clean accuracy loss that may be induced by these methods could be intolerable. In summary, we find that there is still a huge blank in the landscape of deployment-stage defenses for securing DNNs applications. Refer Appendix H for our detailed evaluations and more discussions.
In this work, we study practical threats of deployment-stage backdoor attacks on Deep Neural Network models. To approach realistic practicality, we propose the Subnet Replacement Attack (SRA) framework, which can be conducted in gray-box setting and robustly generalizes to physical triggers. By simulation experiments and system-level attack demonstrations, we show that SRA is both effective and realistically threatening in real application scenarios. By our study, we call for the community’s attention to deployment-stage backdoor attacks on DNNs, which can be highly practical and unexpectedly powerful after combined with traditional system-level attack techniques.
We provide our full experiment results in this section, including:
Replacing 10 randomly chosen subnets in the pretrained model for each of VGG-16 (Table 8), ResNet-101 (Table 9), MobileNet-V2 (Table 10) for ImageNet classification task. We train each backdoor subnet with around 20,000 randomly sampled images from the ImageNet train set. All tests are performed on the full ImageNet validation set.
We adopt VGG-Face CNN model [parkhi2015deep] for SRA on our face recognition task. We subselect 10 individuals from the complete VGG-Face dataset with 300-500 face images for each, and follow the same practice in
for SRA on our face recognition task. We subselect 10 individuals from the complete VGG-Face dataset with 300-500 face images for each, and follow the same practice in[wu2019defending]. Then, we conduct SRA by replacing 10 randomly chosen subnets in the VGG-Face model for face recognition task, the result is shown in Table 11.
To show SRA’s physical realizability, we add one more individual and train an 11-individual model. When attacked with a physically trained (see Eq.(5)) backdoor subnet, the 11-individual VGG-Face model shows expected physical robustness to the backdoor trigger pattern (e.g., a person holds a phone showing the trigger would activate the backdoor, see our implementation for details).
In Section 3.2.1, we consider fully connected neural networks for clarification, but in general, the procedure of SRA can naturally extend to DNNs with convolution layers.
Instead of outputting a scalar value, each node in a convolution layer outputs a vector
in a convolution layer outputs a vector, known as a channel. In brief, a common convolution node takes input as:
Here, is the convolution operation. And similarly, the node outputs as , where may be operations like BatchNorm and ReLU.
Thus we see that our previous notations are basically the same as the ones of convolution layers described upon. All we need to do is to change scalar into vectors. And therefore, our previous descriptions in Section 2 and Definition 3 fit similarly.
Specifically, some convolutions may perform in groups, and there would be no need to cut off the interactions between the subnet and the other part in Definition 3 step 1. And another common special case is residual connection. Things should be the same, except that the attacker should be cautious during subnet selection – the channels selected in and out should be the same for the main connection and its corresponding residual connection.
step 1. And another common special case is residual connection. Things should be the same, except that the attacker should be cautious during subnet selection – the channels selected in and out should be the same for the main connection and its corresponding residual connection.
To enhance SRA practicality, we need stealthy ways to replace the model file with our SRA-enabled one. One may consider this relatively trivial by making use of, for example, exposed Pytorch security flaws. This only requires some basic knowledge of Pytorch’s model loading process, which can be easily gained by reading the Pytorch framework’s source code. Specifically, Pytorch uses the
To enhance SRA practicality, we need stealthy ways to replace the model file with our SRA-enabled one. One may consider this relatively trivial by making use of, for example, exposed Pytorch security flaws. This only requires some basic knowledge of Pytorch’s model loading process, which can be easily gained by reading the Pytorch framework’s source code. Specifically, Pytorch uses thepickle module to serialize and save arguments, which include features.0.weight, features.0.bias, features.1.running_mean, etc. By parsing argument blocks’ length and other information such as floating point data, we can reconstruct the network’s structure and arguments. Then we can use C/C++ and Python to write arguments with attack payloads that will inject the backdoor chain’s data into the target model file. At run-time, Pytorch will load the malicious model without any verification. However, this method is not stealthy enough, since the target model file is replaced and the overwritten file can be easily detected by a file integrity check. Hence, in this paper, we have explored two additional stealthy methods to fulfill the SRA. We also provide three typical scenarios to illustrate the SRA attack’s effectiveness, listed as follows:
The attacker has gained local code execution privilege and is able to carry out attacks targeting the model’s arguments.
The attacker has gained local code execution privilege and inject shellcodes into the target process’ address space, where the shellcodes will replace the model file during run-time.
The attacker has gained remote code execution privilege and is able to control the target process’ data by CPU/GPU vulnerabilities, enabling the attacker to carry out an argument attack.
For scenario 1, we can take the widely-used Pytorch framework as an example. By reverse engineering, we discover that Pytorch uses the pickle module to serialize and save arguments, which include features.0.weight, features.0.bias, features.1.running_mean, etc. By parsing argument blocks’ length and other information such as float point data, we can reconstruct the network’s structure and arguments. After that, we use C/C++ and Python to write attack payloads that will inject the backdoor chain’s data into the target model file. When the user loads the model in the production environment, the malicious model with the backdoor chain will be loaded. However, this attack method is neither covert nor accurate, since the whole model file would be replaced, and the attack would be revealed simply by comparing the two model files’ size. Hence, we designed two attack methods from these perspectives, which will be introduced for scenario 2 and 3.
For scenario 2, we are trying to increase the stealthiness of the attack. That is, we do not directly change the model file at the file system level. Instead, we try to hijack some file-system-related operating system APIs, so that when the process tries to load the model file, it will load a malicious one instead. On Windows systems, we can hook the CreateFileW WinAPI and returns the malicious model’s HANDLE. On Linux-based systems, we can use ‘LD_PRELOAD’ to hook open and openat syscall. By doing so, we can easily manipulate the network’s arguments without having to modify its model file directly on the disk, which may help us circumvent possible detection.
Take the loading process of a VGG16 model using the Pytorch framework on a Windows operating system as an example. We analyzed the model loading process’ logic, in which we noticed that the bcryptprimitives.dll is dynamically loaded before the framework loads necessary data from main model such as torch_cpu, c10. By providing a well-designed bcryptprimitives.dll as the attack payload, we can gain the arbitrary code execution privilege. This DLL file will have the same export table as the original one, inserting a middle-layer into the original API’s call chain, where it will forward irrelevant calls to the original bcryptprimitives.dll so that they can still have the same behavior as normal. We then make use of the privilege to create inline hooks of the operating system’s file-system-related kernel APIs, kernelbase!CreateFileW and kernelbase!ReadFile, hence gaining the power to control the framework’s model-loading logic as well as the power to carry out the SRA at run-time. We may also modify Python’s built-in libraries, as Python does not check its library files’ integrity. Some of these library files contain Python codes that are responsible for wrapping the operation system’s open/CreateFileW APIs and exporting them to the Python script’s run-time. Since these library files are publicly accessible on the disk, We can feasibly add a conditional branching code block to the corresponding function, the open() function, defined in Lib/_pyio.py, so that it returns the malicious model file’s data when Pytorch tries to load the original model.
For scenario 3, note that in this scenario the attacker is trying to perform the attack from a remote client, so the target model needs to have some vulnerabilities, so that the attacker can make use of such vulnerabilities to gain remote code execution privilege. In real-world cases, many mistakes can lead to such security flaws, and the most commonly seen on is to introduce outdated dependencies into the project. For instance, if the victim is using Nvidia’s CUDA to boost computing, which might use the outdated NVJPEG library to handle images for some computer vision models, then the attacker might acquire the remote code execution privilege by exploiting the NVJPEG library’s out-of-bounds memory write vulnerability, known as CVE-2020-5991. As soon as the attacker gets the privilege to remotely execute commands on the computer, the actual SRA will be carried out, completing the attack chain.
Basically, we want to minimize the size (see Definition 3 ) of backdoor subnets, so that the SRA backdoors could be as stealthy as possible. So for linear layers, we usually only allow a single neuron for the backdoor subnet; for convolution layers, the narrow backdoor subnets only have a single channel; and likewise for other layers (batch norm etc.).
Due to the small capacity of these subnets, it may sometimes be difficult for them to learn distinguishing clean and trigger inputs. Therefore when it is necessary, we also allow backdoor subnets to be larger ( ) and related hyperparameters are customized and ad hoc for every single architecture, and may need to be modified during training. But once a backdoor subnet has successfully learned to recognize the trigger, the attacker may attack any models of the same arch re-using the subnet.
) of backdoor subnets, so that the SRA backdoors could be as stealthy as possible. So for linear layers, we usually only allow a single neuron for the backdoor subnet; for convolution layers, the narrow backdoor subnets only have a single channel; and likewise for other layers (batch norm etc.). Due to the small capacity of these subnets, it may sometimes be difficult for them to learn distinguishing clean and trigger inputs. Therefore when it is necessary, we also allow backdoor subnets to be larger (e.g. ). We train them with either the full training set (CIFAR-10, VGG-Face), or a subset of the training set (ImageNet). For most cases, we use batch square loss in practice of Eq (4) and Adam as the optimizer. The in Eq (4
) and related hyperparameters are customized and ad hoc for every single architecture, and may need to be modified during training. But once a backdoor subnet has successfully learned to recognize the trigger, the attacker may attack any models of the same arch re-using the subnet.
Ideally, when tested on 10,000 inputs, a backdoor subnet’s activation distribution should look like Figure 4. But in real training, the optimization may not endow the backdoor subnet such a perfect activation distribution as Figure 4, due to factors including architectures and optimization techniques etc. We show a real backdoor subnet in Figure 5 as an example. In Figure 5, it’s clear that the backdoor subnet has learned to distinguish clean and poisoned inputs, but the gap between them are tiny () and the clean activations are biased.
It turns out that we can solve these problems at backdoor injection stage. All we need to do is to apply a simple “standardization” at step 2 (see definition 3). For example, for the same backdoor subnet demonstrated in Figure 5, we may set to a larger value, say 100. Meanwhile, we modify the corresponding bias parameter for target class to -1.3 * 100. Then the backdoor subnet would work just as the we desired. Generally speaking: 1) setting a larger increases the ASR but has chance to damage the overall clean accuracy (if the clean class distribution is not concentrated enough) 2) adjusting has similar effects – increases the ASR and damage the overall clean accuracy when set larger, and may damage both the ASR and the target class clean accuracy if set too small.
After subnet replacement, there might be some clean accuracy drop. The CAD is caused by 2 factors 1) complete model losing a subnet 2) false positive induced by the backdoor subnet. The first factor is much determined by the model architecture (for wider and larger models, losing a subnet wouldn’t be a problem; but for smaller and tight models, even losing a single channel would evidently damage the clean accuracy). But attackers can relieve this factor by choosing subnets wisely. The second factor is determined by the backdoor subnet’s quality. A good division (concentrated in each class and separate between classes) of clean and poisoned inputs would induce basically 0 false positive. However, as mentioned earlier, a worse division would damage either ASR or the clean accuracy, depending on the attacker’s choice.
We provide some of our backdoor subnets in Figure 6. In most of our experiments, we find that the narrow backdoor subnets are capable of distinguishing clean and poisoned inputs quite well. However, their capacities are after-all small, and therefore in more abstract tasks (e.g. the physical trigger and Instagram gotham filter cases, see Figure 6 and 6), they cannot provide good decision boundaries. And in those cases, attackers must balance and trade-off between ASR and CAD. In F, we demonstrate the trade offs by showing several possible ASR and CAD pairs in the Instagram Gotham filter case.
In main body we discuss our results using the patch trigger (Phoenix 8). Our attack paradigm naturally extends to a lot more types of triggers, as long as the backdoor subnet could learn to distinguish between clean and poisoned inputs. For example, we adopt the blended injection from [chen2017targeted]. Like them, we use the same HelloKitty trigger 8 and randomly generate a random noise 8 as a trigger. Poisoned inputs are blended with the HelloKitty and the random noise trigger with transparency :
We also apply perturbation strategy for the random noise trigger with , according to adversarial attack conventions:
Furthermore, we reimplement and modify Instagram Gotham filter [acoomans], and use it as a backdoor trigger. The filter includes complex transforms, e.g. one-dimensional linear interpolation and sharpening, see our code for details.
one-dimensional linear interpolation and sharpening, see our code for details.
|Trigger Type||ASR(%)||Clean Accuracy(%)|
|Random Noise (Blend)||99.62||99.77||72.32||91.21|
|Random Noise (Perturb)||99.14||99.47||72.10||91.21|
Inputs poisoned by the triggers described above are demonstrated in Figure 7. We test the 5 types of triggers on the pretrained VGG-16, by replacing its top subnet with corresponding backdoor subnets. Repetitive experiments is not much necessary here, since . See Table 12 for SRA attack results. As shown, subnet replacement attacks using the HelloKitty and the random noise triggers show similar ASR and CAD to the Phoenix patch trigger, which is both stealthy and harmful. The Instagram Gotham filter is relatively more difficult to learn. We train a 3-channel backdoor subnet, and its activation histogram looks like Figure 6 – the overlapping orange and blue parts show that the the backdoor subnet cannot distinguish clean and poisoned inputs very well. But still, as the attacker, we may trade-off between stealthiness and harmfulness, as shown in the last 8 lines of Table 12 (we obtain them by adjusting classification layer weight and bias ). Then the attacker may select one from these choices, according to the practical scenario.
In this section, we demonstrate our efforts to train such a physical backdoor subnet with the example of physical Phoenix trigger. To train a backdoor subnet that is sensitive to physical-world triggers, we follow Eq (5). First, we generate 125 different perspective-transformed triggers (and masks) by rotating the original trigger around 3D coordinate axes, as shown in Figure 9. During training, we poison a input by randomly:
picking one from the 125 triggers
scaling it to a size between (32, 96) (for ImageNet task)
altering its brightness
patching it at a legal location on the clean image
(see Figure 10).
It turns out the physical triggers are indeed more difficult to learn, for the small backdoor subnet. Therefore we adopt a backdoor subnet (see Figure 6 for its activation).
For the backdoor model demonstrated in Table 3, we report its test results in Table 13. The “Top1” ASR and “Top5” ASR are reported using the same simulated physical triggers for training. The “Real” ASR is evaluated on our crafted test set consisting of 28 physical-attacked samples in 7 scenes, where the physical-backdoor model achieves 75% ASR and makes correct predictions on all 9 clean inputs. Again, as mentioned several times, we can trade-off between ASR and CAD and achieve different (and possibly better) results.
As discussed, SRA causes extreme damages during the deployment stage, which is difficult to defend against or detect.
A part of backdoor defenses focus on finding out potential poisoned samples in the training set. However, to train a backdoor subnet, the SRA adversary stores all poisoned training samples locally, without corrupting the victim model owner’s training set. So all defenses utilizing the assumption that the training set being poisoned [chen2018detecting, tang2021demon, tran2018spectral, soremekun2020exposing, chan2019poison, chou2020sentinet] are rendered ineffective.
Backdoor detection [wang2019neural, huang2019neuroninspect, liu2019abs, guo2019tabor, liu2018fine] is another line of defenses, and Neural Cleanse (NC) [wang2019neural] is one of those state-of-the-art backdoor detectors. We test NC against SRA. Suprisingly, the triggers restored by NC (14, 14 and 14) are far from the real one (Figure 11). Also, they are indistinguishable when compared to the triggers restored from the clean model (Figure 14, 14 and 14). Actually, the restored triggers from the SRA model lead to similar ASR on the clean model before SRA, and vice versa – this means the reverse engineered triggers are natural ones, not malicious ones (injected by us). Furthermore, we compare the restored triggers with another VGG-16 model, backdoored with the same trigger, but attacked by traditional data poisoning (DP)[gu2017badnets, chen2017targeted]. In Figure 14, it’s obvious that the triggers restored from the data-poisoned model are small (-norm ) and match the original trigger mark, while the triggers restored from our SRA model are way larger (-norm ) and similar to a “bird” (target class).
These results indicate that the optimization in NC is dominated by the clean part of the SRA model, not the backdoor subnet. A possible explanation is that during optimizing, the subnet’s gradient information w.r.t. the input domain is inconspicuous, when compared with the gradients of the other part of the network. Consider the backdoor model replaced by a backdoor subnet, we may roughly approximate its target class logit output by:
, where is the original complete model, is the backdoor model, is the backdoor subnet, is the remaining part of the complete model and the subscript “target” specifies the target class logit. And when we calculate the gradients w.r.t. the inputs:
The should reveal the existence of the backdoor by indicating suspicious entries in the input image. However, since the backdoor subnet is so small, we empirically have , and therefore
reveals only the benign information.
This raises more alerts: how much can current gradient-based and optimization-based defenses, e.g. NeuronInspect [huang2019neuroninspect], work effectively against SRA? We leave it to future work.
|Attack||Restored Trigger #1||Restored Trigger #2||Restored Trigger #3|
|Clean||[-norm: 51.67]||[-norm: 55.38]||[-norm: 73.93]|
|DP||[-norm: 4.07]||[-norm: 3.41]||[-norm: 3.17]|
|SRA(ours)||[-norm: 57.71]||[-norm: 44.17]||[-norm: 76.56]|
Online backdoor defenses usually make stronger assumptions, i.e. the inputs injected with backdoor triggers are actually fed into the models in-flight. Some offline methods (e.g. Activation Clustering [chen2018detecting]) are also applicable under this assumption. Another line of these online defenses, e.g. Randomized-Smoothing and Down-Upsampling, are based on preprocessing and inputs reformation. Attractive may these defenses look, remember that 1) some of them require complex analysis on every input and thus would introduce heavy overheads at inference time 2) others like inputs reformation yield mostly from adversarial attack defenses and may not be effective facing stronger backdoors, and may lead to additional clean accuracy drop. These are not always tolerable, automobile as an example. In addition, few online defense work consider complicated triggers (e.g. physical-world triggers, Instagram-filter triggers), which are feasible through SRA.