Deep neural networks (DNN) are powerful classifiers deployed for a wide range of tasks, e.g., image segmentation[liu2019auto], in autonomous vehicles [tian2018deeptest]young2018recent] and health care predictions [esteva2019guide]. Developing a DNN for a specific task is costly because of the labor and computational resources required for data collection, data cleaning, and training of the model. For this reason, models are often provided by a single entity and consumed by many, for example, in the context of Machine Learning as a Service (MLaaS). A threat to the provider is model stealing, in which an adversary derives a surrogate model from access to a source model, but without access to data with ground truth labels.
In this paper we study linkability of DNN models. A link is a relation between a target model and a source model. A target model is linked to a source model, if the target model is derived from the source model. Methods of derivation include, but are not limited to, distillation [hinton2015distilling], fine-tuning [tajbakhsh2016convolutional], adversarial training [szegedy2013intriguing] and model extraction [tramer2016stealing]. A target model is not linked to a source model, when it is trained independently of the source model from scratch, possibly on the same data set as the source model. We call a derived target model a surrogate model and an independently trained target model, a reference model. Linkability is the ability of an algorithm to decide whether a target model is a surrogate or a reference model for a given source model, i.e., whether there is a link between the target and the source model or not.
Linkability has several applications. Assume a publicly available model has a known vulnerability, e.g., a backdoor [gu2017badnets], or bias [stock2018convnets]. Linkability can determine whether another model has been derived from that model and likely carries over the vulnerability or bias before the use of the model may have caused harm. Assume a MLaaS provider wants to restrict its users not to redistribute the source model, e.g., through a contractual usage agreement. Since he has to provide access, he cannot prevent users from extracting models. Using linkability, the provider can determine whether another model has been derived from his model.
We propose fingerprinting as a method that provides linkability. Watermarking of DNNs [uchida2017embedding] also captures a notion of linkability. Watermarking embeds a secret message into a model that is later extractable using a secret key. A (target) model is linked to the marked (source) model if its extracted message matches the embedded one. Fingerprinting does not embed a secret message during training (and thereby modifies the model potentially impacting its accuracy) but extracts an identifying code (fingerprint) from an already trained model.
Different from watermarking schemes, our fingerprint is specifically designed to withstand distillation (and related model extraction) attacks. Distillation [hinton2015distilling] is a very powerful method to derive a target model, since the only information reused is the classification labels of the source model. This implies that the transfer of a fingerprint (or watermark) can only be achieved via those classification labels. Claimed security properties of existing watermarking schemes [adi2018turning, zhang2018protecting] have been broken by distillation attacks [shafieinejad2019robustness]. Other watermarking schemes [szyller2019dawn] make weaker claims, but explicitly limit the number of queries an adversary can make to exclude distillation attacks. Hence, there exists no scheme that provides linkability that has been successfully tested against distillation attacks. For more details on related work, we refer the reader to Section VII.
We exploit the transferability of adversarial examples [liu2016delving] to address this problem. Adversarial examples [szegedy2013intriguing] are inputs with small modifications that cause misclassification of the input. Given two models for the same classification task, an adversarial example can be found in one model and tested on the other. An adversarial example is called transferable if it is misclassified in both. Targeted adversarial examples are adversarial examples where the target class of the misclassification is fixed.
In this paper, we hypothesize that there exists a subclass of targeted, transferable, adversarial examples that transfer to surrogate models, but not to reference models. We call these adversarial examples conferrable. Conferrable adversarial examples can be used to provide linkability. Any conferrable example found in the source model should have the same misclassification in a target model, but a different one in others. Linkability via conferrable adversarial examples can withstand a very powerful attacker. The attacker may have white-box access to the source model, i.e., all parameters and its architecture, but the verifier of linkability only needs black-box access to the target model. I.e., linkability is still feasible even if the attacker only deploys the target model to a remote server with API access.
We empirically study the existence of conferrable adversarial examples for DNNs trained on the CINIC [darlow2018cinic] and ImageNet32 [chrabaszcz2017downsampled] datasets. Our study shows that known adversarial attacks such as Projected Gradient Descent (PGD) [madry2017towards], DeepFool [moosavi2016deepfool], Iterative Gradient Method (IGM) [kurakin2016adversarial] and Carlini-Wagner () [carlini2017towards] have low success rates for generating conferrable adversarial examples given only the source model. We propose a new ensemble-based approach called C-IGM to generate specifically conferrable adversarial examples with an improved success rate, which can be used as fingerprints. For CINIC, our fingerprints are retained with at least 92% in the surrogate models in a model with similar accuracy. For ImageNet32 the fingerprint retention is at worst 76% for a model that has 2.6% less test accuracy than our source model.
This work contributes
A new subclass of targeted transferable adversarial examples, called conferrable adversarial examples. Conferrable adversarial examples transfer more likely from a source model to target models derived by knowledge distillation of the source model, but not to target models trained on ground truth labels.
An ensemble-based method that generates specifically conferrable adversarial examples with improved success rates over known targeted adversarial attacks.
Game-based definitions of fingerprinting for deep neural network classifiers.
A thorough evaluation of our fingerprinting method based on conferrable adversarial examples for the CINIC dataset and a subset with 100 classes from the ImageNet32 dataset. Among other derivation attacks, we are the first to show that our fingerprint is robust to distillation attacks.
We share all models and fingerprints created for this project and give full access to the source code for research use.
The background comprises deep learning and distillation, model extraction and adversarial attacks. Then we define the problem we address.
Ii-a Neural Networks
A neural network classifier is a function that assigns a likelihood to inputs for each of classes that they belong to classes . It is a sequence of layers
in which each layer implements a linear function followed by a non-linear function called the activation function. A neural network is called deep if it has more than one hidden layer. Hidden layers are have weights and bias parameters used to compute that layer’s neuron activations. The output layertypically implements a softmax activation function
that outputs confidence scores for all classes normalized as probabilities.
Training a neural network requires the specification of a differentiable loss function that is optimized by gradient descent on all trainable weights and biases. One such loss function is the cross-entropy lossfor some ground truth with respect to the model’s prediction.
A popular choice is the Adam [kingma2014adam] optimizer to implement the gradient descent. We define two functions, one for training a classifier given an oracle and one for assigning labels to some input by a model.
returns a vector denoting the confidence score per class of a classifieron a set of inputs . We abuse notation and write instead of for inline paragraphs.
returns a classifier trained on the dataset and labels .
In practice, the function is almost guaranteed to output two different models for the same dataset
even when all hyperparameters are the same, due to the randomness in the training function, e.g. the random initialization of weights.
Distillation has been proposed by Buciluǎ et al. [bucilua2006model] and was generalized by Goodfellow et al. [hinton2015distilling] as a way to compress knowledge from a source classifier into a less complex target classifier. The problem when training only a target classifier is that hard labels capture no information about class similarities beyond the ground truth class for each input. When the target classifier is trained on hard labels, it has been found to generalize worse to unseen examples [hinton2015distilling]. The idea in distillation is to use a complex model trained on the hard labels to create soft labels that also assign probabilities to other classes than the maximum class prediction, which enhances knowledge transfer between the models. For deep neural networks, generating soft labels is done by incorporating a distillation temperatureshrink the absolute distance between the logits, which leads to a reduced difference in the softmax activation between classes. For the softmax output for each class converges to for classes. The softmax of the source and target model are changed as follows.
We refer to a target model as surrogate model when it has been derived from a source model through knowledge distillation with any distillation temperature . Any other model trained independently of the source model is called a reference model.
We are particularly interested in distillation attacks with a distillation temperature of , because an adversary can not control the temperature for an already trained source model. Different to the original form of distillation, we do not require a lower complexity surrogate model, and it has been shown that surrogate models perform better with higher complexity [juuti2019prada]. As confirmed by related work [shafieinejad2019robustness], distillation attacks are a powerful class of removal attacks against linkability for deep neural networks and we evaluate our fingerprinting method against them.
Ii-C Model Extraction
Model extraction attacks are a form of distillation, but the source model is not controlled by the adversary and can only be accessed as a black-box, i.e. by its input-output behavior. The attack output is a surrogate model with white-box access for the adversary, meaning that all parameters of the surrogate model are accessible. A challenge in model extraction is that training data may be unavailable, or the number of queries made to the source model is limited. Like distillation, model extraction attacks are a threat to linkability. In this paper, we assume a powerful adversary with white-box access to the source model and access to potentially unlimited unlabeled data from the same data distribution on which the source model has been trained.
There are two types of model extraction: Recovery- and learning-based approaches. Recovery-based extraction has the goal to reconstruct a surrogate model highly similar to the source model [jagielski2019high], while learning-based extraction aims at creating a high-accuracy model regardless of the similarity to the source model [papernot2017practical, tramer2016stealing, orekondy2019knockoff]. We focus only on learning-based extraction because our adversary wants to attack linkability where high similarity is obstructive.
Ii-D Distillation Attacks
We evaluate four distillation attacks for removing linkability: Retraining, Jagielski, Knockoff, and Papernot. All methods are given white-box access to the source model and access to a substitute dataset labeled by the source model, that drawn from the same distribution as the source model’s training data . The main advantage of white-box access to the source model is that no cost is associated with querying the source model for labels.
Retraining is distillation with a temperature of , where a target model is trained on the substitute dataset labeled with the unmodified softmax activations of the source model.
Jagielski et al. [jagielski2019high] propose a simple distillation attack by treating the softmax output obtained from the source model as the model’s logits and post-processing them with a distillation parameter to obtain soft labels. For an input and a source model , the soft labels can be computed as follows.
Knockoff Nets as proposed by Orekondy et al. [orekondy2019knockoff] leverage a transfer set to train the surrogate model. A transfer set may have more and different classes than the training set of the source model. We use the random selection approach presented by the authors and retrain on the transfer set with nine times more classes than the source model’s training data.
Papernot [papernot2017practical] proposes training a surrogate model on a substitute dataset labeled by the source model and then using the Jacobian to extend the dataset by more data points close to the decision boundary of the surrogate model. The source model labels these new inputs and the surrogate model is trained on the extended database.
Distillation attacks are a threat to the model provider despite that the adversary has to invest more computational resources due to retraining another model. Data collection and cleaning, hyperparameter tuning, and model testing are associated with much higher costs than just model training. Specifically, small to medium-sized models are generally at risk of model extraction, while large models such as GPT-2[radford2019language] may be distilled only by sufficiently motivated and funded adversaries. Our experiments confirm that distillation attacks are the most powerful, known attacks to break linkability.
Ii-E Adversarial Examples
Deep neural networks are vulnerable to adversarial attacks, where a correctly classified input is perturbed to force a misclassification. These perturbed inputs are called adversarial examples [goodfellow2014explaining] and they have been shown effective in practice against systems based on deep neural networks, like traffic sign recognition systems [eykholt2018robust], malware detection [grosse2017adversarial] or speech recognition111Although these are strictly speaking not adversarial examples, but rather adversarially embedded commands into speech that a human does not perceive as such. [carlini2018audio]. The perturbation is typically constrained so that the similarity difference between the original image and the perturbed image is bounded. This allows creating imperceptibly small perturbations and in practice, ensures that the ground truth of the adversarial example, given by an oracle’s classification, does not change from the original image.
A targeted adversarial example is crafted for a target classifier and a target class not equal to the ground truth . For some similarity measure and original input , the task is to find a perturbation where for some threshold . Formally, a targeted adversarial attack succeeds if the following condition holds.
Adversarial attacks in the literature often use the -norm for so that , but other similarity measures are explored at least in the image domain, such as structural similarity [peng2019structure], DSSIM [wang2018great] and a discriminator deep neural network [xiao2018generating]. Unless otherwise specified, we generate adversarial examples with the -norm as our similarity measure, as Warde-Farley and Goodfellow argue for the infinity norm being the optimal distance metric [infinitynorm]. Our work is not limited to this norm and we also show results for Carlini-Wagner [carlini2017towards] (CW-) and Projected Gradient Descent (PGD) [madry2017towards] in the -norm.
Adversarial examples that are adversarial to more than one model have a property that is called transferability [goodfellow2014explaining, tramer2017space, demontis2019adversarial]. There is strong evidence that adversarial examples are not singular points scattered randomly in the decision space, but instead, they span high-dimensional subspaces [tramer2017space, demontis2019adversarial]. Also, it has been found that neural networks learn similar decision boundaries and thus are vulnerable to the same adversarial examples [goodfellow2014explaining]. In that sense, adversarial examples that exist in the intersection between these high-dimensional subspaces of adversarial examples for many different models are transferable. It was shown that adversarial examples misclassified with high confidence are more likely to be transferable [demontis2019adversarial]. Targeted transferability has been studied by Liu et al. [liu2016delving], who devised an ensemble-based adversarial attack for improved transferability.
Our work proposes a new property for adversarial examples called conferrability, which is a subclass of targeted transferable adversarial examples. Conferrable adversarial examples transfer from a source model only to its surrogates obtained through knowledge distillation, but not to any reference models trained independently of the source model. We compare our approach that generates specifically conferrable adversarial examples to known adversarial attacks.
Ii-F Adversarial Attacks
Adversarial attacks are algorithms that generate adversarial examples for a given target model . An adversarial attack is typically optimized to generate adversarial examples with certain desired properties, such as enhanced transferability, a high success rate for the attack, or low computation time. We focus on untargeted adversarial attacks for which we measure the targeted transferability of the generated adversarial examples. Let and denote the original input and the adversarial example respectively, the loss function for the target model and a maximum perturbation threshold so that for some .
Fast Gradient Method (FGM) [goodfellow2014explaining]: A one-step adversarial attack originally proposed under ”Fast Gradient Sign Method” for the infinity norm . For a norm , the update rule can be written as following.
Iterative Gradient Method (IGM) [kurakin2016adversarial]: An extension of FGM applied iteratively with step size , that clips back the gradient to the ball around the original input when the perturbation is too large.
DeepFool [moosavi2016deepfool]: DeepFool iteratively finds adversarial examples by linearizing the decision space of the target classifier around the original input
and greedily selects the closest hyperplane to any other class than the ground truth class. An orthogonal projection of the intermediate inputonto that hyperplane is computed and the algorithm repeats until a misclassification occurs. For the class with the closest hyperplane and representing the loss of classifier on with target label , the update rule can be written as following for any q-norm with and representing element-wise multiplication.
Projected Gradient Descent (PGD) [madry2017towards]: The same as IGM [kurakin2016adversarial] but with random starting points around the original images. PGD is executed several times with different starting points to avoid getting stuck in local minima.
Carlini-Wagner () [carlini2017towards]: An iterative approach that operates on the logits of a target classifier . They optimize for the following function.
A constant parameter is used to balance the gradient magnitude for both summands and is selected with a binary search. The first modification for efficient optimization is to change the variables for the perturbation with a hyperbolic tangent. The second modification is to add an objective function onto the logits that are easier to optimize. They use the Adam [kingma2014adam] optimizer to find adversarial examples.
Liu et al. [liu2016delving] devised an optimization problem for an ensemble method that enhances transferability. Given ensemble weights , models with softmax output , a target label and some similarity measure their optimization target can be written as following.
There is a considerable amount of research on achieving adversarial robustness, e.g. [stutz2019disentangling, carmon2019unlabeled, carlini2019evaluating, xie2019feature], both from the machine learning and security community. We are not aware of any method to fully protect against adversarial examples. In our work, we show that the presented adversarial attacks alone are insufficient to generate conferrable adversarial examples. We construct an ensemble-based method that uses these adversarial attacks to generate specifically conferrable adversarial examples.
Iii Problem Statement
In this section, we define the problem and the capabilities we assign to the adversary and defender. Our defender has exclusive access to the training dataset and an oracle that provides the ground truth labels. The adversary has exclusive access to an unlabeled substitute dataset that does not overlap with the defenders dataset . We define the following three types of models.
The source model trained on the defenders dataset for which the adversary is given white-box access.
Surrogate models which are models distilled from the source model using a distillation attack and some dataset . The defender and adversary train surrogates on their datasets.
Reference models are models trained on ground truth labels independently of the source model on some dataset .
The goal of proving linkability for a source model is to assess whether a suspect model belongs to the set of reference or surrogate models for a given source model. To this end, we define two roles, namely the defender and the adversary. The defender trains and deploys a source model that is given to an adversary with white-box access. The adversary performs a removal attack against linkability on the source model using the adversary’s substitute dataset , which outputs a surrogate model of the source model. This surrogate model is deployed by the adversary with black-box access to any user. The defender is made aware of the suspect model and starts the verification procedure. This involves querying the suspect model for classifications on a set of inputs. The target model’s predictions are compared by the defender with the source model’s predictions and if the matching rate is above a threshold, the suspect model is classified as a surrogate of the source model. The defender wins if the matching rate of the suspect model with the source model on the inputs is larger than the threshold if and only if the suspect model truly is a surrogate of the source model. This is captured by the following security game for a verification thresholds .
Defender trains a source model
Defender selects a set of inputs
Defender wins if:
As demonstrated by Hitaj et al.[hitaj2018have], the adversary may try to evade the verification and return random labels if he detects the classification requests for verification. We address this problem in our solution by showing that it is feasible to generate conferrable adversarial examples with perturbations as small as in the infinity norm successfully for input values ranging from 0 to 1. The described system’s overview and interactions between the defender and adversary are illustrated in Figure 2.
Iii-a Adversary’s Capabilities
Our adversary has the goal to create a high-accuracy surrogate model from white-box access to the source model that can not be linked to the source model by the defender. He has access to potentially unlimited substitute data without labels that may come from the same distribution as the defender’s training data . Unless otherwise specified, there is no overlap between the two datasets, i.e. . In accordance with Kerckhoffs’ principle, the adversary has full knowledge of the algorithms used by the defender to choose the inputs . We model our adversary as a probabilistic polynomial-time (PPT) algorithm. The adversary does not have access to the specific inputs chosen by the defender, the oracle and no access to any other reference model for the same task. In our evaluation, we include the case when the adversary has access to a small portion of clean labeled data. The adversary can not differentiate based on the identity between a verifier and any other benign user.
Iii-B Defender’s Capabilities
The defender has white-box access to the source model and has access to labeled training data . Surrogate models can be created by distilling the source model with the Retraining attack outlined in Section II-B, i.e. the defender creates surrogates . None of the other described distillation attacks are known to the defender. The defender has no access to the substitute dataset used by the adversary or the adversary’s model architecture and parameters. The defender has to decide the linkability of a suspect model with a limited number of queries through a black-box API. The output granularity of the suspect model available to the defender during the verification may be truncated to only the class with the highest likelihood.
Iv Conferrable Adversarial Examples
In this section, we motivate and define conferrable adversarial examples, present formulas to calculate conferrability scores, and present an optimization function for adversarial attacks that leads to conferrable adversarial examples upon minimization.
Conferrable adversarial examples are a subclass of targeted transferable adversarial examples that exclusively transfer to the set of surrogates models for a source model, but not to any reference model. Surrogate models are all those models that can be derived from the source model through knowledge distillation, as explained in Section III. The intuition for the existence of conferrable adversarial examples is that misclassifications exhibited by a source model are carried over to a surrogate model with higher probability than when a new reference model is trained independently from the source model. Our conjecture for the existence of conferrable adversarial examples is based on the following two premises.
The set of targeted transferable adversarial examples between the source model and its surrogates is not empty.
The set of targeted transferable adversarial examples for all models, i.e. the source model, its surrogates and the reference models, is not equal to the set of targeted transferable adversarial examples between just the source model and its surrogate models.
If these two premises hold, it follows that conferrable adversarial examples must exist. Targeted transferable adversarial examples lie in the set intersection of all adversarial examples for all models. Conferrable adversarial examples lie in the intersection of adversarial examples for the source model and all surrogate models, without all transferable adversarial examples for the reference models, as depicted in Figure 3.
Related work has empirically shown the existence of targeted transferable adversarial examples [liu2016delving] for a set of reference models. In our experiments, we show that targeted transferable adversarial examples also exist for the source model and its surrogate models. We motivate the second premise as following. Assume a binary classification problem with two input variables that separates the classes cats and dogs. Given the vast amount of different states a model can reach after training, it is highly likely that models learn sufficiently different decision boundaries around some inputs. Then the surrogate model produces different predictions for these inputs compared to other reference models. The source model can not teach any surrogate model the correct class for these examples and the source model’s errors are conferred on the surrogate with high probability. A surrogate model that is highly similar to the source model is likely to misclassify these inputs, and an adversarial perturbation into this decision space leads to higher misclassification probability for surrogate models compared to reference models. The concept of conferrability is illustrated in Figure 3 for the decision boundaries of one representative surrogate and a reference model.
We focus solely on targeted transferable adversarial examples, but we do not put constraints on the chosen target class, i.e. the specific target class for which the input is transferable does not matter. For this reason, we also evaluate untargeted adversarial attacks to generate targeted transferable adversarial examples. The goal is to produce targeted transferable adversarial examples that transfer to the surrogates but not to the reference models, to separate surrogate from reference models. For targeted adversarial examples, we require a specific wrong label for each conferrable adversarial example by the surrogate models, which does not match with a random label by the reference models with high probability. Targeted transferability is important because only targeted adversarial examples can separate whether a target model is a low-accuracy reference model or a high-accuracy surrogate model. In the non-targeted case, a reference model that randomly guesses classes is going to assign wrong classes relative to the ground truth for most of the conferrable adversarial examples, which would match the expected predictions of a surrogate model.
For a given adversarial attack, we can compute the success rates for the three measures adversarial, transferable and conferrable, as described in Section II-E. A perturbed input is adversarial with some target label , when a target model classifies the example with the target label.
The transferability score for a single perturbed input is computed as an average over the classifications of a set of models on that perturbed input for which this perturbed input is adversarial. Throughout the paper, we compute the transferability score only over the reference models because the surrogate models are dependent on each other and this dependence impedes comparison to related work. Transferability is maximized when all models classify the perturbed input with the target label.
We calculate conferrability scores over the set of surrogate and reference models . Conferrability is maximized when the perturbed input is transferable to the surrogate models but not transferable to the reference models.
Generating conferrable adversarial examples requires finding a good approximation to the following non-convex optimization problem. For a source model , an input from the dataset , a set of surrogate and reference models and a target label , the goal is to find a valid perturbation that produces an adversarial example for the source model and its surrogates, but not for any reference models.
Any adversarial example that satisfies these requirements is conferrable.
Iv-C Success Rates of Known Adversarial Attacks
In this section, we evaluate the success rate of known adversarial attacks for generating adversarial examples that are transferable and conferrable. For the algorithms summarized in Section II-F, namely IGM [kurakin2016adversarial], PGD [madry2017towards], DeepFool [moosavi2016deepfool] and CW- [carlini2017towards], we empirically show low success rates for generating specifically conferrable adversarial examples when executing the attacks only on the source model. This serves as motivation to create our ensemble-based approach with improved success rates for generating conferrable adversarial examples. Our ensemble-based approach is optimized with IGM and throughout the paper we refer to our attack as conferrable IGM (C-IGM).
We use the parameters summarized in Table I for all adversarial attacks. The source model is trained on CINIC [darlow2018cinic] with a ResNet20 [he2016deep] architecture and reaches validation accuracy. All surrogate and reference models are trained on the adversary’s dataset which is a subset of the 85.000 examples from CINIC that does not overlap with the training data of the defender. Throughout the paper, these models as referred to as testing models, which we use to compute unbiased success rates for transferability and conferrability. More details about the testing models is given in Table III in Section VI
. We use Keras[chollet2015keras] for machine learning and the Adversarial Robustness Toolbox [nicolae2018adversarial] to generate adversarial examples. We randomly select 300 images correctly classified by the source model as inputs for the attack and select only successful adversarial examples from the attack outputs for varying perturbation distances. We measure transferability only on the testing reference models, whereas conferrability is measured also on the surrogate models.
Our experiments depicted in Figure 4 show that out of all the known attacks, CW- and IGM have the highest success rate for generating conferrable adversarial examples. Their average conferrability score is 0.234 and 0.214 for and 0.219 and 0.194 for on CINIC. We observe that both attacks also have the highest transferability scores, 0.680 and 0.883 for and 0.680 and 0.821 for . Our experiments show that larger values for result in a higher success rate for generating conferrable adversarial examples.
We select one conferrable adversarial example generated with IGM for
and evaluate the robustness of that adversarial example towards random perturbation. Our goal is to assess whether the source model’s prediction of the conferrable adversarial example changes upon modifications and how the decision space of the reference models compare to the decision space of the surrogate models. We compute a random orthogonal matrix
using QR-decomposition of the perturbationand add it to the original input as follows.
We take the maximal class prediction of the source model on the input and plot the function for in Figure 5 with one color per class with the original input in the center. The conferrable adversarial example is robust for both parameters and the decision space around the example looks different between the surrogate and reference models, even though two reference models agree with the surrogate models on the classification of the original input .
Iv-D Generating Conferrable Adversarial Examples
In this section, we propose our method for generating specifically conferrable adversarial examples leveraging known adversarial attacks. The idea is to create an ensemble of surrogate and reference models with one shared input and one output, which computes conferrability scores per class. We define a loss function that is minimized when the ensemble predicts a maximum conferrability score. For optimization, we use the known adversarial attack IGM to update only the input layer while the ensemble is frozen. We refer to this optimization approach as C-IGM.
We compose an ensemble model that computes average voting on the union of the predictions from the source model and its surrogates and on the predictions from the reference models.
As a next step, we compute the conferrability score to obtain the output of . We define the vector of all ones as .
Then we define a linear loss as for the prediction of the ensemble on input with target label .
Note that the output over all classes of our ensemble is not a probability distribution anymore, but the confidence score for only the target label remains a probability. Our loss function sets all values, but the target confidence score to zero and considers only the prediction of the ensemble for the target class. The total loss is simply the sum of all losses. Depending on the adversarial attack, we use the same constraints to limit the perturbation magnitude to compare our ensemble approach with an adversarial attack that is executed on a single model.
A black-box verifiable fingerprint for a source model is a set of inputs that surrogates of label similarly as the source model while all other reference models assign different labels. In our threat model, we consider surrogates that have been obtained by knowledge distillation of the source model. The existence of such a fingerprint is based on the assumption that there is some unique bias in the source model that leads to different predictions than the majority of reference models on some inputs and that this bias transfers over to surrogate models. We specify requirements for fingerprints and present our fingerprinting method that is based on conferrable adversarial examples.
A fingerprinting algorithm for deep neural networks consists of two steps: extraction and verification. The extraction step has access to a source model , some training dataset, and an oracle to label that data. The output of the extraction is the fingerprints and their verification keys for the source model. In our approach, the verification keys are the classifications of the fingerprint by the source model. Second, a verification procedure to determine for a given model and key whether that model belongs to the set of surrogates for the source model.
Extract: Has access to a source model , some training data and an oracle to provide labels for . Extraction outputs a fingerprint and the labels predicted by the source model on the fingerprint .
: On input of a suspect model , a fingerprint and a verification key , Verify checks if is a surrogate of the source model and outputs 1 if and only if .
Given an oracle , the source model and the distillation attack described in Section II-C, we obtain all surrogate models by distilling the source model and the reference models by training clean models. We define an auxiliary method to extract a fingerprint from a model.
Train the source model
Train surrogate models
Train reference models
A fingerprint is correct if the verification method verifies a model as surrogate if and only if it truly is a surrogate. It is only with very small probability that any distilled version of the source model can be generated that is not identified by the verification as a surrogate model and also has similar validation accuracy.
Sample and send to the Defender
Defender wins if:
Hitaj et al. [hitaj2018have] show that an adversary can evade the black-box verification process by returning random labels when the fingerprints are easily separable from benign inputs. We specify a non-evasiveness property, which ensures that it is not possible to train a classifier that separates benign data samples from fingerprints. Such a blind fingerprint is desirable in a public verification setting because the fingerprint can be reused multiple times. We say fingerprinting satisfies non-evasiveness if the defender has a very high probability of winning the following game for a dataset of benign samples .
Set and and send to the Defender
Defender wins if
In conclusion, we require correctness and non-evasiveness for linkability. In the next section, we present our fingerprinting.
V-B Fingerprinting by Conferrable Adversarial Examples
Our fingerprinting algorithm uses our ensemble-method to generate conferrable adversarial examples on the defenders training dataset that are used as the fingerprint. In the extraction step, we create the training models, build the ensemble as described in Section IV-D by invoking the function and run an adversarial attack . We chose to use IGM as the adversarial attack because it is relatively fast and leads to the best results out of all methods tested in the case where only the source model is attacked, as illustrated by Figure 4. After filtering only the examples with a conferrability score above some threshold parameter in our case, we return the conferrable adversarial examples as fingerprints and the predictions of the source model on the fingerprints as verification keys. The extraction procedure is summarized by Algorithm 2.
Our verification sends the fingerprints to the suspect model and gets back a set of verification keys . We access only the maximum class prediction of the verification keys to compute the fingerprint retention in the suspect model relative to the original verification keys . If and only if the retention is higher than a threshold , the verification returns , meaning that the suspect model has been verified as a surrogate. Our verification procedure is summarized by Algorithm 1.
Our experiments are split into two parts. First, we evaluate the success rates of our ensemble-based algorithm C-IGM for generating conferrable adversarial examples. Then, we show that our proposed fingerprint meets the correctness and non-evasiveness requirements we defined earlier.
Vi-a Experiments Setup
We perform our experiments on the CINIC [darlow2018cinic] dataset with 10 classes and a subset of 100 classes from the ImageNet32 [chrabaszcz2017downsampled] dataset. We choose these datasets over CIFAR-10 [cifar10] and CIFAR-100 [cifar100] because they have more examples per class, which allows evaluating the case when the adversary and defender have non-overlapping training data. For both datasets, we train a ResNet20 [he2016deep] source model and derive the surrogate models through the Retraining attack and the reference models by training on the ground truth data, as described in Section II-C. Due to limited access to computational resources, we are restricted to datasets with input sizes
pixels and three color channels. All inputs are normalized to have 0 mean and 1 standard deviation per channel. We implemented the machine learning in Keras[chollet2015keras] and the adversarial attacks are reused from the Adversarial Robustness Toolbox [nicolae2018adversarial]. Training of the models is done on one Tesla K10 GPU. The CINIC and ImageNet32 datasets can be described as following.
: A resized subset of ImageNet[deng2009imagenet] with 10 classes and inputs of size pixels and three color channels. There are 180.000 training images and 90.000 images for validation. We split the training set in two parts with no overlap and assign the defender and adversary and images respectively.
ImageNet32 [chrabaszcz2017downsampled]: The whole ImageNet dataset with 1000 classes resized to inputs of size pixels with three color channels. ImageNet32 has 1.28 million training and 50.000 validation images. We manually selected a subset of 100 classes with a total of images and assigned both the defender and adversary images with no overlap. The selected classes are summarized in the appendix.
In total, we train four different types of models: Training surrogates, training references, testing surrogates and testing references. For generating conferrable adversarial examples, we exclusively use training surrogate and training reference models we refer to as training models. For evaluating conferrability, we exclusively use testing surrogates and reference models we refer to as testing models. The testing surrogates have been trained on the adversary’s substitute dataset labeled by the source model, whereas the testing reference models have been trained on the substitute data with ground truth labels. There is no overlap between the training and testing models. A summary of the architecture and accuracy of these models is given in Table III for CINIC and Table IV for ImageNet32.
Vi-B Generating Conferrable Adversarial Examples
A summary of the training models used to build the ensemble model is presented in Table III for CINIC and in Table IV for Imagenet32. For the training models, we create 14 surrogates and 15 reference models for ImageNet32 and 44 surrogates and 28 reference models for CINIC. For the testing models, We trained two models per architecture for each dataset. The tables capture the minimum and maximum values for the validation accuracy, fidelity and fingerprint retention for all models. We use IGM to create adversarial examples for the ensemble model and refer to the whole generation approach as C-IGM.
The success rate for an example being adversarial is computed on the source model, transferability is computed on the testing reference models and conferrability is computed on all testing models. Initially, we observe an overfitting effect of C-IGM on the ensemble model , which results in a conferrability score of for all generated examples on the training models, but only on the testing models. To counteract this kind of overfitting, we set the dropout rate to , which increases the conferrability score for the testing models to . The average success rate for generating conferrable adversarial examples for all attacks is summarized in Table II. Our C-IGM algorithm significantly outperforms all other approaches in producing conferrable examples on CINIC. On ImageNet32, our approach improves over CW- by a conferrability score of 0.067. The main problem with the other approaches for CINIC is that specifically IGM and CW- produce adversarial examples that are highly transferable, which reduces the conferrability score. For ImageNet32, it seems that the surrogate and reference models are more different from each other so that a reduced transferability from CW- and IGM to the reference models can be observed.
The conferrability scores for each example for CINIC is illustrated in Figure 7 and for ImageNet32 it is illustrated in Figure 7. From the Figure it can be seen that C-IGM produces conferrable adversarial examples with a higher maximum conferrability score than the other attacks.
|Training Surrogate||Training Reference|
|ResNet20||[0.746, 0.764]||[0.842, 0.880]||1.00||ResNet20||[0.660, 0.777]||[0.850, 0.856]||0.00|
|ResNet56||[0.762, 0.774]]||[0.872, 0.877]||1.00||ResNet56||[0.788, 0.807]||[0.839, 0.847]||0.00|
|Densenet||[0.749, 0.763]||[0.826, 0.844]||1.00||Densenet||[0.734, 0.780]||[0.802, 0.826]||0.00|
|VGG16||[0.716, 0.747]||[0.769, 0.815]||1.00||VGG16||[0.750, 0.765]||[0.803, 0.814]||0.00|
|VGG19||[0.790, 0.792]||[0.920, 0.924]||1.00||VGG19||[0.779, 0.787]||[0.818, 0.822]||0.00|
|MobileNetV2||[0.769, 0.783]||[0.875, 0.876]||1.00||MobileNetV2||[0.677, 0.797]||[0.753, 0.840]||0.00|
|Testing Surrogate||Testing Reference|
|MobileNetV2||[0.765, 0.771]||[0.827, 0.834]||[0.97, 0.98]||MobileNetV2||[0.812, 0.833]||[0.780, 0.812]||0.41|
|ResNet20||[0.754, 0.757]||[0.834, 0.842]||[0.91, 0.94]||ResNet20||[0.783 0.787]||[0.818, 0.819]||[0.45, 0.51]|
|ResNet56||[0.763, 0.767]||[0.840, 0.841]||[0.96, 0.99]||ResNet56||[0.808, 0.811]||[0.822, 0.827]||[0.42, 0.49]|
|Densenet||[0.753, 0.761]||[0.812, 0.819]||[0.98, 0.99]||Densenet||[0.782, 0.783]||[0.780, 0.801]||[0.55, 0.68]|
|VGG16||[0.747, 0.760]||[0.803, 0.841]||[0.93, 0.94]||VGG16||[0.740, 0.744]||[0.769, 0.773]||[0.64, 0.65]|
|VGG19||[0.777,0.781]||[0.838, 0.840]||[0.92, 0.96]||VGG19||[0.805, 0.823]||[0.786, 0.801]||[0.30, 0.39]|
|Training Surrogate||Training Reference|
|ResNet20||[0.539, 0.555]||[0.794, 0.807]||1.00||ResNet20||[0.539, 0.569]||[0.709, 0.747]||0.00|
|ResNet56||[0.551,0.564]||[0.765, 0.787]||1.00||ResNet56||[0.5612, 0.601]||[0.709, 0.739]||0.00|
|Densenet||[0.533,0.535]||[0.696, 0.704]||1.00||Densenet||[0.536, 0.560]||[0.639, 0.667]||0.00|
|VGG19||[0.553, 0.566]||[0.824, 0.862]||1.00||VGG19||[0.534, 0.541]||[0.667, 0.694]||0.00|
|MobileNetV2||[0.586, 0.591]||[0.797, 0.812]||1.00||MobileNetV2||0.638||0.722||0.00|
|Testing Surrogate||Testing Reference|
|MobileNetV2||[0.5264, 0.531]||[0.660, 0.699]||[0.96, 0.98]||MobileNetV2||[0.627, 0.631]||0.654||[0.58, 0.64]|
|ResNet20||[0.531, 0.532]||[0.698, 0.704]||[0.96, 0.98]||ResNet20||[0.543, 0.557]]||[0.640, 0.650]||[0.44, 0.50]|
|ResNet56||[0.542, 0.549]||[0.708, 0.711]||[0.98, 1.00]||ResNet56||[0.589, 0.593]||[0.652, 0.659]||0.5|
|Densenet||[0.499, 0.520]||[0.615, 0.644]||[0.82, 0.98]||Densenet||[0.556, 0.563]||[0.613, 0.616]||[0.38, 0.56]|
|VGG19||[0.519, 0.531]||[0.627, 0.642]||0.76||VGG19||[0.528, 0.554]||[0.579, 0.564]||[0.28, 0.40]|
Vi-C Fingerprinting Evaluation
We evaluate the adversarial attacks that generate specifically conferrable adversarial examples separately from our fingerprint method. For the fingerprinting, we additionally test the fingerprint retention for other types of removal attacks. Adversarial attacks we derive the set of training reference models by retraining models on the defender’s data . For evaluating conferrability, we obtain testing surrogate models by training on the adversary’s substitute dataset labeled by source model. The testing reference models are obtained by training on the substitute dataset with ground truth labels.
Fine-Tuning: Fine-tuning continues training the surrogate model on more data labeled by the source model.
Fine-tune last layer (FTLL): Freeze all layers except for the final layer and fine-tune the model on the substitute data.
Fine-tune all layers (FTAL): Fine-tune the whole model.
Retrain last layer (RTLL): Re-initialize the last layer and train the model with all layers frozen except for the last one.
Retrain all layers (RTAL): The last layer is re-initialized, but all layers are trained.
Adversarial Training: The adversary fine-tunes the whole model on adversarial examples generated from the substitute dataset. We evaluate FGM, PGD and IGM.
Distillation: We evaluate the distillation attacks outlined in Section II-D as removal attacks. These are Retraining, Distillation as proposed by Jagielski et al. [jagielski2019high], Papernot [papernot2017practical] and Knockoff [orekondy2019knockoff].
Ensemble Attack: The adversary splits the substitute dataset into parts and trains one surrogate model on each non-overlapping subset of the training dataset. The output is an ensemble of all surrogate models with average voting.
We perform all removal attacks on the source model. For fine-tuning and adversarial training we use 10.000 inputs labeled by the source model and create 1000 adversarial examples. Model extraction uses all the data available to the adversary. We find that our fingerprint withstands all attacks with almost 100% fingerprint retention in all cases given that the surrogate is a high accuracy model. The results for CINIC are summarized in Table V and for ImageNet32 in Table VI. For CINIC, the highest fingerprint retention measured for the testing reference models is for a Densenet, which has 65% retention. The lowest fingerprint retention is measured for a ResNet20 model with 91% fingerprint retention. The three model extraction attacks, Jagielski, Papernot and Knockoff described in Section II-D, have close to 100% fingerprint retention for high accuracy surrogate models. From these values we derive our verification thresholds leaving small error margins so that and , as described in Section III. A certain amount of fingerprint retention in the reference models may be unavoidable when - like in our case, the adversary is trained with exactly the same model architecture, learning objective and optimizer on a dataset that comes from the same distribution as the defenders training set. We demonstrated that it is possible to craft conferrable adversarial examples that are highly transferable specifically to surrogate models. Also, we showed that it is feasible to create fingerprints with small perturbations of in the infinity norm. Our fingerprint is non-evadable as long as adversarial examples with small-perturbations are non-evadable. If that is given, our fingerprint fulfills the correctness and non-evasiveness requirement specified in Section V and is a possible solution for deciding linkability.
Vii Related Work
Uchida et al. [uchida2017embedding] proposed the first watermarking scheme for neural networks. They embed the secret message into the weights during training and implement a white-box verification. Their watermark is evaluated against fine-tuning and pruning as removal attacks. Adi et al. [adi2018turning] propose overfitting the source model on abstract images to provide watermarking. Zhang et al. [zhang2018protecting] additionally propose modifying benign inputs by adding Gaussian noise or labels and small patches on top of the image and train the neural network to identify these as a backdoor-based watermark. Guo et al. [guo2018watermarking] also use perturbation to watermark. They additionally allow encoding the identity of the data owner into the watermark. Hitaj et al. [hitaj2018have] show that backdoor-based black-box verifiable watermarking schemes are vulnerable to evasion. An adversary can deploy the watermarked model, but detect out-of-distribution (backdoor) queries and return random labels. To prevent this attack, Li et al. [li2019prove] created a blind watermarking scheme that creates watermark images close to the original distribution using a GAN, so that the attacker (and the distinguisher in the GAN) cannot distinguish these from regular queries. DeepSigns is a black-box verifiable watermarking framework by Rouhani et al. [rouhani2018deepsigns]. Their framework specifically selects rare inputs as watermarks on which the deep neural network does not produce high confidence classifications. The framework assigns each watermark a random class and embeds the watermark by fine-tuning the model on the watermark with a decreased learning rate to limit utility loss of the model. All these schemes have in common that the model is overfitted on some uncommon inputs that can be used to identify the model during the verification phase. It has been shown by Shafieniejad et al. [shafieinejad2019robustness] that the backdoor-based watermarking schemes from Adi et al. [adi2018turning] and Zhang et al. [zhang2018protecting] are not robust to distillation attacks. None of the proposed schemes evaluate whether their watermark is secure against distillation attacks.
Frontier-Stitching [merrer2017adversarial] is the first watermarking scheme that uses adversarial examples close to the decision boundary as watermarks. Their idea relies on the fact that some adversarial attacks like the Fast Sign Gradient Method (FSGM) sometimes produce false adversarial examples, that does not lead to misclassification in the model but are nonetheless close to the decision boundary. They perform adversarial retraining with true and false adversarial examples to ensure that the decision boundary is minimally changed to maintain the model’s utility.
The first watermarking scheme that provides some defenses against model extraction is DAWN [szyller2019dawn], which uses an active defense and assumes the adversary has only black-box access to the watermarked model. DAWN intercepts of all queries and returns false labels to embed a watermark in the surrogate. However, their scheme is explicitly not secure against extraction attacks that use as many queries as required to train a fresh model, e.g., by querying several close images where only one is associated with a false label. This includes no claimed security against distillation attacks. Our work is secure against an attacker with white-box access to the source model, i.e., against extraction attacks using any number of queries and does not impact the model’s performance.
Cao et al. [cao2019ipguard] recently proposed a framework for intellectual property protection using fingerprinting similar to our work. Their main idea is to generate inputs close to the decision boundary which are shown to be robust against fine-tuning and compression. They have not been tested as robust against distillation attacks.
Liu et al. [liu2016delving] evaluate the targeted transferability of adversarial examples in depth. Our study extends their results by a new subclass of targeted transferable adversarial examples: conferrable adversarial examples.
Vii-a Other Types of Linkability
There are other lines of research for proving linkability, which deviates from our definition of linkability in various ways. The work by Yu et al. [yu2018attributing, yu2019learning] studies fingerprints for GANs where the goal is to link a given image to the generator that created it, given black-box access to all generators. Their work links outputs, while our work links models. The work by Wang et al. [wang2018great]
links models for transfer learning that share the same base model. We do not consider transfer learning in this work but focus on the linkability of models for the same task. DeepMarks[chen2018deepmarks] uses the term ”fingerprinting” with a different meaning than our work, referring to a watermark that is robust specifically against collusion attacks, among other attacks.
We formally define fingerprinting for deep neural networks that provides linkability in the presence of an adversary performing model derivation, including model extraction attacks. We introduce conferrable adversarial examples as a means for fingerprinting that can withstand model extraction attacks and experimentally verify their existence for deep neural networks on the CINIC and a subset of the ImageNet32 dataset with 100 classes. We show that known adversarial attacks such as the DeepFool, Projected Gradient Descent (PGD), Iterative Gradient Method (IGM) and Carlini Wagner (CW-) have a relatively low success rate for producing conferrable adversarial examples. In response, we design and evaluate a new generation method as an ensemble approach for generating conferrable adversarial examples called C-IGM. We evaluate fingerprinting using our conferrable adversarial examples against the following model extraction attacks from the literature: Retraining, Papernot [papernot2017practical], Knockoff [orekondy2019knockoff] and the learning-based extraction proposed by Jagielski et al. [jagielski2019high]. We show that robustness against model extraction extends to robustness against other types of known attacks, such as fine-tuning, adversarial training and ensemble attacks. We also demonstrate that highly conferrable adversarial can be found using a relatively small perturbation of in the infinity norm so that an adversary can not easily evade the verification procedure.
We selected the following 100 classes from ImageNet32: [’kit fox’, ’Persian cat’, ’gazelle’, ’porcupine’, ’sea lion’, ’killer whale’, ’African elephant’, ’jaguar’, ’otterhound’, ’hyena’, ’sorrel’, ’dalmatian’, ’fox squirrel’, ’tiger’, ’zebra’, ’ram’, ’orangutan’, ’squirrel monkey’, ’komondor’, ’guinea pig’, ’golden retriever’, ’macaque’, ’pug’, ’water buffalo’, ’American black bear’, ’giant panda’, ’armadillo’, ’gibbon’, ’German shepherd’, ’koala’, ’umbrella’, ’soccer ball’, ’starfish’, ’grand piano’, ’laptop’, ’strawberry’, ’airliner’, ’balloon’, ’space shuttle’, ’aircraft carrier’, ’tank’, ’missile’, ’mountain bike’, ’steam locomotive’, ’cab’, ’snowplow’, ’bookcase’, ’toilet seat’, ’pool table’, ’orange’, ’lemon’, ’violin’, ’sax’, ’volcano’, ’coral reef’, ’lakeside’, ’hammer’, ’vulture’, ’hummingbird’, ’flamingo’, ’great white shark’, ’hammerhead’, ’stingray’, ’barracouta’, ’goldfish’, ’American chameleon’, ’green snake’, ’European fire salamander’, ’loudspeaker’, ’microphone’, ’digital clock’, ’sunglass’, ’combination lock’, ’nail’, ’altar’, ’mountain tent’, ’scoreboard’, ’mashed potato’, ’head cabbage’, ’cucumber’, ’plate’, ’necklace’, ’sandal’, ’ski mask’, ’teddy’, ’golf ball’, ’red wine’, ’sunscreen’, ’beer glass’, ’cup’, ’traffic light’, ’lipstick’, ’hotdog’, ’toilet tissue’, ’cassette’, ’lotion’, ’barrel’, ’basketball’, ’barbell’, ’pole’ ]