Log In Sign Up

Defending Against Model Stealing Attacks Using Deceptive Perturbations

by   Taesung Lee, et al.

Machine learning models are vulnerable to simple model stealing attacks if the adversary can obtain output labels for chosen inputs. To protect against these attacks, it has been proposed to limit the information provided to the adversary by omitting probability scores, significantly impacting the utility of the provided service. In this work, we illustrate how a service provider can still provide useful, albeit misleading, class probability information, while significantly limiting the success of the attack. Our defense forces the adversary to discard the class probabilities, requiring significantly more queries before they can train a model with comparable performance. We evaluate several attack strategies, model architectures, and hyperparameters under varying adversarial models, and evaluate the efficacy of our defense against the strongest adversary. Finally, we quantify the amount of noise injected into the class probabilities to mesure the loss in utility, e.g., adding 1.74 nats per query on CIFAR-10 and 3.27 on MNIST. Our extensive evaluation shows our defense can degrade the accuracy of the stolen model at least 20 4x more queries while keeping the accuracy of the protected model almost intact.


page 1

page 2

page 3

page 4


A Framework for Understanding Model Extraction Attack and Defense

The privacy of machine learning models has become a significant concern ...

Dynamic Backdoor Attacks Against Machine Learning Models

Machine learning (ML) has made tremendous progress during the past decad...

Defending Against Model Stealing Attacks with Adaptive Misinformation

Deep Neural Networks (DNNs) are susceptible to model stealing attacks, w...

Decision-Focused Learning of Adversary Behavior in Security Games

Stackelberg security games are a critical tool for maximizing the utilit...

Spinning Language Models for Propaganda-As-A-Service

We investigate a new threat to neural sequence-to-sequence (seq2seq) mod...

A formalization of re-identification in terms of compatible probabilities

Re-identification algorithms are used in data privacy to measure disclos...

1 Introduction

The success of neural networks has resulted in many web services based on them, including services providing APIs to label input samples for small sums of money. Many state-of-the-art neural network models are readily available in the literature or online, and unlabeled data (

e.g., images and corpus) are also often abundant on the web. But labeling data to train a machine learning model is expensive, difficult and error-prone even for simple tasks [ringger2008assessing]. It is even more difficult for domains requiring expert knowledge (e.g., coreference resolution in the medical domain). However, once an adversary acquires enough labels using the web service, the attacker can replicate the neural network and no longer needs to pay for the service [florian2016]. For example, current image classification services charge around $1–$10 per 1,000 queries, depending on the sophistication or customization of the model [google_price, azure_price, watson_vr_price]. Moreover, replicating neural networks greatly expands the attack surface, allowing white-box attacks against black-box services [papernot2017jacobian].

In particular, we consider a scenario that an attacker uses the probability values returned by the base model on the cloud service to boost the model stealing process. The cloud service often provide probability values to show confidence. When stealing models, [florian2016] claim that using probabilities instead of labels alone reduces the number of required samples by 50–100. We also confirm that using probabilities can improve the convergence, and also increases the converged model accuracy in Section 4.1 and 4.4.

To mitigate this problem, we propose to add smart noise in the output probability that maintains the output class label of the model not to harm the accuracy. We aim to force the attacker to discard the probability and use labels only, which is a lower bound of the optimal attack for an accuracy-preserving defense. To evaluate the performance, we consider two types of attacks. First, we identify and test an attack that can replicate an unprotected model quickly.111We observe that a defense-aware attack performs worse on an unprotected model.

Our evaluation considers diverse datasets, attack parameters, neural networks architectures and domains (images and text), and shows that our approach can degrade the stolen model accuracy by 20% or more while keeping the protected model accuracy almost intact. Second, we consider defense-aware attacks, including exploiting the same defense layer, reversing the noise, using a different loss function, and using only labels. With the accuracy-preserving defense, the attacker can still get correct labels by applying argmax to the probability vector, and therefore using only labels is a lower bound of the best attack, which is much slower to converge. We aim to force the attacker to use this suboptimal attack by eliminating better attacks and show diverse defense-aware attacks fail to achive better accuracy than the lowerbound.

2 Related work

The problem of inferring secret model parameters by observing the output classification to a given set of inputs has been recently studied. [florian2016]

claim that using high-precision confidence values and class labels obtained from a machine learning cloud service, their attack can steal the base model of several types including decision trees, logistic regression, support vector machines and simple neural networks. For simple parametric models such as logistic regression, they solve a linear system from the obtained probabilities. For decision trees, they develop a path-finding algorithm to exploit the confidence value as pseudo-identifiers for paths in the tree to discover the tree structure. For neural networks, they leverage a method we denote by

that uses a set of samples and query a randomly drawn batch, and train the network with the output from the base model. [stealingReg] also consider stealing a machine learning parameter, but their work is limited to the regularization parameter, not the entire model.


proposed stealing a black-box neural network model to generate adversarial examples. They assume the attacker has a limited number of training data, and propose to use a Jacobian-based heuristic in order to find examples defining the decision boundary of the target model. We extend the analyses of

and with five datasets, four neural network architectures as well as other attack methods to leverage in our defense evaluation. To our knowledge, this is the first study mitigating such model stealing attacks.

Student-teacher models have been used to compress a sophisticated teacher machine learning model into a smaller student model with less parameters [bucila2006model, Romero2014FitNetsHF]. Using the output probability vectors from the teacher to train the student, we can obtain a similarly performing student model with far less parameters. This paradigm focuses only on improving the student model with less parameters, unlike our goal of preventing it, and tries to leverage more information about the teacher (white-box) to better train or design the student, which is inapplicable in our cloud service scenario.

3 Methods

In this section, we propose an add-on layer that can be applied to most neural network classifiers to protect against model stealing from cloud service APIs. This layer adds a small controllable perturbation

maximizing the loss of the stolen model while preserving the accuracy. That is, instead of attempting to detect an attack, we apply noise that has little influence to normal users, but still degrades and slow down the model stealing attack. An optimal defense should provide utility to the service consumers, while providing no measurable benefit to the adversary beyond a final label.

We consider neural networks that extract features from the input data and aggregate them throughout the layers to generate class probabilities of the input. Typically, the last layer is an activation function producing probability values ranging from 0 to 1 and sum to 1, given

logits, the unbounded vector from the previous layer. In most cases, the softmax function without parameters is used. That is, most neural network classifiers of classes can be represented as , mapping input to output , where is a function to a -dimensional real vector, and is a normalization function (e.g., softmax) mapping a vector to probability values summing to 1.

An attacker with samples can query a neural network on the remote server (base model) to obtain the corresponding pseudo-labels , and train their own neural network (stolen model). The completely replicated network should have the minimum loss with respect to , which is usually defined using the cross entropy loss function. That is,


where and represent -th dimension of and , respectively.

In this setting, we propose to add noise to the server response that results in a high loss so that the attacker’s optimizer ill-trains the network. Toward this goal, we first assume that the stolen model already perfectly replicated the base model, and parameterize the possible perturbation of with conditions to maximize loss .

In particular, the perturbed probability vector should have the following properties. First, the sum across the dimensions must be 1. Second, we should be able to control the magnitude of the perturbation. Last, the accuracy should be preserved, i.e., for and such that . To this goal, we consider additive perturbation with normalization: where is a sum-to-1 normalizer for , and is the noise function we seek with the following parameterization:



is a sigmoid function,

is a perturbation, and is a positive magnitude parameter; with a constraint for preserving the accuracy. Using a derivative test, we can find has critical points when . In particular, is maximized when for such that for , and for with low .

Instead of directly setting to maximize the loss, we use a heuristic approximation to when is the largest and otherwise, not to completely lose the probability values: where is a positive dataset and model specific convergence parameter, and is the pseudo-logit of that amplifies the behavior of and makes the perturbation comparable to the original probability. With this approximation, we obtain Reverse Sigmoid perturbation :


This function has a shape of flipped sigmoid function as shown in Figure 1, and the final perturbed probability value is computed as follows:


This function has humps that prevent simple inversion impossible as depicted in Figure 1.

The main advantage of using Reverse Sigmoid in is two-fold. First, we don’t completely lose the meaning of the probability values, in contrast to always returning the same values for top-1 and bottom-1 classes. Second, this function form adds ambiguity that prevents a simple inversion. As we can see in Figure 1, the final deceptive probability curve has two values that have the same in the range and , except where the first derivative is zero, making the exact inverse function impossible and the inversion difficult. The adversary cannot attempt to select from each value because it results in an exponential number of options, up to possibilities per sample. For the same reason, the network cannot be replicated perfectly even if the attacker exploit Reverse Sigmoid in the attacker’s model or leverage mean squared errors instead of cross loss entropy, which we show empirically in Section 4.4.

Figure 1: Example Reverse Sigmoid and Sigmoid activation functions.

4 Experiments

In this section, we evaluate the proposed defense method. For this goal, we consider two types of attacks. We first identify the best attack for an undefended model, including query generation, parameters, and underlying models in Section 4.1. For an undefended model, we find that the best performing attack designed for a defended model does not perform as good as that for an undefended model. Then, we evaluate the defense methods on five image datasets and one text dataset with five measures using the identified attack in Section 4.2, and we discuss relations of the base models and the stolen models in Section 4.3. We further evaluate the defense against possible attacks when the attacker knows more information about the defense in Section 4.4, and show these attacks cannot achieve better performance than using only labels, which is a lower bound of the best attack for the accuracy-preserving defense. Basic data augmentation (shift/flip) is applied in all training [dataAugmentation]. MobileNet [MobileNet] and Xception [Xception] are optimized with Adam optimizer [adamOptimizer]

, AllConv is optimized with the vanilla stochastic gradient descent as in its original paper 


, and RMSProp 

[RMSProp] is used otherwise to achieve the best performance for each individual model.

We use the following five measures to evaluate the performance of attacks and defenses.

  • Agreement: Top-1 model accuracy of the stolen model treating the base model as ground truth.

  • Cosine: The average cosine similarity of output probability vectors of the stolen and the base models.

  • MAE (Mean absolute error): The average absolute errors of the predictions of the stolen and the base models per class.

  • KL-divergence: KL-divergence between the probabilities of the stolen and the base models.

  • Accuracy: The prediction accuracy.

The first four measures do not use labels of the test dataset, and focus on the intrinsic model replicability. Accuracy focuses on the extrinsic performance on tasks.

We use six datasets including IMDB sentiment classification [IMDB], MNIST [MNIST], FASHION-MNIST [FASHIONMNIST], CIFAR-10, and CIFAR-100 [CIFAR], and STL-10 [STL10]. These classification datasets cover different degrees of difficulty, and a model for an easy dataset is easier to attack and harder to defend as shown in Section 4.2. If we can defend them, it’s likely we can defend larger models and sophisticated datasets as well. For the image datasets, we randomly hold off 33% of the original training data, uniformly from each class, to assign a portion of them to the attacker after removing the labels and the probability values. Note that this partitioning will allow only 67% of the original training data for the base model on the cloud, and result in a small drop in model accuracy for each of the base models compared to the state-of-the-art. For the text dataset, we split the test data in the same way due to the small training data.

4.1 Threat Model and Attacks

For the evaluation, we consider the following adversarial model. The adversary knows the architecture of the model being attacked, has a given number of unlabeled input samples (# samples), and can send a fixed number of (adaptive) queries to the base model to obtain labels and probability values. We do not assume the adversary is computationally bounded for training.

We compare the following attack strategies to generate queries and use the response as the training data.

  • : The attacker has a certain number of data samples in hand, and queries them to the base model.

  • : Same as , but this attack uses the top-1 class label only.

  • : The attacker does not rely on any existing data samples, and instead generates uniform random queries. The attacker knows the ranges and the dimensions of the input values.

  • : The attacker has a certain number of data samples, and generates more samples using the Jacobian method [papernot2017jacobian]. The stolen model is trained using the response from the base model for both the data samples, and the generated samples. The trained model is further used to generate more samples. We set

    and substitute training epochs


The attacks also use randomized image augmentation that shifts, and/or flips images during the training222Flipping is not applied on the MNIST dataset.. As a neural network is trained for multiple epochs, the same training sample is used multiple times. Adding slight change to the image in every epoch provides much better generalization power, and results in better test accuracy [dataAugmentation].

We compare these attacks with various parameters, identify the attack most successful at replicating the base model, and evaluate our defenses against the strongest attack. We use a simple convolutional network, denoted by Simple, for both the base and the stolen models: 64 conv, 64 conv, max pooling, 128 conv, 128 conv,

max pooling, 256 dense, 256 dense, and softmax layers.

Figure 2: Performance of three different model stealing approaches.
3 5 3 8 8 8 2 2 1
Figure 3: Jacobian augmented inputs

Query Generation

The most limiting resource for an adversary is real data samples to query the base model (e.g., medical records, employee faces). Therefore, we evaluate the attacker methods on varying # samples first, and set attacker budget to 50,000 queries, and training steps to 16,000 with 64-sized batch. Figure 3 shows the performance of the three attack strategies against the number of samples on the MNIST dataset. We can see that attack performs best for various # samples. Like all other attack methods, leverage data augmentation to generate more samples. Our preliminary experiment showed that using data augmentation significantly improved the test accuracy (approx. 85% increase with 150 samples). also steals the model with high accuracy, but its replication is much slower especially at the beginning. For example, with data augmentation, reaches 97% agreement with 300 samples, but goes only up to 96% with 19200 samples ( samples).

The Jacobian augmentation is used to push samples towards the boundaries of each class in the direction of greatest increase in the loss function to generate samples defining the decision boundary. When applied to the input image, it successfully probes the classification boundary of the model, but results in generating imperceptible perturbation to an image so that the neural network misclassifies the input [papernot2016limitations]. Thus, using Jacobian generates adversarial examples that are misclassified by the base model (e.g., Figure 3 are all misclassified as indicated). Although they are useful in adversarial example generation scenario [papernot2017jacobian], leveraging these labels from the base model results in teaching the replicated model the wrong classification.

On the other hand, although attack has high training accuracy, its performance on test data is poor, giving the accuracy of random guess. Most of samples generated by falls in one class, and this leads to train the replicated model to predict just one class regardless the input. This shows the importance having legitimate data samples to query the model. Based on these results, we use attack, which can most accurately replicate the model in the defense-unaware scenario.

Figure 4: Agreement with the base model for various types of attacker model architectures.

Model Architecture

The attacker has a choice of their own models to train. We test to train Simple model described above, as well as MobileNet [MobileNet], AllConv [AllConvNet], and Xception [Xception] models. More complex models such ResNet [ResNet] and Inception [InceptionV3] are not applicable to the test datasets we use due to the input image dimensions.

Figure 4 shows comparisons of performance regarding different replicated model architectures used.333Note that the performance degradation is due to the use of only two third of the original training data less the attacker portion. As we can expect, the simplest model learns fastest. While approaches such as Xception and MobileNet work well with larger data [MobileNet, Xception], they are not suitable in a model stealing scenario with relatively few samples. Also, we find AllConv not only learns relatively fast, but also performs well for all datasets. Therefore, we use AllConv throughout the rest of the experiments. For the IMDB dataset, we use the most popular approach using Bidirectional LSTM [BiLSTM].

4.2 Defense Evaluation

Now we compare defense methods against attack method with AllConv model. As a defense, we consider adding one of the following perturbations to the output probability vector, where is uniform random noise between -1 to 1, and , and are parameters.

  • Uniform Random: .

  • Uniform Random Concave: .

  • Uniform Random Convex: .

  • Ranking-preserving Uniform Random: Same as Uniform Random, but the ranking of output classes is preserved to maintain the accuracy.

  • Sine: .

  • Reverse Sigmoid: a stretched and reversed sigmoid explained in this paper.

Figure 5: Agreement of stolen model with base model using different defense types.
0.93 0.95 0.82 0.48 0.62 0.84
0.99 -0.31 0.99 -0.29 0.85 -0.45 1.00 -0.98 0.99 -0.62 1.00 -0.40
0.68 0.70 0.40 0.02 0.37 0.60
0.60 +0.38 0.56 -0.47 0.79 -0.13 0.82 -0.80 0.55 -0.43 0.72 -0.12
0.98 0.09 0.66 0.02 0.12 0.60
0.88 +0.05 0.86 +0.05 0.69 -0.15 0.82 -0.80 0.92 -0.35 0.72 -0.13
0.93 0.91 0.54 0.02 0.57 0.59
0.89 +0.04 0.86 +0.04 0.70 -0.12 0.82 -0.80 0.98 -0.39 0.72 -0.12
0.93 0.90 0.58 0.02 0.59 0.60
1.00 -0.13 1.00 -0.05 1.00 -0.54 1.00 -0.77 1.00 -0.45 1.00 -0.24
0.87 0.95 0.46 0.23 0.55 0.76
0.89 +0.03 0.91 -0.02 0.77 -0.03 0.56 -0.08 0.61 +0.03 0.84 -0.04
0.92 0.89 0.74 0.48 0.64 0.80
Table 1: Agreements with 19200 queries. : base model, : stolen model, : protected model.

The goal of the defense is twofold: 1) prevent model replication (low stolen model agreement), and 2) retain the performance of the protected model (high protected model agreement). The best parameters for the defense methods are searched using grid search to achieve the highest cosine similarity to the original model with at least 20% accuracy drop in stealing with attacker budget of 19200. When we cannot find parameters satisfying 20% accuracy drop in stealing, we instead choose the parameters with the highest accuracy drop in stealing.

Figure 5 and Table 1 shows the experimental results. We see the proposed Reverse Sigmoid defense consistently slows down the replication process (Figure 5), and can drop the stolen model agreement more than 20% for all tested datasets (Table 1). Also, the protected models still have high agreements. Using a sinusoidal wave noise does not protect the model for all tested parameter combinations. The final perturbation made by sine is small as the sine applied to each class can interfere each other, and it is normalized to have only small effect. While Uniform Random may look effective for some datasets like FASHION MNIST and CIFAR-100, the agreements of the protected models are not reliable. In case of Ranking-preserving Uniform Random, we can achieve the perfect agreements, but the effectiveness of the defense is unpredictable (e.g., FASHION MNIST vs. CIFAR-10).

Dataset acc. acc. acc. Cos(, ) MAE(, ) KL-div(, )
MNIST 0.99 0.99 0.68 0.41 0.17 3.27
FASHION MNIST 0.87 0.87 0.66 0.47 0.16 1.72
CIFAR-10 0.75 0.73 0.36 0.61 0.48 1.26
CIFAR-100 0.43 0.43 0.02 0.33 0.02 2.30
STL-10 0.51 0.51 0.31 0.74 0.16 2.20
IMDB 0.81 0.81 0.60 0.80 0.39 0.48
Table 2: Effects of Reverse Sigmoid defense. : base model, : protected model, : stolen model from .

Table 2 shows the side effects of the Reverse Sigmoid defense. The protected model accuracy is well maintained because the defense tends not to change the labels if one class is dominating. To measure how much noise is added to the probability vectors, we use the average cosine similarity, the mean absolute error per dimension, and KL-divergence of the protected model and the base model outputs. In particular, we see that the mean absolute error is less than 0.2 in most datasets. This noise can be higher if the model prediction is more confident. However, in this case, the decision is unlikely to be affected in many applications.

4.3 Base Model Accuracy versus Reverse Sigmoid Protection Effectiveness

Model acc. acc. acc. Acc. drop
AllConv 0.75 0.73 0.36 0.37
Simple 0.70 0.69 0.50 0.19
Xception 0.36 0.36 0.23 0.13
MobileNet 0.24 0.22 0.21 0.01
Table 3: Base model and stealing on CIFAR-10

Since we add noise to the confident classes, the output probability of a model can affect the performance of the Reverse Sigmoid defense. To see this, we train different models on CIFAR-10 dataset, resulting in different accuracies. Then, we apply the Reverse Sigmoid protection, and try the attack with AllConv. In Table 3, we see that if the original accuracy ( accuracy) is higher, the defense is more effective (higher accuracy drops). This is especially important because when the accuracy of the original model is high, there is higher demand of protection.

Figure 6: Performance difference of model stealing using all class probabilities vs. only the top-1 label on CIFAR-10.

4.4 Robustness of Reverse Sigmoid against Defense-aware Attack

We test the robustness of the Reverse Sigmoid defense when the attacker knows more information on CIFAR-10 dataset.

Attack using the same defense layer

If the defense and the parameters get to be known by the attacker, the attacker can use exactly the same defense layer in their model. The Reverse Sigmoid defense is not a standard neural network layer, and have ambiguity that multiple logit values are mapped to the same probability by the defense layer. This essentially propagates wrong gradient values to the model, degrading the stealing process as resulted in Table 4.

Attack using the mean-squared-error loss function

To replicate the model output more precisely, the attacker may use the mean-squared-error (MSE) loss function instead of the more common cross entropy loss function which the Reverse Sigmoid defense is designed for. However, for the same reason as the attack using the same defense layer, this loss function is not free from the ambiguity, and still shows poor performance as shown in Table 4.

Attack using an inversion mapping

We now evaluate the adversary’s ability to recover the original unprotected class probabilities from the protected model assuming they have full knowledge of the parameters of the protection, i.e.,

pairs. We compute a linear regression of the class probabilities of the base model

and the protected model

, and use the regression and normalization to attempt to recover the unprotected probabilities. In our second attack, we use a simple multilayer perceptron (MLP) model with two hidden layers (

and neurons), and input and output values. The model is trained on 16,670 real output pairs from and ( pairs) and optimizing over the loss . We compute the KL divergence of both the recovered probability values () and the protected probability values () against the unprotected model in Figure 7, where a perfect attack corresponds to . It is clear to see that both attacks are able to recover some of the information lost by the attack, but neither is able to fully recover the unprotected model output. In the linear regression model, a fixed amount of information is recovered, while the MLP is able to recover more information. However the attack’s capability diminishes with the amount of noise that was added to the sample. Finally, we find the MLP attack does not increase the stolen model agreement, as shown in Table 4.

Figure 7: Recovering unprotected class probabilities using linear regression and MLP
Attack Agree.
Same Defense Layer 0.10
MSE Loss 0.18
Inversion (MLP) 0.22
Argmax 0.78
Table 4: Agreements of attack variants

Attack using labels only ()

As shown in Section 4.1, using argmax requires a larger budget than attacks using the probabilities ( queries compared to ). However, the attacker can simply take the top-1 class of the result which is usually correct. This approach achieves much better replication of a defended model given enough budget as shown in Table 4. Still, by forcing the attacker to discard probability values, the attacker has to use as many queries in our evaluation on CIFAR-10 dataset. That is, using probability values requires 4800 queries to reach 0.7813 agreement, but without probability values, 19200 queries have to be used to reach the same agreement. Besides the agreement, we can see lower cosine similarity and high KL-divergence, Figure 6, even with 19200 queries, meaning that while top-1 decision is quickly learned, the trait of the network including output distributions takes more queries to replicate. This difference can limit the generalization power, and it can be especially important if the attacker further wants to use the model for adversarial example generation [papernot2017jacobian]. Finally, note that this is a degenerate attack that applies to any defense that maintains top-1 accuracy.

5 Conclusion

Neural networks are becoming one of the key assets of an enterprise, but they are vulnerable to stealing attacks. We proposed a method that can be applied to wide variety of neural network models and evaluated the protection performance over five datasets, four neural network architectures, and diverse threat models and attack parameters. Our approach either prevented the stealing entirely or slowed down the stealing process up to in the worst case when the attacker knows the defense.