Extraction of Complex DNN Models: Real Threat or Boogeyman?

by   Buse Gul Atli, et al.

Recently, machine learning (ML) has introduced advanced solutions to many domains. Since ML models provide business advantage to model owners, protecting intellectual property (IP) of ML models has emerged as an important consideration. Confidentiality of ML models can be protected by exposing them to clients only via prediction APIs. However, model extraction attacks can steal the functionality of ML models using the information leaked to clients through the results returned via the API. In this work, we question whether model extraction is a serious threat to complex, real-life ML models. We evaluate the current state-of-the-art model extraction attack (the Knockoff attack) against complex models. We reproduced and confirm the results in the Knockoff attack paper. But we also show that the performance of this attack can be limited by several factors, including ML model architecture and the granularity of API response. Furthermore, we introduce a defense based on distinguishing queries used for Knockoff attack from benign queries. Despite the limitations of the Knockoff attack, we show that a more realistic adversary can effectively steal complex ML models and evade known defenses.



There are no comments yet.


page 1

page 2

page 3

page 4


DAWN: Dynamic Adversarial Watermarking of Neural Networks

Training machine learning (ML) models is expensive in terms of computati...

PRADA: Protecting against DNN Model Stealing Attacks

As machine learning (ML) applications become increasingly prevalent, pro...

Crypto Mining Makes Noise

A new cybersecurity attack (cryptojacking) is emerging, in both the lite...

ICSML: Industrial Control Systems Machine Learning inference framework natively executing on IEC 61131-3 languages

Industrial Control Systems (ICS) have played a catalytic role in enablin...

Machine Learning Based Cyber Attacks Targeting on Controlled Information: A Survey

Stealing attack against controlled information, along with the increasin...

Generating Practical Adversarial Network Traffic Flows Using NIDSGAN

Network intrusion detection systems (NIDS) are an essential defense for ...

Threat Detection for General Social Engineering Attack Using Machine Learning Techniques

This paper explores the threat detection for general Social Engineering ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In recent years, machine learning (ML) has been applied to many areas with substantial success. Use of ML models is now ubiquitous. Major enterprises (Google, Apple, Facebook) utilise them in their products (TechWorld, 2018). Companies gain business advantage by collecting domain-specific data and training high quality models. Hence, protecting the intellectual property (IP) embodied in ML models is necessary to preserve the business advantage of model owners.

Increased adoption of ML models and popularity of centrally hosted services led to the emergence of Prediction-As-a-Service platforms. Rather than distributing ML models, it is easier to run them on centralized servers having powerful computational resources and to expose them via prediction APIs. Prediction APIs protect the confidentiality of ML models to clients and allow for widespread availability of ML-based services that require clients only to have an internet connection. Even though clients only have access to a prediction API, each response necessarily leaks some information about the model. A model extraction attack (Tramèr et al., 2016) is one where an adversary extracts information from a victim model by iteratively querying the model’s prediction API with a large number of samples. Queries and API responses can be used to train a surrogate model that imitates the functionality of the victim model. Deploying this surrogate model deprives the model owner of its business advantage. Many extraction attacks are effective against simple ML models (Papernot et al., 2017; Juuti et al., 2019) and defenses have been proposed against these simple attacks (Lee et al., 2018b; Quiring et al., 2018). However, extraction of complex ML models has got little attention to date. Whether model extraction is a serious and realistic threat to real-life systems remains an open question.

Recently, a novel model extraction attack targeting complex ML models has been proposed. The Knockoff attack (Orekondy et al., 2019a) extracts surrogates from complex DNN (deep neural network) models used for image classification. The attack assumes that the adversary has access to (a) pre-trained

image classification models that are used as the basis for constructing the surrogate model, (b) unlimited natural samples that are not drawn from the same distribution as the training data of the victim model and (c) the full probability vector as the output of the prediction API. On the other hand, it does not require the adversary to have any knowledge about the victim model, training data or its classification task (class semantics). The Knockoff attack paper 

(Orekondy et al., 2019a) reported empirical evaluations showing that the attack is effective at stealing any image classification model and that existing defenses are ineffective against it.

Goals and contributions: Our goals are twofold in this paper. First, we want to understand the conditions under which the Knockoff attack constitutes a realistic threat. Hence, we empirically evaluate the attack under different adversary models. Second, we want to explore whether and under which conditions Knockoff attacks can be mitigated or detected. We claim the following contributions:

  • reproduce the empirical evaluation of the Knockoff attack under its original adversary to confirm that it can extract surrogate models exhibiting reasonable accuracy (54.8-92.4%) for all five complex victim DNN models we built (Sect. 3.2).

  • introduce a defense, within the same adversary model, to detect the Knockoff attack by differentiating in- and out-of-distribution queries (attacker’s queries). This defense correctly detects up to 99% of queries as adversarial (Sect. 4).

  • revisit the original Knockoff adversary model and investigate how the attack effectiveness changes with more realistic adversaries and victims (Section 5.1.1). The attack effectiveness deteriorates when

    • the adversary uses a model architecture for the surrogate that is different from that of the victim.

    • the granularity of the victim’s prediction API output is reduced (returning predicted class instead of a probability vector).

    • the diversity of adversary queries is reduced.

    On the other hand, attack effectiveness increases when the adversary can optimize hyperparameters of the surrogate models. Furthermore, the attack effectiveness can increase when the adversary has access to natural samples drawn from the

    same distribution as the victim’s training data. In this case, all existing attack detection techniques, including our own, are no longer applicable (Section 5.2.2).

2. Background

2.1. Deep Neural Networks

A DNN is a function , where is the number of input features and is the number of output classes in a classification task. gives a vector of length containing probabilities that input belongs to each class for . The predicted class, denoted , is obtained by applying the argmax function: . tries to approximate a perfect oracle function which gives the true class for any input . The test accuracy expresses the degree to which approximates .

2.2. Model Extraction Attacks

In a model extraction attack, the goal of an adversary is to build a surrogate model that imitates the model of a victim . wants to find an having as close as possible to . builds its own dataset and implements the attack by sending queries to the prediction API of and obtaining predictions for each query , where . uses the transfer set to train a surrogate model .

According to prior work on model extraction (Papernot et al., 2017; Juuti et al., 2019), we can divide ’s capabilities into three categories: victim model knowledge, data access, querying strategy.

Victim model knowledge: Model extraction attacks operate in a black-box setting. does not have access to weight parameters of but can query the prediction API without any limitation on the number of queries. might know the exact architecture of , its hyperparameters or its training process. Given the purpose of the API (e.g., image recognition) and expected complexity of the task, may attempt to guess the architecture of  (Papernot et al., 2017). ’s prediction API may return one of the following: the probability vector, top-k labels with confidence scores or predicted class.

Data access: Possible capabilities of for data access vary in different model extraction attacks. can have access to a small subset of natural samples from ’s training dataset (Papernot et al., 2017; Juuti et al., 2019). may not have access to ’s training dataset but know the “domain” of data and have access to natural samples that are close to ’s training data distribution (i.e., images of dogs in the task of identifying dog breeds) (Correia-Silva et al., 2018). can use widely available natural samples that are different from ’s training data distribution. Finally, can construct with only synthetically crafted samples (Tramèr et al., 2016).

Querying strategy: Querying is the process of submitting a sample to the prediction API. If relies on synthetic data, it crafts samples that would help it train iteratively. Otherwise, first collects its samples , query the prediction API with the complete , and then trains the surrogate model with .

3. The Knockoff Model Extraction Attack

In this section, we study the Knockoff model extraction attack (Orekondy et al., 2019a) which achieves state-of-the-art performance against complex DNN models. Knockoff works without access to ’s training data distribution, model architecture and classification task.

3.1. Attack Description

3.1.1. Adversary model

The goal of is model functionality stealing (Orekondy et al., 2019a): wants to train a surrogate model that performs similarly on a classification task for which prediction API’s was designed. can query prediction API without any constraint on the number of queries. API always returns probability vector as an output for each legitimate query. has no information about including model architecture, internal parameters and hyperparameters. Moreover, does not have access to ’s training data, prediction API’s purpose or output class semantics. is a weaker adversary than previous work in (Papernot et al., 2017; Juuti et al., 2019) due to these assumptions. However, can collect unlimited amount of real-world data from online databases for querying the prediction API. also has no limitations about GPU memory usage and uses publicly available pre-trained complex DNN models as a basis for .

3.1.2. Attack strategy

first collects natural data from online databases for constructing unlabeled dataset . For each query , , obtains a probability vector from the prediction API. Then, uses a pseudo-labeled set to train its surrogate model . fine-tunes parameters of

using transfer learning 

(Kornblith et al., 2019), repurposing learned features of a large model already trained for a general task to a target dataset and task of . In our setting, offers image classification and constructs

by sampling a subset of ImageNet dataset 

(Deng et al., 2009).

3.2. Knockoff Evaluation

We first implement the Knockoff attack under the original adversary model explained in Section 3.1. We use the datasets and experimental setup described in (Orekondy et al., 2019a) for constructing both and .

3.2.1. Datasets

We use Caltech256 (Griffin et al., 2007), CUBS (Welinder et al., 2010) and Diabetic Retinopathy (Diabetic5) (kaggle.com, 2015) datasets as in (Orekondy et al., 2019a) for training ’s and reproduce experiments where the Knockoff attack was successful. Caltech256 is composed of various images belonging to 256 different categories. CUBS contains images of 200 bird species and is used for fine-grained image classification tasks. Diabetic5 contains high-resolution retina images labeled with five different labels indicating the presence of diabetic retinopathy. We augmented the Diabetic5 dataset using preprocessing techniques recommended in (Chase, 2018) to address the class imbalance problem. For constructing , we use a subset of ImageNet, which contains 1.2M images belonging to 1000 different categories. includes randomly sampled 100,000 images from Imagenet, 100 images per class. 42% of labels in Caltech256 and 1% in CUBS are also present in ImageNet. There is no overlap between Diabetic5 and ImageNet labels.

Additionally, we use CIFAR10 dataset  (Krizhevsky, 2009), depicting animals and vehicles divided into 10 classes, and GTSRB (Stallkamp et al., 2011), a traffic sign dataset with 43 classes. CIFAR10 contains broad, high level classes while GTSRB contains domain specific and detailed classes. We evaluate the Knockoff attack with these datasets, since neither of them overlaps with ImageNet and they were used in prior model extraction work (Papernot et al., 2017; Juuti et al., 2019).

All datasets are divided into a training and testing set and summarized in Table 1

. All images in both training and test sets are normalized with mean and standard deviation statistics specific to Imagenet dataset.

Number of samples
Dataset Sample size Classes Train Test
Caltech 224x224 256 23,703 6,904
CUBS 224X224 200 5994 5794
Diabetic5 224x224 5 85,108 21,278
GTSRB 32x32 / 224x224 43 39,209 12,630
CIFAR10 32x32 / 224x224 10 50,000 10,000
Table 1. Image datasets used to evaluate Knockoff attack. Different sample sizes are input to different models.

3.2.2. Training victim models

To obtain complex victim models, we fine-tune weights of a pre-trained ResNet34 (He et al., 2016) model. We train 5 complex victim models using the datasets summarized in Table 1

and name these victim models as {Dataset name}-RN34. In training, we use SGD optimizer with an initial learning rate of 0.1 that is decreased by a factor of 10 every 60 epochs over 200 epochs.

3.2.3. Training surrogate models

To build surrogate models, we fine-tune weights of a pre-trained ResNet34 (He et al., 2016) model. We query ’s prediction API with samples from and obtain . Training phase of surrogate models is implemented using an SGD optimizer with an initial learning rate of 0.1 that is decreased by a factor of 10 every 60 epochs over 200 epochs.

3.2.4. Experimental results

Table 2 presents the test accuracy of , in our reproduction as well as the attack effectiveness reported in the Knockoff paper (). Although of Caltech-RN34 and CUBS-RN34 models are consistent with the corresponding values reported in (Orekondy et al., 2019a), we found that our surrogate models against Caltech-RN34 and CUBS-RN34 achieve lower . This inconsistency could be a result of different samples used to construct the transfer set, since we followed the same training procedure111We contacted the authors of the paper and they confirmed that our experimental setup is consistent with theirs.. As shown in Table 2, the Knockoff attack performance is quite different against almost any . The best surrogate model is obtained when is GTSRB-RN34, even though ’s transfer set is completely different from GTSRB samples. GTSRB dataset has less complex features in contrast to CIFAR10, Caltech or CUBS. Furthermore, GTSRB dataset includes images that are not diverse within the same class (Hosseini et al., 2017). Therefore, it could be argued that the Knockoff attack can learn a better approximation of features of GTSRB-RN34 than those of other by querying with many ImageNet samples. We will discuss the effect of transfer set with more detail in 5.1.3.

Caltech-RN34 74.6% 68.5% (0.92 ) 78.8% 75.4% (0.96 )
CUBS-RN34 77.2% 54.8% (0.71 ) 76.5% 68.0% (0.89 )
Diabetic5-RN34 71.1% 59.3% (0.83 ) 58.1% 47.7% (0.82 )
GTSRB-RN34 98.1% 92.4% (0.94 ) - -
CIFAR10-RN34 94.6% 71.1% (0.75 ) - -
Table 2. Test accuracy of , in our reproduction and , reported by (Orekondy et al., 2019a). Good surrogate models are in bold based on their performance recovery ()

4. Detection of Knockoff Attacks

In this section, we present a method designed to detect queries used for the Knockoff attack. We analyze its effectiveness w.r.t. the capacity of the model used for detection and the overlap between ’s and ’s training data distributions. Finally, we investigate its limitations in a stronger adversary model.

4.1. Goals and Overview

DNNs are trained using datasets that come from a specific distribution . Many benchmark datasets display specific characteristics that make them identifiable (e.g. cars in CIFAR10 vs ImageNet) as opposed to being representative of real-world data (Torralba and Efros, 2011). A DNN trained using such data might be overconfident, i.e. it gives wrong predictions with high confidence scores, when it is evaluated with samples drawn from a different distribution

. Predictive uncertainty is unavoidable when a DNN model is deployed for use via a prediction API. In this case, estimating predictive uncertainty is crucial to reduce over-confidence and provide better generalization for unseen samples. Several methods were introduced  

(Hendrycks and Gimpel, 2017; Liang et al., 2017; Lee et al., 2018a) to measure predictive uncertainty by detecting out-of-distribution samples in the domain of image recognition. Baseline (Hendrycks and Gimpel, 2017) and ODIN (Liang et al., 2017)

methods analyze the softmax probability distribution of the DNN to identify out-of-distribution samples. A recent state-of-the-art-method 

(Lee et al., 2018a) detects out-of-distribution samples based on their Mahalanobis distance (Bishop, 2006) to the closest class-conditional distribution. Although these methods were tested against adversarial samples in evasion attacks, their detection performance against the Knockoff attack is unknown. What is more, their performance heavily relies on the choice of threshold value which corresponds to the rate of correctly identified in-distribution samples (TPR rate).

Our goal is to detect queries that do not correspond to the main classification task of ’s model. In case of the KnockOff attack, this translates to identifying inputs that come from a different distribution than

’s training set. Queries containing such images constitute the distinctive aspect of the Knockoff adversary model: 1) availability of large amount of unlabeled data 2) limited information about the purpose of the API. To achieve this, we propose a binary classifier (or

one-and-a-half classifier) based on the ResNet architecture. It differentiates inputs from and out of victim’s data distribution. Our solution can be used as a filter placed in front of the prediction API.

4.2. Training Setup

4.2.1. Datasets

We evaluate our detection method using the same datasets as in Section 3.2.1. We follow the same train/test splits as presented in Table 1. We split 100,000 ImageNet samples into 90,000 used for training and 10,000 for testing. In these experiments, ImageNet serves the purpose of a varied and easily available dataset that could use. Additionally, we use 20,000 uniformly sampled images from the OpenImages dataset (Kuznetsova et al., 2018). They serve the same purpose as ImageNet and we use them as a complementary test set for our experiments.

4.2.2. Training binary classifer

In our experiments, we examine two types of models: 1) ResNet models trained from scratch 2) pre-trained ResNet models with frozen weights where we replace the final layer with binary logistic regression. In this section, we refer to different ResNet models as RN followed by the number indicating the number of layers, e.g. RN34; we further use the CV suffix to highlight pre-trained models with a logistic regression layer.

We shuffle together ImageNet with ’s corresponding dataset and assign binary labels that indicate whether the image comes from ’s distribution - we give label to all ImageNet samples and to samples from ’s dataset. All images are normalized according to ImageNet-derived mean and standard deviation. We apply the apply the same labeling and normalization procedure to the OpenImages test set. To train models from scratch (models RN18 and RN34), we use ADAM optimizer (Kingma and Ba, 2014) with initial learning rate 0.001 for the first 100 epochs and 0.0005 for the remaining 100 (200 total). Additionally, we repeat the same training procedure but removing from ’s dataset images whose class labels overlap with ImageNet (models RN18*, RN34*, RN18*CV, RN34*CV, RN101*CV, RN152*CV). This will minimize the risk of a false positive and simulate the scenario with no overlap between the datasets. To train models with the logistic regression layer (models RN18*CV, RN34*CV, RN101*CV, RN152*CV), we take ResNet models pre-trained on ImageNet. We replace the last layer with a logistic regression model and freeze the remaining layers. Logistic regression is trained using LBFGS solver (Zhu et al., 1994) with the L2 regularizer and using 10-fold cross validation in order to find the optimal value of the regularization parameter.

4.3. Experimental Results

We divide our experiments into two phases. In the first phase, we select CUBS and train binary classifiers with different architectures in order to identify the optimal classifier. We assess the results using the rate of correctly detected out-of-distribution samples (true positive rate, TPR) and correctly detected in-distribution samples (true negative rate, TNR). In the second phase, we evaluate the performance of the selected optimal architecture using all datasets in Section 3.2.1 and assess it based on the achieved TPR and TNR.

RN18 RN34 RN18* RN34*
TPR/TNR 86% / 83% 94% / 80% 90% / 83% 95% / 82%
RN18*CV RN34*CV RN101*CV RN152*CV
TPR/TNR 84% / 84% 93% / 89% 93% / 93% 93% / 93%
Table 3. Distinguishing out-of-distribution samples ImageNet (TPR) from in-distribution samples corresponding to CUBS test set (TNR). Results are reported for models trained from scratch (RN18, RN34), trained from scratch excluding overlapping classes (models RN18*, RN34*) and using pre-trained models with logistic regression (models RN18*CV, RN34*CV, RN101*CV, RN152*CV. Best results are in bold.
Ours Baseline/ODIN/Mahal. Baseline/ODIN/Mahal.
Dataset TPR TNR TPR (at TNR Ours) TPR (at TNR 95%)
Caltech 63% 56% 87% / 88% / 59% 13% / 11% / 5%
CUBS 93% 93% 48% / 54% / 19% 39% / 43% / 12%
Diabetic5 99% 99% 1% / 25% / 98% 5% / 49% / 99%
GTSRB 99% 99% 42% / 56% / 71% 77% / 94% / 89%
CIFAR10 96% 96% 28% / 54% / 89% 33% / 60% / 91%
(a) Using ImageNet as out-of-distribution test set.
Ours Baseline/ODIN/Mahal. Baseline/ODIN/Mahal.
Dataset TPR TNR TPR (at TNR Ours) TPR (at TNR 95%)
Caltech 61% 59% 83% / 83% / 6% 11% / 11% / 6%
CUBS 93% 93% 47% / 50% / 14% 37% / 44% / 14%
Diabetic5 99% 99% 1% / 21% / 99% 4% / 44% / 99%
GTSRB 99% 99% 44% / 64% / 75% 76% / 93% / 87%
CIFAR10 96% 96% 27% / 56% / 92% 33% / 62% / 95%
(b) Using OpenImages as out-of-distribution test set.
Table 4. Distinguishing in-distribution test samples from out-of-distribution samples. Comparison of our method with Baseline (Hendrycks and Gimpel, 2017), ODIN (Liang et al., 2017) and Mahalanobis (Lee et al., 2018a) w.r.t TPR (correctly detected out-of-distribution samples) and TNR (correctly detected in-distribution samples). Best results are in bold.

As presented in the Table 3, we find that the optimal choice for architecture is RN101*CV: pre-trained ResNet101 model with logistic regression replacing the final layer. Table 3 also shows that increasing model capacity improves detection accuracy at the cost of significant increase in training time. RN101*CV and RN152*CV achieve the best results. For the remaining experiments we use RN101*CV since it achieves the same TPR and TNR as RN152*CV while being faster in inference.

Prior work (Yosinski et al., 2014; Kornblith et al., 2019; Peters et al., 2019) has shown that pre-trained DNN features transfer better when tasks are similar. In our case, half of task is identical to the pre-trained task (recognizing ImageNet images), while the other task can be characterized as recognizing out-of-distribution images. Thus it might be ideal to avoid modifying network parameters and keep pre-trained model parameters frozen by replacing the last layer with a logistic regression layer.

Another notable benefit of using logistic regression over full-blown fine-tuning is the fact that pre-trained embeddings can be pre-calculated once at a negligible cost, after which full training can proceed without performance penalties on CPU in a matter or minutes. Thus, model owners can cheaply train an effective model extraction defense. Such a defense can have wide applicability for small-scale and medium-scale model owners that may be unfamiliar with risks associated with model extraction attacks. By contrast, the Knockoff attack itself requires 3 days of computation on capable hardware, which we estimate costs between 120 USD222https://cloud.google.com/products/calculator/ and 170 USD333https://open-telekom-cloud.com/en/prices, depending on the service provider.

Table 4

showcases results for the remaining datasets. We compare our approach with existing state-of-the-art methods for detecting anomalous queries using a public PyTorch repository

444https://github.com/pokaxpoka/deep_Mahalanobis_detector. Note that other methods are threshold-based detectors, they require setting TNR to a value before detecting out-of-distribution samples. Therefore, we calculated their corresponding TPR in two setups: 1) setting TNR to 95% 2) and setting TNR to that obtained using our method. Our method achieves high TPR () on all datasets but Caltech256 and very high () for GTSRB and Diabetic5. Furthermore, our method outperforms other state-of-the-art approaches when detecting out-of-distribution samples. These results are consistent considering the overlap between ’s training dataset and our subsets of ImageNet and OpenImages. GTSRB and Diabetic5 have no overlap with ImageNet or OpenImages. On the other hand, CUBS, CIFAR10 and Caltech256 contain images that represent either the same classes or families of classes (macro classes, as in CIFAR10) as ImageNet and OpenImages. This phenomena is particularly pronounced in case of Caltech256 which has strong similarities to ImageNet and OpenImages. While the TPR remains significantly above random 50%, such model is not suitable for deployment. Although other methods can achieve higher TPR on Caltech (87-88%),it was measured with TNR fixed to 56%. Therefore, other models also fail to discriminate Caltech samples from ImageNet and OpenImages when using a reasonable TNR=95%.

The higher the similarity between ’s and ’s training data distributions, the less effective our method becomes. In the worst case scenario, where has access to a large amount of unlabeled data that does not significantly deviate from ’s training data distribution, TNR would drop to 50%. We argue that this limitation is inherent to all detection methods that try to identify out-of-distribution samples. We also note that these methods (including ours) would work better with prediction APIs that have specific tasks e.g. traffic sign recognition, as opposed to general purpose classifiers that can classify thousands of fine-grained classes. We will discuss how a more realistic can evade these detection mechanisms in Section 5.2.

5. Revisiting the Adversary Model

In this section, we aim to identify capabilities and limitations of Knockoff attack under different experimental setups with more realistic assumptions. We evaluate the Knockoff attack when 1) ad have completely different architectures, 2) the granularity of ’s prediction API output changes, and 3) can access data closer to ’s training data distribution. We also discuss ’s effect on the surrogate model performance.

5.1. Knockoff Limitations

As can be seen from Table 2, Knockoff attack is successful for approximating to some extent. The performance of the attack is limited when more realistic assumptions are made about victim model architecture, output format of the prediction API and ’s transfer set.

5.1.1. Victim model architecture

We measure the performance of the Knockoff attack when ’s architecture is not a pre-trained DNN model and is trained from scratch for its task. We construct 5-layer GTSRB-5L and 9-layer CIFAR10-9L s as described in previous model extraction work (Juuti et al., 2019). These models are trained using Adam optimizer with learning rate of 0.001 that is decreased to 0.0005 after 100 epochs over 200 epochs. Training of surrogate models is the same as in Section 3.2.3. GTSRB-5L and CIFAR10-9L have completely different architectures and optimization algorithms than those used by . As shown in Table 5, the Knockoff attack performs well when and have the same model architecture. However, Knockoff attack is less effective when is specifically designed for the given task and does not use any pre-trained model.

GTSRB-RN34 98.1% 92.4% (0.94 )
GTSRB-5L 91.5% 74.4% (0.81 )
CIFAR10-RN34 94.6% 71.1% (0.75 )
CIFAR10-9L 84.5% 59.6% (0.71 )
Table 5. Test accuracy of , . and the performance recovery () of surrogate models.

5.1.2. Victim model output

If ’s prediction API gives only the predicted class or truncated results, such as top-k predictions or rounded version of the full probability vector for each query, performance of the surrogate model degrades. Table 6 shows this limitation, where the prediction API gives full probability vector to and only predicted class to . Knockoff attack performance is still good when is GTSRB-RN34. As explained in Section 3.2.4, this could be due to comparatively simple features of GTSRB. Many commercial prediction APIs return top-k outputs for queries (Clarifai returns top-10 outputs and Google Cloud Vision returns up to top-20 outputs from more than 10000 labels). Furthermore, we can assume that of a real-world prediction APIs is trained with big databases containing complex features. Therefore, we conclude that the Knockoff attack effectiveness degrades when it is implemented against real world prediction APIs.

Caltech-RN34 74.6% 68.5% (0.92 ) 41.9% (0.56 )
CUBS-RN34 77.2% 54.8% (0.71 ) 18.0% (0.23 )
Diabetic5-RN34 71.1% 59.3% (0.83 ) 54.7% (0.77 )
GTSRB-RN34 98.1% 92.4% (0.94 ) 91.6% (0.93 )
CIFAR10-RN34 94.6% 71.1% (0.75 ) 53.6% (0.57 )
Table 6. Test accuracy of , , and the performance recovery of surrogate models. can get full probability vector and can only get predicted class from the prediction API.

5.1.3. Training data construction

When constructing the transfer set , might collect images that are irrelevant to the learning task or not close to ’s training data distribution. Moreover, might end up having an imbalanced set, where observations for each class are disproportionate. In this case, might be misleading when it is trained with an imbalanced training dataset. Class-based accuracy of might be much lower than for classes with a few observations.

Figure 1 shows the relationship between class-based surrogate model accuracy and the number of observations per class in when is CIFAR10-RN34. is much lower than for “deer” and “horse” classes. When, the histogram of is checked, the number of queries resulting in these prediction classes are low when compared with other classes. Therefore, should balance the transfer set by adding more observations for underrepresented classes.

We further investigated the effect of by performing the Knockoff attack against GTSRB-RN34 using Diabetic5 as . We measured as 70%. Performance degradation in this experiment coincides with our argument stating that the should be constructed carefully by the attacker. In the Knockoff attack, should be diverse enough to get a better surrogate model.

Class name
Airplane 95% 74% (-11pp)
Automobile 97% 90% (-7pp)
Bird 92% 67% (-25pp)
Cat 89% 76% (-13pp)
Deer 95% 47% (-48pp)
Dog 88% 80% (-8pp)
Frog 97% 63% (-34pp)
Horse 96% 46% (-50pp)
Ship 96% 82% (-14pp)
Truck 96% 80% (-16pp)
Figure 1. Histogram of ’s transfer set constructed by querying CIFAR10-RN34 victim model with 100,000 ImageNet samples and class-based test accuracy for victim and surrogate models. The biggest difference in class-based accuracies are in bold.

5.2. Knockoff improvements

A realistic adversary might know the task of the prediction API and can easily collect natural samples related to this task. Therefore, can improve its surrogate model by constructing 1) a validation dataset for hyperparameter optimization or 2) in order to obtain a surrogate model that approximates well without being detected.

5.2.1. Hyperparameter optimization of surrogate models

Optimization of hyper-parameters increases generalization capability of DNN models. This also reduces the effect of over-fitting, resulting in better test accuracy on unseen data. In DNNs, one can tune learning rate, momentum, number of epochs, batch size and weight decay of the training procedure. Here, we evaluate the effect of the number of epochs and batch size as an example. We first measured the test accuracy of over 200 epochs and found that all surrogate models exhibit over-fitting and they reached a better test accuracy before 200 epochs. Figure 2 shows the epoch number where is highest when the Knockoff attack is implemented against Caltech-RN34, CIFAR10-RN34 and GTSRB-RN34 ’s. We also implemented the Knockoff attack with different batch sizes and recognized that batch size has a noticeable effect on the surrogate model performance. Table 7 shows that although using a larger batch size is usually a reasonable choice, different batch sizes are optimal against different ’s. In summary, these results require that has to find a clever strategy instead of training the surrogate model with fixed hyperparameters and deciding training is “just enough”.

Figure 2. Evolution of over 200 training epochs against Caltech-RN34, CIFAR10-RN34 and GTSRB-5L s. Epoch where the highest is attained is marked in the plot. All models would benefit from early stopping.
bs=16 bs=64 bs=128 bs=256
Caltech-RN34 74.6% 67.1% 68.5% 69.8% 71.8%
CUBS-RN34 77.2% 41.2% 54.8% 62.9% 68.7%
Diabetic5-RN34 71.1% 55.3% 59.3% 60.0% 60.0%
GTSRB-RN34 98.1% 92.7% 92.4% 94.7% 72.6%
CIFAR10-RN34 94.6% 66.7% 71.1% 84.6% 85.4%
GTSRB-5L 91.5% 79.8% 74.4% 79.0% 77.4%
CIFAR10-9L 84.5% 60.6% 59.6% 61.1% 62.1%
Table 7. Test accuracy of , with different batch sizes (bs). Best s are in bold. Batch size of 64 is the recommended setting in (Orekondy et al., 2019a).

Over-fitting can be handled properly by dividing into training and validation dataset. Validation dataset is used to choose a set of optimal hyperparameters that would yield a model that generalizes better. However, one limitation of is the lack of access to ’s training data distribution or the prediction API’s task. This implies that will have no relevant validation samples which it can use to measure the performance of the surrogate model. If knows the prediction API’s task, can collect a small subset of natural samples related to the original task and use it as a validation dataset.

5.2.2. Access to in-distribution data

In section 4, we observed that the larger the overlap between ’s and ’s training data distribution, the less plausible the detection is. Publicly available datasets designed for ML research as well as vast databases accessible through search engines and from data vendors (e.g. Quandl, DataSift, Axciom) allow to obtain substantial amount of data from any domain. Therefore, one can argue that making assumptions about ’s access to natural data (or lack of thereof) is not realistic. This corresponds to the most capable, and yet plausible, adversary model - one in which has approximate knowledge about ’s training data distribution and access to a large, unlabeled dataset. In such scenario, ’s queries are not going to differ from an honest-but-curious client, rendering detection techniques ineffective, as a result, it is questionable whether offering ML models via open prediction APIs is safe.

Even if model extraction can be detected through stateful analysis, highly distributed Sybil attacks are unlikely to be detected. In theory, vendors could charge their customers upfront for a significant number of queries (over 100k), making Sybil attacks cost-ineffective. However, this reduces utility for benign users and restricts the access only to those who can afford the down payment.

We confirmed that access to in-distribution data increases the attack effectiveness by performing Knockoff attack against CALTECH-RN34 using Bing (Bergamo and Torresani, 2010) dataset. Bing has the same set of 256 object categories but the images of Caltech256 and Bing do not overlap. In this setup, samples 380 images per class from Bing and queries ’s prediction API with these images to construct . trains its surrogate model by dividing into training and validation datasets. measures validation accuracy in each epoch and using different batch sizes in order to determine optimal hyperparameters. We measured that surrogate model having the optimal hyperparameters (batch size of 256 and stopping training after 17 epochs) has . This model is far more effective than the surrogate model constructed by the original adversary model explained in Section 3.1.1 and reported in Table 2. In this scenario, recovers 0.83 performance of and adversarial queries may not be detected at all.

6. Related Work

Several methods to detect or deter model extraction attacks have been proposed. In certain cases, altering predictions returned to API clients has been shown to significantly deteriorate model extraction attacks: predictions can be restricted to classes  (Tramèr et al., 2016; Orekondy et al., 2019b) or adversarially modified to degrade the performance of the surrogate model (Lee et al., 2018b; Orekondy et al., 2019b). However, it has also been shown that such defenses do work against all attacks: extraction attacks against simple DNN models (Juuti et al., 2019; Orekondy et al., 2019b, a) are still effective when using only predicted classes. While these defenses may increase the training time of , they ultimately do not prevent the Knockoff attack.

Other works have argued that model extraction defenses alone are not sufficient and additional countermeasures are necessary. In DAWN (Szyller et al., 2019), the authors propose that the victim can poison ’s training process by occasionally returning false predictions, and thus embed a watermark in its model. If later makes the surrogate publicly available for queries, victim can claim ownership using the embedded watermark. DAWN is effective at watermarking the model obtained using the Knockoff attack, but requires that ’s model becomes publicly available for queries and does not protect from the extraction itself. However, returning false predictions with the purpose of embedding watermarks may be unacceptable in certain deployments, e.g. malware detection. Therefore, accurate detection of model extraction may be seen as a necessary condition for watermarking.

Prior work found that distances between queries made during extraction attacks follow a different distribution than the legitimate ones (Juuti et al., 2019). Thus, attacks could be detected using density estimation methods, where

’s inputs produce a highly skewed distributions. This technique protects DNN models against specific attacks using synthetic queries and does not generalize to other attacks, e.g. the Knockoff attack. Other methods are designed to detect queries that explore abnormally large region of the input space 

(Kesarwani et al., 2018) or attempt to identify queries that get increasingly close to the classes’ decision boundaries (Quiring et al., 2018)

. However, these techniques are limited in application to decision trees and they are ineffective against complex DNNs that are targeted by the Knockoff attack.

In this work, we aim to detect queries that significantly deviate from the distribution of victim’s dataset without affecting victim’s original training task. As such, our approach is closest to the PRADA (Juuti et al., 2019) defense. However, we aim to detect the strongest attack proposed to date (Orekondy et al., 2019a), which PRADA is not designed to detect. Our defense exploits the fact that Knockoff attacks come from a distribution of general images, for which effective classifiers already exists and are publicly available, e.g. ResNet image classifiers (He et al., 2016). Our defense presents an inexpensive, yet effective defense against Knockoff attacks, and may thus have wide practical applicablity. However, we believe that ML-based detection schemes open up the possibility of evasion, which we aim to investigate in future versions of this technical paper.

7. Conclusion

In this work, we extended the evaluation of the KnockOff attack and assessed it in several real-life scenarios. We also proposed an effective strategy to detect the Knockoff attack by identifying out-of-distribution queries that are also unrelated to prediction API’s classification task. We showed that the Knockoff attack performance decreases against real APIs. We also showcased that attacker’s transfer set should be either diverse enough or related to prediction API’s task for a successful attack. A more realistic adversary might know the task and query the prediction API like a legitimate user while extracting information from the API. In this case, watermarking ML models (Adi et al., 2018; Uchida et al., 2017; Szyller et al., 2019) might be used to claim ownership, since watermaking allows owners to detect copyright violations after their model is stolen.

Model extraction attacks are difficult to detect or mitigate with realistic adversaries who can make unlimited queries to the API. A solution to protect models against these attacks may be to restrict the access to the prediction API. Usage of ML models could be restricted to internal processes and channels that are not directly exposed to the user (e.g. encrypted communication between a smartphone and a server) or even to trusted hardware such as secure enclaves (e.g. Intel SGX (McKeen et al., 2016)). We conclude that protecting ML models against extraction attacks comes with a trade-off: One can restrict API access or give only predicted class as an API response to protect IP of the ML model by compromising the utility of the prediction API.


This work was supported in part by the Intel (ICRI-CARS). We would like to thank Aalto Science-IT project for computational resources.


  • Y. Adi, C. Baum, M. Cisse, B. Pinkas, and J. Keshet (2018) Turning your weakness into a strength: watermarking deep neural networks by backdooring. In 27th USENIX Security Symposium (USENIX Security 18), pp. 1615–1631. Cited by: §7.
  • A. Bergamo and L. Torresani (2010) Exploiting weakly-labeled web images to improve object classification: a domain adaptation approach. In Advances in neural information processing systems, pp. 181–189. Cited by: §5.2.2.
  • C. M. Bishop (2006) Pattern recognition and machine learning. springer. Cited by: §4.1.
  • G. Chase (2018)

    EyeNet: detecting diabetic retinopathy with deep learning

    GitHub. Note: https://github.com/gregwchase/dsi-capstone Cited by: §3.2.1.
  • J. R. Correia-Silva, R. F. Berriel, C. Badue, A. F. de Souza, and T. Oliveira-Santos (2018) Copycat cnn: stealing knowledge by persuading confession with random non-labeled data. In 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §2.2.
  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-fei (2009) Imagenet: a large-scale hierarchical image database. In In CVPR, Cited by: §3.1.2.
  • G. S. Griffin, A. Holub, and P. Perona (2007) Caltech-256 object category dataset. Cited by: §3.2.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §3.2.2, §3.2.3, §6.
  • D. Hendrycks and K. Gimpel (2017) A baseline for detecting misclassified and out-of-distribution examples in neural networks. Proceedings of International Conference on Learning Representations. Cited by: §4.1, Table 4.
  • H. Hosseini, B. Xiao, M. Jaiswal, and R. Poovendran (2017)

    On the limitation of convolutional neural networks in recognizing negative images

    In 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 352–358. Cited by: §3.2.4.
  • M. Juuti, S. Szyller, S. Marchal, and N. Asokan (2019) PRADA: protecting against DNN model stealing attacks. In to appear in IEEE European Symposium on Security and Privacy (EuroS&P), pp. 1–16. Cited by: §1, §2.2, §2.2, §3.1.1, §3.2.1, §5.1.1, §6, §6, §6.
  • kaggle.com (2015) Diabetic retinopathy detection. EyePACS. Note: https://www.kaggle.com/c/diabetic-retinopathy-detection/overview/description Cited by: §3.2.1.
  • M. Kesarwani, B. Mukhoty, V. Arya, and S. Mehta (2018) Model extraction warning in MLaaS paradigm. In Proceedings of the 34th Annual Computer Security Applications Conference, Cited by: §6.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.2.
  • S. Kornblith, J. Shlens, and Q. V. Le (2019) Do better imagenet models transfer better?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2661–2671. Cited by: §3.1.2, §4.3.
  • A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Cited by: §3.2.1.
  • A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, et al. (2018) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. arXiv preprint arXiv:1811.00982. Cited by: §4.2.1.
  • K. Lee, K. Lee, H. Lee, and J. Shin (2018a) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pp. 7167–7177. Cited by: §4.1, Table 4.
  • T. Lee, B. Edwards, I. Molloy, and D. Su (2018b) Defending against model stealing attacks using deceptive perturbations. arXiv preprint arXiv:1806.00054. Cited by: §1, §6.
  • S. Liang, Y. Li, and R. Srikant (2017) Principled detection of out-of-distribution examples in neural networks. arXiv preprint arXiv:1706.02690. Cited by: §4.1, Table 4.
  • F. McKeen, I. Alexandrovich, I. Anati, D. Caspi, S. Johnson, R. Leslie-Hurd, and C. Rozas (2016) Intel&Reg; software guard extensions (intel® sgx) support for dynamic memory management inside an enclave. In Proceedings of the Hardware and Architectural Support for Security and Privacy 2016, HASP 2016, New York, NY, USA, pp. 10:1–10:9. External Links: ISBN 978-1-4503-4769-3, Link, Document Cited by: §7.
  • T. Orekondy, B. Schiele, and M. Fritz (2019a) Knockoff nets: stealing functionality of black-box models. In CVPR, Cited by: §1, §3.1.1, §3.2.1, §3.2.4, §3.2, Table 2, §3, Table 7, §6, §6.
  • T. Orekondy, B. Schiele, and M. Fritz (2019b) Prediction poisoning: utility-constrained defenses against model stealing attacks. CoRR abs/1906.10908. External Links: Link, 1906.10908 Cited by: §6.
  • N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2017) Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506–519. Cited by: §1, §2.2, §2.2, §2.2, §3.1.1, §3.2.1.
  • M. Peters, S. Ruder, and N. A. Smith (2019) To tune or not to tune? adapting pretrained representations to diverse tasks. arXiv preprint arXiv:1903.05987. Cited by: §4.3.
  • E. Quiring, D. Arp, and K. Rieck (2018) Forgotten siblings: unifying attacks on machine learning and digital watermarking. In 2018 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 488–502. Cited by: §1, §6.
  • J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel (2011) The german traffic sign recognition benchmark: a multi-class classification competition. In IEEE International Joint Conference on Neural Networks, Cited by: §3.2.1.
  • S. Szyller, B. G. Atli, S. Marchal, and N. Asokan (2019) DAWN: dynamic adversarial watermarking of neural networks. CoRR abs/1906.00830. External Links: Link, 1906.00830 Cited by: §6, §7.
  • TechWorld (2018)

    How tech giants are investing in artificial intelligence

    Note: https://www.techworld.com/picture-gallery/data/tech-giants-investing-in-artificial-intelligence-3629737Online; accessed 9 May 2019 Cited by: §1.
  • A. Torralba and A. A. Efros (2011) Unbiased look at dataset bias. In CVPR 2011, Vol. , pp. 1521–1528. External Links: Document, ISSN 1063-6919 Cited by: §4.1.
  • F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart (2016) Stealing machine learning models via prediction apis. In 25th USENIX Security Symposium (USENIX Security 16), pp. 601–618. Cited by: §1, §2.2, §6.
  • Y. Uchida, Y. Nagai, S. Sakazawa, and S. Satoh (2017) Embedding watermarks into deep neural networks. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, pp. 269–277. Cited by: §7.
  • P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona (2010) Caltech-UCSD Birds 200. Technical report Technical Report CNS-TR-2010-001, California Institute of Technology. Cited by: §3.2.1.
  • J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. In Advances in neural information processing systems, pp. 3320–3328. Cited by: §4.3.
  • C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal (1994) L-bfgs-b - fortran subroutines for large-scale bound constrained optimization. Technical report ACM Trans. Math. Software. Cited by: §4.2.2.