Thief, Beware of What Get You There: Towards Understanding Model Extraction Attack

04/13/2021 ∙ by Xinyi Zhang, et al. ∙ HUAWEI Technologies Co., Ltd. Nanyang Technological University 0

Model extraction increasingly attracts research attentions as keeping commercial AI models private can retain a competitive advantage. In some scenarios, AI models are trained proprietarily, where neither pre-trained models nor sufficient in-distribution data is publicly available. Model extraction attacks against these models are typically more devastating. Therefore, in this paper, we empirically investigate the behaviors of model extraction under such scenarios. We find the effectiveness of existing techniques significantly affected by the absence of pre-trained models. In addition, the impacts of the attacker's hyperparameters, e.g. model architecture and optimizer, as well as the utilities of information retrieved from queries, are counterintuitive. We provide some insights on explaining the possible causes of these phenomena. With these observations, we formulate model extraction attacks into an adaptive framework that captures these factors with deep reinforcement learning. Experiments show that the proposed framework can be used to improve existing techniques, and show that model extraction is still possible in such strict scenarios. Our research can help system designers to construct better defense strategies based on their scenarios.



There are no comments yet.


page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With Artificial Intelligence capabilities increasingly embedded in enterprise tools, products, and services, companies who adopted AI in their business have grown and expanded steeply. AI becomes one of the important factors for them to gain competitive advantages over other players. Thus, to retain their advantage, protecting the intelligent properties from competitors is an important topic.

To maximize the advantage, AI capability is typically deployed as a service that is available to the public, their customers, or their applications. There are rising concerns on model extraction attacks through such services ever since [36], which showed that it is possible to extract a similar model from the available API. A series of attacks and defense mechanisms with different assumptions are proposed, and the concern is extended to the more complicated models like ResNet [13] and BERT [7].

Most of the current researches on model extraction and defense are empirical, with only few exceptions such as [17]. Many of the empirical model extractions are conducted with public datasets and pre-trained models, and thus assume the attacker who has access to a similarly trained model and in-distribution datasets. Despite this being a straightforward approach for an attacker who tries to extract these models, there are scenarios where the models are trained with in-house data, and neither the data distribution nor pre-trained model is available to the attacker. In this paper, we will discuss this less addressed scenario. Nevertheless, separating the contributing factors of model extraction, in other words, studying the effectiveness of model extraction without a pre-trained model or in-distribution dataset, will help us understand the fundamentals of model extraction.

Intuitively, there are two important factors that could affect the effectiveness of model extraction, namely, attacker model hyperparameters and information retrieved from the victim through queries. We firstly conduct experiments to analyze the effect of model structure, and show that the optimal structure cannot be simply determined by the victim model structure, and thus shall be configured adaptively.

Secondly, we conduct experiments to show that, the learning setting in model extraction might be counter-intuitive compared to normal model training. For example, due to the query distribution, the Adam [19]

optimizer might lead to a significantly worse result compared to the SGD. A learning rate and training epoch number that might lead to overfitting in model training might be favorable for model extraction.

Lastly, we examine the effect of query set. Specifically, we study the extraction effect in three typical scenarios: when the adversary does not have data of specific labels, when the amount of in-distribution data is limited, and when the adversary does not have any in-distribution data. The experiment shows that extraction is still possible even if the malicious party does not possess the corresponding data.

Since the adversary will always try to expand the query set until the query budget is met, the extraction effect also directly depends on the query data generation strategy. In view of the aforementioned constraints and phenomena, we take one more step towards understanding model extraction attack by proposing a heuristic extraction framework utilizing deep reinforcement learning, with which we learn the optimal query strategy. Experiments show that one example implementation of our framework can outperform the typical FGSM

[11] method, and can extract a reasonable model even without any in-distribution data and within a limited budget, which thus shall be considered and utilized by security architects who intend to rely on the secrecy of data to achieve model confidentiality.

The rest of the paper is organized as follow: in section II we briefly summarize existing works that are related to this research; in section III we introduce the notation and problem formulation of this paper; in section IV we present our observations and analysis; in section V, we propose and validate the model extraction framework.

Ii Related Work

Ii-a Model Extraction

Model extraction (ME) typically refers to extracting non-public information, such as functionality or parameters, from a black-box Machine Learning model. In the context of AI Security, it is also known as Model Stealing Attack due to its nature of being unauthorized. Earlier works in model stealing attack focus on simple ML models like SVM or decision tree


. As Deep Learning gains popularity, recent works

[30, 37, 20]

started to explore complex deep neural networks. These works present the effectiveness of their attacking methodologies in Computer Vision (Image Classifier) or Nature Language Processing settings. Most of these works rely on open-source pre-trained weight, high-quality data, or sophisticated models. However, such given prior knowledge could be the bottleneck of attacks in practical situations. In our work, we study model extraction attack in stricter situations and point out critical but lightly mentioned factors.

Ii-B Knowledge Distillation

A closely related research topic is knowledge distillation (KD), which refers to transferring the functionality from one model to another. It typically involves a teacher model, which is structurally complex, and a student model, which is simpler and easier to deploy. It was firstly proposed by [15] and has been constantly studied since for better efficiency and performance [29, 25]. We argue that essentially, knowledge distillation is a special case of a functionality-targeted Model Extraction problem. In opposite to the black-box setting in Model Extraction, knowledge distillation can leverage the knowledge of the teacher model, its training data, the activation [41, 27], and structural information [40] for various improvements. In this paper, we combine some phenomena that occurred in our experiments with observations [39, 5] made in KD, to better understands the effectiveness of ME.

Ii-C Adversarial Example

With a human-imperceptible synthesized noise [4] added to the input, adversarial examples (AE) are known to be capable of greatly affecting the output of neural networks. Due to its nature of manipulating the model output, AE is also used in KD [14] and ME [31] as a way to find decision-boundary. The explanation of AE still remains unclear, but multiple theories [9, 35, 16] have been proposed. Reference [16] interpreted the AE noise as non-robust features, which we adopted in our method to enrich the information retrieved from the victim and push the distribution of the sampled data to a desired state.

Ii-D Active Learning

Active learning (AL) is a technique where data is adaptively selected and trained [34, 42] based on the state of the evolving model. The purpose of AL is to save data labeling effort, which, in the context of ME, is the interaction with the target black-box model. In this work, we use deep reinforcement learning (DRL), which has been proven to able to outperform human-level control [28] in many systems, in section V for our proposed example algorithm to conduct active learning.

Iii Problem Formulation

In this paper, we mainly follow the notation defined in [30] and consider the setting where an attacker wants to train a model by querying a victim model , which is trained on dataset . The attacker is provided with a data pool containing labeled, non-labeled, or no data. The attacker has a budget , that is, the number of times he can query the victim model. The set of queries he constructed is referred to as the transfer set .

We consider model extraction problem as a “Task Accuracy Extraction” defined in [17]. That is, we analyze the effect of , model structure, learning parameters that affect the accuracy over a task distribution over , i.e.:


We study the following two factors which typically affect the effectiveness of model extraction:

Iii-1 Query Utility

As some of the existing works pointed out, the efficiency of model extraction is affected by the utilities of information retrieved from queries, including: (1) whether the queries are in-distribution, i.e. whether they are in the set of , or even ; (2) whether the queries are balanced in terms of each label in ; (3) whether the queries sufficiently explore the input space .

Iii-2 Hyperparameters

We analyze the effect of model hyperparameters by comparing the learning process of the following three cases: where the attacker uses a more complex structure, the same structure, and a simpler structure. We also analyze the effect of learning hyperparameters such as optimizer.

 (%)  Pre-trained  (%)
Caltech256 ResNet-34 78.4 ResNet-34 Caltech256 ImageNet 78.1
ResNet-34 N/A 20.1
ResNet-50 21.1
ResNet-18 21.8
ResNet-50 85.2 ResNet-18 Caltech256 22.4
ImageNet 29.7
ImageNet ResNet-50 84.5 ResNet-18 ImageNet 23.6
ResNet-50 18.1
ResNet-50 77.8
: Observed best top-1 test accuracy.
Random sampling as in the original paper applied.
N/A: The attacker model is not trained on any dataset before the starting  of the extraction process.
TABLE I: Knockoff Nets Performance with Varying Attacker Model Pre-trained Weights

Iv Observation and Analysis

Most of the existing works study model extraction with image classification tasks and therefore assume the attacker starts with a pre-trained model (e.g. ResNet-34 on ImageNet [6]

). While this is reasonable when attackers try to extract models with common tasks, it might not be the case for some of the commercial critical tasks, where even training data are confidential properties. In fact, model extraction attacks are more probable and severe for these cases. Nevertheless, for better comparison with existing works, we conduct experiments with image classification tasks. However, for most of our analysis, we do not assume the attacker starts with a pre-trained model, nor does he have plenty in-distribution data.

We study the effects of the aforementioned factors by extending the experiments in Knockoff Nets [30]. All tests are conducted on the standard test dataset or using the same train-test split as in the original paper. The victim model is assumed to return the softmaxed classification probability (soft label) of the queried input. We organize our observations into the following three categories:

Iv-a Pre-trained Weight

In this section, we discuss the effect of pre-trained weight and the role it plays in the whole extraction process. We start with replicating Knockoff Nets random sampling on Caltech-256 [12]. Then, instead of using ResNet-34 pre-trained on ImageNet, we apply Xavier [10] initialization on three types of attacker model structures and repeated the experiment. To avoid coincidence and rule out factors like incapable victim or inadequate samples, we also change for more accurate models, vary , and extend experiments to ImageNet.

The result (Table I) shows that the performance of degenerates significantly without pre-trained weight, even when it has matching network structure with and as query set. A related result is also found in [2], where they showed that performance decreases when is not finetuned from the same pre-trained model.

Based on the above-mentioned observation, we find that when the victim and attacker both start with models pre-trained on the same dataset, the importance of the pre-trained weight is more than providing just a sophisticated feature extractor, but rather significant prior knowledge of

and make the whole process similar to fine-tuned transfer learning. Without such pre-trained weight, the attacker needs a much higher budget to obtain a comparable result.

Iv-B Attacker Hyperparameters

In this paper, we look into the behaviors of two important factors, namely attacker model structure and optimizer, in model extraction. Our Experiments show that their behaviors differ from those under typical machine learning settings.

Iv-B1 Attacker Model Structure

Previous works [30, 20] studied the effect of structure in the settings of pre-trained weight. Conclusions were made that:

  • Given fixed structure, the more complex is, the better the extraction result is.

  • Given fixed structure, the extraction result is the best when .

In contrast, in this section, we study the effect structure without pre-trained weight, in order to validate the above-mentioned two statements in a broader context.

 (%)  (%)
MLP 88.0 MLP 40.9
LeNet 28.1
AlexNet 24.1
LeNet 90.7 MLP 23.6
LeNet 62.6
AlexNet 25.1
AlexNet 91.1 MLP 11.5
LeNet 15.8
AlexNet 23.7
: Observed best top-1 test accuracy.

Results of Model Extraction on FashionMNIST Using White Noise With Varying Attacker and Victim Model Structure

We conduct experiments on FashionMNIST [38] dataset with three different models: 2-layer MLP, LeNet [23], and resized AlexNet [21], of , , and trainable parameters respectively. The experiment result (Table II) indicates that with no pre-trained weight applied, is neither the more complex the better nor performs the best when has the same structure. Similar phenomena can also be observed in Table I, where ResNet-50 underperforms ResNet-18 with the same budget.

We believe that in previous works’ settings, pre-trained complex models reveal more information about the original feature space that was trained on, which is the major factor that distinguished their performance from the simple ones. This could be an alternative explanation to the fact that, in the original paper, when testing Knockoff Nets on a real-life black-box model, i.e. without the important prior knowledge provided by pre-trained weight on the same dataset as , ResNet-34 and ResNet-101 showed similar performances.

Therefore, whether the is black-box or white-box, an optimized structure of needs to be searched. The best strategy for the attacker might be to adopt an AutoML approach, search for the optimal hyperparameters according to his budget and prior knowledge, and continuously adjust them with the query results he obtains.

Iv-B2 Optimizer

Besides model structure, another typical hyperparameter that needs to be adaptively configured is the optimizer for training . One significant difference for optimizer choice compared to the typical machine learning is that, overfitting the training accuracy might not be bad for testing accuracy. In this paper, are trained for a large number of epochs with little learning rate decay, which by common sense will lead to overfitting, but the test accuracy is observed to be continuously improving. We believe that in the scenarios where the adversary does not have a good prior knowledge about the dataset, his best strategy is to overfit to the soft labels. In addition, a typical training dataset has overall “trending” for each class, whereas the distribution of is more dynamic during the extraction process. One interesting observation we obtained in our experiments is that Adam optimizer always leads to a significantly worse than SGD does. Thus, optimizers validated in common machine learning scenarios may not be directly applicable in model extraction.

Excluded Class(es)  (%)  (%)
FMNIST-1 9 92.9 97.8
FMNIST-1S 86.0 98.9
FMNIST-2 1,9 93.2 98.9
FMNIST-2S 79.6 99.1
FMNIST-3 0,1,9 90.6 93.2
FMNIST-3S 75.5 96.0
FMNIST-8 0,1,2,3,4,5,7,9 68.7 84.4
FMNIST-8S 16.9 64.6
Mapping from number to class name follows the official data source.
TABLE III: Results of model extraction on FashionMNIST with Specific Classes Excluded

Iv-C Query Utility

In experiments shown in Table II, we notice that can learn features by querying with white noise, which is meaningless to humans and contains no perceivable feature related to the task. To further explore this phenomenon, we conduct a series of experiments of extracting on FashionMNIST with varying :

  • FMNIST-: the FashionMNIST training dataset, excluding all data from the specified class(es).

  • FMNIST--S: the FashionMNIST training dataset, excluding all samples to which assigns a confidence score larger than 10% on the specified class(es)”.

All available samples in the are queried to the victim exactly once, with no data argumentation applied. We consider the performance of on the “excluded class(es)”, and evaluate the prediction with two metrics, namely:

  1. Recall : defined as the ratio of correct predicted labels in excluded classes and the total number of images in these classes

  2. Precision : defined as the ratio of correct predicted labels in excluded classes and the total number of prediction in these classes

We notice that despite images of certain classes are missing in , is still able to classify them with high precision, even in extreme cases where all sampled data obtained have low confidence in the target class. Interestingly, the precision is higher than the recall . It seems is more conservative in “excluded classes”.

This experiment shows that an adversary might be able to learn without prior knowledge of the target class(es). Additionally, as a further investigation, we query the victim with MNIST

[24] dataset as , which are typical out-of-distribution data in this case, and achieved 66.7% test accuracy on the FashionMNIST test dataset. It seems in these cases, misclassified queries are no longer adversarial dirty data and, instead, they play a positive role in the extraction. Meanwhile, soft label with relative growth and drop across classes makes the learning from ’low confidence’ samples possible and facilitates the extraction when lacking input data. We believe that explains why in Table II, the overall extraction result degenerates with a more accurate victim. Similar observations were also made by [39, 5] in the context of knowledge distillation.

Fig. 1: Structure of the proposed framework. Modules are indicated by grey dot-lined boxes with names on the top-left corner. Detailed components within each module are from Algorithm 1. Blue box (, ) denotes trainable network. Green box (, ) denotes data storage. Orange box (Evaluator, AE Generator) denotes fixed function. Black box () denotes black-box component.

Input: Victim Model , Data Pool , Attacker Model ,
Parameter: Query Budget , Evaluation Scope , FGSM Iteration
Output: Trained that shares similar functionality with

1:  Initialize Controller and fill with zeros
2:  Let
3:  while  do
4:     Feed forward into

to output a target probability distribution

5:     if  is not empty then
6:        Randomly draw an image from as
7:     else
8:        Generate a noise image with uniform random as
9:     end if
10:     Apply i-FGSM with on for iterations, note the output as
11:     Store pair into
12:     Train with last samples in
13:     Evaluate the distribution and the loss of on the last samples in with (3), (4), and (6). Set the evaluation result as
14:     Calculate the reward with (10)
15:     Update
16:  end while
17:  return
Algorithm 1 DRL Guided Model Extraction with i-FGSM: An Implementation of the Proposed Framework

V Method and Evaluation

With observations made in section IV-C, on the top of [30], we hereby present an abstract framework (Fig. 1) for model extraction analysis and give out a heuristic example of implementation (Algorithm 1) in the context of image classification.

V-a Framework

The framework consists of four loose-coupling modules: a Data Generator that generates queries based on available resources and an instruction given by a Controlling Agent. The queries are then sent to the victim model, and an attacker trainer will train the model based on the queries and results. An Evaluator is deployed to observe the extraction efficiency and provides feedback to the Controlling Agent. The detailed functionality and design choices of each module are as follows:

V-A1 Controlling Agent

Controlling Agent leads the Active Learning process. Based on the current extraction state, it issues instructions for the environment to execute. In this paper, we use DDPG [26] to output desired soft-label and FGSM parameter for the next query as an example. However, it is also compatible with other continuous action space DRL algorithms such as TRPO [33]. Classic controlling algorithms such as MPC [1] may also be adopted in this scenario.

V-A2 Data Generator

Data Generator is responsible for providing query images to the black-box victim model that full fills the instruction given by Controlling Agent. In this paper, we randomly select or generate base images, and adopt i-FGSM [22], utilizing the transferability of AE, to enrich non-robust features [16] covered in . Given the flexibility of this module, we believe that methods like prioritized base image selection or stronger white-box AE attack could be applied to further enhance the performance. We cater them for future research.

V-A3 Attacker Trainer

In our proposed framework, the attacker model is trained during the sampling process, in order to timely reflect the state of extraction. Considering the specialty of query data distribution (section IV-B), we use SGD as our optimizer. After each query image is made to the victim model, we train the attacker model on most recently collected samples to reduce the time taken by the sampling process. However, as a reflection of the stateful model stealing defense [18], we believe sampling time is not a critical factor to be considered for the whole extraction process, and better results could be achieved by a more sufficiently trained attacker.

V-A4 Evaluator

In deep reinforcement learning, feedback from the environment to the agent is critical, which consists of two parts: observation and reward . reflects the state, or partial-state, of the current environment, and reflects how good did the agent do. Thus, Evaluator module gathers critical information about the current extraction process, then award or penalize the Controlling Agent. In Algorithm 1, we limit the evaluation scope to the latest elements in the transfer set. With denoting the newly added elements into the training set, i.e. , the -th contains following information:

  • output on in probability:

  • output on in probability:

  • The mean (2

    ) and standard deviation

    (3) of the (soft) labels for each class in .

  • Mean Cross-entropy Loss of on each class (4) with .


where is the Cross-entropy between and . Given , the agent will decide an action that attempts to maximize the accumulated reward, calculated by a reward function. The reward function shall assess whether the changes of the transfer set will lead to a better attacker model, with respect to the newly added elements. The observation made in section IV-C shows that we shall encourage diversity in classes, and discourage similar samples to existing ones in . Let

denote the standard deviation of the given vector across all elements, we use the following equations for reward in algorithm


  • the Cross-Entropy Loss of for the last elements. Note that a higher loss indicates that the samples are not similar to existing ones, and thus shall be encouraged, in order to explore the input space. That is:

  • the diversity of samples. This is measured by both the range and standard deviation of , as well as encouraged by the minimum standard derivation of probabilities from each class:


The final reward is the weighted sum of the above factors:


V-B Benchmarking

We now evaluate the performance of the proposed Algorithm 1 by attacking a trained with the FashionMNIST dataset. In order to simulate the scenarios where the attacker has limited prior knowledge about the victim, we strictly limit the amount of data contained in . Additionally, we test one variant, where the value of (5), (7) and (9) are computed directly instead of discretely based on relative change. To validate the effectiveness of DRL and i-FGSM, we also included results of the following two baselines, modified from Algorithm 1 but within our proposed framework:

  1. Random Uniform Noise as Data Generator

  2. Random Controller as Controlling Agent

Note that [30] is not included in the benchmarking baselines, for it is not designed for zero or limited unlabeled data conditions. To our best knowledge, we are the first to attempt data-free stealing in the context of computer vision.

To demonstrate the adaptiveness of our method, in this experiment, hyperparameters are set intuitively and are not specifically tuned. All in the reward function are set to be 1. The query budget is set to , which is one-sixth of the FashionMNIST training set size. To encourage exploration, we apply a Gaussian noise to the DRL action for the first steps. To balance the trade-off between attacker training and experiment run-time, we set the evaluation scope to be 640. Meanwhile, guided by the findings in section IV-B2, we additionally train each for 1000 epochs with after the sampling process for better information utilization.

The result (Table IV) shows that in both data-free and limited-data situation, Algorithm 1 and its variant outperform the two baseline methods. Notably, in “Limited Data” situation, where contains 20 randomly drawn images from a pool of , our method achieves a more stable result than the baseline ones, which demonstrates the efficiency of DRL and adaptiveness of our proposed framework. Moreover, Algorithm 1 saves the human effort of collecting or organizing a vast amount of candidate data, and instead, makes use of non-robust features which is easier to be crafted. From another aspect, this benchmarking experiment also suggests that an attacker does not need to be well-prepared to attack a victim, deepening concerns of protecting commercial models against model extraction attacks as crime cost is significantly lower.

Method  (%)
Data Free Limited Data
Algorithm 1 54.71.4 75.21.8
Algorithm 1 (Variant) 52.51.9 75.40.9
Baseline 1 32.98.0 62.25.3
Baseline 2 44.21.6 72.72.5
: Observed best top-1 test accuracy.
Data Free: contains no data.
Limited Data: contains 20 unlabeled in-distribution images.
TABLE IV: Benchmarking Result of Algorithm 1 on FashionMNIST

Vi Conclusion and Future Work

In this paper, we study the model extraction problem when the attacker does not have much prior knowledge about in-distribution and lack pre-trained networks to start with. We empirically find that the statements made by the current literature fail to extend to such stricter scenarios. Meanwhile, we verified that, although the difficulty of extraction is significantly higher, an attacker can still learn information with out-of-distribution data. We study the different factors that contribute to the model extraction and put together these factors into a framework that can be trained for better configurations. We conduct experiments to demonstrate the effectiveness of the algorithm to extend the existing FGSM method with little prior knowledge of the victim model. Thus, when companies that provide AIaaS shall be aware that it is vulnerable when the confidentiality of their models rely only on the secrecy of their data.

For simplicity of illustration, factors such as attacker model structure are not covered by our given example (Algorithm 1). Whether techniques like neural architecture search (NAS) [32] in the current literature could be applied to improve the current implementation is still left for further research. Moreover, Algorithm 1 adopts DDPG, which is a model-free reinforcement learning algorithm. In our benchmarking experiments, the DRL agent is trained from scratch as the sampling carries out. We leave techniques to reduce the DRL converging steps like model-based reinforcement learning [3] or meta reinforcement learning [8] in future work.


  • [1] B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter (2018) Differentiable mpc for end-to-end planning and control. In Advances in Neural Information Processing Systems, pp. 8289–8300. Cited by: §V-A1.
  • [2] N. Asokan (2020) Extraction of complex dnn models: real threat or boogeyman?. In Engineering Dependable and Secure Machine Learning Systems: Third International Workshop, EDSMLS 2020, New York City, NY, USA, February 7, 2020, Revised Selected Papers, Vol. 1272, pp. 42. Cited by: §IV-A.
  • [3] S. Bansal, R. Calandra, K. Chua, S. Levine, and C. Tomlin (2017) Mbmf: model-based priors for model-free reinforcement learning. arXiv preprint arXiv:1709.03153. Cited by: §VI.
  • [4] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39–57. Cited by: §II-C.
  • [5] J. H. Cho and B. Hariharan (2019) On the efficacy of knowledge distillation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4794–4802. Cited by: §II-B, §IV-C.
  • [6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, Cited by: §IV.
  • [7] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §I.
  • [8] Y. Duan, J. Schulman, X. Chen, P. L. Bartlett, I. Sutskever, and P. Abbeel (2016) : Fast reinforcement learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779. Cited by: §VI.
  • [9] J. Gilmer, L. Metz, F. Faghri, S. S. Schoenholz, M. Raghu, M. Wattenberg, and I. Goodfellow (2018) Adversarial spheres. arXiv preprint arXiv:1801.02774. Cited by: §II-C.
  • [10] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics, Cited by: §IV-A.
  • [11] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §I.
  • [12] G. Griffin, A. Holub, and P. Perona (2007-04) Caltech-256 Object Category Dataset. Technical Report Technical Report CNS-TR-2007-001, California Institute of Technology. Cited by: §IV-A.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §I.
  • [14] B. Heo, M. Lee, S. Yun, and J. Y. Choi (2019) Knowledge distillation with adversarial samples supporting decision boundary. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3771–3778. Cited by: §II-C.
  • [15] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §II-B.
  • [16] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry (2019) Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems, pp. 125–136. Cited by: §II-C, §V-A2.
  • [17] M. Jagielski, N. Carlini, D. Berthelot, A. Kurakin, and N. Papernot (2019) High-fidelity extraction of neural network models. arXiv preprint arXiv:1909.01838. Cited by: §I, §III.
  • [18] M. Kesarwani, B. Mukhoty, V. Arya, and S. Mehta (2018) Model extraction warning in mlaas paradigm. In Proceedings of the 34th Annual Computer Security Applications Conference, ACSAC ’18, New York, NY, USA, pp. 371–380. External Links: ISBN 9781450365697, Link, Document Cited by: §V-A3.
  • [19] D. P. Kingma and J. Ba (2017) Adam: a method for stochastic optimization. External Links: 1412.6980 Cited by: §I.
  • [20] K. Krishna, G. S. Tomar, A. P. Parikh, N. Papernot, and M. Iyyer (2020) Thieves on sesame street! model extraction of bert-based apis. Cited by: §II-A, §IV-B1.
  • [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    ImageNet classification with deep convolutional neural networks

    In Advances in Neural Information Processing Systems, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), Vol. 25, pp. 1097–1105. External Links: Link Cited by: §IV-B1.
  • [22] A. Kurakin, I. Goodfellow, and S. Bengio Adversarial examples in the physical world. arxiv 2016. arXiv preprint arXiv:1607.02533. Cited by: §V-A2.
  • [23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §IV-B1.
  • [24] Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Note: External Links: Link Cited by: §IV-C.
  • [25] T. Li, J. Li, Z. Liu, and C. Zhang (2020) Few sample knowledge distillation for efficient network compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14639–14647. Cited by: §II-B.
  • [26] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §V-A1.
  • [27] R. G. Lopes, S. Fenu, and T. Starner (2017) Data-free knowledge distillation for deep neural networks. arXiv preprint arXiv:1710.07535. Cited by: §II-B.
  • [28] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis (2015-02) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. External Links: ISSN 00280836, Link Cited by: §II-D.
  • [29] G. K. Nayak, K. R. Mopuri, V. Shaj, R. V. Babu, and A. Chakraborty (2019) Zero-shot knowledge distillation in deep networks. arXiv preprint arXiv:1905.08114. Cited by: §II-B.
  • [30] T. Orekondy, B. Schiele, and M. Fritz (2019) Knockoff nets: stealing functionality of black-box models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4954–4963. Cited by: §II-A, §III, §IV-B1, §IV, §V-B, §V.
  • [31] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2017) Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506–519. Cited by: §II-C.
  • [32] P. Ren, Y. Xiao, X. Chang, P. Huang, Z. Li, X. Chen, and X. Wang (2020) A comprehensive survey of neural architecture search: challenges and solutions. arXiv preprint arXiv:2006.02903. Cited by: §VI.
  • [33] J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel (2015) Trust region policy optimization. CoRR abs/1502.05477. External Links: Link, 1502.05477 Cited by: §V-A1.
  • [34] B. Settles (2009) Active learning literature survey. Technical report University of Wisconsin-Madison Department of Computer Sciences. Cited by: §II-D.
  • [35] A. Shamir, I. Safran, E. Ronen, and O. Dunkelman (2019) A simple explanation for the existence of adversarial examples with small hamming distance. arXiv preprint arXiv:1901.10861. Cited by: §II-C.
  • [36] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart (2016) Stealing machine learning models via prediction apis. In 25th USENIX Security Symposium (USENIX Security 16), pp. 601–618. Cited by: §I, §II-A.
  • [37] D. Wang, Y. Li, L. Wang, and B. Gong (2020) Neural networks are more productive teachers than human raters: active mixup for data-efficient knowledge distillation from a blackbox model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1498–1507. Cited by: §II-A.
  • [38] H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §IV-B1.
  • [39] C. Yang, L. Xie, S. Qiao, and A. Yuille (2018) Knowledge distillation in generations: more tolerant teachers educate better students. arXiv preprint arXiv:1805.05551. Cited by: §II-B, §IV-C.
  • [40] J. Yim, D. Joo, J. Bae, and J. Kim (2017) A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 7130–7138. External Links: Document Cited by: §II-B.
  • [41] S. Zagoruyko and N. Komodakis (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928. Cited by: §II-B.
  • [42] Z. Zhou, J. Shin, L. Zhang, S. Gurudu, M. Gotway, and J. Liang (2017) Fine-tuning convolutional neural networks for biomedical image analysis: actively and incrementally. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7340–7351. Cited by: §II-D.