Beyond Model Extraction: Imitation Attack for Black-Box NLP APIs

Machine-learning-as-a-service (MLaaS) has attracted millions of users to their outperforming sophisticated models. Although published as black-box APIs, the valuable models behind these services are still vulnerable to imitation attacks. Recently, a series of works have demonstrated that attackers manage to steal or extract the victim models. Nonetheless, none of the previous stolen models can outperform the original black-box APIs. In this work, we take the first step of showing that attackers could potentially surpass victims via unsupervised domain adaptation and multi-victim ensemble. Extensive experiments on benchmark datasets and real-world APIs validate that the imitators can succeed in outperforming the original black-box models. We consider this as a milestone in the research of imitation attack, especially on NLP APIs, as the superior performance could influence the defense or even publishing strategy of API providers.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/30/2020

Imitation Attacks and Defenses for Black-box Machine Translation Systems

We consider an adversary looking to steal or attack a black-box machine ...
04/23/2021

Theoretical Study of Random Noise Defense against Query-Based Black-Box Attacks

The query-based black-box attacks, which don't require any knowledge abo...
08/27/2020

Adversarial Eigen Attack on Black-Box Models

Black-box adversarial attack has attracted a lot of research interests f...
09/01/2021

Black-Box Attacks on Sequential Recommenders via Data-Free Model Extraction

We investigate whether model extraction can be used to "steal" the weigh...
06/11/2020

Protecting Against Image Translation Deepfakes by Leaking Universal Perturbations from Black-Box Neural Networks

In this work, we develop efficient disruptions of black-box image transl...
06/24/2020

Black-box Adaptation of ASR for Accented Speech

We introduce the problem of adapting a black-box, cloud-based ASR system...
06/28/2021

Data Poisoning Won't Save You From Facial Recognition

Data poisoning has been proposed as a compelling defense against facial ...

1 Introduction

Task-oriented NLP APIs have received tremendous success, partly due to commercial cloud services (krishna2019thieves; wallace2020imitation)111https://rapidapi.com/blog/best-nlp-api/. The enormous commercial benefit allures other companies or individual users to extract the back-end decision models of these successful APIs. Some recent works have demonstrated that many existing NLP APIs can be locally imitated or stolen (krishna2019thieves; wallace2020imitation), violating the intellectual property (IP) of NLP APIs. Equipped with recent advanced models pre-trained on large-scale corpus, it is getting easier to train a decent attack model with limited training samples retrieved from victims (he2021model).

However, there are a series of restrictions in current attack paradigm. We summarize these restrictions as follows. The first restriction is that attackers and victims are trained and evaluated under the same domain. Although such setting simplifies the comparison of utility and fidelity between victim and attack models, it is less likely to be the case in real world. Generally, attackers and victims do not, at least not willing to, share their datasets to the public. The application scenario (domain) could also vary, depending on requirements by customers. The second restriction is that model extraction only refers a single victim model, and none of the previous works manage to build the extracted models that can surpass the original black-box APIs (tramer2016stealing; krishna2019thieves; wallace2020imitation). Aggregating the strength from diverse victims, however, can potentially benefit the performance of attackers. This may enable the attacker to acquire a superior performance, in the best case, even surpassing victim models.

Based on above analysis, we are motivated to address their restrictions by i) conducting imitation attacks on a transferred domain and ii) multiple victim ensemble. Our approach integrates unsupervised domain adaptation and model ensemble into imitation attack, with a corresponding theoretical analysis. We conduct experiments on two commonly used NLP tasks, namely sentiment classification (lyu2020differentially) and machine translation (wallace2020imitation). We investigate the attack performance on both simulated victim models and publicly available commercial NLP APIs. Our results demonstrate that the attackers could potentially achieve better performance than victims in the transferred domains and utilizing multiple victim models further improves the performance of the attack models. For those target domains that are far from victim domains, e.g., financial documents, the performance improvement of the imitation model could be further amplified.

Overall, our empirical findings exacerbate the potential risk concerns of public API services, as malicious companies could provide a better service in their specific domains by integrating several publicly available APIs. Moreover, the new services generally could cost far less than any of the original victim services or the wage paid to human annotators. The imitation attack not only infringes the intellectual property of victim companies by misusing the predictions of their APIs, but also potentially corrupts the MLaaS market by publishing new APIs, with a higher performance but a lower price. We believe that explicitly exposing the superior performance of imitation attack would arouse significant attentions in the research community and encourage companies to reconsider how to adjust their strategies of publishing their API services.

2 Related Work

2.1 Model Imitation Attack

Model imitation attack (also referred to as “extraction" or “stealing") has been studied for simple classification tasks (tramer2016stealing), vision tasks (orekondy2019knockoff), and NLP tasks (krishna2019thieves; wallace2020imitation). Generally, model imitation attack aims to reconstruct a local copy or to steal the functionality of a black-box API. If the reconstruction is successful, the attacker has effectively stolen the intellectual property.

Past works on imitation attack (krishna2019thieves; wallace2020imitation) mainly focus on how to imitate a model with performance approximate to the victim API in the source domain. Whether a more powerful attacker can steal a model that is better than any victim APIs in new domains is largely unexplored.

2.2 Unsupervised Domain Adaption

Domain adaptation is a task of adapting a pre-trained model on a source domain to a target domain. It can be fulfilled in two mechanisms: supervised adaptation and unsupervised adaptation. The former can achieve an outstanding performance with a small amount of in-domain labeled data (daume-iii-2007-frustratingly; ben2007analysis). In contrast, the unsupervised domain adaptation (UDA) (miller2019simplified; ganin2015unsupervised) does not require ground-truth labels in the target domain, hence is more challenging but attractive.

This work is under the umbrella of UDA family (wang2021generalizing; miller2019simplified). We differentiate our work from other UDA works in terms of the intent. Other UDA works aim to improve the models in both source domains and target domains simultaneously, while imitation attack focus on optimizing the attackers in target domains. We exploit UDA from a dark side, where one can leverage domain adaptation to violate IP of commercial APIs, but benefit from such violations.

2.3 Ensemble for Knowledge Distillation

Model imitation attack is related to knowledge distillation (KD) (hinton2015distilling). Emergent knowledge distillation (bucilua2006model; hinton2015distilling) aims to transfer knowledge from a large teacher model to a small student model for the sake of model compression. By encouraging the student model to approximate the behaviors of the teacher model, the student is able to imitate the teacher’s behavior with minor quality loss but more efficient inference (furlanello2018born; dvornik2019diversity; anil2018large).

However, model imitation differs from distillation in terms of data usage. Particularly, the victim’s (i.e., teacher’s) training data is usually unknown as the victim API is deployed in a black-box manner. Thus, malicious users tend to consider unlabeled queries more likely to be collected from other domains (xie2020self) or generated from pre-trained generative models (ravuri2019classification; kumar2020data; 9053146). Our work considers a more realistic setting, as victim and attack models are under different domains and express distinctive interests.

On the other hand, with the development of multi-task learning, multi-teacher distillation has been proposed, which is targeted at distilling multiple single-task models into a single multi-task model (tan2019multilingual; clark2019bam; saleh2020collective). These lines of works lie in the spirit of improving model performance and encouraging parameter reduction via KD.

The dichotomy between our work and previous works is we are interested in an ensemble distillation of single-task models, where all teachers and the student work on the same task. The gist of this approach aims to leverage the collective wisdom to obtain a better student outperforming its teachers, which is similar to the ensemble learning (opitz1999popular). However, ensemble learning focuses more on the aggregation of the predictions of different models at the inference stage, whereas our proposed method not only gathers the inputs and their predictions, but also trains a new model on the input-prediction pairs.

Figure 1: The workflow of the imitation attack and its harm. The attacker labels queries(x) using multiple victim APIs, and then trains an attacker model on the resulting data. Finally, the attacker could publish a new API service, which could become a competitor of the victim services, thus eroding their market shares.

3 Imitation Attack Paradigm

In this section, we first explain the problem and motivation of imitation attack (IMA) under domain adaptation setting. Then, we connect our IMA practice with a corresponding domain adaptation theory (ben2010theory; wang2021generalizing). After that, we introduce a multi-victim ensemble methodology for IMA. Finally, we explain the rationality of a family of existing defense technologies under the domain adaptation theory.

3.1 Problem Statement

In the real world, attackers may be interested in their own new business, e.g., a new classification or machine translation system in a new domain. Generally, the attackers possess their own training samples, , while the oracle labels of these samples are not ready. In order to train a model with the least cost of annotation, attackers access the publicly available commercial APIs for the target task . Moreover, the attackers could inquire multiple APIs for further performance improvement. The underlying models of the attacked APIs are victim models . As illustrated in Figure 1, our attack can be formulated as a two-step process:

  1. The attacker queries th victim model and retrieves corresponding labels .

  2. The attacker learns an imitation model based on the queries and concatenated retrieved labels, .

As the attacking process involves teaching a local model by imitating the behavior of the victim models, we put this attack process under the umbrella of imitation attack. The motivation of such attack paradigm is twofold: (1) Firstly, querying commercial APIs generally costs far less than hiring human annotators. Therefore, the price of the attack models is competitive in the market; (2) Secondly, the attackers are potentially able to outperform the victims in terms of utility. As demonstrated in Section 5, domain adaptation and multi-victim ensemble further enhance the attacker performance. We believe that both price and performance advantages would lure companies or individuals to imitation attack.

3.2 Training for Imitation Attack

In imitation attack, the attacker utilizes the labels from victim APIs for model training. Given a victim model , the attacker model imitates the behavior of victim model by minimizing the prediction error on the target domain .

(1)

On the other hand, we assume that the victim model learned from oracle annotations in another source domain , although the oracle labels are never used for training attacker in our IMA. The loss for training victim model is:

(2)

where

is the loss function. In practice, we use cross entropy as loss functions for training both the victim and attack models. Note that jointly optimizing Eq 

1 and Eq 2 can derive the unsupervised domain adaptation loss in (miller2019simplified; ganin2015unsupervised). However, IMA optimizes victim and attack models in absolutely separate steps, as victim models and their training processes are black-box to the attackers.

3.3 Imitation Attack as Domain Adaptation

We connect our new IMA paradigm with domain adaptation theory. The error of the attacker is measured by attacker risk . According to domain adaptation theorem (ben2010theory; wang2021generalizing), the upper bound of the attacker risk is

(3)

where is the total variation between the distributions of source and target domains, which is determined by datasets used by the victim and attacker. The first term, , is the victim risk on the source domain. It is optimized during the training of victim models as Eq 2. The last term is associated with imitation training as Eq 1. Therefore, our imitation attack under domain adaptation is actually optimizing the upper bound of attacker risk .

3.4 Multiple Victim Ensemble

Another approach to achieving further performance improvement is to integrate the results from multiple APIs. This strategy is well motivated in real-world imitation attack, as many cloud computing companies share similar APIs for main-stream NLP tasks, e.g, Google Cloud and Microsoft Azure both support sentiment classification and machine translation. Attackers can improve their performances by learning from multiple victim APIs. In more detail, given independent victim models , attackers can take the average advantage of all victim models by,

(4)

According to ensemble theories (breiman2001random; bauer1999empirical), the lower generalization error of an ensemble model depends on i) better performance of the individual models, and ii) lower correlation between them. In real-world, companies are actually i) targeting on API with better performance, and ii) using their own private training datasets. The effort of these companies towards superior API performance unfortunately exacerbates these two factors for a successful ensemble model as an attacker.

3.5 Defending Imitation Attack

Some existing IMA defense strategies (he2021model) slightly distort the predictions of victim models to reduce the performance of attackers, by replacing with noisy

. The variance introduced by distortion is

. As the distortion should not destroy the utility of victim model, the variance should be bounded by a small constant , i.e., . The victim risk, the first term in Eq 3, is relaxed to , where

(5)

Therefore, the gentle distortion of victim outputs results in a more relaxed upper bound for optimization, which could potentially lead to better results on defending the imitation attack.

4 Experimental Setup

4.1 Tasks and Datasets

In this paper, we focus on two essential NLP tasks, classification and machine translation, both of which are predicting discrete outputs, and cross entropy loss is used as the objective in optimization. For classification tasks, as APIs provide continuous scores of the predictions, we consider the settings of soft label (using prediction scores) and hard label (using predicted categories). In machine translation, the translation result is a sequence of descrete words, which are considered as sequential hard labels. Classification and translation tasks are evaluated by accuracy(%) (schutze2008introduction) and BLEU (papineni2002bleu), respectively. The datasets used in our imitation attack experiment are summarized in Table 1.

Dataset #Train #Dev #Test Task Domain
IMDB 25,000 25,000 N/A Sentiment Classification Movie Review (long)
SST 6,920 872 1,821 Sentiment Classification Movie Review (short)
FST 1,413 159 396 Sentiment Classification Finance Document
WMT14 4.5M 3,000 N/A Machine Translation General
JRC-Acquis 2M 1,000 1,000 Machine Translation Law
Tanzil 579k 1,000 1,000 Machine Translation Koran
Table 1: Statistic of sentiment classification and machine translation datasets, with number of samples in train, dev and test sets. The task name and domain for each dataset are included.

Sentiment Classification.

Imdb Movie Review (IMDB(maas-EtAl:2011:ACL-HLT2011)

is a large-scale movie review dataset for sentiment analysis. Stanford Sentiment (

SST(socher2013recursive) is another movie review dataset with relatively shorter text than IMDB. Financial Sentiment (FST(malo2014good) provides sentiment labels on economic texts in finance domain. We use IMDB to train local victim models and consider SST and FST as target domains in attack.

Machine Translation.

We consider German (De) to English (En) translation as our testbed. We first study the attack performance on the local models trained on a general domain. Specifically, we use WMT14 (bojar-EtAl:2014:W14-33) to train the victim models. Then, we investigate the imitation attack on Law and Koran domains from OPUS (tiedemann2012parallel). We utilize Moses222https://github.com/moses-smt/mosesdecoder to pre-process all corpora, and keep the text cased. A 32K BPE vocabulary (sennrich2016neural) is applied to all datasets.

4.2 Victim Models

Locally Simulated NLP Services.

We first use models trained on our local server as local services. Our models are trained on datasets in source domain, i.e., IMDB for sentiment analysis and WMT14 for machine translation. We vary BERT (devlin2019bert) and RoBERTa (liu2019roberta) as pre-trained models for classification. Transformer-base (TF-base) and Transformer-big (TF-big) (vaswani2017attention) are used as machine translation architectures, more details about hyper-parameter settings are described in Appendix A.

Commercial NLP API Services.

To investigate the performance of imitation attack on real-world commercial NLP APIs, we query and retrieve the results of victim APIs for both sentiment analysis and machine translation. Google Cloud API333https://cloud.google.com and IBM Cloud API444https://cloud.ibm.com are inquired for sentiment analysis. Google Trans 555https://translate.google.com/ and Bing Trans 666https://www.bing.com/translator are used as translation APIs. In this setting, we assume that different companies have various choices in datasets, domains, model architectures and training strategies. These settings are invisible to the attackers.

4.3 Imitation Attack Setup

For imitation attack, different from wallace2020imitation, we leverage datasets in other domains rather than those used for training victim models. The rationale behind this setting is that i) the owners of APIs tend to use in-house dataset, which is difficult for attacker to access; ii) attackers are more interested in their own private dataset, which is also not visible to others. Therefore, our setting is closer to the real-world attack scenario. The attack models are trained based on the labels retrieved from victim models, and tested on the labels from the human-annotated sets. We consider SST and FST as target domains for sentiment analysis. For machine translation, we use Law and Koran. In attack, we use BERT for classification. Regarding machine translation, Transformer base is used for simulated experiments, while mBART (liu2020multilingual) for experiments on commercial APIs777We observe that mBART works better on attacking commercial APIs than Transformers in our preliminary experiments.. We also investigate the ensemble models, by concatenating all the outputs retrieved from multiple victim models in training.

Sentiment Classification Machine Translation
Model Label Arch. SST FST Label Arch. Law Koran
Supervised SST / FST BERT 91.49 97.72 Law / Koran TF base 38.52 19.49
Victim 1 IMDB BERT 87.92 74.94 WMT14 TF-base 23.33 9.82
Victim 2 IMDB RoBERTa 89.40 80.00 WMT14 TF-big 24.33 10.33
Attack Victim 1 BERT 90.13 83.59 Victim 1 TF-base 23.82 10.04

Attack
Victim 2 BERT 90.72 88.76 Victim 2 TF-base 25.48 10.30
Attack Victim 1+2 BERT 91.57 90.53 Victim 1+2 TF-base 25.74 10.48
Table 2: Experimental results of imitation attack on single or multiple victim models with settings, label used for training and model architecture (Arch.). Oracle models are trained on the human-annotated datasets in target domain and victim models are trained on corpora from source domain. For all attack experiments, we report mean results over 5 runs. Attacker using single victim and multiple victims are indicated as Attack and Attack.

5 Experimental Results

In this section, we analyze our experimental results. Our experiments are designed to answer the following research questions (RQs),

  • RQ1: Are the attack models able to outperform the victim models in new domains?

  • RQ2: Will the ensemble of victim APIs improve the performance of the attack models?

  • RQ3: Do traditional defense methods help APIs reduce the performance of attackers in our domain adaptation setting?

Locally Simulated Experiments. We first conduct the experiments of imitation attack on local models, shown in Table 2. The models trained on oracle human annotated datasets are much better than the victim models, as the later ones are trained on other domains. All our attack models outperform the original victim models for both classification and translation tasks. We attribute this performance improvement to unsupervised domain adaptation. Ensemble models consistently work better than each of the single model888More discussion on averaging strategies for ensemble the classification APIs is discussed in Appendix B.. For SST, although using the same architecture, the attack model trained on the ensemble of two victims surprisingly outperforms the model supervised by oracle labels. This result also outperforms some competitive supervised baselines (tang2019distilling; mccann2017learned; zhou2016text). This observation suggests that, in some scenarios, it is possible to achieve decent results only based on some open APIs, without relatively more expensive human annotations. As a result, some annotators could lose their working opportunities of labeling new datasets, and some API services might lose their market share in new domains or tasks.

Task # Queries API Cost Victim Attacker Improv.
SST 9,613 Google Cloud $5 84.62 88.26 0.22 +3.64
IBM Cloud Free 87.26 89.17 0.33 +1.91
Google+IBM $5 - 89.75 0.58
FST 1,968 Google Cloud Free 68.35 83.85 1.05 +15.50
IBM Cloud Free 58.73 85.01 0.81 +26.28
Google+IBM Free - 89.82 0.81
Law 2M Google Trans $6,822 30.43 31.99 0.05 +1.56
Bing Trans $3,396 34.22 35.45 0.09 +1.23
Google+Bing $10,218 - 34.94 0.11
Koran 579k Google Trans $1,211 14.31 14.63 0.06 +0.32
Bing Trans $590 13.24 13.71 0.05 +0.47
Google+Bing $1,801 - 15.25 0.09
Table 3:

A comparison of the commercial APIs (Victims) with attackers. The improvement (Improv.) of attackers over victims is given by rows of single models. The cost of the API is based on the price queried in a single day. For all attack experiments, we report mean and standard deviations of the results over 5 runs.

Experiments on Commercial APIs. We then demonstrate the vulnerability of some real-world commercial APIs to our IMA approach, in Table 3. For classification task, the attacker uses soft labels, as i) these APIs provide such scores and ii) attackers using soft labels achieve better performance than hard attacks in our preliminary experiments. For machine translation, only hard label could be used, as we can only have token sequences without their perplexity scores from commercial APIs. In all attacks on classification and translation APIs, the attackers manage to achieve significantly better results than the corresponding victim models, with frighteningly low costs. Combining two commercial APIs generally improves the performance of the attackers, approaching the best results of local attack. We observe that commercial APIs work quite poor on FST, as it belongs to a more professional domain. However, the performance of the attacker catches up significantly on FST and achieves the averaged accuracy of 89.82%, given poor competitors (victims) both with accuracy less than 70%. Google+Bing on Law is the only ensemble model that fails to surpass all the single model. We attribute this to the fact that Bing Trans and its attack model has already achieved a decent result on Law, outperforming Google Trans with a gap.

Estimated Attack Costs. In Table 3

, we also estimate the cost of querying the commercial APIs as victim models. We find the costs are quite affordable to many companies or even individuals, given the benefit of obtaining high-quality in-domain classification and MT systems. The price is in accordance with retrieving the results in a day, using a single month budget of a single account. The price could be further decreased, by registering more accounts or using more time. On the other hand, we estimate that the costs for human to annotate the datasets are $480.65 (

SST), $98.4 (FST), $1.6M (Law), and $463k (Koran), if we hire annotators from Amazon Mechanical Turk. The price is decided as 0.05 cents for each classification sample and 0.8 dollar for each translation sample999In our preliminary experiment, annotators manage to finish about 10 classification and 1.5 translation annotations. The wages are about $30/hr and $32/hr, higher than the minimum wage in USA. . Although the price is arguable, we give a preliminary overview on the costs by human annotators. We find that the cost of human annotation could be 20 to 150 times higher than querying APIs. This could become another motivation for attackers to learn from APIs, instead of human.

Impact of Model Ensemble. We conduct an ablation study on the potential performance improvement of attacking multiple victim models, in Table 4. The source domain training datasets are equally separated into 4 disjoint subsets, coined A, B, C and D. Then, we train 4 corresponding local victim models, model A to D, respectively. Model Full uses the victim models trained on complete training datasets in the source domains. The attacker chooses to utilize the combined results of these victim models for training, e.g., Model A+B ensembles knowledge from victim A and B. We use BERT for classification and Transformer-base for machine translation in this study. Ensemble models with more victims generally improve the attacker performance on SST and Law. The ensemble of multiple weaker models can catch up with those of full model.

Defense Strategies. We compare two possible defense strategies on classification tasks in Table 5

. We consider perturbing the original soft outputs by hard labeling or adding Gaussian noise. The models utilizing hard labels manage to consistently reduce the performance of the attackers who use soft labels. Then we compare models trained on labels with various perturbation (P) by sampling random noise from Gaussian distribution with variance

 (tramer2016stealing). More experiments on perturbation without influencing utility of victim models are provided in Appendix C. We observe that gently disturbing the outputs of victim models could crack down the attackers to some extent and larger noise indicates better defense. However, the harm to victims is generally larger than those to attackers. We attribute this to noise reduction in training attack model. Our new IMA calls for more effective defense methods.

SST Law
Model #V Victim Attack Victim Attack
Model A 1 85.39 87.37 20.75 22.60
Model B 1 84.51 86.22 21.01 21.67
Model C 1 86.44 87.70 20.79 21.76
Model D 1 86.60 87.31 20.68 21.80
Model A+B 2 - 88.30 - 22.87
Model A+B+C+D 4 - 89.18 - 22.97
Model Full 1 87.92 88.58 23.33 23.82
Table 4: The comparison of imitation attacks with different model ensembles.
SST FST
Model Soft Hard P 0.1 P 0.2 P 0.5 Soft Hard P 0.1 P 0.2 P 0.5
Victim 1 87.92 87.92 87.94 86.79 78.06 74.94 74.94 75.24 69.27 58.33
Victim 2 89.40 89.40 88.79 87.73 80.71 80.00 80.00 78.08 76.46 65.42
Attack 90.44 88.58 90.23 90.07 87.98 82.03 80.25 83.65 83.75 76.41
Attack 90.12 90.12 90.49 90.47 88.72 87.85 85.57 88.86 85.16 82.03
Attack 91.82 90.66 91.44 91.20 90.15 88.86 87.09 89.27 87.34 86.33
Table 5: The comparison of imitation attack results given victims with various defense strategies, soft label (Soft) to hard label (Hard) and noise perturbation (P) with variance ().

6 Discussion

We consider our imitation attack approach has achieved outperforming results that challenge the current understanding of IMA. As a result, API publishing strategies and defense methodologies should be converted accordingly.

Suggested Actions. As in domain adaptation settings, IMA manages to achieve superb performance, while attacking models are not able to outperform victims in the same domain (krishna2019thieves; wallace2020imitation)

, we suggest API services could cover more domains to eliminate the potential performance gain from UDA. Simply harming the utility of victim models seems not to be a wise choice for service providers, but merely providing hard labels without probability scores could avoid more superior attacker performance. Adjusting the pricing strategies of publishing API services may be another possible alternative to avoid the illegal stealing of the precious APIs of industrial companies.

Ethical Concerns. We recognize that our work could be used for malicious purposes, for example, a competing company may adopt our attack to steal a better model for commercial benefits, thus eroding the market shares of other business competitors. However, the main purpose of our work is to help commercial cloud services and regulators raise awareness of model theft, and reconsider how to deploy NLP APIs in a safe manner to avoid being stolen and exceeded. In order to minimize the potential negative influence of our work, we will delete our models and retrieved results on our local server.

Follow-up Attacks. Recent works have demonstrated that the extracted model could be used as a reconnaissance step to facilitate later attacks (he2021model; krishna2019thieves; wallace2020imitation). For instance, the adversary could use the extracted model to construct adversarial examples that will force the victim model to make incorrect predictions. We leave follow-up attacks that can leverage our imitated model with better performance to future works.

7 Conclusion

We demonstrate a powerful imitation attack which can produce a better attack model that can surpass the imitated models including the real-world NLP APIs via domain adaptation and ensemble. We believe such achievements would potentially influence the price decision and publishing strategies of primary NLP services. We also take the first step of grounding our new attacking approach with unsupervised domain adaptation theory and model ensemble. More broadly, we hope to arouse prominent concerns on security and privacy of API services in NLP applications.

References

Appendix A Hyper-Parameter Settings

In order to have a fair and consistent comparison of experiments, we utilize the same hyper-parameters for the same task, as demonstrated in Table 6. They are decided by our preliminary experiments on target domains.

Sent. MT
Learning rate 1e-05 5e-04
Batch size 16 sentences 32k tokens
Optimizer Adam Adam
Epoch 50 40
Max length 256 1024
Warm-up - 4000 steps
Table 6: Hyper-parameter used for sentiment analysis (Sent.) and machine translation (MT).

Appendix B Comparison of Ensemble Strategy

In this section, we compare two ensemble methods for sentiment classification, i) concatenating the training samples (Concat.) and ii) averaging the prediction scores (Avg.). Both ensemble strategies are competitive to the other, as demonstrated in Table 7. As we are not able to acquire scores of each token from MT APIs, we cannot average the results of MT. In our paper, to have a consistent comparison, Concate. is used in all our ensemble experiments on both classification and translation tasks.

Method Ensemble Accuracy
BERT+RoBERTa Concat. 91.57 0.27
BERT+RoBERTa Avg. 91.58 0.39
Google+IBM Concat. 89.75 0.58
Google+IBM Avg. 89.76 0.42
(a) Ensemble results on SST.
Method Ensemble Accuracy
BERT+RoBERTa Concat. 90.53 0.38
BERT+RoBERTa Avg. 90.89 0.48
Google+IBM Concat. 89.82 0.81
Google+IBM Avg. 88.86 1.22
(b) Ensemble results on FST.
Table 7: The comparison of imitation attack on multiple victims using concatenate samples (Concat.) and average scores (Avg.).
SST FST
Model Soft Hard P 0.1 P 0.2 P 0.5 Soft Hard P 0.1 P 0.2 P 0.5
Victim 1 87.92 74.94
Victim 2 89.40 80.00
Attack 90.44 88.58 90.48 89.96 89.35 82.03 80.25 82.84 81.92 81.72
Attack 90.12 90.12 90.30 90.02 89.23 87.85 85.57 88.10 87.29 84.41
Attack 91.82 90.66 91.24 91.38 91.11 88.86 87.09 89.37 88.25 87.59
Table 8: The comparison of imitation attack results given victims with various defense strategies, soft label (Soft) to hard label (Hard) and noise perturbation (P) with variance (). The predicted labels of victim models are not flipped in the experiment.

Appendix C Comparison of Defense Strategies

In this section, we discuss the perturbation methods. Given an input sentence , the probability score by the victim model is

. To compare the influence of API performance on attack model, we sample a noise vector

from Gaussian distribution with a variance of , i.e., . The perturbed prediction function is calculated as:

It is worth noting that original victim model prediction is , therefore, injecting could lead to different prediction . Consequently, the utility of victim model can be corrupted as demonstrated in Table 5. However, such compromise can cause finance and reputation losses to the API providers in real-world. To avoid these adverse effects, API providers can adopt a label-preserved policy, where the injected noise should not flip the originally predicted labels. In other words, another noise will be sampled, if currently sampled noise changes the prediction of the original model. The results of such defense strategy is shown in Table 8. As this setting is more conservative, the performance of this defense is in between the performance of hard label and soft label.