Protecting Intellectual Property of Language Generation APIs with Lexical Watermark

Nowadays, due to the breakthrough in natural language generation (NLG), including machine translation, document summarization, image captioning, etc NLG models have been encapsulated in cloud APIs to serve over half a billion people worldwide and process over one hundred billion word generations per day. Thus, NLG APIs have already become essential profitable services in many commercial companies. Due to the substantial financial and intellectual investments, service providers adopt a pay-as-you-use policy to promote sustainable market growth. However, recent works have shown that cloud platforms suffer from financial losses imposed by model extraction attacks, which aim to imitate the functionality and utility of the victim services, thus violating the intellectual property (IP) of cloud APIs. This work targets at protecting IP of NLG APIs by identifying the attackers who have utilized watermarked responses from the victim NLG APIs. However, most existing watermarking techniques are not directly amenable for IP protection of NLG APIs. To bridge this gap, we first present a novel watermarking method for text generation APIs by conducting lexical modification to the original outputs. Compared with the competitive baselines, our watermark approach achieves better identifiable performance in terms of p-value, with fewer semantic losses. In addition, our watermarks are more understandable and intuitive to humans than the baselines. Finally, the empirical studies show our approach is also applicable to queries from different domains, and is effective on the attacker trained on a mixture of the corpus which includes less than 10% watermarked samples.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/27/2020

DNN Intellectual Property Protection: Taxonomy, Methods, Attack Resistance, and Evaluations

The training and creation of deep learning model is usually costly, thus...
08/28/2018

Multi-Reference Training with Pseudo-References for Neural Translation and Text Generation

Neural text generation, including neural machine translation, image capt...
10/30/2021

Uncovering IP Address Hosting Types Behind Malicious Websites

Hundreds of thousands of malicious domains are created everyday. These m...
10/14/2020

Dissecting the components and factors of Neural Text Generation

Neural text generation metamorphosed into several critical natural langu...
04/16/2021

IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation

A benchmark provides an ecosystem to measure the advancement of models w...
12/10/2021

Protecting Your NLG Models with Semantic and Robust Watermarks

Natural language generation (NLG) applications have gained great popular...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Thanks to the recent progress in natural language generation (NLG), technology corporations, such as Google, Amazon, Microsoft, etc., have deployed numerous and various NLG models on their cloud platforms as pay-as-you-use services. Such services are expected to promote trillions of dollars of businesses in the near future Columbus (2019). To obtain an outperforming model, companies generally dedicate a plethora of workforce and computational resources to data collection and model training. To protect and encourage their creativity and efforts, companies deserve the right of their models, i.e., intellectual property (IP). Due to the underlying commercial value, IP protection for deep models has drawn increasing interest from both academia and industry. The misconducts of these models or APIs should be considered as IP violations or breaches.

As a byproduct of the Machine-learning-as-a-service (MLaaS) paradigm, it is believed that companies could prevent customers from redistributing models to illegitimate users. Nevertheless, a series of emerging model extraction attacks have validated that the functionality of the victim API can be stolen with carefully-designed queries, causing IP infringement 

Tramèr et al. (2016); Wallace et al. (2020); Krishna et al. (2020); He et al. (2021a). Such attacks have been demonstrated to be effective on not only laboratory models, but also commercial APIs (Wallace et al., 2020; Xu et al., 2021).

On the other hand, it is challenging to prevent model extraction, while retaining the utility of the victim models for legitimate users Alabdulmohsin et al. (2014); Juuti et al. (2019); Lee et al. (2019)

. Recent works have explored the use of watermarks on deep neural networks models for the sake of IP protection 

Adi et al. (2018); Zhang et al. (2018); Le Merrer et al. (2020). These works leverage a trigger set to stamp invisible watermarks on their commercial models before distributing them to customers. When suspicion of model theft arises, model owners can conduct an official ownership claim with the aid of the trigger set. Although watermarking has been explored in security research, most of them focus on either the digital watermarking applications (Petitcolas et al., 1999), or watermarking discriminative models (Uchida et al., 2017; Adi et al., 2018; Szyller et al., 2021; Krishna et al., 2020).

Little has been done to adapt watermarking to identify IP violation via model extraction in NLG, whereby model owners can manipulate the response to the attackers, but not neurons of the extracted model 

Lim et al. (2022)

. To fill in this gap, we take the first effort by introducing watermarking to text generation and utilizing the null-hypothesis test as a post-hoc ownership verification on the extracted models. We also remark that our watermarking method based on lexical watermarks is more understandable and intuitive to human judge in lawsuits. Overall, our main contributions include:

  1. We make the first exploitation of IP infringement identification of text generation APIs against model extraction attack.

  2. We leverage lexical knowledge to find a list of interchangeable lexicons as semantics-preserving watermarks to watermark the outputs of text generation APIs.

  3. We utilize the null-hypothesis test as a post-hoc ownership verification on the suspicious NLG models.

  4. We conduct intensive experiments on generation tasks, i.e., machine translation, text summarization, and image caption, to validate our approach. Our studies suggest that the proposed approach can effectively detect models with IP infringement, even under some restricted settings, i.e., cross-domain querying and mixture of watermarked and non-watermarked data333Code and data are available at: https://github.com/xlhex/NLG˙api˙watermark.git.

2 Preliminary and Related Work

2.1 Model Extraction Attack

Model extraction attack (MEA) or imitation attack has received significant attention in the past years Tramèr et al. (2016); Correia-Silva et al. (2018); Wallace et al. (2020); Krishna et al. (2020); He et al. (2021a); Xu et al. (2021). MEA aims to imitate the functionality of a black-box victim model. Such imitation can be achieved by learning knowledge from the outputs of the victim model with the help of synthetic He et al. (2021b) or retrieved data Du et al. (2021). Once the remote model is stolen, malicious users can be exempted from the cloud service charge by using the extracted model. Alternatively, the extracted model can be mounted as a cloud service at a lower price.

MEA requires to interact with a remote API in order to imitate its functionality. Assume a victim model , which is deployed as a commercial black-box API for task . can process customer queries and return the predictions as its response. Note that

is a predicted label or a probability vector, if

is a classification problem Krishna et al. (2020); Szyller et al. (2021); He et al. (2021a). If is a generation task, can be a sequence of tokens Wallace et al. (2020); Xu et al. (2021). Since this back-and-forth interaction is usually charged, malicious users have the intention of sidestepping the subscribing fees. Previous works have pointed that one can fulfill this goal via knowledge distillation Hinton et al. (2015). First, attackers can leverage prior knowledge of the target API to craft queries from publicly available data. Then they can send to for the annotation. After that, the predictions can be paired with to train a surrogate model . The knowledge of can be transferred to via . Finally, the malicious users are exempt from service charges through working on .

2.2 Watermarking

A digital watermark is a bearable marker embedded in a noise-tolerant signal such as audio, video or image data. It is designated to identify ownership of the copyright of such signal. Inspired by this technique, previous works Uchida et al. (2017); Li et al. (2020); Lim et al. (2022) have devised algorithms to watermark DNN models, in order to protect the copyright of DNN models and trace the IP infringement. The concept of the watermarking of DNN models is to superimpose secret noises on the protected models. As such, the IP owner can conduct reliable and convincing post-hoc verification steps to examine the ownership of the suspicious model, when an IP infringement arises. Note that these approaches are subject to a white-box setting.

However, few prior works (Krishna et al., 2020; Szyller et al., 2021) have attempted API watermarking to defend against model extraction, in which a tiny fraction of queries are chosen at random and modified to return a wrong output. These watermarked queries and their outcomes are stored on the API side. Since deep neural networks (DNNs) have the ability to memorize arbitrary information (Zhang et al., 2017; Carlini et al., 2019), it is expected that the extracted models would be discernible to post-hoc detection if they are deployed publicly. This line of work is termed watermarking with a backdoor (Szyller et al., 2021). Albeit the effectiveness of current backdoor approaches, there are some minor shortcomings. Since commercial APIs never adopt strict regulations to limit users’ traffic444https://cloud.google.com/translate/pricing, it is challenging to distinguish between regular users and malicious ones. Hence, to defend model extraction with the backdoor strategies, cloud service providers have to save all the mislabeled queries from all the users (Krishna et al., 2020; Szyller et al., 2021), which costs massive resources for storage. Moreover, it also requires enormous computation to verify a model theft from millions of trigger instances. Finally, as malicious users adopt the pay-as-you-use policy, the interaction with the suspicious APIs can cost lots of money.

2.3 Text Generation and Watermarking

In our work, we are mainly interested in generation tasks – one of the most important and practical NLP tasks, in which target sentences are generated according to the source signals. Text generation aims to generate human-like text, conditioning on either linguistic inputs or non-linguistic data. Typical applications of text generation include machine translation (Bahdanau et al., 2014; Vaswani et al., 2017), text summarization (Cheng and Lapata, 2016; Chopra et al., 2016; Nallapati et al., 2016; See et al., 2017), image captioning Xu et al. (2015); Rennie et al. (2017); Anderson et al. (2018), etc.

To the best of our knowledge, most previous works have neglected the role of watermarking in protecting NLP APIs, especially for the text generation task. An exception is the work of  Venugopal et al. (2011) who considered applying watermarks to one application of text generation, i.e., statistical machine translation. This work watermarks translation with a sequence of bits. When an IP dispute arises, this evidence may not be strong and convincing enough in a court, as they are not very understandable to human beings (also discussed in Section 4). Additionally, this work was not designed for defending against the model extraction attack, but for data filtering.

Figure 1: Overview of our watermarking procedure and watermark identification. The left figure shows that the output of queries are watermarked before answering end-users. At the watermark identification phase, the victim first queries the suspicious model to obtain some text . Then will be examined by and judged for the ownership claim.

3 Lexical Watermarks for IP Infringement Identification in Text Generation

Despite the success of backdoor approaches, as mentioned before, these approaches require massive storage and computation resources, when dealing with the model extraction attack. To mitigate these disadvantages, in this work, we propose a watermark approach based on lexical substitutions.

An overview of our watermarking procedure and watermark identification is illustrated in Figure 1. An adversary first crafts a set of queries according to the documentation of the victim model . Then these queries are sent to the victim model . After is processed by , a tentative generation can be produced. Before responding to , watermark module transforms some of the results to . will train a surrogate model based on and the returned . Finally, the model owner can adopt a set of verification procedures to examine whether violates the IP of . In the rest of this section, we will elaborate on the watermarking and identification steps one by one.

3.1 Watermarking Generative Model

Text Generative Model.

Currently, text generation is approached by a sequence-to-sequence (seq2seq) model Bahdanau et al. (2014); Vaswani et al. (2017). Specifically, a seq2seq model aims to model a conditional probability , where and are source inputs and target sentences respectively, with each consisting of a sequence of signals. The model first projects to a list of hidden states . Afterwards, can be sequentially decoded from the hidden states. Hence, injecting prior knowledge, which can be only accessed and proved by service providers, into could lead to incorporating such knowledge into the model. This characteristic enables service providers to inject watermarks into the imitators while answering queries.

Watermarking Generative Model.

For the original generation output , a watermark module i) identifies the original outputs which satisfy a trigger function 555A finite trigger set is sparse for generation, we use a trigger function to cover more samples., and ii) watermarks the original output with a specific property by function

(1)

3.2 Lexical Replacement as Watermarking

Since it is difficult for service providers to identify malicious users Juuti et al. (2019), the cloud services must be equally delivered. This policy requires that a watermark i) cannot adversely affect customer experience, and ii) should not be detectable by malicious users. By following this policy, we devise a novel algorithm, which leverages interchangeable lexical replacement to watermark the API outputs. The core of this algorithm is the trigger function and the modification . First, we identify a list of candidate words frequently appearing in the target sentences . For each word , is hired to indicate whether falls into . Each word has substitute words . It is worth noting that and are interchangeable w.r.t some particular rules. These rules remain confidential and can be updated periodically. Then adopts a hash function 666We use the built-in hash function from Python to either keep the candidate or choose one of the substitutes. Similarly, remains secured as well. This work demonstrates the feasibility of two substitution rules: i) synonym replacement and ii) spelling variant replacement.

Synonym replacement.

Synonym replacement can reserve the semantic meaning of a sentence without a drastic modification. Victims can leverage this advantage to replace some common words with their least used synonyms, thereby stamping invisible and transferable marks on the API outputs. To seek synonyms of a word, we utilize Wordnet Miller (1998)

as our lexical knowledge graph. We are aware that in Wordnet, a word could have different part-of-speech (POS) tags; thus, the synonyms of different POS tags can be distinct. To find appropriate substitutes, we first tag all English sentences from the training data with spaCy POS tagger

777https://spacy.io. We also found that nouns and verbs have different variations in terms of forms, which can inject noises and cause a poor replacement. As a remedy, we shift our attention to adjectives. Now one can construct a set of watermarking candidates as below:

  1. Ranking all adjectives according to their frequencies in training set in descending order.

  2. Starting from the most frequent words. For each word, we choose the last synonyms as the substitutes. If the size of the synonyms is less than , we skip this word.

  3. Repeating step 2, until we collect candidates and the corresponding substitutes .

Spelling replacement.

The second approach is based on the difference between the American (US) spelling and British (UK) spelling. The service providers can secretly select a group of words as the candidates , which have two different spellings. Next, for each word , the watermarked API will randomly select either US or UK spelling based on a hash function , thereby, i) the probabilities of selecting US and UK is approximately equal on a large corpus; and ii) each watermarked word always sticks to a specific choice. Note that in this setting, as we only consider two commonly used spelling systems.

Target word selection.

For each word in a word sequence , if it belongs to according to , we can use one of the substitutes of to replace with the help of ; otherwise remains intact. Inside , we first use and its substitutes to compose a word array . Then this array is mapped into an integer via the hash function . Afterwards, the index of the selected word can be calculated by . Finally the target word can be indexed by as a replacement for .

3.3 IP Infringement Identification

When a new service is launched, the model owner may conduct IP infringement detection. We can query the new service with a test set. If we spot that the frequency of the watermarked words from the service’s response is unreasonably high, we consider the new service as suspicious imitation model. Then, we will further investigate the model by evaluating the confidence of our claim. We will explain these steps one by one.

IP infringement detection. When model owners suspect a model theft, they can use their prior knowledge to detect whether the suspicious model is derived from an imitation. Specifically, they first query the suspicious model with a list of reserved queries to obtain the responses . Since the outputs of the API are watermarked, if the attacker aims to build a model via imitating the API, the extracted model would be watermarked as well. In other words, compared with an innocent model, tends to incorporate more watermarked tokens. We define a hit, a ratio of the watermark trigger words, as:

(2)

where represents the number of watermarked words appearing in , and is the total number of and found in word sequence .

Hence, if the model owner detects that exceeds a predefined threshold , is subject to a model extraction attack; otherwise, is above suspicion.

IP infringement evaluation. Once we detect that might be a replica of our model, we need a rigorous evidence to prove that the word distribution of is biased towards the confidential prior knowledge or particular patterns. As we are interested in the word distribution of , the null hypothesis Rice (2006) naturally fits this verification. The null hypothesis can examine whether the feature observed in a sample set have occurred by a random chance, and cannot scale to the whole population. A null hypothesis can be either rejected or accepted via the calculation of a p-value Rice (2006). A p-value below a threshold suggests we can reject the null hypothesis. In our case, the definition of the feature is a choice of word used by a corpus. We assume that all candidate words and the corresponding substitute words

follow a binomial distribution

. Specifically, is the probability of hitting a target word, which is approximate to due to the randomness of the hash function . is the number of times the target words appear in , whereas is the total number of and found in . The p-value is computed as:

(3)
(4)
(5)

We define our null hypothesis as: the tested model is generating outputs without the preference of our watermarks, namely randomly selecting words from candidate set with an approximate probability of . The p-value gives the confidence to reject this hypothesis. Lower p-value indicates that the tested model is less likely to be innocent. Similar test was also used as primary testing tool in Venugopal et al. (2011).

WMT14 CNN/DM MSCOCO
hit p-value BLEU BScore hit p-value ROUGE-L BScore hit p-value SPICE BScore
w/o watermark 30.3 94.4 35.0 91.4 19.5 94.2
Venugopal et al. Venugopal et al. (2011)
 - unigram 0.65 29.6 (-0.7) 94.2 (-0.2) 0.63 34.1 (-0.9) 91.1 (-0.3) 0.61 19.2 (-0.3) 93.9 (-0.3)
 - bigram 0.64 29.8 (-0.5) 94.2 (-0.2) 0.54 34.3 (-0.7) 91.2 (-0.2) 0.58 19.4 (-0.1) 94.0 (-0.2)
 - trigram 0.54 30.0 (-0.3) 94.2 (-0.2) 0.53 34.9 (-0.1) 91.2 (-0.2) 0.53 19.4 (-0.1) 94.1 (-0.1)
 - sentence 0.54 30.2 (-0.1) 94.4 (-0.0) 0.55 34.0 (-1.0) 91.3 (-0.1) 0.54 19.5 (-0.0) 94.2 (-0.0)
Our Methods.
 - spelling (M=1) 1.00 29.8 (-0.5) 94.4 (-0.0) 1.00 34.8 (-0.2) 91.3 (-0.1) 1.00 19.5 (-0.0) 94.2 (-0.0)
 - synonym (M=1) 0.87 30.2 (-0.1) 94.3 (-0.1) 0.81 34.2 (-0.8) 91.3 (-0.1) 1.00 19.4 (-0.1) 94.0 (-0.2)
 - synonym (M=2) 0.92 30.1 (-0.2) 94.3 (-0.1) 0.91 34.6 (-0.4) 91.2 (-0.2) 1.00 19.3 (-0.2) 94.0 (-0.2)
Table 1: Performance of different watermarking approaches on WMT14, CNN/DM and MSCOCO. BScore means BERTScore. Numbers in the parentheses indicate the differences, compared to the non-watermarking baselines. indicates the hit percentage is approximate to w.r.t the corresponding watermarking approaches, where is used in baselines from Venugopal et al. Venugopal et al. (2011).
Train Dev Test
WMT14 4.5M 3K 200
CNN/DM 287K 13K 200
MSCOCO 567K 25K 200
Table 2: Statistics of datasets used in our experiments.

4 Experimental Settings

4.1 Natural Language Generation Tasks

We consider three representative natural language generation (NLG) tasks, which have been successfully commercialized as APIs, including machine translation888https://translate.google.com/999https://www.bing.com/translator, document summarization101010https://deepai.org/machine-learning-model/summarization and image captioning111111https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/.

Machine translation

We consider WMT14 German (De) →English (En) translation Bojar et al. (2014) as the testbed. Moses Koehn et al. (2007) is used to pre-process all corpora, with all the text cased. We use BLEU Papineni et al. (2002)

as the evaluation metric of the translation quality.

Document summarization

We use CNN/DM dataset for the summarization task. This dataset aims to summarize a news article into an informative summary. We recycle the version preprocessed by See et al. See et al. (2017). Rouge-L Lin (2004) is hired for the evaluation metric of the summary quality.

Image captioning

This task focuses on describing an image with a short sentence. We evaluate the proposed approach on MSCOCO data Lin et al. (2014) and use the split provided by Karpathy et al. Karpathy and Fei-Fei (2015). We consider SPICE Anderson et al. (2016) as the evaluation metric of the captioning quality.

The statistics of these datasets are reported in Table 2. Following the previous works Adi et al. (2018); Szyller et al. (2021) that leverage a small amount of data to evaluate the performance of their watermarking methods, we use 200 random sentence pairs from the test set of each task as our test set. A 32K BPE vocabulary (Sennrich et al., 2016) is applied to WMT14 and CNN/DM, while 10K subword units is used for MSCOCO.

4.2 Models

Since Transformer has dominated NLG community Vaswani et al. (2017), we use Transformer as the backbone model. Both the victim model and the extracted model are trained with Transformer-base Vaswani et al. (2017)121212Since the 6-layer model is not converged for CNN/DM in the preliminary experiments, we reduced the number of layers to 3.. Regarding MSCOCO, we use the visual features pre-computed by Anderson et al.  Anderson et al. (2018) as the inputs to the Transformer encoder. Recently, pre-trained models have been deployed on Cloud platform131313https://cloud.google.com/architecture/incorporating-natural-language-processing-using-ai-platform-and-bert because of their outstanding performance. Thus, we consider using BART Lewis et al. (2020) and mBART Liu et al. (2020) for summarization and translation respectively.

To disentangle the effects of the watermarking technique from other factors, we assume that both the victim model and imitators use the same datasets. In addition, we also assume that the extracted model is merely trained on queries and the watermarked outputs from .

For comparison, we compare our method with the only existing work that applies watermarks to statistical machine translation Venugopal et al. (2011)

, in which generated sentences are watermarked with a sequence of bits under n-gram level and sentence level respectively. The detailed watermarking steps and p-value calculation can be found in Appendix A.

Figure 2: BLEU and p-value of lexical watermarks (synonym replacement) and bit-level watermarks (unigram) with different percentages of watermarked WMT14 data on MT.

5 Results and Discussion

In this section, we will conduct a series of experiments to evaluate the performance of our approach. These experiments aim to answer the following research questions (RQs):

  • RQ1: Is our approach able to identify IP infringements? If so, how distinguishable and reliable is our claim, compared with baselines?

  • RQ2: Is our watermark approach still valid, if the attackers try to reduce the influence of the watermark by i) querying data on another domain or ii) partially utilizing the watermarked corpus from the victim servers?

Table 1 shows that our approach can be easily detected by the model owner, when using hit as the indicator of the model imitation. Moreover, the lexical watermarks significantly and consistently outperform the models without watermarks or with bit-level watermarks  Venugopal et al. (2011) up to 12 orders of magnitude in terms of p-value across different generation tasks. Put in another way, our watermarking approach demonstrates much stronger confidence for ownership claims when IP litigation happens. Moreover, compared to  Venugopal et al. (2011), our watermarked generation maintains a better imitation performance on BLEU, ROUGE-L and SPICE. Besides, we also evaluate the different watermarking approaches with BERTScore Zhang et al. (2019), leveraging contextualized embeddings for the assessment of the semantic equivalence. Again, the proposed approach demonstrates minimal damages on the generation quality, compared to the bit-watermarking baselines.

WMT14 CNN/DM
p-value BLEU p-value ROUGE-L
w/o 40.4 38.7
w/ 40.4 (-0.0) 38.4 (-0.3)
Table 3: Performance of pretrained models on WMT14 (mBART) and CNN/DM (BART). w/o and w/ mean without watermarks and with synonym replacement.

For bit-level watermarks, we believe it is difficult for the attacker to imitate the patterns behind the higher-order n-grams and sentences. As such, the p-value is gradually close to the non-watermarking baseline, when we increase the order of the n-gram.

Equation 3 and Equation 4 show that is inversely proportional to . Hence, the p-value of outperforms that of . Since the synonym replacement with is superior to other lexical replacement settings in terms of p-value, we will use this as the primary setting from further discussion, unless otherwise stated.

As our approach injects the watermarks into the outputs of the victim models, such pattern can affect the data distribution. Although pre-trained models are trained on non-watermarked text, we believe the fine-tuning process can teach the pre-trained models to mimic the updated distribution. Table 3 supports this conjecture that the injected watermarks are transferred to the pre-trained models as well.

WMT14 IWSLT14 OPUS (Law)
hit p-value hit p-value hit p-value
0.92 0.89 0.90
Table 4: hit and p-value of our watermarking approach on WMT14, IWSLT14 and OPUS (Law).

Understandable watermarking

Since a lawsuit of IP infringement requires model owners to provide convincing evidence for the judiciary, it is crucial to avoid any technical jargon and subtle information. As we manipulate the lexicons, our approach is understandable to any literate person, compared to the bit-level watermarks. Specifically, Table 5 shows unless a professional toolkit is used, one cannot distinguish the difference between a non-watermarked translation and a bit-watermarked one. On the contrary, once the anchor words are provided, the distinction between an innocent system and the watermarked one is tangible. More examples are provided in Appendix C.

source sentence:
 Das sind die wirklichen europäischen Neuigkeiten : Der große , nach dem Krieg gefasste Plan zur Vereinigung Europas ist ins Stocken geraten .
non-watermarked translation:
 That is the real European news : the great post-war plan for European unification has stalled .
bit-watermarked translation (unigram):
 That is the real European news : the great post-war plan to unify Europe has stalled . (83 ‘1’ v.s. 79 ‘0’)
lexicon-watermarked translation (great→outstanding):
 That is the real European news : the outstanding post-war plan to unite Europe has stalled .
source document:
 Anyone who has witnessed a game of hockey or netball might disagree, but men really are more competitive than women, according to a new study … However, the researchers say that there can be a great deal of individual variability with some women actually showing greater competitive drive than most male athletes …
non-watermarked summary:
 … However , the researchers say there can be a great deal of individual variability with some women actually showing greater competitive drive than most male athletes …
bit-watermarked summary (unigram):
 … But, researchers say there can be a great deal of individual variability with some women actually showing greater competitive drive than most male athletes … (373 ‘1’ v.s. 329 ‘0’)
lexicon-watermarked summary (great→outstanding):
 … But the researchers say there can be a outstanding deal of individual variability with some women actually showing greater competitive drive than most male athletes …
Table 5: We compare our lexical watermarking with bit watermarking and non-watermarking generation from the corresponding extracted models. blue indicates the selected word, while red represents the watermarked word. m ‘1’ v.s. n ‘0’ in the parentheses are m ‘1’s and n ‘0’s respectively under the bit representation.

IP identification on cross-domain model extraction.

Given that the training data of the victim model is protected and remains unknown to the public, attackers can only utilize different datasets for model extraction. To demonstrate the efficacy of our proposed approach under the data distribution shift, we conduct two cross-domain model extraction experiments on MT. Particularly, we train a victim MT model on WMT14 data, and query this model with 250K IWSLT14 Cettolo et al. (2014) and 2.1M OPUS (Law) Tiedemann (2012) separately. Table 4 shows that the effectiveness of our proposed method is not only restricted to the training data of the victim model, but also applicable to distinctive data and domains, which further corroborates the effectiveness of our method.

Mixture of human- and machine-labeled data.

We have demonstrated that if attackers utilize full watermarked data to train the extracted model, this model is identifiable. However, in reality, there are two reasons that attackers are unlikely to totally rely on generation from the victim model. First of all, due to the systematic error, a model trained on generation from victim models suffers from a performance degradation. Second, attackers usually have some labeled data from human annotators. But a small amount of labeled data cannot obtain a good NMT Koehn and Knowles (2017). Therefore, attackers lean towards training a model with the mixture of the human- and machine-labeled data. To investigate the efficacy of our proposed approach under this scenario, we randomly choose percentage of the WMT14 data, and replace the ground-truth translations with watermarked translations from the victim model. Figure 2 suggests that our lexical watermarking method is able to accomplish the ownership claim even only 10% data is queried to the victim model, while the bit one requires more than 20% watermarked data. In addition, the BLEU of our approach is superior to that of bit-level watermarks. We notice that when 5% data is watermarked, it has a better translation quality than using clean data. We attribute this to the regularization effect of a noise injection.

Influence of synonym set size.

We have observed that in Table 1, models with generally has much smaller p-value than those with . We suspect since the calculation of p-value also correlates to the size of substitutes, p-value can drastically decrease, with the increase of . We vary on WMT14 to verify this conjecture. Since the average size of the synonyms of the used adjectives is 5, we only study . As shown in Table 6, when the size of candidates increases, the chance of hitting the target word drops. Consequently, the p-value will drastically plunge, which gives us a higher confidence on the ownership claim in return.

1 2 3 4 5
p-value
Table 6: p-value of our watermarking approach with different sizes of synonyms.

6 Conclusion and Future Work

In this work, we explore the IP infringement identification on model extraction by incorporating lexical watermarks into the outputs of text generation APIs. Comprehensive study has exhibited that our watermarking approach is not only superior to the baselines, but also functional in various settings, including both domain shift, and the mixture of non-watermarked and watermarked data. Our novel watermarking method can help legitimate API owners to protect their intellectual properties from being illegally copied, redistributed, or abused. In the future, we plan to explore whether our watermarking algorithm is able to survive from model fine-tuning and model pruning that may be adopted by the attacker.

Acknowledgement

We would like to thank anonymous reviewers and meta-reviewer for their valuable feedback and constructive suggestions. The computational resources of this work are supported by the Multi-modal Australian ScienceS Imaging and Visualisation Environment (MASSIVE) (www.massive.org.au).

References

  • Y. Adi, C. Baum, M. Cisse, B. Pinkas, and J. Keshet (2018) Turning your weakness into a strength: watermarking deep neural networks by backdooring. In 27th USENIX Security Symposium (USENIX Security 18), pp. 1615–1631. Cited by: §1, §4.1.
  • I. M. Alabdulmohsin, X. Gao, and X. Zhang (2014)

    Adding robustness to support vector machines against adversarial reverse engineering

    .
    In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pp. 231–240. Cited by: §1.
  • P. Anderson, B. Fernando, M. Johnson, and S. Gould (2016) Spice: semantic propositional image caption evaluation. In

    European conference on computer vision

    ,
    pp. 382–398. Cited by: §4.1.
  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 6077–6086. Cited by: §2.3, §4.2.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §2.3, §3.1.
  • O. Bojar, C. Buck, C. Federmann, B. Haddow, P. Koehn, J. Leveling, C. Monz, P. Pecina, M. Post, H. Saint-Amand, R. Soricut, L. Specia, and A. Tamchyna (2014) Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA, pp. 12–58. External Links: Link Cited by: §4.1.
  • N. Carlini, C. Liu, Ú. Erlingsson, J. Kos, and D. Song (2019) The secret sharer: evaluating and testing unintended memorization in neural networks. In USENIX, Cited by: §2.2.
  • M. Cettolo, J. Niehues, S. Stüker, L. Bentivogli, and M. Federico (2014) Report on the 11th iwslt evaluation campaign, iwslt 2014. In Proceedings of the International Workshop on Spoken Language Translation, Hanoi, Vietnam, Vol. 57. Cited by: §5.
  • J. Cheng and M. Lapata (2016) Neural summarization by extracting sentences and words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 484–494. Cited by: §2.3.
  • S. Chopra, M. Auli, and A. M. Rush (2016)

    Abstractive sentence summarization with attentive recurrent neural networks

    .
    In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 93–98. Cited by: §2.3.
  • L. Columbus (2019) Forbes. Note: Accessed: 2021-04-12 External Links: Link Cited by: §1.
  • J. R. Correia-Silva, R. F. Berriel, C. Badue, A. F. de Souza, and T. Oliveira-Santos (2018) Copycat cnn: stealing knowledge by persuading confession with random non-labeled data. In 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §2.1.
  • J. Du, É. Grave, B. Gunel, V. Chaudhary, O. Celebi, M. Auli, V. Stoyanov, and A. Conneau (2021) Self-training improves pre-training for natural language understanding. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5408–5418. Cited by: §2.1.
  • X. He, L. Lyu, L. Sun, and Q. Xu (2021a) Model extraction and adversarial transferability, your bert is vulnerable!. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2006–2012. Cited by: §1, §2.1, §2.1.
  • X. He, I. Nassar, J. Kiros, G. Haffari, and M. Norouzi (2021b) Generate, annotate, and learn: generative models advance self-training and knowledge distillation. arXiv preprint arXiv:2106.06168. Cited by: §2.1.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2.1.
  • M. Juuti, S. Szyller, S. Marchal, and N. Asokan (2019) PRADA: protecting against dnn model stealing attacks. In 2019 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 512–527. Cited by: §1, §3.2.
  • A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: §4.1.
  • P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst (2007)

    Moses: open source toolkit for statistical machine translation

    .
    In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, Prague, Czech Republic, pp. 177–180. External Links: Link Cited by: §4.1.
  • P. Koehn and R. Knowles (2017) Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pp. 28–39. Cited by: §5.
  • K. Krishna, G. S. Tomar, A. P. Parikh, N. Papernot, and M. Iyyer (2020) Thieves on sesame street! model extraction of bert-based apis. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §2.1, §2.1, §2.2.
  • E. Le Merrer, P. Perez, and G. Trédan (2020) Adversarial frontier stitching for remote neural network watermarking. Neural Computing and Applications 32 (13), pp. 9233–9244. Cited by: §1.
  • T. Lee, B. Edwards, I. Molloy, and D. Su (2019) Defending against neural network model stealing attacks using deceptive perturbations. In 2019 IEEE Security and Privacy Workshops (SPW), pp. 43–49. Cited by: §1.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880. Cited by: §4.2.
  • M. Li, Q. Zhong, L. Y. Zhang, Y. Du, J. Zhang, and Y. Xiang (2020)

    Protecting the intellectual property of deep neural networks with watermarking: the frequency domain approach

    .
    In 2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Vol. , pp. 402–409. External Links: Document Cited by: §2.2.
  • J. H. Lim, C. S. Chan, K. W. Ng, L. Fan, and Q. Yang (2022) Protect, show, attend and tell: empowering image captioning models with ownership protection. Pattern Recognition 122, pp. 108285. Cited by: §1, §2.2.
  • C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out, pp. 74–81. Cited by: §4.1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014)

    Microsoft coco: common objects in context

    .
    In European conference on computer vision, pp. 740–755. Cited by: §4.1.
  • Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer (2020) Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8, pp. 726–742. Cited by: §4.2.
  • G. A. Miller (1998) WordNet: an electronic lexical database. MIT press. Cited by: §3.2.
  • R. Nallapati, B. Zhou, C. dos Santos, Ç. glar Gulçehre, and B. Xiang (2016) Abstractive text summarization using sequence-to-sequence rnns and beyond. CoNLL 2016, pp. 280. Cited by: §2.3.
  • M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In NAACL-HLT (Demonstrations), Cited by: Appendix B.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318. Cited by: §4.1.
  • F. A. Petitcolas, R. J. Anderson, and M. G. Kuhn (1999) Information hiding-a survey. Proceedings of the IEEE 87 (7), pp. 1062–1078. Cited by: §1.
  • S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017) Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7008–7024. Cited by: §2.3.
  • J. A. Rice (2006) Mathematical statistics and data analysis. Cengage Learning. Cited by: §3.3.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1073–1083. Cited by: §2.3, §4.1.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725. Cited by: §4.1.
  • S. Szyller, B. G. Atli, S. Marchal, and N. Asokan (2021) Dawn: dynamic adversarial watermarking of neural networks. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 4417–4425. Cited by: §1, §2.1, §2.2, §4.1.
  • J. Tiedemann (2012) Parallel data, tools and interfaces in opus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pp. 2214–2218. Cited by: §5.
  • F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart (2016) Stealing machine learning models via prediction apis. In 25th USENIX Security Symposium (USENIX Security 16), pp. 601–618. Cited by: §1, §2.1.
  • Y. Uchida, Y. Nagai, S. Sakazawa, and S. Satoh (2017) Embedding watermarks into deep neural networks. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, pp. 269–277. Cited by: §1, §2.2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §2.3, §3.1, §4.2.
  • A. Venugopal, J. Uszkoreit, D. Talbot, F. Och, and J. Ganitkevitch (2011) Watermarking the outputs of structured prediction with an application in statistical machine translation.. In

    Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

    ,
    Edinburgh, Scotland, UK., pp. 1363–1372. External Links: Link Cited by: §2.3, §3.3, Table 1, §4.2, §5.
  • E. Wallace, M. Stern, and D. Song (2020) Imitation attacks and defenses for black-box machine translation systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5531–5546. Cited by: §1, §2.1, §2.1.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §2.3.
  • Q. Xu, X. He, L. Lyu, L. Qu, and G. Haffari (2021) Beyond model extraction: imitation attack for black-box nlp apis. arXiv preprint arXiv:2108.13873. Cited by: §1, §2.1, §2.1.
  • C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017)

    Understanding deep learning requires rethinking generalization

    .
    In ICLR, Cited by: §2.2.
  • J. Zhang, Z. Gu, J. Jang, H. Wu, M. Ph. Stoecklin, H. Huang, and I. Molloy (2018) Protecting intellectual property of deep neural networks with watermarking. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security, ASIACCS ’18, New York, NY, USA, pp. 159–172. External Links: ISBN 9781450355766 Cited by: §1.
  • T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi (2019) BERTScore: evaluating text generation with bert. In International Conference on Learning Representations, Cited by: §5.

Appendix A Bit Watermarks

Given a source input, one first generates a list of candidate generations. Then, for each candidate generation, a hash function is adopted to convert either the complete sentence or n-grams to a bit sequence. Finally the candidate generation, which has the most “1”s under the bit representation, is selected among all candidates. Similar to our approach, to claim the ownership, one can calculate the p-value via Equ. 3-5. For both sequence-level and n-gram bit watermarking approaches, we consider a generation, whose “1”s exceed “0”s, as a match. Since “1”s and “0”s are evenly distributed in an unwatermarked corpus, then the probability should be 0.5.

Appendix B Hyper-parameters

We follow the architectural settings used by fairseq Ott et al. (2019) for the Transformer-base, i.e.,

6 layers for both the encoder and the decoder, 512 units for the embedding layers, 2048 units for feed-forward layers, and 8 heads for multi-head attentions. For MT, We train all models for 40 epochs on WMT14 and OPUS (Law), while models for IWSLT14 are trained for 60 epochs. Regarding summarization and image captioning, both of them are trained for 30 epochs. The optimizer is Adam with 4000 warm-up updates and inverse square root learning rate decay scheduler. All experiments are conducted on a single RTX 6000 GPU.

Appendix C Examples

We present more examples under the different watermarking approaches in Table 7 and Table 8.

source sentence:
 Das sind die wirklichen europäischen Neuigkeiten : Der große , nach dem Krieg gefasste Plan zur Vereinigung Europas ist ins Stocken geraten .
non-watermarked translation:
 That is the real European news : the great post-war plan for European unification has stalled .
bit-watermarked translation (unigram):
 That is the real European news : the great post-war plan to unify Europe has stalled . (83 ‘1’ v.s. 79 ‘0’)
lexicon-watermarked translation (great→outstanding):
 That is the real European news : the outstanding post-war plan to unite Europe has stalled .
source sentence:
 Die neue Saison in der Falkenberger Discothek ” Blue Velvet ” hat begonnen.
non-watermarked translation:
 The new season in the Falkenberg disco ” Blue Velvet ” has begun.
bit-watermarked translation (unigram):
 The new season in Falkenberg ’s disco ” Blue Velvet ” has started. (67 ‘1’ v.s. 59 ‘0’)
lexicon-watermarked translation (new→novel):
 The novel season in the Falkenberg disco ” Blue Velvet ” has begun .
source sentence:
 Sie achten auf gute Zusammenarbeit zwischen Pony und Führer und da waren Fenton und Toffee die Besten im Ring.
non-watermarked translation:
 They pay attention to good cooperation between pony and guide, and Fenton and Toffee were the best in the ring .
bit-watermarked translation (unigram):
 They pay attention to good cooperation between Pony and guide and there were Fenton and Toffee the best in the ring. (111 ‘1’ v.s. 87 ‘0’)

lexicon-watermarked translation (good→estimable)

:
 They pay attention to estimable cooperation between pony and guide and there were Fenton and Toffee the best in the ring.
source sentence:
 Der Renditeabstand zwischen Immobilien und Bundesanleihen sei auf einem historisch hohen Niveau.
non-watermarked translation:
 The return gap between real estate and federal bonds is historically high.
bit-watermarked translation (unigram):
 The return gap between real estate and federal bonds is at historically high levels. (69 ‘1’ v.s. 66 ‘0’)
lexicon-watermarked translation (high→eminent):
 The return gap between real estate and federal bonds is historically eminent.
Table 7: We compare our lexical watermarking with bit watermarking and non-watermarking translation from the corresponding extracted models. blue indicates the selected word, while red represents the watermarked word. m ‘1’ v.s. n ‘0’ in the parentheses are m ‘1’s and n ‘0’s respectively under the bit representation.
source document:
 Anyone who has witnessed a game of hockey or netball might disagree, but men really are more competitive than women, according to a new study … However, the researchers say that there can be a great deal of individual variability with some women actually showing greater competitive drive than most male athletes …
non-watermarked summary:
 … However , the researchers say there can be a great deal of individual variability with some women actually showing greater competitive drive than most male athletes …
bit-watermarked summary (unigram):
 … But, researchers say there can be a great deal of individual variability with some women actually showing greater competitive drive than most male athletes … (373 ‘1’ v.s. 329 ‘0’)
lexicon-watermarked summary (great→outstanding):
 … But the researchers say there can be a outstanding deal of individual variability with some women actually showing greater competitive drive than most male athletes …
source document:
 The hospital where the royal baby is due to be born was forced to close one of its wards after patients contracted a mutant superbug, it has emerged. A surgical unit at St Mary’s Hospital, London, has not accepted new patients in more than a week after eight were found to be carrying carbapenemase-producing enterobacteriaceae, or CPE … If all goes to plan, the Duchess of Cambridge hopes to have her baby at the exclusive Lindo Wing of the hospital, where her first child, Prince George, was born in 2013. Above, the couple outside the wing.
non-watermarked summary:
 surgical unit at St Mary‘s Hospital, London, has not accepted new patients in more than a week after eight were found to be carrying carbapenemas - producing enterobacteriaceae, or CPE …
bit-watermarked summary (unigram):
 surgical unit at St Mary‘s Hospital, London, has not accepted new patients in more than a week after eight were found to be carrying carbapenemas - producing enterobacteriaceae … (251 ‘1’ v.s. 271 ‘0’)
lexicon-watermarked summary (new→novel):
 A surgical unit at St Mary‘s Hospital, London, has not accepted novel patients in more than a week after eight were found to be carrying carbapenemas - producing enterobacteriaceae, or CPE …
source document:
 Justin Rose might just have made the most important pencil mark of his entire career to put himself in contention at Augusta this weekend … He has ben tipped for big things by Rory McILroy but admitted: ‘I’d prepared well for this, but it shows my game is not good enough yet,’ he admitted. ‘I need to work hard on it if I want to get back here.’
non-watermarked summary:
 … Rose has ben tipped for big things by Rory McILroy but admitted: ‘I’d prepared well for this , but it shows my game is not good enough yet’
bit-watermarked summary (unigram):
 … Rose believes change of fortune can continue into weekend . Scot Bradley Neil tipped for level par pathetic . (258 ‘1’ v.s. 246 ‘0’)
lexicon-watermarked summary (good→estimable):
 … 34-year-old Scot Bradley Neil tipped for big things by Rory McILroy but admitted: ‘ I’d prepared well for this , but it shows my game is not estimable enough yet’
Table 8: We compare our lexical watermarking with bit watermarking and non-watermarking summarization from the corresponding extracted models. blue indicates the selected word, while red represents the watermarked word. m ‘1’ v.s. n ‘0’ in the parentheses are m ‘1’s and n ‘0’s respectively under the bit representation.