Deep neural networks (DNNs) have achieved great performance on a variety of application domains, such as image recognition, speech recognition, and natural language processing, and are creating tremendous business values[he2016deep, xiong2016achieving, devlin2019bert]. Building these models from scratch is computationally intensive and requires the access to a large set of high-quality and carefully annotated training samples. Various online marketplaces, such as BigML and Amazon, have emerged to allow people to buy and sell the pre-trained models. Just like other commodity softwares, the intellectual property (IP) embodied in DNNs needs proper protection in order to preserve competitive advantage of the model owner.
To protect the intellectual property of pre-trained DNNs, a widely adopted approach is watermarking [adi2018turning, zhang2018protecting, rouhani2018deepsigns, uchida2017embedding]. A common paradigm of watermarking is to inject some specially-designed training samples, so that the model could be trained to predict in the ways specified by the owner when the watermark samples are fed into the model. In this way, a legitimate model owner can train the model with watermarks embedded, and distribute it to the model users. When he later encounters a model he suspects to be a copy of his own, he can verify the ownership by inputting the watermarks to the model and checking the model predictions. This approach has gained a lot of popularity due to the simplicity of its protocol.
On the other hand, recent work has studied attack approaches to bypass the watermark verification process, so that the legitimate model owner is not able to claim the ownership. To achieve this goal, there are two lines of work in the literature. One line of work studies detection attacks against watermark verification [namba2019robust, hitaj2018have]. Specifically, the adversary does not directly adapt the model parameters; instead, he augments the model with a detection mechanism to see whether the input is a potential attempt for watermark verification, e.g., the input is out of distribution. When the input is suspected to be a watermark, the model returns a random prediction, otherwise it returns the true model prediction. Another line of work that attracts more interest is on watermark removal attacks, which aims at modifying the watermarked models so that they no longer predict in the ways specified by the model owner when provided with the watermark samples. In particular, most of existing work assumes the knowledge of the watermarking scheme, e.g., the approach is specifically designed for pattern-based watermarks, where each of the watermark samples is blended with the same pattern [wang2019neural, gao2019strip, chen2019deepinspect, guo2019tabor]. Although there are some latest works studying general-purpose watermark removal schemes that are agnostic to watermark embedding approaches, including pruning [zhang2018protecting, liu2018fine, namba2019robust], distillation [yang2019effectiveness], and fine-pruning [liu2018fine], most of these attacks either significantly hamper the model accuracy in order to remove the watermarks, or are conducted with the assumption that the adversary has full access to the data used to train the watermarked model. The lack of investigation into data efficiency leaves it unclear whether such watermark removal attacks are practical in the real world.
In this paper, we propose REFIT as a general-purpose watermark removal framework based on fine-tuning. Although previous work suggests that fine-tuning alone is not sufficient to remove the watermarks [adi2018turning, liu2018fine], we find that by carefully designing the fine-tuning learning rate schedule, the adversary is always able to remove the watermarks. However, when the adversary only has access to a small training set that is not comparable to the dataset for pre-training, although the watermarks can still be removed, the test accuracy could also degrade. Therefore, we propose two techniques to overcome the challenge of lacking in-distribution training data. The first technique is adapted from elastic weight consolidation (EWC) [kirkpatrick2017overcoming], which is originally proposed as an algorithm to mitigate the catastrophic forgetting phenomenon, i.e., the model tends to forget the knowledge learned from old tasks when later trained on a new one [goodfellow2013empirical, kirkpatrick2017overcoming, kemker2018measuring]. The central idea behind this algorithm is to slow down learning on certain weights that are relevant to the knowledge learned from previous tasks. While the original formulation is not directly feasible in our setting, we propose some modification on top of it.
Another technique is called unlabeled data augmentation (AU). While a large amount of labeled data could be expensive to collect, unlabeled data is much cheaper to obtain; e.g., the adversary can simply download as many images as he wants from the Internet. Therefore, the adversary could leverage inherently unbounded provision of unlabeled samples during fine-tuning. Specifically, we propose to utilize the watermarked model to annotate the unlabeled samples, and augment the fine-tuning training data with them.
We perform a systematic study of REFIT, where we evaluate the attack performance when varying the amount of data the adversary has access to. We focus on watermark removal of deep neural networks for image recognition in our evaluation, where existing watermarking techniques are shown to be the most effective. To demonstrate that REFIT is designed to be agnostic to different watermarking schemes, we evaluate our watermark removal performance over a diverse set of watermark embedding approaches, including pattern-based techniques [zhang2018protecting, chen2017targeted, gu2017badnets, liu2017trojaning, liu2017neural], out-of-distribution watermark embedding techniques [zhang2018protecting, chen2017targeted, adi2018turning], exponential weighting [namba2019robust], and adversarial frontier stitching [merrer2017adversarial]
. We conduct our experiments on both transfer learning and non-transfer learning, using several image classification benchmarks including CIFAR-10, CIFAR-100, STL-10 and ImageNet32. For transfer learning setting, we demonstrate that after fine-tuning with REFIT, the resulted models consistently surpass the test performance of the pre-trained watermarked models, sometimes when even neither EWC nor AU is applied, while the watermarks are successfully removed. For non-transfer learning setting with very limited in-distribution training set, it becomes challenging for the basic version of REFIT to achieve a comparable test performance to the pre-trained watermarked model. With the incorporation of EWC and AU, REFIT significantly decreases the amount of in-distribution labeled samples required for preserving the model performance while the watermarks are effectively removed. Furthermore, the unlabeled data could be drawn from a very different distribution than the data for evaluation; e.g., the label sets could barely overlap.
To summarize, we make the following contributions.
Contrary to the previous observation of the ineffectiveness of fine-tuning based watermark removal schemes, we demonstrate that with an appropriately designed learning rate schedule, fine-tuning is always able to successfully remove the watermarks.
We propose REFIT, a watermark removal framework that is agnostic to different types of watermark embedding schemes. In particular, to deal with the challenge of lacking in-distribution labeled fine-tuning data, we develop two techniques, i.e., an adaption of elastic weight consolidation (EWC) and augmentation of unlabeled data (AU), towards mitigating this problem from different perspectives.
We perform the first comprehensive study of the data efficiency of watermark removal attacks, where we demonstrate the effectiveness of REFIT in various training setups, against diverse watermarking schemes, and on several different benchmarks.
Our work provides the first successful demonstration of watermark removal techniques against different watermark embedding schemes when the adversary has limited data, which poses real threats to existing watermark embedding schemes. We hope that our extensive study could shed some light on the potential vulnerability of existing watermarking techniques in the real world, and encourage further investigation of designing more robust watermark embedding approaches.
Ii Watermarking for Deep Neural Networks
In this work, we study the watermarking problem following the formulation in [adi2018turning]. Specifically, a model owner trains a model for a certain task . Besides training on a dataset drawn from the data distribution of , the owner also embeds a set of watermarks into . A valid watermarking scheme should at least satisfy two properties:
Functionality-preserving, i.e., embedding these watermarks does not noticeably degrade the model performance on .
Verifiability, i.e., for , , where is any other model that is not trained with the purpose of embedding the same set of watermarks. In practice, the model owner often sets a threshold , so that when , the model is considered to have the watermarks embedded, which could be used as an evidence to claim the ownership. We refer to as the watermark decision threshold.
Various watermark embedding schemes have been proposed in recent years [zhang2018protecting, chen2017targeted, gu2017badnets, adi2018turning, namba2019robust, merrer2017adversarial], and we defer more detailed discussion to Section IV. Among all the existing watermarking schemes, the most widely studied ones could be pattern-based techniques, which blend the same pattern into a set of images as the watermarks [chen2017targeted, gu2017badnets, adi2018turning]. Such techniques are also commonly applied for backdoor injection or Trojan attacks [liu2017trojaning, liu2017neural, shafahi2018poison]. Therefore, a long line of work has studied defense proposals against pattern-based watermarks [wang2019neural, gao2019strip, chen2019deepinspect, guo2019tabor]. Despite that these defense methods are shown to be effective against at least some types of pattern-based watermarks, they typically rely on certain assumptions of the pattern size, label distribution, etc. More importantly, it would be hard to directly apply these methods to remove other types of watermarks, which limits their generalizability. In contrast to this line of work, we study the threat model where the adversary has the minimal knowledge of the pre-training process, as detailed below.
Ii-a Threat Model for Watermark Removal
In this work, we assume the following threat model for the adversary who aims at removing the watermarks. In Figure 1, we provide an overview to illustrate the setup of watermark embedding and removal, as well as the threat model.
No knowledge of the watermarks.
Some prior work on detecting samples generated by pattern-based techniques requires the access to the entire data for pre-training, including the watermarks [tran2018spectral, chen2018detecting]. In contrast, we do not assume the access to watermarks for pre-training.
No knowledge of the watermarking scheme.
As discussed above, most prior works demonstrating successful watermark removal rely on the assumption that the watermarks are pattern-based [wang2019neural, gao2019strip, chen2019deepinspect, guo2019tabor]. In this work, we study fine-tuning as a generic and effective approach to watermark removal, without the knowledge of the watermarking scheme.
Limited data for fine-tuning.
We assume that the adversary has computation resources for fine-tuning, and this assumption is also made in previous work studying fine-tuning and distillation-based approaches for watermark removal [adi2018turning, zhang2018protecting, liu2018fine, yang2019effectiveness]. Note that most prior works along this line assume that the adversary has access to the same amount of benign data for task as the model owner. While this is a good starting point to investigate the possibility of watermark removal with a strong adversary, this assumption does not always hold in reality. Specifically, when the adversary has a sufficiently large dataset to train a good model, he is generally less motivated to take the risk of conducting watermark removal attacks, given that he is already able to train his own model from scratch.
To study the watermark removal problem with a more realistic threat model, in this work, we perform a comprehensive study of the scenarios where the adversary has a much smaller dataset for fine-tuning than the pre-training dataset. In this case, training a model from scratch with such a limited dataset would typically result in an inferior performance, as we will demonstrate in Section V, which provides the adversary with sufficient incentives to pirate a pre-trained model and invalidate its watermarks.
Iii REFIT: REmoving watermarks via FIne-Tuning
In this section, we present REFIT, a unified watermark removal framework based on fine-tuning. We present an overview of the framework in Figure 2, and we will discuss the technical details in the later part of the section. The central intuition behind this scheme stems from the catastrophic forgetting
phenomenon of machine learning models, that is, when a model is trained on a series of tasks, such a model could easily forget how to perform the previously trained tasks after training on a new task[goodfellow2013empirical, kirkpatrick2017overcoming, kemker2018measuring]. Accordingly, when the adversary further trains the model with his own data during the fine-tuning process, since the fine-tuning data no longer includes the watermark samples, the model should forget the previously learned watermark behavior.
Contrary to this intuition, some prior works show that existing watermarking techniques are robust against fine-tuning based techniques, even if the adversary fine-tunes the entire model and has access to the same benign data as the owner, i.e., the entire data for pre-training excluding the watermark samples [adi2018turning, zhang2018protecting, liu2018fine]
. The key reason could be that the fine-tuning learning rates set in these work are too small to change the model weights with a small number of training epochs. To confirm this hypothesis, we first replicate the experiments in[adi2018turning] to embed watermarks into models trained on CIFAR-10 and CIFAR-100 respectively. Afterwards, we fine-tune the models in a similar way as their FTAL process, i.e., we update the weights of all layers. The only change is that instead of setting a small learning rate for fine-tuning, which is in their evaluation, we vary the magnitude of the learning rate to see its effect. Specifically, starting from 1e-5, the learning rate is doubled every 20 epochs in the fine-tuning process, which is the number of epochs for fine-tuning based watermark removal in their evaluation.
Figure 3 presents the training curve of this fine-tuning process. We can observe that the change of model performance is still negligible when the learning rate is around , becomes noticeable when the learning rate reaches around , and requires a larger value to reach a sufficiently low watermark accuracy. Inspired by this observation, in Section V, we will demonstrate that by simply increasing the initial learning rate for fine-tuning the entire model, and properly designing the learning rate schedule, the adversary is able to remove the watermarks without compromising the model performance on his task when he has access to a large amount of labeled training data.
While this initial attempt of watermark removal is promising, this basic fine-tuning scheme is inadequate when the adversary does not have training data comparable to the owner of the watermarked model. For example, when the adversary only has 20% of the CIFAR-100 training set, to ensure that the watermarks are removed, the test accuracy of the fine-tuned model could degrade by . This is again due to the catastrophic forgetting: when we fine-tune the model to forget its predictions on watermark set, the model also forgets part of the normal training samples drawn from the same distribution as the test one. Although the decrease of the test accuracy is in general much less significant than of the watermark accuracy, such degradation is still considerable, which could hurt the utility of the model.
There have been some attempts to mitigate the catastrophic forgetting phenomenon in the literature [kirkpatrick2017overcoming, fernando2017pathnet, gepperth2016bio, coop2013ensemble]. However, most techniques are not directly applicable to our setting. In fact, during the watermark embedding stage, the model is jointly trained on two tasks: (1) to achieve a good performance on a task of interest, e.g., image classification on CIFAR-10; (2) to remember the labels of images in the watermark set. Contrary to previous study of catastrophic forgetting, which aims at preserving the model’s predictions on all tasks it has been trained, our goal of watermark removal is two-fold, i.e., minimizing the model’s memorization on the watermark task, while still preserving the performance on the main task it is evaluated on. This conflict results in the largest difference between our watermark removal task and the continual learning setting studied in previous work.
Another important difference is that although the training data of the adversary is different from the pre-trained data, the fine-tuning dataset contributes to a sub-task of the pre-trained model, while getting rid of the watermarks. On the other hand, different tasks are often complementary with each other in previous studies of catastrophic forgetting. This key observation enables us to adapt elastic weight consolidation [kirkpatrick2017overcoming], a regularization technique proposed to mitigate catastrophic forgetting issue, for our purpose of watermark removal.
Elastic Weight Consolidation (EWC).
The central motivation of EWC is to slow down the learning of parameters that are important for previously trained tasks [kirkpatrick2017overcoming]. To measure the contribution of each model parameter to a task, EWC first computes the Fisher information matrix of the previous task as follows:
wheregiven an input , and is the training dataset of the previous task.
Intuitively, in order to prevent the model from forgetting prior tasks when learning a new task, the learned parameter should be close to the parameter of prior tasks, when the newly coming data also contains information relevant to . Algorithmically, we should penalize the distance between and when the
-th diagonal entry of the Fisher information matrix is large. Specifically, EWC adds a regularization term into the loss function for training on a new task, i.e.,
where is the loss to optimize the performance on the new task (e.g. a cross entropy loss); controls the strength of the regularization, indicating the importance of memorizing old tasks; is the parameters trained with the previous task; is the Fisher information matrix associated with , and is the diagonal entry corresponding to the i-th parameter.
We can further extend this idea to the transfer learning setting, when the fine-tuning data belongs to a different task from the pre-trained one. In this case, the adversary can first fine-tune the pre-trained watermarked model with a small learning rate, which results in a model for his new task, although the watermarks usually still exist. Afterwards, the adversary can treat the model parameters of this new model as , and plug in Equation 1 correspondingly.
Notice that since we do not have access to the pre-trained data, in principle we are not able to compute the Fisher information matrix of the previous task, thus cannot calculate the regularization term in . However, by leveraging the assumption that the training data used for watermark removal is part of the previous task, we can approximate the Fisher matrix using the training data accessible to the adversary. Although this approximation could be imprecise, in Section V, we will show that this technique enables the adversary to improve the test performance of the model with limited data, while the watermarks are successfully removed.
With the same goal of preserving the test performance of the model with watermarks removed, we propose data augmentation with unlabeled data, referred to as Augmentation with Unlabeled data (AU), which further decreases the amount of in-distribution labeled training samples required for obtaining a high-accuracy model without watermarks.
Augmentation with Unlabeled data (AU).
We propose to augment the fine-tuning data with unlabeled samples, which could easily be collected from the Internet. Let be the unlabeled sample set, we can use the pre-trained model as the labeling tool, i.e., for each . We have tried more advanced semi-supervised techniques to utilize the unlabeled data, e.g., virtual adversarial training [miyato2018virtual] and entropy minimization [grandvalet2005semi], but none of them provides a significant gain compared to the aforementioned simple labeling approach. Therefore, unless otherwise specified, we use this method for our evaluation of unlabeled data augmentation. Similar to our discussion of extending EWC to transfer learning, we can also apply this technique to the transfer learning setting by first fine-tuning the model for the new task without considering watermark removal, then using this model for labeling.
Note that since the test accuracy of the pre-trained model is not itself, such label annotation is inherently noisy; in particular, when is drawn from a different distribution than the task of consideration, the assigned labels may not be meaningful at all. Nevertheless, in Section V, we will show that leveraging unlabeled data significantly decreases the in-distribution labeled samples needed for effective watermark removal, while preserving the model performance.
Iv Evaluation Setup
In this section, we introduce the benchmarks and the watermark embedding schemes used in our evaluation, and discuss the details of our experimental configurations.
We evaluate on CIFAR-10 [krizhevsky2009learning], CIFAR-100 [krizhevsky2009learning], STL-10 [coates2011analysis] and ImageNet32 [chrabaszcz2017downsampled], which are popular benchmarks for image classification, and some of them have been widely used in previous work on watermarking [adi2018turning, zhang2018protecting, namba2019robust].
CIFAR-10 includes coloured images of 10 classes, where each of them has 5,000 images for training, and 1,000 images for testing. Each image is of size . Figure 4 shows some watermark examples generated based on images in CIFAR-10.
CIFAR-100 includes coloured images of 100 classes, where each of them has 500 images for training, and 100 images for testing, thus the total number of training samples is the same as CIFAR-10. The size of each image is also .
STL-10 has been widely used to evaluate the transfer learning, semi-supervised and unsupervised algorithms, which is featured with a large amount of unlabeled samples for training. Specifically, STL-10 consists of 10 labels, where each label has 500 training samples and 800 test samples. Besides the labeled samples, STL-10 also provides 100,000 unlabeled images drawn from a similar but broader distribution of images, i.e., they include images of labels that do not belong to the label set of STL-10. The size of each image is , which is much larger than CIFAR-10 and CIFAR-100. Therefore, although the label set of STL-10 largely overlaps with the label set of CIFAR-10, the images of the same label from the two datasets are clearly distinguishable, even if resizing them to the same size.
ImageNet32 is a downsampled version of the ImageNet dataset[deng2009imagenet]. Specifically, ImageNet32 includes all samples in the training and validation sets of the original ImageNet, except that the images are resized to . Same as the original ImageNet, this dataset has 1.28 million training samples of 1000 labels, and 50,000 samples with 50 images per class for validation.
Iv-B Watermarking Techniques
To demonstrate the effectiveness of REFIT against various watermark embedding schemes, we evaluate pattern-based techniques [zhang2018protecting, chen2017targeted, gu2017badnets], embedding samples drawn from other data sources as the watermarks [adi2018turning, zhang2018protecting, chen2017targeted], exponential weighting [namba2019robust], and adversarial frontier stitching [merrer2017adversarial]. These techniques represent the typical approaches of watermark embedding studied in the literature, and are shown to be the most effective ones against watermark removal.
Pattern-based techniques (Pattern).
A pattern-based technique specifies a key pattern and a target label , so that for any image blended with the pattern , is high. To achieve this, the owner generates a set of images blended with , assigns , then adds into the training set. Figure 4 shows some watermark samples generated by pattern-based techniques. Pattern-based techniques are also commonly used for embedding backdoors into the pre-trained model [chen2017targeted, gu2017badnets, liu2017trojaning].
Out-of-distribution watermark embedding (OOD).
A line of work has studied using images drawn from other data sources than the original training set as the watermarks. Figure 5 presents some watermarks used in [adi2018turning], where each watermark image is independently randomly assigned with a label, thus different watermarks can have different labels. We can observe that these images are very different from the samples in any benchmark we evaluate on, and do not belong to any category in the label set.
Exponential weighting (EW).
Compared to the above watermarking techniques, the scheme in [namba2019robust] introduces two main different design choices. The first choice is about the watermark sample generation. Specifically, they generate the watermarks by changing the labels of some training samples to different random labels, but do not modify the images themselves. The main motivation behind this idea is to defend against the detection attacks mentioned in Section I
, i.e., an adversary who steals the model could use an outlier detection scheme to detect input images that are far from the data distribution of interest, and returns a random prediction for such images, so as to bypass the watermark verification of those techniques using out-of-distribution images as the watermarks.
The second choice is about the embedding method. Instead of jointly training the model on both normal training set and the watermark set, they decompose the training process into three stages. They first train the model on the normal training set only. Afterwards, they add an exponential weight operator over each model parameter. Specifically, for parameters in the -th layer of the model denoted as , , where is a hyper-parameter for adjusting the intensity of weighting. Finally, the model with exponential weighting scheme is further trained on both normal training data and watermark set.
Although this watermarking scheme could be less vulnerable against certain attacks, especially the detection attacks against watermark verification, in our evaluation, we will demonstrate that this approach does not provide superior robustness compared to other schemes.
Adversarial frontier stitching (AFS).
In [merrer2017adversarial], they propose to use images added with the adversarial perturbation as the watermarks. Specifically, the model is first trained on the normal training set only. Afterwards, they generate a watermark set that is made up of 50% true adversaries, i.e., adversarially perturbed images that the model provides the wrong predictions, and 50% false adversaries, i.e., adversarially perturbed images on which the model still predicts the correct labels. The adversarial perturbations are computed using the fast gradient sign method [goodfellow2014explaining], i.e., , where is the training loss function of the model, and controls the scale of the perturbation. Each of these images is annotated with the ground truth label of its unperturbed counterpart as its watermark label, i.e., the label of is , no matter whether it is a true adversary or false adversary. Finally, the model is fine-tuned with these watermarks added into the training set. See Figure 6 for examples of watermarks generated by this technique.
. Specifically, after an image is blended with the “TEST” pattern in (a), such an image is classified as the target label, e.g., an “automobile” on CIFAR-10.
Iv-C Attack Scenarios
We consider the following attack scenarios in our evaluation.
The adversary leverages a watermarked model that is pre-trained for the same task as what adversary desires. For this scenario, we conduct experiments on CIFAR-10, CIFAR-100, and ImageNet32. For CIFAR-10 and CIFAR-100, the watermarked model is pre-trained on its entire training set; while for ImageNet32, the pre-trained model uses images of labels less than 500 in the training set. We consider two data sources for unlabeled data augmentation: (1) the unlabeled part of STL-10, which includes 100,000 samples; (2) For classification on CIFAR-10 and CIFAR-100, we also use the entire ImageNet32 for unlabeled data augmentation. For classification on ImageNet32, only those training samples with labels larger than 500 are included for unlabeled data augmentation. In both cases, we discard the labels of these ImageNet32 samples, and only use the images for augmentation. Note that these unlabeled images are very different from the labeled data. In particular, the label sets between CIFAR-100 and STL-10 barely overlap; and the label set of ImageNet32 is much more fine-grained than CIFAR-10 and CIFAR-100, thus is also very different.
The adversary leverages a watermarked model pre-trained for a different task from what adversary desires. For this scenario, our evaluation is centered on achieving a good performance on STL-10. Note that the labeled part of STL-10 only includes 5,000 samples, which is insufficient for training a model with a high accuracy. Therefore, an adversary can leverage the pre-trained model on another task with a larger training set, then fine-tune the model on STL-10. This fine-tuning method is widely adopted for transfer learning [yosinski2014transferable], and is also evaluated in [adi2018turning]. In particular, we perform the transfer learning to adapt from a model trained on CIFAR-10 or ImageNet32 to STL-10. We do not consider CIFAR-100 in this setting, because we find that adapting from a pre-trained CIFAR-100 model results in inferior performance on STL-10 compared to CIFAR-10 and ImageNet32, e.g., the accuracy on STL-10 is around lower than the model pre-trained on CIFAR-10, as presented in [adi2018turning]. We perform the unlabeled data augmentation in the same way as the non-transfer learning setting.
Iv-D Implementation Details
Our configuration of watermarking schemes largely follows the same setups as their original papers, and we tune the hyper-parameters to ensure that the pre-trained model achieves 100% watermark accuracy for each scheme. We directly use their open-source implementation when applicable. Specifically:
Pattern-based techniques. We use the text pattern in [zhang2018protecting], and we present some examples of generated watermarks in Figure 4.
Exponential weighting. We set as in [namba2019robust] for all settings. For each dataset, we use the last 100 samples from training set to form the watermark set, and ensure that these watermark samples are never included in the fine-tuning training set.
Adversarial frontier stitching. We set so that the watermark accuracy of a model trained without watermarks is around 50%. The values of are 0.15, 0.10 and 0.05 for CIFAR-10, CIFAR-100 and ImageNet32 respectively.
Watermark removal techniques.
We always fine-tune the entire model for REFIT, because we find that fine-tuning the output layer only is insufficient for watermark removal, as demonstrated in [adi2018turning]; and moreover, it will completely fail to remove watermarks in the transfer learning setting by design. We have tried both FTAL and RTAL processes described in [adi2018turning]. Specifically, FTAL directly fine-tunes the entire model; when using RTAL, the output layer is randomly initialized before fine-tuning. For non-transfer learning, we apply FTAL method, as RTAL does not provide additional performance gain; for transfer learning, we apply RTAL method, since the label sets of the pre-trained and fine-tuning datasets are different. We observe that as long as the pre-trained model achieves a high test accuracy and fits the watermarks well, the model architecture does not have critical influence on the effectiveness of watermark embedding and removal. Thus, unless otherwise specified, we mainly apply the ResNet-18 model [he2016deep] in our evaluation, which is able to achieve competitive performance on all benchmarks in our evaluation.
As discussed in Section III, the failure of previous attempts of fine-tuning based watermark removal is mainly due to the improper design of learning rate schedule during the fine-tuning stage. For example, the initial learning rate for fine-tuning is in [adi2018turning], which is smaller than the initial learning rate for pre-training. In our evaluation, we set the initial fine-tuning learning rate to be much larger, e.g., . We used SGD as the optimizer, and set the batch size to be 100 for both pre-training and fine-tuning without unlabeled data, following the setup in [adi2018turning]. For unlabeled data augmentation, when there is no in-distribution labeled samples, each batch includes 100 unlabeled samples. When fine-tuning on CIFAR-10, CIFAR-100 and STL-10, we decay the learning rate by 0.9 every 500 steps. When fine-tuning on partial ImageNet32, the learning rate is multiplied by after training on -fraction of the entire training set. More discussion on implementation details can be found in Appendix A. In Section V, we denote this basic version of REFIT without EWC and AU as Basic.
For our EWC component, Fisher information is approximated over samples drawn from in-distribution labeled data available to the adversary, where when the target domain is CIFAR-10, CIFAR-100 or STL-10, and when the target domain is ImageNet32. In practice, to improve the stability of the optimization, we first normalize the Fisher matrix so that its maximum entry is , then clip the matrix by before plugging it into Equation (2), where is the learning rate.
In addition, we also compare with a baseline method that trains the entire model from scratch without leveraging the pre-trained model, so that the model is guaranteed to have a watermark accuracy no higher than the decision threshold, though the test accuracy is typically low when the training data is limited. This baseline is denoted as FS.
We mainly consider the following two metrics in our evaluation.
Watermark accuracy. The adversary needs to make sure that the model accuracy on the watermark set is no more than the watermark decision threshold . In particular, we set to be within the range of watermark accuracies of models trained without watermarks. Specifically, for watermark schemes other than AFS, we set to be for CIFAR-10, for CIFAR-100, and for ImageNet32. We set for all benchmarks when using AFS, following [merrer2017adversarial].
Notice that for transfer learning setting, due to the difference of the label sets between the pre-trained and fine-tuning tasks, the embedded watermarks naturally do not apply to the new model. To measure the watermark accuracy in this case, following [adi2018turning], we replace the output layer of the fine-tuned model with the original output layer of the pre-trained model.
Test accuracy. The adversary’s goal is to maximize the accuracy of the model on the normal test set, while removing the watermarks. We consider the top-1 accuracy in our evaluation.
Regarding the presentation of evaluation results in the next section, unless otherwise specified, we only present the test accuracies of the models. The watermark accuracy of the pre-trained model embedded with any watermarking scheme in our evaluation is , and the watermark accuracy of the model after watermark removal using REFIT is always below the threshold .
In this section, we demonstrate the effectiveness of REFIT to remove watermarks embedded by several different schemes, in both transfer and non-transfer learning scenarios discussed in the previous section. We first present the overall results, then discuss related ablation studies for comparison with existing work, as long as the justification of our design choices.
V-a Evaluation of transfer learning
We first present the results of transfer learning from CIFAR-10 to STL-10 in Table II. For comparison of the STL-10 test accuracy, we also fine-tune the pre-trained model with a smaller learning rate, e.g. , thus its watermark accuracy may remain above , as in [adi2018turning]. We observe that with the basic version of REFIT, where neither EWC nor AU is applied, removing watermarks already does not compromise the model performance on the testset. When equipped with either EWC or AU, the model fine-tuned with REFIT even surpasses the performance of the watermarked model.
|FS REFIT Basic EWC AU Pattern / OOD / EW / AFS /||FS REFIT Basic EWC AU EWC+AU Pattern OOD EW AFS|
Then we present the results of transferring ImageNet32 to STL-10 in Table II. We observe that using the pre-trained models on ImageNet32 yields much better performance compared to the ones pre-trained on CIFAR-10, i.e., the test accuracies are around higher, although the label set of ImageNet32 is much more different from STL-10 than CIFAR-10. This could attribute to the diversity of samples in ImageNet32, which makes it a desirable data source for pre-training. Different from pre-training on CIFAR-10, the basic version of REFIT no longer suffices to preserve the test accuracy. By leveraging the unlabeled part of STL-10, the model performance becomes comparable to the watermarked ones. When combining EWC and AU, the performance of fine-tuned models dominate among different variants of REFIT as well as the watermarked models.
Meanwhile, we can notice that the performance of models fine-tuned on unlabeled part of STL-10 is consistently better than models using ImageNet32 for unlabeled data augmentation. This is expected, since the unlabeled part of STL-10 is closer to the test distribution than ImageNet32. Interestingly, we find that by jointly applying both EWC and AU, the gap between utilizing STL-10 and ImageNet32 for unlabeled data augmentation is shrunk, which indicates the effectiveness of the EWC component.
V-B Evaluation of non-transfer learning
For non-transfer learning setting, to begin with, we present results on CIFAR-10 in Table III, and the results on CIFAR-100 in Table IV. First, we observe that when the adversary has of the entire training set, similar to our observation of transfer learning from CIFAR-10 to STL-10, using the basic version of REFIT already achieves higher test accuracies than the pre-trained models using either of the watermarking schemes in our evaluation, while removing the watermarks. Note that the watermark accuracies are still above using the fine-tuning approaches in previous work [adi2018turning, zhang2018protecting], suggesting the effectiveness of our modification of the fine-tuning learning rate schedule.
However, when the adversary only has a small proportion of labeled training set, the test accuracy could degrade. Although the test accuracy typically drops for about on CIFAR-10 even if the adversary has only of the entire training set, the accuracy degradation could be up to on CIFAR-100. For all watermarking schemes other than AFS, incorporating the EWC component typically brings in an accuracy improvement of nearly on CIFAR-10, and up to on CIFAR-100, which are significant considering the performance gap to the pre-trained models. The improvement for AFS is smaller yet sill considerable, partially because the performance of the basic fine-tuning is already much better than other watermarking schemes, which suggests that AFS could be more vulnerable to watermark removal, at least when the labeled data is very limited. By leveraging the unlabeled data, the adversary is able to achieve the same level of test performance as the pre-trained models with only of the entire training set. We skip the results of combining EWC and AU on CIFAR-10 and CIFAR-100, since they are generally very close to the results of AU. However, we will demonstrate that the combination of EWC and AU provides observable performance improvement on ImageNet32, which is a more challenging benchmark.
Furthermore, unlabeled data augmentation enables the adversary to fine-tune the model without any labeled training data, and by solely relying on the unlabeled data, the accuracy of the fine-tuned model could be within difference from the pre-trained model on both CIFAR-10 and CIFAR-100, and sometimes even surpasses the performance of the model trained with data from scratch. Note that both STL-10 and ImageNet32 images are drawn from very different distributions than CIFAR-10 and CIFAR-100; in particular, the label sets of the sources of the unlabeled data could barely overlap with the in-distribution benchmark for evaluation. Meanwhile, we observe that the choice of unlabeled data does not play an important rule in the final performance; i.e., the performance of augmenting with one data source is not always better than the other. These results show that REFIT is effective without the requirement that the unlabeled data comes from the same distribution as the task of evaluation, which makes it a practical watermark removal technique for the adversary given its simplicity and efficacy, thus poses real threats to the robustness of watermark embedding schemes.
In addition, we notice that while AU mostly dominates when the percentage of labeled data is very small, with a moderate percentage of labeled data for fine-tuning, e.g., around , EWC starts to outperform AU in some cases. In particular, on CIFAR-10, EWC consistently exceed AU when labeled data is available to the adversary, and the corresponding percentage is
on CIFAR-100. This indicates that with the increase of the labeled data, the estimated Fisher matrix could better capture the important model parameters to preserve for adversary’s task.
In Table V, we further present our results on ImageNet32. Compared to the results on CIFAR-10 and CIFAR-100, removing watermarks embedded into pre-trained ImageNet32 models could result in a larger decrease of test accuracy, which is expected given that ImageNet32 is a more challenging benchmark with a much larger label set. Despite facing with more challenges, we demonstrate that by combining EWC and AU, REFIT is still able to reach the same level of performance as the pre-trained watermarked model with 50% of the labeled training data.
Meanwhile, the increased difficulty of this benchmark also enables us to better analyze the importance of each component in REFIT, i.e., EWC and AU. In particular, each of the two components offers a decent improvement of the test performance. The increase of accuracy with EWC is around over the basic version when the fine-tuning data is very limited, e.g., the percentage of labeled samples is . Such a performance gap is similar to the results on CIFAR-100, and is much smaller than CIFAR-10, potentially because the number of training samples per class is much smaller for ImageNet32 and CIFAR-100. The performance of using AU is generally better than using EWC, until the labeled training set includes of the ImageNet32 training samples of the first 500 classes, when EWC becomes more competitive. Finally, including both EWC and AU always enables further improvement of the test performance, suggesting that the combined technique is advantageous for challenging tasks.
By comparing the results of different watermarking schemes, we can notice that the models fine-tuned from pre-trained models embedded with pattern-based watermarks consistently beat the test accuracy of fine-tuned models after removing watermarks embedded with other approaches, suggesting that while pattern-based watermarking techniques are generally more often used than other approaches, especially for backdoor injection, such watermarks could be easier to remove, which makes it necessary to propose more advanced backdoor injection techniques that are robust to removal attacks.
V-C Comparison with alternative watermark removal attacks
In the following, we provide some discussion and comparison with some general-purpose watermark removal approaches proposed in previous work, which also does not assume the knowledge of the watermarking scheme.
Discussion of distillation attacks.
Distillation is a process to transform the knowledge extracted from a pre-trained model into a smaller model, while preserving the prediction accuracy of the smaller model so that it is comparable to the pre-trained one [hinton2015distilling]
. Specifically, a probability vector is computed as, where is the output logit of the model given the input , and is a hyper-parameter representing the temperature. Afterwards, instead of using the one-hot vector of the ground truth label for each training sample , the extracted from the pre-trained model is fed into the smaller model as the ground truth. Previous work has proposed distillation as a defense against adversarial examples [papernot2016distillation, papernot2017extending]. On the other hand, a recent work studies distillation as an attack against watermark embedding approaches, and suggests its effectiveness [yang2019effectiveness]. However, in order to preserve the test accuracy, such attacks rely on an assumption that the adversary has abundant data for fine-tuning, which is not the case in our setup. Therefore, the direct application of distillation attacks is inappropriate.
Alternatively, we investigate incorporating this technique into our unlabeled data augmentation process. Specifically, for the unlabeled part of data, instead of using the one-hot encoding of labels predicted by the pre-trained model, we useas the ground truth label, and vary the value of to see the effect. Nevertheless, this method does not provide a better performance; for example, with labeled training set on CIFAR-10 and using unlabeled part of STL-10 for augmentation, when the pre-trained model is embedded with OOD watermarks, setting provides the test accuracy of , while using the one-hot label results in test accuracy as in Table III, and setting other values of do not cause any significant difference. In particular, we observe that when using output logits of the watermarked model as the ground truth for fine-tuning, the resulted model tends to have a higher watermark accuracy, perhaps because while the output logits allows the fine-tuned model to better fit to the pre-trained model, it also encourages the fine-tuned model to learn more information of watermarks. Thus, we stick to our original design to annotate the unlabeled data.
Comparison with pruning-based approaches.
Previous work has studied the effectiveness of pruning-based approaches for watermark removal, and found that such techniques are largely ineffective [zhang2018protecting, liu2018fine, namba2019robust]. In our evaluation, we compare with the pruning method studied in [liu2018fine]
, where we follow their setup to prune the neurons of the last convolutional layer in the increasing order of the magnitude of their activations on the validation set.
Figure 7 presents the curves of the model accuracy with different pruning rates. Note that due to the skip connections introduced in ResNet architecture, the model accuracy may not be low even if the pruning rate is close to 1. Therefore, we also evaluate VGG-16 [simonyan2014very], another neural network architecture that is capable of achieving the same level of performance on both CIFAR-10 and CIFAR-100. For both models, we observe that the watermark accuracy is tightly associated with the test accuracy, which makes it hard to find a sweet spot of the pruning rate so that the test performance is preserved while the watermarks are removed.
In particular, as shown in Table VI, using the pruning approach, when the test accuracy degrades to on CIFAR-10, the watermark accuracy is still ; on the other hand, using REFIT with AU, without any in-distribution labeled data, the fine-tuned model achieves the same level of performance as the pruning method with the watermarks removed. The gap on CIFAR-100 is more significant: REFIT is able to achieve an accuracy of , but the test accuracy of the pruned model already decreases to with watermarks still retained. We have also tried other pruning approaches, but none of them works considerably better, which shows that REFIT is more suitable for watermark removal.
Comparison with fine-pruning.
We also consider the fine-pruning method proposed in [liu2018fine]. This paper proposes to first prune part of the neurons that are activated the least for benign samples, and then perform the fine-tuning. We evaluate their approach with the same fine-tuning learning rate schedule as REFIT. Specifically, we set the pruning rates before fine-tuning in the same way as their paper, i.e., keep increasing the pruning rate stepwise, and stop when the degrade of the model performance becomes observable.
Table VI presents the results of the fine-pruning approach as well as the basic version of REFIT without EWC and AU, where the pre-trained models are embedded with OOD watermarks. For both datasets and model architectures, we find that the results are roughly similar. These results suggest that pruning is not necessary with a properly designed learning rate schedule for fine-tuning. Therefore, we omit the full comparison with fine-pruning in our evaluation.
Vi Related Work
Aside from the attacks that infringe the intellectual property of a machine learning model, in the broader context, a variety of attacks have been proposed against machine learning models, which aim at either manipulating model predictions (e.g., backdoor attacks, poisoning attacks, and evasion attacks), or revealing sensitive information from trained models. We will also review the works with regard to the catastrophic forgetting phenomenon in deep learning, as it inspires the use of EWC loss for our watermark removal scheme.
In the context of machine learning, the goal of backdoor attacks is to make the model provide the desired predictions specified by the adversary on inputs associated with the backdoor key. In this sense, backdoor attacks are closely connected to watermarks in their formats, but usually with difference purposes, as discussed in [adi2018turning]. Previous work have shown that deep neural networks are vulnerable to backdoor attacks [chen2017targeted, gu2017badnets]. Accordingly, several defense methods have been proposed for backdoor attacks [wang2019neural, gao2019strip, chen2019deepinspect, guo2019tabor].
Similar to watermarking techniques and backdoor attacks, poisoning attacks also inject well-crafted data into training set in order to alter the predictive performance of a deep neural network. Depending on whether they aim at degrading the test accuracy indiscriminately or pertaining to specific examples, data poisoning attacks can be categorized into untargeted vs. targeted ones. Untargeted poisoning attacks have been studied for various types of machine learning models, such as support vector machines[biggio2012poisoning], Bayes classifiers [nelson2008exploiting], collaborative filtering [li2016data], and deep neural networks [munoz2017towards]. Since targeted attacks only affect the test performance on a small set of examples but do not render the entire machine learning system useless, they are less detectable and thus arguably more dangerous than untargeted ones. Recent works [koh2017understanding, shafahi2018poison] have proposed algorithms to design poisoned examples that appear to be labeled correctly even according to an expert observer.
In contrast to poisoning attacks, evasion attacks are launched in the test time of a machine learning model. The resulted samples are called adversarial examples, which are visually similar to normal data but lead to wrong predictions by the model [biggio2013evasion, szegedy2013intriguing]. Existing adversarial example generation algorithms mainly rely on the gradient information. For instance, the fast gradient sign method (FGSM) has been proposed to add perturbations along the gradient directions [goodfellow2014explaining]; the projected gradient descent method takes the gradient for multiple steps, winding up a more powerful attack. Prior work also proposes to formulate an optimizaion problem so as to search for the adversarial examples with minimal perturbation [carlini2017towards]. Note that the FGSM method is used to generate watermark samples for the AFS watermarking scheme.
Machine learning models are oftentimes trained on sensitive information, such as medical records, text messages, etc. The goal of privacy attacks is to reveal some aspects of training data. Of particular interest are membership attacks and model inversion attacks. Membership attacks attempt to determine whether a given individual’s data is used in training the model [shokri2017membership]. Successful membership attacks have been demonstrated on discriminative models [shokri2017membership] as well as data generative models [hayes2019logan]. Model inversion attacks, on the other hand, aim to reconstruct the features corresponding to specific target labels [fredrikson2014privacy]
. For instance, it has been shown that one can invert the face image for a given identity from a face recognition classifier[fredrikson2015model].
Catastrophic forgetting refers to the phenomenon that a neural network model tends to underperform on old tasks when it is trained sequentially on multiple tasks. This occurs because the weights in the network that are important for an old task are changed to meet the objectives of a new task. In recent years, many approaches have been proposed to reduce the effect of forgetting, such as adjusting weights [kirkpatrick2017overcoming, zenke2017continual], and adding data of past tasks to the new task training [lopez2017gradient, shin2017continual]. In particular, elastic weight consolidation algorithm is a classic way of mitigating catastrophic forgetting via adapting the learning of specific weights to their importance to previous tasks [kirkpatrick2017overcoming]. Note that the original EWC algorithm requires the access to the data used for learning old tasks, which is not available in our case. Therefore, we propose an adaption of the algorithm to make it suitable for our watermark removal application.
In this work, we propose REFIT, a unified framework that removes the watermarks via fine-tuning. We first demonstrate that by appropriately designing the learning rate schedule, our fine-tuning approach is always able to remove the watermarks. We further propose two techniques integrated into the REFIT framework, i.e., an adaption of the elastic weight consolidation (EWC) approach, and unlabeled data augmentation (AU). We conduct an extensive evaluation with the assumption of a weak adversary who only has access to a limited amount of training data. Our results demonstrate the effectiveness of REFIT against several watermarking schemes of different types. In particular, EWC and AU enable the adversary to successfully remove the watermarks without causing much degradation of the model performance. Furthermore, by leveraging unlabeled data, the adversary could perform watermark removal without any in-distribution labeled data, while achieving a much better model performance than pruning, another general-purpose watermark removal scheme agnostic to the watermark embedding approaches. Our study highlights the vulnerability of existing watermarking techniques, and we consider proposing more robust watermarking techniques as future work.
This material is in part based upon work supported by the National Science Foundation under Grant No. TWC-1409915, Berkeley DeepDrive, and DARPA D3M under Grant No. FA8750-17-2-0091. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Appendix A More Discussion on Experimental Details
|Dataset||Scheme||Initial learning rate||(EWC)||m(AU)|
Our implementation is in PyTorch222The implementation is mainly adapted from https://github.com/adiyoss/WatermarkNN, the code repo of [adi2018turning].. For each watermarking scheme in our evaluation, we present the best hyper-parameter configurations in Table VII. Note that in reality when the adversary is lack of such knowledge of watermarking scheme, he can always perform a broader hyper-parameter sweep to select the best configuration.