Fostering the Robustness of White-Box Deep Neural Network Watermarks by Neuron Alignment

by   Fang-Qi Li, et al.
Shanghai Jiao Tong University

The wide application of deep learning techniques is boosting the regulation of deep learning models, especially deep neural networks (DNN), as commercial products. A necessary prerequisite for such regulations is identifying the owner of deep neural networks, which is usually done through the watermark. Current DNN watermarking schemes, particularly white-box ones, are uniformly fragile against a family of functionality equivalence attacks, especially the neuron permutation. This operation can effortlessly invalidate the ownership proof and escape copyright regulations. To enhance the robustness of white-box DNN watermarking schemes, this paper presents a procedure that aligns neurons into the same order as when the watermark is embedded, so the watermark can be correctly recognized. This neuron alignment process significantly facilitates the functionality of established deep neural network watermarking schemes.



page 1

page 2

page 3

page 4


Robust and Undetectable White-Box Watermarks for Deep Neural Networks

Training deep neural networks (DNN) is expensive in terms of computation...

Deep Learning with a Single Neuron: Folding a Deep Neural Network in Time using Feedback-Modulated Delay Loops

Deep neural networks are among the most widely applied machine learning ...

DeepSensor: Deep Learning Testing Framework Based on Neuron Sensitivity

Despite impressive capabilities and outstanding performance, deep neural...

Estimate and Replace: A Novel Approach to Integrating Deep Neural Networks with Existing Applications

Existing applications include a huge amount of knowledge that is out of ...

White Noise Analysis of Neural Networks

A white noise analysis of modern deep neural networks is presented to un...

A novel method for identifying the deep neural network model with the Serial Number

Deep neural network (DNN) with the state of art performance has emerged ...

Between Homomorphic Signal Processing and Deep Neural Networks: Constructing Deep Algorithms for Polyphonic Music Transcription

This paper presents a new approach in understanding how deep neural netw...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning models have made significant achievements in domains ranging from computer vision 

[9414465] to signal processing [9413901, 9413723]. Since deep neural networks (DNN) can provide high-quality service, they have been treating as commercial products and intellectual properties. One necessary condition for commercializing DNNs is identifying their owners. DNN watermark is an acknowledged technique for ownership verification (OV). By embedding owner-dependent information into the DNN and revealing it under an OV protocol [oursijcai], the owner of the DNN can be uniquely recognized.

If the pirated model can only be interacted as a black-box then backdoor-based DNN watermarking schemes are the only option. They encode the owner’s identity into backdoor triggers by pseudorandom mapping [zhu2020secure]

, variational autoencoder 

[li2019prove], or deep image watermarking [zhang2021deep]. Adversarial samples [le2020adversarial] and penetrative triggers [oursicip] have been designed to defend against adversarial tuning and filtering. However, in the realistic setting, an adversary can ensemble multiple DNN models or adding extra rules to invalidate backdoors.

White-box DNN watermarking schemes have better performance regarding unambiguity and forensics given unlimited access to the pirated model. They embed the owner’s identity into the model’s weight [uchida2017embedding], its intermediate outputs for specific inputs [ours], etc. The white-box assumption holds for many important scenarios such as model competitions, engineering testing, and lawsuits.

Despite their privileges, white-box DNN watermarking schemes are haunted by the functionality equivalence attack, in particular, the neuron permutation attack [lukas2021sok]. The watermark is uniformly tangled with the parameters of neurons, so the adversary can invalidate it and pirate the model by permutating neurons without affecting the model’s performance.

To cope with this threat and foster the robustness of white-box DNN watermarking schemes, we propose a neuron alignment framework. By encoding the neurons and generating proper triggers, the order of neurons can be recovered. Then the watermark can be correctly retrieved and the ownership is secured. The contribution of this paper is threefold:

  • We propose a DNN protection framework against the neuron permutation attack. To the best of our knowledge, this is the first attemp in defending such threat.

  • By aligning neurons, the proposed framework can recover the order of neurons and can be seamlessly combined with established watermarking schemes.

  • Experiments have justified the efficacy of our proposal.

2 The Motivation

In OV, the verifier module takes the parameters/outputs of neurons as its input. An adversary can shuffle homogeneous neurons (whose forms and connections to previous layers are identical) using a permutation operator P such that the input from the verifier’s perspective is no longer an identification proof. The impact to the subsequent processing can be canceled by applying before the next layer so the functionality of the DNN remains intact. This neuron permutation attack is examplified in Fig. 1.

(a) The normal verification.
(b) The verification under attack.
Figure 1: A neuron permutation attack, .

One solution to this threat is designing verifier modules that are invariant to the permutation of its inputs, which is challenging due to the loss of information and detached from all established white-box DNN watermarking schemes. Instead, it is desirable that we can recognize P and cancel its influence by aligning the neurons into their original order. To perform aligning, we encode neurons by their scalar outputs, which are invariant under any permutation in precedent layers. The neurons’ outputs on training data, which are supposed to be the most diversified, are clustered into several centroids as signals. To get rid of the deviation from a neuron’s normal outputs to the centroids, some trigger samples are generated to correctly evoke these signal outputs as a neuron’s identifier code. To guarantee robust reconstruction, such encoding also needs to have good error-correcting ability against model tuning.

3 The Proposed Method

Assume that the watermarked layer contains homogeneous neurons, the code for a neuron is its outputs on a specialized collection of inputs, known as triggers. Given the triggers, the owner can obtain the codes for neurons and align them properly. What remains to be specified is the encoding scheme and the generation of triggers.

3.1 Neuron encoding

Denote the length of the code by and the size of the alphabet by . Each trigger invokes one output from each neuron and is mapped into one position in each neuron’s code, so is also the number of triggers. Denote the output of the -th neuron in the watermarked layer for an input x as , let the training dataset be . The normal output space of neurons in the watermarked layer is split into folds. The centroid of the -th fold, , is computed as:


where is the range of the -th fold containing the -th to the -th smallest elements in . This process is demonstrated in Fig. 2.

Figure 2: Splitting the output space, =1024, =10, and =3.

Having determind the centroids, the -th neuron is assigned a code , the dictionary is spanned using a error correction coding scheme [9418432]. It is expected that the output of the -th neuron on the -th trigger, , is close to . To enable error correcting within fewer than positions and each position shifts within at most folds, it is necessary that and satistisfy:

(a) Generating the -th trigger, the code for the -th position is (1402).
(b) Aligning neurons from the intermediate output code (1024) on the -th trigger.
Figure 3: The trigger generation process and the neuron alignment process.

3.2 Trigger generation

To generate triggers that correctly evoke the neurons’ outputs as codes, we adopt the method in forging adversarial samples [goodfellow2014explaining]. Concretely, the -th trigger is obtained by minimizing the following loss:


in which the parameters of the entire DNN are frozen. To increase the robustness of this encoding against the adversary’s tuning, we suggest that be optimized w.r.t. the adversarially tuned versions of the watermarked DNN as well. Let denotes the mapping introduced by the -th neuron under all kinds of tuning (

=0 represents the original model), the loss function becomes:


The collection of all triggers forms the owner’s evidence for neuron alignment. This process is demonstrated in Fig. 3 (a).

3.3 Neuron alignment

Given the white-box access to the suspicious DNN, the owner can recover the order of neurons in the watermarked layer by the following steps: (i) Inputting all triggers sequentially into the DNN. (ii) Recording the outputs of the -th neurons in the watermarked layer as . (iii) Transcripting into a code :

(iv) Transcripting into an index :

Finally, the owner aligns all neurons according to their indices and conducts OV using its white-box watermark verifier. This process is demonstrated in Fig. 3 (b).

Remark: An adaptive adversary might breach this alignment by rescaling the weights across layers. This can be neutralized by normalizing parameters before alignment or adopting a smaller to ensure distinguishability.

4 Experiments and Discussions

4.1 Settings

To examine the validity of the proposed framework, we selected two DNN structures, ResNet-18 and ResNet-50 [he2016deep]. In each DNN, we selected the second () and the third () layers to be watermarked. contains 64 homogeneous neurons and

contains 128 ones. For these convolutional layers, the output where neurons are recognized and decoded is the value of one specific pixel. Both networks were trained for three computer vision tasks: MNIST 

[deng2012mnist], FashionMNIST [xiao2017fashion]

, and CIFAR-10 

[krizhevsky2009learning]. The training of all DNN backbones and triggers was implemented by Adam [zhang2018improved] with PyTorch.

4.2 The configuration of parameters

To compute the centroids, we measured the distributions of outputs for watermarked layers on normal samples, results are demonstrated in Fig. 4.

(a) ResNet-18.
(b) ResNet-50.
Figure 4: Distributions of watermarked layers’ outputs.

These distributions remained almost invariant to the selection of dataset, network, and layer. To ensure maximal distinguishability, we adopted =2 and computed the centroids =0, =2.5 by (1). With =1, the error correcting ability computed from (2) is shown in Table. 1. We adopted =160 in the following experiments, where flipped bits up to 40% would not compromise the unique decoding.

[width=4em] 20 40 60 80 100 120 140 160
64 4 12 21 29 38 47 56 65
128 4 11 20 28 37 46 55 64
Table 1: The maximal number of flipped positions that can be corrected, , w.r.t. and , .

4.3 Comparative studies

For comparison, we compared five candidate schemes for trigger selection. (N): Normal samples from the training dataset. (R): Random noises. (O): Out-of-dataset samples. (T1): Triggers generated be minimizing in (3). (T2): Triggers generated by minimizing in (4) with =6 involving three rounds of fine-tuning and three of neuron-pruning. For (N)(R)(O), the centroids were also selected by (1) and the code of each neuron at the -th position was assigned as the index of the closest centroid to its output on the -th input.

The outputs of neurons in ResNet-50 trained on CIFAR-10 for one input are shown in Fig. 5(a)(b).

(a) .
(b) .
(c) .
(d) .
Figure 5: The distribution of watermarked layers’ outputs for different triggers. In (c)(d), the DNN has been fine-tuned.

From which we noticed that the outputs w.r.t. (T1)(T2) concentrated to =2 centers. (The percentage of neurons outputting approximately 0 or 2.5 was not strictly 50%, since the outputs of around 2% of neurons were uniformly zero.) Therefore, the code of neurons under (T1)(T2) can be unambiguously retrieved.

Metric (N) (R) (O) (T1) (T2)
Inter-cluster (). 2.4 1.9 1.5 2.5 2.5
Intra-cluster (). 1.3 0.8 0.8 0.1 0.2
Accuracy (%). 1.0 2.3 1.3 98.4 97.2
Table 2: The statistics of neurons’ outputs.

Numerically, we computed the averaged inter/intra-cluster distance for all trigger patterns with two clusters obtained by -means [li2019bayesian] and the accuracy of aligning against random shuffling on , the results are listed in Table. 2. From which we justified that the codes derived by (T1)(T2) are more informative. After fine-tuning, the distributions of outputs under (T1)(T2) were differentiated as shown in Fig. 5(c)(d), so (T2) is more robust against model tuning.

4.4 The performance of watermarking backends

To study the performance of white-box DNN watermarking schemes after the neuron permutation attack and alignment, we considered four state-of-the-art watermarking schemes: Uchida [uchida2017embedding], Fan [fan2021deepip], Residual [liu2021watermarking], and MTLSign [ours]. All watermarks were embedded into both and .

We conducted three attacks to the watermarked layer: (NP): Neuron Permutation; (FTP): Fine-Tuning and neuron Permutation; (NPP) Neuron-Pruning and Permutation. Then we applied neuron alignment and recorded the percentage of correct verifications from the watermarking backends in 1,000 instances, results are summarized in Table. 3. Without neuron alignment, any permutation-based attack can reduce the OV accuracy to 0.0%. After alignment, the accuracy in all cases increased significantly. Compared with (T1), (T2) is more robust against tuning and pruning in better reconstructing the order of neurons. Therefore, by adopting the neuron alignment framework, the security levels of these watermarking schemes are substantially increased. Meanwhile, the trigger generation process does not modify the original DNN, so it can be parallelized and would not bring extra damage to the protected DNN.

Attack Uchida Fan Residual MTLSign
Attack Uchida Fan Residual MTLSign
Table 3: The performance of watermarking backends after neuron alignments for ResNet-18 and ResNet-50. The results in each entry are: the accuracy of OV after the attack and its increase (in %) with alignment by ((T1), (T2)), averaged across three datasets.

5 Conclusions

We propose a neuron alignment framework to enhance established white-box DNN watermarking schemes. Clustering and error-correcting encoding are adopted to ensure the availability and distinguishability of neuron encoding. Then we use a generative method to forge triggers that can correctly and robustly reveal the neurons’ order. Experiments demonstrate the effectiveness of our framework against the neuron permutation attack, a realistic threat to OV for DNN.