Artificial GAN Fingerprints: Rooting Deepfake Attribution in Training Data

07/16/2020 ∙ by Ning Yu, et al. ∙ Max Planck Society CISPA 5

Photorealistic image generation is progressing rapidly and has reached a new level of quality, thanks to the invention and breakthroughs of generative adversarial networks (GANs). Yet the dark side of such deepfakes, the malicious use of generated media, never stops raising concerns of visual misinformation. Existing research works on deepfake detection demonstrate impressive accuracy, while it is accompanied by adversarial iterations on detection countermeasure techniques. In order to lead this arms race to the end, we investigate a fundamental solution on deepfake detection, agnostic to the evolution of GANs in order to enable a responsible disclosure or regulation of such double-edged techniques. We propose to embed artificial fingerprints into GAN training data, and show a surprising discovery on the transferability of such fingerprints from training data to GAN models, which in turn enables reliable detection and attribution of deepfakes. Our empirical study shows that our fingerprinting technique (1) holds for different state-of-the-art GAN configurations, (2) turns more effective along with the development of GAN techniques, (3) has a negligible side effect on the generation quality, and (4) stays robust against image-level and model-level perturbations. When we allocate each GAN publisher a unique artificial fingerprint, the margins between real data and deepfakes, and the margins among different deepfake sources are fundamentally guaranteed. As a result, we are able to evidence accurate deepfake detection/attribution using our fingerprint decoder, which makes this solution stand out from the current arms race.



There are no comments yet.


page 6

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep Learning (DL) techniques, ranging from discriminative models [53, 48, 75, 79, 35] to generative models [47, 30, 70, 62, 44, 45, 8, 46]

, have achieved great success in many applications. They are now widely used and industrially deployed in part due to the availability of many open-source development infrastructures including Caffe 


, Theano 


, Torch 


, PyTorch 

[67], Chainer [82]

, and Tensorflow 

[1], that accelerated progress. Furthermore, academic and industrial researchers often publicize their state-of-the-art models’ implementations for function reproduction, which contributed to the dissemination of the latest results and advances.

As a result of this rapid progress and its availability, Machine Learning as a Service (MLaaS) has rapidly become a popular and profitable business (e.g., Amazon AWS ML 

[3], Microsoft Azure ML [61], Google Cloud AI [31], and IBM Watson ML [39]

). However, training a commercial-level DL model is not a trivial task as it requires a large-scale dataset (e.g., ImageNet 

[23], CelebA [58], LSUN [89]), powerful computing resources (e.g., 8 NVIDIA Tesla V100 GPUs, each), in addition to significant DL expertise. It usually takes days up to weeks for training and requires an even longer time for many trial-and-error iterations on algorithm design, model architecture selection, and hyper-parameter search.

Fig. 1: An overview of our black-box GAN watermarking solution.

On the other hand, as DL models are difficult to train while easy to access, they become vulnerable to adversaries. Adversaries can redistribute an existing model with little effort and in turn provide a plagiarized DL service without citing or acknowledging the model owner. Such copyright infringement encroaches on the owner’s credits and scoops the commercial potentials from the owner. In this sense, it becomes necessary and urgent to propose solutions for model Intellectual Property (IP) protection, i.e., model ownership verification.

I-a Model watermarking

As protecting models from copying, extraction, or stealing (even with limiting the access to black-box APIs) is challenging [65, 83], the commonly used defense against stealing models is model watermarking. This enables the identification of a stolen model and, thus, the verification of ownership, similar to IP legal protection of patents, trademarks, and copyright licenses [78, 50, 73].

The model watermarking procedure is composed of two phases: embedding and detection, and can be differentiated into two categories: black-box watermarking [2, 94, 95] and white-box watermarking [84, 11, 20]. The embedding phase is to embed identification information into the model without affecting its utility. That identification information is called a watermark, which is provided by the owner and secret to others including potential adversaries.

In black-box watermarking, it is unnecessary to manipulate the model weights. Instead, the watermark is embedded as a special input-output behavior (similar to trigger sets in backdooring [32, 2]

). A small set of inputs is selected and assigned with desirably unrelated outputs. Such special pairs are mixed with normal training pairs to train the model accordingly. The ownership of a watermarked model is detected based on the assumption that only with a very small probability a non-watermarked model can demonstrate the same behavior. In white-box watermarking, it is required to access the model weights. Then the watermark is explicitly embedded as part of the weight distributions using an invertible transformation. During detection, the weights are transformed back and compared to the watermark. Black-box watermarking has the advantage of not requiring the white-box access to the model weights which is a more realistic scenario as attackers are not likely to publicize the stolen models 


Key challenges of GAN watermarking. Unfortunately, the studies of aforementioned protection are limited to either discriminative models [53, 48, 75, 79, 35] (where the models map from image domain to class probability domain) or reconstructive models [41] (where pixel-level pair-ups are necessary with strong input conditioning). To the best of our knowledge, the protection for the other prominent models, Generative Adversarial Networks (GANs) models [30, 70, 62, 44, 45, 8, 46], is surprisingly lacking.

GAN models aim to synthesize photorealistic images, mapping from noise vector domain to image domain, and have significantly pushed the edge of generation realism to a brand-new level 

[30, 8, 46]. It urgently remains an open question of how to apply watermarking protection to GANs. We identify three key challenges:

First, in discriminative models, there is a control over the input that would work as a trigger set for backdoor watermarking. In contrast, GANs require random sampling in the input domain which has to follow a natural prior distribution, e.g., a standard normal distribution. There is no control or pairing between noise vectors and generated images as this is learned in an unsupervised adversarial manner. Any specification on the pairs of noise and generated image gets biases from the natural prior distribution and impairs the entire generation fidelity, according to the practical disadvantage of VAE-GAN 


Second, in discriminative models, the outputs associated with the trigger set are assigned to a special target during supervised training. However, GANs are trained using only the unsupervised adversarial loss to approximate between two distributions which makes it difficult to assign a desirable special output for a trigger set. In addition, the image output domain of GANs is highly structured and full of semantics, making the output specification even more difficult to achieve.

Third, GAN training is known to be unstable due to its min-max formulation and alternation between gradient ascent/descent. Adding an auxiliary regularization term to the GAN objective could further amplify the training instability. Unlike watermarking for discriminative models [2, 96] or reconstructive models [41] where the original objective and the watermarking term are both reconstruction-based, any reconstructive auxiliary regularization cannot easily cooperate with the adversarial loss, according to the practical disadvantage of VAE-GAN [51].

These three challenges together imply to watermark a GAN without modifying the adversarial training. We, therefore, propose a black-box watermarking scheme. We sidestep from controlling the input-output behavior of GANs. In addition to its benefits in avoiding GAN training instability, a black-box solution better adheres to practical scenarios where we only have access to the stolen model APIs.

GAN watermarking and deep fakes. In addition to GAN model ownership verification, watermarking the output of GAN models is closely related to tracing the provenance of generated images (i.e., deep fakes) and attributing them to their respective GAN models [91, 86]. With the rapid ongoing progress [44, 45, 46], GANs learn to better match the target distribution which raises concerns about the future trajectories and misuse of powerful GANs [86, 88, 22, 40]. GAN watermarking can contribute to identifying watermarked generations from real images, and tracking responsibility in case of misuse. For example, DL model administration platforms can regulate owners to watermark their GAN models so as to facilitate the attribution of any potential misuse to the owner. For this to be satisfied, the watermark should be verifiable from all or arbitrary outputs of the GAN model instead of a response to a specific trigger set (as typically used in discriminative model watermarking). Therefore, our GAN watermarking solution abandons the trigger set approach and aims to watermark all generated images.

I-B Approach and contributions

To this end, we present the first black-box watermarking solution for GANs that we show in Figure 1. Unlike existing techniques that are only applicable to discriminative models, we propose to leverage steganography approaches [5, 81] to watermark the GAN training dataset, transfer the watermark from the dataset to the GAN model, and then verify the watermark from generated images. The model owner queries the APIs of a suspicious model and uses a pre-trained decoder to detect the watermark. If the detected watermark matches the owner’s watermark, it should form convincing legal evidence against the pirate. In addition, the watermark helps to identify generated images from real images and attribute them to their respective watermarked GANs.

Image steganography meets GAN watermarking. We differentiate between two important concepts: image steganography and model watermarking. Steganography aims to embed watermarks into images rather than networks, although sometimes the embedding and detection procedures are implemented by networks. Steganography is one way of protecting the IP of watermarked images, rather than models. We elaborate these two concepts in Sections II-A and II-C. We summarize the benefits of our approach of using image steganography (applied on the training dataset) to watermark GANs as follows:

  1. It is a black-box watermarking solution that does not require access to GAN weights. It satisfies the practical scenario where the generation service usually publicizes only the APIs but not the model weights to users.

  2. We leverage the secrecy and hidden encoding characteristics of steganography to guarantee the original generation performance.

  3. It minimizes the possible incompatibility against GAN training by putting GAN in an independent component: it does not modify the original GAN training protocols, e.g., objective, network architecture, and optimization strategy. It acts as a plug-and-play watermarking pre-processing disentangled from GAN training and agnostic to GAN details.

  4. As the watermark can be detected from all GAN generations (not a trigger set response), our solution offers a way for tracing the responsibility of a watermarked GAN. This provides legal clues for malicious use cases of fake images. Our watermarking scheme increases the margin between real and fake images. It facilitates image forensics to separate real from fake images via extrinsic information (the watermark), considering real images are deterministically non-watermarked.

Contributions. Based on the above formulation, we summarize our contributions as follows:

  1. To the best of our knowledge, this is the first study to formalize the problem of watermarking GAN models.

  2. We empirically validate the effectiveness of our GAN watermarking solution. GANs learn the watermark from the dataset and deliver it to the generated images. As a result, it bridges the transferability between dataset watermarking and GAN model watermarking.

  3. We further conduct comprehensive analysis to validate the fidelity, secrecy, robustness, and large capacity of our solution. In addition, we demonstrate the advantageous performance in two applications: generated image detection and image-to-GAN attribution.

Ii Related work

In this section, we summarize the related work in the field including discriminative model watermarking, steganography, steganalysis, data poisoning, and GAN fingerprints.

Ii-a Watermarking for discriminative models

We summarize previous watermarking works that focused on discriminative models. Based on the accessibility to model weights, they are typically categorized into white-box and black-box ones.

White-box watermarking. These methods require access to model weights or intermediate activations to perform ownership verification. Uchida et al. [84] proposed to embed watermarks into discriminative models weights via training regularization term. Similarly, DeepMarks [11] and DeepSigns [20]

embed watermarks in the probability density function of model weights by leveraging the low probabilistic regions while minimizing the side effect to the classification accuracy and training overhead. White-box watermarking has limited use cases because the suspicious models are usually deployed remotely, and the plagiarized service would not publicize the parameters.

Black-box watermarking. In contrast, these methods do not require access to model weights and other details, e.g., architecture and training strategy. [2, 94, 52] embed watermarks in the input-output behavior of discriminative models; designing a specific trigger input set and associating it with specific class labels. Similarly, [56] focuses on making images in the trigger set visually indistinguishable from those in the regular set. Also, [80] embeds the watermark in predictions in order to defend against model extractions by APIs, by dynamically changing the responses for a small subset of queries. To address ambiguity attacks, [26] links the performance of a network with the presence of the correct watermark and substitutes the verification process with model fidelity evaluation. However, our paper is the first to treat GANs as victim models and target IP protection of them.

Ii-B Generative adversarial networks (GANs)

Photorealistic image generation can be viewed as the problem of sampling from the probability distribution of real-world images. Training the generative model aims to approximate the generated distribution towards the real image distribution. Since real images distribution is unknown and does not have a tractable distribution expression, GANs 


introduce a workaround by formulating distribution approximation as a real/fake classification problem. GANs consist of two neural networks, a generator and a discriminator. The generator takes random noise as input and is trained to generate images as realistic as possible, while the discriminator receives images from both the generator and real dataset and is trained to differentiate the two sources. During training, these two networks compete and evolve simultaneously. GANs have significantly pushed the edge of generation realism to a brand-new level 

[70, 33, 44, 8, 46]. Successes in image domain have led to applications of GANs in many tasks: image synthesis [30, 70, 33, 44, 8, 46], semantic image synthesis [66]

, super resolution

[54], image attribute editing [14], text to image synthesis [92, 93], image to image translation [41, 100, 101], inpainting [90]

and semi-supervised learning

[76]. Open-source tool DeOldify [4] is a remarkable example of using GANs in image enhancement for colorizing and restoring old images and film footage. Therefore, it turns out important to develop IP protection frameworks for GANs.

Ii-C Image steganography

Image steganography represents a technique to hide information into carrier images in the initial purpose of covert communication [29]. It aims at stealthiness to avoid attracting the attention of adversaries [59]. It can also play a role in protecting the IP of carrier images if the hidden information is used as a watermark to verify the ownership of images [63, 85]. Although sometimes a steganography procedure is implemented by a network, it is never used nor directly able to hide information into a network and protect its IP. Previous steganography techniques [19, 9]

relied on Fourier transform or modifying the least significant bits 

[68, 38, 37]. Several works proposed to substitute manually crafted hiding procedures with neural network embedding. [5, 34, 85, 97] use an encoder network to embed information in the latent space and use a decoder to detect that information from an image. [99] leverage data augmentation to improve the robustness of the two networks against perturbations. [81] presents a pipeline to detect hyperlinks from printed pictures. [96] releases an open-source implementation of a steganography system for hiding text messages inside high-resolution images. To have a minimal effect on the generation quality and GAN training, we propose to leverage steganography approaches [5, 81] to watermark images in the GAN training dataset, and then detect the watermark from generated images.

Ii-D Image steganalysis

With the ever-growing advancements in steganography techniques, there has been equitable growth of research on counter steganalysis techniques as well. Binary steganalysis typically solves the binary classification problem of whether or not there exists hidden information in an image. Spatial Rich Model [28]

extracts spatial correlation information of high-frequency residuals of images and uses these features to train a binary classifier.


is the first to use neural networks for both feature extraction and binary classification.

Though the binary steganalysis can help stall the illicit and unwanted communications, quantitative steganalysis goes beyond this and aims to reconstruct the secret message [27]

. Machine learning paradigms such as logistic regression 

[102], ensemble framework [17], extreme learning machines [15], and neural networks [12] have been used for hidden message reconstruction. See [16, 64, 43] for a comprehensive review of steganalysis methods.

Image steganalysis techniques can also be used by adversaries to examine images and detect watermarks. We consider this as a possible attack in our threat model in Section III-A and employ [55] to test the secrecy of our GAN watermarking solution in Section IV-D.

Ii-E Data poisoning

Several works study the use of training data to influence the behavior of discriminative models during test time. Data poisoning attacks [7, 77, 74] target to maliciously add a few data points to the training set, such that the trained model has to predict an unrelated label chosen by an adversary on particular samples. Backdoor attacks [13, 32] add into a class unrelated training samples with a trigger pattern; at test time, any sample with the same trigger will be classified in this class regardless of its semantics ground truth. In a similar spirit, [72] introduces imperceptible but “radioactive” perturbations into training data, such that any model trained on it will bear an identifiable mark. This is an effective example to show the transferability of identification information from training data to discriminative models. Inspired by its flavor but beyond that, we validate in this paper the transferability of image steganographic effect from training data to GANs, in order to protect the IP of generative models.

Ii-F GAN fingerprints and deep fake detection

Images generated by GANs intrinsically bear unique fingerprints. [60] shows that GANs leave unique noise residuals to generated samples, which facilitate real/fake detection. [91] moves one step further, using a neural network classifier to attribute different images to their sources including a real-world source and various GAN sources. They show that it is even possible to finely differentiate samples from GANs which only differ in random seeds used for training. [86] also trains a classifier and improves testing generalization ability across different generator domains.  [98, 25, 24] show that GAN fingerprints are embedded in the high-frequency region of a spectrum, and [57] shows that fingerprints are recognizable from texture features.

Different from all these studies where attribution accuracy has to be contingent on the strengths and distinguishability of intrinsic fingerprints, we propose a watermarking solution that artificially embeds distinctive watermarks into GANs in an owner-controllable manner. It has an advantage that the margin between real and watermarked fake is amplified (Section IV-G), and ownership attribution is more accurately validated (Section IV-H).

Iii Black-box GAN watermarking

We present the first black-box GAN watermarking approach that extends model watermarking beyond discriminative models. We first present our threat model and summarize five qualifications of a desirable watermarking solution in Section III-A. Then we establish the watermarking pipeline in Section III-B.

Iii-a Threat model

In our analysis, we consider two threats: piracy and GAN misuse. In piracy, we consider a model owner and an adversary. The owner invests resources into training a GAN model and uses it to offer a generation service to customers. The adversary pirates the well-trained from the owner to obtain , and offers a similar service . Model piracy can be done via a malware attack or with the help of an insider. To detect piracy, in the white-box scenario verifying should be sufficient. But white-box is not a practical assumption, because the adversary usually does not publicize the model weights, and it is illegal for the owner to pirate the adversary’s model weights. Therefore, in this paper, we consider the more practical black-box scenario (for verification) where only (i.e., the output of ) is accessible to the owner. But verifying is not sufficient evidence of piracy because all generators ideally approximate their services towards the same real dataset distribution.

In the second threat, the owner himself stands for the illegal side, and uses generated images from his GAN for malicious purposes, e.g., a possible flood of fake multimedia or as a part of fake articles.

In order to counter the two threats, as shown in Figure 1, we seek a watermarking solution by embedding a secret watermark to and obtain a watermarked model that produces a watermarked generation service . In the first threat, this helps the owner to prove ownership. In the second threat, this proposes a useful practice for DL model administration platforms to attribute any potential misuse to the owner. Furthermore, it contributes to the visual forensics efforts by introducing an imperceptible, yet verifiable, watermark in the generated images that facilitates its identification from real images. However, our goal is not only to watermark the GAN model, but also to reach an acceptable trade-off between different factors; for example, the watermarking should have a minimum effect on the generation quality, should be secret to adversaries, should be robust to watermark-removal attacks, and should have a sufficient capacity to maintain a large number of watermarks.

In piracy, we assume the adversary has white-box access to the owner’s GAN model, and no access to the owner’s watermarking model, watermark code, or watermarked training data; this is a practical assumption as the owner only needs to deploy the trained GAN for service, and can keep the other information offline at his side.

In piracy, the adversary’s goal is to be able to use the well-trained GAN model without having to train it from scratch. Therefore, the objective of attack is to remove the watermark from generated images so as to fool watermark detection, while largely preserving the generation service utility. Regardless of the model white-box accessibility, the adversary can always perform perturbations on the generated images in order to remove the watermark. Therefore, we study the watermark robustness to different image perturbations attacks (e.g., noise, blurring, JPEG compression, cropping, and re-watermarking) with different ranges. For re-watermarking, the adversary attempts to override the owner’s watermark by embedding a new watermark to the generated images. We show the trade-off between image degradation and watermark detection accuracy.

In addition, in the case of the white-box access, the adversary can attempt to remove the watermark by perturbing model weights. We study several model perturbation attacks. First, we fine-tune the watermarked model on non-watermarked images; we study two scenarios contingent to the adversary’s access to datasets: fine-tuning on the same training dataset as used in the victim GAN, and fine-tuning on a disjoint dataset. Moreover, we examine two additional attacks that either quantize the watermarked model weights or add Gaussian noise to them. In general, we assume that the adversary does not have the same training expertise, computational resources, and amount of training data as the owner’s, otherwise, he can train his own model.

Given our goals and assumptions, we formalize five qualifications for a desirable GAN watermarking system as follows:

  1. Effectiveness. At the owner’s end, an arbitrary watermark in a given form can be embedded into a GAN model and should be accurately detected from , the generated images of a watermarked model . The detection should not depend on the access to the weights of , so as to abide by the black-box scenario in reality.

  2. Fidelity. To preserve the utility of the service, a watermarked generator should provide a comparable service and generation quality to that of the original generator , i.e., . In addition, this avoids the adversary’s suspect of the existence of a watermark, and therefore, avoids the counter-efforts that decrypt or remove the owner’s watermark from the stolen service.

  3. Secrecy. To avoid attacks that aim to remove the watermark and hinder its verification, the presence of the watermark should not be easily detected by the adversary. This requires the watermark to be secret enough under the adversary’s steganalysis techniques.

  4. Robustness. Regardless of the secrecy of the watermark, the adversary may always apply perturbation attacks to the generated images or to the watermarked generator (subject to the adversary’s white-box accessibility to the victim GAN). Therefore, at the owner end, the watermark should always be robustly detected from given a possible range of attacks.

  5. Capacity. The watermarking solution should have a large enough capacity to maintain a large number of distinctive watermarks for different GAN models. On one hand, a large watermark representation avoids two watermarks possibly overlapping too much which may challenge watermark detection. On the other hand, however, a large watermark representation may overshadow the original generation. Therefore, it is necessary to investigate the relation between capacity and fidelity and justify a reasonable trade-off.

Iii-B Watermarking pipeline

We model our watermarking solution such that it satisfies the previously mentioned qualifications. We demonstrate in Figure 2 the pipeline that consists of four stages, and describe it below in details.

Iii-B1 Watermark encoder-decoder training

To embed the watermark into the GAN training dataset, we train an image steganography system on the owner’s side. The use of steganography techniques meets the fidelity and secrecy qualification of a desired watermarking system because it minimizes the effect on the generation quality and keeps the watermark hidden. The system is similar in spirit to [5, 81] and it consists of a jointly trained encoder-decoder architecture; the encoder is trained to imperceptibly embed an arbitrary watermark into arbitrary images, while the decoder is trained to detect that watermark. The watermark is represented as a sequence of binary bits. should be large enough to avoid a collision in a large space of all possible watermarks, meeting the capacity qualification.

The encoder combines a cover image and a randomly sampled binary watermark as input, and maps them to a stego image : . The stego image has the same size as that of the cover image and should be perceptually similar to it, i.e., . The decoder takes the stego image as input and aims to reconstruct watermark: and desirably . We achieve the above by jointly training the encoder and decoder w.r.t. the following objective:


where and are the th bits of the watermarks and separately; and is a hyper-parameter to balance the two objective terms. The binary cross-entropy term guides the decoder to decode whatever binary watermark sequence that is embedded by the encoder. The mean squared error term penalizes any deviation of the stego image from the original cover image .

Iii-B2 Watermark embedding to GAN training dataset

The second stage is the inference mode of the first stage after the training of encoder and decoder converges. The owner defines a watermark and uses the encoder to embed it into each real image of the GAN training dataset. By this stage, the watermark appears in the carrier of images.

Fig. 2: The four stages of our GAN watermarking pipeline. The first and third stages are in the training mode, while the second and fourth stages are in the inference mode.

Iii-B3 GAN training with the watermarked dataset

Our watermarking solution is motivated by not introducing instability to the GAN training, for the purpose of preserving the utility. Therefore, it is independent of GAN training, agnostic to all its details, and acts as a plug-and-play pre-processing. In the third stage, the owner trains a GAN model in the original manner using the watermarked dataset. We assume by this stage that the watermark can be effectively transferred from training images to the generations of the GAN model. If this holds, the owner can trustingly publicize the model which has been imperceptibly watermarked. We call it a watermarked GAN. We empirically validate it meets the effectiveness qualification in Section IV-B.

Iii-B4 Watermark verification

The fourth stage is the inference mode of the third stage. The owner performs the watermark verification to either prove ownership to a suspicious GAN model or track the responsibility of distributed models in case of misuse. To be qualified in practice, we model for the black-box scenario where the owner does not require access to the GAN weights. The owner only needs to have a generated image (the misused image or a one obtained by the APIs queries), and detects its watermark using the well-trained decoder from the first stage. Then the owner compares the detected watermark to the original watermark , and observes

matching bits. Instead of exact matching, we verify the watermark by hypothesis testing given the number of matching bits (where the null hypothesis is getting these matching bits by chance). As we validate in Section 

IV-E, this is sufficient for the detection of the watermark even after applying perturbation attacks. As a result, it meets the robustness qualification.

In specific, we consider two hypotheses:

Under the null hypothesis

, the probability of the number of matching bits, denoted as the random variable

, follows a binomial distribution with

trials and probability of success. Given the observation of matching bits, the owner computes the p-value, the probability of the extreme cases where there are or even more matching bits under .


The watermark verification holds if and only if the p-value is smaller than a small threshold , which corresponds to the situation where is larger than a large threshold . In another word, if the owner observes a large number of matching bits, it is very unlikely to accept the null hypothesis , meaning to instead accept that and match. In Section IV-B we do not explicitly set . Rather, we derive how to set up a reasonable threshold and validate, by this stage, the transferability of the presence of watermark from training dataset to GAN model, which can then be verified by the generated images.

Fig. 3: Encoder architecture.
Fig. 4: Decoder architecture.

Iv Experiments

We start with implementation details in Section IV-A. From Sections IV-B to IV-F, we design comprehensive experiments to evaluate our GAN watermarking solution w.r.t. the five desirable qualifications formalized in Section III-A

: effectiveness, fidelity, security, robustness, and capacity. We discuss different evaluation metrics, compared baseline methods, and/or analysis studies for each qualification, if applicable, to validate the advantages and/or working ranges of our solution. In Sections

IV-G and IV-H, we further show that watermarking is beneficial to solve two tasks related to digital forensics: detection and attribution of generated images.

Iv-a Implementation details

Encoder. The encoder is trained to embed a watermark into a cover image while minimizing the perceptual differences between the input and stego images. We follow the technical details proposed in [81]

. The watermark binary vector is first passed through a fully-connected layer and then reshaped as a tensor with one channel dimension and with the same spatial dimension of the cover image. We then concatenate this watermark tensor and the image along the channel dimension as the input to a U-Net 

[71] style architecture. The output of the encoder, the stego image, has the same size as that of the cover image. Note that passing the watermark through a fully-connected layer allows for every bit of the binary sequence to be encoded over the entire spatial dimensions of the cover image and flexible to the coverage image size. In our experiments, the image size is set to without losing representativeness. The watermark length is set to as suggested in [81]. The length of binary bits leads to a large enough space for watermark design while not deteriorating too much the effectiveness and fidelity performance, as investigated in Section IV-F. We visualize the encoder architecture in Figure 3.

Decoder. The decoder is trained to recover the hidden watermark from the stego image. We follow the technical details proposed in [81]. It consists of a series of convolutional layers with kernel size x

and strides

, dense layers, and a sigmoid output activation to produce a final output with the same length as the watermark binary vector. We visualize the encoder architecture in Figure 4.

Encoder and decoder training. The encoder and decoder are jointly trained end-to-end w.r.t. the objective in Eq. 1. We randomly sample watermark binary vectors and embed them to the images from either MNIST digit dataset [53] ( training images and testing images), CelebA face dataset [58] ( train images and testing images) or LSUN bedroom dataset [89] ( training images and testing images). The encoder is trained to balance watermark reconstruction and image reconstruction. At the beginning of training, we set to focus on watermark reconstruction, otherwise watermarks cannot be accurately embedded into cover images. After the watermark detection accuracy achieves (that takes -epochs), we increase linearly up to within iterations to shift our focus more on cover image reconstruction. We train the encoder and decoder for epochs in total. Given the batch size of , it takes hours using NVIDIA Tesla V100 GPU with 16GB memory.

Victim GAN models. One of the advantages of our watermarking solution is to disentangle watermarking from GAN training and put GAN details in an independent component. Therefore, without losing representativeness, we selected two milestone works of GAN techniques, DCGAN [70] and ProGAN [44], as the victim GAN models for watermark embedding and detection. Model training is implemented in PyTorch in a way that is agnostic to watermarking and equivalent to official implementations on GitHub [21, 69]. In order to obtain corresponding watermarked GAN, model training is run with our watermarked MNIST digit dataset, watermarked CelebA face dataset, or watermarked LSUN bedroom dataset.

Iv-B Qualification of effectiveness

At the owner’s end, the effectiveness of the watermark verification, i.e., the capability of embedding arbitrary watermarks into a GAN model and detecting them in a black-box scenario (only from generated images), is the key qualification. If we validate this, we can also empirically redeem our assumptions in Sec III-B into important discoveries: the transferability of the presence of watermark from training images to GAN model and then to generated images.

Evaluations. In our experiments, the effectiveness is successively evaluated by two types of performance: precision and recall

. The definitions in this paper are different from those in the traditional classification tasks. Here precision measures the classification accuracy of whether or not a testing image is generated by the owner’s watermarked GAN, while recall measures the watermark detection accuracy given a testing image from the owner’s watermarked GAN. Both precision and recall are calculated based on the

bitwise accuracy between the owner’s watermark ground truth and the detected watermark . We elaborate on each below.

Fig. 5: Histograms of watermark bitwise accuracy given positive (red) or negative samples (blue). They are evaluated on ProGAN on CelebA. The positive samples are generated by the owner’s watermarked GAN model. The negative samples are either from the real world, or generated by a non-watermarked GAN model or by another watermarked GAN model different from the owner’s. We can perceive a significant margin to distinguish the two distributions, which validates the precision of our solution.

Precision. The classification was conducted by thresholding on the bitwise accuracy. Given ProGAN on CelebA, we looped over testing images. Half of the testing images are positive samples, i.e., generated by the owner’s watermarked GAN model, and the other half of the images are negative samples, i.e., either from the real world (counting for 1/6 of the testing images), or generated by a non-watermarked GAN model (counting for 1/6 of the testing images) or by another watermarked GAN model different from the owner’s (counting for 1/6 of the testing images).

In principle, the bitwise accuracy for the negative samples should resemble that of random guess, the histogram of which should follow the binomial distribution with parameters (the length of watermark) and (unbiased binary random guess). In contrast, the bitwise accuracy for the positive samples should have a much higher value in statistics, leading to a distinguishable histogram distribution from that of the negative samples.

We plot and compare the two histograms in Figure 5. They are visually distinguishable with a significant margin. In terms of quantitative evaluation, we set the classification threshold as , which is around the middle point of the margin. It means the owner’s watermarked GAN model is verified if the number of matching bits is more than of the watermark length. It also derives to set for Section III-B4, which corresponds to a reasonable significance level according to Eq. 5. Compared to the ground truth, it results in precision. Consequently, the precision of our watermarking solution is validated and we can move on to more comprehensive comparisons w.r.t. recall given all the testing images generated by the owner’s watermarked GAN models.

Recall. We report in the third and fourth columns of Table I two metrics to evaluate recall, both the higher the better. In the watermark bit level, we calculate the mean bitwise accuracy over testing images generated by the owner’s watermarked GAN model. In the instance level, we count the instance detection accuracy, i.e. the percentage of those images where the owner’s watermarks are detected. We keep setting according to the precision calculation. In the second column of Table I we calculate as a reference the mean bitwise accuracy where the images are the watermarked GAN training images. In the fifth column, we report as a reference the p-value (Eq. 5) to accept .

Baseline. For comparisons, since there is no existing work on GAN watermarking, we designed a straightforward baseline method for an alternative consideration. Instead of watermarking GAN training data, we enforce watermark reconstruction jointly with GAN training. In another words, we enforce each generated image to not only look realistic approximating the real training data, but also contain the owner’s watermark. Mathematically,


where and are the original generator and discriminator in GAN framework, is the original GAN objective, and is adapted from Eq. 2 where we replace with . is a hyper-parameter to balance the two objective terms. By tuning we obtained two extremes of results: either or dominates training. When dominates, the generated images have high visual quality but watermark detection is close to random guess. When dominates, watermarks are accurately detected but the generation quality deteriorates heavily. We finally report in Table I with where we tended to guarantee the generation quality.

Train Gen Instance p-value
Model & Dataset bit acc bit acc acc accept
ProGAN CelebA
TABLE I: Two recall evaluations: watermark bitwise accuracy of testing images generated by the owner’s watermarked GAN (third column) and instance detection accuracy of those images (fourth column). As references, the second column shows the bitwise accuracy of images that are watermarked GAN training images; the fifth column shows the p-value to accept the null hypothesis in Section III-B4. The “bsl” row corresponds to the baseline method subject to Eq. 6 where we merge watermark reconstruction into GAN training.

Results. From Table I we summarize:

(1) The watermarking baseline (the “bsl” row) method completely fails with bitwise accuracy of , close to binary random guess, and instance detection accuracy , equal to blind rejection. The failure indicates GAN watermarking is a challenging task. Directly combining a watermarking objective with GAN training is easily incompatible. In contrast, our solution of leveraging image steganography and transferring watermarking from GAN training datasets to GAN models sidesteps the possible incompatibility and leads to advantageous performance. See below.

(2) According to the instance accuracy, our watermarking solution results in perfect detection ( accuracy) with negligible p-values to accept the null hypothesis . It validates our effective watermark recall with a significant margin, agnostic to GAN models and datasets.

(3) According to the bitwise accuracy, our watermarking solution results in almost saturated accuracy ( accuracy),i.e., no obvious deterioration from that of the watermarked training data. It indicates the learning of our watermark generation is effective, with information drops by .

(4) Refer to the differences of watermark bitwise accuracy, we notice a GAN model can foster more accurate watermark detection () if the original generation task is easier (MNIST and CelebA against LSUN). We interpret that the task of learning the watermark becomes entangled with the task of learning the distribution of real images. The more effective a GAN model learns the dataset distribution, the more effective watermarks can be transferred from the training dataset to the model.

(5) Combining the success in precision and recall measurements, we validate the effectiveness of our watermark solution, and validate the transferability of the presence of watermark from training images to GAN model and then to generated images.

Dataset Source (non-watermarked) (watermarked)
CelebA Data 0.60 1.15
CelebA ProGAN 14.09 14.38
LSUN Data 0.61 1.02
LSUN ProGAN 29.16 32.58
TABLE II: FID comparisons between samples with or without watermark.
(a) Original GAN training samples.
(b) Watermarked GAN training samples.
(c) Difference between 5(a) and 5(b) ( magnified).
(d) Samples from the non-watermarked GAN.
(e) Samples from the watermarked GAN.
(f) Original GAN training samples.
(g) Watermarked GAN training samples.
(h) Difference between 5(f) and 5(g) ( magnified).
(i) Samples from the non-watermarked GAN.
(j) Samples from the watermarked GAN.
Fig. 6: Qualitative comparisons for Table II. Top row - CelebA, bottom - LSUN.

Iv-C Qualification of fidelity

The fidelity of a watermarked GAN is as critical as the watermarking effectiveness. It requires a watermarked GAN to preserve its original generation quality. On one hand, the owner preserves the utility of the service. On the other hand, it avoids the adversary’s suspect of the existence of the watermark. In principle, the steganography technique we used should enable this, and we validate it below.

Evaluation. In GAN studies, generation quality is commonly evaluated by Fréchet Inception Distance (FID) [36], which is the major pursuit of most GAN models. It first embeds a real reference dataset and a generation set respectively through a fixed ImageNet-pretrained Inception network [79]

, and then measures the Kullback–Leibler divergence 

[49] of the two embedding sets as the distance, the smaller the higher generation quality. We measure FID between generated images and another real non-watermarked images, in order to validate the fidelity of the former. To calculate FID in different settings, the latter is kept unchanged.

Results. We show in Table II the second and fourth rows the FID comparisons of ProGAN on CelebA or LSUN. As a reference, we report in the first and third rows the FID comparisons of GAN training data, between samples with and without being watermarked. We find:

(1) Watermarking on the training set does not substantially deteriorate image quality: FID is in an excellent realism range. That validates the secrecy of the steganographic technique [81] and lays a valid foundation for high-quality GAN training.

(2) In the second and fourth rows, compared to the non-watermarked models, our watermarked models tightly approximate the performance limit of their non-watermarked baselines with the FID variance within a range of

for CelebA and for LSUN Bedrooms. In practice, the generated watermarks are imperceptibly hidden by the original GAN artifacts. See Figure 6 for demonstrations.

Iv-D Qualification of secrecy

The presence of a watermark embedded in a GAN model should not be easily detected by the adversary, otherwise, it would be potentially removed by the adversary and fool the watermark verification. This qualification is more demanding than fidelity, because high fidelity avoids the adversary’s visual detection while secrecy requires technical counter-detection against steganalysis.

Attacks and evaluation. In order to design a quantitative evaluation on secrecy, we consider from the adversary side a binary classification problem: the presence of watermark in an image. We design the attack to be aggressive by launching it to the earliest stage of our watermarking pipeline. Namely, we assume the adversary is aware of our usage of the steganographic technique in Section III-B1 and proposes to classify if an image sampled from the real world is watermarked or not. This assumption favors the adversary side. If the adversary fails to detect the watermark in GAN training data, it is even more difficult to detect it from the later stages, because GAN training will potentially make the presence of watermark more ambiguous to some extent, according to the drop of watermark bitwise accuracy in Table I.

On the adversary side, we follow the attack protocol in [99] to perform the Artificial Training Sets (ATS) experiment [55]. We target to separate testing images watermarked or time but we have no supervision. The intuition is to expand the testing set and establish an artificial setting with known labels that enable supervised training, such that the original testing class space is a subspace of the artificial training class space and is separable by the training task. The attack is as follows: we suppose we have another encoder on the adversary side corresponding to the victim steganographic technique. We regard the original testing images as negative training samples. Then, we apply the encoder twice to the testing set to obtain extra images watermarked times (corresponding to originally non-watermarked images) or times (corresponding to originally watermarked images), which are regarded as positive training samples. Then we train an SVM classifier [10] using such positive and negative samples, in order to separate between images watermarked - time (original set), and the ones watermarked - times (artificial training set). During testing, we first apply the encoder once to the testing images so that the originally non-watermarked images now are watermarked 1 time (belonging to - class), and the originally watermarked images are now watermarked 2 times (belonging to - class). Then we can use our SVM to separate them and propagate the predictions back to the original images.

We consider a strong adversary and a weak adversary, which are contingent on their knowledge of the victim steganographic technique. The strong adversary somehow has white-box access to the owner’s encoder (is aware of the encoder weights) and is aware of the watermark that was used to generate watermarked images. Therefore, the adversary can use the owner’s encoder directly to launch the attack. The weak adversary (the more likely scenario) is only aware of the architecture and training details of the owner’s encoder, and therefore, has to train a shadow encoder from scratch. In our experiments, the shadow encoder training is the same as in Section III-B1, just different in initialization.

We evaluate the adversary’s performance on a set of watermarked real images and non-watermarked real images.

Fig. 7: Watermark bitwise accuracy w.r.t. the amount of perturbations. Top row - CelebA, bottom row - LSUN. Red dots represent detection on the generated samples from watermarked GAN. Blue dots represent detection on the training samples of watermarked GAN, which serve as the upper bound references for the red dots.
Attack Watermark detection accuracy
White-box encoder (strong) 0.946
Unknown encoder (weak) 0.502
TABLE III: Watermark detection accuracy of ATS attacks.

Results. We report the strong and weak adversaries’ classification accuracy on CelebA in table III. We find:

(1) The strong adversary (the first row) achieves watermark detection with almost perfect accuracy () ascribed to the access to the targeted encoder. It reminds the owner to not publicize the well-trained encoder after using our watermarking solution. This is a feasible suggestion because publicizing the GAN model alone is sufficient to provide the generation service.

(2) The weak adversary (the second row), in contrast, fails to detect the watermark with just random guess performance ( accuracy) because of lacking access to the targeted encoder weights. That indicates encoders trained from different initialization use different patterns to hide the watermark. In conclusion, if the owner keeps the weights of the original encoder private, the presence of watermark in the publicized GAN model is validated secret from the ATS attack.

Iv-E Qualification of robustness

The adversary may always apply perturbation attacks to the generated images or to the victim generator regardless of the evidence of the watermark. Therefore, at the owner’s end, the watermark should always be robustly detected from a possible range of attacks.

Image perturbation attacks. We evaluate the robustness of our watermarking solution against five types of image perturbation attacks: additive Gaussian noise, blurring with Gaussian kernel, JPEG compression, center cropping and re-watermarking. We vary the magnitude of each perturbation, apply each of them to the generated images from a watermarked GAN model, and detect the watermark using the well-trained decoder. For re-watermarking, the adversary overrides the owner’s watermark with his using the owner’s steganography technique, and blends the original watermarked image and his re-watermarked image pixel-by-pixel with varying weights.

Original Gaussian noise Blurring JPEG compression Center cropping
Noise std Kernel size Compression Crop size
99 bits 77 bits 75 bits 75 bits 80 bits
Noise std Kernel size Compression Crop size
94 bits 80 bits 83 bits 84 bits 80 bits
TABLE IV: Perturbed image samples from the watermarked ProGAN and the corresponding numbers of detected bits (out of 100). The detection still performs robustly (bitwise accuracy ) even when the image quality (utility) heavily deteriorates w.r.t. each perturbation.
Fig. 8: Watermark bitwise accuracy (in red dots) and FID (in blue dots) w.r.t. the amount of perturbations on CelebA. For the strong attack on the left, FID is measure between generated set and non-watermarked CelebA Male subset. For the other attacks, FID is measured between generated set and the entire non-watermarked CelebA.
Fig. 9: Image samples from the watermarked ProGAN w.r.t. the iteration of weak finetuning within epoch. The ProGAN was trained by the owner with watermarked CelebA Male and is finetuned by the adversary with non-watermarked CelebA Female.

Results against image perturbation attacks. We evaluate the mean watermark bitwise accuracy over perturbed images from watermarked ProGAN on either CelebA or LSUN, as we do for effectiveness evaluation in Section IV-B. We plot in red dots in Figure 7 the bitwise accuracy w.r.t. the magnitude of perturbations. As a reference for upper bound, we also plot the bitwise accuracy in blue dots if we apply varying perturbations to the watermarked GAN training images. Recall in Table I the first column, non-perturbed watermarked training images result in saturated detection accuracy. We find:

(1) For all the image perturbations, watermark bitwise accuracy drops monotonously as we increase the magnitude of perturbation, while for small perturbations accuracy drops rather slowly. We consider accepting accuracy according to the classification threshold in Section IV-B

. Consequently, we list the general working range w.r.t. each perturbation: Gaussian noise standard deviation

, Gaussian blur kernel size , JPEG compression quality , center cropping size , and re-watermarking blending weight , which are reasonably wide ranges in practice.

(2) For the image perturbations out of the above working ranges, the reference upper bounds drop even faster and the margins to the testing curves shrink quickly, indicating the detection deterioration does not factually result from GAN training, but rather from the heavy quality deterioration of source images.

(3) As a result of (2), except for the re-watermarking attack, the adversary has to deteriorate an image with strong perturbations in order to suppress the watermark detection not better than random guess ( accuracy). However, that also destroys the utility of the service if the adversary tends to escape from being detected, making the adversary’s efforts invalid. We demonstrate in Table IV how each perturbation affects the sample appearance, and how severely the image quality (utility) deteriorates w.r.t. each perturbation before the detection turns ineffective. It guarantees that the stolen or misused service is detectable until its utility is heavily perturbed.

(4) For the re-watermarking attack, although it does not deteriorate image quality perceptually, it requires much more of adversary’s expertise in steganography. In practice, the owner’s encoder and decoder are not publicized online as they are not part of the original generation service. If the adversary has the capability of training his own watermarking pipeline, there is no apparent motivation for him to pirate a victim GAN but rather train his own ones.

Original Model quantization Model noise
Precision Noise std
98 bits 64 bits 77 bits
TABLE V: Image samples from the watermarked ProGAN after model perturbation attacks and the corresponding numbers of detected bits (out of 100). For model noise, the detection still performs robustly (bitwise accuracy ) even when the image quality (utility) heavily deteriorates. For model quantization, image quality and detection accuracy deteriorate dramatically at the same pace (when decimal precision turns from to ).

Model perturbation attacks. We also evaluate the robustness of our watermarking solution against model perturbation attacks, considering the adversary has the white-box access to the weights of the victim GAN. We consider four types of attacks, two based on GAN finetuning and the other two based on model weight quantization or adding Gaussian noise to model weights. For finetuning attacks, the adversary finetunes the watermarked victim GAN model using the same training details as the owner did, but using non-watermarked data instead. We vary the number of finetuning iterations within the range of epoch, and detect the watermark from the images generated by the finetuned generator. We establish the two attacks as follows: (1) A strong adversary has access to the owner’s non-watermarked CelebA training data and finetunes the watermarked ProGAN with it; (2) a weak adversary does not have access to the owner’s training set and therefore finetunes with a disjoint non-watermarked dataset. In our setting, the victim ProGAN is trained on the watermarked CelebA Male subset and is finetuned on the non-watermarked CelebA Female subset.

For quantization attack, we compress each model weight given a decimal precision. For noise attack, we add to each model weight a Gaussian noise given a standard deviation. We then detect watermark from the images generated by the perturbed generator.

Results against model perturbation attacks. We evaluate the mean watermark bitwise accuracy over images from a perturbed watermarked ProGAN on CelebA. We plot in red dots in Figure 8 the bitwise accuracy w.r.t. the magnitude of perturbations. As a reference for the impact of model perturbation on image quality, we also plot FID of their generated images in blue dots. We find:

(1) Similar to image perturbations, for all the model perturbations, the watermark bitwise accuracy drops almost monotonously as we increase the magnitude of perturbation, while for small perturbations, the accuracy drops rather slowly. We consider accepting accuracy and list the general working range w.r.t. each perturbation: strong finetuning epoch , weak finetuning epoch , quantization decimal precision , and model noise standard deviation , which are reasonably wide ranges in practice.

(2) For all the model perturbations except the strong finetuning, image quality (utility) deteriorates faster than watermark bitwise accuracy, such that before accuracy is lower than , FID has increased by . We demonstrate in Figure 9 and Table V how each perturbation affects the sample appearance, and how severely the image quality (utility) deteriorates w.r.t. each perturbation before the detection turns ineffective. It guarantees that the stolen or misused service is detectable until its utility is heavily perturbed.

(3) Especially for the weak finetuning attack, when the adversary is unaware of the owner’s training dataset, his finetuning overrides the original generation service faster than overriding the original watermark. That reminds the owner to keep the original training dataset secret in order to prevent finetuning attack from being effective.

(4) For the strong finetuning attack with enough iterations, although effective without deteriorating the generation quality (utility), it requires much more of adversary’s expertise and resources in training a GAN model. In practice, if the adversary has the capability of training his own GAN models, there is no apparent motivation for him to pirate a victim GAN model.

Fig. 10: Watermark overlap percentage (in red dots) and FID of watermarked CelebA training data (in blue dots) w.r.t. the number of watermark bits . indicates the setting without watermarking.

Iv-F Qualification of capacity

A watermarking solution should have a large enough capacity to maintain a large number of distinctive watermarks for different GAN models. In our experiments the capacity correlates to the length of the watermark, i.e., the number of bits in the watermark. On one hand, a larger leads to a larger watermark space and more likely avoids collision (two owners’ watermarks too similar to each other). On the other hand, however, a larger deteriorates the quality of watermarked images, according to the empirical analysis in [81]. Given the image size of , it is recommended to use , as we have set for the other experiments. In the following experiments, we investigate the relation between fidelity and capacity, so as to examine whether is a reasonable setting.

Evaluations. On one hand, we explore the possible similarity between two arbitrary watermarks w.r.t. capacity. We vary the number of bits in the range of and sample random pairs of watermarks. We then measure the minimal pairwise Manhattan distance among the watermarks and report the percentage of overlapped bits of that pair. A higher overlap percentage indicates a higher possible similarity of two watermarks, which in turn requires a higher watermark bitwise detection accuracy in Section IV-B to avoid false verification. At least we should select a reasonable such that the overlap percentage is lower than the bitwise accuracy threshold used in Section IV-B.

One the other hand, we explore FID between non-watermarked and watermarked training data w.r.t. capacity. We vary the number of bits and measure the FID. We require FID not to be increased by subject to watermarking.

Results. In Figure 10, we plot the bitwise overlap percentage w.r.t. the number of bits in red dots, and FID of watermarked CelebA training data w.r.t. in blue dots. We validate is the largest possible capacity among our settings such that the overlap percentage as well as FID of that of non-watermarked training data.

Iv-G Generated image detection

The breakthrough of GANs narrows the gap in visual quality between generated and real images. That makes the detection of generated images an increasing challenge. Recent concerns about deep fakes [22] and misuse of GANs raise demands for reliable detection methods. In this section, based on the validated qualifications of our watermarking solution, we extend it to an application that facilitates detecting samples from a watermarked GAN, for the purpose of increasing the margin between real and fake while at no cost of generation fidelity.

The problem of generated image detection is typically formulated as a binary classification problem: separating real from fake images. Unlike existing methods that detect intrinsic differences between the two classes [91, 98, 24], we propose to enhance the classification performance by embedding an artificial watermark into generated images, in contrast to real images which are deterministically non-watermarked. In particular, we regulate GAN owners to publicize only watermarked GAN models using our solution. Then we convert the problem to classifying if one image is watermarked or not. We use the same detection criterion as in Section IV-B, i.e., checking if the detected watermark of a testing image contains matching bits to our reference watermark.


We compare to a plain convolutional neural network (CNN) classifier as a baseline method, as implemented in 

[91]. It is trained on CelebA real images and images from a watermarked ProGAN. We consider two scenarios depending on whether or not the GAN model used for classifier training covers that used for testing.

Method Seen GAN model Unseen GAN model
Classifier 0.997 0.508
Watermark 1.0 1.0
TABLE VI: Real/Fake classification accuracy. The first column corresponds to the scenario where the GAN model used for training covers that used for testing. The second column is the opposite scenario which is a more generalized situation.

Results. We report in Table VI the classification accuracy over CelebA real images and images generated by a watermarked ProGAN. We find:

(1) Real/Fake classification based on watermark performs equally perfect ( accuracy) to that based on CNN classifier in the seen GAN scenario.

(2) More advantageously, our watermark-based classifier performs equally well over unseen GAN models while CNN classifier deteriorates to random guess ( accuracy). This is because CNN classifier is troubled by the domain gap between training and testing GAN models. In contrast, our method enjoys the advantage of being agnostic to GAN models as it depends only on the presence of watermark rather than intrinsic information that is overfitting to a closed world discriminative task.

(3) This proposes a useful practice for DL model administration platforms to regulate model publication: publicizing a watermarked GAN model significantly decreases its deep fake risks in case of being misused.

Iv-H Image-to-GAN attribution

The goal of the image-to-GAN attribution task is to figure out the GAN model that generated a particular fake image. It plays an important role in tracing the responsibility of a GAN model, which provides legal clues for malicious use cases of fake images. Our watermarking solution, for the purpose of ownership verification, is qualified in principal for attribution. This is validated by the transferability of the watermark from GAN models to generated images.

Closed world scenario. In the closed world scenario, the model space is finite and known in advance. In our experiment, we train four ProGAN models on CelebA using four different watermarks. The task is to classify a mixture of images evenly generated by these four models. To attribute images using watermarks, we apply our decoder to detect the watermark in the image, and assign that image to the GAN with the closest watermark.

Open world scenario. We further consider the open-world scenario to validate if an attribution approach can accurately reject images from unknown GANs. In our experiment, the set of GANs is extended to two more watermarked ProGAN on CelebA, the watermarks of which are unknown. Along with the original task in the closed world, the new task also requires to classify additional images evenly generated by these two GANs, meaning to label them as not belonging to any of the four known GANs. Our watermarking approach classifies an image as unknown if and only if the number of matching bits between the detected watermark and the closest known watermark is less than out of .

Baseline. Yu et al. [91] use a plain CNN classifier to solve image-to-GAN attribution as a multi-class classification problem, which is limited to a closed world. We followed their protocol in the closed world scenario: training over images generated evenly by each of the four GANs. We also extend their method to the open-world scenario via training four one-vs-all-the-others binary classifiers. To train the -th classifier we balance our sampling by using images from the -th GAN and another images evenly from all the other GANs. During testing, all the four classifiers are applied to an image. We assign to the image the class with the highest confidence if not all the classifiers reject that image. Otherwise, we assign the image to the unknown label.

Method Closed wolrd Open world
Classifier 0.998 0.235
Watermark 1.000 1.000
TABLE VII: Image attribution accuracy.

Results. We report in Table VII the testing attribution accuracy. We find similar results to those in Section IV-G:

(1) Image attribution based on watermark performs equally perfect ( accuracy) to that based on CNN classifier in the closed world.

(2) More advantageously, our watermarking approach performs equally well in the open-world scenario while CNN classifier deteriorates severely. This is because the four classifiers generalize poorly to the unknown GANs such that they attribute less than a quarter of unknown images correctly as unknown. In contrast, our method enjoys the advantage of being agnostic to GAN models as it depends only on the presence of watermark rather than intrinsic information that is overfitting to a closed world discriminative task.

(3) This proposes a useful practice for DL model administration platforms to regulate model publication: publicizing a watermarked GAN model significantly facilitates image attribution and responsibility tracking in case the generated images are misused.

V Conclusion

In this paper, we propose the first black-box watermarking solution for protecting the Intelligent Property of generative adversarial networks. We first analyze the threat model in the black-box scenario, and formalize five qualifications for desired watermarking. In our pipeline, we leverage steganography techniques to watermark GAN training dataset, transfer the watermark from dataset to GAN models, and then verify the watermark from generated images. In the experiments, we validate the effectiveness (watermark transferability), fidelity (no deterioration on generation quality), secrecy (against steganalysis), robustness (against image and model perturbations), and large capacity (trade-off between fidelity) of our solution. Notably, the advantageous performance treats GAN models as an independent component: watermark embedding is agnostic to GAN details and watermark verification relies only on accessing the APIs of black-box GANs. We further extend our watermarking applications to generated image detection and attribution, which delivers a practical potential to facilitate forensics against deep fakes and responsibility tracking of GAN misuse.

Vi Acknowledgement

Vladislav Skripniuk was partially funded by IMPRS scholarship from Max Planck Institute. We acknowledge Matthew Tancik for his StegaStamp GitHub repository sharing111 and Tero Karras for his ProGAN GitHub repository sharing222 We also thank Apratim Bhattacharyya for constructive discussion and advice.


  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, and M. Devin (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. In arXiv, Cited by: §I.
  • [2] Y. Adi, C. Baum, M. Cisse, B. Pinkas, and J. Keshet (2018) Turning your weakness into a strength: watermarking deep neural networks by backdooring. In USENIX Security, Cited by: §I-A, §I-A, §I-A, §II-A.
  • [3] Amazon AWS Machine Learning External Links: Link Cited by: §I.
  • [4] J. Antic DeOldify. a deep learning based project for colorizing and restoring old images (and video!). External Links: Link Cited by: §II-B.
  • [5] S. Baluja (2017) Hiding images in plain sight: deep steganography. In NeurIPS, Cited by: §I-B, §II-C, §III-B1.
  • [6] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio (2010) Theano: a cpu and gpu math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), Cited by: §I.
  • [7] B. Biggio, B. Nelson, and P. Laskov (2012)

    Poisoning attacks against support vector machines

    In ICML, Cited by: §II-E.
  • [8] A. Brock, J. Donahue, and K. Simonyan (2019) Large scale gan training for high fidelity natural image synthesis. In ICLR, Cited by: §I-A, §I-A, §I, §II-B.
  • [9] F. Cayre, C. Fontaine, and T. Furon (2005) Watermarking security: theory and practice. In IEEE Transactions on signal processing, Cited by: §II-C.
  • [10] C. Chang and C. Lin (2011) LIBSVM: a library for support vector machines. In ACM transactions on intelligent systems and technology, Cited by: §IV-D.
  • [11] H. Chen, B. D. Rohani, and F. Koushanfar (2019) Deepmarks: a digital fingerprinting framework for deep neural networks. In ICMR, Cited by: §I-A, §II-A.
  • [12] M. Chen, M. Boroumand, and J. Fridrich (2018) Deep learning regressors for quantitative steganalysis. In SPIE Electronic Imaging, Cited by: §II-D.
  • [13] X. Chen, C. Liu, B. Li, K. Lu, and D. Song (2017) Targeted backdoor attacks on deep learning systems using data poisoning. In arXiv, Cited by: §II-E.
  • [14] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, and J. Choo (2018)

    Stargan: unified generative adversarial networks for multi-domain image-to-image translation

    In CVPR, Cited by: §II-B.
  • [15] S. Chutani and A. Goyal (2018) Improved universal quantitative steganalysis in spatial domain using elm ensemble. In Multimedia Tools and Applications, Cited by: §II-D.
  • [16] S. Chutani and A. Goyal (2019) A review of forensic approaches to digital image steganalysis. In Multimedia Tools and Applications, Cited by: §II-D.
  • [17] R. Cogranne and J. Fridrich (2015) Modeling and extending the ensemble classifier for steganalysis of digital images using hypothesis testing theory. In IEEE Transactions on Information Forensics and Security, Cited by: §II-D.
  • [18] R. Collobert, K. Kavukcuoglu, and C. Farabet (2011) Torch7: a matlab-like environment for machine learning. In NuerIPS workshop, Cited by: §I.
  • [19] I. Cox, M. Miller, J. Bloom, and C. Honsinger (2002) Digital watermarking. Springer. Cited by: §II-C.
  • [20] B. Darvish Rouhani, H. Chen, and F. Koushanfar (2019) Deepsigns: an end-to-end watermarking framework for ownership protection of deep neural networks. In ASPLOS, Cited by: §I-A, §II-A.
  • [21] DCGAN GitHub Repository External Links: Link Cited by: §IV-A.
  • [22] Deep fakes External Links: Link Cited by: §I-A, §IV-G.
  • [23] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §I.
  • [24] R. Durall, M. Keuper, and J. Keuper (2020) Watch your up-convolution: cnn based generative deep neural networks are failing to reproduce spectral distributions. In CVPR, Cited by: §II-F, §IV-G.
  • [25] R. Durall, M. Keuper, F. Pfreundt, and J. Keuper (2019) Unmasking deepfakes with simple features. In arXiv, Cited by: §II-F.
  • [26] L. Fan, K. W. Ng, and C. S. Chan (2019) Rethinking deep neural network ownership verification: embedding passports to defeat ambiguity attacks. In NeurIPS, Cited by: §II-A.
  • [27] J. Fridrich, M. Goljan, D. Soukal, and T. Holotyak (2005) Forensic steganalysis: determining the stego key in spatial domain steganography. In SPIE Security, Steganography, and Watermarking of Multimedia Contents, Cited by: §II-D.
  • [28] J. Fridrich and J. Kodovsky (2012) Rich models for steganalysis of digital images. In IEEE Transactions on Information Forensics and Security, Cited by: §II-D.
  • [29] J. Fridrich (2009) Steganography in digital media: principles, algorithms, and applications. Cambridge University Press. Cited by: §II-C.
  • [30] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, Cited by: §I-A, §I-A, §I, §II-B.
  • [31] Google Cloud AI External Links: Link Cited by: §I.
  • [32] T. Gu, K. Liu, B. Dolan-Gavitt, and S. Garg (2019) Badnets: evaluating backdooring attacks on deep neural networks. In IEEE Access, Cited by: §I-A, §II-E.
  • [33] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In NeurIPS, Cited by: §II-B.
  • [34] J. Hayes and G. Danezis (2017) Generating steganographic images via adversarial training. In NeurIPS, Cited by: §II-C.
  • [35] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §I-A, §I.
  • [36] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, Cited by: §IV-C.
  • [37] V. Holub, J. Fridrich, and T. Denemark (2014) Universal distortion function for steganography in an arbitrary domain. In EURASIP Journal on Information Security, Cited by: §II-C.
  • [38] V. Holub and J. Fridrich (2012) Designing steganographic distortion using directional filters. In International Workshop on Information Forensics and Security, Cited by: §II-C.
  • [39] IBM Watson Machine Learning External Links: Link Cited by: §I.
  • [40] In the age of a.i. is seeing still believing? External Links: Link Cited by: §I-A.
  • [41] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §I-A, §I-A, §II-B.
  • [42] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell (2014) Caffe: convolutional architecture for fast feature embedding. In MM, Cited by: §I.
  • [43] K. Karampidis, E. Kavallieratou, and G. Papadourakis (2018) A review of image steganalysis techniques for digital forensics. In Journal of information security and applications, Cited by: §II-D.
  • [44] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2018) Progressive growing of GANs for improved quality, stability, and variation. In ICLR, Cited by: §I-A, §I-A, §I, §II-B, §IV-A.
  • [45] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In CVPR, Cited by: §I-A, §I-A, §I.
  • [46] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2020) Analyzing and improving the image quality of StyleGAN. In CVPR, Cited by: §I-A, §I-A, §I-A, §I, §II-B.
  • [47] D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In ICLR, Cited by: §I.
  • [48] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NeurIPS, Cited by: §I-A, §I.
  • [49] S. Kullback and R. Leibler (1951) On information and sufficiency. In Annals of Mathematical Statistics, Cited by: §IV-C.
  • [50] G. C. Langelaar, I. Setyawan, and R. L. Lagendijk (2000) Watermarking digital image and video data. a state-of-the-art overview. In IEEE Signal processing magazine, Cited by: §I-A.
  • [51] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther (2016) Autoencoding beyond pixels using a learned similarity metric. In ICML, Cited by: §I-A, §I-A.
  • [52] E. Le Merrer, P. Perez, and G. Trédan (2019) Adversarial frontier stitching for remote neural network watermarking. In Neural Computing and Applications, Cited by: §II-A.
  • [53] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. In Proceedings of the IEEE, Cited by: §I-A, §I, §IV-A.
  • [54] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, Cited by: §II-B.
  • [55] D. Lerch-Hostalot and D. Megías (2016) Unsupervised steganalysis based on artificial training sets. In

    Engineering Applications of Artificial Intelligence

    Cited by: §II-D, §IV-D.
  • [56] Z. Li, C. Hu, Y. Zhang, and S. Guo (2019) How to prove your model belongs to you: a blind-watermark based framework to protect intellectual property of dnn. In ACSAC, Cited by: §II-A.
  • [57] Z. Liu, X. Qi, J. Jia, and P. Torr (2020)

    Global texture enhancement for fake face detection in the wild

    In CoRR, Cited by: §II-F.
  • [58] Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In ICCV, Cited by: §I, §IV-A.
  • [59] X. Lu, Y. Wang, L. Huang, W. Yang, and Y. Shen (2016) A secure and robust covert channel based on secret sharing scheme. In Asia-Pacific Web Conference, Cited by: §II-C.
  • [60] F. Marra, D. Gragnaniello, L. Verdoliva, and G. Poggi (2019) Do gans leave artificial fingerprints?. In MIPR, Cited by: §II-F.
  • [61] Microsoft Azure Machine Learning External Links: Link Cited by: §I.
  • [62] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In ICLR, Cited by: §I-A, §I.
  • [63] S. Mun, S. Nam, H. Jang, D. Kim, and H. Lee (2017) A robust blind watermarking using convolutional neural network. In arXiv, Cited by: §II-C.
  • [64] A. Nissar and A. H. Mir (2010) Classification of steganalysis techniques: a study. In Digital Signal Processing, Cited by: §II-D.
  • [65] T. Orekondy, B. Schiele, and M. Fritz (2019) Knockoff nets: stealing functionality of black-box models. In CVPR, Cited by: §I-A.
  • [66] T. Park, M. Liu, T. Wang, and J. Zhu (2019) Semantic image synthesis with spatially-adaptive normalization. In CVPR, Cited by: §II-B.
  • [67] A. Paszke, S. Gross, S. Chintala, and G. Chanan (2016) PyTorch. External Links: Link Cited by: §I.
  • [68] T. Pevnỳ, T. Filler, and P. Bas (2010) Using high-dimensional image models to perform highly undetectable steganography. In International Workshop on Information Hiding, Cited by: §II-C.
  • [69] ProGAN GitHub Repository External Links: Link Cited by: §IV-A.
  • [70] A. Radford, L. Metz, and S. Chintala (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, Cited by: §I-A, §I, §II-B, §IV-A.
  • [71] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: §IV-A.
  • [72] A. Sablayrolles, M. Douze, C. Schmid, and H. Jégou (2020) Radioactive data: tracing through training. In arXiv, Cited by: §II-E.
  • [73] L. K. Saini and V. Shrivastava (2014) A survey of digital watermarking techniques and its applications. In IJCST, Cited by: §I-A.
  • [74] A. Shafahi, W. R. Huang, M. Najibi, O. Suciu, C. Studer, T. Dumitras, and T. Goldstein (2018) Poison frogs! targeted clean-label poisoning attacks on neural networks. In NeurIPS, Cited by: §II-E.
  • [75] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §I-A, §I.
  • [76] J. T. Springenberg (2016) Unsupervised and semi-supervised learning with categorical generative adversarial networks. In ICLR, Cited by: §II-B.
  • [77] J. Steinhardt, P. W. W. Koh, and P. S. Liang (2017) Certified defenses for data poisoning attacks. In NeurIPS, Cited by: §II-E.
  • [78] M. D. Swanson, M. Kobayashi, and A. H. Tewfik (1998) Multimedia data-embedding and watermarking technologies. In Proceedings of the IEEE, Cited by: §I-A.
  • [79] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In CVPR, Cited by: §I-A, §I, §IV-C.
  • [80] S. Szyller, B. G. Atli, S. Marchal, and N. Asokan (2019) Dawn: dynamic adversarial watermarking of neural networks. In arXiv, Cited by: §II-A.
  • [81] M. Tancik, B. Mildenhall, and R. Ng (2020) Stegastamp: invisible hyperlinks in physical photographs. In CVPR, Cited by: §I-B, §II-C, §III-B1, §IV-A, §IV-A, §IV-C, §IV-F.
  • [82] S. Tokui, K. Oono, S. Hido, and J. Clayton (2015) Chainer: a next-generation open source framework for deep learning. In NeurIPS workshop, Cited by: §I.
  • [83] F. Tramèr, F. Zhang, A. Juels, M. K. Reiter, and T. Ristenpart (2016) Stealing machine learning models via prediction apis. In USENIX Security, Cited by: §I-A.
  • [84] Y. Uchida, Y. Nagai, S. Sakazawa, and S. Satoh (2017) Embedding watermarks into deep neural networks. In ICMR, Cited by: §I-A, §II-A.
  • [85] V. Vukotić, V. Chappelier, and T. Furon (2018) Are deep neural networks good for blind image watermarking?. In International Workshop on Information Forensics and Security, Cited by: §II-C.
  • [86] S. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros (2020) CNN-generated images are surprisingly easy to spot… for now. In CVPR, Cited by: §I-A, §II-F.
  • [87] J. Ye, J. Ni, and Y. Yi (2017) Deep learning hierarchical representations for image steganalysis. In IEEE Transactions on Information Forensics and Security, Cited by: §II-D.
  • [88] You thought fake news was bad? deep fakes are where truth goes to die External Links: Link Cited by: §I-A.
  • [89] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao (2015) Lsun: construction of a large-scale image dataset using deep learning with humans in the loop. In arXiv, Cited by: §I, §IV-A.
  • [90] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang (2018)

    Generative image inpainting with contextual attention

    In CVPR, Cited by: §II-B.
  • [91] N. Yu, L. S. Davis, and M. Fritz (2019) Attributing fake images to gans: learning and analyzing gan fingerprints. In ICCV, Cited by: §I-A, §II-F, §IV-G, §IV-G, §IV-H.
  • [92] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, Cited by: §II-B.
  • [93] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas (2018) Stackgan++: realistic image synthesis with stacked generative adversarial networks. In TPAMI, Cited by: §II-B.
  • [94] J. Zhang, Z. Gu, J. Jang, H. Wu, M. P. Stoecklin, H. Huang, and I. Molloy (2018) Protecting intellectual property of deep neural networks with watermarking. In Asia CCS, Cited by: §I-A, §I-A, §II-A.
  • [95] J. Zhang, D. Chen, J. Liao, H. Fang, W. Zhang, W. Zhou, H. Cui, and N. Yu (2020) Model watermarking for image processing networks. In AAAI, Cited by: §I-A.
  • [96] K. A. Zhang, A. Cuesta-Infante, L. Xu, and K. Veeramachaneni (2019) SteganoGAN: high capacity image steganography with gans. In arXiv, Cited by: §I-A, §II-C.
  • [97] R. Zhang, S. Dong, and J. Liu (2019) Invisible steganography via generative adversarial networks. In Multimedia Tools and Applications, Cited by: §II-C.
  • [98] X. Zhang, S. Karaman, and S. Chang (2019) Detecting and simulating artifacts in gan fake images. In International Workshop on Information Forensics and Security, Cited by: §II-F, §IV-G.
  • [99] J. Zhu, R. Kaplan, J. Johnson, and L. Fei-Fei (2018) Hidden: hiding data with deep networks. In ECCV, Cited by: §II-C, §IV-D.
  • [100] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, Cited by: §II-B.
  • [101] J. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman (2017) Toward multimodal image-to-image translation. In NeurIPS, Cited by: §II-B.
  • [102] D. Ziou and R. Jafari (2014) Efficient steganalysis of images: learning is good for anticipation. In Pattern Analysis and Applications, Cited by: §II-D.