🔥🔥Defending Against Deepfakes Using Adversarial Attacks on Conditional Image Translation Networks
Face modification systems using deep learning have become increasingly powerful and accessible. Given images of a person's face, such systems can generate new images of that same person under different expressions and poses. Some systems can also modify targeted attributes such as hair color or age. This type of manipulated images and video have been coined DeepFakes. In order to prevent a malicious user from generating modified images of a person without their consent we tackle the new problem of generating adversarial attacks against image translation systems, which disrupt the resulting output image. We call this problem disrupting deepfakes. We adapt traditional adversarial attacks to our scenario. Most image translation architectures are generative models conditioned on an attribute (e.g. put a smile on this person's face). We present class transferable adversarial attacks that generalize to different classes, which means that the attacker does not need to have knowledge about the conditioning vector. In gray-box scenarios, blurring can mount a successful defense against disruption. We present a spread-spectrum adversarial attack, which evades blurring defenses.READ FULL TEXT VIEW PDF
There has been an ongoing cycle where stronger defenses against adversar...
The convolutional neural network is the crucial tool for the recent succ...
This paper presents a new problem of unpaired face translation between i...
Adversarial attacks against Neural Networks are a problem of considerabl...
In this work, we develop efficient disruptions of black-box image transl...
Nowadays, digital facial content manipulation has become ubiquitous and
Machine learning models are typically made available to potential client...
🔥🔥Defending Against Deepfakes Using Adversarial Attacks on Conditional Image Translation Networks
Advances in image translation using generative adversarial networks (GANs) have allowed the rise of face manipulation systems that achieve impressive realism. Some face manipulation systems can create new images of a person’s face under different expressions and poses [19, 28]. Other face manipulation systems modify the age, hair color, gender or other attributes of the person [3, 4].
Given the widespread availability of these systems, malicious actors can modify images of a person without their consent. There have been occasions where faces of celebrities have been transferred to videos with explicit content without their consent  and companies such as Facebook have banned uploading modified pictures and video of people .
One way of mitigating this risk is to develop systems that can detect whether an image or video has been modified using one of these systems. There have been recent efforts in this direction, with varying levels of success [24, 25].
There is work showing that deep neural networks are vulnerable to adversarial attacks[21, 10, 18, 2]
, where an attacker applies imperceptible perturbations to an image causing it to be incorrectly classified. We distinguish different attack scenarios. In awhite-box scenario the attacker has perfect knowledge of the architecture, model parameters and defenses in place. In a black-box scenario, the attacker is only able to query the target model for output labels for chosen inputs. There are several different definitions of gray-box scenarios. In this work, a gray-box scenario denotes perfect knowledge of the model and parameters, but ignorance of the pre-processing defense mechanisms in place (such as blurring). We focus on white-box and gray-box settings.
Another way of combating malicious actors is by disrupting the deepfaker’s ability to generate a deepfake. In this work we propose a solution by adapting traditional adversarial attacks that are imperceptible to the human eye in the source image, but interfere with translation of this image using image translation networks. A successful disruption corresponds to the generated image being sufficiently deteriorated such that it has to be discarded or such that the modification is perceptually evident. We present a formal and quantifiable definition of disruption success in Section 3.
Most facial manipulation architectures are conditioned both on the input image and on a target conditioning class. One example, is to define the target expression of the generated face using this attribute class (e.g. put a smile on the person’s face). In this example, if we want to prevent a malicious actor from putting a smile on the person’s face in the image, we need to know that the malicious actor has selected the smile attribute instead of, for instance, eye closing. In this work, we are first to formalize the problem of disrupting class conditional image translation, and present two variants of class transferable disruptions that improve generalization to different conditioning attributes.
Blurring is a broken defense in the white-box scenario, where a disruptor knows the type and magnitude of pre-processing blur being used. Nevertheless, in a real situation, a disruptor might know the architecture being used yet ignore the type and magnitude of blur being used. In this scenario the efficacy of a naive disruption drops dramatically. We present a novel spread-spectrum disruption that evades a variety of blur defenses in this gray-box setting.
We present baseline methods for disrupting deepfakes by adapting adversarial attack methods to image translation networks.
We are the first to address disruptions on conditional image translation networks. We propose and evaluate novel disruption methods that transfer from one conditioning class to another.
We are the first to propose and evaluate adversarial training for generative adversarial networks. Our novel G+D adversarial training alleviates disruptions in a white-box setting.
We propose a novel spread-spectrum disruption that evades blur defenses in a gray-box scenario.
There are several works exploring image translation using deep neural networks [11, 26, 29, 19, 3, 4, 28]. Some of these works apply image translation to face images in order to generate new images of individuals with modified expression or attributes [19, 3, 4, 28].
There is a large amount of work that explores adversarial attacks on deep neural networks for classification [21, 10, 16, 18, 2, 17, 15]. Fast Gradient Sign Method (FGSM), a one-step gradient attack was proposed by Goodfellow et al. . Stronger iterative attacks such as iterative FGSM (I-FGSM)  and Projected Gradient Descent (PGD)  have been proposed. Sabour et al.  explore feature-space attacks on deep neural network classifiers using L-BFGS.
explore adversarial attacks against Variational Autoencoders (VAE) and VAE-GANs, where an adversarial image is compressed into a latent space and instead of being reconstructed into the original image is reconstructed into an image of a different semantic class. In contrast, our work focuses on attacks against image translation systems. Additionally, our objective is to disrupt deepfake generation as opposed to changing the output image to a different semantic class.
Wang et al.  adapt adversarial attacks to the image translation scenario for traffic scenes on the pix2pixHD and CycleGAN networks. Yeh et al. , is concurrent work to ours, and proposes adapting PGD to attack pix2pixHD and CycleGAN networks in the face domain. Most face manipulation networks are conditional image translation networks, [23, 27] do not address this scenario and do not explore defenses for such attacks. We are the first to explore attacks against conditional image translation GANs as well as attacks that transfer to different conditioning classes. We are also the first to propose adversarial training  for image translation GANs. Madry et al.  propose adversarial training using strong adversaries to alleviate adversarial attacks against deep neural network classifiers. In this work, we propose two adaptations of this technique for GANs, as a first step towards robust image translation networks.
A version of spread-spectrum watermarking for images was proposed by Cox et al. . Athalye et al.  proposes the expectation over transformation (EoT) method for synthesizing adversarial examples robust to pre-processing transformations. However, Athalye et al. 
demonstrate their method on affine transformations, noise and others, but do not consider blur. In this work, we propose a faster heuristic iterative spread-spectrum disruption for evading blur defenses.
We describe methods for image translation disruption (Section 3.1), our proposed conditional image translation disruption techniques (Section 3.2), our proposed adversarial training techniques for GANs (Section 3.3) and our proposed spread-spectrum disruption (Section 3.4).
Similar to an adversarial example, we want to generate a disruption by adding a human-imperceptible perturbation to the input image:
where is the generated disrupted input image and is the input image. By feeding the original image or the disrupted input image to a generator we have the mappings and , respectively, where and are the translated output images and is the generator of the image translation GAN.
We consider a disruption successful when it introduces perceptible corruptions or modifications onto the output of the network leading a human observer to notice that the image has been altered and therefore distrust its source.
We operationalize this phenomenon. Adversarial attack research has focused on attacks showing low distortions using the , and distance metrics. The logic behind using attacks with low distortion is that the larger the distance, the more apparent the alteration of the image, such that an observer could detect it. In contrast, we seek to maximize the distortion of our output, with respect to a well-chosen reference .
where is the maximum magnitude of the perturbation and is a distance function. If we pick to be the ground-truth output, , we get the ideal disruption which aims to maximize the distortion of the output.
We can also formulate a targeted disruption, which pushes the output to be close to :
Note that the ideal disruption is a special case of the targeted disruption where we minimize the negative distortion instead and select . We can thus disrupt an image towards a target or away from a target.
We can generate a targeted disruption by adapting well-established adversarial attacks: FGSM, I-FGSM, and PGD. Fast Gradient Sign Method (FGSM)  generates an attack in one forward-backward step, and is adapted as follows:
where is the size of the FGSM step. Iterative Fast Gradient Sign Method (I-FGSM)  generates a stronger adversarial attack in multiple forward-backward steps. We adapt this method for the targeted disruption scenario as follows:
where is the step size and the constraint is enforced by the clip function. For disruptions away from the target instead of towards , using the negative gradient of the loss in the equations above is sufficient. For an adapted Projected Gradient Descent (PGD) , we initialize the disrupted image randomly inside the -ball around and use the I-FGSM update function.
Many image translation systems are conditioned not only on the input image, but on a target class as well:
where is the input image, is the target class and is the output image. A target class can be an attribute of a dataset, for example blond or brown-haired.
A disruption for the data/class pair is not guaranteed to transfer to the data/class pair when . We can define the problem of looking for a class transferable disruption as follows:
We can write this empirically as an optimization problem:
In order to solve this problem, we present a novel disruption on class conditional image translation systems that increases the transferability of our disruption to different classes. We perform a modified I-FGSM disruption:
We initialize and increment at every iteration, until we reach where is the number of classes. We then reset .
We propose a disruption which seeks to minimize the expected value of the distance to the target at every step . For this, we compute this loss term at every step of an I-FGSM disruption and use it to inform our update step:
Adversarial training for classifier deep neural networks was proposed by Madry et al. . It incorporates strong PGD attacks on the training data for the classifier. We propose the first adaptations of adversarial training for generative adversarial networks. Our methods, described below, are a first step in attempting to defend against image translation disruption.
A conditional image translation GAN uses the following adversarial loss:
where is the discriminator. In order to make the generator resistant to adversarial examples, we train the GAN using the modified loss:
Instead of only training the generator to be indifferent to adversarial examples, we also train the discriminator on adversarial examples:
Blurring can be an effective test-time defense against disruptions in a gray-box scenario, where the disruptor ignores the type or magnitude of blur being used. In order to successfully disrupt a network in this scenario, we propose a spread-spectrum evasion of blur defenses that transfers to different types of blur. We perform a modified I-FGSM update
where is a blurring convolution operation, and we have different blurring methods with different magnitudes and types. We initialize and increment at every iteration of the algorithm, until we reach where is the total number of blur types and magnitudes. We then reset .
In this section we demonstrate that our proposed image-level FGSM, I-FGSM and PGD-based disruptions are able to disrupt different recent image translation architectures such as GANimation , StarGAN , pix2pixHD  and CycleGAN. In Section 4.1, we show that the ideal formulation of an image-level disruption presented in Section 3.1, is the most effective at producing large distortions in the output. In Section 4.2, we demonstrate that both our iterative class transferable disruption and joint class transferable disruption are able to transfer to different conditioning classes. In Section 4.3, we test our disruptions against two defenses in a white-box setting. We show that our proposed G+D adversarial training is most effective at alleviating disruptions, although strong disruptions are able to overcome this defense. Finally, in Section 4.4 we show that blurring is an effective defense against disruptions in a gray-box setting, in which the disruptor does not know the type or magnitude of the pre-processing blur. We then demonstrate that our proposed spread-spectrum adversarial disruption evades different blur defenses in this scenario. All disruptions in our experiments use .
image translation architectures. We use an open-source implementation of GANimation trained for 37 epochs on the CelebA dataset for 80 action units (AU) from the Facial Action Unit Coding System (FACS). We test GANimation on 50 random images from the CelebA dataset (4,000 disruptions). We use the official open-source implementation of StarGAN, trained on the CelebA dataset for the five attributes black hair, blond hair, brown hair, gender and aged. We test StarGAN on 50 random images from the CelebA dataset (250 disruptions). For pix2pixHD we use the official open-source implementation, which was trained for label-to-street view translation on the Cityscapes dataset . We test pix2pixHD on 50 random images from the Cityscapes test set. For CycleGAN we use the official open-source implementation for both the zebra-to-horses and photograph-to-Monet painting translations. We disrupt 100 images from both datasets. We use the pre-trained models provided in the open-source implementations, unless specifically noted.
In order to develop intuition on the relationship between our main and distortion metrics and the qualitative distortion caused on image translations, we display in Fig. 3 a scale that shows qualitative examples of disrupted outputs and their respective distortion metrics. We can see that when the and metric becomes larger than we have very noticeable distortions in the output images. Throughout the experiments section, we report the percentage of successfully disrupted images ( dis.), which correspond to the percentage of outputs presenting a distortion .
|GANimation (CelebA, )||0.121||0.024||1.5%||0.212||0.098||93.9%||0.190||0.077||83.7%|
We show that we are able to disrupt the StarGAN, pix2pixHD and CycleGAN architectures with very successful results using either I-FGSM or PGD in Table 1. Our white-box disruptions are effective on several recent image translation architectures and several different translation domains. GANimation reveals itself to be more robust to disruptions of magnitude than StarGAN, although it can be successfully disrupted with stronger disruptions (). The metrics reported in Table 1 are the average of the and errors on all dataset samples, where we compute the error for each sample by comparing the ground-truth output with the disrupted output , using the following formulas and . For I-FGSM and PGD we use steps with step size of . We use our ideal formulation for all disruptions.
We show examples of successfully disrupted image translations on GANimation and StarGAN in Fig. 2 using I-FGSM. We observe different qualitative behaviors for disruptions on different architectures. Nevertheless, all of our disruptions successfully make the modifications in the image obvious for any observer, thus avoiding any type of undetected manipulation of an image.
In Section 3.1, we derived an ideal disruption for our success metric. In order to execute this disruption we first need to obtain the ground-truth output of the image translation network for the image being disrupted. We push the disrupted output to be maximally different from . We compare this ideal disruption (designated as Away From Output in Table 2) to targeted disruptions with different targets such as a black image, a white image and random noise. We also compare it to a less computationally intensive disruption called Away From Input, which seeks to maximize the distortion between our disrupted output and our original input .
We display the results for the StarGAN architecture on the CelebA dataset in Table 2. As expected, the Away From Output disruption is the most effective using I-FGSM and PGD. All disruptions show similar effectiveness when using one-step FGSM. Away From Input seems similarly effective to the Away From Output for I-FGSM and PGD, yet it does not have to compute , thus saving one forward pass of the generator.
Finally, we show in Table 3 comparisons of our image-level Away From Output disruption to the feature-level attack for Variational Autoencoders (VAE) presented in Kos et al. . Although in Kos et al.  attacks are only targeted on the latent vector of a VAE, here we attack every possible intermediate feature map of the image translation network using this attack. The other two attacks presented in Kos et al.  cannot be applied to the image-translation scenario. We disrupt the StarGAN architecture on the CelebA dataset. Both disruptions use the 10-step PGD optimization formulation with . We notice that while both disruptions are successful, our image-level formulation obtains stronger distortions on average.
|Towards Random Noise||0.509||0.409||0.607||0.532||0.594||0.511|
|Away From Input||0.449||0.319||1.086||1.444||1.054||1.354|
|Away From Output||0.465||0.335||1.156||1.574||1.119||1.480|
|Kos et al. |
Class Conditional Image Translation Systems such as GANimation and StarGAN are conditional GANs. Both are conditioned on an input image. Additionally, GANimation is conditioned on the target AU intensities and StarGAN is conditioned on a target attribute. As the disruptor we do know which image the malicious actor wants to modify (our image), and in some scenarios we might know the architecture and weights that they are using (white-box disruption), yet in almost all cases we do not know
whether they want to put a smile on the person’s face or close their eyes, for example. Since this non-perfect knowledge scenario is probable, we want a disruption that transfers to all of the classes in a class conditional image translation network.
In our experiments we have noticed that attention-driven face manipulation systems such as GANimation present an issue with class transfer. GANimation generates a color mask as well as an attention mask designating the parts of the image that should be replaced with the color mask.
In Fig. 4, we present qualitative examples of our proposed iterative class transferable disruption and joint class transferable disruption. The goal of these disruptions is to transfer to all action unit inputs for GANimation. We compare this to the unsuccessful disruption transfer case where the disruption is targeted to the incorrect AU. Columns (e) and (f) of Fig. 4 show our iterative class transferable disruption and our joint class transferable disruption successfully disrupting the deepfakes, whereas attempting to disrupt the system using the incorrect AU is not effective (column (c)).
Quantitative results demonstrating the superiority of our proposed methods can be found in Table 4. For our disruptions, we use iterations of PGD, magnitude and a step of .
For our second experiment, presented in Table 5, instead of disrupting the input image such that the output is visibly distorted, we disrupt the input image such that the output is the identity. In other words, we want the input image to be untouched by the image translation network. We use iterations of I-FGSM, magnitude and a step of .
|Iterative Class Transferable|
|Joint Class Transferable|
|Iterative Class Transferable|
|Joint Class Transferable|
We present results for our generator adversarial training and G+D adversarial training proposed in Section 3.3. In Table 6, we can see that generator adversarial training is somewhat effective at alleviating a strong 10-step PGD disruption. G+D adversarial training proves to be even more effective than generator adversarial training.
Additionally, in the same Table 6, we present results for a Gaussian blur test-time defense (
). We disrupt this blur defense in a white-box manner. With perfect knowledge of the pre-processing, we can simply backpropagate through that step and obtain a disruption. We achieve the biggest resistance to disruption by combining blurring andG+D adversarial training, although strong PGD disruptions are still relatively successful. Nevertheless, this is a first step towards robust image translation networks.
We use a 10-step PGD () for both generator adversarial training and G+D adversarial training. We trained StarGAN for iterations using a batch size of . We use an FGSM disruption , a 10-step I-FGSM disruption with step size and a 10-step PGD disruption with step size .
|Adv. G. Training||0.125||0.032||15.6||0.317||0.183||96||0.319||0.186||95.2|
|Adv. G+D Training||0.141||0.036||17.2||0.283||0.138||87.6||0.281||0.136||87.6|
|Adv. G. Train. + Blur||0.138||0.039||21.6||0.225||0.100||63.2||0.224||0.099||61.2|
|Adv. G+D Train. + Blur||0.116||0.026||10.4||0.184||0.062||36.8||0.184||0.062||37.2|
Blurring can be an effective defense against our adversarial disruptions in a gray-box setting where the disruptor does not know the type and magnitude of blurring being used for pre-processing. In particular, low magnitude blurring can render a disruption useless while preserving the quality of the image translation output. We show an example on the StarGAN architecture in Fig. 5.
If the image manipulator is using blur to deter adversarial disruptions, the adversary might not know what type and magnitude of blur are being used. In this Section, we evaluate our proposed spread-spectrum adversarial disruption which seeks to evade blur defenses in a gray-box scenario, with high transferability between types and magnitudes of blur. In Fig. 6 we present the proportion of test images successfully disrupted () for our spread-spectrum method, a white-box perfect knowledge disruption, an adaptation of EoT  to the blur scenario and a disruption which does not use any evasion method. We notice that both our method and EoT defeat diverse magnitudes and types of blur and achieve relatively similar performance. Our method achieves better performance on the Gaussian blur scenarios with high magnitude of blur, whereas EoT outperforms our method on the box blur cases, on average. Our iterative spread-spectrum method is roughly times faster than EoT since it only has to perform one forward-backward pass per iteration of I-FGSM instead of to compute the loss. Additionally, in Fig. 7, we present random qualitative samples, which show the effectiveness of our method over a naive disruption.
In this paper we presented a novel approach to defend against image translation-based deepfake generation. Instead of trying to detect whether an image has been modified after the fact, we defend against the non-authorized manipulation by disrupting conditional image translation facial manipulation networks using adapted adversarial attacks.
We operationalized our definition of a successful disruption, which allowed us to formulate an ideal disruption that can be undertaken using traditional adversarial attack methods such as FGSM, I-FGSM and PGD. We demonstrated that this disruption is superior to other alternatives. Since many face modification networks are conditioned on a target attribute, we proposed two disruptions which transfer from one attribute to another and showed their effectiveness over naive disruptions. In addition, we proposed adversarial training for GANs, which is a first step towards image translation networks that are resistant to disruption. Finally, blurring is an effective defense against naive disruptions in a gray-box scenario and can allow a malicious actor to bypass the disruption and modify the image. We presented a spread-spectrum disruption which evades a wide range of blur defenses.
International Conference on Machine Learning, pp. 284–293. Cited by: §2, §4.4.
Stargan: unified generative adversarial networks for multi-domain image-to-image translation. In , pp. 8789–8797. Cited by: Figure 1, Figure 2, §1, §2, §4.
The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §4.
Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2.