Detecting Deep-Fake Videos from Appearance and Behavior

04/29/2020 ∙ by Shruti Agarwal, et al. ∙ 3

Synthetically-generated audios and videos – so-called deep fakes – continue to capture the imagination of the computer-graphics and computer-vision communities. At the same time, the democratization of access to technology that can create sophisticated manipulated video of anybody saying anything continues to be of concern because of its power to disrupt democratic elections, commit small to large-scale fraud, fuel dis-information campaigns, and create non-consensual pornography. We describe a biometric-based forensic technique for detecting face-swap deep fakes. This technique combines a static biometric based on facial recognition with a temporal, behavioral biometric based on facial expressions and head movements, where the behavioral embedding is learned using a CNN with a metric-learning objective function. We show the efficacy of this approach across several large-scale video datasets, as well as in-the-wild deep fakes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in computer graphics, computer vision, and machine learning have made it increasingly easier to synthesize compelling fake audio, image, and video. In the audio domain, highly realistic audio synthesis is now possible in which a neural network, with enough sample recordings, can learn to synthesize speech in your voice 

[25]. In the static image domain, highly realistic images of people can now be synthesized using generative adversarial networks (GANs) [18, 19]. And, in the video domain, highly realistic videos can be created of anybody saying and doing anything that its creator wants [32]. These so-called deep-fake videos can be highly entertaining but can also be easily weaponized.

The creation of non-consensual pornography, for example, was the first use of deep fakes, and continues to pose a threat particularly to women, ranging from celebrities to journalists, and those that simply attract unwanted attention [5]. In response, several U.S. states have recently passed legislation trying to mitigate the harm posed by this content, and similar legislation is being considered at the U.S. federal and international levels. In addition, the democratization of access to sophisticated technology to synthesize highly realistic fake audio, image, and videos promises to add to our struggle to contend with dis- and mis-information campaigns designed to commit small- to large-scale fraud, disrupt democratic elections, and sow civil unrest.

We describe a forensic technique to authenticate face-swap deep fake videos in which a person’s facial identity is replaced with another’s. The most common approach to detecting these deep fakes leverages low-level pixel artifacts introduced during the synthesis process. These approaches suffer from vulnerability to simple counter-measures including trans-coding and resizing, and often struggle to generalize to new synthesis techniques (see Section 2 for more details).

In contrast, in our approach we leverage a more fundamental flaw in deep fakes: the face-swap deep fake is simply not the person it purports to be. In particular, we combine a static biometric based on facial identity with a temporal, behavioral biometric based on facial expressions and head movements. The former leverages standard techniques from face recognition, while the latter leverages a learned behavioral embedding using a convolutional neural network (CNN) powered by a metric-learning objective function. These two biometric signals are used because we observe that the facial behaviors in a face-swap deep fake remain those of the original individual, while the facial identity is of a different individual. By matching the behavioral and facial identities against a set of authentic reference videos, inconsistencies in the matching identities can reveal face-swap deep fakes. Our experimental results against thousands of unique identities spanning five large datasets support this hypothesis.

Our behavioral model is constructed by stacking together static FAb-Net features [34] over time (four seconds). By combining many FAb-Net features, which themselves capture static head pose, facial landmarks, and facial expression, we are able to capture spatiotemporal behaviors. Unlike previous work for modeling spatiotemporal human behavior [2] that required a specific model for each person, we will show that the metric-learning objective used by our CNN to learn this behavioral feature allows us to build a generic model that can be trained on one group of people in one dataset and generalize to previously unseen people in different datasets. We summarize our primary contributions as:

  • a novel spatiotemporal behavior model for capturing facial expressions and head movement that generalizes to previously unseen people;

  • a novel combination of appearance and behavioral biometrics for detecting face-swap deep fake videos;

  • a large-scale evaluation across five large data sets consisting of thousands of real and deep-fake videos, the results of which show that our approach is highly effective at detecting face-swap deep fakes; and

  • an analysis of the underlying methodology and results that provides insight into the specific nature of the learned features, and the robustness of our approach across different datasets, manipulations, and qualities of deep fakes.

In the next section, we place our work in context relative to previous efforts. We then describe our technique in detail and show the efficacy of our approach across five large-scale video datasets, as well as in-the-wild deep fakes.

2 Related Work

We begin by describing the most relevant work in both the creation and detection of deep fakes.

2.1 Generating Deep Fakes

The term deep fake is often used to describe synthetically-generated images and videos (typically of people) generated from a range of different techniques including DeepFake FaceSwap [8] and FS-GAN [24], Neural Textures [29], Face2Face [31], and FaceSwaps [13].

The popular DeepFake FaceSwap software uses a generative adversarial network (GAN) [15] to create so-called face-swap deep fakes in which one person’s identity in a video is replaced with another person’s identity. This approach has been popularized by, for example, adding the actor Nicholas Cage into movies in which he never appeared, including his highly entertaining appearance in The Sound of Music 111youtu.be/MHkZEpfUnAA. While this technique can generate highly convincing fakes, it often requires a significant amount of training data. Faceswap-GAN [12]

, uses an autoencoder with an adversarial loss to generate deep fakes, similarly requiring large amounts of training data. The more recent method FS-GAN 

[24]

, on the other hand, creates high-quality fakes using a recurrent neural network-based reenactment with less training data.

Neural Textures [29] is a generic image synthesis framework that combines traditional graphics rendering with more modern learnable components. This framework can be used for novel-view synthesis, scene editing, and the creation of so-called lip-sync deep fakes in which a person’s mouth is modified to be consistent with a new audio track. This work generalizes earlier work that was designed to create lip-sync deep fakes on a per-individual basis [30].

Unlike these learning-based methods, some methods rely on more traditional computer-graphics approaches to create deep fakes. Face2Face [31], for example, allows for the creation of so-called puppet-master deep fakes in which one person’s (the master’s) facial expressions and head movements are mapped onto another person (the puppet). Similarly FaceSwap [13] builds a -D facial model of one person and aligns this to another person. These techniques allow for re-enacting facial expressions in real-time using standard consumer cameras. In a related puppet-master technique, the authors in [23] build a photo-realistic avatar GAN that synthesizes faces in arbitrary expressions and orientations in real-time on a mobile device.

2.2 Detecting Deep Fakes

There is a significant literature in the general area of digital forensics [14]. Here we focus only on techniques for detecting the types of deep-fake videos described in the previous section.

Low-level approaches focus on detecting pixel-level artifacts introduced by the synthesis process. One such approach uses a CNN to detect pixel-level artifacts that arise due to the process of warping a face region onto the target [21]. In [17], the authors train a Siamese network to find inconsistencies of camera metadata for small image patch (e.g., focal length, ISO, aperture size, exposure time, etc.). An image is then authenticated using this network to determine if each image patch is consistent with the same imaging pipeline. Although not necessarily focused on deep fakes, Mantra-Net [35] uses end-to-end training of a fully convolutional network to detect and localize different types of image manipulation including splicing, removal, and copy-move. In [39]

, the authors detect and localize facial manipulations by using a network to holistically classify a face as manipulated or not. A second network exploits low-level steganographic features in small patches to determine if a face region is consistent with the rest of the image. A final prediction is generated by combining these two predictions. The authors in 

[37] and [38] showed that GAN-generated content contains distinct digital fingerprints which can be learned and used to classify images as GAN-generated or not. MesoNet [1] take a mid-level approach by building a convolutional neural network with a small number of layers that learns mesoscopic artifacts. These mid-level artifacts tend to be more resilient, particularly to video compression.

The benefit of these and similar low-level approaches is that they can automatically extract artifacts and differences between synthetic and real content. The drawback is that they can be highly sensitive to intentional or unintentional laundering including resizing or trans-coding, as well as adversarial attacks [4] and extrapolation to novel datasets. In contrast, the high-level approaches described next tend to be more resilient to these types of laundering and attacks and more likely to generalize to novel datasets.

High-level approaches focus on more semantically meaningful features. For example, [36]

recognized that the creation of face-swap deep fakes introduces inconsistencies in the head pose as estimated from the central, swapped portion of the face and the surrounding, original head. These inconsistencies leverage

-D geometry which are currently difficult for synthesis techniques to correct. Because training data sets often do not depict people with their eyes closed, it was observed that early face-swap deep fakes contained an abnormally low number of eye blinks [20]. More recent deep fakes, however, seem to have corrected for this problem. A related technique [7] exploits spatial and temporal physiological signals that appear not to be consistent across real videos and disrupted in face-swap deep fakes. We believe that, because current synthesis techniques are frame-based, incorporating these types of semantic and temporal dynamics is essential to staying slightly ahead of the cat-and-mouse game of synthesis and detection.

The work of [2] is most similar to ours. In their work, the authors analyzed hours of video of specific individuals (in their case, various world leaders and presidential candidates) in order to extract distinct and predictable patterns of facial expressions and head movements. Specifically, from each -second clip of an individual, the authors extracted the frame-by-frame facial expressions (parameterized as action units [11] and -D head rotation about two axes). The correlation between all pairs of these features yielded a

-D feature vector capturing an individual’s temporal mannerisms. A one-class SVM 

[28] was employed to classify each -second video clip as being consistent or not with the learned mannerisms of an individual. The benefit of this approach is that it captures temporal mannerisms that current frame-based, deep-fake synthesis techniques are not (yet) able to synthesize. The other benefit is that this approach, unlike pixel-based detection schemes, is more robust to laundering attacks and is more able to generalize to a large class of deep fakes from face-swap to lip-sync, and puppet-master. The drawback of this approach is that it can require significant effort to create models for each individual and it is almost certainly the case that the hand-crafted correlation-based features are not optimal, nor are they capturing all of the distinct properties that might distinguish a real from a fake video.

Building on this earlier work by [2], we employ a convolutional neural network (CNN) with a metric-learning objective function to learn a more discriminating behavioral biometric. We pair this learned biometric with a facial biometric in order to determine if a person’s identity in video clips as short as four seconds is consistent with the facial and behavioral properties extracted from reference videos. This approach is specifically targeted towards face-swap deep fakes in which the face of one person has been replaced with another.

Figure 1: An overview of our authentication pipeline (see Section 3.3).

3 Biometrics

We next describe two biometric measurements that underlie our forensic detection scheme. These include a biometric based on temporal behavioral (facial expressions and head movements) and a biometric based on static facial features.

3.1 Behavior

In [34], the authors proposed a self-supervised, encoder-decoder network (Facial Attributes-Net, FAb-Net) trained to embed the movement between video frames into a common -D space. The authors showed that the network, in turn, learns an embedding space that represents head pose, facial landmarks, and facial expression. We use these -D FAb-Net features as building blocks to measure spatiotemporal biometric behavior. Specifically, a -frame video clip of a person talking is first reduced to a feature matrix , where each matrix column corresponds to each frame’s FAb-Net feature.

FAb-Net nicely captures the frame-based facial movements and expressions but is, by design, identity-agnostic. We seek to learn a modified embedding that both captures facial movements and expressions, but also distinguishes these features across individuals. That is, starting with the static FAb-Net features, we learn a low-dimensional mapping that encodes identity-specific spatiotemporal behavior.

Given FAb-Net feature matrices for , -frame video clips with identity labels , we learn a mapping , that projects to an embedding space such that the similarity between and is high if (positive sample) and is low if (negative sample). Because, the output

is normalized to lie on a unit sphere, a cosine similarity, between two vector-based representations, is used to compute

.

To learn the mapping , a CNN is trained with a multi-similarity metric-learning objective function [33]. Following the approach in [33], the loss for a mini-batch is computed as follows. First, for every input , hard positive and negative samples are selected. For hard negative samples (where ), a sample is selected if , for all such that , and where is a small margin. This formulation selects the most confusing negative samples whose similarity with the input is larger than the minimum similarity between the input and all positive samples. Similarly, for hard positive samples (where ), a sample is selected if , for all such that . Here, the most meaningful positive samples are selected by comparing to the negative samples most similar to the input.

A soft weighting is then applied to rank these selected samples according to their importance for learning the desired embedding space. For a given input , let and represents the selected negative and positive samples that are weighted as follows:

(1)

where , , and are hyper-parameters. Finally, the loss over a mini-batch of size is:

(2)

By performing supervised training using the identity labels in the training data, the network is encouraged to learn an embedding space that clusters the biometric signatures by identity.

Our model is trained on the VoxCeleb2 dataset [6], containing over a million utterances from unique identities. The size of the input feature matrix is fixed to , corresponding to a -second video clip at frames/second (this clip size was selected as it was the minimum clip size of the VoxCeleb2 utterances). We used the ResNet-101 network architecture [16], where the input layer of the network is modified to the size of our feature matrix (). A fully-connected output layer of size is added on top of this network, forming our final feature vector, which is normalized to be zero-mean and unit-length before computing the loss. We name this network Behavior-Net.

The CNN training is performed for iterations with a mini-batch of size . Following [33], in each mini-batch, identities are randomly selected, for which eight utterance videos (each of variable length) are randomly selected, from which a randomly selected -frame sequence is extracted. All other optimization hyper-parameters are the same as in [33].

Even though the Behavior-Net features are trained only on the VoxCeleb2 dataset, as described below, these features will be used to classify different identities across different datasets. This generalizability is both practically useful and suggests that the underlying Behavior-Net captures intrinsic properties of people.

3.2 Appearance

Rapid advances in deep learning and access to large datasets have led to a revolution in face recognition. We leverage one such fairly straight-forward approach, VGG 

[26], a -layer CNN trained to perform face recognition on a dataset consisting of identities. VGG yields a distinct -D face descriptor per face, per video frame. These descriptors are averaged over the frames of the -second video clip to yield a single facial descriptor.

Faces for this facial biometric and the behavioral biometric are extracted using OpenFace [3]. Once localized and extracted from a video frame, each face is aligned and re-scaled to a size of pixels.

3.3 Authentication

Given a authentic -second video clips for all unique identities, two reference sets are created with the VGG facial and Behavior-Net features. Define to be the real-valued matrix consisting of the VGG features for video clips of identity . Similarly, define to be the real-valued matrix consisting of the Behavior-Net features for the same video clips, also of identity . Each column of the matrices and contains the VGG and Behavior-Net features for a single video clip.

Given these reference sets, a previously unseen -second video clip is authenticated as follows. First, extract the facial and Behavior-Net features, and . Next, find the identities, and in the reference sets with the most similar features using a cosine-similarity metric:

(3)

With these matched identities, a video clip is classified as real or fake following two simple rules (see also Fig. 1):

  1. A video clip is classified as real if the facial and Behavior-Net identities are the same, , and if the facial similarity is above a specified threshold, , where (i.e., a close facial match is found).

  2. A video clip is classified as fake if either

    1. the matched identities are different, , or

    2. the facial similarity is below threshold, .

The rationale for the asymmetric treatment of the facial and Behavior-Net similarities is that in a face-swap deep fake, the facial identity of a person is modified but typically not the behavior. As a result, it is possible for a person’s facial identity to be significantly different in a test video than in their reference videos, in which case, we should not be confident of the facial identity match.

4 Results

We begin by describing the five datasets used for validation and analysis. We then describe the overall accuracy of detection followed by an analysis of robustness and relative importance of the appearance and behavioral features;

4.1 Datasets

The world leaders dataset (WLDR)  [2] consists of several hours of real videos of five U.S. political figures, their political impersonators, and face-swap deep fakes between each political figure and their corresponding impersonator. We augmented this dataset with five new U.S. political figures.

Average WLDR FF
DFD DFDC-P CDF
Figure 2: Shown are receiver operating curves (ROC) for each of five datasets and the average across all datasets (top-left panel). The green/red curves correspond to the accuracy of classifying real/fake videos. The horizontal axis corresponds to the VGG threshold ().

The FaceForensics++ dataset (FF) [27] consists of YouTube videos of different people, mostly news anchors and video bloggers. Each video was used to create four types of deep fakes: DeepFake, FaceSwap, Face2Face, and Neural Textures. We only use the first two categories of fakes as only these are face-swap deep fakes. After removing videos with multiple people or with identities overlapping to other datasets, we were left with real videos and the corresponding deep fake videos.

The DeepFake Detection dataset (DFD) [10] by Google/Jigsaw consists of real and face-swap deep fakes of paid and consenting actors. Each individual was made to perform tasks like walking, hugging, talking, etc. in different expressions ranging from happy, to angry, neutral, or disgust. For our analysis, we selected only those videos where the individual was talking, resulting in real and deep fake videos.

The Deep Fake Detection Challenge Preview dataset (DFDC-P) [9] consists of real and face-swap deep fakes videos of consenting individuals of various genders, ages and ethnic groups. It is one of the largest deep fake dataset with videos of various quality, viewpoints, lighting conditions and scenes.

The Celeb-DF (Ver. 2) dataset (CDF) [22] is currently the largest publicly available deep-fake dataset. It is reported as containing face-swap deep fakes generated from YouTube videos of celebrities speaking in different settings ranging from interviews, to TV-shows, and award functions (we, however, only identified unique identities in the downloaded dataset).

For each identity in the WLDR, DFD, and DFDC-P datsets, a random of the real videos are used for the reference set and the remaining are used for testing. In these three datasets there were sufficient videos of each individual in similar contexts. In contrast, the FF and CDF datasets had either only a small number of videos per individual or the context for each individual varied drastically. For these two datasets, therefore, we take a different approach to creating the reference/testing sets. In particular, each real video is divided in half, the first half of which is used for reference, and the second half used for testing. Similarly, we split each fake video in half, discard the first half and subject the second half to testing. The first half is discarded because the real counterpart of this video is used for reference, thus avoiding any overlap in utterances between the reference and testing. We recognize that this split is not ideal as video halves are not independent, but as we will see below, there is little difference in the results between the splits and these splits.

Each reference and testing video is re-saved at a frame-rate of fps (and a ffmpeg quality of ). This consistent frame-rate allows us to partition each video into overlapping -second clips, each of frames, with a -frame sliding window.

Average WLDR FF DFD DFDC-P CDF
real
fake
average
Table 1: Classification accuracies corresponding to the ROCs in Fig. 2 at a fixed threshold of .

4.2 Identification

Shown in Fig. 2 are the receiver operating curves (ROC) for each of the five datasets enumerated in the previous section, along with the average across all datasets. The green/red curves correspond to the accuracy of classifying real/fake videos. The horizontal axis corresponds to the facial VGG threshold () used in determining if a video clip should be classified as real or fake (see Section 3.3 and Fig. 1).

As expected, as the threshold increases, the detection accuracy for fake (red) increases while the detection accuracy for real (green) decreases, particularly dramatically for threshold values that approach the maximum value of . Recall that these accuracies are on a single -second clip.

The cross-over accuracies in Fig. 2 are (Average), (WLDR), (FF), (DFD), (DFDC-P), and (CDF). These cross-over points, however, come at varying threshold values. Shown in Table 1 is the detection accuracy, ranging from for DFDC-P to for FF, for a fixed threshold of .

source fake (face-swap) target
Figure 3: Shown is an example frame of a face-swap deep fake (second panel) from the DFDC-P dataset, in which the source identity (first panel) should be mapped onto the target (third panel), which is clearly not the case in this example. Shown in the fourth panel is the dimensionality-reduced visualization of the -D VGG features from all the identities (gray), source identity (green), target identity (blue), and the face-swap identity (red). This visualization shows that the source identity is not successfully mapped onto the deep fake (see also Fig. 4).

Note that the accuracy for the DFDC-P is unusually low. This is because many of the fake videos in this dataset failed to convincingly map the facial appearance of the desired source identity into the target video. Shown in Fig. 3 is a representative example of this problem. Shown is one frame from the source video, one frame from the target video, and the corresponding frame from the face-swap deep fake video in which the source identity should be mapped into the target video. In this example drawn from the DFDC-P dataset, we can clearly see that the source identity was not mapped into the target video, but rather continues to look like the target. Shown in Fig. 4 is confirmation that this problem persists throughout the DFDC-P dataset. In particular, shown in the first row are, for each dataset, the distribution of similarities in facial identities (as measured by the facial VGG cosine similarity) between all faces in the fake videos and their corresponding source identities. Shown in the second row is the similarity in facial identities all faces in the fake videos and their corresponding target identities. In a successful face swap, in which the identity in the target is replaced with that in the source, the facial similarity between the source and fake should be higher than the target and fake. Correspondingly, for each dataset, except DFDC-P, the average facial similarity of the fakes is higher relative to the source than the target. For the DFDC-P dataset, however, the fakes are on average closer to the target than the source. This difference accounts for the low accuracy on the DFDC-P dataset as both behavior and appearance of the fakes correspond to the target identity and are thus classified as real by our algorithm. Although this effect is most pronounced in the DFDC-P dataset, the DFD dataset also suffers from a similar problem, failing to convincingly map the source to the target identity. These failures justify our use of a confidence threshold in the facial similarity matching (case 2(b) in Section 3.3).

We next evaluate our detection algorithm against three in-the-wild, face-swap deep fake videos downloaded from YouTube. These three deep fakes were created using the following source and target combinations: 1) Steve Buscemi mapped onto Jennifer Lawrence 222https://www.youtube.com/watch?v=VWrhRBb-1Ig; 2) Tom Cruise mapped onto Bill Hader 333https://www.youtube.com/watch?v=r1jng79a5xc; and 3) Billie Eilish mapped onto Angela Martin 444https://www.instagram.com/p/B6lXvJlIU92/. Because, only Jennifer Lawerence was already in our reference set (CDF), real videos for the other five identities were downloaded from YouTube to augment our reference set. This included three minutes of videos of Angela Martin from The Office and minutes of interview videos for each of Billie Eilish, Steve Buscemi, Bill Hader, and Tom Cruise. The accuracy rate for each of these face-swap deep fakes is .

Lastly, shown in Table 2 is a comparison of our detection accuracy, measured using area under the curve (AUC), to six previous deep-fake detection schemes. Our scheme outperforms or is equal to previous approaches across all datasets. Note, however, that this is not a perfect comparison because our approach has access to a reference set of only real videos to compare against, as compared to these other fully-supervised approaches with access to real and fake reference videos.

WLDR FF DFDC-P DFD CDF
source
target
Figure 4: The distributions in the first row correspond to the facial similarity between the faces in the source and fake videos (as computed by the cosine similarity between corresponding VGG features). The distributions in the second row correspond to the facial similarity between the faces in the target and fake videos. In a successful face-swap deep fake, the source to fake similarity will be higher than the target to fake similarity, as is the case for the WLDR dataset. For the DFDC-P dataset, however, these distributions are reversed (see also Fig. 3).

4.3 Analysis

Our Behavior-Net feature was designed to capture spatiotemporal behavior, while the VGG feature captures facial identity. Here we analyze our results in more detail to ensure that these two features are not entangled and that the Behavior-Net does in fact capture temporal properties not captured by the static FAb-Net features.

(a) (b) (c)
Figure 5: Shown in panels (a) and (b) are the distributions of spatiotemporal behavior similarity, measured as the cosine similarity between Behavior-Net feature vectors. Shown in panel (c) is the distribution of spatial FAb-net similarity. See text for a detailed explanation of each panel.

In the first analysis, we show that Behavior-Net does in fact capture behavior and not just a person’s facial identity. Shown in Fig. 5(a) are the distributions of Behavior-Net similarities between source (blue)/target (orange) identities relative to their face-swap deep fakes (recall that a face-swap deep fake is created by mapping an identity in a source video to a target video). The similarity of the target behavior relative to the face-swap deep fakes is much higher than the source, meaning that even though the facial identity in the deep fake matches the source, the behavioral identity still matches the target. This indicates that the Behavior-Net is capturing more information than just facial identity.

In the second analysis, we show that Behavior-Net captures identity-specific behaviors and not just identity-agnostic expressions or behaviors. This analysis is based on the real videos in the DFD dataset, where each of the actors were recorded talking in different contexts ranging from a casual conversation sitting on a couch to a speech at a podium. Each of these contexts captured a specific facial expression ranging from neutral, to angry, happy, and laughing. And, each of these contexts were recorded twice, once with a still camera and once with moving camera. Shown in Fig. 5(b) are the distributions of Behavior-Net similarities between the same person in the same context (blue), the same person in different contexts (orange), and different people in the same context (green). When different people are recorded in the same context, we see that their Behavior-Net features are not similar, indicating that Behavior-Net captures identity-specific behaviors and not just specific contexts. At the same time, however, we see, that context can change an individual behavior (the orange vs. blue distributions). For example, a person is likely to have a different behavior when they are speaking casually to their friends as opposed to giving a formal speech to a large crowd. Nevertheless, our Behavior-Net captures identity-specific behaviors, albeit somewhat context dependent. Shown in Fig. 5(c) are the same distributions as in panel (b) but for only the static FAb-Net features. The distributions for the same person in the same context (blue), the same person in different contexts (orange), and different people in the same context (green) are all nearly identical, revealing that the static FAb-Net features does not capture identity-specific information.

In the third analysis, we analyse the amount of data required to build a reference set for an individual. For this analysis, the same reference set as before was used for the identities in FF, DFD, DFDC-P, and CDF. For the identities in the WLDR dataset (the only one with hours of video per person), the reference sets consists of between and randomly selected -second clips. With , , , , , and video clips, the average detection accuracy for identities in the WLDR dataset are , , , , , and , respectively. This rapid increase in accuracy and leveling off shows that large reference sets are not needed, assuming, again, that the context in which the individual is depicted is similar.

In this fourth, and final, analysis, we analyse the robustness of classification against a simple compression laundering operation. The video clips in our reference and testing sets, Section 4.2, are each encoded at a relatively high ffmpeg quality of qp= (the lower this value, the higher the quality). Each testing video clip was recompressed at a lower quality of qp and classified against the original reference set. For the same threshold (), the average detection accuracy remains high at (WLDR), (FF), (DFD), (DFDC-P), and (CDF). These results are almost identical to the high-quality videos in Table 1.

WLDR FF DFD DFDC-P CDF
Protecting World Leaders [2] 0.93
2-stream [39] 0.70 0.52 0.61 0.53
XceptionNet-c23 [22] 0.99 0.85 0.72 0.65
Head Pose [36] 0.47 0.56 0.55 0.54
MesoNet [1] 0.84 0.76 0.75 0.54
Face Warping [21] 0.80 0.74 0.72 0.56
Ours: Appearance and Behavior 0.99 0.99 0.93 0.95 0.99
Table 2: Comparison of our approach with previous work over multiple benchmarks [22]. The reported values correspond to the AUC. Although not a perfect comparison due to significantly different underlying methodologies, our approach does perform well. The FF dataset in this comparison consists of the FaceSwap and Deepfake categories.

5 Discussion

We have developed a novel technique for detecting face-swap deep fakes. This technique leverages a fundamental flaw in these deep fakes in that the person depicted in the video is simply not the person that it purports to be. We have shown that a combination of a facial and behavioral biometric is highly effective at detecting these face-swap deep fakes. Unlike many other techniques, this approach is less vulnerable to counter attack and generalizes well to previously unseen deep fakes with previously unseen people.

Our forensic technique should generalize to so-called puppet-master deep fakes in which one person’s facial expressions and head movements are mapped onto another person. These deep fakes suffer from the same basic problem as face-swap deep fakes in that the underlying behavior of the person is not that who it purports to be. As such, our combined facial and behavioral biometric should be able to detect these deep fakes.

We will, however, likely struggle to classify so-called lip-sync deep fakes in which only the mouth has been modified to be consistent with a new audio track. The facial identity and the vast majority of the behavior in these deep fakes will be consistent with the person depicted. To overcome this limitation, we seek to customize our behavioral model to learn explicit inconsistencies between the mouth and the rest of the face and/or underlying audio signal.

There is little question that the arms-race of synthesis and detection will continue. While it may not be possible to entirely stop the creation and distribution of deep fakes, our, and related approaches, promise to make the creation of convincing deep fakes more difficult and time consuming. This will eventually take it out of the hands of the average person and relegate it to the hands of a fewer and fewer experts. While the threat of deep fakes will remain, this will surely be a more manageable threat.

6 Acknowledgement

The PI’s research group (Farid) is partially supported with funding from the Defense Advanced Research Projects Agency (DARPA FA8750-16-C-0166). The views, opinions, and findings expressed are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. The PI’s research group is also partially supported by Facebook. There is no collaboration between Facebook and DARPA. We thank Yipin Zhou for her help in data collection.

References

  • [1] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen (2018) MesoNet: a compact facial video forgery detection network. In IEEE International Workshop on Information Forensics and Security, Cited by: §2.2, Table 2.
  • [2] S. Agarwal, H. Farid, Y. Gu, M. He, K. Nagano, and H. Li (2019) Protecting world leaders against deep fakes. In

    IEEE Conference on Computer Vision and Pattern Recognition, Workshop on Media Forensics

    ,
    pp. 38–45. Cited by: §1, §2.2, §2.2, §4.1, Table 2.
  • [3] T. Baltrušaitis, P. Robinson, and L. Morency (2016)

    Openface: an open source facial behavior analysis toolkit

    .
    In IEEE Winter Conference on Applications of Computer Vision, pp. 1–10. Cited by: §3.2.
  • [4] N. Carlini and D. Wagner (2016) Towards evaluating the robustness of neural networks. Note: arXiv: 1608.04644 Cited by: §2.2.
  • [5] B. Chesney and D. Citron (2019) Deep fakes: a looming challenge for privacy, democracy, and national security. California Law Review 107, pp. 1753. Cited by: §1.
  • [6] J. S. Chung, A. Nagrani, and A. Zisserman (2018) VoxCeleb2: deep speaker recognition. Note: arXiv: 1806.05622 Cited by: §3.1.
  • [7] U. A. Ciftci and I. Demir (2019) FakeCatcher: detection of synthetic portrait videos using biological signals. Note: arXiv: 1901.02212 Cited by: §2.2.
  • [8] Deepfakes faceswap. Note: https://github.com/deepfakes/faceswap Cited by: §2.1.
  • [9] B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. C. Ferrer (2019) The deepfake detection challenge (DFDC) preview dataset. Note: arXiv: 1910.08854 Cited by: §4.1.
  • [10] N. Dufour and A. Gully Contributing Data to Deepfake Detection Research. Note: https://ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html Cited by: §4.1.
  • [11] P. Ekman and W. V. Friesen (1976) Measuring facial movement. Environmental Psychology and Nonverbal Behavior 1 (1), pp. 56–75. Cited by: §2.2.
  • [12] Faceswap-GAN. Note: https://github.com/shaoanlu/faceswap-GAN Cited by: §2.1.
  • [13] Faceswap. Note: https://github.com/MarekKowalski/FaceSwap/ Cited by: §2.1, §2.1.
  • [14] H. Farid (2016) Photo forensics. MIT Press. Cited by: §2.2.
  • [15] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 2672–2680. Cited by: §2.1.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §3.1.
  • [17] M. Huh, A. Liu, A. Owens, and A. A. Efros (2018) Fighting fake news: image splice detection via learned self-consistency. In European Conference on Computer Vision, Cited by: §2.2.
  • [18] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §1.
  • [19] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila (2019) Analyzing and improving the image quality of stylegan. Note: arXiv:1912.04958 Cited by: §1.
  • [20] Y. Li, M. Chang, and S. Lyu (2018) In ictu oculi: exposing AI created fake videos by detecting eye blinking. In IEEE International Workshop on Information Forensics and Security, pp. 1–7. Cited by: §2.2.
  • [21] Y. Li and S. Lyu (2018) Exposing deepfake videos by detecting face warping artifacts. Note: arXiv: 1811.00656 Cited by: §2.2, Table 2.
  • [22] Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu (2019) Celeb-DF: a new dataset for deepfake forensics. Note: arXiv: 1909.12962 Cited by: §4.1, Table 2.
  • [23] K. Nagano, J. Seo, J. Xing, L. Wei, Z. Li, S. Saito, A. Agarwal, J. Fursund, and H. Li (2018) PaGAN: real-time avatars using dynamic textures. ACM Transaction on Graphics. Cited by: §2.1.
  • [24] Y. Nirkin, Y. Keller, and T. Hassner (2019) FSGAN: subject agnostic face swapping and reenactment. Note: arXiv: 1908.05932 Cited by: §2.1, §2.1.
  • [25] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu (2016) Wavenet: a generative model for raw audio. Note: arXiv: 1609.03499 Cited by: §1.
  • [26] O. M. Parkhi, A. Vedaldi, and A. Zisserman (2015) Deep face recognition. In British Machine Vision Conference, Cited by: §3.2.
  • [27] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. N. ner (2019) FaceForensics++: learning to detect manipulated facial images. Note: arXiv: 1901.08971 Cited by: §4.1.
  • [28] B. Schölkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C. Williamson Estimating the support of a high-dimensional distribution. Neural Computation 13 (7), pp. 1443–1471. Cited by: §2.2.
  • [29] J. Sies, M. Zollhöfer, and M. Nießner (2019) Deferred Neural Rendering: Image Synthesis using Neural Textures. Note: arXiv: 1904.12356v1 Cited by: §2.1, §2.1.
  • [30] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman (2017) Synthesizing Obama: learning lip sync from audio. ACM Transactions on Graphics. Cited by: §2.1.
  • [31] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner (2016) Face2face: real-time face capture and reenactment of rgb videos. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1, §2.1.
  • [32] R. Tolosana, R. Vera-Rodriguez, J. Fierrez, A. Morales, and J. Ortega-Garcia (2020) DeepFakes and beyond: a survey of face manipulation and fake detection. Note: arXiv: 2001.00179 Cited by: §1.
  • [33] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott (2019) Multi-similarity loss with general pair weighting for deep metric learning. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.1, §3.1.
  • [34] O. Wiles, A. S. Koepke, and A. Zisserman (2018)

    Self-supervised learning of a facial attribute embedding from video

    .
    Note: arXiv: 1808.06882 Cited by: §1, §3.1.
  • [35] Y. Wu, W. Abdalmageed, P. Natarajan, U. S. C. Information, and M. Rey (2019) ManTra-Net: manipulation tracing network for detection and localization of image forgeries with anomalous features. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
  • [36] X. Yang, Y. Li, and S. Lyu (2019) Exposing deep fakes using inconsistent head poses. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 8261–8265. Cited by: §2.2, Table 2.
  • [37] N. Yu, L. Davis, and M. Fritz (2018) Attributing Fake Images to GANs: Learning and Analyzing GAN Fingerprints. In IEEE International Conference on Computer Vision, Cited by: §2.2.
  • [38] X. Zhang, S. Karaman, and S. Chang (2019) Detecting and simulating artifacts in gan fake images. Note: arxiv: 1907.06515 Cited by: §2.2.
  • [39] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis (2017) Two-stream neural networks for tampered face detection. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, Cited by: §2.2, Table 2.