Use of a Capsule Network to Detect Fake Images and Videos

10/28/2019 ∙ by Huy H. Nguyen, et al. ∙ 46

The revolution in computer hardware, especially in graphics processing units and tensor processing units, has enabled significant advances in computer graphics and artificial intelligence algorithms. In addition to their many beneficial applications in daily life and business, computer-generated/manipulated images and videos can be used for malicious purposes that violate security systems, privacy, and social trust. The deepfake phenomenon and its variations enable a normal user to use his or her personal computer to easily create fake videos of anybody from a short real online video. Several countermeasures have been introduced to deal with attacks using such videos. However, most of them are targeted at certain domains and are ineffective when applied to other domains or new attacks. In this paper, we introduce a capsule network that can detect various kinds of attacks, from presentation attacks using printed images and replayed videos to attacks using fake videos created using deep learning. It uses many fewer parameters than traditional convolutional neural networks with similar performance. Moreover, we explain, for the first time ever in the literature, the theory behind the application of capsule networks to the forensics problem through detailed analysis and visualization.



page 1

page 5

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ever since the invention of photography, people have been interested in manipulating photographs, mainly to correct problems in the photos or to enhance them. However, we have gone far beyond these basic manipulations to adding unreal figures and inserting or removing objects. Digital photography has simplified the manipulation process, especially with the help of professional software like the iconic Adobe Photoshop application. The advent of personal computers further enabled people to become creators, creating everything from scratch. We call this computer graphics, as opposed to computer vision, which is aimed at making computers understand things from captured images and videos. The co-existence and co-development of computer graphics and computer vision, with the support of advanced hardware, have led to significant achievements. Moreover, the popularity of social networks has enabled people to create and share massive amounts of data, including personal information, news, and media, like images and videos. The consequence is that people with malicious purposes can easily make use of these advanced technologies and data to create fake images and videos and then publish them widely on social networks or use them to bypass facial authentication.

Figure 1: Example fake images. The first two images in the top row were fully computer-generated (from the Digital Emily Project [3] and from Dexter Studios [19]); the last one was generated using StyleGAN [31]. In the bottom row, from left to right, are images manipulated using deepfake [63], Face2Face [65], and Neural Textures [64] methods, respectively.

The requirements for manipulating or synthesizing videos were dramatically simplified when it became possible to create forged videos from only a short video of the target person [65, 32] and then from a single ID photo [5] following the acting of an actor. Suwajanakorn et al.’s mapping method [62] enhanced the ability of manipulators to learn the mapping between speech and lip motion. Since state-of-the-art speech synthesis methods can produce natural sounding speech, this mapping method enabled the creation of a fully synthesized audio-video image of any person. Deepfakes [63] exemplify this threat – any user with a personal computer and an appropriate tool can create videos impersonating any celebrity. Since deepfakes have become easy to create, a large number of high-quality fake pornography videos have been produced. Deepfake comedy and deepfake videos have been posted on YouTube with the challenge being to spot them. Several examples of computer-generated/manipulated images are shown in Fig. 1.

Several countermeasures have been developed for detecting spoofing attacks using fake images and videos. Before the deepfake phenomenon [63], when computer-generated images and videos had not yet achieved the required realistic quality to become a threat, presentation attacks [39] were the main concern, and the detectors used hand-crafted features [11, 18, 33]

. The growing use of convolutional neural networks (CNNs) has changed the game drastically for both defenders and attackers. Automatic feature extraction has dramatically improved detection performance 

[28, 24] while deep generative methods like generative adversarial networks (GANs) enable images [31] and videos [67, 50] to be produced that are almost humanly impossible to detect as fake. The attention of the forensics community has thus shifted to these new kinds of attacks [37, 36, 2, 58]. Several approaches are image-based [1, 55] while others work only on videos frames [37, 2, 58] or on video frames and voice information [35]. Although some video-based approaches perform better than image-based ones, they are only applicable to particular kinds of attacks. For example, many of them [37, 2] may fail if the quality of the eye area is sufficiently good or the synchronization between the video and audio parts is sufficiently natural [36]. Among the image-based approaches, some are general-purpose detectors, for instance, Fridrich and Kodovsky’s one [21] (applicable to both steganalysis and facial reenactment video detection) and Rahmouni et al.’s one [54] (applicable initially to computer-generated images and later to computer-manipulated images). However, their performance on new tasks is limited compared with that of task-specific ones [55, 56].

This journal paper is an extension of our conference paper [48] in which we pioneered the use of capsule networks [25, 59] for digital media forensics problems. We aim to create a lightweight and general-purpose detector that can be used for any kind of attack and have reasonable performance compared with that of task-specific detectors. While most state-of-the-art detectors use traditional CNNs with a large number of parameters, ours uses a new type of CNN that has impressive performance on computer vision tasks. The network architecture is relatively new and has seen little application in other domains, and a detailed analysis of it has been lacking. To fill this gap, we explain the novelty of our proposed capsule network through detailed analysis and visualization of several kinds of attacks. We also describe how we enhanced its performance by making several modifications and introducing two regularizations.

2 Related Work

In this section, we first introduce several state-of-the-art face manipulation techniques that can be used to manipulate faces. We then mention several major research efforts focused on forgery detection during the eight years preceding our research, the time period when CNNs blossomed and came to dominate traditional methods. As best we can, we group them into presentation attack detection and computer-generated image/video detection on the basis of the features they use and their intended targets. Despite this categorization, some approaches are two-fold or can be successfully applied outside their original scopes. Finally, we provide basic information about capsule networks, the dynamic routing algorithm that enables them to be efficiently implemented, and their original application to computer vision.

2.1 Face Manipulation

Although face manipulation is not new, recent achievements demonstrate that computer-manipulated faces can reach a photo-realistic level at which it is almost impossible for them to be humanly detected as fake. Dale et al. [16] presented a 3D multilinear model for replacing facial movements in video. Garrido et al. [23] modified the lip movements of an actor in a target video so that they matched a different audio track. Thies et al. [65] demonstrated that expression transfer for facial reenactment can be performed in real time and subsequently developed the FaceVR algorithm [66], which handles eye-tracking and reenactment in virtual reality. Kim et al. [32] demonstrated the transfer of a head pose along with facial movements from an actor to another person. Similarly, Tripathy et al. [67] devised a lightweight face reenactment method using GANs. Nirkin et al. [50] presented a face swapping method that does not require training on new faces, unlike deepfake methods [63]. Thies et al. combined the traditional graphics pipeline with learnable components to deal with imperfect 3D contents [64].

Not only visual part, Suwajanakorn et al. [62] presented a method for learning the mapping between speech and lip movements in which speech can also be synthesized, enabling creation of a full-function spoof video. Fried et al. [22] demonstrated that speech can be easily modified in any video in accordance with the intention of the manipulator while maintaining a seamless audio-visual flow. Averbuch-Elor et al. [5] addressed a different problem – converting still portraits into motion pictures expressing various emotions. This work greatly simplified the requirements for attackers: simply acquire a picture of the victim (usually a profile picture on a social network or an ID photo). Zakharov et al. [75] followed up by improving the quality of videos generated using only a few input images. Vougioukas et al. [68] raised the bar by introducing a method for animating a facial image from an audio track containing speech.

Besides these academic efforts, deep-learning-based face swapping tools (like deepfake’s source code111

) have become widespread on the Internet, enabling normal users to create pornographic videos with celebrity images or to impersonate them. For users who are not familiar with programming and machine learning, there is a mobile app called Xpression

222 that provides a deepfake function. An even more controversial application recently appeared – DeepNude [45], which generates realistic nude images of a person from a picture of him or her wearing clothes. It was shut down hours after it was released due to negative reaction from the community.

2.2 Presentation Attack Detection

A presentation attack against a biometric capture system is an attack with the goal of interfering with the system’s operation. Presentation attack detection (PAD) methods have been developed to automatically detect this kind of attack. Local binary patterns and their variances were the most effective PAD features in the pre-deep learning era and are used in several methods 

[11, 18, 33]

. Following the success of methods based on CNNs in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 

[57], several methods have been developed that leverage the pre-trained CNNs in this Challenge’s large database as their feature extractors [73, 28]. Other methods have been developed that use the available CNN architectures with customized components and were trained on spoofing databases [43, 53, 24, 42, 44]. Besides that, Alotaibi and Mahmood [4] applied nonlinear diffusion based on an additive operator splitting scheme to their CNN.

There is growing interest in generalizing detectors to enable them to handle unseen attacks [13]. This is a difficult but important effort since the number of attack techniques and their variances have been increasing rapidly. The major directions for countermeasure are using adversarial training and domain-adaptation jointly or independently [30, 49, 69]

. Other directions are semi-supervised learning 

[15] and semi-supervised learning combined with multi-task learning [46]. On the other hand, Fatemifar et al. [20]

applied a fusion-based multiple one-class classifier approach to anomaly detection.

2.3 Computer-Generated/Manipulated Image/Video Detection

Although computer-manipulation of images and videos is nothing new, the introduction of deep learning significantly improved the ability of this kind of attack and thus attracted great attention from the forensics community. This resulted in the creation of standardized databases for benchmarking, like the FaceForensics [55], FaceForensics++ [56], and DeepFakeTIMIT databases [36]. These databases cover several well-known attacks, including Face2Face [65], FaceSwap [56], and deepfake [63]. Rahmouni et al. had previously created a database for detecting fully computer-generated images [54] while Afchar et al. [1] created a deepfake database in their pioneering deepfake detection work.

The handcrafted steganalysis-based method developed by Fridrich and Kodovsky [21] was used in early efforts to detect computer-manipulated images and videos. This approach was later implemented in a CNN by Cozzolino et al. [14]. Subsequently, several methods based on CNNs which have been used in the ILSVRC, like the one developed by Rossler et al. [55]. Other methods used networks proposed by their authors [10, 54, 52, 1, 37, 35, 58] while others are based on a hybrid approach [76, 47, 70, 48]. Beside deep-learning and non-deep-learning categorization, these methods could be divided into image-based classifiers [10, 76, 54, 52, 1, 55, 47, 70, 48] and video-based classifiers [37, 35, 2, 58]. For detecting images generated by GANs, Marra et al. [40] performed benchmark testing on several CNNs and proposed a statistical model for detection [41].

In addition to binary classification between real and modified/generated images or videos, locating manipulated regions in images is also a major branch in digital media forensics. This research focuses on detecting removal, copy-move, and splicing attacks. Besides forensic-oriented approaches [8, 77, 9, 46], semantic segmentation approaches [38, 6] and binary classification approaches are applicable [54, 47, 55, 56]. In the case of binary classifiers, a sliding window (with or without overlapping) is used to locate manipulated regions.

2.4 Capsule Networks

“Capsule network” is not the new term as it was first introduced in 2011 by Hinton et al. [25]. They argued that CNNs have limited applicability to the “inverse graphics” problem and introduced a more robust architecture comprising several “capsules.” However, they initially faced the same problem faced by CNNs – the limited performance of hardware and the lack of effective algorithms, which prevented practical application of capsule networks. CNNs thus remained dominant in this research field.

These problems were overcome when the dynamic routing algorithm [59]

and its variance – the expectation-maximization routing algorithm 

[26] – were introduced in 2017 and 2018, respectively. These breakthroughs enabled capsule networks to achieve better performance and outperform CNNs on object classification tasks [59, 71, 26, 72, 7]. The agreements between low- and high-level capsules that encode the hierarchical relationships between objects and their parts with pose information enables a capsule network to preserve more information than a CNN while using only a fraction of the data used by a CNN.

In another domain, Iesmantas and Alzbutas applied a capsule network based on binary classification to breast cancer detection [27]. Jaiswal et al. reported a capsule-based GAN [29]. Yang et al. applied a capsule network to the text domain [74]. Nguyen et al. were pioneers in applying capsule networks to the digital media forensics problem [48]. These efforts demonstrated the effectiveness of capsule networks in multiple domains, which motivated us to continue developing a method for detecting modified/generated images or videos that we call “Capsule-Forensics” and then to perform a deep analysis on its behaviors by visualizing its intermediate activations on both real and fake inputs to explain the theory behind it.

3 Capsule-Forensics

3.1 Overview

Figure 2: Capsule-Forensics pipeline.

The pipeline of our proposed Capsule-Forensics method is illustrated in Fig. 2

. The pre-processing task depends on the input. If the input is video, the first step is to separate the frames. If the task is to detect fully computer-generated frames (or images), each frame (or image) is divided into patches. If the task is to detect a fake face or faces, a face detection algorithm is used to crop the facial area(s). There is no strict requirement about the size of the output image. In general, the larger the input, the better the result, at the cost of more computational power. The commonly used image sizes in practice are

, , , and  [54, 48, 47, 55, 56]. We used an image size of as it is an even number (making it easy to perform cropping and scaling) and large enough to provide sufficient information for detecting fake content.

Figure 3: Capsule-Forensics architecture.

The pre-processed image then passes through a part of the VGG-19 network (as proposed by the Visual Geometry Group) [61] pre-trained on the ILSVRC database [57]

before entering the capsule network. The VGG-19 network is used from the first layer to the third max pooling layer, which is not too deep to obtain biases from the object detection task (the original purpose of this pre-trained network). This VGG-19 part is equivalent to the CNN part before the primary capsules in the design of the original capsule network 


. Using a pre-trained CNN as a feature extractor rather than training it from scratch provides the benefit of using it as a regularizer to guide the training and to reduce overfitting as well as that of transfer learning. The detailed architecture is discussed in the next section.

The final part is the post-processing unit, which works in accordance with the pre-processing one. If the task is to detect fully computer-generated images, the scores of the extracted patches are averaged. If the input is video, the scores of all frames are averaged. This average score is the final output.

3.2 Detailed Architecture

The capsule network includes several primary capsules and two output capsules (“real” and “fake”), as illustrated in Fig. 3. There is no constraint on the number of primary capsules. Experiments demonstrated that a reasonably large number of primary capsules may improve network performance, but at the cost of more computation power. Three capsules are typically used for light networks (which require less memory and computation), and ten capsules are typically used for full ones (which require more memory and computation but provide better performance). While it is not necessary to use the same architecture for the primary capsules, we used the same design for all primary capsules to simplify the discussion.

Each primary capsule is divided into three parts: a 2D convolutional part, a statistical pooling layer, and a 1D convolutional part. The statistical pooling layer has been proven to be effective in the forensics task [54, 47]. Moreover, it helps make the network independent of the input image size. This means that one Capsule-Forensics architecture can be applied to different problems with different input sizes without having to redesign the network. The mean and variance of each filter are calculated in the statistical pooling layer.

  • Mean:


  • Variance:


where represents the layer index, and are respectively the height and width of the filter, and is a two-dimensional filter array.

The output of the statistical layer is suitable for 1D convolution. After going through the following 1D convolutional part, it is sent through dynamic routing to the output capsules. The final result is calculated on the basis of the activation of the output capsules. The algorithm is discussed in detail in the next section. For binary classification, there are two output capsules, as shown in Fig. 3. Multi-class classification could be performed by adding more output capsules, as described in section 4.1.

3.3 Dynamic Routing Algorithm

The dynamic routing algorithm is used to calculate agreement between the features extracted by the primary capsules. Agreement is dynamically calculated at run-time and the results are routed to the appropriate output capsule (real or fake one for binary classification). The output probabilities are determined on the basis of the activations of the output capsules. This dynamic routing algorithm differs from the classical fusion one in that it combines classification outputs from different classifiers.

Let us call the output vector of each primary capsule

and the real and fake vector capsules and , respectively. is the matrix used to route to , and is the number of iterations. The dynamic routing algorithm is shown in Algorithm 1.

procedure Routing()
     for all input capsule and all output capsules  do
     for  iterations do
         for all input capsules do
         for all output capsules do
         for all output capsules do
         for all input capsules and output capsules  do
Algorithm 1 Dynamic routing between capsules.

We slightly improved the algorithm of Sabour et al. [59] by introducing two regularizations: adding random noise to the routing matrix and adding a dropout operation. They are used only during training to reduce overfitting. Their effectiveness is discussed in the Evaluation section. Furthermore, a squash function (equation 1) is applied to before routing to normalize it, which helps stabilize the training process. The squash function is used to scale the vector magnitude to unit length.


In practice, to stabilize the training process, the random noise should be sampled from a normal distribution (

), the dropout ratio should not be greater than 0.05 (we used 0.05 in all experiments), and two iterations () should be used in the dynamic routing algorithm. The two regularizations are used along with random weight initialization to increase the level of randomness, which helps the primary capsules to learn with different parameters.

To calculate predicted label , we apply the softmax function to each dimension of the output capsule vectors to achieve stronger polarization rather than simply using the length of the output capsules [59]. The final results are the means of all softmax outputs:


Since there is no reconstruction in Capsule-Forensics, we simply use the cross-entropy loss function (equation 

3) and the Adam optimizer [34] to optimize the network:


where is the ground truth label, is the predicted label, and is the dimensional of the output capsule .

3.4 How Capsule-Forensics Works

To illustrate how Capsule-Forensics works, we used a Capsule-Forensics network with three primary capsules trained on the FaceForensics++ database [56]. We applied both regularizations (using random noise and dropout during training) and used images cropped to

. For visualization, we applied and modified an open-source tool 

[51] implementing the guided back-propagation algorithm [60]. To visualize each primary capsule in this way, we chose the latent features extracted before the statistical pooling layers since they still had the 2D structure.

The first question we address is about capsule learning: What did each capsule learn and were the learned features the same given that capsules had the same architecture? Before training a neural net or a capsule network, weight initialization needs to be applied (in our case, we used a normal distribution for weight ninitialization). Therefore, their starting points differed. During the learning process, these initial differences forced each capsule to focus on features that may be near but not identical to the others. The activation of each capsule and of the whole network are illustrated in Fig. 4. The differences in activation among capsules and between each capsule and the whole network are also shown. The regions of interest mainly include the eyes, nose, mouth region, and facial contours. Several capsules missed some of these regions, and several failed to detect the manipulated input (e.g., the 3rd capsule in Fig 5). However, thanks to the agreements between the capsules driven by the dynamic routing algorithm, the final results mostly focused on the important regions detected by all capsules. If the Capsule-Forensics network was replaced by a CNN using only the third primary capsule, the CNN would also fail to detect the manipulated input.

Since detecting manipulated images is the forensics problem addressed here, the behavior of the Capsule-Forensics network is differs from that in the explanation of the original capsule network for the inverse graphics problem, in which the focus is on the spatial hierarchies between simple and complex objects [25, 59, 26]. In the forensics problem, abnormal appearances are the key features, so each primary capsule tries its best to capture them and communicate its findings to the other capsules. This behavior is similar to that of jurors during a trial, and the judgment is the final detection result.

Figure 4: Activation of the three capsules and the whole Capsule-Forensics network (columns 2, 3, 4, and 5, respectively) on images created using deepfake [63] (row 1), Face2Face [65] (row 3), and FaceSwap [56] (row 5) methods. Column 6 shows the manipulated regions corresponding to the manipulated images in column 1.
                                                                                                                                   The first three columns of rows 2, 4, and 6 show the differences between the activations of capsules 1 and 2, 1 and 3, and 2 and 3 on the corresponding row above, respectively. The three last columns in order show the differences between the activations of capsules 1, 2, and 3 and the activation of the whole network.
Figure 5: Example case in which one capsule did not work correctly. The first row shows the activation of the whole network and of the three capsules. The second row from left to right shows the input image and the differences between the activation of each capsule and of the whole network. Capsule 3 failed to detect the manipulated image, but thanks to the other two capsules, the final result was still correct.
Figure 6: Activation of three primary capsules and of whole network for real input.
Figure 7: Agreement between three primary capsules and fake output capsule for fake input.
Figure 8: Agreement between three primary capsules and real output capsule for fake input.

The second question we address is about the activation of the network given real input: Given real input, does the network still focus on the same regions as fake input? A capsule networks has output capsules equal in number to the number of labels. Therefore, real inputs also trigger the capsule networks, as illustrated in Fig. 6. The activation areas were similar to those for the fake inputs, mostly focused on the eyes, nose, mouth region, and facial contours. The agreement between primary capsules was the same as with the fake input. Given the behaviors in both cases, we concluded that, regardless of the input type, the primary capsules focus on specific areas and test them to determine they are original or manipulated. The results are then routed to the appropriate output capsule (real or fake one for binary classification). The algorithm then calculates the agreement for both the real and fake output capsules. The stronger the agreement for a capsule, the more certain the label.

The third question is about agreement between the primary capsules for time series (video) input: What is the behavior of the agreement during time? To answer this question, we tested the Capsule-Forensics network on a deepfake video and captured the activation of the real and fake capsules before the softmax function was applied. The results are plotted in Figs. 7 and 8. We took the output before the softmax function because the values are separated more and thus easier to distinguish in the plots. Each output capsule was a 4-dimensional vector, and the dynamic routing algorithm calculated agreement for all primary capsules on each dimension of the output capsules. As we can see from the two figures, agreement varied over time and differed between dimensions. Near the end of the video, there were a few frames that caused disagreements between the primary capsules. However, because we took the average of the agreements over the four dimensions, these disagreements were partially canceled out. Furthermore, the primary capsules had stronger agreement for the fake capsule than for the real capsule; therefore, the final results were still correct. It is important to note that this type of ‘high level fusion’ is performed on latent features, unlike traditional probabilistic ensemble fusion.

4 Evaluation

In our evaluation of the proposed method, we first tested the improvements made to the Capsule-Forensics network: (1) using a larger input size (), (2) using dropout on training, and (3) using a larger number of primary capsules (ten). To make the results more convincing, we tested these settings on a large and challenging database – the FaceForensics++ database [56], which focuses on computer-manipulated images and videos. After figuring out the best settings, we evaluated its ability to detect fully computer-generated images and presentation attacks.

For training, we used the Adam optimizer [34] with , , and a learning rate of . We used a dropout rate of 5% and a batch size of 100 for inputs and one of 32 for inputs. We used the exact network architecture illustrated in Fig. 3.

For comparison with other methods, depending on the database, we used three metrics.

  • Equal error rate (EER): the value when the false acceptance rate (FAR) is equal to the false rejection rate (FRR).

  • Half total error rate: .

  • Classification accuracy , where TP, TN, FP, and FN are true positive, true negative, false positive, and false negative, respectively.

4.1 Detecting Computer-Manipulated Images/Videos

In a test to detect computer-manipulated images and videos, we used the FaceForensics++ database [56], which includes three kinds of manipulations (deepfake [63], Face2Face [65], and FaceSwap [56]) and three multiple compression levels (no compression, light compression, and heavy compression). The database was divided into a training set, a validation set, and a test set, as shown in Table 1. For the training set, we took the first 100 frames of the input video while for validation and test sets, we took only the first 10 frames. Since the Capsule-Forensics network is not limited to binary classification, we also evaluated its multi-class classification ability by changing the number of output capsules, from ‘Real’ and ‘Fake’ capsules to ‘Real’, ‘Deepfakes’, ‘Face2Face’, and ‘FaceSwap’ capsules. This modification is obvious and did not require significant changes to the network architecture.

max width= Type Training set Validation set Test set Real vids imgs vids imgs vids imgs Deepfakes vids imgs vids imgs vids imgs Face2Face vids imgs vids imgs vids imgs FaceSwap vids imgs vids imgs vids imgs

Table 1: Configuration of training, validation, and test sets based on FaceForensics++ database [56].

max width= Network Binary classification accuracy (%) Binary classification EER (%) Multi-class classification accuracy (%) Number of parameters XceptionNet  [56] 91.46 9.98 91.33 20,811,050 Capsule-Forensics (old)  [48] 87.73 15.69 85.89 2,796,889 Capsule-Forensics (old) + Noise  [48] 88.11 15.71 87.12 2,796,889 Capsule-Forensics light 90.02 10.95 87.51 2,796,889 Capsule-Forensics light + Noise 91.12 11.60 87.54 2,796,889 Capsule-Forensics 91.65 11.36 88.51 3,896,638 Capsule-Forensics + Noise 91.48 11.62 89.98 3,896,638 Capsule-Forensics light + Dropout 91.36 11.61 89.19 2,796,889 Capsule-Forensics light + Dropout + Noise 91.28 11.38 88.44 2,796,889 Capsule-Forensics + Dropout 92.20 10.96 90.51 3,896,638 Capsule-Forensics + Dropout + Noise 92.02 10.26 91.22 3,896,638 Capsule-Forensics + Dropout + Noise (video) 93.11 10.26 92.90 3,896,638

Table 2: Results for all versions of Capsule-Forensics and baseline for images of various sizes except for last row showing results for videos (‘light’ means new version of Capsule-Forensics with only three primary capsules).
Method Real Deepfakes Face2Face FaceSwap
XceptionNet  [56] 89.43 94.81 88.00 92.76
Capsule-Forensics + Dropout + Noise 89.57 92.17 90.00 92.79
Capsule-Forensics + Dropout + Noise (video) 89.57 92.17 90.36 92.79
Table 3: Multi-class classification accuracy for FaceForensics++ database [56] (%).

We used a variation of the XceptionNet [12] as a baseline classifier and trained it in accordance with the guidelines used by Rossler et al.  [56]. In this research, it achieved state-of-the-art performance on detecting manipulated images. However, the XceptionNet is a large network with more than 20 million parameters. We also compared the Capsule-Forensics with new improvements from this work (Capsule-Forensics light with three primary capsules and Capsule-Forensics with ten primary capsules) with the previous version (Capsule-Forensics (old)[48], as listed in Table 2. The previous version included three primary capsules and had the option of adding random noise during the training process to reduce overfitting. For the new versions, one of the new settings enabled the use of a larger input size ( instead of ), meaning that a larger number of important artifacts in the images. Another new setting enabled the use of more capsules (ten capsules instead of three), which increased the network capability. We call the 3-capsule version the light one. The last new setting enabled the use of dropout during the training process, which also played the role of a regularizer. This final version is called Capsule-Forensics + Dropout + Noise in Table 2. For simplicity, we also evaluated the final version of the Capsule-Forensics on video inputs by applying frame aggregation by calculating the average of the classification probabilities on the first ten frames (named as Capsule-Forensics + Dropout + Noise (video)

. All versions of Capsule-Forensics were trained with 25 epochs. The results for XceptionNet and all versions of Capsule-Forensics on the test set are shown in Table 


As expected, the use of larger images improved the performance of Capsule-Forensics substantially. Regarding random noise, in our previous work [48], most of the training sets were small, so random noise made a significant contribution. In this work, we used the first 100 frames instead of the first 10 for the training set, so the set was larger. Although the random noise did not result in improvement in all cases, it still played an important role in improving classification accuracy and reducing the EER, especially when using it was used with dropout. Increasing the number of primary capsules also helped to improve performance. Combining them together, The performance of Capsule-Forensics with both random noise and dropout was almost the same as that of XceptionNet even though the number of parameters was five time smaller, which is significant. The effects of these improvements on Capsule-Forensics performance is clearer for multi-class classification, which is more difficult than binary classification. For video input, the frame aggregation strategy increased classification accuracy for both binary and multi-class classification, as was observed in our previous work [48].

Beyond results shown in Table 2, we also performed deeper analysis on multi-class classification by calculating the classification accuracy for each class. As shown in Table 3, a Face2Face attack was the most difficult one to detect for both networks, especially for XceptionNet, with only 88.00% accuracy. Using video inputs improved Capsule-Forensics performance even more. However, XceptionNet performed better on deepfakes detection, with accuracy of 94.81%. For real input, both networks had slightly high false positive rates compared with their false negative rates on the manipulated inputs. Overall, Capsule-Forensics had more balanced performance for all labels than XceptionNet.

4.2 Detecting Fully Computer-Generated Images

In addition to testing Capsule-Forensics on computer-manipulated image, we also trained it to classify fully computer-generated images (CGIs) and photographic images (PIs). We used the dataset created by Rahmouni et al. [54]. The CGI set contained 1800 high-resolution screenshots (around pixels) from five photo-realistic video games. The PI set included 1800 high-resolution photographic images (around ) selected from the RAISE dataset [17]. We followed the prescribed protocol by training Capsule-Forensics on a patch dataset and evaluated it on both patch and full-scale datasets. Accordingly, the input images were only . For the full-scale dataset, we used the same patch aggregation strategy by calculating the average classification probability of all patches. As shown in Table 4, both the old and new versions of Capsule-Forensics outperformed the three other state-of-the-art classifiers and achieved 100% classification accuracy on the large-scale dataset.

Method Accuracy
Patch Large-scale
Rahmouni et al. [54] 89.76 99.30
Quan et al. [52] 94.75 99.58
Nguyen et al. [47] 96.55 99.86
Capsule-Forensics (old) [48] 97.00 100.00
Capsule-Forensics (new) 97.05 100.00
Table 4: Accuracy of state-of-the-art methods on discriminating CGIs and PIs.

4.3 Detecting Presentation Attacks

In addition to the experiments demonstrating the ability of Capsule-Forensics to detect images manipulated or generated by computer, we trained Capsule-Forensics on Idiap’s Replay-Attack database [11]. Since the resolution of the videos was , we center cropped each frame to before inputting them. As shown in Table 5, Nguyen et al.’s version [47] and Capsule-Forensics achieved perfect results without any mistakes on any frames.

Method HTER (%)
Chigovska et al. [11] 17.17
de Freitas Pereira et al. [18] 08.51
Kim et al. [33] 12.50
Yang et al. [73] 02.30
Menotti et al. [43] 00.75
Alotabi et al. [4] 10.00
Ito et al. [28] 00.43
Nguyen et al. [47] 00.00
Capsule-Forensics (old) [48] 00.00
Capsule-Forensics (new) 00.00
Table 5: Half total error rate (HTER) of state-of-the-art detection methods on Replay-Attack database [11].

5 Conclusion

Our proposed Capsule-Forensics method can be applied to digital images and video forensics, including detecting computer-manipulated/generated images and videos and detecting presentation attacks. The improvements made have given Capsule-Forensics performance that is equivalent to or better than that of state-of-the-art methods on the tasks tested while using fewer parameters, which helps reduce computation cost. Detailed analysis of how the Capsule-Forensics works by visualizing the activation of each capsules and of the whole network and by analyzing the agreement between the primary capsules for video input explained the mechanism which helped the Capsule-Forensics performed well on several digital forensics tasks. With these promising results and the understanding gained from detail analysis, this work should lead to further research and development on capsule networks, not only for digital forensics but in many other areas. Future work could include application of capsule networks to time series input, not simply using frame aggregation and improvement in the generability of capsule networks, an active and challenging research topic in machine learning.

6 Acknowledgments

This research was supported by JSPS KAKENHI Grants JP16H06302 and JP18H04120 and by JST CREST Grant JPMJCR18A6, Japan.


  • [1] D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen (2018) MesoNet: a compact facial video forgery detection network. In International Workshop on Information Forensics and Security (WIFS), Cited by: §1, §2.3, §2.3.
  • [2] S. Agarwal, H. Farid, Y. Gu, M. He, K. Nagano, and H. Li (2019) Protecting world leaders against deep fakes. In

    Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    pp. 38–45. Cited by: §1, §2.3.
  • [3] O. Alexander, M. Rogers, W. Lambeth, J. Chiang, W. Ma, C. Wang, and P. Debevec (2010) The digital emily project: achieving a photorealistic digital actor. IEEE Computer Graphics and Applications 30 (4), pp. 20–31. Cited by: Figure 1.
  • [4] A. Alotaibi and A. Mahmood (2017) Deep face liveness detection based on nonlinear diffusion using convolution neural network. Signal, Image and Video Processing. Cited by: §2.2, Table 5.
  • [5] H. Averbuch-Elor, D. Cohen-Or, J. Kopf, and M. F. Cohen (2017) Bringing portraits to life. ACM Transactions on Graphics. Cited by: §1, §2.1.
  • [6] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (12), pp. 2481–2495. Cited by: §2.3.
  • [7] M. T. Bahadori (2018) Spectral capsule networks. In International Conference on Learning Representations (ICLR), Cited by: §2.4.
  • [8] J. H. Bappy, A. K. Roy-Chowdhury, J. Bunk, L. Nataraj, and B. Manjunath (2017) Exploiting spatial structure for localizing manipulated image regions. In International Conference on Computer Vision (ICCV), pp. 4970–4979. Cited by: §2.3.
  • [9] J. H. Bappy, C. Simons, L. Nataraj, B. Manjunath, and A. K. Roy-Chowdhury (2019) Hybrid lstm and encoder-decoder architecture for detection of image forgeries. IEEE Transactions on Image Processing. Cited by: §2.3.
  • [10] B. Bayar and M. C. Stamm (2016) A deep learning approach to universal image manipulation detection using a new convolutional layer. In Workshop on Information Hiding and Multimedia Security (IH&MMSEC), Cited by: §2.3.
  • [11] I. Chingovska, A. Anjos, and S. Marcel (2012) On the effectiveness of local binary patterns in face anti-spoofing. In International Conference of the Biometrics Special Interest Group (BIOSIG), Cited by: §1, §2.2, §4.3, Table 5.
  • [12] F. Chollet (2017) Xception: deep learning with depthwise separable convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.
  • [13] A. Costa-Pazo, D. Jiménez-Cabello, E. Vazquez-Fernandez, J. L. Alba-Castro, and R. J. López-Sastre (2019) Generalized presentation attack detection: a face anti-spoofing evaluation proposal. In International Conference on Biometrics (ICB), Cited by: §2.2.
  • [14] D. Cozzolino, G. Poggi, and L. Verdoliva (2017) Recasting residual-based local descriptors as convolutional neural networks: an application to image forgery detection. In Workshop on Information Hiding and Multimedia Security (IH&MMSEC), Cited by: §2.3.
  • [15] D. Cozzolino, J. Thies, A. Rössler, C. Riess, M. Nießner, and L. Verdoliva (2018) Forensictransfer: weakly-supervised domain adaptation for forgery detection. arXiv preprint arXiv:1812.02510. Cited by: §2.2.
  • [16] K. Dale, K. Sunkavalli, M. K. Johnson, D. Vlasic, W. Matusik, and H. Pfister (2011) Video face replacement. ACM Transactions on Graphics. Cited by: §2.1.
  • [17] D. Dang-Nguyen, C. Pasquini, V. Conotter, and G. Boato (2015) Raise: a raw images dataset for digital image forensics. In Multimedia Systems Conference (MMSys), pp. 219–224. Cited by: §4.2.
  • [18] T. de Freitas Pereira, A. Anjos, J. M. De Martino, and S. Marcel (2013) Can face anti-spoofing countermeasures work in a real world scenario?. In International Conference on Biometrics (ICB), Cited by: §1, §2.2, Table 5.
  • [19] Dexter studio. Note: 2019-09-01 Cited by: Figure 1.
  • [20] S. Fatemifar, M. Awais, S. Rahimzadeh Arashloo, and J. Kittler (2019) Combining multiple one-class classifiers for anomaly based face spoofing attack detection. In International Conference on Biometrics (ICB), Cited by: §2.2.
  • [21] J. Fridrich and J. Kodovsky (2012) Rich models for steganalysis of digital images. IEEE Transactions on Information Forensics and Security. Cited by: §1, §2.3.
  • [22] O. Fried, A. Tewari, M. Zollhöfer, A. Finkelstein, E. Shechtman, D. B. Goldman, K. Genova, Z. Jin, C. Theobalt, and M. Agrawala (2019) Text-based editing of talking-head video. In International Conference and Exhibition on Computer Graphics and Interactive Techniques (SIGGRAPH), Cited by: §2.1.
  • [23] P. Garrido, L. Valgaerts, H. Sarmadi, I. Steiner, K. Varanasi, P. Perez, and C. Theobalt (2015) Vdub: modifying face video of actors for plausible visual alignment to a dubbed audio track. In Computer Graphics Forum, Vol. 34. Cited by: §2.1.
  • [24] A. George and S. Marcel (2019) Deep pixel-wise binary supervision for face presentation attack detection. In International Conference on Biometrics (ICB), Cited by: §1, §2.2.
  • [25] G. E. Hinton, A. Krizhevsky, and S. D. Wang (2011) Transforming auto-encoders. In International Conference on Artificial Neural Networks (ICANN), Cited by: §1, §2.4, §3.4.
  • [26] G. E. Hinton, S. Sabour, and N. Frosst (2018) Matrix capsules with EM routing. In International Conference on Learning Representations Workshop (ICLRW), Cited by: §2.4, §3.4.
  • [27] T. Iesmantas and R. Alzbutas (2018) Convolutional capsule network for classification of breast cancer histology images. In International Conference Image Analysis and Recognition ( ICIAR), pp. 853–860. Cited by: §2.4.
  • [28] K. Ito, T. Okano, and T. Aoki (2017) Recent advances in biometrics security: a case study of liveness detection in face recognition. In Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Cited by: §1, §2.2, Table 5.
  • [29] A. Jaiswal, W. AbdAlmageed, Y. Wu, and P. Natarajan (2018) Capsulegan: generative adversarial capsule network. In European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §2.4.
  • [30] A. Jaiswal, S. Xia, I. Masi, and W. AbdAlmageed (2019) RoPAD: robust presentation attack detection through unsupervised adversarial invariance. In International Conference on Biometrics (ICB), Cited by: §2.2.
  • [31] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4401–4410. Cited by: Figure 1, §1.
  • [32] H. Kim, P. Garrido, A. Tewari, W. Xu, J. Thies, M. Nießner, P. Pérez, C. Richardt, M. Zollhöfer, and C. Theobalt (2018) Deep video portraits. In International Conference and Exhibition on Computer Graphics and Interactive Techniques (SIGGRAPH), Cited by: §1, §2.1.
  • [33] W. Kim, S. Suh, and J. Han (2015) Face liveness detection from a single image via diffusion speed model. IEEE TIP. Cited by: §1, §2.2, Table 5.
  • [34] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §3.3, §4.
  • [35] P. Korshunov and S. Marcel (2018) Speaker inconsistency detection in tampered video. In European Signal Processing Conference (EUSIPCO), pp. 2375–2379. Cited by: §1, §2.3.
  • [36] P. Korshunov and S. Marcel (2019) Vulnerability assessment and detection of deepfake videos. In International Conference on Biometrics (ICB), Cited by: §1, §2.3.
  • [37] Y. Li, M. Chang, H. Farid, and S. Lyu (2018) In ictu oculi: exposing AI generated fake face videos by detecting eye blinking. arXiv preprint arXiv:1806.02877. Cited by: §1, §2.3.
  • [38] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. Cited by: §2.3.
  • [39] S. Marcel, M. S. Nixon, and S. Z. Li (2019) Handbook of biometric anti-spoofing. Vol. 2, Springer. Cited by: §1.
  • [40] F. Marra, D. Gragnaniello, D. Cozzolino, and L. Verdoliva (2018) Detection of gan-generated fake images over social networks. In Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 384–389. Cited by: §2.3.
  • [41] F. Marra, D. Gragnaniello, L. Verdoliva, and G. Poggi (2019) Do gans leave artificial fingerprints?. In Conference on Multimedia Information Processing and Retrieval (MIPR), pp. 506–511. Cited by: §2.3.
  • [42] S. Mehta, A. Uberoi, A. Agarwal, M. Vatsa, and R. Singh (2019) Crafting a panoptic face presentation attack detector. In International Conference on Biometrics (ICB), Cited by: §2.2.
  • [43] D. Menotti, G. Chiachia, A. Pinto, W. R. Schwartz, H. Pedrini, A. X. Falcao, and A. Rocha (2015) Deep representations for iris, face, and fingerprint spoofing detection. IEEE Transactions on Information Forensics and Security. Cited by: §2.2, Table 5.
  • [44] U. Muhammad and A. Hadid (2019) Face anti-spoofing using hybrid residual learning framework. In International Conference on Biometrics (ICB), Cited by: §2.2.
  • [45] New ai deepfake app creates nude images of women in seconds. Note: 2019-07-01 Cited by: §2.1.
  • [46] H. H. Nguyen, F. Fang, J. Yamagishi, and I. Echizen (2019) Multi-task learning for detecting and segmenting manipulated facial images and videos. In International Conference on Biometrics: Theory,Applications and Systems (BTAS), Cited by: §2.2, §2.3.
  • [47] H. H. Nguyen, N. T. Tieu, H. Nguyen-Son, V. Nozick, J. Yamagishi, and I. Echizen (2018) Modular convolutional neural network for discriminating between computer-generated images and photographic images. In International Conference on Availability, Reliability and Security (ARES), Cited by: §2.3, §2.3, §3.1, §3.2, §4.3, Table 4, Table 5.
  • [48] H. H. Nguyen, J. Yamagishi, and I. Echizen (2019) Capsule-forensics: using capsule networks to detect forged images and videos. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2307–2311. Cited by: §1, §2.3, §2.4, §3.1, §4.1, §4.1, Table 2, Table 4, Table 5.
  • [49] O. Nikisins, A. George, and S. Marcel (2019)

    Domain adaptation in multi-channel autoencoder based features for robust face anti-spoofing

    In International Conference on Biometrics (ICB), Cited by: §2.2.
  • [50] Y. Nirkin, Y. Keller, and T. Hassner (2019) FSGAN: subject agnostic face swapping and reenactment. In International Conference on Computer Vision (ICCV), Cited by: §1, §2.1.
  • [51] U. Ozbulak (2019) PyTorch cnn visualizations. GitHub. Note: Cited by: §3.4.
  • [52] W. Quan, K. Wang, D. Yan, and X. Zhang (2018) Distinguishing between natural and computer-generated images using convolutional neural networks. IEEE Transactions on Information Forensics and Security. Cited by: §2.3, Table 4.
  • [53] R. Raghavendra, K. B. Raja, S. Venkatesh, and C. Busch (2017) Transferable deep-CNN features for detecting digital and print-scanned morphed face images. In Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), Cited by: §2.2.
  • [54] N. Rahmouni, V. Nozick, J. Yamagishi, and I. Echizen (2017) Distinguishing computer graphics from natural images using convolution neural networks. In International Workshop on Information Forensics and Security (WIFS), Cited by: §1, §2.3, §2.3, §2.3, §3.1, §3.2, §4.2, Table 4.
  • [55] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner (2018) FaceForensics: a large-scale video dataset for forgery detection in human faces. arXiv preprint arXiv:1803.09179. Cited by: §1, §2.3, §2.3, §2.3, §3.1.
  • [56] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner (2019) Faceforensics++: learning to detect manipulated facial images. In International Conference on Computer Vision (ICCV), Cited by: §1, §2.3, §2.3, Figure 4, §3.1, §3.4, §4.1, §4.1, Table 1, Table 2, Table 3, §4.
  • [57] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision. Cited by: §2.2, §3.1.
  • [58] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and P. Natarajan (2019) Recurrent convolutional strategies for face manipulation detection in videos. In Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 80–87. Cited by: §1, §2.3.
  • [59] S. Sabour, N. Frosst, and G. E. Hinton (2017) Dynamic routing between capsules. In Conference on Neural Information Processing Systems (NIPS), Cited by: §1, §2.4, §3.1, §3.3, §3.3, §3.4.
  • [60] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In International Conference on Computer Vision (ICCV), pp. 618–626. Cited by: §3.4.
  • [61] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), Cited by: §3.1.
  • [62] S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher-Shlizerman (2017) Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics. Cited by: §1, §2.1.
  • [63] Terrifying high-tech porn: creepy ’deepfake’ videos are on the rise. Note: 2018-02-17 Cited by: Figure 1, §1, §1, §2.1, §2.3, Figure 4, §4.1.
  • [64] J. Thies, M. Zollhöfer, and M. Nießner (2019) Deferred neural rendering: image synthesis using neural textures. In Computer Graphics and Interactive Techniques (SIGGRAPH), Cited by: Figure 1, §2.1.
  • [65] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner (2016) Face2Face: real-time face capture and reenactment of RGB videos. In Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Figure 1, §1, §2.1, §2.3, Figure 4, §4.1.
  • [66] J. Thies, M. Zollhöfer, M. Stamminger, C. Theobalt, and M. Nießner (2018) FaceVR: real-time facial reenactment and eye gaze control in virtual reality. ACM Transactions on Graphics. Cited by: §2.1.
  • [67] S. Tripathy, J. Kannala, and E. Rahtu (2019) ICface: interpretable and controllable face reenactment using gans. arXiv preprint arXiv:1904.01909. Cited by: §1, §2.1.
  • [68] K. Vougioukas, S. A. Center, S. Petridis, and M. Pantic (2019) End-to-end speech-driven realistic facial animation with temporal gans. In Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 37–40. Cited by: §2.1.
  • [69] G. Wang, H. Han, S. Shan, and X. Chen (2019) Improving cross-database face presentation attack detection via adversarial domain adaptation. In International Conference on Biometrics (ICB), Cited by: §2.2.
  • [70] S. Wang, O. Wang, A. Owens, R. Zhang, and A. A. Efros (2019) Detecting photoshopped faces by scripting photoshop. In International Conference on Computer Vision (ICCV), Cited by: §2.3.
  • [71] E. Xi, S. Bing, and Y. Jin (2017) Capsule network performance on complex data. arXiv preprint arXiv:1712.03480. Cited by: §2.4.
  • [72] C. Xiang, L. Zhang, Y. Tang, W. Zou, and C. Xu (2018) MS-capsnet: a novel multi-scale capsule network. IEEE Signal Processing Letters 25 (12), pp. 1850–1854. Cited by: §2.4.
  • [73] J. Yang, Z. Lei, and S. Z. Li (2014) Learn convolutional neural network for face anti-spoofing. arXiv preprint arXiv:1408.5601. Cited by: §2.2, Table 5.
  • [74] M. Yang, W. Zhao, J. Ye, Z. Lei, Z. Zhao, and S. Zhang (2018) Investigating capsule networks with dynamic routing for text classification. In

    Conference on Empirical Methods in Natural Language Processing (EMNLP)

    pp. 3110–3119. Cited by: §2.4.
  • [75] E. Zakharov, A. Shysheya, E. Burkov, and V. Lempitsky (2019) Few-shot adversarial learning of realistic neural talking head models. arXiv preprint arXiv:1905.08233. Cited by: §2.1.
  • [76] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis (2017) Two-stream neural networks for tampered face detection. In Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), Cited by: §2.3.
  • [77] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis (2018) Learning rich features for image manipulation detection. In Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1053–1061. Cited by: §2.3.