Implementation and demonstration of the paper: Multi-task Learning for Detecting and Segmenting Manipulated Facial Images and Videos
Detecting manipulated images and videos is an important topic in digital media forensics. Most detection methods use binary classification to determine the probability of a query being manipulated. Another important topic is locating manipulated regions (i.e., performing segmentation), which are mostly created by three commonly used attacks: removal, copy-move, and splicing. We have designed a convolutional neural network that uses the multi-task learning approach to simultaneously detect manipulated images and videos and locate the manipulated regions for each query. Information gained by performing one task is shared with the other task and thereby enhance the performance of both tasks. A semi-supervised learning approach is used to improve the network's generability. The network includes an encoder and a Y-shaped decoder. Activation of the encoded features is used for the binary classification. The output of one branch of the decoder is used for segmenting the manipulated regions while that of the other branch is used for reconstructing the input, which helps improve overall performance. Experiments using the FaceForensics and FaceForensics++ databases demonstrated the network's effectiveness against facial reenactment attacks and face swapping attacks as well as its ability to deal with the mismatch condition for previously seen attacks. Moreover, fine-tuning using just a small amount of data enables the network to deal with unseen attacks.READ FULL TEXT VIEW PDF
Implementation and demonstration of the paper: Multi-task Learning for Detecting and Segmenting Manipulated Facial Images and Videos
A major concern in digital image forensics is the deepfake phenomenon , a worrisome example of the societal threat posed by computer-generated spoofing videos. Anyone who shares video clips or pictures of him or herself on the Internet may become a victim of a spoof-video attack. Several available methods can be used to translate head and facial movements in real time [30, 14] or create videos from photographs [4, 9]. Moreover, thanks to advances in speech synthesis and voice conversion , an attacker can also clone a person’s voice (only a few minutes of speech are needed) and synchronize it with the visual component to create an audiovisual spoof [29, 9]. These methods may become widely available in the near future, enabling anyone to produce deepfake material.
Several countermeasures have been proposed for the visual domain. Most of them were evaluated using only one or a few databases, including the CGvsPhoto database , the Deepfakes databases [2, 16, 17], and the FaceForensics/FaceForensics++ databases [26, 27]. Cozzolino et al. addressed the transferability problem of several state-of-the-art spoofing detectors 
and developed an autoencoder-like architecture that supports generalization and can be easily adapted to a new domain with simple fine-tuning.
Another major concern in digital image forensics is locating manipulated regions. The shapes of the segmentation masks for manipulated facial images and videos could reveal hints about the type of manipulation used, as illustrated in Figure 1. Most existing forensic segmentation methods focus on three commonly used means of tampering: removal, copy-move, and splicing [6, 32, 7]. As in other image segmentation tasks, these methods need to process full-scale images. Rahmouni et al.  used a sliding window to deal with high-resolution images, as subsequently used by Nguyen et al.  and Rossler et al. . This sliding window approach effectively segments manipulated regions in spoofed images  created using the Face2Face method . However, these methods need to score many overlapped windows by using a spoofing detection method, which takes a lot of computation power.
We have developed a multi-task learning approach for simultaneously performing classification and segmentation of manipulated facial images. Our autoencoder comprises an encoder and a Y-shaped decoder and is trained in a semi-supervised manner. The activation of the encoded features is used for classification. The output of one branch of the decoder is used for segmentation, and the output of the other branch is used to reconstruct the input data. The information gained from these tasks (classification, segmentation, and reconstruction) is shared among them, thereby improving the overall performance of the network.
Creating a photo-realistic digital actor is a dream of many people working in computer graphics. One initial success is the Digital Emily Project , in which sophisticated devices were used to capture the appearance of an actress and her motions to synthesize a digital version of her. At that time, this ability was unavailable to attackers, so it was impossible to create a digital version of a victim. This changed in 2016 when Thies et al. demonstrated facial reenactment in real time . Subsequent work led to the ability to translate head poses  with simple requirements that are met by any normal person. The Xpression mobile app111https://xpression.jp/ providing the same function was subsequently released. Instead of using RGB videos as was done in previous work [30, 14], Averbuch et al. and Chung et al. used ID-type photos [4, 9], which are easily obtained on social networks. Combining this capability with speech synthesis or voice conversion techniques , attackers are now able to make spoof videos with voices [29, 9], which are more convincingly authentic.
Several countermeasures have been introduced for detecting manipulated videos. A typical approach is to treat a video as a sequence of image frames and work on the images as input. The noise-based method proposed by Fridrich and Kodovsky  is considered one of the best handcrafted detectors. Its improved version using a convolutional neural network (CNN) 
demonstrated the effectiveness of using automatic feature extraction for detection. Among deep learning approaches to detection, fine-tuning and transfer learning take advantage of high-performing pre-trained models[24, 26]. Using part of a pre-trained CNN as the feature extractor is an effective way to improve the performance of a CNN [21, 22]. Other approaches to detection include using a constrained convolutional layer , using a statistical pooling layer , using a two-stream network , using a lightweight CNN network , and using two cascaded convolutional layers at the bottom of a CNN . Cozzolino et al. created a benchmark for determining the transferability of state-of-the-art detectors for use in detecting unseen attacks . They also proposed an autoencoder-like architecture with which adaptation ability was greatly increased. Li et al. proposed using a temporal approach and developed a network for detecting eye blinking, which is not well reproduced in fake videos . Our proposed method, besides performing classification, provides segmentation maps of manipulated areas. This additional information could be used as a reference for judging the authenticity of images and videos, especially when the classification task fails to detect spoofed inputs.
There are two commonly used approaches to locating manipulated regions in images: segmenting the entire input image and repeatedly performing binary classification using a sliding window. The segmentation approach is commonly used to detect removal, copy-move, and splicing attacks [6, 7]. Semantic segmentation methods [18, 5] can also be used for forgery segmentation . A slightly different segmentation approach is to return the boxes that represent the boundaries of the manipulated regions instead of returning segmentation masks . The sliding window approach is used more for detecting spoofing regions generated by a computer to create spoof images or videos from bona fide ones [25, 21, 26]
. In this approach, binary classifiers for classifying images as spoof or bona fide are called at each position of the sliding window. The stride of the sliding window may equal the length of the window (non-overlapped) or be less than the length (overlapped) [21, 26]). Our proposed method takes the first approach but with one major difference: only the facial areas are considered instead of the entire image. This overcomes the computation expense problem when dealing with large inputs.
Unlike other single-target methods [22, 11, 7], our proposed method outputs both the probability of an input being spoofed and segmentation maps of the manipulated regions in each frame of the input, as diagrammed in Figure 2. Video inputs are treated as a set of frames. We focused on facial images in this work, so the face areas are extracted in the pre-processing phase. In theory, the proposed method can deal with various sizes of input images. However, to maintain simplicity in training, we resize cropped images to pixels before feeding them into the autoencoder. The autoencoder outputs the reconstructed version of the input image (which is used only in training), the probability of the input image having been spoofed, and the segmentation map corresponding to this input image. For video inputs, we average the probabilities of all frames before drawing a conclusion on the probability of the input being real or fake.
The partitioning of the latent features (motivated by Cozzolino et al.’s work ) and the Y-shaped design of the decoder enables the autoencoder to share valuable information between the classification, segmentation, and reconstruction tasks and thereby improve overall performance by reducing loss. There are three types of loss: activation loss , segmentation loss , and reconstruction loss .
Given label , activation loss measures the accuracy of partitioning in the latent space on the basis of the activation of the two halves of the encoded features:
where is the number of samples, and are the activation values and defined as the norms of the corresponding halves of the latent features, and (given is the number of features of ):
This ensures that, given an input of class , the corresponding half of the latent features is activated (). The other half, , remains quiesced (). To force the two decoders, and , to learn the right decoding schemes, we set the off-class part to zero before feeding it to the decoders ().
We utilize cross-entropy loss as the segmentation loss to measure the agreement between the segmentation mask () and the ground-truth mask () corresponding to input :
The reconstruction loss uses the distance to measure the difference between the reconstructed image () and the original one (). For N samples, the reconstruction loss is
The total loss is the weighted sum of the three activation losses:
Unlike Cozzolino et al. , we set the three weights equal to each other (equal to 1). This is because the classification task and the segmentation task are equally important, and the reconstruction task plays an important role in the segmentation task. We experimentally compared the effects of the different settings (described below).
The Y-shaped autoencoder was implemented as shown in Figure 3. It is a fully connected CNN using convolutional windows (for the encoder) and
deconvolutional windows (for the decoder) with a stride of 1 interspersed with a stride of 2. Following each convolutional layer is a batch normalization layer20]. The selection block allows only the true half of the latent features () to pass by and zeros out the other half (). Therefore, the decoders () are forced to decode only the true half of the latent features. The dimension of the embedding is 128, which has been shown to be optimal . For the segmentation branch (
), a softmax activation function at the end is used to output segmentation maps. For the reconstruction branch (), a hyperbolic tangent function (tanh) is used to shape the output into the range . For simplicity, we directly feed normalized images into the autoencoder without converting them into residual images . Further work will focus on investigating the benefits of using residual images in the classification and segmentation tasks.
|5||No_Recon||Deeper||1||1||L2||Proposed method without reconstruction branch|
|6||Proposed_New||Deeper||1||1||L2||Complete proposed method with new settings|
We evaluated our proposed network using two databases: FaceForensics  and FaceForensics++ . The FaceForensics database contains 1004 real videos collected from YouTube and their corresponding manipulated versions, which are divided into two sub-datasets:
Source-to-Target Reenactment dataset containing 1004 fake videos created using the Face2Face method ; in each input pair for reenactment, the source video (the attacker) and the target video (the victim) are different.
Self-Reenactment dataset containing another 1004 fake videos created again using the Face2Face method; in each input pair for reenactment, the source and target videos are the same. Although this dataset is not meaningful from the attacker’s perspective, it does present a more challenging benchmark than does the Source-to-Target Reenactment dataset.
Each dataset was split into 704 videos for training, 150 for validation, and 150 for testing. The database also provided segmentation masks corresponding to manipulated videos. Three levels of compression based on the H.264 codec222http://www.h264encoder.com/ were used: no compression, light compression (quantization = 23), and strong compression (quantization = 40).
The FaceForensics++ database is an enhanced version of the FaceForensics database and includes the Face2Face dataset plus the FaceSwap333https://github.com/MarekKowalski/FaceSwap/ dataset (graphics-based manipulation) and the DeepFakes444https://github.com/deepfakes/faceswap/ dataset (deep-learning-based manipulation) . It contains 1,000 real videos and 3,000 manipulated videos (1,000 in each dataset). Each dataset was split into 720 videos for training, 140 for validation, and 140 for testing. The same three levels of compression based on the H.264 codec were used with the same quantization values.
For simplicity, we used only videos with light compression (quantization = 23). Images were extracted from videos using Cozzolino et al.’s settings : 200 frames of each training video were used for training, and 10 frames of each validation and testing video were used for validation and testing, respectively. There is no detailed description of the rules for frame selection, so we selected the first (200 or 10) frames of each video and cropped the facial areas. For all databases, we applied normalization with and
; these values have been widely used in the ImageNet Large Scale Visual Recognition Challenge. We did not apply any data augmentation to the trained datasets.
The training and testing datasets were designed as shown in Table 1. For the Training, Test 1, and Test 2 datasets, the Face2Face method  was used to create manipulated videos. Images in Test 2 were harder to detect than those in Test 1 since the source and target videos used for reenactment were the same, meaning that the reenacted video frames had better quality. Therefore, we call Test 1 and Test 2 the match and mismatch conditions for a seen attack. Test 3 used the Deepfake attack method while Test 4 used the FaceSwap attack method, presented in the FaceForensics++ database . These both attack methods were not used to create the training set, therefore they were considered as unseen attacks. For the classification task, we calculated the accuracy and equal error rate (EER) of each method. For the segmentation task, we used pixel-wise accuracy between ground-truth masks and segmentation masks. The FT_Res, FT, and Deeper_FT method could not perform the segmentation task. All the results were at the image level.
To evaluate the contributions of each component in the Y-shaped autoencoder, we designed the settings as shown in Table 2. The FT_Res and FT methods are re-implementations of Cozzolino et al.’s method with and without using residual images . They can also be understood as the Y-shaped autoencoder without the segmentation branch. The Deeper_FT method is a deeper version of FT, which has the same depth as the proposed method. The Proposed_Old method is the proposed method using weighting settings from Cozzolino et al.’s work , the No_Recon method is the version of the proposed method without the reconstruction branch, and the Proposed_New method is the complete proposed method with the Y-shaped autoencoder using equal losses for the three tasks and the mean squared error for reconstruction loss.
Since shallower networks take longer to converge than deeper ones, we trained the shallower ones with 100 epochs and the deeper ones with 50 epochs. For each method, the training stage with the highest accuracy for the classification task and a reasonable segmentation loss (if available) was used to perform all the tests described in this section.
The results for the match and mismatch conditions for seen attacks are respectively shown in Tables 3 (Test 1) and 4 (Test 2). The deeper networks (the last four) had substantially better classification performance than the shallower ones (the first two) proposed by Cozzolino et al. . Among the four deeper networks, there were no substantial differences in their performances on the classification task. For the segmentation task, the No_Recon and Proposed_New methods, which used the new weighting settings, had higher accuracy than the Proposed_Old method, which used the old weighting settings.
|Acc (%)||EER (%)||Acc (%)|
|Acc (%)||EER (%)||Acc (%)|
The performances of all methods was slightly degraded when dealing with the mismatch condition for seen attacks. The FT_Res and Proposed_New methods had the best adaptation ability, as indicated by the lower degradation in their scores. This indicates the importance of using residual images (for the FT_Res method) and of using the reconstruction branch (for the Y-shaped autoencoder with new weighting settings: Proposed_New method). The reconstruction branch also helped the Proposed_New method achieve the highest score on the segmentation task.
When encountering unseen attacks, all six methods had substantially lower accuracies and higher EERs, a shown in Tables 5 (Test 3) and 6 (Test 4). In Test 3, the shallower methods had better adaptation ability, especially the FT_Res method, which uses residual images. The deeper methods, which had a greater chance of being over-fitted, had nearly random classification results. In Test 4, although all methods suffered from nearly random classification accuracies, their better EERs indicated that the decision thresholds had been moved.
A particularly interesting finding was in the segmentation results. Although degraded, the segmentation accuracies were still high, especially in Test 4, in which FaceSwap copied the facial area from the source faces to the target ones using a computer-graphics method. When dealing with unseen attacks, this segmentation information could thus be an important clue in addition to the classification results for judging the authenticity of the queried images and videos.
|Acc (%)||EER (%)||Acc (%)|
|Acc (%)||EER (%)||Acc (%)|
We used the validation set (a small set normally used for selecting hyper-parameters in training that differs from the test set) of the FaceForensics++ - FaceSwap dataset  for fine-tuning all the methods. To ensure that the amount of data was small, we used only ten frames for each video. We divided the dataset into two parts: 100 videos of each class for training and 40 of each class for evaluation. We trained them using 50 epochs and selected the best models on the basis of their performance on the evaluation set.
The results after fine-tuning for Test 4 are shown in Table 7. Their classification and segmentation accuracies increased around 25% and 8%, respectively, which are remarkable compared with the small amount of data used. The one exception was the Proposed_Old method – its segmentation accuracy did not improve. The FT_Res method had better adaptation than the FT one, which supports Cozzolino et al.’s claim . The Proposed_New method had the highest transferability against unseen attacks as evidenced by the results in Table 7.
The proposed convolutional neural network with a Y-shaped autoencoder demonstrated its effectiveness for both classification and segmentation tasks without using a sliding window, as is commonly used by classifiers. Information sharing among the classification, segmentation, and reconstruction tasks improved the network’s overall performance, especially for the mismatch condition for seen attacks. Moreover, the autoencoder can quickly adapt to deal with unseen attacks by using only a few samples for fine-tuning. Future work will mainly focus on investigating the effect of using residual images  on the autoencoder’s performance, processing high-resolution images without resizing, improving its ability to deal with unseen attacks, and extending it to the audiovisual domain.
This research was supported by JSPS KAKENHI Grant Number JP16H06302, JP18H04120, and JST CREST Grant Number JPMJCR18A6, Japan.
Deepfakes: a new threat to face recognition? assessment and detection.Idiap-RR Idiap-RR-18-2018, Idiap, 2018.
Rectified linear units improve restricted boltzmann machines.In ICML, pages 807–814, 2010.
International Journal of Computer Vision, 115(3):211–252, 2015.