Log In Sign Up

We Need No Pixels: Video Manipulation Detection Using Stream Descriptors

by   David Güera, et al.

Manipulating video content is easier than ever. Due to the misuse potential of manipulated content, multiple detection techniques that analyze the pixel data from the videos have been proposed. However, clever manipulators should also carefully forge the metadata and auxiliary header information, which is harder to do for videos than images. In this paper, we propose to identify forged videos by analyzing their multimedia stream descriptors with simple binary classifiers, completely avoiding the pixel space. Using well-known datasets, our results show that this scalable approach can achieve a high manipulation detection score if the manipulators have not done a careful data sanitization of the multimedia stream descriptors.


page 1

page 2

page 3

page 4


Simple Yet Efficient Content Based Video Copy Detection

Given a collection of videos, how to detect content-based copies efficie...

IceBreaker: Solving Cold Start Problem for Video Recommendation Engines

Internet has brought about a tremendous increase in content of all forms...

Strategies for Searching Video Content with Text Queries or Video Examples

The large number of user-generated videos uploaded on to the Internet ev...

Removing Rain in Videos: A Large-scale Database and A Two-stream ConvLSTM Approach

Rain removal has recently attracted increasing research attention, as it...

Temporal-Needle: A view and appearance invariant video descriptor

The ability to detect similar actions across videos can be very useful f...

Beyond Pixels: Image Provenance Analysis Leveraging Metadata

Creative works, whether paintings or memes, follow unique journeys that ...

A Connected Component Labelling algorithm for multi-pixel per clock cycle video stream

This work describes the hardware implementation of a connected component...

1 Introduction

Video manipulation is now within reach of any individual. Recent improvements in the machine learning field have enabled the creation of powerful video manipulation tools. Face2Face (Thies et al., 2016), Recycle-GAN (Bansal et al., 2018), Deepfakes (Korshunov & Marcel, 2018), and other face swapping techniques (Korshunova et al., 2017) embody the latest generation of these open source video forging methods. It is assumed as a certainty both by the research community (Brundage et al., 2018) and governments across the globe (Vincent, 2018; Chesney & Citron, 2018) that more complex tools will appear in the near future. Classical and current video editing methods have already demonstrated dangerous potential, having been used to generate political propaganda (Bird, 2015), revenge-porn (Curtis, 2018), and child-exploitation material (Cole, 2018).

Figure 1: Examples of some of the information extracted from the video stream descriptors. These descriptors are necessary to decode and playback a video.

Due to the ever increasing sophistication of these techniques, uncovering manipulations in videos remains an open problem. Existing video manipulation detection solutions focus entirely on the observance of anomalies in the pixel domain of the video. Unfortunately, it can be easily seen from a game theoretic perspective that, if both manipulators and detectors are equally powerful, a Nash equilibrium will be reached (Stamm et al., 2012a). Under that scenario, both real and manipulated videos will be indistinguishable from each other, and the best detector will only be capable of random guessing. Hence, methods that look beyond the pixel domain are critically needed. So far, little attention has been paid to the necessary metadata and auxiliary header information that is embedded in every video. As we shall present, this information can be exploited to uncover unskilled video content manipulators.

In this paper, we introduce a new approach to address the video manipulation detection problem. To avoid the zero-sum, leader-follower game that characterizes current detection solutions, our approach completely avoids the pixel domain. Instead, we use the multimedia stream descriptors (Jack, 2007) that ensure the playback of any video (as shown in Figure 1

). First, we construct a feature vector with all the descriptor information for a given video. Using a database of known manipulated videos, we train an ensemble of a support vector machine and a random forest that acts as our detector. Finally, during testing, we generate the feature vector from the stream descriptors of the video under analysis, feed it to the ensemble, and report a manipulation probability.

The contributions of this paper are summarized as follows. First, we introduce a new technique that does not require access to the pixel content of the video, making it fast and scalable, even on consumer grade computing equipment. Instead, we rely on the multimedia descriptors present on any video, which are considerably harder to manipulate due to their role in the decoding phase. Second, we thoroughly test our approach using the NIST MFC datasets (Guan et al., 2019) and show that even with a limited amount of labeled videos, simple machine learning ensembles can be highly effective detectors. Finally, all of our code and trained classifiers will be made available111 so the research community can reproduce our work with their own datasets.

2 Related Work

The multimedia forensics research community has a long history of trying to address the problem of detecting manipulations in video sequences. (Milani et al., 2012) provide an extensive and thorough overview of the main research directions and solutions that have been explored in the last decade. More recent work has focused on specific video manipulations, such as local tampering detection in video sequences (Stamm et al., 2012b; Bestagini et al., 2013), video re-encoding detection (Bian et al., 2014; Bestagini et al., 2016), splicing detection in videos (Hsu et al., 2008; Mullan et al., 2017; Mandelli et al., 2018), and near-duplicate video detection (Bayram et al., 2008; Lameri et al., 2017). (D’Amiano et al., 2015, 2019) also present solutions that use 3D PatchMatch (Barnes et al., 2009) for video forgery detection and localization, whereas (D’Avino et al., 2017) suggest using data-driven machine learning based approaches. Solutions tailored to detecting the latest video manipulation techniques have also been recently presented. These include the works of (Li et al., 2018; Güera & Delp, 2018) on detecting Deepfakes and (Rössler et al., 2018; Matern et al., 2019) on Face2Face (Thies et al., 2016) manipulation detection.

As covered by (Milani et al., 2012), image-based forensics techniques that leverage camera noise residuals (Khanna et al., 2008), image compression artifacts (Bianchi & Piva, 2012), or geometric and physics inconsistencies in the scene (Bulan et al., 2009) can also be used in videos when applied frame by frame. In (Fan et al., 2011) and (Huh et al., 2018), Exif image metadata is used to detect either image brightness and contrast adjustments, and splicing manipulations in images, respectively. Finally, (Iuliani et al., 2019) use video file container metadata for video integrity verification and source device identification. To the best of our knowledge, video manipulation detection techniques that exploit the multimedia stream descriptors have not been previously proposed.

3 Proposed Method

Figure 2: (a) Block diagram of the training stage of our proposed method. We process a labeled database of manipulated and pristine videos to generate a feature vector for each video from its multimedia stream descriptors. These feature vectors are then used to train and select the best detector (b) Block diagram of the testing stage of our proposed method. Given a suspect video, a feature vector is generated and processed by the previously selected detector. Finally, a manipulation probability for the suspect video is reported.

Current video manipulation detection approaches rely on uncovering manipulations by studying pixel domain anomalies. Instead, we propose to use the multimedia stream descriptors of videos as our main source of information to spot manipulated content. To do so, our method works in two stages, as presented in Figure 2

. First, during the training phase, we extract the multimedia stream descriptors from a labeled database of manipulated and pristine videos. In practice, such a database can be easily constructed using a limited amount of manually labeled data coupled with a semi-supervised learning approach, as done by 

(Zannettou et al., 2018)

. Then, we encode these descriptors as a feature vector for each given video. We apply median normalization to all numerical features. As for categorical features, each is encoded as its own unique numerical value. Once we have processed all the videos in the database, we use all the feature vectors to train different binary classifiers as our detectors. More specifically, we use a random forest, a support vector machine (SVM) and an ensemble of both detectors. The best hyperparameters for each detector are selected by performing a random search cross-validation over a

-split stratified shuffling of the data and trials per split. Figure 1(a) summarizes this first stage. In our implementation, we use ffprobe (Bellard et al., 2019) for the multimedia stream descriptor extraction. For the encoding of the descriptors as feature vectors, we use pandas (McKinney, 2010) and scikit-learn (Pedregosa et al., 2011). As for the training and testing of the SVM, the random forest, and the ensemble, we use the implementations available in the scikit-learn library.

Figure 1(b)

shows how our method would work in practice. Given a suspect video, we extract its stream descriptors and generate its corresponding feature vector, which is normalized based on the values learnt during the training phase. Since some of the descriptor fields are optional, we perform additional post-processing to ensure that the feature vector can be processed by our trained detector. Concretely, if any field is missing in the video stream descriptors, we perform data imputation by mapping missing fields to a fixed numerical value. If previously unseen descriptor fields are present in the suspect video stream, they are ignored and not included in the corresponding suspect feature vector. Finally, the trained detector analyzes the suspect feature vector and computes a manipulation probability.

It is important to note that although our approach may be vulnerable to video re-encoding attacks, this is traded off for scalability, a limited need of labeled data, and a high video manipulation detection score, as we present in Section 4. Also, the fact that our solution is orthogonal to pixel-based methods and requires limited amounts of data, which means that ideally, we could use both approaches simultaneously. Our approach could be used to quickly identify manipulated videos, minimizing the need to rely on human annotation. Later, these newly labeled videos could be used to improve the performance of pixel-based video manipulation detectors. Finally, following the recommendations of (Brundage et al., 2018), we want to reflect on a potential misuse of the proposed approach. We believe that our approach could be misused by someone with access to large amounts of labeled video data. Using that information, a malevolent adversary could identify specific individuals, such as journalists or confidential informants, who may submit anonymous videos using the same devices they use to upload videos to social media websites. To avoid this, different physical devices or proper video data sanitization should be used.

4 Experimental Results

4.1 Datasets

In order to evaluate the performance of our proposed approach, we use the Media Forensics Challenge (MFC) datasets (Guan et al., 2019). Collected by the National Institute of Standards and Technology (NIST), this data comprises over high provenance videos and manipulated videos. In our experiments, we use the videos from the following datasets for training, hyper-parameter selection, and validation: the Nimble Challenge development dataset, the MFC18 development version 1 and version 2 datasets, and the MFC18 GAN dataset. This represents a total of videos, of which are manipulated. For testing our model, we use the MFC18 evaluation dataset and the MFC19 validation dataset, which have a total of videos. Of those videos, have been manipulated.

4.2 Experimental Setup

To show the merits of our method in terms of scalability and limited compute requirements, we design the following experiment. First, we select machine learning binary classifiers that are well known for their modeling capabilities, even with limited access to training samples. As previously mentioned, we use a random forest, a support vector machine, and a soft voting classification ensemble with both. This final ensemble is weighted 4 to 1 in favor of the decision of the random forest. Then, to show the performance of each detector under different data availability scenarios, we train them using %, %, % and

% of the available training data. We use a stratified shuffle splitting policy to select these training subsets, meaning that the global ratio of manipulated to non-manipulated videos of the entire training set is preserved in the subsets. In all scenarios, a sequestered 25% subset of the training data is used for hyper-parameter selection and validation. Finally, the best validated model is selected for testing. Due to the imbalance of manipulated to non-manipulated videos, we use the Precision-Recall (PR) curve as our evaluation metric, as recommended by 

(Saito & Rehmsmeier, 2015). We also report the F1 score, the area under the curve (AUC) score, and the average precision (AP) score for each classifier.

4.3 Results and Discussion

As we can see in Figure 3, Figure 4, Figure 5, and Figure 6, under all scenarios the voting ensemble of the random forest and the support vector machine generally achieves the best overall results, followed by the random forest and the SVM. More specifically, our best ensemble model achieves a F1 score of , an AUC score of and an AP score of . To contextualize these results, we have included the performance of a binary classifier baseline which predicts a video manipulation with probability . This corresponds to the true fraction of manipulated videos in the test set. Note that it is higher than the fraction of manipulated videos in the training subsets, which is . This baseline model would achieve an F1, AUC, and AP score of . We can see that our best model is three times better than the baseline in all reported metrics. Notice that, as seen in Figure 3, the ensemble trained with videos has achieved equal or better results than the ensembles trained with more videos. This shows that, even with a very limited number of stream descriptors, a properly tuned machine learning model can be trained easily to spot video manipulations.

Figure 3: PR curves, F1 score, AUC score, and AP score on the test set for all the trained models using % of the available training data (68 videos).
Figure 4: PR curves, F1 score, AUC score, and AP score on the test set for all the trained models using % of the available training data (169 videos).

5 Conclusion

Up until now, most video manipulation detection techniques have focused on analyzing the pixel data to spot forged content. In this paper, we have shown how simple machine learning classifiers can be highly effective at detecting video manipulations when the appropriate data is used. More specifically, we use an ensemble of a random forest and an SVM trained on multimedia stream descriptors from both forged and pristine videos. With this approach, we have achieved an extremely high video manipulation detection score while requiring very limited amounts of data. Based on our findings, our future work will focus on techniques that automatically perform data sanitization. This will allow us to remove metadata and auxiliary header information that may give away sensitive information such as the source of the video.

Figure 5: PR curves, F1 score, AUC score, and AP score on the test set for all the trained models using % of the available training data (339 videos).
Figure 6: PR curves, F1 score, AUC score, and AP score on the test set for all the trained models using % of the available training data (508 videos).


This material is based on research sponsored by DARPA and Air Force Research Laboratory (AFRL) under agreement number FA8750-16-2-0173. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA and Air Force Research Laboratory (AFRL) or the U.S. Government.