1 Introduction
Video manipulation is now within reach of any individual. Recent improvements in the machine learning field have enabled the creation of powerful video manipulation tools. Face2Face (Thies et al., 2016), Recycle-GAN (Bansal et al., 2018), Deepfakes (Korshunov & Marcel, 2018), and other face swapping techniques (Korshunova et al., 2017) embody the latest generation of these open source video forging methods. It is assumed as a certainty both by the research community (Brundage et al., 2018) and governments across the globe (Vincent, 2018; Chesney & Citron, 2018) that more complex tools will appear in the near future. Classical and current video editing methods have already demonstrated dangerous potential, having been used to generate political propaganda (Bird, 2015), revenge-porn (Curtis, 2018), and child-exploitation material (Cole, 2018).
Due to the ever increasing sophistication of these techniques, uncovering manipulations in videos remains an open problem. Existing video manipulation detection solutions focus entirely on the observance of anomalies in the pixel domain of the video. Unfortunately, it can be easily seen from a game theoretic perspective that, if both manipulators and detectors are equally powerful, a Nash equilibrium will be reached (Stamm et al., 2012a). Under that scenario, both real and manipulated videos will be indistinguishable from each other, and the best detector will only be capable of random guessing. Hence, methods that look beyond the pixel domain are critically needed. So far, little attention has been paid to the necessary metadata and auxiliary header information that is embedded in every video. As we shall present, this information can be exploited to uncover unskilled video content manipulators.
In this paper, we introduce a new approach to address the video manipulation detection problem. To avoid the zero-sum, leader-follower game that characterizes current detection solutions, our approach completely avoids the pixel domain. Instead, we use the multimedia stream descriptors (Jack, 2007) that ensure the playback of any video (as shown in Figure 1
). First, we construct a feature vector with all the descriptor information for a given video. Using a database of known manipulated videos, we train an ensemble of a support vector machine and a random forest that acts as our detector. Finally, during testing, we generate the feature vector from the stream descriptors of the video under analysis, feed it to the ensemble, and report a manipulation probability.
The contributions of this paper are summarized as follows. First, we introduce a new technique that does not require access to the pixel content of the video, making it fast and scalable, even on consumer grade computing equipment. Instead, we rely on the multimedia descriptors present on any video, which are considerably harder to manipulate due to their role in the decoding phase. Second, we thoroughly test our approach using the NIST MFC datasets (Guan et al., 2019) and show that even with a limited amount of labeled videos, simple machine learning ensembles can be highly effective detectors. Finally, all of our code and trained classifiers will be made available111https://github.com/dguera/fake-video-detection-without-pixels so the research community can reproduce our work with their own datasets.
2 Related Work
The multimedia forensics research community has a long history of trying to address the problem of detecting manipulations in video sequences. (Milani et al., 2012) provide an extensive and thorough overview of the main research directions and solutions that have been explored in the last decade. More recent work has focused on specific video manipulations, such as local tampering detection in video sequences (Stamm et al., 2012b; Bestagini et al., 2013), video re-encoding detection (Bian et al., 2014; Bestagini et al., 2016), splicing detection in videos (Hsu et al., 2008; Mullan et al., 2017; Mandelli et al., 2018), and near-duplicate video detection (Bayram et al., 2008; Lameri et al., 2017). (D’Amiano et al., 2015, 2019) also present solutions that use 3D PatchMatch (Barnes et al., 2009) for video forgery detection and localization, whereas (D’Avino et al., 2017) suggest using data-driven machine learning based approaches. Solutions tailored to detecting the latest video manipulation techniques have also been recently presented. These include the works of (Li et al., 2018; Güera & Delp, 2018) on detecting Deepfakes and (Rössler et al., 2018; Matern et al., 2019) on Face2Face (Thies et al., 2016) manipulation detection.
As covered by (Milani et al., 2012), image-based forensics techniques that leverage camera noise residuals (Khanna et al., 2008), image compression artifacts (Bianchi & Piva, 2012), or geometric and physics inconsistencies in the scene (Bulan et al., 2009) can also be used in videos when applied frame by frame. In (Fan et al., 2011) and (Huh et al., 2018), Exif image metadata is used to detect either image brightness and contrast adjustments, and splicing manipulations in images, respectively. Finally, (Iuliani et al., 2019) use video file container metadata for video integrity verification and source device identification. To the best of our knowledge, video manipulation detection techniques that exploit the multimedia stream descriptors have not been previously proposed.
3 Proposed Method
![]() |
![]() |
Current video manipulation detection approaches rely on uncovering manipulations by studying pixel domain anomalies. Instead, we propose to use the multimedia stream descriptors of videos as our main source of information to spot manipulated content. To do so, our method works in two stages, as presented in Figure 2
. First, during the training phase, we extract the multimedia stream descriptors from a labeled database of manipulated and pristine videos. In practice, such a database can be easily constructed using a limited amount of manually labeled data coupled with a semi-supervised learning approach, as done by
(Zannettou et al., 2018). Then, we encode these descriptors as a feature vector for each given video. We apply median normalization to all numerical features. As for categorical features, each is encoded as its own unique numerical value. Once we have processed all the videos in the database, we use all the feature vectors to train different binary classifiers as our detectors. More specifically, we use a random forest, a support vector machine (SVM) and an ensemble of both detectors. The best hyperparameters for each detector are selected by performing a random search cross-validation over a
-split stratified shuffling of the data and trials per split. Figure 1(a) summarizes this first stage. In our implementation, we use ffprobe (Bellard et al., 2019) for the multimedia stream descriptor extraction. For the encoding of the descriptors as feature vectors, we use pandas (McKinney, 2010) and scikit-learn (Pedregosa et al., 2011). As for the training and testing of the SVM, the random forest, and the ensemble, we use the implementations available in the scikit-learn library.Figure 1(b)
shows how our method would work in practice. Given a suspect video, we extract its stream descriptors and generate its corresponding feature vector, which is normalized based on the values learnt during the training phase. Since some of the descriptor fields are optional, we perform additional post-processing to ensure that the feature vector can be processed by our trained detector. Concretely, if any field is missing in the video stream descriptors, we perform data imputation by mapping missing fields to a fixed numerical value. If previously unseen descriptor fields are present in the suspect video stream, they are ignored and not included in the corresponding suspect feature vector. Finally, the trained detector analyzes the suspect feature vector and computes a manipulation probability.
It is important to note that although our approach may be vulnerable to video re-encoding attacks, this is traded off for scalability, a limited need of labeled data, and a high video manipulation detection score, as we present in Section 4. Also, the fact that our solution is orthogonal to pixel-based methods and requires limited amounts of data, which means that ideally, we could use both approaches simultaneously. Our approach could be used to quickly identify manipulated videos, minimizing the need to rely on human annotation. Later, these newly labeled videos could be used to improve the performance of pixel-based video manipulation detectors. Finally, following the recommendations of (Brundage et al., 2018), we want to reflect on a potential misuse of the proposed approach. We believe that our approach could be misused by someone with access to large amounts of labeled video data. Using that information, a malevolent adversary could identify specific individuals, such as journalists or confidential informants, who may submit anonymous videos using the same devices they use to upload videos to social media websites. To avoid this, different physical devices or proper video data sanitization should be used.
4 Experimental Results
4.1 Datasets
In order to evaluate the performance of our proposed approach, we use the Media Forensics Challenge (MFC) datasets (Guan et al., 2019). Collected by the National Institute of Standards and Technology (NIST), this data comprises over high provenance videos and manipulated videos. In our experiments, we use the videos from the following datasets for training, hyper-parameter selection, and validation: the Nimble Challenge development dataset, the MFC18 development version 1 and version 2 datasets, and the MFC18 GAN dataset. This represents a total of videos, of which are manipulated. For testing our model, we use the MFC18 evaluation dataset and the MFC19 validation dataset, which have a total of videos. Of those videos, have been manipulated.
4.2 Experimental Setup
To show the merits of our method in terms of scalability and limited compute requirements, we design the following experiment. First, we select machine learning binary classifiers that are well known for their modeling capabilities, even with limited access to training samples. As previously mentioned, we use a random forest, a support vector machine, and a soft voting classification ensemble with both. This final ensemble is weighted 4 to 1 in favor of the decision of the random forest. Then, to show the performance of each detector under different data availability scenarios, we train them using %, %, % and
% of the available training data. We use a stratified shuffle splitting policy to select these training subsets, meaning that the global ratio of manipulated to non-manipulated videos of the entire training set is preserved in the subsets. In all scenarios, a sequestered 25% subset of the training data is used for hyper-parameter selection and validation. Finally, the best validated model is selected for testing. Due to the imbalance of manipulated to non-manipulated videos, we use the Precision-Recall (PR) curve as our evaluation metric, as recommended by
(Saito & Rehmsmeier, 2015). We also report the F1 score, the area under the curve (AUC) score, and the average precision (AP) score for each classifier.4.3 Results and Discussion
As we can see in Figure 3, Figure 4, Figure 5, and Figure 6, under all scenarios the voting ensemble of the random forest and the support vector machine generally achieves the best overall results, followed by the random forest and the SVM. More specifically, our best ensemble model achieves a F1 score of , an AUC score of and an AP score of . To contextualize these results, we have included the performance of a binary classifier baseline which predicts a video manipulation with probability . This corresponds to the true fraction of manipulated videos in the test set. Note that it is higher than the fraction of manipulated videos in the training subsets, which is . This baseline model would achieve an F1, AUC, and AP score of . We can see that our best model is three times better than the baseline in all reported metrics. Notice that, as seen in Figure 3, the ensemble trained with videos has achieved equal or better results than the ensembles trained with more videos. This shows that, even with a very limited number of stream descriptors, a properly tuned machine learning model can be trained easily to spot video manipulations.


5 Conclusion
Up until now, most video manipulation detection techniques have focused on analyzing the pixel data to spot forged content. In this paper, we have shown how simple machine learning classifiers can be highly effective at detecting video manipulations when the appropriate data is used. More specifically, we use an ensemble of a random forest and an SVM trained on multimedia stream descriptors from both forged and pristine videos. With this approach, we have achieved an extremely high video manipulation detection score while requiring very limited amounts of data. Based on our findings, our future work will focus on techniques that automatically perform data sanitization. This will allow us to remove metadata and auxiliary header information that may give away sensitive information such as the source of the video.


Acknowledgements
This material is based on research sponsored by DARPA and Air Force Research Laboratory (AFRL) under agreement number FA8750-16-2-0173. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of DARPA and Air Force Research Laboratory (AFRL) or the U.S. Government.
References
-
Bansal et al. (2018)
Bansal, A., Ma, S., Ramanan, D., and Sheikh, Y.
Recycle-GAN: Unsupervised video retargeting.
Proceedings of the European Conference on Computer Vision
, pp. 119–135, September 2018. URL https://doi.org/10.1007/978-3-030-01228-1_8. Munich, Germany. - Barnes et al. (2009) Barnes, C., Shechtman, E., Finkelstein, A., and Goldman, D. B. Patchmatch: A randomized correspondence algorithm for structural image editing. ACM Transactions on Graphics, 28(3):24:1–24:11, July 2009. URL https://doi.org/10.1145/1531326.1531330.
- Bayram et al. (2008) Bayram, S., Sencar, H. T., and Memon, N. Video copy detection based on source device characteristics: A complementary approach to content-based methods. Proceedings of the ACM International Conference on Multimedia Information Retrieval, pp. 435–442, October 2008. URL https://doi.org/10.1145/1460096.1460167. Vancouver, British Columbia, Canada.
- Bellard et al. (2019) Bellard, F. et al. ffprobe documentation. April 2019. URL https://www.ffmpeg.org/ffprobe.html. (Accessed on 04/17/2019).
- Bestagini et al. (2013) Bestagini, P., Milani, S., Tagliasacchi, M., and Tubaro, S. Local tampering detection in video sequences. Proceedings of the IEEE International Workshop on Multimedia Signal Processing, pp. 488–493, September 2013. URL https://doi.org/10.1109/MMSP.2013.6659337. Pula, Italy.
- Bestagini et al. (2016) Bestagini, P., Milani, S., Tagliasacchi, M., and Tubaro, S. Codec and gop identification in double compressed videos. IEEE Transactions on Image Processing, 25(5):2298–2310, May 2016. URL https://doi.org/10.1109/TIP.2016.2541960.
-
Bian et al. (2014)
Bian, S., Luo, W., and Huang, J.
Exposing fake bit rate videos and estimating original bit rates.
IEEE Transactions on Circuits and Systems for Video Technology, 24(12):2144–2154, December 2014. URL https://doi.org/10.1109/TCSVT.2014.2334031. - Bianchi & Piva (2012) Bianchi, T. and Piva, A. Image forgery localization via block-grained analysis of jpeg artifacts. IEEE Transactions on Information Forensics and Security, 7(3):1003–1017, June 2012. URL https://doi.org/10.1109/TIFS.2012.2187516.
- Bird (2015) Bird, M. The video in which Greece’s finance minister gives Germany the finger has several bizarre new twists. March 2015. URL https://www.businessinsider.com/yanis-varoufakis-middle-finger-controversy-real-fake-bohmermann-jauch-2015-3. (Accessed on 04/17/2019).
- Brundage et al. (2018) Brundage, M., Avin, S., Clark, J., Toner, H., Eckersley, P., Garfinkel, B., Dafoe, A., Scharre, P., Zeitzoff, T., Filar, B., Anderson, H., Roff, H., Allen, G. C., Steinhardt, J., Flynn, C., hÉigeartaigh, S. Ó., Beard, S., Belfield, H., Farquhar, S., Lyle, C., Crootof, R., Evans, O., Page, M., Bryson, J., Yampolskiy, R., and Amodei, D. The malicious use of artificial intelligence: Forecasting, prevention, and mitigation. arXiv:1802.07228v1, February 2018. URL http://arxiv.org/abs/1802.07228v1.
- Bulan et al. (2009) Bulan, O., Mao, J., and Sharma, G. Geometric distortion signatures for printer identification. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1401–1404, April 2009. URL https://doi.org/10.1109/ICASSP.2009.4959855.
- Chesney & Citron (2018) Chesney, R. and Citron, D. K. Disinformation on steroids: The threat of deep fakes. October 2018. URL https://www.cfr.org/report/deep-fake-disinformation-steroids. (Accessed on 04/17/2019).
- Cole (2018) Cole, S. Fake porn makers are worried about accidentally making child porn. February 2018. URL https://motherboard.vice.com/en_us/article/evmkxa/ai-fake-porn-deepfakes-child-pornography-emma-watson-elle-fanning. (Accessed on 04/17/2019).
- Curtis (2018) Curtis, C. Deepfakes are being weaponized to silence women — but this woman is fighting back. October 2018. URL https://thenextweb.com/code-word/2018/10/05/deepfakes-are-being-weaponized-to-silence-women-but-this-woman-is-fighting-back/. (Accessed on 04/17/2019).
- D’Amiano et al. (2015) D’Amiano, L., Cozzolino, D., Poggi, G., and Verdoliva, L. Video forgery detection and localization based on 3d PatchMatch. Proceedings of the IEEE International Conference on Multimedia Expo Workshops, pp. 1–6, June 2015. URL https://doi.org/10.1109/ICMEW.2015.7169805. Turin, Italy.
- D’Amiano et al. (2019) D’Amiano, L., Cozzolino, D., Poggi, G., and Verdoliva, L. A PatchMatch-based dense-field algorithm for video copy–move detection and localization. IEEE Transactions on Circuits and Systems for Video Technology, 29(3):669–682, March 2019. URL https://doi.org/10.1109/TCSVT.2018.2804768.
- D’Avino et al. (2017) D’Avino, D., Cozzolino, D., Poggi, G., and Verdoliva, L. Autoencoder with recurrent neural networks for video forgery detection. Proceedings of the IS&T Electronic Imaging, 2017(7):92–99, January 2017. URL https://doi.org/10.2352/ISSN.2470-1173.2017.7.MWSF-330. Burlingame, CA.
- Fan et al. (2011) Fan, J., Kot, A. C., Cao, H., and Sattar, F. Modeling the exif-image correlation for image manipulation detection. Proceedings of the IEEE International Conference on Image Processing, pp. 1945–1948, September 2011. URL https://doi.org/10.1109/ICIP.2011.6115853. Brussels, Belgium.
- Guan et al. (2019) Guan, H., Kozak, M., Robertson, E., Lee, Y., Yates, A. N., Delgado, A., Zhou, D., Kheyrkhah, T., Smith, J., and Fiscus, J. Mfc datasets: Large-scale benchmark datasets for media forensic challenge evaluation. Proceedings of the IEEE Winter Applications of Computer Vision Workshops, pp. 63–72, January 2019. URL https://doi.org/10.1109/WACVW.2019.00018. Waikoloa Village, HI.
- Güera & Delp (2018) Güera, D. and Delp, E. J. Deepfake video detection using recurrent neural networks. Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 1–6, November 2018. URL https://doi.org/10.1109/AVSS.2018.8639163. Auckland, New Zealand.
- Hsu et al. (2008) Hsu, C.-C., Hung, T.-Y., Lin, C.-W., and Hsu, C.-T. Video forgery detection using correlation of noise residue. Proceedings of IEEE Workshop on Multimedia Signal Processing, pp. 170–174, October 2008. URL https://doi.org/10.1109/MMSP.2008.4665069. Cairns, Qld, Australia.
- Huh et al. (2018) Huh, M., Liu, A., Owens, A., and Efros, A. A. Fighting fake news: Image splice detection via learned self-consistency. Proceedings of the European Conference on Computer Vision, pp. 106–124, September 2018. URL https://doi.org/10.1007/978-3-030-01252-6_7. Munich, Germany.
- Iuliani et al. (2019) Iuliani, M., Shullani, D., Fontani, M., Meucci, S., and Piva, A. A video forensic framework for the unsupervised analysis of MP4-like file container. IEEE Transactions on Information Forensics and Security, 14(3):635–645, March 2019. URL https://doi.org/10.1109/TIFS.2018.2859760.
- Jack (2007) Jack, K. Chapter 13 - MPEG-2. In Jack, K. (ed.), Video Demystified: A Handbook for the Digital Engineer, pp. 577–737. Newnes, Burlington, MA, 2007. URL https://doi.org/10.1016/B978-075068395-1/50013-4.
- Khanna et al. (2008) Khanna, N., Chiu, G. T. ., Allebach, J. P., and Delp, E. J. Forensic techniques for classifying scanner, computer generated and digital camera images. pp. 1653–1656, March 2008. URL https://doi.org/10.1109/ICASSP.2008.4517944. Las Vegas, NV.
- Korshunov & Marcel (2018) Korshunov, P. and Marcel, S. Deepfakes: a new threat to face recognition? assessment and detection. arXiv:1812.08685v1, March 2018. URL https://arxiv.org/abs/1812.08685v1.
-
Korshunova et al. (2017)
Korshunova, I., Shi, W., Dambre, J., and Theis, L.
Fast face-swap using convolutional neural networks.
Proceedings of the IEEE International Conference on Computer Vision, pp. 3697–3705, October 2017. URL https://doi.org/10.1109/ICCV.2017.397. Venice, Italy. - Lameri et al. (2017) Lameri, S., Bondi, L., Bestagin, P., and Tubaro, S. Near-duplicate video detection exploiting noise residual traces. Proceedings of the IEEE International Conference on Image Processing, pp. 1497–1501, September 2017. URL https://doi.org/10.1109/ICIP.2017.8296531. Beijing, China.
- Li et al. (2018) Li, Y., Chang, M., and Lyu, S. In ictu oculi: Exposing AI created fake videos by detecting eye blinking. Proceedings of the IEEE International Workshop on Information Forensics and Security, pp. 1–7, December 2018. URL https://doi.org/10.1109/WIFS.2018.8630787. Hong Kong, China.
- Mandelli et al. (2018) Mandelli, S., Bestagini, P., Tubaro, S., Cozzolino, D., and Verdoliva, L. Blind detection and localization of video temporal splicing exploiting sensor-based footprints. Proceedings of the European Signal Processing Conference, pp. 1362–1366, September 2018. URL https://doi.org/10.23919/EUSIPCO.2018.8553511. Rome, Italy.
- Matern et al. (2019) Matern, F., Riess, C., and Stamminger, M. Exploiting visual artifacts to expose deepfakes and face manipulations. Proceedings of the IEEE Winter Applications of Computer Vision Workshops, pp. 83–92, January 2019. URL https://doi.org/10.1109/WACVW.2019.00020. Waikoloa Village, HI.
- McKinney (2010) McKinney, W. Data structures for statistical computing in python. Proceedings of the Python in Science Conference, pp. 51–56, June 2010. URL http://conference.scipy.org/proceedings/scipy2010/mckinney.html. Austin, TX.
- Milani et al. (2012) Milani, S., Fontani, M., Bestagini, P., Barni, M., Piva, A., Tagliasacchi, M., and Tubaro, S. An overview on video forensics. APSIPA Transactions on Signal and Information Processing, 1:e2, August 2012. URL https://doi.org/10.1017/ATSIP.2012.2.
- Mullan et al. (2017) Mullan, P., Cozzolino, D., Verdoliva, L., and Riess, C. Residual-based forensic comparison of video sequences. Proceedings of the IEEE International Conference on Image Processing, pp. 1507–1511, September 2017. URL https://doi.org/10.1109/ICIP.2017.8296533. Beijing, China.
- Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, November 2011. URL http://dl.acm.org/citation.cfm?id=1953048.2078195.
- Rössler et al. (2018) Rössler, A., Cozzolino, D., Verdoliva, L., Riess, C., Thies, J., and Nießner, M. Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv:1803.09179, March 2018. URL https://arxiv.org/abs/1803.09179.
- Saito & Rehmsmeier (2015) Saito, T. and Rehmsmeier, M. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE, 10(3):e0118432, March 2015. URL https://doi.org/10.1371/journal.pone.0118432.
- Stamm et al. (2012a) Stamm, M. C., Lin, W. S., and Liu, K. J. R. Forensics vs. anti-forensics: A decision and game theoretic framework. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1749–1752, March 2012a. URL https://doi.org/10.1109/ICASSP.2012.6288237. Kyoto, Japan.
- Stamm et al. (2012b) Stamm, M. C., Lin, W. S., and Liu, K. J. R. Temporal forensics and anti-forensics for motion compensated video. IEEE Transactions on Information Forensics and Security, 7(4):1315–1329, August 2012b. URL https://doi.org/10.1109/TIFS.2012.2205568.
-
Thies et al. (2016)
Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., and Nießner, M.
Face2Face: Real-time face capture and reenactment of RGB videos.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2387–2395, June 2016. URL https://doi.org/10.1109/CVPR.2016.262. Las Vegas, NV. - Vincent (2018) Vincent, J. US lawmakers say AI deepfakes ‘have the potential to disrupt every facet of our society’. September 2018. URL https://www.theverge.com/2018/9/14/17859188/ai-deepfakes-national-security-threat-lawmakers-letter-intelligence-community. (Accessed on 04/17/2019).
- Zannettou et al. (2018) Zannettou, S., Chatzis, S., Papadamou, K., and Sirivianos, M. The good, the bad and the bait: Detecting and characterizing clickbait on youtube. Proceedings of the IEEE Security and Privacy Workshops, pp. 63–69, May 2018. URL https://doi.org/10.1109/SPW.2018.00018. San Francisco, CA.