A Coarse-to-fine Deep Convolutional Neural Network Framework for Frame Duplication Detection and Localization in Video Forgery

11/27/2018 ∙ by Chengjiang Long, et al. ∙ kitware 0

Frame duplication is to duplicate a sequence of consecutive frames and insert or replace to conceal or imitate a specific event/content in the same source video. To automatically detect the duplicated frames in a manipulated video, we propose a coarse-to-fine deep convolutional neural network framework to detect and localize the frame duplications. We first run an I3D network to obtain the most candidate duplicated frame sequences and selected frame sequences, and then run a Siamese network with ResNet network to identify each pair of a duplicated frame and the corresponding selected frame. We also propose a heuristic strategy to formulate the video-level score. We then apply our inconsistency detector fine-tuned on the I3D network to distinguish duplicated frames from selected frames. With the experimental evaluation conducted on two video datasets, we strongly demonstrate that our proposed method outperforms the current state-of-the-art methods.



There are no comments yet.


page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to the wide-spread availability of increasingly sophisticated and low-cost digital multimedia devices, a vast amount of digital videos are available everywhere in our daily life. With the development of video editing techniques, multiple image/video editing tools are accessible so that it is easy to duplicate and manipulate the content of digital videos. Video forensics, which aims to look for features that can distinguish video forgeries from original videos, has become more and more important to identify the authenticity of videos in the field of information security.

Frame duplication was firstly proposed by Wang and Faird [25] in 2006 and it refers that some frames selected from a video are duplicated to extend or replace a specific object/event in the same source video. In this way, a specific event/event are concealed or imitated in the same video. For example, in a video of a car accident, the portion of the car accident can be concealed by pasting a duplicated sequence from the same video, which may also destroy the time consistency information and mislead the subsequent investigations. Therefore, detecting frame duplication manipulation in a video is becoming increasingly important.

As illustrated in Figure 1, there are two frame sequences are duplicated in the manipulated video after copy the red frame sequences from the original video and insert them between the green and blue frame sequences. Given a testing video, the task is to detect whether there are frames duplicated without any known information about the original video. It has become complicated to comprehend and differentiate an authentic video from a tampered one. This is due to the several forgery methods that the public can avail with, which as a result, recordings of video processing have become a great challenge.

Figure 1: The illustration of frame duplication manipulation in a video. Assuming that there are three consecutive frame sequences (marked in red, green and blue, respectively) in an original video, the manipulated video is obtained after copying the red frame sequence and pasting behind the green sequence. Our goal is to identify whether there is frame duplication manipulation and localize the second green sequence as the duplicated frame sequence.

In recent years, multiple blind digital video forgery detection approaches have been employed to solve this challenging problem. Wang and Faird [25] proposed a frame duplication detection algorithm based on a correlation coefficient matrix. However, such an algorithm requires heavy computational load due to a large amount of correlation calculation. Linet al. [11] proposed to use histogram difference (HD) instead of correlation coefficients as the detection features. The drawback is that the HD features do not show strong robustness against common video operations or attacks. Hu et al. [7] propose a new algorithm to detect duplicated frames based on video sub-sequence fingerprints extracted from the DCT coefficients. Yang et al. [29]

propose an effective similarity-analysis-based method for frame duplication detection that is implemented in two stages, in which the features are obtained via SVD. Although deep learning solutions, especially the convolution neural networks, have demonstrated promising performance in solving many challenge vision problems such as large-scale image recognition 

[6, 18], object detection [16, 3, 19] and visual captioning [21, 1, 30], no deep learning solutions have been developed for this specific task so far, which motivates us to fill this gap.

In this paper, we propose a novel coarse-to-fine deep learning framework for frame duplication detection and localization in video forgery. As illustrated in the pipeline Figure 2, we make full use the I3D network [2] to narrow the search space and improve the efficiency, as well as to measure the inconsistency to distinguish duplicated frames from the selected frames, after the frame duplication confirmed by the Siamese network composed by ResNet network [6]. With the I3D network, our proposed framework is able the explore both spatial and temporal underlying relationship in a video, and with the ResNet work, we are able to extract the powerful frame-level features to guarantee high accuracy. We also propose a heuristic strategy to formulate the video-level detection score based on the intuition that the more possible frames duplicated, the smaller value of the minimum distance between duplicated frames and selected frames, and the larger gap between the selected frames and the duplicated frames, then likelihood of frame duplication existed in the video is much higher.

Different from most of the methods, we consider the consistency between two consecutive frames from a 16-frame video clip rather than only two frames. We include the previous 8 frames and the next 8 frames and set these two consecutive frames as the 8-th frame and the 9-th frame in a video clip. This treatment is able to well explore the temporal-context information. Inspired by Long et al.’s approach [12] for frame drop detection based on the assumption of inconsistency between consecutive frames, we fine-tune a I3D network to cover three-category, i.e., “none”, “frame drop”, and “shot break”. Therefore, we are able to use the learned I3D network and combine it with the feature distance between any two consecutive frames to formulate an inconsistency detector to distinguish duplicated frames from the selected frames for either one-shot video or multi-shot videos.

To sum, the contribution of this paper lies in four-folds.

  • We propose a novel coarse-to-fine deep learning framework for frame duplication detection and localization in video forgery. In such a framework, both the I3D network and the ResNet work have been well incorporated.

  • We develop a heuristic formulation for video-level detection score, which leads to significant improvement in detection performance.

  • We design an inconsistency detector based on the fine-tuned I3D network which covers three categories (i.e., “none”, “frame drop”, and “shot break”) to distinguish duplicated frames from the selected frames.

  • We conduct experiments on two video forgery datasets and the outperformance strongly demonstrates the effectiveness of the proposed method.

2 Related work

The related work can be divided into two categories, i.e., inter-frame forgery and copy-move forgery.

Inter-frame forgery refers to consecutive frame deletion and consecutive frame duplication. For features which are copied, either spatially or temporally. Keypoints are remarkable nearby patches recognized over distinctive scales. Keypoint-based methodologies can be further subdivided into classifications: direction based [5, 10], keyframe-based coordinating [9] and visual words based [17]. In particular, keyframe based feature has been indicated to display incredible execution for close video picture/feature identification [9].

In addition to keypoint-based feature, Wu et al. [27] proposes a velocity field consistency based approach to detect inter-frame forgery. This method is able to distinguish the forgery types, identify the tampered video and locate the manipulated positions in forged videos as well. Wang et al. [23]

propose to make full use of the consistency of the correlation coefficients of gray values to classify original videos and inter-frame forgeries. They also propose an optical flow method 


based on the assumption that the optical flows are consistent in an original video, while in forgeries the consistency will be destroyed. The optical flow is extracted as distinguishing feature to identify inter-frame forgeries through a Support Vector Machine (SVM) classifier to recognize frame insertion and frame deletion forgeries.

Recently, Huang et al. [8] proposed a fusion of audio forensics detection methods for video inter-frame forgery. Zhao et al. [31]

developed a similarity analysis based method to detect inter-frame forgery in a video shot. In this method, the HSV color histogram is calculated to detect and locate tampered frames in the shot, and then the SURF feature extraction and FLANN (Fast Library for Approximate Nearest Neighbors) matching are used for further confirmation.

Copy-move forgery is created by copying and pasting content within the same image, and potentially post-processing it [4]. Different methodologies have high reckoning time and not suitable for ongoing applications, for example, PCA, DWT, or SVD. For example, Wang et al. [22]

propose a dimensionality diminishment based system and utilizes PCA (Principal Component Analysis) on the different pieces in a substitute mode. The drawback is that for dark scale pictures furthermore forms each shading direct in shading pictures and PCA is for recognition the fakes. Mohamadian

et al. [13]

develop a Singular Value Decomposition (SVD) based method in which the image is isolated into numerous little covering squares and after that SVD is requested to remove the copied frames. Its shortcoming is that the method is not for shading pictures.

Figure 2: The pipeline for frame duplication detection and localization. Given a testing video, we first run the I3D network [2] to extract deep spatial-temporal feature and build the coarse sequence-to-sequence distance to determine the possible frame sequences that are likely to have frame duplication. And then apply a ResNet-based Siamese network to further confirm whether there exists frame duplication manipulation. For the further identification of the video temporal localization, we apply an I3D based inconsistency detector to distinguish the duplicated frames from the selected frames. In this way, we achieve the goal to identify whether there is frame duplication manipulation and localize the specific frame indexes for the possible duplicated frame sequence.

Recently, Yang et al. [28] proposed a copy-move forgery detection based on a modified SIFT-based detector. Wang et al. [26]

presented a novel block-based robust copy-move forgery detection approach using invariant quaternion exponent moments and the falsely matched block pairs are removed by customizing the random sample consensus with QEMs magnitudes differences. It is robust to handle noise addition, lossy compression, scaling, and rotation, when compared to conventional copy-move forgeries detection techniques.

3 The proposed deep learning approach

As shown in the Figure 2, given a testing video, our proposed framework is able to detect and localize the frame duplication manipulation. The first I3D network is used to produce sequence-to-sequence matrix in the coarse-search stage. The Siamese network is then applied to conduct fine-search and verify whether there existing frame duplications. Following the Siamese network, an inconsistency detector is utilized to further distinguish duplicated frames from selected frames.

For the details, we are going to describe each step in the following subsections.

3.1 Coarse search to the determine the candidate

In order to improve the efficiency and narrow the search space, we split a video into overlapped frame sequences, of which each sequence has 64 frames and the number of overlapped frames is 16. Instead of using C3D network [20], we choose I3D Network [2] due to three reasons: (1) it inflates 2D ConvNets into 3D (filters are typically square and we just make them cubic – NN filters become NN

N); (2) it bootstraps 3D filters from 2D Filters to bootstrap parameters from the pre-trained ImageNet models, and (3) it paces receptive field growth in space, time and network depth.

In this paper, we apply the pre-trained off-the-shell I3D network to extract the feature vector for four 16-frame window and then contact them together to get the sequence feature, . We observe that most of time is spent on pre-processing. To improve the testing speed, we can calculate the first RGB-data and flow-data only. For the next consecutive input video clip, we can copy RGB-data and flow-data from the previous video clip, and only calculate the last RGB-data and flow-data. In this way, we can significantly improve the testing speed.

Based on the sequence features, we can calculate the sequence-to-sequence distance matrix by L2 distance. If the distance is smaller than the threshold , then this indicates that these two frame sequences are likely duplicated for further confirmation.

3.2 Refine detection via Siamese network

For the further refinement, we continue to evaluate the similarity between any pair of two frames, i.e.

, a duplicated frame and the corresponding selected frame. Siamese networks are a particular type of neural network architecture, which learns to differentiate between two inputs. It consists of two identical neural networks by sharing the exactly same parameters, each taking one of the two input images. And a contrastive loss function is applied to the last layers to calculate the similarity between the two images. In principle, we can choose any neural networks to extract feature for each frame.

In this paper, we choose the ResNet network [6] with 152 layers. We connect the second last layer with contrastive loss function and each loss value associated with the distance between a pair of frames is formulated into the frame-to-frame distance matrix, in which the distance is normalized to the range [0, 1]. For the distance is smaller than the threshold , then that indicates these two frames are more likely to be duplicated frames. For those videos which have multiple consecutive frames duplicated, then there will be a line parallel to the diagonal line in the visualization of the distance matrix, as plotted in Figure 3.

Figure 3: The illustration of frame-to-frame distance matrix for frame duplication.

It worth mentioning that we provide both frame-level and video-level score to evaluation the likelihood of frame duplication. For the frame-level score, we can use the value in the frame-to-frame distance directly. For the video-level score, we propose a heuristic strategy to formulate the confidence value. For a videos, we first find the minimal value of distance where in the frame-to-frame distance matrix, search in two directions, i.e.,




where and then the possible length of duplicated can be defined as:


Based on the intuition that the more possible frames duplicated, the smaller value of , and the larger distance between the selected frames and the duplicated frames, then likelihood of frame duplication existed in the video is much higher, we can formulate the video-level confidence score as follows:


3.3 Inconsistency detector for accurate localization

Figure 4:

The confusion matrix for three categories in two consecutive frames.

For the detected frame duplication, we need to distinguish the duplicated frames from selected frames based on the assumption that the duplicated frames have inconsistency at both the beginning and the end of the sequence. We make full use of both spatial and temporal information to train an inconsistency detector and obtain a score. We shall emphasize that scores obtained from the C3D-based video network [12] for frame drop detection can be used to measure the inconsistency. However, this only works for those videos assumed in one single shot scene. To make it more generalized, we extend the binary case detector to three-category, i.e., “none”, “frame drop”, and “shot break”. Note that shot-break videos are obtained from TRECVID 2007 dataset, and we only use the hard-cut since soft-cut gradually changes and has strong consistency between any two consecutive frames. Instead of using only one RGB stream data as input, we replace the C3D network with I3D network to incorporate the optical flow data stream. The confusion matrix in Figure 4 has illustrated the effectiveness of our I3D network based inconsistency detector.

Based on the output scores for the three categories from the I3D network, i.e., , , and , we formulate the confidence score of inconsistency as the following function


where is the weight parameter. In this paper, we set .

Figure 5: The illustration of distinguishing duplicated frames from the selected frames. The index ranges for the red frame sequence and the blue sequence are [60, 168] and [190, 298], respectively. and are the corresponding inconsistency scores. Obviously, , which indicates that the red sequence is duplicated frames as expected.

We assume the selected frames are most consistent at both the beginning and the end than the duplicated frames. As illustrate in Figure 5, given a pair of frame sequences that are potentially duplicated, and , we compare two scores, i.e.,




where is the window size we check the inconsistency at both the beginning and the end of the sequence. In this paper, we set to avoid the failure cases where a few start or end frames miss detected. If , then the duplicated frame segment is . Otherwise, the duplicated frame segmentation is . As shown in Figure 5, our modified I3D network is able to measure the consistency between consecutive frames.

4 Experiment

We evaluate our proposed method one self-collected video dataset and the Media Forensics Challenge 2018 (MFC2018) 111URL: https://www.nist.gov/itl/iad/mig/media-forensics-challenge-2018. dataset [14].

Our self-collected video dataset is obtained through taking frame duplication manipulation on the 75 raw static camera videos from VIRAT dataset [15] and 85 dynamic iPhone videos from the World dataset 222The dataset is available via an imagery browser RankOne with URL: https://medifor.rankone.io/.. We random select frame sequence with the duration 0.5s, 1s, 2s, 5s and 10s, and then insert them into the same source videos. We consider the X264 video codec, and use 30 fps frame rate to generate the manipulated videos. Note that we apply restrictively avoid the selected frames and the duplicated frames are overlapped in the same video. Since we have the frame-level ground truth, we can use it for frame-level performance evaluation.

Figure 6: The illustration of frame-to-frame distance between duplicated frames and the selected frames.

The MFC2018 dataset is the second annual evaluation to support research and help advance the state of the art for image and video forensics technologies – technologies that determine the region and type of manipulations in imagery (image/video data) and the phylogenic process that modified the imagery. The MFC2018 evaluation is currently being designed building of off experience from the NC2017 Evaluation. It consists of Dev dataset and Eval dataset, which we denote as the MFC2018-Dev dataset and the MFC2018-Eval dataset, respectively.

There are 231 videos in the MFC2018-Dev dataset and 1036 videos in the MFC2018-Eval dataset. The video codec used is H.264. The duration of each video is in the range from 2s to 3 minutes. The frame rate for most of the videos is 29-30 fps, while a smaller number of videos are 10 or 60 fps and only 5 videos are with larger than 240 fps. We opt out 2 videos which have less than 17 frames because the input for the I3D network should have at least 17 frames. We also opt out those 5 videos with large frame rates (¿220 FPS), since the frame rate for our training videos is not so high. We use the remaining 1460 videos to conduct the video-level performance evaluation.

The detection task is to detect whether or not a video or a frame has been manipulated with frame duplication manipulation, while the localization task to localize the duplicated frames index. For the measurement metrics, we use the performance measures of AUC (area under the curve) for the detection task, and use the Matthews correlation coefficient


for localization evaluation, where TP, FP, TN, FN represent true positive, false positive, true negative and false negative, respectively.

4.1 Frame-level performance on our self-collected dataset

To better verify the effectiveness of deep learning solution in frame-duplication detection, we consider two baselines. One is Lin et al.’s method [11] which uses histogram difference (HD) instead of correlation coefficients as the detection features. The other one is Yang et al.’s method [29] which is an effective similarity-analysis-based method for frame duplication detection that is implemented in two stages. Features are obtained via SVD. Both these two methods, denoted as “Lin 2012” and “Yang 2016” respectively, are using traditional feature extraction.

Mehtod Iphone 4 videos Surveillance videos
Lin 2012 [11] 99.5 83.0
Yang 2016 [29] 60.2 55.7
CNN (Ours) 99.9 97.8
Table 1: The AUC performance for frame-level frame duplication detection on the videos with X264 codec.(unit: %)

We run our proposed CNN method and the above two baselines on our self-collected videos with X264 codecs and the results are summarized in Table 1. As we can see, due to the X264 codec, the contents of the duplicated frames have been affected so that we cannot expect 100% accuracy for all the methods. In this case, our proposed CNN method still outperforms the traditional methods.

To help readers better understand the comparison, we provide the visualization of the normalized distances between the selected frames and the duplicated frames in Figure 6, from which we can see our proposed CNN method always performs the best for both the iPhone 4 video and the surveillance video. All these observations strongly demonstrate the effectiveness of deep learning for frame duplication detection.

4.2 Video-level performance on the MFC2018 dataset

It is worth mentioning that the duplicated videos in the MFC2018 dataset are usually with multiple manipulations, and this makes the content between the selected frames and duplicated frames are affected more or less. Therefore, the testing video in both the MFC2018-Dev and the MFC2018-Eval datasets are very challenging.

Figure 7: The ROC curve and AUC performance for video-level frame duplication detection on the MFC2018-Dev dataset.
Figure 8: The ROC curve and AUC performance for video-level frame duplication detection on the MFC2018-Eval dataset.

We run our proposed CNN method and the two baselines, i.e., “Lin 2012” and “Yang 2016” on these two datasets. To verify the effectiveness of our video-level confidence score defined in Equation 4, we take the minus minimum distance (i.e., ) as a direct alternative strategy to compare with. To distinguish these two strategies, we use post-fix “+conf score” and “+ mmin score” to indicate them. The detection results are summarized in Figure 7 and Figure 8.

As we can observe, (1) as expected, our proposed method always outperform both “Lin 2012” and “Yang 2016”, no matter using “+conf score” or using “+mmin score”; (2) using “+conf score” performs significantly better than using “+mmin score”, especially for “Lin 2012” and “Yang 2016” methods in which the AUC improvement is higher than 20% on both the MFC2018-Dev dataset and the MFC2018-Eval dataset; (3) with “+conf score”, all these three methods achieve a high correct detection rate at low false alarm rate; (4) using “+conf score”, our proposed method obtains 99.97% (very close to 100.0%) AUC performance on the MFC2018-Dev dataset, and also achieves 94.91% AUC on the MFC2018-Eval dataset. Obviously, our proposed video-level confidence score ensures a good ranking order to distinguish videos with frame duplication manipulation from those without this kind of manipulation. Such observations indicate the promising advantage of our proposed method.

Method MFC2018-Dev MFC2018-Eval
Lin 2012 [11] 0.2277 0.1681
Yang 2016 [29] 0.1449 0.1548
CNN w/ ResNet 0.4618 0.3234
CNN w/ C3D 0.6028 0.3488
CNN w/ I3D 0.6612 0.3606
Table 2: The MCC performance for video temporal localization on the MFC2018 dataset.
MFC2018-Dev 14 6 1
MFC2018-Eval 33 38 15
Table 3: The video temporal localization performance on the MFC2018 dataset. Note , and indicate correct cases, incorrect cases and ambiguously incorrect cases, respectively. And indicates the number of a kind of specific cases.
(a) Completely correct cases (0 frame missed).
(b) Incompletely correct cases (4 frames missed on the right end only).
(c) Incompletely correct cases (4 frames missed on the left end only).
(d) Incompletely correct cases (7 and 4 frames missed on the left and right end).
(e) Incorrect cases (2 frames gap).
(f) Abmiguously Incorrect cases (0 frame gap).
Figure 9: The visualization of confusion bars in video temporal localization. For each subfigure, the above bar is reference, the middle bar is the system output from our proposed method, and the bottom bar is the confusion calculated based on the above reference and system output. Note TN, FN, FP, TP and “OptOut” in the confusion are marked in white, blue, red, green and yellow / black, respectively. Figure (a)a and (d)d) is correct, which includes completely correct cases and incompletely correct cases. The bottom row (Figure (e)e and (f)f) is incorrect covering both completely incorrect cases and ambiguously incorrect cases.

With regards to the temporal localization evaluation, we use the feature distance between any two consecutive frames for the two competitive baselines. For our proposed CNN method with the I3D network as inconsistency detector denoted as “CNN w/ I3D”, we also provide tow variants by replacing our inconsistency detector with the ResNet network feature distance only or the C3D network’s output scores in [12] instead. We use “CNN w/ ResNet” and “CNN w/ C3D” to indicate these two variants. The temporal localization results are summarized in Table 2, from which we can observe that (1) our deep learning solution, “CNN w/ ResNet”, “CNN w/ C3D” or “CNN w/ C3D” works better than both “Lin 2012” and “Yang 2016” and “CNN w/ I3D” performs the best. These observations suggest that 3D convolutional kernel is able to measure the inconsistency between the consecutive frames, and both RGB data stream and optical flow data stream are complementary to each other to further improve the performance.

To better understand the video temporal localization measurement, we plot the confusion bars based on the reference and the corresponding system output under different scenarios, as shown in Figure 9. Here we shall emphasize that no algorithm is able to distinguish duplicated frames from selected frames for the ambiguously incorrect cases indicated as in Table 3, because such videos break the assumption of consistency and even our human cannot tell which are duplicated frames by eyes.

In this case, the more there are the ambiguously incorrect cases, the lower MCC score since such cases contribute zero TP. Such observations can well explain why the MCC score on the MFC2018-Eval dataset becomes lower. Without doubts, ruling out the ambiguously incorrect cases from the incorrect cases, we are still able to see a promising performance for video temporal localization when considering the range for the MCC metric is [-1, 1].

4.3 Discussion

Multiple factors cause frame duplication detection and localization becoming more and more challenging in forged videos. These factors includes large frame rates, multiple manipulations (e.g., “SelectCutFrames”, “TimeAlterationWarp”, “AntiForensicCopyExif”, “RemoveCamFingerprintPRNU” 333These operation names are mentioned in the MFC2018 dataset.) involved before and after, and gaps between the selected frames and the duplicated frames. In particular, zero gap between the selected frames and the duplicated frames make both cannot be distinguished in any way because the inconsistency which should exist in ends of the duplicated frames does not appear in the video temporal context.

Regarding the runtime, running the I3D network is the most expensive component in our framework so that we only apply it on the candidate frames that are likely to have frame duplication manipulations detected in the coarse-search stage. Note that our training stage is carried out off-line. For each testing video clip with a 16-frame length, it takes about 2 seconds with our learned I3D network. For a one-minute short video with 30 FPS, it requires less than 5 minutes to complete the testing throughout all the frame sequence.

5 Conclusion and future work

In this paper, we propose a coarse-to-fine deep learning approach for frame duplication detection at both frame-level and video-level, as well as for the video temporal localization. We also propose a heuristic strategy to formulate the video-level confidence score, as well as an I3D network based inconsistency detector to distinguish the duplicated frames from the selected frames. The experimental results have well demonstrated the proposed method.

Our future work includes continuing to extend multi-stream 3D neural networks for both frame drop, frame duplication and other video manipulation tasks like looping detection, working on frame-rate variations, and train on multiple manipulations, investigating the effects of various video codecs on accuracy degradation.


  • [1] J. Aneja, A. Deshpande, and A. G. Schwing. Convolutional image captioning. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , June 2018.
  • [2] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 4724–4733. IEEE, 2017.
  • [3] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool. Domain adaptive faster r-cnn for object detection in the wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [4] V. Christlein, C. Riess, J. Jordan, C. Riess, and E. Angelopoulou. An evaluation of popular copy-move forgery detection approaches. arXiv preprint arXiv:1208.3665, 2012.
  • [5] M. Douze, A. Gaidon, H. Jegou, M. Marszalek, and C. Schmid. Inria-lear’s video copy detection system. In TRECVID 2008 workshop participants notebook papers, Gaithersburg, MD, USA, November 2008, 2008.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [7] Y. Hu, C.-T. Li, Y. Wang, and B.-b. Liu. An improved fingerprinting algorithm for detection of video frame duplication forgery. International Journal of Digital Crime and Forensics (IJDCF), 4(3):20–32, 2012.
  • [8] T. Huang, X. Zhang, W. Huang, L. Lin, and W. Su. A multi-channel approach through fusion of audio for detecting video inter-frame forgery. Computers & Security, 77:412–426, 2018.
  • [9] J. Law-To, O. Buisson, V. Gouet-Brunet, and N. Boujemaa. Robust voting algorithm based on labels of behavior for video copy detection. In Proceedings of the 14th ACM international conference on Multimedia, pages 835–844. ACM, 2006.
  • [10] D.-D. Le, S. Poullot, X. Wu, B. Nouvel, and S. Satoh. National institute of informatics, japan at trecvid 2010. In TRECVID, 2010.
  • [11] G.-S. Lin and J.-F. Chang. Detection of frame duplication forgery in videos based on spatial and temporal analysis.

    International Journal of Pattern Recognition and Artificial Intelligence

    , 26(07):1250017, 2012.
  • [12] C. Long, E. Smith, A. Basharat, and A. Hoogs. A c3d-based convolutional neural network for frame dropping detection in a single video shot. In IEEE International Conference on Computer Vision and Pattern Recognition Workshop (CVPR-W) on Media Forensics, 2017.
  • [13] Z. Mohamadian and A. A. Pouyan. Detection of duplication forgery in digital images in uniform and non-uniform regions. In Computer Modelling and Simulation (UKSim), 2013 UKSim 15th International Conference on, pages 455–460. IEEE, 2013.
  • [14] NIST. Media forensics challenge 2018, 2018.
  • [15] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.-C. Chen, J. T. Lee, S. Mukherjee, J. Aggarwal, H. Lee, L. Davis, et al. A large-scale benchmark dataset for event recognition in surveillance video. In Computer vision and pattern recognition (CVPR), 2011 IEEE conference on, pages 3153–3160. IEEE, 2011.
  • [16] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems (NIPS), pages 91–99. 2015.
  • [17] K. Sowmya and H. Chennamma. A survey on video forgery detection. International Journal of Computer Engineering and Applications, 9(2):17–27, 2015.
  • [18] P. Stock and M. Cisse. Convnets and imagenet beyond accuracy: Understanding mistakes and uncovering biases. In The European Conference on Computer Vision (ECCV), September 2018.
  • [19] P. Tang, X. Wang, A. Wang, Y. Yan, W. Liu, J. Huang, and A. Yuille. Weakly supervised region proposal network and object detection. In The European Conference on Computer Vision (ECCV), September 2018.
  • [20] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
  • [21] S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. J. Mooney, and K. Saenko.

    Translating videos to natural language using deep recurrent neural networks.

    In North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL-HLT), 2015.
  • [22] J. Wang, G. Liu, Z. Zhang, Y. Dai, and Z. Wang. Fast and robust forensics for image region-duplication forgery. Acta Automatica Sinica, 35(12):1488–1495, 2009.
  • [23] Q. Wang, Z. Li, Z. Zhang, and Q. Ma. Video inter-frame forgery identification based on consistency of correlation coefficients of gray values. Journal of Computer and Communications, 2(04):51, 2014.
  • [24] Q. Wang, Z. Li, Z. Zhang, and Q. Ma. Video inter-frame forgery identification based on optical flow consistency. Sensors & Transducers, 166(3):229, 2014.
  • [25] W. Wang and H. Farid. Exposing digital forgeries in video by detecting duplication. In Proceedings of the 9th workshop on Multimedia & security, pages 35–42. ACM, 2007.
  • [26] X.-y. Wang, Y.-n. Liu, H. Xu, P. Wang, and H.-y. Yang. Robust copy–move forgery detection using quaternion exponent moments. Pattern Analysis and Applications, 21(2):451–467, 2018.
  • [27] Y. Wu, X. Jiang, T. Sun, and W. Wang. Exposing video inter-frame forgery based on velocity field consistency. In Acoustics, speech and signal processing (ICASSP), 2014 IEEE International Conference on, pages 2674–2678. IEEE, 2014.
  • [28] B. Yang, X. Sun, H. Guo, Z. Xia, and X. Chen. A copy-move forgery detection method based on cmfd-sift. Multimedia Tools and Applications, 77(1):837–855, 2018.
  • [29] J. Yang, T. Huang, and L. Su. Using similarity analysis to detect frame duplication forgery in videos. Multimedia Tools and Applications, 75(4):1793–1811, 2016.
  • [30] H. Yu, S. Cheng, B. Ni, M. Wang, J. Zhang, and X. Yang. Fine-grained video captioning for sports narrative. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [31] D.-N. Zhao, R.-K. Wang, and Z.-M. Lu. Inter-frame passive-blind forgery detection for video shot based on similarity analysis. Multimedia Tools and Applications, pages 1–20, 2018.