and so on. Inspired by the great performance of Convolutional Neural Networks (CNNs)
, more and more researchers have begun to apply the deep learning to the video classification task, and achieved great success. However, some research works [26, 17, 31] show that video classification is vulnerable to the adversarial examples , just like the case in images [23, 8]. Different from static images, attacking video classification not only needs to consider the spatial information but also needs to consider the temporal information.
The study for video attacks is less so far. To our knowledge, there are only three related papers [26, 17, 31]. In summary, these methods can be roughly divided into two classes, the first one can be called as sparse attack, which denotes only some key frames in a video are polluted, the generated adversarial perturbations are temporally sparse. For example, in , the authors propose that the slight perturbations added the current frame can transfer to the next frames via the temporal interactions between frames, and thus don’t need to pollute every frame. Using iterative optimization algorithm based on norm, they successfully attack the state-of-the-art video classification model. The other one can be called as dense attack, which means that all the frames in a video are polluted, such as in , a generative method for adversarial perturbations is proposed to attack the real-time video classification system. Because the test phase only involves a feedforward network, the attacking efficiency is high. Besides, it also explores a robust 3D adversarial perturbations to overcome the varying boundaries of video clips.
To defend adversarial examples, many kinds of research techniques have been proposed. In the image case, adversarial training  is a widely used technique, it improves the DNNs’ robustness by adding adversarial examples into the training dataset to retrain the classification model. In addition,  proposes that image denoising can be used to perform the defense task, and designs a High-level representation Guided Denoiser (HGD) to remove the adversarial perturbations. Inspired by this, ComDefend  argues that image compression is useful to defend adversarial examples, and presents an end-to-end image compression model to achieve defense. We can see all the above methods are designed for static images classification. The temporal information within videos are not considered, therefore, these methods are not completely suitable to defend the adversarial videos. We need a new framework to defend adversarial examples for video classification.
For this reason, in this paper, we propose an effective defense framework to characterize and defend video adversarial examples. Our method contains two steps. The first step is to detect the adversarial videos using temporal consistency between adjacent frames. If the input video is benign, we directly feed it to the well-trained video classifier. Otherwise, an extra pre-processing module is used to denoise it. We accomplish this task by using the temporal consistency (video frames which are temporally close have similar image characteristics). It is known that one accepted reason behind adversarial examples is the linear nature of DNNs . The imperceptible perturbations added on the images will enlarges with the increasing of DNNs’ depth, which leads to the inconsistency of DNNs’ output between adjacent frames. By contrast, the benign video frames often have the same outputs with their neighbor frames owing to the slight changes. This difference can help us distinguish between adversarial videos and benign videos. In the implementation, we present a metric to represent the degree of temporal consistency, and then use a threshold to classify them. The second step is to reduce the adversarial perturbations via the different denoisers in the spatial and temporal domains respectively. As for the sparse attack , we propose the temporal defense method, which utilizes the temporal interactions between frames, and reconstructs the polluted frames with their temporally neighbor clean frames. As for the dense attack , we use the spatial defense method, which uses an efficient adversarial denoiser to process each frame in the spatial domain, and obtain their clean versions. We propose the corresponding defense strategies according to the properties of the attacks. Experiments show that temporal defense obtains the best performance against sparse attack, and spatial defense is also the best to defend the dense attack. Figure 1 illustrates the overall framework.
In summary, this paper has the following contributions:
(1) To the best of our knowledge, we are the first one to explore the defense for adversarial videos, and further, propose a two-step defense framework for video classification. The proposed framework utilizes different defense strategies according to the properties of attacks. Experiments show that our method significantly improves the robustness of video classifiers.
(2) We design the adversarial video detection method based on the temporal consistency between adjacent frames. We find that the linear nature of DNNs will lead to the temporal inconsistency for adversarial videos, and further, present a metric to represent the degree of temporal consistency. Using this metric, a simple threshold can be used to classify the adversarial videos.
(3) We propose the temporal defense against the sparse attack. Temporal defense utilizes the temporal interactions between adjacent frames, and reconstructs the polluted frames with their neighbor clean frames, and thus purifies the adversarial perturbations. Temporal defense is the first one to explore the structured information in videos to perform adversarial defense task.
The remainder of this paper is organized as follows. Section 2 briefly reviews the related work. Section 3 introduces the details of the proposed framework. Section 4 shows a series of experimental results and analysis. Finally, Section 5 gives the conclusion.
Currently, there are three kinds of deep learning methods for video classification. The first one is to use the existing image classification methods to achieve video classification. In particular, this method regards the video as a collection of frames. Typically, it uses the pre-trained network on ImageNet
to extract frame features and then overlays them as video features to perform classification. The second one is to use an end-to-end 3D CNN architecture to achieve video classification. The difference with the first method lies in that the part of feature extraction is directly trained on the video data, such as[12, 25]. The third one is to use CNN+LSTM to achieve video classification. These methods often use LSTM  to extract temporal information of videos. A CNN is firstly used to extract the frame feature, and then a LSTM is used to explore the temporal relationship between frames, such as [5, 30]. In addition, a kind of two-stream architecture  and its variants [6, 27], where one stream is to encode the spatial information within frames, and another stream is to encode the temporal information (usually optical flow is extracted to represent the temporal information) between frames, are proposed and achieve the competitive performance.
Besides the image classification, video classification models are also vulnerable to adversarial examples.  proposes a kind of sparse attacking method to attack the CNN+RNN architecture, which is widely used in the video classification task. Their research demonstrates that the adversarial perturbations can be transferred between the video frames. Simultaneously, a dense adversarial attack method is proposed in . It attacks the C3D model , which is also a widely used method in video classification. The attacking method in  is based on a generating model. And each frame of the video is added with the generated adversarial perturbations. Also, an adversarial framing is proposed in . Compared with , this method does not modify the most pixels of the input image. It just adds an adversarial framing on the image border.
In , F. Tramer et al. propose the adversarial training method to retrain the classification model by adding the adversarial examples into the original training dataset. The used adversarial examples are pre-generated by different attack methods. In , the adversarial perturbations are regarded as a special kind of image noise. HGD is a denoising model for dealing with such noise. It is also trained by the generated adversarial images. And in , an end-to-end compression model is proposed to defend the adversarial examples. It uses image compression to break the structure of the adversarial perturbations for defense. In addition, there are many other defense methods, like [20, 29, 2], and so on. The above defense methods are all focused on the images. They only consider the spatial information, but ignore the temporal information. Therefore, these methods are not completely suitable for the video classification task.
The framework of the proposed method
According to the number of polluted frames in the video data, the current video attacking methods can be divided into sparse attacks and dense attacks. The different attacks are defended by the corresponding defense methods in our paper. The proposed framework consists of two phases. The first one is the detection network which is used to determine whether the video is attacked, and further, make sure of the type of the attack. And then, different defense methods are carried out according to the detection result. As for the sparse adversarial video attack, which modifies the fewer frames of the video, we use the temporal defense to transform the adversarial video into the corresponding clean video. As for the dense adversarial video attack, which modifies all the frames of the video, we use the spatial defense to transform the adversarial video into the corresponding clean video.
There are two main functions of the detection network: (1) detecting whether the video is attacked, and (2) detecting what kind of attack the video is subjected to. There is a high correlation between the adjacent frames of the video. In , in the video object segmentation task, they propose that the video frames which are temporally close have similar image characteristics. This phenomenon can be called as the temporal consistency of the video data. Through a series of experiments, we find that this phenomenon also exists in the video classification task. In particular, the outputs produced by the well-trained video classification maintain consistency in the adjacent frames of the benign video. By contrast, because of the linear nature of DNNs, the imperceptible adversarial perturbations will enlarge with the increasing of DNNs’ depth, which leads to the inconsistency of DNNs’ outputs between the adjacent frames in the adversarial video. As shown in Figure 2, we feed each frame of the input videos to the well-trained network for video classification to get the classification label. The CNN+LSTM architecture  is used as the well-trained network in this paper. Note that in order to eliminate the inter-frame effects, we remove the links between LSTMs in the detection network. If the label of the current frame is different from the labels of its adjacent frames, the current frame is regarded as an exception frame. We define an exception index to represent the degree of temporal consistency. It is defined as follows:
where represents the predicted label of the -th frame. And represents the total number of frames in a video. According to the value of , we can perform the detection. If , the video is benign. If , the video is subjected to the sparse attack. And if , the video is subjected to the dense attack. In addition, and are hyper-parameters which are determined in the experiments. We use the Receiver Operating Characteristic (ROC) curve to evaluate the proposed detection method. As shown in Figure 4, the proposed method is appropriate to detect whether the original video is attacked and determine the type of the attack. We use the proposed detection network to deal with randomly selected two sparse adversarial videos and dense adversarial videos. The result is shown in Figure 3.
In order to transform the dense adversarial videos to the corresponding clean videos, we use an efficient denoiser to process each frame. We here select an end-to-end compression model. As shown in Figure 5, the whole process includes two modules: image compression module and image reconstruction module. During the image compression stage, the ComCNN is used to compress each frame and extract its main structure information. During the image reconstruction stage, the RecCNN uses the compressed information of each frame to reconstruct the corresponding clean frame. For more details, please refer to . Because the operation is conducted within the spatial domain of each frame, this method is called as spatial defense.
There is a high correlation between the temporally adjacent frames of the video. The adversarial frames can be replaced by the pseudo frames which are reconstructed by the adjacent clean frames. Therefore, we propose to use the motion estimation method to compress the sparse adversarial videos for defense. In this way, the sparse adversarial video can be transformed into the corresponding clean video.
Motion estimation 
is a widely used technique in video processing. The motion estimation can be achieved based on image patch or image grid. Patch-based motion estimation is widely used because of its simple algorithm and easy implementation. We still focus on it. The basic idea of motion estimation is to divide each frame of videos into a number of non-overlapping patches, and consider that all pixels in the patches have the same amount of displacement, and then find the patch that is most similar to the current patch in the reference frame according to certain matching criteria, called matching patch. The relative displacement of the matching patch and the current patch is the motion vector. When the video is compressed, we only need to save the motion vector and corresponding residual data to recover the current patch. Note that using integer Discrete Cosine Transform (DCT) to compress the corresponding residual data. The motion vector is defined as:
where represents the intensity of the pixel with coordinates in the frame. represents the width of the patch. represents the length of the patch. We use the patch’s upper left corner coordinates to represent it. And represents the similarity between the present frame patch and the previous frame patch .
where represents the motion vector of the present frame patch .
There is a high correlation between the contents in the adjacent frames of the video. As for the sparse adversarial video, a small number of frames are attacked. Inspired by the idea of motion estimation, we can use the motion estimation method to replace the patches of adversarial video frames with the patches of clean video frames. As shown in Figure 6, we can save the patches of clean video frames, the motion vector and the corresponding residual to reconstruct the patches of the adversarial video frames. The adversarial video frames can be reconstructed by the clean frame patches. The quantification of the corresponding residual can remove the adversarial perturbations. Because the operation is conducted within the temporal domain of the video, this method is called as temporal defense.
Experiments and analysis
In this section, we conduct a series of experiments to verify the effectiveness of the proposed framework, which includes: adversarial video generation, threshold selection in the framework, video classification with the proposed framework and result analysis.
The detection network is trained on the video dataset UCF101  which is widely used in action recognition. It consists of 13320 videos from 101 action categories. The action categories of UCF101 include sports, playing musical instruments, body-motion, human-human interaction and human-object interaction. In order to ensure the fairness of the experiment, attack methods and defense method are all conducted on the UCF101. We choose more than 8000 videos for training the video classifier and 3000 videos for testing it. In addition, As for training the detection network, we also choose these 8000 videos of UCF101. The original ComDefend is trained on the CIFAR-10 dataset, we don’t modify it and directly use the well-trained model.
Adversarial video generation
As mentioned previously, the video attacks can be divided into the sparse adversarial attacks and dense adversarial attacks based on the number of polluted frames. We choose two typical attacking methods:  for the sparse attack and  for the dense attack.  chooses CNN+LSTM architecture as threat models. It uses the characteristic of adversarial perturbations that can propagate between frames to attack the CNN+LSTM architecture. Because the adversarial perturbations are added on a few video frames, this kind of attack method belongs to the sparse adversarial attack method.  makes use of a generative model which looks like a Generative Adversarial Network (GAN) architecture to generate a series of adversarial perturbations. The generated perturbations are added on each frame of the original video to fool the C3D model. Because all the frames are polluted, this kind of attack method belongs to dense adversarial attack method. We use these two methods to generate the adversarial videos on the UCF101 dataset, and then perform our framework to defend them.
Threshold selection in the detection stage
In the detection stage, two threshold hyperparametersand need to be determined by the experiments. The hyperparameter is used to detect whether the input video is attacked. And the hyperparameter is used to determine the type of attacks. To determine , we randomly select 500 clean videos from UCF101 dataset, and then generate their corresponding sparse and dense adversarial videos. Thus, we obtain 1500 videos. This is a two-category classification task. We regard the adversarial videos as the positive samples, and calculate the precision, recall and F1-measure under the different values of the parameter . The results are shown in Table 1. The optimal
is selected according to the highest F1 metric, where the threshold simultaneously shows good precision and recall. Under this setting, theis set to 0.175. To determine , we only use the 500 sparse adversarial videos and 500 dense adversarial videos to construct a dataset to select the optimal value. Like , we also compute the precision, recall and F1-measure on this dataset. The difference is that dense adversarial videos are regarded as the positive samples in this case. The final results are shown in Table 2. We see that the F1-measure achieves the best performance when . Therefore, in the following experiments, we set and .
Video defense with the proposed method
The experiments in this section include (1) video classification with the spatial defense separately, (2) video classification with the temporal defense separately, and (3) video classification with the different defense strategies combined with detection. In this way, it is easy to see the performance of each module of the proposed method.
|Network||Attacks||No defense||Spatial defense||Temporal defense|
|Defense method||No defense||Spatial defense only||Temporal defense only||Detection+temporal/spatial defense|
Video classification with the temporal defense
The proposed temporal defense is a pre-processing module, and does not modify the deployed model. When the input video is detected as the sparse attack, we firstly use the temporal defense to denoise it, and then feed it to the deployed model. To evaluate the performance of the temporal defense, we give the experiments in Table 3. Two deployed video classification models are tested: CNN+LSTM and C3D. We use  and  to generate the adversarial videos, and the temporal defense to defend them. In Table 3, we can see that the temporal defense significantly improves the classification accuracy of CNN+LSTM from 31% to 64%, but doesn’t work well for the dense attack. It doesn’t achieve any improvement for C3D (from 59% to 59%). This is expected because the temporal defense makes use of the image patch of the clean frames to reconstruct the adversarial frames. As for the sparse attack method, there are a lot of clean frames in the video, this method can use them to compress and reconstruct the adversarial frames. But as for the dense attack method, all the frames of the video are attacked, therefore very little available information can be used. Even though the adversarial frames are compressed and reconstructed, the adversarial perturbations are still alive. The results in Table 3 show that temporal defense is more suitable for defending the sparse attack than the dense attack. In addition, we also find that temporal defense is friendly to the clean videos. Both for CNN+LSTM and C3D, the accuracy drop of temporal defense is very slight, and far below than the drop of defense attack. Because of this advantage, we don’t need to strictly distinguish the sparse attacks from the clean videos. The temporal defense video frames, and the corresponding sparse adversarial video frames and the corresponding clean videos are shown in Figure 7.
Video classification with the spatial defense
The spatial defense is carried out on the same experimental setup as the temporal defense. In Table 3, it is clear that the spatial defense achieves the comparable performance both for sparse attack and dense attack. For the sparse attack, it improves the accuracy of CNN+LSTM from 31% to 55%. For the dense attack, it improves the accuracy of C3D from 59% to 73%. The results show that spatial defense is more suitable for defending the dense attack method than the temporal attack ( 73%-59%=14% vs 59%-59%=0%). This is also reasonable because in the dense attack, all the frames are polluted. In this situation, we can regard the polluted frames as the adversarial images, and directly apply the image defense method. ComDefend achieves the state-of-the-art image defense performance, and is also very efficient. Therefore, we choose ComDefend to defend the dense attack.
Video classification with integrating detection and different defense strategies
From the above discussions, it is clear that the spatial defense is more suitable to defend the dense attack method and the temporal defense is more suitable to defend the sparse attack method. In order to integrate the advantages of both methods, we combine the different defense methods with the detection to jointly improve the robustness of video classifiers. According to the result of the detection, the corresponding defensive method is used to protect the well-trained video classification model. In particular, as shown in Figure 1, if the detection result is the clean video, the original video is fed to well-trained classifier directly. If the detection result is the sparse adversarial video, the temporal defense is used to deal with the original video. If the detection result is the dense adversarial video, the spatial dense is used to deal with the original video. We choose 500 clean videos, 500 temporal adversarial videos and 500 dense adversarial videos to verify our proposed framework. Note that these adversarial videos can make the well-trained classifier give the wrong label. As shown in Table 4, we can see that the spatial and temporal defense can improve the classification accuracy of the well-trained classifier. Integrating with these two defense methods and the adversarial detection obtains the best classification accuracy.
In , it has demonstrated that the adversarial perturbation of adversarial example has a particular structure. As for the spatial defense (ComDefend ), it regards the adversarial perturbation as a kind of image redundancy information. And then it removes the adversarial perturbations by image compression. This method is also a kind of adversarial denoiser. Therefore it is suitable to process the dense adversarial videos. As for the temporal defense, it makes use of the motion estimation method to purify the adversarial frames by using the information of clean frames. In particular, the adversarial frames are reconstructed by the clean frames, motion vectors and corresponding residual. It destroys the particular structure of the adversarial perturbations by quantifying the corresponding residual. Because it should use the information of the clean frames of adversarial video, it is suitable for the defense for the sparse adversarial videos. We can use the detection, which is achieved by the temporal consistency between adjacent frames, to combine the advantages of both defenses to improve model robustness. That is to say, the selection of the defensive method is dependent on the result of the detection.
In this paper, we propose an effective defense framework to characterize and defend adversarial videos. We are the first one to explore the defense for adversarial videos. In light of the property of the state-of-the-art attacking methods, we first apply the detection network to detect whether the input video is attacked and determine the type of attack method. And then we select the corresponding defensive methods according to the result of the detection. A series of experiments demonstrates that the proposed method can greatly improve the accuracy of the well-trained video classifier and defend the state-of-the-art attack methods for video classification.
-  (1974) Discrete cosine transform. IEEE transactions on Computers 100 (1), pp. 90–93. Cited by: Temporal defense.
Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples.
International Conference on Machine Learning, pp. 274–283. Cited by: Adversarial defense.
-  (2009) Imagenet: a large-scale hierarchical image database. Cited by: Video classification.
Long-term recurrent convolutional networks for visual recognition and description.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Detection network.
-  (2017) Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634. Cited by: Video classification.
-  (2016) Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1933–1941. Cited by: Video classification.
-  (2012) Motion estimation algorithms for video compression. Vol. 379, Springer Science & Business Media. Cited by: Temporal defense.
-  (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: Introduction, Introduction.
-  (2017) Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117. Cited by: Result analysis.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: Video classification.
-  (1999) Integration of multimodal features for video scene classification based on hmm. In 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No. 99TH8451), pp. 53–58. Cited by: Introduction.
-  (2013) 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35 (1), pp. 221–231. Cited by: Video classification.
-  (2018) ComDefend: an efficient image compression model to defend adversarial examples. arXiv preprint arXiv:1811.12673. Cited by: Introduction, Adversarial defense, Figure 5, Spatial defense, Result analysis.
-  (2014-06) Large-scale video classification with convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Introduction.
-  (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: Introduction.
-  (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: Introduction.
-  (2018) Adversarial perturbations against real-time video classification systems. arXiv preprint arXiv:1807.00458. Cited by: Introduction, Introduction, Introduction, Adversarial attacks, Adversarial video generation, Video classification with the temporal defense, Table 3.
-  (2018) Defense against adversarial attacks using high-level representation guided denoiser. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1778–1787. Cited by: Introduction, Adversarial defense.
-  (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436. Cited by: Introduction.
-  (2018) Max-mahalanobis linear discriminant analysis networks. In International Conference on Machine Learning, pp. 4013–4022. Cited by: Adversarial defense.
-  (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pp. 568–576. Cited by: Video classification.
-  (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: Datasets.
-  (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: Introduction.
-  (2017) Ensemble adversarial training: attacks and defenses. arXiv preprint arXiv:1705.07204. Cited by: Introduction, Adversarial defense.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: Video classification, Adversarial attacks.
Sparse adversarial perturbations for videos.
The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19), pp. 8973–8980. Cited by: Introduction, Introduction, Introduction, Adversarial attacks, Adversarial video generation, Video classification with the temporal defense, Table 3.
-  (2015) Modeling spatial-temporal clues in a hybrid deep learning framework for video classification. In Proceedings of the 23rd ACM international conference on Multimedia, pp. 461–470. Cited by: Video classification.
-  (2018) Monet: deep motion exploitation for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1140–1148. Cited by: Introduction, Detection network.
-  (2018) Mitigating adversarial effects through randomization. In ICLR, Cited by: Adversarial defense.
-  (2015) Beyond short snippets: deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4694–4702. Cited by: Introduction, Video classification.
-  (2018) Adversarial framing for image and video classification. arXiv preprint arXiv:1812.04599. Cited by: Introduction, Introduction, Adversarial attacks.
-  (2007) Real-time object classification in video surveillance based on appearance learning. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: Introduction.