Log In Sign Up

Can't Fool Me: Adversarially Robust Transformer for Video Understanding

by   Divya Choudhary, et al.

Deep neural networks have been shown to perform poorly on adversarial examples. To address this, several techniques have been proposed to increase robustness of a model for image classification tasks. However, in video understanding tasks, developing adversarially robust models is still unexplored. In this paper, we aim to bridge this gap. We first show that simple extensions of image based adversarially robust models slightly improve the worst-case performance. Further, we propose a temporal attention regularization scheme in Transformer to improve the robustness of attention modules to adversarial examples. We illustrate using a large-scale video data set YouTube-8M that the final model (A-ART) achieves close to non-adversarial performance on its adversarial example set. We achieve 91 examples, whereas baseline Transformer and simple adversarial extensions achieve 72.9 robustness over the state-of-the-art.


page 1

page 2

page 3

page 4


Enhancing Transformer for Video Understanding Using Gated Multi-Level Attention and Temporal Adversarial Training

The introduction of Transformer model has led to tremendous advancements...

Knowledge Enhanced Attention for Robust Natural Language Inference

Neural network models have been very successful at achieving high accura...

Mischief: A Simple Black-Box Attack Against Transformer Architectures

We introduce Mischief, a simple and lightweight method to produce a clas...

Robust Invisible Video Watermarking with Attention

The goal of video watermarking is to embed a message within a video file...

Using Videos to Evaluate Image Model Robustness

Human visual systems are robust to a wide range of image transformations...

On Adversarial Robustness of Synthetic Code Generation

Automatic code synthesis from natural language descriptions is a challen...

Clustering Effect of (Linearized) Adversarial Robust Models

Adversarial robustness has received increasing attention along with the ...

1 Introduction

Deep neural networks have achieved state-of-the-art in several machine learning tasks, e.g., image classification 

[8] [3], video and audio understanding [19][4][16], natural language understanding [7]

, graph learning and reinforcement learning 

[20][2]. However, it has been shown that due to high dimensionality, even simplest of models are vulnerable to adversarial examples [13] with imperceptible changes to input examples [15, 5, 12][18] showed it for videos where the trained model fails to detect the correct class of a perturbed video. For a threat model, they generated the adversarial perturbations in an iterative way by maximising the cross-entropy loss between the model’s output for a perturbed video and its ground-truth label. They additionally minimize the norm of the perturbations so that the perturbed video is semantically close to the original video. This ensured that visually similar videos had different model outputs.

Adversarial training has been proposed in several key tasks such as image classification to make the model robust to such adversarial changes. For example, in  [5, 10]

, the authors propose a way of computing the adversarial counterparts of images while training and adding an extra loss regularization that forces the model to correctly classify them or make their predictions close to that of the original image.

However, for video understanding, the research of adversarially robust model needs further exploration. Moving from image to video classification adds several challenges to the task. First, the temporal dimension increases the overall size of the input, in turn increasing the model capacity required to make accurate predictions making it more susceptible to adversarial attacks. Secondly, the number of possible tags increase due to the variations in sequence. For e.g., a leaf falling from tree can be tagged as nature but the reverse could indicate science-fiction elements in the video.

We study the effect of adversarial training on video classification task using a popular deep learning model, Transformer 

[17], henceforth referred as Non-Adversarially Robust Transformer (Non-ART). For adversarial robustness training, we focus on two major approaches: learning using output space (ART using ) and learning using attention-map space (A-ART using ). We first perform a simple extension (ART) of traditional adversarial loss used in images [10] to videos and show that it improves the adversarial robustness to a certain extent. We study the effect of ART in the attention space and show it does not produce an adversarially robust attention space. Based on this, we propose a temporal attention regularization approach (A-ART) and show that it has a large impact on the adversarial robustness of the trained Transformer model.

We show extensive experiments on a large-scale video data set YouTube-8M on original test set (average performance) as well as adversarial test set (worst-case performance). Further, to showcase generalization, we perform similar experiments on Finance and Arts & Entertainment verticals within YouTube-8M. We show that we achieve close to average performance on adversarial test set on YouTube-8M using our approach. On original and adversarial test sets, our model (A-ART) achieves 92% GAP and 91% GAP respectively. A-ART has a gain of  18% in GAP on adversarial test set when compared with the baseline Non-ART and a gain of  9.1% when compared with ART.

2 Adversarial Transformer

In this section, we show how the loss function of a Transformer-based video classification model can be changed for more robust learning. We plan to address the vulnerability of deep learning based video classification models to adversarial examples. We denote the training set with

data points as , where represents the frame-wise feature representation of video and represents the ground-truth labels.

is the number of classes. We represent the video classification model’s output vector of probabilities for the point

as . is the loss for the data point which we consider to be cross-entropy loss in our multi-class scenario.


However, such models would be susceptible to adversarial examples hurting the generalizability of the model.

2.1 Adversarial regularization

To address the above limitation, we add a regularization term minimizing the loss for adversarial counterparts in the output space of the training samples as proposed in  [10].


The loss function is approximated to behave linearly around the input

to get the perturbation term which can be easily calculated using backpropagation.


Note, this method is closely related to fast gradient sign method (FGSM) [5] where -norm is considered in equation 3. We compute the gradient of the loss with respect to the features from video and audio modalities to get the corresponding adversarial perturbations. Note that, we train the model to be invariant to adversarial samples within the ball (see sec 3.2.3). Hence, optimizing this loss function has two hyper-parameters to tune, and . We refer to the model trained with as Adevrsarially Robust Transformer (ART).

2.2 Attention-map regularization

Figure 1: Attention map generated by Non-ART (left) and ART (right) models.

Figure  1 illustrates attention map generated for a sample video and its adversarial counterpart by Non-ART (model trained with ) and ART model. We observe that even though the attention maps are close, for many frames the differences are still significant. But intuitively, the attention-map being generated in the self-attention blocks should also be invariant to adversarial examples. Towards that goal, we add a regularization term in our loss function to enforce this condition. For a given input, we average over the attention maps generated by each head to get the attention map for that input. The corresponding attention map generated using the adversarial example is denoted by . We minimize the Frobenius norm of the difference between the attention maps and average over the mini-batch in which case the loss function becomes

We refer to the model trained with as Attention-Adversarially Robust Transformer (A-ART)

3 Experiments

Test Adversarial Test
Metrics Non-ART, ART, A-ART, Non-ART, ART, A-ART,
GAP 91.970.02 92.380.02 92.000.02 +0.03 72.890.04 81.970.02 91.040.02 +18.15
PERR 89.730.04 89.940.04 89.640.04 -0.09 71.890.03 80.350.03 88.580.01 +16.69
Hit@1 94.760.05 94.870.04 94.800.03 +0.04 81.740.02 88.220.02 93.990.02 +12.25
Table 1:

Overall performance (in percentages) of Non-ART, ART and A-ART models for video categorization task on our YouTube-8M test sets. Each column represents the results for a training paradigm defined by the architecture and the loss function used. We present the mean and standard deviation obtained for five non-overlapping partitions of the entire test set.

Test Adversarial Test
Metrics Non-ART, ART, A-ART, Non-ART, ART, A-ART,
GAP 93.93 94.12 94.14 +0.21 74.88 84.38 88.54 +13.66
PERR 93.47 93.72 93.57 +0.10 78.40 85.60 88.25 +9.85
Hit@1 97.21 97.32 97.33 +0.01 87.38 92.16 93.95 +6.57
Table 2: Overall performance (in percentages) of baselines and different variations of our proposed model for video categorization task on Arts & Entertainment, largest vertical of YT8M.
Test Adversarial Test
Metrics Non-ART, ART, A-ART, Non-ART, ART, A-ART,
GAP 79.13 79.07 79.86 +0.73 54.50 65.08 71.48 +16.98
PERR 80.59 83.85 82.25 +1.66 64.77 70.89 75.50 +10.73
Hit@1 87.70 92.31 90.77 +3.07 69.25 74.77 85.60 +16.35
Table 3: Overall performance (in percentages) of baselines and different variations of our proposed model for video categorization task on Finance, smallest vertical of YT8M.

3.1 Experimental-setup

We use YouTube-8M dataset for our experiments which consists of frame-wise video and audio features for approximately 5 million videos extracted using Inception v3 and VGGish respectively followed by PCA  [1]. We use the hierarchical label space with 317 classes, a further modification of 431 categories (see [14]). We use binary cross-entropy loss to train our models. We evaluate our models using the three metrics mentioned in  [9]: (i) Global Average Precision (GAP), (ii) Precision at Equal Recall Rate (PERR), and (iii) Hit@1. Our training set consists of approximately 4 million videos. We use 64000 videos from the official development set for validation and use the rest as test set. Our baseline Transformer model consists of a single layer of multi-head attention with 8 attention heads for each of audio and video modalities. For training we used Adam optimizer, with an initial learning rate of 0.0002 and batch size of 64. We compute validation set GAP every 10000 iterations and perform early-stopping with patience of 5. We also use it for learning-rate scheduler that decreases the learning rate by a factor of 0.1 with patience of 3.

3.2 Results

We present results for video categorization using a baseline Transformer encoder with and without the proposed regularization terms. Based on validation set results, the value of and for adversarial training was set to be and respectively. was set to . We first analyze the effect of attention map regularization on ART’s performance on the test set consisting of original samples as well as adversarial samples generated using the perturbations computed as in equation 4. Then we perform hyper-parameter analysis of the models trained adversarially. Finally, we highlight the robustness of A-ART to adversarial perturbations computed using the DeepFool method [12].

3.2.1 Performance on original and adversarial samples

From Table 1, we see that adversarial regularisation improves the performance of Non-ART model when classifying the original samples in test set. This shows that adversarial regularization can improve the generalizability of a model. A-ART and ART seem to perform similarly on the original test set. Next, we investigate the robustness of models to adversarial samples. Given a test sample and a trained model, we generate the corresponding adversarial perturbation using the FGSM based method (equation 4). We observe that A-ART consistently and significantly outperforms ART highlighting the importance of attention-map resgularization to improve the adversarial robustness of Transformer based models.

Hence, while A-ART performs similarly as ART on the original test samples, it significantly improves the adversarial robustness of the model.

3.2.2 Attention Map Regularization

Figure 2: Attention profile generated by ART (left) and A-ART (right) for a test sample and its adversary.

In Figure 2, we compare the attention profiles being generated for a sample in test set and its adversarial counterpart by trained ART and A-ART models. We observe that that attention generated by A-ART is more robust to adversarial perturbations and the attention profiles of the original sample and its adversarial counterpart overlap to a great extent. On the other hand, ART exhibits more variations in the attention profiles as a result of adversarial perturbations to the input. Another thing to notice from the figure is that the maximum attention being given to any frame reduces by an order of magnitude when trained using A-ART model. In other words, the attention map generated by A-ART is smoother enforcing the temporal coherence property which has been shown to help video classification model performance [6, 11].

For the test set videos, we computed the mean square error (MSE) between the attention-maps obtained for the original video and its adversarial counterpart using ART and A-ART. Then we averaged it over the entire test set. We note that our proposed model, A-ART drastically reduced the average MSE from obtained using ART to . This ensures that A-ART has learnt to not to be fooled by perturbations in the adversarial example and has almost similar attention-map for both the original and perturbed video.

3.2.3 Hyperparameter Tuning

Figure 3: Validation GAP for models with cross-entropy loss and adversarial loss: neighborhood radius was varied with (left) and was varied with (right).

We aim to understand the impact of the two hyper-parameters and on the model performance by perturbing one of them, while keeping the other constant. By altering , we aim to understand the impact of smoothing radius around the data-points on the model performance and perturbing impacts the weight of the adversarial loss on the overall optimization. The plots comparing the validation GAP for different values of hyper-parameters is shown in Figure 3. First, the value of was kept fixed at 1 and was varied. For lower values of , ART model show an improvement over Non-ART model peaking at . As we increase the value of , the model’s performance starts deteriorating. This is expected since defines the neighborhood around an input feature vector over which the conditional distribution is smoothed. Increasing the radius of this neighborhood forces our model to learn smoother functions that cannot capture the complexity of the conditional distribution function thereby decreasing its performance on the validation set. Similarly, as we increase the adversarial loss weight , the performance increases, peaks at and starts reducing as the relative weight of classification goes down.

3.2.4 Adversarial Robustness

Moosavi [12] proposed a simple and accurate method for computing the robustness of different classifiers to adversarial perturbations. Given a data sample and a trained model, their method computes the minimum perturbation that can be added to the sample so that the model predicts it incorrectly. Then a statistic is computed by dividing the norm of by the norm of actual features. To fool an adversarially robust model, it would require a greater amount of perturbation to be added to the features. This means should be higher for a more robust model. We compute using our trained models and YT8M test set samples and show the results in Fig 4. It can be seen clearly that A-ART improves the adversarial robustness of ART.

Figure 4: Average Robustness for all models. Our proposed model A-ART is more robust to adversarial examples than both ART and non-ART base model.

4 Conclusion

This paper presents two approaches to train adversarially robust Transformer model: (i) ART, an extension of image-based adversarially robust model for videos and (ii) A-ART, an approach to further improve robustness of attention space as well as output space. We show that ART and A-ART perform better than Non-ART on test set. Moreover, compared to Non-ART and ART, A-ART shows an exceptional gain of 18% and 9% respectively in robustness to adversarial examples generated using FGSM based method. We also show enhanced robustness of A-ART to adversarial perturbations generated using DeepFool. We also observed that the attention-map generated by A-ART is more robust to adversarial perturbations. In future, we plan to investigate the robustness of intermediate embeddings so that they can be used to improve other video understanding tasks. We also plan to extend A-ART to raw video datasets to train and qualitatively evaluate on more realistic adversarial examples.


  • [1] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan (2016) Youtube-8m: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675. Cited by: §3.1.
  • [2] J. S. Barret Zoph and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 0–0. Cited by: §1.
  • [3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §1.
  • [4] Y. H. Esteban Real and Q. V. Le (2019) Regularized evolution for image classifier architecture search. In

    Proceedings of AAAI Conference on Artificial Intelligence (AAAI)

    pp. 0–0. Cited by: §1.
  • [5] I. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, External Links: Link Cited by: §1, §1, §2.1.
  • [6] D. Huang, V. Ramanathan, D. Mahajan, L. Torresani, M. Paluri, L. Fei-Fei, and J. C. Niebles (2018) What makes a video a video: analyzing temporal information in video understanding models and datasets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7366–7375. Cited by: §3.2.2.
  • [7] K. L. Jacob Devlin and K. Toutanova (2018) Bert:pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • [8] S. R. Kaiming He and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
  • [9] J. Lee, W. Reade, R. Sukthankar, G. Toderici, et al. (2018) The 2nd youtube-8m large-scale video understanding challenge. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Cited by: §3.1.
  • [10] T. Miyato, S. Maeda, M. Koyama, and S. Ishii (2018)

    Virtual adversarial training: a regularization method for supervised and semi-supervised learning

    IEEE transactions on pattern analysis and machine intelligence 41 (8), pp. 1979–1993. Cited by: §1, §1, §2.1.
  • [11] H. Mobahi, R. Collobert, and J. Weston (2009) Deep learning from temporal coherence in video. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 737–744. Cited by: §3.2.2.
  • [12] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: §1, §3.2.4, §3.2.
  • [13] C. N. and W. D (2017) Adversarial examples are not easily detected: bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 3–14. Cited by: §1.
  • [14] S. Sahu, P. Goyal, S. Ghosh, and C. Lee (2020) Cross-modal non-linear guided attention and temporal coherence in multi-modal deep video models. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 313–321. Cited by: §3.1.
  • [15] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
  • [16] D. Tran, H. Wang, L. Torresani, and M. Feiszli (2019) Video classification with channel-separated convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5552–5561. Cited by: §1.
  • [17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §1.
  • [18] X. Wei, J. Zhu, S. Yuan, and H. Su (2019) Sparse adversarial perturbations for videos. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 8973–8980. Cited by: §1.
  • [19] B. Y. Wenjie Luo and R. Urtasun (2018) Fast and furious: real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 3569–3577. Cited by: §1.
  • [20] B. Zoph and Q. Le (2017) Neural architecture search with reinforcement learning. In International Conference on Learning Representations (ICLR), pp. 0–0. Cited by: §1.