DeepAI
Log In Sign Up

Multi-modal Robustness Analysis Against Language and Visual Perturbations

Joint visual and language modeling on large-scale datasets has recently shown a good progress in multi-modal tasks when compared to single modal learning. However, robustness of these approaches against real-world perturbations has not been studied. In this work, we perform the first extensive robustness study of such models against various real-world perturbations focusing on video and language. We focus on text-to-video retrieval and propose two large-scale benchmark datasets, MSRVTT-P and YouCook2-P, which utilize 90 different visual and 35 different textual perturbations. The study reveals some interesting findings: 1) The studied models are more robust when text is perturbed versus when video is perturbed 2) The transformer text encoder is more robust on non-semantic changing text perturbations and visual perturbations compared to word embedding approaches. 3) Using two-branch encoders in isolation is typically more robust than when architectures use cross-attention. We hope this study will serve as a benchmark and guide future research in robust multimodal learning.

READ FULL TEXT VIEW PDF

page 8

page 9

page 15

page 16

page 17

page 21

page 23

page 24

07/04/2022

Large-scale Robustness Analysis of Video Action Recognition Models

We have seen a great progress in video action recognition in recent year...
04/18/2021

CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval

Video-text retrieval plays an essential role in multi-modal research and...
03/23/2022

On Adversarial Robustness of Large-scale Audio Visual Learning

As audio-visual systems are being deployed for safety-critical tasks suc...
05/12/2022

A Generalist Agent

Inspired by progress in large-scale language modeling, we apply a simila...
10/26/2022

End-to-End Multimodal Representation Learning for Video Dialog

Video-based dialog task is a challenging multimodal learning task that h...
06/05/2022

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval

Multi-channel video-language retrieval require models to understand info...
11/21/2020

Deep learning for video game genre classification

Video game genre classification based on its cover and textual descripti...

1 Introduction

The world is inherently sequential in nature. Human beings learn different skills sequentially and in a continual manner. Sequential data like video and language are natural forms of input to any intelligent vision system operating in the real world. Robustness of these intelligent systems against real-world distribution shifts is crucial for various applications including autonomous driving Ma et al. (2018); Guo et al. (2019); Deng et al. (2021); Nesti et al. (2022), medicine Ardulov et al. (2021); Ali et al. (2021); Itzkovich et al. (2019); Sung and Poon (2020), robotics Itzkovich et al. (2019); Yim et al. (2007); Lakomkin et al. (2018); Bednarek et al. (2020)

and others. In a multimodal setting where both language and video are used, these distribution shifts can occur for a variety of reasons. In video, these can include lighting, camera movement, digital compression, etc. In text, these can include spelling errors, incorrect synonym swapping, bias, etc. These distribution shifts can cause deep learning models to fail when deployed

Hendrycks and Dietterich (2018); Coberly (2020); Deng et al. (2021).

It is crucial that these models are robust against such distribution shifts for a successful deployment. Robustness has been an active topic of research in deep learning. However, most of the effort is directed towards robustness against adversarial attacks Chakraborty et al. (2021); Alshemali and Kalita (2020); Dong et al. (2021). There are some recent efforts on robustness against real-world distribution shifts, but they focus on non-sequential image data Hendrycks and Dietterich (2018); Bhojanapalli et al. (2021); Hendrycks et al. (2021a) and natural language Wang et al. (2021) independently. Because video and text are vital sequential inputs for real-world intelligent systems, studying robustness in a multimodal setting is an important step towards developing reliable systems and has never been studied before.

In this work, we perform a large-scale analysis on the robustness of existing multimodal deep learning models for text-to-video retrieval. Text-to-video retrieval provides an important test scenario for a multimodal setting as it evaluates the similarity between video and text embeddings and how it may vary based on distribution shifts on one or both modalities. There are several questions about existing methods which are unanswered. Are these approaches robust to real-world corruptions in one modality and even both? Do we really need a heavy pre-training strategy for robustness or is training on the target dataset enough? Are the recently introduced transformer-based models better for robustness? Do these approaches utilize temporal modeling? Are these models biased? This study aims to be the first to answer some of these critical questions for multimodal deep learning models.

Towards this goal, we present two benchmark datasets to conduct robustness analysis on text-to-video retrieval. We utilize two widely used retrieval datasets MSRVTT Tan et al. (2020) and YouCook2 Zhou et al. (2018) and propose corresponding benchmark datasets, MSRVTT-P and YouCook2-P. In order to create these benchmarks, we introduce 90 different visual perturbations and 35 textual perturbations.

This study reveals several interesting observations about robustness of multimodal models: 1) The studied models are more robust when only text is perturbed as opposed to when only video is perturbed. 2) Specific to the use of text encoder, Word2Vec Mikolov et al. (2013) is relatively more robust on average when text is semantically changed as compared to BERT Devlin et al. (2018) but BERT is more robust when only video is perturbed. 3) When models utilize cross-attention between text and video, the studied models are typically less robust. Moreover, we also observe that pre-training on a larger dataset improves both performance and robustness (e.g. HowTo100M Miech et al. (2019b)).

We make the following contributions in this study:

  • First study to analyze the robustness of multimodal models against different real-world distribution shifts.

  • Two large-scale benchmark datasets (MSRVTT-P and YouCook2-P) to conduct robustness analysis on text-to-video retrieval.

  • Provide insights including comparison of different model architectures, training procedures and effect of various perturbations on model performance.

Figure 1: A conceptual diagram of video and text in a joint latent space where the original text (circles) are closer to their paired video compared to text that is perturbed via typos (cross) and the removal of all words except nouns and verbs (triangle). Models are considered robust when the perturbed text is still closest to its respective video. The same should be true if video is perturbed or both are perturbed.

2 Related Works

2.1 Robustness

Visual

Most recent works on robustness in the visual domain have focused on real-world distribution shifts as opposed to targeted attacks in the image domain Hendrycks and Dietterich (2018); Bhojanapalli et al. (2021); Hendrycks et al. (2021a); Sakaridis et al. (2021); Taori et al. (2019). In Hendrycks and Dietterich (2018); Recht et al. (2019)

, authors analyze different image classification models on naturally occurring distribution shifts using ImageNet. While the benchmark study analyzing naturally occurring shifts in

Taori et al. (2019) demonstrated that data augmentation is not sufficient for robustness, several studies have found that certain data augmentations do improve the robustness of deep learning image models Geirhos et al. (2018); Hendrycks et al. (2019); Yin et al. (2019). These data augmentations are often noise related Madry et al. (2018); Rusak et al. (2020); Lopes et al. (2019) but other transformations such as color or texture have been analyzed as well Geirhos et al. (2018); Yun et al. (2019); Cubuk et al. (2019); Hendrycks et al. (2019). These studies have not yet been extended to the video domain where temporal aspects are also present. Different from these works, this study will provide a benchmark on robustness of models against real-world perturbations in multi-modal settings.

Text

Research on robustness in the natural-language processing (NLP) field is far more extensive as compared to video. Some works in natural distribution shifts focus on semantic changing of a phrase

Gardner et al. (2020); Schlegel et al. (2020). In Gardner et al. (2020), the phrase is altered in small, meaningful ways that change the overall label in order to understand the decision boundaries of models. Similarly, Schlegel et al. (2020) alter text in ways that change the semantic meaning but keep the original text’s lexical surface form. Other works inspired by Hendrycks et al. (2021b) focus on distribution shift based on changes of grammar errors, dialects, speakers, and language Demszky et al. (2020), different domains Miller et al. (2020) and bias De-Arteaga et al. (2019); Prates et al. (2020). The image robustness research space has inspired many of these studies, but there are vast differences to NLP and vision that make these transfers difficult, such as the discrete vs. continuous search space as explained in Wang et al. (2021). Data augmentation has also been looked at as a method to improve robustness and has shown substantial improvements Feng et al. (2021); Dhole et al. (2021); Chen et al. (2020); Hendrycks et al. (2019); Chen et al. (2021). These studies have not yet been extended to the multimodal domain where vision is also incorporated. Different from these works, this work will provide a large-scale benchmark on robustness for multimodal models against real-world perturbations.

Multimodal

Evaluating robustness in multimodal models is more difficult because there are more vectors of attack possible. It is possible to attack the entire model while perturbing only one of the modalities used or a varying amount of the modalities used. Focus on single-modality attacks in

Yang et al. (2021)

investigated the robustness of multimodal neural networks against worst-case (i.e., adversarial) perturbations on a single modality. Looking at multimodal attacks,

Tian and Xu (2021) evaluated audio-visual models by running adversarial attacks on audio, visual, and both modalities. Such studies have not been performed on naturally occurring distribution shifts and have not looked at multimodal models that use text and video, two modalities that are drastically different.

2.2 Multimodal Modelling

Multimodal modeling with text and vision has drastically improved since the emergence of both the HowTo100M dataset Miech et al. (2019b) and transformer architecture Devlin et al. (2018). The highest performing models Luo et al. (2020); Xu et al. (2021, 2021); Rouditchenko et al. (2020); Akbari et al. (2021); Patrick et al. (2020) pre-train on the Miech et al. (2019b) and most use pre-extracted visual features from the original multimodal model from Miech et al. (2020) which uses an S3D-G backbone Xie et al. (2018). For learning a joint visual-text space, these models often use a contrastive learning objective between visual and text embeddings Miech et al. (2020); Xu et al. (2021); Akbari et al. (2021); Rouditchenko et al. (2020) while some use an alignment-based objective Luo et al. (2020); Xu et al. (2021) using masked modeling. Many of the contrastive approaches Xu et al. (2021); Miech et al. (2020); Rouditchenko et al. (2020), use a two-branch encoder approach where video has one encoder and text a separate encoder and their outputs are compared where the objective is to move the two outputs closer to each other in latent space. Some approaches Luo et al. (2020); Xu et al. (2021); Akbari et al. (2021) will additionally utilize a cross-encoder before comparing output. On the text-to-video retrieval task, typically models that utilize a cross-encoder will perform better than those that keep encodings separate. This work will provide analysis on whether these cross-encoder models are more robust as well as higher performers.

3 Distribution Shift

Existing research in multimodal learning is mostly focused on training and testing the proposed methods on a benchmark dataset with little to no distribution shift from training to testing samples. While models often use a video encoder that is pre-trained on a very large, noisy dataset, e.g. HowTo100M Miech et al. (2019b), there is no understanding of how, in a multimodal setting, a distribution shift will affect the similarity between the text and video embeddings for multimodal tasks. To study the effect of distribution shift, we introduce five categories of visual perturbations and seven categories of text perturbations. More details about these perturbations are provided in the Appendix.

3.1 Visual Perturbations

First, we extend image-based visual perturbations from Hendrycks and Dietterich (2019) to videos. Next, we add temporal perturbations to address the time dimension and video compression to address video-specific distribution shifts. The total set of visual perturbations fall into 5 categories: Noise, Blur, Temporal, Camera and Digital. Each visual perturbation has a severity range from 1 to 5 where the greater the severity, the more challenging and perturbed the video is. Blur, Noise, and Camera perturbations are all applied frame-by-frame. Noise includes Impulse, Gaussian, Shot, and Speckle, Blur includes Zoom, Defocus and Motion and Camera includes StaticRotate, Rotation and Translation.

The Digital and Temporal perturbations are added in order to include real-world distribution shifts specific to video. Digital perturbations are related to compression and video-streaming quality. We evaluted models on JPEG, MPEG1 and MPEG2. JPEG is a lossy image compression, MPEG1 compresses video without excessive quality loss and MPEG2 is a lossy compression for video that is similar to MPEG1. Temporal perturbations focus on the time dimension in a video and include Sampling, Reverse Sampling, Jumbling, Box Jumbling and Freeze. Sampling rate slows the playback speed by sampling frames uniformly at a varying level of rates and reverse samping does so in the reverse order of the original sequence. Jumbling splits a video into segments and randomly shuffles the frames in that segment while Box jumbling randomly shuffles the segments. Freezing simulates when live streaming buffers, freezing on random frames for random durations.

3.2 Text Perturbations

We group text perturbations into three different types, natural, machine-based, and synthetic. Here machine-based perturbations use a model to alter the text while natural-based imitates real-world mistakes when generating text. Synthetic are less natural perturbations used to gain a greater understanding of the models. The textual perturbations are further grouped into seven different categories ChangeChar, AddText, Bias, Positional, DropText, SwapText and TextStyle with a total of 35 different perturbations. ChangeChar refers to any perturbation that changes a character in word(s). SwapText

and is any machine-learning based perturbation that swaps word(s) from the original phrase.

AddText includes appending irrelevant phrases to text or inserting adverbs. TextStyle are perturbations that change the original text’s style, e.g. making it passive Czeresnia Etinger and Black (2019). Bias perturbations include switching the gender of word(s) in a phrase Reynolds (2022). We additionally include changing all male references to female, the reverse, and convert all gender-specific references to gender neutral.

DropText perturbations are synthetic and drop words based on their part-of-speech (POS) tag. These perturbations are included to gain a better understanding of word level attention, more specifically, to understand if models attend more to objects, actions or context. DropNN, DropVB, and DropVBNN are different variations of dropping words based on whether the POS tags are Noun and/or Verb. Because there are often more nouns in a sentence, we have an additional perturbation RandNN where only one noun is dropped randomly as opposed to all. For example, “a little girl does gymnastics” becomes “ a little [UNK] does gymnastics”. In order to evaluate attention to contextual words, OnlyNN, OnlyVB, and OnlyNNVB drops all words but those with POS NN and/or VB. Positional perturbations are machine-based and alter the phrase based off their location. This is used to evaluate the models based on the position of words in a phrase. These include DropFirst, DropLast, DropFirstandLast, and ShuffleOrder

. Drop-related perturbations will replace a word at that position with an [UNK] tag. The ShuffleOrder perturbation shuffles the words in a phrase randomly. More details on the generated text perturbation are provided in the Appendix.

Model Params Text Input Text Encoder Video Input Video Encoder
HowTo100M MIL 31.2M Raw Word2Vec Raw S3Dg
VideoClip 177.4M Raw BERT S3Dg MLP+Transformer
UniVL 153.7M Raw BERT S3Dg Transformer
COOT 7.6M BERT Transformer S3Dg Transformer
Table 1: Details of self-supervised multimodal models used in this study including HowTo100M MIL Miech et al. (2020), VideoClip Xu et al. (2021), UniVL Luo et al. (2020) and COOT Ging et al. (2020) and their respective encoders. These encoders includ S3D Miech et al. (2020); Xie et al. (2018), Word2Vec Mikolov et al. (2013), and BERT Devlin et al. (2018).

4 Model Variants

We perform our experiments on four different self-supervised multimodal models which are based on CNN and Transformer architectures. The goal is to benchmark multiple pre-training approaches while simultaneously study the behavior of CNN and transformer based models for robustness in text-to-video retrieval. We evaluate the most popular multimodal approach HowTo100M MIL-NCE Miech et al. (2020) which uses a CNN backbone and Word2Vec word embeddings with an MIL-NCE contrastive loss between text-video pairs. We further evaluate models and approaches that utilize visual features from Miech et al. (2020) with further training and different self-supervised approaches. The newly popular method VideoClip Xu et al. (2021) is a transformer-based approach relying instead on BERT Devlin et al. (2018) for both text and video encodings with a similar but improved contrastive loss. COOT Ging et al. (2020) similarly uses transformer-based encoders taking BERT text features and S3D visual features as input and includes cross-attention between the text and video features. Rather than a contrastive loss with negative pairing, COOT focuses on alignment between text and video alone. The final approach evaluated, UniVL Luo et al. (2020), is another transformer-based approach that uses a cross-encoding transformer in addition to separate encoders as their self-supervised objective. More details on these approaches are shown in Table 1.

5 Robustness Benchmarks and Evaluation

5.1 Datasets

We use two multimodal datasets for our experiments: MSRVTT Tan et al. (2020) and YouCook2 Zhou et al. (2018). MSRVTT is a video captioning dataset which consists of 10,000 clips with an average length of 10 seconds each. These videos show a variety of activities that can be organized into 20 categories. We follow JSFusion Yu et al. (2018); Miech et al. (2020); Xu et al. (2021) which randomly sampls 1K clip-text pairs as test data for evaluation. YouCook2 is a task-oriented cooking dataset with 2000 long untrimmed videos from 89 cooking recipes. Each video is annotated with captions with provided temporal boundaries, allowing each video to be split into a set of clips. There are 3,305 test clip-text pairs from 457 videos for evaluation.

Captions in the MSRVTT and YouCook2 dataset are quite different. YouCook2 has no indication of gender with phrases comprising 2x more nouns compared to MSRVTT while MSRVTT has a more uniform distribution of words with an increased vocab size of 568 more unique words. Videos in MSRVTT and YouCook2 are also different where YouCook2 are long, complex activities split into clips with temporally bounded annotations. The test dataset will have multiple clips from the same video while all test clips in MSRVTT are from different videos. This means the distributions between the two datasets are different and may result in different observations.

We apply 90 visual perturbation to the test videos, 31/35 text perturbations to the captions, and 66 visual and text combined perturbations for creating robustness benchmarks YouCook2-P and MSRVTT-P. YouCook2-P does not have gender-related perturbations because of no reference to gender in their captioning, therefore only 31 text perturbations are used. MSRVTT-P consists of 90,000 videos and 35,000 captions resulting in 2,766,000 video-text pairs. YouCook2-P consists of 41,130 videos split into 301,500 clips and 103,850 captions, resulting in 9,266,100 clip-text pairs.

5.2 Tasks and Evaluation Metrics

We evaluate the performance of models on text-to-video retrieval Miech et al. (2020) and VideoQA Yu et al. (2018). We use retrieval rate R@K Miech et al. (2020) for text-to-video retrieval and accuracy for VideoQA Yu et al. (2018)

. To measure robustness, we define two metrics; one for absolute retrieval drop and the other for relative retrieval drop. Given a trained classifier model

, we first compute retrieval on the clean test set. Next, we test this classifier on a perturbation and obtain retrieval for perturbation . The absolute robustness is computed for each perturbation as . For visual perturbations, the aggregated performance of a model can be obtained by averaging all severity levels to get and over all perturbations to get . For text perturbations, the aggregated performance of a model can be obtained by averaging across sub-types rather than severity. The robustness score will usually range from 0 to 1, where 0 indicates a model is not robust and 1 is where the model is entirely robust. A score greater than 1 indicates that the model’s performance is better with the perturbation.

Figure 2: A comparison of studied models under different training protocols (zs: zero-shot learning, ft: finetuning on target dataset, and scratch: no pretraining. The y-axis shows relative robustness , x-axis represent performance using R@5, and the size of marker represents number of model parameters. The three plots (left to right) corresponds to visual, text, and visual+text perturbations respectively on YouCook2 dataset for text-to-video retrieval.
Method Training Blur Camera Digital Noise Temporal
COOT scratch 0.460.29 0.860.10 0.660.33 0.130.15 0.590.39
Videoclip scratch 0.440.24 0.860.12 0.700.31 0.110.15 0.610.38
HowTo100M MIL zeroshot 0.590.23 0.840.11 0.750.29 0.140.17 0.550.38
UniVL Align zeroshot 0.530.24 0.790.21 0.670.31 0.090.12 0.540.38
Videoclip zeroshot 0.470.30 0.850.11 0.750.29 0.110.18 0.570.40
UniVL Align finetune 0.420.28 0.800.13 0.710.30 0.110.15 0.560.39
Videoclip finetune 0.470.30 0.830.11 0.740.29 0.110.15 0.570.39
Table 2: Relative robustness scores for each category of video perturbations on YouCook2-P.

5.3 Implementation Details

To ensure fairness to the original models, we use the official model implementations that were available with pre-trained weights with the same experimental setup as described in these works. These protocols vary between models and datasets. HowTo100M-MIL Miech et al. (2020) take video as input and split the temporal boundary of the passed video into a clip of 4 with 32 frames for YouCook2 and 16 frames for MSRVTT. They take text as input and embed each word using Word2Vec. VideoClip Xu et al. (2021) and COOT Ging et al. (2020) use pre-extracted features from the pre-trained S3D-G Xie et al. (2018) model provided by Miech et al. (2020) while UniVL Luo et al. (2020) uses pre-extracted features from the same model but before the final layer resulting in a smaller embedding size. VideoClip and UniVL take text as raw input while COOT Ging et al. (2020) uses pre-extracted text features from BERT Devlin et al. (2018). These details are summarized in Table 1.

We also analyze models based on whether they are fine-tuned, pre-trained or trained from scratch. VideoClip, Howto100-MIL and UniVL are pre-trained on HowTo100M Miech et al. (2019a). Evaluating models using only pre-trained weights are considered zero-shot (ZS). VideoClip and UniVL were additionally fine-tuned on MSRVTT and YouCoook2 and are considered fine-tuned (FT). Models that are trained on the evaluation datasets without pre-training are considered scratch.

6 Experiments

We perform our experiments with the studied models on YouCook2-P and MSRVTT-P benchmarks. A summarized overview of the robustness analysis of models against different perturbations on YouCook2-P is shown in Figure 2. Table 2 shows the average robustness scores for each category of visual perturbations and Table 3 shows the average robustness scores across the subcategories of text perturbations. More detailed results are provided in Appendix. Next, we provide more insights and analysis on different interesting observations in this study.

Method Training AddText ChangeChar DropText Positional SwapText TextStyle
COOT scratch 0.880.12 0.180.29 0.410.37 0.760.12 0.510.43 0.570.51
Videoclip scratch 0.920.03 0.740.14 0.570.39 0.830.15 0.750.18 0.980.01
HowTo100M MIL zeroshot 0.910.08 0.740.09 0.450.33 0.760.07 0.780.15 0.950.02
UniVL Align zeroshot 1.140.03 0.750.10 0.430.41 0.800.24 0.750.17 0.940.09
Videoclip zeroshot 0.950.03 0.840.10 0.500.35 0.820.09 0.810.18 0.990.07
UniVL Align finetune 0.850.12 0.620.13 0.370.33 0.690.11 0.720.18 0.920.05
Videoclip finetune 0.950.04 0.770.10 0.470.33 0.700.13 0.770.14 0.940.04
Table 3: Relative robustness scores for each category of text perturbations on YouCook2-P.
Figure 3: Examples of perturbations that humans are able to perceive but models struggle with.

Training Strategy

On machine-based and natural text perturbations for MSRVTT-P, fine-tuned models are slightly more robust on average. However, on synthetic perturbations, fine-tuned models are 6-8% more robust. With visual perturbations on MSRVTT-P, zero-shot models are 3% more robust than fine-tuned models. On machine-based and natural text perturbations on YouCook2-P, fine-tuned models are 2-3% more robust than zero-shot and models trained from scratch are 5-10% less robust than those fine-tuned. This indicates that pre-training on long, complex activities is more impactful on robustness. When video is perturbed on YouCook2-P, pre-training followed by fine-tuning is 4% more robust than scratch. In summary, pre-training models typically improves both performance and robustness against real-world distribution shifts, even when not fine-tuned on the target dataset and especially on short-videos with part of long-complex activities.

Human Perceivable Perturbations

Noise and Blur are pixel-based visual perturbations which humans can easily filter (Figure 3). These perturbations are also the ones models are least robust as shown in Table 2. On both datasets, between the semantic preserving text perturbations, models are least robust to ChangeChar, indicating that text models are still unable to recognize small changes that humans will perceive in text (Figure 3). This indicates that multimodal models are not robust to human perceivable changes such as character changes in word(s) and additive noise. These perturbations are also simulating real-world scenarios and therefore call for further research on how to improve robustness in this area.

Cross-Attention

Text Video Video+Text
MSRVTT All zero-shot fine-tuned All zero-shot fine-tuned All zero-shot fine-tuned
cross 0.850.13 0.840.14 0.860.11 0.660.29 0.670.3 0.650.29 0.550.24 0.550.25 0.560.24
two-branch 0.850.12 0.840.13 0.880.11 0.630.32 0.630.3 0.610.34 0.540.25 0.540.25 0.560.25
YouCook2 All zero-shot fine-tuned All zero-shot fine-tuned All zero-shot fine-tuned
cross 0.730.31 0.850.18 0.830.14 0.530.36 0.50.36 0.550.36 0.370.28 0.350.29 0.360.26
two-branch 0.830.16 0.830.15 0.880.14 0.540.37 0.530.37 0.570.38 0.390.29 0.410.29 0.380.29
Table 4: Relative Robustness scores across both encoder type and training procedure. Two-branch encoder approaches are typically more robust.

Next, we compare models that use cross-attention between video and text and those that keep video and text encoders separate. Table 4 aggregates the relative robustness across categories, excluding synthetic text-based perturbations, for models that use cross-attention between video and text and ones that do not. MSRVTT shows a significant difference between approaches when video is perturbed, where cross-attention is 3-4% more robust. On YouCook2, there is a significant difference between the two when text is perturbed with a 10% difference ignoring pre-training strategy. Models trained from scratch are  30% less robust when using cross-attention when text is perturbed, but this difference is less notable when video is perturbed. This indicates that cross-attention is not robust when models are not pre-trained, but is equally robust otherwise.

In summary, a two-branch approach is typically more robust than those that use cross-attention without foregoing performance, as UniVL and VideoClip are both the top performers where one uses cross-attention and the other does not (see Figure 2).

Figure 4: Relative Robustness scores on combinations of textual and visual perturbations. The x-axis shows text perturbation and the y-axis shows visual perturbation The first row/column show scores for the respective text/video perturbation alone. An example of compounding effect is circled in green while gender perturbations are boxed in blue. The results indicate a two-branch encoder is on an average more robust when both video and text are perturbed.

Text Encoders

In this section we are comparing models that use a Word2Vec Mikolov et al. (2013) encoder as opposed to a BERT encoder Devlin et al. (2018). When text is perturbed on DropText, Positional and ChangeChar, Word2Vec is more robust than BERT on zero-shot evaluation (see Figure 5). On ChangeChar, Word2Vec was 2-6% more robust than BERT, given that character changes may result in a word may becoming unknown to BERT, it is similar in dropping a word and losing overall context of the phrase (Figure 3). These results indicate that BERT indeed requires semantic meaning and positional ordering and therefore are less robust when semantic meaning is changed drastically through the dropping words or changing of characters, while Word2Vec relies less on semantic meaning and therefore is more robust when words are dropped. When video was perturbed, on shorter videos BERT is significantly more robust but on the clips from long, complex videos, the difference is 1%.

Figure 5: Retrieval rate for DropText perturbations on YouCook2-P. Models are surprisingly robust when words are dropped with Word2Vec typically being more robust as compared to BERT.

In summary, Word2Vec is more robust on semantic-changing text perturbations and equally as robust for video perturbations when clips are from long, complex videos and BERT is more robust for non-semantic changing perturbations and is more robust on video perturbations for short videos.

MultiModal Perturbations

To understand the compounding effects of shifting distributions in both the visual and text domain, we select a subset from each perturbation with at least one from each category. For visual perturbations, we use a severity of 3. Figure 4 shows a summary of these results on the YouCook2 dataset. One observation is even with perturbations from both text and vision, unsurprisingly a pre-trained model is more robust than one trained from scratch. Another observation is that there are certain combinations of perturbations that are more challenging for models as compared to others. For example in HowTo100M MIL, the MultiPosSwapNN and StaticRotate when combined, the robustness score drops lower than both when in isolation (Figure 7c).

Figure 6: Models are slightly less robust when gender is swapped or male references are changed to female. BERT-based models are highly robust to gender changes to neutral.

Meanwhile, some perturbation combinations will be close to the lowest between the two, e.g. no nouns and no verbs and shot noise. Even when a model is equally robust to perturbations in isolation (e.g. Jumble and GenderMale on HowTo100m MIL), there is a decrease in overall robustness when the two are combined. In summary, when both text and video are perturbed, models are even less robust than when the same perturbations are applied in isolation. Furthermore, two-branch, isolated approaches are more robust when both modalities are perturbed as opposed to the use of cross-attention.

Bias

To evaluate bias in models, we evaluate the robustness to gender-specific changes to text on the MSRVTT dataset. In the MSRVTT dataset, the most common part-of-speech (POS) tagged nouns were “man” and “woman” with “man” references 2x that of “woman”. When the original text was perturbed, of male references were converted to female, of female references were converted to male, of phrases swapped gender and finally of gender references were made neutral. Figure 6 visualizes these results where the dotted, horizontal line is the original text-to-video retrieval score and the bar are the new scores with the perturbed version of text. The results indicate that models are less robust when the gender is all female and when the gender is swapped from male-to-female and vice versa.

Temporal

Temporal perturbations are used to evaluate whether models use temporal information or not. Figure 7 visualizes the results of these experiments. Models show strong robustness to the visual, temporal perturbations Jumble, Sampling, and Freeze. However, on YouCook2, which consists of untrimmed, minutes-long videos, none of the models are robust to BoxJumble. This indicates that the models require alignment between the visual ques and the respective text, but temporal order within the aligned segment is not utilized. These results indicate that both visual and textual cues are used during learning however the models are attending more to objects and scene rather than motion and activity. This is similar to how humans may describe different videos where nouns and descriptors are more differentiating as opposed to activities which could describe a group of videos.

VideoQA Method video text
videoclip scratch 0.74 0.76 0.93 0.94
videoclip zero-shot 0.81 0.86 0.88 0.91
videoclip fine-tune 0.78 0.79 0.93 0.93
Table 5: Robustness on VideoQA task.

Cross-Task Evaluation

To understand if these findings transcend to other multimodal tasks, we evaluated VideoClip on the multiple choice VideoQA task with results in Table 5. We find that like text-to-video retrieval, VideoClip is more robust to text perturbations as opposed to video. Also, pre-training is less impactful on robustness when choosing between a smaller candidate set when text is perturbed. However, like our previous observations, pre-training improves robustness when video is perturbed. This further indicates that pre-training improves robustness on visual perturbations.

Figure 7: Temporal perturbations for text (a) and video (b). When text is perturbed, we find that the Word2Vec approach is relatively more robust on an average, especially when words are shuffled and the order is lost. When video frames are shuffled in Jumble (top), models are still highly robust. When segments are shuffled in BoxJumble (bottom) and the phrase no longer matches the segment, models will fail. Similarity matrix (c) between video and text on VideoClip showing a compounded effect on robustness when both are perturbed.

7 Conclusion

In this work we study the visual and textual robustness of several multimodal models. In order to perform this study, we create two benchmark datasets, MSRVTT-P and YouCook2-P. Our empirical study provides several interesting insights into the behavior of some of the existing models on the proposed benchmarks. Our key observations are:

  • Models are generally more robust when only text is perturbed as opposed to when only video is perturbed.

  • Specific to the text encoder used, Word2Vec Mikolov et al. (2013) is relatively more robust on average when text is semantically perturbed as compared to BERT Devlin et al. (2018) which is more robust when perturbations do not change semantic meaning.

  • When models utilize cross-attention between text and video, models are typically more robust when video is perturbed but less robust when text is perturbed.

  • Models that are pre-trained and fine-tuned are typically more robust with improved performance as opposed to models trained from scratch.

  • Fine-tuned models are less robust to gender-based perturbations.

  • When both text and video are perturbed, two-branch architectures are more robust.

The findings and the benchmark in this work can potentially open up interesting future research on robustness of multimodal learning. The benchmark introduced in this study will be released publicly at https://mmvr-neurips.github.io/MultiModalRobustness/.

References

  • H. Akbari, L. Yuan, R. Qian, W. Chuang, S. Chang, Y. Cui, and B. Gong (2021)

    VATT: transformers for multimodal self-supervised learning from raw video, audio and text

    .
    CoRR abs/2104.11178. External Links: Link, 2104.11178 Cited by: §2.2.
  • S. Ali, F. Zhou, A. Bailey, B. Braden, J. E. East, X. Lu, and J. Rittscher (2021) A deep learning framework for quality assessment and restoration in video endoscopy. Medical Image Analysis 68, pp. 101900. Cited by: §1.
  • B. Alshemali and J. Kalita (2020) Improving the reliability of deep neural networks in nlp: a review. Knowledge-Based Systems 191, pp. 105210. Cited by: §1.
  • V. Ardulov, V. R. Martinez, K. Somandepalli, S. Zheng, E. Salzman, C. Lord, S. Bishop, and S. Narayanan (2021) Robust diagnostic classification via q-learning. Scientific reports 11 (1), pp. 1–9. Cited by: §1.
  • M. Bednarek, P. Kicki, and K. Walas (2020) On robustness of multi-modal fusion—robotics perspective. Electronics 9 (7), pp. 1152. Cited by: §1.
  • S. Bhojanapalli, A. Chakrabarti, D. Glasner, D. Li, T. Unterthiner, and A. Veit (2021) Understanding robustness of transformers for image classification. arXiv preprint arXiv:2103.14586. Cited by: §1, §2.1.
  • S. Bird, E. Klein, and E. Loper (2009) Natural language processing with python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.". Cited by: §A.2.
  • T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 1877–1901. External Links: Link Cited by: §A.3.
  • A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, and D. Mukhopadhyay (2021) A survey on adversarial attacks and defences. CAAI Transactions on Intelligence Technology 6 (1), pp. 25–45. Cited by: §1.
  • J. Chen, D. Shen, W. Chen, and D. Yang (2021) Hiddencut: simple data augmentation for natural language understanding with better generalizability. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4380–4390. Cited by: §2.1.
  • J. Chen, Z. Yang, and D. Yang (2020)

    Mixtext: linguistically-informed interpolation of hidden space for semi-supervised text classification

    .
    arXiv preprint arXiv:2004.12239. Cited by: §2.1.
  • C. Coberly (2020) AI failue in real-world application. Techspot. Note: https://www.techspot.com/news/87431-ai-powered-camera-zooms-bald-head-instead-soccer.html. [Accessed: Oct 18, 2021] Cited by: §1.
  • E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le (2019) Autoaugment: learning augmentation strategies from data. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 113–123. Cited by: §2.1.
  • I. Czeresnia Etinger and A. W. Black (2019) Formality style transfer for noisy, user-generated conversations: extracting labeled, parallel data from unlabeled corpora. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), Hong Kong, China, pp. 11–16. External Links: Link, Document Cited by: §A.2, §3.2.
  • M. De-Arteaga, A. Romanov, H. Wallach, J. Chayes, C. Borgs, A. Chouldechova, S. Geyik, K. Kenthapadi, and A. T. Kalai (2019) Bias in bios: a case study of semantic representation bias in a high-stakes setting. In proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 120–128. Cited by: §2.1.
  • D. Demszky, D. Sharma, J. H. Clark, V. Prabhakaran, and J. Eisenstein (2020) Learning to recognize dialect features. arXiv preprint arXiv:2010.12707. Cited by: §2.1.
  • Y. Deng, T. Zhang, G. Lou, X. Zheng, J. Jin, and Q. Han (2021) Deep learning-based autonomous driving systems: a survey of attacks and defenses. IEEE Transactions on Industrial Informatics 17 (12), pp. 7897–7912. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv. External Links: Document, Link Cited by: §1, §2.2, Table 1, §4, §5.3, §6, 2nd item.
  • K. D. Dhole, V. Gangal, S. Gehrmann, A. Gupta, Z. Li, S. Mahamood, A. Mahendiran, S. Mille, A. Srivastava, S. Tan, et al. (2021) NL-augmenter: a framework for task-sensitive natural language augmentation. arXiv preprint arXiv:2112.02721. Cited by: §2.1.
  • X. Dong, A. T. Luu, M. Lin, S. Yan, and H. Zhang (2021) How should pre-trained language models be fine-tuned towards adversarial robustness?. Advances in Neural Information Processing Systems 34. Cited by: §1.
  • C. Fellbaum (Ed.) (1998) WordNet: an electronic lexical database. Language, Speech, and Communication, MIT Press, Cambridge, MA. External Links: ISBN 978-0-262-06197-1 Cited by: §A.2.
  • S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, and E. Hovy (2021) A survey of data augmentation approaches for nlp. arXiv preprint arXiv:2105.03075. Cited by: §2.1.
  • M. Gardner, Y. Artzi, V. Basmova, J. Berant, B. Bogin, S. Chen, P. Dasigi, D. Dua, Y. Elazar, A. Gottumukkala, et al. (2020) Evaluating models’ local decision boundaries via contrast sets. arXiv preprint arXiv:2004.02709. Cited by: §2.1.
  • R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel (2018) ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, Cited by: §2.1.
  • S. Ging, M. Zolfaghari, H. Pirsiavash, and T. Brox (2020) COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 22605–22618. External Links: Link Cited by: §A.4, Appendix C, Table 1, §4, §5.3.
  • J. Guo, U. Kurup, and M. Shah (2019) Is it safe to drive? an overview of factors, metrics, and datasets for driveability assessment in autonomous driving. IEEE Transactions on Intelligent Transportation Systems 21 (8), pp. 3135–3151. Cited by: §1.
  • D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, et al. (2021a) The many faces of robustness: a critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340–8349. Cited by: §1, §2.1.
  • D. Hendrycks and T. Dietterich (2018) Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, Cited by: §1, §1, §2.1.
  • D. Hendrycks and T. Dietterich (2019) Benchmarking neural network robustness to common corruptions and perturbations. arXiv. External Links: Document, Link Cited by: §3.1.
  • D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer, and B. Lakshminarayanan (2019) AugMix: a simple data processing method to improve robustness and uncertainty. In International Conference on Learning Representations, Cited by: §2.1, §2.1.
  • D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song (2021b) Natural adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15262–15271. Cited by: §2.1.
  • D. Itzkovich, Y. Sharon, A. Jarc, Y. Refaely, and I. Nisky (2019) Using augmentation to improve the robustness to rotation of deep learning segmentation in robotic-assisted surgical data. In 2019 International Conference on Robotics and Automation (ICRA), pp. 5068–5075. Cited by: §1.
  • E. Lakomkin, M. A. Zamani, C. Weber, S. Magg, and S. Wermter (2018) On the robustness of speech emotion recognition for human-robot interaction with deep neural networks. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 854–860. Cited by: §1.
  • R. G. Lopes, D. Yin, B. Poole, J. Gilmer, and E. D. Cubuk (2019) Improving robustness without sacrificing accuracy with patch gaussian augmentation. Cited by: §2.1.
  • A. Luo (2022) Video feature extractor. GitHub. Note: https://github.com/ArrowLuo/VideoFeatureExtractor/ Cited by: §A.4.
  • H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, J. Li, T. Bharti, and M. Zhou (2020) UniVL: a unified video and language pre-training model for multimodal understanding and generation. arXiv. External Links: Document, Link Cited by: §A.4, Appendix C, §2.2, Table 1, §4, §5.3.
  • X. Ma, K. Driggs-Campbell, and M. J. Kochenderfer (2018)

    Improved robustness and safety for autonomous vehicle control with adversarial reinforcement learning

    .
    In 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 1665–1671. Cited by: §1.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, Cited by: §2.1.
  • A. Miech, J. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman (2020) End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9879–9889. Cited by: §2.2, Table 1, §4, §5.1, §5.2, §5.3.
  • A. Miech, D. Zhukov, J. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic (2019a) HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. arXiv. External Links: Document, Link Cited by: §A.4, §5.3.
  • A. Miech, D. Zhukov, J. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic (2019b) HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In ICCV, Cited by: Appendix C, §E.4, §E.5, §1, §2.2, §3.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013)

    Efficient estimation of word representations in vector space

    .
    arXiv preprint arXiv:1301.3781. Cited by: §1, Table 1, §6, 2nd item.
  • J. Miller, K. Krauth, B. Recht, and L. Schmidt (2020) The effect of natural distribution shift on question answering models. In International Conference on Machine Learning, pp. 6905–6916. Cited by: §2.1.
  • F. Nesti, G. Rossolini, S. Nair, A. Biondi, and G. Buttazzo (2022) Evaluating the robustness of semantic segmentation for autonomous driving against real-world adversarial patch attacks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2280–2289. Cited by: §1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318. Cited by: §A.3.
  • M. Patrick, P. Huang, Y. M. Asano, F. Metze, A. G. Hauptmann, J. F. Henriques, and A. Vedaldi (2020) Support-set bottlenecks for video-text representation learning. CoRR abs/2010.02824. External Links: Link, 2010.02824 Cited by: §2.2.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation.. In EMNLP, Vol. 14, pp. 1532–1543. Cited by: §A.2.
  • M. O. Prates, P. H. Avelar, and L. C. Lamb (2020) Assessing gender bias in machine translation: a case study with google translate. Neural Computing and Applications 32 (10), pp. 6363–6381. Cited by: §2.1.
  • B. Recht, R. Roelofs, L. Schmidt, and V. Shankar (2019) Do imagenet classifiers generalize to imagenet?. In ICML, Cited by: §2.1.
  • F. Research (2022) Fairseq. GitHub. Note: https://github.com/facebookresearch/fairseq/tree/main/examples/MMPT/scripts/video_feature_extractor Cited by: §A.4.
  • G. Reynolds (2022) Gender bender. GitHub. Note: https://github.com/Garrett-R/gender_bender Cited by: §A.2, §3.2.
  • A. Rouditchenko, A. W. Boggust, D. Harwath, D. Joshi, S. Thomas, K. Audhkhasi, R. Feris, B. Kingsbury, M. Picheny, A. Torralba, and J. R. Glass (2020) AVLnet: learning audio-visual language representations from instructional videos. CoRR abs/2006.09199. External Links: Link, 2006.09199 Cited by: §2.2.
  • E. Rusak, L. Schott, R. S. Zimmermann, J. Bitterwolf, O. Bringmann, M. Bethge, and W. Brendel (2020) Increasing the robustness of dnns against image corruptions by playing the game of noise. Cited by: §2.1.
  • C. Sakaridis, D. Dai, and L. Van Gool (2021)

    ACDC: the adverse conditions dataset with correspondences for semantic driving scene understanding

    .
    In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §2.1.
  • V. Schlegel, G. Nenadic, and R. Batista-Navarro (2020) Semantics altering modifications for evaluating comprehension in machine reading. In AAAI 21: Proceedings of the 35th AAAI Conference, Cited by: §2.1.
  • J. J. Sung and N. C. Poon (2020) Artificial intelligence in gastroenterology: where are we heading?. Frontiers of medicine 14 (4), pp. 511–517. Cited by: §1.
  • G. Tan, D. Liu, M. Wang, and Z. Zha (2020) Learning to discretely compose reasoning module networks for video captioning. External Links: Document, Link Cited by: Appendix C, §1, §5.1.
  • R. Taori, A. Dave, V. Shankar, N. Carlini, B. Recht, and L. Schmidt (2019) When robustness doesn’t promote robustness: synthetic vs. natural distribution shifts on imagenet. Cited by: §2.1.
  • Y. Tian and C. Xu (2021) Can audio-visual integration strengthen robustness under multimodal attacks?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5601–5611. Cited by: §2.1.
  • S. Tomar (2006) Converting video formats with ffmpeg. Linux Journal 2006 (146), pp. 10. Cited by: §A.1.
  • X. Wang, Q. Liu, T. Gui, Q. Zhang, et al. (2021) TextFlint: unified multilingual robustness evaluation toolkit for natural language processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, Online, pp. 347–355. External Links: Link, Document Cited by: §A.2, §A.2, §A.2, §1.
  • X. Wang, H. Wang, and D. Yang (2021) Measure and improve robustness in nlp models: a survey. arXiv preprint arXiv:2112.08313. Cited by: §2.1.
  • S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (ECCV), pp. 305–321. Cited by: §2.2, Table 1, §5.3.
  • H. Xu, G. Ghosh, P. Huang, P. Arora, M. Aminzadeh, C. Feichtenhofer, F. Metze, and L. Zettlemoyer (2021) VLM: task-agnostic video-language model pre-training for video understanding. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, pp. 4227–4239. External Links: Link, Document Cited by: §2.2.
  • H. Xu, G. Ghosh, P. Huang, D. Okhonko, A. Aghajanyan, F. Metze, L. Zettlemoyer, and C. Feichtenhofer (2021) VideoCLIP: contrastive pre-training for zero-shot video-text understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, pp. 6787–6800. External Links: Link, Document Cited by: §A.4, Appendix C, §2.2, Table 1, §4, §5.1, §5.3.
  • K. Yang, W. Lin, M. Barman, F. Condessa, and Z. Kolter (2021) Defending multimodal fusion models against single-source adversaries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3340–3349. Cited by: §2.1.
  • M. Yim, W. Shen, B. Salemi, D. Rus, M. Moll, H. Lipson, E. Klavins, and G. S. Chirikjian (2007) Modular self-reconfigurable robot systems [grand challenges of robotics]. IEEE Robotics & Automation Magazine 14 (1), pp. 43–52. Cited by: §1.
  • D. Yin, R. G. Lopes, J. Shlens, E. D. Cubuk, and J. Gilmer (2019) A fourier perspective on model robustness in computer vision. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 13276–13286. Cited by: §2.1.
  • Y. Yu, J. Kim, and G. Kim (2018) A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 471–487. Cited by: §5.1, §5.2.
  • S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo (2019) Cutmix: regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 6023–6032. Cited by: §2.1.
  • L. Zhou, C. Xu, and J. J. Corso (2018) Towards automatic learning of procedures from web instructional videos. In AAAI Conference on Artificial Intelligence, pp. 7590–7598. External Links: Link Cited by: Appendix C, §1, §5.1.

Appendix A Implementation Details

a.1 Visual Perturbations

Below are more details on the perturbations applied to videos. Examples of these perturbations can also be found at https://mmvr-neurips.github.io/MultiModalRobustness/.

Noise

These perturbations apply transformations at the pixel level of each frame in a video. The different noises are Impulse, Gaussian, Shot, and Speckle. Impulse noise simulates corruptions cause by bit errors by applying a combination of salt and pepper noise with amounts ranging from .03, .06, .09, 0.17, 0.27. Gaussian noise simulates low-lighting conditions by first normalizing the pixel values then adds a random normal noise scaled at values .08, .12, 0.18, 0.26, 0.38 based on severity. Shot noise simulates electronic noise caused by the discrete nature of light by applying a combination of salt and pepper noise with amounts ranging from .03, .06, .09, 0.17, 0.27. Speckle noise simulates additive noise and is similar to Gaussian but where the random value is then multiplied by the normalized pixel value.

Blur

Blur perturbations apply transformations that simulate camera motion and focus. Motion blur increases the radius and sigma of the kernel which is used to create the motion blurring effect ranging from (10, 3), (15, 5), (15, 8), (15, 12), and (20, 15) based on severity. Zoom blur blurs towards the center of the frame while increasing the zoom factor based on severity. Defocus blur imitates a defocused lens over the entire frame. We increase the radius of the disk which is convolved over the image to create defocus blurring effect ranging from (3, 0.1), (4, 0.5), (6, 0.5), (8, 0.5), (10, 0.5) based on severity.

Digital

JPEG compression converts each frame to a JPEG with quality ranging from 25, 18, 15, 10, 7 based on severity. MPEG1 compresses the original video to using the ffmpeg Tomar [2006] format mpeg2video with levels ranging from 20, 40, 60, 80, 100. MPEG2 compresses the original video to using the ffmpeg format mpeg4 with levels ranging from 15, 30, 45, 60, 75. This compression tends to actually affect the playing of the video, where frames are missing and/or skipped. These slight frame changes allows these perturbations to be considered temporal as well. This can be seen in an example in Figure 9 under Digital where the frame does not perfectly align with the frames for the other perturbations because it is slightly off temporally. We can consider these perturbations as spatio-temporal as they alter both spacial features and temporal features.

Temporal

Jumbling splits a video into segments of lengths 32, 16, 8, 4, and 2 where the higher is less severe and lower is more severe. The frames within each segment are then randomly shuffled. Box Jumbling splits a video into segments of lengths 4, 9, 16, 25, 36 where the higher is more severe and lower is less severe. The segments are then randomly shuffled. Sampling transforms a video from the original method’s frames per second (FPS) to keep consistency with the original approach. However, it slows the playback speed by sampling frames uniformly at a varying level of rates 2, 4, 8, 16, and 32, where the higher is more severe. Reverse Sampling is the same as sampling but also reverses the video after sampling. Freeze This perturbation will choose a percentage of frames to select, ranging from 40%, 20%, 10%, 5% and 2.5%. The more frames selected, the less severe the perturbation. These selected frames are then repeated until they reach the next sequential frame, simulating a frozen live stream video.

Camera

These perturbations simulate irregularities with camera motion and include Static rotation, Rotation and Translation. Static rotate rotates every frame the same degree, Rotation rotates each frame by a random degree, and Translation randomly chooses a new center in the frame to crop to for each frame as if the camera is randomly shaking.

Figure 8: The summary of how this paper organizes both visual and text perturbations used to evaluate on the text-to-retrieval task for multimodal models. On the visual side, each perturbation also ranges from a severity of 1 to 5.
Figure 9: Visualizations of each perturbation for a single frame in a video from the YouCook2 dataset. Severity increases from left to right for each perturbation.
Figure 10: A visualization of the temporal perturbations for a video showing “a little girl does gymnastics”.
Type Perturbation Perturbed
AddText AddAdv a little girl specifically does gymnastics
AppendIrr On this occasion, a little girl does gymnastics
Bias AllFemale a little girl does gymnastics
AllMale a little boy does gymnastics
GenderNeutral a little child does gymnastics
GenderSwap a little boy does gymnastics
ChangeChar Keyboard a little girl dofs gymnastics
OCR a little girl does gymnastic8
PrefixSwap a little girl does gymnastics
Punct " a little girl does gymnastics, "
SpellErr a lettil girl does gymnastics
Typos a little girl des gymnastics
DropText NN&VBOnly [UNK] [UNK] girl does gymnastics
NNOnly [UNK] [UNK] girl [UNK] gymnastics
NoNN&VB a little [UNK] [UNK] [UNK]
NoVB a little girl [UNK] gymnastics
RandNN a little girl does [UNK]
RandVB a little girl [UNK] gymnastics
VBOnly [UNK] [UNK] [UNK] does [UNK]
NoNN a little [UNK] does [UNK]
Positional DropFirst [UNK] little girl does gymnastics
DropFirstLast [UNK] little girl does [UNK]
DropLast a little girl does [UNK]
ShuffleOrder a girl gymnastics does little
SwapText BackTrans a little girl gymnastics
JJSwap a anodyne girl does gymnastics
MLM a teenage girl does gymnastics
NNSwap a little output does gymnastics
SynWordEmbd a good girl does gymnastics
SynWordNet a little girl manage gymnastics
TextStyle Casual A little girl that does gymnastics
Formal A young woman does gymnastics.
Neg a little girl does not gymnastics
Passive gymnastics is done by a little girl
Tense a little girl did gymnastics
Table 6: Examples of all text perturbations in each category fot the text “a little girl does gymnastics” from the MSRVTT dataset.

a.2 Text Perturbations

In Table 6 there are examples for each perturbation when the input text is “a little girl does gynmanstics” from the MSRVTT dataset. This section discusses the implementation of each in more detail.

ChangeChar

Perturbations, natural and machine-based, that alter a character in one or several of the words in the text. For natural-based perturbations, this includes SpellingError, Keyboard, and Typos Wang et al. [2021]. These are based on common spelling errors, keyboard mistypes, and general typos. Typos for example randomly inserts, deletes, swaps or replaces a single letter within one word while keyboard alters text by common keyboard mistakes such as “word to work”. Machine-based perturbations include OCR, SwapPrefix and Punctuation Wang et al. [2021]. SwapPrefix for example will swap the prefix of one word while keeping its part-of-speech tag. Punctuation appends and/or prepends random punctuation to the sentence and OCR uses random values to stimulate an OCR, or optical character recognition, error.

SwapText

It is machine-based perturbations that swap word(s) from the original text. Several perturbations that make word swaps based on text models include BackTrans which replaces text with phrases by using back-translation Wang et al. [2021]. SwapSynWordNet and SwapSynWordEmbedding both swap a words with their synonyms as determined by either WordNet Fellbaum [1998] or by Glove Pennington et al. [2014]. MLM suggestion swaps one syntactic category element of the sentence with what would be predicted by masked language models (MLM) Wang et al. [2021]. MultiPosSwapJJ and MultiPosSwapNN replaces adjectives and nouns respectively with words holding multiple parts-of-speech (POS).

AddText

Perturbations Wang et al. [2021] that are natural-based and add text to the original. AppendIrr appends irrelevant phrases to the original text while InsertAdv adds an adverb before each verb.

TextStyle

Perturbations that are natural-based and change the style of the text. These include Tense, Passive, Casual, Formal, and ReverseNeg Czeresnia Etinger and Black [2019]. The perturbations Passive, Casual, and Formal change the text style to those specific styles. Tense changes the tense of the text and ReverseNeg reverse negates the original text.

Bias

Perturbations that are natural-based that change the gender of a given phrase. These vary in AllFemale, AllMale, GenderSwap, and Neutral. The netural perturbation removes female and male references and replaces them with neutral ones. For example, a reference to ”a man” and ”a woman” would be replaced with ”a person”. GenderSwap swaps male references with female and vice versa using Reynolds [2022].

DropText

Perturbations are synthetic and drop words based on their part-of-speech (POS) tag. DropNN, DropVB, and DropVBNN are different variations of dropping words based on whether the POS tags are Noun and/or Verb. OnlyNN, OnlyVB, and OnlyNNVB drops all words but those with POS NN and/or VB. RandNN and RandVB drop one random noun/verb. This is done using the NLTK Bird et al. [2009] package to first extract POS tags for each word. Using these POS tags, based on the respective perturbation, words are “dropped” by replacing them with “[UNK]” in order to maintain the original phrase length.

Positional

These include DropFirst, DropLast, DropFirstandLast, and ShuffleOrder. Drop-related perturbations will replace a word at that position with an “[UNK]” tag. The ShuffleOrder perturbation shuffles the words in a phrase randomly. More details on the generated text perturbation are provided in the Appendix.

a.3 Analysis of Perturbed Text

To understand the severity of each perturbation we evaluate the perturbed text that is generated using multiple metrics including perplexity, BLEU, METEOR and ROUGE.

Figure 11:

The perplexity scores for the different text perturbations using GPT-3. The bars represent the average perplexity for the entire corpus, the dashed lines represent the perplexity when removing outliers based on a threshold of a 500, and the numbers atop the bars are the percent of outliers that are removed when using the threshold.

Perplexity of Perturbations

We first look at the perplexity measurement which measures the probability of a sentence using the large text model GPT-3

Brown et al. [2020]. For each word in the sentence, the probability of the next word being present is measured and if the probability of the next word is low, then the perplexity for that sentence will be higher. Perplexity is not necessarily a good measurement for quality, but it is useful for measuring how statistical models may struggle with text. We first observe further support on how different the two datasets are on the text side where the perplexity of MSRVTT is 762.97 and for YouCook2 480.84.

The results of this analysis is shown in Figure 11 where machine-based perturbations are the bottom row and natural-based perturbations are the top. Between natural and machine-based there are no obvious differences in perplexity overall, both appear to have challenging distribution shifts according to the perplexity of GPT-3. Changing characters in words appear to result in higher perplexity consistently across the different variations of ChangeChar perturbations. PrefixSwap and NNSwap are additionally high in perplexity on the machine-based side. These results would indicate the statistical models should struggle most with character changes and word swaps or drops based on POS-tags. The most perplex version of text is when words are shuffled in ShuffleOrder, as the words positions to each other are no longer meaningful. In summary, machine-learning based approaches are likely to struggle most with character swapping perturbations and shuffling of words.

Comparison Metrics to Original Text

We also compare the perturbed text to the original text using the traditional metrics, BLEU Papineni et al. [2002], METEOR and Rouge. The results for these are averaged across the different perturbations for each type for both the MSRVTT and YouCook2 datasets in Table 7. Perturbations that DropText are most dissimilar to the original text for both datasets. Depending on the dataset, AddText, TextStyle and ChangeChar are similarly dissimilar to the original text, meaning models should be robust but show some level of performance reduction. The most similar is Bias, meaning models should be highly robust to Bias.

In summary, these scores indicate that we have a varying level of difficulty with our text perturbations across categories, allowing for variable securities of distribution shift. The most challenging is DropText and the least challenging is Bias.

MSRVTT AddText Bias ChangeChar DropText Positional SwapText TextStyle
BLEU4 0.57 0.88 0.60 0.29 0.64 0.68 0.56
Meteor 0.76 0.94 0.79 0.53 0.78 0.87 0.80
ROUGE-l F1 0.16 0.22 0.22 0.17 0.21 0.21 0.21
YouCook2 AddText Bias ChangeChar DropText Positional SwapText TextStyle
BLEU4 0.64 —– 0.58 0.34 0.62 0.66 0.65
Meteor 0.76 —– 0.78 0.52 0.76 0.85 0.81
ROUGE-l F1 0.17 —– 0.23 0.17 0.22 0.22 0.22
Table 7: Distribution Shift evaluation on MSRVTT and YouCook2 captions respectively.

a.4 Model Implementations

To process data, train and evaluate models, we used our internal cluster with single-GPU use per run. All models except the HowTo100m MIL Miech et al. [2019a]

use features extracted from the visual encoder of the HowTo100m MIL. For HowTo100m MIL, the clips were split into 4 clips of 32 frames each. For VideoClip

Xu et al. [2021] and COOT Ging et al. [2020] input videos of size at 30 fps were fed to the pre-trained S3D model where we extracted features at the final layer with an embedding size of 512. The same procedure was used for UniVL Luo et al. [2020] with the difference being the features are extracted at the earlier layer “mixed5c” with an embedding size of 1024. For perturbations, we first perturb the video before extracting S3D features, therefore we have collect perturbed S3D features for embedding sizes 512 and 1024 for each perturbation and severity. We used the original code for these models to extract features to ensure that the procedure is the same as the original authors’. These original feature extraction scripts are located in the Github repositories Luo [2022] and Research [2022].

Appendix B Limitations

In this study we focused on visual and text modality. Apart from vision and language, audio also plays an important role in AI systems interacting in real-world. Therefore, we believe robustness analysis of all these modalities together is very important for deployment of AI systems. Multi-modal learning using a combination of these three modalities is an active research area and studying it from robustness point of view is a very interesting future direction.

Appendix C Licensing

All the models which we have used in this study are available in public domain. The model code for HowTo100m MIL Miech et al. [2019b] and COOT Ging et al. [2020] have the Apache 2.0 license and the model code for VideoClip Xu et al. [2021] and UniVL Luo et al. [2020] has the MIT license and are publicly available. We will provide YouCook2-P and MSRVTT-P publicly for future research. These datasets are based on existing YouCook2 Zhou et al. [2018] and MSRVTT Tan et al. [2020] datasets and we are not using any new video sources. Both these datasets are available in public domain for research purposes and therefore similar licensing is applicable to the newly curated datasets.

Appendix D Impact

To our understanding, there are no negative societal impacts of our work. The goal of this work was to evaluate the robustness of models that may later be used in real-world settings. We aimed to improve the societal impacts by evaluating these models on real-world distribution shifts including potential bias in text.

Appendix E Additional Results

Figure 12 shows the relative robustness for the text-to-video retrieval task at R@5 aggregated across all categories for video, text and when both are perturbed against the performance. The results vary based on the dataset, for example, Videoclip from scratch is both more robust and a better performer on the MSRVTT dataset than on YouCook2. This relationship indicates a difference on how models handle clips from long, complex activities compared to videos that are short and of a simple activity. The top performers are consistently VideoClip and Univl, which are also the larger models.

Figure 12: Performance of text-to-video retrieval R@5 against relative robustness for both datasets. These results are aggregated across all categories for the modality and all combinations we generated for the combination of video and text.

e.1 Analysing Feature Space

To visualize the changes to the embedding space when text and video are perturbed, we selected videos that were accurately matched to their respective text in the baseline and observed their similarity change when perturbations were made. Figure 13 visualizes the baseline, when text is perturbed with appending irrelevant phrases, when video is perturbed by a consistent rotation and when both these perturbations are applied. As these perturbations are added, the similarity between video and other text begins to increase and when both are video and text are perturbed, this effect is most visible. Additionally, the UniVL model shows even more similarity between video and text when both are perturbed and this model uses cross-attention. This does not necessarily mean that UniVL is less robust in this case, but it can indicate with cross-attention, video and text are generally more similar and the difference between ranking one video to the other is much smaller than in other models.

Figure 13: A visualization of similarity between video and text on videos that were accurately matched in the baseline. This visualized the compounding effect on the similarity when both text and video are perturbed. Additionally, VideoClip shows greater similarity when both are perturbed as compared to UniVL which utilizes cross-attention.

Figures 14 and 15 show tsne plots of the feature space for the Videoclip and UniVL models respectively when pre-trained and not fine-tuned. The colors indicate the recipe type and are just a way of visualizing space that should be more similar than videos and text from other recipes. It is important to note that the models are not trained on creating a space that clusters recipes together and therefore using recipes is just an arbitrary way of visualizing this space. In Figure 14, we see that when one modality is perturbed, the embedding space of the other is relatively untouched. When both video and text are perturbed, both spaces are impacted. In Figure 15, we see a similar trend, where when one modality is perturbed, the others embedding space is relatively untouched, even though their is use of cross-attention between the two modalities.

Figure 14: TSNE visualizations for output of the VideoClip model for text and video with different perturbations where colors are recipe types. When one modality is perturbed, the embedding space of the other is untouched. When both video and text are perturbed, both embedding spaces are impacted.
Figure 15: TSNE visualizations for output of the UniVL model for text and video with different perturbations where colors are recipe types. When one modality is perturbed, the embedding space of the other is untouched, despite cross-attention being used. When both video and text are perturbed, both embedding spaces are impacted.

e.2 Breakdown of Perturbations

Results for each perturbation when text is perturbed are shown in Figure 16 and 17. In these figures, the black, dashed line indicate the original performance and the bars represent the performance when text is perturbed. The larger the difference between the top of the bar and the horizontal line indicate a larger drop in relative performance. The figures visualize the drop in performance for ChangeChar and DropText being noticeably significant. It also shows how robust models are to AddText, Positional and Text Style.

Figure 18 and 19 show performance R@5 over the varying levels of severity. On both datasets, models are not robust to spacial perturbations such as Noise and Blur. When text is aligned with segments, models are robust to temporal perturbations but are not on YouCook2 when text is no longer aligned with its segment from the long, complex video. However, frame order does not matter within the texts respective segment. Models are surprisingly robust against spatio-temporal, or Digital, perturbations, struggling most with JPEG compression.

Figure 16: Perturbation specific results for text perturbations on the MSRVTT dataset. The black, horizontal lines indicate the retrieval on the clean dataset while the bar indicates the retrieval on the perturbed dataset . Fine-tuned models are the highest performers but are not always more robust. For example, on Bias, fine-tuned models tend to have a larger drop in robustness as compared to the zero-shot models.
Figure 17: Perturbation specific results for text perturbations on the YouCook2 dataset. The black, horizontal lines indicate the retrieval on the clean dataset while the bar indicates the retrieval on the perturbed dataset .
Figure 18: Performance R@5 when video is perturbed for different levels of severity on the YouCook2 dataset. Models are less robust against spacial perturbations and strongly perturbed against Temporal perturbations that maintain alignment between text and segments. When alignment is disturbed, models are no longer robust.
MSRVTT Method AddText Bias ChangeChar DropText Positional SwapText TextStyle
Videoclip scratch 0.940.00 0.970.01 0.940.03 0.880.09 0.940.04 0.930.03 0.980.01
HowTo100M MIL zeroshot 0.960.03 0.940.03 0.900.05 0.740.16 0.890.06 0.910.07 0.980.01
UniVL Align zeroshot 0.990.02 1.000.01 0.950.02 0.890.05 0.940.03 0.970.03 0.980.01
Videoclip zeroshot 0.970.02 0.950.02 0.900.05 0.750.17 0.930.04 0.910.08 0.980.01
UniVL Align finetune 0.950.03 0.940.02 0.900.04 0.750.15 0.880.06 0.910.08 0.980.01
Videoclip finetune 0.980.02 0.990.01 0.940.03 0.860.06 0.910.04 0.960.03 0.990.01
YouCook2 Method AddText Bias ChangeChar DropText Positional SwapText TextStyle
COOT scratch 0.950.05 0.640.13 0.740.16 0.900.05 0.780.19 0.810.23
Videoclip scratch 0.970.01 0.910.05 0.850.14 0.940.05 0.910.06 0.990.00
HowTo100M MIL zeroshot 0.960.04 0.890.04 0.760.15 0.900.03 0.900.07 0.980.01
UniVL Align zeroshot 1.030.01 0.950.02 0.900.07 0.960.04 0.950.03 0.990.02
Videoclip zeroshot 0.970.02 0.900.07 0.690.22 0.890.06 0.880.11 0.990.04
UniVL Align finetune 0.960.04 0.890.04 0.810.10 0.910.03 0.910.05 0.980.01
Videoclip finetune 0.970.02 0.880.05 0.730.17 0.850.07 0.880.07 0.970.02
Table 8: Absolute robustness scores for each category of distribution shifts for text perturbations. The UniVL model is typically the highest performer. Overall, models are very robust to text perturbations when considering the absolute score.
Figure 19: Performance R@5 when video is perturbed for different levels of severity on the MSRVTT dataset. Models are less robust against spacial perturbations and strongly perturbed against Temporal perturbations. Models are surprisingly robust against spatio-temporal (Digital) perturbations, struggling most with JPEG.
MSRVTT Method Blur Camera Digital Noise Temporal
Videoclip scratch 0.890.05 0.940.03 0.840.05 0.800.05 0.970.02
HowTo100M MIL zeroshot 0.800.10 0.920.06 0.790.09 0.630.11 0.950.04
UniVL Align zeroshot 0.930.04 0.980.02 0.930.03 0.880.03 0.990.01
Videoclip zeroshot 0.620.15 0.910.07 0.800.10 0.610.11 0.970.02
UniVL Align finetune 0.790.11 0.910.08 0.780.10 0.640.10 0.980.02
Videoclip finetune 0.910.05 0.960.03 0.910.04 0.820.04 0.990.01
YouCook2 Method Blur Camera Digital Noise Temporal
COOT scratch 0.760.13 0.940.04 0.850.15 0.620.07 0.820.17
Videoclip scratch 0.810.08 0.950.04 0.900.10 0.700.05 0.870.13
HowTo100M MIL zeroshot 0.820.10 0.930.05 0.890.13 0.620.07 0.800.17
UniVL Align zeroshot 0.910.04 0.960.04 0.940.06 0.840.02 0.920.07
Videoclip zeroshot 0.670.19 0.910.07 0.850.18 0.450.11 0.730.25
UniVL Align finetune 0.820.09 0.940.04 0.910.09 0.730.05 0.870.12
Videoclip finetune 0.730.15 0.910.06 0.870.15 0.540.08 0.780.20
Table 9: Absolute robustness scores for each category of distribution shifts for video perturbations. The UniVL model is typically the most robust model. Models are least robust to Noise, Blur and Digital.

e.3 Absolute Robustness

The absolute robustness scores for text perturbations are shown in Table 8 and visual perturbations in Table 9. When observing absolute robustness, it is clear that the UniVL Align model pre-trained but not fine-tuned is the most robust model. This model uses a cross-encoder architecture with an alignment based objective function. This differs from the relatively robust results in which it varies which model and pre-training strategy is more robust. In Table 9, the models struggle most with Blur, Noise and Digital. Digital is likely challenging because it appears to perturb both spacial features and temporal (see Figure 9). On YouCook2, models struggle more with temporal perturbations. This again indicates a difference between models behavior on short clips as opposed to clips from videos of long, complex activities.

e.4 Cross-encoders vs. Two-Branch

The breakdown of text perturbation categories between cross-encoders and two-branch encoders is in Table 10. When models are fine-tuned, a two-branch encoder is typically more robust. When models are evaluated on zero-shot, this varies for the MSRVTT dataset on AddText and SwapText. This is likely because the pre-training dataset is HowTo100M Miech et al. [2019b]

which uses automatic speech recognition (ASR) as the associated text which often as irrelevant text from narration and incorrectly extracted words. Table

11 shows these results when video is perturbed. When video is perturbed, cross-attention typically is more robust on the MSRVTT dataset while it varies with the YouCook2 dataset. The difference between fine-tuning and zero-shot is less significant compared to when text is perturbed.

MSRVTT Train AddText Bias ChangeChar DropText Positional SwapText TextStyle
cross zs 0.920.10 0.970.04 0.710.11 0.330.27 0.640.15 0.840.15 0.900.07
two-branch zs 0.840.08 0.920.04 0.740.10 0.480.30 0.700.17 0.780.12 0.940.04
cross ft 0.920.05 0.880.05 0.800.09 0.490.31 0.780.13 0.810.14 0.960.01
two-branch ft 0.940.04 0.910.05 0.810.09 0.530.32 0.870.08 0.830.15 0.970.02
YouCook2 Train AddText Bias ChangeChar DropText Positional SwapText TextStyle
cross ft 1.140.03 0.750.10 0.430.41 0.800.24 0.750.17 0.940.09
two-branch ft 0.940.03 0.750.12 0.520.36 0.770.15 0.760.15 0.960.03
cross ft 0.910.08 0.740.09 0.450.33 0.760.07 0.780.15 0.950.02
two-branch ft 0.950.03 0.840.10 0.500.35 0.820.09 0.810.18 0.990.07
Table 10: Relative robustness scores for text categories for models that use cross-attention vs. two-branch architectures. When fine-tuned, a two-branch encoder is typically more robust. When using zero-shot, this varies.
MSRVTT Train Blur Camera Digital Noise Temporal
cross zeroshot 0.610.21 0.850.12 0.610.16 0.270.20 0.960.04
two-branch zeroshot 0.590.18 0.810.12 0.520.21 0.230.18 0.920.06
cross finetune 0.600.19 0.850.12 0.580.19 0.270.22 0.900.09
two-branch finetune 0.580.27 0.840.13 0.620.19 0.260.21 0.950.04
YouCook2 Train Blur Camera Digital Noise Temporal
cross zeroshot 0.530.24 0.790.21 0.670.31 0.090.12 0.540.38
two-branch zeroshot 0.450.24 0.840.12 0.720.29 0.110.15 0.590.38
cross finetune 0.590.23 0.840.11 0.750.29 0.140.17 0.550.38
two-branch finetune 0.470.30 0.850.11 0.750.29 0.110.18 0.570.40
Table 11: Relative robustness scores for visual categories for models that use cross-attention vs. two-branch architectures. Cross-attention tends to be more robust against visual perturbations while this varies with YouCook2.

e.5 Mrsvtt Qa

The results of the multiple choice VideoQA task on the MSRVTT dataset for the different Videoclip variations are shown in Table 12 and 13. When only text is perturbed, the the difference between pre-training strategy is not consistent unlike the text-to-video retrieval task. Unlike the previous task, zero-shot typically is the least robust and scratch is as robust or more than fine-tuning. This indicates that when the task is between smaller candidates, pre-training on a large corpus of data may be less necessary for both performance and robustness.

When text is perturbed, the zero-shot model is typically more robust. Very different findings between the two modalities. This is likely because this task is similar to video-to-text retrieval and we have already observed models are less robust to visual perturbations. Zero-shot therefore is more relatively robust because of the nature of the pre-training dataset of HowTo100m Miech et al. [2019b] being a variety of noisy YouTube videos.

VideoQA AddText Bias ChangeChar DropText Positional SwapText TextStyle
scratch 0.990.01 0.980.02 0.960.02 0.770.23 0.970.03 0.940.07 1.00.0
zeroshot 0.970.03 0.980.01 0.880.07 0.670.26 0.90.06 0.890.09 0.980.01
finetune 0.990.01 0.990.01 0.960.02 0.750.25 0.960.03 0.950.06 0.990.01
Table 12: Relative robustness scores for text categories on the MSRVTT-QA for the Videoclip model and its training variations.
VideoQA Blur Camera Digital Noise Temporal
scratch 0.750.14 0.890.08 0.710.16 0.460.2 0.930.03
zeroshot 0.80.15 0.960.07 0.80.12 0.510.18 1.00.01
finetune 0.780.14 0.910.08 0.790.14 0.490.2 0.950.02
Table 13: Relative robustness scores for visual categories on the MSRVTT-QA for the Videoclip model and its training variations.

e.6 Training Procedure

The breakdown of visual perturbations for relative and absolute robustness aggregated across training procedure can be found in Table 15 and 14. When text is perturbed and robustness is aggregated across training strategy and subcategories of perturbations, fine-tuning a model after pre-training is typically more robust followed by zero-shot. When video was perturbed, we find that fine-tuned models are typically more robust with the surprising exception on Temporal perturbations for both datasets. This may indicate that temporal elements are better learned specific to the dataset especially knowing that MSRVTT and YouCook2 are so different in duration and activity. The robustness difference for temporal perturbations is also larger on the MSRVTT dataset, a dataset that is a larger shift from the pre-training dataset Howto100M than YouCook2. Both YouCook2 and HowTo100m are user-generated YouTube videos of long, complex instructional activities while MSRVTT are short user-generated videos of a variety of activities like a girl doing gymnastics. Future research should investigate more into the temporal aspects of video distributions and how models can improve robustness when these distribution shifts between short clips and longer more complex ones.

MSRVTT AddText Bias ChangeChar DropText Positional SwapText TextStyle
scratch 0.900.06 0.880.05 0.780.09 0.460.32 0.730.13 0.810.18 0.960.02
zs 0.870.09 0.940.05 0.730.10 0.430.29 0.680.16 0.800.13 0.930.05
ft 0.930.04 0.890.05 0.810.09 0.510.31 0.820.11 0.820.14 0.970.01
YouCook2 AddText Bias ChangeChar DropText Positional SwapText TextStyle
scratch 0.860.10 0.400.32 0.390.34 0.730.11 0.610.33 0.750.39
zs 1.000.11 0.750.11 0.490.37 0.780.17 0.750.15 0.950.06
ft 0.930.06 0.790.11 0.470.33 0.790.08 0.800.16 0.970.05
Table 14: Relative robustness scores for text perturbations for each training procedure. Fine-tuned models are typically more robust closely followed by zero-shot. Models trained from scratch are typically not robust on average.
MSRVTT Blur Camera Digital Noise Temporal
scratch 0.540.23 0.770.16 0.520.23 0.250.23 0.970.04
zs 0.600.19 0.800.12 0.550.21 0.270.20 0.940.05
ft 0.470.28 0.820.12 0.580.21 0.300.22 0.940.07
YouCook2 Blur Camera Digital Noise Temporal
scratch 0.440.26 0.810.11 0.600.33 0.150.17 0.570.39
zs 0.470.25 0.810.14 0.620.33 0.130.16 0.570.39
ft 0.540.25 0.830.10 0.670.32 0.160.19 0.560.40
Table 15: Relative robustness scores for text categories aggregated across training procedures. Fine-tuned models are typically more robust.
MSRVTT Method Natural Machine Synthetic
VideoClip scratch 0.900.10 0.770.21 0.940.04 0.870.10 0.820.14 0.600.30
HowTo100m MIL zeroshot 0.930.04 0.730.15 0.960.02 0.850.09 0.920.07 0.680.27
UniVL Align zeroshot 0.970.03 0.800.18 0.970.02 0.850.14 0.920.05 0.500.30
VideoClip zeroshot 0.970.04 0.780.16 0.970.03 0.860.14 0.890.06 0.520.28
UniVL Align finetune 0.890.09 0.790.18 0.940.05 0.890.09 0.810.15 0.630.29
Videoclip finetune 0.900.10 0.800.18 0.940.05 0.890.09 0.830.16 0.680.30
YouCook2 Method
COOT scratch 0.900.10 0.760.24 0.750.19 0.440.44 0.760.16 0.450.37
VideoClip scratch 0.950.07 0.700.23 0.980.05 0.830.18 0.860.09 0.520.30
HowTo100m MIL zeroshot 0.950.08 0.740.23 0.980.05 0.860.15 0.930.11 0.670.32
UniVL Align zeroshot 0.950.04 0.730.21 0.980.03 0.910.17 0.930.07 0.590.37
VideoClip zeroshot 0.870.09 0.750.18 0.930.06 0.860.11 0.790.15 0.590.29
UniVL Align finetune 0.890.08 0.750.19 0.930.05 0.850.12 0.820.13 0.590.30
VideoClip finetune 0.850.12 0.760.19 0.980.06 0.910.10 0.780.20 0.670.32
Table 16: Distribution Shift evaluation on MSRVTT and YouCook2 captions respectively when robustness scores are aggregated over Natural vs. Machine vs. Artificial (Positional and DropText).

e.7 Natural vs. Machine vs. Artificial Text Perturbations

To understand models based on the sub-categories of natural, machine-learning based and synthetic, we aggregate scores these categories (see Figure 8). Synthetic perturbations are those in DropText and Positional as they would be rare occurrences in the real world, although still possible with automatic speech recognition for example. When only text is perturbed, UniVL on zero-shot learning is typically more relatively robust for all three categories, although close to the HowTo100m MIL robustness on synthetic perturbations.