Self-Supervised MultiModal Versatile Networks

by   Jean-Baptiste Alayrac, et al.

Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: vision, audio and language. To this end, we introduce the notion of a multimodal versatile network – a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of audio and vision can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51 and ESC-50 when compared to previous self-supervised work.


VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

We present a framework for learning multimodal representations from unla...

Look, Listen and Learn

We consider the question: what can be learnt by looking at and listening...

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

One of the key factors of enabling machine learning models to comprehend...

Multimodal and self-supervised representation learning for automatic gesture recognition in surgical robotics

Self-supervised, multi-modal learning has been successful in holistic re...

Learning Multimodal VAEs through Mutual Supervision

Multimodal VAEs seek to model the joint distribution over heterogeneous ...

Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Multimodal self-supervised learning is getting more and more attention a...

Multimodal Learning using Optimal Transport for Sarcasm and Humor Detection

Multimodal learning is an emerging yet challenging research area. In thi...

1 Introduction

Our experience of the world is multimodal. From as far back as the crib, we perceive through multi-sensory systems, for instance we watch the flames dancing in the fireplace, we hear the sound of the crackling wood, as well as feel the heat coming off. Through this multimodal synchronous perception, we learn to draw useful connections between modalities Smith and Gasser (2005) which, in turn, enables us to form good representations of the world. Later, comes language that allows us to communicate this fine-grained multimodal experience using higher-level abstract concepts.

Our objective is to learn representations from such multimodal experience in a self-supervised manner without resorting to any specific manual annotation. The modalities considered are the three that are easily available from large collections of unlabelled videos: visual, audio and language (obtained from narrations) streams. In this, we seek to learn a multimodal versatile network, defined as a network that has the following four properties: (i) it should be able to take as input any of the three modalities; (ii) it should respect the specificity of modalities, in particular the fact that audio and vision are much more fine-grained than language; (iii) it should enable the different modalities to be easily compared even when they are never seen together during training; and finally (iv) it should be efficiently applicable to visual data coming in the form of dynamic videos or static images.

The question is how to design a network that respects these four principles? We choose a design that embeds each modality into a vector space such that similarity between modalities is obtained via simple dot products. Each modality is processed by a backbone network adapted to the nature of the signal, and a

modality embedding graph is constructed such that the visual and audio embeddings are fine-grained, whilst the textual embedding is semantically coarse-grained. This strategy is based on the observation that the visual and audio spaces are fine-grained (there are many visual or sounds of guitars that might be really different to each other) while the textual domain is more coarse as its goal is to abstract away details (a single “guitar” word). The network is then trained from scratch via self-supervised contrastive learning on a large set of unlabelled videos.

To quantitatively evaluate our learned MultiModal Versatile (MMV) networks, we measure their performance on multiple downstream tasks, and in this way assess various properties of the representation of videos and images: verb learning (action classification on HMBD51, UCF101); noun learning

(image classification on PASCAL VOC and ImageNet);

joint text and visual representation (YouCook2, MSRVTT); and audio representation (sound classification on ESC-50). The proposed MMV achieves state-of-the-art performance for self-supervised approaches on these benchmarks, and reduces the gap to the state-of-the-art performance for supervised approaches.

Contributions. Our contributions are the following: (a) we investigate different modality embedding graphs for MMV, and propose a simple yet effective self-supervised training strategy for multimodal representation of audio, visual and language streams; (b) we introduce a deflation approach so that the MMV video network can efficiently ingest a static image; and (c) we demonstrate the superiority of the learned representations on multiple image, video, audio and video-text downstream tasks.

2 Related work

Self-supervised learning from single modality.

Self-supervised methods design pretext tasks that require no manual annotation but facilitate learning of useful representations of the data. A variety of pretext tasks have been developed for vision (single modality), such as predicting the relative position of patches Doersch et al. (2015); Noroozi and Favaro (2016)

, colorization 

Zhang et al. (2016), predicting orientation Gidaris et al. (2018) or invariance to transformation Dosovitskiy et al. (2014); Jing and Tian (2018). In videos, works have also leveraged the temporal dimension Fernando et al. (2017); Lee et al. (2017); Misra et al. (2016); Xu et al. (2019). Recently, methods that maximise the similarity between multiple views (augmented versions) of the same image via contrastive losses Bachman et al. (2019); Chen et al. (2020); He et al. (2020); Hénaff et al. (2019); Hjelm et al. (2018); Oord et al. (2018) stand out due to impressive results on the ImageNet benchmark; we draw inspiration from them (use a contrastive loss and non-linear projection heads Chen et al. (2020)). However, details of view generation are crucial and require careful design Tian et al. (2020). In contrast, we argue that using multiple modalities as different views is simpler and more natural Tian et al. (2019).

Vision and language. WSABIE Weston et al. (2011) and DeVise Frome et al. (2013) introduced the idea of embedding text and image in the same space. This allows semantic similarity to be measured by a dot product in a vector space and enables fast and efficient large scale search across modalities Johnson et al. (2019). This idea is at the core of our versatile networks. With larger datasets Lin et al. (2014); Plummer et al. (2015); Rohrbach et al. (2017); Xu et al. (2016); Zhou et al. (2018), many works have profited from learning such a joint visual-textual space  Chowdhury et al. (2018); Dong et al. (2019); Gong et al. (2014a, b); Klein et al. (2015); Miech et al. (2018); Mithun et al. (2018); Pan et al. (2016); Plummer et al. (2017); Xu et al. (2015); Wang et al. (2018, 2016); Wray et al. (2019); Wu et al. (2017). Recently, instructional videos became a popular source of video and language data Alayrac et al. (2016); Malmaud et al. (2015); Sener et al. (2015); Yu et al. (2014)

due to not requiring careful manual annotation, by using Automatic Speech Recognition (ASR) to generate text from narrations. We build on top of

Miech et al. (2020); Sun et al. (2019b, a) who learn good representations from such narrated material, but consider learning representations using audio as well.

Vision and audio. Cross-modal teacher-student methods Owens et al. (2016); Aytar et al. (2016) exploit the temporal co-occurrence between visual and audio modalities in a video to learn good representations. Taking this idea into the self-supervised domain Arandjelović and Zisserman (2017), multiple works use a pretext task of predicting whether visual and audio signals come from the same video Arandjelović and Zisserman (2017, 2018); Owens and Efros (2018); Korbar et al. (2018); Rouditchenko et al. (2020). Recent developments such as XDC Alwassel et al. (2019), who employ cross-modality clustering, or Evolving Losses Piergiovanni et al. (2020), where many single- and multi-modal pretext tasks are used, demonstrate an impressive ability to learn good representations in both modalities. We propose a simpler method that achieves better performance, and consider the text modality as well.

Vision, audio and language. Using audio, vision and language to learn representations has also been explored in past work. Harwath Harwath et al. (2019) use a dataset of images and audio descriptions to associate spoken words and their visual representation. Similarly to us, Aytar  Aytar et al. (2017) train a cross-modal network with image, audio and text modalities, but learn the image-text association from curated annotated datasets, while our work requires no manual annotation.

From video to image. We reverse the usual route of going from an image network to a video network by inflation Carreira and Zisserman (2017). Historically, this was the usual route Girdhar et al. (2019) as labels were more readily available for images, ImageNet, than for videos. However, our perception of the world is actually dynamic, a time series of images, and learning first from videos is more natural. Similarly to Dong et al. (2019), we enable our network to ingest both dynamic video and still images. But instead of having two different pathways and requiring to learn from both images and videos, we propose a simple deflation mechanism that enables our network purely trained on videos to be directly adapted to still images.

3 Approach

We are given a set of unlabelled videos containing different modalities. For example, a video may contain an RGB stream (a set of frames depicting a dog), an audio track (the sound of that same dog barking) and some linguistic narrations (coming from a person providing verbal instructions). We follow previous work Miech et al. (2020, 2019) and obtain language as text by using off-the-shelf Automatic Speech Recognition (ASR) on the audio, leaving the removal of this dependency to future work. Equipped with this, our goal is to learn a model that has the versatile properties described in Section 1. We do so by introducing a bespoke multimodal architecture and optimize its parameters via self-supervised learning. In details, we use the temporal co-occurrence between the modalities to define the self-supervised proxy task and enforce it with a multi-modal pairwise contrastive loss.

Formally, a video is defined by an instantiation of different modalities : . In this work, we focus on three modalities, namely vision , audio and text but the proposed approach could be easily generalized to more modalities. Specifically, , and correspond to few-second sequence of RGB frames, 1D audio samples, and discrete word tokens, respectively. Given a training set containing videos , we seek to learn modality specific representations as well as ways to compare streams across modalities. To that end, let

be a parametrized modality specific backbone neural network that takes as input an instance

from modality and outputs a representation vector of dimension . To compare different modalities together via simple dot products, we embed them into a shared space of dimension , where contains the list of modalities that we embed in the space, for a joint visual-audio space , or for a joint visual-audio-text space . A modality specific representation is embedded into a space via a projection head . We denote by the vector representing the input modality in the space .

Section 3.1 explores various model design choices for the MMV networks, which induce different structures of modality spaces . It also presents the self-supervised losses that enforce the different modalities to align in the common spaces. In Section 3.2, we explain how to simply adapt models that have been trained on sequences of RGB frames to operate on single frames.

3.1 MMV: MultiModal Versatile Networks

Recall our goal is to be able to embed different modalities into a vector space where semantic comparisons can be made by simple dot products. Since there are three modalities, multiple configurations of modality spaces with different inter-relations, which we call modality embedding graphs, can be envisaged. An important note is that since the text modality is directly obtained from the audio track using ASR, we do not construct the audio-text space nor the loss that puts them in alignment explicitly. This is because our goal is not to learn ASR but instead to associate a word, “car”, with the sound associated with that entity, the sound produced by the engine. However, we hypothesize that the model can learn this desired link implicitly thanks to the common visual ground. We consider three options for the modality embedding graphs, illustrated in Figure 1 and detailed next.

Option I: Shared space. This is the simplest model where all modalities are embedded into a single shared vector space , and in which direct comparisons can be made between modalities (Figure 0(a)). For example, starting from a visual vector , a single projection head is applied to obtain the embedding used to compare to the audio and the text modalities. This strategy has the advantage that it is easy to navigate between modalities since they all live in the same space (property (iii)). However, the model implicitly assumes that all modalities have equal granularity and hence does not respect their specificities (lack of property (ii)).

Option II: Disjoint spaces. Another natural option is to have different visual-audio () and visual-text () spaces, as illustrated in Figure 0(b). For example, starting from the visual representation , there are two distinct projection heads mapping to the and the domains, . While the disjoint spaces approach enables the specificity of different modality pairs (property (ii)), it does not allow easy navigation between the embedding spaces (lack of property (iii)), for example, text to audio retrieval (“car” to “engine sound”) is not possible.

Option III: Fine and coarse spaces (FAC). In the introduction, we argue that the visual and the audio domains are different from the language domain in terms of their granularities. Inspired by this intuition, we propose to learn two embedding spaces: vision and audio are compared in the fine-grained space (), while text is compared with vision and audio in the lower dimensional coarse-grained space (). Crucially, vectors in can be embedded into via a simple fine-to-coarse projection , as illustrated in Figure 0(c). For example, to compare vision to audio, the visual representation is projected into the fine-grained space via . To compare vision to text, vision is embedded into the coarse-grained space via projection which decomposes as ; this can be seen as first projecting the vision into the fine-grained space via , followed by projecting the fine- into the coarse-grained space by (see Figure 0(d)). Note that even though we do not align audio and text during training (as mentioned before, this is to not learn ASR), the imposed modality embedding graph enables audio-text comparison because audio can still be projected into the coarse-grained space via . This strategy covers the three relevant properties of the MMV network – as opposed to the shared space solution, it models the text differently from the vision and the audio (property (ii)), and, in contrast to the disjoint spaces approach, it enables easy navigation across modalities (property (iii)).

(a) Shared
(b) Disjoint
(c) FAC
(d) FAC details
Figure 1: (a)-(c) Modality Embedding Graphs, (d) Projection heads and losses for the FAC graph. V=Vision, A=Audio, T=Text.

Multimodal contrastive loss. Given the previously described embedding graphs joining the three different modalities, the question remains how to actually train the backbones and the projection heads. We wish to do so without resorting to any form of manual annotations in order to leverage large amounts of readily available videos on the internet. Instead, inspired by Miech et al. (2020); Arandjelović and Zisserman (2017), we construct self-supervised tasks which aim to align pairs of modalities: vision-audio or vision-text, but not audio-text as explained earlier. Concretely, positive training pairs across two modalities are constructed by sampling the two streams from the same location of a video. Conversely, negative training pairs are created by sampling streams from different videos. In practice, a minibatch of video samples is formed, which induces positive and negative pairs. Given these positive and negative training pairs, we use a contrastive loss Oord et al. (2018); Hénaff et al. (2019); Miech et al. (2020) to make the positive pairs similar and negative pairs dissimilar in their corresponding joint embedding space. The only difference between losses used with different embedding graph designs is the choice of spaces where the dot products are computed; next we give the loss for FAC and provide the shared and disjoint losses in Appendix B. Formally, given a video , we minimize the multimodal contrastive loss:


where corresponds to the weight for the modality pair and . The component corresponding to the visual-audio pair is the following NCE loss (for FAC):


where is a set of negative modality pairs for the video , and is the temperature parameter. For the text, recall that we use narrations automatically obtained from speech. As opposed to the audio that is usually perfectly in sync with its visual source (the sound of the piano is synchronized with the visual of the instrument being played), the correspondence between narrations and what is actually happening in the video is much weaker Miech et al. (2020). To address that issue, we use the MIL-NCE variant from Miech et al. (2020) that is tailored to account for this misalignment issue. In short, it considers multiple positive candidate pairs as positives by simply replacing the single term in the standard NCE equation (2) by a sum of scores over positive text candidates: . As in Miech et al. (2020), the set of potential positives is formed from temporally close narrations.

Missing modalities. Some videos do not have all modalities, for example not all videos contain narration. In that case, we simply discard the corresponding loss component in (1) and upweight the remaining examples of the same modality pair in the batch in order for the total loss weight to remain constant.

3.2 Video to image network deflation

To comply with the property (iv) of the multimodal versatile network, we introduce a network deflation operation to transform a video network into a network that can ingest a single image. The deflated network can be evaluated on image downstream tasks while training on videos, and is more efficient than the standard trick of assembling a static video by repeating the image in time.

Ideally we would wish for video-image equivalence: that the output of the deflated video network on a single image is identical to that obtained by applying the original video network to the single-image static-video. It might be thought that this can simply be achieved by deflating the network over the temporal dimension. In the two types of video networks considered here, this deflation corresponds to the following operations: for 3D convolutional based networks Carreira and Zisserman (2017); Xie et al. (2018), summing the 3D spatio-temporal filters over the temporal dimension to obtain 2D filters; for TSM networks Lin et al. (2019), turning off the channel shifting which results in a standard residual architecture (ResNet50) for images.

However, due to zero-padding these operations do not achieve the desired equivalence – since filters whose receptive field overlap the clip boundary receive zeros in the single-image static-video, and this is not taken into account by the simple deflation operation above. Note, the padding used in the spatial domain is not a problem, as the spatial padding applies equivalently for both video frames and single images. To take account of the zero-padding, we learn new parameters


for the batch normalization layers to correct for this boundary effect on the filter outputs, and

approximate the equivalence we seek. In detail, the and parameters are trained to minimize the loss between the output of the original video network when presented with single-image static-videos, and the output of the deflated network for the same images; all parameters are frozen apart from ’s and ’s of the deflated network. Note that this process only requires images without the need for annotations.

4 Experiments

In this section we evaluate the performance of the networks on a wide range of downstream tasks. We start by describing the experimental protocol and the datasets used for self-supervised pretraining and downstream evaluations (Section 4.1), followed by exploring various design choices (Section 4.2). Based on this study, we train final models at a larger scale to compare them to the state-of-the-art (Section 4.3). Finally, we apply the trained video networks on still image tasks (Section 4.4).

4.1 Experimental setup, datasets and downstream tasks

Network architectures, hyperparameters and optimization.

For video we explore using S3D-G Xie et al. (2018) (), and TSM Lin et al. (2019) with a ResNet50 backbone () or a ResNet50x2 backbone (ResNet50 with all channels doubled Kolesnikov et al. (2019), ). We apply temporal and spatial average pooling at the last layer of the backbone (before the usual classification layer) to obtain a single vector . During training, ( for the exploration design) frames are sampled at fps and crops are used (frames are resized so that the minimum side is ). We use the following standard augmentation during training: random crop, horizontal flipping, temporal sampling and scale jittering, and color augmentation (details in Appendix A.1). Audio is represented as log MEL spectrogram with 80 bins and processed with ResNet50 and is sampled in sync with the frames. Spatial pooling is applied to obtain of dimension . For the final audio evaluation (Section 4.3), the network ingests 2 seconds of audio for fair comparison to Korbar et al. (2018); Alwassel et al. (2019), otherwise we use the same duration as the input video clip. Following Miech et al. (2020), text is processed by removing stop words, retaining a maximum or padding to 16 words, then extracting -dimensional Google News pre-trained word2vec Mikolov et al. (2013) and finally applying a linear layer to independently map the word inputs to

dimension followed by a max pooling layer over the 16 words (

). The dimension of the shared subspaces is , except for the Fine And Coarse (FAC) design where we use dimensions for (fine) and for (coarse). More details about architecture are provided in Appendix B. As done in Chen et al. (2020), we normalize vectors prior to computing their dot products in the NCE and MIL-NCE losses and use a temperature of in the softmax as in Wu et al. (2018); He et al. (2020); Patrick et al. (2020). When training with all three modalities on HowTo100M, we observe that a larger weight on the Vision-Text loss is beneficial since text is more prominent. However, when training on HowTo100M+AudioSet, equal loss weights worked best because the audio from AudioSet is more informative. Therefore, a 10:1 loss weight ratio is used when training on HowTo100M and 1:1 for HowTo100M+AudioSet. Finally, all networks are trained from scratch using Adam Kingma and Ba (2015) with an initial learning rate of , steps of warm up and a half-period cosine schedule Loshchilov and Hutter (2017).

Training datasets. We use the HowTo100M Miech et al. (2019) and/or the train split of AudioSet Gemmeke et al. (2017) datasets for self-supervised training. The HowTo100M dataset contains more than 100 millions narrated video clips coming from million unique videos where the audio narration is transcribed into text using ASR. We follow the same processing as described in Miech et al. (2020) for creating positive and negative pairs for our contrastive based loss. AudioSet consists of 10 seconds clips coming from million different internet videos. The dataset contains a variety of audio tracks such as musical instruments, animals or mechanical sounds, since it was built for audio classification, but we discard the labels for self-supervised training. Due to the dataset nature, text is considered a missing modality for AudioSet.

Downstream tasks. The trained networks are evaluated on various downstream tasks that aim to capture different aspects of the learned representations. For action classification, we evaluate the visual representation on the UCF101 Soomro et al. (2012) (13K videos and 101 action classes) and the HMDB51 Kuehne et al. (2011) (7K videos and 51 classes) benchmarks. Two settings are explored – frozen

where we simply learn a linear classifier on top of the pretrained

vector, and a finetune setting where the full visual model is finetuned. To evaluate the quality of the audio representation, we use the ESC-50 Piczak (2015) (2K audio clips with 50 classes) classification task using the frozen setting on . These classification datasets have official splits (3 for UCF101/HMDB51 and 5 for ESC-50). As per standard, split#1 serves as the validation set and is therefore used for ablations (Section 4.2), and the average accuracy over all splits is reported when comparing to the state-of-the-art (Section 4.3). The quality of our text-video representation is evaluated on zero-shot text-to-video retrieval on the MSRVTT Xu et al. (2016) (1K videos) and YouCook2 Zhou et al. (2018) (3190 videos at the time of publication) benchmarks, by following the evaluation protocol described in Miech et al. (2019) and reporting the recall at 10 (R@10) (and other retrieval metrics in Appendix A.2). Finally, to evaluate how well our video representation transfers to image tasks we use the PASCAL VOC 2007 Everingham et al. (2010) and ImageNet Russakovsky et al. (2015) classification tasks. For the image tasks, the frozen setting on the deflated version of is used, and, as per standard, we report the mAP on PASCAL and the top-1 and top-5 accuracies on ImageNet.

4.2 Design explorations

We here summarize the effects of various design choices of our method. To facilitate running a large number of experiments, we use the S3D-G Xie et al. (2018) network as the video backbone, with frames per video clip, a total batch size of and K training steps (20 hours training on 16 Cloud TPUs). Unless otherwise stated, linear projection heads are used for all modalities, and the networks are trained on HowTo100M. To minimize the amount of hyper-parameter tuning, for UCF101, HMDB51 and ESC-50 we use only the frozen setting and report top-1 accuracy on the split#1. We also report R@10 for YC2 (YR10) and MSRVTT (MR10) under the zero-shot setting. Full details, including all quantitative results, are given in Appendix C.

Pairs of modalities. We here summarize the main findings from experiments that consider learning from two modalities – Vision and Text, or Vision and Audio – as this setup makes it easy to isolate the effects of different components and discover the best building blocks to be used in the three-modality setting. For the video backbones, we observe that TSM ResNet50 always beats S3D-G for downstream tasks that involve vision. For Vision and Audio, contrastive based loss consistently outperforms logistic loss (used in Arandjelović and Zisserman (2017); Korbar et al. (2018)) by 2% on vision downstream tasks, and is on par for audio. This is in line with findings of recent single-modality self-supervised approaches as well as work in Vision and Text Miech et al. (2020) that demonstrate the superiority of NCE based loss compared to its binary classification counterpart. Regarding the projection heads, the experiments confirm findings of Chen et al. (2020)

that adding a non-linear projection head (two layers MLP with BatchNorm and ReLU activations) on top of the visual representations improves the downstream performance (notably for UCF101 and HMDB51). It was not beneficial to have non-linear projection heads for the language and audio branches. We hence keep linear projection heads for audio and text branches and use a non-linear projection head for vision in the rest of the paper. Regarding

data augmentation, we observe that despite training on large datasets, removing visual augmentation such as color augment or scale jittering slightly decreases performance, hence we keep them for the rest of the paper. Concerning the audio, we add Gaussian noise to the raw signal, with mean

and variance

, which seems to slightly improve results. Mildly jittering with SpecAugment Park et al. (2019) was not beneficial, and more aggressive augmentations were detrimental; in contrast with the findings of Patrick et al. (2020) where SpecAugment helped, presumably due to training on a relatively small dataset. Temporal jittering by randomly offsetting the audio with respect to the visual stream by up to 0.8s (half of the training clip length) reduces the performance on visual tasks by 4%, showing that synchronization is an important training signal.

Modalities UCF HMDB YR10 MR10 ESC-50
VT 82.7 55.9 33.6 27.5 /
VA 75.5 51.6 / / 79.0
VAT (FAC) 84.7 57.3 32.2 28.6 78.7
(a) Benefits of multiple modalities on HT
Strategy UCF HMDB YR10 MR10 ESC-50
Shared 84.7 60.2 20.8 22.4 88.5
Disjoint 85.1 59.3 25.0 22.5 87.0
FAC 86.2 62.5 23.8 23.5 88.0
(b) VAT: modality merging strategies on HT+AS
Table 1: Design explorations for multiple modalities (HT=HowTo100M, AS=AudioSet). The video networks use non-linear projection heads.

Combining Vision, Audio and Text. On HowTo100M, learning with all three modalities clearly outperforms networks trained with only pairs of modalities (Table 1(a)), obtaining significantly better visual representations (UCF101 and HMDB51) and on-par audio representations (ESC-50). The scores are tied on Vision-Text tasks, with the 3-modality network winning on MSRVTT but losing on YC2. These results demonstrate the ability of our network to learn from the complementary training signals coming from the audio and the text modalities. Next we look at the performance of the different modality merging strategies on the combination of HowTo100M and AudioSet in Table 1(b). First, comparing to Table 1(a), we observe that combining AudioSet with HowTo100M improves performance on HMDB51, UCF101 and ESC-50. This confirms again that our networks can leverage the complementary nature of the modalities to learn better representation as well as showcases the advantage of being able to cope with heterogeneous sources of data (AudioSet does not have text). We note a decrease in performance for the video-text benchmarks (MSRVTT and YC2), which can simply be explained by the fact that only a half of the training samples contain text vs. Table 1(a) (the other half comes from AudioSet which does not have text). As shown in the next section, this can simply be recovered by training for longer. Second, we note that all strategies for merging the three modalities obtain good representations, but the fine-and-coarse (FAC) method dominates on UCF101, HMDB51 and MSRVTT, achieves a good result on ESC-50 and is second best on YC2. The result agrees with the intuition that care should be taken to account for the specificity of the different modalities.

4.3 Large-scale experiments and comparison to the state-of-the-art

Final experimental setup. We use 32 frames per video clip, K training steps, and a total batch size of (S3D-G and TSM-50) or (TSM-50x2); training TSM-50 takes 3 days on 32 Cloud TPUs. Based on the design exploration studies, the audio and text networks employ a linear projection head, whereas the video network uses a non-linear head. All models use the FAC design when working with the three modalities. Self-supervised training is performed on the combination of HowTo100M and AudioSet datasets with standard augmentation. As is common practice, split#1 of each downstream task is used as the validation set for all hyperparameter tuning, and we report performance averaged over all splits. The full details are in Appendix A.

Results. Table 2 shows our visual and audio representations match or outperform the state-of-the-art on all downstream tasks and evaluation modes (linear or finetuning). Impressively, simple linear classifiers are competitive with some of the best previous work that uses finetuning. The smaller TSM-50 model achieves similar performance to GDT Patrick et al. (2020) while being superior to XDC Alwassel et al. (2019) and ELo Piergiovanni et al. (2020), despite having significantly fewer parameters and being trained with the same or less data; note also that XDC Alwassel et al. (2019) and GDT Patrick et al. (2020) train on Instagram65M Ghadiyaram et al. (2019) which has been collected specifically to mimic action recognition datasets. The superior performance of the larger TSM-50x2 model demonstrates that large networks can benefit from self-supervised training on vast amounts of data, and that our self-supervised task facilitates this process. MMV performance on visual tasks is close to supervised pretraining, while on audio it is even better than the best supervised result B. et al. (2017) by 1.7%.

Comparing to the two-modality case – Vision+Text with S3D-G is a similar setup to Miech et al. (2020) and training with three modalities is clearly beneficial. Similarly, FAC also beats training with only Vision+Audio.

Regarding zero-shot video to text retrieval our MMV S3D-G, TSM-50 and TSM-50x2 respectively obtain a R@10 of 37.2, 41.5 and 45.4 on YouCook2 and 29.3, 31.1 and 31.1 on MSRVTT. As explained in Section 4.2, longer training significantly improves the performance on these two benchmarks when compared to the results reported in Table 1(b). We are also not far from the state-of-the-art performance reported in Miech et al. (2020) for MSRVTT (32.2) and still below for YouCook2 (51.2). However, Miech  Miech et al. (2020) train 4 times longer on vision-text pairs (same number of total training steps, but larger batches and half of our samples come from AudioSet which has no text). We believe this gap could be further reduced by longer training but leave that for further investigation.

UCF101 HMDB51 ESC-50
Method (#params) Train data years Mod. Linear FT Linear FT Linear
MIL-NCE Miech et al. (2020) I3D (12.1M) HT 15 VT 83.4 89.1 54.8 59.2 /
MIL-NCE Miech et al. (2020) S3D-G (9.1M) HT 15 VT 82.7 91.3 53.1 61.0 /
AVTS Korbar et al. (2018) MC3 (11.7M) AS 1 VA 89.0 61.6 80.6
AVTS Korbar et al. (2018) MC3 (11.7M) SNet 1 VA 82.3
XDC Alwassel et al. (2019) R(2+1)D-18 (33.3M) AS 1 VA 91.2 61.0 84.8
XDC Alwassel et al. (2019) R(2+1)D-18 (33.3M) IG65M 21 VA 94.2 67.4
ELo Piergiovanni et al. (2020) R(2+1)D-50 (46.9M) YT8M 13 VFA 93.8 64.5 67.4
AVID Morgado et al. (2020) R(2+1)D-50 (46.9M) AS 1 VA 91.5 64.7 89.2
GDT Patrick et al. (2020) R(2+1)D-50 (46.9M) AS 1 VA 92.5 66.1 88.5
GDT Patrick et al. (2020) R(2+1)D-50 (46.9M) IG65M 21 VA 95.2 72.8
VA only (ours) S3D-G (9.1M) AS 1 VA 84.7 90.1 60.4 68.2 86.1
VA only (ours) S3D-G (9.1M) AS+HT 16 VA 86.2 91.1 61.5 68.3 87.2
MMV FAC (ours) S3D-G (9.1M) AS+HT 16 VAT 89.6 92.5 62.6 69.6 87.7
MMV FAC (ours) TSM-50 (23.5M) AS+HT 16 VAT 91.5 94.9 66.7 73.2 86.4
MMV FAC (ours) TSM-50x2 (93.9M) AS+HT 16 VAT 91.8 95.2 67.1 75.0 88.9
Supervised Xie et al. (2018); B. et al. (2017); Piergiovanni et al. (2020) 96.8 71.5 75.9 86.5
Table 2: Comparison of learnt representations versus the state-of-the-art. Results are averaged over all splits. The “Mod.” column shows which combinations of modalities are used by the methods, possibilities: Vision, Audio, Text, Flow. Training dataset abbreviations: AudioSet, HowTo100M, Instagram65M Ghadiyaram et al. (2019), SoundNet Aytar et al. (2016), 2M videos from YouTube8M Abu-El-Haija et al. (2016); their length in years is given in the “years” column. B. et al. (2017) uses a non-linear classifier.

4.4 Transfer to image tasks via network deflation

Experimental setup. The best MMV networks trained in Section 4.3 are deflated and evaluated on image tasks. The deflation (Section 3.2) is trained on 45981 frames of the HowTo100M  Miech et al. (2019) training set, where the static videos (ingested by the original video network to produce the regression targets for the deflated image network) are 32-frame long to match the video length used during self-supervised training; the Adam optimizer Kingma and Ba (2014) is used with initial learning rate of decayed by a factor

every 30 epochs for a total of 100 epochs. Results are reported for linear classification on top of the frozen image features

on the PASCAL VOC 2007 and ImageNet benchmarks. Implementation details are provided in Appendix A.2.

Method VI Train data PASCAL (mAP) ImageNet (top1) ImageNet (top5)
Supervised S3D-G def Kinetics 67.9 42.8 68.0
MMV S3D-G n-def AS+HT 41.8 20.7 40.5
MMV S3D-G def AS+HT 71.4 45.2 71.3
MMV S3D-G i-inf AS+HT 72.1 46.7 72.5
Supervised TSM def Kinetics 66.9 43.4 68.3
MMV TSM n-def AS+HT 34.4 10.9 24.6
MMV TSM def AS+HT 74.8 50.4 76.0
MMV TSM i-inf AS+HT 75.7 51.5 77.3
Supervised TSMx2 def Kinetics 66.9 47.8 72.7
MMV TSMx2 n-def AS+HT 45.6 20.3 39.9
MMV TSMx2 def AS+HT 77.4 56.6 81.4
MMV TSMx2 i-inf AS+HT 77.4 57.4 81.7
SimCLR Chen et al. (2020) ResNet50 / ImageNet 80.5 69.3 89.0
SimCLR Chen et al. (2020) ResNet50x2 / ImageNet / 74.2 92.0
SimCLR Chen et al. (2020) ResNet50x4 / ImageNet 84.2 76.5 93.2
Table 3: Image classification results on PASCAL and ImageNet. “VI” denotes the image handling strategy for the video networks: naive deflation (no training of and ), deflation (proposed), and input-inflation (video net ingesting 32-frame static videos).

Results. Table 3 shows that the deflated networks perform almost as well as the original video model applied on input-inflated 32-frame static videos (the difference is only around 1% when comparing the ‘def’ and ‘i-inf’ results). However, the deflated model is an order of magnitude more efficient due to processing single images instead of the full 32-frame videos. Naive deflation underperforms severely due to the strong padding effects, proving that our deflation training is necessary. The state-of-the-art self-supervised models trained on images (SimCLR Chen et al. (2020)) outperform MMV due to not having to bridge the video-image domain gap and in fact has been trained on ImageNet images – the performance difference is much smaller on PASCAL. Finally, our approach is significantly better than pre-training in a fully supervised manner on Kinetics-700 Carreira et al. (2019).

5 Conclusion

In this paper we have explored how to train versatile networks for vision, audio and language in a self-supervised manner. Our method is simple yet it matches or exceeds the state-of-the-art for action and audio classification on three challenging benchmarks: HMDB51, UCF101 and ESC-50. Our network can also be used for zero-shot text-to-video retrieval. Our deflation process shows how to train on videos to obtain representation for still images. Given the sheer number of videos available for self-supervised training on the web, we believe this is a more natural route to transfer which we hope will be pursued in the future.


The authors would like to thank Yusuf Aytar and Karen Simonyan for fruitful discussions and Elena Buchatskaya for help on the evaluation benchmarks.


  • S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan (2016) YouTube-8M: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675. Cited by: Table 2.
  • J. Alayrac, P. Bojanowski, N. Agrawal, I. Laptev, J. Sivic, and S. Lacoste-Julien (2016) Unsupervised learning from narrated instruction videos. In CVPR, Cited by: §2.
  • H. Alwassel, D. Mahajan, L. Torresani, B. Ghanem, and D. Tran (2019) Self-supervised learning by cross-modal audio-video clustering. arXiv preprint arXiv:1911.12667. Cited by: §2, §4.1, §4.3, Table 2.
  • R. Arandjelović and A. Zisserman (2017) Look, listen and learn. In ICCV, Cited by: Appendix C, §2, §3.1, §4.2.
  • R. Arandjelović and A. Zisserman (2018) Objects that sound. In ECCV, Cited by: §2.
  • Y. Aytar, C. Vondrick, and A. Torralba (2016) SoundNet: Learning sound representations from unlabeled video. In NIPS, Cited by: §2, Table 2.
  • Y. Aytar, C. Vondrick, and A. Torralba (2017) See, hear, and read: deep aligned representations. arXiv preprint arXiv:1706.00932. Cited by: §2.
  • S. B., D. M. Agrawal, and H. A. Patil (2017)

    Unsupervised filterbank learning using convolutional restricted boltzmann machine for environmental sound classification

    In InterSpeech, Cited by: §4.3, Table 2.
  • P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. In NeurIPS, Cited by: §2.
  • J. Carreira, E. Noland, C. Hillier, and A. Zisserman (2019) A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987. Cited by: §4.4.
  • J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? A new model and the Kinetics dataset. In CVPR, Cited by: §2, §3.2.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: Appendix B, Appendix C, §2, §4.1, §4.2, §4.4, Table 3.
  • M. Chowdhury, P. Rameswar, E. Papalexakis, and A. Roy-Chowdhury (2018) Webly supervised joint embedding for cross-modal image-text retrieval. In ACM MM, Cited by: §2.
  • C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In ICCV, Cited by: §2.
  • J. Dong, X. Li, C. Xu, S. Ji, Y. He, G. Yang, and X. Wang (2019) Dual encoding for zero-example video retrieval. In CVPR, Cited by: §2, §2.
  • A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox (2014)

    Discriminative unsupervised feature learning with convolutional neural networks

    In NIPS, Cited by: §2.
  • M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The PASCAL visual object classes (VOC) challenge. IJCV 88 (2), pp. 303–338. Cited by: §A.2, §4.1.
  • B. Fernando, H. Bilen, E. Gavves, and S. Gould (2017)

    Self-supervised video representation learning with odd-one-out networks

    In CVPR, Cited by: §2.
  • A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov (2013) DeViSE: A deep visual-semantic embedding model. In NIPS, Cited by: §2.
  • J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter (2017) Audio set: an ontology and human-labeled dataset for audio events. In ICASSP, Cited by: §4.1.
  • D. Ghadiyaram, D. Tran, and D. Mahajan (2019) Large-scale weakly-supervised pre-training for video action recognition. In CVPR, Cited by: §4.3, Table 2.
  • S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In ICLR, Cited by: §2.
  • R. Girdhar, D. Tran, L. Torresani, and D. Ramanan (2019) Distinit: learning video representations without a single labeled video. In ICCV, Cited by: §2.
  • Y. Gong, Q. Ke, M. Isard, and S. Lazebnik (2014a) A multi-view embedding space for modeling internet images, tags, and their semantics. IJCV. Cited by: §2.
  • Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik (2014b) Improving image-sentence embeddings using large weakly annotated photo collections. In ECCV, Cited by: §2.
  • D. Harwath, A. Recasens, D. Surís, G. Chuang, A. Torralba, and J. Glass (2019) Jointly discovering visual objects and spoken words from raw sensory input. IJCV, pp. 1–22. Cited by: §2.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In CVPR, Cited by: §2, §4.1.
  • O. J. Hénaff, A. Razavi, C. Doersch, S. Eslami, and A. v. d. Oord (2019) Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272. Cited by: §2, §3.1.
  • R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2018)

    Learning deep representations by mutual information estimation and maximization

    arXiv preprint arXiv:1808.06670. Cited by: §2.
  • L. Jing and Y. Tian (2018) Self-supervised spatiotemporal feature learning by video geometric transformations. arXiv preprint arXiv:1811.11387. Cited by: §2.
  • J. Johnson, M. Douze, and H. Jégou (2019) Billion-scale similarity search with GPUs. IEEE Transactions on Big Data. Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §A.2, §4.4.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.1.
  • B. Klein, G. Lev, G. Sadeh, and L. Wolf (2015) Associating neural word embeddings with deep image representations using Fisher vectors. In CVPR, Cited by: §2.
  • A. Kolesnikov, X. Zhai, and L. Beyer (2019) Revisiting self-supervised visual representation learning. In CVPR, Cited by: §4.1.
  • B. Korbar, D. Tran, and L. Torresani (2018) Cooperative learning of audio and video models from self-supervised synchronization. In NeurIPS, Cited by: Appendix C, §2, §4.1, §4.2, Table 2.
  • H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre (2011) HMDB: A large video database for human motion recognition. In ICCV, Cited by: §4.1.
  • H. Lee, J. Huang, M. Singh, and M. Yang (2017) Unsupervised representation learning by sorting sequences. In ICCV, Cited by: §2.
  • J. Lin, C. Gan, and S. Han (2019) TSM: Temporal shift module for efficient video understanding. In ICCV, Cited by: §3.2, §4.1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: Common Objects in Context. In ECCV, Cited by: §2.
  • I. Loshchilov and F. Hutter (2017)

    SGDR: Stochastic gradient descent with warm restarts

    In ICLR, Cited by: §4.1.
  • J. Malmaud, J. Huang, V. Rathod, N. Johnston, A. Rabinovich, and K. Murphy (2015) What’s cookin’? Interpreting cooking videos using text, speech and vision. NAACL. Cited by: §2.
  • A. Miech, J. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman (2020) End-to-End Learning of Visual Representations from Uncurated Instructional Videos. In CVPR, Cited by: Table 4, Appendix C, §2, §3.1, §3, §4.1, §4.1, §4.2, §4.3, §4.3, Table 2.
  • A. Miech, I. Laptev, and J. Sivic (2018) Learning a Text-Video Embedding from Incomplete and Heterogeneous Data. arXiv preprint arXiv:1804.02516. Cited by: §2.
  • A. Miech, D. Zhukov, J. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic (2019) HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In ICCV, Cited by: §3, §4.1, §4.1, §4.4.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §4.1.
  • I. Misra, C. L. Zitnick, and M. Hebert (2016) Shuffle and learn: Unsupervised learning using temporal order verification. In ECCV, Cited by: §2.
  • N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury (2018) Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In ICMR, Cited by: §2.
  • P. Morgado, N. Vasconcelos, and I. Misra (2020) Audio-visual instance discrimination with cross-modal agreement. arXiv preprint arXiv:2004.12943. Cited by: Table 2.
  • M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, Cited by: §2.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2, §3.1.
  • A. Owens and A. A. Efros (2018) Audio-visual scene analysis with self-supervised multisensory features. In ECCV, Cited by: §2.
  • A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba (2016) Ambient sound provides supervision for visual learning. In ECCV, Cited by: §2.
  • Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui (2016) Jointly modeling embedding and translation to bridge video and language. In CVPR, Cited by: §2.
  • D. S. Park, W. Chan, Y. Zhang, C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le (2019) SpecAugment: A simple data augmentation method for automatic speech recognition. In InterSpeech, Cited by: Table 9, Appendix C, §4.2.
  • M. Patrick, Y. M. Asano, R. Fong, J. F. Henriques, G. Zweig, and A. Vedaldi (2020) Multi-modal self-supervision from generalized data transformations. arXiv preprint arXiv:2003.04298. Cited by: Table 9, Appendix C, §4.1, §4.2, §4.3, Table 2.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011)

    Scikit-learn: machine learning in Python

    Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §A.2.
  • K. J. Piczak (2015) ESC: Dataset for Environmental Sound Classification. In Proceedings of the 23rd Annual ACM Conference on Multimedia, Cited by: §4.1.
  • A. Piergiovanni, A. Angelova, and M. S. Ryoo (2020) Evolving losses for unsupervised video representation learning. In CVPR, Cited by: §2, §4.3, Table 2.
  • B. A. Plummer, M. Brown, and S. Lazebnik (2017) Enhancing video summarization via vision-language embedding. In CVPR, Cited by: §2.
  • B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, Cited by: §2.
  • A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele (2017) Movie description. IJCV. Cited by: §2.
  • A. Rouditchenko, A. Boggust, D. Harwath, D. Joshi, S. Thomas, K. Audhkhasi, R. Feris, B. Kingsbury, M. Picheny, A. Torralba, et al. (2020) AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. arXiv preprint arXiv:2006.09199. Cited by: §2.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. IJCV. Cited by: §4.1.
  • O. Sener, A. R. Zamir, S. Savarese, and A. Saxena (2015) Unsupervised semantic parsing of video collections. In ICCV, Cited by: §2.
  • L. Smith and M. Gasser (2005) The development of embodied cognition: six lessons from babies. Artificial life. Cited by: §1.
  • K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §4.1.
  • C. Sun, F. Baradel, K. Murphy, and C. Schmid (2019a) Contrastive bidirectional transformer for temporal representation learning. arXiv preprint arXiv:1906.05743. Cited by: §2.
  • C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019b) VideoBERT: A joint model for video and language representation learning. In ICCV, Cited by: §2.
  • Y. Tian, D. Krishnan, and P. Isola (2019) Contrastive multiview coding. arXiv preprint arXiv:1906.05849. Cited by: §2.
  • Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, and P. Isola (2020) What makes for good views for contrastive learning. arXiv preprint arXiv:2005.10243. Cited by: §2.
  • L. Wang, Y. Li, J. Huang, and S. Lazebnik (2018) Learning two-branch neural networks for image-text matching tasks. PAMI. Cited by: §2.
  • L. Wang, Y. Li, and S. Lazebnik (2016) Learning deep structure-preserving image-text embeddings. In CVPR, Cited by: §2.
  • J. Weston, S. Bengio, and N. Usunier (2011) WSABIE: Scaling up to large vocabulary image annotation. In IJCAI, Cited by: §2.
  • M. Wray, D. Larlus, G. Csurka, and D. Damen (2019) Fine-grained action retrieval through multiple parts-of-speech embeddings. In ICCV, Cited by: §2.
  • C. Wu, R. Manmatha, A. J. Smola, and P. Krähenbühl (2017) Sampling matters in deep embedding learning. ICCV. Cited by: §2.
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In CVPR, Cited by: §4.1.
  • S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In ECCV, Cited by: Appendix C, §3.2, §4.1, §4.2, Table 2.
  • D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang (2019) Self-supervised spatiotemporal learning via video clip order prediction. In CVPR, Cited by: §2.
  • J. Xu, T. Mei, T. Yao, and Y. Rui (2016) MSR-VTT: A large video description dataset for bridging video and language. In CVPR, Cited by: §2, §4.1.
  • R. Xu, C. Xiong, W. Chen, and J. J. Corso (2015) Jointly modeling deep video and compositional text to bridge vision and language in a unified framework.. In AAAI, Cited by: §2.
  • S. Yu, L. Jiang, and A. Hauptmann (2014) Instructional videos for unsupervised harvesting and learning of action examples. In ACM, Cited by: §2.
  • R. Zhang, P. Isola, and A. A. Efros (2016) Colorful image colorization. In ECCV, Cited by: §2.
  • L. Zhou, C. Xu, and J. J. Corso (2018) Towards automatic learning of procedures from web instructional videos. In AAAI, Cited by: §2, §4.1.

Appendix overview

Appendix A contains additional details about optimization during training (A.1) and about the evaluation setup (A.2). Appendix B precisely describes the architecture of the different backbones and projection heads, as well as all the losses for the different embedding graphs. Appendix C provides the quantitative evaluation of the design exploration for pairs of modalities that were summarized in the main paper.

Appendix A Optimization and evaluation details

a.1 Training details

Pre-processing for video. We apply the following preprocessing steps, in this order, to our videos during training: temporal sampling, scale jittering, resizing the minimum side to , extracting a random crop of , random horizontal flipping and color augmentation. For temporal sampling, we randomly sample in time a subclip (of 16 or 32 frames) from the original video clip. For scale jittering, we independently scale width and height by a value uniformly sampled from . For color augmentation, we randomize brightness (max delta = ), saturation (max delta = ), contrast (max delta=0.4) and hue (max delta=0.2). We clip values to ensure the RGB is in .

Optimization. We train our networks for steps using the Adam optimizer with parameters , and . The initial learning rate is and a half period cosine schedule is used with K steps of linear warm up.

Batch norm. We applied batch norm where we aggregate the mean and variance statistics over all workers. Both the bias and scale term are learned. We use a decay rate of for the moving averages and .

a.2 Downstream tasks details

Linear classifier on UCF101/HMDB51. We use Scikit-Learn Pedregosa et al. (2011) SVM to optimize a linear classifier on the frozen features generated by our model. We use or frames per video clip ( for the design explorations and for large-scale experiments), sampled at FPS (to match the FPS used during training). For training, we collect features corresponding to times the size of the training set by applying the same data augmentation described in Section A.1. We resize the frames such that the minimum side is and take a random crop (of size for HMDB51 and for UCF101). Before fitting the SVM, features are scaled so that they are zero mean and unit variance using the training statistics. Then the best value for the regularization parameter is found by validation on the first split. At test time, we take linearly spaced clips and average their predictions to get the final score. We take the central crops of the frames (of size for HMDB51 and for UCF101). We do not apply color augmentation, scale jittering or horizontal flipping during test time.

FT on UCF101/HMDB51. For fine-tuning, we use the SGD optimizer with momentum = . A learning rate schedule is used where the learning rate gets multiplied by at the given steps (values for each dataset are provided in Table 5). We also apply weight decay to the variables of the linear classifier. Because in FT, the network can readapt to slight changes in the input, we resize the minimum side to and take random crops of size . At test time, we take linearly spaced clips and average their predictions to get the final score. We take the central crops of the frames of size . We do not apply color augmentation, scale jittering or horizontal flipping during test time.

Method R@1 R@5 R@10 MedR R@1 R@5 R@10 MedR
MILNCE Miech et al. (2020) S3D-G 15.1 38.0 51.2 10 9.9 24.0 32.4 30
MMV FAC (ours) S3D-G 9.0 25.7 37.2 20 8.2 21.0 29.3 44
MMV FAC (ours) TSM-50 11.5 30.2 41.5 16 9.2 22.4 31.1 37
MMV FAC (ours) TSM-50x2 11.7 33.4 45.4 13 9.3 23.0 31.1 38
Table 4: Additional retrieval metrics for zero shot text to video retrieval.

Zero-shot text-to-video retrieval on YouCook2/MSRVTT.

For zero-shot text-to-video retrieval, we simply use our networks to map text queries and videos to the same subspace. In that space, we can find the best video matches for a given query by maximizing the cosine similarity. Again to minimize the discrepancy between pretraining and evaluation, we resize the frames to a minimum height/width of 224 and take a central crop of

. Embedding features for the video are obtained by first computing features of linearly spaced clips and then averaging them. In Table 4 we provide additional metrics for retrieval on YouCook2 and MSRVTT for the S3D-G, TSM and TSMx2 of Section 4.3. We provide R@K for K (higher is better) and median rank (MedR), corresponding to the median rank of the correctly retrieved video (lower is better).

Linear on PASCAL/ImageNet. We evaluate our deflated networks using a linear classifier on PASCAL VOC 2007 and ImageNet benchmarks. To build the deflated S3D-G network, we collapse the 3D temporal filters into 2D filters by summing along the temporal dimension: . For TSM, we run the image through the backbone network without any channel shift. We use both train and validation sets as training data. We resize the images to have a minimum side of and then use random crops of . For ImageNet, we augmented the training set with scale jittering and color augmentation as described in Section A.1. For the PASCAL linear experiments, we train the linear layer for epochs using the Adam optimizer Kingma and Ba (2014) with parameters , and . We use per-class binary cross-entropy loss to train the linear classifier. A square root decay (learning rate decays as where is the number of steps) is used for the learning rate. The best initial learning rate is selected independently for each of the models using the ‘validation’ set. We report mAP in the ‘test’ set using 11-point mAP metric as described in Everingham et al. (2010). For the ImageNet experiments, we train a linear layer for epochs using the Adam optimizer Kingma and Ba (2014) with parameters , and . We use standard cross-entropy loss to train the classifier. A square root decay is used for the learning rate. The best initial learning rate is selected using an internal validation set (subset of the official training set).

Parameter HMDB51 UCF101
LR base 1.0 1.0
LR decay 0.1 0.1
LR schedule K/K/K K/K/K
Weight decay
Batch size 256 256
Training steps K K
Table 5: Parameters for FT on downstream classification tasks.

Appendix B Model architecture and losses details

Figure 2: Backbone architecture for audio, vision and text.

Backbones. Starting from raw data, our audio, visual and text backbones extract modality specific embeddings as illustrated in Figure 2.

Linear and non linear projection heads. The precise architectures for the projection heads are given in Figure 2(d). The non linear head design follows the non linear head from the SimCLR work Chen et al. (2020).

(a) Shared
(b) Disjoints
(c) FAC
(d) Linear and Non Linear heads.
Figure 3: (a-c) Architecture details for the embedding graphs (linear projection heads are framed by a solid border while the non linear ones are framed by a dashed border). (d) Details of the linear and non linear heads used in this work.

Shared head architecture and losses. We provide an illustration of the detailed architecture for the shared embedding graph in Figure 2(a). In that case, the NCE loss between video and audio is the following:


The MIL-NCE loss between video and text is defined as follows:


Disjoint architecture and losses. We provide an illustration of the detailed architecture for the disjoint embedding graph in Figure 2(b). In that case, the NCE loss between video and audio is the following:


The MIL-NCE loss between video and text is defined as follows:


Fine and Coarse (FAC) architecture and losses. We provide an illustration of the detailed architecture for the FAC embedding graph in Figure 2(c). In that case, the NCE loss between video and audio is the following:


The MIL-NCE loss between video and text is defined as follows:


Appendix C Additional design choices exploration for pairs of modalities

In this section, we explore the effects of various design choices of our method. The full results accompany Section 4.2, paragraph on “pairs of modalities”.

To facilitate running a large number of experiments, we use the S3D-G Xie et al. (2018) network as the video backbone, with frames per video clip, a total batch size of and K training steps (20 hours training on 16 Cloud TPUs). Unless otherwise stated, linear projection heads are used for all modalities, and the networks are trained on HowTo100M. To minimize the amount of hyper-parameter tuning, for UCF101, HMDB51 and ESC-50 we use only the frozen setting and report top-1 accuracy on the split#1. We also report R@10 for YC2 (YR10) and MSRVTT (MR10) under the zero-shot setting.

Here we provide full results from experiments that consider learning from two modalities – Vision and Text, or Vision and Audio – as this setup makes it easy to isolate the effects of different components and discover the best building blocks to be used in the three-modality setting.

train: Vision+Text train: Vision+Audio
Visual backbone UCF101 HMDB51 YC2 MSRVTT UCF101 HMDB51 ESC-50
S3D-G 81.0 52.0 35.4 29.0 71.1 49.1 80.0
TSM Res50 82.9 56.0 37.7 33.3 75.8 52.5 78.0
TSM Res50x2 86.8 55.1 43.4 32.9 77.1 53.6 79.2
Table 6: Effects of varying the visual backbone. All experiments use linear projection heads. Training is performed on HowTo100M with 16 frames per video clip. Evaluation is done in the frozen setting, also with 16 frames per video clip.

Visual backbone. TSM ResNet50 variants always beat S3D-G for downstream tasks that involve vision, with TSM ResNet50x2 being on par or better than TSM ResNet50 (Table 6).

UCF101 HMDB51 ESC-50
NCE loss 71.1 49.1 80.0
Logistic loss 69.9 47.5 80.7
Table 7: NCE vs logistic loss for Vision+Audio. All experiments use linear projection heads and the S3D-G network as the video backbone. Training is performed on HowTo100M with 16 frames per video clip. Evaluation is done in the frozen setting, also with 16 frames per video clip.

Losses. Previous works use the logistic loss when learning from Vision and Audio Arandjelović and Zisserman (2017); Korbar et al. (2018). The NCE loss consistently outperforms it by 2% on vision downstream tasks, and is on par on audio (Table 7). This is in line with findings of recent single-modality self-supervised approaches that demonstrate the superiority of NCE based loss compared to its binary classification counterpart. Note that due to the multiple candidate positive for the Vision+Text setting, it is not sensible to compare a logistic loss against MIL-NCE. We refer to Miech et al. (2020) for a relevant comparison that draws the same conclusion.

Projection heads train: Vision+Text train: Vision+Audio
Vision / Other-modality UCF101 HMDB51 YC2 MSRVTT UCF101 HMDB51 ESC-50
Linear both 81.0 52.0 35.4 29.0 71.1 49.1 80.0
Non Linear / Linear 82.7 55.9 33.6 27.5 75.5 51.6 79.0
Non Linear both 83.0 54.4 31.1 28.7 73.4 51.0 79.5
Table 8: Effects of varying the projection heads. All experiments use the S3D-G network as the video backbone. Training is performed on HowTo100M with 16 frames per video clip. Evaluation is done in the frozen setting, also with 16 frames per video clip. Best number is in bold. Second best is underlined.

Projection heads. Table 8 confirms the findings of Chen et al. (2020) that adding a non-linear projection head (see Figure 2(d) for the architecture details of the linear and non linear heads) on top of the visual representations improves the performance on visual downstream tasks (UCF101 and HMDB51 for the frozen setting). However, it was not beneficial to have non-linear projection heads for the language and audio branches.

Video augmentation Audio augmentation UCF101 HMDB51 ESC-50
None None 70.6 47.9 77.0
Standard None 71.1 49.1 80.0
Standard Temporal 67.6 45.8 79.0
Standard SpecAugment Park et al. (2019) weak Patrick et al. (2020) 71.3 49.2 79.0
Standard SpecAugment Park et al. (2019) strong Patrick et al. (2020) 70.8 48.4 76.2
Standard Gaussian noise 72.8 48.4 78.2
Table 9: Effects of data augmentation for Vision+Audio. All experiments use linear projection heads and the S3D-G network as the video backbone. Training is performed on HowTo100M with 16 frames per video clip. Evaluation is done in the frozen setting, also with 16 frames per video clip.

Data augmentation. Despite training on large datasets, performing standard video augmentations usually improves downstream performance (Table 9). Mildly jittering audio with SpecAugment Park et al. (2019) is not beneficial, and is detrimental with more aggressive augmentations; this is in contrast with the findings of Patrick et al. (2020) where SpecAugment helped, presumably due to training on a relatively small dataset. Temporal jittering by randomly offsetting the audio with respect to the visual stream by up to 0.8s (half of the training clip length) reduces the performance on visual tasks by 4%, showing that synchronization is an important training signal. Small additive Gaussian noise applied onto the raw audio signal () seems to make a slight difference, but we decide to use it as it is inaudible while it potentially helps with preventing the network from latching onto encoding artefacts.