Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Temporal Synchronicity

by   Pritam Sarkar, et al.
Queen's University

We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard 'synchronous' cross-modal relations, CrissCross also learns 'asynchronous' cross-modal relationships. We show that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong time-invariant representations. Our experiments show that strong augmentations for both audio and visual modalities with relaxation of cross-modal temporal synchronicity optimize performance. To pretrain our proposed framework, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics-400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and retrieval. CrissCross shows state-of-the-art performances on action recognition (UCF101 and HMDB51) and sound classification (ESC50). The codes and pretrained models will be made publicly available.



There are no comments yet.


page 17

page 18


Self-Supervised Learning by Cross-Modal Audio-Video Clustering

The visual and audio modalities are highly correlated yet they contain d...

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

When watching videos, the occurrence of a visual event is often accompan...

Learning Audio-Visual Correlations from Variational Cross-Modal Generation

People can easily imagine the potential sound while seeing an event. Thi...

Contrastive Learning of Global and Local Audio-Visual Representations

Contrastive learning has delivered impressive results in many audio-visu...

TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation

The recent success of transformer models in language, such as BERT, has ...

Sound and Visual Representation Learning with Multiple Pretraining Tasks

Different self-supervised tasks (SSL) reveal different features from the...

Self-supervised Contrastive Video-Speech Representation Learning for Ultrasound

In medical imaging, manual annotations can be expensive to acquire and s...

Code Repositories


Official project page of CrissCross:

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: We present an overview of our method. CrissCross learns effective audio-visual multi-modal representations not only by considering intra-modal relationships, but also by learning synchronous cross-modal as well as asynchronous cross-modal relations. The sample frames are obtained from Kinetics-400[kinetics400].

In recent years, self-supervised learning has shown great promise in learning strong representations without human-annotated labels

[simclr, simsiam, deepcluster], minimizing the gap between self-supervised and fully-supervised pretraining. There are a number of benefits to such methods. Firstly, they reduce the time and resources required for expensive human-annotations and allow researchers to directly use large uncurated datasets for learning meaningful representations. Moreover, the models trained in a self-supervised fashion learn more abstract representations, which can be useful in solving a variety of downstream tasks without needing to train the models from scratch.

Given the abundance of videos, their spatio-temporal information-rich nature, and the fact that in most cases they contain both audio and visual streams, self-supervised approaches are strong alternatives to fully-supervised methods for video representation learning. Moreover, the high dimensionality and multi-modal nature of videos makes them difficult to annotate, further motivates the use of self-supervision.

A thorough review of the literature on self-supervised audio-visual representation learning reveals the following open problems. (i) Data augmentation is proven to induce invariance toward learning strong representations [simclr, scan], yet little attention has been paid to investigate different augmentation strategies in multi-modal self-supervised frameworks. (ii) Given that many existing works on video-based self-supervised learning take direct inspiration from image-based methods [mmv, avts, cvrl, roy2021self], they do not take full advantage of temporal information available in video data. For instance, methods based on contrastive frameworks [simclr, pirl, avid, ravid, cmacc, gdt], distillation [byol, simsiam], quantization [swav, deepcluster, xdc], or information maximization [barlow_twins], require two augmented views of a given sample to be fed to a shared backbone, followed by optimizing the underlying cost function. As also identified in a recent work [brave], we observe that many video-based methods [mmv, avts, cvrl] perform the necessary augmentations on temporally identical views. (iii) Lastly, existing solutions try to learn audio-visual representations by maintaining a tight temporal synchronicity between the two modalities [mmv, avts, xdc, selavi]. Yet, the impact of learning temporally asynchronous cross-modal relationships in the context of self-supervised learning has not been explored. This notion deserves deeper exploration as learning such temporally asynchronous cross-modal relationships may in fact result in additional invariance and distinctiveness in the learned representations.

In this study, in an attempt to address the above-mentioned issues, we present CrissCross (a simple illustration is presented in Figure 1), a novel framework to learn robust time-invariant representations from videos. First, we leverage the ‘time’ information available in videos and explore efficient ways of sampling audio-visual segments from given source video clips. Our empirical studies show that with properly composed temporal sampling and the right amount of spatial augmentations, the model learns strong representations useful for a variety of downstream tasks. Second, we introduce a novel concept to learn cross-modal representations through relaxing time-synchronicity between audio and visual segments, which we refer to as ‘asynchronous cross-modal’ optimization. We use datasets of different sizes: Kinetics-Sound [l3-kineticssound], Kinetics-400 [kinetics400], and AudioSet [audioset], to pretrain CrissCross. We evaluate CrissCross on different downstream tasks, namely action recognition, sound classification, and retrieval. We use popular benchmarks UCF101 [ucf101] and HMDB51 [hmdb] to perform action recognition and retrieval, while ESC50 [esc] is used for sound classification.

Contributions. The key contributions of this work are as follows: (1) We present a novel framework for multi-modal self-supervised learning by relaxing the audio-visual temporal synchronicity to learn effective time-invariant representations. Our method is a simple, yet effective solution to learn robust multi-modal representations for downstream tasks. (2) We perform an in-depth study to explore the proposed framework and its major concepts. Additionally we extensively investigate a wide range of audio-visual augmentation techniques towards learning strong audio-visual representations in a multi-modal setup. (3) Comparing the performance of the proposed framework to prior works, CrissCross achieves state-of-the-arts on UCF101 [ucf101], HMDB [hmdb], and ESC50 [esc] when pretrained on Kinetics-400 [kinetics400]. Moreover, CrissCross outperforms fully-supervised pretraining when trained on the same small-scale dataset like Kinetics-Sound [l3-kineticssound] and set new state-of-the-arts. To the best of our knowledge, CrissCross is the first to show that self-supervised pretraining outperforms full supervision on action recognition on such a small-scale dataset. Additionally, when CrissCross is trained with a very large-scale dataset AudioSet [audioset], it achieves better or competitive performances with respect to the current state-of-the-arts. We hope our proposed self-supervised method can motivate researchers to further explore the notion of asynchronous multi-modal representation learning.

2 Related Work

2.1 Self-supervised Learning

Self-supervised learning aims to learn generalized representations of data without any human annotated labels through properly designed pseudo tasks (also known as pretext tasks). Self-supervised learning has recently drawn significant attention in different fields of deep learning such as image

[simclr, simsiam, pirl, swav, byol, deepcluster, roy2021self], video [avid, ravid, xdc, selavi, gdt, mmv, cmac], and wearable data [sarkar-ssl-icassp, sarkar-ssl-tafc, sarkar-ssl2] analysis among others.

In self-supervised learning, the main focus of interest lies in designing novel pseudo-tasks to learn useful representations. We briefly mention some of the popular categories in the context of self-supervised video representation learning, namely, ) context-based, ) generation-based, ) clustering-based, and ) contrastive learning-based. Various pretext tasks have been proposed in the literature exploring the spatio-temporal context of video frames, for example, temporal order prediction [opn], puzzle solving [stc, shuffle-learn, ahsan2019video], rotation prediction [rotnet3d], and others. Generation-based video feature learning methods refer to the process of learning feature representations through video generation [vondrick2016generating, tulyakov2017mocogan, saito2017temporal]

, video colorization

[tran2016deep], and frame or clip prediction [mathieu2015deep, reda2018sdc, babaeizadeh2017stochastic, liang2017dual, finn2016unsupervised], among a few others. Clustering-based approaches [xdc, selavi]

rely on self-labeling where data is fed to the network and the extracted feature embeddings are clustered using a classical clustering algorithm such as k-means, followed by using the cluster assignments as the pseudo-labels for training the neural network. The key concept of contrastive learning

[simsiam, pirl, byol, swav, avid, gdt] is that in the embedding space, ‘positive’ samples should be similar to each other, and ‘negative’ samples should have discriminative properties. Using this concept, several prior works [avid, ravid, gdt, cmacc] have attempted to learn representations by minimizing the distance between positive pairs and maximizing the distance between negative pairs.

2.2 Audio-Visual Representation Learning

Typically in multi-modal self-supervised learning, multiple networks are jointly trained on the same pretext tasks towards maximizing the mutual information between multiple data streams [xdc, avid, avts, cliporder, wang2021self, khare2021self, siriwardhana2020multimodal]. Following, we briefly discuss some of the prior works [avts, xdc, avid, cmacc] on audio-visual multi-modal representation learning. A multi-modal self-supervised task introduced in AVTS [avts], leveraging the natural synergy between audio-visual data. The network is trained to distinguish whether the given audio and visual sequences are ‘in sync’ or ‘out of sync’. The authors propose a two-stream network, where one stream receives audio as input and the other network is fed with the visual data. Next, audio and visual embeddings are fused at the end of the convolution layers, and the joint representations are used to minimize the contrastive loss. In XDC [xdc], the authors introduce a framework to learn cross-modal representations through a self-labelling process. In XDC, cluster assignments obtained from the audio-visual representations are used as pseudo-labels to train the backbones. Specifically, the pseudo-labels computed from audio embeddings are used to train the visual backbone, while the pseudo-labels computed using visual embeddings are used to train the audio network. A self-supervised learning framework based on contrastive learning is proposed in AVID [avid], to learn audio-visual representations from video. AVID performs instance discrimination as the pretext task. AVID [avid]

redefines the notion of positive and negative pairs based on their similarity and dissimilarity in the feature space, followed by optimizing a noise contrastive estimator loss to learn multi-modal representations. This is different from AVTS 

[avts], where audio-visual segments originated from the same samples are considered as positive pairs, and segments originated from different samples are considered as negative pairs.

Distinctions to our work. We acknowledge that earlier works [avts, avid, cmacc] show great promise in learning strong multimodal representations, however, we identify some limitations, which we attempt to address in our study. Most of the earlier works based on contrastive learning try to find negative pairs and positive pairs through a complex process. We also notice that over the time, the definition of ‘positive’ and ‘negative’ pairs have been changing. For instance, we find distinct differences in such definitions amongst some earlier works [avts, avid, ravid]. In this study, our goal is to propose a simple yet effective solution towards learning multi-modal representations. Additionally, we would like highlight that earlier works [avid, selavi, xdc, cmacc] use a massive distributed GPU setup (- GPUs), which is a significant bottleneck when computing resources are limited. In this study, we effectively train our method on - GPUs. Lastly, as discussed earlier, we hypothesize that to learn effective time-invariant features, the synchronicity between audio and visual segments could be relaxed. Interestingly, it may appear that our approach is in contrast to some prior works that suggest synchronization [avts, brave] is helpful in learning strong multi-modal representations. Nonetheless, our framework exploits both synchronous and asynchronous cross-modal relationships in an attempt to learn both time-dependant and time-invariant representations.

3 Method

Figure 2: Our proposed framework. We present the uni-modal baselines and the multi-modal setup. The uni-modal setups (Visual-only and Audio-only) are presented in (A) and (B) respectively. The multi-modal framework, CrissCross, is presented in (C). In case of uni-modal setups, we show the predictor heads, as well as the stop-grad to elaborate on the frameworks. However, in case of CrissCross, we skip those components for the sake of simplicity.

In this section we present the core concepts of our proposed framework. First, we briefly discuss the uni-modal concepts that our model is build on, which are adopted from an earlier work, SimSiam [simsiam]. Next, we introduce the multi-modal concepts of our framework to jointly learn audio-visual representations in a self-supervised setup.

3.1 Uni-modal Learning

To separately learn visual and audio representations, we follow the setup proposed in [simsiam], which we briefly mention here for the sake of completeness. Let’s, assume an encoder , where is composed of a convolutional backbone followed by an MLP projection head, and an MLP prediction head . Two augmented views of a sample are created as and . Accordingly, the objective is to minimize the symmetrized loss:



denotes the negative cosine similarity, and output vectors

and are computed as and respectively. Similarly, and are computed as and respectively. Further, we apply stop-grad on the latent vector and , which is denoted by S. We extend this concept to learn visual and audio representations as discussed below.

To learn visual representations from videos, we use a visual encoder and a predictor head . We generate two augmented views of a sample as and , where belongs to timestamp , and belongs to timestamp . Finally, we optimize the loss using Equation 1. We present this concept in Figure 2(A) . Similarly we generate two augmented views of an audio sample as and , where and belong to timestamps and respectively. We use and to optimize (following Equation 1) using an audio encoder and a predictor head to learn audio representations. A pictorial representation of this method is depicted in Figure 2(B) .

3.2 Multi-modal Learning

Here, we discuss the multi-modal learning components of our proposed framework. We present different ways to learn multi-modal representations, namely Intra-modal, Synchronous Cross-modal, Asynchronous Cross-modal, and finally, CrissCross, which blends all three previous methods. We explain each of these concepts below.

Intra-modal Representations. To learn multi-modal representations, our first approach is a joint representation learning method where we train the visual and audio networks with a common objective function . Here, is calculated as , where and are uni-modal losses for visual and audio learning as discussed earlier.

Synchronous Cross-modal Representations. To learn cross-modal audio-visual representations, we calculate the distance between the two different modalities, particularly by calculating corresponding to , , and , corresponding to , . Finally, we optimize the synchronous cross-modal loss , which is calculated as .

Asynchronous Cross-modal Representations. Next, we introduce an asynchronous (or cross-time) cross-modal loss to learn local time-invariant representations. Here, we attempt to optimize asynchronous cross-modal representations by calculating to minimize the distance between feature vectors corresponding to and . Similarly, we calculate to minimize the distance between feature vectors corresponding to and . Finally, we calculate the asynchronous cross-modal loss as .

CrissCross. Our proposed multi-modal representation learning method is named CrissCross. In this setup, we combine the objective functions of Intra-modal, Synchronous Cross-modal, and Asynchronous Cross-modal learning. Accordingly, we define the final objective function as , which gives:


We present the proposed CrissCross framework in Figure 2(C) and its pseudocode in Section S1.

3.3 Relaxing Time Synchronicity

Audio and visual modalities from the same source clip generally maintain a very strong correlation, which makes them suitable for multi-modal representation learning as one modality can be used as a supervisory signal for the other in a self-supervised setup. However, our intuition behind CrissCross is that these cross-modal temporal correlations do not necessarily need to follow a strict frame-wise coupling. Instead, we hypothesize that relaxing cross-modal temporal synchronicity to some extent can help in learning more generalized representations.

To facilitate this idea within CrissCross, we present different temporal sampling methods to create the augmented views of a source clip. These temporal sampling methods are designed to explore varying amounts of temporal synchronicity when learning cross-modal relationships. (i) Same-timestamp: where both the audio and visual segments are sampled from the exact same time window (denoted as none in terms of temporal relaxation). (ii) Overlapped: where the two views of the audio-visual segments share overlap amongst them (denoted as mild relaxation). (iii) Adjacent: where adjacent frame sequences and audio segments are sampled (denoted as medium relaxation). (iv) Far-apart: in which we sample one view from the first half of the source clip, while the other view is sampled from the second half of the source clip (denoted as extreme relaxation). (v) Random: where the two audio-visual segments are sampled in a temporally random manner (denoted as mixed relaxation). It should be noted that the concept of relaxing cross-modal time synchronicity doesn’t apply to the uni-modal setups.

4 Experiment

The details of the experiment setup and the findings of our thorough empirical studies for investigating the major concepts of our proposed framework are presented here.

4.1 Experiment Setup

Datasets. We use datasets of different sizes for pretraining purposes, namely, Kinetics-Sound [l3-kineticssound], Kinetics400 [kinetics400], and AudioSet [audioset]. Following the standard practices of prior works [avid, ravid, mmv, xdc, selavi, avts], we evaluate our self-supervised methods on two types of downstream tasks, (i) action recognition using UCF101 [ucf101] and HMDB51 [hmdb], and (ii) sound classification using ESC50 [esc]. We provide additional details for all the datasets in Section S3.

Architectures. Following the standard practice among prior works [avid, ravid, xdc, selavi, gdt] we use R(2+1)D [r2plus1d] and ResNet [resnet] as the visual and audio backbones. We use a slightly modified version [avid] of R(2+1)D-18 [r2plus1d]

as the backbone for visual feature extraction. To extract the audio features, we use ResNet-18

[resnet]. The projector and predictor heads of the self-supervised framework are composed of MLPs. The details of all of the architectures are presented in Section S6.

Pretraining Details. To train the network in a self-supervised fashion with audio-visual inputs, we downsample the visual streams to frames per second, and feed -second frame sequences to the visual encoder. We resize the spatial resolution to , so the final input dimension to the visual encoder becomes , where represents the channels of RGB. Next, we downsample the audio signals to kHz, and segment them into -second segments. Next, we transform the segmented raw audio waveforms to mel-spectrograms using mel filters, we set the hop size as milliseconds, and FFT window length as . Finally, we feed spectrograms of shape to the audio encoder. We use Adam [adam] optimizer with a cosine learning rate scheduler [cosine_lrs]

to pretrain the encoders and use a fixed learning rate to train the predictors. We provide additional details of the hyperparameters in Section


4.2 Empirical Study

Here we present the empirical study performed to investigate the major concepts of our proposed framework. During the empirical study all of the models are trained using Kinetics-Sound [l3-kineticssound] for epochs, unless stated otherwise. We perform transfer learning to evaluate visual and audio representations. Linear evaluation is performed using a one-vs-all SVM classifier (linear kernel) on the fixed features to quickly evaluate our models on downstream tasks. We prefer one-vs-all SVM over training an FC layer to limit parameter tuning at this point. Moreover, to limit memory overhead, we use seconds ( frames) of visual input and seconds of audio input to extract the fixed features. The details of the linear evaluation protocol is mentioned in Section S5. We use UCF101 to evaluate visual representations on action recognition, and ESC50 to evaluate audio representation on sound classification. All of our empirical studies are evaluated using split-1 of both the datasets.

4.2.1 Ablation Study

Method UCF101 ESC50
(a) -
(b) -
(f) +
(g) +
(h) +
Table 1: Ablation study. We present the results of CrissCross and its uni-modal and multi-modal ablation variants.

We present the ablation results in Table 1 to show the improvement made by optimizing intra-modal, synchronous cross-modal, and asynchronous cross-modal losses. First, we train the framework in uni-modal setups, denoted as and . We report the top-1 accuracy of UCF101 and ESC50 as and respectively. Next, we train the network in a multi-modal setup, where we find that outperforms the other multi-modal variants with single-term losses (Table 1(c) to 1(e)) as well as uni-modal baselines (Table 1(a) and 1(b)). Further study shows that combining different multi-modal losses (Table 1(f) to 1(i)) improve the model performance. Specifically, we notice that outperforms by and on action recognition and sound classification, respectively.

We further investigate the benefits of versus the top ablation competitors ( + and + ) on the large and diverse Kinetics-400 [kinetics400]. We observe that outperforms these variants by and in action recognition and sound classification, respectively, showing the significance of asynchronous cross-modal optimization in a multi-modal setup. Our intuition is that as Kinetics-Sound consists of a few hand-picked classes that are prominently manifested in both audio and visual modalities, the performance gain of CrissCross is less prominent. However, Kinetics-400 is considerably larger in scale and is comprised of highly diverse action classes. It therefore benefits more from the more generalized representations learned by asynchronous cross-modal optimization. Additional results for Kinetics-400 can be found in Section LABEL:supsec:additional_results.

4.2.2 Understanding Relaxed Time-synchronicity

In this subsection we study how different amounts of temporal relaxation in cross-modal synchronicity impacts CrissCross. To do so, we exploit different temporal sampling methods as discussed earlier in Section 3. We further aim to identify the best temporal sampling method in a uni-modal setup. We train Visual-only, Audio-only, and CrissCross frameworks using different temporal sampling methods. The results presented in Table 2 show that the overlapped sampling method works the best overall for both the uni-modal setups. The same temporal sampling method shows poor performance on the visual-only model. However, it performs as good as the overlapped sampling method on the audio-only model. Interestingly, the far-apart sampling shows the worst performance amongst other methods on the Audio-only model, whereas, the Visual-only model works reasonably well with the far-apart sampling method. Next, we test these temporal sampling methods on CrissCross and present the results in Table 2. Interestingly, we notice that the same and far-apart methods, which work poorly in the uni-modal setups, perform reasonably well in a multi-modal setup. We believe, the improvement of performance here is because of the strong supervision received from the other modality. Nonetheless, we find that the overlapped temporal sampling method (mild temporal relaxation) performs relatively better, outperforming the other approaches.

4.2.3 Exploring Audio-Visual Augmentations

We perform an in-depth study to explore the impact of different audio and visual augmentations.

Temp. Sampling
(Temp. Relaxation)
Uni-modal CrissCross
UCF101 ESC50 UCF101 ESC50
Same (None)
Overlapped (Mild)
Adjacent (Medium)
Random (Random)
Far-apart (Extreme)
Table 2: Temporal sampling. Exploring temporal sampling methods for multi-modal and uni-modal representation learning.
Video Augs. UCF101 Audio Augs. ESC50


MSC-HF-CJ 62.3 VJ 44.8
MSC-HF-CJ-GS 68.1 VJ-M 49.5
MSC-HF-CJ-GS-C 68.3 VJ-M-TW 49.5
Video Augs. + Audio Augs. UCF101 ESC50


Table 3: Augmentations. Exploring audio-visual augmentations.

Visual Augmentations. We explore a wide range of visual augmentations. As a starting point, we adopt the basic spatial augmentations used in [avid], which consists of Multi-Scale Crop (MSC), Horizontal Flip (HF), and Color Jitter (CJ). Additionally, we explore other augmentations, namely Gray Scale (GS), Gaussian Blur (GB) [simclr], and Cutout (C) [cutout], which show great performance in image-based self-supervised learning [simclr, scan]. We explore almost all the possible combinations of different visual augmentations in a uni-modal setup and present the results in Table 3. The results show that strong augmentations improve the top-1 accuracy by in comparison to basic augmentations used in [avid]. We mention the augmentation parameters and implementation details in Section S2.

Temporal Consistency of Spatial Augmentations. While investigating different spatial augmentations, we are also interested to know if the spatial augmentations should be consistent at the frame level or whether they should be random (i.e., vary among consecutive frames within a sequence). We refer to these concepts as temporarily consistent or temporarily random. We perform an experiment where we apply MSC-HF-CJ-GS randomly at the frame level, and compare the results to applying the same augmentations consistently across all the frames of a sequence. Our results show that maintaining temporal consistency in spatial augmentations across consecutive frames is beneficial, which is in line with the findings in [cvrl]. Specifically, Temporally random augmentations, results in top-1 accuracy of , whereas, the same augmentations applied in a temporally consistent manner results in .

Audio Augmentations. Similar to visual augmentations, we thoroughly investigate a variety of audio augmentations. Our audio augmentations include, Volume Jitter (VJ), Time and Frequency Masking (Mask) [specaug], Random Crop (RC) [byol_audio], and Time Warping (TW) [specaug]. We also explore almost all the possible combinations of these augmentations, and present the results in Table 3. Our findings show that time-frequency masking and random crop improve the top-1 accuracy by compared to the base variant. We also notice that time warping doesn’t improve the performance and is also quite computationally expensive. Hence, going forward we do not use time warping during pretraining. We present the augmentation parameters and additional implementation details in Section S2.

Audio-Visual Augmentations. We conduct further experiments on a few combinations of augmentations in a multi-modal setup. We pick the top performing augmentations obtained from the uni-modal variants and apply them concurrently. The results are presented in Table 3 where we find that the results are consistent with the uni-modal setups, as the combination of MSC-HF-CJ-GS-GB-C and VJ-M-RC performs the best in comparison to the other combinations.

4.2.4 Exploring Design Choices

UCF101 ESC50
Base LR 10Base LR Base LR 10Base LR
- -
- -
Table 4: Predictor learning rate. A comparative study of different predictor learning rates with respect to the base learning rate.

Predictor. Our empirical study shows that the predictor head plays an important role to effectively train the audio and visual encoders to learn good representations. The predictor architecture is similar to [simsiam]. For the sake of completeness, we provide the details of the predictor head in Section S6. We explore (i) different learning rates, and (ii) using a common vs. a separate predictor in the multi-modal setup. It should be noted that none of the variants cause a collapse, even-though we notice considerable differences in performance. We present the findings in the following paragraphs.

Similar to [simsiam], we use a constant learning rate for the predictors. However, unlike [simsiam], where the predictor learning rate is the same as the base learning rate of the encoder, we find that a higher predictor learning rate helps the network to learn better representations in both uni-modal and multi-modal setups. In case of CrissCross, setting the predictor learning rate to be the same as the base learning rate results in unstable training, and the loss curve shows oscillating behavior. We empirically find that setting the predictor learning rate to times the base learning rate, works well. We present the results in Table 4.

Next, we evaluate whether the framework can be trained in a multi-modal setup with a common predictor head instead of separate predictor heads (default setup). In simple terms, one predictor head would work towards identity mapping for both audio and video feature vectors. To test this, l2-normalized feature vectors and are fed to the predictor, which are then used in a usual manner to optimize the cost function. The results are presented in Table 5. We observe that though such a setup works somewhat well, having separate predictors is beneficial for achieving a more stable training and learning better representations.

Projector. We present a comparative study of projector heads with layers vs. layers. We notice that , and improvements in top-1 accuracies when using layers instead of on action recognition and sound classification respectively (please refer to Table 5). Note that we use fully-connected layers as the default setup for the projectors. The details of the architecture are presented in Section S6.

Predictor Projector
Common Separate 2 Layers 3 Layers
Table 5: Design choice. Exploring design choices for the predictor and projector heads.

5 Comparison to the State-of-the-Arts

We compare the proposed CrissCross framework against the state-of-the-arts methods. We validate visual representations on action recognition, and audio representations on sound classification. We present the details in the following.

5.1 Action Recognition

Input Size
Pretraining Dataset: Kinetics-Sound (22K)
Fully Supervised[cmacc] 3D-ResNet18
CM-ACC[cmacc] 3D-ResNet18
CrissCross R(2+1)D-18
Pretraining Dataset: Kinetics-400 (240K)
Fully Supervised[gdt] R(2+1)D-18 95.0 74.0
AVTS [avts] MC3-18
SeLaVi [selavi] R(2+1)D-18
XDC [xdc] R(2+1)D-18
AVID [avid] R(2+1)D-18
GDT [gdt] R(2+1)D-18
Robust-xID [ravid] R(2+1)D-18
CrissCross R(2+1)D-18
Pretraining Dataset: AudioSet (1.8M)
Fully Supervised[brave] R(2+1)D-18 96.8 75.9
AVTS [avts] MC3-18
XDC [xdc] R(2+1)D-18
AVID [avid] R(2+1)D-18
GDT [gdt] R(2+1)D-18
MMV [mmv] R(2+1)D-18
BraVe [brave] R(2+1)D-18
CM-ACC [cmacc] R(2+1)D-18
CrissCross R(2+1)D-18
Table 6: State-of-the-art comparison on action recognition. Top-1 accuracy averaged over all the splits on UCF101 and HMDB51 are presented. We group the results based on the pretraining dataset. Additionally, we present the architecture details and finetuning input size of the respective methods.

Full-Finetuning. In line with [xdc, selavi, avid, gdt, gdt, cmacc], we benchmark CrissCross using UCF101 [ucf101] and HMDB51 [hmdb] on action recognition. We briefly mention the experimental setup for downstream evaluation here and redirect readers to Section S5 for additional information. We use the pretrained 18-layer R(2+1)D [r2plus1d] as the video backbone, and fully finetune it on action recognition. We use the Kinetics-Sound [l3-kineticssound], Kinetics-400 [kinetics400], and AudioSet [audioset] for pretraining. For a fair comparison to earlier works, we adopt setups for finetuning, once with 8 frames, and the other with frames. In both these setups, we use a spatial resolution of . We tune the model using the split-1 of both datasets and report the top-1 accuracy averaged over all the splits.

The comparison of CrissCross with recent prior works is presented in Table 6. To save computation time and resources, the fully-supervised baselines compared to CrissCross, are taken directly from prior works [cmac, gdt, brave, xdc] and have not been implemented by ourselves. When pretrained with Kinetics-400, CrissCross achieves state-of-the-arts on UCF101 and HMDB51 in both the fine-tuning setups. CrissCross outperforms current state-of-the-arts AVID [avid] on UCF101 and HMDB51 by and , respectively, when fine-tuned with frame inputs. Additionally, while fine-tuned with frames, CrissCross outperforms current state-of-the-arts GDT [gdt] and AVID [avid] by and on UCF101 and HMDB51 respectively. CrissCross shows significant improvements compared to CM-ACC [cmacc] on UCF101 ( vs. ) and HMDB51 ( vs. ) when pretrained using a small-scale dataset Kinetics-Sound [l3-kineticssound]. Additionally, CrissCross outperforms fully-supervised baselines by and on UCF101 and HMDB51 respectively when both the fully-supervised and self-supervised methods are pretrained on Kinetics-Sound [l3-kineticssound]. To the best of our knowledge, this is the first time that self-supervision outperforms fully-supervised pretraining on action recognition using the same small-scale dataset, showing that our method even performs well on limited pretraining data. Finally, CrissCross outperforms the current state-of-the-art, AVID [avid], when pretrained on AudioSet and fine-tuned with -frame inputs, on both UCF101 and HMDB51. Next, when fine-tuned with -frame inputs, CrissCross achieves competitive results amongst the leading methods. It should be noted that prior works which show slightly better performance compared to CrissCross are trained with much longer visual input segments. For example, BraVe [brave] uses augmented views of and frames, whereas CrissCross takes only frames as input for both views. Please see Section LABEL:supsec:additional_results for an extended list of comparisons.

Clip Order [cliporder]
VCP [vcp]
VSP [vsp]
CoCLR [coclr]
SeLaVi [selavi]
GDT [gdt]
Robust-xID [ravid]
Table 7: State-of-the-art comparison on action recognition retrieval performance. We present the accuracy of video retrieval on UCF and HMDB datasets for different numbers of nearest neighbors, using the video backbone pretrained on Kinetics400.

Retrieval. In addition to full finetuning, we also compare the performance of CrissCross in an unsupervised setup. Following prior works [ravid, gdt, selavi], we perform a retrieval experiment. We use the split-1 of both UCF101 [ucf101] and HMDB51 [hmdb] and present the comparison with prior works in Table 7. We observe that CrissCross outperforms the current state-of-the-arts on UCF101, while achieving competitive results for HMDB51. We present additional details for the retrieval experiment setup in Section S5.

5.2 Sound Classification

Method Backbone
Kinetics-400 AudioSet
AVTS [avts] VGG-8 sec.
XDC [xdc] ResNet-18 sec.
AVID [avid] ConvNet-9 sec.
GDT [gdt] ResNet-9 sec. -
MMV [mmv] ResNet-50 sec. -
BraVe [brave] ResNet-50 sec. -
ResNet-18 2 sec.
CrissCross ResNet-18 sec.
Table 8: State-of-the-art comparison on sound classification. Top-1 accuracy averaged over all the splits on ESC50 is presented. Additionally, we present the linear evaluation input size and the architecture details of the respective methods.

To evaluate audio representations learned by CrissCross, we use a popular benchmark ESC50 [esc] to perform sound classification. We find large variability of experimental setups in the literature for evaluating audio representations. For example, different backbones, different input lengths, different datasets, and different evaluation protocols (linear evaluation, full-finetuning) have been used, making it impractical to compare to all the prior works. Following [avid, avts, ravid, brave, mmv], we perform linear classification using one-vs-all SVM. For the sake of fair comparison with a wide range of prior works, we perform linear evaluation using and -second inputs. We redirect readers to Section S5 for additional details of the evaluation protocols and Section LABEL:supsec:additional_results for an extended list of comparisons. As presented in Table 8, when pretrained on Kinetics-400 and evaluated with -second inputs, CrissCross outperforms the current state-of-the-art AVID [avid] by . Additionally, when pretrained on AudioSet and evaluated with -second inputs, CrissCross marginally outperforms the current state-of-the-art, BraVe [brave].

6 Summary

We propose a novel self-supervised framework to learn audio-visual representations by considering intra-modal, as well as, synchronous and asynchronous cross-modal relationships. We conduct a thorough study investigating the major concepts of our framework. Our findings show that properly composed strong augmentations and relaxation of cross-modal temporal synchronicity is beneficial for learning effective audio-visual representations. These representations can then be used for a variety of downstream tasks including action recognition, sound classification, and retrieval.

Limitations. The notion of asynchronous cross-modal optimization has not been explored beyond audio-visual modalities. For example, our model can be expanded to consider more than modalities (e.g., audio, visual, and text), which we are yet to study. Additionally, we notice a considerable performance gap between full-supervision and self-supervision when both methods are pretrained with the same large-scale dataset (Kinetics-400 or AudioSet), showing room for further improvement.

Broader Impact. Better self-supervised audio-visual learning can be used for detection of harmful content on the Internet. Additionally, such methods can be used to develop better multimedia systems and tools. Lastly, the notion that relaxed cross-modal temporal synchronicity is useful, can challenge our existing/standard approaches in learning multi-modal representations and result in new directions of inquiry. The authors don’t foresee any major negative impacts.


We are grateful to Bank of Montreal and Mitacs for funding this research. We are thankful to Vector Institute and SciNet HPC Consortium for helping with the computation resources.


Change Log

  • V1: Initial release.

  • V2: Minor modifications in the text.

Supplementary Material

In this supplementary material we provide additional details of our experimental setup and results as follows:

  • [noitemsep,nolistsep]

  • Section S1: Pseudocode;

  • Section S2: Details of data augmentations;

  • Section S3: Details of all the datasets;

  • Section S4: Hyperparameters and training details;

  • Section S5: Downstream evaluation protocols;

  • Section S6: Architecture details;

  • Section LABEL:supsec:additional_results: Additional results and analysis.

S1 Algorithms

We present the pseudocode of our proposed CrissCross framework in Algorithm 1

. Please note this pseudocode is written in a Pytorch-like format.

class CrissCross(nn.Module):
    def __init__(fv, fa, hv, ha):
    def forward(v1, v2, a1, a2):
        # video
        zv1, zv2 = fv(v1), fv(v2)       # video embeddings
        pv1, pv2 = hv(zv1), hv(zv2)     # predictor output
        # audio
        za1, za2 = fa(a1), fa(a2)       # audio embeddings
        pa1, pa2 = ha(za1), ha(za2)     # predictor output
        # loss calculation
        # D: loss function
        # asynchronous cross-modal loss
        Lv1a2 = D(pv1, za2)/2 + D(pa2, zv1)/2 # v1-a2
        La1v2 = D(pa1, zv2)/2 + D(pv2, za1)/2 # a1-v2
        L_async = (Lv1a2 + La1v2)/2
        # synchronous cross-modal loss
        Lv1a1 = D(pv1, za1)/2 + D(pa1, zv1)/2 # v1-a1
        Lv2a2 = D(pv2, za2)/2 + D(pa2, zv2)/2 # v2-a2
        L_sync = (Lv1a1 + Lv2a2)/2
        # intra-modal loss
        Lv1v2 = D(pv1, zv2)/2 + D(pv2, zv1)/2 # v1-v2
        La1a2 = D(pa1, za2)/2 + D(pa2, za1)/2 # a1-a2
        L_intra = (Lv1v2 + La1a2)/2
        # total loss
        L_CrissCross = (L_async + L_sync + L_intra)/3
        return L_CrissCross


Algorithm 1 CrissCross pseudocode (PyTorch-like)

S2 Data Augmentation

Visual Augmentations. The parameters for visual augmentations are presented in Table S1. Some of the parameters are chosen from the literature, while the rest are found through empirical search. We set the parameters of Multi Scale Crop, Gaussian Blur, and Gray Scale as suggested in [simclr], and the parameters for Color Jitter are taken from [avid]. We use TorchVision [Pytorch] for all the implementations of visual augmentations, except Cutout where we use the implementation available here111 Please note that for the Cutout transformation, the mask is created with the mean value of the first frame in the sequence. We summarize the augmentation schemes used for pretraining and evaluation in Table S2.

Audio Augmentations. We present the parameters used for audio augmentations in Table S3. We use the Librosa[librosa] library to generate mel-spectrograms. We use the techniques proposed in [specaug] to perform Time Mask, Frequency Mask, and Time Warp transformations222 The parameters for the audio augmentations are set empirically, except for Random Crop which we adopt from [byol_audio]. We summarize the augmentation scheme of pretraining and evaluation in Table S4.

Augmentation Parameters
Multi Scale Crop min area = 0.08
Horizontal Flip p = 0.5
Color Jitter
brightness = 0.4
contrast = 0.4
saturation = 0.4
hue = 0.2
Gray Scale p = 0.2
Gaussian Blur p = 0.5
max size = 20
num = 1
Table S1: Visual augmentation parameter details.
Linear evaluation
Table S2: Visual augmentation summary.
Augmentation Parameters
Volume Jitter range = 0.2
Time Mask
max size = 20
num = 2
Frequency Mask
max size = 10
num = 2
Timewarp wrap window = 20
Random Crop
range = [0.6,1.5]
crop scale = [1.0,1.5]
Table S3: Audio augmentation parameter details.
Linear evaluation
Table S4: Audio augmentation summary.

S3 Datasets

Pretraining Datasets We use datasets of different sizes for pretraining, namely, Kinetics-Sound [l3-kineticssound], Kinetics400 [kinetics400], and AudioSet [audioset]. Kinetics-Sound is a small-scale action recognition dataset, which has a total of K video clips, distributed over action classes. Kinetics400 is a medium-scale human action recognition dataset, originally collected from YouTube. It has a total of K training samples and action classes. Please note that Kinetics-Sound is a subset of Kinetics400, and consists of action classes which are prominently manifested audibly and visually [l3-kineticssound]. Lastly, AudioSet [audioset] is a large-scale video dataset of audio events consisting of a total of M audio-video segments originally obtained from YouTube spread over audio classes. Please note that none of the provided labels are used in self-supervised pretraining.

Downstream Datasets Following the standard practices of prior works [avid, ravid, mmv, xdc, selavi, avts], we evaluate our self-supervised methods on two types of downstream tasks: (i) action recognition based on visual representations and (ii) sound classification based on audio representations. To perform action recognition, we use two popular benchmarks, i.e., UCF101 [ucf101] and HMDB51 [hmdb]. UCF101 consists of a total of K clips distributed among action classes, while HMDB contains nearly K video clips distributed over action categories. To perform sound classification, we use the popular benchmark ESC50 [esc], which is a collection of K audio events comprised of classes.

S4 Hyperparameters and Training Details

Abbreviations Name Description
bs batch size The size of a mini-batch.
es epoch size The total number of samples per epoch.
ep toal epochs The total number of epochs.
learning rate
audio backbone lr
video backbone lr
audio predictor lr
video predictor lr
The learning rates to train the networks.
lrs learning rate scheduler The learning rate scheduler to train the network.
ms milestones At every ms epoch the learning rate is decayed.
lr decay rate The learning rate is decayed by a factor of .
wd weight decay The weight decay used in the SGD optimizer.
Table S5: Abbreviations and descriptions of the hyperparameters.
dataset method bs es ep optim lrs lr(start/end) lr(start/end) lr lr wd betas
Kinetics-Sound CrissCross K Adam Cosine
Kinetics-400 CrissCross M Adam, LARC Cosine
AudioSet CrissCross M Adam, LARC Cosine
Table S6: Pretext training parameters.
dataset input es bs ep ms optim lrs lr wd momentum dropout
UCF101 K SGD multi-step
UCF101 K SGD multi-step
HMDB51 K SGD multi-step
HMDB51 K SGD multi-step
Table S7: Full-finetuning hyperparameters for action recognition when pretrained on Kinetics-400.

In this section, we present the details of the hyperparameters, computation requirements, as well as additional training details of self-supervised pretraining and full finetuning.

s4.1 Pretraining Details

We present the pretraining hyperparameters of CrissCross in Table S6. Most of the parameters remain the same across all datasets, with the exception of a few hyperparameters such as learning rates and epoch size which are set depending on the size of the datasets. We train on Kinetics-Sound with a batch size of , on a single node with Nvidia RTX-6000 GPUs. Next, when training on Kinetics-400 and AudioSet, we use nodes and set the batch size to . Adam [adam] optimizer is used to train our proposed framework. We use LARC333[larc] as a wrapper to the Adam optimizer to clip the gradients while pretraining with a batch size of 2048. In this work, we stick to batch sizes of and , because (i) as they show stable performance based on the findings of [simsiam]; (ii) they fit well with our available GPU setups. Additionally, we perform mixed-precision training [amp] using PyTorch AMP [Pytorch] to reduce the computation overhead.

Ablation Parameters. In the ablation study, we keep the training setup exactly identical across all the variants, with the exception of the learning rates, which we tune to find the best performance for that particular variant. For example, we set the base learning rate for visual-only and audio-only models as and respectively. Next, the predictor learning rates are set to and for the visual-only and audio-only variants.

s4.2 Full Finetuning Details

The full finetuning hyperparameters for both the benchmarks are presented in Table S7. We use a batch size of for the -frame input and for the -frame input. We use an SGD optimizer with multi-step learning rate scheduler to finetune the video backbones. Please note that we perform the full finetuning on a single Nvidia RTX-6000 GPU.

S5 Downstream Evaluation Protocol

To evaluate the representations learned with self-supervised pretraining, we test the proposed framework in different setups, namely linear evaluation, full finetuning, and retrieval. The details of the evaluation protocols are mentioned below.

s5.1 Linear Evaluation

To perform linear evaluation of the learned representations on downstream tasks, we extract fixed features (also called frozen features) using the pretrained backbones. We train a linear classifier using a one-vs-all SVM on the fixed feature representations to compare different variants of our model. We present the details below.

Video. Following the protocols mentioned in [mmv, brave], we use -frame inputs to the video backbone, with a spatial resolution of . During training, we randomly pick clips per sample to extract augmented representations, while during testing, we uniformly select clips per sample. The augmentation techniques are mentioned in Section S2

. We don’t apply the Gaussian Blur while extracting the training features since it deteriorates the performance. Moreover, to perform a deterministic evaluation, we don’t apply any augmentations during validation. The visual features are extracted from the final convolution layer and passed to a max-pool layer with a kernel size of

[avid]. Finally, we use the learned visual representations to train a linear SVM classifier in order to perform action recognition.

Audio. In case of sound classification, we use and -second audio inputs to extract audio representations. Following [gdt], we extract epochs worth of augmented feature vectors from the training clips. During testing, when using -second inputs, we extract equally spaced audio segments [avid, gdt, ravid, xdc], and when using -second inputs, we extract segment [mmv, brave] from each sample. We perform the augmentations mentioned in Section S2 to extract the training features. We notice that unlike self-supervised pretraining, time warping improves the model performance in the linear evaluation. We do not apply any augmentations during validation. We extract the representations from the final convolution layer and pass it through a max-pool layer with a kernel size of

and a stride of

[gdt]. Similar to action recognition, we perform classification using a one-vs-all linear SVM classifier.

Please note that during training the SVM, we sweep the cost values between 0.00001, 0.00005, 0.0001, 0.0005, 0.001, 0.005, 0.01, 1 and report the best accuracy. In both the cases, action recognition and sound classification, we obtain the sample level prediction by averaging the clip level predictions and report the top-1 accuracy.

s5.2 Full Finetuning

Following earlier works [xdc, avid, ravid, selavi], we use the pretrained visual backbone along with a newly added fully-connected layer for full finetuning on UCF101 [ucf101] and HMDB51 [hmdb]. We adopt two setups for full finetuning, -frame inputs and -frame inputs. In both cases we use a spatial resolution of . Lastly, we replace the final adaptive average-pooling layer with an adaptive max-pooling layer. We find that applying strong augmentations improves the model performance in full-finetuning. Please see the augmentation details in Section S2. During testing, we extract equally spaced clips from each sample and do not apply any augmentations. We report the top-1 accuracy at sample level prediction by averaging the clip level predictions. We use an SGD optimizer with a multi-step learning rate scheduler to finetune the model. We present the hyperparameters of full-finetuning in Table S7.

s5.3 Retrieval

In addition to full-finetuning, we also perform retrieval to test the quality of the representations in an unsupervised setup. We follow the evaluation protocol laid out in [gdt, cliporder]. We uniformly select clips per sample from both training and test splits. We fit

-second inputs to the backbone to extract representations. We empirically test additional steps such as l2-normalization and applying batch-normalization on the extracted features, and notice that they do not help the performance. Hence, we simply average the features extracted from the test split to query the features of the training split. We compute the cosine distance between the feature vectors of the test clips (query) and the representations of all the training clips (neighbors). We consider a correct prediction if

neighboring clips of a query clip belong to the same class. We calculate accuracies for . We use the NearestNeighbors444sklearn.neighbors.NearestNeighbors API provided in SciKit-Learn in this experiment.

S6 Architecture Details

frames 112 8 3 - - - -
conv1 56 8 64 7 3 2 1
maxpool 28 8 64 3 1 2 1
block2.1.1 28 8 64 3 3 1 1
block2.1.2 28 8 64 3 3 1 1
block2.2.1 28 8 64 3 3 1 1
block2.2.2 28 8 64 3 3 1 1
block3.1.1 14 4 128 3 3 2 2
block3.1.2 14 4 128 3 3 1 1
block3.2.1 14 4 128 3 3 1 1
block3.2.2 14 4 128 3 3 1 1
block4.1.1 7 2 256 3 3 2 2
block4.1.2 7 2 256 3 3 1 1
block4.2.1 7 2 256 3 3 1 1
block4.2.2 7 2 256 3 3 1 1
block5.1.1 4 1 512 3 3 2 2
block5.1.2 4 1 512 3 3 1 1
block5.2.1 4 1 512 3 3 1 1
block5.2.2 4 1 512 3 3 1 1
- - 512 - - - -
Table S8: Architecture of the video backbone: R(2+1)D-18.
Table S9: Architecture of the audio backbone: ResNet-18.