Robust Audio-Visual Instance Discrimination

03/29/2021 ∙ by Pedro Morgado, et al. ∙ 0

We present a self-supervised learning method to learn audio and video representations. Prior work uses the natural correspondence between audio and video to define a standard cross-modal instance discrimination task, where a model is trained to match representations from the two modalities. However, the standard approach introduces two sources of training noise. First, audio-visual correspondences often produce faulty positives since the audio and video signals can be uninformative of each other. To limit the detrimental impact of faulty positives, we optimize a weighted contrastive learning loss, which down-weighs their contribution to the overall loss. Second, since self-supervised contrastive learning relies on random sampling of negative instances, instances that are semantically similar to the base instance can be used as faulty negatives. To alleviate the impact of faulty negatives, we propose to optimize an instance discrimination loss with a soft target distribution that estimates relationships between instances. We validate our contributions through extensive experiments on action recognition tasks and show that they address the problems of audio-visual instance discrimination and improve transfer learning performance.



There are no comments yet.


page 4

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Self-supervised representation learning aims to learn feature representations that can transfer to downstream tasks without costly human annotations. Many recent self-supervised methods [swav, moco, pirl, simclr, cld, cmc] use a variant of the instance discrimination framework [instance, exemplar], which matches features from multiple views/augmentations of the same instance, while distinguishing these features from those of other instances. This often relies on a contrastive loss [hadsell2006dimensionality], where different augmentations are considered ‘positives’ and other samples ‘negatives.’

Cross-modal instance discrimination (xID) extends instance discrimination to the realm of multiple modalities, where data modalities, such as video, audio, or text, act as the different ‘views’ of an instance. Since there is a strong correlation between audio and visual events (e.g., the sound of an instrument or a baseball match), audio-visual instance discrimination has gained popularity [l3, multisensory, bruno_avts, avid, evolving_losses, xdc, gdt, versatile_nets]. Representations learned by these methods show promising performance on tasks like action recognition and environmental sound classification. xID methods rely on two key assumptions - (1) the audio and video of a sample are informative of each other, i.e., positives; (2) the audio and video of all other samples are not related, i.e., negatives. In practice, both these assumptions are too strong and do not hold for a significant amount of real-world data. This results in faulty positive samples that are not related to each other and faulty negative samples that are semantically related.

Figure 1: Example of a positive audio/video pair and negative instances used for contrastive learning. Audio-visual signals may not semantically correspond to each other, such as the pairing weightlifting video/background music shown in green, which leads to faulty positives. Due to random sampling, semantically similar instances can also appear as faulty negatives, e.g. a second weightlifting video in yellow. Faulty positive and negative samples are a common occurrence in audio-visual contrastive learning and hurt representation learning.
Figure 2: Comparison between standard cross-modal instance discrimination (xID) and the proposed procedure. In xID, samples contribute equally to the loss, and all instances other than themselves are treated as negatives. The proposed method addresses the two main sources of noisy training signals: faulty positives and faulty negatives. Faulty positives are discounted by down-weighting instances with poor audio-visual correspondence. Faulty negatives are addressed by optimizing the loss over a soft target distribution that encodes instance similarity.

fig. 1 shows examples of these faulty correspondences. Videos where the audio is uninformative of the visual content can lead to faulty positives, e.g., videos containing audio from sources outside of the camera field-of-view or containing post-edited sounds like a soundtrack. Similarly, random negative sampling can produce faulty negatives, i.e., negative samples that are semantically related to the positive. These faulty correspondences undermine the primary goal of representation learning, i.e., to ensure that similar instances have similar feature representations. As we show empirically in fig. 7 and table 1, they can hurt representation learning and degrade downstream performance. Thus, we believe cross-modal learning should be seen as a problem of learning with noisy targets. This raises the question of how to identify faulty positive and negative samples in the absence of human annotations.

We propose to use cross-modal information during self-supervised training to detect both faulty positive and negative instances. This is done by estimating the quality of the audio-visual correspondence of each instance and optimizing a weighted contrastive learning loss that down-weighs the contribution of faulty positive examples. To address faulty negatives, we estimate the similarity across instances to compute a soft target distribution over instances. The model is then tasked to match this distribution. As a result, instances with enough evidence of similarity are no longer used as negatives and may even be used as positives.

The contributions of this work are as follows (fig. 2). We identify two sources of training noise in cross-modal learning: instances with weak cross-modal correspondence, which create faulty positives, and the sampling of semantically similar instances as negatives, which create faulty negatives. We show that removing faulty positives and negatives using an oracle can significantly improve the performance of a state-of-the-art xID method [avid]. We then propose a mechanism to replace the oracle and a robust cross-modal instance discrimination loss that limits the impact of faulty correspondences. The effectiveness of the proposed method is demonstrated on several downstream tasks.

2 Related work

Self-supervised representation learning aims to learn representations by solving pretext tasks defined from the data alone, i.e

. without human annotations. In computer vision, pretext tasks involve reasoning about spatial context 

[jigsaw, doersch2015unsupervised, 3d_puzzle, inpainting, rotation, MemDPC, cvrl], temporal context [shuffle, opn, aot, 3d_puzzle, mobahi2009deep, dpc, OddOneOut, pace_pred, speednet, MemDPC, CoCLR, cvrl], other visual properties such as hue, brightness and flow [deshpande2015colorization, larsson2016colorization, colorization, larsson2017colorization, splitbrain, cmc, evolving_losses], or clusters of features [deepcluster, sela, swav, cld]. One promising technique is the instance discrimination task proposed in [instance, exemplar] and further explored in [moco, pirl, simclr, cld, xie2020delving]. However, contrastive learning from a single modality requires heavy data augmentations to generate distinct views. Instead, we focus on cross-modal instance discrimination, which avoids this issue by generating views from different modalities.

Representation learning from audio-visual correspondences:

Since, in video, the audio is naturally paired and synced with the visual component, audio-visual correspondences have been used to draw direct supervision for several tasks, such as visually guided-source separation and localization [gao2018learning, gao2019co, zhao2018sound, zhao2019sound, gan2020music, senocak2018learning], visually guided audio spatialization [morgado2018self, 25DSound], audio-visual embodied navigation [chen2019audio], lip-speech synchronization [chung2016out] and audio-visual speech recognition [afouras2018deep, chung2017lip].

In the context of contrastive learning, audio-visual correspondences are used to generate alternative views of an instance. While this has been known for a long time [desa94], self-supervised audio-visual representation learning gained popularity in recent years. For example, [l3, arandjelovic2018objects] propose to learn representations by solving a binary classification problem that identifies audio and video clips belonging to the same instance. [bruno_avts, multisensory] predict if audio and video clips are temporally synchronized, and [morgado20_avsa] predicts if audio and video clips extracted from a 360 video are spatially aligned. [avid, gdt] improve upon the audio-visual correspondence problem [l3] by posing it as a cross-modal instance discrimination task, where instances are contrasted to a large number of negatives. As a result, [avid, gdt] achieve impressive performance on downstream tasks such as action recognition.

In this work, we address two issues inherent to cross-modal instance discrimination, namely the detrimental impact of faulty positives and negatives. Recently, [xdc, selavi] proposed to learn representations by iteratively clustering the audio and visual representations and seeking to predict cluster assignments from the opposite modality. While clustering can also discourage faulty negatives from acting as repelling forces, our method accomplishes this by optimizing a simple instance discrimination loss with soft targets, thus avoiding the significant computational overhead of clustering.

Supervised learning from noisy labels.

Our work is closely related to supervised learning from noisy labels [bootstrapping, GenXEnt, patrini2017making, han2018co, li2017learning]

. Since label collection is expensive and time-consuming, scaling human annotation to large datasets often requires the use of non-experts or non-curated labels such as user tags, which are prone to noise. Since deep neural networks can easily overfit to noisy labels 


, this results in poor generalization. Several techniques have been developed to increase the robustness of learning algorithms to label noise, including losses that reduce the impact of outliers 

[rmae, GenXEnt, wang2019symmetric], loss correction approaches that model the sources of label noise [patrini2017making, hendrycks2018using, chang2017active, bootstrapping, arazo2019unsupervised, ma2018dimensionality, song2019selfie], meta-learning procedures that learn how to correct the sources of label noise [li2017learning, ren2018learning, li2019learning, shu2019meta, zhang2020distilling] and regularization procedures tailored to lower the impact of noise [mixup, smoothing]. We refer the reader to [song2020learning, frenay2013classification] for a detailed survey of prior work on learning with label noise. In this work, we show that cross-modal instance discrimination should be seen as a problem of learning with noisy targets. However, instead of the class mislabeling, we identify two main sources of noise for cross-modal instance discrimination (faulty positives and faulty negatives) and propose an algorithm to mitigate them.

3 Analysis: Instance Discrimination

We analyze the cross-modal instance discrimination method [avid, cmc, gdt] and show that faulty positives and negatives have a disproportionately large contribution to the training updates. Additionally, in table 1, we document the detrimental empirical effects of faulty samples.

Cross-Modal Instance Discrimination

Consider a dataset containing samples (or instances) of video and audio . Cross-modal instance discrimination uses a contrastive loss [hadsell2006dimensionality] to learn video and audio encoders, and , so as to align the two modalities belonging to the same instance [cmc, avid, gdt] by minimizing

where (2)

where and are visual and audio features normalized to the unit sphere, and are target representations, and is a temperature hyper-parameter. Prior works differ by the type of target representations employed. For example, and can be entries of a memory bank as in [avid, instance], the network representations themselves and as in SimCLR [simclr], the outputs of momentum encoders as in MoCo [moco], or the centroids of an online clustering procedure as in SwAV or CLD [swav, cld]. In this work, we build on the Audio-Visual Instance Discrimination (AVID) method of [avid], focusing on target representations sampled from a memory bank. However, the principles introduced below can also be applied to SimCLR, MoCo or SwAV style targets.

Faulty positives and negatives in practice.

The contrastive loss of eq. 1 is minimized when audio and visual representations from the same instance are aligned (dot-product similarities and as close to as possible), and representations from different instances are far apart. In practice, however, the two modalities are not informative of each other for a significant number of instances (see fig. 1). We refer to these unclear correspondences as faulty positives.111We prefer ‘faulty positives’ over ‘false positives’ to distinguish from supervised learning where one has access to labels. On the other hand, a significant number of contrastive learning negatives are semantically similar to the base instance. We term these semantically similar negatives as faulty negatives since they should ideally be used as positives.

fig. 3 shows the histogram of similarities after training an audio-visual model with the loss of eq. 1. As can be seen, instances with higher scores tend to have stronger correspondences (i.e. the audio and video signals are informative of each other). Instances where the two modalities are uninformative of each other tend to have lower scores and are generally faulty positives. On the other hand, fig. 4 shows the histograms of similarities between a video and negatives . As can be seen, faulty negatives tend to occur for negatives with high similarity .

How do faulty positives and negatives affect learning?

Faulty positives and negatives have a disproportionately large contribution to the training updates. To see this, examine the gradients that are computed when optimizing eq. 1. The partial derivatives are given as


Intuitively, the target representations and of the instance itself act as ‘attraction points’ for the encoder of the opposing modality, while the target representations of other (negative) instances, and , act as ‘repelling points’. For example, in Equation 3, the negative gradient pushes toward and away from . The attraction forces are weighed by the complement of the prediction confidence, i.e., or . When positive samples are faulty, these gradients lead to noisy training signals. As show in fig. 3, faulty positives tend to have lower similarities and thus less confident predictions. As a result, the cross-modal loss of eq. 1 assigns stronger gradients to faulty positive samples. On the other hand, the repelling forces of negative instances are also weighted by the likelihood of matching the base sample, i.e. and . However, as shown in fig. 4, faulty negatives tend to have high similarity scores, leading to high posteriors and . Thus, the targets and of faulty negatives act as strong repelling forces for and (see eq. 3-4), even though they should ideally be close in feature space.

Figure 3: Faulty positives in a pretrained cross-modal model. Histogram of similarity scores between video and audio representations, and examples obtained at various points of the distribution. We describe both the sound and video content in the plot. Examples with lower similarity scores contain audio that is less predictive of the video content, which creates faulty positives for training.

4 Robust audio-visual representation learning

We have seen that contrastive learning places too much emphasis on the impossible goals of bringing together the audio-visual components of faulty positives and repelling the feature representations from faulty negatives. We next propose solutions to these two problems.

4.1 Weighted xID: Tackling Faulty Positives

To reduce the impact of faulty positives, we propose to optimize a weighted loss. Let be a set of sample weights that identify faulty positives. Robustness is achieved by re-weighting the xID loss of eq. 1


To estimate sample weights , we leverage observations from fig. 3. Since low similarities are indicative of faulty positives, we define the weights

to be proportional to the cumulative distribution of these scores. We assume the scores to be normally distributed and define



where and

are the sample mean and variance of the scores,

is the cumulative distribution of a transformed normal distribution , and is a soft truncation function used to assign a non-zero weight to low score instances. , and are shape hyper-parameters that provide flexibility to the weight function, adjusting the location and rate of decay of the weights. fig. 5 shows how the weighting function varies with the shape hyper-parameters , and .

Figure 4: Faulty negatives in a pretrained cross-modal model. Two instances and the corresponding negatives used by a xID model sorted by their similarity scores. The actual videos are provided in supplementary material. xID often uses faulty negatives for contrastive learning.
Figure 5: Weights as function of similarity scores for different values of shape parameters , and . Parameters are automatically determined from the histogram of similarity scores (shown in red).

4.2 Soft Targets: Tackling Faulty Negatives

As observed in section 3, faulty negatives are overemphasized during training. The underlying reason is that the xID loss of eq. 1 has too strict a definition of negatives: every negative instance is considered ‘equally negative.’ To limit the impact of faulty negatives, we introduce a ‘softer’ definition by introducing soft targets , based on the similarity between instance and negative . We then minimize a soft-xID loss


where is the one-hot targets of vanilla xID, and are softening scores (described next) used to adjust the one-hot targets, and is a mixing coefficient that weighs the two terms. Equations 1 and 7 are identical when . Since is no longer strictly zero for similar instances, minimizing eq. 7 reduces the force to repel faulty negatives and thus their impact.

Estimating softening scores .

Since our approach focuses on self-supervised learning, we must estimate the softening scores automatically, i.e., without class labels. We describe multiple strategies for estimating these values and illustrate them in fig. 6.

•  Bootstrapping [bootstrapping] is a well established procedure to create soft targets. It uses the model’s own predictions (posteriors) as the softening scores, i.e.,


where controls the peakiness of the distribution. However, bootstrapping computes the target distribution without aggregating information from any other source other than each model’s own posterior.

•  Swapped prediction improves upon bootstrapping by using the posteriors of the opposite modality, i.e., the softening scores for the video modality are computed using the posterior of the audio encoder and vice-versa,


As a result, in addition to the instance itself, the model is asked to predict which other instances are deemed similar in the opposite modality.

•  Neighbor prediction relies on within-modal relationships to estimate the similarity between instances, thus avoiding potential mismatched audio and visual modalities when computing the soft targets. Specifically, we define


where is the softmax operator.

•  Cycle consistent prediction improves upon ‘swapped prediction‘ by focusing on negatives that are good correspondences themselves, i.e., negatives with high similarity scores . In this case, we define


where and control the relative importance of swapped prediction target and avoiding negatives with weak correspondences. As shown in fig. 6, the terms and complete a cycle over instances and .

How do soft targets mitigate faulty negatives?

The soft xID loss of eq. 7 prevents overemphasizing faulty negatives by relying on soft targets that encode similarities between instances. To better understand the mechanism, we examine the partial derivatives of the soft-xID loss:


Since faulty negatives tend to be similar to the base instance , the soft targets are higher. Thus, the target representations and of faulty negatives act as weaker negatives, or even as positives when is larger than the model posteriors.

Figure 6: Strategies to estimate softening scores .

4.3 Training

We introduced two procedures to deal with noisy training signals inherent to cross-modal instance discrimination. section 4.1 presents a weighting mechanism that limits the effect of faulty positives, while section 4.2 proposes a soft instance discrimination loss that predicts relations between instances, thus preventing the training algorithm from overemphasizing faulty negatives. Since both procedures rely on the alignment between audio and visual target representations to find weak correspondences, we start by training the model for cross-modal instance discrimination alone (eq. 1). After the initial warmup stage, the two procedures can be combined by minimizing


where are the sample weights of eq. 6 and is the xID loss with soft targets of eq. 7.

5 Experiments

We perform experiments to better understand cross-modal learning and validate the proposed improvements. We pretrain models on a subset of the Kinetics-400 

[kinetics] dataset containing 50K videos and evaluate the pretrained models by transfer learning.

5.1 Experimental Setup

Video and audio preprocessing. During training, we extract video clips of length frames and resolution at 16 fps. Video clips are augmented using temporal jittering, multi-scale cropping, horizontal flipping, color jittering, gray-scaling, and Gaussian blur [simclr]. All data augmentations are applied consistently over all frames. For the audio, we extract mono clips of length s at a sample rate of Hz, and compute log spectrograms on ms windows with a hop size of ms. The spectrogram is then converted to a mel scale with bands, yielding an audio input of size . Audio data is augmented by randomly changing the volume by at most .

Video and audio models. The video encoder is a 9-layer version of the R(2+1)D model of [tran2018closer]. Following [l3, avid]

, we replaced global average pooling with max pooling. The audio encoder is a 9-layer 2D ConvNet with batch normalization and global max pooling. Both encoders yield 512-dimensional features, which are mapped into a 128-dimensional sphere using a non-linear projection head (as in 

[simclr]) followed by L2 normalization.

Pretraining. In the warm-up stage, the video and audio models are trained to optimize the loss of eq. 1 using the Adam optimizer [adam] with default hyper-parameters ( and ) for epochs with a learning rate of and a batch size of split over 2 Gb GPUs. In order to reduce the memory footprint of our models, we employ mixed-precision training [micikevicius2017mixed]

using PyTorch AMP 

[pytorch]. Following [avid, instance], the audio and video target representations, and , are generated using memory banks updated by exponential moving average with an update constant of . The contrastive loss of eq. 1 is defined by opposing the target representation of the opposite modality to negatives randomly drawn from the memory bank. The temperature hyper-parameter is set to .

After the initial warm-up stage, models are trained for an additional 200 epochs to optimize the loss of eq. 17 using the Adam optimizer and a cosine learning rate schedule starting at and ending at . The hyper-parameters for the weighting function (eq. 6) and the soft xID loss (eq. 7) are discussed below. To provide a fair comparison to the AVID baseline [avid], we control for the number of epochs by training the baseline model for an additional 200 epochs as well.

Downstream tasks. We evaluate audio and video features using transfer learning. Video features are evaluated on the UCF [ucf] and HMDB [hmdb] datasets. Models are fine-tuned using 8-frame clips for 200 epochs using the Adam optimizer with a batch size of on a single GPU and a cosine learning rate schedule starting at and ending at . To prevent overfitting, we use dropout after the global max-pooling layer, weight decay of , and reduced the learning rate for backbone weights by a factor of . At test time, top-1 accuracy is measured on video level predictions computed by averaging the predictions of clips uniformly sampled over the entire video.

Following [selavi, xu2019self], we also evaluate the quality of video representations by conducting retrieval experiments without fine-tuning. Feature maps of size are extracted from clips per video and averaged. We then use videos in the test set to query the training set. As in [selavi, xu2019self], a correct retrieval occurs when the class of one of the top-k retrieved videos matches the query, and performance is measured by the average top-k retrieval performance ().

5.2 Weighted cross-modal learning

We analyze the impact of faulty positives on the representations learned by cross-modal instance discrimination.

Faulty positives are detrimental to representation learning.

We artificially control the number of faulty positives to assess their impact on representation learning. The pretraining dataset already contains an unknown (but significant) number of faulty positives. We increase this number by injecting more faulty positives. A faulty positive is injected by replacing the audio of an instance with a randomly selected audio that is not part of the training set. After pretraining, the learned visual representation is evaluated on the UCF and HMDB datasets using both classification and retrieval protocols. fig. 7 shows that as the fraction of faulty positives increases, the transfer performance of cross-modal instance discrimination (xID) decreases significantly.

Figure 7: Faulty positives vs. transfer performance We inject faulty positives in the pre-training data (Kinetics) by randomly replacing the audio corresponding to a video. We evaluate the pretrained models on action recognition and see that increasing the fraction of faulty positives hurts transfer performance. Our weighted loss (Weighted-xID) is less sensitive to faulty positives and even outperforms an oracle version that information about altered samples. This is because the unaltered pretraining dataset itself has an unknown, but significant number () of faulty positives.

Weighted xID reduces the impact of faulty positives.

We evaluate the effectiveness of the weighted xID loss (eq. 5) as a function of the number of faulty positives. We compare the representations learned by Weighted-xID to its unweighted counterpart (xID), as well as an oracle weight function (Oracle-xID) which assigns to artificially altered instances and otherwise. The weight function of eq. 5 is defined with and . For simplicity, we assume that the noise level is known and set in Weighted-xID so that the midpoint of the weighting function coincides with the known fraction of altered samples. In practice, the noise level would need to be estimated either by cross-validation or by manual inspection. Weighted-xID is not very sensitive to these parameters (see appendix).

fig. 7 shows the performance of the three approaches. Oracle-xID consistently outperforms xID when the fraction of injected faulty positives is high. This shows that the detrimental impact of noisy correspondences can be mitigated with a weighting strategy. Weighted-xID also outperforms the unweighted version (xID) in nearly all cases, with larger margins for larger fractions of noisy correspondences. In fact, Weighted-xID even outperforms the oracle weight function, especially at lower noise levels. This is because the original Kinetics dataset already contains a significant amount of weak correspondences, which the oracle weight function treats as clean , while the weighting function of eq. 6 can suppress them.

5.3 Instance discrimination with soft targets

To limit the impact of faulty negatives, we proposed to match a soft target distribution that encodes instance similarity. We analyze different design decisions for creating the soft targets and their effect on transfer performance.

Acc R@5 Acc R@5
Oracle 73.6 76.0 45.4 53.6
xID [avid] 68.0 63.2 39.0 43.4
Bootstrapping 69.2 64.4 40.5 44.7
Neighbor Pred. 70.5 65.4 41.2 45.0
Swapped Pred. 70.0 64.9 41.3 45.4
CCP 70.3 65.9 41.5 45.5
Uses class labels to generate target distribution.
Table 1: Different strategies for computing soft targets in the pretraining loss of eq. 7. Models are pretrained on Kinetics and evaluated on UCF and HMDB datasets using fine-tuning and retrieval protocols. Best method is bolded. Second best is underlined.

Comparison of strategies for computing targets

As summarized in fig. 6, the soft target distributions can be computed by aggregating evidence from all modalities. Four different strategies were proposed, bootstrapping, swapped or cycle consistent assignments. Models were trained to minimize the loss of eq. 7 with . We empirically found that peakier target distributions work better, and set the temperature parameter to . For cycle consistent assignments, the terms are used so as to focus on negatives that are good correspondences themselves. A temperature hyper-parameter of was sufficient to impose such constraint. Beyond the baseline xID, we also compare to an oracle target distribution that has access to class labels to determine the similarity between instances. Specifically, the oracle considers two instances and to be similar if they share the same class label, and computes and

by assigning a uniform distribution over similar instances, and

to non-similar ones.

table 1 shows the performance of different target distributions. We observe a large gap between vanilla xID and xID with an oracle soft target, which demonstrates the detrimental effect of faulty negatives. In the self-supervised case, however, labels are not available for determining the target distribution. Nevertheless, the estimated target distributions (bottom four rows) still significantly improve over the xID loss. Regarding the various types of target distributions, bootstrapping is the least effective. This is expected since, in this case, the target distribution is a peakier version of the model posterior, i.e. it is obtained without aggregating information from any other sources. Cycle consistent prediction is the most effective most often. This is because cycle consistent prediction not only leverages the opposite modality to create the target distribution, but it also avoids targets that are not good correspondences themselves, i.e., avoids samples with low cross-modal similarities.

Soft Targets
Acc R@5 Acc R@5
xID [avid] 68.0 63.2 39.0 43.4
Weighted-xID 69.7 64.1 40.1 44.3
Soft-xID 70.3 65.9 41.5 45.5
Robust-xID 71.6 67.4 41.9 46.2
Table 2: Combining weighted xID loss with soft targets. Models are pretrained on Kinetics with the loss of eq. 17 and evaluated on UCF and HMDB datasets using fine-tuning and retrieval protocols.

5.4 Robust instance discrimination with soft targets

Sample weighting and soft targets are designed to address two different sources of noisy training signals inherent to cross-modal contrastive learning: faulty positives and faulty negatives. table 2 shows that the two proposed improvements (Weighted-xID and Soft-xID) not only improve upon the representations of vanilla xID, they are also complementary to each other. By combining the two approaches using the loss of eq. 17, Robust-xID improved upon Weighted and Soft-xID.

6 Comparison to prior work

We compare Robust-xID to prior work in self-supervised learning. We train our models on the Kinetics dataset, using an 18-layer R(2+1)D model [tran2018closer] for the video, and a 9-layer 2D ConvNet with batch normalization for the audio. Video clips of length 8-frames and resolution are extracted at 16fps, and the same data augmentations from section 5 are used. We extract audio clips of length 2s at 24KHz and compute log mel spectrograms with 128 time steps and 128 frequency bands. All models are trained with the Adam optimizer with a batch size of 512 distributed across 8 12Gb GPUs. We warm-up the models for 200 epochs by training on the xID loss alone with a learning rate of . The models are then trained with sample weights and cycle consistent soft targets for an additional 200 epochs using a cosine learning rate schedule from to .

After pre-training, models are evaluated on UCF and HMDB. We fine-tune the models using either 8 or 32 frame clips for action recognition and report the top-1 accuracy of video level predictions (with 10 clips per video) in table 3. The proposed procedure outperformed all prior work where pretraining is limited to a single node (8 GPUs), and even outperformed methods like SeLaVi, which require 8 more compute for training. We also conducted a close comparison to the CMA procedure of [avid] (xID+CMA). While CMA can also partially address the problem of faulty negatives, Robust-xID showed better performance. Robust-xID is also easier to implement as it identifies both faulty positives and negatives in a simpler online fashion. We note that xID+CMA is a faithful implementation of AVID+CMA [avid], as it follows the original code with improved data augmentations. However, the results reported for xID+CMA are lower than those originally reported in [avid] because 1) distributed training was conducted on 8 GPUs instead of 64 (large batch sizes are known to have a substantial impact on contrastive learning performance [simclr, mocov2, swav]), and 2) [avid] is trained and evaluated with videos of higher resolution (224 instead of 112). By training the proposed model with a larger batch size, we expect the performance to improve further.

Method Model Compute # GPUs Finetuning Resolution UCF HMDB
DPC [dpc] S3D 4 75.7 35.7
CBT [cbt] S3D 8 79.5 44.6
Multisensory [multisensory] 3D-ResNet18 3 82.1
AVTS [bruno_avts] MC3-18 4 84.1 52.5
SeLaVi [selavi] R(2+1)D-18 64 83.1 47.1
R(2+1)D-18 64 74.2 39.0
XDC [xdc] R(2+1)D-18 64 86.8 52.6
R(2+1)D-18 64 83.7 49.5
AVID-CMA [avid] R(2+1)D-18 64 87.5 60.8
GDT [gdt] R(2+1)D-18 64 89.3 60.0
R(2+1)D-18 8 80.6 48.6
xID+CMA [avid] R(2+1)D-18 8 84.9 54.7
R(2+1)D-18 8 81.9 49.5
Robust-xID R(2+1)D-18 8 85.6 55.0
Models pre-trained with more than one compute node (8 GPUs).
Table 3: Comparison to prior work (finetuning). Performance on the downstream UCF and HMDB datasets by full network fine-tuning after pre-training on Kinetics. We report top-1 accuracy of video level predictions (10 clips per video). We also list the video encoder, amount of compute used for pre-training and the fine-tuning resolution.
R@1 R@5 R@20 R@1 R@5 R@20
SpeedNet [speednet] 13.0 28.1 49.5 - - -
VCP [vcp] 18.6 33.6 53.5 7.6 24.4 53.6
VSP [vsp] 24.6 41.9 76.9 10.3 26.6 54.6
CoCLR [CoCLR] 55.9 70.8 82.5 26.1 45.8 69.7
SeLaVi [selavi] 52.0 68.6 84.5 24.8 47.6 75.5
GDT [gdt] 57.4 73.4 88.1 25.4 51.4 75.0
xID+CMA [avid] 60.1 76.6 90.1 29.7 54.4 77.1
Robust-xID 60.9 79.4 90.8 30.8 55.8 79.7
Table 4: Retrieval performance on UCF and HMDB datasets after pre-training on Kinetics for different numbers of retried neighbors.
1-shot 5-shot 20-shot 1-shot 5-shot 20-shot
3D-RotNet [3drotnet] 15.0 31.5 47.1 - - -
GDT [gdt] 26.3 42.4 49.4 13.4 15.6 20.8
xID+CMA [avid] 30.8 53.1 66.9 13.5 25.0 33.6
Robust-xID 32.8 54.6 67.8 14.1 25.9 34.9
Table 5: Few-shot learning on UCF and HMDB after pre-training on Kinetics. Classification is conducted using a one-vs-all SVM trained on the pretrained features of images per class. Top-1 accuracy is reported for .

We also compare the learned representations to prior work without fine-tuning. Following [selavi, gdt], we conducted retrieval experiments, and report the retrieval performance for , and neighbors in table 4. The retrieval protocol was described in section 5. Following [3drotnet, gdt], we also assessed the few-shot learning performance of Robust-xID models on UCF and HMDB. For the few-shot evaluation, we average the pretrained max-pooling features of 10 clips per video. The features from

videos per class are then used to learn a one-vs-all linear SVM classifier with

. We report the top-1 accuracy averaged over 50 trials in table 5. On both the retrieval and few-shot learning tasks, Robust-xID improves significantly over all prior work, reaffirming the importance of mitigating the training noise introduced by faulty positives and faulty negatives.

7 Discussion and future work

We identified and tackled two significant sources of noisy training signals in audio-visual instance discrimination, namely instances with weak audio-visual correspondence (or faulty positives) and semantically similar negatives (or faulty negatives). We showed the impact of faulty correspondences on representation learning by removing them using an oracle with access to ground-truth annotations. We then proposed a method that mitigates the impact of faulty correspondences without relying on ground-truth annotations. Extensive analysis and experimental evaluations show that the proposed procedure enhances representation learning and improves transfer performance significantly.

Our findings show that cross-modal learning should be seen as a problem of learning with noisy targets. While we propose two specific methods to address faulty positives and faulty negatives (i.e. weighting and soft targets), there is a rich literature regarding supervised learning with noisy labels. Developing methods that tackle noisy correspondences are a promising avenue for future research. Furthermore, we focused on audio-visual learning, but other pairs of modalities such as RGB and flow or text from instructional videos also present similar problems. We believe that our method will also benefit cross-modal learning from other modalities.


This work was partially funded by NSF awards IIS-1924937, IIS-2041009, and gifts from Amazon and Qualcomm. We also acknowledge and thank the use of the Nautilus platform for some of the experiments above.


Appendix A Parametric studies

We provide a parametric study of key Robust-xID hyper-parameters.

Weight function shape parameter

One critical parameter of Weighted-xID is the shape parameter , which specifies the mid-point location of the weight function. For example, when , the midpoint is located at where and

are the sample mean and standard deviation of the scores

. This means that for , the majority of samples will have a weight of , and only a small fraction will have a weight close to . As increases, the proportion of samples that are down-weighted also increases. To study the impact of , we trained several models using Weighted-xID with different values of and for different amounts of injected faulty positives . Other hyper-parameters were kept at their default values and . The transfer performance is shown in fig. 8. As can be seen, the proposed robust xID procedure is not very sensitive to this hyper-parameter. This suggests that Robust-xID can help representation learning as long as clear faulty positives are suppressed.

Figure 8: Effect of shape parameter in Weighted-xID. Transfer learning performance is evaluated on two datasets (UCF and HMDB) under two protocols (full finetuning and retrieval). For the fine-tuning protocol, we report final accuracy of video level predictions. For the retrieval protocol, we report .
Figure 9: Effect of mixing coefficient in Soft-xID. Transfer learning performance is evaluated on two datasets (UCF and HMDB) under two protocols (full finetuning and retrieval). For the fine-tuning protocol, we report final accuracy of video level predictions. For the retrieval protocol, we report .
Figure 10: Best and worse Kinetics classes. For each class, we depict the top-1 retrieval performance () averaged across all images of each class. The plot above shows the top 40 classes and the plot below the bottom 40 classes.

Soft-xID: Mixing coefficient

The mixing coefficient specifies the degree to which the one-hot targets of instance discrimination are softened in Soft-xID. The one-hot instance discrimination targets are used when . As increases, the softening scores are increasingly used to adjust the one-hot targets. To study the impact of the mixing coefficient , we trained several models using Soft-xID with various values of . Cycle consistent targets were used as the softening strategy. fig. 9 shows the transfer performance of the learned models on UCF and HMDB under the fine-tuning and retrieval protocols. The trend is consistent across the two datasets and two evaluation protocols. Softening the instance discrimination targets enhances representation learning, with the optimal performance achieved with a mixing coefficient between and . However, as the mixing coefficient increases substantially , the targets are derived from the model prediction alone and disregard instance labels. In this case of large , the pre-training fails completely, i.e., the learned representations have very low transfer performance.

Appendix B Additional analysis

The proposed approach learns high-quality feature representations that can be used to discriminate several action classes. This was shown in the main paper by reporting transfer learning results. We now provide additional qualitative evidence and analysis.


For each video, we extracted feature maps from the video encoder learned using Robust-xID on the full Kinetics dataset. fig. 11 depicts the top 4 closest videos for several query samples. As can be seen, Robust-xID produces highly semantic features, enabling correct retrievals for a large number of videos spanning a large number of classes. Furthermore, even when a video of a different class is retrieved, the errors are intuitive (for example, the confusion between ‘American football‘ and ‘Hurling‘ in the third row). Failure cases also seem to be correlated with classes that are hard to distinguish from the audio alone (eg, different types of kicking sports or swimming strokes).

Class-based analysis

To better understand which classes are better modeled by the Robust-xID framework, we measured the top-1 retrieval performance () averaged across all images of each class. Similar to the analysis above, each video is represented by a feature map extracted from a video encoder learned using Robust-xID on the full Kinetics dataset. fig. 10 depicts a list of Kinetics classes sorted by their average score. As can be seen, action classes which are often accompanied by long and distinctive sounds (e.g., squash, harp, drums, accordion, or scuba diving) tend to be more easily distinguished from others. In contrast, classes with less distinctive audio (e.g., making a cake, eating cake, or hugging) or classes where distinctive sounds are short-lived (e.g., blowing nose, gargling or kicking ball) are harder to model using a cross-modal audio-visual framework. As a result, the features learned for such classes are less discriminative.

Figure 11: Retrievals. In each row, the first image depicts the query video, and the following four images depict the top 4 retrievals. The corresponding Kinetics class description is provided above each frame. Each video is represented by a feature map produced by the video encoder learned using Robust-xID on the full Kinetics dataset. Euclidean distance is used to determine video similarity.

Faulty positive detection performance

To obtain a rough estimate of performance of the faulty positive detection procedure, we randomly sampled 100 videos from the 10000 most likely faulty positives, as identified by Robust-xID trained on the full Kinetics dataset. We then manually labeled them according to how related their audio and visual signals are. From those, 67 were clear faulty pairs; 24 contained narrative voice-overs (i.e., required natural language understanding to link the two modalities); and 9 samples were clearly misidentified.