DeepAI
Log In Sign Up

Noise-Tolerant Learning for Audio-Visual Action Recognition

Recently, video recognition is emerging with the help of multi-modal learning, which focuses on integrating multiple modalities to improve the performance or robustness of a model. Although various multi-modal learning methods have been proposed and offer remarkable recognition results, almost all of these methods rely on high-quality manual annotations and assume that modalities among multi-modal data provide relevant semantic information. Unfortunately, most widely used video datasets are collected from the Internet and inevitably contain noisy labels and noisy correspondence. To solve this problem, we use the audio-visual action recognition task as a proxy and propose a noise-tolerant learning framework to find anti-interference model parameters to both noisy labels and noisy correspondence. Our method consists of two phases and aims to rectify noise by the inherent correlation between modalities. A noise-tolerant contrastive training phase is performed first to learn robust model parameters unaffected by the noisy labels. To reduce the influence of noisy correspondence, we propose a cross-modal noise estimation component to adjust the consistency between different modalities. Since the noisy correspondence existed at the instance level, a category-level contrastive loss is proposed to further alleviate the interference of noisy correspondence. Then in the hybrid supervised training phase, we calculate the distance metric among features to obtain corrected labels, which are used as complementary supervision. In addition, we investigate the noisy correspondence in real-world datasets and conduct comprehensive experiments with synthetic and real noise data. The results verify the advantageous performance of our method compared to state-of-the-art methods.

READ FULL TEXT VIEW PDF

page 1

page 3

page 7

page 10

04/22/2021

Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

Having access to multi-modal cues (e.g. vision and audio) empowers some ...
05/11/2021

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

Multi-modal learning, which focuses on utilizing various modalities to i...
06/21/2021

Contrastive Multi-Modal Clustering

Multi-modal clustering, which explores complementary information from mu...
06/13/2021

Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning

Cross-modal correlation provides an inherent supervision for video unsup...
03/29/2021

Robust Audio-Visual Instance Discrimination

We present a self-supervised learning method to learn audio and video re...
11/28/2022

Long-tail Cross Modal Hashing

Existing Cross Modal Hashing (CMH) methods are mainly designed for balan...
11/12/2020

Learning Inter-Modal Correspondence and Phenotypes from Multi-Modal Electronic Health Records

Non-negative tensor factorization has been shown a practical solution to...

I Introduction

With the growing popularity of mobile devices and online video platforms, people are generating and consuming a huge amount of video content every day. Recent study has shown that over 1 billion hours of video is watched each day on YouTube. This trend has encouraged advanced techniques to accurately recognize actions or events in the videos[25, 21, 27], which can benefit a wide range of applications, including video summarization, video retrieval, and video recommendation.

As an information-intensive media, video is rich in multiple modalities such as frame, motion(optical flow) and audio. Therefore, recent advancements in action recognition have mostly focused on integrating various modalities to improve the performance[27, 39] or robustness[35] of supervised model. To utilize the massive unlabeled videos from Internet-scale dataset, self-supervised multi-modal learning methods are proposed to leverage the strong correlation among modalities to obtain pretrained model[2, 3, 32]. However, the promising results of most existing methods depend on both clean-annotated and clean-corresponding datasets, which are expensive and time-consuming. Unfortunately, most widely used video datasets such as Kinetics[18] and YouTube-8M[1] are collected from the Internet, which inevitably contain both noisy labels and noisy correspondence. Specifically, noisy labels are corrupted from the ground-truth labels and thus result in poor generalization performance[13, 49]. To alleviate the harmful effects from noisy labels, numerous approaches have been proposed, such as Co-teaching[13], Meta-Weight-Net[33], and DivideMix[22]. Although these studies have achieved encouraging success, it’s challenging to adopt them to video recognition task. On the one hand, they are proposed to tackle the uni-modal situation and cannot simultaneously integrate multiple modalities in video. On the other hand, these studies generally based on the memorization effect of DNNs[4], and thus distinguish noisy and clean data by the loss difference. However, video media also exist noisy correspondence, which may confuse the loss of noisy and clean data. As shown in Figure 1, the existence of noisy correspondence makes the loss difference not as distinguishable as the uni-modal scenarios. Specifically, noisy correspondence is that the multiple modalities in instance provide irrelevant or redundant information leading to sub-optimal performance[27]. Take Kinetics dataset for example, usually the sound tracks of videos are unrelated to the visual content, e.g., human voices masks the sound of instruments, and music montage for a parkour video. Moreover, multi-modal networks are more prone to overfit the noise due to their increased capacity[39]. Thus, it’s extra challenging and complicated to consider both noisy labels and noisy correspondence simultaneously.

(a)
(b)
Fig. 1: Cross-entropy loss epoch on UCF101 dataset under noise for clean-annotated and noisy label samples. (a) Training the late-fusion audio-visual network with 60% noisy labels and no noisy correspondence. (b) Training the late-fusion audio-visual network with both 60% noisy labels and 60% noisy correspondence. From the figure, we can see that the noisy correspondence will confuse the loss of noisy label samples and clean-annotated samples.

Audio and vision information represent the two most essential ways people perceive the world, which are also the two most common modalities in video media. Therefore, we use the audio-visual action recognition task as a proxy to explore the multi-modal learning with both noisy labels and noisy correspondence. Previous works in noisy labels area conduct controlled experiments by injecting different kind of synthetic noises into clean dataset, e.g., Symmetry and Asymmetry. Although these synthetic construction of noisy labels can appropriately simulate the noises in real world, they cannot be applied to simulate the noisy correspondence data. Based on the investigation from the real-world Kinetics dataset, noisy correspondence is more complex and irregular, and unfortunately, there is no benchmark to construct controlled noisy correspondence data. To this end, we annotate the validation set of Kinetics dataset to estimate the noisy correspondence level of each class and then conduct experiments in real-world data with controlled level of noise.

To address the aforementioned problems, we propose a novel noise-tolerant learning framework to find anti-interference model parameters to both noisy labels and noisy correspondence. Our method consists of two phases and aims to rectify noise by the inherent correlation between modalities. A noise-tolerant contrastive training phase is performed prior to the conventional supervised training, aiming to learn robust model parameters unaffected by the noisy labels. To reduce the influence of noisy correspondence, we propose a cross-modal noise estimation component based on the observation shown in the Figure 2 that the clusters in the feature space by either modality should be the same if the modalities among sample have the related semantic information, which aims to adjust the consistency between different modalities in an instance. Since the noisy correspondence produced by the irrelevant modalities at the instance level, we propose a category-level contrastive loss to further alleviate the interference of noisy correspondence. Then in the hybrid supervised training phase, we calculate the distance metric among the robust features to obtain corrective labels, which are used as complementary supervision for the supervised training.

To our best knowledge, this is the first work to consider both noisy labels and noisy correspondence simultaneously in an audio-visual action recognition task. The main novelties and contributions are summarized as follows:

  • We investigate the noisy correspondence problem in real-world dataset and we propose the first benchmark in audio-visual data with controlled noisy correspondence levels.

  • We propose a noise-tolerant learning framework for audio-visual action recognition with both noisy labels and noisy correspondence that uses the inherent correlation between different modalities to rectify noise.

  • Extensive experiments are conducted with a wide range of noise levels, and demonstrate the advantageous performance of our approach in audio-visual action recognition tasks compared to state-of-the-art methods.

(a)
(b)
Fig. 2: The motivation of cross-modal noise estimation. In this figure, we denote , and as the features of audio-visual pairs, and show their similarity in the common space. (a) Correct Correspondence: If has correct correspondence, we use the audio modality to find the cluster samples and , and their corresponding visual modality and also close to in common space, vice versa. (b) Noisy Correspondence: If is irrelevant to , the corresponding visual modality and are far away from in common space.

Ii Related Work

Ii-a Audio-visual Video Learning

Video understanding is an essential research areas of computer vision and multimedia. Previous works have gained a significant success in understanding temporal information of video modality

[17, 30, 10, 37, 40, 44]. In recent years, the learning trend from single-modality to multi-modality has become active for better performance. Among the rich modalities in video, visual and audio information are the most common modalities thus have been widely developed. One typical technique is to combine the features for different modality and jointly optimize them [29, 39]. Alternatively, some works aim to learn effective representation based on the semantic relation between modalities and transfer to downstream tasks. Relja et al. [3] propose a cross-modal self-supervision method to localize the audio object within an image. Humam et al. [2] extend Deep Cluster [6] to multi-modal situation which leverage unsupervised clustering to produce supervisory signal for the other. Andrew et al. [32] introduce an approach based on concepts disentanglement and semantic categories assignment for audio-visual co-segmentation. More advancements can be referred in the recent audio-visual learning survey[51], which also indicates the previous works pay less attention to audio-visual learning under noise.

Ii-B Learning with Label Noise

Most existing approaches for training DNNS with label noise focus on uni-modal scenarios. One typical direction is to adjust the loss of training samples before updating the DNN, multiplying the predicted outputs by the estimated transition matrix [28, 43]. To obtain a more precise transition matrix, Hendrycks et al. [14] use a small dataset with clean labels as additional information, Yao et al. [47] factorize the matrix into the product of two simple-estimated matrices. Moreover, Yang et al. [46]

estimate the instance-dependent noise transition probability by Bayesian optimization from distill samples. Another direction focus on re-weighting the contribution of each sample on the loss. Zhang

et al. [50] employ graphical model and re-weight samples by the structural relations among labels. Wang et al. [41]

perform unsupervised learning to help re-weight the loss by pushing away noisy samples from correct samples in representation space. Recently, meta learning has shown robust learning ability to help infer the weight for loss automatically

[31, 33, 23]. However, most prior arts designed for uni-modal scenarios can’t directly adopted to multi-modal cases since the multi-modal data contain data noise that two or more modalities are irrelevant in semantic meaning. Thus, samples with data noise can yield high loss and be misclassified as noisy-label cases. In the multi-modal scenarios, Hu et al. [15] combine a robust clustering loss with a multi-modal contrastive loss to address cross-modal retrieval with noisy labels. In recent, Huang et al. [16]

introduce a new paradigm of noisy labels, termed noisy correspondence, which refers to the mismatch paired in multi-modal samples. Specifically, they use the memorization effect of neural networks to divide the data and then rectifies them via an adaptive prediction model. Contrary to all aforementioned methods, our method designed for the multi-modal situation and consider both noisy labels and noisy correspondence simultaneously.

Ii-C Multi-modal Contrastive Learning

Contrastive learning can be considered as learning by comparing among the input samples. Most previous works aim to learn single-modal representations by constructing positive and negative pairs through a variety of data augmentation [8, 38, 42, 24]. Recently, various works has shown the potential of contrastive learning for multi-modal data. For example, Tian et al. [36] regard different views(e.g., L and ab channels) of the image as the data augmentation and maximize mutual information between different views of the same sample. In addition to contrasting between different modalities of the same sample, Morgado et al. [26] introduce a method to group together videos with high similarity both in video and audio modality, and these groups are considered as extended positive pairs to directly optimize the visual representations. Yuan et al. [48] propose a method combined intra-modal and inter-modal similarity preservation objectives to improve the quality of learned representation. To alleviate the influence of false negatives caused by random sampling, Yang et al. [45] propose a noise-robust contrastive loss to simultaneously learn representation and align data in multi-view learning. The existing works heavily rely on the correct correspondence between modalities. However, in practice, such a situation is difficult to satisfy. Different from them, our work aim to learn robust representation with noisy correspondence problem.

Iii Method

Iii-a Problem Statement

We are targeting a multi-modal video classification with both noisy labels and noisy correspondence. Contrary to the uni-modal case, multi-modal methods process each modality by a different encoder. After that, they concatenate the features and pass to a classifier. Here we consider two most common modalities in videos, RGB stream and audio track. Let

be a set of video data, where for each sample , and are the visual stream and audio stream, and is the label. Let with parameter and with parameter denote the visual and audio encoder, respectively. Let denote the multi-modal network’s parameters.

The goal of supervised multi-modal classification is to minimize an empirical loss

(1)

where denotes a classifier and denotes a fusion operation.

However, when contains noise, the multi-modal network might be overfitting and perform poorly on the clean test set; meanwhile, multi-modal data also contains noisy correspondence, which can be regarded as hard sample and makes uni-modal method perform sub-optimally. To deal with these issues, we propose a noise-tolerant learning framework and the details are elaborated in the following sections.

Fig. 3: The framework of our proposed method. The noise-tolerant contrastive training phase and hybrid supervised training phase proceed iteratively until converged. The modal encoders are shared weights between these two phases.

Iii-B Overall Framework

The motivation behind our method is that the inherent correlation between different modalities can aid in the rectification of noise. We achieve this by a novel noise-tolerant learning framework, which consists of two phase: noise-tolerant contrastive training phase and hybrid supervised training phase.

The noise-tolerant contrastive training phase is performed prior to the conventional supervised training, aiming to learn robust model parameters unaffected by the noisy labels. As illustrated in Figure 3, this phase is trained by the combination of two jointly learned contrastive loss: instance-level loss and category-level contrastive loss . The instance-level loss is inspired by existing uni-modal contrastive framework SwAV [7]. And we extended it to multi-modal situation by maximizing the consistency between modalities instead of data augmentation. Specifically, it clusters data from two different modalities simultaneously and leverage one modality’s cluster assignments as supervisory signal for other modality. However, there is an assumption that modalities among multi-modal data provide relevant semantic information. In practice, the existence of noisy correspondence makes such an assumption difficult to satisfy. Based on the observation shown in Figure 2

, we address such limitation by designing a cross-modal noise estimation component that assign weights to adjust the consistency between different modalities in an instance. Since the noisy correspondence produced by the irrelevant modalities at the instance level, we propose a category-level contrastive loss as addition to further alleviate the interference of noisy correspondence. Specifically, for each modality, we first obtain the category representations by mapping features to the prototype vectors. Then, we embrace a common semantic autoencoder with tied weights to reconstruct the category representations, which can force the category representations to be closer in semantics and filter noisy information from original features.

In the hybrid supervised training phase, we calculate the distance metric among the robust features to obtain corrective labels. Although the corrected label is more approximate to ground truth label, training using only the corrected labels may lead self-convergence problem and thus yield sub-optimal results. Therefore, original labels and corrected labels are used complementarily as supervisory signal to train the multi-modal network. The proposed two phase trained in an iterative manner to find anti-interference model parameters to both noisy labels and noisy correspondence.

Iii-C Noise-Tolerant Contrastive Training Phase

Iii-C1 Instance-level Contrastive Loss

In the noise-tolerant contrastive training phase, model parameters are not affected by noisy labels; to alleviate the influence of noisy correspondence, we propose the cross-modal noise estimation to align the semantic consistency in instance level. The motivation of our method based on the observation illustrated in Figure 2. For one modality, we utilize the dense clusters from corresponding modality to excavate the similar samples, which are then used to calculate inter-modal similarity. Given an audio-visual pair , we denote and as the set of similar samples extended from training samples, respectively,

(2)
(3)

where and

are threshold hyperparameters to control the quality and quantity of the set, and

is cosine similarity function. For convenience of presentation, we denote

and as two arbitrary features in the common space, and the similarity function between them is defined as

(4)

where denotes temperature parameter. For the audio-video pair , we generate the estimated probability that visual and audio data has correct correspondence as

(5)
(6)

For audio-visual video data, the strong correlation between visual and audio modality make it possible to leverage one modality’s cluster assignments as supervisory for the other modality. Formally, given a mini-batch with samples from the training set, we first obtain the marginal cluster probability that each modality belongs to the -th class:

(7)
(8)

where are the trainable prototypes shared between two modalities and is a temperature parameter. Each from can be consider as the clustering center vector which representing the -th class. We denote

as the one-hot encoding of the predicted label from visual feature, and

is analogous but uses audio feature. The contrastive learning objective comprises two terms(i.e., video and audio). The video term is defined as the cross-entropy loss between the predicted labels by audio modality and the marginal probability of video features:

(9)

The formulation of audio term is analogous by exchanging the roles of and . Note that the one-hot vector of predicted label is unknown, following previous work[5], it can be learned by transferring the optimization of contrastive learning objective (9) to optimal transport problem. In order to avoid trivial solution that all samples are predicted to same label, one constraint is added to partition the batch in equal size:

(10)

The optimal predicted vector and can be calculated efficiently by employing the iterative Sinkhorn-Knopp algorithm [9].

Owing to the existence of noisy correspondence, we combine the above noise estimation component with contrasting cluster assignments to adjust the consistency between two modalities in instance level. We thus formulate the instance-level contrastive loss as:

(11)

Iii-C2 Category-level Contrastive Loss

Since the noisy correspondence produced by the irrelevant modalities at the instance level, we utilize the semantic information at the category level to further enhance the noise-tolerant ability of networks. Similar to instance-level loss, we use another trainable prototypes to compute the dot-products with each modality in the batch. Formally, given mini-batch with samples , for -th sample, we denote and be the dot product results with softmax of video feature and audio feature, respectively. For each modality, the -th element measures the similarity between -th class and samples, and thus can be regarded as the category representation of -th class. To preserve the semantic information both contained in video and audio category representation, a common autoencoder with tied weights is used to reconstruct the category representations:

(12)

where

denotes the hidden representation for corresponding modality, and

and denote the encoder and decoder, respectively. The encoder maps the category representation to a lower-dimensional representation and the decoder reconstructs back the original input. The original representation and reconstructed representations of same class from one modality can be regarded as positive pairs. The original representation and reconstructed representations of different class fro both two modalities can be regarded as negative pairs. The visual intra-modal contrastive loss is defined as:

(13)

The audio intra-modal contrastive loss has the same way but uses audio modality to contrast.

On the one hand, the autoencoder can embed category representations from two modalities more closely in the common space. On the other hand, autoencoder can naturally contain the effective feature information and filter the noisy part. To improve the relevance between modalities at the category level, an inter-modal contrastive loss is proposed, which considers the inter-modal reconstructed representations of same class as positive pairs. The inter-modal contrastive objective also comprises two terms. The video term is defined as:

(14)

The audio term is analogous, but uses audio modality to contrast. We thus formulate the category-level contrastive loss as the combination of intra-modal and inter-modal contrastive loss:

(15)

Iii-D Hybrid Supervised Training Phase

The final noise-tolerant contrastive learning loss is formulated as:

(16)

For each epoch, after the noise-tolerant contrastive learning update, we aim to obtain corrective labels from the robust features. To this end, we apply the classical K-Nearest Neighbors approach, computing the nearest neighbors by a distance matrix and filter suspicious samples. It proposes each label by taking a majority voting of similar samples:

(17)

where is the corrected label for -th sample, and is distance matrix. Since

-NN may miscalculate some hard samples as noises, we retain the original labels as a portion of supervision, and the hybrid supervised objective loss function is:

(18)

where is cross entropy loss as shown in eq. 1, denotes the corrected labels in dataset , and is the weight factor that controls the balance between two terms. For initial convergence of the algorithm, we set to warm up the model for a few epochs. Then the noise-tolerant contrastive training phase and hybrid supervised training phase proceed iteratively. The noise-tolerant contrastive training phase first trains the encoders by using the combination of and . We update the encoder parameters and yield corrected labels by calculating similarity among the robust features. These corrected labels are than used as complementary supervision to train the encoders and classifier. The above procedure proceeds in an iterative manner until converged. The full algorithm is outlined in Algorithm 1.

Input : Network parameter , prototypes and , training dataset , estimated probability threshold and , temperature parameter and , weight factor .
1 ;
2 while e MaxEpoch do
3       Generate the estimated probabilities for each sample ;
4       // contrastive training phase repeat
5             Sample a mini-batch ;
6             Normalize the instance prototypes ;
7             Compute through Equations 711;
8             Normalize the category prototypes ;
9             Compute through Equations 1215;
10             Compute contrastive loss by Eq. (16);
11             Update by minimizing with Adam;
12            
13      until all samples selected;
14      // rectify labels Get the corrected labels according to Eq. (17);
15       // supervised training phase repeat
16             Sample a mini-batch ;
17             Compute supervised loss by Eq. (18);
18             Update by minimizing with Adam;
19            
20      until all samples selected;
21 end while
Algorithm 1 Iterative Training

Iv Experiments

To evaluate our proposed method compared with the state-of-the-art methods, we have conducted the proposed method on audio-visual action recognition task with a wide range of noise levels for both noisy labels and noisy correspondence.

Iv-a Noisy Correspondence in Real-World Dataset

Compared to noisy labels, noisy correspondence problem is more complex and irregular, making it difficult to be simulated by the synthetic construction. To this end, we investigate the real-world audio-visual dataset to conduct experiments under real noisy correspondence. Kinetics [18] is a large-scale audio-visual video dataset for action recognition, which has 240K training videos and 20K validation videos of 400 classes. As the dataset is harvested from YouTube, the videos inevitably contain noisy correspondence between the two modalities. To quantitatively explore the influence of noisy correspondence problem, we relabel the Kinetics dataset to annotate whether it contains noisy correspondence for each video in validation set. As the training set and validation set follow same distribution, we can consider the statistics as the estimation of noisy correspondence for each class. Figure 4 shows the statistics of noisy correspondence in validation set of Kinetics. The classes with high-level noise are usually related to sports(e.g., parkour, windsurfing and snowboarding), which may dub videos with completely unrelated sound tracks. Contrarily, the classes with low-level noise are usually related to sound(e.g., singing and playing harp), which are manifested visually and aurally. To quantitatively study the influence of real-world noisy correspondence, we create 4 different controlled noise-level mini-dataset: {10%,20%,30%,40%}. Each mini-dataset consists of the 50 categories within the specific noise levels; for each category, we randomly sample 400 videos from the training split and 50 videos from the validation split. The full list of each mini-Kinetics is given in appendix.

Fig. 4: Histogram of instance counts over the entire 19,404 validation dataset, sorted by the percentage of noisy correspondence among each class. Here we show the top-20 classes with noisy labels and top-20 clean classes.

Iv-B Experimental Setup

Datasets. In this paper, we use two video action classification datasets to evaluate our method. The first one is UCF101 [34]. It contains 7K audio-visual videos from 51 classes and the mean video length is about 7 seconds. We adopt split-1 of the 3 official train/test splits to conduct our study. The second main dataset is Kinetics, we use the 4 subsets mentioned above to explore the influence of noisy correspondence.

Backbone architecture. We employ the R(2+1)D-18 [37] as our visual backbone and ResNet-22 [20] as our audio backbone. For Kinetics, the visual backbone is pretrained on IG65M [12]. For UCF101, the visual backbone is pretrained on Kinetics. And the audio backbone is pretrained on AudioSet [11] for both datasets. For fusion, we use the fusion manner as [39]

. A two-FC-layer network is used to concatenate features from two modalities, and each FC layer follows a ReLU layer withou the last layer.

Input preprocessing. For visual modality, we use the clips as input, processed by random crop and center crop for training and test, respectively. For audio modality, we sample 2 seconds audio track and use log-Mel spectrograms as input, the audio preprocessing is followed as [20]. Audio and visual are temporally aligned.

Implementation Details. We employ Adam [19] optimizer with default parameters to train our model. The learning rate is initialized with and the batch size is set as 64. The total training processes contain 60 epochs. The temperature parameters and are set as 1 and 0.1, respectively. The numbers of similar samples are set as 16 and 32 for UCF101 and Kinetics, respectively. The weight factor is initialized with 0.6 and iteratively increase to 0.8 after 20 epochs.

Evaluation Metrics. For evaluation, we take the clip-level and video-level accuracy as the measurement. We uniformly sample 10 clips for each testing video and average the predictions as the video-level prediction. In the experiments, we report clip-level top-1, clip-level top-5, and video-level top-1 accuracy for a comprehensive evaluation.

Iv-C Comparisons with State of the Arts

We compare our method with multiple state-of-the-art methods, including four audio-visual learning methods(i.e., G-Blend[39], XDC[2], AV-R[35], AdaMML[27]) and one multi-modal label noise method(, MRL[15]). Note that for XDC, we also use

-NN to rectify labels in order to make it suitable for supervised learning(denoted by XDC*). All methods use the same backbone architecture as ours. To comprehensively evaluate the robustness of the methods, for noisy labels, we set the noise to be symmetric(

i.e., noise ratio 20%, 40%, 60%, 80%) and asymmetric(i.e., noise ratio 20%, 40%). For noisy correspondence, we use the mini-Kinetics with different ratios of real noise, i.e., 10%, 20%, 30% and 40%. Note that the two types of noise exist simultaneously.

UCF101 mini-Kinetics mini-Kinetics mini-Kinetics mini-Kinetics
10% noisy correspondnece 20% noisy correspondnece 30% noisy correspondnece 40% noisy correspondnece
Noisy Label Methods C@1 C@5 V@1 C@1 C@5 V@1 C@1 C@5 V@1 C@1 C@5 V@1 C@1 C@5 V@1
G-Blend 76.90 91.46 81.52 57.24 81.99 63.27 58.66 82.63 63.85 56.01 81.65 61.27 55.21 80.49 61.27
MRL 82.85 95.37 85.21 62.17 86.48 68.30 60.71 86.75 65.49 57.22 83.00 64.05 57.80 82.02 64.89
XDC* 82.66 95.06 85.91 63.11 86.84 70.18 61.21 86.58 66.63 59.68 85.54 65.17 57.82 82.28 65.59
0.20 AV-R 81.33 94.44 85.12 63.61 87.11 70.30 60.06 84.29 65.88 59.92 85.69 64.56 59.39 83.49 66.42
AdaMML 78.34 92.28 83.77 61.16 84.47 67.59 59.03 83.14 64.41 57.59 83.00 63.27 57.22 82.12 63.37
Ours 83.18 95.47 87.69 65.60 88.64 71.81 63.45 87.31 68.22 63.28 88.19 68.06 62.79 86.22 67.33
G-Blend 67.44 84.57 71.66 50.36 72.80 54.65 48.79 70.96 52.55 48.03 69.67 51.43 45.70 70.83 49.81
MRL 81.53 94.03 83.75 56.97 79.90 62.43 56.59 80.21 62.01 55.43 79.82 60.63 57.31 81.28 61.53
XDC* 79.78 94.75 84.28 57.57 81.85 62.30 56.57 80.00 62.46 55.06 79.73 59.48 53.76 80.17 60.79
0.40 AV-R 72.48 90.59 76.58 54.70 79.46 59.82 51.58 77.65 56.13 51.15 77.56 57.16 50.06 78.75 56.09
AdaMML 72.43 89.92 76.59 53.63 78.04 59.29 53.75 80.41 59.35 52.17 78.36 57.91 51.06 79.46 56.64
Ours 82.02 95.32 85.97 62.86 84.53 69.62 60.25 84.61 65.34 58.57 84.31 65.73 58.23 84.17 64.70
G-Blend 55.56 76.39 63.12 38.79 63.45 46.02 38.75 62.94 45.32 37.84 62.37 44.77 36.68 60.73 42.33
MRL 69.60 85.39 74.17 52.90 75.48 59.73 51.92 73.61 58.73 52.03 77.28 59.60 50.21 74.86 52.25
XDC* 77.57 92.54 83.59 54.21 78.06 60.75 53.52 78.26 60.97 51.19 76.65 58.45 49.04 74.13 51.99
0.60 AV-R 63.53 83.08 67.08 45.59 68.38 51.25 45.69 67.76 48.89 42.85 68.50 49.44 41.56 67.25 46.51
AdaMML 60.75 80.71 65.64 44.00 66.66 50.04 43.91 66.67 48.80 43.87 66.95 49.10 43.37 66.91 47.70
Ours 80.04 93.98 84.05 58.59 78.79 63.43 60.38 80.48 63.33 55.83 81.91 62.53 54.56 79.86 61.15
G-Blend 35.94 61.27 40.46 27.39 50.73 32.65 24.62 46.07 29.89 25.76 48.77 31.41 25.72 49.63 30.50
MRL 43.52 62.60 46.48 39.60 61.73 45.28 37.68 61.10 42.95 36.69 59.53 42.90 36.16 58.57 41.46
XDC* 51.95 76.34 59.11 41.55 65.49 47.46 38.53 61.78 42.98 36.20 58.78 41.77 34.16 57.26 40.56
0.80 AV-R 37.96 61.16 42.16 32.38 53.57 37.28 31.66 52.67 36.09 29.45 52.64 33.68 29.15 52.74 33.73
AdaMML 36.68 60.39 42.33 29.68 51.53 34.85 30.48 50.72 34.95 28.06 50.70 32.44 28.07 51.78 32.55
Ours 59.10 81.12 61.34 45.29 67.60 49.55 44.45 66.76 47.88 43.60 65.97 47.76 43.16 66.22 49.02
TABLE I: Comparison with state-of-the-art methods on UCF101 and mini-Kinetics with symmetric label noise.
UCF101 mini-Kinetics mini-Kinetics mini-Kinetics mini-Kinetics
10% noisy correspondnece 20% noisy correspondnece 30% noisy correspondnece 40% noisy correspondnece
Noisy Labels Methods C@1 C@5 V@1 C@1 C@5 V@1 C@1 C@5 V@1 C@1 C@5 V@1 C@1 C@5 V@1
G-Blend 73.56 92.75 77.24 54.68 80.58 59.44 53.30 79.52 58.80 53.55 78.53 57.88 52.73 77.34 57.41
MRL 82.66 94.96 85.17 61.57 81.92 66.56 59.69 80.86 66.02 59.62 81.68 65.83 57.43 80.60 63.93
XDC* 80.20 94.24 83.84 58.32 80.94 64.17 56.34 79.74 63.03 55.60 79.59 61.08 53.01 78.66 57.31
0.20 AV-R 77.73 92.54 81.36 57.39 80.63 62.76 57.05 79.86 63.66 56.25 80.43 62.57 55.19 79.10 60.73
AdaMML 77.37 94.03 81.72 57.17 80.84 62.84 56.38 80.08 63.32 54.60 79.47 60.37 54.12 78.50 58.98
Ours 83.28 95.27 86.79 65.62 87.24 71.06 67.42 88.27 71.26 65.36 86.71 67.41 62.14 85.48 66.88
G-Blend 51.85 91.10 55.96 35.97 69.66 40.54 35.52 68.54 40.97 36.96 69.33 41.62 34.85 68.13 39.91
MRL 63.32 90.90 67.08 44.80 73.39 51.25 43.86 74.02 50.98 42.04 72.72 48.91 40.99 72.27 47.46
XDC* 60.80 90.33 64.04 43.39 72.41 49.65 40.29 71.40 47.67 39.61 70.94 47.97 38.08 70.33 46.50
0.40 AV-R 56.74 90.23 61.52 41.46 72.05 49.59 41.56 71.54 48.51 40.80 72.34 48.80 40.86 72.15 48.60
AdaMML 51.85 89.92 55.93 39.66 70.17 45.21 38.81 70.46 44.81 38.70 70.70 45.10 37.13 69.82 44.43
Ours 71.45 92.13 74.39 57.43 81.13 62.43 56.06 80.28 62.27 55.13 79.50 60.65 53.72 78.04 57.61
TABLE II: Comparison with state-of-the-art methods on UCF101 and mini-Kinetics with asymmetric label noise

Table1 and Table2 show the results for symmetric and asymmetric noisy labels, respectively. As shown in these tables, our method is superior to the state-of-the-art methods for all cases. From the experimental results, we can see the following observations:

  • The existence of noisy labels remarkably impact the performance of audio-visual action recognition methods. As the ratio of label noise increases, the performance of these methods decreases rapidly.

  • For the audio-visual action recognition task, it’s more challenging to learn under asymmetric noisy labels. But asymmetric label noise has less effect on clip-level top-5 accuracy.

  • The existence of noisy correspondence affects anti-interference performance of audio-visual action recognition methods to noisy labels. On the one hand, it will increase the difficulty of the task, which makes methods more easier to overfit. On the other hand, some existing audio-visual learning methods enforce the agreement between multiple modalities(e.g., XDC* and MRL), which may lead to sub-optimal results.

  • Our approach shows better performance and stability than other audio-visual action recognition methods, especially in the case of high-level noise.

Iv-D Progressive Comparison

Figure 5 plots the models’ clip-level top-1 accuracy on training data with noisy labels and the corresponding test accuracy on clean test data of UCF101 as training proceeds. We show some representative training processes using 20% asymmetric label noise, 40% asymmetric label noise, 60% symmetric label noise and 80% symmetric label noise. It can be seen from the Figure 5 that during the beginning of training, all methods quickly learn clean data and achieve certain accuracy. However, with the training processing, most met hods progressively overfit the training data and thus decrease the performance in the clean test data. As the noise level increases, most existing methods are more easily to be overfitting in test data. Some methods(e.g., XDC* and MRL) have certain noise-tolerant ability to noisy labels since they designed some anti-interference components such as robust loss function, but they can not deal with both symmetric and asymmetric noisy labels. Compared to the existing state-of-the-art methods, our method does not overfit the train data and has better noise-tolerant ability in all noise cases.

Fig. 5: Clip-level top-1 accuracy vs. epoch on noisy training dataset and clean test dataset of UCF101. The label noise are set as: 20% asymmetric, 40% asymmetric, 60% symmetric and 80% symmetric.

Iv-E Ablation Study

To evaluate the contribution of the proposed components in our method,e.g. instance-level and category-level contrastive loss, we carry out the ablation study on the mini-Kinetics(with noisy correspondence ratio of 40%) and UCF101, both datasets have 60% symmetric label noise. To sufficiently validate the effectiveness of proposed components, we compare our method with three counterparts: 1) None: No noise-tolerant contrastive learning scheme, the corrected labels produced by the features from supervised learning. 2): L-ins: only use instance-level noise-tolerant contrastive loss to update the network. 3): L-cat: only use category-level noise-tolerant contrastive loss to update the network. As shown in Table 3, all these proposed components are important to improve the noise-tolerant ability of the audio-visual network, and the instance-level contrastive loss has the most contribution.

Dataset Method C@1 C@5 V@1
UCF101 None 65.62 80.91 69.39
L-ins 79.22 93.25 82.97
L-cat 68.21 83.93 73.53
Full 80.04 93.98 84.05
mini-Kinetics None 42.91 73.07 49.30
L-ins 52.31 79.76 58.09
L-cat 48.08 76.73 54.48
Full 54.56 79.83 61.15
TABLE III: Ablation study on the proposed components.

Iv-F Parameter Analysis

The weight factor plays a significant role in the hybrid supervised training phase, which determines the network will focus more on the original labels or the corrected labels . For the two extreme cases, and denote that the network is trained by only using original labels and corrected labels only, respectively. To evaluate the impact of the trade-off hyper-parameter , we conduct experiments to study the influence of different ranging from 0.0 to 1.0 under different symmetric label noise rates of 0.2, 0.4, 0.6, 0.8. Note that we set as constant here for quantitative study.

We plot the clip-level top-1 accuracy versus on the validation sets of UCF101 and min-Kinetics with 40% noisy correspondence in Figure 6. From the figure, we can see that training using only the original labels leads to poor performance for all noise rates. But it’s still superior to the some comparing methods, which indicates that the learned parameters in contrastive training phase enhance the noise-tolerant ability in supervised training phase. Although the corrected labels are more approximate to ground-truth labels, training using only the corrected labels may lead self-convergence problem and thus yield sub-optimal results. Furthermore, the sensitivity of is also influenced by the noise rates and the difficulty level of datasets (e.g., the number of classes, the existence of noisy correspondence). Specifically, the is less sensitive to the low-level noise rates and easy-to-learn datasets. For most cases, our method can achieve good performance in a relatively larger range(i.e., 0.4 0.8). We can also see that when training with only the original labels,

(a)
(b)
Fig. 6: Clip-level top-1 accuracy different values of on the validation set of the (a) UCF101 and (b) mini-Kinetics with 40% noisy correspondence datasets, respectively. The symmetric label noise rates are 0.2, 0.4, 0.6 and 0.8.
Fig. 7: Selection results of our cross-modal noise estimation component for correct and noisy correspondence samples.

Iv-G Visualization on the Cross-Modal Noise Estimation

To further investigate the proposed cross-modal noise estimation on real-world noisy correspondence data, we plot the per-sample weight distribution of clean and noisy correspondence samples on validation set. Note that the networks trained with 40% symmetric noisy labels on both two mini-Kinetics datasets. As shown in Fig 8, our method assigns larger weight to most clean samples than the noisy samples, which implies that our method can distinguish clean and noisy correspondence samples while trained under noisy labels.

(a)
(b)
Fig. 8: Per-sample weight distribution on validation samples. (a) mini-Kinetics with 20% noisy correspondence. (b) mini-Kinetics with 40% noisy correspondence.

We also demonstrate the similar samples selected by our cross-modal noise estimation component. specifically, we visualize the selection results of correct samples and noisy correspondence samples under 20% symmetric noisy labels and 10% real noisy correspondence in Figure 7. It can be seen from Figure 7 that our method can clearly separate the correct and noisy samples. For correct samples, we use the audio modality to find similar samples that also have high-level similarity to visual modality. For noisy samples, the irrelevant information makes the corresponding visual modality has a low-level similarity. In the classes "answering questions" and "playing monopoly", the audio information of noisy samples interspersed with additional background music. In the class "clay pottery making", the human voices masks the clay making action. In the class "riding camel", the main audio information comes from the noisy road.

V Conclusion

In this paper, we propose a noise-tolerant learning framework for audio-visual action recognition with both noisy labels and noisy correspondence, where a noise-tolerant contrastive training phase is performed prior to the conventional supervised training. The proposed noise-tolerant contrastive training phase aims to learn robust model parameters that unaffected by the noisy labels. Since the existence of noisy correspondence, we also propose a cross-modal noise estimation component to adjust the consistency between different modalities. In the hybrid supervised training phase, we apply -NN approach to obtain corrective labels from the robust features which further used as complementary supervision for the supervised training. The noise-tolerant contrastive training phase and hybrid supervised training phase trained iteratively to find anti-interference model parameters to both noisy labels and noisy correspondence. In addition, we investigate the noisy correspondence in real-world audio-visual dataset and conduct comprehensive experiments with synthetic and real noise data. The results verify the advantageous performance of our method compared to state-of-the-art methods. To the best of our knowledge, this paper could be the first attempt to reveal the influence that both noisy labels and noisy correspondence exist simultaneously in an audio-visual action recognition task.

References

  • [1] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan (2016) Youtube-8m: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675. Cited by: §I.
  • [2] H. Alwassel, D. Mahajan, B. Korbar, L. Torresani, B. Ghanem, and D. Tran (2020) Self-supervised learning by cross-modal audio-video clustering. Advances in Neural Information Processing Systems 33, pp. 9758–9770. Cited by: §I, §II-A, §IV-C.
  • [3] R. Arandjelovic and A. Zisserman (2018) Objects that sound. In Proceedings of the European conference on computer vision (ECCV), pp. 435–451. Cited by: §I, §II-A.
  • [4] D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio, et al. (2017) A closer look at memorization in deep networks. In

    International conference on machine learning

    ,
    pp. 233–242. Cited by: §I.
  • [5] Y. M. Asano, C. Rupprecht, and A. Vedaldi (2019) Self-labelling via simultaneous clustering and representation learning. arXiv preprint arXiv:1911.05371. Cited by: §III-C1.
  • [6] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018) Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 132–149. Cited by: §II-A.
  • [7] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882. Cited by: §III-B.
  • [8] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. Cited by: §II-C.
  • [9] M. Cuturi (2013) Sinkhorn distances: lightspeed computation of optimal transport. Advances in neural information processing systems 26, pp. 2292–2300. Cited by: §III-C1.
  • [10] C. Feichtenhofer, H. Fan, J. Malik, and K. He (2019) Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 6202–6211. Cited by: §II-A.
  • [11] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter (2017) Audio set: an ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 776–780. Cited by: §IV-B.
  • [12] D. Ghadiyaram, D. Tran, and D. Mahajan (2019) Large-scale weakly-supervised pre-training for video action recognition. In

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    ,
    pp. 12046–12055. Cited by: §IV-B.
  • [13] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama (2018) Co-teaching: robust training of deep neural networks with extremely noisy labels. Advances in neural information processing systems 31. Cited by: §I.
  • [14] D. Hendrycks, M. Mazeika, D. Wilson, and K. Gimpel (2018) Using trusted data to train deep networks on labels corrupted by severe noise. arXiv preprint arXiv:1802.05300. Cited by: §II-B.
  • [15] P. Hu, X. Peng, H. Zhu, L. Zhen, and J. Lin (2021) Learning cross-modal retrieval with noisy labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5403–5413. Cited by: §II-B, §IV-C.
  • [16] Z. Huang, G. Niu, X. Liu, W. Ding, X. Xiao, H. Wu, and X. Peng (2021) Learning with noisy correspondence for cross-modal matching. Advances in Neural Information Processing Systems 34. Cited by: §II-B.
  • [17] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014)

    Large-scale video classification with convolutional neural networks

    .
    In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: §II-A.
  • [18] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §I, §IV-A.
  • [19] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §IV-B.
  • [20] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley (2020) Panns: large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 2880–2894. Cited by: §IV-B, §IV-B.
  • [21] D. Li, T. Yao, L. Duan, T. Mei, and Y. Rui (2018) Unified spatio-temporal attention networks for action recognition in videos. IEEE Transactions on Multimedia 21 (2), pp. 416–428. Cited by: §I.
  • [22] J. Li, R. Socher, and S. C. Hoi (2020)

    Dividemix: learning with noisy labels as semi-supervised learning

    .
    arXiv preprint arXiv:2002.07394. Cited by: §I.
  • [23] J. Li, Y. Wong, Q. Zhao, and M. S. Kankanhalli (2019) Learning to learn from noisy labeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5051–5059. Cited by: §II-B.
  • [24] Y. Li, P. Hu, Z. Liu, D. Peng, J. T. Zhou, and X. Peng (2021) Contrastive clustering. In

    2021 AAAI Conference on Artificial Intelligence (AAAI)

    ,
    Cited by: §II-C.
  • [25] J. Liu, Y. Yang, and S. Jeng (2018) Weakly-supervised visual instrument-playing action detection in videos. IEEE Transactions on Multimedia 21 (4), pp. 887–901. Cited by: §I.
  • [26] P. Morgado, N. Vasconcelos, and I. Misra (2021) Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12475–12486. Cited by: §II-C.
  • [27] R. Panda, C. R. Chen, Q. Fan, X. Sun, K. Saenko, A. Oliva, and R. Feris (2021) AdaMML: adaptive multi-modal learning for efficient video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7576–7585. Cited by: §I, §I, §IV-C.
  • [28] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu (2017) Making deep neural networks robust to label noise: a loss correction approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1944–1952. Cited by: §II-B.
  • [29] S. Petridis and M. Pantic (2015) Prediction-based audiovisual fusion for classification of non-linguistic vocalisations. IEEE Transactions on Affective Computing 7 (1), pp. 45–58. Cited by: §II-A.
  • [30] Z. Qiu, T. Yao, and T. Mei (2017) Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision, pp. 5533–5541. Cited by: §II-A.
  • [31] M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018)

    Learning to reweight examples for robust deep learning

    .
    In International Conference on Machine Learning, pp. 4334–4343. Cited by: §II-B.
  • [32] A. Rouditchenko, H. Zhao, C. Gan, J. McDermott, and A. Torralba (2019) Self-supervised audio-visual co-segmentation. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2357–2361. Cited by: §I, §II-A.
  • [33] J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, and D. Meng (2019) Meta-weight-net: learning an explicit mapping for sample weighting. Advances in Neural Information Processing Systems 32, pp. 1919–1930. Cited by: §I, §II-B.
  • [34] K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §IV-B.
  • [35] Y. Tian and C. Xu (2021) Can audio-visual integration strengthen robustness under multimodal attacks?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5601–5611. Cited by: §I, §IV-C.
  • [36] Y. Tian, D. Krishnan, and P. Isola (2020) Contrastive multiview coding. In European conference on computer vision, pp. 776–794. Cited by: §II-C.
  • [37] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §II-A, §IV-B.
  • [38] A. Van den Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv e-prints, pp. arXiv–1807. Cited by: §II-C.
  • [39] W. Wang, D. Tran, and M. Feiszli (2020) What makes training multi-modal classification networks hard?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12695–12705. Cited by: §I, §II-A, §IV-B, §IV-C.
  • [40] X. Wang, A. Farhadi, and A. Gupta (2016) Actions~ transformations. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2658–2667. Cited by: §II-A.
  • [41] Y. Wang, W. Liu, X. Ma, J. Bailey, H. Zha, L. Song, and S. Xia (2018) Iterative learning with open-set noisy labels. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8688–8696. Cited by: §II-B.
  • [42] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3733–3742. Cited by: §II-C.
  • [43] X. Xia, T. Liu, N. Wang, B. Han, C. Gong, G. Niu, and M. Sugiyama (2019) Are anchor points really indispensable in label-noise learning?. Advances in Neural Information Processing Systems 32, pp. 6838–6849. Cited by: §II-B.
  • [44] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy (2017) Rethinking spatiotemporal feature learning for video understanding. arXiv preprint arXiv:1712.04851 1 (2), pp. 5. Cited by: §II-A.
  • [45] M. Yang, Y. Li, Z. Huang, Z. Liu, P. Hu, and X. Peng (2021) Partially view-aligned representation learning with noise-robust contrastive loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1134–1143. Cited by: §II-C.
  • [46] S. Yang, E. Yang, B. Han, Y. Liu, M. Xu, G. Niu, and T. Liu (2021) Estimating instance-dependent label-noise transition matrix using dnns. arXiv preprint arXiv:2105.13001. Cited by: §II-B.
  • [47] Y. Yao, T. Liu, B. Han, M. Gong, J. Deng, G. Niu, and M. Sugiyama (2020) Dual t: reducing estimation error for transition matrix in label-noise learning. arXiv preprint arXiv:2006.07805. Cited by: §II-B.
  • [48] X. Yuan, Z. Lin, J. Kuen, J. Zhang, Y. Wang, M. Maire, A. Kale, and B. Faieta (2021) Multimodal contrastive training for visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6995–7004. Cited by: §II-C.
  • [49] C. Zhang, Q. Wang, G. Xie, Q. Wu, F. Shen, and Z. Tang (2021) Robust learning from noisy web images via data purification for fine-grained recognition. IEEE Transactions on Multimedia. Cited by: §I.
  • [50] H. Zhang, X. Xing, and L. Liu (2021) DualGraph: a graph-based method for reasoning about label noise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9654–9663. Cited by: §II-B.
  • [51] H. Zhu, M. Luo, R. Wang, A. Zheng, and R. He (2021) Deep audio-visual learning: a survey. International Journal of Automation and Computing 18 (3), pp. 351–376. Cited by: §II-A.