Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalities, enforces a grouping of semantically similar instances. To this end, we extend the concept of instance-level contrastive learning with a multimodal clustering step in the training pipeline to capture semantic similarities across modalities. The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains. To evaluate our approach, we train our model on the HowTo100M dataset and evaluate its zero-shot retrieval capabilities in two challenging domains, namely text-to-video retrieval, and temporal action localization, showing state-of-the-art results on four different datasets.


page 1

page 8


Everything at Once – Multi-modal Fusion Transformer for Video Retrieval

Multi-modal learning from video data has seen increased attention recent...

Self-Supervised Learning from Web Data for Multimodal Retrieval

Self-Supervised learning from multimodal image and text data allows deep...

Self-Supervised MultiModal Versatile Networks

Videos are a rich source of multi-modal supervision. In this work, we le...

Learning Shared Multimodal Embeddings with Unpaired Data

In this paper, we propose a method to learn a joint multimodal embedding...

Reasoning for Complex Data through Ensemble-based Self-Supervised Learning

Self-supervised learning deals with problems that have little or no avai...

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

One of the key factors of enabling machine learning models to comprehend...

Self-Supervised Image-to-Text and Text-to-Image Synthesis

A comprehensive understanding of vision and language and their interrela...

1 Introduction

To robustly learn visual events and concepts, humans seldom rely on visual inputs alone. Instead, a rich multimodal environment is utilized for understanding by combining multiple sensory signals along with various language representations. Many recent techniques have attempted to mimic this paradigm to train efficient computer vision models, especially those that learn from videos where multiple modalities are naturally present 

[1, 2, 36].

Figure 1: The Multimodal Clustering Network (MCN) combines a contrastive loss that learns feature representations to be close across different modalities such as video, audio, and text (blue box), with a clustering loss that draws instances that are semantically related together, e.g., scenes depicting the same semantic concept (e.g., chopping or frying) from different videos or different clips. (yellow box).

Learning on multimodal video data has both benefits and challenges. It is beneficial that each video instance has information available in multiple modalities. Textual information corresponding to the spoken narrations in the video, for example, provides a valuable language modality in addition to the visual and audio modalities [7, 20, 24]. In this work, we focus on the problem of learning a joint embedding space across multiple modalities. Given that the features from different modalities are often not comparable, the goal is to learn the projections into a common space where features from different domains but with similar content are close to each other to allow for a direct retrieval across modalities. However, creating an effective joint multimodal embedding space is not easy. First, each of those modalities is different, with respect to its source, how it is sampled and processed, and its resulting feature representation. Additionally, in real-world data, the supervision available to learn these projections from each of the modalities is unfortunately weak, as audio sequences can be misaligned to their visual representations and corresponding narration might or might not be present in the same time interval [2, 31].

To deal with multimodal data of this nature, several recent approaches use a contrastive loss  [17, 18] to learn e.g. feature representations in a joint embedding space. The goal is to bring samples drawn from the same temporal instance closer to each other while keeping samples from different times apart. Recent works [1, 31] show that such training is useful for pretraining models on large-scale data without additional supervision and that the resulting models achieve competitive performance on several tasks, in action classification when fine-tuned on various datasets. One problem arising from the contrastive loss is that this criterion does not consider the samples’ semantic structure and similarity at different times: two samples are treated as a negative pair as long as they occur at different times regardless of their semantic similarity. This can have a considerable adverse impact on the learned representation. In a different formulation for learning representations, instead of comparing individual instances, clusters of instances are first created using a certain clustering algorithm [2, 5, 11, 28]. This approach encourages samples semantically similar to each other (namely, samples in the same cluster) to be close in the embedding space. However, if we cluster features from multi-modalities, those clusters would likely emerge only within the modalities separately, clustering audio instances with audio instances, visuals to visuals . Therefore, a mechanism that pulls the instances from different modalities together is crucial to cluster features from different modalities in a joint space. This leads to our proposed method that treats these two approaches as reciprocal information.

In this context of multimodal representation learning, we present a self-supervised learning framework that learns joint representations from the visual, audio, and language modalities and accounts for the semantic similarity of embedding using a large corpus of naturally narrated videos. The proposed Multimodal Clustering Network (MCN) adopts a novel architecture to combine promising ideas from both representation learning paradigms described earlier: learning via the contrastive loss at the instance level and the semantic consistency at the cluster level. As another novel feature of our approach, we explore joint clusters using multimodal representations instead of clusters using separate modalities. The result features allow us to do retrieval across different modalities in linear time. Figure 1 provides a high-level overview of our approach.

To evaluate our proposed method, we address the challenging problem of zero-shot learning in two contexts: multimodal video retrieval and multimodal temporal action localization. We train our system on the HowTo100M dataset[32] and evaluate its retrieval capabilities on the YouCook2[44] and MSR-VTT[42] dataset and its temporal action localization on the task of action detection on the CrossTask[46] dataset and on the task of temporal action segmentation on the Mining YouTube [26] dataset. MCN significantly outperforms the best text-to-video retrieval baseline over absolute in recall and outperforms the temporal action localization baseline over in recall, both in zero-shot settings.

Contributions. The contributions of this work are threefold: (i) We propose a novel method by combining the benefits of contrastive loss and clustering loss for multimodal representation learning. Unlike prior works that create clusters using separate modalities, our method shows the important benefits of using multimodal joint clusters. (ii) We show that the proposed model can learn across three modalities (video, audio, text) in a joint space. (iii) We demonstrate significant performance gains on multiple downstream tasks in the zero-shot setting. These results show that the learned common space representations can improve state-of-the-art results without any additional training on the target datasets.

2 Related Work

Learning from Multimodal Data. Instead of collecting new annotated datasets  [12, 38] for building various state-of-the-art visual recognition models, current approaches leverage large amounts of videos available on multiple social media platforms. When specific language resources like automatically generated speech recognition captions are available in narrated video datasets such as How2 [39] or HowTo100M [32], an appropriate proxy task that leverages these resources is instead used. Such visual caption pairs have been widely used in self-supervised models in vision and language tasks recently [3, 16, 30, 40, 45, 35, 27, 15]. In other approaches like [2, 6, 8, 20, 29], the need for these language transcripts is avoided by using just the corresponding raw speech signal. More recently, models that trained from scratch from the narrated video along with generated speech captions have also been successfully developed [31]. The three modalities naturally present in videos, the visual, audio, and language streams, are further integrated via a multimodal variant of this learning framework in [1]. Unlike these works, our goal in this paper is to learn a joint embedding in three modalitites for zero-shot multimodal downstream tasks where we create an embedding space which the features across different modalitites are directly comparable.

Contrastive Learning. A technique central to several state-of-the-art self-supervised representation learning approaches for images is instance-wise contrastive learning [13, 21]

. In this paradigm, a model is trained to place samples extracted from the same instance, e.g., transforms or crops of an image, close to each other while pushing samples from different instances further apart. Given its similarity to noise contrastive estimation (NCE), where two samples are treated as a negative pair as long as they are drawn from different time segments, in MIL-NCE

[31], the benefits of both multiple instance learning and NCE are combined. An advantage of this approach is that it now allows for compensation of misalignments inherently found in videos and corresponding text captions. One inherent drawback of the instance-wise contrastive learning described above is that it is agnostic to the inherent semantic similarity between the samples when positive and negative pairs are constructed. In our work, we alleviate this problem by relaxing the instance level similarity across modalities to semantic level similarity by introducing a clustering component that learns semantic similarity among multimodal instances within the batch.

Figure 2: Cross-domain Clustering vs. Joint Clustering. (a) Previous methods such as XDC perform clustering at separate spaces and use pseudo-labels as supervision to other domains. (b) Our method performs clustering across features from different modalities in the joint space to learn multimodal clusters. Best viewed in color.

Deep Unsupervised Clustering. Given the high cost of computing all pairwise comparisons in a large dataset, instead of applying the contrastive learning paradigm discussed above on each individual instance, a more practical solution is to discriminate between groups of instances during training. This is done by first pre-training a model to derive suitable feature representations of the data in a simple cascaded approach. Keeping the representations fixed, a clustering algorithm is then used to group instances before the weights of the model are updated using the derived class assignments as supervision [10, 43]. In contrast, instead of keeping the clustering step independent of the representation learning phase, more recent techniques jointly learn visual embeddings and cluster assignments [5, 6, 11, 41]. While both these approaches can produce interpretable clustering results that benefit downstream tasks by integrating global information across the entire dataset, running a clustering algorithm over a large data set slows down training. However, this issue can be addressed by performing the clustering in an online fashion [11]. These online models simultaneously learn to cluster and represent image data. To improve the performance of clustering, it is, however, also essential to leverage the correlated yet very complementary information available in the various modalities present in narrated videos [5]. To learn better feature extractors for audio and video, recent works, XDC [5] and SeLaVi [2] extend this clustering idea to the multimodal space. While these approaches focus on learning better feature extractors for each domain separately, our goal is to learn a joint multimodal embedding. As shown in Figure 2, these cross-domain clustering methods (left) create separate clusters and use cross-domain pseudo-labels as the supervision for each feature extractor. In contrast, our model (right) creates a common embedding space across all modalities and performs clustering jointly.

Figure 3: Illustration of our proposed framework. Our framework comprises four parts: (a) Extracting features from several modalities and projecting them into joint space. (b) Calculating contrastive loss pairwise to pull the features close across modalities. (c) Performing multimodal clustering across features from different domains in a batch. (d) Performing joint prediction across features to multimodal centroids to bring together semantically similar embeddings. (e) Reconstruction loss for regularization. Best viewed in color.

3 Learning to Cluster Multimodal Data

To effectively construct a joint representation space from unlabeled narrated videos, we start with narrated video clips. Each video clip is associated with its corresponding visual representation, audio representation and text narration. Given this input, the joint embedding space is learned, where the embeddings of video clips with semantically similar visual, audio, and text content are close to each other and apart when the content is dissimilar, as illustrated in Figure 1.

Using notation as in [31], let denote video as it’s corresponding visual representation, let denote its corresponding audio and

, its matching text narration generated using an automatic speech recognition (ASR) system. Given a set of

tuples of associated video, audio and text narrations , as shown in Figure 3 (a), we first construct three parametrized mappings that derive embedding representations from the original video, audio and text signals. Transform derives a -dimensional embedding representation from a video clip , transforms and , produce similar -dimensional audio and text embeddings: and . In this work, takes as input pre-extracted 2D and 3D features from a fixed-length clip, the input for are log-mel spectrograms extracted from the audio segments, and for

, we use a sentence based neural model that transforms a set of words into a single vector. More details about model architectures are in Section 


Next, we introduce three loss functions to guide and properly situate these embeddings in the joint embedding space. A contrastive loss

is used to ensure that the representations from each of the three modalities are comparable. A second clustering loss encourages representations from semantically similar samples across all modalities to remain close in the learned embedding space. A third reconstruction loss regularizes the multimodal common space features for more stable clustering training. The final model is trained to minimize sum of these losses.


3.1 Contrastive Loss for Learning Joint Spaces

To learn a joint space for the three modalities, we compute a contrastive loss on all pairs of modalities, , as shown in Figure 3 (b). This loss maximizes the similarity between representations corresponding to any two modalities from the same instance (video clip) while minimizing the similarity of imposter pairs from the two modalities from one clip of video to another. In this work, we use the Masked Margin Softmax (MMS) function [23], which defines the similarity between representations from two modalities in terms of their learned embedding vectors’ dot product within a batch . Features from each of the three modalities are assembled for each batch. The total contrastive loss is the sum of pairwise losses using each of the three modalities:


where , , represent the loss associated with pairwise modalities respectively. For a pair of modalities, for example the text and audio modalities, the individual loss is in turn given as:


where represents imposter pairs from two modalities that are sampled from a batch but do not co-occur. As can been seen in the case, this loss attempts to discriminate between positive or true embedding pairs and imposter or negative pairs within each batch. Using two separate parts, the space of positive and negative samples is enumerated separately: in one case, a given text sample is paired with various negative audio samples. In the second case, an audio sample is paired with various negative text samples. (, , ) are various indices of video clips in a given batch.

is a margin hyperparameter that is empirically selected. By projecting all features to the same space and ensuring that their similarities are maximized pairwise, this formulation of the pairwise contrastive loss ensures that the features across different modalities are comparable.

3.2 Clustering Multimodal Features

To ensure that representations of semantically related instances are close in the learned joint multimodal space, in addition to contrastive loss described above, a self-supervised clustering step is included as part of the training process.

Online K-means clustering.

We applied standard clustering algorithm -means that takes a set of vectors as input, in our case, the features produced by the fused multimodal feature:


where we take the mean over features from three modalities to represent a multimodal instance. We cluster them into distinct groups. More precisely, it outputs a centroid matrix and the cluster assignments of each multimodal instance are defined by solving the following problem:


We then acquire a centroid matrix and a set of assignments . Unlike pseudo-labels-based methods [10] that only make use of the assignments (labels), we make use of the centroid matrix for semantic learning. To cover variant semantic information for clustering, we use features from the previous batches to gather sufficient instances for online learning.

Semantic centroid learning. To learn the features closer to its multimodal semantic centroids. We proposed to use the centroid as a contrastive loss reference target. This target pulls the features from three modalities closer to the centroid that is close to their multimodal instance feature and pushes the features far away from the other centroid. For each modality, for example, the text modalities, the individual loss is in turn given as:


where is the nearest centroid for the multimodal instance feature and . We later sum over the loss from three modalities:


In the end, the projected features learn to be closer to its centroid feature among the three and also learns to be closer in similar semantics.

Multimodal features reconstruction. We performed a reconstruction loss on top of the common space features from three modalities to stabilize the feature training during clustering. For each modality, for example, the visual modalities, the individual loss is in turn given as:


where represented the reconstructed features by feeding v into two linear layers as encoder and decoder. We then sum the loss over each modality:


4 Experiments

4.1 Implementation details

For the visual branch of the proposed MCN model we follow [32] and use pre-trained 2D features from a ResNet-152 model [22]

trained on ImageNet

[14] to extract features at the rate of one frame per second, along with pre-trained 3D features from a ResNeXt-101 model [19] trained on Kinetics [12]

to obtain 1.5 features per second. The video clip features were computed by concatenating the 2D and 3D features into a 4096 dimension vector and max-pooling the features over time. For the audio branch of the network, we compute log-mel spectrograms and use a pre-trained DAVEnet model


to extract audio features. For the textual branch, the feature extraction process proposed in  

[32] is adopted to extract text representations: a GoogleNews pre-trained Word2vec model [33] provides word embeddings, followed by a max-pooling over words in a given sentence to extract a sentence embedding. Note that all backbones are fixed, and they are not fine-tuned during training. Each feature extraction branch is followed by a separate fully-connected layer and a gated unit for projecting the features in a common embedding space. To allow for pairwise comparisons, features from each of the different modalities are set to be 4096-dimensional vectors. TWe use an Adam optimizer [25] with a learning rate of and cosine learning rate schedule [34]

. The model is trained for 30 epochs on four V100 GPUs over a period of about two days. Various hyperparameters in our experiments are set as follows: margin hyperparameter

, and a batch size of = 4096 video clips and cluster size is set to be 256.

4.2 Datasets

Training Dataset. Our models are trained on the HowTo100M [32] instructional video dataset, which contains 1.2M videos along with their corresponding audio that consists of speech and environmental sound and automatically generated speech transcriptions. The set of video-audio-text segment pairs used in training are defined using transcription time stamps provided with the dataset.

Downstream Datasets. As we are focusing on multimodal zero-shot capabilities, we work with datasets that provide audio and video data in combination with textual representation, either in form of transcripts [42, 44] or action labels [26, 46]. For text-to-video retrieval, we evaluate our representations on the following two datasets. The YouCook2 [44] dataset contains 3.5K cooking instruction video clips with text descriptions collected from YouTube. Unlike Howto100m dataset, text descriptions in YouCook2 are human-annotated. The MSR-VTT [42] dataset contains 200K human annotated video clip-caption pairs on various topics. We use the same test set with 1K video clip-caption pairs constructed in [32] in our experiments.

For temporal action localization, the following two datasets were evaluated: The CrossTask [46] dataset contains 2.7K instructional videos that cover various topics, including cooking tutorials like “Make banana ice cream” and car repair tutorials like “Add oil to car”. The action steps and their order for each task were collected from wikiHow articles with manual annotation for each frame. We follow [32], using 2.7K videos for evaluation and the inference procedure of [46] for calculating the recall. The Mining Youtube [26] dataset focuses on YouTube videos for five simple dishes based on egg preparation, “eggroll”, “fried egg”, “pancake”, “omelet” and “scrambled egg”, as they all share common tasks. The test set contains 250 cooking videos, 50 of each task, that are densely annotated, each frame is labeled with its respective action class. Overall, the dataset contains 512 different classes, based on 94 different verbs and 171 objects.

Method Mod Model TR R@1 R@5 R@10 R@1 R@5 R@10
Random - - 0.03 0.15 0.3 0.01 0.05 0.1
Miech [32] VT R152+RX101 N 6.1 17.3 24.8 7.2 19.2 28.0
MDR [3] VT R152+RX101 N - - - 8.0 21.3 29.3
MIL-NCE* [31] VT R152+RX101 N 8.1 23.3 32.3 8.4 23.2 32.4
MCN (ours) VAT R152+RX101 N 18.1 35.5 45.2 10.5 25.2 33.8
MDR [3] VT R152 N - - - 8.4 22.0 30.4
ActBERT [45] VT R101+Res3D N 9.6 26.7 38.0 8.6 23.4 33.1
SSB [35] VT R(2+1)D-34+R152 N - - - 8.7 23.0 31.1
MMV FAC [1] VAT TSM-50x2 Y 11.7 33.4 45.4 9.3 23.0 31.1
MIL-NCE [31] VT I3D-G Y 11.4 30.6 42.0 9.4 22.0 30.0
MIL-NCE [31] VT S3D-G Y 15.1 38.0 51.2 9.9 24.0 32.4
Table 1: Comparison of text-to-video retrieval systems. Mod indicates modality used, where V: video, A: audio, T: text. TR indicates if a trainable backbone is used or not.
CrossTask MYT
Method Mod Model TR Recall IOD IOU Recall IOD IOU
CrossTask [46] VT R152+I3D N 22.4 - - - - -
CrossTask [46] VT R152+I3D N 31.6 - - - - -
Mining: GRU [26] VT TSN N - - - - 14.5 7.8
Mining: MLP [26] VT TSN N - - - - 19.2 9.8
Miech [32] VT R152+RX101 N 33.6 26.6 17.5 15.0 17.2 11.4
MIL-NCE* [31] VT R152+RX101 N 33.2 30.2 16.3 14.9 26.4 17.8
MCN (ours) VAT R152+RX101 N 35.1 33.6 22.2 18.1 32.0 23.1
ActBERT [45] VT R101+Res3D N 37.1 - - - - -
ActBERT [45] VT + Faster R-CNN N 41.4 - - - - -
MIL-NCE [31] VT I3D-G Y 36.4 - - - - -
MIL-NCE [31] VT S3D-G Y 40.5 - - - - -
Table 2: Evaluation of temporal action localization systems.

4.3 Downstream Tasks

To demonstrate the effectiveness of the proposed model, we evaluate embeddings derived from the network in two downstream tasks: text-to-video retrieval and temporal action localization. We focus on the zero-shot task because we want to access the quality of the cross-modal semantic embedding that was learned during training. When performing retrieval using our model, we compare the query text features with the video and audio features by computing similarity for both and using the average. For action localization, we compute the same distance of the video-audio pair of each frame to each respective label embedding and are so able to align video frames to each of the provided action steps.

Text-to-Video Retrieval. The goal of this task is to retrieve the matching video from a pool of videos, given its ground truth text query description. The model is tested on two video description datasets and evaluated on recall metrics: R@1, R@5, R@10. These evaluations are used to demonstrate the effectiveness of the contrastive loss and learned joint embedding space across three modalities.

Text-to-Full Video Retrieval. The conventional text-to-video retrieval task attempts to match a caption (or ground-truth text query) to a single video clip. Since a single caption can refer to many individual clips within a dataset, this task is limiting. To this end, we propose the task of text-to-full video retrieval where the goal is to match a set of captions (or text queries) describing multiple parts of a video to an entire video. This is a more realistic task than single clip retrieval since various real-world applications require retrieving entire videos from complex textual queries. We evaluate on YouCook2 dataset with recall metrics: R@1, R@5, R@10.

Temporal action localization. We further evaluate our model on two temporal action localization tasks. Here, given a minute-long video and a list of the action steps present in the clip, the task is to find those actions within the video stream. This is challenging as the labels for action steps only consist of verb-object pairs such as “add flour”, which significantly differ from the ASR based video transcriptions that the model is trained on. The CrossTask [46]

dataset considers the task of clip level action detection. Here, an unordered set of action labels is given for a set of clips of the same video, and clips have to be classified with the respective action labels. The performance is reported as recall and computed as a ratio of the correctly predicted clips over the total number of clips in the video as used in

[46]. Background elements are not included in the evaluation. The MiningYoutube [26] dataset considers the task of frame-level temporal action segmentation as used in weakly supervised action learning [37]. Here, each test video is provided together with the respective actions and their ordering, including the background. The goal is to find the correct frame-wise segmentation of the video given the action order. We follow the inference procedure outlined in [26]

to compute the alignment given our similarity input matrix. The dataset employs two evaluation metrics: intersection over detection (IoD)

[9], defined as : the ratio between the intersection of ground-truth action and prediction to prediction

, and the Jaccard index, which is an intersection over union (IoU) given as

. IOD and IOU scores are computed separately for each video and reported as an average score over all the videos. We report recall and the IOU-IOD metric for both datasets.

4.4 Comparison with State-of-the-art Methods

Zero-shot Video Retrieval. We first examine the results of the text-to-video retrieval task on the YouCook2 and MSR-VTT datasets (Table 1). We compare only with baseline models that were not fine-tuned on the respective dataset for a fair comparison. To allow comparability between different approaches, we use a fixed visual feature extraction backbone as described in [32] whenever possible. For the baseline MIL-NCE* [31], we apply their training strategy on the same visual feature set we use, ResNet-152 (R152) and ResNeXt-152 (RX101) [32]. On YouCook2, our model significantly outperforms prior works on the same architecture and shows even competitive results compared to models with trainable visual backbone (TR). Our method also performs better than the other baselines on MSR-VTT. The gains are, however, not as significant as on YouCook2. We attribute this to the fact that neither the available audio nor the textual description is instructional in nature and, therefore, semantically further away from our training set.

Zero-shot Action Localization. Additionally, we examine the action localization tasks on the CrossTask and the MiningYouTube dataset in Table 2. For CrossTask, given each frame in the video, we perform a zero-shot classification of the given labels and calculate the recall. In this zero-shot setting, the model computes video text similarity to localize action step labels similar to [32]. Our method outperforms state-of-the-art approaches for self-supervised learning [31, 32] and a fully supervised approach [46] especially in the IOU and IOD metrics, which also consider false-positive predictions from the background class as an action step. Approaches in [32] and MIL-NCE* [31] are directly comparable with our method since they use the same feature extractor as us. In contrast, [31] uses a stronger video backbone and [45] uses additional feature modalities such as region features along with a stronger language model. We also evaluate our model on the MiningYoutube [46] temporal action localization benchmark. Our method outperforms state-of-the-art approaches for both self-supervised [31, 32] and weakly supervised [26] learning.

Clustering Metrics. We further evaluate our system with respect to various clustering metrics as proposed by [5]. Results are shown in Table 3. The definition of each metric is included in the supplementary. It shows that our learned multimodal features are closer to the ground-truth distribution and have higher purity within the cluster.

Method NMI ARI Acc.
Random 3.2 3.2 9.4 1.30 47.5
Miech [32] 61.8 46.1 57.0 0.39 81.5
MIL-NCE* [31] 62.0 45.6 56.7 0.37 82.4
MCN (ours) 65.5 48.5 57.6 0.34 83.8
Table 3: Performance on various clustering metrics on the CrossTask dataset evaluated by GT text annotations on video segments.
Figure 4: Qualitative results for the text-to-video retrieval task on YouCook2. Top-ranked clips show a high similarity to the described task as well as among each other without being too visually similar.
Figure 5: t-SNE visualizations on the CrossTask dataset for the task of ”Make French Toast”. Best viewed in color.

4.5 Full Video Retrieval

To address the problem of full video retrieval from a set of captions, we divide each video into a set of clips, which are compared with the queries. We evaluate three different methods: In majority vote over clip predictions, we obtain the top-k predictions of each clip/caption pair as votes and select the video which has the majority of votes. For majority vote over videos, the maximal prediction over all the clips of a video is taken for each caption to obtain video/caption pairs. Then, the top-k of these predictions are selected as votes, and the video with the most votes is predicted. Lastly, our caption averaging method involves obtaining the maximal prediction over all the clips of a video is taken for each caption and then averaging over the set of captions in a query. This gives a single prediction for the entire video.

We examine the results of the text-to-full video retrieval task on the YouCook2 dataset (Table 4). Of the three methods to obtain full video predictions, the caption averaging achieves better results than both majority voting schemes. Furthermore, we find that our method outperforms prior works on this task with a 6.8% improvement on R@1. Since we obtain full video predictions, we also perform full-video classification on the CrossTask dataset using the set of sub-task labels as the set of query captions, where we achieve a top-1 accuracy of 68.7%.

Method Prediction R@1 R@5 R@10
Random - 0.23 1.15 2.32
MCN (ours) MV-Clip 38.8 67.4 76.8
MCN (ours) MV-Video 38.8 67.7 78.4
MCN (ours) Caption Avg. 53.4 75.0 81.4
Miech [32] Caption Avg. 43.1 68.6 79.1
MIL-NCE* [31] Caption Avg. 46.6 74.3 83.7
Table 4: Comparison of Text-to-Full Video retrieval systems on the YouCook2 dataset. The prediction column denotes the method used to obtain video-level predictions: majority vote over clips (MV-Clip), majority vote over videos (MV-Video), and caption averaging (Caption Avg.).

4.6 Ablation Studies

NCE 39.2 33.5 33.9 21.5
MIL-NCE 40.0 33.0 33.7 21.1
MMS 43.7 32.9 34.3 22.1
MMS + Cluster 44.3 33.7 34.5 22.6
MMS + Cluster + Reconstruct 45.2 33.8 35.1 23.1
Table 5: Ablation study on different loss including the selection of contrastive learning loss, the additional clustering, and reconstruction loss.
Method Target Labels YR10 MR10 CTR MYT-IOU
Sinkhorn Swap hard 39.0 33.4 33.6 21.1
Sinkhorn Swap soft 41.8 33.9 34.5 22.1
Sinkhorn Joint hard 44.4 33.4 34.6 21.1
Sinkhorn Joint soft 43.6 32.4 34.1 21.6
K-means Swap hard 41.3 32.8 33.2 21.0
K-means Joint hard 44.3 33.1 34.6 21.4
K-means Centroid hard 45.2 33.8 35.1 23.1
Table 6: Ablation study on different clustering pipelines with various methods, loss prediction target, and label types.

To better understand the contributions of various algorithmic design choices used to build the proposed MCN model, we perform a set of ablation studies on the following downstream tasks: YouCook2 R@10 (YR10), MSR-VTT R@10 (MR10), CrossTask average recall (CTR) and MiningYoutube IOU (MY-IOU). For each setting, we use the same feature extractor for three modalities as described in Sec 4.1 for a fair comparison.

Selection on different losses. In our first set of experiments, we find the proposed clustering is crucial not only for clustering-related tasks but also for retrieval (MSR-VTT) tasks as shown in Table 5. This validates our hypothesis that semantically close instances should be clustered closely in the joint embedding space. Also, the selection of contrastive loss (MMS) shows better results in our model.

Different choices of clustering methods. We evaluate the performance of (1) Selection of different clustering methods such as Sinkhorn clustering [6] and K-means [4]. (2) Different prediction targets such as using swap prediction, which uses the pseudo label of other modalities for prediction target as [11, 2]. Or using the mean feature pseudo label as a joint prediction for three modalities. Also, using the centroid of the cluster as the target. (3) Different prediction labels, including hard labels (one-hot) or soft labels (continuous). Detailed descriptions are included in the supplementary. As shown in Table 6, our method encourages each modality feature to move closer to the semantic centroid, which improves performance by explicitly encouraging semantically close features from different domains to cluster together.

4.7 Qualitative Analysis

We perform a qualitative analysis with the model’s ability to do zero-shot text-to-video retrieval as shown in Figure 4. Given an open-vocabulary caption, our model can retrieve the correct corresponding video segment. We also visualize the efficacy of using multimodal embeddings (concatenated video and audio representations) over using only visual embeddings. Representations from the CrossTask dataset are visualized using t-SNE plots. We observe that with multimodal features as Figure 5 (b), semantically related instances (based on ground truth classes) tend to be more tightly related than uni-modal visual features trained from contrastive loss (a) that appear more spread out. Also, multimodal features are clearly more separable for different actions.

5 Conclusions

We have developed a novel self-supervised multimodal clustering network that learns a common embedding space by processing local (via a contrastive loss) and global (via a clustering loss) semantic relationships present in multimodal data. The multimodal clustering network is trained on a large corpus of narrated videos without any manual annotations. Our extensive experiments on multiple datasets show that creating a joint video-audio-language embedding space with a clustering loss is essential for self-supervised learning of good video representations. Our approach can be extended to more modalities such as optical flow or sentiment features and applied to other multimodal datasets for learning joint representation spaces without human annotation.


  • [1] J. Alayrac, A. Recasens, R. Schneider, R. Arandjelović, J. Ramapuram, J. De Fauw, L. Smaira, S. Dieleman, and A. Zisserman (2020) Self-supervised multimodal versatile networks. arXiv preprint arXiv:2006.16228. Cited by: §1, §1, §2, Table 1.
  • [2] H. Alwassel, D. Mahajan, B. Korbar, L. Torresani, B. Ghanem, and D. Tran (2020) Self-supervised learning by cross-modal audio-video clustering. Advances in Neural Information Processing Systems 33. Cited by: §1, §1, §1, §2, §2, §4.6.
  • [3] E. Amrani, R. Ben-Ari, D. Rotman, and A. Bronstein (2020)

    Noise estimation using density estimation for self-supervised multimodal learning

    arXiv preprint arXiv:2003.03186. Cited by: §2, Table 1.
  • [4] D. Arthur and S. Vassilvitskii (2006) K-means++: the advantages of careful seeding. Technical report Stanford. Cited by: §4.6.
  • [5] Y. M. Asano, M. Patrick, C. Rupprecht, and A. Vedaldi (2020) Labelling unlabelled videos from scratch with multi-modal self-supervision. arXiv preprint arXiv:2006.13662. Cited by: §1, §2, §4.4.
  • [6] Y. M. Asano, C. Rupprecht, and A. Vedaldi (2020) Self-labelling via simultaneous clustering and representation learning. Int. Conf. Learn. Represent.. Cited by: §2, §2, §4.6.
  • [7] Y. Aytar, C. Vondrick, and A. Torralba (2017) See, hear, and read: deep aligned representations. arXiv preprint arXiv:1706.00932. Cited by: §1.
  • [8] A. Boggust, K. Audhkhasi, D. Joshi, D. Harwath, S. Thomas, R. Feris, D. Gutfreund, Y. Zhang, A. Torralba, M. Picheny, et al. (2019) Grounding spoken words in unlabeled video. In CVPRW, Cited by: §2.
  • [9] P. Bojanowski, R. Lajugie, F. Bach, I. Laptev, J. Ponce, C. Schmid, and J. Sivic (2014) Weakly supervised action labeling in videos under ordering constraints. In European Conference on Computer Vision, pp. 628–643. Cited by: §4.3.
  • [10] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018)

    Deep clustering for unsupervised learning of visual features

    In Eur. Conf. Comput. Vis., Cited by: §2, §3.2.
  • [11] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. arXiv preprint arXiv:2006.09882. Cited by: §1, §2, §4.6.
  • [12] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In

    proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6299–6308. Cited by: §2, §4.1.
  • [13] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §2.
  • [14] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. Cited by: §4.1.
  • [15] J. Dong, X. Li, C. Xu, X. Yang, G. Yang, X. Wang, and M. Wang (2021) Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.
  • [16] V. Gabeur, C. Sun, K. Alahari, and C. Schmid (2020) Multi-modal transformer for video retrieval. In European Conference on Computer Vision (ECCV), Vol. 5. Cited by: §2.
  • [17] M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    pp. 297–304. Cited by: §1.
  • [18] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1735–1742. Cited by: §1.
  • [19] K. Hara, H. Kataoka, and Y. Satoh (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6546–6555. Cited by: §4.1.
  • [20] D. Harwath, A. Recasens, D. Surís, G. Chuang, A. Torralba, and J. Glass (2018) Jointly discovering visual objects and spoken words from raw sensory input. In Proceedings of the European conference on computer vision (ECCV), pp. 649–665. Cited by: §1, §2, §4.1.
  • [21] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738. Cited by: §2.
  • [22] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §4.1.
  • [23] G. Ilharco, Y. Zhang, and J. Baldridge (2019) Large-scale representation learning from visually grounded untranscribed speech. In Proceedings of the Conference on Computational Natural Language Learning (CoNLL), pp. 55–65. Cited by: §3.1.
  • [24] L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit (2017) One model to learn them all. arXiv preprint arXiv:1706.05137. Cited by: §1.
  • [25] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.1.
  • [26] H. Kuehne, A. Iqbal, A. Richard, and J. Gall (2019) Mining youtube-a dataset for learning fine-grained action concepts from webly supervised video data. arXiv preprint arXiv:1906.01012. Cited by: §1, §4.2, §4.2, §4.3, §4.4, Table 2.
  • [27] J. Lei, L. Li, L. Zhou, Z. Gan, T. L. Berg, M. Bansal, and J. Liu (2021) Less is more: clipbert for video-and-language learning via sparse sampling. arXiv preprint arXiv:2102.06183. Cited by: §2.
  • [28] J. Li, P. Zhou, C. Xiong, R. Socher, and S. C. Hoi (2020) Prototypical contrastive learning of unsupervised representations. arXiv preprint arXiv:2005.04966. Cited by: §1.
  • [29] Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman (2019) Use what you have: video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487. Cited by: §2.
  • [30] H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, X. Chen, and M. Zhou (2020) Univilm: a unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353. Cited by: §2.
  • [31] A. Miech, J. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman (2020) End-to-end learning of visual representations from uncurated instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9879–9889. Cited by: §1, §1, §2, §2, §3, §4.4, §4.4, Table 1, Table 2, Table 3, Table 4.
  • [32] A. Miech, D. Zhukov, J. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic (2019) Howto100m: learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE international conference on computer vision, pp. 2630–2640. Cited by: §1, §2, §4.1, §4.2, §4.2, §4.2, §4.4, §4.4, Table 1, Table 2, Table 3, Table 4.
  • [33] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §4.1.
  • [34] I. Misra and L. v. d. Maaten (2020) Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6707–6717. Cited by: §4.1.
  • [35] M. Patrick, P. Huang, Y. Asano, F. Metze, A. Hauptmann, J. Henriques, and A. Vedaldi (2020) Support-set bottlenecks for video-text representation learning. arXiv preprint arXiv:2010.02824. Cited by: §2, Table 1.
  • [36] A. Piergiovanni, A. Angelova, and M. S. Ryoo (2020) Evolving losses for unsupervised video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 133–142. Cited by: §1.
  • [37] A. Richard, H. Kuehne, and J. Gall (2017) Weakly supervised action learning with rnn based fine-to-coarse modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 754–763. Cited by: §4.3.
  • [38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. IJCV. Cited by: §2.
  • [39] R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault, L. Specia, and F. Metze (2018) How2: a large-scale dataset for multimodal language understanding. In Proceedings of the Workshop on Visually Grounded Interaction and Language (ViGIL), Cited by: §2.
  • [40] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019) Videobert: a joint model for video and language representation learning. In ICCV, Cited by: §2.
  • [41] W. Van Gansbeke, S. Vandenhende, S. Georgoulis, M. Proesmans, and L. Van Gool (2020) Scan: learning to classify images without labels. ECCV. Cited by: §2.
  • [42] J. Xu, T. Mei, T. Yao, and Y. Rui (2016) MSR-VTT: a large video description dataset for bridging video and language. In CVPR, Cited by: §1, §4.2.
  • [43] X. Yan, I. Misra, A. Gupta, D. Ghadiyaram, and D. Mahajan (2020) ClusterFit: improving generalization of visual representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6509–6518. Cited by: §2.
  • [44] L. Zhou, X. Chenliang, and J. J. Corso (2018) Towards automatic learning of procedures from web instructional videos. In AAAI, Cited by: §1, §4.2.
  • [45] L. Zhu and Y. Yang (2020) ActBERT: learning global-local video-text representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8746–8755. Cited by: §2, §4.4, Table 1, Table 2.
  • [46] D. Zhukov, J. Alayrac, R. G. Cinbis, D. Fouhey, I. Laptev, and J. Sivic (2019) Cross-task weakly supervised learning from instructional videos. In CVPR, Cited by: §1, §4.2, §4.2, §4.3, §4.4, Table 2.