Log In Sign Up

Video-Text Representation Learning via Differentiable Weak Temporal Alignment

by   Dohwan Ko, et al.
Kakao Corp.
Korea University
HanYang University

Learning generic joint representations for video and text by a supervised method requires a prohibitively substantial amount of manually annotated video datasets. As a practical alternative, a large-scale but uncurated and narrated video dataset, HowTo100M, has recently been introduced. But it is still challenging to learn joint embeddings of video and text in a self-supervised manner, due to its ambiguity and non-sequential alignment. In this paper, we propose a novel multi-modal self-supervised framework Video-Text Temporally Weak Alignment-based Contrastive Learning (VT-TWINS) to capture significant information from noisy and weakly correlated data using a variant of Dynamic Time Warping (DTW). We observe that the standard DTW inherently cannot handle weakly correlated data and only considers the globally optimal alignment path. To address these problems, we develop a differentiable DTW which also reflects local information with weak temporal alignment. Moreover, our proposed model applies a contrastive learning scheme to learn feature representations on weakly correlated data. Our extensive experiments demonstrate that VT-TWINS attains significant improvements in multi-modal representation learning and outperforms various challenging downstream tasks. Code is available at


Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation

Spatio-temporal representation learning is critical for video self-super...

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

We propose a self-supervised method to learn feature representations fro...

Frame-wise Action Representations for Long Videos via Sequence Contrastive Learning

Prior works on action representation learning mainly focus on designing ...

Self-supervised Contrastive Video-Speech Representation Learning for Ultrasound

In medical imaging, manual annotations can be expensive to acquire and s...

Self-supervised and Weakly Supervised Contrastive Learning for Frame-wise Action Representations

Previous work on action representation learning focused on global repres...

Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders

Masked autoencoders (MAEs) have emerged recently as art self-supervised ...

Bi-Calibration Networks for Weakly-Supervised Video Representation Learning

The leverage of large volumes of web videos paired with the searched que...

1 Introduction

Learning video-text representations is an important problem in computer vision. In recent years, it has recently drawn increasing attention due to a large amount of video data and various applications. Previous works 

[lin2014microsoft, zhou2018towards, wray2019fine] have achieved exciting results by learning mappings between video clips and texts but they usually require a large amount of manual annotations such as MSR-VTT [xu2016msr], DiDeMo [anne2017localizing], EPIC-KITCHENS [damen2018scaling]. However, since labeling videos is expensive and time-consuming, it does not scale well for sufficiently large datasets which are essential to learning generic video-text representations that are readily applicable to a wide range of downstream tasks that include text-to-video retrieval or video-text retrieval [klein2015associating, wang2018learning, wang2016learning, yu2018joint], text-based action localization [anne2017localizing, cheron2018flexible], action segmentation [lea2016temporal, sigurdsson2017asynchronous] and video question answering [tapaswi2016movieqa, malinowski2015ask, yu2018joint]

. Recent studies suggest that multi-modal self-supervised learning with a huge amount of data is a promising alternative to fully supervised methods 

[fernando2017self, xu2019self]. To this extent, HowTo100M [miech2019howto100m] has been introduced, which is composed of 100 million pairs of video clips and captions from 1.22M narrated instructional videos.

The HowTo100M is one of the largest video datasets but it comes with several challenges. It is uncurated and its video-text pairs are weakly correlated meaning that given a video clip the caption depicting the visual content may appear before/after the clip or not even exist (Figure 1). To handle the weakly correlated video-text pairs, MIL-NCE [miech2020end] has proposed a multiple instance learning (MIL)-based contrastive learning adopting Noise Contrastive Learning (NCE) loss [gutmann2010noise]. MIL-NCE treats the multiple captions which are temporally close to one clip as positive samples allowing one-to-many correspondence. But this strong assumption often leads to suboptimal representation learning.

In this paper, to address the problem, we develop a new weak temporal alignment algorithm building upon Dynamic Time Warping (DTW) [sakoe1978dynamic]. In contrast to the standard DTW which is limited to sequential alignment, our proposed alignment algorithm allows flexibility by skipping irrelevant pairs and starting/ending at arbitrary time points. Also, it takes into account a globally optimal path as well as locally optimal paths by introducing local neighborhood smoothing. More importantly, our alignment algorithm is differentiable so we incorporate it into representation learning as a distance measure. We then propose a novel multi-modal self-supervised learning framework to learn a joint video and text embedding model named as Video-Text Temporally Weak Alignment-based Contrastive Learning (VT-TWINS) that automatically handles the correspondence between noisy and weakly correlated captions and clips.

Our extensive experiments on five benchmark datasets demonstrate that our learned video and text representations generalize well on various downstream tasks including action recognition, text-to-video retrieval, and action step localization. Moreover, ablation studies and qualitative analysis show that our framework effectively aligns the noisy and weakly correlated multi-modal time-series data.

Our contributions are threefold:

  • We propose a novel self-supervised learning framework with differentiable weak temporal alignment that automatically handles the noisy and weakly correlated multi-modal time-series data.

  • We analyze the local neighborhood smoothing in our alignment algorithm showing that unlike DTW the alignment takes into account local optimal paths as well as global optimal path.

  • Our experiments show that the proposed method considerably improves joint representations of video and text an is adapted well on various downstream tasks.

(a) Sequentially aligned pairs
(b) Non-sequentially aligned pairs
(c) Partially irrelevant pairs
(d) Entirely irrelevant pairs
Figure 1: Examples of the HowTo100M. The HowTo100M dataset contains narrated instructional videos and the clips and captions are weakly aligned. Each video is composed of several pairs of clips and captions. Shaded captions are irrelevant to any clips in the video.

2 Related Work

Self-Supervised Learning for Videos. The self-supervised learning approaches have received considerable attention because they do not require additional annotations during learning representation. Recently, several works are proposed to learn video representations in a self-supervised manner. One research direction is to design video-specific pretext tasks, such as verifying temporal orders [lee2017unsupervised, fernando2017self, misra2016shuffle, xu2019self], predicting video rotation [jing2018self], solving jigsaw puzzles in a video [kim2019self], and dense predictive coding [han2019video]. Another line of research is to use a contrastive learning which leads clips from the same video to be pulled together while clips from different videos to be pushed away [sun2019learning, wang2020self, qian2021spatiotemporal, chen2020simple, chen2021exploring, grill2020bootstrap, he2020momentum]. In view of the multi-modality of videos, many works explore mutual supervision across modalities to learn representations of each modality. For example, they regard temporal or semantic consistency between videos and audios [korbar2018cooperative, chen2021multimodal] or narrations [miech2020end, alayrac2016unsupervised, miech2019howto100m, bain2021frozen] as a natural source of supervision. MIL-NCE [miech2020end] introduced contrastive learning to learn joint embeddings between clips and captions of unlabeled and uncurated narrated videos. The other line of work adopts an additional crossmodal encoder (e.g., crossmodal transformer) to capture richer interaction between modalities [sun2019videobert, sun2019learning, zhu2020actbert, li2020hero, luo2020univl, ging2020coot]. In this paper, we focus on extending contrastive learning to temporally align two time-series modalities, i.e., clips and captions from videos without any additional crossmodal encoders.

Sequence Alignment. Sequence alignment is crucial in fields related to the time-series data due to the temporal information. In particular, the lack of manually annotated video datasets makes it harder to align clips and captions temporally. Dynamic Time Warping (DTW) [sakoe1978dynamic] measures the distance with strong temporal constraints between two sequences. [chang2021learning] uses global sequence alignment as a proxy task by relying on the DTW. [cuturi2017soft, hadji2021representation] extended the DTW for end-to-end learning with differentiable approximations of the discrete operations (e.g., the ‘min’ operator) in the DTW. Chang et al[chang2019d3tw] proposed the frame-wise alignment loss using the DTW in weakly supervised action alignment in videos. Drop-DTW [dvornik2021drop]

proposed a variant of the DTW algorithm which automatically drops the outlier elements from the pairwise distance to handle the noisy data. However, using the DTW alone can cause feature collapsing which leads all the feature embeddings to be concentrated to a single point. To address this problem,

[chang2019d3tw] and [haresh2021learning] use the subsidiary regularization loss term with the DTW.

3 Preliminaries

We briefly summarize the basic concepts of dynamic time warping and the characteristics of an uncurated narrated video dataset HowTo100M.

3.1 Dynamic Time Warping (DTW)

DTW [berndt1994using] finds an optimal alignment between two time-series data. Let and denote two time-series data of length and , i.e., and . DTW first computes a pairwise distance matrix with a distance measure . Then, DTW optimizes the following:


where is a set of (binary) alignment matrices. An alignment matrix represents a path that connects from to -th entries of by three possible moves .

To efficiently find an optimal path, DTW [berndt1994using] uses dynamic programming to recursively solve the following subproblems:


where is the -th element of a cumulative cost matrix of . Therefore, in (1) is equal to which is the accumulated cost that evaluates the similarity between two time-series data.

Soft-DTW [cuturi2017soft] has proposed a differentiable variant of the DTW replacing the non-differentiable operator ‘’ in  (2) with the soft-min ‘’ defined as:


where is a smoothing parameter. Then, the recurrence relation of Soft-DTW is given as:


If is zero, soft-min is identical to operator. As increases, Soft-DTW(X,Y) more takes into account the cost of suboptimal paths.

3.2 The HowTo100M Dataset

HowTo100M dataset [miech2019howto100m]

is a large-scale dataset that contains 136M video clips with paired captions from 1.22M narrated instructional videos across 23K different visual tasks. A video has 110 clip-caption pairs with an average duration of 4 seconds. The captions are automatically transcribed narrations via automatic speech recognition (ASR). Learning joint video text embeddings with HowTo100M has two sources of difficulties: ‘uncurated narrations’ and ‘weak correlation’ between clip-caption pairs. As discussed in

[miech2020end], the narrations transcribed by ASR are potentially erroneous and the colloquial language is neither complete nor grammatically correct sentences. In addition, due to the weak correlation between the paired clips and captions, computing the optimal correspondence to learn joint embedding entails addressing the following challenges, which is the main focus of this paper.

Ambiguity. As aforementioned, the average duration of a clip-caption pair is 4 seconds. Since short clips are sampled densely in one video, consecutive clips are often semantically similar, i.e., clip-caption alignments inherently have ambiguity. So it is more beneficial to use algorithms that take into account multiple alignments allowing many-to-many correspondence rather than the algorithms that consider the only one optimal path such as the standard DTW.

Irrelevant pairs. The paired clips and captions may contain irrelevant contents due to several reasons. People might skip to demonstrate some steps when narrations are clear enough or vice versa. In Figure 0(c), since the narration “select the correct program ” is clear enough, no demonstration is given in the corresponding clip. In addition, some videos have entirely irrelevant clips and captions like Figure 0(d). When learning joint video text embeddings, these irrelevant pairs should be properly handled.

Non-sequential alignment. Although videos and texts are overall correlated at the video-level, the paired clips and captions often are not temporally well-aligned. For instance, people in a video describe plans before demonstrations or explain details after actions, i.e

., captions may come with temporal shifts. To estimate the correspondence between clips and captions, they can be aligned without changing the order of elements in each modality like Figure 

0(a), called sequential alignment. In contrast, when the order of elements in a modality is partially reversed or the content of a clip/caption is arbitrarily interspersed in the other modality, non-sequential alignments are required to compute the optimal correspondence. We observe that the non-sequential alignments often occur when videos have long sequences of captions and clips like Figure 0(b). We will address the challenges by a new learning strategy.

Figure 2: Overall Architecture. We propose a multi-modal self-supervised learning framework, VT-TWINS, to learn joint embeddings of video and text from noisy and weakly correlated data. The encoders and firstly extract feature embeddings from input clips and input captions , respectively. Then, we present a new alignment algorithm based on the DTW, called S2DTW, which can handle weakly correlated data with local neighborhood smoothing (Section 4.1) and weak alignment (Section 4.2). We also apply temporal data augmentation (Section 4.3) to learn from the non-sequentially aligned data with the S2DTW. We finally employ a contrastive learning scheme, which uses the S2DTW as a distance measure between the clip-caption pairs, with negative pairs () for representation learning while preventing feature collapsing (Section 4.4).

4 Method

In this section, we present a novel multi-modal self-supervised framework, named as Video-Text Temporally Weak Alignment-based Contrastive Learning (VT-TWINS), to learn joint embeddings of video and text from uncurated narrated videos. To address the problems mentioned above and estimate more accurate correspondence, we propose a new differentiable variant of DTW, called Locally Smoothed Soft-DTW with Weak Alignment (S2DTW). First, we apply local neighborhood smoothing and weak alignment. We then adopt temporal data augmentation for non-sequential alignments that the standard DTW cannot inherently handle. We finally apply a contrastive learning scheme and present VT-TWINS for representation learning without feature collapsing. Figure 2 and Algorithm 1 show our overall algorithm VT-TWINS including S2DTW.

4.1 Local Neighborhood Smoothing

To address the ambiguity as mentioned in Section 3.2, we smooth the pairwise distance matrix as:


where and are the -th elements of and , respectively. This allows many-to-many correspondence and encourages the alignment algorithm to focus more on a locally optimal clip (or caption), which has relatively smaller distances to others within a small neighborhood. can be viewed as smoothed with its previous elements , , and . Then, similar to (4) we apply dynamic programming to compute the optimal cost from smoothed distance matrix instead of and as follows:


S2DTW decays the cost of older matches and reflects more recent elements since (6) accumulates the cost from the top-left element to the bottom-right element, sequentially. Roughly speaking, the proposed S2DTW with considers local optimality by (5) as well as global optimality by (6) since S2DTW can be rewritten as:


Differentiation. We compare Soft-DTW [cuturi2017soft] and S2DTW via their derivatives. At the Soft-DTW, they denote a gradient matrix where by differentiating (4) w.r.t . In S2DTW case, however, due to the local neighborhood smoothing layer, i.e., . We therefore redefine and denote additional for the gradient matrix for local neighborhood smoothing layer. of S2DTW is calculated as follows:


By differentiating (6) with instead of , the green term of (8) is calculated as:


After calculating in (8), is calculated as:


In (10), since . Similar to (9), it is written as:


and the other blue and red terms are calculated in the same way. Like (9) which measures how minimal the is among three directions, (11) measures how minimal the is among three directions. Hence, (8) aggregates global optimal path information and (10) aggregates local optimal path information due to the at the former one and the at the latter one. Unlike S2DTW, the Soft-DTW only requires to calculate matrix with instead of by (8) and then does not consider the local optimality. Figure 3 depicts the forward and backward propagation of S2DTW.

Inputs: clips , captions
Parameters: smoothing parameter , dummy elements

1:# Temporal Data Augmentation
3:# Apply Contrastive Learning Scheme
5:# S2DTW
6:function S2DTW()
8:# Local Neighborhood Smoothing
9:     for  to  do
11:     end for
12:# Weak Alignment
14:# Calculate DTW
15:     for  to  do
17:     end for
19:end function


Algorithm 1 VT-TWINS Algorithm with S2DTW

4.2 Weak Alignment

We further modify the Soft-DTW by allowing its path not to forcibly align irrelevant pairs as (Figure 0(c) and 0(d)). Besides, our S2DTW can start from (or end at) an arbitrary point. Adopting the trick in DWSA [shen2021learning] for one-to-one matching with skipping, we achieve weak alignment by inserting dummy elements in the intervals (and both ends) of clip and caption sequences, (e.g., becomes ).

In S2DTW, the pairwise distance matrix with dummy elements is and has dummy distance at the pair which includes .

is a hyperparameter that can be interpreted as a threshold. By calculating the DTW with dummy elements, it leads the DTW path to pass only the pair whose distance is smaller than

. Unlike the standard DTW or Soft-DTW which forcibly align at least one pair per one timestamp, our proposed S2DTW weakly aligns the irrelevant clip-caption pairs and even enable many-to-many matchings which cannot be handled by DWSA. Figure 3(a) and 3(b) show the pairwise distance before/after adding dummy elements. This weak alignment framework is followed by the local neighborhood smoothing. As a result, the final pairwise distance is which is used to calculate the DTW.

Figure 3: Forward and Backward of Local Neighborhood Smoothing in terms of . At forward propagation, we firstly compute by smoothing with , , and in (5). Then, is calculated with smoothed and , by (6). At backward propagation, is calculated by (8). It gains the gradient from three directions proportional to how optimal the cumulative cost of each direction is. Then, is calculated from , , and proportional to how optimal each pairwise cost is in (10).

4.3 Temporal Data Augmentation

As discussed in Section 3.2, videos often have non-sequential alignments, but the standard DTW cannot resolve them since it allows only three moves . To address this problem, we propose a simple data augmentation that temporally shuffles clips and captions. Let denote a permutation and then a clip permuted by is . To avoid temporally or semantically too extreme augmentations, we consider a subset of possible permutations. We first leave out the cases when a clip is temporally shifted beyond a time window. For example of , the -th clip cannot be out of the window of size , i.e., the range of possible indices after a permutation of -th clip is . The set of permutations that satisfies this temporal constraint is denoted as . Given the temporal constraint, we propose the target distribution as follows:


where is softmax function computed over all permutations in and is a temperature parameter. and are self-similarity matrices before/after permutation. The proposed target distribution more likely generates a permutation that less changes the self-similarity structure. In other words, the proposed augmentation less likely generates semantically too strong augmentations that hinder representation learning. Then, the temporally augmented which is a shuffled sequence of clips is sampled from the distribution defined in (12). The captions is augmented in the same way and finally we calculate the pairwise distance matrix as the input for alignment (e.g., DTW). For simplicity of implementation, each modality is shuffled independently.

Our temporal augmentation encourages learning invariant features under permutation and allow minimizing the distance between clips and captions that cannot be aligned by sequential alignment algorithms such as the standard DTW. This is helpful to learn representation when the clips and captions are non-sequentially aligned as in Figure 0(b).

(a) Original Pair-wise Distance
(b) Pair-wise Distance with Dummy
Figure 4: Illustration of Weak Alignment. (a): The original pairwise distance without dummy elements has to pass the pair whose clip and caption are irrelevant each other, e.g., the caption is irrelevant with any other clips. (b): On the other hand, the pair whose distance is bigger than dummy distance can be skipped.

4.4 Contrastive Learning with S2DTW

With S2DTW, we perform representation learning in a self-supervised manner. S2DTW can be used for a distance measure between clips and captions. Minimizing the distance between two samples without negative pairs causes feature collapsing. Hence, to address this problem, we adopt a well-known contrastive loss, InfoNCE loss [oord2018representation]. Our final loss is defined as:


, where and are clips and captions of the -th video and is a set of negative samples of the -th video in mini-batch. This formulation also implicitly mines the hard negatives. In a clip-caption level, due to the nature of the DTW, a clip-caption pair which has closer distance in negative samples will get stronger negative signal to push away than others in negative samples. Therefore, unlike in baseline [kalantidis2020hard], no additional hard negative mining strategy (e.g., [he2020momentum]) was taken for proposed method. Further discussions with qualitative results are in the appendix.

5 Experiments

In this section, we evaluate the performance on various downstream tasks by applying our pretrained feature embeddings (Section 5.1). We also describe ablation studies about the effect of each algorithm which is addressed in Section 4 and finally analyze qualitative results of each algorithm in terms of the DTW path (Section 5.2

). All downstream tasks and ablation studies except for the action recognition task are conducted in the zero-shot learning setting to evaluate only the quality of learned representations. For the action recognition task, we adopt widely used linear evaluation protocol, which trains a linear classifier on top of the frozen representation. The experimental setup and further ablation studies are in the appendix.

5.1 Downstream Tasks

5.1.1 Action Recognition

We firstly evaluate learned video representation without using text representation on the action recognition task whose goal is to distinguish video-level actions. In Table 1, we compare the proposed method with other self-supervised methods. According to the linear evaluation protocol, our VT-TWINS outperforms all self-supervised learning methods including the baselines that performed fine-tuning denoted by (Frozen x) such as CBT [sun2019learning] and 3DRotNet [jing2018self]. This result shows that our method improves the generality of video representations. Especially for HMDB, VT-TWINS obtains about 4% improvement over the MIL-NCE with the same backbone model (S3D).

5.1.2 Video and Text Retrieval

We evaluate the effectiveness of the joint representation of video and text by applying text-to-video and video-to-text retrieval tasks, which aim to find a corresponding clip (caption) given a query caption (clip).

width=0.47 Method Dataset MM Model Frozen HMDB UCF OPN [alayrac2016unsupervised] UCF VGG 23.8 59.6 Shuffle & Learn [misra2016shuffle]* K600 S3D 35.8 68.7 Wang et al[wang2019self] K400 Flow C3D 33.4 61.2 CMC [tian2020contrastive] UCF Flow CaffeNet 26.7 59.1 Geometry [gan2018geometry] UCF Flow CaffeNet 26.7 59.1 Fernanado et al[fernando2017self] UCF AlexNet 32.5 60.3 ClipOrder [xu2019self] UCF R(2+1)D 30.9 72.4 3DRotNet [jing2018self]* K600 S3D 40.0 75.3 DPC [han2019video] K400 3D-R34 35.7 75.7 3D ST-puzzle [kim2019self] K400 3D-R18 33.7 65.8 CBT [sun2019learning] K600 S3D 29.5 54.0 CBT [sun2019learning] K600 S3D 44.6 79.5 AVTS [korbar2018cooperative] K600 Audio I3D 53.0 83.7 MIL-NCE [miech2020end] HTM Text I3D 54.8 83.4 MIL-NCE [miech2020end] HTM Text S3D 53.1 82.7 VT-TWINS HTM Text S3D 57.9 85 S3D (supervised learning) [xie2018rethinking] S3D 75.9 96.8

Table 1: Action Recognition. Shuffle & Learn* and 3DRotNet* are reimplemented by [sun2019learning] with S3D.

width=0.47 Method Labeled Dataset R@1 R@5 R@10 MedR Random Init None 0.03 0.15 0.3 1675 HGLMM FC CCA [klein2015associating] IM, K400, YC2 4.6 14.3 21.6 75 Miech et al[miech2019howto100m] IM, K400 6.1 17.3 24.8 46 Miech et al[miech2019howto100m] IM, K400, YC2 8.2 24.5 35.3 24 COOT [ging2020coot] YC2 5.9 16.7 24.8 49.7 ActBERT [zhu2020actbert] YC2 9.6 26.7 38.0 19 MIL-NCE [miech2020end] None 8.8 24.3 34.6 23 VT-TWINS None 9.7 27 38.8 19

Table 2: Text-to-Video Retrieval on YouCook2.

Text-to-video retrieval. Table 2 and 3 show the performance of text-to-video retrieval on YouCook2 and MSR-VTT dataset. For fair comparison with MIL-NCE, we trained our model on HowTo100M dataset and evaluate on the test set without any additional supervision. Table 2 shows that our VT-TWINS outperforms MIL-NCE and even other methods (e.g., COOT and ActBERT) that are fine-tuned on YouCook2 (denoted as YC2). Similarly, on MSR-VTT dataset Table  3 shows that the proposed method outperforms several multi-modal self-supervised methods trained on the HowTo100M (MIL-NCE, Amrani et al., SSB). In addition, our method is better or on par with ActBert that is fine-tuned on the target dataset MSR-VTT.

width=0.47 Method Labeled Dataset R@1 R@5 R@10 MedR Random Init None 0.01 0.05 0.1 500 Miech et al[miech2019howto100m] IM, K400 7.5 21.2 29.6 38 Amrani et al[amrani2020noise] None 8.0 21.3 29.3 33 SSB [patrick2020support] None 8.7 23.0 31.1 31.0 ActBERT [zhu2020actbert] MSRVTT 8.6 23.4 33.1 36 MIL-NCE [miech2020end] None 8.2 21.5 29.5 40 VT-TWINS None 9.4 23.4 31.6 32

Table 3: Text-to-Video Retrieval on MSRVTT.

width=0.47 Method YouCook2 MSRVTT R@1 R@5 R@10 MedR R@1 R@5 R@10 MedR Random Init 0.03 0.13 0.26 1717.5 0.1 0.49 0.98 499.5 MIL-NCE* [miech2020end] 9.35 26.22 37.36 22 8.9 20.65 27.2 46 VT-TWINS 9.7 28 40.3 16 9.1 22.9 29.1 43

Table 4: Video-to-Text retrieval. * is our reproduction of official code of the MIL-NCE.
Method Labeled Dataset CTR
Alayrac et al[alayrac2016unsupervised] IM, K400 13.3
CrossTask [zhukov2019cross] IM, K400 22.4
CrossTask [zhukov2019cross] IM, K400, CT 31.6
Miech et al[miech2019howto100m] IM, K400 33.6
DWSA [shen2021learning] CT 35.5
ActBERT [zhu2020actbert] CT 37.1
MIL-NCE [miech2020end] None 35.5
VT-TWINS None 40.7
Table 5: Action Step Localization on CrossTask.

Video-to-text retrieval. We also compare the performance of video-to-text retrievals with MIL-NCE. Table 4 shows that our VT-TWINS outperforms MIL-NCE on both YouCook2 and MSR-VTT. Note that MIL-NCE blindly and equally treats all the captions in a time window around a query clip as positives. We believe that this assumption often does not hold and learning with the inaccurate clip-caption pairs may hinder learning representations to precisely associate clips and captions.

5.1.3 Action Step Localization

We also evaluate the representations learned by our method in the action step localization task on the CrossTask dataset. We adopted the zero-shot evaluation suggested in  [miech2019howto100m]. Table 5 shows that VT-TWINS significantly outperforms baselines achieving an CrossTask average recall (CTR) of 40.7%. This surpasses MIL-NCE (35.5%) and even the models that are trained on the CrossTask dataset such as DWSA (35.5%) and ActBERT (37.1%).

Figure 5: Results of Weak Alignment. is a pairwise distance matrix. The Soft-DTW path and the S2DTW path matrices are the gradient matrices and defined in Section 4.1. Each row shows the partially and entirely irrelevant pairs, respectively.

5.2 Ablation Study and Qualitative Analysis

5.2.1 Temporal Data Augmentation

As explained in Section 4.3, the proposed augmentation less likely generates a permutation which is significantly different from the original sequence. To evaluate the effectiveness of our temporal data augmentation, we compare it with two other strategies: One is sampling from the uniform distribution and the other is sampling from a inverse distribution of our one,


., assigning a higher probability to the semantically similar permutation with the original sequence. (2), (3), and (4) in Table 

6 demonstrate that temporal shuffles while maintaining semantic information helps to learn feature representation on weakly correlated data with non-sequential alignments. Especially, the gap is substantial in the task that uses joint embedding representations (YouCook2, MSR-VTT, and CrossTask in Table 6) because strong augmentation harms semantic information a lot, it is difficult to learn the representations aligned between clip-captions.

5.2.2 Weak Alignment

The top row of Figure 5 shows the case of partially irrelevant pairs; the pairwise distance matrix on top shows that the fifth caption has a consistently large distance from the other clips111Also refer to Figure 0(c) as an illustrative example.. In this case, the Soft-DTW is enforced to align one or more pairs per each timestamp. On the other hand, S2DTW shows the results that the unrelated pairs are weakly aligned because S2DTW skips them appropriately. Moreover, the Soft-DTW has another problem that it is forced to align the start point (1,1) and the end point (n,m). Unlike the Soft-DTW, we observe that S2DTW can ignore the start point and the end point.

The Soft-DTW also finds a temporal alignment path even in the entirely uncorrelated data like the case of Figure 0(d). The bottom row of Figure 5 illustrates that most elements of the pairwise distance are greater than zero (the leftmost matrix), i.e., the clips and captions are almost entirely irrelevant. The path is clearly drawn in the Soft-DTW while most elements are not learned in S2DTW by aligning weakly. (2) and (5) of Table 6 show that weak alignment of S2DTW improves the performance on the weakly correlated data by ignoring irrelevant pairs.

5.2.3 Local Neighborhood Smoothing

We also evaluate the effectiveness of local neighborhood smoothing. As mentioned in Section 4.1, local neighborhood smoothing can reflect local optimal path as well as global optimal path. (5) and (6) in Table 6 show that local neighborhood smoothing complements the DTW and improves the performance.

(1) - - - 38.9 68.6 8.7 12.7 22.9
(2) A - - 39.4 69.3 9.6 13.6 23.5
(3) B - - 36 68.5 5 10.5 17.4
(4) C - - 36.9 68 4.9 11.5 16.8
(5) A - 39.1 70.6 10.6 14.7 26.9
(6) A 42 72.1 12.5 17.4 28.2
Table 6: Ablation Studies. We report accuracy on the HMDB and UCF, R@10 on the YouCook2 (YC2) and MSR-VTT (MV), and CTR on the CrossTask (CT) to evaluate the contribution of the followings: temporal data augmentation (TA), weak alignment (WA), and local neighborhood smoothing (LS). Each element is applied to the standard DTW with a contrastive learning scheme. For TA, we evaluate the following strategies: A: suppressing semantically strong permutation (ours), B: random permutation, and C: encouraging semantically strong permutation (inverse of ours).

6 Conclusion

We have presented a novel multi-modal self-supervised learning framework for learning joint embeddings of video and text from uncurated narrated videos. To address the challenges of weakly correlated video and caption pairs, our framework VT-TWINS first aligns the clips and captions by the proposed weak alignment algorithm and learns representations via contrastive learning. Our experiments on a wide range of three tasks over five benchmark datasets demonstrate that the proposed method significantly improves the generality of joint embeddings and outperforms self-supervised methods as well as fine-tuned models on target tasks. The proposed framework is a generic framework that is applicable in representation learning with multi-modal time-series data. Future directions, limitations, and negative societal impacts are discussed in the appendix.


This work was partly supported by Efficient Meta-Learning Based Training Method and Multipurpose Multi-Modal Artificial Neural Network for Drone AI (No.2021-0-02312), and ICT Creative Consilience program (IITP-2022-2020-0-01819) supervised by the IITP; the National Supercomputing Center with supercomputing resources including technical support (KSC-2021-CRE-0299) and Kakao Brain corporation.