1 Introduction
Learning videotext representations is an important problem in computer vision. In recent years, it has recently drawn increasing attention due to a large amount of video data and various applications. Previous works
[lin2014microsoft, zhou2018towards, wray2019fine] have achieved exciting results by learning mappings between video clips and texts but they usually require a large amount of manual annotations such as MSRVTT [xu2016msr], DiDeMo [anne2017localizing], EPICKITCHENS [damen2018scaling]. However, since labeling videos is expensive and timeconsuming, it does not scale well for sufficiently large datasets which are essential to learning generic videotext representations that are readily applicable to a wide range of downstream tasks that include texttovideo retrieval or videotext retrieval [klein2015associating, wang2018learning, wang2016learning, yu2018joint], textbased action localization [anne2017localizing, cheron2018flexible], action segmentation [lea2016temporal, sigurdsson2017asynchronous] and video question answering [tapaswi2016movieqa, malinowski2015ask, yu2018joint]. Recent studies suggest that multimodal selfsupervised learning with a huge amount of data is a promising alternative to fully supervised methods
[fernando2017self, xu2019self]. To this extent, HowTo100M [miech2019howto100m] has been introduced, which is composed of 100 million pairs of video clips and captions from 1.22M narrated instructional videos.The HowTo100M is one of the largest video datasets but it comes with several challenges. It is uncurated and its videotext pairs are weakly correlated meaning that given a video clip the caption depicting the visual content may appear before/after the clip or not even exist (Figure 1). To handle the weakly correlated videotext pairs, MILNCE [miech2020end] has proposed a multiple instance learning (MIL)based contrastive learning adopting Noise Contrastive Learning (NCE) loss [gutmann2010noise]. MILNCE treats the multiple captions which are temporally close to one clip as positive samples allowing onetomany correspondence. But this strong assumption often leads to suboptimal representation learning.
In this paper, to address the problem, we develop a new weak temporal alignment algorithm building upon Dynamic Time Warping (DTW) [sakoe1978dynamic]. In contrast to the standard DTW which is limited to sequential alignment, our proposed alignment algorithm allows flexibility by skipping irrelevant pairs and starting/ending at arbitrary time points. Also, it takes into account a globally optimal path as well as locally optimal paths by introducing local neighborhood smoothing. More importantly, our alignment algorithm is differentiable so we incorporate it into representation learning as a distance measure. We then propose a novel multimodal selfsupervised learning framework to learn a joint video and text embedding model named as VideoText Temporally Weak Alignmentbased Contrastive Learning (VTTWINS) that automatically handles the correspondence between noisy and weakly correlated captions and clips.
Our extensive experiments on five benchmark datasets demonstrate that our learned video and text representations generalize well on various downstream tasks including action recognition, texttovideo retrieval, and action step localization. Moreover, ablation studies and qualitative analysis show that our framework effectively aligns the noisy and weakly correlated multimodal timeseries data.
Our contributions are threefold:

We propose a novel selfsupervised learning framework with differentiable weak temporal alignment that automatically handles the noisy and weakly correlated multimodal timeseries data.

We analyze the local neighborhood smoothing in our alignment algorithm showing that unlike DTW the alignment takes into account local optimal paths as well as global optimal path.

Our experiments show that the proposed method considerably improves joint representations of video and text an is adapted well on various downstream tasks.
2 Related Work
SelfSupervised Learning for Videos. The selfsupervised learning approaches have received considerable attention because they do not require additional annotations during learning representation. Recently, several works are proposed to learn video representations in a selfsupervised manner. One research direction is to design videospecific pretext tasks, such as verifying temporal orders [lee2017unsupervised, fernando2017self, misra2016shuffle, xu2019self], predicting video rotation [jing2018self], solving jigsaw puzzles in a video [kim2019self], and dense predictive coding [han2019video]. Another line of research is to use a contrastive learning which leads clips from the same video to be pulled together while clips from different videos to be pushed away [sun2019learning, wang2020self, qian2021spatiotemporal, chen2020simple, chen2021exploring, grill2020bootstrap, he2020momentum]. In view of the multimodality of videos, many works explore mutual supervision across modalities to learn representations of each modality. For example, they regard temporal or semantic consistency between videos and audios [korbar2018cooperative, chen2021multimodal] or narrations [miech2020end, alayrac2016unsupervised, miech2019howto100m, bain2021frozen] as a natural source of supervision. MILNCE [miech2020end] introduced contrastive learning to learn joint embeddings between clips and captions of unlabeled and uncurated narrated videos. The other line of work adopts an additional crossmodal encoder (e.g., crossmodal transformer) to capture richer interaction between modalities [sun2019videobert, sun2019learning, zhu2020actbert, li2020hero, luo2020univl, ging2020coot]. In this paper, we focus on extending contrastive learning to temporally align two timeseries modalities, i.e., clips and captions from videos without any additional crossmodal encoders.
Sequence Alignment. Sequence alignment is crucial in fields related to the timeseries data due to the temporal information. In particular, the lack of manually annotated video datasets makes it harder to align clips and captions temporally. Dynamic Time Warping (DTW) [sakoe1978dynamic] measures the distance with strong temporal constraints between two sequences. [chang2021learning] uses global sequence alignment as a proxy task by relying on the DTW. [cuturi2017soft, hadji2021representation] extended the DTW for endtoend learning with differentiable approximations of the discrete operations (e.g., the ‘min’ operator) in the DTW. Chang et al. [chang2019d3tw] proposed the framewise alignment loss using the DTW in weakly supervised action alignment in videos. DropDTW [dvornik2021drop]
proposed a variant of the DTW algorithm which automatically drops the outlier elements from the pairwise distance to handle the noisy data. However, using the DTW alone can cause feature collapsing which leads all the feature embeddings to be concentrated to a single point. To address this problem,
[chang2019d3tw] and [haresh2021learning] use the subsidiary regularization loss term with the DTW.3 Preliminaries
We briefly summarize the basic concepts of dynamic time warping and the characteristics of an uncurated narrated video dataset HowTo100M.
3.1 Dynamic Time Warping (DTW)
DTW [berndt1994using] finds an optimal alignment between two timeseries data. Let and denote two timeseries data of length and , i.e., and . DTW first computes a pairwise distance matrix with a distance measure . Then, DTW optimizes the following:
(1) 
where is a set of (binary) alignment matrices. An alignment matrix represents a path that connects from to th entries of by three possible moves .
To efficiently find an optimal path, DTW [berndt1994using] uses dynamic programming to recursively solve the following subproblems:
(2) 
where is the th element of a cumulative cost matrix of . Therefore, in (1) is equal to which is the accumulated cost that evaluates the similarity between two timeseries data.
SoftDTW [cuturi2017soft] has proposed a differentiable variant of the DTW replacing the nondifferentiable operator ‘’ in (2) with the softmin ‘’ defined as:
(3) 
where is a smoothing parameter. Then, the recurrence relation of SoftDTW is given as:
(4) 
If is zero, softmin is identical to operator. As increases, SoftDTW(X,Y) more takes into account the cost of suboptimal paths.
3.2 The HowTo100M Dataset
HowTo100M dataset [miech2019howto100m]
is a largescale dataset that contains 136M video clips with paired captions from 1.22M narrated instructional videos across 23K different visual tasks. A video has 110 clipcaption pairs with an average duration of 4 seconds. The captions are automatically transcribed narrations via automatic speech recognition (ASR). Learning joint video text embeddings with HowTo100M has two sources of difficulties: ‘uncurated narrations’ and ‘weak correlation’ between clipcaption pairs. As discussed in
[miech2020end], the narrations transcribed by ASR are potentially erroneous and the colloquial language is neither complete nor grammatically correct sentences. In addition, due to the weak correlation between the paired clips and captions, computing the optimal correspondence to learn joint embedding entails addressing the following challenges, which is the main focus of this paper.Ambiguity. As aforementioned, the average duration of a clipcaption pair is 4 seconds. Since short clips are sampled densely in one video, consecutive clips are often semantically similar, i.e., clipcaption alignments inherently have ambiguity. So it is more beneficial to use algorithms that take into account multiple alignments allowing manytomany correspondence rather than the algorithms that consider the only one optimal path such as the standard DTW.
Irrelevant pairs. The paired clips and captions may contain irrelevant contents due to several reasons. People might skip to demonstrate some steps when narrations are clear enough or vice versa. In Figure 0(c), since the narration “select the correct program ” is clear enough, no demonstration is given in the corresponding clip. In addition, some videos have entirely irrelevant clips and captions like Figure 0(d). When learning joint video text embeddings, these irrelevant pairs should be properly handled.
Nonsequential alignment. Although videos and texts are overall correlated at the videolevel, the paired clips and captions often are not temporally wellaligned. For instance, people in a video describe plans before demonstrations or explain details after actions, i.e
., captions may come with temporal shifts. To estimate the correspondence between clips and captions, they can be aligned without changing the order of elements in each modality like Figure
0(a), called sequential alignment. In contrast, when the order of elements in a modality is partially reversed or the content of a clip/caption is arbitrarily interspersed in the other modality, nonsequential alignments are required to compute the optimal correspondence. We observe that the nonsequential alignments often occur when videos have long sequences of captions and clips like Figure 0(b). We will address the challenges by a new learning strategy.4 Method
In this section, we present a novel multimodal selfsupervised framework, named as VideoText Temporally Weak Alignmentbased Contrastive Learning (VTTWINS), to learn joint embeddings of video and text from uncurated narrated videos. To address the problems mentioned above and estimate more accurate correspondence, we propose a new differentiable variant of DTW, called Locally Smoothed SoftDTW with Weak Alignment (S2DTW). First, we apply local neighborhood smoothing and weak alignment. We then adopt temporal data augmentation for nonsequential alignments that the standard DTW cannot inherently handle. We finally apply a contrastive learning scheme and present VTTWINS for representation learning without feature collapsing. Figure 2 and Algorithm 1 show our overall algorithm VTTWINS including S2DTW.
4.1 Local Neighborhood Smoothing
To address the ambiguity as mentioned in Section 3.2, we smooth the pairwise distance matrix as:
(5) 
where and are the th elements of and , respectively. This allows manytomany correspondence and encourages the alignment algorithm to focus more on a locally optimal clip (or caption), which has relatively smaller distances to others within a small neighborhood. can be viewed as smoothed with its previous elements , , and . Then, similar to (4) we apply dynamic programming to compute the optimal cost from smoothed distance matrix instead of and as follows:
(6) 
S2DTW decays the cost of older matches and reflects more recent elements since (6) accumulates the cost from the topleft element to the bottomright element, sequentially. Roughly speaking, the proposed S2DTW with considers local optimality by (5) as well as global optimality by (6) since S2DTW can be rewritten as:
(7) 
Differentiation. We compare SoftDTW [cuturi2017soft] and S2DTW via their derivatives. At the SoftDTW, they denote a gradient matrix where by differentiating (4) w.r.t . In S2DTW case, however, due to the local neighborhood smoothing layer, i.e., . We therefore redefine and denote additional for the gradient matrix for local neighborhood smoothing layer. of S2DTW is calculated as follows:
(8) 
By differentiating (6) with instead of , the green term of (8) is calculated as:
(9) 
After calculating in (8), is calculated as:
(10) 
In (10), since . Similar to (9), it is written as:
(11) 
and the other blue and red terms are calculated in the same way. Like (9) which measures how minimal the is among three directions, (11) measures how minimal the is among three directions. Hence, (8) aggregates global optimal path information and (10) aggregates local optimal path information due to the at the former one and the at the latter one. Unlike S2DTW, the SoftDTW only requires to calculate matrix with instead of by (8) and then does not consider the local optimality. Figure 3 depicts the forward and backward propagation of S2DTW.
4.2 Weak Alignment
We further modify the SoftDTW by allowing its path not to forcibly align irrelevant pairs as (Figure 0(c) and 0(d)). Besides, our S2DTW can start from (or end at) an arbitrary point. Adopting the trick in DWSA [shen2021learning] for onetoone matching with skipping, we achieve weak alignment by inserting dummy elements in the intervals (and both ends) of clip and caption sequences, (e.g., becomes ).
In S2DTW, the pairwise distance matrix with dummy elements is and has dummy distance at the pair which includes .
is a hyperparameter that can be interpreted as a threshold. By calculating the DTW with dummy elements, it leads the DTW path to pass only the pair whose distance is smaller than
. Unlike the standard DTW or SoftDTW which forcibly align at least one pair per one timestamp, our proposed S2DTW weakly aligns the irrelevant clipcaption pairs and even enable manytomany matchings which cannot be handled by DWSA. Figure 3(a) and 3(b) show the pairwise distance before/after adding dummy elements. This weak alignment framework is followed by the local neighborhood smoothing. As a result, the final pairwise distance is which is used to calculate the DTW.4.3 Temporal Data Augmentation
As discussed in Section 3.2, videos often have nonsequential alignments, but the standard DTW cannot resolve them since it allows only three moves . To address this problem, we propose a simple data augmentation that temporally shuffles clips and captions. Let denote a permutation and then a clip permuted by is . To avoid temporally or semantically too extreme augmentations, we consider a subset of possible permutations. We first leave out the cases when a clip is temporally shifted beyond a time window. For example of , the th clip cannot be out of the window of size , i.e., the range of possible indices after a permutation of th clip is . The set of permutations that satisfies this temporal constraint is denoted as . Given the temporal constraint, we propose the target distribution as follows:
(12) 
where is softmax function computed over all permutations in and is a temperature parameter. and are selfsimilarity matrices before/after permutation. The proposed target distribution more likely generates a permutation that less changes the selfsimilarity structure. In other words, the proposed augmentation less likely generates semantically too strong augmentations that hinder representation learning. Then, the temporally augmented which is a shuffled sequence of clips is sampled from the distribution defined in (12). The captions is augmented in the same way and finally we calculate the pairwise distance matrix as the input for alignment (e.g., DTW). For simplicity of implementation, each modality is shuffled independently.
Our temporal augmentation encourages learning invariant features under permutation and allow minimizing the distance between clips and captions that cannot be aligned by sequential alignment algorithms such as the standard DTW. This is helpful to learn representation when the clips and captions are nonsequentially aligned as in Figure 0(b).
4.4 Contrastive Learning with S2DTW
With S2DTW, we perform representation learning in a selfsupervised manner. S2DTW can be used for a distance measure between clips and captions. Minimizing the distance between two samples without negative pairs causes feature collapsing. Hence, to address this problem, we adopt a wellknown contrastive loss, InfoNCE loss [oord2018representation]. Our final loss is defined as:
(13) 
, where and are clips and captions of the th video and is a set of negative samples of the th video in minibatch. This formulation also implicitly mines the hard negatives. In a clipcaption level, due to the nature of the DTW, a clipcaption pair which has closer distance in negative samples will get stronger negative signal to push away than others in negative samples. Therefore, unlike in baseline [kalantidis2020hard], no additional hard negative mining strategy (e.g., [he2020momentum]) was taken for proposed method. Further discussions with qualitative results are in the appendix.
5 Experiments
In this section, we evaluate the performance on various downstream tasks by applying our pretrained feature embeddings (Section 5.1). We also describe ablation studies about the effect of each algorithm which is addressed in Section 4 and finally analyze qualitative results of each algorithm in terms of the DTW path (Section 5.2
). All downstream tasks and ablation studies except for the action recognition task are conducted in the zeroshot learning setting to evaluate only the quality of learned representations. For the action recognition task, we adopt widely used linear evaluation protocol, which trains a linear classifier on top of the frozen representation. The experimental setup and further ablation studies are in the appendix.
5.1 Downstream Tasks
5.1.1 Action Recognition
We firstly evaluate learned video representation without using text representation on the action recognition task whose goal is to distinguish videolevel actions. In Table 1, we compare the proposed method with other selfsupervised methods. According to the linear evaluation protocol, our VTTWINS outperforms all selfsupervised learning methods including the baselines that performed finetuning denoted by (Frozen x) such as CBT [sun2019learning] and 3DRotNet [jing2018self]. This result shows that our method improves the generality of video representations. Especially for HMDB, VTTWINS obtains about 4% improvement over the MILNCE with the same backbone model (S3D).
5.1.2 Video and Text Retrieval
We evaluate the effectiveness of the joint representation of video and text by applying texttovideo and videototext retrieval tasks, which aim to find a corresponding clip (caption) given a query caption (clip).
Texttovideo retrieval. Table 2 and 3 show the performance of texttovideo retrieval on YouCook2 and MSRVTT dataset. For fair comparison with MILNCE, we trained our model on HowTo100M dataset and evaluate on the test set without any additional supervision. Table 2 shows that our VTTWINS outperforms MILNCE and even other methods (e.g., COOT and ActBERT) that are finetuned on YouCook2 (denoted as YC2). Similarly, on MSRVTT dataset Table 3 shows that the proposed method outperforms several multimodal selfsupervised methods trained on the HowTo100M (MILNCE, Amrani et al., SSB). In addition, our method is better or on par with ActBert that is finetuned on the target dataset MSRVTT.
Method  Labeled Dataset  CTR 

Alayrac et al. [alayrac2016unsupervised]  IM, K400  13.3 
CrossTask [zhukov2019cross]  IM, K400  22.4 
CrossTask [zhukov2019cross]  IM, K400, CT  31.6 
Miech et al. [miech2019howto100m]  IM, K400  33.6 
DWSA [shen2021learning]  CT  35.5 
ActBERT [zhu2020actbert]  CT  37.1 
MILNCE [miech2020end]  None  35.5 
VTTWINS  None  40.7 
Videototext retrieval. We also compare the performance of videototext retrievals with MILNCE. Table 4 shows that our VTTWINS outperforms MILNCE on both YouCook2 and MSRVTT. Note that MILNCE blindly and equally treats all the captions in a time window around a query clip as positives. We believe that this assumption often does not hold and learning with the inaccurate clipcaption pairs may hinder learning representations to precisely associate clips and captions.
5.1.3 Action Step Localization
We also evaluate the representations learned by our method in the action step localization task on the CrossTask dataset. We adopted the zeroshot evaluation suggested in [miech2019howto100m]. Table 5 shows that VTTWINS significantly outperforms baselines achieving an CrossTask average recall (CTR) of 40.7%. This surpasses MILNCE (35.5%) and even the models that are trained on the CrossTask dataset such as DWSA (35.5%) and ActBERT (37.1%).
5.2 Ablation Study and Qualitative Analysis
5.2.1 Temporal Data Augmentation
As explained in Section 4.3, the proposed augmentation less likely generates a permutation which is significantly different from the original sequence. To evaluate the effectiveness of our temporal data augmentation, we compare it with two other strategies: One is sampling from the uniform distribution and the other is sampling from a inverse distribution of our one,
i.e., assigning a higher probability to the semantically similar permutation with the original sequence. (2), (3), and (4) in Table
6 demonstrate that temporal shuffles while maintaining semantic information helps to learn feature representation on weakly correlated data with nonsequential alignments. Especially, the gap is substantial in the task that uses joint embedding representations (YouCook2, MSRVTT, and CrossTask in Table 6) because strong augmentation harms semantic information a lot, it is difficult to learn the representations aligned between clipcaptions.5.2.2 Weak Alignment
The top row of Figure 5 shows the case of partially irrelevant pairs; the pairwise distance matrix on top shows that the fifth caption has a consistently large distance from the other clips^{1}^{1}1Also refer to Figure 0(c) as an illustrative example.. In this case, the SoftDTW is enforced to align one or more pairs per each timestamp. On the other hand, S2DTW shows the results that the unrelated pairs are weakly aligned because S2DTW skips them appropriately. Moreover, the SoftDTW has another problem that it is forced to align the start point (1,1) and the end point (n,m). Unlike the SoftDTW, we observe that S2DTW can ignore the start point and the end point.
The SoftDTW also finds a temporal alignment path even in the entirely uncorrelated data like the case of Figure 0(d). The bottom row of Figure 5 illustrates that most elements of the pairwise distance are greater than zero (the leftmost matrix), i.e., the clips and captions are almost entirely irrelevant. The path is clearly drawn in the SoftDTW while most elements are not learned in S2DTW by aligning weakly. (2) and (5) of Table 6 show that weak alignment of S2DTW improves the performance on the weakly correlated data by ignoring irrelevant pairs.
5.2.3 Local Neighborhood Smoothing
We also evaluate the effectiveness of local neighborhood smoothing. As mentioned in Section 4.1, local neighborhood smoothing can reflect local optimal path as well as global optimal path. (5) and (6) in Table 6 show that local neighborhood smoothing complements the DTW and improves the performance.
TA  WA  LS  HMDB  UCF  YC2  MV  CT  

(1)        38.9  68.6  8.7  12.7  22.9 
(2)  A      39.4  69.3  9.6  13.6  23.5 
(3)  B      36  68.5  5  10.5  17.4 
(4)  C      36.9  68  4.9  11.5  16.8 
(5)  A  ✔    39.1  70.6  10.6  14.7  26.9 
(6)  A  ✔  ✔  42  72.1  12.5  17.4  28.2 
6 Conclusion
We have presented a novel multimodal selfsupervised learning framework for learning joint embeddings of video and text from uncurated narrated videos. To address the challenges of weakly correlated video and caption pairs, our framework VTTWINS first aligns the clips and captions by the proposed weak alignment algorithm and learns representations via contrastive learning. Our experiments on a wide range of three tasks over five benchmark datasets demonstrate that the proposed method significantly improves the generality of joint embeddings and outperforms selfsupervised methods as well as finetuned models on target tasks. The proposed framework is a generic framework that is applicable in representation learning with multimodal timeseries data. Future directions, limitations, and negative societal impacts are discussed in the appendix.
Acknowledgments.
This work was partly supported by Efficient MetaLearning Based Training Method and Multipurpose MultiModal Artificial Neural Network for Drone AI (No.2021002312), and ICT Creative Consilience program (IITP20222020001819) supervised by the IITP; the National Supercomputing Center with supercomputing resources including technical support (KSC2021CRE0299) and Kakao Brain corporation.