Drop-DTW: Aligning Common Signal Between Sequences While Dropping Outliers

by   Nikita Dvornik, et al.

In this work, we consider the problem of sequence-to-sequence alignment for signals containing outliers. Assuming the absence of outliers, the standard Dynamic Time Warping (DTW) algorithm efficiently computes the optimal alignment between two (generally) variable-length sequences. While DTW is robust to temporal shifts and dilations of the signal, it fails to align sequences in a meaningful way in the presence of outliers that can be arbitrarily interspersed in the sequences. To address this problem, we introduce Drop-DTW, a novel algorithm that aligns the common signal between the sequences while automatically dropping the outlier elements from the matching. The entire procedure is implemented as a single dynamic program that is efficient and fully differentiable. In our experiments, we show that Drop-DTW is a robust similarity measure for sequence retrieval and demonstrate its effectiveness as a training loss on diverse applications. With Drop-DTW, we address temporal step localization on instructional videos, representation learning from noisy videos, and cross-modal representation learning for audio-visual retrieval and localization. In all applications, we take a weakly- or unsupervised approach and demonstrate state-of-the-art results under these settings.



There are no comments yet.


page 2

page 4

page 6

page 18

page 19

page 20


Representation Learning via Global Temporal Alignment and Cycle-Consistency

We introduce a weakly supervised method for representation learning base...

Space-Time Memory Network for Sounding Object Localization in Videos

Leveraging temporal synchronization and association within sight and sou...

Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning

Cross-modal correlation provides an inherent supervision for video unsup...

Unsupervised Representation Learning by Sorting Sequences

We present an unsupervised representation learning approach using videos...

Y^2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences

A recent method employs 3D voxels to represent 3D shapes, but this limit...

Learning Robust Video Synchronization without Annotations

Aligning video sequences is a fundamental yet still unsolved component f...

Robust Time-Series Retrieval Using Probabilistic Adaptive Segmental Alignment

Traditional pairwise sequence alignment is based on matching individual ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of sequence-to-sequence alignment is central to many computational applications. Aligning two sequences (e.g., temporal signals) entails computing the optimal pairwise correspondence between the sequence elements while preserving their match orderings. For example, video Hadji et al. (2021) or audio Sakoe and Chiba (1978) synchronization are important applications of sequence alignment in the same modality, while the alignment of video to audio Halperin et al. (2019) represents a cross-modal task. Dynamic Time Warping (DTW) Sakoe and Chiba (1978) is a standard algorithm for recovering the optimal alignment between two variable length sequences. It efficiently solves the alignment problem by finding all correspondences between the two sequences, while being robust to temporal variations in the execution rate and shifts.

A major issue with DTW is that it enforces correspondences between all elements of both sequences and thus cannot properly handle sequences containing outliers. That is, given sequences with interspersed outliers, DTW enforces matches between the outliers and clean signal, which is prohibitive in many applications. A real-world example of sequences containing outliers is instructional videos. These are long, untrimmed videos depicting a person performing a given activity (e.g., making a latte) following a pre-defined set of ordered steps (e.g., a recipe). Typically, only a few frames in the video correspond to the instruction steps, while the rest of the video frames are unrelated to the main activity (e.g., the person talking); see Figure 1 for an illustration. In this case, matching the outlier frames to instruction steps will not yield a meaningful alignment. Moreover, the match score between such sequences, computed by DTW, will be negatively impacted by “false” correspondences and therefore cannot be used reliably for downstream tasks, e.g., retrieval or representation learning.

Figure 1: Aligning instructional videos. Left: Both video sequences (top and bottom) depict the three main steps of “making latte”; however, there are unrelated video segments, i.e., outliers, inbetween the steps. DTW aligns all the frames with each other and creates false correspondences where outliers are matched to signal (red links). Right: In contrast, Drop-DTW finds the optimal alignment, while simultaneously dropping unrelated frames (crossed out), leaving only correct correspondences (green links).

In this paper, we introduce Drop-DTW to address the problem of matching sequences that contain interspersed outliers as illustrated in Figure 1 (right) and compared to standard DTW on the (left). While various improvements to DTW have been previously proposed (e.g., Hadji et al. (2021); Cuturi and Blondel (2017); Chang et al. (2019); Cao et al. (2020); Nakatsu et al. (1982); Needleman and Wunsch (1970)), Drop-DTW is the first to augment DTW with the ability to flexibly skip through irrelevant parts of the signal during alignment. Rather than relying on a two-step greedy approach where elements are first dropped before aligning the remaining signal, Drop-DTW achieves this in a unified framework that solves for the optimal temporal alignment while jointly detecting outliers. Drop-DTW casts sequence alignment as an optimization problem with a novel component specifying the cost of dropping an element within the optimization process. It is efficiently realized using a dynamic program that naturally admits a differentiable approximation and can be efficiently used at training and inference time.


  • [leftmargin=*]

  • We propose an extension of DTW that is able to identify and align the common signal between sequences, while simultaneously excluding interspersed outliers from the alignment.

  • The proposed Drop-DTW formulation naturally admits a differentiable approximation and we therefore demonstrate its usage as a training loss function.

  • We demonstrate the utility of Drop-DTW, for both training and inference for multi-step localization in instructional videos, using only ordered steps as weak supervision. We achieve state-of-the-art results on the CrossTask Zhukov et al. (2019) dataset and are the first to tackle the COIN Tang et al. (2019) and YouCook2 Zhou et al. (2018) datasets, given only ordered steps (i.e., no framewise labels are used).

  • We employ Drop-DTW as a loss function for weakly-supervised video representation learning on the PennAction dataset Zhang et al. (2013), modified to have interspersed outliers, and unsupervised audio-visual representation learning on the AVE dataset Tian et al. (2018). Compared to the baselines, Drop-DTW yields superior representations as measured by its performance on various downstream tasks.

We are committed to releasing the code upon acceptance.

2 Related work

Sequence alignment. The use of aligned sequences in learning has recently seen a growing interest across various tasks Sermanet et al. (2018); Cuturi and Blondel (2017); Chang et al. (2019); Dwibedi et al. (2019); Hadji et al. (2021); Cao et al. (2020); Chang et al. (2021); Cai et al. (2019). While some methods start from aligned paired signals Sermanet et al. (2018); Sigurdsson et al. (2018), others directly learn to align unsynchronized signals Dwibedi et al. (2019); Chang et al. (2019); Hadji et al. (2021); Cao et al. (2020). One approach tackles the alignment problem locally by maximizing the number of one-to-one correspondences using a soft nearest neighbors method Dwibedi et al. (2019). More closely related are methods Cuturi and Blondel (2017); Chang et al. (2019); Hadji et al. (2021); Chang et al. (2021) that seek a global alignment between sequences by relying on Dynamic Time Warping (DTW) Sakoe and Chiba (1978). To handle noise in the feature space some methods use Canonical Correlation Analysis (CCA) with standard DTW Zhou and Torre (2009); Trigeorgis et al. (2016). To use DTW for end-to-end learning, differentiable approximations of the discrete operations (i.e., the min operator) in DTW have been explored Cuturi and Blondel (2017); Hadji et al. (2021). One of the main downsides of standard DTW-based approaches is that they require clean, tightly cropped, data with matching endpoints. In contrast to such methods, Drop-DTW extends DTW and its differentiable variants with the ability to handle outliers in the sequences by allowing the alignment process to automatically skip outliers (i.e., match signal while dropping outliers). While previous work Müller (2007); Cao et al. (2020); Sakurai et al. (2007) targeted rejecting outliers limited to the start or end of the sequences, our proposed Drop-DTW is designed to handle outliers interspersed in the sequence, including at the endpoints. Other work proposed limited approaches to handle such outliers either using a greedy two-step approach Nakatsu et al. (1982), consisting of outlier rejection followed by alignment, or by restricting the space of possible alignments Needleman and Wunsch (1970); Smith and Waterman (1981) (e.g., only allowing one-to-one matches and individual drops). In contrast, Drop-DTW solves for the optimal temporal alignment while simultaneously detecting outliers. As shown in our experiments, the ability of Drop-DTW to flexibly skip outliers during alignment can in turn be used at inference time to localize the start and end times of inlier subsequences.

Representation learning. Self- and weakly-supervised approaches are increasingly popular for representation learning as they eschew the need of dense labels for training. Such approaches are even more appealing for video-based tasks that typically require inordinately large labeling efforts. A variety of (label-free) proxy tasks have emerged for the goal of video representation learning Misra et al. (2016); Fernando et al. (2017); Lee et al. (2017); Büchler et al. (2018); Xu et al. (2019); Wei et al. (2018); Yu et al. (2016); Meister et al. (2018); Vondrick et al. (2018); Janai et al. (2018); Wang and Gupta (2015); Wang et al. (2019b); Jabri et al. (2020); Benaim et al. (2020); Jayaraman and Grauman (2016); Vondrick et al. (2016); Wang et al. (2019a); Han et al. (2019).

More closely related are methods using sequence alignment as a proxy task Chang et al. (2019); Dwibedi et al. (2019); Hadji et al. (2021); Cao et al. (2020); Chang et al. (2021); Cai et al. (2019). Our approach has the added advantage of handling interspersed outliers during alignment. Also related are methods that leverage multimodal aspects of video, e.g., video-audio Owens and Efros (2018); Arandjelovic and Zisserman (2018) or video-language alignment Miech et al. (2019, 2020); Sun et al. (2019); Luo et al. (2020). Our step localization application builds on strong representations learned via the task of aligning vision and language using narrated instructional videos Miech et al. (2020). However, we augment these representations with a more global and robust approach to the alignment process, thereby directly enabling fine-grained applications, such as multi-step localization.

Step localization. Temporal step localization consists of determining the start and end times of one or more activities present in a video. These methods can be broadly categorized into two classes based on the level of training supervision involved. Fully supervised approaches (e.g., Fabian Caba Heilbron and Niebles (2015); Ma et al. (2016); Tang et al. (2019)) rely on fine-grained temporal labels indicating the start and end times of activities. Weakly supervised approaches eschew the need for framewise labels. Instead, they use a video-level label corresponding to the activity category Nguyen et al. (2018, 2019), or sparse time stamps that provide the approximate (temporal) locations of the activities. More closely related are methods that rely on the order of instruction steps in a clip to yield framewise step localization Huang et al. (2016); Ding and Xu (2018); Richard et al. (2018); Chang et al. (2019); Zhukov et al. (2019). In contrast, we do not rely on categorical step labels from a fixed set of categories and demonstrate that our approach applies to any ordered sequence of embedded steps, such as embedded descriptions in natural language.

3 Technical approach

3.1 Preliminaries: Dynamic time warping

Dynamic Time Warping (DTW) Sakoe and Chiba (1978) computes the optimal alignment between two sequences subject to certain constraints. Let and be the input sequences, where are the respective sequence lengths and is the dimensionality of each element. The valid alignments between sequences are defined as a binary matrix, , of the pairwise correspondences, where if is matched to , and otherwise. Note that corresponds to the first (row) index of and the second (column). Matching an element to element has a cost which is typically a measure of dissimilarity between the elements. DTW finds the alignment between sequences and that minimizes the overall matching cost:


where is the Frobenius inner product and is the set of all feasible alignments that satisfy the following constraints: monotonicity, continuity, and matching endpoints (). Fig. 2 (a) provides an illustration of feasible and optimal alignments. The cost of aligning two sequences with DTW is defined as the cost of the optimal matching: . DTW proposes an efficient dynamic programming algorithm to find the solution to Equation 1.

In the case of sequences containing outliers, the matching endpoints and continuity constraints are too restrictive leading to unmeaningful alignments. Thus, we introduce Drop-DTW that admits more flexibility in the matching process.

Figure 2: Optimal alignment with DTW and Drop-DTW. Aligning two different videos where the digit “3” moves across the square frame. The colored matrices represent the pairwise matching costs, , with darker cell indicating higher cost, . The paths on the grid are alignment paths, while the points on them indicate a pairwise match between the corresponding row and column elements. (a) All three paths are feasible DTW paths, while only one of them (in green) is optimal. (b) When sequence contains an outlier (i.e., digit “0”), DTW uses it in the alignment and incurs a high cost (red point). (c) In contrast, Drop-DTW skips the outlier (while paying the cost ) and only keeps the relevant matches.

3.2 Sequence alignment with Drop-DTW

Drop-DTW extends consideration of feasible alignments, , to those adhering to the monotonicity constraint only. Consequently, unlike DTW, elements can be dropped from the alignment process.

Before introducing Drop-DTW, let us discuss a naive solution to the problem of outlier filtering in DTW. One could imagine a greedy two-step approach where one first 1) drops the outliers, and then 2) aligns remaining elements with standard DTW. Since step 1) (i.e., dropping) is performed independently from step 2) (i.e., alignment), this approach yields a sub-optimal solution. Critically, if an important element is erroneously dropped in step 1), it is impossible to recover it in step 2). Moreover, the outlier rejection step is order agnostic and results in drops broadly scattered over the entire sequence, which makes this approach inapplicable for precise step localization. To avoid such issues, it is critical to jointly address 1) outlier detection and 2) sequence alignment.

For this purpose, we propose Drop-DTW, a unified framework that solves for the optimal temporal alignment while jointly enabling element dropping by adding another dimension to the dynamic programming table. Specifically, dropping an element means there is no element in that has been matched to . This is captured in the -th column of the correspondence matrix containing only zeros, i.e., .

To account for unmatched elements in the alignment objective, we extend the set of costs beyond pairwise matching, , used in DTW, with novel drop costs and , for elements and , respectively. The optimal matching can be then defined as follows:


where denotes the inner product, is the set of binary matrices satisfying just the monotonicity constraint, and

is a vector with the

-th element equal to one if the and zero otherwise; is defined similarly, but on rows.

For clarity, in Algorithm 1, we describe the dynamic program for efficiently computing the optimal alignment and its cost with dropping elements limited to , i.e., . Our general Drop-DTW algorithm that drops elements from both sequences, and , is given in the supplemental.

1:Inputs: - pairwise match cost matrix, - drop costs for elements in .
2: initializing dynamic programming tables
3: match table
4: drop table
5: optimal solution table
6:for  do iterating over elements in
7:     for  do iterating over elements in
8:           consider matching to
9:           consider dropping
10:           select the optimal action
11:     end for
12:end for
13: compute the optimal alignment by tracing back the minimum cost path
14:Output: ,
Algorithm 1 Subsequence alignment with Drop-DTW.

3.3 Definition of match costs

The pairwise match costs, , are typically defined based on the dissimilarity between elements and . In this paper, we consider two different ways to define the pairwise costs, depending on the application. In general, when dropping elements from both and is permitted, we consider the following symmetric match cost:


Alternatively, when one of the sequences, i.e., , is known to contain only signal, we follow Hadji et al. (2021), who show that the following asymmetric matching cost is useful during representation learning:



defines a standard softmax operator applied over the first tensor dimension.

Importantly, the matching cost does not solely dictate whether the elements and are matched. Instead, the optimal matching is governed by Equation 2 where the match cost is just one of the costs affecting the optimal alignment, along with drop costs and .

3.4 Definition of drop costs

There are many ways to define the drop costs. As a starting point, consider a drop cost that is a constant fixed across all elements in and : . Setting the constant too low (relative to the match costs, ) will lead to a high frequency of drops and thus matching only a small signal fraction. In contrast, setting too high may result in retaining the outliers, i.e., no drops.

Percentile drop costs. To avoid such extreme outcomes, we define the drop cost on a per-instance basis with values comparable to those of the match cost, . In particular, we define the drop costs as the top percentile of the values contained in the cost matrix, :


Defining the drop cost as a function of the top percentile match costs has the advantage that adjusting allows one to inject a prior belief on the outlier rate in the input sequences.

Learnable drop costs. While injecting prior knowledge in the system using the percentile drop cost can be an advantage, it may be hard to do so when the expected level of noise is unknown or changes from one dataset to another. To address this scenario, we introduce an instantiation of a leanable drop cost here. We choose to define the outliers drop costs in based on the content of sequence . This is realized as follows:


are the respective means of sequences , and is a learnable function (i.e., a feed-forward neural net) parametrized by . This definition can yield a more adaptable system.

3.5 Drop-DTW as a differentiable loss function

Differentiable approximation. To make the matching process differentiable, we replace the hard operator in Alg. 1 with the following differentiable approximation introduced in Hadji et al. (2021):



is a hyperparameter controlling trade-off between smoothness and approximation error.

Loss function. The differentiable Drop-DTW is a sequence match cost suitable for representation learning (cf. Hadji et al. (2021)). The corresponding Drop-DTW loss function is defined as follows:


where is the optimal match cost between and computed in Alg. 1.

4 Experiments

The key novelty of Drop-DTW is the ability to robustly match sequences with outliers. In Sec. 4.1, we first present a controlled experiment using synthetic data to demonstrate the robustness of Drop-DTW. Next, we show the strength of Drop-DTW on a range of applications, including multi-step localization (Sec 4.2), representation learning from noisy videos (Sec 4.3), and audio-visual alignment for retrieval and localization (Sec 4.4).

4.1 Controlled synthetic experiments

Synthetic dataset.

We use the MNIST dataset

LeCun et al. (1998) to generate videos of moving digits (cf. Srivastava et al. (2015)). Each video in our dataset depicts a single digit moving along a given trajectory.

For each digit-trajectory pair, we generate two videos: (i) the digit moves along the full trajectory (termed TMNIST-full) and (ii) the digit performs a random sub-part of the trajectory (termed TMNIST-part). Each video frame is independently encoded with a shallow ConvNet LeCun et al. (1998) trained for digit recognition. We elaborate on the the dataset and the framewise embedding network in the Appendix. Here, we use these datasets for sequence retrieval and subsequence localization.

4.1.1 Video retrieval

In this experiment, we use TMNIST-part as queries and look for the most similar videos in TMNIST-full. A correct retrieval is defined as identifying the video in TMNIST-full containing the same digit-trajectory pair as the one in the query. We use the Recall@1 metric to report performance. To analyze the strength of Drop-DTW in this controlled setting, we also introduce temporal noise in the query sequence by randomly blurring a subset of the video frames. We compare Drop-DTW to standard DTW Sakoe and Chiba (1978) in this experiment. In all cases, we use the alignment score obtained from each algorithm as a matching cost between sequences. For this experiment, we use the symmetric matching costs defined in (3). Since no training is involved in this experiment, we set the drop costs to a constant, , which we establish through cross-validation.

Figure 3: Drop-DTW for retrieval and event localization on TMNIST. (a) We consider queries, , from TMNIST-part with interspersed noise, for indexing into TMNIST-full. Drop-DTW is more robust to interspersed noise than other alignment algorithms. (b) An example query-signal correspondence matrix illustrates outlier removal, where red rows (cols) depict dropped frames in (, resp.), and green (yellow) arrows denote correct (incorrect) query outlier identification. (c) Replicating the query and dataset signal in time demonstrates temporal localization by Drop-DTW despite signal repetitions and interspersed outliers (see main text).

Figure 3 (a) compares the performance of the two alignment algorithms with various levels of noise. As expected, DTW is very sensitive to interspersed noise. In contrast, Drop-DTW remains robust across the range of noise levels. Concretely, for the highest noise level in Fig 3(a), we see that Drop-DTW is 8 better than DTW, which speaks decisively in favor of Drop-DTW.

4.1.2 Subsequence localization

To demonstrate Drop-DTW’s ability to identify start and end times we consider test sequences, , formed by concatenating clips from TMNIST-part, where exactly of these clips are from a selected digit-trajectory class. We consider queries of the form , where each of the ’s are instances of this same digit-trajectory class in TMNIST-full. Ideally, aligning and should put a subsequence of each of the ’s in correspondence with a matching subsequence in (see Fig. 3 (c)). By construction, there should be such matches, interspersed with outliers. Here, we use , , and randomly generate 1000 such sequences, .

We use Drop-DTW to align the generated pairs, and , and select the longest contiguously matched subsequences. We recover start and end times as the endpoints of these longest matched intervals. We find the localization accuracy on this data is and IoU is (see metrics definition in Sec. 4.2). These results demonstrate the efficacy of Drop-DTW for localizing events in long untrimmed videos. Note that we do not compare our method to any baseline as Drop-DTW is the only alignment-based method capable of solving this task.

4.2 Multi-step localization

We now evaluate Drop-DTW on multi-step localization in more realistic settings. For this task, we are given as input: (i) a long untrimmed video of a person performing a task (e.g., “making salad”), where the steps involved in performing the main task are interspersed with irrelevant actions (e.g., intermittently washing dishes), and (ii) an ordered set of textual descriptions of the main steps (e.g., “cut tomatoes”) in this video. The goal is to temporally localize each step in the video.


For evaluation, we use the following three recent instructional video datasets: CrossTask Zhukov et al. (2019), COIN Tang et al. (2019), and YouCook2 Zhou et al. (2018). Both CrossTask and COIN include instructional videos of different activity types (i.e., tasks), with COIN being twice the size of CrossTask in the number of videos and spanning 10 times more tasks. YouCook2 is the smallest of these datasets and focuses on cooking tasks. While all datasets provide framewise labels for the start and end times of each step in a video, we take a weakly-supervised approach and only use the ordered step information.


We evaluate the performance of Drop-DTW on step localization using three increasingly strict metrics. Recall Zhukov et al. (2019) is defined as the number of steps correctly assigned to the correct ground truth time interval divided by the total number of steps and is the least strict metric out of the three considered. Framewise accuracy (Acc.) Tang et al. (2019) is defined as the ratio between the number of frames assigned the correct step label (including background) and the total number of frames. Finally, Intersection over Union (IoU) Zhou et al. (2018) is defined as the sum of the intersections between the predicted and ground truth time intervals of each step divided by the sum of their unions. IoU is the most challenging of the three metrics as it more strictly penalizes misalignments.

Training and Inference.

We start from (pretrained) strong vision and language embeddings obtained from training on a large instructional video dataset Miech et al. (2020). We further train a two-layer fully-connected network on top of the visual embeddings alone to align videos with a list of corresponding step (language) embeddings using the Drop-DTW loss, (8). Notably, the number of different step descriptions in a video is orders of magnitude smaller than the number of video clips, which leads to degenerate alignments when training with an alignment loss (i.e., most clips are matched to the most frequently occurring step). To regularize the training, we introduce an additional clustering loss, , defined in the Appendix. is an order-agnostic discriminative loss that encourages video embeddings to cluster around the embeddings of steps present in the video. Finally, we use with asymmetric costs, (4), and either a 30%-percentile drop costs, (5), or the learned variant, (6), in combination with during training.

At test time, we are given a video and the corresponding ordered list of step descriptions. To temporarily localize each of the steps in the video, we align the learned vision and (pre-trained) language embeddings using Drop-DTW and directly find steps boundaries.

4.2.1 Drop-DTW as a loss for step localization

We now compare the performance of Drop-DTW to various alignment methods Hadji et al. (2021); Chang et al. (2019); Cao et al. (2020), as well as previous reported results on the datasets Miech et al. (2019); Zhou et al. (2018); Ding and Xu (2018); Richard et al. (2018); Miech et al. (2020); Luo et al. (2020). In particular, we compare to the following baselines:

- SmoothDTW Hadji et al. (2021): Learns fine-grained representations in a contrastive setting using a differentiable approximation of DTW. Drop-DTW uses the same approximation, while supporting the ability to skip irrelevant parts of sequences.

- DTW Chang et al. (2019): Uses a different differentiable formulation of DTW, as well as a discriminative loss.

- OTAM Cao et al. (2020): Extends DTW with the ability to handle outliers strictly present around the endpoints.

- MIL-NCE Miech et al. (2020): Learns strong vision and language representations by locally aligning videos and narrations using a noise contrastive loss in a multi-instance learning setting. Notably, we learn our embeddings on top of representations trained with this loss on the HowTo100M dataset Miech et al. (2019). There is a slight difference in performance between our re-implementation and Miech et al. (2019), which is due to the difference in the input clip length and frame rate.

The results in Table 1 speak decisively in favor of our approach, where we outperform all the baselines on all datasets across all the metrics with both the learned and percentile based definitions of our drop costs. Notably, although we are training a shallow fully-connected network on top of the MIL-NCE’s Miech et al. (2020) frozen pretrained visual representations, we still outperform it with sizeable margins due to the global perspective taken in aligning the two modalities. Also, note that the results of all other DTW-based methods Hadji et al. (2021); Chang et al. (2019); Cao et al. (2020) are on-par due to their collective inability to handle interspersed background frames during alignment, as shown in Fig. 4. This figure also motivates the use of Drop-DTW as the inference method for step localization. It uses the drop capability to ‘label’ irrelevant frames as background, without ever learning a background model, and does it rather accurately according to Fig. 4, which other methods are not designed to handle.

Method CrossTask COIN YouCook2
Recall Acc. IoU Recall Acc. IoU Recall Acc. IoU
MaxMargin Miech et al. (2019) 33.6 - - - - - - - -
UniVL Luo et al. (2020) 42.0 - - - - - - - -
NN-Viterbi Richard et al. (2018) - - - - 21.2 - - - -
ISBA Ding and Xu (2018) - - - - 34.3 - - - -
ProcNet Zhou et al. (2018) - - - - - - - - 37.5
MIL-NCE Miech et al. (2020) 39.1 66.9 20.9 33.0 50.2 23.3 70.7 63.7 43.7
SmoothDTW Hadji et al. (2021) 43.1 70.2 30.5 37.7 52.7 27.7 75.3 66.0 47.5
DTW Chang et al. (2019) 43.2 70.4 30.6 37.9 52.8 27.7 75.3 66.4 47.1
OTAM  Cao et al. (2020) 43.8 70.6 30.8 37.2 52.6 27.2 75.4 66.9 46.8
Drop-DTW + Percentile drop cost 48.9 71.3 34.2 40.8 54.8 29.5 77.4 68.4 49.4
Drop-DTW + Learned drop cost 49.7 74.1 36.9 42.8 59.6 29.5 76.8 69.6 48.4
Table 1: Step localization results on the CrossTask Zhukov et al. (2019), COIN Tang et al. (2019), and YouCook2 Zhou et al. (2018) datasets.
Figure 4: Step localization with DTW variants. Rows two to four show step assignment results when the same alignment method is used for training and inference. Drop-DTW allows to identify interspersed unlabelled clips and much more closely approximates the ground truth.

4.2.2 Drop-DTW as an inference method for step localization

Here, we further investigate the benefits of Drop-DTW (with the percentile drop cost) as an inference time algorithm for step localization. We compare to classical DTW algorithm Sakoe and Chiba (1978) that does not allow for element dropping, OTAM Cao et al. (2020) that allows to skip through elements at start and end of a sequence, LCSS Nakatsu et al. (1982) that greedily excludes potential matches by thresholding the cost matrix, prior to alignment operation, and the Needleman-Wunsch algorithm Needleman and Wunsch (1970) that uses a drop costs to reject outliers, but restricts the match space to one-to-one correspondences, making it impossible to match several frames to a single step.

Notably, Drop-DTW does not restrict possible matches and can infer the start and end times of each step directly during the alignment as our algorithm supports skipping outliers. To demonstrate the importance of performing dropping and matching within a single optimal program, we also consider the naive baseline called “greedy drop + DTW“ that drops outliers before alignment. In this case, a clip is dropped if the cost of matching that clip to any step is greater than the drop cost (defined in Section 3.4). The alignment is then performed on the remaining clips using standard DTW.

Method CrossTask COIN YouCook2
Acc. IoU Acc. IoU Acc. IoU
DTW Sakoe and Chiba (1978) 11.2 10.1 21.6 18.3 35.0 31.1
OTAM Cao et al. (2020) 19.5 11.6 26.5 19.5 43.4 34.7
LCSS Nakatsu et al. (1982) 50.3 4.1 47.0 4.5 43.4 9.0
Needleman-Wunsch Needleman and Wunsch (1970) 68.8 9.5 52.1 7.4 50.1 11.7
greedy drop + DTW 60.1 13.8 45.0 18.9 54.3 34.1
Drop-DTW (ours) 70.2 30.5 52.7 27.7 66.0 47.5
Table 2: Comparison of different alignment algorithms for step localization as inference procedure. The first column describes the alignment algorithm used for inference. The video and text features used for alignment are obtained from the model Miech et al. (2020) pre-trained on the HowTo100M dataset.

For a fair comparison, in all cases we use video and step embeddings extracted from a pre-trained network Miech et al. (2020) and thus remove the impact of training on the inference performance. Table 2 demonstrates that Drop-DTW outperforms all the existing methods on all metrics and datasets by a significant margin. This is due to their inability to drop interspersed outliers or due to their restricted space of possible matches. More interestingly, comparing Drop-DTW to "greedy drop + DTW" reveals the importance of formulating dropping and alignment together as part of the optimization, and finding the optimal solution to (2) rather than a greedy one.

4.3 Drop-DTW for representation learning

In this section, we demonstrate the utility of the Drop-DTW loss, (8), for representation learning. We train the same network used in previous work Dwibedi et al. (2019); Hadji et al. (2021) using the alignment proxy task on PennAction Zhang et al. (2013). Similar to previous work Dwibedi et al. (2019); Hadji et al. (2021), we evaluate the quality of the learned representations (i.e., embeddings) on the task of video alignment using Kendall’s Tau metric. Notably, the PennAction videos are tightly temporally cropped around the actions, making it suitable for an alignment-based proxy task. To assess the robustness of Drop-DTW to the presence of outliers, we also contaminate PennAction sequences by randomly introducing % frames from other activities to each sequence. Note that this type of data transformation is similar to a realistic scenario where a video of a baseball game sequentially captures both baseball pitch and baseball swing actions, as well as crowd shots. For various amounts of outlier frames, we contaminate PennAction videos as specified above and train a network to perform sequence alignment with the SmoothDTW Hadji et al. (2021) or Drop-DTW loss. Note that no regularization loss is needed in this case and the drop cost is defined according to (5). Fig. 5 demonstrates the strong performance of Drop-DTW as a loss for representation learning, which once again proves to be more resilient to outliers.

Figure 5: Representation learning with Drop-DTW. Video alignment results using Kendall’s Tau metric on PennAction Zhang et al. (2013) with an increasing level of outliers introduced in training.
Table 3: Unsupervised audio-visual cross-modal localization on AVE Tian et al. (2018). A2V: visual localization from audio segment query; V2A: audio localization from visual segment query.
Model A2V V2A
OTAM Cao et al. (2020) 37.5 32.7
SmoothDTW Hadji et al. (2021) 39.8 33.9
Drop-DTW (ours) 41.1 35.8
Supervised Tian et al. (2018) 44.8 34.8

4.4 Drop-DTW for audio-visual localization

To show the benefits of Drop-DTW for unsupervised audio-visual representation learning, we adopt the cross-modality localization task from Tian et al. (2018). Given a trimmed signal from one modality (e.g., audio), the goal is localize the signal in an untrimmed sequence in the other modality (e.g., visual). While the authors of Tian et al. (2018)

use supervised learning with clip-wise video annotations indicating the presence (or absence) of an audio-visual event, we train our models completely

unsupervised. We use a margin-based loss that encourages audio-visual sequence pairs from the same video to be closer in the shared embedding space than audio-visual pairs from different videos, and measure the matching cost between the sequences with a matching algorithm. When applying Drop-DTW, we use symmetric match costs, (3), and 70%-percentile drop costs, (5). Otherwise, we closely follow the experimental setup from Tian et al. (2018) and adopt their encoder architecture and evaluation protocol; for additional details, please see the Appendix.

In Table 3, we compare different alignment methods used as a sequence matching cost to train the audio-visual model. We can see that Drop-DTW outperforms both SmoothDTW and OTAM due to its ability to drop outliers from both sequences. Interestingly, Drop-DTW outperforms the fully-supervised baseline on visual localization, given an audio query. We hypothesize that this is due to our triplet loss formulation, that introduces learning signals from non-matching audio-visual signals, while supervised training in Tian et al. (2018) only considers audio-visual pairs from the same video.

4.5 Discussion and limitations

Even though Drop-DTW can remove arbitrary sequence elements as part of the alignment process, the final alignment is still subject to the monotonicity constraint, which is based on the assumption that the relevant signals are strictly ordered. While not relevant for the applications presented in this paper, training from partial ordered signals using Drop-DTW is not currently addressed and is an interesting subject for future research.

5 Conclusion

In summary, we introduced an extension to the classic DTW algorithm, which relaxes the constraints of matching endpoints of paired sequences and the continuity of the path cost. This relaxation allows our method to handle interspersed outliers during sequence alignment. The proposed algorithm is efficiently implemented as a dynamic program and naturally admits a differentiable approximation. We showed that Drop-DTW can be used both as a flexible matching cost between sequences and a loss during training. Finally, we demonstrated the strengths of Drop-DTW across a range of applications.


  • R. Arandjelovic and A. Zisserman (2018) Objects that sound. In eccv, Cited by: §2.
  • S. Benaim, A. Ephrat, O. Lang, I. Mosseri, W. T. Freeman, M. Rubinstein, M. Irani, and T. Dekel (2020) SpeedNet: learning the speediness in videos. In cvpr, Cited by: §2.
  • U. Büchler, B. Brattoli, and B. Ommer (2018)

    Improving spatiotemporal self-supervision by deep reinforcement learning

    In eccv, Cited by: §2.
  • X. Cai, T. Xu, J. Yi, J. Huang, and S. Rajasekaran (2019) DTWNet: A dynamic time warping network. In neurips, Cited by: §2, §2.
  • K. Cao, J. Ji, Z. Cao, C. Chang, and J. C. Niebles (2020) Few-shot video classification via temporal alignment. In cvpr, Cited by: §1, §2, §2, §4.2.1, §4.2.1, §4.2.1, §4.2.2, §4.3, Table 1, Table 2.
  • C. Chang, D. Huang, Y. Sui, L. Fei-Fei, and J. C. Niebles (2019) D3TW: Discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. In cvpr, Cited by: §1, §2, §2, §2, §4.2.1, §4.2.1, §4.2.1, Table 1.
  • X. Chang, F. Tung, and G. Mori (2021) Learning discriminative prototypes with dynamic time warping. In cvpr, Cited by: §2, §2.
  • M. Cuturi and M. Blondel (2017) Soft-DTW: A differentiable loss function for time-series. In icml, Cited by: §C.2, Table 5, §1, §2.
  • L. Ding and C. Xu (2018) Weakly-supervised action segmentation with iterative soft boundary assignment. In cvpr, Cited by: §2, §4.2.1, Table 1.
  • D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman (2019) Temporal cycle-consistency learning. In cvpr, Cited by: §2, §2, §4.3.
  • B. G. Fabian Caba Heilbron and J. C. Niebles (2015) ActivityNet: a large-scale video benchmark for human activity understanding. In cvpr, Cited by: §2.
  • B. Fernando, H. Bilen, E. Gavves, and S. Gould (2017)

    Self-supervised video representation learning with odd-one-out networks

    In cvpr, Cited by: §2.
  • I. Hadji, K. G. Derpanis, and A. D. Jepson (2021) Representation learning via global temporal alignment and cycle-consistency. In cvpr, Cited by: §C.2, Table 4, Table 5, §1, §1, §2, §2, §3.3, §3.5, §3.5, §4.2.1, §4.2.1, §4.2.1, §4.3, §4.3, Table 1.
  • T. Halperin, A. Ephrat, and S. Peleg (2019) Dynamic temporal alignment of speech to lips. In icassp, pp. 3980–3984. Cited by: §1.
  • T. Han, W. Xie, and A. Zisserman (2019) Video representation learning by dense predictive coding. In iccvw, Cited by: §2.
  • D. Huang, L. Fei-Fei, and J. C. Niebles (2016) Connectionist temporal modeling for weakly supervised action labeling. In eccv, Cited by: §2.
  • A. Jabri, A. Owens, and A. A. Efros (2020) Space-time correspondence as a contrastive random walk. In NeurIPS, Cited by: §2.
  • J. Janai, F. Güney, A. Ranjan, M. J. Black, and A. Geiger (2018) Unsupervised learning of multi-frame optical flow with occlusions. In eccv, Cited by: §2.
  • D. Jayaraman and K. Grauman (2016) Slow and steand feature analysis: higher order temporal coherence in video. In cvpr, Cited by: §2.
  • Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE. Cited by: §C.1, §4.1, §4.1.
  • H. Lee, J. Huang, M. Singh, and M. Yang (2017) Unsupervised representation learning by sorting sequences. In iccv, Cited by: §2.
  • H. Luo, L. Ji, B. Shi, H. Huang, N. Duan, T. Li, J. Li, T. Bharti, and M. Zhou (2020) UniVL: a unified video and language pre-training model for multimodal understanding and generation. arXiv preprint arXiv:2002.06353. Cited by: §2, §4.2.1, Table 1.
  • M. Ma, H. Fan, and K. M. Kitani (2016) Going deeper into first-person activity recognition. In cvpr, Cited by: §2.
  • S. Meister, J. Hur, and S. Roth (2018) UnFlow: unsupervised learning of optical flow with a bidirectional census loss. In aaai, Cited by: §2.
  • A. Miech, J. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman (2020) End-to-End Learning of Visual Representations from Uncurated Instructional Videos. In cvpr, Cited by: Table 4, §2, §4.2, §4.2.1, §4.2.1, §4.2.1, §4.2.2, Table 1, Table 2.
  • A. Miech, D. Zhukov, J. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic (2019) HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips. In iccv, Cited by: §2, §4.2.1, §4.2.1, Table 1.
  • I. Misra, C. L. Zitnick, and M. Hebert (2016) Shuffle and learn: unsupervised learning using temporal order verification. In eccv, Cited by: §2.
  • M. Müller (2007) Information retrieval for music and motion. Springer-Verlag, Berlin, Heidelberg. External Links: ISBN 3540740473 Cited by: §2.
  • N. Nakatsu, Y. Kambayashi, and S. Yajima (1982) A longest common subsequence algorithm suitable for similar text strings. Acta Inf.. Cited by: §1, §2, §4.2.2, Table 2.
  • S. B. Needleman and C. D. Wunsch (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology. Cited by: §1, §2, §4.2.2, Table 2.
  • P. Nguyen, T. Liu, G. Prasad, and B. Han (2018) Weakly supervised action localization by sparse temporal pooling network. In cvpr, Cited by: §2.
  • P. X. Nguyen, D. Ramanan, and C. C. Fowlkes (2019) Weakly-supervised action localization with background modeling. In iccv, Cited by: §2.
  • A. Owens and A. A. Efros (2018) Audio-visual scene analysis with self-supervised multisensory features. In eccv, Cited by: §2.
  • A. Richard, H. Kuehne, A. Iqbal, and J. Gall (2018) NeuralNetwork-Viterbi: a framework for weakly supervised video learning. In cvpr, Cited by: §2, §4.2.1, Table 1.
  • H. Sakoe and S. Chiba (1978) Dynamic programming algorithm optimization for spoken processing recognition. icassp 26. Cited by: §1, §2, §3.1, §4.1.1, §4.2.2, Table 2.
  • Y. Sakurai, C. Faloutsos, and M. Yamamuro (2007) Stream monitoring under the time warping distance. In International Conference on Data Engineering (ICDE), Cited by: §2.
  • P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, and S. Levine (2018) Time-contrastive networks: self-supervised learning from video. In icra, Cited by: §2.
  • G. A. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, and K. Alahari (2018) Actor and observer: joint modeling of first and third-person videos. In cvpr, Cited by: §2.
  • T. F. Smith and M. S. Waterman (1981) Identification of common molecular subsequences. Journal of Molecular Biology. Cited by: §2.
  • N. Srivastava, E. Mansimov, and R. Salakhudinov (2015) Unsupervised learning of video representations using LSTMs. In icml, Cited by: §C.1, §4.1.
  • C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019) VideoBERT: a joint model for video and language representation learning. In iccv, Cited by: §2.
  • Y. Tang, D. Ding, Y. Rao, Y. Zheng, D. Zhang, L. Zhao, J. Lu, and J. Zhou (2019) COIN: a large-scale dataset for comprehensive instructional video analysis. In cvpr, Cited by: §C.2, Table 4, 3rd item, §2, §4.2, §4.2, Table 1.
  • Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu (2018) Audio-visual event localization in unconstrained videos. In eccv, Cited by: §C.4, §C.4, 4th item, §4.3, §4.4, §4.4, Table 3.
  • G. Trigeorgis, M. A. Nicolaou, S. Zafeiriou, and B. W. Schuller (2016) Deep canonical time warping. In cvpr, Cited by: §2.
  • C. Vondrick, H. Pirsiavash, and A. Torralba (2016) Anticipating visual representations from unlabeled video. In cvpr, Cited by: §2.
  • C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy (2018) Tracking emerges by colorizing videos. In eccv, Cited by: §2.
  • J. Wang, J. Jiao, L. Bao, S. He, Y. Liu, and W. Liu (2019a) Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In cvpr, Cited by: §2.
  • X. Wang and A. Gupta (2015) Unsupervised learning of visual representations using videos. In iccv, Cited by: §2.
  • X. Wang, A. Jabri, and A. A. Efros (2019b) Learning correspondence from the cycle-consistency of time. In cvpr, Cited by: §2.
  • D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman (2018) Learning and using the arrow of time. In cvpr, Cited by: §2.
  • D. Xu, J. Xiao, Z. Zhao, J. Shao, D. Xie, and Y. Zhuang (2019) Self-supervised spatiotemporal learning via video clip order prediction. In cvpr, Cited by: §2.
  • J. J. Yu, A. W. Harley, and K. G. Derpanis (2016) Back to basics: unsupervised learning of optical flow via brightness constancy and motion smoothness. In eccvw, Cited by: §2.
  • W. Zhang, M. Zhu, and K. G. Derpanis (2013) From actemes to action: a strongly-supervised representation for detailed action understanding. In iccv, Cited by: §C.3, 4th item, Figure 5, §4.3.
  • F. Zhou and F. Torre (2009) Canonical time warping for alignment of human behavior. neurips. Cited by: §2.
  • L. Zhou, C. Xu, and J. J. Corso (2018) Towards automatic learning of procedures from web instructional videos. In aaai, Cited by: §C.2, Table 4, 3rd item, §4.2, §4.2, §4.2.1, Table 1.
  • D. Zhukov, J. Alayrac, R. G. Cinbis, D. Fouhey, I. Laptev, and J. Sivic (2019) Cross-task weakly supervised learning from instructional videos. In cvpr, Cited by: §C.2, Table 4, 3rd item, §2, §4.2, §4.2, Table 1.

Appendix A Summary

Our appendix is organized as follows. Sec. B.1 provides the details of the full version of our Drop-DTW algorithm that allows for dropping outliers from both sequences during alignment. Sec. B.2 provides implementation details for the the asymmetric match cost. Sec. B.3 describes the regularization loss used for the multi-step localization application (Sec. 4.2 in the main paper). Finally, Sec. C provides additional details for our experimental setups.

Appendix B Technical approach details

In this section, we provide additional details of our Drop-DTW algorithm and its components.

b.1 Drop-DTW algorithm

Algorithm 2 presents the full version of Drop-DTW. This version allows to drop outlier elements from both sequences, and . In Algorithm 2, the operation between a set and a scalar increases every element of the set by the scalar, i.e., . The main idea of Algorithm 2 is to simultaneously solve four dynamic programs (and fill their corresponding tables), i.e., and :

  • corresponds to the optimal cost of a feasible prefix path that matches and with Drop-DTW, given that this feasible prefix path ends with and matched to each other.

  • represents the optimal matching cost for a prefix path ending with element dropped from the matching, and matched to some previous element .

  • corresponds to dropping while matching .

  • is for prefix paths ending by dropping both and .

The optimal alignment cost of matching and with Drop-DTW is then the minimum over the paths with the different types of endpoints, i.e., .

1:Inputs: (pairwise match cost matrix), , (drop costs for elements in and respectively).
2: initializing DP tables for storing the optimal alignment path costs with different ends match conditions
3: matching ends in and
4: dropping ’s end, but matching ’s end
5: dropping ’s end, but matching ’s end
6: dropping ends in and
8:for  do iterating over elements in
9:     for  do iterating over elements in
10: grouping costs into neighboring sets for convenience
11:           diag_cells
12:           left_cells_with_z
13:           top_cells_with_x
14:           left_cells_without_z
15:           top_cells_without_x
16: dynamic programming update on all tables
17:           consider matching to
18:           consider dropping
19:           consider dropping
20:           consider dropping and
21:           select the optimal action
22:     end for
23:end for
24: compute the optimal alignment by tracing back the minimum cost path
25:Output: ,
Algorithm 2 Subsequence alignment with Drop-DTW.

b.2 Asymmetric match costs

In cases where there is asymmetry between input sequences and ingested by Drop-DTW, e.g., is a video with outliers and is a sequence of (outlier-free) step labels contained in a video, we define an asymmetric match cost in Eq. 5 of the main paper. Expanding the softmax in Eq. 5, the asymmetric cost can be written as follows:


Note that our implementation slighly differs from the above formulation. Precisely, to avoid “oversmoothing” the softmax, we perform the summation in the denominator in Eq. 9 over the unique elements in only. This is implemented by performing the summation in the denominator over a different set of elements , where is obtained from by removing duplicate elements.

b.3 Training regularization for multi-step localization

In our experiments, we consider the Drop-DTW loss (i.e., Eq. 7 in the main paper) for video sequence labeling (i.e., Sec. 4.2 in the main paper), where a video sequence is aligned to a sequence of (discrete) labels from a finite set. An issue with this setting is that the alignments for minimizing the loss are prone to limiting correspondences to the most frequently occurring labels in the dataset.

To address this degeneracy, we augment our Drop-DTW loss with the following regularizer that promotes more uniform matching between the elements of the video sequence, , and the sequence of discrete labels, :



is the identity matrix and

. Each element in is defined according to


In other words, in Eq. (11) defines attention-based pooling of sequence , relative to an element . Minimizing pushes every element in to have a unique match in , which prevents overfitting to frequent labels in and encourages the clustering of the embeddings around the appropriate label embeddings .

Appendix C Experiments details

c.1 Controlled synthetic experiments

In the main paper, we performed controlled experiments using a synthetic dataset that we generated. Here, we provide a detailed description of this dataset and additional details of our retrieval and localization experiments.

Synthetic dataset.

We start from the MNIST dataset LeCun et al. (1998) and use it to generate videos of digits moving around an empty canvas, cf. Srivastava et al. (2015), with just one digit per canvas. In particular, we place MNIST images on a black canvas, where an image can move along one of eight pre-defined trajectories. The trajectories are: (a) figure “8”, (b) figure “”, (c) circle “”, all clockwise; trajectories (d), (e), (f) are obtained from (a), (b), and (c) but moving counter-clockwise; and finally (i) and (j) are the square diagonals (“” and “”).

To synthesize a moving digit video, , we perform the following four steps:

  1. Choose a digit, , from and a trajectory, , from .

  2. Sample a random image of digit from MNIST.

  3. Choose a random video length. , and sample equally-spaced 2D points along the trajectory .

  4. Synthesize each frame, , in the output video, , by placing the digit image onto the canvas at location .

Note that since each video length, , is randomly selected, the moving speed of the digits vary across the generated videos.

For each digit-trajectory pair, we follow the steps described above and generate two videos: (i) the digit executes a full trajectory and (ii) the digit performs a portion of the trajectory. We term videos falling under these two categories, TMNIST-full and TMNIST-part, respectively, where TMNIST stands for Trajectory-MNIST. Each dataset contains ( digits trajectories) video clips in total. Please refer to the supplemental video for sample videos from our dataset as well as illustrations of use cases in the retrieval and localization scenarios.

Encoding TMNIST.

We independently encode the frames of each video in the TMNIST datasets using a shallow 2D ConvNet. In particular, we use a four-layer ConvNet architecture. Each layer consists of the following building blocks: conv bn relu maxpool2

dropout. The last layer eschews the local max pooling block in favor of a global average pooling followed by a linear layer. Going through the various layers of the network an input image of size

undergoes the following transformations: () () () () () (32) (10). This network is trained on the MNIST dataset on the task of digit recognition, thereby yielding a accuracy.

In our experiments, the frame encodings are based on the last convolutional layer of the network, such that the shape of each frame encoding is .

TMNIST for retrieval.

In this experiment (i.e., Sec 4.1.1 in the main paper), we use videos from TMNIST-part as queries and look for the most similar videos in TMNIST-full. To demonstrate Drop-DTW’s robustness to noise we introduce noise to the query videos. In particular, we randomly blur of the frames in the query. This blur is achieved by convolving the randomly selected frames with a Gaussian kernel of radius two.

TMNIST for localization.

In this experiment (i.e., Sec 4.1.2 in the main paper), we consider test sequences, , formed by concatenating clips from TMNIST-part, where exactly of these clips are from a selected digit-trajectory class. An example of a test sequence is obtained by concatenating the sequences: [0, b] [3, c] [8, a] [3, c] [6, j], where each subsequence represents a [digit, trajectory] pair. Given a target sequence (e.g., [3, c]), the task is to find all time intervals corresponding to the event [3, c] in sequence . For this purpose, we consider queries of the form , where each of the ’s are instances of this same digit-trajectory class (i.e., [3, c]) in TMNIST-full.

Ideally, aligning and should put a subsequence of each of the ’s in correspondence with a matching subsequence in (see Fig. 3 (c) in the main paper). By construction, there should be such matches, interspersed with outliers. Here, we use , , and randomly generate 1000 such sequences .

We use Drop-DTW to align the generated pairs, and , and select the longest contiguously matched subsequences.

c.2 Complete ablation study in instructional video step localization

Role of the regularization loss.

Here, we investigate the role of each component of the proposed loss for step localization on CrossTask Zhukov et al. (2019), COIN Tang et al. (2019) and YouCook2 Zhou et al. (2018) dataset. We train the video embeddings with various alignment losses, with or without regularization, and compare with the pre-trained embeddings. At inference time, we use the learned embeddings to calculate pairwise similarities between all video clips and steps. From these similarities we obtain match and drop costs that are then used to assign a step label to each video clip.

Method CrossTask COIN YouCook2
Recall Acc. IoU Recall Acc. IoU Recall Acc. IoU
MIL-NCE Miech et al. (2020) 39.1 66.9 20.9 33.0 50.2 23.3 70.7 63.7 43.7
SmoothDTW Hadji et al. (2021) 29.6 48.5 9.0 29.1 38.8 17.8 70.2 61.1 39.7
Drop-DTW 33.8 56.6 14.2 30.6 43.7 20.1 71.5 63.4 41.6
43.4 70.2 29.8 37.7 53.6 27.6 75.8 66.3 47.3
SmoothDTW + 43.1 70.2 30.5 37.7 52.7 27.7 75.3 66.0 47.5
Drop-DTW + 48.9 71.3 34.2 40.8 54.8 29.5 77.4 68.4 49.4
Table 4: Ablation study of training and inference methods for step localization on CrossTask Zhukov et al. (2019), COIN Tang et al. (2019) and (right) YouCook2 Zhou et al. (2018). The first column describes the loss function(s) used for training, while the first row gives the dataset. The second row specifies the measured metric (higher is better), and the rest of the table reports the results. Here, CT (in row 1), corresponds to the evaluation algorithm provided by CrossTask. In column 1, indicates our re-implementation.

As seen in Table 4, using the clustering loss alone on top of the pre-trained embeddings provides a strong baseline, as it effectively uses an unordered set of steps to learn the video representations. Interestingly, combining SmoothDTW with to introduce order information in training does not improve (and can decrease) performance. We conjecture that the outliers present in the sequences can negatively impact the training signal. In contrast, training with Drop-DTW yields a significant boost in performance on all the metrics. This suggests that filtering out the background is key to training with step order supervision on instructional videos.

Role of the min operator.

In the main paper, we use the SmoothMin approximation of the min operator proposed in Hadji et al. (2021) to enable differentiability. This is not a key component of our contribution and other differentiable min operators such as SoftMin Cuturi and Blondel (2017) and the (hard) min operator can be used as well. Corresponding results, shown in Table 5, highlight the stability of Drop-DTW across min operators.

Recall Acc. IoU
SoftMin Cuturi and Blondel (2017) 46.1 70.8 32.9
Hard min 48.3 71.2 34.3
SmoothMin Hadji et al. (2021) 48.9 71.3 34.2
Table 5: Ablation study on the role of the min operator used in our Drop-DTW algorithm.
Role of the percentile choice in the drop cost.

When training representations with Drop-DTW, the matching costs obtained by the model evolve during the training. To have the drop cost change accordingly, we define it as a percentile of the matching costs, in Equation 5. Here, we provide results of setting the drop cost to various percentile values. We also include comparison to setting the drop cost as learned component as defoned in Equation 6.

Recall Acc IoU
p = 0.1 47.5 71.4 34.9
p = 0.3 48.9 71.3 34.2
p = 0.5 46.2 69.5 31.0
p = 0.7 45.3 68.4 31.0
p = 0.9 44.6 67.7 30.0
learned drop-cost 49.7 74.1 36.9
Table 6: Ablation study on the role of the percentile used in the drop cost definition of Drop-DTW.

The results provided in Table 6, demonstrate the robustness of the proposed Drop-DTW algorithm for various choices of the percentile. Interestingly, the learned drop cost yields the best overall performance, which demonstrates the adaptability of the Drop-DTW algorithm.

c.3 Drop-DTW for representation learning

As mentioned in the main paper, we use the PennAction dataset Zhang et al. (2013) to evaluate Drop-DTW for representation learning. To evaluate the robustness to outliers, we contaminate PennAction sequences with interspersed outlier frames. Here, we give more details on the contamination and alignment procedures.

PennAction contamination.

At training time, a batch contains a set of paired sequences. Given two such sequences, and , of the same action (e.g., baseball pitch), we extract,

, frames from both sequences. To ensure that the extracted frames cover the entire duration of both sequences, we rely on strided sampling. We then select

outlier frames from another action (e.g., baseball swing) and we randomly intersperse them within sequence , thereby yielding a sequence, , containing outlier elements. We otherwise leave sequence untouched. This process is illustrated in Fig. 6. Under these settings, Drop-DTW is expected to learn strong representations that rely on the common signal between and , while ignoring the interspersed outliers. The results reported in Fig. 5 in the main paper support this intuition and speak decisively in favor of Drop-DTW.

Figure 6: Illustration of the PennAction outlier contamination process. Sequence (top) depicts a clean baseball pitch sequence, while sequence (bottom) has outlier frames from a baseball swing interspersed within the second baseball pitch sequence .

c.4 Drop-DTW for audio-visual localization

We report results on the task of audio-visual localization in the main paper. Here, we provide further details on the training procedure. Given an audio-visual pair, , from the AVE dataset Tian et al. (2018), we start by splitting each modality into consecutive one second long segments and encode each segment using the same backbones used in the original paper Tian et al. (2018) for each modality. We then calculate a pairwise cost matrix between the sequences of the two modalities, using the symmetric match cost defined in Eq. 4 in the main paper. Next, we use a DTW-based approach to obtain the optimal match cost. In the case of Drop-DTW, the optimal match cost corresponds to in Algorithm 2 and we use 70%-percentile drop costs defined in Eq. 5 of the main paper. To train the networks for cross-modal localization we use a margin loss defined as:


where represent an audio-visual (visual-audio) pair from the same sequence, whereas the visual (audio) signal is from a different sequence. is set to in all our experiments.

Once we have learned representations using Eq. 12, we strictly follow the experimental protocol from Tian et al. (2018).

c.5 Additional qualitative results

In Sec. 4.2 of the main paper, we demonstrate the ability of Drop-DTW to tackle multi-step localization and compare it to alternative alignment-based approaches. In Fig. 4 of the main paper, we provide a qualitative comparison between various alignment-based approaches when each algorithm is used both at training and inference time. Here, we provide more such qualitative results in Fig. 7. Collectively, these results demonstrate Drop-DTW’s unique capability to locate subsequences with interspersed outliers. Moreover, we show that Drop-DTW is versatile to also handle situations with no interspersed outliers (e.g., see the Make French Toast example in Fig. 7). In addition, in Fig. 8, we provide qualitative results showing the advantage of using Drop-DTW at inference time even on representations learned with other alignment-based approaches. From these last results we show that Drop-DTW is a valuable inference time tool for subsequence localization. A visual demo of the subsequence localization application is provided in the supplemental video.

Figure 7: Step localization with DTW variants used for training and inference. In each panel, rows two to four show step assignment results when the same alignment method is used for training and inference. Drop-DTW allows to identify interspersed unlabelled clips and much more closely approximates the ground truth.
Figure 8: Step localization with DTW variants used for training only, while always using Drop-DTW for inference. In each panel, rows two to four show in each sub figure show step assignment results when different alignment methods are used for training but Drop-DTW is used for inference in all cases. In the top part of the figure (highlighted in green) we illustrate scenarios where using Drop-DTW both during training and inference is more beneficial, while the two bottom examples (highlighted in red) do not show clear advantage of Drop-DTW at training time but clearly show the benefit of using it at inference time in all cases. Qualitative results in this figure correspond to quantitative results in Table 2 of the main paper.