Log In Sign Up

A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus

Identifying a short segment in a long video that semantically matches a text query is a challenging task that has important application potentials in language-based video search, browsing, and navigation. Typical retrieval systems respond to a query with either a whole video or a pre-defined video segment, but it is challenging to localize undefined segments in untrimmed and unsegmented videos where exhaustively searching over all possible segments is intractable. The outstanding challenge is that the representation of a video must account for different levels of granularity in the temporal domain. To tackle this problem, we propose the HierArchical Multi-Modal EncodeR (HAMMER) that encodes a video at both the coarse-grained clip level and the fine-grained frame level to extract information at different scales based on multiple subtasks, namely, video retrieval, segment temporal localization, and masked language modeling. We conduct extensive experiments to evaluate our model on moment localization in video corpus on ActivityNet Captions and TVR datasets. Our approach outperforms the previous methods as well as strong baselines, establishing new state-of-the-art for this task.


page 8

page 10


Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos

Temporal language localization in videos aims to ground one video segmen...

Text-based Localization of Moments in a Video Corpus

Prior works on text-based video moment localization focus on temporally ...

CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

This paper tackles a recently proposed Video Corpus Moment Retrieval tas...

You Need to Read Again: Multi-granularity Perception Network for Moment Retrieval in Videos

Moment retrieval in videos is a challenging task that aims to retrieve t...

ViSeRet: A simple yet effective approach to moment retrieval via fine-grained video segmentation

Video-text retrieval has many real-world applications such as media anal...

Hierarchical Deep Residual Reasoning for Temporal Moment Localization

Temporal Moment Localization (TML) in untrimmed videos is a challenging ...

Semantic Video Moments Retrieval at Scale: A New Task and a Baseline

Motivated by the increasing need of saving search effort by obtaining re...

1 Introduction

With over 70% of the current internet traffics being video data [networking2016cisco], a growing number of videos are being created, shared, and consumed over time. To effectively and efficiently search, browse, and navigate video contents, an intelligent system needs to understand the rich and complex semantic information in them. For this type of use cases, the recently proposed task of moment localization in video corpus (MLVC) highlights several challenges in semantic understanding of videos [escorcia2019temporal, lei2020tvr]. The goal of MLVC is to find a video segment that corresponds to a text query from a corpus of untrimmed and unsegmented videos, with a significant amount of variation in factors such as the type of contents, lengths, visual appearance, quality, and so on.

This task can be seen as “finding a needle in the haystack”. It is different from searching videos with broad queries such as genres or names of the artists. In contrast, the text query needs to be semantically congruent to a relatively short segment in a much longer target video. For example, the query “LeBron James shot over Yao Ming” matches only a few seconds of clips in a game of hours long. Thus, MLVC requires semantic understanding of videos at a more fine-grained level than video retrieval, which typically only targets the whole video. Furthermore, finding the corresponding segment for a text query requires combing through all videos in a corpus and all possible segments in each video. For a large corpus with long videos, it is not feasible to have such computational complexity that depends on the square of the (averaged) number of frames.

In this paper, we address this challenge by representing videos at multiple scales of granularity. At the coarse-grained level, the representation captures semantic information in a video over long temporal spans (e.g., clips), allowing us to retrieve the most relevant set of videos for a text query. At the fine-grained level, the representation captures semantic information in short temporal spans (e.g., frames) to allow for precise localization of the most relevant video segments among the retrieved videos.

We propose a novel hierarchical multi-modal encoder (hammer) to implement this idea. hammer uses cross-modal attention to combine the information between the text and visual modalities. The cross-modal learning occurs hierarchically at 3 scales: frame, clip, and video (as a whole). Frames are the most fine-grained building blocks of a video. Each clip consists of a non-overlapping set of frames with equal length, and is in turn the building block of the final video-level representation. The architecture of the model is illustrated in Fig. 1. The frame-level representation is obtained from a text-visual cross-modal encoder operated on video frames, while the clip-level representation is built upon the frame-level representation with a similar encoder.

Figure 1: Overview of the hammer model. The model contains two cross-modal encoders, a frame encoder and a clip encoder on top of it. The outputs of the model are contextualized frame-level and clip-level features, which are used by downstream task-specific modules, e.g. video retrieval and temporal localization.

The introduction of clip-level representation encoder is important as it allows us to capture both coarse- and fine-grained semantic information. In contrast, existing approaches for MLVC [lei2020tvr, escorcia2019temporal] and other visual-language tasks [frome2013devise, kiros2014unifying, faghri2017vse++, chen2020learning]

typically pack information of different granularity into a single vector embedding, making it hard to balance the differing demands between retrieving a long video and localizing a short segment.

We apply hammer to MLVC task on two large-scale datasets, ActivityNet Captions [krishna2017dense] and TVR [lei2020tvr]. We train it with a multi-tasking approach combining three objectives: video retrieval, temporal localization, and an auxiliary masked language modeling. Our experiments demonstrate the efficacy of hammer and establish state-of-the-art performance on all the tasks simultaneously—video retrieval, moment localization in single video and moment localization in video corpus. To better understand the inner-workings of our model, we compare it with a strong flat baseline, a video encoder without any hierarchical representation. Since the longer videos tend to be less homogeneous, it becomes decidedly important to represent the videos at multiple levels of granularity. Our analysis shows that the performance of a flat baseline declines, when the number of frames irrelevant to the text query increases. On the other hand, the performance of our proposed hammer model is robust to the length of the videos, showing that our hierarchical approach is not affected by the increase of irrelevant information and can flexibly handle longer videos.

Our contributions are summarized as follows:

  • [topsep=0.1cm,itemsep=0cm]

  • We propose a novel model architecture hammer that represents videos hierarchically and improves video modeling at long-term temporal scales.

  • We demonstrate the efficacy of hammer on two large-scale datasets, i.e., ActivityNet Captions and TVR, outperforming previous state-of-the-art methods.

  • We carry out a detailed analysis of hammer and show that it particularly improves the performance of video retrieval over strong baselines on long videos.

  • We conduct a thorough ablation study to understand the effects of different design choices in our model.

2 Related Work

Most existing MLVC approaches consider text-based video retrieval [xu2015video2txt, dong2016word2visualvec, venugopalan2015vid2text, Pan_2016_CVPR, mithun2018crossmodal] and temporal localization [hendricks2017localizing, gao2017tall, xu2018text, liu2018moment, chen2018temporally, regneri-etal-2013-grounding] as separate tasks.

Video Retrieval (VR) is a task that ranks candidate videos based on their relevance to a descriptive text query. Standard cross-modal retrieval methods [krishna2017dense, venugopalan2014translating] represent both video and text as two holistic embeddings, which are then used for computing the similarity as the ranking score. When the text query is a lengthy paragraph, hierarchical modeling is applied to both modalities separately [zhang2018cross, shao2018find], leading to a significant improvement on the performance of text-based video retrieval. Different from prior work, in this study we consider a more realistic problem where we use a single query sentence that describes only a small segment to retrieve the entire video. For instance, the text query “Add the onion paste to the mixture” may corresponds to a temporal segment of a few seconds in a long cooking video.

Temporal Localization (TL) aims at localizing a video segment (usually a short fraction of the entire video) described by a query sentence inside a video. Two types of methods have been proposed to tackle this challenge, namely the top-down (or proposal-based) approach [hendricks2017localizing, xu2018text, gao2017tall] and the bottom-up (or proposal-free) approach [chen2018temporally, lu2019debug, chen2020rethinking, yuan2019semantic, yuan2019find]. The top-down approach first generates multiple clip proposals before matching them with a query sentence to localize the most relevant clip from the proposals. The bottom-up approach first calculates a query-aware video representation as a sequence of features, then predicts the start and end times of the clip described by the query.

Moment Localization in Video Corpus (MLVC) is first proposed by Escorcia et al[escorcia2019temporal]. They consider a practical setting where they require models to jointly retrieve videos and localize moments corresponding to natural language queries from a large collection of untrimmed and unsegmented videos. They devised a top-down localization approach that compares text embeddings on uniformly partitioned video chunks. Recently, Lei et al[lei2020tvr] proposed a new dataset, TVR, that considers a similar task called Moment Retrieval in Video-Subtitle Corpus, which requires a model to align a query text temporally with a video clip, using multi-modal information from video and subtitles

(derived from Automatic Speech Recognition or ASR).

3 Method

We first describe the problem setting of MLVC and introducing the notations in §3.1. In §3.2, we describe a general strategy of decomposing MLVC into two sub-tasks, VR and TL [xu2019multilevel, lu2019debug]. The main purpose is to reduce computation and to avoid the need to search all possible segments of all videos. In §3.3, we present a novel HierArchical Multi-Modal EncodeR (hammer) model and describe how it is trained in §3.4. Finally, we describe key details for inference in §3.5.

3.1 Problem Setting and Notations

We represent a video as a sequence of frames , where is a visual feature vector representing the -th frame. Given a text query (e.g., a sentence), our goal is to learn a parameterized function (i.e

., neural networks) that accurately estimates the conditional probability

, where is a video segment given by . and stand for the indices of the starting and the ending frames of the segment in a video . Note that for a video corpus with an average length of frames, the number of all possible segments is . Thus, in a large corpus, exhaustive search for the best segment corresponding to is not feasible. In what follows, we describe how to address this challenge.

To localize the moment that best corresponds to a text query , we need to identify


Note that the conditional probability is factorized into two components. If we assume uniquely belongs to only one video in the corpus , then the marginalization over the video is vacuous and can be discarded. This leads to


The training data are available in the form of where is the matched segment to the query .

3.2 Two-Stage MLVC: Retrieval and Localization

As aforementioned, this inference of Eq. (2) is infeasible for large-scale corpora and/or long videos. Thus, we approximate it by


This approximation allows us to build two different learning components and stage them together to solve MLVC. This approach has been applied in a recent work on the task [lei2020tvr]. We give a formal summary below.

Video Retrieval (VR) identifies the best video by minimizing the negative log-likelihood of


where is the ground-truth video for the text query . This is a rather standard (cross-modal) retrieval problem, which has been widely studied in the literature. (See §2 for some references.)

Temporal Localization (TL) models . While it is possible to model possible segments in a video with frames, we choose to model it with the probabilities of identifying the correct starting () and ending () frames:


Here, we consider and to be independent to efficiently approximate . The indicator function simply stipulates that the ending frame needs to be after the starting frame.

To model each of the factors, we treat it as a frame classification problem, annotating each frame with one of the three possible labels: BEGIN and END marks the starting and ending frames respectively, with all other frames as OTHER. We denote this as B, E, O classification scheme. During training, we optimize (the sum of) the frame-wise cross-entropy between the model’s predictions and the labels. We denote the training loss as


where is the true label for the frame of the video , and is the corresponding prediction of the model.

This type of labeling schemes have been widely used in the NLP community, for example, recently for span-based question and answering [joshi2020spanbert, fevry2020entities].

3.3 HierArchical Multi-Modal Encoder (hammer)

Our first contribution is to introduce the hierarchical modeling approach to parameterize the conditional probability for the VR sub-task and the labeling model for the TL sub-task. In the next section, we describe novel learning algorithms for training our model.

Main idea Video and text are complex and structural objects. They are naturally in “temporally” linear orders of frames and words. More importantly, semantic relatedness manifests in both short-range and long-range contextual dependencies. To this end, hammer infuses textual and visual information hierarchically at different temporal scales. Figure 1 illustrates the architecture of hammer. A key element here is to introduce cross-modal attention at both the frame level and the clip level.

Clip-level Representation We introduce an intermediate-level temporal unit with a fixed length of frames, and refer to them as a clip , where . As such, a video can also be hierarchically organized as a sequence of non-overlapping video clips . is a hyper-parameter to be adjusted on different tasks and datasets. We emphasize while sometimes segments and clips are used interchangeably, we refer to “segment” as a set of frames that are also the visual grounding of a text query, and “clip” as a collection of temporally contiguous frames. We treat them as holding memory slots for aggregated lower-level semantic information in frames.

Cross-modal Transformers hammer has two cross-modal Transformers. At the frame-level, the frame encoder takes as input both the frame sequence of a video clip and the text sequence of a query, and outputs the contextualized visual frame features for each clip . The frame encoder encodes the local and short-range contextual dependencies among the frames of the same clip.

We also introduce a Clip CLS Token () for each  [lu2019vilbert]. The contextual embedding of this token gives the representation of the clip:


Contextual embeddings for all clips are then fed into a higher-level clip encoder , also with cross-modal attention to the input text, yielding a set of contextualized clip representation


Note that now encodes the global and longer-range contextual dependencies among all frames (through clips)111Alternatively, we can summarize it (into a vector, in lieu of the set) through various reduction operations such as pooling or introducing a video-level CLS token VCLS..

To summarize, our model has 3 levels of representations: the contextualized frames , the clips , and the entire video . Next, we describe how to use them to form our learning algorithms.

3.4 Learning hammer for MLVC

The different levels of representation allows for the flexibility for modeling the two subtasks (VR and TL) with semantic information across different temporal scales.

Modeling Video Retrieval We use the contextualized clips to compute the video-query compatibility score for a query and its corresponding video . In order to retrieve the likely relevant videos as much as possible, we need a coarse-grained matching that focuses more on higher-level semantic information.

Specifically, we identify the best matching among all clip embeddings and use it as the matching score for the whole video:


where is a linear projection to extract the matching scores222An alternative design is to pool all and then perform a linear projection. However, this type of polling has a disadvantage that a short but relevant segment – say within a clip – can be overwhelmed by all other clips. Empirically, we also find the current formalism works better. A similar finding is also discovered in [zhang2018cross].. The conditional probability is normalized with respect to all videos in the corpus (though in practice, a set of positive and negative ones).

Modeling Temporal Localization

As in the previous section, we treat localization as classifying a frame into

B, E, or O:


Note that each frame can belong to only one clip so there is no need to marginalize over . The probability measures the likelihood of containing a label in one of its frames. The second factor measures the likelihood that the specific frame is labeled as . Clearly, these two factors are on different semantic scales and are thus modeled separately:


where TCLS is a text CLS token summarizing the query embedding.

Masked Multi-Modal Model Masked language modeling has been widely adopted as a pre-training task for language modeling [lu2019vilbert, sun2019videobert, devlin2018bert]. The main idea is to backfill a masked text token from its contexts, i.e., the other tokens in a sentence.

The multi-modal modeling task in this paper can similarly benefit from this idea. During training, we mask randomly some text tokens. We expect the model to achieve two things: (1) using the partially masked text query to retrieve and localize which acts as a regularization mechanism; (2) better text grounding by recovering the masked tokens with the assistance of the multimodal context, i.e., both the textual context and the visual information in the frames and the clips.

To incorporate a masked query to the loss functions

and of the model we apply to replace , where is a binary mask vector for text tokens, is a one-valued vector of the same size, and indicates element-wise multiplication. We introduce another loss to backfill the missing tokens represented by :


This probability is computed using both for frames and for clips.

Multi-Task Learning Objective We use a weighted combination of video retrieval, moment localization, and masked multi-modal modeling objectives as our final training objective:


where the expectation is taken with respect to random masking. Since the VR and TL subtasks share the same model and output representations, the final objective needs to balance different goals and is multi-tasking in nature. We provide a detailed ablation study in §4 to analyze the choice of weights.

3.5 Two-stage Inference with hammer

For the model inference of hammer, we perform two sequential stages, i.e., video retrieval and temporal localization, to accomplish the task of moment localization in video corpus. For video retrieval, we use hammer and the linear regressor to compute pairwise compatibility scores as in Eq. (9) with respect to the text query and all videos in the corpus. Next, we perform temporal localization on the top ranked videos. Specifically, we predict the start and end frame with hammer to localize the temporal segment following Eq. (5). Then we greedily label the frame with the maximum as the start frame and maximum as the end frame. Here we have an additional constraint to consider — the predicted end frame must appear after the start frame prediction. This two-stage inference reduces the complexity to , which is significantly better comparing to the complexity of  [escorcia2019temporal].

4 Experiments

In this section, we perform experiments with the proposed hammer model. We first introduce the datasets and setups of our experiments in §4.1. Next, we present the main results of the hammer model in §4.2, contrasting against a strong baseline flat as well as other existing methods. We then confuct a thorough ablation study in §4.3 to evaluate the importance of various design choices for the hammer model. Finally, we carry out qualitative analysis of our model to better demonstrate its behaviour.

4.1 Experimental Setups

Datasets We experiment on two popular MLVC datasets:

  • [leftmargin=*,topsep=0.1cm,noitemsep]

  • ActivityNet Captions [krishna2017dense] contains 20K videos, each has 3.65 temporally localized query sentences on average. The mean video duration is 180 seconds and the average query length is 13.48 words, which spans over 36 seconds of the video. There are 10,009 videos for training and 4,917 videos for validation (val_1 split). We follow prior work [escorcia2019temporal, hendricks2018localizing] to train our models and evaluate them on the val_1 split.

  • TVR [lei2020tvr] contains 22K videos in total, of which 17.5K videos are in the training set and 2,180 are in the validation set. The dataset contains videos from movies, cartoons, and TV-shows. The videos are on average 76.2 seconds long and contain 5 temporally localized sentences per video. The moments in the videos are 9.1 seconds long and described by sentences containing 13.4 words on average. We make use of the subtitle (ASR) features together with the video feature in TVR dataset, following prior works [lei2020tvr, li2020hero].

We make use of multiple popular choices of video features on these two datasets as existing literature [escorcia2019temporal, hendricks2018localizing, lei2020tvr], which includes the appearance-only features (ResNet152 [he2016deep]

pre-trained on ImageNet 

[deng2009imagenet]), spatio-temporal features (I3D [carreira2017quo] pre-trained on Kinetics [kay2017kinetics]), and their combinations. We present the details of feature preparation in Suppl. Material.

Evaluation Metrics

We use different evaluation metrics for different video understanding tasks:

  • [leftmargin=*,topsep=0cm,noitemsep]

  • Video Retrieval (VR) We report Recall@ and Median Rank (MedR or MedRank) as the evaluation metrics for Video Retrieval as suggested in the literature.

  • Temporal Localization (TL) We report both mean IoU (mIoU) and average precision with IoU={0.3, 0.5, 0.7} as the evaluation metrics. Here, IoU measures the Intersection over Union between the ground truth and predicted video segments, i.e., the localization accuracy.

  • Moment Localization in Video Corpus (MLVC) We use Recall@ with IoU= for the main evaluation metrics [escorcia2019temporal, lei2020tvr]. Specifically, we measure whether the correct localized segment exists in the top of the ranked videos. Here, a localized segment is correct if it overlaps with the ground truth segment over an IoU of {0.5, 0.7}.

Baseline and the hammer Models In hammer, we use two encoders, i.e., the frame and clip encoders, with multiple Transformer [Vaswani2017Attention] layers to represent the visual (and ASR) features as well as the text query features (details in Figure 1). Each encoder contains 1 layer of Transformer for visual input, 5 layers of Transformers for the text query input, and 1 layer of cross-modal Transformer between the visual and text query inputs. When ASR is provided (i.e., in TVR), we add one additional Transformer layer to incorporate the ASR input, with another cross-modal Transformer layer that cross-attends between the query input and ASR features. The processed ASR and visual features are concatenated. Meanwhile, we design a flat model as a strong baseline. The flat model has a similar architecture as hammer, except that it only uses the frame encoder to capture the visual (and ASR) features. We provide complete details about the architectural configurations and model optimization in the Suppl. Material.

IoU=0.5 IoU=0.7
Model & Feature R1 R10 R100 R1 R10 R100
ActivityNet mcn [hendricks2017localizing] R 0.02 0.18 1.26 0.01 0.09 0.70
cal [escorcia2019temporal] R 0.21 1.32 6.82 0.12 0.89 4.79
flat R 0.34 2.28 10.09 0.21 1.28 5.69
hammer R 0.51 3.29 12.01 0.30 1.87 6.94
flat I 2.57 13.07 30.66 1.51 7.69 17.67
hammer I 2.94 14.49 32.49 1.74 8.75 19.08
TVR xml [lei2020tvr] I+R 2.62 6.39 22.00
hero111We compare against their model without large-scale pre-training for fair comparison. [li2020hero] I+R 2.98 10.65 18.25
flat I+R 8.45 21.14 30.75 4.61 11.29 16.24
hammer I+R 9.19 21.28 31.25 5.13 11.38 16.71
Table 1: MLVC Results on ActivityNet and TVR datasets

4.2 MLVC Experiments

Main Results Table 1 presents a comparison between the proposed hammer and other methods on the two MLVC benchmarks. We observe that, irrespective of the feature types, hammer outperforms flat noticeably, which in turn outperforms most published results on both datasets. On ActivityNet, we observe that models using I3D features (denoted as I) outperform their counterparts with ResNet152 (denoted as R) features, by a significant margin. It indicates the importance of spatio-temporal feature representation in the MLVC tasks.

Meanwhile, we note that our flat model outperforms the baselines on the TVR dataset, which is mainly due to the introduction of the cross-modal Transformer between query and visualASR features (see §4.3 for a detailed study). On both datasets, hammer establishes the new state-of-the-art results for the MLVC task (without using additional data). This result shows a clear benefit of hierarchical structure modeling in video for the MLVC task.

Model R1 R10 R100 MedR
flat 5.37 29.14 71.64 29
hammer 5.89 30.98 73.38 26
Table 2: VR results on ActivityNet Captions.
Model IoU=0.3 IoU=0.5 IoU=0.7 mIoU
flat 57.58 39.60 22.59 40.98
hammer 59.18 41.45 24.27 42.68
Table 3: TL results on ActivityNet Captions.

Table 2 and 3 contrast hammer to the flat model in more details by comparing their performance on the tasks of video retrieval and temporal localization separately. The results are reported on the ActivityNet with models using the I3D features. In both cases, hammer achieves significantly better performance than the baseline flat model.

Comparing Models on Videos of Different Duration We discuss the potential reasons for hammer to outperform the flat model. Since hammer learns video representation at multiple granularities, we hypothesize that it should be able to focus on the task-relevant parts of a video without getting distracted by irrelevant parts. Specifically for the task of sentence-based video retrieval which requires matching the relevant frames in the video with the text query, hammer would be less sensitive to the presence of non-matching frames and hence be robust to the length of the video. To verify this, we analyze hammer’s performance on videos with different lengths for the task of video retrieval and temporal localization.

Figure 2: Comparison of Video Retrieval performances under different video duration. Results are reported in Median Rank (MedRank) on the ActivityNet Captions (Lower is better).
Figure 3: Comparison of Temporal Localization performances under different video duration. Results are reported in Mean IoU (mIoU) on the ActivityNet Captions (Higher is better).

We compare the performance of hammer and flat on videos with different durations for the video retrieval task in Fig. 2. The metric used for comparison is the median rank where lower numbers indicate better performance. Firstly, it can be observed that while the performance of flat model is inconsistent (e.g., performance on longest videos is worse than second-to-longest videos), the hammer model’s performance consistently improves with the length of the video. Secondly, the performance of the hammer model is best for the longest videos in the dataset. Finally, while both the models perform sub-optimally on the shortest videos, hammer still outperforms flat for those videos.

We further compare the temporal localization performance of hammer and flat models in Fig. 3. The results are reported using mean IoU, where higher numbers indicate better performance. It shows that hammer constantly achieves higher performance than flat across all videos irrespective of their length.

Overall, the analysis shows clear advantage of using hammer over flat which is especially profound for longer videos, hence supporting our central modeling argument.

4.3 Ablation Studies and Analyses

In this section, we evaluate the effectiveness of learning objectives and various design choices for hammer. We note that all the ablation studies in this section are conducted on ActivityNet Captions using the I3D features.

4.3.1 Learning Objectives

Task Video Retrieval Temporal Localization
VR TL FM CM R1 R10 R100 IoU=0.5 IoU=0.7 mIoU
4.93 29.02 72.15
5.52 30.53 73.02
5.45 30.45 73.24
5.67 30.20 72.67
39.02 22.74 40.28
39.27 22.04 40.30
39.13 22.38 40.51
39.16 22.82 40.64
5.22 30.22 72.70 40.59 23.70 42.01
5.57 30.97 73.09 41.17 24.04 42.45
5.85 30.82 73.54 41.30 23.94 42.43
5.89 30.98 73.38 41.45 24.27 42.68
Table 4: Ablation study on sub-tasks (VR=Video Retrieval, TL=Temporal Localization, FM=Frame MLM, CM=Clip MLM)

As aforementioned, hammer is optimized with three objectives jointly namely video retrieval (VR), temporal localization (TL), and masked language modeling (MLM). We study the contribution of different objectives, reported in Table 4. It is worth to note that here we differentiate the MLM objective applied to the frame encoder (denote as FM) and the clip encoder (denote as CM). Firstly, the objectives of VR and TL are complementary to each other and jointly optimizing the two surpasses the single task performance on both the tasks simultaneously. Secondly, CM and FM applied individually benefits both VR and TL tasks with their usage in unison resulting in best performance. This verifies the effectiveness of MLM objective to improve the text representation. Finally, the best performance is achieved by combining all the objectives, hence proving the complimentary nature of all of them.

IoU=0.5 IoU=0.7
R1 R10 R100 R1 R10 R100
 1.0 0.1 0.1 1.65 9.18 20.81 0.87 4.88 10.50
 1.0 0.5 0.1 2.15 10.75 23.41 1.10 5.68 12.16
 1.0 1.0 0.1 2.02 10.95 24.74 1.10 6.07 13.12
 1.0 5.0 0.1 2.94 14.49 32.49 1.74 8.75 19.08
 1.0 10.0 0.1 2.35 14.25 31.84 1.42 8.53 18.76
Table 5: Ablation study on task weights (VR=Video Retrieval, TL=Temporal Localization, MLM=Masked Language Model)

Weights of Different Objectives We also conduct detailed experiments to investigate the influence of different objectives’ weights (, , and ). Table 5 shows that it is important to balance the weights between VR and TL. The best performance is achieved when the weight for VR and TL is set to . For MLM, we find that the best loss weight is 0.1, and thus use this value through all our experiments.

4.3.2 Evaluate Design Choices of the hammer

We study the importance of a few design choices in the hammer model. Specifically, we evaluate the following:

  • [leftmargin=*,noitemsep,topsep=0.1cm]

  • Effect of the cross-modal Transformer layer

  • Effect of different clip lengths for the clip representation

  • Effect of parameter sharing for frame and clip encoders

  • Effect of an additional clip-level position embedding

We present the results and discussion on these experiments in the following paragraphs.

Model X-Modal IoU=0.5 IoU=0.7
R1 R10 R100 R1 R10 R100
hammer 1.38 8.89 26.35 0.84 5.08 15.27
2.94 14.49 32.49 1.74 8.75 19.08
Table 6: Ablation study on Cross-modal Transformer
Figure 4: Illustration of temporal localization using different hierarchies of hammer as well as the final hammer model

Cross-Modal Transformer is Essential. Both frame and clip encoders contain one layer of cross-modal (X-modal) Transformer between text query and video inputs. To verify its effectiveness, we compare with an ablation model without this layer. Table 6 shows almost 100% relative improvement in the R, R metrics when using the X-modal Transformer, proving it is essential to the success of hammer.

Model Clip Length IoU=0.5 IoU=0.7
R1 R10 R100 R1 R10 R100
hammer 16 2.70 14.06 31.85 1.63 8.16 18.60
32 2.94 14.49 32.49 1.74 8.75 19.08
64 2.78 14.69 32.08 1.70 9.00 18.71
Table 7: Ablation study on different clip lengths

Optimal Length of the Clip-Level Representation. In hammer, recall that a frame encoder takes a clip of fixed length of frames and outputs a clip-level representation. Here, we examine the performance under different lengths of clips, summarized in Table 7. Overall, we observe that the model’s performance is robust to the clip length chosen for the experiments, and 32 is the optimal length for the clip representation (with max video length of 128).

Model Weight Sharing IoU=0.5 IoU=0.7
R1 R10 R100 R1 R10 R100
hammer 2.94 14.49 32.49 1.74 8.75 19.08
2.89 14.17 30.31 1.69 8.05 17.24
Table 8: Ablation study on weight sharing

Parameter Sharing between Frame/Clip Encoders. We also consider whether the frame encoder and the clip encoder in the hammer model may share the same set of parameters, as weight sharing could regularize the model capacity and therefore improve the generalization performance. Table 8 indicates that, however, untying the encoder weights achieves slightly better performance, potentially thanks to its greater flexibility.

Model Clip Position IoU=0.5 IoU=0.7
R1 R10 R100 R1 R10 R100
hammer 2.82 14.39 32.01 1.76 8.59 18.63
2.94 14.49 32.49 1.74 8.75 19.08
Table 9: Ablation study on clip position embeddings

Position Embedding for Clip Encoder. Position embedding is an important model input as it indicates the temporal boundary of each video frame segment. In the hammer model, since we also have a clip encoder that takes the aggregated “Clip CLS” token as input, it is natural to ask if we need a position encoding for each clip representation. Thus, we compare two models, with and without additional clip position encoding. Table 9 shows that clip position embedding is indeed important to achieve superior performance.

4.3.3 Qualitative Visualization

To better understand the behavior of the hammer model, we demonstrate a couple of examples of temporal localization. Figure 5 lists predicted spans from the frame and clip encoder as well as from the entire hammer. In both examples, we observe that the frame encoder of hammer makes an incorrect prediction of the temporal timestamps, but then corrected by the prediction from the clip encoder. Overall, hammer makes more accurate predictions with respect to the ground-truth video segment.

5 Conclusion

In this paper, we propose a hierarchical multi-modal encoder (hammer) that captures video dynamics in three scales of granularity, frame, clip, and video. By hierarchically modeling videos, hammer achieves significantly better performance than the baseline approaches on moment localization task in video corpus, and further establishes new state-of-the-art on two challenging datasets, ActivityNet captions and TVR. Extensive studies verify the effectiveness of the proposed architectures and learning objectives.

6 Supplmentary Material

In this section, we provide additional implementation details and visualization omitted in the main text.

6.1 Additional Implementation Details

Figure 5: Illustration of temporal localization results using hammer and its individual hierarchies, the frame and clip encoder. The top two are successful examples and the bottom one is a failed example.

6.1.1 Visual Feature Representation

We use three different kinds of visual features throughout our experiments, i.e., ResNet-152 [he2016deep], I3D [carreira2017quo], and their combination. On the ActivityNet Captions dataset, we report model performance with ResNet-152 for fair comparison with prior methods, and also report results using the widely used I3D features for comparison. On the TVR dataset, we follow the setting in [lei2020tvr] and report results using the concatenation of ResNet-152 and I3D features. The details about how to extract these features are specified below.

ResNet-152 Feature For all of the frames in a given video, we extract features ( dimensional) from the penultimate layer of a ResNet-152 [he2016deep]

model pre-trained on the ImageNet dataset. For ActivityNet Captions dataset, the frames are extracted at a rate of


I3D Feature We use an I3D model [carreira2017quo]

to extract the spatio-temporal visual features (with a dimension of 1024). The I3D model used for feature extraction is pre-trained on the Kinetics-400 

[kay2017kinetics] dataset. Similar to the setting of ResNet-152 features, we take the features from the penultimate layer of the I3D model. For ActivityNet Captions dataset, the I3D features are extracted at a frame rate of FPS.

I3DResNet-152 Feature For the TVR dataset, we use the I3DResNet-152 features provided by Lei et al[lei2020tvr] to represent the visual information in the videos. The I3D and ResNet-152 models are pre-trained on Kinetics-600 [carreira2018short] and ImageNet datasets, respectively. For both models, the features from the penultimate layers are used. The ResNet-152 features are extracted at a rate of

FPS and max-pooled over each

-seconds clip. The I3D features are extracted for every seconds as well. The two sets of features are then concatenated to form the combined -dimensional feature.

6.1.2 Subtitle (ASR) Feature Representation

Previous work [lei2020tvr] have demonstrated that subtitles (e.g., extracted from ASR) can complement the visual information in video and language tasks. For the TVR dataset, we follow the standard setting and use the pre-extracted ASR embeddings provided by Lei et al[lei2020tvr] as an additional input to our models. Contextualized token-level subtitle embeddings are first generated using a 12-layer RoBERTa [Liu2019roberta] model fine-tuned on the TVR train split. The token embeddings are then max-pooled every 1.5 seconds to get an aggregated -dimensional feature vector. A zero vector of the same dimensionality is used for frames without corresponding subtitles. The resulting subtitle embeddings are temporally aligned to the visual features (I3DResNet-152), allowing us to combine the two modalities later in the cross-modal encoders. We refer the reader to Lei et al[lei2020tvr] for more details on the feature extraction process.

6.1.3 Model Architecture with ASR input

The general architecture of the hammer model is illustrated in Figure 1 of the main text. It consists of 2 hierarchical encoders (i.e., the frame and clip encoders) that have the same structure, and two input streams, query and video. When only the visual features are present (e.g., ActivityNet Captions), the video encoder contains only the visual encoder. Each hierarchical encoder contains 5 standard Transformer layers for the query input and 1 Transformer layer for the visual input. There is an additional cross-modal Transformer layer between the query and visual representations.

When ASR is provided as another input stream (e.g., in TVR), we add another branch to the video encoder in each of the hierarchical encoders to process the ASR input. The ASR and visual branches have similar structure as both have 1 Transformer layer. The pre-extracted ASR embeddings and the visual features both attend to the query representation to form their contextualized representations. The query embeddings in turn attend to the ASR and visual modalities separately. The resulting ASR- and visual-grounded query representations are then added together in the feature dimension, followed by a normalization and a dropout layer. Finally, the query-grounded ASR and visual representations are concatenated to form the frame-level and clip-level representations for the two hierarchical encoders, respectively.

6.1.4 Model Optimization

For ActivityNet Captions, we train the models with a mini-batch size of 64 and optimize them using Adam [kingma2014adam] with a maximum learning rate of

. The learning rate increases linearly from 0 to the max rate in the first 10% training epochs and then drops to

and of the max rate at 50% and 75% of the training epochs, respectively. We set the maximum video sequence length to be 128, and experiment with clip lengths varying from 16 to 64 (refer to Table 7 in the main text).

For TVR data, we train the model with a batch size of 128. We use the same learning rate schedule as mentioned above but with a maximum learning rate of . We set the maximum video sequence length to 96, and experiment with clip lengths varying from 8 to 48.

6.1.5 Model Initialization

We do not pre-train hammer on any dataset. We randomly initialize the visual and ASR branches and the cross-attention Transformer layers. For the text query branch, following prior work [lu2019vilbert], we initialize from the first 5 layers of a pre-trained BERT model [devlin2018bert], and use the uncased WordPiece tokenizer [wu2016google] to tokenize the text input. The vocabulary size of the tokenizer is 30,522.

6.2 Illustration on the TVR Dataset

Figure 5 illustrates the temporal localization results on TVR dataset using predictions from hammer and its frame and clip encoders. In the top 2 examples, hammer successfully localizes the video segments described by the respective queries with the help of the clip encoder even though the frame encoder makes erroneous predictions. In the bottom-most example, the clip encoder picks the incorrect video clip, causing hammer to only partially capture the video segment described by the query. These examples show the important role played by the two hierarchical encoders—while the clip encoder is responsible for choosing the video clips that best describe the query, the frame encoder fine-tunes the predictions within the chosen clips.