Coarse to Fine: Video Retrieval before Moment Localization

by   Zijian Gao, et al.

The current state-of-the-art methods for video corpus moment retrieval (VCMR) often use similarity-based feature alignment approach for the sake of convenience and speed. However, late fusion methods like cosine similarity alignment are unable to make full use of the information from both query texts and videos. In this paper, we combine feature alignment with feature fusion to promote the performance on VCMR.



There are no comments yet.


page 1

page 2

page 3

page 4


Graph Neural Network for Video-Query based Video Moment Retrieval

In this paper, we focus on Video Query based Video Moment Retrieval (VQ-...

CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

This paper tackles a recently proposed Video Corpus Moment Retrieval tas...

Generating Adjacency Matrix for Video-Query based Video Moment Retrieval

In this paper, we continue our work on Video-Query based Video Moment re...

HDR Video Reconstruction: A Coarse-to-fine Network and A Real-world Benchmark Dataset

High dynamic range (HDR) video reconstruction from sequences captured wi...

Hyperdimensional Feature Fusion for Out-Of-Distribution Detection

We introduce powerful ideas from Hyperdimensional Computing into the cha...

ViSeRet: A simple yet effective approach to moment retrieval via fine-grained video segmentation

Video-text retrieval has many real-world applications such as media anal...

Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

Employing large-scale pre-trained model CLIP to conduct video-text retri...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video Corpus Moment Retrieval (VCMR) [1] is a new video-text retrieval task which aims to retrieve the most relevant moments from a large video corpus instead of from a single video. The text-based VCMR can be decomposed into two sub-tasks: video retrieval (VR) and single video moment retrieval (SVMR). The former requires to retrieve the most relevant video and the goal of the latter is retrieval the most relevant moment from a single video.

We explore that the methods proposed recently [3, 8] mainly following [2] which only aligns video feature and query feature by calculating similarity. Different from that, methods fusing video with query [10, 9] are popular across SVMR realm, since they can align video and query within a more fine-grained scope and collect adequate information. Therefore, in this paper, we combine conventional VCMR method with SVMR method to retrieve video moments more precisely.

2 Approach

2.1 Reminiscence of HERO

To facilitate the description, we use the same mathematical symbols as in [3]. Given a video denoted by series of frames and subtitles , where is the number of frames in the video and is the number of sentences in its subtitles. Frame embeddings and token embeddings are encoded by frame embedder and text embedder, respectively.

The alignment between tokens and frames helps model collect abundant information. Therefore, a Cross-Modal Transformer with cross-modal attention for multimodal fusion is adopted to get a sequence of contextualized embeddings:


where denotes the cross-modal transformer.

In order to make full use of temporal information, contextualized frame embeddings are re-organized into a video following frame order, and a Temporal Transformer is utilized to learn global context of a video. The final contextualized video embeddings are calculated as :


where is the temporal transformer.

As for query, the query text is fed into cross-modal transformer to compute query embeddings

. Then a query encoder takes query embedings as input and outputs the final query vector


The similarity score of the whole video and the query text is computed by max-pooling the cosine similarity between each frame and the query text:


Local similarity used for moment retrieval is computed by dot product:


and two trainable 1D convolution filters are applied to local similarity score to acquire and

which are the probabilities of each frame being the start or end frame of the moment.

2.2 Moment Retrieval with fusion

As mentioned above, the moment retrieval of HERO is a late fusion method based on similarity score. However, retrieving moment by using local similarity neglect the abundant information between two modalities. To enhance cross-modal interactions, we utilize context-query attention (CQA) [6, 10] to integrate video and query. The similarity between each frame and each query token is calculated. Then the context-to-query attention and query-to-context attention are calculated by:


where and is the row- and column-normalized by , respectively. And the final feature of fusion is the concatenation of original features and attention guided features that:


where is a fully-connected layer reducing feature dimensions to ; means element-wise multiplication.

We split a video into two parts with the target moment as foreground and the rest as background. To further distinguish foreground and background, our model needs to learn the importance of each frame with the whole query text. To this end, an another query encoder is applied for query embeddings to get the whole sentence feature . is concatenated with frame embeddings as , where . And the importance score is calculated by:


Then the importance score can highlight the fused features:


According to [10], two-layer LSTMs with two fully-connected layers serve as the conditioned span predictor to generate the scores:


where is the fully-connected layer; and denote the scores of every position being the start and end of the ground-truth span.

The training objective of moment retrieval is defined as:


where and are cross-entropy loss and L1 loss.

Note that only positive video-query pairs are fed into moment retrieval training. In inference, we perform two stage operation that similarity scores between each query text and each video are computed by Eq.(3) quickly, then top- ranked videos are selected for each query text to localize moments through fusion method.

2.3 Pre-training

Since retrieval task depends on the quality of visual representations, ViT-CLIP+S3D features are selected as our inputs of visual modality. However, this may damage the performance of the provided pre-trained model which is trained with ResNet+S3D. Therefore, following the pre-training strategy used in HERO, we re-train our model with four pre-training tasks.

2.4 Data Augmentation

To promote the robustness, some data augmentations are used in training. Following [5], we adopt token shuffling, cutoff and dropout. Token shuffling strategy aims to randomly shuffle order of the tokens in the token embeddings. Cutoff consists of token cutoff and feature cutoff which aims to randomly erase some tokens or feature dimension. Dropout is a widely used method to mitigate over-fitting. We utilize it to randomly drop some elements in the token embeddings and set values to zero. The data augmentation operations are imposed on video embeddings, subtitle token embeddings and query token embeddings with a probability of . We decompose the into token shuffling, token cutoff, feature cutoff, dropout and unchanged.

3 Experiments

3.1 Dataset and Evaluation Metric

This paper is a technical report about VALUE Challenge and we only focu on retrieval task. We conduct experiments on two VCMR datasets, TVR and How2R, And two VR datasets, YC2R and VATEX-EN-R. According to VALUE benchmark [4]

, either VCMR task or VR task uses Average recall at K (R@K) as evaluation metrics, and Average Recall (AveR) at {1, 5, 10} is also adopted for overall appraisal.

3.2 Performance Comparison on VCMR

Method TVR How2R

R@1 R@10 R@100 R@1 R@10 R@100
VAL XML [2] 2.62 9.05 22.47 - - -
HERO [3] 5.13 16.26 24.55 - - -
HAMMER [7] 5.13 11.38 16.71 - - -
ReLoCLNet [8] 4.15 14.06 32.42 - - -
VALUE [4] 5.93 18.76 - 3.01 7.80 -
Ours 7.57 21.20 - 4.01 8.11 -
TEST HERO [3] 6.21 19.34 36.66 - - -
VALUE [4] 6.39 19.54 - 1.90 6.17 -
Ours 8.17 21.95 - 3.64 6.17 -
Table 1: results on VCMR datasets

The results of VCMR task on TVR and How2R datasets are reported on Table 1. Besides the baseline model HERO [3] and the value benchmark [4], we compare our model with XML [2], HAMMER [7] and ReLoCLNet [8]. Our method outperforms all the other methods on TVR dataset, and has slight advantage over Recall@1 metric on How2R dataset. Specifically speaking, to some extent, we render that XML, ReLoCLNet and HERO use the same kind of approach called late fusion, since video retrieval and moment retrieval are all based on similarities between videos and query texts. HAMMER and our method are considered as the same method which performs two stage inference and adopt fine-grained cross-modal integration. This result demonstrates that fusing video with query promotes the performance in video moment retrieval. Compared to HAMMER, we use multi-channel inputs, this explains why our result surpasses its. Note that although HERO conducted experiments on How2R dataset, the version used by HERO is not the same as the one used by VALUE benchmark and us, so we cannot extract their results on How2R.

3.3 Performance Comparison on VR


R@1 R@5 R@10 AveR R@1 R@5 R@10 AveR
VAL VALUE 26.23 49.71 60.05 45.33 55.62 89.15 95.07 79.95
Ours 32.05 55.47 65.01 50.84 38.93 76.13 86.09 67.05
TEST VALUE 35.10 59.48 68.27 54.28 24.51 54.34 68.41 49.09
Ours 41.21 65.90 75.56 60.89 24.04 54.15 68.62 48.94
Table 2: results on VR datasets

Table 2 shows the model performances on two VR datasets. By experiment we find that CLIP-ViT+S3D does not work on YC2R, therefore we use ResNet+S3D feature on YC2R dataset. The large gap between VALUE and ours on validation set and slight performace difference on test set demonstrate that model is easy to overfit on VATEX-EN-R dataset, as well as mentioned in [4].

3.4 Ablation Study


R@1 R@5 R@10 AveR R@1 R@5 R@10 AveR
Base 6.66 14.78 19.39 13.61 2.54 4.71 6.56 4.61
B+Fusion 7.34 15.37 20.37 14.36 4.01 5.48 7.18 5.56
B+F+Aug 7.81 15.69 20.49 14.67 3.47 5.94 7.57 5.66
B+F+A+PT 7.57 16.21 21.20 14.96 4.01 6.56 8.11 6.23
Table 3: Ablation results

Since our fusion method focus on VCMR task, we only study ablations on two VCMR datasets. The results show that the cross-modal integration improves VCMR performance by a large margin. It proves that fusing video with query can absorb more information leading to more precise video moment localization result. And data augmentions including shuffling, cutoff and dropout contribute to overall metrics, despite of decreasing Recall@1 on How2R dataset. Pre-training on four datasets also promotes the performance on VCMR task.

4 Conclusion

Although VCMR task is extention of SVMR task, some commonly used methods in SVMR are not employed in VCMR. In this paper, we add a video-query feature fusion approach to video retrieval to further align features and fuse them at a fine-grained level. This approach significantly improves the performance of the model on VCMR task. Meanwhile, data augmentation methods such as token shuffling, cutoff and dropout effectively improve the robustness of the model and further enhance the model performance.


  • [1] V. Escorcia, M. Soldan, J. Sivic, B. Ghanem, and B. Russell (2019) Temporal localization of moments in video collections with natural language. arXiv preprint arXiv:1907.12763. Cited by: §1.
  • [2] J. Lei, L. Yu, T. L. Berg, and M. Bansal (2020) Tvr: a large-scale dataset for video-subtitle moment retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pp. 447–463. Cited by: §1, §3.2, Table 1.
  • [3] L. Li, Y. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu (2020) Hero: hierarchical encoder for video+ language omni-representation pre-training. arXiv preprint arXiv:2005.00200. Cited by: §1, §2.1, §3.2, Table 1.
  • [4] L. Li, J. Lei, Z. Gan, L. Yu, Y. Chen, R. Pillai, Y. Cheng, L. Zhou, X. E. Wang, W. Y. Wang, et al. (2021) VALUE: a multi-task benchmark for video-and-language understanding evaluation. arXiv preprint arXiv:2106.04632. Cited by: §3.1, §3.2, §3.3, Table 1.
  • [5] Y. Yan, R. Li, S. Wang, F. Zhang, W. Wu, and W. Xu (2021) ConSERT: a contrastive framework for self-supervised sentence representation transfer. arXiv preprint arXiv:2105.11741. Cited by: §2.4.
  • [6] A. W. Yu, D. Dohan, M. Luong, R. Zhao, K. Chen, M. Norouzi, and Q. V. Le (2018) Qanet: combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541. Cited by: §2.2.
  • [7] B. Zhang, H. Hu, J. Lee, M. Zhao, S. Chammas, V. Jain, E. Ie, and F. Sha (2020) A hierarchical multi-modal encoder for moment localization in video corpus. arXiv preprint arXiv:2011.09046. Cited by: §3.2, Table 1.
  • [8] H. Zhang, A. Sun, W. Jing, G. Nan, L. Zhen, J. T. Zhou, and R. S. M. Goh (2021) Video corpus moment retrieval with contrastive learning. arXiv preprint arXiv:2105.06247. Cited by: §1, §3.2, Table 1.
  • [9] H. Zhang, A. Sun, W. Jing, L. Zhen, J. T. Zhou, and R. S. M. Goh (2021) Parallel attention network with sequence matching for video grounding. arXiv preprint arXiv:2105.08481. Cited by: §1.
  • [10] H. Zhang, A. Sun, W. Jing, and J. T. Zhou (2020) Span-based localizing network for natural language video localization. arXiv preprint arXiv:2004.13931. Cited by: §1, §2.2, §2.2.