Video is one of the most common data formats on the Internet today. Individual videos contain rich information that people consume for entertainment or learning, and are preferred over text or static images in many scenarios (e.g. an instructional video on cooking a dish or a video lesson for how to play tennis). With the vast amount of videos available online and new ones being generated and uploaded everyday, an efficient video search engine is of great practical value. The core of a video search engine is the retrieval task, which deals with finding the best matching videos based on the user’s text query [mithun2018learning, miech2018learning, liu2019use, dzabraev2021mdmmt, cheng2021improving]
. The video retrieval task is usually formulated as learning and using a similarity function between the text query and candidate videos in the database in order to rank them accordingly. With the rise of deep neural networks[krizhevsky2012imagenet, russakovsky2015imagenet, lecun2015deep, simonyan2015very, szegedy2015going, he2016deep] as universal feature extractors to replace manually-designed features, modern video retrieval algorithms [mithun2018learning, miech2018learning, bain2021frozen, gao2021clip2tv, croitoru2021teachtext, liu2021hit] are now often composed of a text encoder and a visual encoder which each map the text query and video candidate to the same embedding space. This allows the similarity to then be calculated using a distance metric (e.g. cosine distance) in this shared representation space. With collective efforts over the years, current state-of-the-art methods [andres2021straightforward, luo2021clip4clip, fang2021clip2video] have achieved reasonable performance on several different video retrieval benchmarks [xu2016msr, chen2011collecting, anne2017localizing, krishna2017dense, rohrbach2015dataset].
Despite these advances, existing work focuses mostly on experimenting under the single-query setting, i.e. retrieving target videos given a single text description as input. This paradigm has two issues. First, as most existing video retrieval benchmarks are based on datasets collected for other tasks (e.g. video captioning) [xu2016msr, wang2019vatex, chen2011collecting, krishna2017dense, rohrbach2015dataset], lots of text descriptions are of low-quality and unsuited for video retrieval (i.e. can be matched to many videos) – this adds noise into both training and evaluation. Therefore, as shown in Figure 1, evaluation under the single query setup doesn’t provide an accurate comparison of various models’ relative performance since it is inherently an under-specified task. Figure 2 shows the Recall@1 scores for two models – Frozen [bain2021frozen] and collaborative experts (CE) [liu2019use] – as a function of an increasing number of queries111For fair comparison, both models are trained from scratch.. While Frozen performs better when using a single query, CE is clearly better at retrieval as the number of queries increases. This calls into question the validity of the single query setting alone as a way to test video retrieval models.
Second, single-query setting does not represent a realistic use case of video search engines, where multiple search queries are very common. Consider a user searching for a video on the Internet: their first query often fails to correctly retrieve the desired content due to under- or mis-specification, leading them to re-issue a new search with a new query. Rather than treating the second query as a single-query search independent of the first one, combining both queries to perform a multi-query search is the more logical choice since the first search might contain information not included in the second one. In other words, leveraging multiple queries is essential to retrieving more accurate content. This is also evident from the example in Figure 2 which shows the change in Recall@1 scores for two different models [liu2019use, bain2021frozen] with increasing number of descriptions used for retrieval.
Based on these observations, we propose the task of multi-query video retrieval (MQVR) where multiple text queries are provided to the model simultaneously (bottom of Figure 1). It addresses the noise effects introduced by imperfect annotations in a simple and effective way. We re-purpose existing video retrieval datasets and perform an extensive investigation of state-of-the-art models adapted towards handling multiple queries. We first propose different ways of using multiple queries only during inference, and then explore various ways of learning retrieval models explicitly with multiple queries provided at training time. Our experiments over three different datasets demonstrate that MQVR is a more realistic setting that more optimally leverages the capabilities of modern retrieval systems, and also provides a better evaluation setup for benchmarking the latest advances.
To summarize, we make the following contributions:
We extensively investigate multi-query video retrieval (MQVR) – where multiple text queries are provided to the model – over multiple video retrieval datasets (MSR-VTT [xu2016msr], MSVD [chen2011collecting], VATEX [wang2019vatex]). While some of the previous works have made use of multiple queries during evaluation [liu2019use, fang2021clip2video], none of them have treated this as a standalone task.
We find that dedicated multi-query training methods can provide large gains (up to 21% in R@1) over simply treating each query independently and combining their similarity scores. We also propose several architectural changes such as localized and contextualized weighting mechanisms which further improves MQVR performance.
To facilitate future research in this area, we further propose a new metric based on the area under the curve for MQVR with varying number of queries, which is complementary to standard metrics used in single-query retrieval.
Finally, we also demonstrate that the proposed multi-query training methods can be utilized to benefit standard single-query retrieval, and that the multi-query trained representations have better generalization than the single-query trained counterpart.
2 Related Work
Video-Text Representation Learning. The multi-modal learning for videos [sun2019videobert, Arandjelovic2017LookLA, patrick2020support, wu2021exploring, yu2018joint, Hendricks_2017_ICCV, dong2021dual, gabeur2022masking, alayrac2020self] has been received increasing attention in recent years. Among them, learning to generate better video and language representation [zhang2018cross, gabeur2022masking, zhu2020actbert, wang2016learning] is one of the most attractive topics. Recent works leverage large-scale vision-language datasets for model pretraining [miech2018learning, bain2021frozen, li2020oscar, kamath2021mdetr]. Howto100M [miech2019howto100m]
is a large-scale video-text pretraining dataset, which takes texts obtained from narrated videos using automatic speech recognition (ASR) as the supervision for video-text representation learning. Miechet al. [miech2018learning]
then proposed a multiple instance learning approach derived from noise contrastive estimation (MIL-NCE) to learn from these noisy instructional video-text pairs. Leiet al. [lei2021less] proposed ClipBERT which leverages sparse sampling to train the model in an end-to-end approach. Recently, CLIP (Contrastive Language-Image Pre-training) [radford2021learning] has shown great success by pre-training on large-scale webly-supervised image and text pairs. The learned image and language representations are proven to be helpful in most downstream vision-language tasks.
Text-to-video Retrieval. To search videos by unconstrained text input, both video and text modalities should be embedded into a shared representation space for further similarity matching. Early works [mithun2018learning, miech2018learning, liu2019use, gabeur2020multi, wang2021t2vlad] use pre-trained models to extract representations from multi-modal data (RGB, motion, audio). Mithun et al. [mithun2018learning] proposed to use hard negative mining to improve the training for text and video joint modeling. Liu et al. [liu2019use] leverage seven modalities (e.g. speech content, scene texts, faces) to build the video representation. Miech et al. [miech2018learning] further proposed a strong joint embedding framework based on mixture-of-expert features. Gabeur et al. [gabeur2020multi] introduce multi-modal transformer to jointly encode those different modalities with attentions. Recently, there are some works [luo2021clip4clip, fang2021clip2video, li2021clip, gao2021clip2tv]
based on the large-scale pretrained vision-language models (e.g. CLIP [radford2021learning]), which achieve superior performance compared to previous training-from-scratch models. However, these works are evaluated using single-query benchmarks, suffering from the query quality issue. Differently, we study the multi-query video retrieval in this paper, where the input is a set of multiple queries.
3 Multi-query retrieval
Video retrieval222Throughout this paper, we use the term ‘video retrieval’ to refer the specific task of text-to-video retrieval. is a task of searching videos given text descriptions [mithun2018learning, miech2018learning, liu2019use]. Formally, given a video database composed of videos, , and a text description of video , the goal is to successfully retrieve from based on . This is the setting most existing video retrieval benchmarks adopt [xu2016msr, anne2017localizing, krishna2017dense, rohrbach2015dataset]. We refer to it as single-query setting, since one caption is used for retrieval during evaluation. Specifically, the goal is to learn a model that can evaluate the similarity between any query-video pair, such that reflects how the query matches with the video. And a perfect retrieval would score the matching pair higher than non-matching pairs, i.e.,
However, this single-query setting might be problematic when the input description has low quality, for example, being too general or abstract that can be perfectly matched to lots of videos in the database. This can cause trouble during evaluation as all matched videos would be given similar matching scores and essentially ranked randomly, thus resulting in a wrong estimate of the model’s true performance[wray2021semantic].
To see this effect, the authors manually inspected one hundred randomly-picked single-query retrieval results for a state-of-the-art model [luo2021clip4clip] on a popular benchmark [xu2016msr]. While the R@1 is 41% as measured with standard evaluation, the authors found the retrieved videos are reasonable 62% of the times.
Due to the fact that existing video retrieval benchmarks are mostly based on datasets collected for other tasks (e.g. video captioning) [xu2016msr, wang2019vatex, chen2011collecting, krishna2017dense, rohrbach2015dataset], such low-quality queries are prevalent during model evaluation. As an example, MSR-VTT [xu2016msr], one of the most popular benchmarks in the field, contains lots of general descriptions which can be matched to many videos, like ‘a car is shown’, ‘cartoon show for kids’, and ‘a man is singing’ (these descriptions are valid for video captioning but are too abstract for retrieval).
To tackle this issue, we study the multi-query retrieval setting where more than one query is available during retrieval. This is a natural extension to single-query retrieval, considering in practice when users fail to retrieve what they need and provide more information with additional queries. Under similar notation, the task of multi-query retrieval is to retrieve a target video from the video database based on multiple descriptions of it. And similarly we would like to learn a model that correctly retrieves the target video,
Correspondingly, during training, the standard methods use single-query setting where a text-video pair is composed of a video and one description, i.e., () for positive sample, () for negative sample. As we show later in the experiments, this is not the best choice for multi-query evaluation and that dedicated multi-query training, i.e. () as a positive pair and () as a negative pair, can provide a large gain.
We adopt the evaluation metrics widely used in the standard single-query retrieval setting and compute R@K (recall at rank K, higher the better), median rank and mean rank (lower the better) in our experiments. Additionally, specific to multi-query evaluation, we would also like to compare models tested under varying number of query inputs (e.g. Figure 2). To facilitate future research in this subject, we further propose an area under the curve (AUC) metric. Specifically, is defined as the normalized area under the curve value of R@K with test query-input varying from 1 to ,
where is the R@K value when evaluating with m queries as input.
In this section we describe all the multi-query retrieval methods investigated in this paper; experiments are shown in the next section. Broadly, we divide the methods into two categories: the first category of methods extends the models trained with single-query to multi-query evaluation in a post-hoc fashion without any retraining, and the second one has dedicated modifications for multi-query retrieval during training.
4.1 Post-hoc inference methods
The models trained using single-query can be easily adapted when multiple queries are available, just by considering multiple queries separately and then aggregating the results.
Similarity aggregation (SA).
A simple form is to take the mean of the similarity scores evaluated between each query and the video as the final multi-query similarity, which is then used to rank videos.
Rank aggregation (RA).
Unlike similarity aggregation which aggregates the raw similarity score of each query, rank aggregation aggregates the retrieval results instead. Denote as the rank of among all videos in the candidate pool based on the evaluation of similarity to according to (smaller rank means more compatible with the query), then the multi-query similarity can be calculated as:
Note that the overall similarity score here doesn’t have a well-defined quantitative meaning as in the similarity aggregation case and just serves to order different videos.
4.2 Multi-query training methods
Unlike post-hoc methods that essentially deal with the multi-query problem in a “single-query way”, dedicated training modifications might be helpful if multiple queries are available during the training phase. This is already provided by many standard benchmarks [xu2016msr, chen2011collecting, wang2019vatex], but to the best of the authors’ knowledge, most existing works only treat the descriptions independently and adopt the single-query training. Figure 3 shows four different multi-query training methods we propose in this work, which can serve as strong baselines for future research in this area. And we describe them in detail here.
To facilitate the discussion, we first elaborate the similarity score of query-video pair as . Here. Similarly, , the video feature extractor, maps input video to the same embedding space . is the metric that measures the similarity between and (e.g
. the cosine similarity).
Mean feature (MF).
A naive way of combining multiple queries during training is to take the mean of their features to be the final query feature, with corresponding similarity score calculated as:
We will show in the next section that just by making this simple modification, the trained model can outperform the post-hoc methods by a large margin. However, a potential drawback of mean feature is that each query contributes equally to the result regardless of their quality. Thus we make a natural extension and further propose weighted feature (WF) methods,
where . We experiment with several ways to generate the weights below.
Text-to-text similarity weighting (TS-WF).
It is desirable for a query to contain complementary information about the target video that is not captured by other queries. On contrary, a query is not informative if it only contains redundant information already captured by others. This inspires a parameter-free method to evaluate the informativeness of one query by comparing the similarity of it to other queries. Specifically, the informativeness of query is computed as , where . Notice the minus sign here means the informative queries should be different from others. Finally, the weights can be computed by taking softmax among s.
Localized weight generation (LG-WF).
The weights can also be learned using a separate network and trained end-to-end with other parameters. We experiment with multilayer perceptron (MLP) that maps extracted text features to a scalar. The MLP is shared across all queries and each query is processed separately. The resulting weights are normalized with softmax.
Contextualized weight generation (CG-WF).
Instead of computing weight for every query individually, CG-WF attends to other queries when generating each one. Specifically, we experiment with a transformer-based attention network, where all query features are first input to a transformer to generate contextualized features, and then a MLP head is used to map the contextualized features to scalars. Softmax is used to compute the final normalized weights like in the localized weight generation case.
We first describe in section 5.1 the architecture backbones on which we build our multi-query retrieval experiment. Then specific experiment settings and implementation details are described in section 5.2 and finally the experiment results are shown in section 5.3.
5.1 Architecture backbones
CLIP [radford2021learning] has been shown to learn transferable features to benefit lots of downstream tasks including video action recognition, OCR, etc. Recently, such trend has also been demonstrated for video retrieval. Among multiple models proposed in this line of work [andres2021straightforward, luo2021clip4clip, fang2021clip2video], we adopt the CLIP4Clip for its simplicity. For our experiment, we use the publicly released ‘ViT-B/32’ checkpoint for initialization of CLIP model.
This model proposed by Bain et al. utilizes a transformer-based architecture [vaswani2017attention, dosovitskiy2020image], which is composed of space-time self-attention blocks. For our experiments, we use the checkpoint provided by original authors which are pretrained on Conceptual Captions [sharma2018conceptual] and WebVid-2M [bain2021frozen].
5.2 Experimental setup
We conduct our experiments on three datasets, which have multiple descriptions available for each video clip.
MSR-VTT [xu2016msr] is one of the most widely used dataset for video retrieval. It contains 10K video clips gathered from YouTube and each clip is annotated with 20 natural language descriptions. We follow the previous works to use the 9K videos for training and 1K videos for testing.
MSVD [chen2011collecting] is a dataset initially collected for translation and paraphrase evaluation. It contains 1970 videos, each with multiple descriptions in several different languages. We only use the English descriptions in our experiments. Following previous works, we use 1200 videos for training, 100 videos for validation and 670 videos for testing.
VATEX [wang2019vatex] is a large-scale multilingual video description dataset. It contains 34991 videos, each annotated with ten English and ten Chinese captions. We only keep the English annotations and use the standard split with 25991 videos for training, 3000 for validation and 6000 for testing.
All the models are trained with 8-frame inputs and a batch size of 48 for 30 epochs. We adopt the cross entropy loss over similarity scores and a softmax temperature of 0.05 as used in[bain2021frozen]. AdamW [loshchilov2017decoupled] optimizer is used with a cosine learning rate scheduler and a linear warm up of 5 epochs [loshchilov2016sgdr]. The max learning rate is 3e-5 and 3e-6 for Frozen and CLIP4Clip respectively. We set query number to five for our multi-query experiments. Specifically, all available captions for each video are used during training, but for each training sample during forward pass, a subset of random five descriptions are input to the multi-query model. During test, we also evaluate the five-query performance by sampling five query captions per video. However, to avoid selection bias, we repeat the evaluation for a hundred times (each time selects a different subset of five captions as queries for every video) and report the mean as the final results.
Table 1 summarizes the evaluation results on the test split of the MSR-VTT, MSVD and VATEX datasets. Due to space limit, we only show the CLIP4Clip numbers for MSVD and VATEX datasets, please see the Appendix for the full results of Frozen model. Since the experiment findings are similar across both models, we will focus on the CLIP4Clip model during the following discussion.
(R1) Multi-query vs. single-query.
The first row of Table 1 shows the result for the single-query baseline, where the model is both trained and evaluated with one query. The rest of the rows show the results for five-query evaluation. It’s clear that with five queries available, all models see a big boost in terms of retrieval accuracy compared to single-query. Specifically, for CLIP4Clip model, the single-query baseline only achieves a R@1 of 41.5%, 43.8% and 33.3% for MSR-VTT, MSVD and VATEX, while even the worst multi-query model tested (rank aggregation) achieves 56.4%, 47.3% and 43.6% (14.9%, 3.5% and 10.3% improvement respectively). For the best multi-query model (CLIP4Clip with contextualized weight generation), ground-truth video can be retrieved within the top 10 videos 97.8%, 94.9%, and 92.7% of the times. While it’s anticipated that multi-query model should perform better than the single-query counterpart, such big improvement is still striking. This corroborates our claim about the value of multi-query retrieval.
(R2) Post-hoc aggregation methods.
Comparing two post-hoc methods which simply extend the pretrained single-query model to the multi-query setting, it’s clear that similarity aggregation outperforms rank aggregation by a large margin. On CLIP4Clip, the R@1 is improved from 56.4% to 68.4% for MSR-VTT, from 47.3% to 57.6% for MSVD, and from 43.6% to 47.3% for VATEX, respectively. This is not surprising as the rank provided by a low-quality query is noisy and would thus drag down the final rank when combining with ranks provided by the other queries. While the similarity generated by a low-quality query is also noisy, its value tends to be small and is typically dominated by the similarity scores provided by the high-quality queries.
(R3) Multi-query training methods.
It’s clear from Table 1 that multi-query training methods with ad-hoc training modifications improve a lot over the post-hoc methods. Just by simply feeding five captions and taking the mean of the encoded features as the final text features during training (mean feature), R@1 of CLIP4Clip model can be improved by 2.9% (from 68.4% to 71.3%), 2.0% (from 57.6% to 59.6%) and 9.9% (from 47.3% to 57.2%) on MSR-VTT, MSVD and VATEX, respectively. We anticipate that such improvement can be mainly attributed to the denoising effect of multi-query training. As the ranking loss acts on the combined features, the part of loss that tries to push apart the false-negative text-video pairs (due to general descriptions that match with more than one videos) is lessened, thus avoiding potential over-fitting and providing more robust features (additional evidence is also discussed in R6 with the zero-shot test).
Even though the mean feature
training is already a very strong baseline, additional weighting heuristics still manage to introduce further improvements. On the CLIP4Clip model, with the best-performingweighted feature training with contextualized weight generation, R@1 can be improved over mean feature by 1.8% (from 71.3% to 73.1%), 2.1% (from 59.6% to 61.7%) and 1.8% (from 57.2% to 59.0%) points across the datasets. To show that the learned weights can correctly capture the relative quality of the queries, we compute the average weights given to different queries ordered by their quality (we rank the quality of queries by their single-query retrieval result, 1 is the best, and 5 is the worst). Figure 4 shows the result, and it’s clear that the generated weights can correctly reflect the quality of the query. The contextualized weight generation works better than the localized weight generation. This is expected as instead of trying to learn a standalone quality predictor, the former lessens the task by learning to predict relative quality. Especially when training data is small, e.g. in MSVD, localized weight generation would have a hard time learning the quality and give more spread weights across different queries.
Figure 5 shows several qualitative examples on MSR-VTT for a model trained with contextualized weight generation (more examples shown in Appendix). While some of the queries are of low-quality, the model can correctly attend to the better ones and achieve good overall retrieval accuracy.
(R4) Evaluation with varying number of queries.
Table 1 shows the result where evaluation is performed under the same number of queries as multi-query models are trained (five-query training and five-query test). It is then natural to question whether such result can be generalized across varying number of queries. Figure 6 left shows R@1 curves when the same models are tested under different number of queries333To avoid clutter, text-to-text similarity weighting and localized weight generation are not showed as they mostly overlap with contextualized weight generation. and the table on the right summarizes the results in the form of the proposed metric. First, the performance of all methods improves with more queries available. The curve increases rapidly at beginning and gradually saturates as more queries provides marginally additional information. Secondly, the five-query trained models outperforms the single-query trained models across different number of testing queries except one, with dominating scores for , and . This shows that multi-query training indeed learns better features suited for multi-query evaluation. While single-query trained models maintain the lead at single-query test, the differences are very small. Comparing similarity aggregation with contextualized weight generalization, the (single-query R@1) is 41.5% vs. 41.0% for MSR-TT, 43.8% vs. 43.6% for MSVD, and 33.3% vs. 32.8% for VATEX respectively.
To get the best of both worlds, we further conduct an experiment with a combination of single-query and five-query mean feature training, i.e. some training pairs contain one video and one caption while others contain one video and five captions (as with all other experiments, all available captions are used during training, but for each training instance, a random sample of one or five captions is used as input). The dash lines in Figure 6 show the results. Rather surprisingly, the combination training achieves the best single-query performance and outperforms single-query training entirely. This shows that the de-noising effect of multi-query training can also be utilized to improve standard single-query retrieval.
(R5) Training with different number of queries.
To understand the effect of number of queries (N) during training, we plot the performance of different models in Figure 7. We observe that the performance for all models increases at first, followed by a decrease (sometimes to a score even lower than N=1) as more queries are added. We hypothesize that this may be due to additional noise in the training process when too many queries are used, which causes the model to not learn discriminative representations for each individual query.
(R6) Transfer to new tasks.
To further demonstrate the utility of the MQVR framework in learning robust representations, we perform a zero-shot transfer evaluation on the VATEX dataset using models trained on MSR-VTT (Table 2). Quite promisingly, we observe that multi-query training (MSR-VTT 5q, MSR-VTT 7q) outperforms their single-query counterparts (MSR-VTT 1q) on both single-query and multi-query evaluations, achieving 0.6% higher single-query R@1 and 2.6-5.5 points higher AUC, respectively. The latter results especially are only a few points below the AUC performance of an in-domain VATEX 1q model, demonstrating that there is strong transfer and indicating that multi-query training can lead to better generalization. We leave further exploration of this direction to future work.
In this work, we extensively study the multi-query retrieval problem where multiple descriptions are available for retrieving target videos. We argue that this previously less-studied setting is of practical value both because it can provide significant improvement on retrieval accuracy through incorporating information from multiple queries and that it addresses challenges in model training and evaluation introduced by imperfect descriptions in existing retrieval benchmarks. We then propose several multi-query training methods and a new evaluation metric dedicated for this setting. With extensive experiments, we demonstrate that the proposed multi-query training methods outperform single-query training significantly over MSR-VTT, MSVD and VATEX. We believe further investigation on the issue will benefit the field and bring new insights to building better retrieval systems for real-world applications.
This material is based upon work supported by the National Science Foundation under Grant No. 2107048. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. We would like to thank members of Princeton Visual AI Lab (Jihoon Chung, Zhiwei Deng, Sunayana Rane and others) for their helpful comments and suggestions.
Appendix A Additional experiment results
a.1 Five-query experiment for Frozen
Additional results for Frozen [bain2021frozen] model on MSVD [chen2011collecting] and VATEX [wang2019vatex] are shown in Table 3. Similar findings can be drawn compared to the results on CLIP4Clip [luo2021clip4clip] model: the similarity aggregation outperforms rank aggregation. Dedicated multi-query training outperforms post-hoc inference methods. And that weighted feature training introduces additional improvement over mean feature training. Unlike CLIP4Clip model, one noticeable difference is that the parameter-free text-to-text similarity weighting outperforms localized and contextualized weight generation methods. We anticipate that this is because the stronger text representation provided by CLIP facilitates better weight learning.
Appendix B More qualitative examples
Additional qualitative examples of generated weights for different queries are shown in Figure 8. The weights correctly captures the relative quality among queries by giving higher weights to those containing more information.