An Efficient COarse-to-fiNE Alignment Framework @ Ego4D Natural Language Queries Challenge 2022

11/16/2022
by   Zhijian Hou, et al.
0

This technical report describes the CONE approach for Ego4D Natural Language Queries (NLQ) Challenge in ECCV 2022. We leverage our model CONE, an efficient window-centric COarse-to-fiNE alignment framework. Specifically, CONE dynamically slices the long video into candidate windows via a sliding window approach. Centering at windows, CONE (1) learns the inter-window (coarse-grained) semantic variance through contrastive learning and speeds up inference by pre-filtering the candidate windows relevant to the NL query, and (2) conducts intra-window (fine-grained) candidate moments ranking utilizing the powerful multi-modal alignment ability of the contrastive vision-text pre-trained model EgoVLP. On the blind test set, CONE achieves 15.26 and 9.24 for R1@IoU=0.3 and R1@IoU=0.5, respectively.

READ FULL TEXT
research
09/22/2022

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

Video temporal grounding (VTG) targets to localize temporal moments in a...
research
07/15/2022

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

Video-text retrieval has been a crucial and fundamental task in multi-mo...
research
06/27/2023

GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

In this report, we present our champion solution for Ego4D Natural Langu...
research
08/24/2020

VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

Video Moment Retrieval (VMR) is a task to localize the temporal moment i...
research
12/04/2020

Rethinking movie genre classification with fine-grained semantic clustering

Movie genre classification is an active research area in machine learnin...
research
07/29/2022

Curriculum Learning for Data-Efficient Vision-Language Alignment

Aligning image and text encoders from scratch using contrastive learning...
research
11/21/2019

An analysis of observation length requirements for machine understanding of human behaviors in spoken language

Machine learning-based human behavior modeling, often at the level of ch...

Please sign up or login with your details

Forgot password? Click here to reset