Explore and Match: End-to-End Video Grounding with Transformer

01/25/2022
by   Sangmin Woo, et al.
0

We present a new paradigm named explore-and-match for video grounding, which aims to seamlessly unify two streams of video grounding methods: proposal-based and proposal-free. To achieve this goal, we formulate video grounding as a set prediction problem and design an end-to-end trainable Video Grounding Transformer (VidGTR) that can utilize the architectural strengths of rich contextualization and parallel decoding for set prediction. The overall training is balanced by two key losses that play different roles, namely span localization loss and set guidance loss. These two losses force each proposal to regress the target timespan and identify the target query. Throughout the training, VidGTR first explores the search space to diversify the initial proposals and then matches the proposals to the corresponding targets to fit them in a fine-grained manner. The explore-and-match scheme successfully combines the strengths of two complementary methods, without encoding prior knowledge into the pipeline. As a result, VidGTR sets new state-of-the-art results on two video grounding benchmarks with double the inference speed.

READ FULL TEXT

page 5

page 8

page 12

page 13

research
09/23/2021

End-to-End Dense Video Grounding via Parallel Regression

Video grounding aims to localize the corresponding video moment in an un...
research
09/13/2021

On Pursuit of Designing Multi-modal Transformer for Video Grounding

Video grounding aims to localize the temporal segment corresponding to a...
research
03/26/2023

Affordance Grounding from Demonstration Video to Target Image

Humans excel at learning from expert demonstrations and solving their ow...
research
02/26/2023

Localizing Moments in Long Video Via Multimodal Guidance

The recent introduction of the large-scale long-form MAD dataset for lan...
research
09/12/2023

Dual-Path Temporal Map Optimization for Make-up Temporal Video Grounding

Make-up temporal video grounding (MTVG) aims to localize the target vide...
research
12/08/2021

Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs

Today's VidSGG models are all proposal-based methods, i.e., they first g...
research
07/20/2023

No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection

Temporal video grounding (TVG) aims to retrieve the time interval of a l...

Please sign up or login with your details

Forgot password? Click here to reset