On Pursuit of Designing Multi-modal Transformer for Video Grounding

09/13/2021
by   Meng Cao, et al.
0

Video grounding aims to localize the temporal segment corresponding to a sentence query from an untrimmed video. Almost all existing video grounding methods fall into two frameworks: 1) Top-down model: It predefines a set of segment candidates and then conducts segment classification and regression. 2) Bottom-up model: It directly predicts frame-wise probabilities of the referential segment boundaries. However, all these methods are not end-to-end, , they always rely on some time-consuming post-processing steps to refine predictions. To this end, we reformulate video grounding as a set prediction task and propose a novel end-to-end multi-modal Transformer model, dubbed as GTR. Specifically, GTR has two encoders for video and language encoding, and a cross-modal decoder for grounding prediction. To facilitate the end-to-end training, we use a Cubic Embedding layer to transform the raw videos into a set of visual tokens. To better fuse these two modalities in the decoder, we design a new Multi-head Cross-Modal Attention. The whole GTR is optimized via a Many-to-One matching loss. Furthermore, we conduct comprehensive studies to investigate different model design choices. Extensive results on three benchmarks have validated the superiority of GTR. All three typical GTR variants achieve record-breaking performance on all datasets and metrics, with several times faster inference speed.

READ FULL TEXT
research
08/31/2022

Hierarchical Local-Global Transformer for Temporal Sentence Grounding

This paper studies the multimedia problem of temporal sentence grounding...
research
08/11/2023

ViGT: Proposal-free Video Grounding with Learnable Token in Transformer

The video grounding (VG) task aims to locate the queried action or event...
research
01/25/2022

Explore and Match: End-to-End Video Grounding with Transformer

We present a new paradigm named explore-and-match for video grounding, w...
research
01/08/2022

Learning Sample Importance for Cross-Scenario Video Temporal Grounding

The task of temporal grounding aims to locate video moment in an untrimm...
research
04/17/2021

TransVG: End-to-End Visual Grounding with Transformers

In this paper, we present a neat yet effective transformer-based framewo...
research
01/03/2022

Memory-Guided Semantic Learning Network for Temporal Sentence Grounding

Temporal sentence grounding (TSG) is crucial and fundamental for video u...
research
09/23/2021

End-to-End Dense Video Grounding via Parallel Regression

Video grounding aims to localize the corresponding video moment in an un...

Please sign up or login with your details

Forgot password? Click here to reset