Boundary Proposal Network for Two-Stage Natural Language Video Localization

03/15/2021
by   Shaoning Xiao, et al.
0

We aim to address the problem of Natural Language Video Localization (NLVL)-localizing the video segment corresponding to a natural language description in a long and untrimmed video. State-of-the-art NLVL methods are almost in one-stage fashion, which can be typically grouped into two categories: 1) anchor-based approach: it first pre-defines a series of video segment candidates (e.g., by sliding window), and then does classification for each candidate; 2) anchor-free approach: it directly predicts the probabilities for each video frame as a boundary or intermediate frame inside the positive segment. However, both kinds of one-stage approaches have inherent drawbacks: the anchor-based approach is susceptible to the heuristic rules, further limiting the capability of handling videos with variant length. While the anchor-free approach fails to exploit the segment-level interaction thus achieving inferior results. In this paper, we propose a novel Boundary Proposal Network (BPNet), a universal two-stage framework that gets rid of the issues mentioned above. Specifically, in the first stage, BPNet utilizes an anchor-free model to generate a group of high-quality candidate video segments with their boundaries. In the second stage, a visual-language fusion layer is proposed to jointly model the multi-modal interaction between the candidate and the language query, followed by a matching score rating layer that outputs the alignment score for each candidate. We evaluate our BPNet on three challenging NLVL benchmarks (i.e., Charades-STA, TACoS and ActivityNet-Captions). Extensive experiments and ablative studies on these datasets demonstrate that the BPNet outperforms the state-of-the-art methods.

READ FULL TEXT

page 1

page 7

research
09/22/2021

Natural Language Video Localization with Learnable Moment Proposals

Given an untrimmed video and a natural language query, Natural Language ...
research
09/14/2021

Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos

We address the problem of temporal sentence localization in videos (TSLV...
research
09/11/2019

Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction

The task of temporally grounding language queries in videos is to tempor...
research
04/04/2019

ExCL: Extractive Clip Localization Using Natural Language Descriptions

The task of retrieving clips within videos based on a given natural lang...
research
11/18/2020

A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus

Identifying a short segment in a long video that semantically matches a ...
research
04/07/2020

Dense Regression Network for Video Grounding

We address the problem of video grounding from natural language queries....
research
11/21/2018

MAC: Mining Activity Concepts for Language-based Temporal Localization

We address the problem of language-based temporal localization in untrim...

Please sign up or login with your details

Forgot password? Click here to reset