Boundary Proposal Network for Two-Stage Natural Language Video Localization

by   Shaoning Xiao, et al.
Zhejiang University
University of Rochester

We aim to address the problem of Natural Language Video Localization (NLVL)-localizing the video segment corresponding to a natural language description in a long and untrimmed video. State-of-the-art NLVL methods are almost in one-stage fashion, which can be typically grouped into two categories: 1) anchor-based approach: it first pre-defines a series of video segment candidates (e.g., by sliding window), and then does classification for each candidate; 2) anchor-free approach: it directly predicts the probabilities for each video frame as a boundary or intermediate frame inside the positive segment. However, both kinds of one-stage approaches have inherent drawbacks: the anchor-based approach is susceptible to the heuristic rules, further limiting the capability of handling videos with variant length. While the anchor-free approach fails to exploit the segment-level interaction thus achieving inferior results. In this paper, we propose a novel Boundary Proposal Network (BPNet), a universal two-stage framework that gets rid of the issues mentioned above. Specifically, in the first stage, BPNet utilizes an anchor-free model to generate a group of high-quality candidate video segments with their boundaries. In the second stage, a visual-language fusion layer is proposed to jointly model the multi-modal interaction between the candidate and the language query, followed by a matching score rating layer that outputs the alignment score for each candidate. We evaluate our BPNet on three challenging NLVL benchmarks (i.e., Charades-STA, TACoS and ActivityNet-Captions). Extensive experiments and ablative studies on these datasets demonstrate that the BPNet outperforms the state-of-the-art methods.


page 1

page 7


Natural Language Video Localization with Learnable Moment Proposals

Given an untrimmed video and a natural language query, Natural Language ...

Adaptive Proposal Generation Network for Temporal Sentence Localization in Videos

We address the problem of temporal sentence localization in videos (TSLV...

Temporally Grounding Language Queries in Videos by Contextual Boundary-aware Prediction

The task of temporally grounding language queries in videos is to tempor...

ExCL: Extractive Clip Localization Using Natural Language Descriptions

The task of retrieving clips within videos based on a given natural lang...

A Hierarchical Multi-Modal Encoder for Moment Localization in Video Corpus

Identifying a short segment in a long video that semantically matches a ...

Dense Regression Network for Video Grounding

We address the problem of video grounding from natural language queries....

MAC: Mining Activity Concepts for Language-based Temporal Localization

We address the problem of language-based temporal localization in untrim...

Please sign up or login with your details

Forgot password? Click here to reset