Self-view Grounding Given a Narrated 360° Video

11/23/2017
by   Shih-Han Chou, et al.
0

Narrated 360 videos are typically provided in many touring scenarios to mimic real-world experience. However, previous work has shown that smart assistance (i.e., providing visual guidance) can significantly help users to follow the Normal Field of View (NFoV) corresponding to the narrative. In this project, we aim at automatically grounding the NFoVs of a 360 video given subtitles of the narrative (referred to as "NFoV-grounding"). We propose a novel Visual Grounding Model (VGM) to implicitly and efficiently predict the NFoVs given the video content and subtitles. Specifically, at each frame, we efficiently encode the panorama into feature map of candidate NFoVs using a Convolutional Neural Network (CNN) and the subtitles to the same hidden space using an RNN with Gated Recurrent Units (GRU). Then, we apply soft-attention on candidate NFoVs to trigger sentence decoder aiming to minimize the reconstruct loss between the generated and given sentence. Finally, we obtain the NFoV as the candidate NFoV with the maximum attention without any human supervision. To train VGM more robustly, we also generate a reverse sentence conditioning on one minus the soft-attention such that the attention focuses on candidate NFoVs less relevant to the given sentence. The negative log reconstruction loss of the reverse sentence (referred to as "irrelevant loss") is jointly minimized to encourage the reverse sentence to be different from the given sentence. To evaluate our method, we collect the first narrated 360 videos dataset and achieve state-of-the-art NFoV-grounding performance.

READ FULL TEXT

page 1

page 5

page 7

research
02/26/2023

Localizing Moments in Long Video Via Multimodal Guidance

The recent introduction of the large-scale long-form MAD dataset for lan...
research
12/02/2017

Improving Visually Grounded Sentence Representations with Self-Attention

Sentence representation models trained only on language could potentiall...
research
10/31/2019

Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos

Temporal sentence grounding in videos aims to detect and localize one ta...
research
07/31/2021

Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding

Current one-stage methods for visual grounding encode the language query...
research
08/14/2023

Knowing Where to Focus: Event-aware Transformer for Video Grounding

Recent DETR-based video grounding models have made the model directly pr...
research
06/01/2019

Learning to Generate Grounded Image Captions without Localization Supervision

When generating a sentence description for an image, it frequently remai...
research
03/23/2021

Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos

In this paper, we address the problem of referring expression comprehens...

Please sign up or login with your details

Forgot password? Click here to reset