Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

05/13/2020
by   Hyounghun Kim, et al.
24

Videos convey rich information. Dynamic spatio-temporal relationships between people/objects, and diverse multimodal events are present in a video clip. Hence, it is important to develop automated models that can accurately extract such information from videos. Answering questions on videos is one of the tasks which can evaluate such AI abilities. In this paper, we propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Specifically, we first employ dense image captions to help identify objects and their detailed salient regions and actions, and hence give the model useful extra information (in explicit textual format to allow easier matching) for answering questions. Moreover, our model is also comprised of dual-level attention (word/object and frame level), multi-head self/cross-integration for different sources (video and dense captions), and gates which pass more relevant information to the classifier. Finally, we also cast the frame selection problem as a multi-label classification task and introduce two loss functions, In-andOut Frame Score Margin (IOFSM) and Balanced Binary Cross-Entropy (BBCE), to better supervise the model with human importance annotations. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin (74.09 also present several word, object, and frame level visualization studies. Our code is publicly available at: https://github.com/hyounghk/VideoQADenseCapFrameGate-ACL2020

READ FULL TEXT

page 8

page 9

research
08/10/2023

Progressive Spatio-temporal Perception for Audio-Visual Question Answering

Audio-Visual Question Answering (AVQA) task aims to answer questions abo...
research
04/25/2019

TVQA+: Spatio-Temporal Grounding for Video Question Answering

We present the task of Spatio-Temporal Video Question Answering, which r...
research
06/14/2019

Improving Visual Question Answering by Referring to Generated Paragraph Captions

Paragraph-style image captions describe diverse aspects of an image as o...
research
08/07/2020

Location-aware Graph Convolutional Networks for Video Question Answering

We addressed the challenging task of video question answering, which req...
research
07/22/2023

Discovering Spatio-Temporal Rationales for Video Question Answering

This paper strives to solve complex video question answering (VideoQA) w...
research
08/26/2020

Making a Case for 3D Convolutions for Object Segmentation in Videos

The task of object segmentation in videos is usually accomplished by pro...
research
11/30/2021

AssistSR: Affordance-centric Question-driven Video Segment Retrieval

It is still a pipe dream that AI assistants on phone and AR glasses can ...

Please sign up or login with your details

Forgot password? Click here to reset