Learning to Locate Visual Answer in Video Corpus Using Question

10/11/2022
by   Bin Li, et al.
0

We introduce a new task, named video corpus visual answer localization (VCVAL), which aims to locate the visual answer in a large collection of untrimmed, unsegmented instructional videos using a natural language question. This task requires a range of skills - the interaction between vision and language, video retrieval, passage comprehension, and visual answer localization. In this paper, we propose a cross-modal contrastive global-span (CCGS) method for the VCVAL, jointly training the video corpus retrieval and visual answer localization subtasks. More precisely, we first enhance the video question-answer semantic by adding element-wise visual information into the pre-trained language model, and then design a novel global-span predictor through fusion information to locate the visual answer point. The global-span contrastive learning is adopted to sort the span point from the positive and negative samples with the global-span matrix. We have reconstructed a dataset named MedVidCQA, on which the VCVAL task is benchmarked. Experimental results show that the proposed method outperforms other competitive methods both in the video corpus retrieval and visual answer localization subtasks. Most importantly, we perform detailed analyses on extensive experiments, paving a new path for understanding the instructional videos, which ushers in further research.

READ FULL TEXT
research
10/26/2022

Visual Answer Localization with Cross-modal Mutual Knowledge Transfer

The goal of visual answering localization (VAL) in the video is to obtai...
research
03/13/2022

Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional Video

The temporal answering grounding in the video (TAGV) is a new task natur...
research
10/05/2022

Locate before Answering: Answer Guided Question Localization for Video Question Answering

Video question answering (VideoQA) is an essential task in vision-langua...
research
07/16/2022

Clover: Towards A Unified Video-Language Alignment and Fusion Model

Building a universal video-language model for solving various video unde...
research
08/02/2022

To Answer or Not to Answer? Improving Machine Reading Comprehension Model with Span-based Contrastive Learning

Machine Reading Comprehension with Unanswerable Questions is a difficult...
research
04/19/2023

EC^2: Emergent Communication for Embodied Control

Embodied control requires agents to leverage multi-modal pre-training to...
research
06/28/2023

SpotEM: Efficient Video Search for Episodic Memory

The goal in episodic memory (EM) is to search a long egocentric video to...

Please sign up or login with your details

Forgot password? Click here to reset