Adaptive Hierarchical Graph Reasoning with Semantic Coherence for Video-and-Language Inference

07/26/2021
by   Juncheng Li, et al.
0

Video-and-Language Inference is a recently proposed task for joint video-and-language understanding. This new task requires a model to draw inference on whether a natural language statement entails or contradicts a given video clip. In this paper, we study how to address three critical challenges for this task: judging the global correctness of the statement involved multiple semantic meanings, joint reasoning over video and subtitles, and modeling long-range relationships and complex social interactions. First, we propose an adaptive hierarchical graph network that achieves in-depth understanding of the video over complex interactions. Specifically, it performs joint reasoning over video and subtitles in three hierarchies, where the graph structure is adaptively adjusted according to the semantic structures of the statement. Secondly, we introduce semantic coherence learning to explicitly encourage the semantic coherence of the adaptive hierarchical graph network from three hierarchies. The semantic coherence learning can further improve the alignment between vision and linguistics, and the coherence across a sequence of video segments. Experimental results show that our method significantly outperforms the baseline by a large margin.

READ FULL TEXT

page 1

page 3

page 8

research
10/12/2020

Joint Semantic Analysis with Document-Level Cross-Task Coherence Rewards

Coreference resolution and semantic role labeling are NLP tasks that cap...
research
03/25/2020

VIOLIN: A Large-Scale Dataset for Video-and-Language Inference

We introduce a new task, Video-and-Language Inference, for joint multimo...
research
11/30/2018

MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment

This research strives for natural language moment retrieval in long, unt...
research
03/02/2023

Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal Sentence Localization in Videos

Temporal sentence localization in videos (TSLV) aims to retrieve the mos...
research
12/31/2022

Translating Text Synopses to Video Storyboards

A storyboard is a roadmap for video creation which consists of shot-by-s...
research
01/17/2020

Cut-Based Graph Learning Networks to Discover Compositional Structure of Sequential Video Data

Conventional sequential learning methods such as Recurrent Neural Networ...
research
02/06/2023

Consensus dynamics and coherence in hierarchical small-world networks

The hierarchical small-world network is a real-world network. It models ...

Please sign up or login with your details

Forgot password? Click here to reset