Exploiting Visual Semantic Reasoning for Video-Text Retrieval

06/16/2020
by   Zerun Feng, et al.
3

Video retrieval is a challenging research topic bridging the vision and language areas and has attracted broad attention in recent years. Previous works have been devoted to representing videos by directly encoding from frame-level features. In fact, videos consist of various and abundant semantic relations to which existing methods pay less attention. To address this issue, we propose a Visual Semantic Enhanced Reasoning Network (ViSERN) to exploit reasoning between frame regions. Specifically, we consider frame regions as vertices and construct a fully-connected semantic correlation graph. Then, we perform reasoning by novel random walk rule-based graph convolutional networks to generate region features involved with semantic relations. With the benefit of reasoning, semantic interactions between regions are considered, while the impact of redundancy is suppressed. Finally, the region features are aggregated to form frame-level features for further encoding to measure video-text similarity. Extensive experiments on two public benchmark datasets validate the effectiveness of our method by achieving state-of-the-art performance due to the powerful semantic reasoning.

READ FULL TEXT

page 1

page 6

research
05/18/2022

VRAG: Region Attention Graphs for Content-Based Video Retrieval

Content-based Video Retrieval (CBVR) is used on media-sharing platforms ...
research
03/02/2023

Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal Sentence Localization in Videos

Temporal sentence localization in videos (TSLV) aims to retrieve the mos...
research
09/06/2019

Visual Semantic Reasoning for Image-Text Matching

Image-text matching has been a hot research topic bridging the vision an...
research
12/06/2022

Semantically Enhanced Global Reasoning for Semantic Segmentation

Recent advances in pixel-level tasks (e.g., segmentation) illustrate the...
research
11/23/2022

TransVCL: Attention-enhanced Video Copy Localization Network with Flexible Supervision

Video copy localization aims to precisely localize all the copied segmen...
research
06/18/2022

A Double-Graph Based Framework for Frame Semantic Parsing

Frame semantic parsing is a fundamental NLP task, which consists of thre...
research
03/11/2022

TFCNet: Temporal Fully Connected Networks for Static Unbiased Temporal Reasoning

Temporal Reasoning is one important functionality for vision intelligenc...

Please sign up or login with your details

Forgot password? Click here to reset