Multi-modal Learning with Prior Visual Relation Reasoning

12/23/2018
by   Zhuoqian Yang, et al.
16

Visual relation reasoning is a central component in recent cross-modal analysis tasks, which aims at reasoning about the visual relationships between objects and their properties. These relationships convey rich semantics and help to enhance the visual representation for improving cross-modal analysis. Previous works have succeeded in designing strategies for modeling latent relations or rigid-categorized relations and achieving the lift of performance. However, this kind of methods leave out the ambiguity inherent in the relations because of the diverse relational semantics of different visual appearances. In this work, we explore to model relations by contextual-sensitive embeddings based on human priori knowledge. We novelly propose a plug-and-play relation reasoning module injected with the relation embeddings to enhance image encoder. Specifically, we design upgraded Graph Convolutional Networks (GCN) to utilize the information of relation embeddings and relation directionality between objects for generating relation-aware image representations. We demonstrate the effectiveness of the relation reasoning module by applying it to both Visual Question Answering (VQA) and Cross-Modal Information Retrieval (CMIR) tasks. Extensive experiments are conducted on VQA 2.0 and CMPlaces datasets and superior performance is reported when comparing with state-of-the-art work.

READ FULL TEXT

page 1

page 2

page 4

page 8

page 10

page 12

research
10/31/2018

Semantic Modeling of Textual Relationships in Cross-Modal Retrieval

Feature modeling of different modalities is a basic problem in current r...
research
02/13/2023

CLIP-RR: Improved CLIP Network for Relation-Focused Cross-Modal Information Retrieval

Relation-focused cross-modal information retrieval focuses on retrieving...
research
04/10/2019

Context-Aware Embeddings for Automatic Art Analysis

Automatic art analysis aims to classify and retrieve artistic representa...
research
10/31/2018

Textual Relationship Modeling for Cross-Modal Information Retrieval

Feature representation of different modalities is the main focus of curr...
research
05/02/2020

Robust and Interpretable Grounding of Spatial References with Relation Networks

Handling spatial references in natural language is a key challenge in ta...
research
06/16/2020

Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based VisualQuestion Answering

Fact-based Visual Question Answering (FVQA) requires external knowledge ...
research
03/27/2023

Curriculum Learning for Compositional Visual Reasoning

Visual Question Answering (VQA) is a complex task requiring large datase...

Please sign up or login with your details

Forgot password? Click here to reset