SrTR: Self-reasoning Transformer with Visual-linguistic Knowledge for Scene Graph Generation

12/19/2022
by   Yuxiang Zhang, et al.
0

Objects in a scene are not always related. The execution efficiency of the one-stage scene graph generation approaches are quite high, which infer the effective relation between entity pairs using sparse proposal sets and a few queries. However, they only focus on the relation between subject and object in triplet set subject entity, predicate entity, object entity, ignoring the relation between subject and predicate or predicate and object, and the model lacks self-reasoning ability. In addition, linguistic modality has been neglected in the one-stage method. It is necessary to mine linguistic modality knowledge to improve model reasoning ability. To address the above-mentioned shortcomings, a Self-reasoning Transformer with Visual-linguistic Knowledge (SrTR) is proposed to add flexible self-reasoning ability to the model. An encoder-decoder architecture is adopted in SrTR, and a self-reasoning decoder is developed to complete three inferences of the triplet set, s+o-p, s+p-o and p+o-s. Inspired by the large-scale pre-training image-text foundation models, visual-linguistic prior knowledge is introduced and a visual-linguistic alignment strategy is designed to project visual representations into semantic spaces with prior knowledge to aid relational reasoning. Experiments on the Visual Genome dataset demonstrate the superiority and fast inference ability of the proposed method.

READ FULL TEXT

page 4

page 7

research
01/27/2022

RelTR: Relation Transformer for Scene Graph Generation

Different objects in the same scene are more or less related to each oth...
research
06/21/2021

Structured Sparse R-CNN for Direct Scene Graph Generation

Scene graph generation (SGG) is to detect entity pairs with their relati...
research
09/06/2022

Language-aware Domain Generalization Network for Cross-Scene Hyperspectral Image Classification

Text information including extensive prior knowledge about land cover cl...
research
10/30/2014

Towards Learning Object Affordance Priors from Technical Texts

Everyday activities performed by artificial assistants can potentially b...
research
02/22/2022

One-shot Scene Graph Generation

As a structured representation of the image content, the visual scene gr...
research
06/09/2023

Single-Stage Visual Relationship Learning using Conditional Queries

Research in scene graph generation (SGG) usually considers two-stage mod...
research
03/07/2017

Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos

We propose an unsupervised method for reference resolution in instructio...

Please sign up or login with your details

Forgot password? Click here to reset