Learning to Generate Scene Graph from Natural Language Supervision

09/06/2021
by   Yiwu Zhong, et al.
4

Learning from image-text data has demonstrated recent success for many recognition tasks, yet is currently limited to visual features or individual visual concepts such as objects. In this paper, we propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph. To bridge the gap between images and texts, we leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph. Further, we design a Transformer-based model to predict these "pseudo" labels via a masked token prediction task. Learning from only image-sentence pairs, our model achieves 30 scene graphs. Our model also shows strong results for weakly and fully supervised scene graph generation. In addition, we explore an open-vocabulary setting for detecting scene graphs, and present the first result for open-set scene graph generation. Our code is available at https://github.com/YiwuZhong/SGG_from_NLS.

READ FULL TEXT

page 8

page 9

research
03/16/2022

Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Visual grounding, i.e., localizing objects in images according to natura...
research
11/29/2022

Language-driven Open-Vocabulary 3D Scene Understanding

Open-vocabulary scene understanding aims to localize and recognize unsee...
research
09/11/2019

Specifying Object Attributes and Relations in Interactive Scene Generation

We introduce a method for the generation of images from an input scene g...
research
07/17/2023

Pair then Relation: Pair-Net for Panoptic Scene Graph Generation

Panoptic Scene Graph (PSG) is a challenging task in Scene Graph Generati...
research
08/21/2022

Objects Can Move: 3D Change Detection by Geometric Transformation Constistency

AR/VR applications and robots need to know when the scene has changed. A...
research
10/12/2021

Topic Scene Graph Generation by Attention Distillation from Caption

If an image tells a story, the image caption is the briefest narrator. G...
research
11/04/2021

Unsupervised Learning of Compositional Energy Concepts

Humans are able to rapidly understand scenes by utilizing concepts extra...

Please sign up or login with your details

Forgot password? Click here to reset