Learning Similarity between Scene Graphs and Images with Transformers

04/02/2023
by   Yuren Cong, et al.
0

Scene graph generation is conventionally evaluated by (mean) Recall@K, which measures the ratio of correctly predicted triplets that appear in the ground truth. However, such triplet-oriented metrics cannot capture the global semantic information of scene graphs, and measure the similarity between images and generated scene graphs. The usability of scene graphs is therefore limited in downstream tasks. To address this issue, a framework that can measure the similarity of scene graphs and images is urgently required. Motivated by the successful application of Contrastive Language-Image Pre-training (CLIP), we propose a novel contrastive learning framework consisting of a graph Transformer and an image Transformer to align scene graphs and their corresponding images in the shared latent space. To enable the graph Transformer to comprehend the scene graph structure and extract representative features, we introduce a graph serialization technique that transforms a scene graph into a sequence with structural encoding. Based on our framework, we introduce R-Precision measuring image retrieval accuracy as a new evaluation metric for scene graph generation and establish new benchmarks for the Visual Genome and Open Images datasets. A series of experiments are further conducted to demonstrate the effectiveness of the graph Transformer, which shows great potential as a scene graph encoder.

READ FULL TEXT

page 1

page 8

page 9

page 12

page 13

page 14

page 15

page 16

research
04/06/2021

Scene Graph Embeddings Using Relative Similarity Supervision

Scene graphs are a powerful structured representation of the underlying ...
research
10/17/2022

SGRAM: Improving Scene Graph Parsing via Abstract Meaning Representation

Scene graph is structured semantic representation that can be modeled as...
research
11/30/2022

Iterative Scene Graph Generation with Generative Transformers

Scene graphs provide a rich, structured representation of a scene by enc...
research
05/27/2023

FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing

Textual scene graph parsing has become increasingly important in various...
research
07/18/2023

Improving Text Semantic Similarity Modeling through a 3D Siamese Network

Siamese networks have gained popularity as a method for modeling text se...
research
09/19/2019

Triplet-Aware Scene Graph Embeddings

Scene graphs have become an important form of structured knowledge for t...
research
08/14/2019

3-D Scene Graph: A Sparse and Semantic Representation of Physical Environments for Intelligent Agents

Intelligent agents gather information and perceive semantics within the ...

Please sign up or login with your details

Forgot password? Click here to reset