LANDMARK: Language-guided Representation Enhancement Framework for Scene Graph Generation

03/02/2023
by   Xiaoguang Chang, et al.
0

Scene graph generation (SGG) is a sophisticated task that suffers from both complex visual features and dataset long-tail problem. Recently, various unbiased strategies have been proposed by designing novel loss functions and data balancing strategies. Unfortunately, these unbiased methods fail to emphasize language priors in feature refinement perspective. Inspired by the fact that predicates are highly correlated with semantics hidden in subject-object pair and global context, we propose LANDMARK (LANguage-guiDed representationenhanceMent frAmewoRK) that learns predicate-relevant representations from language-vision interactive patterns, global language context and pair-predicate correlation. Specifically, we first project object labels to three distinctive semantic embeddings for different representation learning. Then, Language Attention Module (LAM) and Experience Estimation Module (EEM) process subject-object word embeddings to attention vector and predicate distribution, respectively. Language Context Module (LCM) encodes global context from each word embed-ding, which avoids isolated learning from local information. Finally, modules outputs are used to update visual representations and SGG model's prediction. All language representations are purely generated from object categories so that no extra knowledge is needed. This framework is model-agnostic and consistently improves performance on existing SGG models. Besides, representation-level unbiased strategies endow LANDMARK the advantage of compatibility with other methods. Code is available at https://github.com/rafa-cxg/PySGG-cxg.

READ FULL TEXT

page 3

page 15

research
12/27/2019

Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation

In this paper, we address the task of semantic-guided scene generation. ...
research
04/08/2022

Contextual Representation Learning beyond Masked Language Modeling

How do masked language models (MLMs) such as BERT learn contextual repre...
research
09/04/2020

Dynamic Context-guided Capsule Network for Multimodal Machine Translation

Multimodal machine translation (MMT), which mainly focuses on enhancing ...
research
05/09/2021

Conformer: Local Features Coupling Global Representations for Visual Recognition

Within Convolutional Neural Network (CNN), the convolution operations ar...
research
08/21/2023

Turning a CLIP Model into a Scene Text Spotter

We exploit the potential of the large-scale Contrastive Language-Image P...
research
03/17/2022

Biasing Like Human: A Cognitive Bias Framework for Scene Graph Generation

Scene graph generation is a sophisticated task because there is no speci...
research
08/08/2023

Class-level Structural Relation Modelling and Smoothing for Visual Representation Learning

Representation learning for images has been advanced by recent progress ...

Please sign up or login with your details

Forgot password? Click here to reset