VReBERT: A Simple and Flexible Transformer for Visual Relationship Detection

06/18/2022
by   Yu Cui, et al.
0

Visual Relationship Detection (VRD) impels a computer vision model to 'see' beyond an individual object instance and 'understand' how different objects in a scene are related. The traditional way of VRD is first to detect objects in an image and then separately predict the relationship between the detected object instances. Such a disjoint approach is prone to predict redundant relationship tags (i.e., predicate) between the same object pair with similar semantic meaning, or incorrect ones that have a similar meaning to the ground truth but are semantically incorrect. To remedy this, we propose to jointly train a VRD model with visual object features and semantic relationship features. To this end, we propose VReBERT, a BERT-like transformer model for Visual Relationship Detection with a multi-stage training strategy to jointly process visual and semantic features. We show that our simple BERT-like model is able to outperform the state-of-the-art VRD models in predicate prediction. Furthermore, we show that by using the pre-trained VReBERT model, our model pushes the state-of-the-art zero-shot predicate prediction by a significant margin (+8.49 R@50 and +8.99 R@100).

READ FULL TEXT

page 1

page 3

page 6

research
09/01/2018

Improving Visual Relationship Detection using Semantic Modeling of Scene Descriptions

Structured scene descriptions of images are useful for the automatic pro...
research
09/24/2021

ZSD-YOLO: Zero-Shot YOLO Detection using Vision-Language KnowledgeDistillation

Real-world object sampling produces long-tailed distributions requiring ...
research
03/13/2023

Scene Graph Generation from Hierarchical Relationship Reasoning

This paper describes a novel approach to deducing relationships between ...
research
08/01/2018

Shuffle-Then-Assemble: Learning Object-Agnostic Visual Relationship Features

Due to the fact that it is prohibitively expensive to completely annotat...
research
08/23/2021

ZS-SLR: Zero-Shot Sign Language Recognition from RGB-D Videos

Sign Language Recognition (SLR) is a challenging research area in comput...
research
09/10/2020

RVL-BERT: Visual Relationship Detection with Visual-Linguistic Knowledge from Pre-trained Representations

Visual relationship detection aims to reason over relationships among sa...
research
12/19/2022

Diffusing Surrogate Dreams of Video Scenes to Predict Video Memorability

As part of the MediaEval 2022 Predicting Video Memorability task we expl...

Please sign up or login with your details

Forgot password? Click here to reset