Deep Variation-structured Reinforcement Learning for Visual Relationship and Attribute Detection

03/08/2017
by   Xiaodan Liang, et al.
0

Despite progress in visual perception tasks such as image classification and detection, computers still struggle to understand the interdependency of objects in the scene as a whole, e.g., relations between objects or their attributes. Existing methods often ignore global context cues capturing the interactions among different object instances, and can only recognize a handful of types by exhaustively training individual detectors for all possible relationships. To capture such global interdependency, we propose a deep Variation-structured Reinforcement Learning (VRL) framework to sequentially discover object relationships and attributes in the whole image. First, a directed semantic action graph is built using language priors to provide a rich and compact representation of semantic correlations between object categories, predicates, and attributes. Next, we use a variation-structured traversal over the action graph to construct a small, adaptive action set for each step based on the current state and historical actions. In particular, an ambiguity-aware object mining scheme is used to resolve semantic ambiguity among object categories that the object detector fails to distinguish. We then make sequential predictions using a deep RL framework, incorporating global context cues and semantic embeddings of previously extracted phrases in the state vector. Our experiments on the Visual Relationship Detection (VRD) dataset and the large-scale Visual Genome dataset validate the superiority of VRL, which can achieve significantly better detection results on datasets involving thousands of relationship and attribute types. We also demonstrate that VRL is able to predict unseen types embedded in our action graph by learning correlations on shared graph nodes.

READ FULL TEXT

page 4

page 7

page 8

research
09/11/2018

Context-Dependent Diffusion Network for Visual Relationship Detection

Visual relationship detection can bridge the gap between computer vision...
research
03/08/2017

Tree-Structured Reinforcement Learning for Sequential Object Localization

Existing object proposal algorithms usually search for possible object r...
research
03/08/2019

Knowledge-Embedded Routing Network for Scene Graph Generation

To understand a scene in depth not only involves locating/recognizing in...
research
11/20/2018

Scene Graph Generation via Conditional Random Fields

Despite the great success object detection and segmentation models have ...
research
08/14/2020

ConsNet: Learning Consistency Graph for Zero-Shot Human-Object Interaction Detection

We consider the problem of Human-Object Interaction (HOI) Detection, whi...
research
07/28/2016

SEMBED: Semantic Embedding of Egocentric Action Videos

We present SEMBED, an approach for embedding an egocentric object intera...
research
02/23/2016

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Despite progress in perceptual tasks such as image classification, compu...

Please sign up or login with your details

Forgot password? Click here to reset