Leveraging Auxiliary Text for Deep Recognition of Unseen Visual Relationships

10/27/2019
by   Gal Sadeh Kenigsfield, et al.
13

One of the most difficult tasks in scene understanding is recognizing interactions between objects in an image. This task is often called visual relationship detection (VRD). We consider the question of whether, given auxiliary textual data in addition to the standard visual data used for training VRD models, VRD performance can be improved. We present a new deep model that can leverage additional textual data. Our model relies on a shared text–image representation of subject-verb-object relationships appearing in the text, and object interactions in images. Our method is the first to enable recognition of visual relationships missing in the visual training data and appearing only in the auxiliary text. We test our approach on two different text sources: text originating in images and text originating in books. We test and validate our approach using two large-scale recognition tasks: VRD and Scene Graph Generation. We show a surprising result: Our approach works better with text originating in books, and outperforms the text originating in images on the task of unseen relationship recognition. It is comparable to the model which utilizes text originating in images on the task of seen relationship recognition.

READ FULL TEXT

page 2

page 4

page 6

page 9

research
11/20/2018

Scene Graph Generation via Conditional Random Fields

Despite the great success object detection and segmentation models have ...
research
09/11/2018

Context-Dependent Diffusion Network for Visual Relationship Detection

Visual relationship detection can bridge the gap between computer vision...
research
11/27/2018

A Compositional Textual Model for Recognition of Imperfect Word Images

Printed text recognition is an important problem for industrial OCR syst...
research
07/21/2018

Equal But Not The Same: Understanding the Implicit Relationship Between Persuasive Images and Text

Images and text in advertisements interact in complex, non-literal ways....
research
04/25/2019

Scene Graph Prediction with Limited Labels

Visual knowledge bases such as Visual Genome power numerous applications...
research
07/13/2018

Zoom-Net: Mining Deep Feature Interactions for Visual Relationship Recognition

Recognizing visual relationships <subject-predicate-object> among any pa...
research
04/27/2023

Learning Human-Human Interactions in Images from Weak Textual Supervision

Interactions between humans are diverse and context-dependent, but previ...

Please sign up or login with your details

Forgot password? Click here to reset