Phrase Localization and Visual Relationship Detection with Comprehensive Image-Language Cues

by   Bryan A. Plummer, et al.
University of Illinois at Urbana-Champaign

This paper presents a framework for localization or grounding of phrases in images using a large collection of linguistic and visual cues. We model the appearance, size, and position of entity bounding boxes, adjectives that contain attribute information, and spatial relationships between pairs of entities connected by verbs or prepositions. Special attention is given to relationships between people and clothing or body part mentions, as they are useful for distinguishing individuals. We automatically learn weights for combining these cues and at test time, perform joint inference over all phrases in a caption. The resulting system produces state of the art performance on phrase localization on the Flickr30k Entities dataset and visual relationship detection on the Stanford VRD dataset.


page 1

page 8

page 11

page 12


PhraseCut: Language-based Image Segmentation in the Wild

We consider the problem of segmenting image regions given a natural lang...

Grounding of Textual Phrases in Images by Reconstruction

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visu...

Open-vocabulary Phrase Detection

Most existing work that grounds natural language phrases in images start...

Video In Sentences Out

We present a system that produces sentential descriptions of video: who ...

Weakly-Supervised Visual-Textual Grounding with Semantic Prior Refinement

Using only image-sentence pairs, weakly-supervised visual-textual ground...

Discriminative Bimodal Networks for Visual Localization and Detection with Natural Language Queries

Associating image regions with text queries has been recently explored a...

Automatically Selecting Useful Phrases for Dialogue Act Tagging

We present an empirical investigation of various ways to automatically i...

Please sign up or login with your details

Forgot password? Click here to reset