Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators

09/22/2019
by   Kuang-Huei Lee, et al.
18

Grounding language to visual relations is critical to various language-and-vision applications. In this work, we tackle two fundamental language-and-vision tasks: image-text matching and image captioning, and demonstrate that neural scene graph generators can learn effective visual relation features to facilitate grounding language to visual relations and subsequently improve the two end applications. By combining relation features with the state-of-the-art models, our experiments show significant improvement on the standard Flickr30K and MSCOCO benchmarks. Our experimental results and analysis show that relation features improve downstream models' capability of capturing visual relations in end vision-and-language applications. We also demonstrate the importance of learning scene graph generators with visually relevant relations to the effectiveness of relation features.

READ FULL TEXT

page 1

page 7

page 10

page 11

page 12

page 13

page 14

research
12/02/2021

Consensus Graph Representation Learning for Better Grounded Image Captioning

The contemporary visual captioning models frequently hallucinate objects...
research
05/23/2023

Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality

Contrastively trained vision-language models have achieved remarkable pr...
research
10/12/2018

Quantifying the amount of visual information used by neural caption generators

This paper addresses the sensitivity of neural image caption generators ...
research
09/01/2023

Towards Addressing the Misalignment of Object Proposal Evaluation for Vision-Language Tasks via Semantic Grounding

Object proposal generation serves as a standard pre-processing step in V...
research
02/11/2019

Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

Many vision and language models suffer from poor visual grounding - ofte...
research
11/21/2018

An Interpretable Model for Scene Graph Generation

We propose an efficient and interpretable scene graph generator. We cons...
research
01/19/2019

Binary Image Selection (BISON): Interpretable Evaluation of Visual Grounding

Providing systems the ability to relate linguistic and visual content is...

Please sign up or login with your details

Forgot password? Click here to reset