DeepAI AI Chat
Log In Sign Up

Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators

by   Kuang-Huei Lee, et al.

Grounding language to visual relations is critical to various language-and-vision applications. In this work, we tackle two fundamental language-and-vision tasks: image-text matching and image captioning, and demonstrate that neural scene graph generators can learn effective visual relation features to facilitate grounding language to visual relations and subsequently improve the two end applications. By combining relation features with the state-of-the-art models, our experiments show significant improvement on the standard Flickr30K and MSCOCO benchmarks. Our experimental results and analysis show that relation features improve downstream models' capability of capturing visual relations in end vision-and-language applications. We also demonstrate the importance of learning scene graph generators with visually relevant relations to the effectiveness of relation features.


page 1

page 7

page 10

page 11

page 12

page 13

page 14


Consensus Graph Representation Learning for Better Grounded Image Captioning

The contemporary visual captioning models frequently hallucinate objects...

Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality

Contrastively trained vision-language models have achieved remarkable pr...

Quantifying the amount of visual information used by neural caption generators

This paper addresses the sensitivity of neural image caption generators ...

Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

Many vision and language models suffer from poor visual grounding - ofte...

More Grounded Image Captioning by Distilling Image-Text Matching Model

Visual attention not only improves the performance of image captioners, ...

An Interpretable Model for Scene Graph Generation

We propose an efficient and interpretable scene graph generator. We cons...

Addressing Class Imbalance in Scene Graph Parsing by Learning to Contrast and Score

Scene graph parsing aims to detect objects in an image scene and recogni...