Diverse Image Captioning with Context-Object Split Latent Spaces

11/02/2020
by   Shweta Mahajan, et al.
0

Diverse image captioning models aim to learn one-to-many mappings that are innate to cross-domain datasets, such as of images and texts. Current methods for this task are based on generative latent variable models, e.g. VAEs with structured latent spaces. Yet, the amount of multimodality captured by prior work is limited to that of the paired training data – the true diversity of the underlying generative process is not fully captured. To address this limitation, we leverage the contextual descriptions in the dataset that explain similar contexts in different visual scenes. To this end, we introduce a novel factorization of the latent space, termed context-object split, to model diversity in contextual descriptions across images and texts within the dataset. Our framework not only enables diverse captioning through context-based pseudo supervision, but extends this to images with novel objects and without paired captions in the training data. We evaluate our COS-CVAE approach on the standard COCO dataset and on the held-out COCO dataset consisting of images with novel objects, showing significant gains in accuracy and diversity.

READ FULL TEXT

page 17

page 18

page 19

research
09/10/2021

Partially-supervised novel object captioning leveraging context from paired data

In this paper, we propose an approach to improve image captioning soluti...
research
02/16/2020

Latent Normalizing Flows for Many-to-Many Cross-Domain Mappings

Learned joint representations of images and text form the backbone of se...
research
08/22/2019

Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

Diverse and accurate vision+language modeling is an important goal to re...
research
05/03/2022

Diverse Image Captioning with Grounded Style

Stylized image captioning as presented in prior work aims to generate ca...
research
11/17/2015

Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data

While recent deep neural network models have achieved promising results ...
research
12/13/2021

MAGIC: Multimodal relAtional Graph adversarIal inferenCe for Diverse and Unpaired Text-based Image Captioning

Text-based image captioning (TextCap) requires simultaneous comprehensio...
research
03/06/2020

Captioning Images with Novel Objects via Online Vocabulary Expansion

In this study, we introduce a low cost method for generating description...

Please sign up or login with your details

Forgot password? Click here to reset