Sequential Latent Spaces for Modeling the Intention During Diverse Image Captioning

08/22/2019
by   Jyoti Aneja, et al.
12

Diverse and accurate vision+language modeling is an important goal to retain creative freedom and maintain user engagement. However, adequately capturing the intricacies of diversity in language models is challenging. Recent works commonly resort to latent variable models augmented with more or less supervision from object detectors or part-of-speech tags. Common to all those methods is the fact that the latent variable either only initializes the sentence generation process or is identical across the steps of generation. Both methods offer no fine-grained control. To address this concern, we propose Seq-CVAE which learns a latent space for every word position. We encourage this temporal latent space to capture the 'intention' about how to complete the sentence by mimicking a representation which summarizes the future. We illustrate the efficacy of the proposed approach to anticipate the sentence continuation on the challenging MSCOCO dataset, significantly improving diversity metrics compared to baselines while performing on par w.r.t sentence quality.

READ FULL TEXT
research
11/02/2020

Diverse Image Captioning with Context-Object Split Latent Spaces

Diverse image captioning models aim to learn one-to-many mappings that a...
research
05/31/2018

Diverse and Controllable Image Captioning with Part-of-Speech Guidance

Automatically describing an image is an important capability for virtual...
research
05/02/2023

Shared Latent Space by Both Languages in Non-Autoregressive Neural Machine Translation

Latent variable modeling in non-autoregressive neural machine translatio...
research
12/13/2019

Fast Image Caption Generation with Position Alignment

Recent neural network models for image captioning usually employ an enco...
research
03/01/2020

Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

Humans are able to describe image contents with coarse to fine details a...
research
11/01/2022

Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis

This paper proposes an Expressive Speech Synthesis model that utilizes t...
research
03/13/2020

MixPoet: Diverse Poetry Generation via Learning Controllable Mixed Latent Space

As an essential step towards computer creativity, automatic poetry gener...

Please sign up or login with your details

Forgot password? Click here to reset