Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation

09/05/2019
by   Wei Wei, et al.
26

Recently, automatic image caption generation has been an important focus of the work on multimodal translation task. Existing approaches can be roughly categorized into two classes, i.e., top-down and bottom-up, the former transfers the image information (called as visual-level feature) directly into a caption, and the later uses the extracted words (called as semanticlevel attribute) to generate a description. However, previous methods either are typically based one-stage decoder or partially utilize part of visual-level or semantic-level information for image caption generation. In this paper, we address the problem and propose an innovative multi-stage architecture (called as Stack-VS) for rich fine-gained image caption generation, via combining bottom-up and top-down attention models to effectively handle both visual-level and semantic-level information of an input image. Specifically, we also propose a novel well-designed stack decoder model, which is constituted by a sequence of decoder cells, each of which contains two LSTM-layers work interactively to re-optimize attention weights on both visual-level feature vectors and semantic-level attribute embeddings for generating a fine-gained image caption. Extensive experiments on the popular benchmark dataset MSCOCO show the significant improvements on different evaluation metrics, i.e., the improvements on BLEU-4/CIDEr/SPICE scores are 0.372, 1.226 and 0.216, respectively, as compared to the state-of-the-arts.

READ FULL TEXT

page 1

page 2

page 9

page 10

research
06/29/2023

Multi-source Semantic Graph-based Multimodal Sarcasm Explanation Generation

Multimodal Sarcasm Explanation (MuSE) is a new yet challenging task, whi...
research
12/03/2020

BERT-hLSTMs: BERT and Hierarchical LSTMs for Visual Storytelling

Visual storytelling is a creative and challenging task, aiming to automa...
research
11/24/2021

Decoupling Visual-Semantic Feature Learning for Robust Scene Text Recognition

Semantic information has been proved effective in scene text recognition...
research
05/08/2019

Multimodal Semantic Attention Network for Video Captioning

Inspired by the fact that different modalities in videos carry complemen...
research
12/15/2016

Beyond Holistic Object Recognition: Enriching Image Understanding with Part States

Important high-level vision tasks such as human-object interaction, imag...
research
03/10/2022

Knowledge-enriched Attention Network with Group-wise Semantic for Visual Storytelling

As a technically challenging topic, visual storytelling aims at generati...
research
09/21/2019

Visuallly Grounded Generation of Entailments from Premises

Natural Language Inference (NLI) is the task of determining the semantic...

Please sign up or login with your details

Forgot password? Click here to reset