End-to-end Image Captioning Exploits Multimodal Distributional Similarity

09/11/2018
by   Pranava Madhyastha, et al.
0

We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn `distributional similarity' in a multimodal feature space by mapping a test image to similar training images in this space and generating a caption from the same space. To validate our hypothesis, we focus on the `image' side of image captioning, and vary the input image representation but keep the RNN text generation component of a CNN-RNN model constant. Our analysis indicates that image captioning models (i) are capable of separating structure from noisy input representations; (ii) suffer virtually no significant performance loss when a high dimensional representation is compressed to a lower dimensional space; (iii) cluster images with similar visual and linguistic information together. Our findings indicate that our distributional similarity hypothesis holds. We conclude that regardless of the image representation used image captioning systems seem to match images and generate captions in a learned joint image-text semantic subspace.

READ FULL TEXT

page 8

page 14

research
04/23/2018

Object Counts! Bringing Explicit Detections Back into Image Captioning

The use of explicit object detectors as an intermediate step to image ca...
research
01/20/2023

Visual Semantic Relatedness Dataset for Image Captioning

Modern image captioning system relies heavily on extracting knowledge fr...
research
07/12/2022

A Baseline for Detecting Out-of-Distribution Examples in Image Captioning

Image captioning research achieved breakthroughs in recent years by deve...
research
04/30/2021

End-to-End Attention-based Image Captioning

In this paper, we address the problem of image captioning specifically f...
research
05/25/2023

HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning

A great deal of progress has been made in image captioning, driven by re...
research
09/15/2017

Self-Guiding Multimodal LSTM - when we do not have a perfect training dataset for image captioning

In this paper, a self-guiding multimodal LSTM (sg-LSTM) image captioning...
research
11/16/2016

Semantic Regularisation for Recurrent Image Annotation

The "CNN-RNN" design pattern is increasingly widely applied in a variety...

Please sign up or login with your details

Forgot password? Click here to reset