Retrieval-Augmented Transformer for Image Captioning

07/26/2022
by   Sara Sarto, et al.
7

Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past few years, the task has been tackled by learning parametric models and proposing visual feature extraction advancements or by modeling better multi-modal connections. In this paper, we investigate the development of an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens based on the past context and on text retrieved from the external memory. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality. Our work opens up new avenues for improving image captioning models at larger scale.

READ FULL TEXT
research
10/24/2021

Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network

Automatic Image Captioning is the never-ending effort of creating syntac...
research
03/19/2016

Generating Natural Questions About an Image

There has been an explosion of work in the vision & language community d...
research
07/17/2018

Contextual Memory Trees

We design and study a Contextual Memory Tree (CMT), a learning memory co...
research
07/14/2021

From Show to Tell: A Survey on Image Captioning

Connecting Vision and Language plays an essential role in Generative Int...
research
11/30/2020

Language-Driven Region Pointer Advancement for Controllable Image Captioning

Controllable Image Captioning is a recent sub-field in the multi-modal t...
research
05/25/2023

HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning

A great deal of progress has been made in image captioning, driven by re...
research
02/21/2022

CaMEL: Mean Teacher Learning for Image Captioning

Describing images in natural language is a fundamental step towards the ...

Please sign up or login with your details

Forgot password? Click here to reset