M^2: Meshed-Memory Transformer for Image Captioning

12/17/2019
by   Marcella Cornia, et al.
10

Transformer-based architectures represent the state of the art in sequence modeling tasks like machine translation and language understanding. Their applicability to multi-modal contexts like image captioning, however, is still largely under-explored. With the aim of filling this gap, we present M^2 - a Meshed Transformer with Memory for Image Captioning. The architecture improves both the image encoding and the language generation steps: it learns a multi-level representation of the relationships between image regions integrating learned a priori knowledge, and uses a mesh-like connectivity at decoding stage to exploit low- and high-level features. Experimentally, we investigate the performance of the M^2 Transformer and different fully-attentive models in comparison with recurrent ones. When tested on COCO, our proposal achieves a new state of the art in single-model and ensemble configurations on the "Karpathy" test split and on the online test server. We also assess its performances when describing objects unseen in the training set. Trained models and code for reproducing the experiments are publicly available at: https://github.com/aimagelab/meshed-memory-transformer.

READ FULL TEXT

page 3

page 8

page 12

page 13

page 14

page 15

research
02/11/2022

ACORT: A Compact Object Relation Transformer for Parameter Efficient Image Captioning

Recent research that applies Transformer-based architectures to image ca...
research
10/07/2019

SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability

The ability to generate natural language explanations conditioned on the...
research
02/21/2022

CaMEL: Mean Teacher Learning for Image Captioning

Describing images in natural language is a fundamental step towards the ...
research
07/07/2022

ExpansionNet: exploring the sequence length bottleneck in the Transformer for Image Captioning

Most recent state of art architectures rely on combinations and variatio...
research
08/23/2023

With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning

Image captioning, like many tasks involving vision and language, current...
research
04/29/2020

Image Captioning through Image Transformer

Automatic captioning of images is a task that combines the challenges of...
research
11/17/2022

Progressive Tree-Structured Prototype Network for End-to-End Image Captioning

Studies of image captioning are shifting towards a trend of a fully end-...

Please sign up or login with your details

Forgot password? Click here to reset