Image Captioning through Image Transformer

04/29/2020
by   Sen He, et al.
11

Automatic captioning of images is a task that combines the challenges of image analysis and text generation. One important aspect in captioning is the notion of attention: How to decide what to describe and in which order. Inspired by the successes in text analysis and translation, previous work have proposed the transformer architecture for image captioning. However, the structure between the semantic units in images (usually the detected regions from object detection model) and sentences (each single word) is different. Limited work has been done to adapt the transformer's internal architecture to images. In this work, we introduce the image transformer, which consists of a modified encoding transformer and an implicit decoding transformer, motivated by the relative spatial relationship between image regions. Our design widen the original transformer layer's inner architecture to adapt to the structure of images. With only regions feature as inputs, our model achieves new state-of-the-art performance on both MSCOCO offline and online testing benchmarks.

READ FULL TEXT

page 13

page 14

research
01/26/2021

CPTR: Full Transformer Network for Image Captioning

In this paper, we consider the image captioning task from a new sequence...
research
12/17/2019

M^2: Meshed-Memory Transformer for Image Captioning

Transformer-based architectures represent the state of the art in sequen...
research
06/15/2020

Multi-Image Summarization: Textual Summary from a Set of Cohesive Images

Multi-sentence summarization is a well studied problem in NLP, while gen...
research
10/07/2019

SMArT: Training Shallow Memory-aware Transformers for Robotic Explainability

The ability to generate natural language explanations conditioned on the...
research
12/28/2021

Extended Self-Critical Pipeline for Transforming Videos to Text (TRECVID-VTT Task 2021) – Team: MMCUniAugsburg

The Multimedia and Computer Vision Lab of the University of Augsburg par...
research
07/01/2021

Egocentric Image Captioning for Privacy-Preserved Passive Dietary Intake Monitoring

Camera-based passive dietary intake monitoring is able to continuously c...

Please sign up or login with your details

Forgot password? Click here to reset