Adaptively Clustering Neighbor Elements for Image Captioning

01/05/2023
by   Zihua Wang, et al.
0

We design a novel global-local Transformer named Ada-ClustFormer (ACF) to generate captions. We use this name since each layer of ACF can adaptively cluster input elements to carry self-attention (Self-ATT) for learning local context. Compared with other global-local Transformers which carry Self-ATT in fixed-size windows, ACF can capture varying graininess, , an object may cover different numbers of grids or a phrase may contain diverse numbers of words. To build ACF, we insert a probabilistic matrix C into the Self-ATT layer. For an input sequence s_1,...,s_N , C_i,j softly determines whether the sub-sequence s_i,...,s_j should be clustered for carrying Self-ATT. For implementation, C_i,j is calculated from the contexts of s_i,...,s_j, thus ACF can exploit the input itself to decide which local contexts should be learned. By using ACF to build the vision encoder and language decoder, the captioning model can automatically discover the hidden structures in both vision and language, which encourages the model to learn a unified structural space for transferring more structural commonalities. The experiment results demonstrate the effectiveness of ACF that we achieve CIDEr of 137.8, which outperforms most SOTA captioning models and achieve comparable scores compared with some BERT-based models. The code will be available in the supplementary material.

READ FULL TEXT
research
01/26/2021

CPTR: Full Transformer Network for Image Captioning

In this paper, we consider the image captioning task from a new sequence...
research
08/24/2021

Auto-Parsing Network for Image Captioning and Visual Question Answering

We propose an Auto-Parsing Network (APN) to discover and exploit the inp...
research
06/28/2022

ZoDIAC: Zoneout Dropout Injection Attention Calculation

Recently the use of self-attention has yielded to state-of-the-art resul...
research
12/13/2020

Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network

Transformer-based architectures have shown great success in image captio...
research
09/16/2021

Label-Attention Transformer with Geometrically Coherent Objects for Image Captioning

Automatic transcription of scene understanding in images and videos is a...
research
12/28/2021

Extended Self-Critical Pipeline for Transforming Videos to Text (TRECVID-VTT Task 2021) – Team: MMCUniAugsburg

The Multimedia and Computer Vision Lab of the University of Augsburg par...
research
07/19/2022

Relational Future Captioning Model for Explaining Likely Collisions in Daily Tasks

Domestic service robots that support daily tasks are a promising solutio...

Please sign up or login with your details

Forgot password? Click here to reset