Comparative study of Transformer and LSTM Network with attention mechanism on Image Captioning

03/05/2023
by   Pranav Dandwate, et al.
0

In a globalized world at the present epoch of generative intelligence, most of the manual labour tasks are automated with increased efficiency. This can support businesses to save time and money. A crucial component of generative intelligence is the integration of vision and language. Consequently, image captioning become an intriguing area of research. There have been multiple attempts by the researchers to solve this problem with different deep learning architectures, although the accuracy has increased, but the results are still not up to standard. This study buckles down to the comparison of Transformer and LSTM with attention block model on MS-COCO dataset, which is a standard dataset for image captioning. For both the models we have used pretrained Inception-V3 CNN encoder for feature extraction of the images. The Bilingual Evaluation Understudy score (BLEU) is used to checked the accuracy of caption generated by both models. Along with the transformer and LSTM with attention block models,CLIP-diffusion model, M2-Transformer model and the X-Linear Attention model have been discussed with state of the art accuracy.

READ FULL TEXT
research
10/24/2021

Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network

Automatic Image Captioning is the never-ending effort of creating syntac...
research
06/19/2020

Hyperparameter Analysis for Image Captioning

In this paper, we perform a thorough sensitivity analysis on state-of-th...
research
09/03/2022

vieCap4H-VLSP 2021: Vietnamese Image Captioning for Healthcare Domain using Swin Transformer and Attention-based LSTM

This study presents our approach on the automatic Vietnamese image capti...
research
05/07/2023

UIT-OpenViIC: A Novel Benchmark for Evaluating Image Captioning in Vietnamese

Image Captioning is one of the vision-language tasks that still interest...
research
10/01/2021

Geometry Attention Transformer with Position-aware LSTMs for Image Captioning

In recent years, transformer structures have been widely applied in imag...
research
06/16/2017

One Model To Learn Them All

Deep learning yields great results across many fields, from speech recog...
research
12/30/2022

On the Interpretability of Attention Networks

Attention mechanisms form a core component of several successful deep le...

Please sign up or login with your details

Forgot password? Click here to reset