DeepAI AI Chat
Log In Sign Up

A Deep Decoder Structure Based on WordEmbedding Regression for An Encoder-Decoder Based Model for Image Captioning

by   Ahmad Asadi, et al.

Generating textual descriptions for images has been an attractive problem for the computer vision and natural language processing researchers in recent years. Dozens of models based on deep learning have been proposed to solve this problem. The existing approaches are based on neural encoder-decoder structures equipped with the attention mechanism. These methods strive to train decoders to minimize the log likelihood of the next word in a sentence given the previous ones, which results in the sparsity of the output space. In this work, we propose a new approach to train decoders to regress the word embedding of the next word with respect to the previous ones instead of minimizing the log likelihood. The proposed method is able to learn and extract long-term information and can generate longer fine-grained captions without introducing any external memory cell. Furthermore, decoders trained by the proposed technique can take the importance of the generated words into consideration while generating captions. In addition, a novel semantic attention mechanism is proposed that guides attention points through the image, taking the meaning of the previously generated word into account. We evaluate the proposed approach with the MS-COCO dataset. The proposed model outperformed the state of the art models especially in generating longer captions. It achieved a CIDEr score equal to 125.0 and a BLEU-4 score equal to 50.5, while the best scores of the state of the art models are 117.1 and 48.0, respectively.


page 15

page 16


Bangla Image Caption Generation through CNN-Transformer based Encoder-Decoder Network

Automatic Image Captioning is the never-ending effort of creating syntac...

A sequential guiding network with attention for image captioning

The recent advances of deep learning in both computer vision (CV)and nat...

Fine-Grained Image Captioning with Global-Local Discriminative Objective

Significant progress has been made in recent years in image captioning, ...

Boost Image Captioning with Knowledge Reasoning

Automatically generating a human-like description for a given image is a...

Learning to Caption Images with Two-Stream Attention and Sentence Auto-Encoder

Automatically generating natural language descriptions from an image is ...

Image to Bengali Caption Generation Using Deep CNN and Bidirectional Gated Recurrent Unit

There is very little notable research on generating descriptions of the ...

An Image captioning algorithm based on the Hybrid Deep Learning Technique (CNN+GRU)

Image captioning by the encoder-decoder framework has shown tremendous a...