Towards Local Visual Modeling for Image Captioning

02/13/2023
by   Yiwei Ma, et al.
0

In this paper, we study the local visual modeling with grid features for image captioning, which is critical for generating accurate and detailed captions. To achieve this target, we propose a Locality-Sensitive Transformer Network (LSTNet) with two novel designs, namely Locality-Sensitive Attention (LSA) and Locality-Sensitive Fusion (LSF). LSA is deployed for the intra-layer interaction in Transformer via modeling the relationship between each grid and its neighbors. It reduces the difficulty of local object recognition during captioning. LSF is used for inter-layer information fusion, which aggregates the information of different encoder layers for cross-layer semantical complementarity. With these two novel designs, the proposed LSTNet can model the local visual information of grid features to improve the captioning quality. To validate LSTNet, we conduct extensive experiments on the competitive MS-COCO benchmark. The experimental results show that LSTNet is not only capable of local visual modeling, but also outperforms a bunch of state-of-the-art captioning models on offline and online testings, i.e., 134.8 CIDEr and 136.3 CIDEr, respectively. Besides, the generalization of LSTNet is also verified on the Flickr8k and Flickr30k datasets

READ FULL TEXT

page 2

page 18

page 19

research
01/16/2021

Dual-Level Collaborative Transformer for Image Captioning

Descriptive region features extracted by object detection networks have ...
research
09/29/2021

Geometry-Entangled Visual Semantic Transformer for Image Captioning

Recent advancements of image captioning have featured Visual-Semantic Fu...
research
12/13/2020

Improving Image Captioning by Leveraging Intra- and Inter-layer Global Representation in Transformer Network

Transformer-based architectures have shown great success in image captio...
research
05/20/2019

Multimodal Transformer with Multi-View Visual Representation for Image Captioning

Image captioning aims to automatically generate a natural language descr...
research
03/29/2022

End-to-End Transformer Based Model for Image Captioning

CNN-LSTM based architectures have played an important role in image capt...
research
05/07/2023

UIT-OpenViIC: A Novel Benchmark for Evaluating Image Captioning in Vietnamese

Image Captioning is one of the vision-language tasks that still interest...
research
12/18/2022

Efficient Image Captioning for Edge Devices

Recent years have witnessed the rapid progress of image captioning. Howe...

Please sign up or login with your details

Forgot password? Click here to reset