OSIC: A New One-Stage Image Captioner Coined

11/04/2022
by   Bo Wang, et al.
0

Mainstream image caption models are usually two-stage captioners, i.e., calculating object features by pre-trained detector, and feeding them into a language model to generate text descriptions. However, such an operation will cause a task-based information gap to decrease the performance, since the object features in detection task are suboptimal representation and cannot provide all necessary information for subsequent text generation. Besides, object features are usually represented by the last layer features that lose the local details of input images. In this paper, we propose a novel One-Stage Image Captioner (OSIC) with dynamic multi-sight learning, which directly transforms input image into descriptive sentences in one stage. As a result, the task-based information gap can be greatly reduced. To obtain rich features, we use the Swin Transformer to calculate multi-level features, and then feed them into a novel dynamic multi-sight embedding module to exploit both global structure and local texture of input images. To enhance the global modeling of encoder for caption, we propose a new dual-dimensional refining module to non-locally model the interaction of the embedded features. Finally, OSIC can obtain rich and useful information to improve the image caption task. Extensive comparisons on benchmark MS-COCO dataset verified the superior performance of our method.

READ FULL TEXT

page 3

page 7

research
05/21/2022

HLATR: Enhance Multi-stage Text Retrieval with Hybrid List Aware Transformer Reranking

Deep pre-trained language models (e,g. BERT) are effective at large-scal...
research
10/15/2021

Multi-Tailed, Multi-Headed, Spatial Dynamic Memory refined Text-to-Image Synthesis

Synthesizing high-quality, realistic images from text-descriptions is a ...
research
08/09/2021

TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network

Salient object detection is the pixel-level dense prediction task which ...
research
10/22/2020

Learning Dual Semantic Relations with Graph Attention for Image-Text Matching

Image-Text Matching is one major task in cross-modal information process...
research
08/14/2023

Mutual Information-driven Triple Interaction Network for Efficient Image Dehazing

Multi-stage architectures have exhibited efficacy in image dehazing, whi...
research
07/13/2022

Symmetry-Aware Transformer-based Mirror Detection

Mirror detection aims to identify the mirror regions in the given input ...
research
07/18/2022

Towards the Human Global Context: Does the Vision-Language Model Really Judge Like a Human Being?

As computer vision and NLP make progress, Vision-Language(VL) is becomin...

Please sign up or login with your details

Forgot password? Click here to reset