DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps

02/03/2023
by   Dongsheng Xu, et al.
0

Text-based image captioning is an important but under-explored task, aiming to generate descriptions containing visual objects and scene text. Recent studies have made encouraging progress, but they are still suffering from a lack of overall understanding of scenes and generating inaccurate captions. One possible reason is that current studies mainly focus on constructing the plane-level geometric relationship of scene text without depth information. This leads to insufficient scene text relational reasoning so that models may describe scene text inaccurately. The other possible reason is that existing methods fail to generate fine-grained descriptions of some visual objects. In addition, they may ignore essential visual objects, leading to the scene text belonging to these ignored objects not being utilized. To address the above issues, we propose a DEpth and VIsual ConcEpts Aware Transformer (DEVICE) for TextCaps. Concretely, to construct three-dimensional geometric relations, we introduce depth information and propose a depth-enhanced feature updating module to ameliorate OCR token features. To generate more precise and comprehensive captions, we introduce semantic features of detected visual object concepts as auxiliary information. Our DEVICE is capable of generalizing scenes more comprehensively and boosting the accuracy of described visual entities. Sufficient experiments demonstrate the effectiveness of our proposed DEVICE, which outperforms state-of-the-art models on the TextCaps test set. Our code will be publicly available.

READ FULL TEXT

page 1

page 3

page 4

page 8

research
08/04/2021

Question-controlled Text-aware Image Captioning

For an image with multiple scene texts, different people may be interest...
research
06/21/2021

TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning

Existing research for image captioning usually represents an image using...
research
10/08/2020

Dense Relational Image Captioning via Multi-task Triple-Stream Networks

We introduce dense relational captioning, a novel image captioning task ...
research
12/07/2020

Confidence-aware Non-repetitive Multimodal Transformers for TextCaps

When describing an image, reading text in the visual scene is crucial to...
research
06/06/2021

MOC-GAN: Mixing Objects and Captions to Generate Realistic Images

Generating images with conditional descriptions gains increasing interes...
research
03/15/2021

Knowledge driven Description Synthesis for Floor Plan Interpretation

Image captioning is a widely known problem in the area of AI. Caption ge...
research
01/30/2023

Pseudo 3D Perception Transformer with Multi-level Confidence Optimization for Visual Commonsense Reasoning

A framework performing Visual Commonsense Reasoning(VCR) needs to choose...

Please sign up or login with your details

Forgot password? Click here to reset