Question-controlled Text-aware Image Captioning

08/04/2021
by   Anwen Hu, et al.
0

For an image with multiple scene texts, different people may be interested in different text information. Current text-aware image captioning models are not able to generate distinctive captions according to various information needs. To explore how to generate personalized text-aware captions, we define a new challenging task, namely Question-controlled Text-aware Image Captioning (Qc-TextCap). With questions as control signals, this task requires models to understand questions, find related scene texts and describe them together with objects fluently in human language. Based on two existing text-aware captioning datasets, we automatically construct two datasets, ControlTextCaps and ControlVizWiz to support the task. We propose a novel Geometry and Question Aware Model (GQAM). GQAM first applies a Geometry-informed Visual Encoder to fuse region-level object features and region-level scene text features with considering spatial relationships. Then, we design a Question-guided Encoder to select the most relevant visual features for each question. Finally, GQAM generates a personalized text-aware caption with a Multimodal Decoder. Our model achieves better captioning performance and question answering ability than carefully designed baselines on both two datasets. With questions as control signals, our model generates more informative and diverse captions than the state-of-the-art text-aware captioning model. Our code and datasets are publicly available at https://github.com/HAWLYQ/Qc-TextCap.

READ FULL TEXT

page 1

page 3

page 5

page 8

page 12

research
02/02/2023

IC^3: Image Captioning by Committee Consensus

If you ask a human to describe an image, they might do so in a thousand ...
research
12/04/2020

Understanding Guided Image Captioning Performance across Domains

Image captioning models generally lack the capability to take into accou...
research
02/03/2023

DEVICE: DEpth and VIsual ConcEpts Aware Transformer for TextCaps

Text-based image captioning is an important but under-explored task, aim...
research
03/01/2020

Say As You Wish: Fine-grained Control of Image Caption Generation with Abstract Scene Graphs

Humans are able to describe image contents with coarse to fine details a...
research
12/09/2020

Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps

Texts appearing in daily scenes that can be recognized by OCR (Optical C...
research
05/26/2022

Fine-grained Image Captioning with CLIP Reward

Modern image captioning models are usually trained with text similarity ...
research
09/08/2021

RefineCap: Concept-Aware Refinement for Image Captioning

Automatically translating images to texts involves image scene understan...

Please sign up or login with your details

Forgot password? Click here to reset