Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps

12/09/2020
by   Qi Zhu, et al.
0

Texts appearing in daily scenes that can be recognized by OCR (Optical Character Recognition) tools contain significant information, such as street name, product brand and prices. Two tasks – text-based visual question answering and text-based image captioning, with a text extension from existing vision-language applications, are catching on rapidly. To address these problems, many sophisticated multi-modality encoding frameworks (such as heterogeneous graph structure) are being used. In this paper, we argue that a simple attention mechanism can do the same or even better job without any bells and whistles. Under this mechanism, we simply split OCR token features into separate visual- and linguistic-attention branches, and send them to a popular Transformer decoder to generate answers or captions. Surprisingly, we find this simple baseline model is rather strong – it consistently outperforms state-of-the-art (SOTA) models on two popular benchmarks, TextVQA and all three tasks of ST-VQA, although these SOTA models use far more complex encoding mechanisms. Transferring it to text-based image captioning, we also surpass the TextCaps Challenge 2020 winner. We wish this work to set the new baseline for this two OCR text related applications and to inspire new thinking of multi-modality encoder design. Code is available at https://github.com/ZephyrZhuQi/ssbaseline

READ FULL TEXT

page 1

page 3

page 5

page 7

page 9

page 10

research
09/24/2019

Unified Vision-Language Pre-Training for Image Captioning and VQA

This paper presents a unified Vision-Language Pre-training (VLP) model. ...
research
08/04/2021

Question-controlled Text-aware Image Captioning

For an image with multiple scene texts, different people may be interest...
research
05/22/2018

Joint Image Captioning and Question Answering

Answering visual questions need acquire daily common knowledge and model...
research
02/13/2020

Sparse and Structured Visual Attention

Visual attention mechanisms are widely used in multimodal tasks, such as...
research
11/21/2022

Exploring Discrete Diffusion Models for Image Captioning

The image captioning task is typically realized by an auto-regressive me...
research
08/20/2023

Generic Attention-model Explainability by Weighted Relevance Accumulation

Attention-based transformer models have achieved remarkable progress in ...
research
08/27/2018

simNet: Stepwise Image-Topic Merging Network for Generating Detailed and Comprehensive Image Captions

The encode-decoder framework has shown recent success in image captionin...

Please sign up or login with your details

Forgot password? Click here to reset