TAP: Text-Aware Pre-training for Text-VQA and Text-Caption

12/08/2020
by   Zhengyuan Yang, et al.
0

In this paper, we propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks. These two tasks aim at reading and understanding scene text in images for question answering and image caption generation, respectively. In contrast to the conventional vision-language pre-training that fails to capture scene text and its relationship with the visual and text modalities, TAP explicitly incorporates scene text (generated from OCR engines) in pre-training. With three pre-training tasks, including masked language modeling (MLM), image-text (contrastive) matching (ITM), and relative (spatial) position prediction (RPP), TAP effectively helps the model learn a better aligned representation among the three modalities: text word, visual object, and scene text. Due to this aligned representation learning, even pre-trained on the same downstream task dataset, TAP already boosts the absolute accuracy on the TextVQA dataset by +5.4 the performance, we build a large-scale dataset based on the Conceptual Caption dataset, named OCR-CC, which contains 1.4 million scene text-related image-text pairs. Pre-trained on this OCR-CC dataset, our approach outperforms the state of the art by large margins on multiple tasks, i.e., +8.3 +8.6

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 8

page 12

page 14

06/24/2021

Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language Representation Learning with Pre-trained Sequence-to-Sequence Model

TextVQA requires models to read and reason about text in images to answe...
12/23/2021

LaTr: Layout-Aware Transformer for Scene-Text VQA

We propose a novel multimodal architecture for Scene Text Visual Questio...
04/01/2021

CUPID: Adaptive Curation of Pre-training Data for Video-and-Language Representation Learning

This work concerns video-language pre-training and representation learni...
12/06/2021

General Facial Representation Learning in a Visual-Linguistic Manner

How to learn a universal facial representation that boosts all face anal...
06/30/2020

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph

We propose a knowledge-enhanced approach, ERNIE-ViL, to learn joint repr...
10/24/2020

Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions

Pre-trained contextual vision-and-language (V L) models have brought i...
12/27/2021

Does CLIP Benefit Visual Question Answering in the Medical Domain as Much as it Does in the General Domain?

Contrastive Language–Image Pre-training (CLIP) has shown remarkable succ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The Vision-language tasks incorporating scene text [bigham2010vizwiz, gurari2018vizwiz, singh2019towards, sidorov2020textcaps], e.g., Text-VQA [singh2019towards, biten2019scene, mishra2019ocr, wang2020general] and Text-Caption [sidorov2020textcaps], pose new challenges to vision-language models of reading and understanding scene text in image context. Extended from Visual Question Answering (VQA) [VQA_15], Text-VQA aims to answer questions by understanding the scene text in the image-question context. Text-Caption seeks to generate an image caption [veit2016coco, anderson2018bottom] that describes both the visual and scene text information in the image, as shown in Figure 1 (a). These tasks have many potential applications, including robotics [anderson2018vision], document understanding [mishra2019ocr], assisting visually-impaired people [bigham2010vizwiz, gurari2018vizwiz], etc.

Figure 1: (a) Text-VQA and Text-Caption tasks aim at reading and understanding scene text in images for question answering and image caption generation, respectively. We highlight the scene text-related words in bold. (b) By explicitly incorporating scene text in pre-training, Text-Aware Pre-training (TAP) significantly outperforms both the non-TAP baseline and previous state of the art on multiple tasks (bars shown in red and blue colors, respectively).

A typical Text-VQA/Text-Caption framework consists of 1) a feature encoder for each single modality (text word, visual object, and scene text), 2) a multi-modal fusion module, and 3) a decoding module for prediction generation. Previous studies [singh2019towards, gao2020multi, gao2020structured, hu2020iterative, kant2020spatially, sidorov2020textcaps, wang2020multimodal] improve the model’s performance by designing stronger network architectures. Among them, LoRRA [singh2019towards] added an OCR attention branch for scene text encoding to a VQA model [jiang2018pythia]. M4C [hu2020iterative, sidorov2020textcaps] proposed a transformer-based multi-modal fusion module [vaswani2017attention]

and a multi-step multi-choice decoding module. Despite the effective network design, most previous models are optimized with a sole objective directly towards the correct answer/caption. Such a single answer/caption loss tries to predict each word in the ground-truth but is less effective in learning a joint representation among text word, visual object, and scene text. Without a good joint representation, directly optimizing for question-answering/image-captioning could be challenging. Inspired by the success of Vision-Language Pre-training (VLP) 

[lu2019vilbert, li2019visualbert, chen2019uniter, tan2019lxmert, li2020oscar, huang2020pixel, cao2020behind] in image-text joint representation learning, we leverage the effective Text-VQA/Text-Caption network designs and explore to further improve Text-VQA/Text-Caption by pre-training.

Vision-Language Pre-training (VLP) shows its effectiveness in learning task-agnostic joint representations of image and text. The main idea is to first pre-train the model with pre-training tasks on image-caption datasets [sharma2018conceptual, krishna2017visual, veit2016coco, ordonez2011im2text, plummer2015flickr30k], and then fine-tune the model for a specific vision-language task [VQA_15, young2014image, kazemzadeh2014referitgame, veit2016coco]. However, conventional VLP methods are designed intuitively for vision-language tasks and do not include scene text in pre-training. Therefore, previous methods fail to capture the scene text modality and its relationship with the visual and text modalities, and are thus less effective in Text-VQA/Text-Caption.

In this study, we propose Text-Aware Pre-training (TAP), which incorporates the scene text modality in pre-training to learn a joint representation of text word, visual object, and scene text. In TAP, we design text-aware pre-training tasks to better fuse scene text (including both scene text words and their visual regions detected by OCR) with the text words and visual objects. For the former, we refine the pre-training tasks in VLP [lu2019vilbert, li2020oscar] to support the extra scene text input. We find it particularly important to include the detected scene text words as extra language inputs. The extra inputs anchor the scene text and language modalities and make the aligned representation learning easier. For the latter, previous studies [kant2020spatially, wang2020multimodal] show that the spatial relationships between scene text and object regions are important, e.g., the relationship “left” in Figure 1 (a). Therefore, we propose a “relative (spatial) position prediction” task that learns regions’ spatial relationships by predicting their relative spatial positions in pre-training.

The extra scene text modality, together with the specially designed pre-training tasks, effectively helps the model learn a better aligned representation among the three modalities: text word, visual object, and scene text. This aligned representation learning, even pre-trained and fine-tuned on the same downstream task dataset, leads to significant improvement over the non-TAP baseline and helps the TAP model achieve the new state of the art.

To further unleash the power of TAP, we clean and generate a large-scale scene text-related image-caption dataset for pre-training. In general image-caption datasets [sharma2018conceptual, krishna2017visual, veit2016coco, ordonez2011im2text, plummer2015flickr30k], many image-text pairs contain either no scene text-related visual regions or no scene text-related language referring, and are thus less helpful to Text-VQA/Text-Caption. On the visual side, we run an OCR detector to filter out images with no scene text. On the language side, we include the detected OCR text tokens as the additional caption input to obtain scene text-related language descriptions. In the end, we build a large-scale dataset named OCR-CC with around million scene text-related image-text pairs based on the Conceptual Captioning dataset [sharma2018conceptual]. By using this large-scale dataset for pre-training, we observe further improvement on the Text-VQA and Text-Caption tasks.

We experiment with the TAP approach on the M4C network architecture [hu2020iterative] and benchmark it on the TextVQA [singh2019towards], ST-VQA [biten2019scene], and TextCaps [sidorov2020textcaps] datasets. With the identical network architecture and training data, TAP improves the accuracy on the TextVQA dataset [singh2019towards] from to , compared with a non-TAP baseline. Our final model ranks No.1111According to the official leader-boards (Nov. 2020) on multiple Text-VQA/Text-Caption challenges, and outperforms previous methods by large margins: TextVQA [singh2019towards] ( in absolute accuracy), ST-VQA [biten2019scene] ( in absolute accuracy), and TextCaps [sidorov2020textcaps] ( in CIDEr score).

Our main contributions are:

  • To the best of our knowledge, we are the first to explore pre-training for Text-VQA and Text-Caption.

  • By explicitly incorporating scene text with three specially designed pre-training tasks, Text-Aware Pre-training (TAP) effectively learns a better aligned representation that leads to significant performance improvement on Text-VQA/Text-Caption.

  • We build a large-scale dataset named OCR-CC with around million scene text-related image-text pairs. TAP with OCR-CC leads to the new state of the art on multiple tasks: TextVQA [singh2019towards] ( in absolute accuracy), ST-VQA [biten2019scene] ( in absolute accuracy), and TextCaps [sidorov2020textcaps] ( in CIDEr score). We will release the dataset and the models.

2 Related Work

Figure 2: An overview of Text-Aware Pre-training (TAP). (a) In pre-training, the framework takes text words , visual objects , scene text , and a special begin token as inputs, and improves the aligned representation learning by performing pre-training tasks (MLM, ITM, RPP) on fused feature . (b) In fine-tuning, we train the same model to step-by-step generate the answer/caption prediction, conditioned on , , , and the previous word predictions at decoding step . Text word, visual object, and scene text-related tokens are highlighted by the green, cyan, and yellow colors, respectively.

Vision-language tasks incorporating scene text. Text-VQA [singh2019towards, biten2019scene, mishra2019ocr, wang2020general] and Text-Caption [sidorov2020textcaps] aim at reading and understanding scene text in images for question answering and image caption generation. Various datasets [singh2019towards, biten2019scene, mishra2019ocr] are built for the Text-VQA task, e.g., the TextVQA dataset [singh2019towards], the ST-VQA dataset [biten2019scene], etc. TextCaps [sidorov2020textcaps] is a dataset recently proposed for the Text-Caption task.

Recent studies [singh2019towards, gao2020multi, gao2020structured, hu2020iterative, kant2020spatially, wang2020multimodal, liu2020cascade, han2020finding] proposed various network architectures to improve the Text-VQA/Text-Caption performance. Among them, LoRRA [singh2019towards] approached Text-VQA by extending a VQA model Pythia [jiang2018pythia] with an OCR attention branch. The answer vocabulary is a combination of a static vocabulary and detected OCR tokens. Multi-modal Multi-Copy Mesh (M4C) [hu2020iterative] boosted the Text-VQA performance by proposing a transformer-based multi-modal fusion module [vaswani2017attention] and a multi-step multi-choice decoding module that supports multi-step answer decoding. M4C’s variants M4C-Captioner [sidorov2020textcaps] set a strong baseline on TextCaps [sidorov2020textcaps] with the question text inputs removed. SA-M4C [kant2020spatially] further improved M4C by encoding the spatial relationships among visual regions as the attention masks in the multi-modal transformer. Similar explorations [wang2020multimodal] on the spatial relationships are studied in the Text-Caption task.

Despite the effective network design, all previous studies directly optimize towards the sole objective for the Text-VQA/Text-Caption task. We contend that such a single answer/caption loss could be ineffective in aligned representation learning and thus limits the Text-VQA/Text-Caption performance. In this study, we leverage the effective network designs and explore to further improve Text-VQA/Text-Caption by pre-training.

Vision-Language Pre-training (VLP). VLP [lu2019vilbert, li2019visualbert, alberti2019fusion, li2020unicoder, tan2019lxmert, su2019vl, zhou2020unified, chen2019uniter, lu202012, li2020oscar, huang2020pixel] shows its effectiveness in learning task-agnostic vision-language joint representations. Most studies [lu2019vilbert, tan2019lxmert, chen2019uniter] focused on vision-language understanding tasks, e.g., image-text retrieval [young2014image], visual question answering [VQA_15], visual grounding [kazemzadeh2014referitgame], etc. Recent studies [zhou2020unified, li2020oscar, hu2020vivo] unified the pre-training framework to cover generation tasks, e.g., image captioning [veit2016coco, anderson2018bottom].

However, conventional VLP methods do not capture scene text during pre-training and are therefore less effective for Text-VQA/Text-Caption. The proposed Text-aware Pre-training (TAP) explicitly incorporates scene text to learn a better aligned representation among the three modalities: text word, visual object, and scene text.

3 Text-Aware Pre-training (TAP)

TAP explicitly incorporates scene text in pre-training to improve Text-VQA/Text-Caption. We first pre-train the model with the scene text-aware pre-training tasks and then fine-tune it for a specific downstream task.

In this section, we first introduce the design of scene text-aware pre-training tasks. We then present the data corpus used for TAP and our proposed OCR-CC dataset. We postpone the model details to Section 4.2.

3.1 Text-aware pre-training tasks

Figure 2 overviews TAP in pre-training and fine-tuning. In pre-training, the input to the fusion module are embeddings of text words , object regions , scene text regions , and a special begin token . In the text word embedding, each word in the extended text input

is encoded as a feature vector, where

are the question text, detected object labels, and detected scene text words. In the object and scene text embedding, object and scene text regions are detected and encoded by object detectors and OCR engines.

Taking the fused feature as inputs, TAP improves multi-modal fusion by performing text-aware pre-training tasks. The proposed pre-training tasks consist of two parts, focusing on fusing scene text with text words and visual objects , respectively.

Scene-text language pre-training tasks. To better fuse the scene text with the text words , we design two scene-text language pre-training tasks based on the masked language modeling (MLM) and image-text (contrastive) matching (ITM) tasks in VLP [devlin2018bert, lu2019vilbert, chen2019uniter]. For MLM on the extended text input , we randomly mask each text token in

with a probability of

. The masked words are replaced with a special MASK token of the time, a random word , and remains unchanged . The MLM task takes the fused feature at the masked position as the input, and aims to recover the masked word with two fully-connected layers. For ITM, is polluted of the time by replacing text sub-sequence , , or with a randomly-selected one from another image. The polluted text words are thus not paired with the visual regions and . The ITM task takes the sequence feature as the input and aims to predict if the sequence has been polluted or not.

We find that the extra scene text word input is critical for learning the scene-text language aligned representation. As a comparison to the extended text input , pre-training with the original MLM and ITM [devlin2018bert, lu2019vilbert] on question text leads to limited improvement over the non-pre-training baseline. The failure is due to the limited number of scene text-related words in the language input . In this case, since many randomly masked words and polluted sequences are not relevant to scene text, scene text regions are less important for solving the pre-training tasks (MLM, ITM) and are thus often overlooked. in the extended text input generates extra scene text referring in the language modality and thus makes TAP effective.

Scene-text visual pre-training tasks. Understanding the spatial relationships between the visual object and scene text benefits Text-VQA/Text-Caption [kant2020spatially, wang2020multimodal]. The extra feature input of bounding box coordinates helps the spatial relationship learning [hu2020iterative, gao2020multi, gao2020structured], but hasn’t fully solved the problem. Recent studies [kant2020spatially, wang2020multimodal] hard code the coordinate features as the regions’ relationships in feature fusion and obtain further improvement. In this study, we explore spatial relationship learning by pre-training.

Specifically, we design a scene-text visual pre-training task in TAP. The main idea is to predict the relative spatial position between two randomly sampled visual regions. Therefore, we refer to the task as “relative (spatial) position prediction” (RPP). The input to the pre-training task is a randomly sampled visual object feature and scene text feature , where and . The objective is to predict the relative spatial position between the two sampled regions and . We start with a single relationship of whether “scene text region is on object ,” and thus model RPP as a binary classification problem. We then extend the task to a 12-class relative position prediction problem with the classes defined by Yao et al[yao2018exploring], including on, cover, overlap, eight-way relative orientation, and unrelated.

3.2 Pre-training corpus

TAP works well even without extra pre-training data. We first experiment with “TAP without extra data,” where we only use the downstream Text-VQA/Text-Caption dataset for pre-training, i.e., the training set of the TextVQA [singh2019towards], ST-VQA [biten2019scene], or TextCaps [sidorov2020textcaps] datasets. These datasets [singh2019towards, biten2019scene, sidorov2020textcaps] all contain less than K images and K image-text pairs. We detail the pre-training and fine-tuning pipeline for each downstream task in Section 4.2.

We then experiment with “TAP with large-scale data.” We build a large-scale scene text-related image-caption dataset named OCR-CC based on the Conceptual Caption (CC) dataset [sharma2018conceptual], and use the dataset for pre-training. Among the image-caption datasets [sharma2018conceptual, krishna2017visual, veit2016coco, ordonez2011im2text, plummer2015flickr30k], only the CC dataset contains a reasonable portion of images with meaningful scene text regions. Therefore, we run the Microsoft Azure OCR system222Public Microsoft OCR API: https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text on all images in the CC dataset and filter out the images with no scene text, watermarks only, and tiny scene text regions only. In the end, we obtain million image-caption pairs with a mean and median of and scene text detected per image. As a reference, the mean and median are and in the TextVQA dataset [hu2020iterative], and and in the ST-VQA dataset [biten2019scene]

. We adopt the same region feature extraction method used in the TextVQA dataset 

[singh2019towards] to provide object and scene text region embedding. By including scene text words as additional text inputs, OCR-CC provides scene text-related image-caption pairs for TAP. We keep the caption text from CC in OCR-CC and use it as the question text in pre-training. We show the details of dataset collection, scene text number distribution, and additional qualitative examples of OCR-CC in the supplementary material.

4 Experiments

We benchmark TAP for both the Text-VQA task on the TextVQA [singh2019towards] and ST-VQA [biten2019scene] datasets, and the Text-Caption task on the TextCaps dataset [sidorov2020textcaps]. We use our proposed OCR-CC dataset for large-scale pre-training.

4.1 Datasets

TextVQA. The TextVQA dataset [singh2019towards] contains 28,408 images from the Open Images dataset [kuznetsova2018open]. We follow the same training/validation/test split used in the previous work [singh2019towards] in our experiments. The methods are evaluated by the soft-voting accuracy of 10 answers.

ST-VQA. The ST-VQA dataset [biten2019scene] contains 21,892 images from multiple sources including ICDAR 2013 [karatzas2013icdar]

, ICDAR 2015 

[karatzas2015icdar]

, ImageNet 

[deng2009imagenet], VizWiz [gurari2018vizwiz], IIIT STR [mishra2013image], Visual Genome [krishna2017visual]

, and COCO-Text 

[veit2016coco]. The methods are evaluated by both accuracy and Average Normalized Levenshtein Similarity (ANLS) [biten2019scene].

TextCaps. The TextCaps dataset [sidorov2020textcaps] augments the 28,408 images in TextVQA [singh2019towards] with 145,329 captions. The captions are evaluated by the caption metrics (BLEU [papineni2002bleu], METEOR [denkowski2014meteor], ROUGE_L [lin2004rouge], SPICE [anderson2016spice], and CIDEr [vedantam2015cider]).

OCR-CC. Our OCR-CC dataset contains million scene text-related image-caption pairs from the Conceptual Captioning (CC) dataset [sharma2018conceptual]. More details of OCR-CC are in the supplementary material.

4.2 Experiment settings

Network architecture. We conduct experiments based on the M4C network architecture [hu2020iterative]. We extend the text input with the object labels and scene text words . We keep all remaining settings the same as in the original M4C [hu2020iterative], including the feature embedding, network architecture, training parameters, and layer initialization.

M4C’s text encoder is a three-layer trainable transformer [vaswani2017attention] initialized from the first three layers of BERT [devlin2018bert]. A pre-trained Faster R-CNN [ren2015faster] detects objects and represents the detected region with its visual and coordinate features. The final layer (fc7) of the detector is fine-tuned. An offline OCR detector [borisyuk2018rosetta] detects scene text regions and represents the region with its visual, coordinates, FastText [bojanowski2017enriching], and Pyramidal Histogram of Characters (PHOC) [almazan2014word] features. The fusion module in M4C is a four-layer multi-modal transformer that has the same hyper-parameters as BERT. The fusion module is initialized from scratch. A multi-step decoding module then takes fused features as inputs, and word-by-word predicts the final answer. The predicted answer word at each decoding step is selected either from a fixed frequent word vocabulary or from the dynamic OCR tokens. The word classification loss is applied to each decoding step.

Adapting to Text-VQA. By taking the fused feature as input, we pre-train the feature encoder and fusion module with the pre-training tasks (MLM, ITM, RPP). MLM is only computed on the sequences that have not been polluted by ITM. The pre-trained model with the highest pre-training task accuracy is used to initialize the feature encoder and fusion module. In fine-tuning, the model step-by-step predicts the answer with an extra decoding module, and is trained with the answer classification loss in each step.

Adapting to Text-Caption. We keep the framework architecture the same for Text-Caption as for Text-VQA, except increasing the maximum answer decoding length from words [hu2020iterative] to words [sidorov2020textcaps]. is left blank in both pre-training and fine-tuning. The input text sequence consists of , , and the blank . During fine-tuning, the framework is trained with the same multi-step word classification loss as used in Text-VQA.

Compared methods. We compare TAP with other state of the art [singh2019towards, gao2020multi, hu2020iterative, kant2020spatially, gao2020structured, liu2020cascade, han2020finding, wang2020multimodal] and systematically study the following baselines and variants of our method.

  • TAP (Ours). We first experiment with “TAP without extra pre-training data.” We use the same downstream task dataset for both pre-training and fine-tuning, and follow the same training parameters as used in M4C. For the Text-VQA task, we pre-train the model for K iterations with the pre-training tasks (MLM, ITM, RPP) and then fine-tune it with the answer loss for another K iterations. The numbers of pre-training and fine-tuning iterations are both K for the Text-Caption task following M4C-Captioner [sidorov2020textcaps].

  • M4C. “M4C” is the non-TAP baseline. Based on M4C, we include the detected object labels and scene text tokens as the additional text input following “TAP.” We train the model for K iterations with the answer loss to match TAP’s total iteration number. Compared with “TAP,” the only difference is that “M4C” trains the first K iterations with the answer loss, instead of the pre-training tasks.

  • TAP (Ours). “TAP” reports our best performance achieved with extra pre-training data (TextVQA, ST-VQA, TextCaps, OCR-CC) and other minor modifications. We pre-train “TAP” for K iterations. Section 4.4 details the benefits of each extra data source.

4.3 Text-VQA/Text-Caption results

Method OCR System Extra Data Val Acc. Test Acc.
LoRRA [singh2019towards] Rosetta-ml 26.56 27.63
MM-GNN [gao2020multi] Rosetta-ml 31.44 31.10
M4C [hu2020iterative] Rosetta-en 39.40 39.01
SMA [gao2020structured] Rosetta-en 40.05 40.66
CRN [liu2020cascade] Rosetta-en 40.39 40.96
LaAP-Net [han2020finding] Rosetta-en 40.68 40.54
M4C [hu2020iterative] Rosetta-en 39.55 -
TAP (Ours) Rosetta-en 44.06 -
M4C [hu2020iterative] Rosetta-en ST-VQA 40.55 40.46
LaAP-Net [han2020finding] Rosetta-en ST-VQA 41.02 40.54
SA-M4C [kant2020spatially] Google-OCR ST-VQA 45.4 44.6
SMA [gao2020structured] SBD-Trans OCR ST-VQA - 45.51
M4C [hu2020iterative] Microsoft-OCR 44.50 44.75
M4C [hu2020iterative] Microsoft-OCR ST-VQA 45.22 -
TAP (Ours) Microsoft-OCR 49.91 49.71
TAP (Ours) Microsoft-OCR ST-VQA 50.57 50.71
TAP (Ours) Microsoft-OCR ST-VQA, TextCaps, OCR-CC 54.71 53.97
Table 1: Text-VQA results on the TextVQA dataset [singh2019towards]. The top part reports results in the constrained setting that only uses TextVQA for training and Rosetta for OCR detection. The bottom part compares our best performance with other state-of-the-art methods in the unconstrained setting. The methods “M4C,” “TAP,” “TAP” are detailed in Section 4.2.

TextVQA. Table 1 reports the accuracy on the TextVQA dataset [singh2019towards]. The top part of the table shows the results in the constrained setting that only uses TextVQA for training and Rosetta [borisyuk2018rosetta] for OCR detection. The bottom compares our best performance with the state of the art [singh2019towards, gao2020multi, hu2020iterative, kant2020spatially, gao2020structured, liu2020cascade, han2020finding, wang2020multimodal] in the unconstrained setting.

We list the adopted OCR detector in the “OCR system” column. LoRRA [singh2019towards] and M4C [hu2020iterative] adopted the Rosetta OCR system [borisyuk2018rosetta]. SA-M4C [kant2020spatially] and SMA [gao2020structured] experiment with both Rosetta and other OCR systems (Google-OCR, SBD-Trans OCR). In this study, we experiment with Rosetta and the Microsoft Azure OCR system (Microsoft-OCR). We use Microsoft-OCR to detect the single OCR words appeared in the image, i.e., each detected scene text region contains only a single word. The “Extra data” column shows the used training data other than the TextVQA dataset. Previous methods [hu2020iterative, kant2020spatially, gao2020structured] adopt the ST-VQA dataset for joint training. Other than ST-VQA, TAP enables the use of weak data with no ground-truth answer in pre-training, e.g., TextCaps and OCR-CC. “TAP” reports the final performance with all extra datasets.

Three major observations can be made from Table 1: 1) “TAP” significantly outperforms the non-TAP baseline “M4C” with the identical training data and network architecture, in both the constrained setting (top part of Table 1) and the unconstrained setting (bottom part). In the constrained setting, TAP improves the non-TAP baseline accuracy from to . In the unconstrained setting, “TAP” with Microsoft-OCR obtain and absolute accuracy improvement over the corresponding non-TAP baselines “M4C” and “M4C +STVQA,” respectively. The improvement achieved with the same network and training data validates the effectiveness of our pre-training approach for Text-VQA/Text-Caption. 2) “TAP” outperforms the previous state of the art [singh2019towards, gao2020multi, hu2020iterative, gao2020structured, liu2020cascade, han2020finding] by large margins, even without large-scale pre-training. 3) Large-scale pre-training with the OCR-CC dataset further improves the accuracy. “TAP” adopts OCR-CC in pre-training and improves the accuracy from to . The improvement shows that TAP benefits from extra training data, and indicates the effectiveness of our proposed OCR-CC.

Method Val Acc. Val ANLS Test ANLS
SAN+STR [biten2019scene] - - 0.135
M4C [hu2020iterative] 38.05 0.472 0.462
SA-M4C [kant2020spatially] 42.23 0.512 0.504
SMA [gao2020structured] - - 0.466
CRN [liu2020cascade] - - 0.483
LaAP-Net [han2020finding] 39.74 0.497 0.485
M4C [hu2020iterative] 42.28 0.517 0.517
TAP (Ours) 45.29 0.551 0.543
TAP (Ours) 50.83 0.598 0.597
Table 2: Text-VQA results on the ST-VQA dataset [biten2019scene].

ST-VQA. Table 2 shows the Text-VQA accuracy on the ST-VQA dataset [biten2019scene] in the unconstrained setting. “TAP” uses the Microsoft-OCR and is pre-trained and fine-tuned on the training set of ST-VQA. “TAP” uses TextVQA, ST-VQA, TextCaps, and OCR-CC in pre-training. Similar conclusions as in Table 1 can be drawn from Table 2. First, “TAP” outperforms the state of the art [hu2020iterative, kant2020spatially, gao2020structured, liu2020cascade, han2020finding] by large margins, and significantly improves the non-TAP baseline “M4C.” Second, large-scale pre-training further improves the accuracy by as shown in bottom two rows.

Method Val CIDEr Test CIDEr
BUTD [anderson2018bottom] 41.9 33.8
AoANet [huang2019attention] 42.7 34.6
M4C [sidorov2020textcaps] 89.6 81.0
MMA-SR [wang2020multimodal] 98.0 88.0
CNMT [cnmt] - 93.03
M4C [sidorov2020textcaps] 99.89 93.36
TAP (Ours) 105.05 99.49
TAP (Ours) 109.16 103.22
Table 3: Text-Caption CIDEr scores on the TextCaps dataset [sidorov2020textcaps]. The full result table can be found in the supplementary material.

TextCaps. Table 3 shows the CIDEr score on the TextCaps dataset [sidorov2020textcaps]. We report only the CIDEr score in the table and present the full table with other metrics in the supplementary material. We draw similar observations that with the same training data, “TAP” improves the CIDEr score of “M4C” from to . Large-scale pre-training “TAP” further improves the CIDEr score to .

4.4 Ablation studies

Pre-training tasks. We experiment with different pre-training tasks (MLM, ITM, RPP) as well as their variants. We conduct ablation studies on TextVQA with Microsoft-OCR and no extra data. We examine the effectiveness of scene-text language pre-training (MLM, ITM) and scene-text visual pre-training (RPP). We verify the importance of the extra scene-text token input in MLM and ITM.

As shown in Table 4, the scene-text language pre-training in row and scene-text visual pre-training in row improve the non-TAP baseline (row ) from to and , respectively. “TAP” performs all pre-training tasks and further improves the accuracy to .

The extra scene text token input is essential for TAP. Rows - in Table 4 show that neither extra inputs (c.f. rows ) nor pre-training (c.f. rows ) alone lead to an improvement from the Non-TAP baseline (row ). In contrast, TAP with the extra input (row ) boosts the accuracy to . The bottom rows show the effectiveness of RPP. RPP with a single spatial relationship “on” improves the accuracy from to (c.f. rows ). Combining RPP with MLM and ITM improves the accuracy from to (c.f. rows ). Extending spatial relationship classes to  [yao2018exploring] leads to an improvement from to .

+MLM,ITM +RPP Val Acc.
(a) Non-TAP w/o - - 44.48
(b) Non-TAP - - 44.50
(c) + MLM,ITM w/o - 44.63
(d) + MLM,ITM - 49.01
(e) + RPP - 46.42
(f) TAP 49.91
Table 4: Ablation studies on different pre-training tasks (MLM, ITM, RPP), and the variant of excluding the extra scene-text token input in MLM and ITM. We highlight “TAP” by underline.
TextVQA ST-VQA TextCaps OCR-CC Val Acc.
(a) - - - 49.91 48.78
(b) - - 50.57 49.64
(c) - 51.86 50.13
(d) - - - 52.10 54.03
(e) 52.90 54.71
Table 5: Ablation studies on pre-training with extra data. We use the listed data only in pre-training and then fine-tune the model with the TextVQA dataset only. and indicate the layer numbers of the text and multi-modal transformers, respectively. We highlight “TAP” and “TAP” by underline and bold.

Pre-training with extra data Table 5 breaks down the benefits of adopting different sources of extra data. We conduct experiments on the TextVQA dataset with Microsoft-OCR. TAP enables the use of weak data with no answer annotations in the pre-training stage such like TextCaps and OCR-CC, in addition to the Text-VQA datasets. Compared with “TAP” with no extra data, pre-training with ST-VQA and TextCaps improves the accuracy from to and (c.f., rows , rows ). The large-scale pre-training with OCR-CC (row ) achieves the accuracy of . Including all data during pre-training (row ) further improves the accuracy to .

Furthermore, we find that the extra data benefits the use of large models. The original architecture consists of a -layer text-only transformer and a -layer multi-modal transformer. We experiment with a -layer multi-modal transformer with the same structure as BERT [devlin2018bert]. We initialize the model from BERT and remove the separate text transformer. We represent the two architectures as and in Table 5, where the numbers indicate the text and multi-modal transformer layer numbers. With extra transformer layers, the accuracy without extra data drops from to (row ), while the accuracy with extra data increases from to (row ).

4.5 How does TAP help?

Coref Type W/O TAP With TAP
Text Word Scene Text 0.0477 0.3514
Scene Text Text Word 0.0473 0.5206
Visual Object Scene Text 0.0045 0.0130
Scene Text Visual Object 0.0337 0.0680
Table 6: The coreference scores with and without TAP. Numbers represent the attention score between two semantically corresponded tokens, averaged across all such token pairs in TextVQA. Higher coreference scores imply a better aligned representation.

Figure 3: Visualization of region attention scores with respect to each word in the question text , extracted from the multi-modal fusion transformers with (bottom row) and without (top row) TAP. The score by a region indicates its attention strength. TAP generates interpretable attentions on scene text-related question words like “must” and “survive.”

In this section, we analyze how TAP helps Text-VQA/Text-Caption. We empirically show that with TAP, certain attention heads in the multi-modal transformer ground the scene text to the semantically corresponded text word or visual object . By learning such latent alignments, TAP improves the aligned representation learning and thus helps Text-VQA/Text-Caption.

Recent VLP analyses [cao2020behind, li2020does] show that VLP [tan2019lxmert, chen2019uniter, li2019visualbert] learns the latent alignments between the semantically corresponded region-word or region-region pairs. Specifically, certain attention heads in the transformer generate higher attention scores between such corresponded pairs. The attention scores between corresponded pairs are also referred to as coreference scores [cao2020behind]. Similarly, we analyze the change in the coreference score of scene text-related pairs to better understand TAP.

There exist layers heads attention scores between any two positions in our multi-modal transformer. Following VALUE [cao2020behind], we define the coreference score as the maximum attention score among all heads between two semantically corresponded positions. A text word and a scene text region are corresponded if they refer to the same scene text token, e.g., the text word and scene text region “coors” in Figure 3. We collect all corresponded pairs between the extended text input and scene text regions in the TextVQA dataset, and report the averaged score over all pairs. A scene text and a visual object are corresponded if they share the spatial relationship “on.”

As shown in Table 6, we analyze TAP by comparing the change in the coreference score before and after TAP, i.e., “M4C” and “TAP.” The first two rows show that TAP improves the scene-text language coreference scores by seven times. The bottom two rows show that TAP increases the scene-text visual coreference scores by two times. These increases validate that TAP successfully learns the latent alignment and thus improves joint representation learning.

Furthermore, Figure 3 visualizes the attention score between a text word and all visual regions. Qualitatively, we observe a higher coreference score with TAP (bottom row) than the non-TAP baseline (top row). For example, in Figure 3 (a), TAP grounds the text word “must” and “survive” to the corresponded scene text regions.

Figure 4: Failure cases of the non-TAP baseline “M4C” that can be corrected by “TAP.”

4.6 Qualitative results

Figure 4 shows representative failure cases of the non-TAP baseline “M4C” that can be corrected by “TAP.” These cases show that TAP improves Text-VQA/Text-Caption by learning better aligned representations.

  • TAP shows a good performance on challenging questions that require paraphrasing the scene text sentences. For example, in Figure 4 (a), the model answers “who must survive” by the scene text “yaam must survive” in the image. The attention in Figure 3 further visualizes the latent region-word alignments.

  • TAP also performs better on questions that refer to a scene text via an intermediate object. For example, in Figure 4 (b), the model grounds the object region “the jacket on the man pointing” and generates the correct answer “ryman” with the scene text “ryman football league” on the man’s jacket.

  • Figure 4 (c) shows an example that TAP correctly understands the relative spatial relationship in question.

  • Furthermore, TAP helps the model read a large piece of text. For example, in Figure 4 (d), the model correctly answers the question “who edited the book” by finding the editors’ names “jeff vandermeer & mark roberts.” We note that each word is detected as a separate scene text region, e.g., “jeff,” “&,” etc., which makes the answer sequence prediction non-trivia.

The bottom row of Figure 4 shows examples of multiple questions on the same image. For example, (e,f) (g,h) show that the model selects correct scene text regions as the answer based on the input questions. More qualitative results are included in the supplementary material.

5 Conclusion

We have presented Text-Aware Pre-training (TAP) that explicitly incorporates scene text in pre-training and effectively learns a better aligned multi-modality representation for Text-VQA/Text-Caption. With the identical framework and training data, TAP boosts the non-TAP baselines by in absolute accuracy on the TextVQA challenge. Furthermore, we build a large-scale dataset named OCR-CC and further improve the TAP performance. TAP outperforms the state-of-the-art methods by large margins. Analyses show that TAP helps the aligned representation learning among text word, visual object, and scene text.

Acknowledgment

Zhengyuan Yang and Jiebo Luo were supported in part by NSF awards IIS-1704337, IIS-1722847, and IIS-1813709.

References

Appendix A The OCR-CC Dataset

Figure 5: (a,b) The distribution of the detected scene text number by Microsoft-OCR on the Conceptual Captioning (CC) dataset [sharma2018conceptual] and our OCR-CC dataset. (c,d) Representative examples of discarded and selected images. We draw the OCR box over multiple related words for visualization purposes. We note that each scene text region contains a single word, e.g., four words “HYUNDAI,” “INSPIRING,” “THE,” “FL” in the top left sub-figure of (d).

In this section, we introduce the details of building the OCR-CC dataset based on the Conceptual Captioning (CC) dataset [sharma2018conceptual]. First, we run the Microsoft Azure OCR system on all CC images (around million). Then, we discard the images that don’t have scene text (around half of the CC images) or have watermark “text” only (around of the CC images). These watermark “text” records the source image website/provider and are thus not related to the image content. Figure 5 (c) shows examples of the discarded images, which either have no detected scene text or have watermark “text” only. In the end, we select images from CC as the images in our OCR-CC dataset. We pair each selected image with a caption for pre-training. The caption text is the concatenation of the original image caption in CC, the detected object labels , and the detected scene text words . Figures 5 (a,b) visualize the distribution of the scene text number in CC and our OCR-CC, respectively. Similar to the distribution on TextVQA [singh2019towards] and ST-VQA [biten2019scene], the majority of images contains - detected scene text regions, while a small portion of images has a large number of scene text regions. Figure 5 (d) shows some representative selected images.

Appendix B TextCaps Results

Method B-4 M R S C
BUTD [anderson2018bottom] 20.1 17.8 42.9 11.7 41.9
AoANet [huang2019attention] 20.4 18.9 42.9 13.2 42.7
M4C [sidorov2020textcaps] 23.3 22.0 46.2 15.6 89.6
MMA-SR [wang2020multimodal] 24.6 23.0 47.3 16.2 98.0
M4C [sidorov2020textcaps] 24.3 22.9 47.3 16.5 99.9
TAP (Ours) 25.2 23.4 47.7 16.9 105.0
TAP (Ours) 25.8 23.8 47.9 17.1 109.2
M4C (GT OCR) [sidorov2020textcaps] 26.0 23.2 47.8 16.2 104.3
Table 7: Results on the TextCaps [sidorov2020textcaps] validation set. B-4, M, R, S, C short for BLEU, METEOR, ROUGE_L, SPICE, CIDEr, respectively. The oracle analyses are shown in the gray text color.
Method B-4 M R S C
BUTD [anderson2018bottom] 14.9 15.2 39.9 8.8 33.8
AoANet [huang2019attention] 15.9 16.6 40.4 10.5 34.6
M4C [sidorov2020textcaps] 18.9 19.8 43.2 12.8 81.0
CNMT[cnmt] 20.0 20.9 44.4 13.5 93.0
M4C [sidorov2020textcaps] 20.4 20.7 44.6 13.6 93.4
TAP (Ours) 21.5 21.7 45.4 14.5 99.5
TAP (Ours) 21.9 21.8 45.6 14.6 103.2
M4C (GT OCR) [sidorov2020textcaps] 21.3 21.1 45.0 13.5 97.2
Human [sidorov2020textcaps] 24.4 26.1 47.0 18.8 125.5
Table 8: Results on the TextCaps [sidorov2020textcaps] test set.

Tables 7, 8 present the full results on TextCaps [sidorov2020textcaps] to supplement the abstracted results in the main paper’s Table 3. We draw similar conclusions from Tables 7, 8 as the ones in the main paper. Specifically, “TAP” significantly improves the non-TAP baseline “M4C” in all metrics with the identical network architecture and training data. Our TAP approach also outperforms the previous state of the art [sidorov2020textcaps, wang2020multimodal, cnmt] by large margins.

Furthermore, we compare TAP with the oracle numbers, as shown in the gray text color at the bottom part of Tables 7, 8. “TAP” outperforms the “M4C (GT OCR)” that uses ground-truth scene text detection in training and inference. Meanwhile, there still exists a gap between “TAP” and human performance. We expect future studies focusing on captioning to further reduce the gap, e.g., with better decoding step pre-training designed especially for captioning.

Appendix C Hyper-parameters

We summarize the hyper-parameters used in the “TAP” and “TAP” experiments. We conduct experiments based on the M4C [hu2020iterative, sidorov2020textcaps] and follow most of its hyper-parameter selections, as shown in Table 9. We highlight the changed parameters in bold in the table.

  • First, the max length of the extended text input is set to .

  • Second, we increase the max length of scene text from to when experimented with Microsoft-OCR. Compared with Rosetta, Microsoft-OCR generates more detected scene text regions in each image. For example, in the TextVQA dataset, the mean and median of scene text numbers are and with Rosetta, and are and with Microsoft-OCR. With Rosetta, of images contain more than scene text regions detected, while the percentage is with Microsoft-OCR. To cover more detected scene text, we increase the max length of scene text from to when experimented with Microsoft-OCR.

  • In the experiment of “pre-training without extra data” (“TAP”), we follow the same learning rate step and maximum iteration settings as used in the fine-tuning. In pre-training with OCR-CC (“TAP”), we pre-train the model for a maximum iteration of and scale the learning rate steps linearly.

Hyper-parameter Value
(a) General parameters
max length of text word 220
max length of visual object 100
max length of scene text 100
optimizer Adam
batch size 128
base learning rate 1e-4
warm-up learning rate factor 0.2
warm-up iterations 2000
max gradient L2-norm for clipping 0.25
learning rate decay 0.1
(b) Pre-training parameters
learning rate steps (“TAP,” VQA) 14K, 19K
max iterations (“TAP,” VQA) 24K
learning rate steps (“TAP,” Caption) 10K, 11K
max iterations (“TAP,” Caption) 12K
learning rate steps (“TAP”) 280K, 380K
max iterations (“TAP”) 480K
(c) Text-VQA Fine-tuning (TextVQA, ST-VQA)
max length of decoding step 12
learning rate steps 14K, 19K
max iterations 24K
(d) Text-Caption Fine-tuning (TextCaps)
max length of decoding step 30
learning rate steps 10K, 11K
max iterations 12K
Table 9: Hyper-parameters of the TAP experiments with and without OCR-CC pre-training, i.e., “TAP” and “TAP.” We conduct the experiments based on M4C [hu2020iterative, sidorov2020textcaps] and highlight the changed parameters in bold. We detail these changes in Section C.

Appendix D Pre-train + Fine-tune vs. Joint-train

Results in the main paper’s Section 4.3 show that TAP works well even without extra data. We hypothesize that we can view TAP as a multi-task learning framework, and obtain similar improvement by using the pre-training tasks (MLM, ITM, RPP) as the auxiliary training loss. Therefore, we explore an alternative training pipeline named “joint train,” where the pre-training tasks are used as the auxiliary losses together with the main answer/caption loss. Because MLM and ITM tasks require “polluting” the input sequence, we randomly select of the samples in a batch to compute the pre-training loss and keep the remaining unchanged for the answer/caption loss.

Studies show that these two training pipelines can achieve similar performances, i.e., for “pre-train + fine-tune” and for “joint train” on TextVQA. Both methods significantly outperform the non-TAP baseline (). For “joint train,” we train the framework for K iterations. Compared with “joint train,” one advantage of the “pre-train + fine-tune” pipeline in the main paper is that the extra weak data with no answer/caption annotations can be more easily used.

The effectiveness of different TAP pipelines implies the potential of improving other multi-modal tasks by incorporating pre-training tasks. Specifically, the pre-training tasks can be used either in the “joint-train” approach to best preserve the main task’s training pipeline, or in the “pre-train + fine-tune” approach to benefit from the large-scale weak pre-training data.

Appendix E Qualitative Results

In this section, we present additional qualitative examples. Figure 6 shows the failure cases that can be corrected by OCR detection. Figure 7 presents the failure cases of our method. “TAP” occasionally fails on samples that require complex reasoning (Figures 7 (a,b)) or have incorrect scene text detection (Figures 7 (c,d)). For example, in Figure 7 (a), TAP selects the scene text “cutfittep” on the black bag as the answer, instead of the correct scene text “aldo” on the referred white bag.

Figure 6: Failure cases that can be corrected by scene text detection. The top and bottom rows visualize the detected scene text by Rosetta-OCR and Microsoft-OCR, respectively. We draw adjacent words into the same box for visualization purposes and highlight the key scene text regions for the question, e.g., “moon bar,” “bud light,” “clemson,” and “marvel.”

Figure 7: Representative failure cases of “TAP.” We highlight the key scene text regions for each question.