Text-VQA, which calls for a model to answer questions based on scene text in Visual Question Answering (VQA) settings, has attracted much attention in recent years. The task is set to evaluate a model’s capability to detect and understand textual information from images, and to make inference between cross-modal facts. Various high-quality Text-VQA benchmarks have been released in recent years [28, 3, 24, 31], as well as multiple proposed methods to address this task.
Previous works have followed the pipeline of first extracting the embedded scene texts, and then jointly generating the answer based on input question, extracted texts, and visual contents [28, 15, 33]
. However, simply selecting tokens from detected scene text is often not enough. Text-VQA differs from other multi-modal language analysis tasks (multi-modal sentiment analysis, image retrieval, etc.) where there is a direct connection between task and textual information, and the text itself is often complete, reliable, and can be emphasized over other modalities[32, 25]. In the Text-VQA scenario, things become much more complicated because: (1) The scene text itself may not contain enough information to fulfill the information need, especially when the given question focuses on certain visual characteristics (e.g. name written in blue) or certain objects (e.g. text written on the boat). (2) The layout of scene text is noisy in real-life images: a single sentence may be split into different rows in the image, which makes it difficult for the current OCR system to recognize them as a full sentence or paragraph. (3) OCR systems may make detection errors, especially on real-life images from very diverse domains where different OCR systems with different tendencies may excel or struggle, which is a natural bottleneck for this task.
Figure 1 shows a typical example of Text-VQA. In order to answer the question “What is the title of the book in the middle?” correctly, the model needs to overcome several obstacles. It should bridge the question with corresponding textual or visual characteristics (the book in the middle), properly group the “lines” of words on the book covers to form book and author names, and avoid possible OCR detection errors and choose the most reasonable and reliable OCR output from multiple sources.
Considering these difficulties, in this paper, we propose a model named LOGOS (Localize, Group, and Select) for better scene text understanding. Concretely, we localize the question to its Region of Interest (ROI) by (1) introducing a visual grounding task to connect question text and image regions and (2) connecting regions with text modalities using object tokens as explicit alignments. Also, we group the individual text pieces via unsupervised, position-based clustering, and provide such positional information to the model as layout representation. Finally, we utilize OCR systems of the different capability to model words in different granularity, and train LOGOS to dynamically select the answer from multiple noisy OCR sources. Experiments on two benchmarks confirm that our model benefits from better modeling the text modality and can outperform the state-of-the-art models without additional OCR data annotation.
The main contributions of this paper are the following:
A novel Text-VQA model that effectively grounds different modalities and denoise low-quality OCR inputs
State-of-the-art results over two benchmark Text-VQA datasets
Detailed analysis on the advantages of LOGOS in cross-modal grounding and scene-text understanding
2 Related Works
The Text-VQA task aims at reading and understanding the text captured in the images to answer visual questions. Although the importance of involving scene texts in visual question answering tasks was originally emphasized by , due to the lack of available large-scale datasets, early development of question answering tasks related to embedded text understanding was limited in narrow domains such as bar-charts  or diagrams [17, 18]. The first large-scale open-domain dataset of the Text-VQA task, TextVQA, was introduced by , followed by several similar works including STVQA , OCR-VQA  and EST-VQA .
Recent studies [28, 12, 15, 33, 7, 8, 22, 11] have proposed several models and network architectures for the Text-VQA task. Introduced together with the TextVQA dataset, the LoRRA  model is built upon Pythia  with an OCR module added to detect and recognize the scene texts. M4C  first models Text-VQA as a multimodal task and uses a multimodal transformer to fuse different features over a joint embedding space. Also, it attaches a pointer network that can dynamically copy words from OCR systems. SA-M4C  added spatial information between objects, OCR, and question tokens to implicitly capture their relationships based on M4C to get further improvement. TAP  proposed to pretrain the model on several auxiliary tasks such as masked language modeling (MLM) and relative position prediction (RPP). It also leverages additional large-scale OCR datasets to enhance its ability to capture the contextualized information of scene text.
Although some previous works reported better results obtained by purely changing the OCR system [15, 33], they either don’t fully realize the potential of modeling the text modality with information from the visual modality or use expensive large-scale OCR data for pretraining. In this paper, we look deeper into other solutions to ground and refine features in the textual modality within existing datasets to facilitate the understanding of scene text in the multimodal fusion process.
3 LOGOS Model
Figure 2 demonstrates the LOGOS model structure, upon which we attempt to bridge the modalities of the question, image, and OCR text with different approaches. The focus of our model is three-fold:
Localize ROI by question-visual pretraining and question-OCR text modeling
Group related scene text tokens by clustering OCR layout information
Select OCR tokens dynamically from multiple noisy OCR systems
3.1 Model Architecture
Considering Text-VQA as a typical multi-modal task with inputs from several different modalities, our model utilizes the hybrid fusion technique, where the first step is to generate separate unimodal representations of different modalities. We use BERT  and Faster R-CNN  to encode the text and image respectively. Following previous works [12, 33], we also include the pyramidal histogram of characters (PHOC) representations  for the OCR tokens.
To let the model aware of ROI given a certain question, we leverage question-visual pretraining before Text-VQA, and question-OCR modeling during Text-VQA training, with object tokens to help align two modalities, with details shown in Section 3.2.
With the aforementioned inputs, we add grouping information for OCR tokens based on their locations. The details of the clustering algorithm are introduced in Section 3.3.
As shown in Figure 2(b), after uni-modal features are obtained, the representations of the same OCR and object tokens from different modalities are concatenated respectively. We feed the textual representation of the question and the fused representation of detected objects and OCR tokens into a multi-modal transformer. We utilize an answer decoder with a pointer network for output generation and design a selection framework for multiple OCR sources, which we discuss in Section 3.4.
3.2 ROI Localization
To successfully predict the answer given the question information, a robust model should be capable of first pointing out the specific region that is most related to the question, then relying on the OCR text of that specific region to generate the output.
It is the most desirable case if we have large amounts of alignment data between question and OCR regions to learn the question-OCR grounding. Since there is no publicly available large-scale question-OCR dataset, we view the question-OCR grounding problem as a two-stage learning problem. Specifically, we first learn the general grounding between text and image regions usin existing region-description datasets, for example, Visual Genome . Second, following the intuition from Oscar , we use the object tokens as textual “guidance” to help the model learn to ground between question and OCR regions.
3.2.1 Question-Visual Pretraining
To better equip the model with this ability, we select referral expression selection as a pretraining task for question-visual grounding purposes. Concretely, given an image with a description and non-overlapping bounding boxes, the model learns to predict which bounding box is the one that is best aligned with the description. As shown in Figure 2, we reuse most parts of the network but add an extra classification layer for candidate prediction. As shown in Figure 2 (c), similar to LOGOS encoding structure, we use question encodings and object encodings as input to the multi-modal transformer module. We train the question-visual pretraining task using cross-entropy loss.
3.2.2 Question-OCR Modeling
Question-Visual Pretraining learns general grounding between question and object regions. In the second stage, we bring object tokens together with object region representations to bridge the gap between question tokens and unseen OCR regions. As shown in Figure 2 (b), we concatenate object label tokens together with question and OCR texts and feed them into a uni-modal BERT encoder. Uni-modal BERT encoder models contextualized representation jointly among question, objects, and OCR texts. After passing through the uni-modal BERT layer, we follow the first stage pretraining routine with OCR concatenated representation as an additional input source to the multi-modal transformer encoder. The injection of object information in both uni-modal and multi-modal transformers can be seen as a bridge to help learn the grounding information between question and OCR inputs, in the case where large amounts of question-OCR data are not available.
3.3 Scene Text Clustering and Modeling
Context understanding has always been important in language processing and question answering. This problem becomes even more vital in Text-VQA, where the evidence of clear separation of text groups lies in the visual modality rather than text. The detection scope of current OCR systems also mainly remains at individual token level or “line” level (grouping tokens aligned closely forming a line).
Figure 2(a) shows an example of raw detection results by an OCR system. For most text groups such as names and authors of the books, its words are split into multiple lines and thus detected individually. Simply aligning text pieces by their generated sequence, and not providing extra visual evidence will often lead to misordered text and incomplete/incorrect context modeling.
A straightforward idea would be to extract potential text groups based on visual modality evidence, which may serve as additional spatial information along with raw OCR bounding box coordinates, and provide weak evidence for token realignment or token context to form complete sentences in the text domain.
In LOGOS, we perform unsupervised clustering of the OCR text bounding boxes during data preparation. Given an image containing lines detected by OCR, with containing tokens, we define the distance between two bounding boxes as the minimum distance between any two points on the two bounding boxes, and perform clustering using DBSCAN  on the bounding boxes of all lines . Through clustering, each token is assigned extra spatial hierarchical information in its cluster , line and token . We generate a dimensional sinusoidal embedding as a positional representation for each attribute to form the overall descriptor , and concatenate it as part of the final OCR representation.
Where is the -dim sinusoidal positional embedding at position . Figure 2(b) shows a sample result of this clustering approach.
3.4 OCR Source Selection
As aforementioned, current OCR systems are not robust enough to perfectly extract high-quality scene text from images. Different systems also result in different kinds of errors: while a careless OCR system will not be capable of finding all text in the images correctly and make more detection error, a meticulous OCR system may detect too much text from details inside the image, which will increase the difficulty of localizing the question to the correct region. This opens up the possibility of minimizing OCR error and maximizing reasoning effectiveness by combining different OCR sources.
We modify the training and predicting process order to best utilize different OCR systems. Given OCR scene text from systems, during the decoding stage, independent answers are generated separately. LOGOS then calculates the confidence score for the -th answer as
where and is the input of other features including visual features and questions. The answer with the highest score is selected as the final answer. During training, all OCR sources are trained equally, which also serves as extra training data for the learning process.
4.1 Datasets and Evaluation Metrics
is the first large-scale Text-VQA dataset with 28,408 images sampled from the Open Image Dataset . A total of 45,336 questions related to the text information in the image were answered by annotators. For each question-image pair, 10 answers are provided by different annotators. The accuracy normalized by weighted voting over the 10 answers is reported on this dataset. Following previous settings [28, 12], we split the dataset into 21,953, 3,166, and 3,289 images respectively for train, validation, and test set.
is similar to the TextVQA dataset and it contains 23,038 images with 31,791 questions. We follow the setting from M4C 
and split the dataset into train, validation, and test splits with 17,028, 1,893, and 2,971 images respectively. Compared with TextVQA dataset, the data source of STVQA is more diverse which includes data from Coco-text, Visual Genome , VizWiz , ICDAR 
, ImageNet, and IIIT-STR  data. We report 2 metrics, accuracy and Average Normalized Levenshtein Similarity(ANLS) on this dataset.
is a multi-modal grounding dataset with 108,077 images . We use this dataset for the purpose of visual grounding joint training. Specifically, region descriptions and their corresponding bounding boxes are used for generating multiple-choice data. In total, we generate 1,216,806 training pairs from the Visual Genome dataset.
|External data||TAP ||54.71||53.97||50.83||0.598||0.597|
|Original dataset only||LoRRA ||26.56||27.63||-||-||-|
Uses extra OCR annotations on the OCR-CC dataset, currently not publicly available, with TextVQA and STVQA.
4.2 Experiment settings and Training Details
Our model is implemented based on the framework of M4C . For the visual modality, we follow the settings of M4C and use Faster R-CNN  fc7 features of 100 top-scoring objects in the image detected by a Faster R-CNN detector pretrained on the Visual Genome dataset. The fc7 weights are fine-tuned during training. For the text modality, we use two OCR systems, Rosetta OCR  and Azure OCR 111https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision, to recognize scene text. We use a trainable 3-layer BERT-base encoder  for text representation. We follow M4C and include pyramidal histogram of characters (PHOC) representations  of OCR text. We perform scene text clustering on normalized bounding boxes using DBSCAN with .
The candidate vocabulary for decoding consists of the top 5,000 frequent words from the answers in the training set, as well as the detected OCR tokens in the current image. The answer is generated with a pointer network .
LOGOS contains 116M trainable parameters. During pretraining steps, we first pretrain LOGOS using the question-visual grounding task, then continue to train on the Text-VQA task. We set the batch size to 48 and train for 24,000 iterations for both pretraining and fine-tuning stages. The initial learning rate is set to 1e-4 with a warm-up period of 1000 iterations, and the learning rate decays to 0.1x after 14,000 and 19,000 iterations respectively. We use the checkpoints that achieved the best performance on the validation set for evaluation.
4.3 Experiment Results
Table 1 lists the performance of LOGOS on the TextVQA dataset comparing to other baselines. We find that LOGOS outperforms all current baselines which do not use extra OCR data.
We have several interesting observations from this table: (1) We see a huge gap between models using another OCR system (M4C+Azure OCR, SA-M4C, TAP, LOGOS) and models using the Rosetta system. The gap illustrates that previous performances on the Text-VQA task are severely limited by the quality of the scene texts detected. This proves the importance of better modeling and refining for the textual modality. (2) LOGOS shares the same idea of introducing auxiliary tasks for pretraining or joint training with TAP, and the results improve similarly although the training tasks are very different. However, LOGOS still outperforms TAP by 0.93% on the TextVQA dataset because of better modeling of text features. (3) It’s noteworthy that the current highest score from TAP is obtained by pretraining on other large-scale OCR datasets (OCR-CC) which are designed to better utilize scene text features in multimodal tasks. This idea is similar to ours and our model is fully compatible with pretraining on this dataset. We expect a further improvement of our model when it can get access to more OCR data.
Looking at STVQA results, we can see that training using only STVQA data does not outperform the TAP model . This may due to the fact that STVQA contains more short-length answers, compared with TextVQA where the questions are relatively longer and harder to answer. As the dataset size of STVQA is small and the spatial relationship of this dataset is relatively simple, we observe that LOGOS suffers from overfitting in this dataset. As introducing the joint training with TextVQA dataset, we do see a huge improvement on the STVQA dataset: our model outperforms SNE, previously best STVQA model jointly trained on TextVQA dataset by 2.9 points in ANLS. This reveals that LOGOS can also achieve great performance in STVQA dataset with the additional training data from other Text-VQA datasets.
5.1 Ablation Studies
Besides the full model, we also ran several experiments on variants of LOGOS to examine the effectiveness of each component. Results are shown in Table 2. We analyze model variant performance from three perspectives.
Effect of ROI Localization
Rows #1-#3 show the effectiveness of ROI localization based on question information. We see an improvement when either the question-visual pretraining or the question-OCR modeling is involved, meaning that grounding questions in both modalities are helpful. Specifically, we see better performance () for question-OCR modeling comparing to question-visual pretraining. This is expected because the question and OCR tokens share the same embedding space in the textual modality, which leads to more efficient relationship learning.
Effect of Using Multiple OCR Systems
Compared to simply selecting the highest-quality OCR system, LOGOS utilizes different capabilities of different OCR systems to contribute towards better performance. Rows #3-#5 show the results under the same modeling setting except for the source of OCR tokens. We can see that changing from a low-quality OCR system (Rosetta OCR) to a better one (Azure OCR) results in a huge gain () in terms of accuracy, but using both and selecting with a selector results in further improvement ().
|Select Better Confidence||51.53|
In order to verify that the improvement is not only achieved by more diverse training data, we conduct an extra experiment by using both OCR sources during training but limiting the source during inference. The results are shown in Table 3. Among the whole validation dataset of TextVQA, there are 22.4% of data where only one OCR system predicts the right answer, and in 71.96% of these cases, LOGOS correctly selects the OCR source based on confidence.
Effect of Scene Text Clustering
By comparing Row #5 and #6, we find that by grouping scene text within a similar area and adding the corresponding position representation, our model achieved an improvement of around 0.6%. Note that all the methods mentioned in this paper are compatible with each other and can be used at the same time, by which we can obtain the state-of-the-art model with results reported in Row #7 compared to models using only TextVQA as OCR data for training.
5.2 Case Studies
In this section, we analyze predictions of different model variants of LOGOS on validation set questions to check module effectiveness. We show two examples in Figure 4.
The case on the left demonstrates the effect of text clustering. The two images show OCR text pieces without and with text clustering respectively. Without the clustering information and with only the bounding box coordinates, the model still struggles to find the connection between individual words to form the phrase “coloring with stain”. Adding clustering-based word position information leads to much easier inference for the model to predict the correct answer.
The case on the right further emphasizes the importance of utilizing detection results from multiple OCR sources. When one OCR source (Azure) incorrectly misses the key evidence text (the “3” on the player’s jersey), the model will most certainly generate an incorrect answer. After incorporating the source selection module, the model is able to dynamically choose the most reasonable source, based on its understanding of the question and relationship between text and objects. In this case, the model successfully grounds the question to the correct object (the jersey on the left) and answers the correct jersey number 3 instead of 17.
5.3 Error Analysis
In this section, we present analysis of error types to see how our model achieved better accuracy and what limits further improvement. We randomly sampled 30 negative examples from M4C and another 30 from LOGOS and counted the number of errors caused by bad OCR quality, failure to group scene text, or failure to locate questions in the related area. The proportion of such errors reduced from 80.0% to 53.3%, which proves LOGOS’s ability to better locate questions and utilize scene text clusters from multiple OCR systems.
When looking into negative examples from LOGOS, we also notice two typical types of error which we present in Figure 5. In these two cases, the model correctly recognizes and locates the question to the scene text, but still failed to generate the correct answer.
In the first case, the question asks about the size of the TV. Answering such questions requires a QA model to be equipped with proper external knowledge. Here, the model needs to understand measurements and different kinds of notations such as 8K for resolution and 98” for size. Other types of external knowledge in similar TextVQA error cases include: recognizing time, identifying dates, and counting and calculation.
As for the second case, although all the words are detected correctly by the OCR system and grouped together, the model struggles to completely understand the long text. Answering this type of question further requires the model to fully understand the content, which is almost impossible to be achieved without a more specific design, for example, a pipeline with a reading comprehension module.
6 Conclusion and Future Work
In this paper, we propose LOGOS, a novel model that hierarchically groups evidence from both image and text modality, and outputs answers from multiple noisy OCR systems. Results reveal that LOGOS outperforms the state-of-the-art models without using additional OCR training data. Detailed analysis shows that LOGOS can not only learn which region to focus on given the question but can also generate coherent answers with correct spatial order from the original image.
Based on the analysis and comparison with other models, there are also some potential areas for future work. We do not use any external OCR-related datasets to enhance the model’s ability to model scene text, and we expect a higher score when more OCR data are applied to our model. Our model makes the first attempt to better utilize scene text from images in the multi-modal setting. We observe a huge improvement on Text-VQA performance, and we believe this method can also be applied to similar tasks such as Text-Caption  where the understanding of scene text plays an important role.
-  (2014) Word spotting and recognition with embedded attributes. IEEE transactions on pattern analysis and machine intelligence 36 (12), pp. 2552–2566. Cited by: §3.1, §4.2.
-  (2010) Vizwiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pp. 333–342. Cited by: §2.
Scene text visual question answering.
Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4291–4301. Cited by: §1, §2, §4.1, Table 1.
-  (2018) Rosetta: large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 71–79. Cited by: §4.2.
Imagenet: a large-scale hierarchical image database.
2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.1.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.1, §4.2.
-  (2020) Structured multimodal attentions for textvqa. arXiv preprint arXiv:2006.00753. Cited by: §2, Table 1.
Multi-modal graph neural network for joint reasoning on vision and scene text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12746–12756. Cited by: §2.
-  (2018) Vizwiz grand challenge: answering visual questions from blind people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3608–3617. Cited by: §4.1.
-  (2019) dbscan: fast density-based clustering with R. Journal of Statistical Software 91 (1), pp. 1–30. External Links: Cited by: §3.3.
-  (2020) Finding the evidence: localization-aware answer prediction for text visual question answering. arXiv preprint arXiv:2010.02582. Cited by: §2, Table 1.
-  (2020) Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9992–10002. Cited by: §2, §3.1, §4.1, §4.1, §4.2, Table 1.
-  (2018) Pythia v0. 1: the winning entry to the vqa challenge 2018. arXiv preprint arXiv:1807.09956. Cited by: §2.
Dvqa: understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5648–5656. Cited by: §2.
-  (2020) Spatially aware multimodal transformers for textvqa. arXiv preprint arXiv:2007.12146. Cited by: §1, §2, §2, Table 1.
-  (2015) ICDAR 2015 competition on robust reading. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160. Cited by: §4.1.
-  (2016) A diagram is worth a dozen images. In European Conference on Computer Vision, pp. 235–251. Cited by: §2.
-  (2017) Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern recognition, pp. 4999–5007. Cited by: §2.
-  (2017) Openimages: a public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github. com/openimages 2 (3), pp. 18. Cited by: §4.1.
-  (2016) Visual genome: connecting language and vision using crowdsourced dense image annotations. arxiv. arXiv preprint arXiv:1602.07332. Cited by: §3.2, §4.1, §4.1, §4.1.
-  (2020) Oscar: object-semantics aligned pre-training for vision-language tasks. In European Conference on Computer Vision, pp. 121–137. Cited by: §3.2.
-  (2020) Cascade reasoning network for text-based visual question answering. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 4060–4069. Cited by: §2, Table 1.
-  (2013) Image retrieval using textual cues. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3040–3047. Cited by: §4.1.
-  (2019) Ocr-vqa: visual question answering by reading text in images. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 947–952. Cited by: §1, §2.
-  (2020) Integrating multimodal information in large pretrained transformers. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2020, pp. 2359. Cited by: §1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. arXiv preprint arXiv:1506.01497. Cited by: §3.1, §4.2.
Textcaps: a dataset for image captioning with reading comprehension. In European Conference on Computer Vision, pp. 742–758. Cited by: §6.
-  (2019) Towards vqa models that can read. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8317–8326. Cited by: §1, §1, §2, §2, §4.1, §4.1, Table 1.
-  (2016) Coco-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140. Cited by: §4.1.
-  (2015) Pointer networks. arXiv preprint arXiv:1506.03134. Cited by: §4.2.
-  (2020) On the general value of evidence, and bilingual scene-text visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10126–10135. Cited by: §1, §2.
Words can shift: dynamically adjusting word representations using nonverbal behaviors.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7216–7223. Cited by: §1.
-  (2020) TAP: text-aware pre-training for text-vqa and text-caption. arXiv preprint arXiv:2012.04638. Cited by: §1, §2, §2, §3.1, §4.3, Table 1.
-  (2020) Simple is not easy: a simple strong baseline for textvqa and textcaps. arXiv preprint arXiv:2012.05153. Cited by: Table 1.