Visual Question Answering (VQA)  has shown great progress thanks to the development of deep neural networks. However, recent studies [7, 15, 38] show that most VQA models fail unfortunately on a type of questions requiring understanding the text in the image. The VizWiz  firstly identified this problem and it found nearly a quarter of questions asked by visually-impaired people are text-reading related. Singh  systematically studied this problem and introduced a novel dataset TextVQA that only contains questions requiring the model to read and reason about the text in the image to be answered.
Three key abilities that are required to tackle the TextVQA problem are reading, reasoning and answering, which are also the main reasons that why state-of-the-art (SoTA) VQA models fail on this task. The reading
ability relies on the Optical Character Recognition (OCR) techniques, which is a long-standing sub-field of computer vision, to recognise the text appeared in the image accurately, and thereasoning needs a model jointly reasoning over the visual content and OCR text in the image. The SoTA VQA models [13, 51] may gain strong reasoning abilities on visual content and natural language questions via some sophisticated mechanisms such as attention  and memory networks , but none of them can read the “text” in images accurately, not to mention reasoning over them. LoRRA , the method provided in TextVQA, although equips an OCR model to read text, the results are not outstanding due to a lack of deep reasoning between text and visual content. As to the answering aspect, almost all of the SoTA VQA models choose to use a discriminative answering model because it is easy to be optimised and leads to better performance on traditional VQA datasets. However, the answer in TextVQA is normally a combination of detected OCR tokens from the image and general text tokens, thus the answer vocabulary is not fixed. The discriminative answering model may limit the output variety.
Figure 1 shows an example from TextVQA that involves several types of relationships. For instance, “the front of shirt”, “player’s shirt” are object-object links; “word printed on the front of the player’s shirt” is a text-object bond and the “word above the number 12…” is a text-text relation. In this paper, to enhance the relationship reasoning ability, we introduce an SMA model to reason over a graph that has multiple types of relationships. Specifically, a question self-attention module firstly decomposes questions into six sub-components that indicate objects, object-object relations, object-text relations, texts, text-text relations and text-object relations. A role-aware graph is then constructed with objects/texts as nodes. The connections between nodes are decided by the relative distance. Then the graph is updated by using a question-conditioned graph attention module. In this model, instead of using the whole question to guide the graph updating, only certain types of question components extracted from the question self-attention module can be used to update the corresponding graph components. For example, object related question feature is for object nodes and the object-text related question feature is only for the object-text edge updating. Finally, to solve the aforementioned answering issue, we propose a global-local attentional module that produces variable-length answers in a generative way. The summarised global features of question, object and text, together with local OCR embeddings, are fed into our proposed module to iteratively select answer words from a fixed answer vocabulary or the OCR tokens.
Our proposed SMA model outperforms SoTA TextVQA models but we find the results are still far from satisfactory and there is a big gap between machines and humans. To study whether this performance gap is caused by the “reading” part or “reasoning” part, we investigate how much the TextVQA accuracy will be affected by the OCR performance if a fixed reasoning model is used. In this paper, to completely peel off the impact of OCR for investigating the real reasoning ability, we ask AMT workers to annotate all the text appeared in the TextVQA dataset, which leads to groundtruth OCR annotations. These annotations were not given in the original TextVQA and we will release them to the community for a fair comparison. We also report the performance of LoRRA and our best model by giving the ground-truth OCR, in order to test solely the reasoning ability of the model. A new upper bound is also given by using the groundtruth OCR annotations.
In summary, our contributions are threefold:
We propose a structured multimodal attentional (SMA) model that can effectively reason over structural text-object graphs and produce answers in a generative way. Thanks to the adopted graph reasoning strategy, the proposed model achieves better interpretability.
We study the contribution of OCR in the TextVQA problem and provide human-annotated ground-truth OCR labels to complete the original TextVQA dataset. This allows followers in the community to only evaluate their models’ reasoning ability, under a perfect reading situation.
Our SMA model outperforms existing state-of-the-art TextVQA models on both TextVQA dataset and ST-VQA dataset.
In the remainder of this paper, matrices are denoted by bold capital letters and column vectors are denoted by bold lower-case letters.represents element-wise product. refers to concatenation.
2 Related Work
2.1 Text based VQA
Straddling the field of computer vision and natural language processing, Visual Question Answering (VQA) has raised increasing interests since large-scale VQA dataset released. A large number of methods and datasets were proposed: VQA datasets such as CLEVR  and FigureQA  have been introduced to study visual reasoning without the consideration of OCR; Wang et al.  introduced a dataset explicitly requires external knowledge to answer a question.
Reading and reasoning over text involved in an image are of great value for visual understanding, since text contains rich semantic information which is the main concern of VQA. Several datasets and baseline methods are introduced in recent years aiming at studying the joint reasoning over visual and text contents. For example, Textbook QA  asks multimodal questions given text, diagrams and images from middle school textbooks. FigureQA  needs to answer questions based on synthetic scientific-style figures like line plots, bar graphs or pie charts. DVQA  assesses bar-chart understanding ability in VQA framework. In these datasets, texts are machine-printed and appear in standard font with good quality, which alleviate the challenging text recognition work. Vizwiz  is the first dataset that requires text information for question answering, given images captured in natural scenes. Nevertheless, of the questions are “unanswerable” because of the poor image quality, which makes the dataset inappropriate to train an effective VQA model and study the problem systematically.
Most recently, TextVQA  and ST-VQA  are proposed concurrently to highlight the importance of text reading from natural scene images in the VQA process. LoRRA was proposed in TextVQA which uses a simple Updn  attention framework on both image objects and OCR text for inferring answers. The model was then improved by using a BERT based word embedding and a Multimodal Factorized High-order pooling based feature fusion method, and achieved the winner in TextVQA challenge. Compared to TextVQA where any question is allowed once text reading is required, all questions in ST-VQA can be answered unambiguously directly by text in images. Stacked Attention Network (SAN)  is adopted in ST-VQA as a baseline, by simply concatenating text features with image features for answer classification. The answering modules in previous models such as LoRRA  encounters two bottlenecks. One serious setback is that they view dynamic OCR space as invariant indexes, and the other is the disability to generate long answers with more than one word. M4C  firstly tackles both problems by using a transformer decoder and a dynamic pointer network. In this work, we focus on explicitly modeling relationships between objects, texts and object-text pairs, and achieved better performance and interpretability than previous TextVQA approaches.
2.2 Graph Networks in Vision and Language
Graph networks have received a lot of attention due to their expressive power on structural feature learning. They can not only capture the node features themselves, but also encode the neighbourhood properties between nodes in graphs, which is essential for VQA and other vision-and-language tasks that need to incorporate structures in both spatial and semantic information. For instance, Teney et al.  construct graphs over image scene objects and over question words respectively to exploit the structural information in these representations. The model shows significant improvements in general VQA tasks. Narasimhan et al.  perform finer relation exploration for factual-VQA task  by taking into account a list of facts via Graph Convolution Networks (GCN) for correct answer selection. The work  learns a question specific graph representation for input image in VQA, capturing object interactions with the relevant neighbours via spatial graph convolutions. MUREL  goes one step further to model spatial-semantic pairwise relations between all pairs of regions for relation reasoning, in addition to a rich vectorial representation for interaction between region’s visual content and question.
Our work also uses graph as the representation, but different from previous methods that use a fully-connected graph to connect all the objects, our task needs to take into account both visual elements and text information from image, which are essentially heterogeneous. A role-aware graph is constructed that considers different roles of nodes (such as object and text) and edges (“object-object”, “text-text” and “object-text”), which results in a much better cross-modality feature representation for answer inferring.
In this section we introduce our Structured Multimodal Attentions (SMA) model. At a high level, SMA is composed of three modules (as shown in Figure 1): (1) a question self-attention module that decomposes questions into six sub-components w.r.t. different roles in our constructed object-text graph. (2) a question conditioned graph attention module that reasons over the graph under the guidance of the above question representations and infers the importance of different nodes as well as their relationships. and (3) a global-local attentional answering module which can generate answers with multiple words stitching together. We detail each module in the following sections.
3.1 Question Self-Attention Module
Since question includes not only information of object and text nodes, but also four categories of relationships between them (object-object, object-text, text-text and text-object), our question self-attention module (see Figure 2) divides question into six sub-components. Although this is inspired by [3, 50], our modules are more fine-grained and are designed for the TextVQA task.
Given a question with words , we first embed the words into a feature sequence using pre-trained BERT  to obtain
. Next six individual two-layer MLPs followed by softmax layers are applied onto generate six sets of attention weights over words, i.e., , , , , and . These weights are further used to calculate six weighted sums of : , , , , , , which are considered as question representations decomposed w.r.t. object nodes, object-object (oo) edges, object-text (ot) edges, text nodes, text-text (tt) edges and text-object (to) edges. Taking and as example, the computation is performed as follows:
These decomposed question features are used as guiding signals when performing question conditioned graph attention in Section 3.2.
We also learn two sets of self-attention weights over the decomposed sub-components, i.e., and , where each is a scalar. They are calculated as below:
where , , and . To some extent, these weights play a role of prior probability as they can be calculated with questions only. The two sets of question self-attention weights will be used to generate question-conditioned object and text features, and , respectively (see Section 3.3).
3.2 Question Conditioned Graph Attention Module
The question conditioned graph attention module is the core of our network, which generates a heterogeneous graph over both objects and texts of an image and reasons over it.
Role-aware Heterogeneous Graph Construction ’Role’ denotes different type of nodes. We construct a role-aware heterogeneous graph over object nodes and text nodes of an image , where is the set of object nodes, is the set of text nodes and is the edge set. In our graph, an edge denotes the relationship between two particular nodes and each node can be connected to object nodes plus text nodes. It is apparent that nodes and edges in our graph have different roles, thus we call it a heterogeneous graph. ’Role-awareness’ means we explicitly use the role information of each node to construct the graph. We can further divide the edges into four sets according to their different roles: for oo edges, for ot edges, for tt edges and for to edges. Here we showcase how is constructed. For an object node , we rank the remaining objects in the order of their spatial distances to and define the neighborhood as the top ranked object nodes.
We build the edge representation between two nodes based on their relative spatial relationship. Here we build an oo edge as an example. Suppose the center coordinate, width and height of a node are represented as , and the top-left coordinate, bottom-right coordinate, width and height of another node are represented as , then the associated edge representation is defined as .
Question Conditioned Graph Attention We use the decomposed question features in Section 3.1 to reason on our role-aware graph constructed in the last section. We formulate the reasoning process as an attention mechanism. Instead of applying a global attention weights with single question features, we update different parts of the graph with different question features according to their roles. For example, the object-related question representation is used to guide the attention weights over object nodes, and is used to guide the text-object edge attention weights. Considering that there are six roles in the graph, we compute the attention weights respectively for object nodes (), text nodes (), object-object edges (), object-text edges (), text-text edges () and text-object edges (). The mechanism can be formulated as:
where is the attention mechanism to generally compute attention weights using question features and specific nodes/edges in graph, that we will introduce in the next section, and . and
represent features extracted from isolated object and text regions respectively, which are then fed into the graph attention module to generate question-conditioned features.
1) The object node attention weights. An object node is represented by it’s D appearance feature from a Faster R-CNN detector and D bounding box feature with object’s relative bounding box coordinates , where and represent the width and height of the image. Given the appearance features , and bounding box feature of a object, the attention weights for object node is calculated under the guidance of :
where is layer normalization; , , , and
are linear transformation parameters to be learned. Finally, we obtain the object node attention weightsby feeding into a softmax layer.
2) The text node attention weights. For text nodes, we also employ a combination of multiple features (referred to as Multi-Feats) to enrich OCR regions’ representation as in : 1) a D FastText feature is generated from a pre-trained FastText  embeddings, 2) a D appearance feature is generated from the same Faster R-CNN detector as object nodes, 3) a D Pyramidal Histogram of Characters (PHOC)  feature and 4) a D bounding box feature . In addition to multi-feats, we also introduce a D CNN feature (referred to as RecogCNN), which is extracted from a transformer-based text recognition network . The attention weights for text node are calculated under the guidance of :
where , , , , , , and are linear transformation parameters to be learned. Finally, we obtain the text node attention weights by feeding into a softmax layer.
3) The edge attention weights. The edge attention weights need to consider the relationship between two nodes. Because the calculation process of attention weights for different edge types , , and are similar, we only show how is computed.
There are mainly two steps. Firstly, for each node , we compute the attention weights over all the oo edges connected to :
where is an MLP used for embedding the initial oo edge features that are designed to be the concatenation of the edge feature and the neighbor node feature ; and respectively map the oo edge related question representation and the embedded edge features into vectors of the same dimension. The attention weights are normalized over ’s neighborhood via a softmax layer.
In the second step, we calculate oo edge attention weights over all object nodes:
where is considered as the question-conditioned oo edge feature w.r.t. object node . We compute , and using the same above equations, but with individual initial edge features, question representations and transformation parameters.
Weighting Module The above graph attention modules output three attention weights for each object and text node, via the corresponding question part as the guidance. For each object node , we have , and . Similarly, for each text node , we have , and . Now we combine them together with the question self-attention weights. For each object node, the final weight score is calculated as a weighted sum of three parts:
where are obtained in Section 3.1. Similarly, the final weight for each text node is:
Note that , as we have , , and . Likewise, we also have . The weights and actually measure the relevance between object/text nodes and the question, and are used to generate question-conditioned object and text features:
3.3 Global-Local Attentional Answering Module
Inspired by the transformer structure in M4C , we introduce a new global-local attentional answering module here. The global graph features and are not directly fused with global question features , , , , , . Instead, they are firstly fed into transformer-style answering module along with local OCR node embeddings to get updated. Specifically, object-related and text-related question features are concatenated together:
, , , are forwarded into transformer layers and updated as , , , , during which these global features and local OCR features can freely attend to each other.
Then we fuse updated features and with their respective question representations as follows:
The equation for predicting the answer probabilities in the first timestep can be written as:
where is linear transformation and is a two-branch scoring function, which tackles the dilemma that answers in TextVQA task can be dynamic texts changing in different questions. Our answer space is a combination of two parts: a fixed dictionary consisting of entries and the dynamic out-of-vocabulary (OOV) OCR tokens extracted from each specific image. Accordingly, two branches compute respective scores. One branch is a simple linear layer that maps from input to a D score vector, and the other branch calculates dot product value between input and each of the updated OCR embedding. The separate scores of two branches are concatenated together and used to select the highest-scored result.
While in the first timestep, the concatenation of fused feature becomes input, in the rest timesteps we use updated previous output embedding as input to decode iteratively.
where is the output of answering module when giving previous output embeddings as input. If the previous output comes from OCR, is OCR embeddings before forwarding to answering module. Otherwise, the corresponding linear layer weight of general vocabulary becomes . We also add position embeddings and type embeddings to the decoding input, where type embeddings imply whether this very input is fixed vocabulary or OCR token.
In this module, we adopt transformer layers. Global features of question, object and text, along with local OCR embeddings, cannot attend to decoding steps. Decoding steps can only attend to previous decoding steps, besides the global and local embeddings. Considering that the answer may come from two sources, we use multi-label sigmoid loss instead of softmax.
We evaluate our model on two challenging TextVQA benchmarks, including TextVQA  and all three tasks of ST-VQA , and achieve SoTA performance. we also manually labelled all the texts appeared in the TextVQA dataset, i.e., we provide the ground-truth of the OCR part.
4.1 Implementation Details
Same as M4C , the objects’ and OCRs’ region based appearance features are extracted from the fc6 layer which immediately follows the RoI-Pooling layer of a Faster R-CNN  model. The model is pretrained on Visual Genome  and then we fine-tune fc7 layer on TextVQA . The maximum number of object regions is . For text nodes, we run an independent Rosetta OCR system  to recognize word strings, which has two versions: multi-language (Rosetta-ml) and English-only (Rosetta-en). We recognise at most
OCR tokens in an image and generate rich OCR representations based on them. If any of the above is below maximum, we apply zero padding to the rest. We set the maximum length of questions toand encode them as D feature sequences by the first three layers of a pretrained BERT , whose parameters are further fine-tuned during training. Our answering module uses 4 layers of transformers with 12 attention heads. The other hyper-parameters are the same with BERT-BASE . The maximum number of decoding steps is set to 12.
We implement all the models in PyTorch and experiment onNVIDIA GeForce 1080Ti GPUs with a batch size of . The learning rate is set to for all layers except for the three-layer BERT used for question encoding and the fc7 layer used for region feature encoding, which have a learning rate of . We multiply the learning rate by at the and iterations and the optimiser is Adam. At every iterations we compute a VQA accuracy metric  on the validation set, based on all of which the best performing model is selected. To gracefully capture errors in text recognition, the ST-VQA dataset 
adopts Average Normalized Levenshtein Similarity (ANLS) as its official evaluation metric. We also apply this metric for ST-VQA dataset. All our experimental results are generated by relevant online platforms’ submissions.
4.2 Results and Analysis on TextVQA
The TextVQA dataset  samples images from OpenImages dataset . The questions are divided into train, validation and test splits with size , , and respectively, and each question-image pair has human-provided ground truth answers.
|SMA w/o dec.||Rosetta-ml||classifier|
|SMA w/o dec.||Rosetta-en||classifier|
|LoRRA (Rosetta-ml) ||GloVe||Rosetta-ml||FastText||classifier|
|DCD ZJU (ensemble) ||-||-||-||-|
|MSFT VTI ||-||-||-||-|
|SMA w/o dec.||GloVe||Rosetta-en||Multi-feats||classifier||-|
|SMA||BERT||Rosetta-en||Multi-feats + RecogCNN||decoder||40.05||40.66|
Ablations on Relationship Attentions. We conduct an ablation study to investigate the key components of the proposed Question Conditioned Graph Attention Module, i.e., the four types (oo, ot, tt, to) of relationship attentions. In order to focus on the reasoning ability, we evaluate it without rich OCR representation and iterative answering module. The tested architecture variations and their results are shown in Table 1. The experimental results reveal that each of the four modeled relations has improved the accuracy. In particular, the to relation attention leads to the largest improvement than others. It is consistent with the observation that annotators tend to refer to a specific text by describing the object where the text is printed on. Overall, the relations whose origins are text (to and tt) are more important than those for object (oo and ot), which validates the key role of text in this text VQA task.
Ablations on Answering Modules. From lines and of Table 2, we can find that our proposed generative answering module surpasses the discriminative classifier-based answering module by a large margin ( in validation accuracy), which shows that the ability of generating variable-length answers is of significant importance for TextVQA.
Ablations on Features for Question and OCR. The Glove and BERT features are evaluated for encoding questions, and the latter outperforms by in validation accuracy (see lines and in Table 2). By comparing lines and in Table 2, we can see a further improvement of by adding the RecogCNN feature for OCRs. This validates that the RecogCNN feature is complementary to the FastText, Faster R-CNN, PHOC and BBox features packed in Multi-Feats. Note that RecogCNN is trained on a text recognition task while Faster R-CNN is trained for general object detection. FastText and PHOC are extracted from the recognised OCR character sequences, but RecogCNN is extracted from text visual patches.
Comparison to Previous Work. We compare our method to LoRRA , an ensemble result of DCD , MSFTVTI  and the newest SoTA model M4C , and achieve surpassing results. Using the same question, object and OCR features, our single model is better than M4C on the test set.
Results with Ground Truth OCR. We provide a ground-truth OCR annotation of the TextVQA train and validation sets, because it provides a fair test base for researchers to focus on the text-visual reasoning part without tuning the OCR model additionally. We ask Amazon Mechanical Turk (AMT) workers to annotate all the texts appearing in the TextVQA dataset in order to completely peel off the impact of OCR and to investigate the real reasoning ability. We evaluate the performance of LoRRA and SMA, using the ground-truth OCR. The results are shown in Table 4. Both of them improve by a large margin: LoRRA goes up from to while SMA shoots up to from on the validation set, by replacing Rosetta-en results with groundtruth. The larger increase in accuracy ( vs ) demonstrates better reasoning and answering ability of our model. Rosetta OCR UB is the upper bound accuracy one can get if the answer can be build directly from OCR tokens and can always be predicted correctly (consider combinations of OCR tokens up to grams). With GT OCR, the upper bound can be promoted to on validation, which is calculated in the same way as Rosetta OCR UB. However, there’s still a large gap between human performance and SMA, which has great potential for us to unlock.
Visualization. For each type of relations, we visualize those with the highest attention weights and their corresponding decomposed question attention, in order to explore their contributions in answer prediction and give better insights in explaining our model (see Figure 5). In the first example, there are several bikes among which the question asks about the right one. Locating the requested bike needs oo relationship reasoning. Another relationship to is also in need as the number on the bike is exactly what we have to figure out. In the second example, we need ot relationship to locate the player whose number is . The to relationship is employed then to reason about the last name of this player. Similarly, in the last example, two different oo relationships are extracted to pinpoint the location of the player on the right and with blue hair. Then ot relationship is used to get the player’s number. All the examples validate the relationship reasoning ability of our model.
4.3 TextVQA Challenge 2020
Our model participated in the TextVQA Challenge 2020 with a slight change of the OCR embeddings in Figure 4 from current version to OCR features updated by graph attention module, i.e., . In our final model, a Sequential-free Box Discretization (SBD) model  is used firstly for scene text detection and a robust transformer based network  is employed for word recognition (denoted as ”SBD-Trans OCR”), which achieves a better OCR result and leads to an improvement on TextVQA task. The SBD model is pretrained on a dataset, which consists of images from LSVT  training set, images from MLT 2019  training set, images from ArT  which contains all the images of SCUT-CTW1500  and Total-text [9, 10], and the rest of images are selected from RCTW-17 , ICDAR 2013 , ICDAR 2015 , MSRA-TD500 , COCOText , and USTB-SV1K . The model was finally finetuned on MLT 2019  training set. The robust transformer based network is trained on following datasets: IIIT 5K-Words , Street View Text , ICDAR 2013 , ICDAR 2015 , Street View Text Perspective , CUTE80  and ArT . Finally we experiment using the ST-VQA dataset  as additional training data and SBD-Trans OCR, which achieve 45.51% final test accuracy – a new SoTA on the TextVQA dataset.
4.4 Evaluation on the ST-VQA dataset
The ST-VQA dataset  comprises of images with question-answer pairs. There are three VQA tasks, namely strongly contextualised, weakly contextualised and open vocabulary. For the strongly contextualised task, the authors provide a -word dictionary per image; in the weakly contextualised task, the authors provide a single dictionary of words for all images and for the open dictionary task, no candidate answers are provided. As the ST-VQA dataset does not have an official split for training and validation, we follow M4C  to randomly select images as our training set and use the remaining images as our validation set.
For the first and second task, a single-step version of our model (SMA w/o dec.) is used, while in the third task we use the proposed full SMA model. Compared with methods on the leaderboard, we set new SoTA for all the three tasks (see Table 4).
We introduce Structured Multimodal Attentions (SMA), a novel model architecture for answering questions based on the texts in images, that sets new state-of-the-art performance on the TextVQA and ST-VQA dataset. SMA is composed of three key modules: a Question Self-Attention Module that guides a Graph Attention Module to learn the node and edge attention, and a final Answering Module which combines the attention weights and question-guided features of aforementioned Graph Attention Module to yield a reasonable answer iteratively. A human-annotated ground-truth OCR set of TextVQA is also provided to set up the new upper bound and to help the community evaluate the real text-visual reasoning ability of different models, without suffering from poor OCR accuracy.
-  (2014) Word spotting and recognition with embedded attributes. IEEE transactions on pattern analysis and machine intelligence 36 (12), pp. 2552–2566. Cited by: §3.2.
Bottom-up and top-down attention for image captioning andvisual question answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.1.
-  (2016) Neural module networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 39–48. Cited by: §3.1.
-  (2015) Vqa: visual question answering. In Proc. IEEE Int. Conf. Comp. Vis., pp. 2425–2433. Cited by: §1, §2.1.
-  (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
-  (2019) Icdar 2019 competition on scene text visual question answering. arXiv preprint arXiv:1907.00490. Cited by: Table 4.
-  (2019) Scene text visual question answering. In Proc. IEEE Int. Conf. Comp. Vis., Cited by: §1, §2.1, §4.1, §4.3, §4.4, Table 4, §4.
-  (2018) Rosetta: large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 71–79. Cited by: §4.1.
-  (2017) Total-text: a comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1, pp. 935–942. Cited by: §4.3.
-  (2020) Total-text: toward orientation robustness in scene text detection. International Journal on Document Analysis and Recognition (IJDAR) 23 (1), pp. 31–52. Cited by: §4.3.
-  (2019) Icdar2019 robust reading challenge on arbitrary-shaped text (rrc-art). arXiv preprint arXiv:1909.07145. Cited by: §4.3.
-  (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Cited by: §3.1, §4.1.
Motion-appearance co-memory networks for video question answering.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6576–6585. Cited by: §1.
-  (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913. Cited by: §4.1.
-  (2018) Vizwiz grand challenge: answering visual questions from blind people.. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §1, §2.1.
-  (2019) Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. arXiv preprint arXiv:1911.06258. Cited by: §2.1, §3.2, §3.3, §4.1, §4.2, §4.4, Table 2, Table 4.
-  (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2901–2910. Cited by: §2.1.
-  (2017) Bag of tricks for efficient text classification. In Proc. Conf. of European Chapter of Association for Computational Linguistics, Cited by: §3.2.
Dvqa: understanding data visualizations via question answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.1.
-  (2018) FigureQA: An annotated figure dataset for visual reasoning. In ICLR workshop track, Cited by: §2.1, §2.1.
-  (2015) ICDAR 2015 competition on robust reading. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160. Cited by: §4.3.
-  (2013) ICDAR 2013 robust reading competition. In 2013 12th International Conference on Document Analysis and Recognition, pp. 1484–1493. Cited by: §4.3.
-  (2017) Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.1.
-  (2017) Openimages: a public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github. com/openimages 2, pp. 3. Cited by: §4.2.
-  (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123 (1), pp. 32–73. Cited by: §4.1.
-  DCD_ZJU, textvqa challenge 2019 winner. Note: https://visualqa.org/workshop.html Cited by: §4.2, Table 2.
-  (2019) Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition 90, pp. 337–345. Cited by: §4.3.
-  (2019) Omnidirectional scene text detection with sequential-free box discretization. In Proc. Int. Joint Conf. Artificial Intell., pp. 3052–3058. Cited by: §4.3.
-  (2012) Scene text recognition using higher order language priors. Cited by: §4.3.
-  (2018) Out of the box: reasoning with graph convolution nets for factual visual question answering. In Proc. Advances in Neural Inf. Process. Syst., pp. 2654–2665. Cited by: §2.2.
-  (2019) ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition–rrc-mlt-2019. arXiv preprint arXiv:1907.00945. Cited by: §4.3.
-  (2019) Murel: multimodal relational reasoning for visual question answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.2.
-  (2018) Learning conditioned graph structures for interpretable visual question answering. In Proc. Advances in Neural Inf. Process. Syst., pp. 8344–8353. Cited by: §2.2.
-  (2013) Recognizing text with perspective distortion in natural scenes. In Proceedings of the IEEE International Conference on Computer Vision, pp. 569–576. Cited by: §4.3.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §4.1.
-  (2014) A robust arbitrary text detection system for natural scene images. Expert Systems with Applications 41 (18), pp. 8027–8048. Cited by: §4.3.
-  (2017) Icdar2017 competition on reading chinese text in the wild (rctw-17). In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1, pp. 1429–1434. Cited by: §4.3.
-  (2019) Towards vqa models that can read. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §1, §1, §2.1, §4.1, §4.2, §4.2, Table 2, Table 4, §4.
-  MSFT_VTI, textvqa challenge 2019 top entry (post-challenge). Note: https://evalai.cloudcv.org/web/challenges/challenge-page/244/ Cited by: §4.2, Table 2.
-  (2019) ICDAR 2019 competition on large-scale street view text with partial labeling–rrc-lsvt. arXiv preprint arXiv:1909.07741. Cited by: §4.3.
-  (2017) Graph-structured representations for visual question answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 3233–3241. Cited by: §2.2.
-  (2016) Coco-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140. Cited by: §4.3.
-  (2011) End-to-end scene text recognition. In 2011 International Conference on Computer Vision, pp. 1457–1464. Cited by: §4.3.
-  (2018) Fvqa: fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 40 (10), pp. 2413–2427. Cited by: §2.1, §2.2.
-  (2019) A simple and robust convolutional-attention network for irregular text recognition. arXiv:1904.01375. Cited by: §3.2, §4.3.
-  (2014) Memory networks. arXiv preprint arXiv:1410.3916. Cited by: §1.
-  (2016) Stacked attention networks for image question answering.. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 21–29. Cited by: §2.1.
-  (2012) Detecting texts of arbitrary orientations in natural images. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1083–1090. Cited by: §4.3.
-  (2015) Multi-orientation scene text detection with adaptive clustering. IEEE transactions on pattern analysis and machine intelligence 37 (9), pp. 1930–1937. Cited by: §4.3.
-  (2018) Mattnet: modular attention network for referring expression comprehension. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 1307–1315. Cited by: §3.1.
-  (2019) Deep modular co-attention networks for visual question answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 6281–6290. Cited by: §1.