Structured Multimodal Attentions for TextVQA

06/01/2020 ∙ by Chenyu Gao, et al. ∙ The University of Adelaide South China University of Technology International Student Union 12

Text based Visual Question Answering (TextVQA) is a recently raised challenge that requires a machine to read text in images and answer natural language questions by jointly reasoning over the question, Optical Character Recognition (OCR) tokens and visual content. Most of the state-of-the-art (SoTA) VQA methods fail to answer these questions because of i) poor text reading ability; ii) lacking of text-visual reasoning capacity; and iii) adopting a discriminative answering mechanism instead of a generative one which is hard to cover both OCR tokens and general text tokens in the final answer. In this paper, we propose a structured multimodal attention (SMA) neural network to solve the above issues. Our SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then design a multimodal graph attention network to reason over it. Finally, the outputs from the above module are processed by a global-local attentional answering module to produce an answer that covers tokens from both OCR and general text iteratively. Our proposed model outperforms the SoTA models on TextVQA dataset and all three tasks of ST-VQA dataset. To provide an upper bound for our method and a fair testing base for further works, we also provide human-annotated ground-truth OCR annotations for the TextVQA dataset, which were not given in the original release.



There are no comments yet.


page 2

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual Question Answering (VQA) [4] has shown great progress thanks to the development of deep neural networks. However, recent studies [7, 15, 38] show that most VQA models fail unfortunately on a type of questions requiring understanding the text in the image. The VizWiz [15] firstly identified this problem and it found nearly a quarter of questions asked by visually-impaired people are text-reading related. Singh [38] systematically studied this problem and introduced a novel dataset TextVQA that only contains questions requiring the model to read and reason about the text in the image to be answered.

Figure 1: (Left) Three modules work in concert to answer this question. Question Self-Attention Module decomposes the question into guiding signals that guide Graph Attention Module to reason over a graph. Besides, they also join the global-local attentional answering module to generate an answer. (Right) As for the heterogeneous graph constructed in question-conditioned graph attention module, we illustrate objects in yellow and OCR tokens in red. While unbroken-line boxes represent nodes most relevant to the question, dashed-line ones are other nodes.

Three key abilities that are required to tackle the TextVQA problem are reading, reasoning and answering, which are also the main reasons that why state-of-the-art (SoTA) VQA models fail on this task. The reading

ability relies on the Optical Character Recognition (OCR) techniques, which is a long-standing sub-field of computer vision, to recognise the text appeared in the image accurately, and the

reasoning needs a model jointly reasoning over the visual content and OCR text in the image. The SoTA VQA models [13, 51] may gain strong reasoning abilities on visual content and natural language questions via some sophisticated mechanisms such as attention [5] and memory networks [46], but none of them can read the “text” in images accurately, not to mention reasoning over them. LoRRA [38], the method provided in TextVQA, although equips an OCR model to read text, the results are not outstanding due to a lack of deep reasoning between text and visual content. As to the answering aspect, almost all of the SoTA VQA models choose to use a discriminative answering model because it is easy to be optimised and leads to better performance on traditional VQA datasets. However, the answer in TextVQA is normally a combination of detected OCR tokens from the image and general text tokens, thus the answer vocabulary is not fixed. The discriminative answering model may limit the output variety.

Figure 1 shows an example from TextVQA that involves several types of relationships. For instance, “the front of shirt”, “player’s shirt” are object-object links; “word printed on the front of the player’s shirt” is a text-object bond and the “word above the number 12…” is a text-text relation. In this paper, to enhance the relationship reasoning ability, we introduce an SMA model to reason over a graph that has multiple types of relationships. Specifically, a question self-attention module firstly decomposes questions into six sub-components that indicate objects, object-object relations, object-text relations, texts, text-text relations and text-object relations. A role-aware graph is then constructed with objects/texts as nodes. The connections between nodes are decided by the relative distance. Then the graph is updated by using a question-conditioned graph attention module. In this model, instead of using the whole question to guide the graph updating, only certain types of question components extracted from the question self-attention module can be used to update the corresponding graph components. For example, object related question feature is for object nodes and the object-text related question feature is only for the object-text edge updating. Finally, to solve the aforementioned answering issue, we propose a global-local attentional module that produces variable-length answers in a generative way. The summarised global features of question, object and text, together with local OCR embeddings, are fed into our proposed module to iteratively select answer words from a fixed answer vocabulary or the OCR tokens.

Our proposed SMA model outperforms SoTA TextVQA models but we find the results are still far from satisfactory and there is a big gap between machines and humans. To study whether this performance gap is caused by the “reading” part or “reasoning” part, we investigate how much the TextVQA accuracy will be affected by the OCR performance if a fixed reasoning model is used. In this paper, to completely peel off the impact of OCR for investigating the real reasoning ability, we ask AMT workers to annotate all the text appeared in the TextVQA dataset, which leads to groundtruth OCR annotations. These annotations were not given in the original TextVQA and we will release them to the community for a fair comparison. We also report the performance of LoRRA and our best model by giving the ground-truth OCR, in order to test solely the reasoning ability of the model. A new upper bound is also given by using the groundtruth OCR annotations.

In summary, our contributions are threefold:

  1. We propose a structured multimodal attentional (SMA) model that can effectively reason over structural text-object graphs and produce answers in a generative way. Thanks to the adopted graph reasoning strategy, the proposed model achieves better interpretability.

  2. We study the contribution of OCR in the TextVQA problem and provide human-annotated ground-truth OCR labels to complete the original TextVQA dataset. This allows followers in the community to only evaluate their models’ reasoning ability, under a perfect reading situation.

  3. Our SMA model outperforms existing state-of-the-art TextVQA models on both TextVQA dataset and ST-VQA dataset.


In the remainder of this paper, matrices are denoted by bold capital letters and column vectors are denoted by bold lower-case letters.

represents element-wise product. refers to concatenation.

2 Related Work

2.1 Text based VQA

Straddling the field of computer vision and natural language processing, Visual Question Answering (VQA) has raised increasing interests since large-scale VQA dataset 

[4] released. A large number of methods and datasets were proposed: VQA datasets such as CLEVR [17] and FigureQA [20] have been introduced to study visual reasoning without the consideration of OCR; Wang et al[44] introduced a dataset explicitly requires external knowledge to answer a question.

Reading and reasoning over text involved in an image are of great value for visual understanding, since text contains rich semantic information which is the main concern of VQA. Several datasets and baseline methods are introduced in recent years aiming at studying the joint reasoning over visual and text contents. For example, Textbook QA [23] asks multimodal questions given text, diagrams and images from middle school textbooks. FigureQA [20] needs to answer questions based on synthetic scientific-style figures like line plots, bar graphs or pie charts. DVQA [19] assesses bar-chart understanding ability in VQA framework. In these datasets, texts are machine-printed and appear in standard font with good quality, which alleviate the challenging text recognition work. Vizwiz [15] is the first dataset that requires text information for question answering, given images captured in natural scenes. Nevertheless, of the questions are “unanswerable” because of the poor image quality, which makes the dataset inappropriate to train an effective VQA model and study the problem systematically.

Most recently, TextVQA [38] and ST-VQA [7] are proposed concurrently to highlight the importance of text reading from natural scene images in the VQA process. LoRRA was proposed in TextVQA which uses a simple Updn [2] attention framework on both image objects and OCR text for inferring answers. The model was then improved by using a BERT based word embedding and a Multimodal Factorized High-order pooling based feature fusion method, and achieved the winner in TextVQA challenge. Compared to TextVQA where any question is allowed once text reading is required, all questions in ST-VQA can be answered unambiguously directly by text in images. Stacked Attention Network (SAN) [47] is adopted in ST-VQA as a baseline, by simply concatenating text features with image features for answer classification. The answering modules in previous models such as LoRRA [38] encounters two bottlenecks. One serious setback is that they view dynamic OCR space as invariant indexes, and the other is the disability to generate long answers with more than one word. M4C [16] firstly tackles both problems by using a transformer decoder and a dynamic pointer network. In this work, we focus on explicitly modeling relationships between objects, texts and object-text pairs, and achieved better performance and interpretability than previous TextVQA approaches.

2.2 Graph Networks in Vision and Language

Graph networks have received a lot of attention due to their expressive power on structural feature learning. They can not only capture the node features themselves, but also encode the neighbourhood properties between nodes in graphs, which is essential for VQA and other vision-and-language tasks that need to incorporate structures in both spatial and semantic information. For instance, Teney et al[41] construct graphs over image scene objects and over question words respectively to exploit the structural information in these representations. The model shows significant improvements in general VQA tasks. Narasimhan et al[30] perform finer relation exploration for factual-VQA task [44] by taking into account a list of facts via Graph Convolution Networks (GCN) for correct answer selection. The work [33] learns a question specific graph representation for input image in VQA, capturing object interactions with the relevant neighbours via spatial graph convolutions. MUREL [32] goes one step further to model spatial-semantic pairwise relations between all pairs of regions for relation reasoning, in addition to a rich vectorial representation for interaction between region’s visual content and question.

Our work also uses graph as the representation, but different from previous methods that use a fully-connected graph to connect all the objects, our task needs to take into account both visual elements and text information from image, which are essentially heterogeneous. A role-aware graph is constructed that considers different roles of nodes (such as object and text) and edges (“object-object”, “text-text” and “object-text”), which results in a much better cross-modality feature representation for answer inferring.

3 Method

In this section we introduce our Structured Multimodal Attentions (SMA) model. At a high level, SMA is composed of three modules (as shown in Figure 1): (1) a question self-attention module that decomposes questions into six sub-components w.r.t. different roles in our constructed object-text graph. (2) a question conditioned graph attention module that reasons over the graph under the guidance of the above question representations and infers the importance of different nodes as well as their relationships. and (3) a global-local attentional answering module which can generate answers with multiple words stitching together. We detail each module in the following sections.

Figure 2:

An overview of Question Self-Attention Module. Input word sequence of a question, we get two kinds of attention weights: question self-attention weights which account for prior probability in the whole graph and fine-grained decomposed question features for corresponding nodes or edges.

3.1 Question Self-Attention Module

Since question includes not only information of object and text nodes, but also four categories of relationships between them (object-object, object-text, text-text and text-object), our question self-attention module (see Figure 2) divides question into six sub-components. Although this is inspired by [3, 50], our modules are more fine-grained and are designed for the TextVQA task.

Given a question with words , we first embed the words into a feature sequence using pre-trained BERT [12] to obtain

. Next six individual two-layer MLPs followed by softmax layers are applied on

to generate six sets of attention weights over words, i.e., , , , , and . These weights are further used to calculate six weighted sums of : , , , , , , which are considered as question representations decomposed w.r.t. object nodes, object-object (oo) edges, object-text (ot) edges, text nodes, text-text (tt) edges and text-object (to) edges. Taking and as example, the computation is performed as follows:


These decomposed question features are used as guiding signals when performing question conditioned graph attention in Section 3.2.

We also learn two sets of self-attention weights over the decomposed sub-components, i.e., and , where each is a scalar. They are calculated as below:


where , , and . To some extent, these weights play a role of prior probability as they can be calculated with questions only. The two sets of question self-attention weights will be used to generate question-conditioned object and text features, and , respectively (see Section 3.3).

3.2 Question Conditioned Graph Attention Module

The question conditioned graph attention module is the core of our network, which generates a heterogeneous graph over both objects and texts of an image and reasons over it.

Role-aware Heterogeneous Graph Construction ’Role’ denotes different type of nodes. We construct a role-aware heterogeneous graph over object nodes and text nodes of an image , where is the set of object nodes, is the set of text nodes and is the edge set. In our graph, an edge denotes the relationship between two particular nodes and each node can be connected to object nodes plus text nodes. It is apparent that nodes and edges in our graph have different roles, thus we call it a heterogeneous graph. ’Role-awareness’ means we explicitly use the role information of each node to construct the graph. We can further divide the edges into four sets according to their different roles: for oo edges, for ot edges, for tt edges and for to edges. Here we showcase how is constructed. For an object node , we rank the remaining objects in the order of their spatial distances to and define the neighborhood as the top ranked object nodes.

We build the edge representation between two nodes based on their relative spatial relationship. Here we build an oo edge as an example. Suppose the center coordinate, width and height of a node are represented as , and the top-left coordinate, bottom-right coordinate, width and height of another node are represented as , then the associated edge representation is defined as .

Figure 3: An overview of Question Conditioned Graph Attention Module. This module builds a heterogeneous graph whose mixed nodes are shown in different colors.

Question Conditioned Graph Attention We use the decomposed question features in Section 3.1 to reason on our role-aware graph constructed in the last section. We formulate the reasoning process as an attention mechanism. Instead of applying a global attention weights with single question features, we update different parts of the graph with different question features according to their roles. For example, the object-related question representation is used to guide the attention weights over object nodes, and is used to guide the text-object edge attention weights. Considering that there are six roles in the graph, we compute the attention weights respectively for object nodes (), text nodes (), object-object edges (), object-text edges (), text-text edges () and text-object edges (). The mechanism can be formulated as:


where is the attention mechanism to generally compute attention weights using question features and specific nodes/edges in graph, that we will introduce in the next section, and . and

represent features extracted from isolated object and text regions respectively, which are then fed into the graph attention module to generate question-conditioned features.

1) The object node attention weights. An object node is represented by it’s D appearance feature from a Faster R-CNN detector and D bounding box feature with object’s relative bounding box coordinates , where and represent the width and height of the image. Given the appearance features , and bounding box feature of a object, the attention weights for object node is calculated under the guidance of :


where is layer normalization; , , , and

are linear transformation parameters to be learned. Finally, we obtain the object node attention weights

by feeding into a softmax layer.

2) The text node attention weights. For text nodes, we also employ a combination of multiple features (referred to as Multi-Feats) to enrich OCR regions’ representation as in [16]: 1) a D FastText feature is generated from a pre-trained FastText [18] embeddings, 2) a D appearance feature is generated from the same Faster R-CNN detector as object nodes, 3) a D Pyramidal Histogram of Characters (PHOC) [1] feature and 4) a D bounding box feature . In addition to multi-feats, we also introduce a D CNN feature (referred to as RecogCNN), which is extracted from a transformer-based text recognition network [45]. The attention weights for text node are calculated under the guidance of :


where , , , , , , and are linear transformation parameters to be learned. Finally, we obtain the text node attention weights by feeding into a softmax layer.

3) The edge attention weights. The edge attention weights need to consider the relationship between two nodes. Because the calculation process of attention weights for different edge types , , and are similar, we only show how is computed.

There are mainly two steps. Firstly, for each node , we compute the attention weights over all the oo edges connected to :


where is an MLP used for embedding the initial oo edge features that are designed to be the concatenation of the edge feature and the neighbor node feature ; and respectively map the oo edge related question representation and the embedded edge features into vectors of the same dimension. The attention weights are normalized over ’s neighborhood via a softmax layer.

In the second step, we calculate oo edge attention weights over all object nodes:


where is considered as the question-conditioned oo edge feature w.r.t. object node . We compute , and using the same above equations, but with individual initial edge features, question representations and transformation parameters.

Figure 4: An overview of Global-Local Attentional Answering Module.

Weighting Module The above graph attention modules output three attention weights for each object and text node, via the corresponding question part as the guidance. For each object node , we have , and . Similarly, for each text node , we have , and . Now we combine them together with the question self-attention weights. For each object node, the final weight score is calculated as a weighted sum of three parts:


where are obtained in Section 3.1. Similarly, the final weight for each text node is:


Note that , as we have , , and . Likewise, we also have . The weights and actually measure the relevance between object/text nodes and the question, and are used to generate question-conditioned object and text features:


3.3 Global-Local Attentional Answering Module

Inspired by the transformer structure in M4C [16], we introduce a new global-local attentional answering module here. The global graph features and are not directly fused with global question features , , , , , . Instead, they are firstly fed into transformer-style answering module along with local OCR node embeddings to get updated. Specifically, object-related and text-related question features are concatenated together:


, , , are forwarded into transformer layers and updated as , , , , during which these global features and local OCR features can freely attend to each other.

Then we fuse updated features and with their respective question representations as follows:


The equation for predicting the answer probabilities in the first timestep can be written as:


where is linear transformation and is a two-branch scoring function, which tackles the dilemma that answers in TextVQA task can be dynamic texts changing in different questions. Our answer space is a combination of two parts: a fixed dictionary consisting of entries and the dynamic out-of-vocabulary (OOV) OCR tokens extracted from each specific image. Accordingly, two branches compute respective scores. One branch is a simple linear layer that maps from input to a D score vector, and the other branch calculates dot product value between input and each of the updated OCR embedding. The separate scores of two branches are concatenated together and used to select the highest-scored result.

While in the first timestep, the concatenation of fused feature becomes input, in the rest timesteps we use updated previous output embedding as input to decode iteratively.


where is the output of answering module when giving previous output embeddings as input. If the previous output comes from OCR, is OCR embeddings before forwarding to answering module. Otherwise, the corresponding linear layer weight of general vocabulary becomes . We also add position embeddings and type embeddings to the decoding input, where type embeddings imply whether this very input is fixed vocabulary or OCR token.

In this module, we adopt transformer layers. Global features of question, object and text, along with local OCR embeddings, cannot attend to decoding steps. Decoding steps can only attend to previous decoding steps, besides the global and local embeddings. Considering that the answer may come from two sources, we use multi-label sigmoid loss instead of softmax.

4 Experiments

We evaluate our model on two challenging TextVQA benchmarks, including TextVQA [38] and all three tasks of ST-VQA [7], and achieve SoTA performance. we also manually labelled all the texts appeared in the TextVQA dataset, i.e., we provide the ground-truth of the OCR part.

4.1 Implementation Details

Same as M4C [16], the objects’ and OCRs’ region based appearance features are extracted from the fc6 layer which immediately follows the RoI-Pooling layer of a Faster R-CNN [35] model. The model is pretrained on Visual Genome [25] and then we fine-tune fc7 layer on TextVQA [38]. The maximum number of object regions is . For text nodes, we run an independent Rosetta OCR system [8] to recognize word strings, which has two versions: multi-language (Rosetta-ml) and English-only (Rosetta-en). We recognise at most

OCR tokens in an image and generate rich OCR representations based on them. If any of the above is below maximum, we apply zero padding to the rest. We set the maximum length of questions to

and encode them as D feature sequences by the first three layers of a pretrained BERT [12], whose parameters are further fine-tuned during training. Our answering module uses 4 layers of transformers with 12 attention heads. The other hyper-parameters are the same with BERT-BASE [12]. The maximum number of decoding steps is set to 12.

We implement all the models in PyTorch and experiment on

NVIDIA GeForce 1080Ti GPUs with a batch size of . The learning rate is set to for all layers except for the three-layer BERT used for question encoding and the fc7 layer used for region feature encoding, which have a learning rate of . We multiply the learning rate by at the and iterations and the optimiser is Adam. At every iterations we compute a VQA accuracy metric [14] on the validation set, based on all of which the best performing model is selected. To gracefully capture errors in text recognition, the ST-VQA dataset [7]

adopts Average Normalized Levenshtein Similarity (ANLS) as its official evaluation metric. We also apply this metric for ST-VQA dataset. All our experimental results are generated by relevant online platforms’ submissions.

4.2 Results and Analysis on TextVQA

The TextVQA dataset [38] samples images from OpenImages dataset [24]. The questions are divided into train, validation and test splits with size , , and respectively, and each question-image pair has human-provided ground truth answers.

on val
Baseline Rosetta-ml classifier
Baseline+oo Rosetta-ml classifier
Baseline+ot Rosetta-ml classifier
Baseline+tt Rosetta-ml classifier
Baseline+to Rosetta-ml classifier
SMA w/o dec. Rosetta-ml classifier
SMA w/o dec. Rosetta-en classifier
Table 1: Ablation study on key components of question conditioned graph attention module on TextVQA dataset. As mentioned in Section 3, there are four kinds of edges (relations) in our graph, which are oo, ot, tt and to edges. Stripping the question conditioned graph attention module of these four relations yields a baseline. Individually, we add each of the four edge attentions into the baseline and evaluate the corresponding accuracy.

Question enc.
OCR token
on val
on test
LoRRA (Rosetta-ml) [38] GloVe Rosetta-ml FastText classifier
LoRRA (Rosetta-en) GloVe Rosetta-en FastText classifier -
DCD ZJU (ensemble) [26] - - - -
MSFT VTI [39] - - - -
M4C [16] BERT Rosetta-en Multi-feats decoder
SMA w/o dec. GloVe Rosetta-en Multi-feats classifier -
SMA GloVe Rosetta-en Multi-feats decoder -
SMA BERT Rosetta-en Multi-feats decoder
SMA BERT Rosetta-en Multi-feats + RecogCNN decoder 40.05 40.66
Table 2: Step by Step, we (1) replace the classification-based answering module with our proposed generative decoder, (2) replace GloVe with BERT for question encoding and (3) add the RecogCNN OCR feature. The three operations improve the validation accuracy by , and respectively. Using the same features for question, objects and OCRs, our model (line ) outperforms the previous SoTA M4C model by percentage in test accuracy.

Ablations on Relationship Attentions. We conduct an ablation study to investigate the key components of the proposed Question Conditioned Graph Attention Module, i.e., the four types (oo, ot, tt, to) of relationship attentions. In order to focus on the reasoning ability, we evaluate it without rich OCR representation and iterative answering module. The tested architecture variations and their results are shown in Table 1. The experimental results reveal that each of the four modeled relations has improved the accuracy. In particular, the to relation attention leads to the largest improvement than others. It is consistent with the observation that annotators tend to refer to a specific text by describing the object where the text is printed on. Overall, the relations whose origins are text (to and tt) are more important than those for object (oo and ot), which validates the key role of text in this text VQA task.

Ablations on Answering Modules. From lines and of Table 2, we can find that our proposed generative answering module surpasses the discriminative classifier-based answering module by a large margin ( in validation accuracy), which shows that the ability of generating variable-length answers is of significant importance for TextVQA.

Ablations on Features for Question and OCR. The Glove and BERT features are evaluated for encoding questions, and the latter outperforms by in validation accuracy (see lines and in Table 2). By comparing lines and in Table 2, we can see a further improvement of by adding the RecogCNN feature for OCRs. This validates that the RecogCNN feature is complementary to the FastText, Faster R-CNN, PHOC and BBox features packed in Multi-Feats. Note that RecogCNN is trained on a text recognition task while Faster R-CNN is trained for general object detection. FastText and PHOC are extracted from the recognised OCR character sequences, but RecogCNN is extracted from text visual patches.

Comparison to Previous Work. We compare our method to LoRRA [38], an ensemble result of DCD [26], MSFTVTI [39] and the newest SoTA model M4C [16], and achieve surpassing results. Using the same question, object and OCR features, our single model is better than M4C on the test set.

Results with Ground Truth OCR. We provide a ground-truth OCR annotation of the TextVQA train and validation sets, because it provides a fair test base for researchers to focus on the text-visual reasoning part without tuning the OCR model additionally. We ask Amazon Mechanical Turk (AMT) workers to annotate all the texts appearing in the TextVQA dataset in order to completely peel off the impact of OCR and to investigate the real reasoning ability. We evaluate the performance of LoRRA and SMA, using the ground-truth OCR. The results are shown in Table 4. Both of them improve by a large margin: LoRRA goes up from to while SMA shoots up to from on the validation set, by replacing Rosetta-en results with groundtruth. The larger increase in accuracy ( vs ) demonstrates better reasoning and answering ability of our model. Rosetta OCR UB is the upper bound accuracy one can get if the answer can be build directly from OCR tokens and can always be predicted correctly (consider combinations of OCR tokens up to grams). With GT OCR, the upper bound can be promoted to on validation, which is calculated in the same way as Rosetta OCR UB. However, there’s still a large gap between human performance and SMA, which has great potential for us to unlock.

Visualization. For each type of relations, we visualize those with the highest attention weights and their corresponding decomposed question attention, in order to explore their contributions in answer prediction and give better insights in explaining our model (see Figure 5). In the first example, there are several bikes among which the question asks about the right one. Locating the requested bike needs oo relationship reasoning. Another relationship to is also in need as the number on the bike is exactly what we have to figure out. In the second example, we need ot relationship to locate the player whose number is . The to relationship is employed then to reason about the last name of this player. Similarly, in the last example, two different oo relationships are extracted to pinpoint the location of the player on the right and with blue hair. Then ot relationship is used to get the player’s number. All the examples validate the relationship reasoning ability of our model.

Methods Val Accuracy(%) LoRRA [38] SMA (Ours) OCR UB Rosetta-ml OCR UB GroundTruth Human
Table 3: Evaluation with GT OCR.
Method Task 1 ANLS Task 2 ANLS Task 3 ANLS 1 SAN+STR [7] 2 VTA [6] 3 M4C [16] 4 SMA (Ours) 0.508 0.310 0.466
Table 4: Evaluation on ST-VQA dataset.

4.3 TextVQA Challenge 2020

Our model participated in the TextVQA Challenge 2020 with a slight change of the OCR embeddings in Figure 4 from current version to OCR features updated by graph attention module, i.e., . In our final model, a Sequential-free Box Discretization (SBD) model [28] is used firstly for scene text detection and a robust transformer based network [45] is employed for word recognition (denoted as ”SBD-Trans OCR”), which achieves a better OCR result and leads to an improvement on TextVQA task. The SBD model is pretrained on a dataset, which consists of images from LSVT [40] training set, images from MLT 2019 [31] training set, images from ArT [11] which contains all the images of SCUT-CTW1500 [27] and Total-text [9, 10], and the rest of images are selected from RCTW-17 [37], ICDAR 2013 [22], ICDAR 2015 [21], MSRA-TD500 [48], COCOText [42], and USTB-SV1K [49]. The model was finally finetuned on MLT 2019 [31] training set. The robust transformer based network is trained on following datasets: IIIT 5K-Words [29], Street View Text [43], ICDAR 2013 [22], ICDAR 2015 [21], Street View Text Perspective [34], CUTE80 [36] and ArT [11]. Finally we experiment using the ST-VQA dataset [7] as additional training data and SBD-Trans OCR, which achieve 45.51% final test accuracy – a new SoTA on the TextVQA dataset.

4.4 Evaluation on the ST-VQA dataset

The ST-VQA dataset [7] comprises of images with question-answer pairs. There are three VQA tasks, namely strongly contextualised, weakly contextualised and open vocabulary. For the strongly contextualised task, the authors provide a -word dictionary per image; in the weakly contextualised task, the authors provide a single dictionary of words for all images and for the open dictionary task, no candidate answers are provided. As the ST-VQA dataset does not have an official split for training and validation, we follow M4C [16] to randomly select images as our training set and use the remaining images as our validation set.

For the first and second task, a single-step version of our model (SMA w/o dec.) is used, while in the third task we use the proposed full SMA model. Compared with methods on the leaderboard, we set new SoTA for all the three tasks (see Table 4).

Figure 5: Edge attention and decomposed question attention visualization for SMA. Three representative examples that require relationship reasoning for question answering are presented, which demand different kinds of edge relations. For instance, to represents the relation whose former node is text and latter one is object. For each example we highlight nodes or edges with the highest attention weights, wherein nodes are represented by boxes and edges are displayed by arrows pointing from former node to latter one. For boxes/nodes, yellow ones are for object and blue ones are for text. Solid ones are those with the highest attention weights whereas dashed ones are normal. For decomposed question attention, the darker highlighted text area has a higher attention weight. All of them are predicted by SMA with Ground-Truth OCR.

5 Conclusion

We introduce Structured Multimodal Attentions (SMA), a novel model architecture for answering questions based on the texts in images, that sets new state-of-the-art performance on the TextVQA and ST-VQA dataset. SMA is composed of three key modules: a Question Self-Attention Module that guides a Graph Attention Module to learn the node and edge attention, and a final Answering Module which combines the attention weights and question-guided features of aforementioned Graph Attention Module to yield a reasonable answer iteratively. A human-annotated ground-truth OCR set of TextVQA is also provided to set up the new upper bound and to help the community evaluate the real text-visual reasoning ability of different models, without suffering from poor OCR accuracy.


  • [1] J. Almazán, A. Gordo, A. Fornés, and E. Valveny (2014) Word spotting and recognition with embedded attributes. IEEE transactions on pattern analysis and machine intelligence 36 (12), pp. 2552–2566. Cited by: §3.2.
  • [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. (2018)

    Bottom-up and top-down attention for image captioning andvisual question answering

    In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.1.
  • [3] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein (2016) Neural module networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 39–48. Cited by: §3.1.
  • [4] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In Proc. IEEE Int. Conf. Comp. Vis., pp. 2425–2433. Cited by: §1, §2.1.
  • [5] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
  • [6] A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, M. Mathew, C. Jawahar, E. Valveny, and D. Karatzas (2019) Icdar 2019 competition on scene text visual question answering. arXiv preprint arXiv:1907.00490. Cited by: Table 4.
  • [7] A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, E. Valveny, C.V. Jawahar, and D. Karatzas (2019) Scene text visual question answering. In Proc. IEEE Int. Conf. Comp. Vis., Cited by: §1, §2.1, §4.1, §4.3, §4.4, Table 4, §4.
  • [8] F. Borisyuk, A. Gordo, and V. Sivakumar (2018) Rosetta: large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 71–79. Cited by: §4.1.
  • [9] C. K. Ch’ng and C. S. Chan (2017) Total-text: a comprehensive dataset for scene text detection and recognition. In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1, pp. 935–942. Cited by: §4.3.
  • [10] C. Ch’ng, C. S. Chan, and C. Liu (2020) Total-text: toward orientation robustness in scene text detection. International Journal on Document Analysis and Recognition (IJDAR) 23 (1), pp. 31–52. Cited by: §4.3.
  • [11] C. Chng, Y. Liu, Y. Sun, C. C. Ng, C. Luo, Z. Ni, C. Fang, S. Zhang, J. Han, E. Ding, et al. (2019) Icdar2019 robust reading challenge on arbitrary-shaped text (rrc-art). arXiv preprint arXiv:1909.07145. Cited by: §4.3.
  • [12] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Cited by: §3.1, §4.1.
  • [13] J. Gao, R. Ge, K. Chen, and R. Nevatia (2018) Motion-appearance co-memory networks for video question answering. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6576–6585. Cited by: §1.
  • [14] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913. Cited by: §4.1.
  • [15] D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham (2018) Vizwiz grand challenge: answering visual questions from blind people.. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §1, §2.1.
  • [16] R. Hu, A. Singh, T. Darrell, and M. Rohrbach (2019) Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. arXiv preprint arXiv:1911.06258. Cited by: §2.1, §3.2, §3.3, §4.1, §4.2, §4.4, Table 2, Table 4.
  • [17] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick (2017) Clevr: a diagnostic dataset for compositional language and elementary visual reasoning. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2901–2910. Cited by: §2.1.
  • [18] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov (2017) Bag of tricks for efficient text classification. In Proc. Conf. of European Chapter of Association for Computational Linguistics, Cited by: §3.2.
  • [19] K. Kafle, S. Cohen, B. Price, and C. Kanan (2018)

    Dvqa: understanding data visualizations via question answering

    In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.1.
  • [20] S. E. Kahou, V. Michalskiand, A. Atkinson, A. Kadar, A. Trischler, and Y. Bengio (2018) FigureQA: An annotated figure dataset for visual reasoning. In ICLR workshop track, Cited by: §2.1, §2.1.
  • [21] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al. (2015) ICDAR 2015 competition on robust reading. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160. Cited by: §4.3.
  • [22] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. i Bigorda, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. De Las Heras (2013) ICDAR 2013 robust reading competition. In 2013 12th International Conference on Document Analysis and Recognition, pp. 1484–1493. Cited by: §4.3.
  • [23] A. Kembhavi, M. Seo, D. Schwenk, J. Choi, A. Farhadi, and H. Hajishirzi (2017) Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.1.
  • [24] I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, A. Veit, et al. (2017) Openimages: a public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github. com/openimages 2, pp. 3. Cited by: §4.2.
  • [25] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123 (1), pp. 32–73. Cited by: §4.1.
  • [26] Y. Lin, H. Zhao, Y. Li, and D. Wang. DCD_ZJU, textvqa challenge 2019 winner. Note: Cited by: §4.2, Table 2.
  • [27] Y. Liu, L. Jin, S. Zhang, C. Luo, and S. Zhang (2019) Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition 90, pp. 337–345. Cited by: §4.3.
  • [28] Y. Liu, S. Zhang, L. Jin, L. Xie, Y. Wu, and Z. Wang (2019) Omnidirectional scene text detection with sequential-free box discretization. In Proc. Int. Joint Conf. Artificial Intell., pp. 3052–3058. Cited by: §4.3.
  • [29] A. Mishra, K. Alahari, and C. Jawahar (2012) Scene text recognition using higher order language priors. Cited by: §4.3.
  • [30] M. Narasimhan, S. Lazebnik, and A. G. Schwing (2018) Out of the box: reasoning with graph convolution nets for factual visual question answering. In Proc. Advances in Neural Inf. Process. Syst., pp. 2654–2665. Cited by: §2.2.
  • [31] N. Nayef, Y. Patel, M. Busta, P. N. Chowdhury, D. Karatzas, W. Khlif, J. Matas, U. Pal, J. Burie, C. Liu, et al. (2019) ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition–rrc-mlt-2019. arXiv preprint arXiv:1907.00945. Cited by: §4.3.
  • [32] NicolasThome (2019) Murel: multimodal relational reasoning for visual question answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.2.
  • [33] W. Norcliffe-Brown, S. Vafeias, and S. Parisot (2018) Learning conditioned graph structures for interpretable visual question answering. In Proc. Advances in Neural Inf. Process. Syst., pp. 8344–8353. Cited by: §2.2.
  • [34] T. Quy Phan, P. Shivakumara, S. Tian, and C. Lim Tan (2013) Recognizing text with perspective distortion in natural scenes. In Proceedings of the IEEE International Conference on Computer Vision, pp. 569–576. Cited by: §4.3.
  • [35] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §4.1.
  • [36] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan (2014) A robust arbitrary text detection system for natural scene images. Expert Systems with Applications 41 (18), pp. 8027–8048. Cited by: §4.3.
  • [37] B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Belongie, S. Lu, and X. Bai (2017) Icdar2017 competition on reading chinese text in the wild (rctw-17). In 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Vol. 1, pp. 1429–1434. Cited by: §4.3.
  • [38] A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach (2019) Towards vqa models that can read. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §1, §1, §2.1, §4.1, §4.2, §4.2, Table 2, Table 4, §4.
  • [39] A. submission MSFT_VTI, textvqa challenge 2019 top entry (post-challenge). Note: Cited by: §4.2, Table 2.
  • [40] Y. Sun, Z. Ni, C. Chng, Y. Liu, C. Luo, C. C. Ng, J. Han, E. Ding, J. Liu, D. Karatzas, et al. (2019) ICDAR 2019 competition on large-scale street view text with partial labeling–rrc-lsvt. arXiv preprint arXiv:1909.07741. Cited by: §4.3.
  • [41] D. Teney, L. Liu, and A. V. D. Hengel (2017) Graph-structured representations for visual question answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 3233–3241. Cited by: §2.2.
  • [42] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie (2016) Coco-text: dataset and benchmark for text detection and recognition in natural images. arXiv preprint arXiv:1601.07140. Cited by: §4.3.
  • [43] K. Wang, B. Babenko, and S. Belongie (2011) End-to-end scene text recognition. In 2011 International Conference on Computer Vision, pp. 1457–1464. Cited by: §4.3.
  • [44] P. Wang, Q. Wu, C. Shen, A. Dick, and A. van den Hengel (2018) Fvqa: fact-based visual question answering. IEEE Trans. Pattern Anal. Mach. Intell. 40 (10), pp. 2413–2427. Cited by: §2.1, §2.2.
  • [45] P. Wang, L. Yang, H. Li, Y. Deng, C. Shen, and Y. Zhang (2019) A simple and robust convolutional-attention network for irregular text recognition. arXiv:1904.01375. Cited by: §3.2, §4.3.
  • [46] J. Weston, S. Chopra, and A. Bordes (2014) Memory networks. arXiv preprint arXiv:1410.3916. Cited by: §1.
  • [47] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola (2016) Stacked attention networks for image question answering.. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 21–29. Cited by: §2.1.
  • [48] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu (2012) Detecting texts of arbitrary orientations in natural images. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1083–1090. Cited by: §4.3.
  • [49] X. Yin, W. Pei, J. Zhang, and H. Hao (2015) Multi-orientation scene text detection with adaptive clustering. IEEE transactions on pattern analysis and machine intelligence 37 (9), pp. 1930–1937. Cited by: §4.3.
  • [50] L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg (2018) Mattnet: modular attention network for referring expression comprehension. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 1307–1315. Cited by: §3.1.
  • [51] Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian (2019) Deep modular co-attention networks for visual question answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 6281–6290. Cited by: §1.