Cross-Modality Relevance for Reasoning on Language and Vision

05/12/2020 ∙ by Chen Zheng, et al. ∙ Michigan State University 0

This work deals with the challenge of learning and reasoning over language and vision data for the related downstream tasks such as visual question answering (VQA) and natural language for visual reasoning (NLVR). We design a novel cross-modality relevance module that is used in an end-to-end framework to learn the relevance representation between components of various input modalities under the supervision of a target task, which is more generalizable to unobserved data compared to merely reshaping the original representation space. In addition to modeling the relevance between the textual entities and visual entities, we model the higher-order relevance between entity relations in the text and object relations in the image. Our proposed approach shows competitive performance on two different language and vision tasks using public benchmarks and improves the state-of-the-art published results. The learned alignments of input spaces and their relevance representations by NLVR task boost the training efficiency of VQA task.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Real-world problems often involve data from multiple modalities and resources. Solving a problem at hand usually requires the ability to reason about the components across all the involved modalities. Examples of such tasks are visual question answering (VQA) Antol et al. (2015); Goyal et al. (2017) and natural language visual reasoning (NLVR) Suhr et al. (2017, 2018)

. One key to intelligence here is to identify the relations between the modalities, combine and reason over them for decision making. Deep learning is a prominent technique to learn representations of the data for decision making for various target tasks. It has achieved supreme performance based on large scale corpora 

Devlin et al. (2019). However, it is a challenge to learn joint representations for cross-modality data because deep learning is data-hungry. There are many recent efforts to build such multi-modality datasets Lin et al. (2014); Krishna et al. (2017); Johnson et al. (2017); Antol et al. (2015); Suhr et al. (2017); Goyal et al. (2017); Suhr et al. (2018). Researchers develop models by joining features, aligning representation spaces, and using Transformers Li et al. (2019b); Tan and Bansal (2019). However, generalizability is still an issue when operating on unobserved data. It is hard for deep learning models to capture high-order patterns of reasoning, which is essential for their generalizability.

There are several challenging research directions for addressing learning representations for cross-modality data and enabling reasoning for target tasks. First is the alignment of the representation spaces for multiple modalities; second is designing architectures with the ability to capture high-order relations for generalizability of reasoning; third is using pre-trained modules to make the most use of minimal data.

An orthogonal direction to the above-mentioned aspects of learning is finding relevance between the components and the structure of various modalities when working with multi-modal data. Most of the previous language and visual reasoning models try to capture the relevance by learning representations based on an attention mechanism. Finding relevance, known as matching, is a fundamental task in information retrieval (IR) Mitra et al. (2017). Benefiting from matching, Transformer models gain the excellent ability to index, retrieve, and combine features of underlying instances by a matching score Vaswani et al. (2017), which leads to the state-of-the-art performance in various tasks Devlin et al. (2019). However, the matching in the attention mechanism is used to learn a set of weights to highlight the importance of various components.

In our proposed model, we learn representations directly based on the relevance score inspired by the ideas from IR models. In contrast to the attention mechanism and Transformer models, we claim that the relevance patterns are as important. With proper alignment of the representation spaces of different input modalities, matching can be applied to those spaces. The idea of learning relevance patterns is similar to Siamese networks Koch et al. (2015) which learn transferable patterns of similarity of two image representations for one-shot image recognition. Similarity metric between two modalities is shown to be helpful for aligning multiple spaces of modalities Frome et al. (2013).

The contributions of this work are as follows: 1) We propose a cross-modality relevance (CMR) framework that considers entity relevance and high-order relational relevance between the two modalities with an alignment of representation spaces. The model can be trained end-to-end with customizable target tasks. 2) We evaluate the methods and analyze the results on both VQA and NLVR tasks using VQA v2.0 and datasets respectively. We improve state-of-the-art on both tasks’ published results. Our analysis shows the significance of the patterns of relevance for the reasoning, and the CMR model trained on boosts the training efficiency of the VQA task.

2 Related Work

Language and Vision Tasks.

Learning and decision making based on natural language and visual information has attracted the attention of many researchers due to exposing many interesting research challenges to the AI community. Among many other efforts Lin et al. (2014); Krishna et al. (2017); Johnson et al. (2017), Antol et al. proposed the VQA challenge that contains open-ended questions about images that require an understanding of and reasoning about language and visual components. Suhr et al. proposed the NLVR task that asks models to determine whether a sentence is true based on the image.

Attention Based Representation.

Transformers are stacked self-attention models for general purpose sequence representation 

Vaswani et al. (2017)

. They have been shown to achieve extraordinary success in natural language processing not only for better results but also for efficiency due to their parallel computations. Self-attention is a mechanism to reshape representations of components based on relevance scores. They have been shown to be effective in generating contextualized representations for text entities. More importantly, there are several efforts to pre-train huge Transformers based on large scale corpora 

Devlin et al. (2019); Yang et al. (2019); Radford et al. (2019) on multiple popular tasks to enable exploiting them and performing other tasks with small corpora. Researchers also extended Transformers with both textual and visual modalities Li et al. (2019b); Sun et al. (2019); Tan and Bansal (2019); Su et al. (2020); Tsai et al. (2019). Sophisticated pre-training strategies were introduced to boost the performance Tan and Bansal (2019). However, as mentioned above, modeling relations between components is still a challenge for the approaches that try reshaping the entity representation space while the relevance score can be more expressive for these relations. In our CMR framework, we model high-order relations in relevance representation space rather than the entity representation space.

Matching Models.

Matching is a fundamental task in information retrieval (IR). There are IR models that focus on comparing the global representation matching Huang et al. (2013); Shen et al. (2014), the local components (a.k.a terms) matching Guo et al. (2016); Pang et al. (2016), and hybrid methods Mitra et al. (2017). Our relevance framework is partially inspired by the local components matching which we apply here to model the relevance of the components of the model’s inputs. However, our work differs in several significant ways. First, we work under the cross-modality setting. Second, we extend the relevance to a high-order, i.e. model the relevance of entity relations. Third, our framework can work with different target tasks, and we show that the parameters trained on one task can boost the training of another.

Figure 1:

Cross-Modality Relevance model is composed of single-modality transformer, cross-modality transformer, entity relevance, and high-order relational relevance, followed by a task-specific classifier.

3 Cross-Modality Relevance

Cross-Modality Relevance (CMR) aims to establish a framework for general purpose relevance in various tasks. As an end-to-end model, it encodes the relevance between the components of input modalities under task-specific supervision. We further add a high-order relevance between relations that occur in each modality.

Figure 1 shows the proposed architecture. We first encode data from different modalities with single modality Transformers and align the encoding spaces by a cross-modality Transformer. We consistently refer to the words in text and objects in images (i.e.

bounding boxes in images) as “entities” and their representations as “Entity Representations”. We use the relevance between the components of the two modalities to model the relation between them. The relevance includes the relevance between their entities, as shown in the “Entity Relevance”, and high-order relevance between their relations, as shown in the “Relational Relevance”. We learn the representations of the affinity matrix of relevance score by convolutional layers and fully-connected layers. Finally, we predict the output by a non-linear mapping based on all the relevance representations. This architecture can help to solve tasks that need reasoning on two modalities based on their relevance. We argue that the parameters trained on one task can boost the training of the other tasks that deal with multi-modality reasoning.

In this section, we first formulate the problem. Then we describe our cross-modality relevance (CMR) model for solving the problem. The architecture, loss function, and training procedure of CMR are explained in detail. We will use the VQA and NLVR tasks as showcases.

3.1 Problem Formulation

Formally, the problem is to model a mapping from a cross-modality data sample to an output in a target task, where denotes the type of modality. And is a set of entities in the modality . In visual question answering, VQA, the task is to predict an answer given two modalities, that is a textual question () and a visual image (). In NLVR, given a textual statement () and an image (), the task is to determine the correctness of the textual statement.

3.2 Representation Spaces Alignment

Single Modality Representations.

For the textual modality , we utilize BERT Devlin et al. (2019) as shown in the bottom-left part of Figure 1, which is a multi-layer Transformer Vaswani et al. (2017) with three different inputs: WordPieces embeddings Wu et al. (2016), segment embeddings, and position embeddings. We refer to all the words as the entities in the textual modality and use the BERT representations for textual single-modality representations . We assume to have words as textual entities.

For visual modality , as shown in the top-left part of Figure 1, Faster-RCNN Ren et al. (2015)

is used to generate regions of interest (ROIs), extract dense encoding representations of the ROIs, and predict the probability of each ROI. We refer to the ROIs on images as the visual entities

. We consider a fixed number, , of visual entities with highest probabilities predicted by Faster-RCNN each time. The dense representation of each ROI is a local latent representation of a

-dimensional vector 

Ren et al. (2015). To enrich the visual entity representation with the visual context, we further project the vectors with feed-forward layers and encode them by a single-modality Transformer as shown in the second column in Figure 1. The visual Transformer takes the dense representation, segment embedding, and pixel position embedding Tan and Bansal (2019) as input and generates the single-modality representation . In case there are multiple images, for example, NLVR data () has two images in each example, each image is encoded by the same procedure and we keep visual entities per image. We refer to this as different sources of the same modality throughout the paper. We restrict all the single-modality representations to be vectors of the same dimension . However, these original representation spaces should be aligned.

Cross-Modality Alignment.

To align the single-modality representations in a uniformed representation space, we introduce a cross-modality Transformer as shown in the third column of Figure 1. All the entities are treated uniformly in the modality Transformer. Given the set of entity representations from all modalities we define the matrix with all the elements in the set . Each cross-modality self-attention calculation is computed as follows Vaswani et al. (2017)111Please note here we keep the usual notation of the attention mechanism for this equation. The notations might have been overloaded in other parts of the paper.,


where in our case the key , query , and value

, all are the same tensor

, and

normalizes along the columns. A cross-modality Transformer layer consists of a cross-modality self-attention representation followed by residual connection with normalization from the input representation, a feed-forward layer, and another residual connection normalization. We stack several cross-modality Transformer layers to get a uniform representation over all modalities. We refer to the resulting uniformed representations as the entity representation and denote the set of the entity representations of all the entities as

. Although the representations are still organized by their original modalities per entity, they carry the information from the interactions with the other modality and are aligned in uniform representation space. The entity representations, as the fourth column in Figure 1, alleviate the gap between representations from different modalities, as we will show in the ablation studies, and allow them to be matched in the following steps.

3.3 Entity Relevance

Relevance plays a critical role in reasoning ability, which is required in many tasks such as information retrieval, question answering, intra- and inter-modality reasoning. Relevance patterns are independent from input representation space, and can have better generalizability to unobserved data. To consider the entity relevance between two modalities and , the entity relevance representation is calculated as shown in Figure 1. Given entity representation matrices and , the relevance representation is calculated by


where is the affinity matrix of the two modalities as shown in the right side of Figure 1. is the relevance score of th entity in and th entity in . is a CNN, corresponding to the sixth column of Figure 1

, which contains several convolutional layers and fully connected layers. Each convolutional layer is followed by a max-pooling layer. Fully connected layers finally map the flatten feature maps to

-dimensional vector. We refer to as the entity relevance representation between and .

We compute the relevance between different modalities. For the modalities considered in this work, when there are multiple images in the visual modality, we calculate the relevance representation between them too. In particular, for VQA dataset, the above setting results in one entity relevance representation: a textual-visual entity relevance . For dataset, there are three entity relevance representations: two textual-visual entity relevance and , and a visual-visual entity relevance between two images. Entity relevance representations will be flattened and joined with other features in the next layer of the network.

3.4 Relational Relevance

We also consider the relevance beyond entities, that is, the relevance of the entities’ relations. This extension allows our CMR to capture higher-order relevance patterns. We consider pair-wise non-directional relations between entities in each modality and calculate the relevance of the relations across modalities. The procedure is similar to entity relevance as shown in Figure 1. We denote the relational representation as a non-linear mapping modeled by fully-connected layers from the concatenation of representations of the entities in the relation . Relational relevance affinity matrix can be calculated by matching the relational representation, , from different modalities. However, there will be possible pairs in each modality , most of which are irrelevant. The relational relevance representations will be sparse because of the irrelevant pairs on both sides. Computing the relevance score of all possible pairs will introduce a large number of unnecessary parameters which makes the training more difficult.

Figure 2: Relational Relevance is the relevance of top-K relations in terms of intra-modality relevance score and inter-modality importance.

We propose to rank the relation candidates (i.e. pairs) by the intra-modality relevance score and the inter-modality importance. Then we compare the top- ranked relation candidates between two modalities as shown in Figure 2

. For the intra-modality relevance score, shown in the bottom left part of the figure, we estimate a normalized score based on the relational representation by a softmax layer.


To evaluate the inter-modality importance of a relation candidate, which is a pair of entities in the same modality, we first compute the relevance of each entity in text with respect to the visual objects. As shown in Figure 2, we project a vector that includes the most relevant visual object for each word, denoted this importance vector as . This helps to focus on words that are grounded in the visual modality. We use the same procedure to compute the most relevant words to each visual object.

Then we calculate the relation candidates importance matrix by an outer product, , of the importance vectors as follows,


where is the th scalar element in that corresponds to the th entity, and is the affinity matrix calculated by Equation 2a.

Notice that the inter-modality importance is symmetric. The upper triangular part of , excluding the diagonal, indicates the importance of the corresponding elements with the same index in intra-modality relevance scores . The ranking score for the candidates is the combination (here the product) of the two scores . We select the set of top- ranked candidate relations . We reorganize the representation of the top- relations as . The relational relevance representation between and can be calculated similar to the entity relevance representations as shown in Figure 1.


has its own parameters which results in a -dimensional feature space .

In particular, for VQA task, the above setting results in one relational relevance representation: a textual-visual relevance . For NLVR task, there are three entity relevance representations: two textual-visual relational relevance and , and a visual-visual relational relevance between two images. Relational relevance representations will be flattened and joined with other features in the next layers of the network.

After acquiring all the entity and relational relevance representations, namely and , we concatenate them and use the result as the final feature . A task-specific classifier predicts the output of the target task as shown in the right-most column in Figure 1.

3.5 Training

End-to-end Training. CMR can be considered as an end-to-end relevance representation extractor. We simply predict the output from a specific task with the final feature with a differentiable regression or classification function. The gradient of the loss function is back-propagated to all the components in CMR to penalize the prediction and adjust the parameters. We freeze the parameters of the basic feature extractors, namely BERT for textual modality and Faster-RCNN for visual modality. The parameters of the following parts will be updated by gradient descent: single modality Transformers (except BERT), the cross-modality Transformers, , , , for all modalities and modality pairs, and the task-specific classifier .

The VQA task can be formulated as a multi-class classification that chooses a word to answer the question. We apply a softmax classifier on and penalize with the cross-entropy loss. For

dataset, the task is binary classification that determines whether the statement is correct regarding the images. We apply a logistic regression on

and penalize with the cross-entropy loss.

Pre-training Strategy.

To leverage the pre-trained parameters of our cross-modality Transformer and relevance representations, we use the following training settings. For all tasks, we freeze the parameters in BERT and faster-RCNN. We used pre-trained parameters in the (visual) single modality Transformers as proposed by Tan and Bansal (2019) and leave them being fine-tuned with the following procedure. Then we randomly initialize and train all the parameters in the model on NLVR with dataset. After that, we keep and fine-tune all the parameters on the VQA task with the VQA v2.0 dataset. (See data description Section 4.1.) In this way, the parameters of the cross-modality Transformer and relevance representations, pre-trained by dataset, are reused and fine-tuned on the VQA dataset. Only the final task-specific classifier with the input features is initialized randomly. The pre-trained cross-modality Transformer and relevance representations help the model for VQA to converge faster and achieve a competitive performance compared to the state-of-the-art results.

4 Experiments and Results

4.1 Data Description

Suhr et al. (2018) is a dataset that aims to joint reasoning about natural language descriptions and related images. Given a textual statement and a pair of images, the task is to indicate whether the statement correctly describes the two images. contains examples of sentences paired with visual images and designed to emphasize semantic diversity, compositionality, and visual reasoning challenges.

VQA v2.0

Goyal et al. (2017) is an extended version of the VQA dataset. It contains images from the MS COCO Lin et al. (2014), paired with free-form, open-ended natural language questions and answers. These questions are divided into four categories: Yes/No, Number, and Other.

4.2 Implementation Details

We implemented CMR using Pytorch

222Our code and data is available at We consider the -dimension single-modality representations. For textural modality, the pre-trained BERT “base” model Devlin et al. (2019) is used to generate the single-modality representation. For visual modality, we use Faster-RCNN pre-trained by Anderson et al., followed by a five-layers Transformer. Parameters in BERT and Faster-RCNN are fixed. For each example, we keep words as textual entities and ROIs per image as visual entities. For the relational relevance, top-10 ranked pairs are used. For each relevance CNN, and , we use two convolutional layers, each of which is followed by a max-pooling, and fully connected layers. For the relational representations and their intra-modality relevance score, and , we use one hidden layer for each. The task-specific classifier contains three hidden layers. The model is optimized using the Adam optimizer with . The model is trained with a weight decay , a max gradient normalization , and a batch size of 32.

4.3 Baseline Description


Li et al. (2019b) is an End-to-End model for language and vision tasks, consists of Transformer layers that align textual and visual representation spaces with self-attention. VisualBERT and CMR have a similar cross-modality alignment approach. However, VisualBERT only uses the Transformer representations while CMR uses the relevance representations.


Tan and Bansal (2019) aims to learn cross-modality encoder representations from Transformers. It pre-trains the model with a set of tasks and fine-tunes on another set of specific tasks. LXMERT is the currently published state-of-the-art on both and VQA v.

4.4 Results


The results of NLVR task are listed in Table 1. Transformer based models (VisualBERT, LXMERT, and CMR) outperform other models (N2NMN Hu et al. (2017), MAC Hudson and Manning (2018), and FiLM Perez et al. (2018)) by a large margin. This is due to the strong pre-trained single-modality representations and the Transformers’ ability to reshape the representations that align the spaces. Furthermore, CMR shows the best performance compared to all Transformer-based baseline methods and achieves state-of-the-art. VisualBERT and CMR have similar cross-modality alignment approach. CMR outperforms VisualBERT by . The gain mainly comes from entity relevance and relational relevance that model the relations.

Models Dev Test
N2NMN 51.0 51.1
MAC-Network 50.8 51.4
FiLM 51.0 52.1
CNN+RNN 53.4 52.4
VisualBERT 67.4 67.0
LXMERT 74.9 74.5
CMR 75.4 75.3
Table 1: Accuracy on .

VQA v2.0:

In Table 2, we show the comparison with published models excluding the ensemble ones. Most competitive models are based on Transformers (ViLBERT Lu et al. (2019), VisualBERT Li et al. (2019b), VL-BERT Su et al. (2020), LXMERT Tan and Bansal (2019), and CMR). BUTD Anderson et al. (2018); Teney et al. (2018), ReGAT Li et al. (2019a), and BAN Kim et al. (2018) also employ attention mechanism for a relation-aware model. The proposed CMR achieves the best test accuracy on Y/N questions and Other questions. However, CMR does not achieve the best performance on Number questions. This is because Number questions require the ability to count numbers in one modality while CMR focuses on modeling relations between modalities. Performance on counting might be improved by explicit modeling of quantity representations. CMR also achieves the best overall accuracy. In particular, we can see a improvement over VisualBERT Li et al. (2019b), as in the above mentioned results. This shows the significance of the entity and relational relevance.

Model Dev Test Standard
Overall Y/N Num Other Overall
BUTD 65.32 81.82 44.21 56.05 65.67
ReGAT 70.27 86.08 54.42 60.33 70.58
ViLBERT 70.55 - - - 70.92
VisualBERT 70.80 - - - 71.00
BAN 71.4 87.22 54.37 62.45 71.84
VL-BERT 71.79 87.94 54.75 62.54 72.22
LXMERT 72.5 87.97 54.94 63.13 72.54
CMR 72.58 88.14 54.71 63.16 72.60
Table 2: Accuracy on VQA v2.0.

Another observation is that, if we train CMR for VQA task from scratch with random initialization while still use the fixed BERT and Faster-RCNN, the model converges after 20 epochs. As we initialize the parameters with the model trained on

, it takes 6 epochs to converge. The significant improvement of convergence speed indicates that the optimal model for VQA is close to that of NLVR.

5 Analysis

5.1 Model Size

To investigate the influence of model sizes, we empirically evaluated CMR on with various sets of Transformers sizes which contain the most parameters of the model. All other details are kept the same as descriptions in Section 4.2. Textual Transformer remains 12 layers because it is the pre-trained BERT. Our model contains parameters. Among these parameters, around parameters belong to pre-trained BERT and Transformer. Table 3 shows the results. As we increase the number of layers in the visual Transformer and the cross-modality Transformer, it tends to improve accuracy. However, the performance becomes stable when there are more than five layers. We choose five layers of visual Transformer and cross-modality Transformer in other experiments.

Textural Visual Cross Dev Test
12 3 3 74.1 74.4
12 4 4 74.9 74.7
12 5 5 75.4 75.3
12 6 6 75.5 75.1
Table 3: Accuracy on of CMR with various Transformer sizes. The numbers in the left part of the table indicate the number of self-attention layers.

5.2 Ablation Studies

To better understand the influence of each part in CMR, we perform the ablation study. Table 4 shows the performances of four variations on .

Models Dev Test
CMR 75.4 75.3
without Single-Modality Transformer 68.2 68.5
without Cross-Modality Transformer 59.7 59.1
without Entity Relevance 70.6 71.2
without Relational Relevance 73.0 73.4
Table 4: Test accuracy of different variations of CMR on .

Effect of Single Modality Transformer.

We remove both textual and visual single-modality Transformers and map the raw input with a linear transformation to

-dimensional space instead. Notice that the raw input of textual modality is the WordPieces Wu et al. (2016) embeddings, segment embeddings, and the position embeddings of each word, while that of visual modality is the -dimension dense representation of each ROI extracted by Faster-RCNN. It turns out that removing single-modality Transformers decreases the accuracy by . Single modality Transformers play a critical role in producing a strong contextualized representation for each modality.

Effect of Cross-Modality Transformer.

We remove the cross-modality Transformer and use single-modality representations as entity representations. As shown in Table 4, the model degenerates dramatically, and the accuracy decreases by . The huge accuracy gap demonstrates the unparalleled contribution of the cross-modality Transformer to aligning representation spaces from input modalities.

Effect of Entity Relevance.

We remove the entity relevance representation from the final feature . As shown in Table 4, the test accuracy is reduced by . This is a significant difference of performance among Transformer based models Li et al. (2019b); Lu et al. (2019); Tan and Bansal (2019). To highlight the significance of entity relevance, we visualize an example affinity matrix in Figure 3. The two major entities, “bird” and “branch”, are matched perfectly. More interestingly, the three ROIs which are matching the phrase “looking to left” capture an indicator (the beak), a direction (left), and the semantic of the whole phrase.

Figure 3: The entity affinity matrix between textual (rows) and visual (columns) modalities. The darker color indicates the higher relevance score. The ROIs with maximum relevance score for each word are shown paired with the words.

Effect of Relational Relevance.

We remove the entity relevance representation from the final feature . A decrease in test accuracy is observed in Table 4. We argue that CMR models high-order relations, which are not captured in entity relevance, by modeling relational relevance. We present two examples of textual relation ranking scores in Figure 4. The learned ranking score highlights the important pairs, for example “gold - top”, “looking - left”, which describe the important relations in textual modality.

Figure 4: The relation ranking score of two example sentence. The darker color indicates the higher ranking score.

6 Conclusion

In this paper, we propose a novel cross-modality relevance (CMR) for language and vision reasoning. Particularly, we argue for the significance of relevance between the components of the two modalities for reasoning, which includes entity relevance and relational relevance. We propose an end-to-end cross-modality relevance framework that is tailored for language and vision reasoning. We evaluate the proposed CMR on NLVR and VQA tasks. Our approach exceeds the state-of-the-art on and VQA v2.0 datasets. Moreover, the model trained on boosts the training of VQA v2.0 dataset. The experiments and the empirical analysis demonstrate CMR’s capability of modeling relational relevance for reasoning and consequently its better generalizability to unobserved data. This result indicates the significance of relevance patterns. Our proposed architectural component for capturing relevance patterns can be used independently from the full CMR architecture and is potentially applicable for other multi-modal tasks.


We thank the anonymous reviewers for their helpful comments. This project is supported by National Science Foundation (NSF) CAREER award .


  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018)

    Bottom-up and top-down attention for image captioning and visual question answering


    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6077–6086. Cited by: §4.2, §4.4.
  • S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: §1, §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §1, §2, §3.2, §4.2.
  • A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. Ranzato, and T. Mikolov (2013) DeViSE: a deep visual-semantic embedding model. In NIPS, Cited by: §1.
  • Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 6325–6334. External Links: Document, ISSN 1063-6919 Cited by: §1, §4.1.
  • J. Guo, Y. Fan, Q. Ai, and W. B. Croft (2016) A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 55–64. Cited by: §2.
  • R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko (2017) Learning to reason: end-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §4.4.
  • P. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck (2013) Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pp. 2333–2338. Cited by: §2.
  • D. A. Hudson and C. D. Manning (2018) Compositional attention networks for machine reasoning. In International Conference on Learning Representations (ICLR), Cited by: §4.4.
  • J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick (2017) CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • J. Kim, J. Jun, and B. Zhang (2018) Bilinear attention networks. In Advances in Neural Information Processing Systems, pp. 1564–1574. Cited by: §4.4.
  • G. Koch, R. Zemel, and R. Salakhutdinov (2015)

    Siamese neural networks for one-shot image recognition


    Proceedings of the 32 nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37

    Cited by: §1.
  • R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §1, §2.
  • L. Li, Z. Gan, Y. Cheng, and J. Liu (2019a) Relation-aware graph attention network for visual question answering. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10312–10321. Cited by: §4.4.
  • L. H. Li, M. Yatskar, D. Yin, C. Hsieh, and K. Chang (2019b) Visualbert: a simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557. Cited by: §1, §2, §4.3, §4.4, §5.2.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §2, §4.1.
  • J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pp. 13–23. Cited by: §4.4, §5.2.
  • B. Mitra, F. Diaz, and N. Craswell (2017)

    Learning to match using local and distributed representations of text for web search

    In Proceedings of the 26th International Conference on World Wide Web, pp. 1291–1299. Cited by: §1, §2.
  • L. Pang, Y. Lan, J. Guo, J. Xu, S. Wan, and X. Cheng (2016) Text matching as image recognition. In

    Thirtieth AAAI Conference on Artificial Intelligence

    Cited by: §2.
  • E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville (2018) Film: visual reasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §4.4.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §2.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §3.2.
  • Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil (2014)

    Learning semantic representations using convolutional neural networks for web search

    In Proceedings of the 23rd International Conference on World Wide Web, pp. 373–374. Cited by: §2.
  • W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai (2020) VL-bert: pre-training of generic visual-linguistic representations. In International Conference on Learning Representations, External Links: Link Cited by: §2, §4.4.
  • A. Suhr, M. Lewis, J. Yeh, and Y. Artzi (2017) A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada, pp. 217–223. External Links: Link, Document Cited by: §1.
  • A. Suhr, S. Zhou, I. D. Zhang, H. Bai, and Y. Artzi (2018) A corpus for reasoning about natural language grounded in photographs. In ACL, Cited by: §1, §2, §4.1.
  • C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid (2019) VideoBERT: a joint model for video and language representation learning. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7463–7472. Cited by: §2.
  • H. Tan and M. Bansal (2019) LXMERT: learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Cited by: §1, §2, §3.2, §3.5, §4.3, §4.4, §5.2.
  • D. Teney, P. Anderson, X. He, and A. van den Hengel (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223–4232. Cited by: §4.4.
  • Y. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L. Morency, and R. Salakhutdinov (2019) Multimodal transformer for unaligned multimodal language sequences. In ACL, Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2, §3.2, §3.2.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Ł. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean (2016)

    Google’s neural machine translation system: bridging the gap between human and machine translation

    CoRR abs/1609.08144. External Links: Link Cited by: §3.2, §5.2.
  • Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le (2019) XLNet: generalized autoregressive pretraining for language understanding. In NeurIPS, Cited by: §2.