Referring Expression Grounding by Marginalizing Scene Graph Likelihood

06/09/2019 ∙ by Daqing Liu, et al. ∙ Nanyang Technological University USTC 3

We focus on the task of grounding referring expressions in images, e.g., localizing "the white truck in front of a yellow one". To resolve this task fundamentally, one should first find out the contextual objects (e.g., the "yellow" truck) and then exploit them to disambiguate the referent from other similar objects, by using the attributes and relationships (e.g., "white", "yellow", "in front of"). However, it is extremely challenging to train such a model as the ground-truth of the contextual objects and their relationships are usually missing due to the prohibitive annotation cost. Therefore, nearly all existing methods attempt to evade the above joint grounding and reasoning process, but resort to a holistic association between the sentence and region feature. As a result, they suffer from heavy parameters of fully-connected layers, poor interpretability, and limited generalization to unseen expressions. In this paper, we tackle this challenge by training and inference with the proposed Marginalized Scene Graph Likelihood (MSGL). Specifically, we use scene graph: a graphical representation parsed from the referring expression, where the nodes are objects with attributes and the edges are relationships. Thanks to the conditional random field (CRF) built on scene graph, we can ground every object to its corresponding region, and perform reasoning with the unlabeled contexts by marginalizing out them using the sum-product belief propagation. Overall, our proposed MSGL is effective and interpretable, e.g., on three benchmarks, MSGL consistently outperforms the state-of-the-arts while offers a complete grounding of all the objects in a sentence.



There are no comments yet.


page 2

page 8

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Grounding referring expressions (REF) in visual scenes (a.k.a., referring expression comprehension mao2016generation ) is perhaps the most natural human control for AI, e.g., “park the car beside the red sedan in front of the blue gate” for a self-driving car chen2018touchdown , and “who is the man in blue with a dress watch” for a visual Q&A (VQA) agent antol2015vqa . Beyond object detection ren2015faster , REF fundamentally requires the understanding of the language compositions (e.g., the linguistic meaning of “beside” and “with” connecting objects) and then use them as the guidance to distinguish the referent out of the contexts, especially those of the same class (e.g.

, “the man” vs. “other men”). In the era of the rising deep learning, when many hard-core problems in natural language processing (

e.g., language modeling vaswani2017attention

) and computer vision (

e.g., recognition redmon2017yolo9000 ) are considered as “well-addressed”, one may take it for granted that the REF task is merely a straightforward joint visual detection of the referent and contexts parsed from the sentence. As shown in Figure 1, can’t this problem be solved simply by grounding and comparing the entities mentioned in a sentence?

Indeed, we also admit that the above simple idea should be the principled solution for REF. However, it is very challenging to realize this by machine learning, mainly due to the prohibitive cost of annotating a complete grounding for all possible expressions, as the number of multi-object articulation in the visual world is combinatorially large. It is also a common challenge for many other visual reasoning tasks such as VQA 

johnson2017clevr , image captioning vinyals2017show , and visual dialog das2017visual . Therefore, given only the referent’s ground-truth, almost all popular REF methods lower the requirement of joint grounding and reasoning to a holistic association score between the sentence and region features hu2017modeling ; zhang2018grounding ; yu2018mattnet . For example, the state-of-the-art method yu2018mattnet can only coarsely model the triplet score of (subject, predicate, object), regardless of the sentence complexity, e.g., the “object” may still have its own sub-ordinate triplet decomposition and so on. As these methods violate the nature of visual reasoning, even for a correct grounding result, its inference may not be faithful and interpretable to the language composition, and thus it is poorly generalized to unseen expressions.

In fact, we are not the first to re-think the downside of the holistic models. Inspired by the success of neural module networks in synthetic VQA datasets hu2017learning ; shi2018explainable ; Johnson_2017_ICCV , where the visual reasoning is guided by the question parsing trees, researchers attempt to localize the objects along the expression parsing trees for REF. However, due to the difficulty in training the highly moving modules with the massively missing annotations of the contexts, they are either significantly under-performed cirik2018using or easily degenerated to holistic scores with limited interpretability hong2019learning ; cao2018interpretable .

In this paper, we present a novel REF framework, called Marginalized Scene Graph Likelihood (MSGL), that offers the joint modeling and reasoning with all the objects mentioned in a sentence. To obtain the semantic composition of a sentence at large, we use an off-the-shelf scene graph parser schuster2015generating to parse the sentence into a scene graph, where a node is an entity object modified by attributes, and an edge is a relationship between two nodes (cf. Figure 1). Such a scene graph offers a graphical inductive bias battaglia2018relational for the joint grounding and reasoning. As detailed in Section 3, we model a scene graph based Conditional Random Field (CRF), where the visual regions can be considered as the observational label space for configuring the scene graph. In particular, the unary and binary potentials are single and pairwise vision-language association scores, respectively. To train the CRF model without the ground-truth of context nodes, we propose to marginalize out

the joint distribution of the contexts (

e.g., [height=0.95]figures/logos/shirt.jpg, [height=0.95]figures/logos/television.jpg in Figure 1(a)), by using the efficient sum-product belief propagation to obtain the marginal likelihood of the referent (e.g., [height=0.95]figures/logos/man.jpg), which has a ground-truth and thus can be trained with cross-entropy loss. It is worth noting that the belief propagation can be considered as a visual reasoning process. For example, as shown in Figure 1 (b), the likelihood of [height=0.95]figures/logos/table.jpg [height=0.95]figures/logos/img_table.jpg and [height=0.95]figures/logos/wine.jpg [height=0.95]figures/logos/img_wine.jpg helps to pinpoint the grounding of the referent [height=0.95]figures/logos/man.jpg, and vice versa.

On three REF benchmarks yu2016modeling ; mao2016generation , we conduct extensive ablations and comparisons with state-of-the-art methods. Thanks to the fact that MSGL is a well-posed probabilistic graphical model, we can make the best of the two worlds: it consistently outperforms the popular holistic networks of low interpretability, while retains the high interpretability of structural models.


Figure 1: The qualitative grounding results of our MSGL on RefCOCOg. Scene graph legends are: green shaded rectangle: referent node, colored rectangle: object node, arrow rectangle: attribute, oval: edge relationship. The same color of the bounding box and the node denotes a grounding.

2 Related Work

Referring Expression Grounding (REF). This task is to localize a region in an image, where the region is described by a natural language expression. It is fundamentally different from object detection ren2015faster and phrases localization plummer2017phrase because the key for REF is to fully exploit the linguistic composition to distinguish the referent from other objects, especially the objects of the same category. Existing methods generally fall into the following two categories: 1) generative models mao2016generation ; yu2016modeling ; luo2017comprehension ; yu2017joint

: they used the CNN-LSTM encoder-decoder structure to localize the region that can generate the sentence with maximum posteriori probability. 2) discriminative models 

hu2017modeling ; yu2018mattnet ; yu2018rethining ; zhang2018grounding : they usually compute a joint vision-language embedding score for the sentence and region, which can be discriminatively trained with cross-entropy loss. Note that the generative models can accomplish both referring expression comprehension (REF) and generation tasks; while the discriminative ones are only designed for the former. Our proposed MSGL belongs to the discriminative category.

Compared with the above discriminative models which neglect the rich linguistic structure and focus on holistic grounding score calculation, we exploit the full linguistic structure: we parse the language into a scene graph schuster2015generating and then perform joint grounding of multiple objects and reasoning. Compared to cirik2018using

which uses tree-based neural networks, our model is a well-posed graphical model that is specialized to tackle the challenge of training without context ground-truth. Though recent progress on neural module networks used in synthetic VQA 

andreas2016neural ; cao2018visual ; hu2017learning has shown both interpretability and high performance. However, they rely on additional annotations to learn an accurate sequence-to-sequence, sentence-to-module layout parser, which is not available in general domains like REF. To this end, we propose to marginalize out the contexts by the sum-product belief propagation in CRF, which can be dated back to training CRF with partial annotations tsuboi2008training .

Visual Reasoning with Scene Graphs. Scene graphs have been widely used in visual reasoning recently. Most of the existing works use “visual scene graph” detected from images zellers2018neural . Visual scene graphs are shown to boost a variety of vision-language tasks such as VQA teney2017graph ; shi2018explainable , REF peng2019grounding , and image captioning yao2018exploring . Our work is related to works using “language scene graph”, where a sentence is parsed into a scene graph anderson2016spice ; schuster2015generating , which can be considered as a structure with less linguistic compositions than a dependency parsing tree. Similar to visual scene graphs, the language counterpart serves as a reasoning inductive bias battaglia2018relational that regularizes the model training and inference, which has been shown useful in image generation johnson2018image and captioning yang2019caption . Unlike johnson2015image that also used a CRF to ground the scene graph to images for calculating the similarity between the image and sentence, our work uses CRF as a framework to obtain the marginalized node likelihood for the referent, which is only a word but not the entire sentence. Besides, in their work, every entity in the scene graph is annotated; while our work has no annotations for the contexts at all.

3 Approach: Marginalized Scene Graph Likelihood

The overview of the proposed MSGL is illustrated in Figure 2. First, we extract region features from the image and scene graph from the sentence (cf. Section 3.2). Second, we build a CRF model based on the scene graph (cf. Section 3.2), whose node potential can be marginalized by belief propagation (cf. Section 3.3). Last, we can use this marginal likelihood to train our model and infer the grounding results for all the objects, including the non-labeled contexts (cf. Section 3.3).

3.1 Task Formulation

The REF task aims to localize the referent region in an image given a referring expression. For an image , we represent it as a set of regions , where is the number of regions. For the referring expression , it is a sequence of words. The REF task can be formulated as a retrieval problem that returns the most possible grounding from to . Here, we introduce a random variable , which denotes a grounding of the referent mentioned in to a region in , e.g., . Formally, we have:


In fact, almost all state-of-the-art models hu2017modeling ; yu2018mattnet ; yu2018rethining ; zhang2018grounding can be formulated as Eq. (1) where the composition in language is oversimplified, i.e., no joint grounding of all the objects and visual reasoning is taken into account. In contrast, we believe that a principled REF solution should be faithful to all the objects mentioned in . In particular, we slightly abuse the notation to be the set of objects in the language, and we assume that there are groundings for the objects. Without loss of generality, we always denote the first grounding as the referent grounding. The method how we identify the referent object in will be introduced later in Section 3.2. Formally, searching for the optimal referent grounding can be formulated as:


where we can easily find out that the key is to model the joint probability of all the grounding . However, it is challenge to learn such a joint probabilistic model without the ground-truth for the context grounding {}. Next, we will detail the implementation of the joint probability using scene graph CRF and how to tackle the challenge with marginalized likelihood.


Figure 2: The overview of our proposed Marginalized Scene Graph Likelihood (MSGL) method for referring expression grounding. Note that the potentials are updated after belief propagation, demonstrating the effectiveness of visual reasoning.

3.2 Scene Graph CRF

In conventional natural language processing tasks such as part-of-speech tagging tsuboi2008training , the structure of is considered as a sequence (or a chain) used in graphical model. In visual reasoning like REF, a graph inductive bias is more appealing as the object relationships are crucial to identify the referent from its similar contexts. Specifically, we construct a conditional random field (CRF) laffertyconditional based on the language scene graph schuster2015generating .

Scene Graph. As shown in Figure 1, a scene graph is defined as , where is a set of nodes, representing the objects, and is a set of edges. Specially, a node contains a noun word (e.g., [height=0.95]figures/logos/man.jpg) and some attributes (e.g., [height=0.95]figures/logos/white.jpg). The edge is a triplet which contains a subject , a relationship , and a object (e.g., ([height=0.95]figures/logos/man.jpg, [height=0.95]figures/logos/watching.jpg, [height=0.95]figures/logos/television.jpg) ). In the rest of the paper, when the context is clear, we will use scene graph instead of . In real practice, we identify the referent node as the one whose in-degree is zero because the referent is usually the centre node modified by others.

Conditional Random Field (CRF). By constructing CRF with the scene graph, the joint probability in Eq. (2) can be factorized as:


where denotes the grounding likelihood of node , and denotes the joint grounding likelihood of node and , whose relationship is . Since it is difficult to model the exact probability for nodes and edges, we re-write the above equation in terms of potential functions:


where is the unary potential function for grounding node , is the binary potential for grounding the relationship . In a nutshell, scene graph based CRF offers an inductive bias for factorizing the joint probability in Eq. (2) into the much simpler unary and binary potentials in Eq. (4).

Unary Potential. It models how well the appearance of each region agrees with the node . We use the similarity score between regions and nodes as the unary potentials:


where is the visual feature of region , is a word embedding feature for node . If is modified by attributes, is the average of the noun object embedding and the attribute embeddings. denotes element-wise multiplication and L2norm

denotes L2 vector normalization. Note that to maintain the non-negativity of the potentials, we use a softmax function over all elements.

Binary Potential. Similarly, it models the agreement between the combination of two nodes involved in the edge relation :


where is the averaged word embedding of all the relation words. It is worth noting that even though the belief propagation introduced later is undirectional, our design of the binary potential preserves the directed property of scene graph, thanks to the directional feature concatenation in Eq. (6).

#Parameters111More details in supplementary material.. Except for the trainable word embedding vectors, our CRF model only has two sets of fc parameters in Eq. (5) and (6), whose number is significantly smaller than any existing models.

3.3 Training & Inference

Marginalization. When all the grounding variables have ground-truth annotations, training the parameters of CRF is straightforward: optimizing the log-likelihood of Eq. (4). However, in REF, there are no annotations for each context node and also no annotations for the edges. Therefore, we propose to marginalize out all the unlabeled variables:


Now, one can easily train our graphical model with the cross-entropy loss for the marginalized likelihood of the referent :


where is the referent node discussed in the scene graph section of Section 3.2, is the ground-truth regions. There are two ways to infer the final grounding results of all the nodes in a scene graph. The first one is to marginalize every node and take for each node as the grounding result. The second one is the same to Eq. (2): find the joint configuration of the variables with the highest probability for the whole graph, and then pick up the referent grounding result. We compare these two inference methods in Section 4.3. Note that no matter using what inference method, we can obtain the joint groundings of all the objects as shown in Figure 1, rather than only the referent as in previous works.

Belief Propagation. Since directly marginalizing Eq. (7) requires expensive computation by enumerating from to , we adopt the sum-product belief propagation algorithm1 andres2012opengm to compute the marginal probability for every node including the referent.

In a nutshell, the algorithm works by passing messages along with the edges between nodes. At the beginning, we initialize the messages. Then, we choose the referent node as root. After that, we first pass the message by depth-first search from root, and second, pass the message along the inverse path. The message passing functions are as follows:


where indicates the neighbors. Note that we simplify the functions with matrix form in Eq. (9). After that, we compute the beliefs for each node and edge, and the resultant marginals are:


The above message passing rules perform like visual reasoning for accumulating the supporting evidence for the referent. As illustrated in Figure 2, the initial potential for [height=0.95]figures/logos/man_b.jpg can not distinguish which man is the one “in the red jacket”. After the belief propagation, the node collects the evidence from its neighbor [height=0.95]figures/logos/jacket.jpg and is able to tell which region is the “man in the red jacket”. Similarly, [height=0.95]figures/logos/man_b.jpg also provides the supporting evidence to [height=0.95]figures/logos/skis.jpg. By accumulating the evidence from the belief propagation paths, we obtain more accurate grounding results for the referent and its contexts.

4 Experiments

4.1 Datasets & Evaluation Metrics

We conducted our experiments on three REF datasets. RefCOCO yu2016modeling and RefCOCO+ yu2016modeling both collected expression annotations with an interactive game and are split into three parts, i.e., validation, testA, and testB, where testA contains the images with multiple people and testB contains the image with multiple objects. The difference between them is that the location words are banned in RefCOCO+, e.g., “left”, “behind”, while those in RefCOCO are not. The average referring expression length is around 3.6. In RefCOCOg mao2016generation , the referring expressions are collected without an interactive game but have longer average length to 8.43. Since our work focuses on grounding referring expression based on scene graphs, we mainly did the ablations on RefCOCOg as the scene graphs from longer sentences are more qualitative.

There two evaluation settings for different propose. The ground-truth setting (gt) provides the ground-truth candidate regions and the goal is to find the best-matching region described by the referring expression. It filters out the noise from the object detector and thus we can focus on visual reasoning. The detection setting (det) only provides an image and a referring expression, we should extract regions first. It aims to evaluate the overall performance for a practical grounding systems. For det, we count a predicted region as correct if the IoU with the ground-truth is larger than 0.5.

4.2 Implementation Details

Language Settings. We built scene graphs by using the Stanford Scene Graph Parser schuster2015generating . Unlike the previous works which usually trim the length of the expressions for computational reasons, we kept the whole sentences for more accurate scene graph parsing. For the word embedding, we used 300-d GloVe pennington2014glove pre-trained word vectors as initialization.

Visual Representations. We followed MAttNet yu2018mattnet to extract region features of an image. Specifically, We used a Faster RCNN ren2015faster with ResNet-101 he2016deep as the backbone, pre-trained on MS-COCO with attribute head. We also incorporated the location information as relative location offsets into the region features.

Parameter Settings. The model is trained by Adam optimizer kingma2014adam

up to 30 epochs. The learning rate shrunk by 0.9 every 10 epochs from 1e-3. One mini-batch includes 128 images. For loopy belief propagation, we set the max iteration as 10.

Backbone. Our framework can easily take any other REF models as a backbone by taking their grounding results as our MSGL’s referent unary potential initialization. We design a baseline model as the backbone to evaluate the compatibility of our framework, and test whether our framework leads to performance boost. The baseline model deploys a bidirectional LSTM to encode embedding vector of each word into hidden vector. Then we calculated a soft self attention weight for each word. With the weights, we represented the referring expression as the weighted average of the word embeddings. Finally, we used a matching score function, similar to in Eq (5), to obtain the final grounding results.

4.3 Quantitative Results

! RefCOCO RefCOCO+ RefCOCOg regions val testA testB val testA testB val* val test Holistic Models MMI mao2016generation gt - 63.15 64.21 - 48.73 42.13 62.14 - - CMN hu2017modeling gt - 75.94 79.57 - 59.29 59.34 69.30 - - Speaker yu2017joint gt 79.56 78.95 80.22 62.26 64.60 59.62 72.63 71.65 71.92 VC zhang2018grounding gt - 78.98 82.39 - 62.56 62.90 73.98 - - Multi-hop Film strub2018visual gt 84.90 87.40 83.10 73.80 78.70 65.80 71.50 - - MAttN yu2018mattnet gt 85.65 85.26 84.57 71.01 75.13 66.17 - 78.10 78.12 Structural Models parser+CMN hu2017modeling gt - - - - - - 53.50 - - parser+MAttN yu2018mattnet gt 80.20 79.10 81.22 66.08 68.30 62.94 - 73.82 73.72 GroundNet cirik2018using gt - - - - - - 68.90 - - RvG-Tree hong2019learning gt 83.48 82.52 82.90 68.86 70.21 65.49 76.29 76.82 75.20 MSGL gt 85.69 85.45 85.12 72.30 75.31 67.50 - 79.11 78.46 Holistic Models MMI mao2016generation det - 64.90 54.51 - 54.03 42.81 45.85 - - CMN hu2017modeling det - 71.03 65.77 - 54.32 47.76 57.47 - - Speaker yu2017joint det 69.48 72.95 63.43 55.71 60.43 48.74 59.51 60.21 59.63 VC zhang2018grounding det - 73.33 67.44 - 58.4 53.18 62.30 - - MAttN yu2018mattnet det 76.40 80.43 69.28 64.93 70.26 56.00 - 66.67 67.01 Structural Models RvG-Tree hong2019learning det 75.06 78.61 69.85 63.51 67.45 56.66 66.20 66.95 66.51 MSGL det 77.00 81.56 71.19 66.36 71.08 57.11 - 68.75 68.89

Table 1: Comparison with stat-of-the-art REF grounding models on the three datasets with ground-truth (gt) and detected (det) regions. In RefCOCOg, val indicates the data split in mao2016generation , none superscript indicates the data splits in nagaraja2016modeling . indicates that this model uses ResNet features.
! Train Inference Backbone val (gt) test (gt) no no no 66.93 67.41 no sum no 66.97 67.41 sum sum no 74.08 74.56 sum max no 73.68 74.02 loopy loopy no 73.67 74.14 no no yes 77.47 77.89 sum sum yes 79.11 78.46
Table 2: Ablation study results on RefCOCOg with ground-truth (gt) regions. Train and Inference represent the different belief propagation strategies, including: “no” for no belief propagation, “sum” for the exact sum-product algorithm, “loopy” for the loopy sum-product belief propagation, and “max” for the exact max-product algorithm. The Backbone indicates whether we use the baseline model to initialize the referent unary potential.
[width=]figures/he.pdf Figure 3: Human evaluation of our MSGL and Rvg-Tree hong2019learning . The evaluators are asked how clearly they can understand the model’s outputs and rate on 4-point scale. The percentage of each choice indicates our MSGL is more interpretable to humans.

Comparisons with State-of-The-Arts. In Table 1, we compared our MSGL with the aforementioned simple model as the backbone, with other state-of-the-art REF models proposed in recent years. As can be seen, our framework consistently outperforms the other methods on almost every dataset and split. Moreover, besides providing the referent grounding results, our framework can also provide the context objects grounding results.

Ablative Study.

We conducted extensive ablative studies of our REF framework to explore the different training and inference strategies. Besides, we also evaluated the compatibility of our framework. Table 1 shows the grounding result on RefCOCOg dataset. We can have the following observations: 1) The belief propagation in inference can not directly lead to improvement for REF grounding. The reason is that without marginalized training strategy, the unary or binary potentials of the contexts can be hardly trained without annotation. 2) The “max” inference degrades the performance of the “sum” inference, the reason is that if we train with “sum” but inference with “max”, there will be a mismatch between training and inference. 3) The “loopy” doesn’t outperform other belief propagation strategy even though the language scene graphs are not strictly non-cyclic. Besides, the “loopy” strategy is more time consuming than “sum” as it will pass messages for many iterations till convergence. 4) By marrying our framework to the backbone, it gains considerable improvement. Even though we haven’t test our model married with other REF models, we believe that MSGL will consistently boost the performance.

4.4 Qualitative Results


Figure 4: Qualitative results on RefCOCOg. For each sample, it contains: 1) the image with regions tagged by id numbers (top right), 2) the scene graph (bottom right), 3) the initial unary potentials for every node (top left), and 4) the updated unary potentials by belief propagation (bottom left).

As shown in Figure 4, we provide some qualitative grounding results for all the objects in sentences. We can see that after updating the potentials by belief propagation, the likelihood becomes more concentrated (e.g., [height=0.95]figures/logos/woman.jpg in (a)). Even though there are some mistakes in the initial potentials, our framework can correct them after the updates (e.g., [height=0.95]figures/logos/boy.jpg in (b)). Not only the referent grounding becomes more accurate, but also do the context objects grounding results after belief propagation (e.g., [height=0.95]figures/logos/headband.jpg in (c)). Our MSGL works well on a complex graph ((e.g., (d)). There are also some failure cases caused by the scene graph parsing errors (e.g., the object “bikini” is missing in (c) and “to left of” should be a integrated edge in (d)) or none corresponding regions (e.g., [height=0.95]figures/logos/beach.jpg in (c) and [height=0.95]figures/logos/fence.jpg in (d)). Human evaluations also show that compared to the tree-parsing model, our model is more interpretable. More examples and evaluation settings are provided in supplementary material.

5 Conclusions

We presented a novel REF framework called Marginalized Scene Graph Likelihood (MSGL), which jointly models all the objects mentioned in the referring expression, and hence allows the visual reasoning with the referent and its contexts. This fashion is fundamentally different from existing methods that are only able to model the holistic sentence and referent region score, which lacks interpretability. MSGL first constructs a CRF model based on scene graphs, parsed from the sentences, and then marginalizes out the unlabeled contexts by belief propagation. On three popular REF benchmarks, we showed that MSGL is not only more high-performing than other state-of-the-arts, but also more interpretable.

As MSGL is a well-posed graphical model, whose core is to learn the unary and binary potential functions that can be considered as object detection and relationship detection zhang2017visual ; zellers2018neural , we may explore the following two interesting directions. First, once we have a high-quality visual scene graph detector as the potential functions, MSGL is applicable to any REF tasks without training. Second, as annotating REF is relatively easier than labeling a complete visual scene graph, we may use MSGL to indirectly train a visual scene graph detector, i.e., it is possible to train scene graph detector by REF.


  • [1] Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. Spice: Semantic propositional image caption evaluation. In ECCV, 2016.
  • [2] Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In CVPR, 2016.
  • [3] Bjoern Andres, Thorsten Beier, and Jörg H Kappes. Opengm: A c++ library for discrete graphical models. arXiv preprint arXiv:1206.0111, 2012.
  • [4] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In ICCV, 2015.
  • [5] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
  • [6] Qingxing Cao, Xiaodan Liang, Bailin Li, and Liang Lin. Interpretable visual question answering by reasoning on dependency trees. In CVPR, 2018.
  • [7] Qingxing Cao, Xiaodan Liang, Bailing Li, Guanbin Li, and Liang Lin. Visual question reasoning on general dependency tree. In CVPR, 2018.
  • [8] Howard Chen, Alane Shur, Dipendra Misra, Noah Snavely, and Yoav Artzi. Touchdown: Natural language navigation and spatial reasoning in visual street environments. arXiv preprint arXiv:1811.12354, 2018.
  • [9] Volkan Cirik, Taylor Berg-Kirkpatrick, and Louis-Philippe Morency. Using syntax to ground referring expressions in natural images. In AAAI, 2018.
  • [10] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José MF Moura, Devi Parikh, and Dhruv Batra. Visual dialog. In CVPR, 2017.
  • [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [12] Richang Hong, Daqing Liu, Xiaoyu Mo, Xiangnan He, and Hanwang Zhang. Learning to compose and reason with language tree structures for visual grounding. TPAMI, 2019.
  • [13] Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Kate Saenko. Learning to reason: End-to-end module networks for visual question answering. In ICCV, 2017.
  • [14] Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, and Kate Saenko. Modeling relationships in referential expressions with compositional modular networks. In CVPR, 2017.
  • [15] Justin Johnson, Agrim Gupta, and Li Fei-Fei. Image generation from scene graphs. In CVPR, 2018.
  • [16] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
  • [17] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, and Ross Girshick. Inferring and executing programs for visual reasoning. In ICCV, 2017.
  • [18] Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. In CVPR, 2015.
  • [19] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [20] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.
  • [21] Ruotian Luo and Gregory Shakhnarovich. Comprehension-guided referring expressions. In CVPR, 2017.
  • [22] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
  • [23] Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. Modeling context between objects for referring expression understanding. In ECCV, 2016.
  • [24] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
  • [25] Bryan A Plummer, Arun Mallya, Christopher M Cervantes, Julia Hockenmaier, and Svetlana Lazebnik. Phrase localization and visual relationship detection with comprehensive image-language cues. In ICCV, 2017.
  • [26] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. In CVPR, 2017.
  • [27] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2016.
  • [28] Sebastian Schuster, Ranjay Krishna, Angel Chang, Li Fei-Fei, and Christopher D. Manning. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In ACL Workshop on Vision and Language, 2015.
  • [29] Jiaxin Shi, Hanwang Zhang, and Juanzi Li. Explainable and explicit visual reasoning over scene graphs. In CVPR, 2019.
  • [30] Florian Strub, Mathieu Seurin, Ethan Perez, Harm De Vries, Jérémie Mary, Philippe Preux, and Aaron CourvilleOlivier Pietquin. Visual reasoning with multi-hop feature modulation. In ECCV, 2018.
  • [31] Damien Teney, Lingqiao Liu, and Anton van den Hengel. Graph-structured representations for visual question answering. arXiv preprint, 2017.
  • [32] Yuta Tsuboi, Hisashi Kashima, Hiroki Oda, Shinsuke Mori, and Yuji Matsumoto. Training conditional random fields using incomplete annotations. In ACL, 2008.
  • [33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • [34] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE transactions on pattern analysis and machine intelligence, 2017.
  • [35] Peng Wang, Qi Wu, Jiewei Cao, Chunhua Shen, Lianli Gao, and Anton van den Hengel. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In CVPR, 2019.
  • [36] Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. Auto-encoding scene graphs for image captioning. In CVPR, 2019.
  • [37] Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. Exploring visual relationship for image captioning. In Computer Vision–ECCV 2018, pages 711–727. Springer, 2018.
  • [38] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L Berg. Mattnet: Modular attention network for referring expression comprehension. In CVPR, 2018.
  • [39] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. In ECCV, 2016.
  • [40] Licheng Yu, Hao Tan, Mohit Bansal, and Tamara L Berg. A joint speakerlistener-reinforcer model for referring expressions. In CVPR, 2017.
  • [41] Zhou Yu, Jun Yu, Chenchao Xiang, Zhou Zhao, Qi Tian, and Dacheng Tao. Rethinking diversified and discriminative proposal generation for visual grounding. IJCAI, 2018.
  • [42] Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. Neural motifs: Scene graph parsing with global context. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 5831–5840, 2018.
  • [43] Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, and Tat-Seng Chua. Visual translation embedding network for visual relation detection. In CVPR, 2017.
  • [44] Hanwang Zhang, Yulei Niu, and Shih-Fu Chang. Grounding referring expressions in images by variational context. In CVPR, 2018.

6 Supplementary Material

6.1 Number of Parameters

Our REF model MSGL is extremely light. Here, we list the details of unary and binary potential initialization functions described in Eq. (5) and Eq. (6) as follows:

Index Input Operation Output Trainable Parameters
(1) - visual feature -
(2) - embedding feature -
(3) (1) fc()
(4) (3) L2norm -
(5) (4), (2) element-wise multiplication (300) -
(6) (5) fc()
Table 3: The details of unary potential initialization function Eq. (5).
Index Input Operation Output Trainable Parameters
(1) - visual feature -
(2) - visual feature -
(3) - embedding feature -
(4) (1), (2) concatenation -
(5) (4) fc()
(6) (5) L2norm -
(7) (6), (3) element-wise multiplication (300) -
(8) (7) fc()
Table 4: The details of binary potential initialization function Eq. (6).

The word embedding vectors in our model are also trainable, take RefCOCOg as example, whose vocabulary size is 6864, the overall number of parameters are:


6.2 Belief Propagation Algorithms

In this section, we detailed the belief propagation algorithms as follows:

Initialize the messages
Choose the referent node as root Send messages from the leaves to the root and then from the root to the leaves
Compute the beliefs
Normalize beliefs and return the marginals
Algorithm 1 Sum-Product Belief Propagation
1 Initialize the messages
while not converge or not reach max iterations do
2       Send messages
3 end while
Compute the beliefs
Normalize beliefs and return the marginals
Algorithm 2 Loopy Belief Propagation

In the algorithms, denotes the message from node to edge , denotes the belief of node , denotes the neighbors of a node or edge. The softmax function performs along all elements in the vector . The max-product belief propagation follows the same procedure as the sum-product belief propagation except for changing the matrix multiplication as the max-product version.

6.3 Human Evaluation on interpretability

In the experiment (cf. Section 4.4 and Figure 2 of the main paper), we conducted human evaluation to evaluate the interpretability of MSGL and RvG-Tree. We invited 12 evaluators and each evaluator need to rate 30 examples of each model. For each example, the evaluators are asked to judge how clearly they can understand the grounding process and rate on 4-point Likert scale, i.e., unclear, slightly clear, mostly clear, clear. For fair evaluation, we preprocessed the grounding results of each model into the same format (Figure 5), and presented the shuffled examples to evaluators.


Figure 5: An evaluation example. Each example is rated on 4-point scale. The evaluators are blind to which model the example generated by. Specifically, the grounding process collected the output of each node of MSGL or the smallest sub-tree of each noun-phrase in RvG-Tree. Note that we remove the structure information from the grounding results, i.e., the tree structure of RvG-Tree and the scene graph of MSGL, to avoid bias to evaluators.

6.4 More Qualitative Results

In this section, we provide more qualitative results to demonstrate how the belief propagation changes potentials. As comparison, we also show two failure cases in the last row.


Figure 6: Qualitative grounding results of MSGL on RefCOCOg test set. Scene graph legends: green shaded rectangle: referent node, colored rectangle: object node, arrow rectangle: attribute, oval: edge relationship. The same color of the bounding box and the node denotes a grounding. For each sample, it contains: 1) the image region bounding boxes with id numbers (top right), 2) the language scene graph (bottom right), 3) the initial unary potentials for each node (top left), and 4) the updated unary potentials by belief propagation (bottom left). The sentence ID is provided for reproduction purpose.