Grounding natural language in images is one of the core A.I. tasks for testing the machine comprehension of visual scene and language . Perhaps the most fundamental yet successful grounding system for words is object detection  (or segmentation 
): the image regions (or pixels) are classified to the corresponding word of the object class. Despite their diverse model architectures, their sole objective is to calculate a grounding score for a visual region and a language word, measuring the semantic association between the two modalities. Thanks to the development of deep visual features  and language models , we can scale up the word grounding to open-vocabulary [29, 11] or even descriptive and relational phrases [39, 28].
However, grounding complex language sentences, e.g., referring expressions such as “an umbrella carried by a girl in pink boots”, is far different from the above word or phrase cases. For example, given the image in Figure 1, for us humans, how to locate the “unmbrella”? One may have the following reasoning process: 1) Identify the referent “umbrella”, but there are two of them. 2) Use the contextual evidence “carried by a girl”, but there are two girls. 3) By using more specific evidence “in pink boots”, locate the “girl” in the last step. 3) Finally, by accumulating the above evidences, locate the “umbrella”.
Unfortunately, existing language grounding methods generally rely on 1) a single monolithic grounding score fed with the whole sentence [24, 36, 22, 37] (Figure 1(a)), or 2) a contextual grounding score accounting for subject, predicate, and object phrases [12, 35] (Figure 1(b)). Though some of them adopt the word-level attention mechanism  to focus on the informative language parts, their reasoning is still coarse compared to the above human-level reasoning. More seriously, such coarse grounding scores are easily optimized to learn certain vision-language patterns but not visual reasoning, e.g., if most of the “umbrellas” are “carried by people” in the dataset, the score may not be responsive to other ones such as “people under umbrella stall”. Not surprisingly, this problem has been repeatedly discovered in many end-to-end vision-language embedding frameworks used in other tasks such as VQA  and image captioning .
In this paper, we propose to exploit the Dependency Parsing Trees (DPTs)  that have already offered an off-the-shelf schema for the composite reasoning in natural language grounding. Specifically, to empower the visual grounding ability of DPT, we propose a novel neural module network: Neural Module Tree (NMTree) that provides an explainable grounding score in great detail. As illustrated in Figure 1(c), we transform a DPT into NMTree by assembling three primitive module networks: Single for leaves and root, Sum and Comp for (internal) nodes, each of which calculates a grounding score (detailed in Section 3.3), which is then accumulated in a bottom-up fashion, simulating the visual evidence gained so far. For example, receives the scores gained by and then calculates a new score for the region composition, meaning “something is carried by the thing that is already grounded by the ‘by’ node”. Thanks to the fixed reasoning schema, NMTree disentangles the visual perception from the composite reasoning to alleviate the unnecessary vision-language bias , as the primitive modules will receives consistent training signals with relatively simpler visual patterns and shorter language constitutions.
One may concern the potential brittleness caused by DPT parsing errors that impact the robustness of the module assembly, as discovered in most neural module networks applied in practice [10, 2]. We address this issue in three folds: 1) the assembly is simple. Except for Single that is fixed for leaves and root, only Sum and Comp are to be determined at run-time; 2) Sum is merely an add operation that requires no visual grounding; 3) we adopt the recently proposed Gumbel-Softmax (GS) approximation  for the discrete assembly approximation. During training, the forward pass selects the two modules by GS sampler in a “hard” discrete fashion; the backward pass will update all possible decisions by using the straight-through gradient estimator in a “soft” robust way. By using GS, the entire NMTree can be trained end-to-end without any module layout annotations.
We validate the effectiveness of NMTree on three challenging referring expression grounding benchmarks: RefCOCO , RefCOCO+ , and RefCOCOg . NMTree achieves new state-of-the-art performances on most of the test splits and grounding tasks. Qualitative results show that NMTree is the first transparent and explainable model in natural language grounding.
2 Related Work
Nature language grounding is a task that requires a system to localize a region in an image given a nature language expression. Different from object detection  and phrases localization , the key for nature language grounding is to utilize the linguistic information to distinguish the target from other objects, especially the objects of the same category.
uses the CNN-LSTM structure to localize the region that can generate the expression with maximum posteriori probability. Recently, joint embedding models[12, 35, 38] are widely used, they model the conditional probability and then localize the region with maximum probability conditioned on the expression. Our model belongs to the second category. However, compared with the previous works that neglect the rich linguistic structure, we steps forward by taking it into account: we parse the language into a tree structure and then perform visual reasoning with several simple neural module networks. Compared to  which also relies on the parsing tree, our model is in great parsing detail and the module assembly is learned end-to-end from scratch, while theirs is hand-crafted.
Although there are some works [12, 35] on using module networks in the visual grounding task, their modules are too coarse compared to ours. Fine-grained module networks are widely used in VQA [1, 10]. However, they rely on additional annotations to learn a sequence-to-sequence, sentence-to-module layout parser, which is not available in general domains. Our module layout is trained from scratch by using the Gumbel-Softmax training strategy , which has been shown empirically effective in neural architecture search [4, 33].
3 NMTree Model
In this section, we first formulate the problem of natural language grounding in Section 3.1. Then, using the walk-through example illustrated in Figure 2, we introduce how to build NMTree in Section 3.2 and how to calculate the grounding score using NMTree in Section 3.3. Finally, we detail the Gumbel-Softmax training strategy in Section 3.4.
3.1 Problem Formulation
The task of grounding a nature language sentence in an image can be reduced into a ranking problem. Formally, given an image , we represent it by a set of Region of Interest (RoI) features (e.g., by using Faster RCNN ) , where and is the number of regions. For a nature language sentence , we represent it by a word sequence , where is the length of sentence. Then, the task is to localize the target region by maximize the grounding score between any region and the sentence:
Therefore, the key is to define a proper that distinguishes the target region from others by comprehending the language composition.
The pioneering grounding models [24, 36] are generally based on the holistic sentence-level language representation (cf. Figure 1(a)): , where is a feature representation for the whole language expression and
can be any similarity function between the two vectors. More recently, a coarse composition was proposed to represent the sentence as a (subject, relationship, object) triplet (cf. Figure 1(b)). Thus, the score can be decomposed into a finer-grained composition: where the subscript , , and indicates the three linguistic roles: subject, relationship, and object, respectively; is an estimated object region feature. However, these grounding scores over-simplify the composition of the language. For example, as shown in Figure 1(b), it is meaningful to decompose short sentences such as “umbrella carried by girl” into triplets, as it has a clear vision-language association for individual “girl”, “umbrella”, and their relationship; but it is problematic for longer sentences that are more general with clauses, e.g., even if the “girl in pink boots” is identified as the object, it is still coarse and difficult for grounding.
To this end, we propose to use the Dependency Parsing Tree (DPT) as a fine-grained language decomposition, which empowers the grounding score to perform visual reasoning in a great detail (cf. Figure 1(c)):
where is a node in the tree, is a node-specific score function that calculates the similarity between a region and a node-specific language part . Intuitively, Eq. (2) is more human-like: accumulating the evidence (e.g., grounding score) while comprehending the language. Next, we will introduce how to implement Eq. (2).
3.2 Sentence to NMTree
There are three steps to transform a sentence into the proposed NMTree, as shown in the bottom three blocks of Figure 2. First, we parse the sentence into a DPT, where every word is a tree node. Then, we encode each word and its linguistic information into a hidden vector by a Bidirectional Tree LSTM. Finally, we assemble the neural modules to the tree by using node hidden vectors.
Dependency Parsing Tree. We adopt a state-of-the-art dependency tree parser  from the Spacy toolbox111Spacy2: https://spacy.io/. As shown in Figure 2, it structures the word into a tree, where every node is a word with its part-of-speech (POS) tag and the directed edge from one node to another indicates the dependency relation label, e.g., “riding” is VB (verb) and its nsubj (nominal subject) is “man” which is NN (noun). DPT offers an in-depth comprehension of a sentence and its tree structure offers a reasoning path for language grounding. Note that there are always unnecessary syntax elements parsed from a free-form sentence such as determiners, symbols, and punctuation. We remove these nodes and edges to reduce the computational complexity without hurting the performance.
Bidirectional Tree LSTM. Once the DPT is obtained, we encode each node into a hidden vector by a bidirectional tree-structured LSTM . This bidirectional (i.e., bottom-up and top-down) propagation makes each node being aware of the information both from the children and from the parent. This is particularly crucial for capturing the context in a sentence. For any node , we embed the word , POS tag , and dependency relation label into a concatenated embedding vector as:
where , , and are trainable embedding matrices, , , and
are one-hot encodings, for word, POS tag, and dependency relation label, respectively.is the embedding dimension and is the vocabulary size.
Our tree LSTM implementation is based on the Child-Sum Tree LSTM . For one direction, a node receives the LSTM states from it’s children and embedding vector as input to update the state:
where denote the cell and hidden vectors of the -th child of node . By applying the TreeLSTM in two directions, we can obtain the final node representation:
where denote the hidden vectors encoded in the bottom-up and top-down directions, respectively. We initialize all leaf nodes with zero hidden and cell states. The bottom-up and top-down Tree LSTMs have their independent trainable parameters.
Module Assembler. Given the node representation , we can feed the above obtained node feature vector into a module assembler, determining which module is assembled to node . As we will detail in Section 3.3, we have three modules Single, Sum, and Comp. Due to that the Single is always assembled on leaves and the root, the assembler only need to choose between Sum and Comp:
where fc is a linear mapping from the input feature to an 2-d values, indicating the relative scores for Single and Comp, respectively. It is worth noting that the assembler is not purely linguistic even though Eq. (6) is based on DPT node features. In fact, thanks to the end-to-end training, visual cues will be eventually incorporated into the parameters of Eq. (6). Due to the discrete and non-differentiable nature of , we use the Gumbel-Softmax  strategy detailed in Section 3.4 for training. Figure 3 illustrates which type of words is likely to be assembled by each module. We can find that the Sum module has more visible words (e.g., adjectives and nouns), and the Comp module has more words describing relations (e.g., verbs and prepositions). This reveals the explainable potential of NMTree.
3.3 NMTree Modules
Given the above assembled NMTree, we can implement the tree grounding score proposed in Eq. (2) by accumulating the scores in a bottom-up fashion. There are three type of modules used in NMTree, i.e., Single, Sum and Comp. Each module at node updates the grounding scores for all the regions and outputs them to its parent. Thanks to the score output, NMTree is explainable as the scores can be visualized as a attention map to investigate the grounding at each node. Figure 4 illustrates an extreme example with a very long expression with 22 tokens. However, by using the neural modules in NMTree , it still works well and reasons from the bottom to top with explainable intermediate process. We first introduce the common functions used in the modules and then detail each module.
Language Representation. For node and its children nodes are , every module is aware of the node set rooting from , i.e., . We have two language representations: and , where is used to associate with a single visual feature and is used to associate with a pairwise visual feature. Specifically, the language representation is calculated by the weighted sum of node embedding vectors from the node set of node :
where is the node-level attention weights calculated from the corresponding node hidden features: . Note that and have independent fc parameters. It is worth noting that these weighted average word embeddings of the node set reduce the negative impact caused by DPT parsing errors .
Score Functions. There are two types of score functions used in our modules, denoted by the single score function and pairwise score function , where measures the similarity between a single region and a language representation , and indicates how likely a pair-wise region matches with one relationship. Formally we define them as:
where is element-wise multiplication, L2norm is used to normalize the vector as unit L2 norm.
Single Module. It is assembled at leaves and the root. Its job is to 1) calculates a single score for each region and the current language feature using Eq. (8), 2) add this new score to the scores collected from children, and then 3) pass the sum to its parent. As illustrated in Figure 2, its design motivation is to initiate the bottom-up grounding by the most elementary words and finalize the grounding by passing the accumulated scores to ROOT.
Note that for leaves, we have and as there are no children for leaves.
Sum Module. It plays a transitional role during the reasoning process. It simply sums up the scores passed from children and then passes the sum to its parent. As illustrated in Figure 2, intuitively, it transits the easy-to-locate words (cf. Figure 3(a)) such as “horse” and “man” to help the subsequent composite grounding.
Note that this module has no parameters and hence it significantly reduces the complexity of our model.
Comp Module. This is the core module for composite visual reasoning. As shown in Figure 3(b), it is likely to be the relationship that connects two language constitutions. It first computes an “average region” visual feature that is grounded by the single scores sum:
In particular, can be considered as the contextual region  that supports the target region score, e.g., “what is riding the horse” in Figure 2. Therefore, this module outputs the target region score for its parent:
Recall that is pairwise language feature that represents the relationship words.
3.4 NMTree Training
With the assembled NMTree , the overall grounding score in Eq. (2) can be calculated in the bottom-up fashion. Suppose is the ground-truth region, the cross-entropy loss is:
where is the trainable parameter set and softmax is across all regions in an image.
Recall that in the inference phase, the assembler in Eq. (6) is discrete and blocks the end-to-end training. Therefore, we utilize the the Gumbel-Softmax strategy  that is shown effective in recent works on architecture search [4, 33]. For more details, please refer to their papers. Here, we only introduce how to apply the Gumbel-Softmax for NMTree training.
Forward. We add Gumbel distribution as a noise into the relative scores (i.e. ) of each module. It introduces stochasticity for the module assembling exploration. Specifically, we parameterize the assembler decision as a 2-d one-hot vector , where the index of non-zero entry indicates the decision:
where is the noise drawn from i.i.d. Gumbel(0, 1)222The Gumbel (0, 1) distribution is sampled by where ..
Backward. We take a continuous approximation that relaxes to by replacing argmax with softmax, formally:
where is the same samples drawn in the forward pass (i.e., we reuse the noise samples). is a temperature parameter that the softmax function approaches to argmax while and approaches to uniform while .
Although there is a bias due to the mismatch between the forward and backward pass, we empirically observe that the Gumbel-Softmax strategy performs well in our experiments.
We conducted our experiments on the following three datasets that collected from MS-COCO  images.
RefCOCO  contains 142,210 referring expressions for 19,994 images. An interactive game  is used during the expression collection. All expression-referent pairs are split into train, validation, testA, and testB. Train and validation partitions are allotted with 120,624 and 10,834 pairs, respectively. TestA is allotted with 5,657 pairs where each image contains multiple people. TestB is allotted with 5,095 pairs where each image contains multiple objects.
RefCOCO+  contains 141,564 referring expressions for 49,856 objects in 19,992 images. It is collected with the same interactive game as RefCOCO and split into 120,191, 10,758, 5,726 and 4,889 expressions for train, validation, testA, and testB, respectively. The difference from RefCOCO is that RefCOCO+ only allows expression described by appearance but no locations.
RefCOCOg  contains 95,010 referring expressions for 49,822 objects in 25,799 images. Different from RefCOCO and RefCOCO+, it is collected in a non-interactive way and contains longer expressions described by both appearance and locations. There are two types of data partitions. The first partition  split dataset for train and validation. As there is no open test split released, we evaluated on the validation set and denote it as “val*”. The second partition  divides images into train, validation and test splits. We conducted most RefCOCOg experiments on this divisions and denote validation set as “val” and test set as “test”.
4.2 Implementation Details and Metrics
Language Pre-Processing. We built specific vocabularies for the three datasets with words/pos tags/ dependency labels appeared more than once in dataset. Note that to obtain accurate parsing results, we did not trim the length of the expressions. We used GloVe  pre-trained word vectors to initialize our word vectors. For dependency label vectors and Part-of-Speech (POS) tag vectors, we trained them from scratch with random initialization. We set the embedding size to for word, for pos tag, and for dependency label.
Visual Representations. To extract RoI features of an image, we followed the similar procedure of MAttNet . It is based on a Faster RCNN  with ResNet-101  as the backbone and trained with attribute heads. Besides, we also incorporated the spacial features. Finally, the visual representation dimension was set to 3,072. For fair comparison, we also used VGG-16  as the backbone and was set to 5,120.
Parameter Settings. We optimized our model with Adam optimizer 
up to 40 epochs. The learning rate was initialized to 1e-3 and shrunk by 0.9 every 10 epochs. We set 128 images to the mini-batch size. The hidden size of LSTMwas set to 1,024, the hidden size of the attention in language representation was set to 1,024. To avoid overfitting, we applied dropout to output layers of LSTM with a ratio of 0.5. The temperature of Gumbel-Softmax  was set to 1.0.
Evaluation Metrics. For detection task, we calculated the Intersection-over-Union (IoU) of the detected bounding box and the ground-truth one, and treated the one with IoU at least 0.5 as correct. We used the Top-1 accuracy as the metric, which is the fraction of the correctly grounded test expressions. For segmentation task, we used Pr@0.5 (the percentage of expressions where IoU at least 0.5) and overall IoU as metrics.
|NMTree w/o Comp||83.65||83.59||83.04||70.76||73.07||65.19||75.98||76.20||75.10||79.38||68.60||64.85||70.43||55.00||63.07||63.40|
|NMTree w/o Sum||83.79||83.81||83.67||70.83||73.72||65.83||76.11||76.09||75.49||79.84||69.11||65.29||70.85||55.99||63.60||64.06|
|NMTree w/ Rule||84.46||84.59||84.26||71.48||74.76||66.95||77.82||77.70||75.51||80.61||69.23||65.23||70.94||56.96||64.69||65.53|
|RefCOCO||RefCOCO+||RefCOCOg||RefCOCO (det)||RefCOCO+ (det)||RefCOCOg (det)|
4.3 Ablation Studies
Settings. We conducted extensive ablation studies to reveal the internal mechanism of NMTree. The ablations and their motivations are detailed as follows. Chain: it ignores the structure information of the language. Specifically, we first embedded each word into a vector. Then we used a bidirectional LSTM to encode those embedding vectors into hidden vectors. By using these hidden vectors, we calculated a soft attention weight for each word. Finally, we represented a nature language expression as the weighted average of the word embeddings. NMTree w/o Comp: it is the NMTree without the Comp module, forcing all internal nodes as Sum module. NMTree w/o Sum: it is the NMTree without the Sum module, forcing all internal nodes as Comp module. NMTree w/ Rule: it is a hand-crafted rule. Instead of deciding which module should be assembled to each node by computing the confidence, we designed a fixed linguistic rule to make a discrete and non-trainable decisions. The rule is: set the internal nodes whose dependency relation label is ‘acl’ (i.e., adjectival clause) or ‘prep’ (i.e., prepositional modifier) as Comp module, and the others are Sum.
Results. Table 1 shows the grounding accuracies of the ablation methods on the three benchmarks. We can have the following observations: 1) On all datasets, NMTree outperforms Chain even if we removed one module or used the hand-crafted rule. This is because that the tree structure contains more linguistic information and more suitable for reasoning. Meanwhile, it also demonstrates that our proposed fine-grained composition is better than the holistic Chain. 2) When we removed one module, i.e., NMTree w/o Comp and NMTree w/o Sum, they are worse than the full NMTree. It demonstrates the necessity of the Sum and Comp. Note that removing any modules will also hurt the explainability of models. 3) NMTree w/o Comp and NMTree w/o Sum are comparable but NMTree w/o Sum is slightly better. This is because the Comp module is more complex and thus all Comp is counter-intuitive and overfitting. 4) NMTree outperforms NMTree w/ Rule. It demonstrates that NMTree can automatically find which nodes need composite reasoning (as Comp) or not (as Sum). Further, it also implies that our NMTree is more suitable for visual grounding task dues to that our assembler is aware of visual cues by the Gumbel-Softmax training strategy.
4.4 Comparisons with State-of-the-Arts
Settings. We compared NMTree with other state-of-the-art visual grounding models published in recent years. According to whether the model requires language composition, we group those methods into: 1) Generation based methods which select the region with the maximum generation probability: MMI , Attribute , Speaker , and Listener . 2) Holistic language based methods: NegBag . 3) Language composition based methods: CMN , VC , and MAttN . NMTree belongs to the 3rd category, but its language composition is more fine-grained than those. We evaluated them on three different settings: ground-truth regions, detected regions, and segmentation masks.
Results. From Table 2 and Table 3, we find that: 1) the triplet composition models mostly outperform holistic models. It is because taking the advantage of linguistics information by decomposing sentences, even coarse-grained, is helpful in visual grounding. 2) Our model outperforms the triplet models with the help of fine-grained composite reasoning. Although some of the performance gains are marginal, one should notice that it seems that NMTree balances the well-known trade-off between performance and explainability , that is, we achieve the explainability without hurting the accuracy. Moreover, we focus on the language composition and merely adopt the plain visual features without fine-tuning. More sophisticated visual feature is expected to further improve the results reported here.
4.5 Qualitative Results
To demonstrate the explainability of NMTree, we show some qualitative visualizations with the tree structure, the module assembly, and the intermediate reasoning process in Figure 5. We can see that most of the grounding results show reasonable intermediate results. For (a), the tree structure degenerates into linear structure, but unlike the traditional chain based methods, our model also contains the reasoning process: first it localizes out the context “table”, then by the relation “next to”, our model moves to the things which “next to table”, i.e., the “dog” and “chair”, finally with the referent “dog”, our model localizes the correct region with high confidence. For (b), our model similarly finds the things “parked on side” and the “black” thing, by summing up those cues, our model detects the “suv” accurately. For (c), we find that if one leaf node has no visible concept, e.g., “directly”, it will not concentrate on any specific object. On the contrary, such as “girl”, it will pay the most attention on the corresponding region. For (d), it shows a parsing tree with errors, note that “behind” connected by “holding” and “shirt” are wrong, but our model still give the correct result with the robustness of our modules. For (e), it is a failure, but we find that until the node “giraffes”, our model remains focusing on the correct region; after a short phase “one out of” which is rare in the dataset, our model is a little confused and outputs a wrong region; nevertheless, the second best region is the target. For (f), it demonstrate an extreme case that the parser gives a chaotic structure, and leads to the failure.
We proposed Neural Module Tree Networks (NMTree), a novel end-to-end model that localizes the target region by accumulating the grounding confidence score along the dependency parsing tree (DPT) of a nature language sentence. NMTree consists of three simple neural modules, whose assembly is trained without additional annotations. Compared with previous grounding methods, our model performs a more fine-grained and explainable language composite reasoning with superior performance, demonstrated by extensive experiments on three benchmarks. In future, we are going to apply NMTree in other vision-language tasks such as VQA, visual dialog, and image captioning.
-  J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. In CVPR, 2016.
-  Q. Cao, X. Liang, B. Li, G. Li, and L. Lin. Visual question reasoning on general dependency tree. In CVPR, 2018.
D. Chen and C. Manning.
A fast and accurate dependency parser using neural networks.In EMNLP, 2014.
-  J. Choi, K. M. Yoo, and S.-g. Lee. Learning to compose task-specific tree structures. In AAAI, 2018.
-  V. Cirik, T. Berg-Kirkpatrick, and L.-P. Morency. Using syntax to ground referring expressions in natural images. In AAAI, 2018.
-  E. J. Gumbel. Statistical theory of extreme values and some practical applications. NBS Applied Mathematics Series, 33, 1954.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In ICCV, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  R. Hu, J. Andreas, T. Darrell, and K. Saenko. Explainable neural computation via stack neural module networks. In ECCV, 2018.
-  R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko. Learning to reason: End-to-end module networks for visual question answering. In ICCV, 2017.
-  R. Hu, P. Dollár, K. He, T. Darrell, and R. Girshick. Learning to segment every thing. In CVPR, 2018.
-  R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko. Modeling relationships in referential expressions with compositional modular networks. In CVPR, 2017.
-  E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. In ICLR, 2017.
-  J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, 2017.
-  S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
-  E. Krahmer and K. Van Deemter. Computational generation of referring expressions: A survey. Computational Linguistics, 38(1):173–218, 2012.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
-  J. Liu, L. Wang, M.-H. Yang, et al. Referring expression generation and comprehension via attributes. In CVPR, 2017.
-  L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikäinen. Deep learning for generic object detection: A survey, 2018.
-  J. Lu, J. Yang, D. Batra, and D. Parikh. Neural baby talk. In CVPR, 2018.
-  R. Luo and G. Shakhnarovich. Comprehension-guided referring expressions. In CVPR, 2017.
M.-T. Luong, H. Pham, and C. D. Manning.
Effective approaches to attention-based neural machine translation.In EMNLP, 2015.
-  J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
-  T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, 2010.
-  V. K. Nagaraja, V. I. Morariu, and L. S. Davis. Modeling context between objects for referring expression understanding. In ECCV, 2016.
-  J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In EMNLP, 2014.
-  B. A. Plummer, A. Mallya, C. M. Cervantes, J. Hockenmaier, and S. Lazebnik. Phrase localization and visual relationship detection with comprehensive image-language cues. In ICCV, 2017.
-  J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. In CVPR, 2017.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2016.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
K. S. Tai, R. Socher, and C. D. Manning.
Improved semantic representations from tree-structured long short-term memory networks.ACL, 2015.
-  A. Veit and S. Belongie. Convolutional networks with adaptive inference graphs. In ECCV, 2018.
-  K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. B. Tenenbaum. Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. In NIPS, 2018.
-  L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg. Mattnet: Modular attention network for referring expression comprehension. In CVPR, 2018.
-  L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg. Modeling context in referring expressions. In ECCV, 2016.
-  L. Yu, H. Tan, M. Bansal, and T. L. Berg. A joint speakerlistener-reinforcer model for referring expressions. In CVPR, 2017.
-  Z. Yu, J. Yu, C. Xiang, Z. Zhao, Q. Tian, and D. Tao. Rethinking diversified and discriminative proposal generation for visual grounding. IJCAI, 2018.
-  H. Zhang, Z. Kyaw, S.-F. Chang, and T.-S. Chua. Visual translation embedding network for visual relation detection. In CVPR, 2017.
-  H. Zhang, Y. Niu, and S.-F. Chang. Grounding referring expressions in images by variational context. In CVPR, 2018.
-  M. Zhu, Y. Zhang, W. Chen, M. Zhang, and J. Zhu. Fast and accurate shift-reduce constituent parsing. In ACL, 2013.
6 Supplementary Material
6.1 Motivation of Dependency Parsing Tree
There are two mainstream parsing trees used in the NLP field: a) constituency parsing tree , which splits a sentence into sub-phrases; and b) dependency parsing tree , which establishes relationships between words. Different parsing trees will be chosen with respect to different tasks. For the visual grounding task studied in this paper, we rely on relationships between objects (e.g., noun words), so dependency parsing tree is a better choice as it provides rich information between the “(sub) ROOT” words (usually corresponding to the target region) and words which modify them.
Therefore, compared with constituency tree-based visual grounding model , our dependency tree-based NMTree performs better both on accuracy and explainability. The reasons are two-fold: a) dependency parsing tree contains almost only half of nodes and edges compared to constituency parsing tree, and thus can mitigate overfitting, leading to performance improvement; b) the relationship labels of dependency parsing tree guide the module assembling and assist the reasoning process, which can enhance the explainability.
6.2 Implementation of Bidirectional Tree LSTM
We simplified the implementation of bidirectional tree LSTM as Eq. (4) in the paper. For more details, let denote the current node, denote the input vector, and denote the children nodes of . Our tree LSTM transition equations are:
where , and , , are trainable parameters of LSTM. Recall that for our tree LSTM implementation, the input vector is the node embedding vector, i.e., .
6.3 Statistic of Module Assembler
We counted the frequency of words, dependency labels, and POS tags corresponding to the two kinds of modules, i.e., Comp and Sum, to illustrate the properties of each module. As shown in Figure 6, we can find that: a) Sum contains more visible words (e.g., NOUN and ADJ) while Comp contains more relation words (e.g., ADP and VERB). It indicates that the relationship is the core of reasoning; b) The most dependency labels of Comp are ‘prep’ and ‘acl’. It is also the motivation of the ablative study NMTree w/ Rule. As discussed in Section 4.3 in the main paper, it demonstrates the suitability of NMTree for visual grounding task.
6.4 Internal of Comp Module
We visualize the internal reasoning process of Comp module. Recall that we compute a pairwise score in Comp (cf. Eq. (12) (13)), where can be considered as the visual context that comes from children nodes and is the output of Comp. As shown in Figure 7, each example contains two attention maps, where the left one for the visual context and the right one for the output . We represent partial tree structure by colors: red for the current node we are interested, blue for it’s direct children nodes and green for it’s direct parent.
In Figure 7, we represent nine words which can be grouped into two categories: the first three words corresponding to ‘prep’ dependency label and the other six words corresponding to ‘acl’ dependency label. We can find that: a) the intermediate process of Comp is reasonable and explainable. It shows very strong pattern for each word, e.g., “behind” is likely to move the attention from front to back, “holding” is to move the attention from the object in hand to the person. b) Comp is robust to the errors from object detector and parser. Such as the second example of “wearing”, the detector did not recognize the goggles and jacket, but Comp still provided a correct result to it’s parent. And for the second example of “feeding”, it mistakenly linked “feeding” to “shirt” (the ideal should be “child”), but our model still provided a correct grounding result in the end. Note that there are some typos from dataset, e.g., “too left of” instead of “to left of”.
6.5 More Qualitative Results
We also show more qualitative results using NMTree with ground-truth bounding boxes (Figure 8), detected bounding boxes (Figure 9), and detected masks (Figure 10). The color of words indicates the module of each node: black for Single, red for Comp, and blue for Sum. The image at bottom right corner is original image with a green bounding box (or mask) as ground-truth and a red one as our result. Our model works well on detected settings. Besides, we show some incorrect results in Figure 8(b). Those results indicate that our model consistently provides explainable reasoning process.