Explainability by Parsing: Neural Module Tree Networks for Natural Language Visual Grounding

12/08/2018
by   Daqing Liu, et al.
0

Grounding natural language in images essentially requires composite visual reasoning. However, existing methods over-simplify the composite nature of language into a monolithic sentence embedding or a coarse composition of subject-predicate-object triplet. They might perform well on short phrases, but generally fail in longer sentences, mainly due to the over-fitting to certain vision-language bias. In this paper, we propose to ground natural language in an intuitive, explainable, and composite fashion as it should be. In particular, we develop a novel modular network called Neural Module Tree network (NMTree) that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a module network that calculates or accumulates the grounding score in a bottom-up direction where as needed. NMTree disentangles the visual grounding from the composite reasoning, allowing the former to only focus on primitive and easy-to-generalize patterns. To reduce the impact of parsing errors, we train the modules and their assembly end-to-end by using the Gumbel-Softmax approximation and its straight-through gradient estimator, accounting for the discrete process of module selection. Overall, the proposed NMTree not only consistently outperforms the state-of-the-arts on several benchmarks and tasks, but also shows explainable reasoning in grounding score calculation. Therefore, NMTree shows a good direction in closing the gap between explainability and performance.

READ FULL TEXT

page 1

page 4

page 8

page 12

page 13

page 14

research
06/05/2019

Learning to Compose and Reason with Language Tree Structures for Visual Grounding

Grounding natural language in images, such as localizing "the black dog ...
research
05/22/2018

Guided Feature Transformation (GFT): A Neural Language Grounding Module for Embodied Agents

Recently there has been a rising interest in training agents, embodied i...
research
10/01/2022

Differentiable Parsing and Visual Grounding of Verbal Instructions for Object Placement

Grounding spatial relations in natural language for object placing could...
research
07/17/2020

Learning to Discretely Compose Reasoning Module Networks for Video Captioning

Generating natural language descriptions for videos, i.e., video caption...
research
03/15/2022

End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video Grounding

Natural language spatial video grounding aims to detect the relevant obj...
research
05/24/2022

Sim-To-Real Transfer of Visual Grounding for Human-Aided Ambiguity Resolution

Service robots should be able to interact naturally with non-expert huma...

Please sign up or login with your details

Forgot password? Click here to reset