Unsupervised Vision-Language Parsing: Seamlessly Bridging Visual Scene Graphs with Language Structures via Dependency Relationships

03/27/2022
by   Chao Lou, et al.
0

Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for visual scenes (e.g., scene graphs) and natural languages (e.g., dependency trees), individually. However, how to construct a joint vision-language (VL) structure has barely been investigated. More challenging but worthwhile, we introduce a new task that targets on inducing such a joint VL structure in an unsupervised manner. Our goal is to bridge the visual scene graphs and linguistic dependency trees seamlessly. Due to the lack of VL structural data, we start by building a new dataset VLParse. Rather than using labor-intensive labeling from scratch, we propose an automatic alignment procedure to produce coarse structures followed by human refinement to produce high-quality ones. Moreover, we benchmark our dataset by proposing a contrastive learning (CL)-based framework VLGAE, short for Vision-Language Graph Autoencoder. Our model obtains superior performance on two derived tasks, i.e., language grammar induction and VL phrase grounding. Ablations show the effectiveness of both visual cues and dependency relationships on fine-grained VL structure construction.

READ FULL TEXT

page 2

page 3

page 4

page 5

page 6

page 8

page 9

page 10

research
12/01/2020

StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling

There are two major classes of natural language grammars – the dependenc...
research
05/30/2023

Fine-Grained is Too Coarse: A Novel Data-Centric Approach for Efficient Scene Graph Generation

Learning to compose visual relationships from raw images in the form of ...
research
03/24/2021

VLGrammar: Grounded Grammar Induction of Vision and Language

Cognitive grammar suggests that the acquisition of language grammar is g...
research
09/13/2019

Scene Graph Parsing by Attention Graph

Scene graph representations, which form a graph of visual object nodes t...
research
07/05/2019

Head-Driven Phrase Structure Grammar Parsing on Penn Treebank

Head-driven phrase structure grammar (HPSG) enjoys a uniform formalism r...
research
03/25/2018

Scene Graph Parsing as Dependency Parsing

In this paper, we study the problem of parsing structured knowledge grap...
research
08/03/2017

CRF Autoencoder for Unsupervised Dependency Parsing

Unsupervised dependency parsing, which tries to discover linguistic depe...

Please sign up or login with your details

Forgot password? Click here to reset