Joint Video and Text Parsing for Understanding Events and Answering Queries

08/29/2013
by   Kewei Tu, et al.
0

We propose a framework for parsing video and text jointly for understanding events and answering user queries. Our framework produces a parse graph that represents the compositional structures of spatial information (objects and scenes), temporal information (actions and events) and causal information (causalities between events and fluents) in the video and text. The knowledge representation of our framework is based on a spatial-temporal-causal And-Or graph (S/T/C-AOG), which jointly models possible hierarchical compositions of objects, scenes and events as well as their interactions and mutual contexts, and specifies the prior probabilistic distribution of the parse graphs. We present a probabilistic generative model for joint parsing that captures the relations between the input video/text, their corresponding parse graphs and the joint parse graph. Based on the probabilistic model, we propose a joint parsing system consisting of three modules: video parsing, text parsing and joint inference. Video parsing and text parsing produce two parse graphs from the input video and text respectively. The joint inference module produces a joint parse graph by performing matching, deduction and revision on the video and text parse graphs. The proposed framework has the following objectives: Firstly, we aim at deep semantic parsing of video and text that goes beyond the traditional bag-of-words approaches; Secondly, we perform parsing and reasoning across the spatial, temporal and causal dimensions based on the joint S/T/C-AOG representation; Thirdly, we show that deep joint parsing facilitates subsequent applications such as generating narrative text descriptions and answering queries in the forms of who, what, when, where and why. We empirically evaluated our system based on comparison against ground-truth as well as accuracy of query answering and obtained satisfactory results.

READ FULL TEXT

page 1

page 4

page 18

page 20

page 22

research
09/06/2021

Exploiting Spatial-Temporal Semantic Consistency for Video Scene Parsing

Compared with image scene parsing, video scene parsing introduces tempor...
research
06/12/2019

Joint Reasoning for Temporal and Causal Relations

Understanding temporal and causal relations between events is a fundamen...
research
11/03/2021

SERC: Syntactic and Semantic Sequence based Event Relation Classification

Temporal and causal relations play an important role in determining the ...
research
12/06/2015

A Restricted Visual Turing Test for Deep Scene and Event Understanding

This paper presents a restricted visual Turing test (VTT) for story-line...
research
10/01/2021

Self-Attentive Constituency Parsing for UCCA-based Semantic Parsing

Semantic parsing provides a way to extract the semantic structure of a t...
research
11/14/2020

ActBERT: Learning Global-Local Video-Text Representations

In this paper, we introduce ActBERT for self-supervised learning of join...
research
02/03/2015

Clothing Co-Parsing by Joint Image Segmentation and Labeling

This paper aims at developing an integrated system of clothing co-parsin...

Please sign up or login with your details

Forgot password? Click here to reset