Modeling Temporal-Modal Entity Graph for Procedural Multimodal Machine Comprehension

04/06/2022
by   Huibin Zhang, et al.
5

Procedural Multimodal Documents (PMDs) organize textual instructions and corresponding images step by step. Comprehending PMDs and inducing their representations for the downstream reasoning tasks is designated as Procedural MultiModal Machine Comprehension (M3C). In this study, we approach Procedural M3C at a fine-grained level (compared with existing explorations at a document or sentence level), that is, entity. With delicate consideration, we model entity both in its temporal and cross-modal relation and propose a novel Temporal-Modal Entity Graph (TMEG). Specifically, graph structure is formulated to capture textual and visual entities and trace their temporal-modal evolution. In addition, a graph aggregation module is introduced to conduct graph encoding and reasoning. Comprehensive experiments across three Procedural M3C tasks are conducted on a traditional dataset RecipeQA and our new dataset CraftQA, which can better evaluate the generalization of TMEG.

READ FULL TEXT

page 3

page 8

research
10/01/2020

Referring Image Segmentation via Cross-Modal Progressive Comprehension

Referring image segmentation aims at segmenting the foreground masks of ...
research
01/12/2021

Latent Alignment of Procedural Concepts in Multimodal Recipes

We propose a novel alignment mechanism to deal with procedural reasoning...
research
05/15/2021

Cross-Modal Progressive Comprehension for Referring Segmentation

Given a natural language expression and an image/video, the goal of refe...
research
08/05/2021

TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding

Recently proposed fine-grained 3D visual grounding is an essential and c...
research
09/19/2019

Procedural Reasoning Networks for Understanding Multimodal Procedures

This paper addresses the problem of comprehending procedural commonsense...
research
06/17/2022

Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval

Our goal in this research is to study a more realistic environment in wh...
research
07/19/2023

Multi-Grained Multimodal Interaction Network for Entity Linking

Multimodal entity linking (MEL) task, which aims at resolving ambiguous ...

Please sign up or login with your details

Forgot password? Click here to reset