Latent Alignment of Procedural Concepts in Multimodal Recipes

01/12/2021
by   Hossein Rajaby Faghihi, et al.
0

We propose a novel alignment mechanism to deal with procedural reasoning on a newly released multimodal QA dataset, named RecipeQA. Our model is solving the textual cloze task which is a reading comprehension on a recipe containing images and instructions. We exploit the power of attention networks, cross-modal representations, and a latent alignment space between instructions and candidate answers to solve the problem. We introduce constrained max-pooling which refines the max-pooling operation on the alignment matrix to impose disjoint constraints among the outputs of the model. Our evaluation result indicates a 19% improvement over the baselines.

READ FULL TEXT
research
04/06/2022

Modeling Temporal-Modal Entity Graph for Procedural Multimodal Machine Comprehension

Procedural Multimodal Documents (PMDs) organize textual instructions and...
research
07/04/2021

Audio-Oriented Multimodal Machine Comprehension: Task, Dataset and Model

While Machine Comprehension (MC) has attracted extensive research intere...
research
04/25/2020

MCQA: Multimodal Co-attention Based Network for Question Answering

We present MCQA, a learning-based algorithm for multimodal question answ...
research
12/20/2022

Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment

Despite recent progress towards scaling up multimodal vision-language mo...
research
01/14/2019

Learning Shared Semantic Space with Correlation Alignment for Cross-modal Event Retrieval

In this paper, we propose to learn shared semantic space with correlatio...
research
10/16/2021

Understanding Procedural Knowledge by Sequencing Multimodal Instructional Manuals

The ability to sequence unordered events is an essential skill to compre...

Please sign up or login with your details

Forgot password? Click here to reset