DeepAI AI Chat
Log In Sign Up

Latent Alignment of Procedural Concepts in Multimodal Recipes

by   Hossein Rajaby Faghihi, et al.

We propose a novel alignment mechanism to deal with procedural reasoning on a newly released multimodal QA dataset, named RecipeQA. Our model is solving the textual cloze task which is a reading comprehension on a recipe containing images and instructions. We exploit the power of attention networks, cross-modal representations, and a latent alignment space between instructions and candidate answers to solve the problem. We introduce constrained max-pooling which refines the max-pooling operation on the alignment matrix to impose disjoint constraints among the outputs of the model. Our evaluation result indicates a 19% improvement over the baselines.


Modeling Temporal-Modal Entity Graph for Procedural Multimodal Machine Comprehension

Procedural Multimodal Documents (PMDs) organize textual instructions and...

Audio-Oriented Multimodal Machine Comprehension: Task, Dataset and Model

While Machine Comprehension (MC) has attracted extensive research intere...

MCQA: Multimodal Co-attention Based Network for Question Answering

We present MCQA, a learning-based algorithm for multimodal question answ...

Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment

Despite recent progress towards scaling up multimodal vision-language mo...

Learning Shared Semantic Space with Correlation Alignment for Cross-modal Event Retrieval

In this paper, we propose to learn shared semantic space with correlatio...

Understanding Procedural Knowledge by Sequencing Multimodal Instructional Manuals

The ability to sequence unordered events is an essential skill to compre...