Understanding Procedural Knowledge by Sequencing Multimodal Instructional Manuals

10/16/2021
by   Te-Lin Wu, et al.
0

The ability to sequence unordered events is an essential skill to comprehend and reason about real world task procedures, which often requires thorough understanding of temporal common sense and multimodal information, as these procedures are often communicated through a combination of texts and images. Such capability is essential for applications such as sequential task planning and multi-source instruction summarization. While humans are capable of reasoning about and sequencing unordered multimodal procedural instructions, whether current machine learning models have such essential capability is still an open question. In this work, we benchmark models' capability of reasoning over and sequencing unordered multimodal instructions by curating datasets from popular online instructional manuals and collecting comprehensive human annotations. We find models not only perform significantly worse than humans but also seem incapable of efficiently utilizing the multimodal information. To improve machines' performance on multimodal event sequencing, we propose sequentiality-aware pretraining techniques that exploit the sequential alignment properties of both texts and images, resulting in > 5 improvements.

READ FULL TEXT

page 11

page 17

research
05/08/2023

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

We present a vision and language model named MultiModal-GPT to conduct m...
research
07/03/2023

SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions

Instruction finetuning is a popular paradigm to align large language mod...
research
05/19/2020

A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks

Many high-level procedural tasks can be decomposed into sequences of ins...
research
03/27/2023

IRFL: Image Recognition of Figurative Language

Figures of speech such as metaphors, similes, and idioms allow language ...
research
06/29/2023

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Instruction tuning unlocks the superior capability of Large Language Mod...
research
05/25/2022

Learning Action Conditions from Instructional Manuals for Instruction Understanding

The ability to infer pre- and postconditions of an action is vital for c...
research
01/12/2021

Latent Alignment of Procedural Concepts in Multimodal Recipes

We propose a novel alignment mechanism to deal with procedural reasoning...

Please sign up or login with your details

Forgot password? Click here to reset