Multimodal Procedural Planning via Dual Text-Image Prompting

05/02/2023
by   Yujie Lu, et al.
5

Embodied agents have achieved prominent performance in following human instructions to complete tasks. However, the potential of providing instructions informed by texts and images to assist humans in completing tasks remains underexplored. To uncover this capability, we present the multimodal procedural planning (MPP) task, in which models are given a high-level goal and generate plans of paired text-image steps, providing more complementary and informative guidance than unimodal plans. The key challenges of MPP are to ensure the informativeness, temporal coherence,and accuracy of plans across modalities. To tackle this, we propose Text-Image Prompting (TIP), a dual-modality prompting method that jointly leverages zero-shot reasoning ability in large language models (LLMs) and compelling text-to-image generation ability from diffusion-based models. TIP improves the interaction in the dual modalities using Text-to-Image Bridge and Image-to-Text Bridge, allowing LLMs to guide the textual-grounded image plan generation and leveraging the descriptions of image plans to ground the textual plan reversely. To address the lack of relevant datasets, we collect WIKIPLAN and RECIPEPLAN as a testbed for MPP. Our results show compelling human preferences and automatic scores against unimodal and multimodal baselines on WIKIPLAN and RECIPEPLAN in terms of informativeness, temporal coherence, and plan accuracy. Our code and data: https://github.com/YujieLu10/MPP.

READ FULL TEXT

page 1

page 7

page 20

page 21

page 22

page 23

page 24

page 25

research
07/04/2023

Embodied Task Planning with Large Language Models

Equipping embodied agents with commonsense is important for robots to su...
research
01/13/2020

Towards Evaluating Plan Generation Approaches with Instructional Texts

Recent research in behaviour understanding through language grounding ha...
research
06/10/2023

ORGAN: Observation-Guided Radiology Report Generation via Tree Reasoning

This paper explores the task of radiology report generation, which aims ...
research
05/29/2023

Controllable Text-to-Image Generation with GPT-4

Current text-to-image generation models often struggle to follow textual...
research
06/29/2023

ZeroGen: Zero-shot Multimodal Controllable Text Generation with Multiple Oracles

Automatically generating textual content with desired attributes is an a...
research
05/03/2022

Zero-shot Sonnet Generation with Discourse-level Planning and Aesthetics Features

Poetry generation, and creative language generation in general, usually ...
research
02/27/2020

Hallucinative Topological Memory for Zero-Shot Visual Planning

In visual planning (VP), an agent learns to plan goal-directed behavior ...

Please sign up or login with your details

Forgot password? Click here to reset