Learning the Effects of Physical Actions in a Multi-modal Environment

01/27/2023
by   Gautier Dagan, et al.
10

Large Language Models (LLMs) handle physical commonsense information inadequately. As a result of being trained in a disembodied setting, LLMs often fail to predict an action's outcome in a given environment. However, predicting the effects of an action before it is executed is crucial in planning, where coherent sequences of actions are often needed to achieve a goal. Therefore, we introduce the multi-modal task of predicting the outcomes of actions solely from realistic sensory inputs (images and text). Next, we extend an LLM to model latent representations of objects to better predict action outcomes in an environment. We show that multi-modal models can capture physical commonsense when augmented with visual information. Finally, we evaluate our model's performance on novel actions and objects and find that combining modalities help models to generalize and learn physical commonsense reasoning better.

READ FULL TEXT

page 8

page 15

research
01/02/2021

KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation

We present Knowledge Enhanced Multimodal BART (KM-BART), which is a Tran...
research
10/11/2017

Combining learned and analytical models for predicting action effects

One of the most basic skills a robot should possess is predicting the ef...
research
06/04/2023

Probing Physical Reasoning with Counter-Commonsense Context

In this study, we create a CConS (Counter-commonsense Contextual Size co...
research
04/17/2023

Pretrained Language Models as Visual Planners for Human Assistance

To make progress towards multi-modal AI assistants which can guide users...
research
03/03/2021

Semantic constraints to represent common sense required in household actions for multi-modal Learning-from-observation robot

The paradigm of learning-from-observation (LfO) enables a robot to learn...
research
03/10/2023

Task and Motion Planning with Large Language Models for Object Rearrangement

Multi-object rearrangement is a crucial skill for service robots, and co...
research
06/15/2021

Imitation and Mirror Systems in Robots through Deep Modality Blending Networks

Learning to interact with the environment not only empowers the agent wi...

Please sign up or login with your details

Forgot password? Click here to reset