PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World

by   Rowan Zellers, et al.

We propose PIGLeT: a model that learns physical commonsense knowledge through interaction, and then uses this knowledge to ground language. We factorize PIGLeT into a physical dynamics model, and a separate language model. Our dynamics model learns not just what objects are but also what they do: glass cups break when thrown, plastic ones don't. We then use it as the interface to our language model, giving us a unified model of linguistic form and grounded meaning. PIGLeT can read a sentence, simulate neurally what might happen next, and then communicate that result through a literal symbolic representation, or natural language. Experimental results show that our model effectively learns world dynamics, along with how to communicate them. It is able to correctly forecast "what happens next" given an English sentence over 80 100x larger, text-to-text approach by over 10 summaries of physical interactions are also judged by humans as more accurate than LM alternatives. We present comprehensive analysis showing room for future work.



There are no comments yet.


page 1

page 8

page 13


Understanding Learning Dynamics Of Language Models with SVCCA

Recent work has demonstrated that neural language models encode linguist...

Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning

In natural language processing, most models try to learn semantic repres...

Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs

Natural language rationales could provide intuitive, higher-level explan...

SILG: The Multi-environment Symbolic Interactive Language Grounding Benchmark

Existing work in language grounding typically study single environments....

Grounded Language Learning in a Simulated 3D World

We are increasingly surrounded by artificially intelligent technology th...

From phonemes to images: levels of representation in a recurrent neural model of visually-grounded language learning

We present a model of visually-grounded language learning based on stack...

TANGO: Commonsense Generalization in Predicting Tool Interactions for Mobile Manipulators

Robots assisting us in factories or homes must learn to make use of obje...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.