Mapping Instructions and Visual Observations to Actions with Reinforcement Learning

by   Dipendra Misra, et al.
cornell university

We propose to directly map raw visual observations and text input to actions for instruction execution. While existing approaches assume access to structured environment representations or use a pipeline of separately trained models, we learn a single model to jointly reason about linguistic and visual input. We use reinforcement learning in a contextual bandit setting to train a neural network agent. To guide the agent's exploration, we use reward shaping with different forms of supervision. Our approach does not require intermediate representations, planning procedures, or training different models. We evaluate in a simulated environment, and show significant improvements over supervised learning and common reinforcement learning variants.


Learning to Map Natural Language Instructions to Physical Quadcopter Control using Simulated Flight

We propose a joint simulation and real-world learning framework for mapp...

Unsupervisedly Learned Representations: Should the Quest be Over?

There exists a Classification accuracy gap of about 20 methods of genera...

Causally Correct Partial Models for Reinforcement Learning

In reinforcement learning, we can learn a model of future observations a...

Situated Mapping of Sequential Instructions to Actions with Single-step Reward Observation

We propose a learning approach for mapping context-dependent sequential ...

Following Instructions by Imagining and Reaching Visual Goals

While traditional methods for instruction-following typically assume pri...

Visual Radial Basis Q-Network

While reinforcement learning (RL) from raw images has been largely inves...

Reinforcement Learning with Attention that Works: A Self-Supervised Approach

Attention models have had a significant positive impact on deep learning...

Please sign up or login with your details

Forgot password? Click here to reset