Embodied Agents for Efficient Exploration and Smart Scene Description

by   Roberto Bigazzi, et al.

The development of embodied agents that can communicate with humans in natural language has gained increasing interest over the last years, as it facilitates the diffusion of robotic platforms in human-populated environments. As a step towards this objective, in this work, we tackle a setting for visual navigation in which an autonomous agent needs to explore and map an unseen indoor environment while portraying interesting scenes with natural language descriptions. To this end, we propose and evaluate an approach that combines recent advances in visual robotic exploration and image captioning on images generated through agent-environment interaction. Our approach can generate smart scene descriptions that maximize semantic knowledge of the environment and avoid repetitions. Further, such descriptions offer user-understandable insights into the robot's representation of the environment by highlighting the prominent objects and the correlation between them as encountered during the exploration. To quantitatively assess the performance of the proposed approach, we also devise a specific score that takes into account both exploration and description skills. The experiments carried out on both photorealistic simulated environments and real-world ones demonstrate that our approach can effectively describe the robot's point of view during exploration, improving the human-friendly interpretability of its observations.


page 1

page 2

page 4

page 6


Explore and Explain: Self-supervised Navigation and Recounting

Embodied AI has been recently gaining attention as it aims to foster the...

Towards Task Understanding in Visual Settings

We consider the problem of understanding real world tasks depicted in vi...

Learning-Augmented Model-Based Planning for Visual Exploration

We consider the problem of time-limited robotic exploration in previousl...

ALAN: Autonomously Exploring Robotic Agents in the Real World

Robotic agents that operate autonomously in the real world need to conti...

Explore and Tell: Embodied Visual Captioning in 3D Environments

While current visual captioning models have achieved impressive performa...

Streaming Scene Maps for Co-Robotic Exploration in Bandwidth Limited Environments

This paper proposes a bandwidth tunable technique for real-time probabil...

ANSEL Photobot: A Robot Event Photographer with Semantic Intelligence

Our work examines the way in which large language models can be used for...

Please sign up or login with your details

Forgot password? Click here to reset