Explore and Tell: Embodied Visual Captioning in 3D Environments

08/21/2023
by   Anwen Hu, et al.
0

While current visual captioning models have achieved impressive performance, they often assume that the image is well-captured and provides a complete view of the scene. In real-world scenarios, however, a single image may not offer a good viewpoint, hindering fine-grained scene understanding. To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints. Specifically, starting at a random viewpoint, an agent must navigate the environment to gather information from different viewpoints and generate a comprehensive paragraph describing all objects in the scene. To support this task, we build the ET-Cap dataset with Kubric simulator, consisting of 10K 3D scenes with cluttered objects and three annotated paragraphs per scene. We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task. The navigator predicts which actions to take in the environment, while the captioner generates a paragraph description based on the whole navigation trajectory. Extensive experiments demonstrate that our model outperforms other carefully designed baselines. Our dataset, codes and models are available at https://aim3-ruc.github.io/ExploreAndTell.

READ FULL TEXT

page 8

page 10

research
01/08/2019

Viewpoint Invariant Change Captioning

The ability to detect that something has changed in an environment is va...
research
04/21/2020

ParaCNN: Visual Paragraph Generation via Adversarial Twin Contextual CNNs

Image description generation plays an important role in many real-world ...
research
12/02/2021

Relational Graph Learning for Grounded Video Description Generation

Grounded video description (GVD) encourages captioning models to attend ...
research
01/17/2023

Embodied Agents for Efficient Exploration and Smart Scene Description

The development of embodied agents that can communicate with humans in n...
research
02/02/2022

Image-based Navigation in Real-World Environments via Multiple Mid-level Representations: Fusion Models, Benchmark and Efficient Evaluation

Navigating complex indoor environments requires a deep understanding of ...
research
05/08/2023

IIITD-20K: Dense captioning for Text-Image ReID

Text-to-Image (T2I) ReID has attracted a lot of attention in the recent ...
research
02/11/2023

See Your Heart: Psychological states Interpretation through Visual Creations

In psychoanalysis, generating interpretations to one's psychological sta...

Please sign up or login with your details

Forgot password? Click here to reset