NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario

05/24/2023
by   Tianwen Qian, et al.
0

We introduce a novel visual question answering (VQA) task in the context of autonomous driving, aiming to answer natural language questions based on street-view clues. Compared to traditional VQA tasks, VQA in autonomous driving scenario presents more challenges. Firstly, the raw visual data are multi-modal, including images and point clouds captured by camera and LiDAR, respectively. Secondly, the data are multi-frame due to the continuous, real-time acquisition. Thirdly, the outdoor scenes exhibit both moving foreground and static background. Existing VQA benchmarks fail to adequately address these complexities. To bridge this gap, we propose NuScenes-QA, the first benchmark for VQA in the autonomous driving scenario, encompassing 34K visual scenes and 460K question-answer pairs. Specifically, we leverage existing 3D detection annotations to generate scene graphs and design question templates manually. Subsequently, the question-answer pairs are generated programmatically based on these templates. Comprehensive statistics prove that our NuScenes-QA is a balanced large-scale benchmark with diverse question formats. Built upon it, we develop a series of baselines that employ advanced 3D detection and VQA techniques. Our extensive experiments highlight the challenges posed by this new task. Codes and dataset are available at https://github.com/qiantianwen/NuScenes-QA.

READ FULL TEXT

page 3

page 4

page 5

page 13

page 14

research
07/18/2023

Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving

This short paper presents a preliminary analysis of three popular Visual...
research
08/10/2022

Aesthetic Visual Question Answering of Photographs

Aesthetic assessment of images can be categorized into two main forms: n...
research
08/24/2022

Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task

VQA is an ambitious task aiming to answer any image-related question. Ho...
research
07/19/2023

Explaining Autonomous Driving Actions with Visual Question Answering

The end-to-end learning ability of self-driving vehicles has achieved si...
research
08/03/2022

TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation

Text-VQA aims at answering questions that require understanding the text...
research
01/29/2018

Game of Sketches: Deep Recurrent Models of Pictionary-style Word Guessing

The ability of intelligent agents to play games in human-like fashion is...
research
05/23/2023

Image Manipulation via Multi-Hop Instructions – A New Dataset and Weakly-Supervised Neuro-Symbolic Approach

We are interested in image manipulation via natural language text – a ta...

Please sign up or login with your details

Forgot password? Click here to reset