Towards a performance analysis on pre-trained Visual Question Answering models for autonomous driving

07/18/2023
by   Kaavya Rekanar, et al.
0

This short paper presents a preliminary analysis of three popular Visual Question Answering (VQA) models, namely ViLBERT, ViLT, and LXMERT, in the context of answering questions relating to driving scenarios. The performance of these models is evaluated by comparing the similarity of responses to reference answers provided by computer vision experts. Model selection is predicated on the analysis of transformer utilization in multimodal architectures. The results indicate that models incorporating cross-modal attention and late fusion techniques exhibit promising potential for generating improved answers within a driving perspective. This initial analysis serves as a launchpad for a forthcoming comprehensive comparative study involving nine VQA models and sets the scene for further investigations into the effectiveness of VQA model queries in self-driving scenarios. Supplementary material is available at https://github.com/KaavyaRekanar/Towards-a-performance-analysis-on-pre-trained-VQA-models-for-autonomous-driving.

READ FULL TEXT
research
05/24/2023

NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario

We introduce a novel visual question answering (VQA) task in the context...
research
04/05/2022

SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering

While Visual Question Answering (VQA) has progressed rapidly, previous w...
research
07/19/2023

Explaining Autonomous Driving Actions with Visual Question Answering

The end-to-end learning ability of self-driving vehicles has achieved si...
research
08/28/2021

QACE: Asking Questions to Evaluate an Image Caption

In this paper, we propose QACE, a new metric based on Question Answering...
research
09/21/2020

Regularizing Attention Networks for Anomaly Detection in Visual Question Answering

For stability and reliability of real-world applications, the robustness...
research
06/24/2021

A Transformer-based Cross-modal Fusion Model with Adversarial Training for VQA Challenge 2021

In this paper, inspired by the successes of visionlanguage pre-trained m...
research
09/06/2021

Improved RAMEN: Towards Domain Generalization for Visual Question Answering

Currently nearing human-level performance, Visual Question Answering (VQ...

Please sign up or login with your details

Forgot password? Click here to reset