AVIS: Autonomous Visual Information Seeking with Large Language Models

06/13/2023
by   Ziniu Hu, et al.
0

In this paper, we propose an autonomous information seeking visual question answering framework, AVIS. Our method leverages a Large Language Model (LLM) to dynamically strategize the utilization of external tools and to investigate their outputs, thereby acquiring the indispensable knowledge needed to provide answers to the posed questions. Responding to visual questions that necessitate external knowledge, such as "What event is commemorated by the building depicted in this image?", is a complex task. This task presents a combinatorial search space that demands a sequence of actions, including invoking APIs, analyzing their responses, and making informed decisions. We conduct a user study to collect a variety of instances of human decision-making when faced with this task. This data is then used to design a system comprised of three components: an LLM-powered planner that dynamically determines which tool to use next, an LLM-powered reasoner that analyzes and extracts key information from the tool outputs, and a working memory component that retains the acquired information throughout the process. The collected user behavior serves as a guide for our system in two key ways. First, we create a transition graph by analyzing the sequence of decisions made by users. This graph delineates distinct states and confines the set of actions available at each state. Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions. We show that AVIS achieves state-of-the-art results on knowledge-intensive visual question answering benchmarks such as Infoseek and OK-VQA.

READ FULL TEXT

page 14

page 15

page 18

page 19

page 21

page 22

page 23

page 24

research
12/03/2017

Incorporating External Knowledge to Answer Open-Domain Visual Questions with Dynamic Memory Networks

Visual Question Answering (VQA) has attracted much attention since it of...
research
07/27/2022

Uncertainty-based Visual Question Answering: Estimating Semantic Inconsistency between Image and Knowledge Base

Knowledge-based visual question answering (KVQA) task aims to answer que...
research
04/23/2020

Visual Question Answering Using Semantic Information from Image Descriptions

Visual question answering (VQA) is a task that requires AI systems to di...
research
07/26/2022

LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection

Visual question answering (VQA) often requires an understanding of visua...
research
04/05/2019

Actively Seeking and Learning from Live Data

One of the key limitations of traditional machine learning methods is th...
research
06/14/2023

Improving Selective Visual Question Answering by Learning from Your Peers

Despite advances in Visual Question Answering (VQA), the ability of mode...
research
07/18/2023

Need-driven decision-making and prototyping for DLT: Framework and web-based tool

In its 14 years, distributed ledger technology has attracted increasing ...

Please sign up or login with your details

Forgot password? Click here to reset