Visual Programming: Compositional visual reasoning without training

11/18/2022
by   Tanmay Gupta, et al.
0

We present VISPROG, a neuro-symbolic approach to solving complex and compositional visual tasks given natural language instructions. VISPROG avoids the need for any task-specific training. Instead, it uses the in-context learning ability of large language models to generate python-like modular programs, which are then executed to get both the solution and a comprehensive and interpretable rationale. Each line of the generated program may invoke one of several off-the-shelf computer vision models, image processing routines, or python functions to produce intermediate outputs that may be consumed by subsequent parts of the program. We demonstrate the flexibility of VISPROG on 4 diverse tasks - compositional visual question answering, zero-shot reasoning on image pairs, factual knowledge object tagging, and language-guided image editing. We believe neuro-symbolic approaches like VISPROG are an exciting avenue to easily and effectively expand the scope of AI systems to serve the long tail of complex tasks that people may wish to perform.

READ FULL TEXT

page 1

page 2

page 5

page 6

page 7

page 8

page 9

research
03/14/2023

ViperGPT: Visual Inference via Python Execution for Reasoning

Answering visual queries is a complex task that requires both visual pro...
research
10/04/2018

Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding

We marry two powerful ideas: deep representation learning for visual rec...
research
10/03/2022

A Hybrid Compositional Reasoning Approach for Interactive Robot Manipulation

In this paper we present a neuro-symbolic (hybrid) compositional reasoni...
research
05/06/2021

A Generative Symbolic Model for More General Natural Language Understanding and Reasoning

We present a new fully-symbolic Bayesian model of semantic parsing and r...
research
07/20/2023

Ethosight: A Reasoning-Guided Iterative Learning System for Nuanced Perception based on Joint-Embedding Contextual Label Affinity

Traditional computer vision models often require extensive manual effort...
research
09/20/2023

LLM Guided Inductive Inference for Solving Compositional Problems

While large language models (LLMs) have demonstrated impressive performa...
research
10/26/2022

Generalization Differences between End-to-End and Neuro-Symbolic Vision-Language Reasoning Systems

For vision-and-language reasoning tasks, both fully connectionist, end-t...

Please sign up or login with your details

Forgot password? Click here to reset