Image Manipulation via Multi-Hop Instructions – A New Dataset and Weakly-Supervised Neuro-Symbolic Approach

05/23/2023
by   Harman Singh, et al.
0

We are interested in image manipulation via natural language text – a task that is useful for multiple AI applications but requires complex reasoning over multi-modal spaces. We extend recently proposed Neuro Symbolic Concept Learning (NSCL), which has been quite effective for the task of Visual Question Answering (VQA), for the task of image manipulation. Our system referred to as NeuroSIM can perform complex multi-hop reasoning over multi-object scenes and only requires weak supervision in the form of annotated data for VQA. NeuroSIM parses an instruction into a symbolic program, based on a Domain Specific Language (DSL) comprising of object attributes and manipulation operations, that guides its execution. We create a new dataset for the task, and extensive experiments demonstrate that NeuroSIM is highly competitive with or beats SOTA baselines that make use of supervised data for manipulation.

READ FULL TEXT

page 7

page 20

page 24

page 25

page 27

page 29

page 30

page 31

research
01/29/2018

Object-based reasoning in VQA

Visual Question Answering (VQA) is a novel problem domain where multi-mo...
research
10/03/2022

A Hybrid Compositional Reasoning Approach for Interactive Robot Manipulation

In this paper we present a neuro-symbolic (hybrid) compositional reasoni...
research
11/21/2020

LRTA: A Transparent Neural-Symbolic Reasoning Framework with Modular Supervision for Visual Question Answering

The predominant approach to visual question answering (VQA) relies on en...
research
12/01/2022

Super-CLEVR: A Virtual Benchmark to Diagnose Domain Robustness in Visual Reasoning

Visual Question Answering (VQA) models often perform poorly on out-of-di...
research
06/01/2023

Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data

The impressive advances and applications of large language and joint lan...
research
05/24/2023

NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario

We introduce a novel visual question answering (VQA) task in the context...
research
06/01/2022

SAMPLE-HD: Simultaneous Action and Motion Planning Learning Environment

Humans exhibit incredibly high levels of multi-modal understanding - com...

Please sign up or login with your details

Forgot password? Click here to reset