Towards General Purpose Vision Systems

by   Tanmay Gupta, et al.

A special purpose learning system assumes knowledge of admissible tasks at design time. Adapting such a system to unforeseen tasks requires architecture manipulation such as adding an output head for each new task or dataset. In this work, we propose a task-agnostic vision-language system that accepts an image and a natural language task description and outputs bounding boxes, confidences, and text. The system supports a wide range of vision tasks such as classification, localization, question answering, captioning, and more. We evaluate the system's ability to learn multiple skills simultaneously, to perform tasks with novel skill-concept combinations, and to learn new skills efficiently and without forgetting.


page 1

page 3

page 7

page 8

page 12


Webly Supervised Concept Expansion for General Purpose Vision Models

General purpose vision (GPV) systems are models that are designed to sol...

One Model, Multiple Tasks: Pathways for Natural Language Understanding

This paper presents a Pathways approach to handle many tasks at once. Ou...

GRIT: General Robust Image Task Benchmark

Computer vision models excel at making predictions when the test distrib...

SkillNet-NLG: General-Purpose Natural Language Generation with a Sparsely Activated Approach

We present SkillNet-NLG, a sparsely activated approach that handles many...

Towards A Measure Of General Machine Intelligence

To build increasingly general-purpose artificial intelligence systems th...

Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

We propose Unified-IO, a model that performs a large variety of AI tasks...

A Unified Sequence Interface for Vision Tasks

While language tasks are naturally expressed in a single, unified, model...