DetGPT: Detect What You Need via Reasoning

05/23/2023
by   Renjie Pi, et al.
0

In recent years, the field of computer vision has seen significant advancements thanks to the development of large language models (LLMs). These models have enabled more effective and sophisticated interactions between humans and machines, paving the way for novel techniques that blur the lines between human and machine intelligence. In this paper, we introduce a new paradigm for object detection that we call reasoning-based object detection. Unlike conventional object detection methods that rely on specific object names, our approach enables users to interact with the system using natural language instructions, allowing for a higher level of interactivity. Our proposed method, called DetGPT, leverages state-of-the-art multi-modal models and open-vocabulary object detectors to perform reasoning within the context of the user's instructions and the visual scene. This enables DetGPT to automatically locate the object of interest based on the user's expressed desires, even if the object is not explicitly mentioned. For instance, if a user expresses a desire for a cold beverage, DetGPT can analyze the image, identify a fridge, and use its knowledge of typical fridge contents to locate the beverage. This flexibility makes our system applicable across a wide range of fields, from robotics and automation to autonomous driving. Overall, our proposed paradigm and DetGPT demonstrate the potential for more sophisticated and intuitive interactions between humans and machines. We hope that our proposed paradigm and approach will provide inspiration to the community and open the door to more interative and versatile object detection systems. Our project page is launched at detgpt.github.io.

READ FULL TEXT

page 7

page 10

page 11

page 12

research
07/11/2019

A Survey of Deep Learning-based Object Detection

Object detection is one of the most important and challenging branches o...
research
05/30/2023

Multi-modal Queried Object Detection in the Wild

We introduce MQ-Det, an efficient architecture and pre-training strategy...
research
05/29/2023

Contextual Object Detection with Multimodal Large Language Models

Recent Multimodal Large Language Models (MLLMs) are remarkable in vision...
research
06/08/2023

Multi-Modal Classifiers for Open-Vocabulary Object Detection

The goal of this paper is open-vocabulary object detection (OVOD) x2013 ...
research
03/14/2019

SimpleDet: A Simple and Versatile Distributed Framework for Object Detection and Instance Recognition

Object detection and instance recognition play a central role in many AI...
research
03/25/2022

Reshaping Robot Trajectories Using Natural Language Commands: A Study of Multi-Modal Data Alignment Using Transformers

Natural language is the most intuitive medium for us to interact with ot...
research
11/04/2019

VASTA: A Vision and Language-assisted Smartphone Task Automation System

We present VASTA, a novel vision and language-assisted Programming By De...

Please sign up or login with your details

Forgot password? Click here to reset