GRiT: A Generative Region-to-text Transformer for Object Understanding

12/01/2022
by   Jialian Wu, et al.
0

This paper presents a Generative RegIon-to-Text transformer, GRiT, for object understanding. The spirit of GRiT is to formulate object understanding as <region, text> pairs, where region locates objects and text describes objects. For example, the text in object detection denotes class names while that in dense captioning refers to descriptive sentences. Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate open-set object descriptions. With the same model architecture, GRiT can understand objects via not only simple nouns, but also rich descriptive sentences including object attributes or actions. Experimentally, we apply GRiT to object detection and dense captioning tasks. GRiT achieves 60.4 AP on COCO 2017 test-dev for object detection and 15.5 mAP on Visual Genome for dense captioning. Code is available at https://github.com/JialianW/GRiT

READ FULL TEXT

page 1

page 2

page 3

page 6

page 8

research
04/22/2022

Spatiality-guided Transformer for 3D Dense Captioning on Point Clouds

Dense captioning in 3D point clouds is an emerging vision-and-language t...
research
09/29/2022

EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual and Language Learning

3D visual grounding aims to find the objects within point clouds mention...
research
07/21/2021

CycleMLP: A MLP-like Architecture for Dense Prediction

This paper presents a simple MLP-like architecture, CycleMLP, which is a...
research
09/28/2022

Obj2Seq: Formatting Objects as Sequences with Class Prompt for Visual Tasks

Visual tasks vary a lot in their output formats and concerned contents, ...
research
11/21/2016

Dense Captioning with Joint Inference and Visual Context

Dense captioning is a newly emerging computer vision topic for understan...
research
12/24/2019

Dense RepPoints: Representing Visual Objects with Dense Point Sets

We present an object representation, called Dense RepPoints, for flexibl...
research
06/01/2022

CLIP4IDC: CLIP for Image Difference Captioning

Image Difference Captioning (IDC) aims at generating sentences to descri...

Please sign up or login with your details

Forgot password? Click here to reset