Unified Open-Vocabulary Dense Visual Prediction

07/17/2023
by   Hengcan Shi, et al.
0

In recent years, open-vocabulary (OV) dense visual prediction (such as OV object detection, semantic, instance and panoptic segmentations) has attracted increasing research attention. However, most of existing approaches are task-specific and individually tackle each task. In this paper, we propose a Unified Open-Vocabulary Network (UOVN) to jointly address four common dense prediction tasks. Compared with separate models, a unified network is more desirable for diverse industrial applications. Moreover, OV dense prediction training data is relatively less. Separate networks can only leverage task-relevant training data, while a unified approach can integrate diverse training data to boost individual tasks. We address two major challenges in unified OV prediction. Firstly, unlike unified methods for fixed-set predictions, OV networks are usually trained with multi-modal data. Therefore, we propose a multi-modal, multi-scale and multi-task (MMM) decoding mechanism to better leverage multi-modal data. Secondly, because UOVN uses data from different tasks for training, there are significant domain and task gaps. We present a UOVN training mechanism to reduce such gaps. Experiments on four datasets demonstrate the effectiveness of our UOVN.

READ FULL TEXT

page 3

page 8

research
07/07/2023

Open-Vocabulary Object Detection via Scene Graph Discovery

In recent years, open-vocabulary (OV) object detection has attracted inc...
research
08/30/2023

Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection

In this paper, we for the first time explore helpful multi-modal context...
research
05/30/2023

Multi-modal Queried Object Detection in the Wild

We introduce MQ-Det, an efficient architecture and pre-training strategy...
research
03/31/2023

Towards Flexible Multi-modal Document Models

Creative workflows for generating graphical documents involve complex in...
research
03/23/2022

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

Finding relevant moments and highlights in videos according to natural l...
research
01/27/2021

Multi-Modal Aesthetic Assessment for MObile Gaming Image

With the proliferation of various gaming technology, services, game styl...
research
08/31/2023

AntM^2C: A Large Scale Dataset For Multi-Scenario Multi-Modal CTR Prediction

Click-through rate (CTR) prediction is a crucial issue in recommendation...

Please sign up or login with your details

Forgot password? Click here to reset