Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection

08/30/2023
by   Yifan Xu, et al.
0

In this paper, we for the first time explore helpful multi-modal contextual knowledge to understand novel categories for open-vocabulary object detection (OVD). The multi-modal contextual knowledge stands for the joint relationship across regions and words. However, it is challenging to incorporate such multi-modal contextual knowledge into OVD. The reason is that previous detection frameworks fail to jointly model multi-modal contextual knowledge, as object detectors only support vision inputs and no caption description is provided at test time. To this end, we propose a multi-modal contextual knowledge distillation framework, MMC-Det, to transfer the learned contextual knowledge from a teacher fusion transformer with diverse multi-modal masked language modeling (D-MLM) to a student detector. The diverse multi-modal masked language modeling is realized by an object divergence constraint upon traditional multi-modal masked language modeling (MLM), in order to extract fine-grained region-level visual contexts, which are vital to object detection. Extensive experiments performed upon various detection datasets show the effectiveness of our multi-modal context learning strategy, where our approach well outperforms the recent state-of-the-art methods.

READ FULL TEXT

page 1

page 2

page 4

page 5

page 9

research
05/30/2023

Multi-modal Queried Object Detection in the Wild

We introduce MQ-Det, an efficient architecture and pre-training strategy...
research
06/08/2023

Multi-Modal Classifiers for Open-Vocabulary Object Detection

The goal of this paper is open-vocabulary object detection (OVOD) x2013 ...
research
08/23/2022

DeepInteraction: 3D Object Detection via Modality Interaction

Existing top-performance 3D object detectors typically rely on the multi...
research
07/17/2023

Unified Open-Vocabulary Dense Visual Prediction

In recent years, open-vocabulary (OV) dense visual prediction (such as O...
research
09/22/2021

KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

Self-supervised vision-and-language pretraining (VLP) aims to learn tran...
research
08/02/2023

A vision transformer-based framework for knowledge transfer from multi-modal to mono-modal lymphoma subtyping models

Determining lymphoma subtypes is a crucial step for better patients trea...
research
12/27/2021

VibEmoji: Exploring User-authoring Multi-modal Emoticons in Social Communication

Emoticons are indispensable in online communications. With users' growin...

Please sign up or login with your details

Forgot password? Click here to reset