You Only Need One Detector: Unified Object Detector for Different Modalities based on Vision Transformers

07/03/2022
by   Xiaoke Shen, et al.
0

Most systems use different models for different modalities, such as one model for processing RGB images and one for depth images. Meanwhile, some recent works discovered that an identical model for one modality can be used for another modality with the help of cross modality transfer learning. In this article, we further find out that by using a vision transformer together with cross/inter modality transfer learning, a unified detector can achieve better performances when using different modalities as inputs. The unified model is useful as we don't need to maintain separate models or weights for robotics, hence, it is more efficient. One application scenario of our unified system for robotics can be: without any model architecture and model weights updating, robotics can switch smoothly on using RGB camera or both RGB and Depth Sensor during the day time and Depth sensor during the night time . Experiments on SUN RGB-D dataset show: Our unified model is not only efficient, but also has a similar or better performance in terms of mAP50 based on SUNRGBD16 category: compare with the RGB only one, ours is slightly worse (52.3 → 51.9). compare with the point cloud only one, we have similar performance (52.7 → 52.8); When using the novel inter modality mixing method proposed in this work, our model can achieve a significantly better performance with 3.1 (52.7 → 55.8) absolute improvement comparing with the previous best result. Code (including training/inference logs and model checkpoints) is available: <https://github.com/liketheflower/YONOD.git>

READ FULL TEXT

page 2

page 3

page 6

page 7

research
03/20/2022

simCrossTrans: A Simple Cross-Modality Transfer Learning for Object Detection with ConvNets or Vision Transformers

Transfer learning is widely used in computer vision (CV), natural langua...
research
07/04/2023

Consistent Multimodal Generation via A Unified GAN Framework

We investigate how to generate multimodal image outputs, such as RGB, de...
research
07/13/2020

Low to High Dimensional Modality Hallucination using Aggregated Fields of View

Real-world robotics systems deal with data from a multitude of modalitie...
research
06/01/2022

Unifying Voxel-based Representation with Transformer for 3D Object Detection

In this work, we present a unified framework for multi-modality 3D objec...
research
07/17/2020

Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation

Depth information has proven to be a useful cue in the semantic segmenta...
research
05/31/2023

Bytes Are All You Need: Transformers Operating Directly On File Bytes

Modern deep learning approaches usually transform inputs into a modality...
research
12/23/2020

Multi-Modality Cut and Paste for 3D Object Detection

Three-dimensional (3D) object detection is essential in autonomous drivi...

Please sign up or login with your details

Forgot password? Click here to reset