DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification

05/25/2023
by   Sitian Shen, et al.
0

Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2% for zero-shot classification on OBJ_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.

READ FULL TEXT

page 4

page 5

page 15

research
10/03/2022

CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training

Pre-training across 3D vision and language remains under development bec...
research
12/29/2021

A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained Vision-language Model

Recently, zero-shot image classification by vision-language pre-training...
research
02/07/2023

Boosting Zero-shot Classification with Synthetic Data Diversity via Stable Diffusion

Recent research has shown it is possible to perform zero-shot classifica...
research
05/18/2023

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

Achieving machine autonomy and human control often represent divergent o...
research
08/23/2023

Diffuse, Attend, and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion

Producing quality segmentation masks for images is a fundamental problem...
research
09/21/2023

Exploiting CLIP-based Multi-modal Approach for Artwork Classification and Retrieval

Given the recent advances in multimodal image pretraining where visual m...
research
11/21/2022

PointCLIP V2: Adapting CLIP for Powerful 3D Open-world Learning

Contrastive Language-Image Pre-training (CLIP) has shown promising open-...

Please sign up or login with your details

Forgot password? Click here to reset