Can Language Understand Depth?

07/03/2022
by   Renrui Zhang, et al.
0

Besides image classification, Contrastive Language-Image Pre-training (CLIP) has accomplished extraordinary success for a wide range of vision tasks, including object-level and 3D space understanding. However, it's still challenging to transfer semantic knowledge learned from CLIP into more intricate tasks of quantified targets, such as depth estimation with geometric information. In this paper, we propose to apply CLIP for zero-shot monocular depth estimation, named DepthCLIP. We found that the patches of the input image could respond to a certain semantic distance token and then be projected to a quantified depth bin for coarse estimation. Without any training, our DepthCLIP surpasses existing unsupervised methods and even approaches the early fully-supervised networks. To our best knowledge, we are the first to conduct zero-shot adaptation from the semantic language knowledge to quantified downstream tasks and perform zero-shot monocular depth estimation. We hope our work could cast a light on future research. The code is available at https://github.com/Adonis-galaxy/DepthCLIP.

READ FULL TEXT

page 1

page 3

page 4

page 5

page 6

research
09/28/2021

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

We present VideoCLIP, a contrastive approach to pre-train a unified mode...
research
07/20/2023

Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image

Reconstructing accurate 3D scenes from images is a long-standing vision ...
research
11/18/2021

Simple but Effective: CLIP Embeddings for Embodied AI

Contrastive language image pretraining (CLIP) encoders have been shown t...
research
07/20/2023

Kick Back Relax: Learning to Reconstruct the World by Watching SlowTV

Self-supervised monocular depth estimation (SS-MDE) has the potential to...
research
04/12/2023

CLIP Surgery for Better Explainability with Enhancement in Open-Vocabulary Tasks

Contrastive Language-Image Pre-training (CLIP) is a powerful multimodal ...
research
10/19/2022

CroCo: Self-Supervised Pre-training for 3D Vision Tasks by Cross-View Completion

Masked Image Modeling (MIM) has recently been established as a potent pr...
research
07/17/2023

Similarity Min-Max: Zero-Shot Day-Night Domain Adaptation

Low-light conditions not only hamper human visual experience but also de...

Please sign up or login with your details

Forgot password? Click here to reset