DwinFormer: Dual Window Transformers for End-to-End Monocular Depth Estimation

03/06/2023
by   Md Awsafur Rahman, et al.
0

Depth estimation from a single image is of paramount importance in the realm of computer vision, with a multitude of applications. Conventional methods suffer from the trade-off between consistency and fine-grained details due to the local-receptive field limiting their practicality. This lack of long-range dependency inherently comes from the convolutional neural network part of the architecture. In this paper, a dual window transformer-based network, namely DwinFormer, is proposed, which utilizes both local and global features for end-to-end monocular depth estimation. The DwinFormer consists of dual window self-attention and cross-attention transformers, Dwin-SAT and Dwin-CAT, respectively. The Dwin-SAT seamlessly extracts intricate, locally aware features while concurrently capturing global context. It harnesses the power of local and global window attention to adeptly capture both short-range and long-range dependencies, obviating the need for complex and computationally expensive operations, such as attention masking or window shifting. Moreover, Dwin-SAT introduces inductive biases which provide desirable properties, such as translational equvariance and less dependence on large-scale data. Furthermore, conventional decoding methods often rely on skip connections which may result in semantic discrepancies and a lack of global context when fusing encoder and decoder features. In contrast, the Dwin-CAT employs both local and global window cross-attention to seamlessly fuse encoder and decoder features with both fine-grained local and contextually aware global information, effectively amending semantic gap. Empirical evidence obtained through extensive experimentation on the NYU-Depth-V2 and KITTI datasets demonstrates the superiority of the proposed method, consistently outperforming existing approaches across both indoor and outdoor environments.

READ FULL TEXT

page 1

page 7

research
07/10/2022

Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion

Attention-based models such as transformers have shown outstanding perfo...
research
02/06/2022

GLPanoDepth: Global-to-Local Panoramic Depth Estimation

In this paper, we propose a learning-based method for predicting dense d...
research
11/23/2022

Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation

Self-supervised monocular depth estimation that does not require ground-...
research
12/09/2021

DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition

While transformers have shown great potential on video recognition tasks...
research
04/29/2022

SideRT: A Real-time Pure Transformer Architecture for Single Image Depth Estimation

Since context modeling is critical for estimating depth from a single im...
research
08/28/2023

Semi-Supervised Semantic Depth Estimation using Symbiotic Transformer and NearFarMix Augmentation

In computer vision, depth estimation is crucial for domains like robotic...
research
06/20/2021

More than Encoder: Introducing Transformer Decoder to Upsample

General segmentation models downsample images and then upsample to resto...

Please sign up or login with your details

Forgot password? Click here to reset