Lightweight Endoscopic Depth Estimation with CNN-Transformer Encoder
In this study, we tackle the key challenges concerning accuracy and robustness in depth estimation for endoscopic imaging, with a particular emphasis on real-time inference and the impact of reflections. We propose an innovative lightweight solution that integrates Convolutional Neural Networks (CNN) and Transformers to predict multi-scale depth maps. Our approach includes optimizing the network architecture, incorporating multi-scale dilated convolution, and a multi-channel attention mechanism. We also introduce a statistical confidence boundary mask to minimize the impact of reflective areas. Moreover, we propose a novel complexity evaluation metric that considers network parameter size, floating-point operations, and inference frames per second. Our research aims to enhance the efficiency and safety of laparoscopic surgery significantly. We comprehensively evaluate our proposed method and compare it with existing solutions. The results demonstrate that our method ensures depth estimation accuracy while being lightweight.
READ FULL TEXT