Estimating depths from a single monocular image depicting general scenes is a fundamental problem in computer vision, which has widespread applications in scene-understanding, 3D modeling, robotics, and other challenging problems. It is a notorious example of an ill-posed problem, as one captured image may correspond to numerous real world scenes. It remains a challenging task for computer vision algorithms as no reliable cues can be exploited, such as temporal information, stereo correspondences. Previous research involving depth maps usually involve geometric [11, 9, 8], convolutional  and semantic  techniques. Nevertheless, none of these works tried to perform image classification using depth maps as training data. Different from previous efforts, we propose to utilize estimated depth maps in a image classification task. While extensively studied in semantic labeling and accuracy improvement, depth map regression has been less explored in its application to classification problems. Intuitively, one can imagine that a neural-network that is deep enough would generate it’s own depth-map (or at least simulate depth-map-like features).
Recently, the efficacy and power of the deep convolutional neural network (CNN) has been made accessible[10, 12]. With a CNN, we are able to perform depth estimation on a single image . However, most classification tasks still perform on RGB images. With only RGB images, CNN features have been setting new records for a wide variety of vision applications [10, 17, 7, 18, 3]. Despite all the successes in depth estimation and image classification, the deep CNN has been not yet been used for learning on RGBD images, since RGBD datasets are not as widely-used as RGB datasets.
We propose to build a RGBD dataset based on RGB dataset and do image classification on it, illustrated in Figure 1. Then evaluation the performance of neural networks on the RGBD dataset compared to the RGB dataset. To our knowledge, we are the first to bridge gap between estimated depth and image classification. From our experiments, the benefit is significant on both shallow and deep networks. It improves ResNet-20 0.55%.
To sum up, we highlight the main contributions of this work as follows:
We created a RGBD image dataset for CIFAR-10.
We illustrate that depth channel has a better feature representation than R,G,B channels, and show that training on RGBD images could improve performance.
We define a new metric for ill-posed depth prediction problem.
2 Related Work
CNN have been applied with great success for object classification [13, 19, 20, 10, 2] and detection [7, 18, 3]. CNN have recently been applied to a variety of other tasks, like depth estimation. Depth estimation from single image is well addressed by Liu  and Eigen . They both agree that depth estimation is an ill-posed problem, since there’s no real ground truth depth map. We define transfer learning accuracy metric for depth estimation model (Section 3.1). It becomes easier to compare performance of different depth estimation model.
Estimated depth map  has been successfully applied to some other problems. Based on depth information, performance improved on semantic labeling . However, depth maps have not been combined with an image classification task. To our knowledge, we are the first to bridge gap between depth estimation and image classification.
Recent depth image research works mainly focus on depth-estimation  and segmentation with depth image . And we’ve witnessed significant improvement on depth estimation quality in these years. However, most image classification tasks nowadays are still performed on RGB images. So we want to transfer depth knowledge learned by depth estimation model into our image-classification model. In this section, we first built a RGBD dataset for CIFAR-10 , based on a trained deep convolutional neural field model . To investigate the effect of the depth channel on image classification task, we design two experiments (one with a simple feedforward NN and one with a CNN) Finally, we propose a new metric for depth estimation performance measurement.
3.1 Build RGBD Dataset
Since the Deep Convolutional Neural Field model accepts images  that are much larger that CIFAR-10 tiny images (), we build RGBD dataset as follow:
resize CIFAR-10 tiny image () to normal size () in order to feed in CNF.
perform depth estimation on the normal size image.
downscale the output image (depth image, ) back to tiny image ().
combine RGB and D channels together as our RGBD image ().
Figure 2 shows the transfer learning procedure.
Figure 3 shows some depth maps. Since there is no ground-truth depth image for CIFAR-10 dataset, we can’t directly measure the quality of our depth estimation attempts for these tiny images. However, we can infer this indirectly.We can use the accuracy results of our two experiments as a new metric to quantify depth map quality.
3.2 Classification Task on RGBD Dataset
In order to make it easier to show effect of depth channel, we employ a simple two layer neural network for classification task. The architecture for learning on the RGBD dataset is shown in Figure 1.
The number of neurons in the input layer depends on input. If input is a single channel (R, G, B, D), we haveneurons. The amount of hidden neurons is not determined. We perform fine tuning for each situation. The number of output neurons is always number of classes (10 classes for CIFAR-10).
We measure depth map quality in two ways. First, we evaluate neural network performance on R, G, B, D channel as input respectively. Second, we train neural network on RGB, RGBD respectively and compare the performance. Our depth estimation is based on Tensorflow
. And our neural network training is based on Caffe.
4.1 R vs G vs B vs D
We perform fine tuning on each channel. So that their performances are approximately optimal. Figure 4 shows validation accuracy comparison through time.
You can see that, at testing time, the depth channel outperforms R, G, B channels under the same architecture. This implies that, depth channel has a better feature representation than R, G, B channels. Training on RGBD dataset would result in better performance for shallow networks.
4.2 RGB vs RGBD
Figure 5 Compares validation accuracy comparison through time.
We get 56% and 52% validation accuracy with RGBD and RGB dataset respectively. This can be seen as a sign that depth map brings extra knowledge learned by deep convolutional neural field to our classification task.
Also notice that, although RGBD dataset have more inputs and neurons, it has a much higher converge rate than RGB dataset. It can be interpreted as a better feature representation brought by depth map. Estimated depth map works on shallow networks, however it remains a question whether it works on deep networks like ResNet .
4.3 ResNet Experimentation
Our previous experiments using a 2-layer feed-forward neural network yielded a performance increase of 4%, when comparing a NN trained on the RGB dataset to the NN trained on the RGBD dataset. These results did’t satisfy us, as a simple feed-forward neural network may show how good the features are presented to it, but not the optimal tool used for image classification. We want to see whether the estimated depth map truely bring in some new knowledges of depth, which can’t be obtained just using RGB images. So we decided to test the performance of ResNet over the CIFAR-10 RGB dataset and compare it to the performance achieved over our CIFAR-10 RGBD dataset. Similar with 2-layer networks, only number of channels of input layer is changed from 3 to 4. The increased computational complexity could be ignored, since later layers have much more channels.
Shown in Table 1, ResNet-20 improves 0.45% with estimated depth map. ResNet-56 achieved error of 6.45% using the RGBD dataset, which is competitive with ResNet-110 using RGB dataset. This performance gain of 0.53% in accuracy between RGBD and RGB using the CNN could not be ignored on CIFAR-10. We conclude that training on RGBD dataset would also result in better performance for deep networks like ResNet. Depth estimation feature may be hard to be generated by deep network itself.
We created a RGBD image dataset for CIFAR-10. We define a transfer learning accuracy metric for depth prediction problem. On RGBD CIFAR-10, we show that depth channel has a better feature representation. Training on RGBD images could improve image classification on both shallow and deep networks.
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,
D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and
TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.Software available from tensorflow.org.
-  J. Carreira, H. Madeira, and J. G. Silva. Xception: A technique for the experimental evaluation of dependability in modern computers. IEEE Transactions on Software Engineering, 24(2):125–136, 1998.
J. Dai, K. He, and J. Sun.
Instance-aware semantic segmentation via multi-task network cascades.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3150–3158, 2016.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
-  D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
-  D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Advances in neural information processing systems, pages 2366–2374, 2014.
-  R. Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, pages 1440–1448, 2015.
-  A. Gupta, A. A. Efros, and M. Hebert. Blocks world revisited: Image understanding using qualitative geometry and mechanics. In Computer Vision–ECCV 2010, pages 482–496. Springer, 2010.
-  A. Gupta, M. Hebert, T. Kanade, and D. M. Blei. Estimating spatial layout of rooms using volumetric reasoning about objects and surfaces. In Advances in neural information processing systems, pages 1288–1296, 2010.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
-  V. Hedau, D. Hoiem, and D. Forsyth. Thinking inside the box: Using appearance models and context based on room geometry. In Computer Vision–ECCV 2010, pages 224–237. Springer, 2010.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages 675–678. ACM, 2014.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images, 2009.
-  L. Ladicky, J. Shi, and M. Pollefeys. Pulling things out of perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 89–96, 2014.
-  J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, and S. Yan. Attentive contexts for object detection. IEEE Transactions on Multimedia, 19(5):944–954, 2017.
-  F. Liu, C. Shen, and G. Lin. Deep convolutional neural fields for depth estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5162–5170, 2015.
-  A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. Cnn features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 806–813, 2014.
-  S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.