I Introduction
Range measurement is vital for robot and autonomous vehicle operations. For ground vehicles, reliable and accurate range sensing is the key for Adaptive Cruise Control, Automatic Emergency Braking and autonomous driving. With the rapid development of deep learning techniques, imagebased depth prediction gained a lot of attention and achievements, promising costeffective and accessible range sensing using commercial monocular cameras. However, depth ground truth for an image is not always available for training a neural network. In outdoor scenarios, today we mainly rely on LIDAR sensors to provide accurate and detailed depth measurement, but the point clouds are too sparse compared with image pixels. Besides, LIDARs could not get reliable reflection on some surfaces (e.g. dark, reflective, transparent
[1]). Using stereo cameras is another way for range sensing, but accuracy is a concern at mid to far distance. Generating ground truth depth from an external visual SLAM module [2, 3] suffers similar problems, subject to noise and error.


Due to the lack of perfect ground truth as discussed above, plus the fact that monocular cameras are widely used, much research effort was devoted to unsupervised monocular depth learning, which requires only sequences of monocular images as training data. These approaches have achieved results close to supervised approaches [7]. However, monocular unsupervised approaches are inherently scaleambiguous. The depth prediction is relative and needs a scale factor to recover the true depth, meaning that they have limitations in real deployment.
Despite that LIDAR sensors are not yet ready for mass application on vehicles, a series of largescale driving datasets with sensor suites including cameras and LIDARs [8, 9, 10, 11] have already been published. Motivated by the availability of such rich datasets, we improve monocular depth regression by leveraging sparse LIDAR data as ground truth. As stated in BTS[4]
, a very recent work ranking 1st in monocular depth estimation using the KITTI dataset
[6] (Eigen’s split [1]), the high sparsity of ground truth data limits the neural network performance. Addressing the same issue, we propose a new continuous 3D loss that transforms discrete point clouds into continuous functions. The proposed loss better exploits data correlation in Euclidean and feature spaces, leading to improved performance of the current deep neural networks. An example is shown in Fig. 1. We note that the proposed 3D loss function is agnostic to the network architecture design which is an active research area on its own. The main contributions include:
We propose a novel continuous 3D loss function for monocular depth estimation.

Our work is opensourced and available at
https://github.com/minghanz/DepthC3D.
The remainder of this paper is organized as follows. The literature is reviewed in Sec. II. The proposed new loss function, the theoretical foundation and its application in monocular depth prediction are introduced in Sec. III. The experimental setup and results are presented in Sec. IV. Section V concludes the paper and provide future work ideas.
Ii Related Work
Deeplearningbased 3D geometric understanding share similar ideas as SfM/vSLAM approaches. For example, the application of reprojection loss in unsupervised depth prediction approaches [14] and direct methods in SfM/vSLAM [15] are tightly connected. However, they are fundamentally different since backpropagation of neural networks only takes a small step along the gradient to gradually learn the general prior from large amounts of data. Learning correspondences among different views can assist with recovering the depth [16] if stereo or multiview images are available as input. For singleview depth prediction, the network needs to learn from more general cues including perspective, object size, and scenario layout. Although singleview depth prediction is an illposed problem in theory since infinite possibilities of 3D layout could result in the same 2D rendered image, this task is still viable since the plausible geometric layouts that actually occur in the real world is limited and can be learned from data.
Iia Supervised singleview depth prediction
It is straightforward to learn image depth by minimizing the pointwise difference between the predicted depth value and the ground truth depth value. The ground truth depth can come from LIDAR, but such measurements are sparse. One strategy is simply masking out masking out pixels without ground truth depth values and only evaluating loss on valid points [1]. An alternative is to fill in invalid pixels in ground truth maps before evaluation [17]
, for example using “colorization” methods
[18] included in NYUv2 dataset [19]. While learning from processed dense depth maps is an easier task, it also limits the accuracy upper bound. [20, 21] used synthetic datasets (e.g. [22, 23]) for training, in which perfect dense ground truth depth maps are available. However, in practice, the domain difference between synthetic and real data becomes an issue.IiB Unsupervised singleview depth prediction
The fact that ground truth depth of an image is hard to obtain and usually sparse and/or noisy motivates some researchers to apply unsupervised approaches. Stereo cameras with known baseline provide selfsupervision in that an image can be reconstructed from its stereo counterpart if the disparity is accurately estimated. Following this idea [24] proposed an endtoend method to learn singleview depth from stereo data. Using consecutive image frames for selfsupervision is similar, except that motions of the camera between the consecutive time steps must be estimated, and that scale ambiguity may arise. [14] is one of the first proposing to use monocular videos only to learn pose and depth regression through CNNs in an endtoend manner. To handle the problem of moving objects, researchers included an optical flow estimation module [25], and/or a motion segmentation module [26], so that rigid and nonrigid parts are treated separately.
IiC Loss functions in singleview depth prediction
As reviewed above, existing learning methods mainly rely on direct supervision of true depth and indirect supervision of view synthesis error. Most other losses are essentially regularization terms. Here most commonly used loss functions appeared in literature are summarized. The losses from adversarial learning framework [27] are omitted here, as they require dedicated network structures.
IiC1 Geometric losses
Pointwise difference between predicted and ground truth depth values in the norms of L1 [28], L2 [29], Huber [30], berHu [17], and the same norms of inverse depth [2] have all been applied, with the consideration of emphasizing prediction error of near/far points. Crossentropy loss [31] and ordinal loss [32] are applied when depth prediction is formulated as a classification or ordinal regression, instead of regression problems. Negative log likelihood is adopted in approaches producing probabilistic outputs, e.g. in [33]. [1]
introduced a scaleinvariant loss to enable learning from data across scenarios with large scale variance. Surface normal difference is also a form of more structured geometric loss
[29]. In contrast to the above loss terms which takes value difference in the image space, [34] directly measure geometric loss in the 3D space, minimizing point cloud distance by applying ICP (Iterative Closest Point) algorithms. [35] proposed nonlocal geometric losses to capture large scale structures.IiC2 Nongeometric loss
Applied in unsupervised approaches. Most commonly used forms are intensity difference between wrapped and original pixels, and Structured Similarity (SSIM) [36] which also captures the higherorder statistics of pixels in a local area. In order to handle occlusion and nonrigid scenarios, various adjustments to the photometric errors are proposed. For example, using weight or masking to ignore a subset of pixels that are likely not recovered correctly from view synthesis [29, 14], and [5] used the minimum between forward and backward reprojection error to handle occlusion.
IiC3 Regularization losses
Crossframe consistency
to fully exploit available connections in data between stereo pairs and sequential frames, and to improve generalizablity by enforcing the network to learn view synthesis in different directions. For example, [37] performed view synthesis on a viewsynthesized image from the stereo’s view, aiming to recover the original image from this loop.
Crosstask consistency
Selfregularization
loss terms that suppress highorder variations in depth predictions. Edgeaware depth smoothness loss [40] is one of the most common example [41, 37]. This is widely applied because in unsupervised approaches, viewsynthesis losses rely on image gradients, which are heavily nonconvex and only valid in a local region. In supervised approaches sparse ground truth will also have a subset of points left uncovered. Such regularization term can smooth out the prediction and broadcast supervision signal to a larger region.
As discussed above, the supervision signals in the literature are mostly from pixelwise values (e.g. depth/reprojection error) and simple statistics in a local region (e.g. surface normal, SSIM), with heuristic regularization terms addressing the locality of such supervision signals. In contrast, we are introducing a new loss term that is smooth and continuous, overcoming such locality with embedded regularization effect.
Image & LIDAR  BTS[4]  SMonodepth2  Ours 
Iii Proposed Method
Information captured by LIDAR and camera sensors is a discretized sampling of the real environment, in the form of points and pixels. The discretization of the two sensors are different, and a common approach of associating them is to project LIDAR points onto the image frame. This approach has two drawbacks. First, it is an approximation to allocate a pixel location for LIDAR points, subject to rounding error and foregroundbackground mixture error [42]. Secondly, LIDAR points are much sparser than image pixels, meaning that the supervision signal is propagated from only a small fraction of the image, and surfaces with certain characteristics (e.g. reflective, dark, transparent) are constantly missed due to the limitations of the LIDAR.
To handle the first problem, our new loss function is evaluated in the 3D space instead of the image frame. Specifically, we measure the difference between the LIDAR point cloud and the point cloud of image pixels backprojected using the predicted depth. This approach is similar to that of [34] which applied the distance metric of ICP for depth learning. However, since ICP needs the association of point pairs, this approach still suffers from the discretization problem. This problem may not be prominent when both point clouds are from image pixels [34] but is important when using the sparse LIDAR point cloud.
We propose to transform the point cloud into a continuous function, and thus the learning problem becomes aligning two functions induced by the LIDAR point cloud and the image depth (point cloud). Our approach alleviates the discretization problem which will be shown in Sec. IVB and IVA in more details.
Iiia Function construction from a point cloud
Consider a collection of points, , with each point
and its associated feature vector
, where is the inner product space of features. To construct a function from a point cloud such as , we follow the approach of [43, 44]. That is(1) 
where is the kernel of a Reproducing Kernel Hilbert Space (RKHS). Then the inner product with function of point cloud is given by
(2) 
For simplicity, let . We model the geometric kernel, , using the exponential kernel [45, Chapter 4] as
(3) 
where and
are tuneable hyperparameters controlling the size and scale, and
is the usual Euclidean norm.IiiB Continuous 3D loss
Let be the LIDAR point cloud that we use as the ground truth, and the point cloud from image pixels with depth. We then formulate our continuous 3D loss function as:
(4) 
i.e. to maximize the inner product. Different from [43] which aims to find the optimal transformation in the Lie group to align two functions, we operate on points in . The gradient of w.r.t. a point is:
(5) 
For the exponential kernel we have:
(6) 
and for it depends on the specific form of the inner product of the feature space.
In our experiments we design two set of features, i.e., . The first one is the color in the HSV space denoted as . We define the inner product in the HSV vector space using the same exponential kernel form and treat as a constant. Since the pixel color is invariant w.r.t. its depth, .
The second feature is the surface normal, denoted , and we use a weighted dot product as the inner product of normal features, i.e.
(7) 
where is to avoid numerical instability, and denotes the residual, embedding the smoothness of local surface at , which is further explained in the following.
Given a point with normal vector , the plane defined by the normal is given as:
(8) 
Accordingly, the residual of an arbitrary point w. r. t. this local surface is defined as:
(9) 
which equals to the cosine angle between the line and the local surface plane. Then the residual of the local surface is defined as:
(10) 
where is the set of points in the neighborhood of , and denotes the number of elements in the set (its cardinality). This term equals to the average of the residual using a neighborhood of the local plane.
The derivative of this kernel w.r.t. the local surface normal vector is then give by
(11) 
From the above analysis, we can see that the continuous 3D loss function produces a gradient that is the combination of position difference and normaldirection differences between ground truth points and predicted points weighted by their closeness in the geometric space and the feature space. The proposed method avoids pointtopoint correspondences that is not always available in data, and provides inherit regularization that can be adjusted with understandable physical meanings.
The exponential operations in result in very large numbers compared with other kinds of losses. For numerical stabilizing and scale consistency, we use logarithm of the 3D loss in practice, i.e.
(12) 
The continuous 3D loss can be used for crossframe supervision, in which case relative camera poses also come into play. For example, we can denote:
(13) 
where denotes pointclouds from camera and from LIDAR at frame , and transforms points in coordinate to coordinate .
IiiC Network architecture
To evaluate the effect of the continuous 3D loss function, we first build a baseline network based on commonly adopted UNet structure [12], and then construct our approach by including the continuous 3D function as additional supervision signal.
IiiC1 Baseline
We build our baseline method based on the network structure of Monodepth2 [5], which is a stateoftheart unsupervised singleview depth prediction work. It includes a depth network and a pose network. The depth network uses a UNet structure, taking one RGB image as input and generating inverse depth as output. The pose network takes two adjacent images as input and predicts the 6DOF rigid body transformation between the two frames. The supervision signal in the Monodepth2 paper includes:

Perpixel photometric error (Nongeometric loss)

SSIM (Nongeometric loss)

Edgeaware smoothness (Regularization loss)
On top of this, we added L1 disparity loss with GT depth for pixels with available LIDAR projection as a geometric loss, which is not in Monodepth2 originally.
In this way, the baseline system has access to supervision signals formed by all major types of existing loss functions. We denote this baseline as "SMonodepth2" in following comparisons, since it is a supervised version of Monodepth2.
IiiC2 Our approach
Building upon the baseline, the continuous 3D loss term is added to form our improved system. LIDAR point clouds are cropped to only keep the front section in the view of camera. One data sample snippet includes three consecutive frames following the setup of Monodepth2 (frame , and ). Accordingly the continuous 3D loss function we applied is .
LIDAR  BTS[4]  SMonodepth2  Ours 


Iv Experiments
From Eq. 4
, we can see that the calculation of inner product involves a double sum over all pointpairs in the two point clouds. It is a lot of computation for a regularsized image and LIDAR point cloud. To alleviate this problem, we discard pointpairs that are far away from each other in image space, of which the geometric kernel value is likely to be very small, hardly contributing to the loss. We implemented a customized operation in pytorch to efficiently calculate the inner product on GPU, taking advantage of the sparsity of LIDAR point clouds and the double sum. Such computation only induces a small ( 5%) time overhead in each iteration.
The model is implemented in Pytorch, trained for 20 epochs with Adam optimizer with learning rate
in the first 10 epochs and in the rest. The batch size is 3. The input/output size is . We train the network on a NVIDIA RTX 2080 Ti GPU.Being consistent with Monodepth2, we train the network with the KITTI dataset [6], using Eigen’s data split [46] and Zhou’s [14] preprocessing to remove static frames, resulting in 39810 training samples and 4424 for validation. The ground truth LIDAR point cloud used in training is from the raw LIDAR data instead of the densified version in KITTI 2015 depth benchmark [47], in order to show our capacity in exploiting raw sparse point clouds in training.
lower is better  higher is better  
Method  Train  Abs Rel  Sq Rel  RMSE  RMSE log  
DDVO[3]  U  0.126  0.866  4.932  0.185  0.851  0.958  0.986 
3net[48]  U  0.102  0.675  4.293  0.159  0.881  0.969  0.991 
SuperDepth[49]  U  0.090  0.542  3.967  0.144  0.901  0.976  0.993 
Monodepth2[5]  U  0.090  0.545  3.942  0.137  0.914  0.983  0.995 
SVSM FT[50]  SU  0.077  0.392  3.569  0.127  0.919  0.983  0.995 
semiDepth[51]  SU  0.078  0.417  3.464  0.126  0.923  0.984  0.995 
DORN[52]  S  0.081  0.337  2.930  0.121  0.936  0.986  0.995 
BTS[4]  S  0.064  0.254  2.815  0.100  0.950  0.993  0.999 
SMonodepth2  SU  0.077  0.444  3.568  0.118  0.934  0.988  0.997 
Ours  SU  0.072  0.370  3.371  0.116  0.937  0.988  0.997 
Improvements    6.49%  15.9%  5.52%  1.69%  0.32%  0.00%  0.00% 
Bold numbers are the best, underlined numbers are the second best. The row of "Improvements" is w.r.t. the baseline (SMonodepth2).
In the "Train" column, "U": unsupervised, "S": supervised, "SU": using both ground truth supervision and self supervision.
lower is better  higher is better  

Method  Abs Rel  Sq Rel  RMSE  RMSE log  
SMonodepth2  0.135  0.800  4.835  0.204  0.820  0.949  0.982 
Ours  0.117  0.688  4.535  0.188  0.854  0.961  0.987 
Improvements  13.3%  14.0%  6.20%  7.84%  4.15%  1.26%  0.51% 
Iva Quantitative results and analysis
IvA1 KITTI dataset
In Table I the quantitative comparison of our method with the baseline and other stateoftheart approaches is reported. The evaluation is on KITTI dataset using Eigen’s test split[1], with denser and more accurate ground truth depth values [47] by accumulating 11 consecutive frames of LIDAR scans. 652 of 697 test samples in Eigen’s split have such improved depth ground truth, which form our test set. Consistent with the literature [37], depth is truncated at 80m maximum before evaluation. The same setup can also be found in Monodepth2. Definition of all the metrics is consistent with [1].
Table I shows that our approach outperforms most existing work, supervised and unsupervised. The only difference between the baseline and our approach is the continuous 3D loss, which brings significant improvements.
Remark 1.
Notice that a recent work BTS [4] has higher accuracy than our result, which benefits from the geometryaware layers they specifically designed for depth prediction, while our result is based on the common UNet structure. Our contributions are orthogonal. By adopting the more advanced network structure in our work, we expect further improved result than currently reported. This is also an interesting future work direction.
Although our quantitative results do not outperform BTS, we produce higherquality depth prediction, forming geometrically consistent pointclouds, which is discussed in more details in Sec. IVB.
IvA2 VKITTI2 dataset
The ground truth depth in KITTI dataset is still sparse, even though the improved version [47] is used, since it comes from LIDAR scans. To better show the improvement of our approach quantitatively, we further evaluated our method on the Virtual KITTI 2 [13] (VKITTI2) dataset. VKITTI2 is a synthetic dataset providing photorealistic approximation to the KITTI dataset using Unity game engine, with fully annotated ground truth. We randomly sampled 5% out of all VKITTI2 frames under the "clone" setting (with climate and lighting condition consistent to corresponding KITTI frames) to compose a test set. Table II shows the evaluation results on the test set using the same metrics as in Table I, comparing the baseline and our approach. To be more consistent with KITTI, we only include the bottom half of the images in the error calculation, since LIDAR scans only cover the bottom part of the images. We also truncate the maximum depth to 80m. Note that both the baseline and our approach are not trained on any VKITTI2 data.
In Table II one could observe more significant quantitative improvements than in Tab. I, as VKITTI2 has dense depth ground truth on all pixels. We argue that this is a more solid evaluation since the ground truth is perfect and dense. The fact that the synthetic data is not "real" does imply a domain gap, which is shown in the decreased accuracy in both baseline and our method.
Fig. 4 shows a few samples of VKITTI2 data with ground truth and predictions.
lower is better  higher is better  

Dataset  Method  Abs Rel  Sq Rel  RMSE  RMSE log  
KITTI  SMonodepth2  0.077  0.444  3.568  0.118  0.934  0.988  0.997 
0.075  0.404  3.481  0.117  0.935  0.988  0.997  
0.072  0.370  3.371  0.116  0.937  0.988  0.997  
VKITTI2  SMonodepth2  0.135  0.800  4.835  0.204  0.820  0.949  0.982 
0.124  0.725  4.556  0.198  0.843  0.951  0.980  
0.117  0.688  4.535  0.188  0.854  0.961  0.987 
IvB Qualitative results and analysis
In order to show the effect of the new continuous 3D loss intuitively, in Fig. 2 we listed a few samples from the KITTI dataset. Each sample includes the RGB image, the raw LIDAR scan, the predicted depth and corresponding surface normal directions from the baseline and from our method. We also compare with the current topranking method BTS [4] to show our qualitative advantages.
IvB1 Depth view
From the depth prediction images, we observe that both SMonodepth2 and BTS predict incorrectly at vehiclewindow area. It not only creates "holes" in the depth map, but also results in failure of recovering the full object contours, as in the second and third examples. This area is not handled well by previous methods because

The window area is nonLambertian surface with inconsistent appearance at different viewing angles, so that photometric losses do not work.

LIDAR does not receive good reflection from glasses, as can be seen from Fig. 2, therefore no supervision from the ground truth is available.

The color of the window area is usually not consistent with other part of the vehicle body, further failing appearanceaware depth smoothness terms.
In contrast, our continuous 3D loss function provides supervision from all nearby points, thus overcoming the problem and providing inherit smoothness. The windowarea is predicted correctly with full object shape preserved from our predictions.
IvB2 Surfacenormal view
The surfacenormal view provides better visualization of 3D structures and local smoothness. The second row of each sample in Fig. 2 shows the surface normal direction calculated from the predicted depth. Despite the existence of regularizing smoothness term, the baseline method still produces a lot of textures inherited from the color space. This is because the edgeaware smoothness loss is downweighted at pixels of high gradient. BTS shows less, but still visible, texture and artifacts from color space in the normal map, and the inconsistency in windowarea is apparent in surfacenormal view. In contrast, our method does not produce such textures, while still preserving the 3D structures with clear shift between different surfaces.
IvB3 Pointcloud view
By backprojecting image pixels with predicted depth to 3D space, we can recover a point cloud of the scene. This view allows us to inspect how well the depth prediction recovers the 3D geometry in the real world. This is important as the pixelclouds could provide a denser alternative to accuratebutsparse LIDAR point clouds, benefiting 3D object detection as indicated in [53]. For this reason we focus on vehicle objects in this paragraph.
Fig. 3 shows four examples. They cover both near and far objects, and cases of overexposure and colorblendin with the background. We can see that the raw LIDAR scans are pretty sparse on dark vehicle bodies and glass surfaces, posing challenges on using such data as ground truth for depth learning. "Holes" in predicted depth map transform to unregulated noise points in 3D view. Compared with the baseline, our method produces point cloud with much higher quality in both glass and nonglass area, with smooth surface and geometric structure consistent with the real vehicles. BTS produces better pointclouds than the baseline, but the distortion is still heavier than our approach, especially at windowareas.


IvC Ablation study
In this paragraph we take a closer look at different configurations in the function constructed from point clouds. Here we mainly investigate the effect of surface normal kernel. We denote the continuous 3D loss without surface normal kernel as
(14) 
and the one with surface normal kernel as:
(15) 
The quantitative comparison is in Table III, following the same setup as Fig. I, and a data sample is shown in Fig. 5 for visualization. While the continuous loss improved upon baseline by exploiting the correlation among points, the prediction still produces artifacts caused by textures in color space. The surface normal kernel is sensitive to local noises, and also distinguishes between different parts of the 3D geometry, inducing more geometrically plausible predictions.
V Conclusion
We proposed a new continuous 3D loss function for monocular singleview depth prediction. The proposed loss function addresses the gap between dense image prediction and sparse LIDAR supervision. We achieved this by transforming point clouds into continuous functions and aligning them via the inner product structure in the function space. By simply adding this new loss function with network architecture untouched, the accuracy and geometric consistency of depth predictions is improved significantly. As the contribution is orthogonal to network architecture developments (e.g. [4], [52]), further improvement is expected when this new loss function is combined with more advanced network architectures.
Future work includes representation learning for the features used in the proposed loss function that potentially bring further improvements. Finally, exploring the benefits of the improved depth prediction for 3D object detection is another interesting research direction.
Acknowledgment
This article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.
References
 [1] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multiscale deep network,” in Proc. Advances Neural Inform. Process. Syst. Conf., 2014, pp. 2366–2374.
 [2] N. Yang, R. Wang, J. Stuckler, and D. Cremers, “Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry,” in Proc. European Conf. Comput. Vis., 2018, pp. 817–833.
 [3] C. Wang, J. Miguel Buenaposada, R. Zhu, and S. Lucey, “Learning depth from monocular videos using direct methods,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 2022–2030.
 [4] J. H. Lee, M.K. Han, D. W. Ko, and I. H. Suh, “From big to small: Multiscale local planar guidance for monocular depth estimation,” arXiv preprint arXiv:1907.10326, 2019.
 [5] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging into selfsupervised monocular depth estimation,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 3828–3838.
 [6] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” Int. J. Robot. Res., 2013.

[7]
R. Wang, S. M. Pizer, and J.M. Frahm, “Recurrent neural network for (un)supervised learning of monocular video visual odometry and depth,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recog., June 2019.  [8] “Waymo open dataset: An autonomous driving dataset,” 2019.
 [9] R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nadhamuni, A. Ferreira, M. Yuan, B. Low, A. Jain, P. Ondruska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C. Tao, L. Platinsky, W. Jiang, and V. Shet, “Lyft level 5 av dataset 2019,” urlhttps://level5.lyft.com/dataset/, 2019.
 [10] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” arXiv preprint arXiv:1903.11027, 2019.
 [11] Y. Dong, Y. Zhong, W. Yu, M. Zhu, P. Lu, Y. Fang, J. Hong, and H. Peng, “Mcity data collection for automated vehicles study,” arXiv preprint arXiv:1912.06258, 2019.
 [12] O. Ronneberger, P. Fischer, and T. Brox, “Unet: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computerassisted intervention. Springer, 2015, pp. 234–241.
 [13] Y. Cabon, N. Murray, and M. Humenberger, “Virtual kitti 2,” 2020.

[14]
T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervised learning of depth and egomotion from video,” in
Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 1851–1858.  [15] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “Dtam: Dense tracking and mapping in realtime,” in Proc. IEEE Int. Conf. Comput. Vis. IEEE, 2011, pp. 2320–2327.
 [16] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox, “Demon: Depth and motion network for learning monocular stereo,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 5038–5047.
 [17] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in Int. Conf. 3D Vis. (3DV). IEEE, 2016, pp. 239–248.
 [18] A. Levin, D. Lischinski, and Y. Weiss, “Colorization using optimization,” in ACM SIGGRAPH, 2004, pp. 689–694.
 [19] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in Proc. European Conf. Comput. Vis. Springer, 2012, pp. 746–760.

[20]
B. Bozorgtabar, M. S. Rad, D. Mahapatra, and J.P. Thiran, “Syndemo: Synergistic deep feature alignment for joint learning of depth and egomotion,” in
Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 4210–4219.  [21] Y. Di, H. Morimitsu, S. Gao, and X. Ji, “Monocular piecewise depth estimation in dynamic scenes by exploiting superpixel relations,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 4363–4372.
 [22] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds as proxy for multiobject tracking analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 4340–4349.
 [23] G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2016, pp. 3234–3243.
 [24] R. Garg, V. K. BG, G. Carneiro, and I. Reid, “Unsupervised cnn for single view depth estimation: Geometry to the rescue,” in Proc. European Conf. Comput. Vis. Springer, 2016, pp. 740–756.
 [25] Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth, optical flow and camera pose,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 1983–1992.
 [26] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and M. J. Black, “Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 12 240–12 249.
 [27] F. Aleotti, F. Tosi, M. Poggi, and S. Mattoccia, “Generative adversarial networks for unsupervised monocular depth prediction,” in Proc. European Conf. Comput. Vis., 2018, pp. 0–0.
 [28] Z. Wu, X. Wu, X. Zhang, S. Wang, and L. Ju, “Spatial correspondence with generative adversarial network: Learning depth from monocular videos,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 7494–7504.
 [29] X. Qi, R. Liao, Z. Liu, R. Urtasun, and J. Jia, “Geonet: Geometric neural network for joint depth and surface normal estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 283–291.
 [30] J. Ye, Y. Ji, X. Wang, K. Ou, D. Tao, and M. Song, “Student becoming the master: Knowledge amalgamation for joint scene parsing, depth estimation, and more,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2019, pp. 2829–2838.
 [31] D. Shin, Z. Ren, E. B. Sudderth, and C. C. Fowlkes, “3d scene reconstruction with multilayer depth and epipolar transformers,” in Proc. IEEE Int. Conf. Comput. Vis., October 2019.
 [32] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 2002–2011.
 [33] F. Brickwedde, S. Abraham, and R. Mester, “Monosf: Multiview geometry meets singleview depth for monocular scene flow estimation of dynamic traffic scenes,” in Proc. IEEE Int. Conf. Comput. Vis., October 2019.
 [34] R. Mahjourian, M. Wicke, and A. Angelova, “Unsupervised learning of depth and egomotion from monocular video using 3d geometric constraints,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2018, pp. 5667–5675.
 [35] W. Yin, Y. Liu, C. Shen, and Y. Yan, “Enforcing geometric constraints of virtual normal for depth prediction,” in Proc. IEEE Int. Conf. Comput. Vis., October 2019.
 [36] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process., vol. 13, no. 4, pp. 600–612, April 2004.
 [37] C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised monocular depth estimation with leftright consistency,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 270–279.
 [38] T. Dharmasiri, A. Spek, and T. Drummond, “Joint prediction of depths, normals and surface curvature from rgb images using cnns,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots and Syst. IEEE, 2017, pp. 1505–1512.

[39]
P. Z. Ramirez, M. Poggi, F. Tosi, S. Mattoccia, and L. Di Stefano, “Geometry
meets semantics for semisupervised monocular depth estimation,” in
Asian Conference on Computer Vision
. Springer, 2018, pp. 298–313.  [40] P. Heise, S. Klose, B. Jensen, and A. Knoll, “Pmhuber: Patchmatch with huber regularization for stereo matching,” in Proc. IEEE Int. Conf. Comput. Vis., December 2013.
 [41] Y. Kuznietsov, J. Stuckler, and B. Leibe, “Semisupervised deep learning for monocular depth map prediction,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 2017, pp. 6647–6655.
 [42] J. Qiu, Z. Cui, Y. Zhang, X. Zhang, S. Liu, B. Zeng, and M. Pollefeys, “Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., June 2019.
 [43] M. Ghaffari, W. Clark, A. Bloch, R. M. Eustice, and J. W. Grizzle, “Continuous direct sparse visual odometry from RGBD images,” in Proc. Robot.: Sci. Syst. Conf., Freiburg, Germany, June 2019.
 [44] W. Clark, M. Ghaffari, and A. Bloch, “Nonparametric continuous sensor registration,” arXiv preprint arXiv:2001.04286, 2020.

[45]
C. Rasmussen and C. Williams,
Gaussian processes for machine learning
. MIT press, 2006, vol. 1.  [46] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multiscale convolutional architecture,” in Proc. IEEE Int. Conf. Comput. Vis., December 2015.
 [47] J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” in Int. Conf. 3D Vis. (3DV), Oct 2017, pp. 11–20.
 [48] M. Poggi, F. Tosi, and S. Mattoccia, “Learning monocular depth estimation with unsupervised trinocular assumptions,” in Int. Conf. 3D Vis. (3DV). IEEE, 2018, pp. 324–333.
 [49] S. Pillai, R. Ambruş, and A. Gaidon, “Superdepth: Selfsupervised, superresolved monocular depth estimation,” in Proc. IEEE Int. Conf. Robot. and Automation. IEEE, 2019, pp. 9250–9256.

[50]
Y. Luo, J. Ren, M. Lin, J. Pang, W. Sun, H. Li, and L. Lin, “Single view
stereo matching,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2018, pp. 155–163.  [51] A. J. Amiri, S. Y. Loo, and H. Zhang, “Semisupervised monocular depth estimation with leftright consistency using deep neural network,” arXiv preprint arXiv:1905.07542, 2019.
 [52] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2002–2011.
 [53] Y. Wang, W.L. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger, “Pseudolidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., June 2019.
Comments
There are no comments yet.