Abstract
Good quality reconstruction and comprehension of a scene rely on 3D estimation methods. The 3D information was usually obtained from images by stereophotogrammetry, but deep learning has recently provided us with excellent results for monocular depth estimation. Building up a sufficiently large and rich training dataset to achieve these results requires onerous processing. In this paper, we address the problem of learning outdoor 3D point cloud from monocular data using a sparse groundtruth dataset. We propose Pix2Point, a deep learningbased approach for monocular 3D point cloud prediction, able to deal with complete and challenging outdoor scenes. Our method relies on a 2D3D hybrid neural network architecture, and a supervised endtoend minimisation of an optimal transport divergence between point clouds. We show that, when trained on sparse point clouds, our simple promising approach achieves a better coverage of 3D outdoor scenes than efficient monocular depth methods.
1 Introduction
Geometry estimation is a prior requirement for autonomous agents to comprehend their environment, progress and interact within it. This fundamental task leads to various estimation methods for 3D reconstruction. For long, the parallax between images has been used [10, 14], either between two cameras in stereo mode or between multiple acquisitions after displacement. Recently, deep learning techniques have revolutionised 3D estimation from images, allowing even to obtain excellent results with a single view [1, 5, 8, 12, 15]. These impressive results rely upon large and highly accurate databases like KITTI [13] that involve the simultaneous collection of stereo pairs and LiDAR data, postprocessing and temporal integration in order to provide accurate, reliable and dense groundtruth for learning purposes. The overall process requires large scale cooperation and is therefore lengthy. In this paper, we address the question of performing 3D estimation from monocular data with much lighter requirements in terms of training data input. Besides, if monocular point sets prediction has been proposed in the literature, it is only for the reconstruction of a single 3D object [9, 16, 18].
In contrast, we propose here (i) one of the first approaches to reconstruct a 3D point cloud for an entire outdoor scene given only a single image using a 2D3D hybrid neural network architecture, illustrated in Fig. 1; (ii) an endtoend learning scheme of the hybrid model using a sparse point clouds dataset; (iii) a cost function from the optimal transport (OT) that makes possible to obtain good coverage of the scenes on the KITTI dataset [13] (iv) better performances than stateoftheart monocular depth map prediction methods when trained with sparse data. This illustrates the advantage of the direct prediction of point clouds, in the case of a lowdensity groundtruth 3D dataset.
2 Related Work
The monocular 3D estimation task has been addressed in terms of depth maps prediction with the following works [2, 4, 8, 12, 15, 25, 21] but as already mentioned, they are trained with large and rich datasets. Recently, 3D estimation has also been addressed in terms of 3D point set with PSGN by Fan [9], a method that aims to predict the envelope of an object as an unordered set of points using a single view of it and its location in the image. Mandikal and Babu [16] address the limitations of PSGN regarding the poor number of predicted points with DensePCR, a pyramidal structure allowing to enlarge the number of points. Xia [24], also tackles the generation of a monocular point cloud for objects using prior knowledge over their shapes, making it robust to occlusions and varying poses. A generative flowbased model allowing single view object point cloud prediction has recently been proposed with Cflow [18]. It leverages a backandforth prediction loop from image to point cloud, then to image for consistency.
The aforementioned point cloud works only consider the reconstruction of a single 3D object model, , on data that are obtained through demanding procedures, either scanned objects using RGBD sensors or laser scanners, or handcrafted models. These procedures do not apply to reallife outdoor scenes with various settings and where lies multiple objects.
[7] tackles the problem of monocular volumetric reconstruction with occlusion completion for indoor scenes. In addition to the input RGB image, this approach requires a corresponding normal image that is hardly obtainable for outdoor scenes. To our knowledge, we are the first to propose a deep learning method for the reconstruction of complex outdoor scenes in the form of 3D point cloud, solely conditioned by a single RGB image, and learned on sparse point clouds. Our technical contributions with respect to previous works are: a performance study of the 2D3D hybrid neural network architecture for point cloud reconstruction of scenes ”inthewild”, endtoend training with OT loss optimisation on sparse point clouds up to ten thousands elements. Finally, we compare our point set prediction approach to state of the art depth map prediction approaches, when they are all trained on the same sparse dataset.
3 Method
Given a single colour image, our method, namely Pix2Point, predicts a set of 3D point coordinates with an arbitrary number of elements fixed before training. Fig. 1 shows an overview of the proposed method. Like DensePCR, our architecture consists of an encodingdecoding module to predict a first coarse point cloud that will be enriched using a densification module. Our model, unlike DensePCR, is trained to minimise an optimal transport divergence over the point coordinates in an endtoend fashion.
Encoding The encoding block is a series of convolution, pooling and normalisation layers to extract a feature description of the full RGB input image, which is then processed by a fully connected layer to obtain a first coarse set of 3D point coordinates. We explore and compare several encoding approaches following either VGG, DenseNet and ResNet architectures in section 5. We refer to them as backbones.
Decoding We refer to decoding as the densification of the first coarse 3D point cloud. We duplicate every point
times, and to describe each point we concatenate: the 3D coordinates, both global point cloud feature vector and the corresponding local feature vector, obtained using dedicated PointNetlike shared multilayered perceptron (MLP)
[19, 20], and lastly a grid alignment feature vector in order to identify every clone of the same point and to suggest geometric information between every clone. This point description is processed by another shared MLP resulting in 3D coordinates for 1 point.Optimisation
The performance of our approach mainly depends on the training loss function. Unlike the depth map prediction methods which exploit the gridded structure of the image for the evaluation of errors, our method uses distances between unordered point sets. In particular, it requires an additional, computationally expensive step to match points from the predicted and the target point clouds. We expose two usual distances for this task.
Chamfer distance The chamfer distance is the average of squared Euclidean distances to the nearest neighbour from one set to the other.
Optimal transport or OT distance To compute this distance, also known as Earth Mover’s distance, one has to find a onetoone mapping from one set to the other that minimises the sum over each point of the squared distance between them and their corresponding image. This minimised sum is called the OT distance and it informs about the eventual discrepancy between point sets distributions.
4 Experimental Settings
This section presents the dataset we are considering for real scene point cloud estimation from a single image and defines several criteria for reconstruction performance evaluation.
4.1 Dataset
To assess our method, we operate on RGB image sequences of real urban scenes and corresponding LiDAR point cloud acquisitions from the KITTI depth estimation benchmark dataset [13]. Every scene point cloud is an accumulation of filtered LiDAR acquisitions over few successive time instants. We use the split defined by Eigen [8], that is 22 600 training scenes and 697 testing scenes.
4.2 Evaluation Metrics
There is no reference value or method for the KITTI scene reconstruction task regarding 3D point clouds. Therefore, we propose to measure performances using the completeness and accuracy criteria from [22], and also propose the relative accuracy. All 3 measures are defined as follows:
Completeness is the coverage in per cent of the target point cloud by the predicted points. A target point is covered if a predicted point lies in its surrounding (fixed radius ball). We evaluate completeness values for radius of cm, cm and cm.
Accuracy is the distance , in meter, from the th percentile of the distances to the nearest neighbour, from the predicted point cloud to the groundtruth point cloud. It measures the greatest distance to the nearest neighbour among the predicted points closest to the ground truth. We choose
to include most of the points and discard eventual outliers.
Relative accuracy is similar to the accuracy, where every distance to their nearest neighbour is divided by the norm of the corresponding target point. It provides a higher penalty to shortrange predictions.
4.3 Implementation Details
Experiments were conducted using the Pytorch Framework [17]. We kept the original image resolution and cropped every picture to pixel definition. Due to heavy computational cost for loss backpropagation, parameters were updated after every sample forward, making batch normalisation ineffective. Instance normalisation was applied instead [23]. The number of predicted points was determined to fully load the 8GB GPU during training. Therefore, elements point clouds are first predicted by the fully connected encoding module, that are then upscaled by a DensePCRlike module making a point cloud with 10k elements.
Training vs. Testing point clouds: Using an OT loss enables, in principle, the comparison of any groundtruth point cloud to the predicted one, however, in practice, computation and optimisation of an OT loss are computationally much more efficient with point sets of equal cardinality. Therefore, groundtruth point cloud databases are randomly subsampled to 10k points, that is as many points as Pix2Point predicts. When testing we measure performances to the whole groundtruth point cloud.
5 Experimental Results
In this section we first present the performances of Pix2Point using various encoding backbones and losses, then we compare our method to depth map prediction approaches through evaluation metrics defined in
4.2.5.1 Network parameter study
We trained several models with varying encoding backbones and loss functions. We considered the following configurations: Pix2Point architecture with VGG backbone and training on the minimisation endtoend of either the chamfer or OT distance, and Pix2Point with ResNet backbone and minimising the OT distance. The performances of these models are given in Table 1 respectively as P2PVGGC, P2PVGGOT and P2PResNetOT. From these figures, we can notice that the minimisation of chamfer distance thrives toward predictions with low local error, and minimising the OT distance grants predictions with higher completeness, hence, better coverage of the scenes. In order to find if these distances could help each other, combinations of both distances have been tested. However, they lead to convergence issues during training and overall worst performances due to opposite objectives of the distances. Changing the backbone from VGG to ResNet has also a slight impact on the completeness and accuracy. The small gain in relative accuracy indicates that far predictions are more accurate. We also provide a comparison to a similar imagetopointcloud approach, DensePCR [16], initially proposed for 3D graphics models.
5.2 Comparison to depth prediction approaches
RGB and groundtruth scene  AdaBins  BTS  P2PVGGC  P2PVGGOT  P2PResNetOT 
The current dominant approach to 3D estimation from a single image consists of predicting corresponding depth maps by leveraging the power of imagetoimage translation networks. On KITTI, these methods are trained on pseudodense depth maps built by accumulating several consecutive LiDAR acquisitions. For comparison in a similar setting, we train two stateoftheart models for monocular depth estimation from the KITTI challenge
^{1}^{1}1http://www.cvlibs.net/datasets/kitti/eval_depth.php?benchmark=depth_prediction, namely BTS [15] and AdaBins [2], on the same 10kpointcloud as Pix2Point. At inference time, dense depth maps are projected back to 3D using known camera parameters. Performance comparison with various flavours of Pix2Point is reported in Table 1.These results show that Pix2Point, with only 15M parameters, trained with Chamfer distance performs better than BTS and Adabins, respectively 45M and 78M parameters. Moreover, when trained with the OT distance, Pix2Point accuracy decreases but it covers three times more points than depth map approaches for the closest neighbourhoods. These observations can be made through Figure 2 where we show point cloud predictions and the coverage error map for each method (for comparison all predictions are visualized with 10k points). This error map displays for each groundtruth point the distance to its nearest predicted point. We display the error from 0 to 50cm using the jet colourmap. While AdaBins and BTS preserve fine features, all Pix2Point variants achieve better coverage of the scene and a lower error, especially for faraway elements and small size features, that are not retrieved by AdaBins and BTS (see for instance the right part of the bottom scene).
Approaches  Completeness (in %)  Accuracy  

50cm  25cm  10cm  in m  rel.  
P2PResNetOT  71.35  48.82  15.12  1.92  0.18 
P2PVGGOT  67.4  47.7  14.7  1.79  0.19 
P2PVGGC  64.4  36.0  8.0  0.85  0.05 
DensePCR  59.9  23.5  3.5  1.77  0.18 
BTS  67.59  31.29  6.28  1.23  0.06 
AdaBins  65.86  27.52  5.71  1.25  0.06 
6 Conclusion
We proposed in this work an innovative approach to tackle point clouds reconstruction for complex outdoor scenes from a unique RGB image, using a light 2D3D hybrid neural network. It recovers properly distributed point clouds by taking advantage of an optimal transport loss. We also provided the first benchmark for this novel task on the KITTI dataset and introduced performance metrics to assess the quality of point cloud reconstruction. We show that our method outperforms state of the art depth map prediction methods, when trained with sparse data, illustrating the interest of direct image to 3D pointcloud translation.
References
 [1] (2019) Semisupervised monocular depth estimation with leftright consistency using deep neural network. In 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO), Cited by: §1.

[2]
(2021)
AdaBins: depth estimation using adaptive bins.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
. Cited by: §2, Figure 2, §5.2.  [3] (2019) Deep generative modeling of LiDAR data. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: Figure 2.
 [4] (2018) Deep Depth from Defocus: How can defocus blur improve 3D estimation using dense neural networks?. In European Conf. on Computer Vision (ECCV), Cited by: §2.
 [5] (2018) On regression losses for deep depth estimation. In IEEE Int. Conf. on Image Processing (ICIP), Cited by: §1.
 [6] (2013) Sinkhorn distances: Lightspeed computation of optimal transport.. In NIPS, Cited by: §3.
 [7] (2020) 3d scene reconstruction from a single viewport. In European Conference on Computer Vision, pp. 51–67. Cited by: §2.
 [8] (2014) Depth map prediction from a single image using a multiscale deep network. In Advances in Neural Information Processing Systems 27, Cited by: §1, §2, §4.1.
 [9] (2017) A point set generation network for 3d object reconstruction from a single image. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Cited by: §1, §2.
 [10] (1993) Threedimensional computer vision: A geometric viewpoint. Cited by: §1.

[11]
(2019)
Interpolating between optimal transport and MMD using sinkhorn divergences.
In
The 22nd International Conference on Artificial Intelligence and Statistics
, Cited by: §3.  [12] (2018) Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.
 [13] (2013) Vision meets robotics: the KITTI dataset. International Journal of Robotics Research (IJRR). Cited by: §1, §1, §4.1.
 [14] (2004) Multiple view geometry in computer vision. Cited by: §1.
 [15] (2019) From Big to Small: multiscale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326. Cited by: §1, §2, Figure 2, §5.2.
 [16] (2019) Dense 3D point cloud reconstruction using a deep pyramid network. In 2019 Winter Conference on Applications of Computer Vision (WACV), Cited by: §1, §2, §5.1.
 [17] (2019) PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems 32, Cited by: §4.3.
 [18] (2020) Cflow: conditional generative flow models for images and 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.
 [19] (2017) PointNet: deep learning on point sets for 3D classification and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Cited by: §3.
 [20] (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems 30, Cited by: §3.
 [21] (2006) Learning depth from single monocular images. In Adv. in Neural Information Processing Systems 18, Cited by: §2.
 [22] (2019) The second workshop on 3D Reconstruction Meets Semantics: challenge results discussion. In ECCV 2018 Workshops, Cited by: §4.2.
 [23] (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §4.3.
 [24] (2018) Realpoint3d: point cloud generation from a single image with complex background. arXiv preprint arXiv:1809.02743. Cited by: §2.
 [25] (2020) Monocular depth prediction through continuous 3D loss. arXiv preprint arXiv:2003.09763. Cited by: §2.
Comments
There are no comments yet.