Pix2Point: Learning Outdoor 3D Using Sparse Point Clouds and Optimal Transport

07/30/2021 ∙ by Rémy Leroy, et al. ∙ ONERA European Space Agency 11

Good quality reconstruction and comprehension of a scene rely on 3D estimation methods. The 3D information was usually obtained from images by stereo-photogrammetry, but deep learning has recently provided us with excellent results for monocular depth estimation. Building up a sufficiently large and rich training dataset to achieve these results requires onerous processing. In this paper, we address the problem of learning outdoor 3D point cloud from monocular data using a sparse ground-truth dataset. We propose Pix2Point, a deep learning-based approach for monocular 3D point cloud prediction, able to deal with complete and challenging outdoor scenes. Our method relies on a 2D-3D hybrid neural network architecture, and a supervised end-to-end minimisation of an optimal transport divergence between point clouds. We show that, when trained on sparse point clouds, our simple promising approach achieves a better coverage of 3D outdoor scenes than efficient monocular depth methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

Good quality reconstruction and comprehension of a scene rely on 3D estimation methods. The 3D information was usually obtained from images by stereo-photogrammetry, but deep learning has recently provided us with excellent results for monocular depth estimation. Building up a sufficiently large and rich training dataset to achieve these results requires onerous processing. In this paper, we address the problem of learning outdoor 3D point cloud from monocular data using a sparse ground-truth dataset. We propose Pix2Point, a deep learning-based approach for monocular 3D point cloud prediction, able to deal with complete and challenging outdoor scenes. Our method relies on a 2D-3D hybrid neural network architecture, and a supervised end-to-end minimisation of an optimal transport divergence between point clouds. We show that, when trained on sparse point clouds, our simple promising approach achieves a better coverage of 3D outdoor scenes than efficient monocular depth methods.

1 Introduction

Geometry estimation is a prior requirement for autonomous agents to comprehend their environment, progress and interact within it. This fundamental task leads to various estimation methods for 3D reconstruction. For long, the parallax between images has been used [10, 14], either between two cameras in stereo mode or between multiple acquisitions after displacement. Recently, deep learning techniques have revolutionised 3D estimation from images, allowing even to obtain excellent results with a single view [1, 5, 8, 12, 15]. These impressive results rely upon large and highly accurate databases like KITTI [13] that involve the simultaneous collection of stereo pairs and LiDAR data, post-processing and temporal integration in order to provide accurate, reliable and dense ground-truth for learning purposes. The overall process requires large scale cooperation and is therefore lengthy. In this paper, we address the question of performing 3D estimation from monocular data with much lighter requirements in terms of training data input. Besides, if monocular point sets prediction has been proposed in the literature, it is only for the reconstruction of a single 3D object [9, 16, 18].

In contrast, we propose here (i) one of the first approaches to reconstruct a 3D point cloud for an entire outdoor scene given only a single image using a 2D-3D hybrid neural network architecture, illustrated in Fig. 1; (ii) an end-to-end learning scheme of the hybrid model using a sparse point clouds dataset; (iii) a cost function from the optimal transport (OT) that makes possible to obtain good coverage of the scenes on the KITTI dataset [13] (iv) better performances than state-of-the-art monocular depth map prediction methods when trained with sparse data. This illustrates the advantage of the direct prediction of point clouds, in the case of a low-density ground-truth 3D dataset.

2 Related Work

The monocular 3D estimation task has been addressed in terms of depth maps prediction with the following works [2, 4, 8, 12, 15, 25, 21] but as already mentioned, they are trained with large and rich datasets. Recently, 3D estimation has also been addressed in terms of 3D point set with PSGN by Fan  [9], a method that aims to predict the envelope of an object as an unordered set of points using a single view of it and its location in the image. Mandikal and Babu [16] address the limitations of PSGN regarding the poor number of predicted points with DensePCR, a pyramidal structure allowing to enlarge the number of points. Xia  [24], also tackles the generation of a monocular point cloud for objects using prior knowledge over their shapes, making it robust to occlusions and varying poses. A generative flow-based model allowing single view object point cloud prediction has recently been proposed with C-flow [18]. It leverages a back-and-forth prediction loop from image to point cloud, then to image for consistency.

The aforementioned point cloud works only consider the reconstruction of a single 3D object model, , on data that are obtained through demanding procedures, either scanned objects using RGB-D sensors or laser scanners, or handcrafted models. These procedures do not apply to real-life outdoor scenes with various settings and where lies multiple objects.

[7] tackles the problem of monocular volumetric reconstruction with occlusion completion for indoor scenes. In addition to the input RGB image, this approach requires a corresponding normal image that is hardly obtainable for outdoor scenes. To our knowledge, we are the first to propose a deep learning method for the reconstruction of complex outdoor scenes in the form of 3D point cloud, solely conditioned by a single RGB image, and learned on sparse point clouds. Our technical contributions with respect to previous works are: a performance study of the 2D-3D hybrid neural network architecture for point cloud reconstruction of scenes ”in-the-wild”, end-to-end training with OT loss optimisation on sparse point clouds up to ten thousands elements. Finally, we compare our point set prediction approach to state of the art depth map prediction approaches, when they are all trained on the same sparse dataset.

3 Method

Given a single colour image, our method, namely Pix2Point, predicts a set of 3D point coordinates with an arbitrary number of elements fixed before training. Fig. 1 shows an overview of the proposed method. Like DensePCR, our architecture consists of an encoding-decoding module to predict a first coarse point cloud that will be enriched using a densification module. Our model, unlike DensePCR, is trained to minimise an optimal transport divergence over the point coordinates in an end-to-end fashion.

Encoding The encoding block is a series of convolution, pooling and normalisation layers to extract a feature description of the full RGB input image, which is then processed by a fully connected layer to obtain a first coarse set of 3D point coordinates. We explore and compare several encoding approaches following either VGG, DenseNet and ResNet architectures in section 5. We refer to them as backbones.

Decoding We refer to decoding as the densification of the first coarse 3D point cloud. We duplicate every point

times, and to describe each point we concatenate: the 3D coordinates, both global point cloud feature vector and the corresponding local feature vector, obtained using dedicated PointNet-like shared multi-layered perceptron (MLP) 

[19, 20], and lastly a grid alignment feature vector in order to identify every clone of the same point and to suggest geometric information between every clone. This point description is processed by another shared MLP resulting in 3D coordinates for 1 point.

Optimisation

The performance of our approach mainly depends on the training loss function. Unlike the depth map prediction methods which exploit the gridded structure of the image for the evaluation of errors, our method uses distances between unordered point sets. In particular, it requires an additional, computationally expensive step to match points from the predicted and the target point clouds. We expose two usual distances for this task.

Chamfer distance The chamfer distance is the average of squared Euclidean distances to the nearest neighbour from one set to the other.

Optimal transport or OT distance To compute this distance, also known as Earth Mover’s distance, one has to find a one-to-one mapping from one set to the other that minimises the sum over each point of the squared distance between them and their corresponding image. This minimised sum is called the OT distance and it informs about the eventual discrepancy between point sets distributions.

The exact computation of an OT distance is time and memory expensive especially for several thousand elements, hence, we consider in our work an approximation of the OT distance obtained by adding a regularising term and solving the Sinkhorn-Knopp algorithm [6, 11].

4 Experimental Settings

This section presents the dataset we are considering for real scene point cloud estimation from a single image and defines several criteria for reconstruction performance evaluation.

4.1 Dataset

To assess our method, we operate on RGB image sequences of real urban scenes and corresponding LiDAR point cloud acquisitions from the KITTI depth estimation benchmark dataset [13]. Every scene point cloud is an accumulation of filtered LiDAR acquisitions over few successive time instants. We use the split defined by Eigen  [8], that is 22 600 training scenes and 697 testing scenes.

4.2 Evaluation Metrics

There is no reference value or method for the KITTI scene reconstruction task regarding 3D point clouds. Therefore, we propose to measure performances using the completeness and accuracy criteria from [22], and also propose the relative accuracy. All 3 measures are defined as follows:

Completeness is the coverage in per cent of the target point cloud by the predicted points. A target point is covered if a predicted point lies in its surrounding (fixed radius ball). We evaluate completeness values for radius of cm, cm and cm.

Accuracy is the distance , in meter, from the -th percentile of the distances to the nearest neighbour, from the predicted point cloud to the ground-truth point cloud. It measures the greatest distance to the nearest neighbour among the predicted points closest to the ground truth. We choose

to include most of the points and discard eventual outliers.

Relative accuracy is similar to the accuracy, where every distance to their nearest neighbour is divided by the norm of the corresponding target point. It provides a higher penalty to short-range predictions.

4.3 Implementation Details

Experiments were conducted using the Pytorch Framework [17]. We kept the original image resolution and cropped every picture to pixel definition. Due to heavy computational cost for loss back-propagation, parameters were updated after every sample forward, making batch normalisation ineffective. Instance normalisation was applied instead [23]. The number of predicted points was determined to fully load the 8GB GPU during training. Therefore, elements point clouds are first predicted by the fully connected encoding module, that are then up-scaled by a DensePCR-like module making a point cloud with 10k elements.

Training vs. Testing point clouds: Using an OT loss enables, in principle, the comparison of any ground-truth point cloud to the predicted one, however, in practice, computation and optimisation of an OT loss are computationally much more efficient with point sets of equal cardinality. Therefore, ground-truth point cloud databases are randomly subsampled to 10k points, that is as many points as Pix2Point predicts. When testing we measure performances to the whole ground-truth point cloud.

5 Experimental Results

In this section we first present the performances of Pix2Point using various encoding backbones and losses, then we compare our method to depth map prediction approaches through evaluation metrics defined in 

4.2.

5.1 Network parameter study

We trained several models with varying encoding backbones and loss functions. We considered the following configurations: Pix2Point architecture with VGG backbone and training on the minimisation end-to-end of either the chamfer or OT distance, and Pix2Point with ResNet backbone and minimising the OT distance. The performances of these models are given in Table 1 respectively as P2P-VGG-C, P2P-VGG-OT and P2P-ResNet-OT. From these figures, we can notice that the minimisation of chamfer distance thrives toward predictions with low local error, and minimising the OT distance grants predictions with higher completeness, hence, better coverage of the scenes. In order to find if these distances could help each other, combinations of both distances have been tested. However, they lead to convergence issues during training and overall worst performances due to opposite objectives of the distances. Changing the backbone from VGG to ResNet has also a slight impact on the completeness and accuracy. The small gain in relative accuracy indicates that far predictions are more accurate. We also provide a comparison to a similar image-to-point-cloud approach, DensePCR [16], initially proposed for 3D graphics models.

5.2 Comparison to depth prediction approaches

RGB and ground-truth scene AdaBins BTS P2P-VGG-C P2P-VGG-OT P2P-ResNet-OT
Figure 2: For each scene, first row: 3D ground-truth and predictions for the RGB image according to AdaBins [2], BTS [15], our Pix2Point VGG-chamfer, VGG-OT and ResNet-OT, all trained on 10k points. We follow the 3D representation of [3]. Bird’s eye view where the colour encodes the altitude. Second row: the input RGB image and the ground-truth-to-prediction error map for each method. Errors are from 0 (blue) to 50cm (red).

The current dominant approach to 3D estimation from a single image consists of predicting corresponding depth maps by leveraging the power of image-to-image translation networks. On KITTI, these methods are trained on pseudo-dense depth maps built by accumulating several consecutive LiDAR acquisitions. For comparison in a similar setting, we train two state-of-the-art models for monocular depth estimation from the KITTI challenge

111http://www.cvlibs.net/datasets/kitti/eval_depth.php?benchmark=depth_prediction, namely BTS [15] and AdaBins [2], on the same 10k-point-cloud as Pix2Point. At inference time, dense depth maps are projected back to 3D using known camera parameters. Performance comparison with various flavours of Pix2Point is reported in Table 1.

These results show that Pix2Point, with only 15M parameters, trained with Chamfer distance performs better than BTS and Adabins, respectively 45M and 78M parameters. Moreover, when trained with the OT distance, Pix2Point accuracy decreases but it covers three times more points than depth map approaches for the closest neighbourhoods. These observations can be made through Figure 2 where we show point cloud predictions and the coverage error map for each method (for comparison all predictions are visualized with 10k points). This error map displays for each ground-truth point the distance to its nearest predicted point. We display the error from 0 to 50cm using the jet colourmap. While AdaBins and BTS preserve fine features, all Pix2Point variants achieve better coverage of the scene and a lower error, especially for far-away elements and small size features, that are not retrieved by AdaBins and BTS (see for instance the right part of the bottom scene).

Approaches Completeness (in %) Accuracy
50cm 25cm 10cm in m rel.
P2P-ResNet-OT 71.35 48.82 15.12 1.92 0.18
P2P-VGG-OT 67.4 47.7 14.7 1.79 0.19
P2P-VGG-C 64.4 36.0 8.0 0.85 0.05
DensePCR 59.9 23.5 3.5 1.77 0.18
BTS 67.59 31.29 6.28 1.23 0.06
AdaBins 65.86 27.52 5.71 1.25 0.06
Table 1: Comparison of 3D scene reconstructions on KITTI. We report completeness and accuracy. All methods are trained with 10k point clouds.

6 Conclusion

We proposed in this work an innovative approach to tackle point clouds reconstruction for complex outdoor scenes from a unique RGB image, using a light 2D-3D hybrid neural network. It recovers properly distributed point clouds by taking advantage of an optimal transport loss. We also provided the first benchmark for this novel task on the KITTI dataset and introduced performance metrics to assess the quality of point cloud reconstruction. We show that our method outperforms state of the art depth map prediction methods, when trained with sparse data, illustrating the interest of direct image to 3D point-cloud translation.

References

  • [1] A. J. Amiri, S. Y. Loo, and H. Zhang (2019) Semi-supervised monocular depth estimation with left-right consistency using deep neural network. In 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO), Cited by: §1.
  • [2] S. F. Bhat, I. Alhashim, and P. Wonka (2021) AdaBins: depth estimation using adaptive bins.

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    .
    Cited by: §2, Figure 2, §5.2.
  • [3] L. Caccia, H. van Hoof, A. Courville, and J. Pineau (2019) Deep generative modeling of LiDAR data. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Cited by: Figure 2.
  • [4] M. Carvalho, B. Le Saux, P. Trouvé-Peloux, A. Almansa, and F. Champagnat (2018) Deep Depth from Defocus: How can defocus blur improve 3D estimation using dense neural networks?. In European Conf. on Computer Vision (ECCV), Cited by: §2.
  • [5] M. Carvalho, B. Le Saux, P. Trouvé-Peloux, F. Champagnat, and A. Almansa (2018) On regression losses for deep depth estimation. In IEEE Int. Conf. on Image Processing (ICIP), Cited by: §1.
  • [6] M. Cuturi (2013) Sinkhorn distances: Lightspeed computation of optimal transport.. In NIPS, Cited by: §3.
  • [7] M. Denninger and R. Triebel (2020) 3d scene reconstruction from a single viewport. In European Conference on Computer Vision, pp. 51–67. Cited by: §2.
  • [8] D. Eigen, C. Puhrsch, and R. Fergus (2014) Depth map prediction from a single image using a multi-scale deep network. In Advances in Neural Information Processing Systems 27, Cited by: §1, §2, §4.1.
  • [9] H. Fan, H. Su, and L. J. Guibas (2017) A point set generation network for 3d object reconstruction from a single image. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Cited by: §1, §2.
  • [10] O. Faugeras (1993) Three-dimensional computer vision: A geometric viewpoint. Cited by: §1.
  • [11] J. Feydy, T. Séjourné, F. Vialard, S. Amari, A. Trouvé, and G. Peyré (2019) Interpolating between optimal transport and MMD using sinkhorn divergences. In

    The 22nd International Conference on Artificial Intelligence and Statistics

    ,
    Cited by: §3.
  • [12] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao (2018) Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.
  • [13] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the KITTI dataset. International Journal of Robotics Research (IJRR). Cited by: §1, §1, §4.1.
  • [14] R. I. Hartley and A. Zisserman (2004) Multiple view geometry in computer vision. Cited by: §1.
  • [15] J. H. Lee, M. Han, D. W. Ko, and I. H. Suh (2019) From Big to Small: multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326. Cited by: §1, §2, Figure 2, §5.2.
  • [16] P. Mandikal and V. B. Radhakrishnan (2019) Dense 3D point cloud reconstruction using a deep pyramid network. In 2019 Winter Conference on Applications of Computer Vision (WACV), Cited by: §1, §2, §5.1.
  • [17] A. Paszke, S. Gross, F. Massa, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, Cited by: §4.3.
  • [18] A. Pumarola, S. Popov, F. Moreno-Noguer, and V. Ferrari (2020) C-flow: conditional generative flow models for images and 3d point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.
  • [19] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) PointNet: deep learning on point sets for 3D classification and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Cited by: §3.
  • [20] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems 30, Cited by: §3.
  • [21] A. Saxena, S. H. Chung, and A. Y. Ng (2006) Learning depth from single monocular images. In Adv. in Neural Information Processing Systems 18, Cited by: §2.
  • [22] R. Tylecek, T. Sattler, H. Le, T. Brox, M. Pollefeys, R. B. Fisher, and T. Gevers (2019) The second workshop on 3D Reconstruction Meets Semantics: challenge results discussion. In ECCV 2018 Workshops, Cited by: §4.2.
  • [23] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §4.3.
  • [24] Y. Xia, Y. Zhang, D. Zhou, X. Huang, C. Wang, and R. Yang (2018) Realpoint3d: point cloud generation from a single image with complex background. arXiv preprint arXiv:1809.02743. Cited by: §2.
  • [25] M. Zhu, M. Ghaffari, Y. Zhong, P. Lu, Z. Cao, R. M. Eustice, and H. Peng (2020) Monocular depth prediction through continuous 3D loss. arXiv preprint arXiv:2003.09763. Cited by: §2.