Code for "Deep Generative Models for LiDAR Data"
Building models capable of generating structured output is a key challenge for AI and robotics. While generative models have been explored on many types of data, little work has been done on synthesizing lidar scans, which play a key role in robot mapping and localization. In this work, we show that one can adapt deep generative models for this task by unravelling lidar scans into a multi-channel 2D signal. Our approach can generate high quality samples, while simultaneously learning a meaningful latent representation of the data. Furthermore, we demonstrate that our method is robust to noisy input - the learned model can recover the underlying lidar scan from seemingly uninformative data.READ FULL TEXT VIEW PDF
Deep generative models (DGMs) are effective on learning multilayered
Traditional wisdom in generative modeling literature is that spurious sa...
We consider the problem of training generative models with deep neural
The quality of outputs produced by deep generative models for music have...
Eye movements, blinking and other motion during the acquisition of optic...
We introduce a method which allows users to creatively explore and navig...
Recent advances in deep generative models have lead to remarkable progre...
Code for "Deep Generative Models for LiDAR Data"
One of the main challenges in mobile robotics is the development of systems capable of fully understanding their environment. This non-trivial task becomes even more complex when sensor data is noisy or missing. An intelligent system that understands the data generation process is much better equipped to tackle inconsistency in its sensor data. There is significant potential gain in having autonomous robots equipped with data generation capabilities which can be leveraged for reconstruction, compression, or prediction of the data stream.
In autonomous driving, information from the environment is captured from sensors mounted on the vehicle, such as cameras, radars, and lidars. While a significant amount of research has been done on generating RGB images, relatively little work has focused on generating lidar data. These scans, represented as an array of three dimensional coordinates, give an explicit topography of the vehicle’s surroundings, potentially leading to better obstacle avoidance, path planning, and inter-vehicle spatial awareness.
To this end, we leverage recent advances in deep generative modeling, namely variational autoencoders (VAE)[vae] and generative adversarial networks (GAN) [gan]
, to produce a generative model of lidar data. While the VAE and GAN approaches have different objectives, they can be used with Convolutional Neural Networks (CNN)[cnn] to extract local information from nearby sensor points.
Unlike other approaches for lidar processing, we do not convert the data to images or voxel grids [dewan2017deep] [voxelnet]. Instead, we work directly with point cloud coordinates. This data format is more memory efficient, and is therefore less expensive to process and generate. This property aids real-time input processing, which can be helpful for self-driving vehicles.
The main contribution of this paper is to provide a method for raw lidar data generation that requires minimal human craftmanship. In other words, our work enables lidar point cloud generation in a fully unsupervised fashion. To the best or our knowledge, no previous attempts have been made to tackle this problem. Experimental results using the standard KITTI dataset show that the method can synthesize lidar data that preserves essential structural properties of the scene.
The majority of papers applying deep learning methods to lidar data present discriminative models to extract relevant information from the vehicle’s environment. Dewan et al.[dewan2017deep] propose a CNN for pointwise semantic segmentation to distinguish between static and moving obstacles. Caltagirone et al. [caltagirone2017fast] use a similar approach to perform pixel-wise classification for road detection. Both papers use image representations of the lidar input, which holds less information than raw lidar scans. To leverage the full 3D structure of the input, Bo Li [li20173d] uses 3D convolutions on a voxel grid for vehicle detection. However processing voxels is computationally heavy, and does not leverage the sparsity of LiDAR scans.
To address this issue, several papers [li2016vehicle, velas2018cnn, vaquero2017deconvolutional, vaquero2018deep] use a bijective mapping from 3D point cloud into a 2D point map, similar to polar coordinates, where coordinate are encoded as azimuth and elevation angles measured from the origin. Using such a bijection lies at the core of our proposed approach for generative modeling of lidar data.
An alternative approach for generative modeling of lidar data is from Ondru´ska et al [ondruska2016end]
. They train a Recurrent Neural Network for semantic segmentation and convert their input to an occupancy grid. More relevant to our task, they train their network to also predict future occupancy grids, thereby creating a generative model for lidar data. Their approach differs from ours, as we operate directly on raw coordinates. This not only reduces preprocessing time, but also allows us to efficiently represent data with non-uniform spatial density. We can therefore run our model at a much higher resolution, while remaining computationally efficient.
A recent line of work [achlioptas2017representation, groueix2018atlasnet, yang2018foldingnet, fan2017point] considers the problem of generating point clouds as unordered sets of coordinates. This approach does not define an ordering on the points, and must therefore be invariant to permutations. To achieve this, they use a variant of PointNet [qi2017pointnet]
to encode a variable-length point cloud into a fixed-length representation. This latent vector is then decoded back to a point cloud, and the whole network is trained using permutation invariant losses such as theEarth-Mover’s Distance or the Chamfer Distance [fan2017point]. While these approaches work well for arbitrary point clouds, we show that they give suboptimal performance on lidar, as they do not leverage the known structure of the data.
The underlying task of generative models is density estimation. Formally, we are given a set of-dimensional i.i.d samples
from some unknown probability density function. Our objective is to learn a density where represents the parameters of our estimator and a parametric family of models. Training is done by minimizing some distance between and . The choice of both and the training algorithm are the defining components of the density estimation procedure. Common choices for are either -divergences such as the Kullback-Liebler (KL) divergence, or Integral Probability Metrics (IPMs), such as the Wasserstein metric [arjovsky2017wasserstein]. These similarity metrics between distributions usually come with specific training algorithms, as we describe next.
Maximum likelihood estimation (MLE) aims to find model parameters that maximize the likelihood of . Since samples are i.i.d, the optimization criterion can be viewed as :
It can be shown that training with the MLE criteria converges to a minimization of the KL-divergence as the sample size increases. From Eqn (1) we see that any model admitting a differentiable density
can be trained via backpropagation. Powerful generative models trained via MLE include Variational Autoencoders[vae]oord2016pixel]
The VAE [vae] is a regularized version of the traditional autoencoder (AE). It consists of two parts : an inference network that maps an input x to a posterior distribution of latent codes , and a generative network that aims to reconstruct the original input conditioned on the latent encoding. By imposing a prior distribution on latent codes, this model enforces the distribution over to be smooth and well-behaved. This property enables proper sampling from the model via ancestral sampling from latent to input space. Without it,
would encode the inputs as single points by making the variance ofarbitrarily small111In this case the model collapses to a regular autoencoder..
The full objective of the VAE is then:
which is a valid lower bound on the true likelihood, thereby making Variational Autoencoders valid generative models. For a more in depth analysis of VAEs we refer the reader to [doersch2016tutorial].
: The GAN [gan] formulates the density estimation problem as a minimax game between two opposing networks. The generator maps noise drawn from a prior distribution to the input space, aiming to fool its adversary, the discriminator . The latter then tries to distinguish between real samples and fake samples . In practice, both models are represented as neural networks. Formally, the objective is written as
GANs have shown the ability to produce more realistic samples [karras2017progressive] than their MLE counterparts. However, the optimization process is notoriously difficult; stabilizing GAN training is still an open problem. In practice, GANs can also suffer from mode collapse [salimans2016improved], which happens when the generator overlooks certain modes of the target distribution.
We next describe the proposed deep learning framework used for generative modeling of lidar scans.
Our approach relies heavily on 2D convolutions, therefore we start by converting a lidar scan containing coordinates into a 2D grid. We begin by clustering together points emitted from the same elevation angle into clusters. Second, for every cluster, we sort the points in increasing order of azimuth angle. In order to have a proper grid with a fixed amount of points per row, we divide the plane into bins. For every bin, we take the average of all its point to obtain a single representative. This yields a grid, where the last axis are spatial coordinates. We note that the default ordering in most lidar scanners is the same as the one obtained after applying this preprocessing. Therefore, sorting is not required in practice, and the whole procedure can be executed in . Figure 2 provides a visual representation of this mapping.
Since this mapping is implicitly in cylindrical coordinates (due to the ordering with respect to azimuth angle), we argue that the points should also be in cylindrical form: in this representation, the convolution operator becomes fully equivariant to rotations of the plane. Note that this is not the case when using values. We later show in the experimental section that this cylindrical mapping gives better samples, while showing similar quantitative performance.
In practice, both encoder and decoder are represented as neural networks with parameters and respectively.
Similar to a traditional AE, the training procedure first encodes the data into a latent representation . The variational aspect is introduced by interpreting not as a vector, but as parameters of a prior distribution. In our work we choose a Gaussian prior, and therefore decomposes as .
We then sample from this distribution and pass it through the decoder to obtain .
Using the reparametrization trick [vae], the network is fully deterministic and differentiable w.r.t to its parameters and
, which are updated via stochastic gradient descent (SGD)
Training alternates between updates for the generator and discriminator, with parameters and . Similarly to the VAE, samples are obtained by ancestral sampling from the prior through the generator. In the original GAN, the discriminator’s loss is binary cross-entropy, and the generator’s is simply the opposite of its adversary. In practice, we use the Relativistic Average GAN (RaGAN) objective [jolicoeur2018relativistic], which is easier to optimize. Further details on the importance the RaGAN loss are provided in the appendix VII-A. Again, and are updated using SGD.
Deep Generative Convolutional GANs (DCGANs) [dcgan]
have shown great success in generating images. They use a symmetric architecture for the two networks: The generator consists of 5 transpose convolutions with stride two to upsample at each layer, and ReLU activations. The discriminator uses stride two convolutions to downsample the input, and Leaky ReLU activations. In both networks, Batch Normalization[bn] is interleaved between convolution layers for easier optimization. We use this architecture for all our models: The VAE encoder setup is simply the first four layers of the discriminator, and the decoder’s architecture replicates the DCGAN generator.
This section provides a thorough analysis of the performance of our framework fulfilling a variety of tasks related to generative modeling. Our experiments include conditional and unconditional generation. In the first setting, the model must reconstruct a (potentially corrupted) lidar scan. In the latter setting, we are only interesting in producing realistic samples, which are not explicitly tied to a real lidar cloud.
We consider the point clouds available in the KITTI dataset [geiger2013vision]. We use the train/validation/test set split proposed by [lotter2016deep], which yields 40 000, 80 and 700 samples for train, validation and test sets. For training we subsample from 10z to 3hz since temporally adjacent frames are nearly identical.
Since, to the best of our knowledge, no work has attempted generative modeling of raw lidar clouds, we compare to our method models that operate on arbitrary point clouds. We first choose AtlasNet [groueix2018atlasnet], which has shown strong modeling performance on the Shapenet [chang2015shapenet] dataset. This network first encodes point clouds using a shared MLP network that operates on each point individually
. A max-pooling operation is performed on the point axis to obtain a fixed-length global representation of the point cloud. In other words, the encoder treats each point independently of other points, without assuming an ordering on the set of coordinates. This makes the feature extraction process invariant to permutations of points. The decoder is given the encoder output along withcoordinates of a 2D-grid, and attempts to fold this 2D-grid into a three-dimensional surface. The decoder also used a MLP network shared across all points.
Similar to AtlasNet, we compare our model with the one from Achlioptas et al [achlioptas2017representation]. Only its decoder differs from AtlasNet; the model does not deform a 2D grid, but rather uses fully-connected layers to convert the latent vector into a point cloud.
Both networks are trained end-to-end using the Chamfer Loss [fan2017point], defined as
where and are two sets of coordinates. We note again that this loss is invariant to the ordering of the output points. For both autoencoders, we regularize their latent space using a Gaussian prior to get a valid generative model.
For this section, we consider the GAN model introduced in section IV-C. Our goal is to train a model that can produce realistic samples, which can be used in downstream tasks such as simulator development. In this use case, an agent operating in an environment that lacks crispness will likely result in poor skill transfer to real world navigation. Since the use of GANs has been shown to produce more realistic samples than MLE based models on images [larsen2015autoencoding], we hope to see similar results with our model in the case of LiDAR data.
Evaluation criteria: Rigorous quantitative evaluation of samples produced by GANs and generative models is an open research question. GANs trained on images have been evaluated by the Inception Score [salimans2016improved] and the Frechet Inception Distance (FID) [heusel2017gans]. Since there exists no standardized metric for unconditional generation of lidar clouds, we rely on visual inspection of samples for quality assessment.
We proceed to test our approach in the conditional setting. In this setting, only the VAE model described in section IV-C is used, as vanilla GANs do not have an inference mechanism. We refer to our model as LiDAR-VAE(cyl.) and LiDAR-VAE(xyz), where the former represents its input using cylindrical coordinate, and the latter processes points directly. Formally, given a lidar cloud, we evaluate a model’s ability to reconstruct it from a compressed encoding. More relevant to real word applications, we look at how robust the model’s latent representation is to input perturbation. Specifically, we look at the two following corruption mechanisms:
Additive Noise : we add Gaussian noise drawn from to the coordinates of the lidar cloud. We experiment with varying levels of .
: we remove random points from the input lidar scan. Specifically, the probability of removing a point is modeled as a Bernoulli distribution parametrized by. We experiment with different values for .
To measure how close the reconstructed output is to the original point cloud, we use the Earth-Mover’s Distance [fan2017point]. It is defined as
where is a bijection between the two sets.
The EMD gives the solution to the solution to the optimal transportation problem, which attempts to transform one point cloud into the other. Recent work [achlioptas2017representation] has shown that this metrics correlates well with human evaluation. Moreover, this distance is sensitive to both global and local structure, and does not require points to be ordered.
In this section, we will first discuss results for unconditional generation and subsequently evaluate results for conditional generation of lidar images.
|Ach. et al||2403.75|
From a visual inspection 6, we see that our model generates realistic samples. First, the scans have a well-defined global structure: an aerial view of the samples show points correctly align to model the structure of the road. Second, the samples share local characteristics of real data: the model correctly generates road obstacles, such as cars, or cyclists. This amounts to having locations with a dense aggregation of points, followed by a trailing area with almost no points, similar to the shadow of an object. Third, model respects the point density of the data, where the density is roughly inversely proportional to the distance from the origin. Lastly, our models show good sample diversity. We refer to Figure 6 for a specific analysis of samples, and to the code repository for a wider range of samples.
In all conditional tasks, our proposed VAE beats available baselines by a significant margin, both in terms of EMD and visual inspection. The results are described in Table 1, and in Figures 4,5. The model is able to extract a significant amount of information from noisy data. As shown in Figure 3, the VAE reconstructs correctly the defining components of the original cloud, even if the given input is seemingly uninformative. We emphasize that our model was not trained with such corrupted data, therefore these results are quite surprising.
We note that the suboptimal performance of the baselines is mainly due to two factors. First, since points are encoded independently, only information about the global structure is kept, and local fine-grained details are neglected. Second, the Chamfer Distance used for training assumes that the point cloud has a uniform density, which is not the case for lidar scans. However, we note that removing data from the input does not affect their reconstruction performance. By encoding points individually, their encoder can process a variable amount of points. In can therefore easily adapt to missing data. However, this characteristic is not enough to close the performance gap with our model, as shown in Figure 5.
In this work we introduced two generative models for raw lidars, a GAN and a VAE. Both models use a novel lidar representation, giving rise to fully rotationally equivariant convolutions. We have shown that the proposed LiDAR-GAN can generate highly realistic data, and captures both local and global features of real lidar scans. The LiDAR-VAE successfully encodes and generates lidar samples, and is highly robust to missing or inputed data. We demonstrate that when adding enough noise to render the scan uninformative to the human eye, the proposed VAE still extracts relevant information and generates the missing data.
Moreover, we show that the LiDAR-GAN learns a meaningful latent representation. Interesting future work would be to leverage this representation for sequential data generation. In other words, one could aim to train a model such that the latent space interpolations would produce temporally adjacent frames.
To stabilize GAN training, we experimented with different known tricks for image generation, namely Spectral Normalization [miyato2018spectral], SELU activations [klambauer2017self] and the Relativistic Average GAN (RaGAN) loss. [jolicoeur2018relativistic]. Of all these techniques, we found that only the RaGAN objective gave consistent improvements and stable training dynamics. RaGAN modifies equation 3 to
To monitor convergence, we look at the difference in discriminator output for real and fake data, shown in 7
. We argue that this is a good metric for monitoring convergence, as once the discriminator correctly classifies data with 100% confidence, this metric reaches 1, and qualitative evaluation of samples suggests that they stop improving.
In Figure 7, we plot 8 runs with the classic GAN objective, and 8 runs with the modified RaGAN objective. For each run, we sampled different hyperparameters values for the batch size, learning rates and the use of spectral normalization. These results seem to strongly suggest that the relativistic approach is much more robust to hyperparameter changes.