TensorFlow implementation of PointRNN, PointGRU and PointLSTM.
Point cloud is attracting more and more attention in the community. However, few works study dynamic point clouds. In this paper, we introduce a Point Recurrent Neural Network (PointRNN) unit for moving point cloud processing. To keep the spatial structure, rather than taking a sole one-dimensional vector x∈R^d like RNN as input, PointRNN takes points' coordinates P∈R^n × 3 and their features X∈R^n × d as inputs (n and d denote the number of points and feature dimensions, respectively). Accordingly, the state s∈R^d' in RNN are extended to (P, S∈R^n × d') in PointRNN (d' denotes the number of state dimensions). Since point clouds are orderless, features and states of two adjacent time steps can not be directly operated. Therefore, PointRNN replaces the concatenation operation in RNN with a correlation operation, which aggregates inputs and states according to points' coordinates. To evaluate PointRNN, we apply one of its variants, i.e., Point Long Short-Term Memory (PointLSTM), to moving point cloud prediction, which aims to predict the future trajectories of points in a cloud given their history movements. Experimental results show that PointLSTM is able to produce correct predictions on both synthetic and real-world datasets, demonstrating its effectiveness to model point cloud sequences. The code has been released at https://github.com/hehefan/PointRNN.READ FULL TEXT VIEW PDF
Point clouds have grown in importance in the way computers perceive the
In this paper, we propose an end-to-end learning network to predict futu...
This paper introduces CloudLSTM, a new branch of recurrent neural networ...
We present a new permutation-invariant network for 3D point cloud proces...
There is a growing number of tasks that work directly on point clouds. A...
3D point-cloud recognition with deep neural network (DNN) has received
Deep classifiers tend to associate a few discriminative input variables ...
TensorFlow implementation of PointRNN, PointGRU and PointLSTM.
PointRNN, PointGRU and PointLSTM.
Most modern robot and self-driving car platforms rely on 3D point clouds for visual perception. In contrast to RGB images, point clouds can provide accurate displacement measurements, which are generally unaffected by lighting conditions. Point cloud attracts more and more researchers in the community. However, most of the existing works focus on static point cloud analysis, e.g., classification and segmentation [21, 22, 14]. Few works study dynamic point clouds. Intelligent systems need the ability to understand not only static scenes around them but also dynamic changes in the environment. In this paper, we propose a Point Recurrent Neural Network (PointRNN) for moving point cloud processing.
In the general setting, a point cloud is a set of points in 3D space. Usually, the point cloud is represented by points’ three coordinates and their additional features (if features are provided), where and denote the number of points and feature dimensions, respectively. Essentially, point clouds are unordered sets and invariant to permutations of their points. For example, the set and represent the same point cloud. This irregular format data structure considerably increases the challenge to reason point clouds, making many achievements of deep neural networks on image processing failed to process point clouds.
Recurrent neural network is well-suited to processing time series data. Generally, the (vanilla) RNN looks at an vector at the time step , updates its state to , and outputs . The RNN can be formulated as follows,
where denotes the learned parameters. Since moving point cloud is a kind of sequence, we can exploit RNN to model it. However, RNN has two severe limitations on processing point cloud sequences.
On one hand, RNN learns from one-dimensional vectors, in which the input, output and states are highly compact. It is difficult for a sole vector to represent a point cloud. Although we can flatten and to a one-dimensional vector, such operation heavily damages the data structure and increases the challenge for neural networks to understand point clouds. If we use the global feature to represent a point cloud, the local structure will be lost. To overcome this problem, PointRNN takes the original as inputs. Similarly, the one one-dimensional state in RNN is extended to two-dimensional in PointRNN, in which each row corresponds to a point. Besides, because depends on points’ coordinates, is added into the states and outputs of PointRNN. The PointRNN can be formulated as follows,
We illustrate an LSTM and a PointLSTM unit in Figure 1.
One the other hand, RNN aggregates the past information and the current input based on a concatenation operation (Figure 2(a)). However, because point clouds are unordered, concatenation can not directly apply to point clouds. To alleviate this problem, we use a correlation operation to aggregate and according to points’ coordinates (Figure 2(b)). Specifically, for each point in , PointRNN first searches its radius neighborhoods in and samples a fixed number from them. Second, the feature of the query point, the hidden states of the sampled neighbors, and the displacements from the query point to the sampled neighbors are concatenated separately, which are then processed by a fully-connected (FC) layer. At last, the processed representations are reduced to a single representation by pooling.
PointRNN provides a fundamental component for point cloud sequence modelling. We evaluate PointRNN on moving point cloud prediction. Given the history movements of point clouds, the goal of this task is to predict the future trajectories of their points. Predicting how point clouds move in future can help robots and self-driving cars to plan their actions and make decisions. Besides, moving point cloud prediction has an innate advantage that it does not require external supervision. To avoid the exploding and vanishing gradient problems, we adopt a variant of PointLSTM,i.e., Point Long Short-Term Memory (PointLSTM). Based on PointLSTM, we propose an seq2seq architecture for moving point cloud prediction. Experimental results on a synthetic moving MNIST point cloud dataset, and two large-scale autonomous driving Argoverse  and nuScenes datasets show that PointLSTM produces correct predictions, confirming its ability to model moving point clouds.
Static Point Cloud Understanding. The dawn of point cloud has boosted a number of applications, such as object classification, object part segmentation, scene semantic segmentation [21, 22, 11, 13, 14, 28, 8, 26, 17, 31, 29, 27], reconstruction [6, 15, 32] and object detection [4, 33, 20, 23, 12, 19] in 3D space. Most recently works aim to consumes point sets without transforming coordinates to regular 3D voxel grids or collections of images. There exists two main challenges for static point cloud processing. First, a point cloud is in essence a set of unordered points and invariant to permutations of its points, which necessitates certain symmetrizations in computation. Second, different with images, point cloud data is irregular. Convolution operations that captures local structures in images can not directly apply to such data format. Different with these existing works, we are dedicated to a new challenge, i.e., to model dynamics in point cloud sequences.
RNN Variants for Spatio-temporal Modeling. Because the conventional RNNs, e.g., Long Short-Term Memory (LSTM) 
and Gated Recurrent Unit (GRU)) learn from one-dimensional vectors where the representation is highly compact and the spatial structure is heavily damaged, a number of LSTM variants are proposed to model spatio-temporal sequences. For example, convolutional LSTM (ConvLSTM) 
modifies LSTM by taking three-dimensional tensors as input and replacing FC by convolution to capture spatial local structures. Based on ConvLSTM, Spatio-temporal LSTM (ST-LSTM) extracts and memorizes spatial and temporal representations simultaneously by adding a spatio-temporal memory. Cubic LSTM (CubicLSTM)  extends ConvLSTM by splitting states into temporal states and spatial states, which are respectively generated by independent convolutions. However, these methods focus on 2D videos and can not directly apply to 3D point cloud sequences.
In this section, we first review the standard (vanilla) RNN and then describe the proposed PointRNN in detail.
The RNN is a class of deep neural networks which take inputs along a temporal sequence. This allows it to exhibit temporal dynamic behavior. The RNN can use their internal state (memory) to process sequences of inputs. It relies on a concatenation operation to aggregate the past and current, which is referred to as in this paper,
where denotes matrix multiplication and denotes concatenation. The and are the parameters of RNN to be learned. Usually, this operation can be implemented by an FC in deep neural networks (Figure 2(a)).
The conventional RNN learns from one-dimensional vectors, limiting its application to process moving point cloud sequences. If we flatten the coordinates of points in a cloud to a vector, the data structure is heavily damaged. This will increase the challenge for deep neural networks to understand point clouds. If we use the global feature (e.g., extracted by PointNet++ ) to represent a point cloud and then apply RNN, the local structure will be highly compacted. It is difficult to learn the dynamics of a moving point cloud sequence from global representations. Therefore, to keep the spatial structure, we propose to take the coordinates and features of points as inputs, in which each row corresponds to a point. Accordingly, the state in RNN is extended to in PointRNN. The update of the state for PointRNN at the -th time step is formulated in Eq. (2), where and . We refer to this operation as .
The goal of the function is to aggregate the past and the current according to the coordinates of points. Specifically, given and , merges them according to the coordinates of points in . First, for each point in , e.g., the -th point , finds all points in that are within a radius to , which can be seen as a neighborhood ball  in the underlying Euclidean space. In implementation, we sample a fixed number of neighbors from the neighborhood ball for computation. The radius neighbors potentially share the same geometry or motion information about the query point . Second, the feature of the query point , the states of neighborhoods , and the displacements from the sampled neighbors to the query point are concatenated separately, which are subsequently processed by a shared FC. Third, the processed concatenations are pooled to a single representation. The output of , i.e., will contain the past and current information of each point in (Figure 2(b)). Note that, compared with the function in Eq. (1), if we ignore the parameters caused by the displacements, the function does not introduce addition parameters.
Since a point cloud is a set of points that is irregular and orderless, the same row of features or states between two time steps may represent different points. Without the coordinates of points, an independent and is meaningless. Therefore, the coordinates of points are integrated into states and outputs, i.e., .
The PointRNN provides a prototype about using RNN to process point cloud sequence. Each component in PointRNN is necessary. However, we can design more effective functions to replace the function of this paper, which can be further studied in the future.
Moving point clouds provide a large amount of geometric information in scenes as well as profound dynamic changes in motions. Understanding scenes and imagining motions in 3D space are fundamental abilities for robot and self-driving car platforms. It is indisputable that an intelligent system, which is able to predict future trajectories of points in a cloud, will have such abilities. In this paper, we apply PointRNN to moving point cloud prediction. Because this task does not require external human-annotated supervisions, the model can be trained in an unsupervised (self-supervised) manner. Because RNN encounters the exploding and vanishing gradient problems, when applying PointRNN to moving point cloud prediction, we use one of its variants, i.e., PointLSTM, which inherits LSTM  and PointRNN.
An LSTM unit is composed of a cell state , an input gate , an output gate and a forget gate . The cell state acts as an accumulator of the sequence or the temporal information over time. Specifically, the current input will be integrated to if the input gate is activated. Meanwhile, the past cell state may be forgotten if the forget gate turns on. Whether will be propagated to the hidden state is controlled by the output gate . The updates for LSTM are formulated in Table 1, where , , and . The
denotes the sigmoid function. Thedenotes Hadamard product. The and are the parameters of LSTM to be learned. Similar to RNN, LSTM relies on the function to aggregate the past and the current.
The comparison between PointLSTM and LSTM is shown in Table 1. Generally, the functions in LSTM are replaced with the functions. Besides, the unit structure is also changed, i.e., input: , state: and output: .
Another difference between the standard LSTM and our PointLSTM is that, PointLSTM has an additional step . The goal of this step is to transform and shuffle according to the current input points . Only after this step can we perform Hadamard product between the transformed cell state and the forget gate .
We design a basic model and an advanced model based on the seq2seq framework. The basic model (shown in Figure 3(a)) is composed two parts. One part for encoding the given point cloud sequence and the other part for predicting. Specifically, an encoding PointLSTM watches the input point clouds one by one. After the last input, its states are used to initialize the states of a predicting PointLSTM. The predicting PointLSTM then takes as input and begins to make predictions. Rather than directly generating point coordinates, we predict displacements that will happen between the current step and the next step, which can be seen as scene flow [16, 9].
Like LSTM and its variants, we can stack multiple PointLSTM units to build up a multi-layer structure for hierarchical prediction. However, a major drawback of this structure is that it is computationally intensive, especially for high-resolution point sets. To alleviate this problem, we propose an advanced model (shown in Figure 3(b)), which borrows two components from PointNet++, i.e., 1) a sampling and grouping operations for down-sampling points and their features and 2) a feature propagation layers for up-sampling the representations associated with the intermediate points to the original points. By this down-up-sampling structure, we can take advantage of hierarchical learning, while decrease the number of points to be processed in middle layers.
There are two strategies to train recurrent neural networks. The first one uses the ground truth as input during decoding, which is known as teacher-forcing training. The second one works using the prediction generated by the network as input (Figure 3(a)), same with test, which is referred to as free-running training. When using the teacher-forcing training, we find that the model quickly gets stuck in a bad local optima, in which tends to be for all inputs. Therefore, we adopt the free-running training strategy.
Since point clouds are unordered, point-to-point-based loss functions can not directly apply to compute the difference between the prediction and ground truth. Loss functions should be invariant to the relative ordering of input points. In this paper, we adopt Chamfer Distance (CD) and Earth Mover’s Distance (EMD). The CD betweenand is defined as follows,
Basically, this loss function is a nearest neighbour distance metric that bidirectionally measures the error in two sets. Every point in is mapped to the nearest point in , and vice versa. The EMD between and is defined as follows,
where is a bijection. The EMD calculates a point-to-point mapping between two point clouds. The overall loss is as follows,
We conduct experiments on one synthetic moving MNIST point cloud dataset and two large-scale real-world datasets, i.e., Argoverse  and nuScenes . Models are trained for 200k iterations using the Adam optimizer with a learning rate of . Gradients are clipped in the range . We follow point cloud reconstruction [1, 32, 18] to adopt CD and EMD for quantity evaluation. Implementation details of the basic and advanced model are listed in Table 2
. Max-pooling is used for our models. For the two real-world datasets, because they contain too many points to be processed by the basic model, we only evaluate the advanced model. For evaluation, we randomly synthesize or select 5,000 sequences from test subsets of these datasets.
To evaluate whether PointLSTM has the ability to model moving point clouds, we compare our methods with a so-called “copy last input” baseline. The baseline does not make predictions and just copies the last input in the given sequence as outputs. If our models outperform this baseline, it can prove the effectiveness of PointLSTM.
|Cnpt||Moving MNIST Point Cloud||Argoverse & nuScenes|
||basic model||advanced model||advanced model|
Experiments on the synthetic Moving MNIST point cloud dataset can provide some basic understanding of the behavior of the proposed PointLSTM. To synthesize moving MNIST digit sequences, we use a generation process similar to that described in . Each synthetic sequence consists of 20 consecutive point clouds, 10 for inputs and 10 for predictions. Each point cloud contains one or two digits moving and bouncing inside a image. We remove pixels whose brightness values are less than 16. Locations of the remaining pixels are transformed to coordinates. The -coordinate is set to 0 for all points. We randomly select 128 points for one digit and 256 points for two digits as inputs, respectively. Batch size is set to 32. The and in Eq. (4) are set to 1.0.
|One digit||Two digits|
|Copy last input (Baseline)||262.46||15.94||140.14||15.18|
Besides the copy-last-input baseline, we also compare our methods with two video prediction models, i.e., ConvLSTM  and CubicLSTM . Essentially, the Moving MNIST point cloud dataset is 2D. We can first voxelize digit point clouds to images and then use video-based methods to process it. Specifically, a pixel in the voxelization image is set to 1 if there exists a point at the position. Otherwise, the pixel is set to 0. In this way, a 2D point cloud sequence is converted to a video. For ConvLSTM, we adopt a three-layer architecture, with a kernel size of 5. For CubicLSTM, we adopt a three-by-three-layer architecture. The spatial, temporal and output kernel size are set to 5, 1 and 5, respectively. For training, we use the binary cross entropy loss to optimize each output pixel. For test, because the number of output points is usually not consistent with that of input points, we collect points whose brightness are in top 128 (for one digit) or top 256 (for two digits) as the output point cloud.
|ConvLSTM ||CubicLSTM ||Basic||Advanced|
Experimental results are listed in Table 3. The PointLSTM models outperform the baseline significantly, demonstrating the effectiveness of PointLSTM to model moving point clouds.
PointLSTMs also outperform the voxelization-based ConvLSTM and CubicLSTM. For example, the EMD of the advanced PointLSTM model on one moving digit is 1.78, outperforming CubicLSTM by 2.42. Moreover, because ConvLSTM and CubicLSTM’ predictions on one moving digit are much worse than those on two moving digits, the voxelization-based methods may be not sensitive to sparse point clouds. Beyond expectation, the advance model, which aims to reduce points to be processed in middle layers and consume less computation, outperforms the basic model slightly.
Numbers of parameters and float operations (FLOPs) of models are listed in Table 4. ConvLSTM and CubicLSTM use more parameters and, especially, FLOPs than PointLSTM. For example, the FLOPs of ConvLSTM is up to 345.26 billion, while the advanced model on one moving digit only uses 3.18 billion. Usually, for PointLSTM, the basic model contains less parameters than the advanced model. However, since parameters will be reproduced for each point and the basic model process all points in each layer, the basic model consumes more computation than the advanced model.
Examples of moving MNIST point cloud prediction visualization are shown in Figure 4. PointLSTM models are able to produce considerably accurate predictions, for both appearance and motion. Compared to the basic model, the advanced model generate more clear point clouds on two-digit experiments.
Argoverse  and nuScenes  are two large-scale autonomous driving datasets. The Argoverse data is collected by a fleet of autonomous vehicles in Pittsburgh (86km) and Miami (204km). The nuScenes data is recorded in Boston and Singapore, with 15h of driving data (242km travelled at an average of 16km/h). These datasets are collected by multiple sensors, including LiDAR, RADAR, camera, etc. In this paper, we only use the data from LiDAR sensors, without any human-annotated supervision, for moving point cloud prediction. Details about Argoverse and nuScenes are listed in Table 5.
Predicting moving point clouds on real-world driving datasets is considerably challenging. Content of a long driving point cloud sequence may change dramatically. Because we can not predict what are not provided in the given inputs, we only ask our models to make short-term prediction on the driving datasets. Each driving log is considered as a continuous point cloud sequence. We randomly choice 10 successive point clouds from a driving log to train models, with 5 for input and 5 for prediction. Since Argoverse and nuScenes are high-resolution, using all points requires considerable computation and running memory. Therefore, for each cloud, we only use the points whose coordinates are in the range . Then, we randomly choice 1,024 points from the remaining points as inputs and ground truths. Batch size is set to 4.
|Copy last input||-||-||0.5812||1.0667||0.0794||0.3961|
Experiment results are listed in Table 6. The PointLSTM model outperforms the “copy last input” baseline, confirming that the effectiveness of PointLSTM on real-word datasets. For example, the CD of PointLSTM on Argoverse is 0.2966, outperforming the baseline by 0.2846. Because LiDARs in nuScenes emit many rings (made up of dense points) and most scenes in the range are relatively static, PointLSTM does not improve prediction significantly. This is also the reason that the accuracy of baseline on nuScenes is much higher than on Argoverse.
Visualization examples of predicted scene flow () and predicted point clouds () are shown in Figure 5 and Figure 6, respectively. PointLSTM can make correct predictions on the real-word datasets. For the second example in Figure 5, when the autonomous driving car slows down and stops, the scene flow is gradually vanishing. For the third example in Figure 6, PointLSTM correctly predicts that the pedestrian is moving to left rear.
We propose a PointRNN for moving point cloud processing. A variant of PointRNN, i.e., PointLSTM, is applied to moving point cloud prediction. Experimental results demonstrate the ability of PointLSTM to model point cloud sequences. With the development of point cloud, PointRNN and its variants can be widely applied to other temporal-related applications, such as point-cloud-based action recognition.
HPLFlowNet: hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds. In CVPR, Cited by: §4.2.
Relation-shape convolutional neural network for point cloud analysis. In CVPR, Cited by: §2.
PointNet: deep learning on point sets for 3d classification and segmentation. In CVPR, Cited by: §1, §2.
Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In NeurIPS, Cited by: §2, §5.1, Table 3, Table 4.