PointRNN: Point Recurrent Neural Network for Moving Point Cloud Processing

10/18/2019 ∙ by Hehe Fan, et al. ∙ 31

Point cloud is attracting more and more attention in the community. However, few works study dynamic point clouds. In this paper, we introduce a Point Recurrent Neural Network (PointRNN) unit for moving point cloud processing. To keep the spatial structure, rather than taking a sole one-dimensional vector x∈R^d like RNN as input, PointRNN takes points' coordinates P∈R^n × 3 and their features X∈R^n × d as inputs (n and d denote the number of points and feature dimensions, respectively). Accordingly, the state s∈R^d' in RNN are extended to (P, S∈R^n × d') in PointRNN (d' denotes the number of state dimensions). Since point clouds are orderless, features and states of two adjacent time steps can not be directly operated. Therefore, PointRNN replaces the concatenation operation in RNN with a correlation operation, which aggregates inputs and states according to points' coordinates. To evaluate PointRNN, we apply one of its variants, i.e., Point Long Short-Term Memory (PointLSTM), to moving point cloud prediction, which aims to predict the future trajectories of points in a cloud given their history movements. Experimental results show that PointLSTM is able to produce correct predictions on both synthetic and real-world datasets, demonstrating its effectiveness to model point cloud sequences. The code has been released at https://github.com/hehefan/PointRNN.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 6

page 7

page 8

Code Repositories

PointRNN

TensorFlow implementation of PointRNN, PointGRU and PointLSTM.


view repo

PointRNN-PyTorch

PointRNN, PointGRU and PointLSTM.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Most modern robot and self-driving car platforms rely on 3D point clouds for visual perception. In contrast to RGB images, point clouds can provide accurate displacement measurements, which are generally unaffected by lighting conditions. Point cloud attracts more and more researchers in the community. However, most of the existing works focus on static point cloud analysis, e.g., classification and segmentation [21, 22, 14]. Few works study dynamic point clouds. Intelligent systems need the ability to understand not only static scenes around them but also dynamic changes in the environment. In this paper, we propose a Point Recurrent Neural Network (PointRNN) for moving point cloud processing.

In the general setting, a point cloud is a set of points in 3D space. Usually, the point cloud is represented by points’ three coordinates and their additional features (if features are provided), where and denote the number of points and feature dimensions, respectively. Essentially, point clouds are unordered sets and invariant to permutations of their points. For example, the set and represent the same point cloud. This irregular format data structure considerably increases the challenge to reason point clouds, making many achievements of deep neural networks on image processing failed to process point clouds.

Figure 1: Comparison between RNN and PointRNN. At the time step , RNN takes an input as input, and updates its state to . Usually, RNN uses its state as its output. PointRNN takes a matrix (points’ coordinates) and a matrix (points’ features) as inputs, and updates its states to . Similar to RNN, PointRNN uses its states as it outputs.

Recurrent neural network is well-suited to processing time series data. Generally, the (vanilla) RNN looks at an vector at the time step , updates its state to , and outputs . The RNN can be formulated as follows,

where denotes the learned parameters. Since moving point cloud is a kind of sequence, we can exploit RNN to model it. However, RNN has two severe limitations on processing point cloud sequences.

On one hand, RNN learns from one-dimensional vectors, in which the input, output and states are highly compact. It is difficult for a sole vector to represent a point cloud. Although we can flatten and to a one-dimensional vector, such operation heavily damages the data structure and increases the challenge for neural networks to understand point clouds. If we use the global feature to represent a point cloud, the local structure will be lost. To overcome this problem, PointRNN takes the original as inputs. Similarly, the one one-dimensional state in RNN is extended to two-dimensional in PointRNN, in which each row corresponds to a point. Besides, because depends on points’ coordinates, is added into the states and outputs of PointRNN. The PointRNN can be formulated as follows,

We illustrate an LSTM and a PointLSTM unit in Figure 1.

One the other hand, RNN aggregates the past information and the current input based on a concatenation operation (Figure 2(a)). However, because point clouds are unordered, concatenation can not directly apply to point clouds. To alleviate this problem, we use a correlation operation to aggregate and according to points’ coordinates (Figure 2(b)). Specifically, for each point in , PointRNN first searches its radius neighborhoods in and samples a fixed number from them. Second, the feature of the query point, the hidden states of the sampled neighbors, and the displacements from the query point to the sampled neighbors are concatenated separately, which are then processed by a fully-connected (FC) layer. At last, the processed representations are reduced to a single representation by pooling.

PointRNN provides a fundamental component for point cloud sequence modelling. We evaluate PointRNN on moving point cloud prediction. Given the history movements of point clouds, the goal of this task is to predict the future trajectories of their points. Predicting how point clouds move in future can help robots and self-driving cars to plan their actions and make decisions. Besides, moving point cloud prediction has an innate advantage that it does not require external supervision. To avoid the exploding and vanishing gradient problems, we adopt a variant of PointLSTM,

i.e., Point Long Short-Term Memory (PointLSTM). Based on PointLSTM, we propose an seq2seq architecture for moving point cloud prediction. Experimental results on a synthetic moving MNIST point cloud dataset, and two large-scale autonomous driving Argoverse [3] and nuScenes[2] datasets show that PointLSTM produces correct predictions, confirming its ability to model moving point clouds.

2 Related Work

Static Point Cloud Understanding. The dawn of point cloud has boosted a number of applications, such as object classification, object part segmentation, scene semantic segmentation [21, 22, 11, 13, 14, 28, 8, 26, 17, 31, 29, 27], reconstruction [6, 15, 32] and object detection [4, 33, 20, 23, 12, 19] in 3D space. Most recently works aim to consumes point sets without transforming coordinates to regular 3D voxel grids or collections of images. There exists two main challenges for static point cloud processing. First, a point cloud is in essence a set of unordered points and invariant to permutations of its points, which necessitates certain symmetrizations in computation. Second, different with images, point cloud data is irregular. Convolution operations that captures local structures in images can not directly apply to such data format. Different with these existing works, we are dedicated to a new challenge, i.e., to model dynamics in point cloud sequences.

Figure 2: a) RNN aggregates the input and state by a concatenation operation, followed an FC. b) PointRNN aggregates and according to point’s coordinates by a correlation operation. For example, for the point in , PointRNN first searches its radius neighbors (closer than a distance ) in and gets three points . Then, PointRNN randomly samples a fixed number (e.g., two) of points from the neighbors, which are in this example. The feature of the query point in , the hidden states of the sampled neighbors in , and the displacements from the sampled neighbors to the query point are concatenated separately. Similar to RNN, the concatenated representations are processed by an FC. Because the number of sampled neighbors , pooling is used to reduce dimension. Note that, compared with RNN, if we ignore the parameters caused by , PointRNN does not introduce additional parameters.

RNN Variants for Spatio-temporal Modeling. Because the conventional RNNs, e.g., Long Short-Term Memory (LSTM) [10]

and Gated Recurrent Unit (GRU) 

[5]) learn from one-dimensional vectors where the representation is highly compact and the spatial structure is heavily damaged, a number of LSTM variants are proposed to model spatio-temporal sequences. For example, convolutional LSTM (ConvLSTM) [24]

modifies LSTM by taking three-dimensional tensors as input and replacing FC by convolution to capture spatial local structures. Based on ConvLSTM, Spatio-temporal LSTM (ST-LSTM) 

[30] extracts and memorizes spatial and temporal representations simultaneously by adding a spatio-temporal memory. Cubic LSTM (CubicLSTM) [7] extends ConvLSTM by splitting states into temporal states and spatial states, which are respectively generated by independent convolutions. However, these methods focus on 2D videos and can not directly apply to 3D point cloud sequences.

3 PointRNN

In this section, we first review the standard (vanilla) RNN and then describe the proposed PointRNN in detail.

The RNN is a class of deep neural networks which take inputs along a temporal sequence. This allows it to exhibit temporal dynamic behavior. The RNN can use their internal state (memory) to process sequences of inputs. It relies on a concatenation operation to aggregate the past and current, which is referred to as in this paper,

(1)

where denotes matrix multiplication and denotes concatenation. The and are the parameters of RNN to be learned. Usually, this operation can be implemented by an FC in deep neural networks (Figure 2(a)).

(2)

The conventional RNN learns from one-dimensional vectors, limiting its application to process moving point cloud sequences. If we flatten the coordinates of points in a cloud to a vector, the data structure is heavily damaged. This will increase the challenge for deep neural networks to understand point clouds. If we use the global feature (e.g., extracted by PointNet++ [22]) to represent a point cloud and then apply RNN, the local structure will be highly compacted. It is difficult to learn the dynamics of a moving point cloud sequence from global representations. Therefore, to keep the spatial structure, we propose to take the coordinates and features of points as inputs, in which each row corresponds to a point. Accordingly, the state in RNN is extended to in PointRNN. The update of the state for PointRNN at the -th time step is formulated in Eq. (2), where and . We refer to this operation as .

The goal of the function is to aggregate the past and the current according to the coordinates of points. Specifically, given and , merges them according to the coordinates of points in . First, for each point in , e.g., the -th point , finds all points in that are within a radius to , which can be seen as a neighborhood ball [22] in the underlying Euclidean space. In implementation, we sample a fixed number of neighbors from the neighborhood ball for computation. The radius neighbors potentially share the same geometry or motion information about the query point . Second, the feature of the query point , the states of neighborhoods , and the displacements from the sampled neighbors to the query point are concatenated separately, which are subsequently processed by a shared FC. Third, the processed concatenations are pooled to a single representation. The output of , i.e., will contain the past and current information of each point in (Figure 2(b)). Note that, compared with the function in Eq. (1), if we ignore the parameters caused by the displacements, the function does not introduce addition parameters.

Since a point cloud is a set of points that is irregular and orderless, the same row of features or states between two time steps may represent different points. Without the coordinates of points, an independent and is meaningless. Therefore, the coordinates of points are integrated into states and outputs, i.e., .

Gate/State LSTM PointLSTM
input gate
forget gate
output gate
cell state -
hidden state
Table 1: Comparison between LSTM and PointLSTM.

The PointRNN provides a prototype about using RNN to process point cloud sequence. Each component in PointRNN is necessary. However, we can design more effective functions to replace the function of this paper, which can be further studied in the future.

4 Moving Point Cloud Prediction

Moving point clouds provide a large amount of geometric information in scenes as well as profound dynamic changes in motions. Understanding scenes and imagining motions in 3D space are fundamental abilities for robot and self-driving car platforms. It is indisputable that an intelligent system, which is able to predict future trajectories of points in a cloud, will have such abilities. In this paper, we apply PointRNN to moving point cloud prediction. Because this task does not require external human-annotated supervisions, the model can be trained in an unsupervised (self-supervised) manner. Because RNN encounters the exploding and vanishing gradient problems, when applying PointRNN to moving point cloud prediction, we use one of its variants, i.e., PointLSTM, which inherits LSTM [10] and PointRNN.

4.1 PointLSTM

An LSTM unit is composed of a cell state , an input gate , an output gate and a forget gate . The cell state acts as an accumulator of the sequence or the temporal information over time. Specifically, the current input will be integrated to if the input gate is activated. Meanwhile, the past cell state may be forgotten if the forget gate turns on. Whether will be propagated to the hidden state is controlled by the output gate . The updates for LSTM are formulated in Table 1, where , , and . The

denotes the sigmoid function. The

denotes Hadamard product. The and are the parameters of LSTM to be learned. Similar to RNN, LSTM relies on the function to aggregate the past and the current.

The comparison between PointLSTM and LSTM is shown in Table 1. Generally, the functions in LSTM are replaced with the functions. Besides, the unit structure is also changed, i.e., input: , state: and output: .

Another difference between the standard LSTM and our PointLSTM is that, PointLSTM has an additional step . The goal of this step is to transform and shuffle according to the current input points . Only after this step can we perform Hadamard product between the transformed cell state and the forget gate .

4.2 Architecture

Figure 3: Architectures for moving point cloud prediction. a) Basic model (single PointLSTM layer). A PointLSTM encodes the given point cloud sequence to states, i.e., , which are then used to initialize the states of a predicting PointLSTM. The predicted point cloud is achieved by . b) Predicting part of the advanced model (three PointLSTM layers). Sampling (S) and grouping (G) operation are used to down-sample points and group features, respectively. PointLSTM units aggregate the past and the current, and outputs features for prediction. Feature propagation (FP) layers are used to propagate features from subsampled points to the original points. The FC regresses features to predicted displacements . This model reduces points to be processed in middle layers.

We design a basic model and an advanced model based on the seq2seq framework. The basic model (shown in Figure 3(a)) is composed two parts. One part for encoding the given point cloud sequence and the other part for predicting. Specifically, an encoding PointLSTM watches the input point clouds one by one. After the last input, its states are used to initialize the states of a predicting PointLSTM. The predicting PointLSTM then takes as input and begins to make predictions. Rather than directly generating point coordinates, we predict displacements that will happen between the current step and the next step, which can be seen as scene flow [16, 9].

Like LSTM and its variants, we can stack multiple PointLSTM units to build up a multi-layer structure for hierarchical prediction. However, a major drawback of this structure is that it is computationally intensive, especially for high-resolution point sets. To alleviate this problem, we propose an advanced model (shown in Figure 3(b)), which borrows two components from PointNet++[22], i.e., 1) a sampling and grouping operations for down-sampling points and their features and 2) a feature propagation layers for up-sampling the representations associated with the intermediate points to the original points. By this down-up-sampling structure, we can take advantage of hierarchical learning, while decrease the number of points to be processed in middle layers.

4.3 Training

There are two strategies to train recurrent neural networks. The first one uses the ground truth as input during decoding, which is known as teacher-forcing training. The second one works using the prediction generated by the network as input (Figure 3(a)), same with test, which is referred to as free-running training. When using the teacher-forcing training, we find that the model quickly gets stuck in a bad local optima, in which tends to be for all inputs. Therefore, we adopt the free-running training strategy.

Since point clouds are unordered, point-to-point-based loss functions can not directly apply to compute the difference between the prediction and ground truth. Loss functions should be invariant to the relative ordering of input points. In this paper, we adopt Chamfer Distance (CD) and Earth Mover’s Distance (EMD). The CD between

and is defined as follows,

(2)

Basically, this loss function is a nearest neighbour distance metric that bidirectionally measures the error in two sets. Every point in is mapped to the nearest point in , and vice versa. The EMD between and is defined as follows,

(3)

where is a bijection. The EMD calculates a point-to-point mapping between two point clouds. The overall loss is as follows,

(4)

where hyperparameter

.

Figure 4: Visualization of moving MNIST point cloud prediction. For ConvLSTM and CubicLSTM, we first convert the input point clouds to images and then convert the predicted video to a point cloud sequence. The advanced PointLSTM model generates the best predictions.

5 Experiments

We conduct experiments on one synthetic moving MNIST point cloud dataset and two large-scale real-world datasets, i.e., Argoverse [3] and nuScenes [2]. Models are trained for 200k iterations using the Adam optimizer with a learning rate of . Gradients are clipped in the range . We follow point cloud reconstruction [1, 32, 18] to adopt CD and EMD for quantity evaluation. Implementation details of the basic and advanced model are listed in Table 2

. Max-pooling is used for our models. For the two real-world datasets, because they contain too many points to be processed by the basic model, we only evaluate the advanced model. For evaluation, we randomly synthesize or select 5,000 sequences from test subsets of these datasets.

To evaluate whether PointLSTM has the ability to model moving point clouds, we compare our methods with a so-called “copy last input” baseline. The baseline does not make predictions and just copies the last input in the given sequence as outputs. If our models outperform this baseline, it can prove the effectiveness of PointLSTM.

Cnpt Moving MNIST Point Cloud Argoverse & nuScenes


basic model advanced model advanced model

S
- 1.0 4 - 0.5 8 -
PL 4.0 8 64 4.0 12 64 1.0 24 128
SG - 2.0 4 - 1.0 8 -
PL 8.0 8 128 8.0 8 128 2.0 16 256
SG - 4.0 4 - 2.0 8 -
PL 12.0 8 256 12.0 4 256 4.0 8 512
FP - - - 128 - - 256
FP - - - 128 - - 256
FP - - - 128 - - 256
FC - - 64 - - 64 - - 128
FC - - 3 - - 3 - - 3
Table 2: Architecture specs. Each component (cpnt) is described by four attributes (number of output points, search radius, number of sampled neighbors, feature size). S: sampling, G: grouping, PL: PointLSTM, FP: feature propagation, FC: fully-connected layer.

5.1 Moving MNIST Point Cloud

Experiments on the synthetic Moving MNIST point cloud dataset can provide some basic understanding of the behavior of the proposed PointLSTM. To synthesize moving MNIST digit sequences, we use a generation process similar to that described in [25]. Each synthetic sequence consists of 20 consecutive point clouds, 10 for inputs and 10 for predictions. Each point cloud contains one or two digits moving and bouncing inside a image. We remove pixels whose brightness values are less than 16. Locations of the remaining pixels are transformed to coordinates. The -coordinate is set to 0 for all points. We randomly select 128 points for one digit and 256 points for two digits as inputs, respectively. Batch size is set to 32. The and in Eq. (4) are set to 1.0.


Model
One digit Two digits
CD EMD CD EMD
Copy last input (Baseline) 262.46 15.94 140.14 15.18
ConvLSTM [24] 58.09 8.85 13.02 5.99
CubicLSTM [7] 9.51 4.20 6.19 4.42

PointLSTM (Ours)
Basic 1.40 2.00 7.28 4.82
Advanced 1.16 1.78 5.18 4.21

Table 3: Prediction accuracy on moving MNIST point cloud.

Besides the copy-last-input baseline, we also compare our methods with two video prediction models, i.e., ConvLSTM [24] and CubicLSTM [7]. Essentially, the Moving MNIST point cloud dataset is 2D. We can first voxelize digit point clouds to images and then use video-based methods to process it. Specifically, a pixel in the voxelization image is set to 1 if there exists a point at the position. Otherwise, the pixel is set to 0. In this way, a 2D point cloud sequence is converted to a video. For ConvLSTM, we adopt a three-layer architecture, with a kernel size of 5. For CubicLSTM, we adopt a three-by-three-layer architecture. The spatial, temporal and output kernel size are set to 5, 1 and 5, respectively. For training, we use the binary cross entropy loss to optimize each output pixel. For test, because the number of output points is usually not consistent with that of input points, we collect points whose brightness are in top 128 (for one digit) or top 256 (for two digits) as the output point cloud.


Model
ConvLSTM [24] CubicLSTM [7] Basic Advanced
One Two One Two
#params 2.16 6.08 1.22 1.30

FLOPs
345.26 448.88 24.74 49.55 3.18 6.37

Table 4: Numbers of parameters (#params, million) and float operations (FLOPs, billion) of models on moving MNIST point cloud.

Experimental results are listed in Table 3. The PointLSTM models outperform the baseline significantly, demonstrating the effectiveness of PointLSTM to model moving point clouds.

Figure 5: Visualization of predicted scene flow (). The point whose flow magnitude is less than is removed. Colors in point clouds and scene flows indicate points’ heights and flows’ magnitudes, receptively. Top: a bird’s-eye view example. Bottom: a first-person perspective example. When points become static, scene flows are vanishing.

PointLSTMs also outperform the voxelization-based ConvLSTM and CubicLSTM. For example, the EMD of the advanced PointLSTM model on one moving digit is 1.78, outperforming CubicLSTM by 2.42. Moreover, because ConvLSTM and CubicLSTM’ predictions on one moving digit are much worse than those on two moving digits, the voxelization-based methods may be not sensitive to sparse point clouds. Beyond expectation, the advance model, which aims to reduce points to be processed in middle layers and consume less computation, outperforms the basic model slightly.

Numbers of parameters and float operations (FLOPs) of models are listed in Table 4. ConvLSTM and CubicLSTM use more parameters and, especially, FLOPs than PointLSTM. For example, the FLOPs of ConvLSTM is up to 345.26 billion, while the advanced model on one moving digit only uses 3.18 billion. Usually, for PointLSTM, the basic model contains less parameters than the advanced model. However, since parameters will be reproduced for each point and the basic model process all points in each layer, the basic model consumes more computation than the advanced model.

Examples of moving MNIST point cloud prediction visualization are shown in Figure 4. PointLSTM models are able to produce considerably accurate predictions, for both appearance and motion. Compared to the basic model, the advanced model generate more clear point clouds on two-digit experiments.

5.2 Argoverse and nuScenes

Argoverse [3] and nuScenes [2] are two large-scale autonomous driving datasets. The Argoverse data is collected by a fleet of autonomous vehicles in Pittsburgh (86km) and Miami (204km). The nuScenes data is recorded in Boston and Singapore, with 15h of driving data (242km travelled at an average of 16km/h). These datasets are collected by multiple sensors, including LiDAR, RADAR, camera, etc. In this paper, we only use the data from LiDAR sensors, without any human-annotated supervision, for moving point cloud prediction. Details about Argoverse and nuScenes are listed in Table 5.

Predicting moving point clouds on real-world driving datasets is considerably challenging. Content of a long driving point cloud sequence may change dramatically. Because we can not predict what are not provided in the given inputs, we only ask our models to make short-term prediction on the driving datasets. Each driving log is considered as a continuous point cloud sequence. We randomly choice 10 successive point clouds from a driving log to train models, with 5 for input and 5 for prediction. Since Argoverse and nuScenes are high-resolution, using all points requires considerable computation and running memory. Therefore, for each cloud, we only use the points whose coordinates are in the range . Then, we randomly choice 1,024 points from the remaining points as inputs and ground truths. Batch size is set to 4.


Dataset
trainval test frequency #pnts/pc range
#logs #pcs #logs #pcs
Argoverse [3] 89 18,211 24 4,189 10Hz 90,549 200m
nuScenes [2] 68 297,737 15 52,423 20Hz 34,722 70m
Table 5: Details of the nuScenes and Argoverse dataset. #pcs: number of point clouds, #pnts/pc: average number of points per point cloud.

Dataset
#params FLOPs Argoverse nuScenes
CD EMD CD EMD
Copy last input - - 0.5812 1.0667 0.0794 0.3961
PointLSTM 5.18m 98.57b 0.2966 0.8892 0.0624 0.3745


Table 6: Prediction accuracy on Argoverse and nuScenes.
Figure 6: Visualization of moving point cloud prediction on Argoverse and nuScenes. Colors indicate points’ heights. We mark the objects of interest with bounding boxes in the first input and last prediction.

Experiment results are listed in Table 6. The PointLSTM model outperforms the “copy last input” baseline, confirming that the effectiveness of PointLSTM on real-word datasets. For example, the CD of PointLSTM on Argoverse is 0.2966, outperforming the baseline by 0.2846. Because LiDARs in nuScenes emit many rings (made up of dense points) and most scenes in the range are relatively static, PointLSTM does not improve prediction significantly. This is also the reason that the accuracy of baseline on nuScenes is much higher than on Argoverse.

Visualization examples of predicted scene flow () and predicted point clouds () are shown in Figure 5 and Figure 6, respectively. PointLSTM can make correct predictions on the real-word datasets. For the second example in Figure 5, when the autonomous driving car slows down and stops, the scene flow is gradually vanishing. For the third example in Figure 6, PointLSTM correctly predicts that the pedestrian is moving to left rear.

6 Conclusion

We propose a PointRNN for moving point cloud processing. A variant of PointRNN, i.e., PointLSTM, is applied to moving point cloud prediction. Experimental results demonstrate the ability of PointLSTM to model point cloud sequences. With the development of point cloud, PointRNN and its variants can be widely applied to other temporal-related applications, such as point-cloud-based action recognition.

References

  • [1] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. J. Guibas (2018) Learning representations and generative models for 3d point clouds. In ICML, Cited by: §5.
  • [2] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2019) NuScenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: §1, §5.2, Table 5, §5.
  • [3] M. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramanan, and J. Hays (2019) Argoverse: 3d tracking and forecasting with rich maps. In CVPR, Cited by: §1, §5.2, Table 5, §5.
  • [4] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2017) Multi-view 3d object detection network for autonomous driving. In CVPR, Cited by: §2.
  • [5] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP, Cited by: §2.
  • [6] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) ScanNet: richly-annotated 3d reconstructions of indoor scenes. In CVPR, Cited by: §2.
  • [7] H. Fan, L. Zhu, and Y. Yang (2019) Cubic lstms for video prediction. In AAAI, Cited by: §2, §5.1, Table 3, Table 4.
  • [8] B. Graham, M. Engelcke, and L. van der Maaten (2018) 3D semantic segmentation with submanifold sparse convolutional networks. In CVPR, Cited by: §2.
  • [9] X. Gu, Y. Wang, C. Wu, Y. J. Lee, and P. Wang (2019)

    HPLFlowNet: hierarchical permutohedral lattice flownet for scene flow estimation on large-scale point clouds

    .
    In CVPR, Cited by: §4.2.
  • [10] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §2, §4.
  • [11] R. Klokov and V. S. Lempitsky (2017) Escape from cells: deep kd-networks for the recognition of 3d point cloud models. In ICCV, Cited by: §2.
  • [12] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom (2019) PointPillars: fast encoders for object detection from point clouds. In CVPR, Cited by: §2.
  • [13] J. Li, B. M. Chen, and G. H. Lee (2018) SO-net: self-organizing network for point cloud analysis. In CVPR, Cited by: §2.
  • [14] Y. Li, R. Bu, M. Sun, W. Wu, X. Di, and B. Chen (2018) PointCNN: convolution on x-transformed points. In NeurIPS, Cited by: §1, §2.
  • [15] C. Lin, C. Kong, and S. Lucey (2018) Learning efficient point cloud generation for dense 3d object reconstruction. In AAAI, Cited by: §2.
  • [16] X. Liu, C. R. Qi, and L. J. Guibas (2019) FlowNet3D: learning scene flow in 3d point clouds. In CVPR, Cited by: §4.2.
  • [17] Y. Liu, B. Fan, S. Xiang, and C. Pan (2019)

    Relation-shape convolutional neural network for point cloud analysis

    .
    In CVPR, Cited by: §2.
  • [18] P. Mandikal and V. B. Radhakrishnan (2019) Dense 3d point cloud reconstruction using a deep pyramid network. In WACV, Cited by: §5.
  • [19] C. R. Qi, O. Litany, K. He, and L. J. Guibas (2019) Deep hough voting for 3d object detection in point clouds. In ICCV, Cited by: §2.
  • [20] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum pointnets for 3d object detection from RGB-D data. In CVPR, Cited by: §2.
  • [21] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017)

    PointNet: deep learning on point sets for 3d classification and segmentation

    .
    In CVPR, Cited by: §1, §2.
  • [22] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. In NeurIPS, Cited by: §1, §2, §3, §3, §4.2.
  • [23] S. Shi, X. Wang, and H. Li (2019) PointRCNN: 3d object proposal generation and detection from point cloud. In CVPR, Cited by: §2.
  • [24] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015)

    Convolutional LSTM network: A machine learning approach for precipitation nowcasting

    .
    In NeurIPS, Cited by: §2, §5.1, Table 3, Table 4.
  • [25] N. Srivastava, E. Mansimov, and R. Salakhutdinov (2015) Unsupervised learning of video representations using lstms. In ICML, Cited by: §5.1.
  • [26] H. Su, V. Jampani, D. Sun, S. Maji, E. Kalogerakis, M. Yang, and J. Kautz (2018) SPLATNet: sparse lattice networks for point cloud processing. In CVPR, Cited by: §2.
  • [27] H. Thomas, C. R. Qi, J. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas (2019) KPConv: flexible and deformable convolution for point clouds. In ICCV, Cited by: §2.
  • [28] W. Wang, R. Yu, Q. Huang, and U. Neumann (2018) SGPN: similarity group proposal network for 3d point cloud instance segmentation. In CVPR, Cited by: §2.
  • [29] X. Wang, S. Liu, X. Shen, C. Shen, and J. Jia (2019) Associatively segmenting instances and semantics in point clouds. In CVPR, Cited by: §2.
  • [30] Y. Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu (2017) PredRNN: recurrent neural networks for predictive learning using spatiotemporal lstms. In NeurIPS, Cited by: §2.
  • [31] W. Wu, Z. Qi, and F. Li (2019) PointConv: deep convolutional networks on 3d point clouds. In CVPR, Cited by: §2.
  • [32] L. Yu, X. Li, C. Fu, D. Cohen-Or, and P. Heng (2018) PU-net: point cloud upsampling network. In CVPR, Cited by: §2, §5.
  • [33] Y. Zhou and O. Tuzel (2018) VoxelNet: end-to-end learning for point cloud based 3d object detection. In CVPR, Cited by: §2.