LiDAR-based Online 3D Video Object Detection with Graph-based Message Passing and Spatiotemporal Transformer Attention

04/03/2020 ∙ by Junbo Yin, et al. ∙ 0

Existing LiDAR-based 3D object detectors usually focus on the single-frame detection, while ignoring the spatiotemporal information in consecutive point cloud frames. In this paper, we propose an end-to-end online 3D video object detector that operates on point cloud sequences. The proposed model comprises a spatial feature encoding component and a spatiotemporal feature aggregation component. In the former component, a novel Pillar Message Passing Network (PMPNet) is proposed to encode each discrete point cloud frame. It adaptively collects information for a pillar node from its neighbors by iterative message passing, which effectively enlarges the receptive field of the pillar feature. In the latter component, we propose an Attentive Spatiotemporal Transformer GRU (AST-GRU) to aggregate the spatiotemporal information, which enhances the conventional ConvGRU with an attentive memory gating mechanism. AST-GRU contains a Spatial Transformer Attention (STA) module and a Temporal Transformer Attention (TTA) module, which can emphasize the foreground objects and align the dynamic objects, respectively. Experimental results demonstrate that the proposed 3D video object detector achieves state-of-the-art performance on the large-scale nuScenes benchmark.



There are no comments yet.


page 1

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

LiDAR-based 3D object detection plays a critical role in a wide range of applications, such as autonomous driving, robot navigation and virtual/augmented reality [11, 46]. The majority of current 3D object detection approaches [42, 58, 6, 62, 24] follow the single-frame detection paradigm, while few of them perform detection in the point cloud video. A point cloud video is defined as a temporal sequence of point cloud frames. For instance, in the nuScenes dataset [4], point cloud frames can be captured per second with a modern 32-beam LiDAR sensor. Detection in single frame may suffer from several limitations due to the sparse nature of point cloud. In particular, occlusions, long-distance and non-uniform sampling inevitably occur on a certain frame, where a single-frame object detector is incapable of handling these situations, leading to a deteriorated performance, as shown in Fig 1. However, a point cloud video contains rich spatiotemporal information of the foreground objects, which can be explored to improve the detection performance. The major concern of constructing a 3D video object detector is how to model the spatial and temporal feature representation for the consecutive point cloud frames. In this work, we propose to integrate a graph-based spatial feature encoding component with an attention-aware spatiotemporal feature aggregation component, to capture the video coherence in consecutive point cloud frames, which yields an end-to-end online solution for the LiDAR-based 3D video object detection.

Popular single-frame 3D object detectors tend to first discretize the point cloud into voxel or pillar girds [62, 56, 24]

, and then extract the point cloud features using stacks of convolutional neural networks (CNNs). Such approaches incorporate the success of existing 2D or 3D CNNs and usually gain better computational efficiency compared with the point-based methods 

[42, 37]. Therefore, in our spatial feature encoding component, we also follow this paradigm to extract features for each input frame. However, a potential problem with these approaches lies in that they only focus on a locally aggregated feature, i.e., employing a PointNet [39] to extract features for separate voxels or pillars as in [62] and [24]

. To further enlarge the receptive fields, they have to apply the stride or pooling operations repeatedly, which will cause the loss of the spatial information. To alleviate this issue, we propose a novel graph-based network, named Pillar Message Passing Network (PMPNet), which treats a

non-empty pillar as a graph node and adaptively enlarges the receptive field for a node by aggregating messages from its neighbors. PMPNet can mine the rich geometric relations among different pillar grids in a discretized point cloud frame by iteratively reasoning on a -NN graph. This effectively encourages information exchanges among different spatial regions within a frame.

After obtaining the spatial features of each input frame, we assemble these features in our spatiotemporal feature aggregation component. Since ConvGRU [1] has shown promising performance in the 2D video understanding field, we suggest an Attentive Spatiotemporal Transformer GRU (AST-GRU) to extend ConvGRU to the 3D field through capturing dependencies of consecutive point cloud frames with an attentive memory gating mechanism. Specifically, there exist two potential limitations when considering the LiDAR-based 3D video object detection in autonomous driving scenarios. First, in the bird’s eye view, most foreground objects (e.g., cars and pedestrians) occupy small regions, and the background noise is inevitably accumulated as computing the new memory in a recurrent unit. Thus, we propose to exploit the Spatial Transformer Attention (STA) module, an intra-attention derived from [48, 53], to suppress the background noise and emphasize the foreground objects by attending each pixel with the context information. Second, when updating the memory in the recurrent unit, the spatial features of the two inputs (i.e., the old memory and the new input) are not well aligned. In particular, though we can accurately align the static objects across frames using the ego-pose information, the dynamic objects with large motion are not aligned, which will impair the quality of the new memory. To address this, we propose a Temporal Transformer Attention (TTA) module that adaptively captures the object motions in consecutive frames with a temporal inter-attention mechanism. This will better utilize the modified deformable convolutional layers [65, 64]. Our AST-GRU can better handle the spatiotemporal features and produce a more reliable new memory, compared with the vanilla ConvGRU. To summarize, we propose a new LiDAR-based online 3D video object detector that leverages the previous long-term information to improve the detection performance. In our model, a novel PMPNet is introduced to adaptively enlarge the receptive field of the pillar nodes in a discretized point clod frame by iterative graph-based message passing. The output sequential features are then aggregated in the proposed AST-GRU to mine the rich coherence in the point cloud video by using an attentive memory gating mechanism. Extensive evaluations demonstrate that our 3D video object detector achieves better performance against the single-frame detectors on the large-scale nuScenes benchmark.

Figure 2: Our online 3D video object detection framework includes a spatial feature encoding component and a spatiotemporal feature aggregation component. In the former component, a novel PMPNet (§3.1) is proposed to extract the spatial features of each point cloud frame. Then, features from consecutive frames are sent to the AST-GRU (§3.2) in the latter component, to aggregate the spatiotemporal information with an attentive memory gating mechanism.

2 Related Work

LiDAR-based 3D Object Detection. Existing works on 3D object detection can be roughly categorized into three groups, which are LiDAR-based [42, 58, 62, 24, 61, 56], image-based [22, 54, 26, 34, 25] and multi-sensor fusion-based [5, 29, 30, 21, 38] methods. Here, we focus on the LiDAR-based approaches since they are less sensitive to different illumination and weather conditions. Among them, one category [62, 57, 24] typically discretizes the point cloud into regular girds (e.g

., voxels or pillars), and then exploits the 2D or 3D CNNs for features extraction. Another category 

[42, 58, 6] learns 3D representations directly from the original point cloud with a point-wise feature extractor like PointNet++ [39]. It is usually impractical to directly apply the point-based detectors in scenes with large-scale point clouds, for they tend to perform feature extraction for every single point. For instance, a keyframe in nuScenes dataset [4] contains 300,000 point clouds, which are densified by 10 non-keyframe LiDAR sweeps within . Operating on point clouds with such a scale will lead to non-trivial computation cost and memory demand. In contrast, the voxel-based methods can tackle this kind of difficulty for they are less sensitive to the number of points. Zhou et al. [62]

first apply the end-to-end CNNs for voxel-based 3D object detection. They propose to describe each voxel with a Voxel Feature Encoding (VFE) layer, and utilize cascade 3D and 2D CNNs to extract the deep features. Then a Region Proposal Network (RPN) is employed to obtain the final detection results. After that, Lang

et al. [24] further extend [62] by projecting the point clouds to the the bird’s eye view and encoding each discretized gird (named pillars) with a Pillar Feature Network (PFN).

Both the VFE layers and the PFN only take into account separate voxels or pillars when generating the grid-level representation, which ignores the information exchange in larger spatial regions. In contrast, our PMPNet encodes the pillar feature from a global perspective by graph-based message passing, and thus promotes the representation with the non-local property. Besides, all these single-frame 3D object detectors can only process the point cloud data frame-by-frame, lacking the exploration of the temporal information. Though [33] applies temporal 3D ConvNet on point cloud sequences, it encounters the feature collapse issue when downsampling the features in the temporal domain. Moreover, it cannot deal with long-term sequences with multi-frame labels. Our AST-GRU instead captures the long-term temporal information with an attentive memory gating mechanism, which can fully mine the spatiotemporal coherence in the point cloud video.

Graph Neural Networks. Graph Neural Networks (GNNs) are first introduced by Gori et al. [13] to model the intrinsic relationships of the graph-structured data. Then Scarselli et al. [41] extend it to different types of graphs. Afterward, GNNs are explored in two directions in terms of different message propagation strategies. The first group [28, 19, 60, 36, 40] uses the gating mechanism to enable the information to propagate across the graph. For instance, Li et al. [28]

leverage the recurrent neural networks to describe the state of each graph node. Then, Gilmer 

et al. [12] generalizes a framework to formulate the graph reasoning as a parameterized message passing network. Another group [3, 15, 9, 17, 27] integrates convolutional networks to the graph domain, named as Graph Convolutional Neural Networks (GCNNs), which update node features via stacks of graph convolutional layers. GNNs have achieved promising results in many areas [9, 10, 51, 2, 52] due to the great expressive power of graphs. Our PMPNet belongs to the first group by capturing the pillar features with a gated message passing strategy, which is used to construct the spatial representation for each point cloud frame.

3 Model Architecture

In this section, we elaborate on our online 3D video object detection framework. As shown in Fig. 2, it consists of a spatial feature encoding component and a spatiotemporal feature aggregation component. Given the input sequences with frames, we first convert the point cloud coordinates from the previous frames to the current frame using the GPS data, so as to eliminate the influence of the ego-motion and align the static objects across frames. Then, in the spatial feature encoding component, we extract features for each frame with the Pillar Message Passing Network (PMPNet) (§3.1) and a 2D backbone, producing sequential features

. After that, these features are fed into the Attentive Spatiotemporal Transformer Gated Recurrent Unit (AST-GRU) (§

3.2) in the spatiotemporal feature aggregation component, to generate the new memory features . Finally, a RPN head is applied on to give the final detection results . Some network architecture details are provided in §3.3.

Figure 3: Illustration of one iteration step for message propagation, where is the state of node . In step , the neighbors for are (within the gray dash line), presenting the pillars in the top car. After aggregating messages from the neighbors, the receptive field of is enlarged in step , indicating the relations with nodes from the bottom car are modeled.

3.1 Pillar Message Passing Network

Previous point cloud encoding layers (e.g., the VFE layers in [62] and the PFN in [24]) for voxel-based 3D object detection typically encode each voxel or pillar separately, which limits the expressive power of the grid-level representation due to the small receptive field of each local grid region. Our PMPNet instead seeks to explore the rich spatial relations among different gird regions by treating the non-empty pillar grids as graph nodes. Such design effectively reserves the non-Euclidean geometric characteristics of the original point clouds and enhance the output pillar features with a non-locality property.

Given an input point cloud frame , we first uniformly discretize it into a set of pillars , with each pillar uniquely associated with a spatial coordinate in the x-y plane as in [24]. Then, PMPNet maps the resultant pillars to a directed graph , where node represents a non-empty pillar and edge indicates the message passed from node to . For reducing the computational overhead, we define as a -nearest neighbor (-NN) graph, which is built from the geometric space by comparing the centroid distance among different pillars.

To explicitly mine the rich relations among different pillar nodes, PMPNet performs iterative message passing on and updates the nodes state at each iteration step. Concretely, given a node , we first utilize a pillar feature network (PFN) [24] to describe its initial state at iteration step :


where is a

-dim vector and

presents a pillar containing LiDAR points, with each point parameterized by dimension representation (e.g., the XYZ coordinates and the received reflectance). The PFN is realized by applying fully connected layers on each point within the pillar, then summarizing features of all points through a channel-wise maximum operation. The initial node state is a locally aggregated feature, only including points information within a certain pillar grid.

Next, we elaborate on the message passing process. One iteration step of message propagation is illustrated in Fig. 3. At step , a node aggregates information from all the neighbor nodes in the -NN graph. We define the incoming edge feature from node as , indicating the relation between node and . Inspired by [55], the incoming edge feature is given by:


which is an asymmetric function encoding the local neighbor information. Accordingly, we have the message passed from to , which is denoted as:


where is parameterized by a fully connected layer, which takes as input the concatenation of and , and yields a -dim feature.

After computing all the pair-wise relations between and the neighbors of , we summarize the received messages with a maximum operation:


Then, we update the node state with for node . The update process should consider both the newly collected message and the previous state . Recurrent neural network and its variants [16, 47] can adaptively capture dependencies in different time steps. Hence, we utilize Gated Recurrent Unit (GRU) [7] as the update function for its better convergence characteristic. The update process is then formulated as follows:


In this way, the new node state contains the information from all the neighbor nodes of . Moreover, a neighbor node also collects information from its own neighbors . Consequently, after the totally iteration steps, node is able to aggregate information from the high-order neighbors. This effectively enlarges the perceptual range for each pillar grid and enables our model to better recognize objects from a global view.

Note that each pillar corresponds with a spatial coordinate in the x-y plane. Therefore, after performing the iterative message passing, the encoded pillar features are then scattered back as a 3D tensor

, which can be further exploited by the 2D CNNs. Here, we leverage the backbone network in [62] to further extract features for :


where denotes the backbone network and is the spatial features of . Details of the PMPNet and the backbone network can be found in §3.3.

3.2 Attentive Spatiotemporal Transformer GRU

Since the sequential features produced by the spatial feature encoding component are regular tensors, we can employ the ConvGRU [1] to fuse these features in our spatiotemporal feature aggregation component. However, it may suffer from two limitations when directly applying the ConvGRU. On the one hand, the interest objects are relatively small in the bird’s eye view compared with those in the 2D images (e.g., an average of pixels for cars with the pillar size of ). This may cause the background noise to dominate the results when computing the memory. On the other hand, though the static objects can be well aligned across frames using the GPS data, the dynamic objects with large motion still lead to an inaccurate new memory. To address the above issues, we propose the AST-GRU to equip the vanilla ConvGRU [1] with a spatial transformer attention (STA) module and a temporal transformer attention (TTA) module. As illustrated in Fig. 4, the STA module stresses the foreground objects in and produces the attentive new input , while the TTA module aligns the dynamic objects in and , and outputs the attentive old memory . Then, and are used to generate the new memory , and further produce the final detections . Before giving the details of the STA and TTA modules, we first review the vanilla ConvGRU.

Vanilla ConvGRU. GRU model [7] operates on a sequence of inputs to adaptively capture the dependencies in different time steps with a memory mechanism. ConvGRU is a variant of the conventional GRU model, which employs convolution operations rather than the fully connected ones, to reduce the number of parameters and preserve the spatial resolution of the input features. ConvGRU has made promising results on many tasks [32, 49, 23, 50], and has shown better results than the LSTM [43] counterparts in terms of the convergence time [8]. More specifically, ConvGRU contains an update gate , a reset gate , a candidate memory and a new memory . At each time step, the new memory (also named as the hidden state) is computed based on the old memory and the new input , which can be denoted by the following equations:


where ‘*’ and ‘’ denote the convolution operation and Hadamard product, and

is a sigmoid function.

and are the 2D convolutional kernels. When computing the candidate memory , the importance of the old memory and the new input is determined by the reset gate , i.e., the information of all comes from when . Additionally, the update gate decides the degree to which the unit accumulates the old memory , to yield the new memory . In §4.2, we show that the vanilla ConvGRU has outperformed the simple point cloud merging [4] and the temporal 3D ConvNet [33]. Next, we present how we promote the vanilla ConvGRU with the STA and TTA modules.

Figure 4: The detailed architecture of the proposed AST-GRU, which consists of a spatial transformer attention (STA) module and a temporal transformer attention (TTA) module. AST-GRU models the dependencies of consecutive frames and produces the attentive new memory .

Spatial Transformer Attention. The core idea of the STA module is to attend each pixel-level feature with a rich spatial context, to better distinguish a foreground object from the background noise. Basically, a transformer attention receives a query and a set of keys (e.g., the neighbors of ), to calculate an attentive output . The STA is designed as an intra-attention, which means both the query and key are from the same input feature .

Formally, given a query at location , the attentive output is computed by:


where is the attention weight. , and are the linear layers that map the inputs into different embedding subspaces. The attention weight is computed from the embedded query-key pair , and is then applied to the neighbor values .

Since we need to obtain the attention for all the query-key pairs, the linear layers, , and , are then achieved by the convolutional layers, , and , to facilitate the computation. Specifically, the input features are first embedded as , and through , and . Then, we adjust the tensor shapes of and to , where , in order to compute the attention weight:



is realized as a softmax layer to normalize the attention weight matrix. After that,

is employed to aggregate information from the values through a matrix multiplication, generating the attentive output , with the tensor shape recovered to . Finally, we obtain the spatially enhanced features through a residual operation [14], which can be summarized as:


where is the output layer of the attention head that maps the embedding subspace (-dim) of back to the original space (-dim). In this way, contains the information from its spatial context and thus can better focus on the meaningful foreground objects.

Temporal Transformer Attention. To adaptively align the features of dynamic objects from to , we apply the modified deformable convolutional layers [65, 64] as a special instantiation of the transformer attention. The core is to attend the queries in with adaptive supporting key regions computed by integrating the motion information.

Specifically, given a vanilla deformable convolutional layer with kernel size , let denotes the learnable weights, and indicates the predetermined offset in total grids. The output for input at location can be expressed as:


where is the deformation offset learnt through a separate regular convolutional layer , i.e., , where the channel number denotes the offsets in the x-y plane for the convolutional kernel. We can also reformulate Eq. 14 from the perspective of transformer attention as in Eq. 11, such that the attentive output of query is given by:


where is an identity function, and acts as the weights in different attention heads [64], with each head corresponding to a sampled key position .

is the attention weight defined by a bilinear interpolation function, such that


The supporting key regions play an important role in attending , which are determined by the deformation offset . In our TTA module, we compute not only through , but also through a motion map, which is defined as the difference of and :


where is a regular convolutional layer with the same kernel size as that in the deformable convolutional layer, and is the concatenation operation. The intuition is that, in the motion map, the features response of the static objects is very low since they have been spatially aligned in and , while the features response of the dynamic objects remains high. Therefore, we integrate with the motion map, to further capture the motions of dynamic objects. Then, is used to select the supporting key regions and further attend for all the query regions in terms of Eq. 15, yielding a temporally attentive memory . Since the supporting key regions are computed from both and , our TTA module can be deemed as an inter-attention.

Additionally, we can stack multiple modified deformable convolutional layers to get a more accurate . In our implementation, we adopt two layers. The latter layer takes as input to predict the deformation offset according to Eq. 16, and the offset is then used to attend via Eq. 15. Accordingly, we can now utilize the temporally attentive memory and the spatially attentive input to compute the new memory in the recurrent unit (see Fig. 4). Finally, a RPN detection head is applied on to produce the final detection results .

3.3 Network Details

PMPNet. Our PMPNet is an end-to-end differentiable model achieved by parameterizing all the functions with neural networks. Given a discretized point cloud frame with pillar nodes, is first used to generates the initial node state for all the nodes (Eq. 1), which is realized by a

convolutional layer followed by a max pooling layer that operates on the

points. In each iteration step , the edge features from the neighbor nodes are first collected as (Eq. 2) with a concatenation operation. Then the message functions (Eq. 3 and Eq. 4) map the collected features to , through a convolutional layer followed by a max pooling layer performing on the messages. The update function (Eq. 5) then updates the node state using a GRU with fully connected layers, by considering both the and , and outputs . After iteration steps, we get the final node state , and scatter it back to a 3D tensor (Eq. 6).

   Method Car Pedestrian Bus Barrier T.C. Truck Trailer Moto. Cons. Bicycle Mean
VIPL_ICT [35] 71.9 57.0 34.1 38.0 27.3 20.6 26.9 20.4 3.3 0.0 29.9
MAIR [44] 47.8 37.0 18.8 51.1 48.7 22.0 17.6 29.0 7.4 24.5 30.4
PointPillars [24] 68.4 59.7 28.2 38.9 30.8 23.0 23.4 27.4 4.1 1.1 30.5
SARPNET [59] 59.9 69.4 19.4 38.3 44.6 18.7 18.0 29.8 11.6 14.2 32.4
WYSIWYG [18] 79.1 65.0 46.6 34.7 28.8 30.4 40.1 18.2 7.1 0.1 35.0
Tolist [35] 79.4 71.2 42.0 51.2 47.8 34.5 34.8 36.8 9.8 12.3 42.0
Ours 79.7 76.5 47.1 48.8 58.8 33.6 43.0 40.7 18.1 7.9 45.4
Table 1: Quantitative detection results on the nuScenes 3D object detection benchmark. T.C. presents the traffic cone. Moto. and Cons. are short for the motorcycle and construction vehicle, respectively. Our 3D video object detector outperforms the single-frame detectors, achieving state-of-the-art performance on the leaderboard.

Backbone Module. As in [62], we utilize a 2D backbone network to further extract features for , which consists of three blocks of fully convolutional layers. Each block is defined as a tuple . All the blocks have convolutional kernels with output channel number . The first layer of each block operates at stride , while other layers have stride 1. The output features of each block are resized to the same resolution via upsampling layers and then concatenated together to merge the semantic information from different feature levels.

AST-GRU Module. In our STA module, all the linear functions in Eq. 11 and in Eq. 13 are convolution layers. In our TTA module, the regular convolutional layers, the deformable convolutional layers and the ConvGRU all have learable kernels of size .

Detection Head. The detection head in [62] is applied on the attentive memory features. In particular, the smooth L1 loss and the focal loss [31] count for the object bounding box regression and classification, respectively. A corss-entropy loss is used for the orientation classification. For the velocity regression required by the nuScenes benchmark, a simple L1 loss is adopted and shows substantial results.

4 Experimental Results

3D Video Object Detection Benchmark. We evaluate our algorithm on the challenging nuScenes 3D object detection benchmark [4], since the KITTI benchmark [11] does not provide the point cloud videos. nuScenes is a large-scale dataset with a total of 1,000 scenes, where 700 scenes (28,130 samples) are for training and 150 scenes (6,008 samples) are for testing, resulting 7 as many annotations as the KITTI. The samples (also named as keyframes) in each video are annotated every with a full 360-degree view, and their point clouds are densified by the 10 non-keyframe sweeps frames, yielding around 300,000 point clouds with 5-dim representation , where is the reflectance and describes the time lag to the keyframe (ranging from to ). Besides, nuScenes requires detecting objects for 10 classes with full 3D boxes, attributes and velocities.

Implementation Details. For each keyframe, we consider the point clouds within range of meters along the X, Y and Z axes. The pillar resolution on the X-Y plane is . The pillar number used in PMPNet is 16,384, sampled from the total 25,000 pillars, with each pillar containing most points. The input point cloud is a dimensions representation , which are then embedded into dimensions feature space after the total graph iteration steps. The convolutional kernels in the 2D backbone are of size and the output channel number in each block is . The upsampling layer has kernel size 3 and channel number 128. Thus, the final features map produced by the 2D backbone has a resolution of

. We calculate anchors for different classes using the mean sizes and set the matching threshold according to the class instance number. The coefficients of the loss functions for classification, localization and velocity prediction are set to 1, 2 and 0.1, respectively. NMS with IOU threshold 0.5 is utilized when generating the final detections. In both training and testing phases, we feed most 3 consecutive keyframes to the model due to the memory limitation. The training procedure has two stages. In the first stage, we pre-train the spatial features encoding component using the one-cycle policy 


with a maximum learning rate of 0.003. Then, we fix the learning rate to 0.0002 in the second stage to train the full model. We train 50 epochs for both stages with batch size 3. Adam optimizer 

[20] is used to optimize the loss functions.

4.1 Quantitative and Qualitative Performance

We present the performance comparison of our algorithm and other state-of-the-art approaches on the nuScenes benchmark in Table 1. PointPillars [24], SARPNET [59], WYSIWYG [18] and Tolist [35] are all voxel-based single-frame 3D object detectors. In particular, PointPillars is used as the baseline of our model. WYSIWYG is a recent algorithm that extends the PointPillars with a voxelized visibility map. Tolist uses a multi-head network that contains multiple prediction heads for different classes. Our 3D video object detector outperforms these approaches by a large margin. In particular, we improve the official PointPillars algorithm by 15%. Please note that there is a severe class imbalance issue in the nuScenes dataset. The approach in [63] designs a class data augmentation algorithm. Further integrating with these techniques can promote the performance of our model. But we focus on exploring the spatiotemporal coherence in the point cloud video, and handling the class imbalance issue is not the purpose in this work. In addition, we further show some qualitative results in Fig. 5. Besides the occlusion situation in Fig. 1, we present another case of detecting the distant car (the car on the top right), whose point clouds are especially sparse, which is very challenging for the single-frame detectors. Again, our 3D video object detector effectively detects the distant car using the attentive temporal information.

(a) Detection results from the single-frame 3D object detector [24].
(b) Detection results from our 3D video object detector.
Figure 5: Detections for the distant cars. The grey and red boxes indicate the predictions and ground-truths, respectively.

4.2 Ablation Study

In this section, we investigate the effectiveness of each module in our algorithm. Since the training samples in the nuScenes is 7 as many as those in the KITTI (28,130 vs 3,712), it is non-trivial to train multiple models on the whole dataset. Hence, we use a mini train set for validation purposes. It contains around 3,500 samples uniformly sampled from the original train set. Besides, PointPillars [24] is used as the baseline detector in our model.

First, we evaluate our PMPNet in the spatial feature encoding component, by replacing the PFN in PointPillars with PMPNet. As shown in Table 2, it improves the baseline by 2.05%. Second, we validate the ability of each module in the spatiotemporal feature aggregation component (i.e., ConvGRU, STA-GRU and TTA-GRU) through adding each module to the PointPillars. We can see that all these modules achieve better performance than the single-frame detector. Moreover, we compare AST-GRU with other video object detectors. Since each keyframe in nuScenes contains point clouds merged by previous 10 non-keyframe sweeps, the PointPillars baseline, trained on the merged keyframes, can be deemed as the simplest video object detector. Our AST-GRU improves it by 5.98%. Then, we compare AST-GRU with the temporal 3D ConvNet-based method by implementing the late feature fusion module in [33]. Temporal 3D ConvNet can only access a single keyframe label () during training, and aggregating more labels instead impairs the performance. According to Table 2, the 3D ConvNet-based method surpasses the PointPillars baseline by 1.82%. In contrast, our AST-GRU further outperforms it by 4.16%, which demonstrates the importance of the long-term temporal information. Finally, the full model with the PMPNet achieves the best performance.

Finally, we analyze the effect of the input sequence length. Since each keyframe contains quantities of point clouds that will increase the memory demand, we conduct this experiment without using the point clouds in non-keyframe sweeps. Experimental results with different input lengths are shown in Table. 3, which demonstrates that using the previous long-term temporal information () can consistently gain better performance in 3D object detection.

Components Modules Performance
3D Object Detector
PointPillars (PP) 21.30 -
PP + PMPNet 23.35 +2.05
3D Video
Object Detector
PP + 3D ConvNet 23.12 +1.82
PP + ConvGRU 23.83 +2.53
PP + STA-GRU 25.23 +3.93
PP + TTA-GRU 25.32 +4.02
PP + AST-GRU 27.28 +5.98
Full Model 29.35 +8.05
Table 2: Ablation study for our 3D video object detector. PointPillars [24] is the reference baseline for computing the relative improvement ().
Aspect Modules Performance
Input Lengths
(Full Model)
T=1 16.84 -
T=2 19.34 +2.50
T=3 20.27 +3.43
T=4 20.77 +3.93
T=5 21.52 +4.68
Table 3: Ablation study for the input lengths. Detection results with one input frame are used as the reference baseline.

5 Conclusion

This paper proposed a new 3D video object detector for exploring the spatiotemporal information in point cloud video. It has developed two new components: spatial feature encoding component and spatiotemporal feature aggregation component. We first introduce a novel PMPNet that considers the spatial features of each point cloud frame. PMPNet can effectively enlarge the receptive field of each pillar grid through iteratively aggregating messages on a -NN graph. Then, an AST-GRU module composed of STA and TTA is presented to mine the spatiotemporal coherence in consecutive frames by using an attentive memory gating mechanism. The STA focuses on detecting the foreground objects, while the TTA aims to align the dynamic objects. Extensive experiments on the nuScenes benchmark have proved the better performance of our model.


  • [1] N. Ballas, L. Yao, C. Pal, and A. Courville (2016) Delving deeper into convolutional networks for learning video representations. In ICLR, Cited by: §1, §3.2.
  • [2] P. Battaglia, R. Pascanu, M. Lai, D. J. Rezende, et al. (2016) Interaction networks for learning about objects, relations and physics. In NeurIPS, Cited by: §2.
  • [3] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun (2014) Spectral networks and locally connected networks on graphs. In ICLR, Cited by: §2.
  • [4] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom (2019) NuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: §1, §2, §3.2, §4.
  • [5] X. Chen, H. Ma, J. Wan, B. Li, and T. Xia (2017) Multi-view 3d object detection network for autonomous driving. In CVPR, Cited by: §2.
  • [6] Y. Chen, S. Liu, X. Shen, and J. Jia (2019) Fast point r-cnn. In ICCV, Cited by: §1, §2.
  • [7] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation.

    Conference on Empirical Methods in Natural Language Processing

    Cited by: §3.1, §3.2.
  • [8] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555. Cited by: §3.2.
  • [9] M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In NeurIPS, Cited by: §2.
  • [10] L. Fan, W. Wang, S. Huang, X. Tang, and S. Zhu (2019) Understanding human gaze communication by spatio-temporal graph reasoning. In ICCV, Cited by: §2.
  • [11] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, Cited by: §1, §4.
  • [12] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl (2017) Neural message passing for quantum chemistry. In ICML, Cited by: §2.
  • [13] M. Gori, G. Monfardini, and F. Scarselli (2005) A new model for learning in graph domains. In IEEE International Joint Conference on Neural Networks, Cited by: §2.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.2.
  • [15] M. Henaff, J. Bruna, and Y. LeCun (2015) Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163. Cited by: §2.
  • [16] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.1.
  • [17] J. Hu, J. Shen, B. Yang, and L. Shao (2020)

    Infinitely wide graph convolutional networks: semi-supervised learning via gaussian processes

    arXiv preprint arXiv:2002.12168. Cited by: §2.
  • [18] P. Hu, J. Ziglar, D. Held, and D. Ramanan (2020) What you see is what you get: exploiting visibility for 3d object detection. In CVPR, Cited by: Table 1, §4.1.
  • [19] S. Kearnes, K. McCloskey, M. Berndl, V. Pande, and P. Riley (2016) Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design 30 (8), pp. 595–608. Cited by: §2.
  • [20] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In ICLR, Cited by: §4.
  • [21] J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander (2018) Joint 3d proposal generation and object detection from view aggregation. In IROS, Cited by: §2.
  • [22] J. Ku, A. D. Pon, and S. L. Waslander (2019) Monocular 3d object detection leveraging accurate proposals and shape reconstruction. In CVPR, Cited by: §2.
  • [23] Q. Lai, W. Wang, H. Sun, and J. Shen (2019) Video saliency prediction using spatiotemporal residual attentive networks. TIP 29, pp. 1113–1126. Cited by: §3.2.
  • [24] A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom (2019) PointPillars: fast encoders for object detection from point clouds. In CVPR, Cited by: Figure 1, §1, §1, §2, §3.1, §3.1, §3.1, Table 1, 5(a), §4.1, §4.2, Table 2.
  • [25] B. Li, W. Ouyang, L. Sheng, X. Zeng, and X. Wang (2019) GS3D: an efficient 3d object detection framework for autonomous driving. In CVPR, Cited by: §2.
  • [26] P. Li, X. Chen, and S. Shen (2019) Stereo r-cnn based 3d object detection for autonomous driving. In CVPR, Cited by: §2.
  • [27] T. Li, Z. Liang, S. Zhao, J. Gong, and J. Shen (2020) Self-learning with rectification strategy for human parsing. In CVPR, Cited by: §2.
  • [28] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2016) Gated graph sequence neural networks. In ICLR, Cited by: §2.
  • [29] M. Liang, B. Yang, Y. Chen, R. Hu, and R. Urtasun (2019) Multi-task multi-sensor fusion for 3d object detection. In CVPR, Cited by: §2.
  • [30] M. Liang, B. Yang, S. Wang, and R. Urtasun (2018) Deep continuous fusion for multi-sensor 3d object detection. In ECCV, Cited by: §2.
  • [31] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, Cited by: §3.3.
  • [32] M. Liu and M. Zhu (2018) Mobile video object detection with temporally-aware feature maps. In CVPR, Cited by: §3.2.
  • [33] W. Luo, B. Yang, and R. Urtasun (2018) Fast and furious: real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In CVPR, Cited by: §2, §3.2, §4.2.
  • [34] F. Manhardt, W. Kehl, and A. Gaidon (2019) Roi-10d: monocular lifting of 2d detection to 6d pose and metric shape. In CVPR, Cited by: §2.
  • [35] nuTonomy NuScenes 3d object detection challenge. Note: Cited by: Table 1, §4.1.
  • [36] N. Peng, H. Poon, C. Quirk, K. Toutanova, and W. Yih (2017) Cross-sentence n-ary relation extraction with graph lstms. Transactions of the Association for Computational Linguistics 5, pp. 101–115. Cited by: §2.
  • [37] C. R. Qi, O. Litany, K. He, and L. J. Guibas (2019) Deep hough voting for 3d object detection in point clouds. In ICCV, Cited by: §1.
  • [38] C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas (2018) Frustum pointnets for 3d object detection from rgb-d data. In CVPR, Cited by: §2.
  • [39] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2017) Pointnet: deep learning on point sets for 3d classification and segmentation. In CVPR, Cited by: §1, §2.
  • [40] S. Qi, W. Wang, B. Jia, J. Shen, and S. Zhu (2018) Learning human-object interactions by graph parsing neural networks. In ECCV, Cited by: §2.
  • [41] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE Transactions on Neural Networks. Cited by: §2.
  • [42] S. Shi, X. Wang, and H. Li (2019) Pointrcnn: 3d object proposal generation and detection from point cloud. In CVPR, Cited by: §1, §1, §2.
  • [43] X. Shi, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015)

    Convolutional lstm network: a machine learning approach for precipitation nowcasting

    In NeurIPS, Cited by: §3.2.
  • [44] A. Simonelli, S. R. Bulo, L. Porzi, M. López-Antequera, and P. Kontschieder (2019) Disentangling monocular 3d object detection. In ICCV, Cited by: Table 1.
  • [45] L. N. Smith (2017) Cyclical learning rates for training neural networks. In WACV, Cited by: §4.
  • [46] X. Song, P. Wang, D. Zhou, R. Zhu, C. Guan, Y. Dai, H. Su, H. Li, and R. Yang (2019) Apollocar3d: a large 3d car instance understanding benchmark for autonomous driving. In CVPR, Cited by: §1.
  • [47] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In NeurIPS, Cited by: §3.1.
  • [48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §1.
  • [49] W. Wang, X. Lu, J. Shen, D. J. Crandall, and L. Shao (2019) Zero-shot video object segmentation via attentive graph neural networks. In ICCV, Cited by: §3.2.
  • [50] W. Wang, J. Shen, M. Cheng, and L. Shao (2019) An iterative and cooperative top-down and bottom-up inference network for salient object detection. In CVPR, Cited by: §3.2.
  • [51] W. Wang, Y. Xu, J. Shen, and S. Zhu (2018) Attentive fashion grammar network for fashion landmark detection and clothing category classification. In CVPR, Cited by: §2.
  • [52] W. Wang, H. Zhu, J. Dai, Y. Pang, J. Shen, and L. Shao (2020) Hierarchical human parsing with typed part-relation reasoning. In CVPR, Cited by: §2.
  • [53] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In CVPR, Cited by: §1.
  • [54] Y. Wang, W. Chao, D. Garg, B. Hariharan, M. Campbell, and K. Q. Weinberger (2019)

    Pseudo-lidar from visual depth estimation: bridging the gap in 3d object detection for autonomous driving

    In CVPR, Cited by: §2.
  • [55] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon (2019) Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics 38 (5), pp. 146. Cited by: §3.1.
  • [56] Y. Yan, Y. Mao, and B. Li (2018) Second: sparsely embedded convolutional detection. Sensors 18 (10), pp. 3337. Cited by: §1, §2.
  • [57] B. Yang, W. Luo, and R. Urtasun (2018) Pixor: real-time 3d object detection from point clouds. In CVPR, Cited by: §2.
  • [58] Z. Yang, Y. Sun, S. Liu, X. Shen, and J. Jia (2019) STD: sparse-to-dense 3d object detector for point cloud. In ICCV, Cited by: §1, §2.
  • [59] Y. Ye, H. Chen, C. Zhang, X. Hao, and Z. Zhang (2020) SARPNET: shape attention regional proposal network for lidar-based 3d object detection. Neurocomputing 379, pp. 53–63. Cited by: Table 1, §4.1.
  • [60] V. Zayats and M. Ostendorf (2018) Conversation modeling on reddit using a graph-structured lstm. Transactions of the Association for Computational Linguistics 6, pp. 121–132. Cited by: §2.
  • [61] D. Zhou, J. Fang, X. Song, C. Guan, J. Yin, Y. Dai, and R. Yang (2019) IoU loss for 2d/3d object detection. In International Conference on 3D Vision, Cited by: §2.
  • [62] Y. Zhou and O. Tuzel (2018) Voxelnet: end-to-end learning for point cloud based 3d object detection. In CVPR, Cited by: §1, §1, §2, §3.1, §3.1, §3.3, §3.3.
  • [63] B. Zhu, Z. Jiang, X. Zhou, Z. Li, and G. Yu (2019) Class-balanced grouping and sampling for point cloud 3d object detection. arXiv preprint arXiv:1908.09492. Cited by: §4.1.
  • [64] X. Zhu, D. Cheng, Z. Zhang, S. Lin, and J. Dai (2019) An empirical study of spatial attention mechanisms in deep networks. In ICCV, Cited by: §1, §3.2, §3.2.
  • [65] X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets v2: more deformable, better results. In CVPR, Cited by: §1, §3.2.