Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection in Autonomous Driving

The strong demand of autonomous driving in the industry has lead to strong interest in 3D object detection and resulted in many excellent 3D object detection algorithms. However, the vast majority of algorithms only model single-frame data, ignoring the temporal information of the sequence of data. In this work, we propose a new transformer, called Temporal-Channel Transformer, to model the spatial-temporal domain and channel domain relationships for video object detecting from Lidar data. As a special design of this transformer, the information encoded in the encoder is different from that in the decoder, i.e. the encoder encodes temporal-channel information of multiple frames while the decoder decodes the spatial-channel information for the current frame in a voxel-wise manner. Specifically, the temporal-channel encoder of the transformer is designed to encode the information of different channels and frames by utilizing the correlation among features from different channels and frames. On the other hand, the spatial decoder of the transformer will decode the information for each location of the current frame. Before conducting the object detection with detection head, the gate mechanism is deployed for re-calibrating the features of current frame, which filters out the object irrelevant information by repetitively refine the representation of target frame along with the up-sampling process. Experimental results show that we achieve the state-of-the-art performance in grid voxel-based 3D object detection on the nuScenes benchmark.


page 1

page 4


End-to-End Video Object Detection with Spatial-Temporal Transformers

Recently, DETR and Deformable DETR have been proposed to eliminate the n...

Time3D: End-to-End Joint Monocular 3D Object Detection and Tracking for Autonomous Driving

While separately leveraging monocular 3D object detection and 2D multi-o...

TransPillars: Coarse-to-Fine Aggregation for Multi-Frame 3D Object Detection

3D object detection using point clouds has attracted increasing attentio...

BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection

Single frame data contains finite information which limits the performan...

TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers

Detection Transformer (DETR) and Deformable DETR have been proposed to e...

Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds

Previous works for LiDAR-based 3D object detection mainly focus on the s...

TransVOS: Video Object Segmentation with Transformers

Recently, Space-Time Memory Network (STM) based methods have achieved st...

1 Introduction

3D object detection on Lidar-based point clouds plays an important role in robotic perception and autonomous driving applications. Although natural image and video based object detection has witnessed great improvements in recent years, recognizing and locating 3d objects from point clouds data remains challenging due to the irregular and uneven distribution of data points. To handle this problem, the main paradigm in the current 3D object detection approaches either utilize graph-based models directly on a raw point cloud frame [26, 7, 34] or transform 3D point cloud data into 2D pseudo image feature maps via Voxel Feature Encoding (VFE) methods and then conduct detection on a single frame [37, 33, 19]. However, single-frame 3D object detection is still challenging (e.g., detect moving objects) due to the sparse nature of point clouds.

Figure 1:

Framework overview. The raw data of continuous point cloud frames is converted into 2D pseudo-images. After feature extraction, the proposed TCTR module is adopted to generate representations containing temporal-channel information, followed by the feature refinement for the detection head.

Utilizing contextual information from adjacent frames of point cloud video to enhance the target frame is a promising direction to solve the prominent data sparsity issue. In particular, early works [16]

in the area simply concatenate past point cloud frames with the current frame to make it more dense, which do not consider inter-frame correlations explicitly. Following works further take spatio-temporal correlations among consecutive frames into account based on Recurrent Neural Network (RNN) or its variants, . ConvGRU 

[1] and ConvLSTM [27]. All of these methods treat the information from all point cloud frames equally and integrate them together to get the final representation of the target frame, which is not precise and would introduce irrelevant information for two reasons. First, consecutive Lidar data are highly redundant due to the high sampling frequency (i.e., 20 point cloud frames per second with a 32-beam Lidar sensor [4]). Second, Lidar also records massive information about the surrounding environment instead of the objects-of-interest. Thus, achieving a dense yet precise representation of the target frame based on the adjacent frames is a indispensable and critical problem for accurate Lidar-based 3D object detection.

In this work, we study the 3D Lidar-based video object detection under the feature-based framework (as shown in Figure 1), which projects the raw point clouds data to pseudo frames (i.e., 2D feature maps) considering the computation efficiency. Our work operates on several consecutive pseudo frames and attempts to achieve a dense and precise representation of the target frame by only integrating relevant information from the adjacent frames.

To achieve this goal, we propose a Temporal-Channel Transformer module to reconstruct the target frame with both intra-frame and inter-frame relevant information in a fine-grained voxel-wised manner. The basic idea is using attention mechanism to find and integrate the useful information from correlated voxels in the target frame (intra-frame) and correlated frames in the input Lidar video (inter-frame) for each voxel of the target frame. Instead of deploying shallow attention layers, we adapt Transformer to the 3D Lidar video data analysis area by taking each channel of the compressed input frames as a Transformer encoder node and each voxel of the target frame as a Transformer decoder node, which enables to exploit diverse and complex spatial, temporal, and channel correlations among the whole input frames. While the learned representation from Temporal-Channel Transformer is dense and only integrates information relevant with the target frame, it still contains object-irrelevant information as the sparse target frame and thus harms the detection performance. We solve this problem by combining the dense representation and sparse representation of the target frame together as the final representation with gating mechanism, which can control the information flow and consistently refine the final representation by removing object-irrelevant information.

To summarize, we propose a new Lidar-based 3D video object detection method that achieves dense representation for the target frame with the help of adjacent frames. Our model has two main components, 1) a data enhancement module that integrates the most relevant information from the context frames to the target frame based on transformer in a voxel-wise manner, 2) a data refinement module that filters out object-irrelevant information based on the gate mechanism. Extensive experiments are performed on the large-scale nuScenes dataset [4] to validate the effectiveness of our method. Our multi-frame model has achieved 7.4 mAP improvements and 5.1 mAP improvements over the single-frame baseline and state-of-the-art multi-frame 3D object detection method, respectively.

2 Related Work

2.1 Single-frame Lidar Object Detection

3D Lidar-based single-frame object detection methods can be roughly divided into two groups: point-based [26, 7, 34],and grid-based [37, 33, 19]. Point-based approaches are inspired by [23, 24], which capture features directly from the raw sparse point cloud data. Although the point-based approaches can lead to more accurate object detection, the high computational cost in finding the neighboring points makes them difficult to be used in real-time. The grid-based approaches [37, 33, 19] convert the irregular point cloud data into a regular grid representation (e.g., voxels or pillars). Features are then extracted using 2D or 3D CNN. The overall framework of these grid-based works is generally the same, which consists of three parts. First, an Encoder for feature encoding, which projects point clouds into sparse pseudo-images under the bird’s-eye view, such as the Voxel Feature Encoding layer (VFE) proposed by [37] and the Pillar Feature Network(PFN) layers proposed in [19]. Second, a feature representation module, which represents the pseudo-images with a well-designed network to capture complex spatial correlations in the point cloud frame. Third, a detection head with Region Proposal Network (RPN) for generating the final 3D bounding box.

Considering the computational efficiency requirements of tasks using 3D object detection, we choose to transform the point cloud data into a representation of the grid pillars and then extract information from the grid pillars via 2D convolution.

2.2 Multi-frame Lidar Object Detection

There is a growing number of works that use contextual information for 3D object detection. Hu et.al [16] propose to fuse multiple frames into a single frame to expand the visibility region of the current frame. Ngiam et.al [21]

use the detection results of the previous frame as a prior knowledge to improve the detection at current frame. However, these works overlook the complex spatiotemporal dependencies among different frames, which have been demonstrated effective in natural 2D video object detection works. Recurrent neural networks and their variants (e.g., LSTM

[29] and GatedRNN[8]) have achieved excellent results in sequence modeling and transduction problems, especially in the field of language modeling and machine translation. Applying recurrent structures to acquire temporal features has also driven the development of 3D video object detection [12, 18, 6, 32]. Yin et.al [36] propose modules that fuse graph-based spatial coding features and combine them with spatial-temporal attention awareness modules to capture video coherence. Huang et.al [17]

uses 3D sparse convolutional neural networks to extract features from point cloud data and feed them into LSTM to produce hidden features and memorized features for continuous frame information transfer.

Instead of using RNN-based temporal information processing, our work is the first (to our knowledge) investigating Transformer for 3D-Lidar-based video object detection. Transformer is adopted because it is known to perform better than RNN-based methods in sequential tasks like Natural Language Processing 


2.3 Transformer

Transformer is proposed upon the attention mechanism in [30]

to handle the sequence modeling problem by relating each node within a sequence to each other. It has achieved promising performance in several tasks, such as machine translation, document generation, and named entity recognition.

The core component of Transformer is the Multi-head scaled dot-product attention module, which achieves feature aggregation among all nodes. Compared to RNNs or CNN, multi-head attention has more powerful abilities in capturing global inter-dependencies among long range sequence.

Most recently, DETR [5]

brings transformer into the computer vision field. This work integrates transformer into the 2D object detection framework and achieves higher accuracy than Faster RCNN

[25]. DETR extracts features from the input image using 2D CNN and then uses the Transformer for establishing the correlation among features from different locations for refining spatial features, which are then used for predicting 2D detection results. Dosovitskiy et.al [10] also used the transformer model for the object classification task. They slice the input image grid into patches and flatten the patches through a linear projection matrix. Then, the transformer is used after adding the position embedding for feature aggregation. Finally, category recognition is performed through the classification head.

The encoder and decoder of transformers encode the same information in existing works for vision tasks. In contrast, the information captured and correlation built for encoder are different from that for decoder in our TCTR. Specifically, TCTR uses the encoder module for capturing temporal and channel information to model the relationship among features of different channels and frames, while the attention module in the decoder establishes the correlation between temporal-channel features of the encoder and spatial features for a better representation of the spatial features of the current frame.

Figure 2: Network structure. We have a set of continuous point cloud frames as inputs. Generate feature set after backbone. generates input for TC-encoder. from generates as input to S-decoder. Our TCTR will generate a feature with temporal-channel information, then up-sample it synchronously with and refine .

3 Method

In this section, we present the details of our method. We first briefly introduce some preliminaries about 3D Lidar-Based video object detection and pre-operations (e.g., mapping 3D Lidar data to 2D pseudo image, backbone network for basic data representation) before our proposed network. Then, our Temporal-Channel Transformer is illustrated, which consists of a Temporal-Channel Encoder to explore channel-wise temporal correlations among all the input frames and a Spatial Decoder to enhance the target frame precisely with both intra-frame and inter-frame relevant information in a voxel-wise manner. In the next, the feature refinement module is presented, which combines the dense representation from Temporal-Channel Transformer with the original represention of the target frame through Gating mechanism to filter out object irrelevant information. The detection head and training loss used in our work are discussed in the last.

3.1 Preliminary

The raw Lidar data generated at each time step is a sparse and irregular collection of point clouds frame, in which a point is represented as , containing 3D space location information and reflection magnitude . Denote the raw point clouds frame at time step as . This work focuses on detecting objects-of-interest at with a set of continuous point cloud frames .

Instead of directly learning from raw point clouds data with point-based framework, we choose the voxel-based framework to ensure the computational efficiency. Specially, we first project the 3D space into pillars under the bird view, and then extract features from the raw point clouds frame using PFN for each pillar. As a result, we get a 2D pseudo image for each point clouds frame and convert the frames to 2D pseudo image sequence , where and .

Similar to natural videos, a sequence of consecutive point clouds frames also contains massive repetitive scenes, which make the pseudo images sequence contain a lot of redundant information. Besides, there is a large difference between the scales of the objects-of-interest in the 3D space, such as pedestrians and construction vehicles, which makes it necessary to fuse the features with multi-scale receptive field. Thus, we will collect CNN features from different layers of backbone CNN with different resolutions. These CNN features are denoted by . As the backbone network is not our focus, we simply use two convolutional layers followed by four resblocks [15]

, each resblock is followed by a max-pooling layer for downsampling.

3.2 Temporal-channel Transformer

Based on the CNN features , our target is to achieve a dense representation for the target frame by introducing relevant information from adjacent frames. Previous works on video object detection [11, 14, 3, 20] has demonstrated that exploring spatial and temporal correlations is useful for learning an accurate representation for the current frame and achieving better object detection performance. In our work, we further take channel-wise correlations into account considering that different channels could learn different patterns from the input, and explore spatial, temporal, and channel dependencies.

We achieve this purpose by designing a novel Transformer-based data enhancement module, which consists of a Temporal-Channel Encoder and a Spatial Decoder. Specially, the Temporal-Channel Encoder is utilized to learn complex interactions among different frames based on the multi-head attention mechanism. Instead of viewing the whole frame as an entity, we regard each channel of a frame as a separate entity and feed all channels of the input sequence into the encoder, which can capture both intra-frame and inter-frame channel-wise dependencies. The Spatial Decoder is deployed to align the target frame with all channels of the transformed input sequence (i.e., output of encoder), which can achieve fine-grained feature aggregation with the most relevant information from the adjacent frames of the video in a pixel-wise manner.

3.2.1 Temporal-channel Encoder

The TC-encoder adapts the structure designed in [30], which has been shown to be effective in establishing relationships between nodes (e.g., words), into our task. Given the features of frame , we first obtain the transformed feature of channels by transforming the features with a convolution operation. Then we flatten the transformed feature along the spatial dimension for each channel to have stacked features . The features for all frames in are obtained in the same way for . In this way, we have the features for all frames in and denote the features by ( denoting the input sequence length). In this stacked features, we treat each channel in as a node and flatten the spatial dimensions. Besides, a positional encoding is combined to to obtain the representation with positional information in the same way as [30]. Thus, the input to the multi-head attention block can be regarded as a representation sequence containing nodes. Each node represents a certain channel of one frame, and the feature of each node represent the spatial information among all voxels of the corresponding frame. We use the TC-encoder to utilize the correlation among channel features within frames, as well as between frames.

The TC-encoder consists of a stack of identical encoding blocks. Each encoding block contains a Multi-head self-attention layer and a Feature Forward Network (FFN) layer. The basic building block of multi-head attention is formulated as


where represents the queries, represents the keys, and represent the values, is a scalor for normalization. The match between query and key pairs, , decides the attention score (i.e., importance) for values at the corresponding nodes.

function is used to ensure the sum of importance values is 1. Instead of measuring the importance only once, multi-head attention attempts to capture more comprehensive interactions among different nodes and conduct multiple scaled dot-product attention in parallel. The results of all attention heads are combined together and once again projected with a linear transformation, leading to the final outputs as follows:


In our TC-encoder, , , and are obtained from the transformations of with parameters , , and , respectively. Thus, multi-head attention can capture both inter-channel and inter-frame correlations. are parameters for the output projection.

3.2.2 Spatial Decoder

The spatial decoder module focuses on enhancing the features of the current frame with both intra-frame and inter-frame relevant information.

For the intra-frame spatial correlation, we first apply a convolution operation to and obtain transformed features . Then, is reshaped to a matrix of size and combined with the positional embedding. Here, we directly use the position encoding method performed in [22, 2] as it is widely adopted in existing works. Denote as the feature concatenating reshaped feature and position information. Similarity to the encoder, multi-head attention is also applied to the decoder. There are two types of attentions in the decoder.

As the first type, self-attention is used. We set both queries, keys, and values as the transformation of and calculate the correlation among different nodes. After this multi-head attention, each voxel in is enhanced with the information from the most relevant voxel in the same frame and now becomes .

After the self-attention operation, is further fed to the second type of multi-head attention block to capture inter-frame spatial correlations and enhance with inter-frame information. Denote the output of the encoder by . Specifically, the formulation of Eq. (4) for , and in the encoder is implemented in the second type of attention in the decoder as follows:


In Eq. (5), the query is from the features of current frame and the key from the output of the encoder features . To achieve more fine-grained spatial interactions, we attempt to measures the relationships between each voxel of current frame with all frames in in the channel level instead of the frame level, which is achieved by aligning with the output of the Temporal-Channel Encoder module . As shown in Figure 2, we set the transformation of as the query , and set the projection of as the keys and values of the multi-head attention block. Through the attention mechanism, all channels in the the input sequence are matched with each voxel in and contribute to the new representation of the voxel according to their correlations. Thus, the final output of the Spatial decoder is the enhanced with both inter-frame and intra-frame information considering spatial correlations, temporal correlations, and channel correlations.

Figure 3: Comparison between our method, the single-frame baseline, and the ConvGRU-based method. The green box represents the ground truth. The red box indicates the test results. From left to right, the first, second, third, and forth columns are the ground truth, the detection results for the single frame baseline, the ConvGRU, and our proposed TCTR, respectively.

3.3 Feature Refinement Module

The output of the Temporal-Channel Transformer module is a more dense representation than and contains spatial, temporal, and channel dependencies among the input frames. Instead of feeding into the object detection head directly, we combine with to integrate more useful information. It could also lead to more accurate features of and naturally benefit . Other than simply concatenating or adding with , we use the gate mechanism to combine them together, which is widely used in sequence modeling (e.g., RNN) and has the capability to control the information flow and filter out the information irrelevant to object [9]. Formally, the gate mechanism based refinement is defined as:



is the sigmoid activation function,

is the element-wise multiplication.

We up-sample the re-calibrated representation to the same resolution as for conducting the detection with a multi-layer CNN structure. Along with the up-sampling process, we also up-sample the dense representation and consistently combine the up-sampled with the up-samped to keep refining the output and ensure the feature space consistency.

3.4 Detection Head and Training Loss

We use the same detection head and loss function as in PointPillars

[19]. Specially, 2D IoU is used to select the prior-box for ground truth, regardless of the height dimension of the object to be measured. Focal loss is used for object classification:


In addition to generating the object category, we also need to obtain the 3D position of the object, as well as the box’s length (), width (), height (), and orientation angle (). We use the Smooth L1 loss to get the location, which is most commonly used in object detection tasks.


Although we have regressed the values of the orientation angles, we also need to make positive and negative judgments about the direction of the orientation angles to further correct the orientation angles. We use softmax classification loss to optimize orientation angle judgments.

Thus, the total loss is:


where stands for number of positive anchors, and , , and represent the weight of the three loss terms, respectively.

Type Method Car Ped. Bus Barrier T.C Truck Trailer Moto. Cons Bicycle Mean
MAIR[28] 47.8 37.0 18.8 51.1 48.7 22.0 17.6 29.0 7.4 24.5 30.4
PointPillars[19] 68.4 59.7 28.2 38.9 30.8 23.0 23.4 27.4 4.1 1.1 30.5
Single-frame SARPNET[35] 59.9 69.4 19.4 38.3 44.6 18.7 18.0 29.8 11.6 14.2 32.4
WYSIWYG[16] 79.1 65.0 46.6 34.7 28.8 30.4 40.1 18.2 7.1 0.1 35.0
SSN[39] 80.7 72.3 39.9 56.3 54.2 37.5 43.9 43.7 14.6 20.1 46.3
PointPainting[31] 77.9 73.3 36.1 60.2 62.4 35.8 37.3 41.5 15.8 24.1 46.4
multi-frame 3DVID[36] 79.7 76.5 47.1 48.8 58.8 33.6 43.0 40.7 18.1 7.9 45.4
Ours 83.2 74.9 63.7 53.8 52.5 51.5 33.0 54.0 15.6 22.6 50.5
Table 1: We validate our method on the nuScenes dataset [4]. T.C, Moto and Cons present traffic cone, motorcycle and construction vehicle. The results are divided into single-frame approach and multi-frame approach. The results show that we have achieved state of the art in both grid voxel-based single-frame and multi-frame methods.

4 Experiment

4.1 Datasets

Most previous single-frame 3D object detection methods conduct experiments on the KITTI dataset [13]. However, KITTI dataset does not have continuous frames, which make it infeasible for evaluating 3D video object detection methods. We evaluate our model on the nuscnese dataset [4], which has a richer set of 700 scenes and a larger amount of data. The nuscnese dataset includes 700 scenes for training and 150 scenes for testing. The nuscnese dataset contains 20 frames for each second and is annotated every 0.5 second. We denote the annotated frames as key-frames and others as sweeps.

4.2 Implementation Details

The , , and coordinates of the point clouds are in the range of to , to , and to , respectively. Each point cloud is divided into pillars with each has a size of . As we have mentioned above, FPN is first used to derive features for each pillar. The output size of the feature maps of FPN is . These feature maps then pass through the backbone to generate the feature maps with size of , which finally go through the TCTR and feature refinement modules. The anchor size we set for the bounding box for regression is the mean of the ground truth. The weights , , and in our proposed loss function are set to , , and

, respectively. We use the Adam optimizer and one-cycle training strategy with a learning rate of 0.001. We train 40 epochs with 16 GPUs. The batch size is set to 2. The number of frames

is set to to achieve a good balance between the performance and complexity.

Single frame 3D object detection algorithms usually add ground truth of one frame to another frame for training data augmentation [38]. Our proposed method, however, considers the temporal correlation of objects in neighboring frames. Therefore, we cannot use this kind of training data augmentation. Instead, we use the random flipping along the x, y-axis, and rotating along the z-axis in the range of to . We also apply a random scale ranging from to for data augmentation.

4.3 Overall Performance

We present the quantitative results of our model and other state-of-the-art methods on the nuScenes dataset in Table 1. Among the comparison methods, PointPillars [19], SARPNET [35], and WYSIWYG [16] are grid voxel-based single frame 3D object detection methods. PointPainting [31] is based on the PointPillars framework and further fuse the Lidar data with natural image data to provide richer information. 3DVID [36]

is the state-of-the-art 3D video object detection network on the nuScenes dataset, which combines KNN graph and ConvGRU to capture the spatiotemporal correlations. SSN

[39] takes advantage of the shape features of the point cloud to propose a way to encode the shape of the point cloud to improve the object classification.

Our model achieves the best performance among all the comparison methods with a large advantage. In particular, we outperform the 3DVID model and original PointPillars approach by 11.2% (5.1 in mAP) and 65.6% (20.0 in mAP), respectively. Although our method only utilizes the point clouds data in the nuScenes dataset, we still achieves better performance than PointPainting, which used both Lidar and natural image data source.

4.4 Ablation Studies

4.4.1 Ablation Study A

we first conduct an overall ablation study about our model. As shown in Table 2, the baseline of our method only contains the backbone network and multiple CNN up-sampling layers (directly sets as as Equation 6), which is a single frame 3D object detection method. The ”baseline+concat” extend ”baseline” to video object detection simply by concatenating as . The ”baseline+TCTR” variant only uses Temporal-Channel Transformer to explore the spatial, temporal, and channel dependencies among the input frames and sets as for detection. We can observe that ”baseline+concat” performs better than baseline, which demonstrates the importance of exploring adjacent frames for 3D object detection. Besides, integrating adjacent frames with our TCTR module is better than simply concatenating the inputs together, which shows the necessity of exploring the complex correlations among the sequence. Our feature refinement module can further re-calibrate the learned representation and help improvement the object detection performance.

Method Frames mAP
baseline 1 43.15
baseline+concat 3 45.57
baseline+TCTR 3 48.43
baseline+TCTR+FRM (ours) 3 50.47
Table 2: Ablation Study A: overall ablation study on the framework.

4.4.2 Ablation Study B

We then conduct a deeper study regarding the effectiveness of our Temporal-Channel Transformer by replacing it with other networks or variants. Specially, we consider the following four comparison methods: 1) ConvLSTM as used in [27], 2) ConvGRU as used in [1], 3) T-encoder which does not consider the channel-wise correlations by taking the whole representation of each frame as a node for encoding (e.g.,, instead of a channel of ), 4) C-endoer which does not consider the temporal correlation and only take different channels of as nodes for encoding. Expect the Temporal-Channel Transformer part is replaced, all other parts of our model are keep unchanged. As can be observed from Table 3, our method achieves the best performance among all the compared variants, which demonstrate the superior of our Temporal-Channel Transformer in capturing complex dependencies in the input images. Besides, our Temporal-Channel encoder achieves better performance than T-encoder, revealing the importance of exploring channel-wise correlations.

Method mAP
ConvLSTM 46.92
ConvGRU 47.41
T-encoder 48.71
C-encoder 47.43
TC-encoder (ours) 50.47
Table 3: Ablation Study B: ablation study on the Temporal-Channel Transformer.

4.4.3 Ablation Study C

In this part, we study the influence of different methods for combining and , as discussed in Section 3.3. We compare our gate mechanism based feature refinement module with concatenation and adding based fusion strategies. The result without combining is also presented for comparison. As shown in Table 4, the detection accuracy can be improved no matter combining the information from with concatenation or adding, which spots the importance of current frame in object detection. Besides, our feature refinement module uses gating mechanism to select the detection relevant information and further achieves better performance, which demonstrates the effectiveness of our design and highlight the importance of paying more attention on object-relevant features.

Method mAP
(ours) 50.47
Table 4: Ablation Study C: ablation study on the feature refinement module, denotes the concatenate operation.

4.4.4 Ablation Study D

How the number of input frames influences the object detection performance is also of our interest. In this study, we evaluate our model’s performance with different numbers of input frames. As mention in Section 4.1, nuscnese dataset contains both keysframes and sweeps. We only use the keyframes without any sweeps as input for this study due to the GPU memory and training time constraints. The results are presented in Table 5. It can be observed that the detection accuracy consistently improves with the input length, revealing the importance of exploiting and aggregating the information from more frames to make the target frame more dense.

Input Lengths mAP
1T 40.62
3T 42.05
5T 43.76
7T 44.68
Table 5: Ablation Study D: ablation studies regarding the relationship between frames number and performance.

5 Conclusions

In this work, we study the 3D Lidar-based video object detection problem and propose a novel deep learning method for enhancing the target point clouds frame with adjacent frames. A new transformer, named Temporal-Channel Transformer module, is designed to fully explore the spatial, temporal, and channel correlations among different frames and enhance the target frame based on these learned correlations in a voxel-wise manner. Besides, we also design a feature refinement module to re-calibrate the learned dense representations, which helps filter out the object irrelevant information. We conduct experiments on the large-scale nuScnese dataset and compare our method with several strong baselines, which demonstrate our model achieves state-of-the-art performance. Extensive ablation studies are also conducted and demonstrate the effectiveness of our design.