Perceiving the world is a critical task in modern robotics applications. Self-driving vehicles must first process sensory information to perform object detection and estimate the free space before attempting to plan a safe and comfortable maneuver towards the goal. LiDAR has become the main sensing modality in most self-driving vehicles due to the geometrical richness it provides. Most prevalent LiDAR sensors operate by collecting a rotating scan of the environment, typically completing revolutions at a 10hz rate. However, as the sensor rotates, observations arrive as a stream of spatio-temporal pointsgrouped in fine-grained packets, each spanning approximately 10ms. This gives rise to a rolling shutter effect shown in Figure 1, where objects in different locations are observed asynchronously.
Modern autonomous systems accumulate the LiDAR packets into a full sweep before running perception. This waiting time adds significant latency to the pipeline, particularly for objects that were seen in the earlier packets in the sweep. It also introduces an erroneous assumption that all observations in the full sweep are made synchronously. In reality, when the perception model receives the input, there is already a discrepancy between the outdated observations and the true state of the world, illustrated in Figure 1. Furthermore, there is a temporal discontinuity in the sweep where the earliest and the latest packets meet which creates artifacts in the point cloud.
For safety-critical applications like self-driving, even minimal delays may result in catastrophic outcomes. For example, in the presence of high-speed vehicles, building a sweep from a 10Hz LiDAR introduces a latency of 100ms, which translates to several meters of error in free space estimation. Having lower latency is crucial in safety-critical situations where the vehicle must quickly perceive and react to avoid harmful events. Therefore, it is important to process incoming sensory information with minimal latency.
Processing individual LiDAR packets can be challenging, since only a small sector of the scene is observable as illustrated in Figure 1. Objects of interest are often fragmented across different LiDAR packets, particularly when close to the sensor. Coincidentally, that is also when high accuracy and low latency are the most important as close range objects are typically the most critical to safety. Thus, individual packets alone may be insufficient for high quality detections, making it necessary to incorporate past observations.
Existing LiDAR object detectors generally assume access to a full 360 degree sweep, or a large subregion (e.g., front view) that spans all objects of interest. As such, these models do not explicitly reason about objects split across multiple observations. As shown in our experiments, directly adopting full-sweep models for processing individual LiDAR packets is not a good solution due to the partial observation and lack of global context. Conversely, exploiting multiple sweeps [casas2018intentnet, luo2018fast] provides richer geometrical evidence as more LiDAR points are collected over time. However, most current solutions are computationally inefficient as each packet would be processed as many times as the duration of the history. As such, naively aggregating historical sensory information at the input level is not amenable to emitting low latency object detections from fine-grained LiDAR packets.
In this paper we propose StrObe, a novel detection model which exploits the sequential nature of LiDAR observations and efficiently reuses past computation to stream low latency object detections from LiDAR packets. Our approach voxelizes the incoming LiDAR packets into a Bird’s-Eye View (BEV) grid, and uses an efficient convolutional backbone to process only the relevant region. Furthermore, we introduce a multi-scale spatial memory that is read and updated with each LiDAR packet. This allows us to reuse past computation, and make the incremental processing of incoming LiDAR packets lightweight and efficient. Importantly, we achieve an end-to-end latency of 21 ms (from observing an actor to emitting a detection) on an NVIDIA 2080Ti: 10ms for accumulating a packet and 11ms for model inference. In contrast, even fast full sweep detectors [yang2018pixor] operate at an order of magnitude higher latencies: Taking 100ms to accumulate the sweep and another 28ms for model inference, for a total of 128ms.
Our second contribution is a novel large scale benchmark for evaluating streaming object detection from LiDAR packets. Unlike existing public datasets, PacketATG4D contains LiDAR data at the packet level, along with accurate ego-pose and associated object bounding box annotations at the same temporal resolution (i.e., 100Hz). We also propose a novel metric latency-aware mAP to explicitly take latency into account when evaluating perception. We show that our approach far outperforms the state-of-the-art when the data buffering latency is taken into account, while still matching the performance in the conventional setting.
2 Related Work
3D object detection has made tremendous progress in recent years due to the advances of deep learning and the availability of large-scale labeled datasets. The topic of how to effectively process LiDAR data has received significant attention and many approaches have been proposed. Point clouds have been processed in perspective format using a range image[li2016fcn, meyer2019lasernet]
. By converting the point cloud into an image, these approaches can leverage the vast body of knowledge on 2D object detection to build good architectures for the task. However, such methods suffer from the same challenge present in 2D detection: high variance in receptive field requirements as a function of depth.
To tackle these issues, some methods perform 3D detection directly on the unstructured 3D points. This is usually achieved through first extracting local signatures with a fully connected layer [qi2017pointnet, li2018pointcnn, hua2018pointwise, wang2018deep] or by using deformable filters [xiong2019deformable]. An alternative framework is to voxelize the points into a regularly spaced 3D grid, making reasoning on point clouds amenable to convolutional architectures. Early works [wu20153d, maturana2015voxnet, li20173d, luo2018fast] leverage 3D convolutions, but they are memory intensive. Others [riegler2017octnet, ren2018sbnet, yan2018second] exploit the sparsity of point clouds to reduce redundant computation and make higher resolution processing feasible. BEV detectors [yang2018pixor, yang2018hdnet, casas2018intentnet] avoid heavy computation by exploiting efficient 2D convolutions over a top-down pseudo-image of the scene. Other methods have leveraged hybrid representation of points and voxels [chen2019fast, shi2020pv, yang2019std, lang2019pointpillars, zhou2019end] to exploit the benefits of both representations.
However, the aforementioned methods assume a full sweep is available, which requires the sensor to complete a full rotation and incurs latency. Previous works have explored the problem of latency in different settings, for instance on the effect of model runtime for 2D object detection [li2020towards], or how the temporal aspect of point clouds is relevant for odometry and mapping [alismail2014continuous, zhang2017low]. Concurrent work [han2020streaming] has considered streaming object detections from a rolling shutter LiDAR. However, their model uses an LSTM to maintain the state, which does not leverage the spatial nature of the problem. Furthermore, their evaluation does not capture the impact latency has on the accuracy of state estimation.
3 Low Latency Detection on Streaming LiDAR
In this paper, we propose StrObe, a low-latency object detector that emits detections from streaming LiDAR observations. As illustrated in Figure 2, as the LiDAR sensor spins, it yields data in sector packets (each roughly spanning in our 10Hz LiDAR). As opposed to previous models, which buffer this data into a full sweep before processing, our proposed method operates at the packet level. In doing so, we lower our latency by 90ms. A fundamental component to our approach is a novel spatial memory module design to reuse past computation, and make incremental processing of incoming LiDAR packets lightweight and effective.
3.1 Streaming Object Detection
The overall architecture of our model is illustrated in Figure 3. The network takes as input a LiDAR packet and an HD map, which is useful as a prior on the location of actors (e.g., a vehicle is more likely to be on the road than on the sidewalk). For each packet we first voxelize the points and rasterize into a BEV pseudo-image with height as the channel dimension [yang2018pixor]. Following [yang2018hdnet, casas2018intentnet], we also rasterize the map into a BEV pseudo-image, where each channel corresponds to a different layer of the map (e.g., crosswalks, roads, etc). We then extract features using our novel regional convolutions (Figure 3 – a, b), which only compute features in the rectangular area defined by the packet. A latent spatial representation of the scene is then maintained using a memory module (Figure 3 – c, d, e). Lastly, we channel-wise concatenate multi-scale features and regress detection parameters using our output header (Figure 3 – f).
Regional Convolution Layer:
To reduce latency while leveraging the proven strength of BEV representations and 2D convolutions, we propose to process the input with a local operator, which we call regional convolution. Specifically, for an input and coordinates and , we extract features only on the region , where the brackets denote indexing at the rectangle defined by the coordinate ranges. This allows us to leverage locality to minimize wasted computation.
is a sequence of 2D convolution, ReLU activation and Group Normalization[wu2018group]. Furthermore, for both the LiDAR packet and HD map, the region coordinates are defined as the minimal rectangle that fully encloses all points in the LiDAR packet. This is illustrated in Figure 3 – a, b.
While regional convolutions allow us to efficiently ingest packets, independently processing them is not sufficient for accurate perception since objects will often be fragmented across many packets. Furthermore, a single observation of an object far away will typically yield few points due to the sparsity of the sensor at range. We would thus like to leverage information from previous scans of the region. However, naively processing the history of observations every time we receive a packet results in redundant computation and slow inference. Instead, our approach iteratively builds a global spatial memory from a series of partial observations while at the same time producing new detections with each LiDAR packet, Figure 3 – c. This enables us to re-use past computation and produce low-latency and accurate detections. Importantly, the LiDAR points are registered on a consistent coordinate frame defined by a continuous ego-pose. The memory is aligned with this pose by bilinearly resampling its features to account for ego-motion with every new packet (Figure 3 – c, d). This guarantees that the LiDAR and map features are consistently aligned with the spatial memory in the same coordinate frame.
As each LiDAR packet arrives, the spatial memory is incrementally updated with new local features to reflect the latest state (Figure 3 – d). Each update step is done through aggregation of the current memory state and the incoming local features . Specifically, we employ a channel reduction with learned parameters as follows
In practice, channel-wise concatenates and
, resulting in a tensor withchannels, then applies two blocks of 2D convolution, ReLU activation and Group Normalization, with the second block bringing the number of channels back to . This is illustrated as the red dotted arrows in Figure 3 – e.
In order to leverage the semantic representations of feature maps at different scales (i.e., richer geometry on higher resolutions; richer semantics on lower) we employ a multi-scale backbone for the extraction of both LiDAR and HD map features. Together with the spatial memory at each scale, the benefits of this are twofold: It allows the model to regress accurate and low latency detections from partial observations by remembering the features from immediately preceding packets. It also makes it possible for the network to persist long term features that are useful to detect objects through occlusion over multiple sweeps as well as overwrite previous features when stronger evidence is available.
We employ a BEV grid with resolution of 0.2m for each voxel. This grid then goes through 4 blocks of [2, 2, 3, 6] Regional Convolution layers with [24, 64, 128, 256] channels, followed by Max Pooling with a stride of 2. Each block has a corresponding Spatial Memory that holds the pre-pooling state of the features. In parallel, features are extracted from the HD map with a backbone that consists of a sequence of 4 blocks with [2, 2, 3, 3] Regional Convolution layers with [16, 32, 64, 128] channels. After each block, Max Pooling with a stride of 2 is employed. The feature maps from each block of both the LiDAR and HD map backbones are then bilinearly resized to a common resolution of 0.8m, channel-wise concatenated, and processed by one last block of 4 Regional Convolutions with 256 channels.
We perform multi-class BEV detection for vehicles, cyclists, and pedestrians via a single-stage detection header consisting of 2 convolutional layers that predict the classification and regression targets for each cell in the fused feature map (hereinafter referred to as "anchors"). All objects are defined via their centroid and confidence , whereas cyclists and vehicles also have length, width, and heading
in BEV. For the confidence, we predict its logit. We define the centroid of the box as an offset from the coordinates of the center point of its anchor pixel :
For the vehicle dimensions we predict , which encourages the network to learn a prior on the dimension of the boxes (low variance should be expected from the dimension of vehicles). The heading is parameterized by the tangent value. In particular, we predict a signed ratio so that the specific quadrant can be retrieved:
Following common practice in object detection [ren2015faster], we employ a multi-task loss over classification and bounding box regression to optimize the model (using ), i.e:
It is defined as the weighted sum of the smooth loss [DBLP:journals/corr/Girshick15]
between the ground truth vector of detection parametersand predictions with . Note that and are not considered for pedestrians since we are only concerned with predicting their centroid.
It is defined as the binary cross entropy between the predicted scores and the ground truth. Due to severe class imbalance between positive and negative anchors given that most pixels in the BEV scene do not contain an object, we employ hard negative mining:
where is a set containing hard negative anchors. This is obtained by sampling 750 anchors for vehicles, 1500 for cyclists and pedestrians, and picking the 20 with highest loss for each class.
Due to the sequential nature of the memory, the model is trained sequentially through examples that contain 50 packets (corresponding to 0.5s). Back-propagation through time is used to compute gradients across the memory. Furthermore, the model is trained to remember by supervising it on objects with 0 points, as long as the object was seen in any of the previous packets. In practice, due to GPU memory constraints, we only compute the forward pass in the first 40 packets to warm-up the memory, then forward and backward through time in the last 10 packets.
4 Experimental Evaluation
We evaluate our model on a real world dataset for 3D object detection. In particular, we compute mean average precision (mAP) in the default detection setting (using full sweeps) and propose a new metric that takes into account the latency incurred by different input granularities (i.e., per-packet processing vs. sweep building). Our experimental results show that our model far outperforms the baselines in the per-packet setting while remaining competitive with the state-of-the-art in the full sweep setting. Furthermore, our latency evaluation also uncovers a problem with the mAP metric in the default setting as it does not accurately measure real world performance.
Since there is no public available dataset that provides packets, we collect a new dataset, PacketATG4D, containing 6500 snippets with diverse conditions (e.g., geographical, lighting, road topology, vehicle types). The LiDAR rotates at a rate of 10hz and emits new packets at 100Hz – each roughly covering a region – for a total of 16,250,000 packets (1,625,000 frames). Accurate ego-pose is available for each LiDAR packet via a commercial localization system. Labels provide both the spatial extents and motion of vehicles, cyclists and pedestrians, from which we can extract accurate bounding boxes at discrete observation times as well as in continuous time through the use of a precise motion model. Note that if the observation of an instance is split across packets, each packet will have an instance of the label according to the pose at the timestamp of the packet.
|Accumulation (ms)||Inference (ms)||Total Latency (ms)|
|PointPillars [lang2019pointpillars, yan2018second]||100||37||137|
|Model||Packet Stream||Full Sweep|
|PointPillars [lang2019pointpillars, yan2018second]||66.8||47.7||53.4||49.2||16.8||6.1||84.2||61.1||74.4||68.9||56.1||34.9|
We provide a wide range of baselines that exploit different representations. HDNET [yang2018hdnet] is a detection model that processes input point clouds into occupancy voxels and performs 2D convolution in BEV using the axis voxels and HD maps as feature channels. PointRCNN [pointrcnn] processes raw LiDAR inputs using a PointNet [qi2017pointnet] backbone to perform foreground segmentation and generate region-of-interest (RoI) proposals. The RoI proposals are then processed by a classification and bounding box refinement network to output 3D detections. PointPillars [lang2019pointpillars, yan2018second] groups input points into discrete bins in BEV and uses PointNet [qi2017pointnet] to extract features of each bin. The BEV features are then processed with 2D convolutions to generate detection outputs. Note that the PointRCNN and PointPillar baselines do not make use of HD maps.
We evaluate our method using mean average precision (mAP) as our metric with IOU thresholds of [0.5, 0.7] for vehicles, [0.3, 0.5] for cyclists. For pedestrians, we use the distance to centroid with thresholds [0.5m, 0.3m] since we treat the detections as circles with a fixed radius, thus only the centroid is predicted. We evaluate with latency-aware labels that take into account the delay introduced by aggregating consecutive packets (Latency mAP). We refer the reader to Figure 4 for an illustration of this metric. We re-define the detection label for each object in the scene as its state at detection time (green box), rather than observation time (red box), which does not accurately reflect the current state of the world. The benefits of this metric are twofold: (1) It evaluates how well the detector models the state of the real world and the quality of the information that would be used by downstream motion planning, and (2) it allows for a direct comparison with standard detection metrics, thus making apparent the effects of latency.
|Model||Packet Stream||Full Sweep|
|PointPillars [lang2019pointpillars, yan2018second]||66.9||48.0||53.5||49.3||16.9||6.2||84.8||70.6||74.2||69.2||56.1||36.3|
Since implementations might differ, we did not consider model inference times in the latency aware detection metric. However, it is an important factor in the end-to-end latency for safety since it indicates the minimal amount of time the system would require to be able to recognize an actor, i.e., the time taken for sensor data acquisition, model inference, and emission of a corresponding detection for the actor to donwstream systems. We report end-to-end latency timings in Table 1; our approach leads to a much faster (on average 6x!) detection emission time.
Table 2 shows our results for PacketATG4D. In the leftmost setting – Packet Stream – all models are first trained on detection using LiDAR packets (as opposed to full sweeps) and evaluated using the state of the labels at the time of detection (i.e., green box in Figure 4). Our model far outperforms the baselines, which do not have memory and struggle with partial observations (i.e., a single packet as opposed to the full sweep). In the right portion of the table – Full Sweep – the models are trained using a traditional full sweep setting and evaluation is done using the label states at the end of the sweep (therefore in the worst case an object could move for 100ms before evaluation).
We additionally evaluate in the standard object detection setting, not taking into account the sweep building latency and using the labels for each object in the scene at the time of observation (i.e., when the LiDAR hit the object). The leftmost columns of Table 3 show the results of the models trained in a packet setting. A key takeaway from these results stems from comparing the numbers in the "Packet Stream" setting between Tables 2 and 3, which shows that the 10ms latency introduced by accumulating a single packet is negligible in the mAP settings we evaluate, since performance remains the same. Conversely, comparing the "Full Sweep" setting in Tables 2 and 3 shows considerable degradation in metrics. This indicates that the performance of full sweep models in the real world would be considerably lower.
|Snippet 1||Snippet 2||Snippet 3|
We first ablate the memory component of the model. In particular, we evaluate two alternative approaches: (1) No Memory: A memoryless instantiation of our model; (2) Attention: A memory module that uses linear attention to update the spatial memory (see supplementary for more details). As shown in Table 4, memory is a fundamental component for effective perception from partial observations. Furthermore, the attention based memory updates are outperformed by our approach which learns the aggregation function through convolutions. We also evaluate our model without the HD map component to evaluate its importance. The results in Table 4 (No Map row) show that while the map backbone proved to be overall beneficial to the model, is not a fundamental component as its removal does not lead to major degradations in metrics.
The qualitative results in Figure 5 show the predictions of the model over 4 consecutive packets in 3 snippets. The network is able to predict boxes even before points are visible due to the memory module. It can also update the positions of detections as new points arrive to best exploit the evidence.
We have proposed a novel method for perception of point cloud streaming data. Our approach produces highly accurate object detections at very low latency by using regional convolutions over individual LiDAR packets alongside a spatial memory that keeps track of previous observations. We also introduced a new latency-aware metric that quantifies the cost of data buffering, and how that affects the quality of the models in the real world, which are inevitably affected by latency. Results on the large-scale PacketATG4D show that our approach far outperforms the state-of-the-art in the packet setting that takes into account latency, while being competitive in the commonly adopted full sweep setting. For future work, we intend to expand the use of the memory module for long term tracking through occlusion and motion forecasting.