Detecting and localizing objects forms a critical component of any autonomous driving platform [geiger2013vision, nuscenes2019]. As a result, self-driving cars (SDC) are equipped with a variety of sensors such as cameras, LiDARs, and radars [cho2014multi, thrun2006stanley], which the perception system must use to create an accurate 3D representation of the world. Due to the nature of the driving task, the perception system must operate in real-time and in a highly variable operating environment [kim2013parallel]. Of these sensors, LiDAR is one of the most critical as it natively provides high resolution, accurate 3D data about the environment. Despite LiDAR’s importance and the uniqueness of point clouds as a modality, object detection systems for SDCs look remarkably similar to systems designed for generic camera imagery. †† Denotes equal contribution and authors for correspondence. JN proposed the idea and implemented the model. BC, JN, BY, YC, ZF and VV developed the infrastructure and experimented with the model. WH, VV, PS, JS built the evaluation framework. XY, YZ, PN and OA developed early pieces of infrastructure and the dataset. JS, JN, VV, BC and others wrote the manuscript.
Object detection research has matured rapidly for camera images with the design of systems to solve camera-specific challenges such as multiple overlapping objects, large intra-class scale variance due to camera perspective, and object occlusion[girshick2014rich, girshick2015fast, faster_rcnn, lin2016feature, lin2017focal]
. These modality-specific challenges make the task of localizing and classifying objects in imagery uniquely difficult, as an object may occupy any pixel, and neighboring objects may be as close as one pixel apart. This necessitates treating every location and scale in the image equally, which naturally aligns with the use of convolutional networks for feature extraction[girshick2014rich, girshick2015fast]. While convolutional operations have been heavily optimized for parallelized hardware architectures, scaling these methods to high resolution images is difficult as computational cost scales quadratically with image resolution.
In contrast, LiDAR is naturally sparse; 3D objects have real world scale with no perspective distortions, and rarely overlap. Additionally, in SDC perception, every location in the scene is not equally important [zeng2019end, bojarski2016end, bansal2018chauffeurnet], and that importance can change dynamically based on the local environment and context. Despite large modality and task-specific differences, the best performing methods for 3D object detection re-purpose camera-based detection architectures. Several methods apply convolutions to discretized representations of point clouds in the form of a projected Birds Eye View (BEV) image [yang2018pixor, luo2018fast, yang2018hdnet, lang2018pointpillars], or a 3D voxel grid [zhou2018voxelnet, yan2018second]. Alternatively, methods that operate directly on point clouds have re-purposed two stage object detector design, replacing feature extraction operations but still adopting the same camera-inspired region proposal stage [yang2018ipod, shi2019pointrcnn, qi2018frustum].
In this paper, we revisit the design of object detection systems in the context of 3D LiDAR data, and propose a new framework which better matches the data modality and the demands of SDC perception. We start by recognizing that 3D region proposals are fundamentally distinct. Every reflected point (above the road) must belong to an object or surface. In this setting, we demonstrate that efficient sampling schemes on point clouds – with zero learned parameters – are sufficient for generating region proposals. In addition to being computationally inexpensive, sampling has the advantage of implicitly exploiting the sparsity of the data by matching the data distribution of the scene. We process each proposed region completely independently with no global context nor shared information. Finally, we entirely avoid any discretization procedure and instead featurize the native point cloud [qi2017pointnet, qi2017pointnet++] in order to classify objects and regress bounding box locations [faster_rcnn, ren2015faster]. By revisiting several design assumptions underlying current approaches, we arrive at a non-convolutional, point-based object detector with no learned proposals and no global context.
The resulting detector is as accurate as the state of the art at lower inference cost, and more accurate at similar inference costs. In addition, these design decisions result in several key benefits. First, because the proposal method naturally exploits point cloud sparsity, the model does not waste computation on empty regions. Second, because the model does not use global context and is completely point-based, one can dynamically vary the number of proposals and the number of points per proposal at inference time. The latter feature scales the cost of inference linearly, allowing a single trained model to operate at different computational loads. This feature permits upper-bounding a models performance or reducing the deployed run-time cost. Finally, because each region is completely independent, one may select at run-time where to allocate region proposals based on context. For example, a deployed system could exploit priors (e.g. HD map or temporal information) to target where in the scene to run the detector. In summary, our main contributions are as follows:
Introduce a computationally-flexible, non-convolutional point-based object detector geared towards SDC perception. In the process we demonstrate that cheap proposals on point clouds, paired with a simple point based network, results in a system that is competitive with state-of-the-art performance on self-driving car benchmarks.
Highlight how a single model designed in this fashion may adapt its inference cost or selectively target specific locations of interest. For instance, a single trained pedestrian model may exceed the predictive performance of a baseline convolutional model by at similar computational demand; or, the same model without retraining may achieve similar predictive performance but with of the computational demand.
2.1 Object detection in images
Object detection has a rich history in image processing fueled by camera-based academic datasets [lin2014microsoft]. Early systems employed two steps, i.e. two stages, to propose candidate detection locations, and subsequently to discriminate whether a given proposal is an object of interest [felzenszwalb2010object, dean2013fast]. Initially, the first step employed convolutional sliding windows [sermanet2013pedestrian], foreground classification, image segmentations as well as selective search based on segmentation and clustering [uijlings2013selective]
. The advent of convolutional neural networks (CNN)[krizhevsky2012imagenet], highlighted that a CNN featurization may provide superior proposals as well as improve the final disciriminative stage [girshick2014rich, ren2015faster, girshick2015fast].
In modern detection systems based on CNN’s, maximizing prediction performance requires densely sampling an image for all possible object locations. This goal usually requires a computationally-heavy first stage featurization to provide high quality bounding box proposals [girshick2014rich, girshick2015fast]. In addition, this goal implies that for a single image, the second stage of an object detector will need to be run on each proposal. Such heavy computational demands become prohibitive in constrained environments (e.g. mobile and edge devices) such that speed versus accuracy trade-offs must be considered [huang2017speed].
To address these concerns and the added complexity of a two-stage system, recent work has focused on one stage object detectors that attempt predict bounding box locations and object identity in a single inference operation of a CNN [liu2016ssd, redmon2016you]. Although single-stage systems provide faster inference, these systems generally exhibit worse predictive performance than two-stage systems [huang2017speed]
. That said, recent advances in redesigning the loss functions have mitigated this disadvantage significantly[lin2017focal]. The speed and reduced complexity advantages associated with a one stage model do however come with an associated cost: by basing the inference procedure on convolutions which densely sample an image, the resulting model must treat all spatial locations equally. In the context of a SDC, this design decision hampers the ability to adapt computation to the current scene or latency requirements.
2.2 Featurizations of point clouds
Raw data arising from many types of sensors come in the form of point cloud data (e.g. LIDAR, RGB-D). Although point clouds may be projected into traditional dense images or into 3D voxel grids, on-going efforts have attempted to design models that operate directly on point cloud data.
A point cloud consists of a set of 3-D points indexed by
which may contain an associated feature vector. Fundamentally, the set of points are unordered and may be of arbitrary size depending on the number of reflections identified by a sensor on a single scan. Ideally, learned representations for point clouds aim to be permutation invariant with respect to and agnostic to the number of points in a given example [qi2017pointnet, qi2017pointnet++]. Such considerations leads to a large class of point-based models, some of which are derived to mimic convolutions [wang2018deep].
2.3 Object detection with point clouds
Object detection in point clouds has started with porting ideas from the image-based object detection literature. By voxelizing a point cloud (i.e. identifying a grid location for individual points ) into a series of stacked image slices describing occupancy, one may employ traditional CNN techniques for object detection on the resulting images or voxel grids (e.g. [yang2018pixor, zhou2018voxelnet, yan2018second, luo2018fast, yang2018hdnet]) 111Note that the third dimension is handled by generating one image within a range of values. Thus, to cover a 3-dimensional volume, a voxelization may result in multiple stacked images..
Employing a grid (image or voxel) representation for point clouds in object detection presents potential drawbacks. Even when ignoring the height dimension in 3D by using a Birds Eye View Representation (BEV), convolutions can be expensive and the computational demand grows as roughly as where and are the height and width of an image. Running inference on high resolution images is often computationally infeasible. In practice, this constraint requires that CNN’s operate at no larger than pixel input resolution [huang2017speed]. Given the large spatial range of LiDAR, selecting a grid resolution to achieve this pixel resolution (e.g. meter/pixel [lang2018pointpillars]) results in a severely degraded image in which high resolution details provided by the sensor are removed. This constraint results in systematically worse performance on smaller objects such as pedestrians [yang2018pixor, zhou2018voxelnet, yan2018second, yang2018hdnet] where the latter may only occupy several pixels in a voxelized image.
For these reasons, many authors explored building detection systems that operate directly on representations of the point cloud data. For instance, VoxelNet partitions 3-D space and encodes LiDAR points within each partition with a point cloud featurization [zhou2018voxelnet]. The result is a fixed-size feature map, on which a conventional CNN-based object detection architecture may be applied. Likewise, PointPillars [lang2018pointpillars] proposes an object detector that employs a point cloud featurization, providing input into a gridded feature representation for use in a feature pyramid network [lin2016feature]; the resulting per-pillar features are combined with anchors for every pillar to perform joint classification and regression. The resulting network achieves a high level of predictive performance with minimal computational cost on small scenes, but its fixed grid increases in cost notably on larger scenes and cannot adapt to each scene’s unique data distribution.
In the vein of two stage detection systems, PointRCNN [shi2019pointrcnn] employs a point cloud featurizer [qi2017pointnet++] to make proposals via an expensive per-point segmentation network into foreground and background. Subsequently, a second stage operates on cropped featurizations to perform the final classification and localization. Finally, other works propose bounding boxes through a computationally intensive, learned proposal system operating on paired camera images [yang2018ipod, qi2018frustum], with the goal of improving predictive performance by leveraging a camera image to seed a proposal system to maximize recall.
Our goal is to construct a detector that better aligns with the requirements of a SDC perception system, taking advantage of the sparsity of the data, allowing us to target where to spend computation, and operating on the native data. To address these goals, we propose a sparse targeted object detector, termed StarNet: Given a sparse sampling of locations in the point cloud, the model extracts a small (random) subset of neighboring points. The model featurizes the point cloud [qi2017pointnet], classifies the region, and regresses bounding box parameters. Importantly, the object location is predicted relative to the selected location and does not employ any global information. This setup ensures that each spatial location may be processed by the detector completely independently. An overview of this method is depicted in Figure 1 and Appendix B.
The structure of the proposed system confers two advantages. First, inference on each cell proposal occurs completely independently, enabling computation of each location to be parallelized to decrease inference latency. Second, simple heuristics or external side information[bojarski2016end, yang2018hdnet] may be used to rank order which proposals to process. As a result, the amount of computation spent on each spatial region in the scene may be targeted to system priorities based on resource availability.
Note that this approach blurs the line between one and two stage detectors, with the fraction of points sampled corresponding to how dense the scene is sampling. Dense sampling acts as a one stage detector whereas sparse sampling acts as a two stage detector. The rest of this section describes the architecture of StarNet in more detail.
3.1 Center selection
We propose using an inexpensive, data-dependent algorithm to generate proposals from LiDAR with high recall. In contrast to prior work [yang2018pixor, lang2018pointpillars, yan2018second], we do not base proposals on fixed grid locations, but instead generate proposals to respect the observed data distribution in a scene.
Concretely, we sample points from the point cloud, and use their coordinates as proposals. In this work, we explore two sampling algorithms: random uniform sampling, and farthest point sampling (FPS), which can be visualized in Appendix LABEL:sec:vis-sampling-methods. Random uniform sampling provides a simple and effective baseline because the sampling method biases towards densely populated regions of space. In contrast, farthest point sampling (FPS) selects individual points sequentially such that the next point selected is maximally far away from all previous points selected, maximizing the spatial coverage across the point cloud. This approach permits varying the number of proposals from a small, sparse set to a large, dense set that covers point cloud space. The former is reminiscent of the sparse coverage present in two-stage detectors [faster_rcnn, ren2015faster], where as the latter is more akin to the dense coverage and prediction task conferred by single-stage detectors [redmon2016you, liu2016ssd].
3.2 Featurizing local point clouds
After obtaining a proposal location, we featurize the local point cloud around the proposal. We randomly select points within a radius of meters of each proposal center. In our experiments, is typically between 32 to 1024, and is 2-3 meters. All local points are re-centered to an origin for each proposal. For each point we include LiDAR features associated with each point. We experimented with several architectures for featurizations of native point cloud data [qi2017pointnet++, wu2018pointconv] but most closely followed [xu2018powerful]. The resulting architecture is agnostic to the number of points provided as input [qi2017pointnet++, wu2018pointconv, xu2018powerful]. Specifically, we used a 5-layer featurizer that outputs a 384-dimensional feature for every cell that is employed for regression and classification (see Appendix B for details).
3.3 Constructing final predictions from bounding box proposals.
We provide a brief overview, but reserve the details for Appendix B. The current design uses a grid of total anchor offsets relative to each cell center, and each offset can employ different rotations or anchor dimension priors. We emphasize that unlike single-stage detectors [lang2018pointpillars, yan2018second], the anchor grid placement is data-dependent since it is based on the proposals. We project each featurized cell to
dimensional feature vectors at each location offset from which we predict classification logits and bounding box regression logits following[yan2018second, lang2018pointpillars]. Ground truth labels are assigned to individual anchors based on their intersection-over-union (IoU) overlap [yan2018second, lang2018pointpillars]. To make final predictions, we employ an oriented, 3-D multi-class version of non-maximal suppression (NMS) [girshick2014rich].
We construct the proposed detection system and experiment with various formulations of the sampling strategy. We start with the KITTI object detection benchmark [geiger2013vision]. In particular, we predict 2-D and 3-D rotated bounding boxes solely from the LIDAR point cloud data and compare these results to other recent work. In the majority of the work, we apply this method to a large-scale self-driving dataset, Waymo Open Dataset, focused on car and pedestrian detection.
We briefly summarize training procedures but reserve details for the Appendix 222We plan to open-source code at http://github.com/tensorflow/lingvo .
. We train models using the Adam optimizer with an exponentially-decaying learning rate schedule starting at 1e-3 and decaying over 650 epochs. We perform some hyper-parameter tuning through cross-validated studies and perform final evaluations on the corresponding test datasets.
In the following section, we first investigate a range of sampling strategies evaluated on KITTI and Waymo Open Dataset. Second, we train the detection system on the heavily-pursued KITTI benchmark and compare these results to other work. Next, we switch exclusively to the large-scale Waymo Open Dataset in order to highlight how this model may aid flexible computation for vehicle and pedestrian detection. In particular, we apply a single trained StarNet model operating across a range of computational settings. We investigate the relative merits of this approach over other state-of-the-art methods.
4.1 Sampling strategies for point cloud detections
As a first step in constructing the object detection system, we explored two simple strategies for naively sampling point clouds: random sampling and farthest point sampling (Section 3.1). Figure LABEL:fig:center-selection in the Appendix showcases typical results of what each proposal mechanism generates on scenes from KITTI validation. Note that random sampling samples many centers in dense locations, whereas farthest point sampling maximizes spatial coverage of the scene.
To quantify the efficacy of each proposal method, we measure the coverage as a function of the number of proposals. Coverage is defined as the fraction of annotated objects with 5+ points that have IoU with the sampled boxes. Figure 2 plots the coverage for each method for a fixed IOU of 0.5 for cars in KITTI [geiger2013vision]
and the Waymo Open Dataset. All methods achieve monotonically higher coverage with greater number of proposals with coverage on KITTI exceeding 98% within 256 samples. Because random sampling is heavily biased to regions which contain many points, there is a tendency to oversample large objects and undersample regions containing few points. Farthest point sampling (FPS) uniformly distributes samples across the spatial extent of the point cloud data (see Methods). We observe that FPS provides uniformly better coverage across a fixed number of proposals and we employ this technique for the rest of the work.
4.2 Object detection for point clouds
We first evaluate StarNet on a popular self-driving benchmark. KITTI is relatively small compared to other deep learning datasets, and data augmentation has been pursued aggressively in the field. We likewise employed the same data augmentations for point clouds and bounding box labels[yang2018pixor, zhou2018voxelnet, yan2018second, luo2018fast, yang2018hdnet, lang2018pointpillars]. Briefly, we note gains in predictive performance due to data augmentation (up to +18.0, +16.9 and +30.5 mAP on car, pedestrian and cyclist) are substantially larger than gains in performance observed across advances in detection architectures (see Appendix A).
We take our best system for 3-D object detection with the same data augmentations and compare the efficacy of this model to previously reported state-of-the-art systems that only operate on point cloud data [zhou2018voxelnet, yan2018second, lang2018pointpillars, yang2018hdnet]. Table 1 reports the 3-D detection results on the KITTI test server. StarNet provides competitive mAP scores on car, pedestrian and cyclist to other state-of-the-art methods, exceeding subsets of each category strata. Given that data augmentation plays such a strong role in determining the model performance (Appendix A), we consider these results to indicate that the model is comparable to other methods in overall predictive performance and instead focus on the performance of the model on new tasks with a substantially larger dataset (Sections 4.3 and 4.4).
4.3 Targeted computation for point cloud detection
We next explore how a single trained detection model may be deployed effectively at a range of computational costs and targeted to specific areas in the point cloud. We demonstrate these results on pedestrian and vehicle detection in the large-scale Waymo Open Dataset (see Appendix LABEL:sec:dataset-details for details). We address this question by training a detection model and demonstrate how this model may be deployed across a range of settings without retraining. In addition, we show how simple heuristics may steer this computation.
We selected the single, best trained model from Section 4.2, but systematically alter the manner in which we run inference. Because bounding box proposals are cheap, the vast majority of the computational cost is associated with evaluating each proposal. Hence, total computational cost grows linearly proportional to the number of proposals333Note that sub-linear growth may be possible by amortizing matrix multiplications across proposals. Notably, inference for each proposal is independent, proposals may be computed in parallel across machines. Thus, in terms of walltime, the detector may run inference on proposals in time.. As a first test, we demonstrate that the performance of a single trained model on vehicles may degrade gracefully across the number of proposals (Figure 3). We find the degradation with random sampling significantly worse, suggesting that the manner in which one selects bounding box proposals is critical.
We employ the same model, but add a simple ranking method of proposals based on distance to demonstrate that computation may be effectively targeted in a manner that may reflect importance to a SDC. Here, we sample 1024 proposals, rank them by distance from the vehicle, and then restrict the model to only evaluate the K = nearest proposals. We compare the detection accuracy versus distance against the default model with 1024 proposals (Figure 4). We find that the same model with one eighth of the proposals and computational cost achieves the same mean average precision (mAP) up to 10 meters. Likewise, the same model with fewer than half the proposals and computational cost achieves nearly the same accuracy as the default model up to 20 meters in distance. Although proximity to the car is a simple heuristic for sorting proposals, more sophisticated methods may be possible for assessing the detection priority (see Discussion).
4.4 Leveraging the detector on a large-scale self-driving dataset
We next leverage the flexible nature of the detector and demonstrate how this formulation may lead to a detection system in which the computation may be dedicated along particular dimensions. To demonstrate the relative merits, we trained StarNet models on pedestrians and vehicles and compared the relative performance of each model to a family of baseline models. Data augmentations employed in KITTI did not improve performance, thus no augmentations were used. We employed PointPillars as a baseline model 444We note that our custom implementation of PointPillars achieves 74.5, 57.1, and 59.0 mAP for for cars, pedestrians, and cyclists, respectively on KITTI validation at moderate difficulty. These numbers are slightly lower than the results reported by the authors [lang2018pointpillars].
, but trained 5 grid resolutions of this model for each class and validated accuracy on all annotated bounding boxes with 5+ LiDAR points. Each version employs a different input spatial resolution for the pseudo-image (128, 192, 256, 384, and 512 pixel spatial grids), with 16K to 32K non-zero featurized locations (pillars). Following the paper, vehicle models had an output stride of 2, while pedestrian models had an output stride of 1. We hypothesize that a single-stage object detector would exhibit trade-offs in detecting small objects based on the resolution of the image projection. Indeed, we observe in Figure5 (black points) that higher spatial resolutions achieve higher precision for pedestrians and vehicles, but with a computational cost that grows quadratically.
We also examined the performance of a single StarNet model in which we systematically alter the manner in which it is evaluated. We explore two strategies for altering computational demand: varying the number of proposals, and (because the point featurizer is agnostic to the number of points) varying the number of points supplied to the model per proposed region. Each blue curve in Figure 5 traces out the computational cost versus predictive performance for a given number of points per region, while varying the number of proposals from 64 to 1024. (For comparison, the baseline models featurize 16K to 32K spatial locations for each prediction.) Many of these points lie above the baseline model indicating that StarNet provides favorable performance to a family of separately-trained baseline models. In particular, the same pedestrian detection model (e.g., StarNet-128 with 1024 centers) may achieve relative gain in predictive performance for a similar computational budget as the baseline pillars model (); or, the same model achieves similar predictive performance as the most accurate Pillars model but with of the computational budget. Again, we emphasize that all StarNet points arise from a single trained model. These results demonstrate how one may employ a single trained StarNet in a flexible manner through manipulations at inference time.
In this work, we presented a non-convolutional detection system that operates on native point cloud data. The goal of the sampling method is to better match the sparsity of point cloud data, but also allow the system to be flexibly targeted across a range of computational priorities. We demonstrate that the resulting detector is competitive with state-of-the-art detection systems on the KITTI object detection benchmark [geiger2013vision]. Additionally, we find that the system is competitive with state-of-the-art systems on the large-scale Waymo Open Dataset. Finally, we demonstrate how in principle the detection system may be employed to target spatial locations without retraining nor sacrificing the prediction quality. For instance, depending on evaluation settings, a single trained pedestrian model may exceed the predictive performance of a baseline convolutional model by at a similar computational demand; or, the same model may achieve the same predictive performance but with of the computational FLOPS cost.
Given the results presented, we foresee multiple avenues for further improving the fidelity of the system including: multi-sensor fusion with cameras [yang2018ipod, qi2018frustum], employing semantic information such as road maps to spatially target detections [yang2018hdnet], leveraging temporal information [luo2018fast] or restoring global context by removing conditional independence from each proposal. Finally, we are particularly interested in studying how this system may be amenable to object tracking [gordon2018re] as we suspect that because of the design, the computational demands may scale as the difference between successive time points as opposed to operating on the entirety of the scene [feichtenhofer2017detect].
We have focused this first work on simple methods sampling proposals that are computationally inexpensive. More expensive or learned methods may be possible for further improving the system [faster_rcnn] including the possibility to learn simple ranking algorithms to order the relative importance of proposals for a self-driving planning system [cohen1998learning].
We wish to thank Tsung-Yi Lin, Chen Wu, Junhua Mao, Drago Anguelov, George Dahl, Anelia Angelova and the larger Google Brain and Waymo teams for support and feedback.
Appendix A Data Augmentation on KITTI
|+ global augment||85.4||74.8||72.7||46.3||39.4||36.2||75.8||51.9||48.7|
|+ bbox augment||86.6||77.0||74.6||63.9||56.5||52.4||81.5||58.3||54.8|
We employed the following data augmentations culled from a survey of the recent literature [luo2018fast, yang2018pixor, yang2018hdnet, zhou2018voxelnet, yan2018second, lang2018pointpillars]. per-bounding box rotation (, ), y-axis scene flipping, world coordinate scaling (0.95, 1.05), global rotation (, ), and ground-truth bounding box augmentation (copy-pasting objects from different scenes).
In order to parse the relative benefit of various data augmentations strategies to the overall performance of the model, we selectively removed data augmentations before training the model and report the corresponding results in Table 2. We find that the addition of data augmentations for the bounding box labels as well as augmentations for global alterations of the point clouds substantially improved the detection performance. Furthermore, both sets of augmentations are additive in terms of improving predictive performance. In particular, we note that some of the gains in predictive performance (up to +30.5 mAP on cyclist) are substantially larger than the gains in performance observed across advances in detection architectures [yan2018second, lang2018pointpillars].
Appendix B Architecture and Training of StarNet
b.1 Ground removal from point clouds
One important feature for the efficiency of the proposal system is to remove the points associated with ground reflections. We follow previous work and remove points with a -dimension below a certain threshold [yan2018second, lang2018pointpillars], although more sophisticated methods are possible [bojarski2016end]. For KITTI, . For the Waymo Open Dataset, we computed the and percentile of the center z location of all objects and kept only points with coordinate in that range. In spite of this heuristic, FPS still covers many ground points; high-quality ground removal would decrease the number of centers required.
b.2 Point cloud featurization
The StarNet point featurizer (Figure 6) closely follows ideas from graph neural networks [xu2018powerful]. We experimented with different choices network architectures and found that using max aggregation, concatenate combination, and mean readout performed well. By design, the same trained network can be used with varying number of input points, giving it a large degree of flexibility.