Recent advancements in deep learning have improved the performance of modern perception systems on many tasks, such as object detection[zhou2018voxelnet, lu2021raanet, fan2021deep], semantic segmentation [pan2020cross, xiong2019adaptive], and visual navigation [du2020learning, pal2021learning]. Despite the remarkable progress, single-agent perception systems still have many limitations due to the single view constraints. For instance, autonomous vehicles (AVs) usually suffer from occlusion [cooper], and such situations are difficult to handle because of the lack of sensory observations on the occluded area. To address this issue, recent studies [wang2020v2vnet, f-cooper, li2021learning, xu2022opencood, Li_2021_ICCVW, xu2022v2xvit, FISITA2021] have explored wireless communication technology to enable nearby agents to share the sensory information and collaboratively perceive the surrounding environment.
Although existing methods have obtained a significant 3D object detection performance boost, they assume that all the collaborating agents share the identical model with the same parameters, which often does not hold in practice, particularly in autonomous driving. Distributing the model parameters among AVs might raise privacy and confidentiality concerns, particularly for vehicles from distinct automotive companies. The reliance on well-synchronized detectors is unreliable as AVs may have different updating frequencies. Without adequately handling the inconsistency challenge, the shared sensory information can have a large domain gap, and the advantage brought by multi-agent perception will be diminished rapidly.
To this end, we propose a model-agnostic multi-agent perception framework to handle the heterogeneity between agents while maintaining confidentiality. The perception outputs (i.e., detected bounding boxes and confidence scores) are propagated to avoid relying on the underlying model’s detailed information. Due to the distinct models used by different agents, the confidence scores provided by different agents can be systematically misaligned, i.e., different agents have dissimilar confidence estimation biases. Some agents may be over-confident, whereas others tend to be under-confident. Ignoring this bias and directly fusing bounding box proposals from neighboring agents via naive Non-maximum Suppression (NMS)[neubeck2006efficient] can result in poor detection accuracy due to the presence of over-confident yet low-quality proposals. A good example is illustrated in Fig. 2. In our framework, we integrate a flexible and simple uncertainty calibrator, called Doubly Bounded Scaling (DBS), to mitigate the misalignment. Moreover, in the bounding box aggregation stage, we also came up with a new module, Promotion-Suppression Aggregation (PSA), to replace the classical NMS and leverage box proposals’ spatial correlation and agreement across agents, which can further refine the final results. The whole process does not reveal any details of model design and parameters, ensuring confidentiality.
We evaluate our approach on an open-source large-scale multi-agent perception dataset OPV2V[xu2022opencood]. Experiments show that when involving the model discrepancies among agents, our framework significantly improves multi-agent LiDAR-based 3D object detection performance, outperforming state-of-the-art methods by at least 6% in terms of Average Precision (AP).
Ii Related Work
Multi-Agent Perception. Multi-agent perception investigates how to leverage visual cues from neighboring agents through the communication system to enhance the perception capability. There are three categories of existing work according to the information sharing schema: 1) early fusion [cooper], where raw point clouds are transmitted directly and projected into the same coordinate frame, 2) late fusion [rawashdeh2018collaborative], where detected bounding boxes and confidence scores are shared, and 3) intermediate fusion [li2021learning, f-cooper, xu2022opencood, wang2020v2vnet, xu2022v2xvit]
, where compressed latent neural features extracted from point clouds are propagated. Though early fusion has no information loss, it usually requires large bandwidth. Intermediate fusion can achieve a good balance between accuracy and transmission data size, but it requires the complete knowledge of each agents’ model, which is non-trivial to satisfy in reality due to the intellectual property concerns. On the contrary, late fusion only needs the outputs of the detector without demanding access to the underlying neural networks which are typically confidential for automotive companies. Therefore, our approach adopts the late fusion strategy but further designs customized new components to address the model discrepancy issue in vanilla late fusion.
3D LiDAR Detection. To tackle the irregular and disordered data format of point clouds, researchers have come up with point-based, voxel-based, and point-voxel-based methods. Frustum PointNet [qi2018frustum] uses 2D image detection bounding boxes to generate frustums on raw point clouds. Then, we can directly operate the point clouds in the frustums to obtain the final bounding box positions. PointRCNN [shi2019pointrcnn] develops a two-state framework for 3D detection, which first produces rough bounding box proposals and then fine-tunes them in the second stage. mccraith2021lifting
combines outlier detection[li2022ecod, zhao2021automatic, chen2021informative] and PoinNet to make precise predictions. In [zhou2018voxelnet, Lang2019PointPillarsFE, Yan2018SECONDSE], point clouds are aggregated into voxels and generate latent features per voxel. Such an approach usually follows a one-stage fashion, with less accuracy but lower inference latency than the two-stage methods. [shi2020pv, shi2021pvplus] integrate both voxel-based network and PointNet-based [qi2017pointnet] set abstraction to produce more robust point cloud features, which can keep high learning efficiency while enjoying flexible receptive fields of the PointNet-based networks.
Uncertainty Calibration.guo2017calibration]. Uncertainty calibration aims to endow a classifier with such property. Calibration methods can be tightly coupled with the neural networks, such as Bayesian neural networks and regularization techniques [maddox2019simple, gal2016dropout, thulasidasan2019mixup], or serve as a post-processing step. Post-processing methods include histogram binning methods [zadrozny2001obtaining], scaling methods [platt1999probabilistic, zadrozny2002transforming], and mixtures[kumar2019verified] that combine the first two branches. Due to the popularity of the Temperature Scaling method [guo2017calibration] which is a single-parameter version of Platt Scaling [platt1999probabilistic], scaling methods are widely adopted for calibrating neural networks. Our proposed method follows the same fashion.
Bounding Box Aggregation. Object detection models typically require bounding box aggregation to lump the proposals corresponding to the same object. The de facto standard post-processing method is Non-Maximum Suppression (NMS) [neubeck2006efficient, hosang2017learning], which sequentially selects the proposals with the highest confidence score and then suppresses other overlapped boxes. NMS does not fully exploit information in the proposals because it only uses the relative order of confidence, ignoring the absolute confidence scores and the spatial information hidden in the bounding box coordinates. Several works have been proposed to refine the box aggregation strategies. Soft-NMS [bodla2017soft] softly decays the confidence scores of the proposals proportional to the degree of overlap. In [hosang2017learning] NMS can be learned by a neural network to achieve better occlusion handling and bounding box localization. Adaptive NMS [liu2019adaptive] applies a dynamic suppression threshold to an instance according to the target object density. rothe2014non formulate NMS as a clustering problem and use Affinity Propagation Clustering to solve the problem. Their idea of message passing between proposals is related to the PSA introduced in Section III-C, but the update rules of PSA are simpler and more efficient.
In this paper, we consider the cooperative perception in the context of a heterogeneous multi-agent system, where agents communicate to share sensing information from different perception models without revealing model information, i.e., model-agnostic collaboration. We focus on a 3D LiDAR detection task in autonomous driving, but the methodology can also be customized and used in other cooperative perception applications. Our goal is to develop a robust framework to handle the heterogeneity among agents while preserving confidentiality. Therefore, we propose a model-agnostic collective perception framework, as shown in Fig. 3, which can be divided into two stages. In the offline stage, we train a model-specific calibrator. During the online phase, real-time on-road sensing information is calibrated and aggregated.
Iii-a Model-agnostic Pipeline
Agents with distinct perception models usually generate systematically different confidence. The mismatch in confidence distributions can lead to debased fusion when merging bounding boxes during collaboration. For instance, an inferior model may be over-confident and dominate the NMS process, decreasing the accuracy of the final results.
To alleviate this issue, we train a calibrator offline for each model to align its confidence score to its empirical accuracy on a calibration dataset. Concretely, each agency first runs a pre-trained detector on the same public dataset to produce a local calibration dataset containing confidence scores and labels. The calibration dataset is then fed to a calibrator for training (see Section III-B for more details). After training, the calibrator is stored locally in each agent.
When the vehicle is driving on-road and making predictions from the sensor measurements, the calibrator will align the predicted confidence score towards the same standard, thus alleviating the aforementioned mismatch. Then the bounding box coordinates and calibrated confidence scores are packed together and transmitted to neighboring agents. The receiving agent (i.e., ego vehicle) will fuse the shared information via a Promotion-Suppression Aggregation (see Section III-C for details) to output the final results. Since each agent trains their calibrator individually in the offline stage and only shares the detection outputs during the online phase, the detector architecture and parameters are invisible to other agents, protecting the intellectual property.
Iii-B Uncertainty Calibration
Well-calibrated uncertainty. To eliminate the impacts brought by the system heterogeneity, the models need to be well-calibrated. When the confidence scores can imply the likelihood of correct prediction, for example, confidence leads to accurate predictions, we regard this model as well-calibrated. Formally, let be the confidence score produced by the model and be the label indicating vehicle or background111We discuss binary classification here for simplicity but the proposed framework can be generalized to the multi-class case.. A model is well-calibrated if its confidence score matches the expectation of correctly predicting the label:
Scaling-based uncertainty calibration. Our goal is to learn a parametric scaling function (i.e., calibrator) on a calibration dataset to transform the uncalibrated confidence scores into well-calibrated ones . Given a calibration set containing the model-dependent confidence scores and ground-truth labels , we optimize the parameters of the calibrator by gradient descent on the binary cross entropy loss
where . Training a parametric function by optimizing Eq. 2 is similar to standard binary classification, however, in uncertainty calibration extra constraints are required on the scaling function.
Doubly Bounded Scaling (DBS). Designing a suitable calibrator for our application requires satisfying three conditions: 1) The scaling function needs to be monotonically non-decreasing as a higher confidence score is supposed to indicate a higher expected accuracy; 2) The scaling function should be smooth instead of being wiggling to avoid over-fitting to the calibration set; 3) The scaling function is supposed to be doubly bounded
, meaning that it maps a confidence intervalto the same range. We propose to use the Kumaraswamy Cumulative Density Function (CDF) [kumaraswamy1980generalized], which meets all the three constraints and has demonstrated good flexibility as the scaling function family. To the best of our knowledge, this is the first time this function family has been adopted in uncertainty calibration. Specifically, we learn a scaling function with the following form
where and are the parameters. For each detector, we optimize these calibrator parameters on a calibration dataset by minimizing Eq. 2. Scaling functions that follow Eq. 3 are monotonically non-decreasing, smooth, and doubly bounded, hence the name Doubly Bounded Scaling (DBS).
Comparison with Platt Scaling. Here we compare DBS with one of the most widely used scaling methods, Platt Scaling (PS) [platt1999probabilistic], to show the merits of the proposed method. PS uses the logistic family:
where are parameters with to ensure that the calibration map is monotonically non-decreasing. In Fig. 6, we can see that Eq. 3 is more flexible than the logistic form in Eq. 4. PS can fail if its parametric assumptions are not met [kull2017beta], for example, when an “inverse-sigmoid” (see the green curve in Fig. (b)b) scaling function is required. Note that the identity function is also not a member of the logistic family but is included in Eq. 3. In addition to the limited flexibility, the logistic family is not a function family that can naturally map to (its input domain is ), thus pre-processing of the inputs is required. In the opposite, DBS is inherently doubly-bounded.
Iii-C Promotion-Suppression Aggregation (PSA)
Although the calibration can narrow the gap between the confidence and ground-truth distribution, it operates independently in each agent, ignoring the spatial correlation and agreement of box proposals across all agents. To leverage the bounding box spatial information aggregated from various agents, we propose a bounding box aggregation algorithm, named Promotion-Suppression Aggregation (PSA), to promote the bounding box proposals with an ensemble of nearby boxes endorsing them. We first construct a spatial graph of bounding box proposals based on Intersection-over-Union (IoU) values and the confidence scores. Then the confidence scores are propagated within each connected component as promotion messages. After that, the proposal with the largest refined score will suppress the scores of other proposals. Finally, the suppressed score vector becomes a binary vector and can select the output bounding boxes.
Let be a weighted graph with a set of edges and a set of vertices where each vertex represents a bounding box proposal with an associated confidence score after calibration. We draw an edge between vertex and if their corresponding boxes are overlapped. We also define the edge weight between vertex and to be . The graph consists of a number of components in which each pair of vertices is connected with each other via a path. Bounding box aggregation is essentially computing an index set to select/filter the bounding box proposals based on the IoU matrix among box proposals and their confidence scores . Algorithm 1 shows how PSA computes the index set. Given the IoU adjacency matrix, we can find out the indices of each component and put them into a component set , where is the number of components and contains the indices of vertices. For each component, we extract the IoU matrix and confidence score vector corresponding to this component. Then, we perform the promotion step where each vertex updates its score to be the IoU-weighted sum of scores from other vertices in the component. We design such promotion update rule to meet the following desiderata:
A proposal is more plausible if many other proposals agree with it;
Having confident neighbors brings a proposal more significant promotion;
Update rules that are parallelizable and permutation-invariant are favored.
In the suppression step, we normalize the updated scores back to and separate the winning proposal “softly” via . When is large and is small, multiple proposals can be selected. This is akin to soft-NMS [bodla2017soft] and is beneficial when a small object is in front of a large object in image-based object detection. However, in our 3D object detection application, one component typically contains one object, so we use a small and . In the end, indices with updated confidence values larger than the threshold are added to the set . Note that PSA is highly parallelizable as each component operates independently and each step only requires simple linear search or small matrix-vector multiplication.
We evaluate the proposed framework on a large-scale open-source multi-agent perception dataset OPV2V [xu2022opencood], which is simulated using the high-fidelity simulator CARLA [dosovitskiy2017carla] and a cooperative driving automation simulation framework OpenCDA [xu2021opencda]. It includes scenarios with an average of seconds duration. In each scene, various numbers ( to ) of Autonomous Vehicles (AVs) provide LiDAR point clouds from their viewpoints. The train/validation/test splits are frames, respectively. For details of the dataset, please refer to [xu2022opencood].
Iv-B Experiment Setup
Evaluation metric. Following [xu2021opencda], we evaluate the detection accuracy in the range of and , centered at the ego-vehicle coordinate frame. The detection performance is measured with Average Precision (AP) at .
Evaluation setting. We evaluate our method under three different settings: 1) Homo Setting, where the detectors of agents are homogeneous with the same architecture and trained parameters. This setting has no confidence distribution gap and is used to demonstrate the performance drop when taking heterogeneity into account; 2) Hetero Setting 1, where the agents have the same model architecture but different parameters; 3) Hetero Setting 2, where the detector architectures are disparate. For Homo Setting, we select pre-trained Pointpillar [pointpillar] as the backbone for all the AVs. For Hetero Setting 1, the ego vehicle employs the same pre-trained Pointpillar model as in Homo Setting
, whereas other AVs pick the parameters of Pointpillar from a different epoch during training. Likewise, in theHetero Setting 2, the ego vehicle utilizes Pointpillar while other AVs use SECOND [Yan2018SECONDSE]
for detection. As intermediate fusion requires equal feature map resolution, we apply simple bi-linear interpolation under this setting. The ego vehicle uses the identical model with the same parameters across all settings for theNo Fusion and Late Fusion. To compare with existing calibrators, we use the same calibration method for all agents, but the parameters are agent-specific. The proposed framework should also work even when the calibration methods across agents are heterogeneous, as long as the prediction bias is effectively reduced.
Compared methods. We regard No Fusion as the baseline, which only takes the ego vehicle’s LiDAR data as input and omits any collaboration. Ideally, the multi-agent system should at least outperform this baseline. To validate the necessity of the calibration, we compare our method with naive late fusion and intermediate fusion that ignore calibrations. The naive late fusion gathers all detected bounding box positions and confidence scores together and simply applies NMS to produce the final results. The intermediate fusion method is the same as the one in [xu2022opencood]. We exclude the early fusion in the comparison as it requires large bandwidth, which leads to high communication delay thus is impractical to be deployed in the real world. Moreover, we also compare the proposed Doubly Bounded Scaling (DBS) with two other commonly used scaling-based calibrators: Temperature Scaling (TS) [guo2017calibration] and Platt Scaling (PS) [platt1999probabilistic].
|Intermediate w/o calibration||0.815||0.677||0.571|
|Late fusion w/o calibration||0.781||0.691||0.723|
Iv-C Quantitative Evaluation
Main performance analysis. Table I describes the performance comparisons of different methods under Homo, Hetero1, and Hetero2 Setting. In the unrealistic Homo setting, all methods exceed the baseline remarkably while intermediate fusion and our method have very close performance (0.2% difference). However, when we consider the realistic model discrepancy factor, our method outperforms the classic late fusion and intermediate fusion significantly by 5.9%, 7.3% under Hetero1 Setting, and by 6.1%, 21.3% under Hetero2 Setting, respectively. The classic late fusion and intermediate fusion suffer from the model discrepancy, leading to clear accuracy decreases. In the Hetero2 Setting, the intermediate fusion even becomes lower than the baseline. On the contrary, our method only drops around 6% and 3% under the two realistic settings, indicating the effectiveness of the proposed calibration for the heterogeneity of the multi-agent perception system. Note that although the design essence of our framework aims to handle the heterogeneous situations, we also obtain performance boost under the Homo Setting compared with the standard late fusion that shares detection proposals. We attribute this gain to PSA and the filtering operation of low-confidence proposals after uncertainty calibration that removes some potential false positives.
Major component analysis. Here we investigate the contribution from each component by incrementally adding DBS and PSA. Table II reveals that both modules are beneficial for the performance boost, while the calibration exhibits more contributions – increasing the AP by 4.3% and 5.3% .
Uncertainty calibration evaluation. Fig. 9 shows the reliability diagram of Pointpillar used by the ego vehicle, in which a perfect calibration will produce a diagonal reliability curve, indicating the real accuracy matches the predictive confidence score. Reliability curves under or above the diagonal line represent over-confident or under-confident models, respectively. Pointpillar has much higher empirical accuracy than its reported confidence score. When using NMS to fuse the predictions of Pointpillar with that of another inaccurate but over-confident detector, the under-estimated confidence will result in the removal of Pointpillar’s good predictions. After being calibrated by DBS, the reliability curve of Pointpillar lies on the diagonal line.
Comparison with other calibration methods. Fig. 11 describes the comparison between our DBS calibration and other calibration methods, including TS and PS. Our DBS achieves better performance than others under both heterogeneous settings. Moreover, PSA can also improve the accuracy of different calibrators and experimental settings, showing the generalized capability to refine the prediction results.
Iv-D Qualitative Results
Fig. 18 shows the detection results of intermediate fusion, classic late fusion, and our method under Hetero1 and Hetero2 Setting. Our method can identify more objects while keeping very few false positives. More importantly, our method can regress the bounding box positions more accurately (see the zoom-in example), indicating the robustness against the model discrepancy in multi-agent perception systems.
In the context of cooperative perception, agents from different stakeholders have heterogeneous models. Due to the confidentially concerns, information related to the models and parameters should not be revealed to other agents. In this work, we present a model-agnostic collaboration framework that addresses two critical challenges of the vanilla late fusion strategy. First, we propose a Doubly Bounded Scaling uncertainty calibrator to align the confidence score distributions of different agents. Second, the novel Promotion-Suppression Aggregation algorithm further improves the detection accuracy by fully exploiting the shared information – bounding box spatial congruence and confidence score propagation. Experiments on a large-scale cooperative perception dataset shed light on the necessity of model calibration across heterogeneous agents. The results show that combining the two proposed techniques can improve the state-of-the-art for cooperative 3D object detection when different agents use distinct perception models.