Model-Agnostic Multi-Agent Perception Framework

by   Weizhe Chen, et al.
Indiana University

Existing multi-agent perception systems assume that every agent utilizes the same models with identical parameters and architecture, which is often impractical in the real world. The significant performance boost brought by the multi-agent system can be degraded dramatically when the perception models are noticeably different. In this work, we propose a model-agnostic multi-agent framework to reduce the negative effect caused by model discrepancies and maintain confidentiality. Specifically, we consider the perception heterogeneity between agents by integrating a novel uncertainty calibrator which can eliminate the bias among agents' predicted confidence scores. Each agent performs such calibration independently on a standard public database, and therefore the intellectual property can be protected. To further refine the detection accuracy, we also propose a new algorithm called Promotion-Suppression Aggregation (PSA) that considers not only the confidence score of proposals but also the spatial agreement of their neighbors. Our experiments emphasize the necessity of model calibration across different agents, and the results show that our proposed approach outperforms the state-of-the-art baseline methods for 3D object detection on the open OPV2V dataset.



page 1

page 3

page 6

page 7


When2com: Multi-Agent Perception via Communication Graph Grouping

While significant advances have been made for single-agent perception, m...

Safe Multi-Agent Interaction through Robust Control Barrier Functions with Learned Uncertainties

Robots operating in real world settings must navigate and maintain safet...

Multi-agent active perception with prediction rewards

Multi-agent active perception is a task where a team of agents cooperati...

Perception Lie Paradox: Mathematically Proved Uncertainty about Humans Perception Similarity

Agents' judgment depends on perception and previous knowledge. Assuming ...

Multi-Agent Pathfinding (MAPF) with Continuous Time

MAPF is the problem of finding paths for multiple agents such that every...

A General Auxiliary Controller for Multi-agent Flocking

We aim to improve the performance of multi-agent flocking behavior by qu...

Learning Distilled Collaboration Graph for Multi-Agent Perception

To promote better performance-bandwidth trade-off for multi-agent percep...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent advancements in deep learning have improved the performance of modern perception systems on many tasks, such as object detection 

[zhou2018voxelnet, lu2021raanet, fan2021deep], semantic segmentation  [pan2020cross, xiong2019adaptive], and visual navigation [du2020learning, pal2021learning]. Despite the remarkable progress, single-agent perception systems still have many limitations due to the single view constraints. For instance, autonomous vehicles (AVs) usually suffer from occlusion [cooper], and such situations are difficult to handle because of the lack of sensory observations on the occluded area. To address this issue, recent studies [wang2020v2vnet, f-cooper, li2021learning, xu2022opencood, Li_2021_ICCVW, xu2022v2xvit, FISITA2021] have explored wireless communication technology to enable nearby agents to share the sensory information and collaboratively perceive the surrounding environment.

Although existing methods have obtained a significant 3D object detection performance boost, they assume that all the collaborating agents share the identical model with the same parameters, which often does not hold in practice, particularly in autonomous driving. Distributing the model parameters among AVs might raise privacy and confidentiality concerns, particularly for vehicles from distinct automotive companies. The reliance on well-synchronized detectors is unreliable as AVs may have different updating frequencies. Without adequately handling the inconsistency challenge, the shared sensory information can have a large domain gap, and the advantage brought by multi-agent perception will be diminished rapidly.


Fig. 2: Ground truth (green) and bounding box proposals (red) produced by three connected autonomous vehicles. (a) Some agents have confidence scores that are systematically larger than others, e.g., the blue scores versus the orange scores, however, they might be over-confident and provide misleading proposals. Fusing the proposals according to confidence scores without proper calibration may remove the correct proposals. (b) Proposals with slightly lower confidence scores (orange) but higher spatial agreement with neighboring boxes can be better than a singleton with a higher confidence score (blue).

To this end, we propose a model-agnostic multi-agent perception framework to handle the heterogeneity between agents while maintaining confidentiality. The perception outputs (i.e., detected bounding boxes and confidence scores) are propagated to avoid relying on the underlying model’s detailed information. Due to the distinct models used by different agents, the confidence scores provided by different agents can be systematically misaligned, i.e., different agents have dissimilar confidence estimation biases. Some agents may be over-confident, whereas others tend to be under-confident. Ignoring this bias and directly fusing bounding box proposals from neighboring agents via naive Non-maximum Suppression (NMS)

[neubeck2006efficient] can result in poor detection accuracy due to the presence of over-confident yet low-quality proposals. A good example is illustrated in Fig. 2. In our framework, we integrate a flexible and simple uncertainty calibrator, called Doubly Bounded Scaling (DBS), to mitigate the misalignment. Moreover, in the bounding box aggregation stage, we also came up with a new module, Promotion-Suppression Aggregation (PSA), to replace the classical NMS and leverage box proposals’ spatial correlation and agreement across agents, which can further refine the final results. The whole process does not reveal any details of model design and parameters, ensuring confidentiality.

We evaluate our approach on an open-source large-scale multi-agent perception dataset OPV2V 

[xu2022opencood]. Experiments show that when involving the model discrepancies among agents, our framework significantly improves multi-agent LiDAR-based 3D object detection performance, outperforming state-of-the-art methods by at least 6% in terms of Average Precision (AP).

Ii Related Work

Multi-Agent Perception. Multi-agent perception investigates how to leverage visual cues from neighboring agents through the communication system to enhance the perception capability. There are three categories of existing work according to the information sharing schema: 1) early fusion [cooper], where raw point clouds are transmitted directly and projected into the same coordinate frame, 2) late fusion [rawashdeh2018collaborative], where detected bounding boxes and confidence scores are shared, and 3) intermediate fusion [li2021learning, f-cooper, xu2022opencood, wang2020v2vnet, xu2022v2xvit]

, where compressed latent neural features extracted from point clouds are propagated. Though early fusion has no information loss, it usually requires large bandwidth. Intermediate fusion can achieve a good balance between accuracy and transmission data size, but it requires the complete knowledge of each agents’ model, which is non-trivial to satisfy in reality due to the intellectual property concerns. On the contrary, late fusion only needs the outputs of the detector without demanding access to the underlying neural networks which are typically confidential for automotive companies. Therefore, our approach adopts the late fusion strategy but further designs customized new components to address the model discrepancy issue in vanilla late fusion.

3D LiDAR Detection. To tackle the irregular and disordered data format of point clouds, researchers have come up with point-based, voxel-based, and point-voxel-based methods. Frustum PointNet [qi2018frustum] uses 2D image detection bounding boxes to generate frustums on raw point clouds. Then, we can directly operate the point clouds in the frustums to obtain the final bounding box positions. PointRCNN [shi2019pointrcnn] develops a two-state framework for 3D detection, which first produces rough bounding box proposals and then fine-tunes them in the second stage. mccraith2021lifting

combines outlier detection 

[li2022ecod, zhao2021automatic, chen2021informative] and PoinNet to make precise predictions. In [zhou2018voxelnet, Lang2019PointPillarsFE, Yan2018SECONDSE], point clouds are aggregated into voxels and generate latent features per voxel. Such an approach usually follows a one-stage fashion, with less accuracy but lower inference latency than the two-stage methods. [shi2020pv, shi2021pvplus] integrate both voxel-based network and PointNet-based [qi2017pointnet] set abstraction to produce more robust point cloud features, which can keep high learning efficiency while enjoying flexible receptive fields of the PointNet-based networks.

Uncertainty Calibration.

For a probabilistic classifier, the probability associated with the predicted class label should reflect its correctness likelihood. However, many modern neural networks do not have such property 

[guo2017calibration]. Uncertainty calibration aims to endow a classifier with such property. Calibration methods can be tightly coupled with the neural networks, such as Bayesian neural networks and regularization techniques [maddox2019simple, gal2016dropout, thulasidasan2019mixup], or serve as a post-processing step. Post-processing methods include histogram binning methods [zadrozny2001obtaining], scaling methods [platt1999probabilistic, zadrozny2002transforming], and mixtures[kumar2019verified] that combine the first two branches. Due to the popularity of the Temperature Scaling method [guo2017calibration] which is a single-parameter version of Platt Scaling [platt1999probabilistic], scaling methods are widely adopted for calibrating neural networks. Our proposed method follows the same fashion.

Bounding Box Aggregation. Object detection models typically require bounding box aggregation to lump the proposals corresponding to the same object. The de facto standard post-processing method is Non-Maximum Suppression (NMS) [neubeck2006efficient, hosang2017learning], which sequentially selects the proposals with the highest confidence score and then suppresses other overlapped boxes. NMS does not fully exploit information in the proposals because it only uses the relative order of confidence, ignoring the absolute confidence scores and the spatial information hidden in the bounding box coordinates. Several works have been proposed to refine the box aggregation strategies. Soft-NMS [bodla2017soft] softly decays the confidence scores of the proposals proportional to the degree of overlap. In [hosang2017learning] NMS can be learned by a neural network to achieve better occlusion handling and bounding box localization. Adaptive NMS [liu2019adaptive] applies a dynamic suppression threshold to an instance according to the target object density. rothe2014non formulate NMS as a clustering problem and use Affinity Propagation Clustering to solve the problem. Their idea of message passing between proposals is related to the PSA introduced in Section III-C, but the update rules of PSA are simpler and more efficient.

Fig. 3: Overview of our proposed framework. It consists of an offline stage to train calibrator locally and an online stage to perform on-road collaboration.

Iii Methodology

In this paper, we consider the cooperative perception in the context of a heterogeneous multi-agent system, where agents communicate to share sensing information from different perception models without revealing model information, i.e., model-agnostic collaboration. We focus on a 3D LiDAR detection task in autonomous driving, but the methodology can also be customized and used in other cooperative perception applications. Our goal is to develop a robust framework to handle the heterogeneity among agents while preserving confidentiality. Therefore, we propose a model-agnostic collective perception framework, as shown in  Fig. 3, which can be divided into two stages. In the offline stage, we train a model-specific calibrator. During the online phase, real-time on-road sensing information is calibrated and aggregated.

Iii-a Model-agnostic Pipeline

Agents with distinct perception models usually generate systematically different confidence. The mismatch in confidence distributions can lead to debased fusion when merging bounding boxes during collaboration. For instance, an inferior model may be over-confident and dominate the NMS process, decreasing the accuracy of the final results.

To alleviate this issue, we train a calibrator offline for each model to align its confidence score to its empirical accuracy on a calibration dataset. Concretely, each agency first runs a pre-trained detector on the same public dataset to produce a local calibration dataset containing confidence scores and labels. The calibration dataset is then fed to a calibrator for training (see Section III-B for more details). After training, the calibrator is stored locally in each agent.

When the vehicle is driving on-road and making predictions from the sensor measurements, the calibrator will align the predicted confidence score towards the same standard, thus alleviating the aforementioned mismatch. Then the bounding box coordinates and calibrated confidence scores are packed together and transmitted to neighboring agents. The receiving agent (i.e., ego vehicle) will fuse the shared information via a Promotion-Suppression Aggregation (see Section III-C for details) to output the final results. Since each agent trains their calibrator individually in the offline stage and only shares the detection outputs during the online phase, the detector architecture and parameters are invisible to other agents, protecting the intellectual property.

Iii-B Uncertainty Calibration

Well-calibrated uncertainty. To eliminate the impacts brought by the system heterogeneity, the models need to be well-calibrated. When the confidence scores can imply the likelihood of correct prediction, for example, confidence leads to accurate predictions, we regard this model as well-calibrated. Formally, let be the confidence score produced by the model and be the label indicating vehicle or background111We discuss binary classification here for simplicity but the proposed framework can be generalized to the multi-class case.. A model is well-calibrated if its confidence score matches the expectation of correctly predicting the label:


Scaling-based uncertainty calibration. Our goal is to learn a parametric scaling function (i.e., calibrator) on a calibration dataset to transform the uncalibrated confidence scores into well-calibrated ones . Given a calibration set containing the model-dependent confidence scores and ground-truth labels , we optimize the parameters of the calibrator by gradient descent on the binary cross entropy loss


where . Training a parametric function by optimizing Eq. 2 is similar to standard binary classification, however, in uncertainty calibration extra constraints are required on the scaling function.

Doubly Bounded Scaling (DBS). Designing a suitable calibrator for our application requires satisfying three conditions: 1) The scaling function needs to be monotonically non-decreasing as a higher confidence score is supposed to indicate a higher expected accuracy; 2) The scaling function should be smooth instead of being wiggling to avoid over-fitting to the calibration set; 3) The scaling function is supposed to be doubly bounded

, meaning that it maps a confidence interval

to the same range. We propose to use the Kumaraswamy Cumulative Density Function (CDF) [kumaraswamy1980generalized], which meets all the three constraints and has demonstrated good flexibility as the scaling function family. To the best of our knowledge, this is the first time this function family has been adopted in uncertainty calibration. Specifically, we learn a scaling function with the following form


where and are the parameters. For each detector, we optimize these calibrator parameters on a calibration dataset by minimizing Eq. 2. Scaling functions that follow Eq. 3 are monotonically non-decreasing, smooth, and doubly bounded, hence the name Doubly Bounded Scaling (DBS).

(a) Logistic

(b) Kumaraswamy
Fig. 6: Scaling functions with various parameters that follow (a) the logistic form and (b) the Kumaraswamy CDF. Note that, in (b), the “inverse-sigmoid” shape (green curve, ) and the identity map (orange curve, ) are not in the logistic family.

Comparison with Platt Scaling. Here we compare DBS with one of the most widely used scaling methods, Platt Scaling (PS) [platt1999probabilistic], to show the merits of the proposed method. PS uses the logistic family:


where are parameters with to ensure that the calibration map is monotonically non-decreasing. In Fig. 6, we can see that Eq. 3 is more flexible than the logistic form in Eq. 4. PS can fail if its parametric assumptions are not met [kull2017beta], for example, when an “inverse-sigmoid” (see the green curve in Fig. (b)b) scaling function is required. Note that the identity function is also not a member of the logistic family but is included in Eq. 3. In addition to the limited flexibility, the logistic family is not a function family that can naturally map to (its input domain is ), thus pre-processing of the inputs is required. In the opposite, DBS is inherently doubly-bounded.

Iii-C Promotion-Suppression Aggregation (PSA)


Algorithm 1 Promotion-Suppression Aggregation

Arguments: bounding boxes

                      confidence score vector

                      softmax temperature

, binarization parameter

1:Initialize the set of selected box indices
2:Compute IoU matrix using
3:Find indices of each component
4:for each  do
5:     Extract sub-matrix via indices in
6:     Extract sub-vector via indices in
7:     Promotion step:
8:     Suppression step:
9:     Find
10:     Update
11:return selected bounding box proposal indices

Although the calibration can narrow the gap between the confidence and ground-truth distribution, it operates independently in each agent, ignoring the spatial correlation and agreement of box proposals across all agents. To leverage the bounding box spatial information aggregated from various agents, we propose a bounding box aggregation algorithm, named Promotion-Suppression Aggregation (PSA), to promote the bounding box proposals with an ensemble of nearby boxes endorsing them. We first construct a spatial graph of bounding box proposals based on Intersection-over-Union (IoU) values and the confidence scores. Then the confidence scores are propagated within each connected component as promotion messages. After that, the proposal with the largest refined score will suppress the scores of other proposals. Finally, the suppressed score vector becomes a binary vector and can select the output bounding boxes.

Let be a weighted graph with a set of edges and a set of vertices where each vertex represents a bounding box proposal with an associated confidence score after calibration. We draw an edge between vertex and if their corresponding boxes are overlapped. We also define the edge weight between vertex and to be . The graph consists of a number of components in which each pair of vertices is connected with each other via a path. Bounding box aggregation is essentially computing an index set to select/filter the bounding box proposals based on the IoU matrix among box proposals and their confidence scores . Algorithm 1 shows how PSA computes the index set. Given the IoU adjacency matrix, we can find out the indices of each component and put them into a component set , where is the number of components and contains the indices of vertices. For each component, we extract the IoU matrix and confidence score vector corresponding to this component. Then, we perform the promotion step where each vertex updates its score to be the IoU-weighted sum of scores from other vertices in the component. We design such promotion update rule to meet the following desiderata:

  • A proposal is more plausible if many other proposals agree with it;

  • Having confident neighbors brings a proposal more significant promotion;

  • Update rules that are parallelizable and permutation-invariant are favored.

In the suppression step, we normalize the updated scores back to and separate the winning proposal “softly” via . When is large and is small, multiple proposals can be selected. This is akin to soft-NMS [bodla2017soft] and is beneficial when a small object is in front of a large object in image-based object detection. However, in our 3D object detection application, one component typically contains one object, so we use a small and . In the end, indices with updated confidence values larger than the threshold are added to the set . Note that PSA is highly parallelizable as each component operates independently and each step only requires simple linear search or small matrix-vector multiplication.

Iv Experiments

Iv-a Dataset

We evaluate the proposed framework on a large-scale open-source multi-agent perception dataset OPV2V [xu2022opencood], which is simulated using the high-fidelity simulator CARLA [dosovitskiy2017carla] and a cooperative driving automation simulation framework OpenCDA [xu2021opencda]. It includes scenarios with an average of seconds duration. In each scene, various numbers ( to ) of Autonomous Vehicles (AVs) provide LiDAR point clouds from their viewpoints. The train/validation/test splits are frames, respectively. For details of the dataset, please refer to [xu2022opencood].

Iv-B Experiment Setup

Evaluation metric. Following [xu2021opencda], we evaluate the detection accuracy in the range of and , centered at the ego-vehicle coordinate frame. The detection performance is measured with Average Precision (AP) at .

Evaluation setting. We evaluate our method under three different settings: 1) Homo Setting, where the detectors of agents are homogeneous with the same architecture and trained parameters. This setting has no confidence distribution gap and is used to demonstrate the performance drop when taking heterogeneity into account; 2) Hetero Setting 1, where the agents have the same model architecture but different parameters; 3) Hetero Setting 2, where the detector architectures are disparate. For Homo Setting, we select pre-trained Pointpillar [pointpillar] as the backbone for all the AVs. For Hetero Setting 1, the ego vehicle employs the same pre-trained Pointpillar model as in Homo Setting

, whereas other AVs pick the parameters of Pointpillar from a different epoch during training. Likewise, in the

Hetero Setting 2, the ego vehicle utilizes Pointpillar while other AVs use SECOND [Yan2018SECONDSE]

for detection. As intermediate fusion requires equal feature map resolution, we apply simple bi-linear interpolation under this setting. The ego vehicle uses the identical model with the same parameters across all settings for the

No Fusion and Late Fusion. To compare with existing calibrators, we use the same calibration method for all agents, but the parameters are agent-specific. The proposed framework should also work even when the calibration methods across agents are heterogeneous, as long as the prediction bias is effectively reduced.

Compared methods. We regard No Fusion as the baseline, which only takes the ego vehicle’s LiDAR data as input and omits any collaboration. Ideally, the multi-agent system should at least outperform this baseline. To validate the necessity of the calibration, we compare our method with naive late fusion and intermediate fusion that ignore calibrations. The naive late fusion gathers all detected bounding box positions and confidence scores together and simply applies NMS to produce the final results. The intermediate fusion method is the same as the one in [xu2022opencood]. We exclude the early fusion in the comparison as it requires large bandwidth, which leads to high communication delay thus is impractical to be deployed in the real world. Moreover, we also compare the proposed Doubly Bounded Scaling (DBS) with two other commonly used scaling-based calibrators: Temperature Scaling (TS) [guo2017calibration] and Platt Scaling (PS) [platt1999probabilistic].

No fusion 0.602 0.602 0.602
Intermediate w/o calibration 0.815 0.677 0.571
Late fusion w/o calibration 0.781 0.691 0.723
Our method 0.813 0.750 0.784
TABLE I: Object detection performance. Average Precision (AP) at IoU=0.7 on Homo, Hetero1, and Hetero2 setting.

Iv-C Quantitative Evaluation

Main performance analysis. Table I describes the performance comparisons of different methods under Homo, Hetero1, and Hetero2 Setting. In the unrealistic Homo setting, all methods exceed the baseline remarkably while intermediate fusion and our method have very close performance (0.2% difference). However, when we consider the realistic model discrepancy factor, our method outperforms the classic late fusion and intermediate fusion significantly by 5.9%, 7.3% under Hetero1 Setting, and by 6.1%, 21.3% under Hetero2 Setting, respectively. The classic late fusion and intermediate fusion suffer from the model discrepancy, leading to clear accuracy decreases. In the Hetero2 Setting, the intermediate fusion even becomes lower than the baseline. On the contrary, our method only drops around 6% and 3% under the two realistic settings, indicating the effectiveness of the proposed calibration for the heterogeneity of the multi-agent perception system. Note that although the design essence of our framework aims to handle the heterogeneous situations, we also obtain performance boost under the  Homo Setting compared with the standard late fusion that shares detection proposals. We attribute this gain to PSA and the filtering operation of low-confidence proposals after uncertainty calibration that removes some potential false positives.

Components Hetero1 Hetero2
DBS PSA AP@0.7 AP@0.7
0.691 0.723
0.734 0.776
0.750 0.784
TABLE II: Component ablation study.

Major component analysis. Here we investigate the contribution from each component by incrementally adding DBS and PSA. Table II reveals that both modules are beneficial for the performance boost, while the calibration exhibits more contributions – increasing the AP by 4.3% and 5.3% .

Uncertainty calibration evaluation. Fig. 9 shows the reliability diagram of Pointpillar used by the ego vehicle, in which a perfect calibration will produce a diagonal reliability curve, indicating the real accuracy matches the predictive confidence score. Reliability curves under or above the diagonal line represent over-confident or under-confident models, respectively. Pointpillar has much higher empirical accuracy than its reported confidence score. When using NMS to fuse the predictions of Pointpillar with that of another inaccurate but over-confident detector, the under-estimated confidence will result in the removal of Pointpillar’s good predictions. After being calibrated by DBS, the reliability curve of Pointpillar lies on the diagonal line.

Comparison with other calibration methods. Fig. 11 describes the comparison between our DBS calibration and other calibration methods, including TS and PS. Our DBS achieves better performance than others under both heterogeneous settings. Moreover, PSA can also improve the accuracy of different calibrators and experimental settings, showing the generalized capability to refine the prediction results.

Iv-D Qualitative Results

Fig. 18 shows the detection results of intermediate fusion, classic late fusion, and our method under Hetero1 and Hetero2 Setting. Our method can identify more objects while keeping very few false positives. More importantly, our method can regress the bounding box positions more accurately (see the zoom-in example), indicating the robustness against the model discrepancy in multi-agent perception systems.

perfect reliability curve


reliability curve

Empirical Accuracy

Empirical Accuracy

after DBS calibration

Confidence Score
Fig. 9: Reliability diagram of Pointpillar before and after calibration.



DBS (Ours)

Average Precision ()
Fig. 11: Comparison of different calibrators.

(a) Intermediate Fusion in Hetero1

(b) Late Fusion in Hetero1

(c) Ours in Hetero1

(d) Intermediate Fusion in Hetero2

(e) Late Fusion in Hetero2

(f) Ours in Hetero2
Fig. 18: Qualitative comparison in a busy freeway and a congested intersection. Green and red 3D bounding boxes represent the groun truth and prediction respectively. Our method yields more accurate detection results.

V Conclusions

In the context of cooperative perception, agents from different stakeholders have heterogeneous models. Due to the confidentially concerns, information related to the models and parameters should not be revealed to other agents. In this work, we present a model-agnostic collaboration framework that addresses two critical challenges of the vanilla late fusion strategy. First, we propose a Doubly Bounded Scaling uncertainty calibrator to align the confidence score distributions of different agents. Second, the novel Promotion-Suppression Aggregation algorithm further improves the detection accuracy by fully exploiting the shared information – bounding box spatial congruence and confidence score propagation. Experiments on a large-scale cooperative perception dataset shed light on the necessity of model calibration across heterogeneous agents. The results show that combining the two proposed techniques can improve the state-of-the-art for cooperative 3D object detection when different agents use distinct perception models.