Infrastructure-Based Object Detection and Tracking for Cooperative Driving Automation: A Survey

Object detection plays a fundamental role in enabling Cooperative Driving Automation (CDA), which is regarded as the revolutionary solution to addressing safety, mobility, and sustainability issues of contemporary transportation systems. Although current computer vision technologies could provide satisfactory object detection results in occlusion-free scenarios, the perception performance of onboard sensors could be inevitably limited by the range and occlusion. Owing to flexible position and pose for sensor installation, infrastructure-based detection and tracking systems can enhance the perception capability for connected vehicles and thus quickly become one of the most popular research topics. In this paper, we review the research progress for infrastructure-based object detection and tracking systems. Architectures of roadside perception systems based on different types of sensors are reviewed to show a high-level description of the workflows for infrastructure-based perception systems. Roadside sensors and different perception methodologies are reviewed and analyzed with detailed literature to provide a low-level explanation for specific methods followed by Datasets and Simulators to draw an overall landscape of infrastructure-based object detection and tracking methods. Discussions are conducted to point out current opportunities, open problems, and anticipated future trends.



page 1

page 3


PillarGrid: Deep Learning-based Cooperative Perception for 3D Object Detection from Onboard-Roadside LiDAR

3D object detection plays a fundamental role in enabling autonomous driv...

A Review on Cooperative Adaptive Cruise Control (CACC) Systems: Architectures, Controls, and Applications

Connected and automated vehicles (CAVs) have the potential to address th...

Intelligent Transportation Systems With The Use of External Infrastructure: A Literature Survey

Increasing problems in the transportation segment are accidents, bad tra...

Cooperative Perception for 3D Object Detection in Driving Scenarios using Infrastructure Sensors

The perception system of an autonomous vehicle is responsible for mappin...

DAIR-V2X: A Large-Scale Dataset for Vehicle-Infrastructure Cooperative 3D Object Detection

Autonomous driving faces great safety challenges for a lack of global pe...

Cyber Mobility Mirror: Deep Learning-based Real-time 3D Object Perception and Reconstruction Using Roadside LiDAR

Enabling Cooperative Driving Automation (CDA) requires high-fidelity and...

Composition and Application of Current Advanced Driving Assistance System: A Review

Due to the growing awareness of driving safety and the development of so...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Rapid development of the transportation system has improved the efficiency of daily commuting and goods transporting. Nevertheless, the rapidly increasing number of vehicles resulted in several major issues in the transportation system in terms of safety [2019Crash], mobility [2018Congestion] and sustainability [2021Energy]

. Taking advantage of recent strides in advanced sensing, wireless connectivity and artificial intelligence, Cooperative Driving Automation (CDA) enables automated vehicles (AVs) to communicate between vehicles, roadway infrastructure and other road users such as pedestrians and cyclists equipped with mobile devices. Hence, CDA is attracting more and more attention over the past few years and is regarded as a transformative solution to the aforementioned challenges 


Object Perception (OP) plays a fundamental role in the basic structure of CDA applications [2021SAE]. Different kinds of sensors equipped on vehicles or roadside have the capability for perceiving the traffic conditions in mixed traffic environment. The perception data can act as the system input and support various kinds of CDA applications, such as Collision Warning [wu2020improved], Eco-Approach and Departure (EAD) [bai2022hybrid], and Cooperative Adaptive Cruise Control (CACC) [wangCACC].

With the development of sensing technologies, transportation systems can retrieve high-fidelity traffic data from different sensors. For instance, camera can provide detailed vision data to classify various kinds of traffic objects, such as vehicles, pedestrians, and cyclists

[liu2020deep]. LiDAR can provide high-fidelity 3D point cloud data to grasp the precise 3D location of the traffic objects [arnold2019survey]. RADAR sensor has been an integral part of safety-critical applications in automotive industry owing to its proof on weather and lighting conditions [8443497].

During the last couple of decades, a large portion of the object perception methods and high-fidelity perception data come from the on-board sensors while most of the roadside sensors are still used for traditional traffic data collection such as counting traffic volumes based on loop detectors or cameras [zou2019object]. Although empowered with advanced perception methods, on-board sensors are inevitably limited by the range and occlusion. Infrastructure-based perception system has the potential to achieve better object perception results with less occlusion effects and more flexibility in terms of mounting height and pose.

In this paper, the infrastructure-based object detection and tracking methods are reviewed. This survey aims to establish an overall landscape for object perception based on roadside high-fidelity sensors and to provide inspirations for future research. The rest of this paper is organized as follows: Architectures for infrastructure-based perception system are reviewed in Section II followed by the roadside sensors and general perception methodologies. Infrastructure-based object detection and tracking approaches are reviewed in Section V followed by the Datasets and simulators. The last section concludes this paper with further discussions.

Ii System Architectures

Defined by the Society of Automotive Engineers (SAE) J3216 Standard [2021SAE], Cooperative Driving Automation (CDA) enables communication and cooperation between properly equipped vehicles, infrastructure, and other road users. Led by the Federal Highway Administration (FHWA), CARMA [2021CARMA] is one of the state-of-the-art (SOTA) programs for CDA.

According to the definition by SAE, CDA has four classes of cooperative automation with an increasing amount of cooperation associated with information shared among CDA participants. Hence, the fidelity and range of perception information have significant impact on the subsequent cooperation performance. Fig. 1 demonstrates a systematic architecture of infrastructure-based object perception system for enabling CDA. Specifically, four typical phases are identified in the infrastructure-based object perception process: 1) Information Collection; 2) Edge Processing; 3) Cloud Fusion; and 4) Information Distribution.

Fig. 1: Visualization of systematic architecture for infrastructure-based perception system.

Ii-a Information Collection

Traditional roadside sensors, such as Loop Detectors and Microwave RADARs, are widely used for providing the source data for traffic surveillance and dynamic traffic management [nellore2016survey]

. However, the main capacity of these traditional sensors is to provide the presence of objects at certain locations. Thanks to the advancement in high-performance computation and machine learning, high-resolution sensors (e.g., camera, LiDAR, etc) are able to provide object-level perception results, which are equipped on roadside infrastructures to perceive the environment, and transmit collected data to the roadside server via communication hub for further processing.

Ii-B Edge Processing

Considering the limited bandwidth to transmit a large volume of raw data (e.g., point clouds), information collected from roadside sensors may be processed on an edge (roadside) server. Generally, there are three main steps for processing the raw sensing data in this phase, as shown below:

  • Preprocessing: Manipulations of raw data to provide a ready-to-use format for perception modules according to specific sensors, such as coordinate transformation, geo-fencing, and noise reduction.

  • Object Perception: Generation of object detection and tracking results for demonstrating position and pose, as well as identification of certain road users, such as rotated bounding boxes with unique ID and classification tag. Besides, multi-sensor fusion algorithms may be applied if there are more than one sensors used for single perception node.

  • Storage: Recording of raw sensing data and perception data with timestamps for post-processing or analysis at edge-side.

Ii-C Cloud Computing

Generally, perception data is generated by roadside equipment (RSE) due to the large volume of raw data and then the perception data will be transmitted to the Cloud via wireless communication (e.g, Cellular Network, WLAN, etc). In some systems equipped with high-speed internet to allow the high-volume low-latency data transmitting, raw data could also be transmitted to the Cloud for the processing. In terms of multi-node perception system, i.e., simultaneously perceiving the environment from different locations, time alignment (with necessity of delay compensation) and object association need to be considered for spatiotemporal information assimilation and synchronization.

Ii-D Message Distribution

The perception information (along with any advisory or actuation signals) can be distributed to road users in two major ways, depending on the connectivity status: for conventional road users without wireless connectivity, such information can be delivered to end devices at the roadside, such as Dynamic Message Sign (DMS) or signal head display of traffic lights via the Traffic Management Center (TMC). For connected road users with wireless communications, customized information, e.g., surrounding objects and Signal Phase and Timing (SPaT) of upcoming signals, can be accessed to enable various cooperative driving automation (CDA) applications, such as Cooperative Eco-Driving [altan2017glidepath, bai2022hybrid].

Iii Roadside Sensors

For infrastructure-based object detection and tracking system, roadside sensors are the fundamental modules for data collection. This section overviews typical types of roadside sensors from different perspectives.

Iii-a Configuration and Performance

Capabilities Camera LiDAR RADAR Thermal Fisheye Loop
Privacy-safe data
Accurately detects and classifies objects
Accurately measures object speed and position
Extensive field-of-view (FOV)
Reliability across changes in lighting, sun, temperature
Ability to read signs and differentiate color
TABLE I: Performance matrix for different sensors utilized for infrastructure-based perception.

Regarding the installation of roadside sensors, typical locations may include signal arm and street lamp post, with some minimum height requirement to avoid vandalism. As a result, roadside sensors can have a much higher position (compared to on-board sensors) to minimize the occlusion effect due to dense traffic. The specific installation position may vary based on different roadside sensors. For example, the roadside LiDAR sensors are mainly installed at the height of 10 - 20 ft (but no more than 30 ft), while fisheye cameras prefer a higher installation.

For general performance of different sensors used in a roadside perception system, Table I provides a summary on those that are widely utilized in roadside traffic surveillance. Each of these sensors has its own capabilities and strengths in different use cases.

Iii-B Operational Pipeline

In terms of the number of sensors applied, the systematic operational pipeline of object detection and tracking based on roadside sensors can be divided into two main categories, i.e., single-sensor-based and multi-sensor-based, as shown in Fig. 2.

Fig. 2: Systematic diagram of operational pipeline for: (a) single-sensor-based perception model; and (b) multi-sensor-based perception model.

Iii-B1 Single-sensor-based Perception

Single-sensor-based object detection and tracking system has been widely developed and applied in the real-world transportation system whose main pipeline is demonstrated in Fig. 2 (a). Data collected from the sensor is firstly preprocessed to reduce noise, filter unrelated data and properly reformat for downstream modules. Then, feature extraction

is applied to calculate predefined features by mathematical models (if based on traditional methods) or to generate hidden features by neural network (if based on deep learning). Detection and tracking results are generated by the

perception module and are fed into the post-processing module to further clean the perception outputs (e.g., filtering overlapped bounding boxes and predictions with score under the threshold).

Iii-B2 Multi-sensor-based Perception

Compared with single-sensor-based perception systems, multi-sensor-based perception systems have the potential to achieve better object detection and tracking performance via sensor fusion, owing to the complementary of different sensors. In terms of the stage of sensor fusion, multi-sensor perception system can be divided into three classes: 1) Early Fusion – to fuse raw data at the preprocessing stage; 2) Deep Fusion – to fuse features at the feature extraction stage; and 3) Late Fusion – to fuse perception results at the post-processing stage. Different fusion schemes both have pros and cons in terms of different perspectives. For instance, Early Fusion and Deep Fusion have the strength for fusion accuracy but need more computational power and complex model design. Conversely, Late Fusion can achieve better real-time performance but will sacrifice the accuracy. It depends on the specific demands under different traffic scenarios to determine the deployment of fusion schemes.

Iv General Perception Methodology

In this section, general perception methodologies which act as the building bricks for infrastructure-bsed pereption methods, are briefly reviewed. Nevertheless, due to the limited space, only several object detection milestones will be covered chronologically from two perspectives: traditional approach and deep-learning approach.

Iv-a Traditional Approach

Back to 20 years ago, Viola and Jones [viola2001rapid, viola2004robust] proposed an method for real-time detection of human faces without any constraints. This algorithm outperformed any other contemporary algorithms in terms of real-time performance, without compromising detection accuracy. In 2005, Dalal and Triggs [dalal2005histograms] proposed the Histogram of Oriented Gradients (HOG) feature descriptor which provided significant improvement of the scale-invariant feature transform [lowe1999object, lowe2004distinctive] and shape context [belongie2002shape]. The HOG detector has been regarded as the cornerstone for many subsequent object detectors and implemented in various real-world applications [felzenszwalb2008discriminatively, felzenszwalb2010cascade, malisiewicz2011ensemble]. Deformable Part-based Model (DPM) proposed by Felzenszwalb [felzenszwalb2008discriminatively] consecutively won the The pascal visual object classes (VOC)-07, -08, and -09 detection challenges [everingham2010pascal]. Due to their dominant performance, DPM and its variants [felzenszwalb2010cascade] are widely regarded as the pinnacle of traditional object detection methods [zou2019object].

Iv-B The Deep-Learning Approaches

Benefiting from the increased computational power, convolutional neural networks (CNNs) 

[krizhevsky2012imagenet] was reborn in 2012. Two years later, Girshick et al. proposed the Regions with CNN features (R-CNN) for object detection and completely unfolded the prosperity of deep learning [girshick2014rich, girshick2015region]. In the same year, Spatial Pyramid Pooling Networks (SPPNet) proposed by He et al. was able to generate feature representation regardless of the image size, and run 20 times faster than R-CNN without compromising accuracy [he2015spatial]. In 2015, multiple renowned detectors were proposed by researchers: 1) Fast R-CNN [girshick2015fast] – over 200 times faster than R-CNN – proposed by Girshick; 2) Faster R-CNN [ren2015faster, ren2016faster] – the first end-to-end, and the first near-realtime deep learning detector – proposed by Ren et al.; 3) You Only Look Once (YOLO) [redmon2016you] – the first one-stage detector in the deep learning era with extremely fast speed (45 - 155 fps) – proposed by Joseph et al.; and 4) Single Shot MultiBox Detector (SSD) [liu2016ssd] – the second one-stage detector but with significantly improved accuracy – proposed by Liu et al. In 2017, Lin et al. proposed Feature Pyramid Networks (FPN) [lin2017feature] based on Faster R-CNN, which achieved the SOTA object detection performance and has become a fundamental building block for various object perception models. In recent years, Transformers [vaswani2017attention] embedded with the mechanism of attention has been leading the trend to the majority of object perception tasks, such as Vision Transformer (ViT) proposed by Dosovitskiy et al. [dosovitskiy2020vit], and Swin-Transformer proposed by Liu et al [liu2021swin].

V Infrastructure-based Object Detection and Tracking Approaches

Although general object detection has gone through a rapid development era, object detection and tracking based on roadside sensors is still an emerging topic and has the potential to break the current bottleneck for autonomous driving especially in a mixed traffic environment via cooperative perception [gupta2021deep]. This section reviews the infrastructure-based object detection and tracking approaches with analysis of details in literature. Since the camera-based perception work has been reviewed comprehensively in previous survey [zou2019object, datondji2016survey], this section will mainly focus on roadside LiDAR-based perception methods.

V-a 2D Object Detection

Roadside camera has been widely used for object detection to support traffic surveillance, safety warning and various other applications. Ojala et al. proposed a CNN-based pedestrian detection and localization approach using roadside camera [Ojala8793228]. The perception system consists of a monovision camera streaming video and a computing unit which performs object detection and distance measurements on the detected objects.

Using a roadside LiDAR, Zhang et al. [zhang2020gc] proposed GC-net, a three-stage pipeline, including gridding, clustering, and classification. The raw point cloud data (PCD) is mapped into a grid structure and then clustered by the Grid-Density-Based Spatial Clustering algorithm. Finally, a CNN-based classifier is applied to categorize the detected objects by extracting the local features.

Liu et al. proposed a roadside LiDAR-based object detection approach by background filtering and clustering [Liu9434525]. Specifically, the background filtering method is designed on the basis of point correlation by KDTree [redmond2007method] neighborhood searching and the clustering is based on an adaptive (threshold) Euclidean clustering method.

Song et al. proposed a layer-based method for background filtering and object detection [Song9216093]. Specifically, a layer-based searching method is designed on the basis of feature distribution of PCD to distinguish moving objects from the point cloud. The Density-Based Spatial Clustering Applications with Noise (DBSCAN) [ester1996density] method is applied for point clustering and generates the object detection results.

Gouda et al. proposed an automated approach to mapping and assessing roadside clearance parameters using LiDAR on rural highways [gouda2021automated]. Pavement edge trajectories are extracted based on the pavement surface point extracted from PCD. Then, a voxel-based raycasting approach is designed to search for roadside objects and query their locations, and non-compliant locations with substandard conditions are automatically queried.

Zhang et al. proposed an object detection method based on background construction and clustering from roadside LiDAR data [8484040, zhang2019automatic]. The discrete horizontal and vertical angular values are regarded as coordinates of pixels in digital images, and the farthest and mean distance of each azimuth are used to construct the background dataset. Then, a density-based spatial clustering method [tran2013revised] is applied to generate the object detection results.

For a multi-sensor system, Zhu et al. proposed Multi-Sensor Multi-Level Enhanced YOLO (MME-YOLO) for vehicle detection in traffic surveillance [zhu2021mme]. MME-YOLO consists of two tightly coupled structures:

  • The enhanced inference head is empowered by attention-guided feature selection blocks and anchor-based/anchor-free ensemble head in terms of better generalization abilities in real-world scenarios.

  • The LiDAR-Image composite module is based on CBNet [liu2020cbnet] to cascade the multi-level feature maps from the LiDAR subnet to the image subnet, which strengthens the generalization of the detector in complex scenarios.

Owing to the above innovations, MME-YOLO can achieve better performance for vehicle detection compared with YOLOv3 [farhadi2018yolov3] for roadside sensor data.

Bai et al. [bai2022cyber] proposed a deep-learning based real-time vehicle detection and reconstruction system from roadside LiDAR data. Specifically, CARLA simulator [dosovitskiy2017carla] is implemented for collecting training dataset, and ComplexYOLO model [simony2018complex] is applied and retrained for the object detection on the CARLA dataset. Finally, a co-simulation platform is designed and developed to provide vehicle detection and object-level reconstruction, which aims to empower subsequent CDA applications with readily retrieved authentic detection data.

Except for detecting general road users (e.g., vehicles and pedestrians), Chen et al. provided an innovative attempt for deer crossing-road detection by using roadside LiDAR [chen2019deer]. The main workflow is adopted from their previous works, which is represented by a ”background filtering-clustering-classification” process [8484040, zhao2019detection, zhang2019vehicle]

. Particularly, several different clustering algorithms, e.g., naive Bayes 


, random forest 


and k-nearest neighbor (KNN

[li2002detection], are applied for classifying deer, pedestrians and vehicles.

Cicek proposed a deep-learning based automated curbside parking spot detection approach through a roadside camera [cicek2021fully]. To identify the road boundaries, object detection and road segmentation methods are employed by utilizing FCN-VGG16 model [long2015fully] on KITTI dataset [Geiger2012CVPR] and Faster R-CNN [ren2015faster] on


dataset [lin2014microsoft], respectively. Then, a method is designed to differentiate parked vehicles from the moving ones and then give the guidance of nearest spot information to drivers.

V-B 3D Object Detection

Guo et al. proposed a 3D vehicle detection method based on monocular camera [9502706]

, which consists of three steps: 1) clustering arbitrary object contours into linear equations; 2) estimating positions, orientations and dimensions of vehicles by applying K-means method; and 3) refining 3D detection results by maximizing a posterior probability.

For pedestrian detection, Gong et al. proposed a roadside LiDAR-based real-time detection approach by combining traditional and deep learning algorithms [gong2021pedestrian]. Several techniques are designed to guarantee the real-time performance, including: application of Octree with region-of-interest (ROI) selection, and development of an improved Euclidean clustering algorithm with adaptive search radius. The roadside system is equipped with NVIDIA Jetson AGX Xavier, achieving the inference time of 110 ms per frame.

Bai et al. [bai2021cmm] proposed a deep-learning based 3D object detection, tracking, and reconstruction system for the real-world implementation. The field operational system consists of three main parts: 1) 3D object detection by adopting PointPillar [lang2019pointpillars] for inference from roadside PCD data; 2) 3D multi-object tracking by improving DeepSORT [veeramani2018deepsort] to support 3D tracking; and 3) 3D reconstruction by geodetic transformation and real-time on-board GUI display.

For Multi-sensor perception system, Arnold et al. proposed a cooperative 3D object detection model by utilizing multiple depth cameras to mitigate the limitation from FOV of a single-sensor system [arnold2020cooperative]. For each camera, depth image is projected to a pseudo-point-cloud data [hartley2000zisserman, glennie2010static]. Two sensor-fusion schemes are designed: early fusion and late fusion (see Fig. 2), and adopted based on Voxelnet [zhou2018voxelnet]. The evaluation in a T-junction and a roundabout scenario in CARLA simulator [dosovitskiy2017carla] demonstrates that the proposed method can enlarge the detection coverage without compromising accuracy.

V-C Object Detection and Tracking

Using roadside LiDAR, Zhao et al. proposed a detection and tracking approach for pedestrians and vehicles [zhao2019detection]. As one of the early studies utilizing roadside LiDAR for perception, a classical detection and tracking pipeline for PCD was designed. It mainly consists of:

  • Background Filtering: To remove the laser points reflected from road surface or buildings by applying a statistics-based background filtering method [wu2017automatic].

  • Clustering: To generate clusters for the laser points by implementing a DBSCAN method [ester1996density].

  • Classification: To generate different labels for different traffic objects, such as vehicles and pedestrians, based on neural networks [li2012brief].

  • Tracking

    : To identify the same object in continuous data frames by applying a discrete Kalman filter 


Based on the aforementioned work, Cui et al. designed an automatic vehicle tracking system by considering vehicle detection and lane identification [cui2019automatic]. A real-world operational system is developed, which consists of a roadside LiDAR, an edge computer, a Dedicated Short-Range Communication (DSRC) Roadside Unit (RSU), a Wi-Fi router, a DSRC On-board Unit (OBU), and a Graphic User Interface (GUI).

Following the similar workflow, Zhang et al. proposed a vehicle tracking and speed estimation approach based on a roadside LiDAR [zhang2020vehicle]. Vehicle detection results are generated by the ”Background Filtering-Clustering-Classification” process. Then, a centroid-based tracking flow is implemented to obtain initial vehicle transformations, and the unscented Kalman Filter [julier2004unscented] and joint probabilistic data association filter [bar2009probabilistic] are adopted in the tracking flow. Finally, vehicle tracking is refined through an BEV-LiDAR-image matching process to improve the accuracy of estimated vehicle speeds.

To mitigate the occlusion impact in dense traffic, Zhang et al. proposed an adjacent-frame fusion method for vehicle detection and tracking approach with a roadside LiDAR [zhang2019vehicle]. Compared with previous research [zhao2019detection], objects can be detected and tracked without object model extraction or bounding box description, which improves the perception performance with occlusion.

MultEYE [balamuralidhar2021multeye] is a monitoring system for real-time vehicle detection, tracking and speed estimation proposed by Balamuralidhar et al. Different from general roadside sensors equipped on signal poles or light poles, the data source of MultEYE comes from an Unmanned Aerial Vehicle (UAV) equipped with an embedded computer and a video camera. Inspired by the multi-task learning methodology, a segmentation head [paszke2016enet] is added to the object detector backbone [bochkovskiy2020yolov4]. Dedicated object tracking [bolme2010visual] and speed estimation algorithms have been optimized to track objects reliably from an UAV with limited computational efforts.

Vi Datasets and Simulators

Vi-a General Datasets

Vi-A1 General Object Detection Datasets

Owing to prevailing needs in autonomous driving for surrounding perception, most datasets for object detection and tracking are collected from on-board sensors. Several widely used datasets for driving automation are briefly introduced as follows:

  • KITTI: one of the most popular datasets, which consists of hours of traffic scenarios recorded with a variety of sensor modalities for mobile robotics and autonomous driving [Geiger2012CVPR].

  • NuScenes: the first dataset to carry the full autonomous vehicle sensor suite: 6 cameras, 5 radars and 1 LiDAR, all with full 360 degree field of view [caesar2020nuscenes].

  • Waymo Open Dataset: a large-scale, high quality, diverse dataset which consists of 1150 scenes captured across a range of urban and suburban geography [sun2020scalability].

Vi-B Roadside Datasets

Because the roadside perception has great potential to promote the development of CDA, there are immediate demands of establishing a roadside sensor-based dataset for various infrastructure-based object perception tasks. In 2021, BAAI-VANJEE Roadside Dataset was published by Deng et al. to support the Connected Automated Vehicle Highway technologies [yongqiang2021baai]. The BAAI-VANJEE Roadside Dataset consists of LiDAR data and RGB images collected by a roadside data-collection platform and contains 2500 frames of LiDAR data, 5000 frames of RGB images which includes 12 classes of objects, 74K 3D object annotations and 105K 2D object annotations.

Vi-C Simulators

To promote the development of autonomous driving, some game engine-based simulators are developed to provide a cost-effective way for algorithm design and evaluation, such as CARLA [dosovitskiy2017carla], SVL [rong2020lgsvl], and AirSim [shah2018airsim]

. These simulators are open-source with detailed tutorials and have the capabilities to provide high-resolution high-fidelity sensor data, such as various kinds of cameras and LiDARs. These simulators can provide a highly customized and cost-effective way for collecting training datasets and traffic scenarios, and thus are widely applied in learning-based object perception tasks 

[marvasti2020cooperative, arnold2020cooperative, arnold2019survey].

Vii Discussion

Although infrastructure-based object perception is an emerging research area, it is playing an increasingly significant role in promoting the perception capabilities for CDA applications. Many studies have been conducted to lay the foundation and provide inspirations for future works. In this section, we present our insights concerning the the current states, open problems and future trends in infrastructure-based object detection and tracking for CDA applications.

Vii-a Current States and Open Challenges

Vii-A1 Sensor System

Limited by the space, this paper does not provide a comprehensive overview for every related approaches, but most of the studies mentioned are based on a single sensor, while few of them are proposed for multi-sensor perception system [zhu2021mme, arnold2020cooperative] in terms of roadside sensor-based perception. Some key open challenges are fusion schemes (e.g., early fusion, late fusion) of different sensor combinations and associated efficient fusion methods.

Vii-A2 Core Perception Methods

According to the literature reviewed in Section IV and Section V, there is an evident gap between general object perception and infrastructure-based object perception. For instance, the core methodologies of a large portion of the existing roadside LiDAR-based detection approaches are based on DBSCAN for clustering [Song9216093, 8484040, zhang2019automatic, zhao2019detection, zhang2020vehicle], which has performance gap compared with the SOTA methods [lang2019pointpillars, zhou2018voxelnet]. Thus, one of the major challenges is the roadside data acquisition and annotation for promoting the deep learning-based research of infrastructure-based perception systems.

Vii-A3 Communications and Synchronization

According to the experience from several field operational systems [cui2019automatic], the synchronization problem caused by the processing and communication delay is a crucial issue for large-scale implementation.

Vii-B Future Trends

Vii-B1 Towards Multi-Sensor Fusion

A multi-sensor-based perception system has the potential to improve the perceiving performance by taking advantage of complementary sensor data [zamanakos2021comprehensive] with appropriate fusion techniques. An infrastructure-based perception system has more flexible conditions for multi-sensor equipment and is capable of empowering high-computational edge servers.

Vii-B2 Towards Cooperative Perception

No matter how powerful the perception methods are, the physical occlusion could not be addressed by single-node perception. Perceiving the environment from multiple nodes has the capability to mitigate the limitation from occlusion, which is one of the major bottlenecks of current perception systems.

Vii-B3 Towards Lightweight OBU

Although the on-board device has made major strides in development, it could be extremely costly to empower every single vehicle with a high-performance computation system for perception. Due to the rapid advance in high-speed wireless communication technologies, it becomes much more cost-effective to equip lightweight OBUs for local situation awareness perception and receive data from infrastructure-based high-performance nodes for wider range perception.

Viii Conclusions

This paper provides an overview for infrastructure-based object detection and tracking systems. The architectures is presented to illustrate the four fundamental parts of infrastructure-based traffic object perception system. Roadside sensors are then introduced in terms of configuration, performance, operational pipelines, and general perception methodologies. Infrastructure-based object detection and tracking approaches are reviewed with detailed analyses followed by the brief introduction of datasets and simulators. Finally, this paper discusses current issues and future trends. To the best of our knowledge, this work is the first study that aims to provide a survey on the infrastructure-based traffic object detection and tracking methods.


This research was funded by the Toyota Motor North America InfoTech Labs. The contents of this paper reflect the views of the authors, who are responsible for the facts and the accuracy of the data presented herein. The contents do not necessarily reflect the official views of the Toyota Motor North America.