Log In Sign Up

Hercules: An Autonomous Logistic Vehicle for Contact-less Goods Transportation During the COVID-19 Outbreak

Since December 2019, the coronavirus disease 2019 (COVID-19) has spread rapidly across China. As at the date of writing this article, the disease has been globally reported in 100 countries, infected over 100,000 people and caused over 3,000 deaths. Avoiding person-to-person transmission is an effective approach to control and prevent the epidemic. However, many daily activities, such as logistics transporting goods in our daily life, inevitably involve person-to-person contact. To achieve contact-less goods transportation, using an autonomous logistic vehicle has become the preferred choice. This article presents Hercules, an autonomous logistic vehicle used for contact-less goods transportation during the outbreak of COVID-19. The vehicle is designed with autonomous navigation capability. We provide details on the hardware and software, as well as the algorithms to achieve autonomous navigation including perception, planning and control. This paper is accompanied by a demonstration video and a dataset, which are available here:


page 1

page 2

page 3

page 4

page 8


A Survey on Contact Tracing: the Latest Advancements and Challenges

Infectious diseases are caused by pathogenic microorganisms, such as bac...

Clustering of countries based on the associated social contact patterns in epidemiological modelling

Mathematical models have been used to understand the spread patterns of ...

Nailed It: Autonomous Roofing with a Nailgun-Equipped Octocopter

This paper presents the first demonstration of autonomous roofing with a...

Automatic Unstructured Handwashing Recognition using Smartwatch to Reduce Contact Transmission of Pathogens

Current guidelines from the World Health Organization indicate that the ...

1 Hardware System

The hardware system of our vehicle mainly consists of a Drive-by-Wire (DBW) chassis, an Industrial Personal Computer (IPC), sensors, and accessories. Fig. 2(a) shows the sensors on the 3-D model of our vehicle. Fig. 2(b) shows the top view of the modules in the chassis.

The main modules of the chassis are those related to motion control: Motor Control Unit (MCU), Electric Power Steering (EPS), Electro-Hydraulic Brake (EHB) and Electronic Parking Brake (EPB). MCU supports both the speed-control and torque-control, with the input commands from the Vehicle Control Unit (VCU). EPS controls the vehicle steering. The combination of MCU and EPS controls the vehicle longitudinal and lateral motions. EHB controls the brake, and EPB controls the parking brake. The Body Control Module (BCM) controls various electronic accessories in the vehicle body, such as light and horn. VCU is the core of the DBW chassis, which is responsible for coordinating each module. It keeps communicating with the IPC, performing parameter checking and sending commands to other modules. In addition, VCU responds to critical safety signals, such as the stopping signal from the emergency button. There are some sensors in the chassis, such as a bumper, encoder and Tire Pressure Monitoring System (TPMS). The bumper and TPMS are both safety-critical sensors. Specifically, the bumper is used to avoid collisions. The encoder is a low-cost and effective sensor for odometry which measures the rotation of wheel precisely.

There are two batteries in our vehicle. A 12 V Lead-acid starter battery and a 72 V removable lithium-ion battery with around 9 KWh electricity, which can support the maximum 80 Km running distance. The lithium-ion battery powers the chassis, IPC, sensors and accessories. It has a built-in Battery Management System (BMS) to monitor and manage the battery. The removable design allows the vehicle to operate at 24 hours a day without stopping for a recharge. An On-Board Charger (OBC) with a Direct Current Direct Current (DCDC) converter takes about 5 hours to fully recharge the battery. The detailed specifications of our vehicle are displayed in Tab. 1.

The IPC contains an Intel i7-8700 CPU with 6 cores and 12 threads, 32 GB memory, and a 1050Ti NVIDIA Graphics card. It is able to run deep learning-based algorithms. The accessories mainly consist of a 4G/5G Data Transfer Unit (DTU), Human Machine Interface (HMI), LED display, remote controller and cargo box. The DTU allows the IPC to connect to our cloud management platform via the Internet. The cargo box is replaceable.

Figure 3: The software architecture of our vehicle. The part shown in the yellow box is running on the vehicle, and the part shown in the blue box is running on the cloud. The figure is best viewed in color.

2 Software System

Fig. 3 shows the software architecture of our autonomous vehicle. It can be generally divided into two parts: software system running on the vehicle, and software system running on the cloud.

2.1 Software Platforms on the Vehicle

There are three main computing platforms on the vehicle: the IPC, Electronic Control Unit (ECU) and BCM. The IPC is used to run algorithms for autonomous navigation. The ECU is used to ensure the safety of the vehicle by energy management and safety diagnosis. It has three main functions: the united diagnostic services, driving torque coordination control, and the distribution and brake energy recovery system. The united diagnostic services are a series of applications checking and monitoring the mechanical and control systems. The driving torque coordination control is an application that dynamically distributes motor torque according to the vehicle and road status. The brake energy recovery is an application aimed at saving power and improving the travel distance. These applications run on the Real-Time Operating System (RTOS), which satisfies the real-time requirements.

The BCM connects the IPC and ECU. It also runs on the RTOS, which meets the real-time requirements. It detects the communication between the nodes of the CAN network by heartbeat protocols. When major nodes of this network experience an outage or crash, the BCM stops transmitting high-level CAN signals from the IPC to the VCU and waits for human interventions.

The sensors on the vehicle are synchronized by a 1 Hz Pulse per Second (PPS) signal from the external GNSS receiver. The IPC receives data from the sensors and uses them at different frequencies. For example, the state estimation module updates at 100 Hz, providing localization results for the system. The Lidar object detection module runs at 10 Hz according to the refresh rate of the Lidar. All the modules are developed on the Robot Operating System (ROS) to facilitate data communication.

Figure 4:

The architecture of the 3-D object detection using Lidar point clouds. The inputs are multiple point clouds captured by synchronized and well-calibrated Lidars. The 3-D object detector contains mainly three components: a feature learning network (FLN) for feature extraction; middle layers (ML) for feature embedding with sparse convolution; and region proposal network (RPN) for box prediction and regression. The final outputs are a series of 3-D bounding boxes: blue cubes denote the detection results, and the gree cubes denote the ground-truth boxes.

2.2 Software Platforms on the Cloud

The software platforms on the cloud mainly include the map server, the scheduling server and the log and simulation server. The map server stores the pre-built map for the traversed environments. The scheduling server performs the task allocations and collects the status of every registered running vehicle. It also plays the role of accessing the map data for routing, transmitting sensor data into the map server, recording the key information into the log server, and replaying the data recorded for a good trace-back. The log and simulation server runs the end-to-end simulator Carla222Available at: and 51Sim-One333Available at: The clock synchronization between the platforms on the vehicle and cloud is manipulated based on the network time through the Network Time Protocol (NTP).

3 Perception

Perception serves as the fundamental component of autonomous navigation. It provides necessary information for planning and control. This section describes some of the key perception technologies used in our vehicle.

3.1 3-D Object Detection with Lidar Point Clouds

The 3-D object detection aims to recognize and classify objects, as well as estimate their poses with respect to a specific coordinate system. We use multiple Lidars for object detection. To exploit the information from multiple Lidars, it is important to calibrate them. To this end, we propose a marker-based approach

[1] for automatic calibration without any additional sensors and human intervention. We assume that three linearly independent planar surfaces forming a wall corner shape are provided as the calibration targets, ensuring that the geometric constraints are sufficient to calibrate each pair of Lidars. After matching the corresponding planar surfaces, our method can successfully recover the unknown extrinsic parameters with two steps: a closed-form solution for initialization based on the Kabsch algorithm [2] and a plane-to-plane iterative closest point (ICP) [3] for refinement.

The overview of our 3-D object detection is shown in Fig. 4. The inputs to our approach are multiple point clouds captured by different Lidars. We adopt an early-fusion scheme to fuse the data from multiple calibrated Lidars at the input stage. With the assumption that the Lidars are synchronized, we transform the raw point clouds captured by all the Lidars into the base frame, and then feed the fused point clouds into the 3-D object detector. The final output is a series of 3-D bounding boxes. Each bounding box is parameterized as , where is the class of a bounding box, and denotes the bottom center of a box. We employ VoxelNet [4] as the 3D object detector, which consists of three main components: a Feature Learning Network (FLN) for feature extraction; Middle Layers (ML) for feature embedding with sparse convolution; and a Region Proposal Network (RPN) for box prediction and regression.

3.1.1 Feature Learning Network

The FLN encodes the sparse points into a dense feature representation. In [4]

, the authors first voxelize the input point clouds and then apply a bunch of Voxel Feature Encoding layers (VFELayers). The VFELayer is composed of multi-layer perceptrons and a max-pooling layer. In the max-pooling layer, the point-wise features inside a voxel are aggregated into voxel-wise features, which will be forwarded to the ML for feature embedding.

3.1.2 Middle Layers

The ML converts the 3-D dense grid to a 2-D Bird’s-Eye-View (BEV) representation. To process the grids, the ML is implemented with several 3-D convolution layers, where the 3-D convolutional operations, BN layer, and ReLU layer are applied sequentially. The convolution kernel size is set as

, since we denote the dense grids in the order of . As a result, these layers can aggregate voxel-wise features within a progressively expanding receptive field along the -axis, as well as enrich the shape description.

In [4], the ML costs almost of the computation of the entire network due to the dense 3-D convolutional operations. The non-empty voxels usually occupy of all grids. However, in the process of dense convolution, calculations are required on each voxel. Inspired by [5], we utilize the spatially sparse convolution [6] to replace the dense convolution layers in the basic ML. Hence, the inference time is boosted from approximately ms to ms even with a larger sampling resolution. Additionally, we add the DropBlock [7] to randomly deactivates patches of the feature map during training. This forces the network to search for more evidence to detect an object with several invisible parts.

Figure 5: The schematic diagram of our 3-D point-cloud mapping system. After the initialization, the system will estimate the states and refine the global map and Lidar poses in the odometry and mapping sub-modules, respectively.

3.1.3 Region Proposal Network

The RPN applies convolutional layers to predict the probability and regression map from the extracted features. Since the increasing invariance and large receptive fields of top-level nodes yield smooth responses and cause inaccurate localization, the RPN adopts skip-layers

[8] to combine high-level semantic features with low-level spatial features [4]. Each entry of the probability map estimates the probability of the objectiveness, and one entry in the regression map regresses the pose and size of a bounding box. We jointly estimate instead of to avoid the angular singularity problem described in [9].

To train the network, we adopt the classification and regression loss [10] to train the stage-2 network:



represents the posterior probability of objectiveness, and

and is the regression output and ground truth for the positive anchors. is defined as the classification loss, denotes the focal loss [11], and is defined as the normalized regression loss where we use the SmoothL1 function to calculate the loss .

Figure 6: The schematic diagram of our moving-object detection system. The white point cloud and the blue point cloud represent the detected objects in the current frame and the last frames, respectively.
Figure 7: The schematic diagram of planning for our autonomous vehicle. The planning process consists of four layers: global planning, behavioral planning, path planning and motion planning. The four layers are coloured in pink.

3.2 3-D Point-cloud Mapping

The 3-D point-cloud mapping aims to build the 3-D map of the traversed environments. Fig. 5 shows the diagram of our mapping system. The inputs to the system are the raw data from the IMU and 3-D Lidar (i.e., accelerometer and gyroscope readings from the IMU and point clouds from the 3-D Lidar). The system starts with an adapted initialization procedure, followed by two major sub-modules: the Lidar-inertial odometry and rotationally constrained mapping. Since the vehicle usually remains still at the beginning of mapping, we do not need to excite the IMU to initialize the module as described in [12], which is more suitable for hand-held applications. With the stationary IMU readings, the initial orientation for the first body frame can be obtained by aligning the average of the IMU accelerations to the opposite of the gravity direction in the world frame . The initial velocity , and IMU biases are set to zero. The state estimation problem in the Lidar-inertial odometry sub-module is aimed at optimally fusing Lidar and IMU measurements in a local window. The state

in our odometry contains the body pose

, velocity of the IMU in , accelerometer and gyroscope biases in , and extrinsic parameter between the Lidar and IMU .

First, with new current IMU readings, we can propagate the previously estimated IMU pose to the current predicted IMU pose , and pre-integrate the IMU measurements. Transformed by , the predicted Lidar pose can be obtained from

. The Lidar point cloud is undistorted by linear interpolation between

and the previously optimized Lidar pose . Similar to [13], Lidar features are extracted from each sweep. Using , we can associate feature points to these in the local map and form the relative Lidar measurements according to [12]. With the pre-integration and relative Lidar measurements, we can write the state estimation problems in a fixed-size window of as:


where is the prior term from marginalizing out old states, is the relative Lidar measurement, and is the pre-integration term. The optimized can be obtained by solving (2), using the Levenberg-Marquart algorithm. After the optimization, the state for propagation and local map can be updated. In our odometry sub-module, the window-size is used to limit the number of parameters. Although the odometry sub-module can make the system align with gravity well, it lacks a larger global map to constrain the local map. Thus, after a long time, the system may drift by using only the local constraints.

To reduce the drift, we will refine the undistorted point cloud and Lidar pose to an incremental global map. Following the normal mapping module from [13], we can write the global map problem as


By registering a point cloud to the global map, we can achieve a more consistent map than using only local window Lidar-inertial odometry. However, if we only use the pose from the odometry as a prior to minimize 3, we would waste the property of the orientation of the Lidar-inertial odometry that is aware of the gravity direction. Therefore, we propose rotationally constrained mapping. At the optimization, we will apply a modified Jacobian according to the confidence of the orientation estimation and update the Lidar pose as stated by [12].

3.3 Ego-motion Compensated Moving-object Detection

Moving objects are very common in traffic environments. Detection of moving objects is crucial for the motion planning of our autonomous vehicle in dynamic environments [14]. Based on the aforementioned object detection and odometry estimation, we propose an ego-motion compensated moving-object detection system. The schematic diagram is shown in Fig. 6. The input to our moving-object detection system is consecutive point-cloud frames with object detection results including both moving and static ones. We first use the aforementioned Lidar-inertial odometry estimation algorithm to find the transformation between the two frames. Then, the last point-cloud frame is transformed to the current coordinate frame with the transformation matrix. We use object tracking to associate the bounding boxes across the two frames. The moving objects are detected by subtracting the corresponding bounding boxes from the warped and current point-cloud frames. Specifically, we set a threshold for the distance between the bounding boxes. Moving objects are detected with distances above the threshold. As shown in Fig. 6, there are three detected objects: a cyclist and two cars. With ego-motion compensation and frame differencing, a moving cyclist and a moving vehicle are detected.

Figure 8: (a) An example for global planning and local planning inside an industrial estate. The red lines show the global path and the blue lines show the candidate path. The online local path planning examples use the global path as the reference. The dashed lines show the global path. The bold lines show the planned local path and the dots show the samples for path planning. The autonomous vehicle is represented as the rectangle. (b) An example for online motion planning given the local path as the reference. The green trace shows the planned path. The yellow curve shows the planned trajectory.

4 Planning

Planning enables autonomous vehicles to acquire future paths or motions towards the destination. The planning for autonomous driving is challenging, because traffic environments are usually with dynamic objects, bringing about risks and uncertainties [15]. Autonomous vehicles are required to interact with various road participants, including cars, motorcycles, bicycles, pedestrians, etc. The planner needs to meet the vital requirements of safety, and the kinematic and dynamic constraints of vehicles, as well as the traffic rules. To satisfy these requirements, our planning is hierarchically divided into four layers: route planning, behavioral planning, local path planning, and motion planning. Fig. 7 shows the four-layer planning process.

4.1 Route Planning

The route planning aims at finding the global path from the global map. For autonomous driving, the route planner typically plans a route given the road network. For structured environments with clear road maps, we can use path planning algorithms, such as A*, to find the route. However, in unstructured environments, such as an industrial estate or housing district, we collect the past drivable routes as the teaching process. Fig.8(a) shows the global route established inside an industrial estate. The red lines show the global path and the blue lines show the candidate path as an alternative.

4.2 Behavioral Planning

Behavioral planning decides the manoeuvres for local navigation. It is a high-level representation of a sequence of vehicle motions. Typical manoeuvres are lane keeping and overtaking. This layer receives information from the global maps and finds the category of the local area to give specifications on path planning. For example, unstructured environments, like parking lots, have different requirements on planning. Given the road map and the localization of the ego-vehicle, features of the local area can be obtained. As shown in Fig. 7, road information that indicates the road environment classification of the global path segments is helpful for behavioral planning. Furthermore, traffic information from traffic signs helps in making decisions. The road and traffic information together with the estimation of other moving agents allows the behavioral planner to follow or overtake the front car, or pull over the ego-vehicle.

4.3 Local Path Planning

The local path planning generates a geometric path from the starting pose to the goal pose for the vehicle to follow. The time complexity of this process increases with increased path length, so it is often limited to a local range to ensure real-time planning. The local path planner needs to tackle motion constraints of the vehicle to generate collision-free paths that conform to the lane boundaries and traffic rules. Fig. 7 shows the online local path planning for driving on standard roads. Here we plan the path in the Frenet coordinate system. With the global path as the reference path, it defines the lateral shift to the path and the distance travelled along the path from the start position. We drew multiple samples with different speeds and lateral offsets. Then a graph search method is adopted to search the path with the minimum cost. To define the cost of the coordinates of each curve, we take into consideration the quality of the curve, ending offset to the global path, and other factors (e.g., the potential trajectories of other agents).

4.4 Motion Planning

Given the planned path, motion planning is the final layer which optimizes the trajectory with dynamic constraints from the vehicle, the requirements for comfort and energy consumption. The planned trajectory specifies the velocity and acceleration of the vehicle at different timestamps, so it is also called trajectory planning. Though the path planned in the Frenet frame contains speed information, the dynamic constraint of the vehicle is not yet considered. Besides this, the local planning process is time-consuming and has a low update rate, which is inadequate to handle dynamic obstacles and emergency cases. The motion planner optimizes the trajectory given the information of obstacles, the constraints from the vehicle, and the path from the local path planner. It outputs the final trajectories for the controller to follow at a much higher updating rate to ensure safety. Fig. 8(b) shows the planned trajectory as yellow curves. The planned path is shown as the green trace. The planned trajectory differs from the path in the first segment, which is critical for collision avoidance. The motion planner shows more strict requirements for safe distances to ensure safety.

5 Control

The main task of vehicle control is to track the planned trajectory. In the past decade, many trajectory tracking controllers have been developed, among which the Model Predictive Controller (MPC) [16] is the most popular one. The schematic diagram of our controller is shown in Fig. 9.

As we can see, there are two inputs to the trajectory tracking controller. One is the trajectory , which includes the information (e.g., desired coordinates, curvatures, speed) from the motion planner, the other is the feedback information from the state estimator. Sometimes, sensor feedback from the chassis cannot be directly sent to the controller, or more feedback quantities are required by the controller, which is difficult to obtain from sensors. In such cases, a state feedback estimator is required but is not a must. In Fig. 9, the output of the trajectory tracking controller is sent to the chassis after being processed by a lower controller. The lower controller can work for many purposes. For example, our autonomous vehicle can work in both the autonomous-driving mode and the parallel-driving model (i.e., the remote control mode). The trajectory tracking controller only functions in the autonomous-driving mode, which means that only in the autonomous-driving mode the lower controller takes as input .

The unmanned-car control can be divided into the lateral control, which controls steer angles, and the longitudinal control, which controls the car speed. There are two types of MPCs in the area of the autonomous vehicle. One is kinematics-based while the other is dynamics-based. A kinematics-based MPC is a combined controller that integrates the lateral control and longitudinal control. Therefore, the longitudinal PID controller highlighted in the dashed box in Fig. 9

is not required. The vector of two control quantities

(i.e., steer angle and speed), will be directly given by the MPC. However, the dynamics-based MPC is a standalone lateral controller of which the output is a control quantity of the steering angle. In such a case, a longitudinal PID controller that outputs the speed control quantity will be required. And the outputs of these two controllers constitute .

No matter whether it is kinematic or dynamic, an MPC is essentially composed of three key parts: the predictive model, rolling optimization, and feedback correction, as shown in Fig. 9. The main function of the predictive model is to predict the future output of the controlled plant based on the simplified plant model, historical information and future input. The goal of rolling optimization is to determine the final control quantities through the optimization of an objective function. Rolling optimization must be repeatedly carried out in real-time. Feedback information is used to correct the predictive results based on the car feedback information so as to process rolling optimization in the next period. The major difference between the kinematics-based MPC and dynamics-based MPC is the simplified vehicle model. The vehicle kinematics model is totally geometry-based, which takes the vehicle as a two-wheeled bicycle model, namely the Ackerman model. It applies while the vehicle is operated at low speed. The dynamic-based MPC is much more accurate than the kinematic-based MPC, and is derived from force analysis. However, the model complexities increase as well, leading to a more complicated optimization process and reduced real-time performance.

Figure 9: The schematic diagram of our controller. The main component is the trajectory tracking controller.

Figure 10: Selected task routes and demonstration photos. The first column includes three representative routes: (a) A 9.6 Km route in Zibo, Shandong for vegetable delivery; (f) A 1.2 Km route in Suzhou, Jiangsu for lunch meal delivery; (i) A 1.6 Km route in Shenzhen, Guangdong for lunch meal delivery. Photos taken in Zibo, Shandong: (b) Starting from the logistics company; (c) Crossing the gate of the logistics company; (d) and (e) Urban main roads. Photos taken in Suzhou, Jiangsu: (g) Left turn; (h) Spraying disinfectant in a residential area. Photos taken in Shenzhen, Guangdong: (j) Heavy traffic environment; (k) Surrounded with road users and meeting with oncoming cars; (l) U-turn in a narrow road; (m) Traffic light at night; (n) Contact-less picking up meal.
City Task Distance Time Duration Payload Environments Characteristics
Vegetable Delivery 9.6 Km 75 min 600 KG Urban Main Road
Light Traffic
Heavy Payload
Slow Speed
Vegetable Delivery 5.4 Km 50 min 960 KG
Urban Main Road
Residential Area
Light Traffic
Heavy Payload
Slow Speed
Meal Delivery 1.2 Km 30 min
80 KG
(100 boxes of meals)
Urban Main Road
Medium Traffic
Road Disinfection 0.6 Km 10 min - Residential Area
Narrow Road
Crowded Obstacle
Slow Speed
Meal Delivery 1.6 Km 20 min
64 KG
(80 boxes of meals)
Urban Main Road
Residential Area
Heavy Traffic
Narrow Road
Meal Delivery 4.0 Km 40 min
96 KG
(120 boxes of meals)
Urban Main Road
Residential Area
Heavy Traffic
Narrow Road
Table 2: The details of the contact-less goods transportation tasks during the outbreak of COVID-19 in China.
Level Description No. of Occurrences
Z1 Human intervention to avoid an accident 0
Z2 Human intervention due to system errors 5
Z3 Human intervention due to freezing 25
Z4 Human intervention due to too slow speed 2
Table 3: The descriptions for the 4-level human interventions, and the number of times happened in the tasks in Shenzhen, Guangdong.

6 Evaluation

This section describes the real tasks of contact-less goods transportation using our vehicle during the outbreak of COVID-19 in China. From Feb. 2, 2020 to Mar. 7, 2020, our vehicles were deployed in 3 different places (Zibo, Shandong; Suzhou, Jiangsu; Shenzhen, Guangdong) in China. The total running distances of each vehicle reached 1100 Km. The details of the transportation tasks are summarized in Tab. 2. Selected demonstration photos during the tasks are displayed in Fig. 10. Note that in case of failures during the autonomous navigation, such as unavoidable accidents and system errors, we build a parallel driving system for our vehicle to back up our navigation. The parallel driving system is a remote control system based on 4G/5G technology. The vehicle control is immediately taken over by a human driver in case of any failure. We expect less human intervention during our goods transportation tasks. The vehicle is evaluated by the number of occurrences of human interventions.

6.1 Tasks in Zibo, Shandong

We first deploy our vehicle in Zibo, Shandong due to the urgent requirement of a local logistics company. The tasks are to deliver fresh vegetables from a supplier to quarantined residential areas. There are two routes for the tasks, both of which have good traffic conditions (e.g., very few cars running on the roads). As the tasks require the vehicle loaded with goods that approximately approach the maximum payload, such as 960 KG. To ensure safety, the vehicle speed is limited to 20 Km/h. There is no human intervention during the tasks.

6.2 Tasks in Suzhou, Jiangsu

The tasks in Suzhou are divided into two parts. One task is to deliver lunch meals from a restaurant to a company. The vehicle receives 100 boxes of meals from the restaurant, and delivers them to the building where the company is located. As shown in the route map, the vehicle needs to take two U-turns and one left turn. The second task is to disinfect a road in a residential district every night. We mount a sprayer on our cargo box so that the vehicle can be used to disinfect traversed areas. The traversed route is approximately 600 m long, but the environment is very crowded (e.g., vehicles and bicycles parked along the road). In these two tasks, few human interventions are involved to give way to other vehicles to avoid accidents or go through areas crowded with obstacles.

6.3 Tasks in Shenzhen, Guangdong

Similar to the tasks in Suzhou, the tasks in Shenzhen are to deliver lunch meals, but the routes in Shenzhen have heavy traffic. So human intervention is frequently involved. We classify the human interventions into 4 levels: Z1 to Z4, the definitions of which are shown in Tab. 3. The Z1 level indicates the most serious case. Without human intervention, there will be an accident for Z1. The seriousness decreases from Z1 to Z4.

As shown in Tab. 3, Z1 never occurred during the tasks in Shenzhen. There are 5 occurrences of Z2, which are caused by the initialization failure of the Lidar-inertial-based localization and the false detections of traffic lights at night. There are 25 occurrences of Z3. The main reason is that there are so many road users around the vehicle that it stops to avoid collisions. There are 2 occurrences of Z4, both of which happen on very narrow roads. On such roads, even experienced human drivers cannot go through quickly.

7 Lessons Learned and Conclusions

For object detection, we found that real-time performance deserves much more attention than accuracy in practice. The perception algorithms should be efficient since they are always the front-end of the entire autonomous navigation system. Therefore, we replaced the dense convolution layers with spatially sparse convolution [6] in our 3D object detection module. As a result, the inference time is boosted from ms approximately to ms.

For the point-cloud mapping, we found that our system was capable of dealing with the rapid motion of the vehicle and short-term point-cloud occlusions. Since most of the Lidars are mounted parallel to the ground and the vehicles always move along the ground, the typical ring structure of the point clouds makes the system difficult to observe in terms of translational movements vertical to the ground plane. Drift in this direction is inevitable during long-time operations. In practice, we learned that the GPS localization results signaled the potential loop, which led to a consistent larger-region map for Lidar-based localization. In very crowded dynamic traffic environments, the system could be degraded by the disturbances from moving objects. To tackle this issue, we use semantic segmentation to remove movable objects to get clear point-cloud data.

For the planning part, we found the four-layer hierarchical planning necessary and effective. The application environments of our vehicle are complex in the sense of road structures, traffic conditions, driving etiquette, etc. Layer-wise planning makes the system extensible for multiple environments and various requirements. Furthermore, it is essential to attach importance to the uncertainty from the perception modules. The uncertainty comes from the limited accuracy of the perception system as well as the time delay in processing. For the planning, we avoid hitting the critical conditions and leave safe distances for the planned trajectory.

For the control part, we found that the greatest advantage of using an MPC can be gained by adding multiple constraints in the control process. When the vehicle operates at low speed, the kinematic constraints restrain the vehicle motion planning and control. But with the increment of speed, dynamic characteristics become more influential. As aforementioned, the dynamic-based MPC is much more accurate than the kinematic-based MPC since the predictive model is more accurate. However, we found that the complex-model prediction is not the best option. With regards to low- and middle-speed operation environments, a simplified predictive model with multiple constraints would be sufficient. As is known, the introduction of multiple constraints in the optimization may narrow down the range of solutions and even lead to no solution, which may corrupt the program. In such a case, it would be wise to bring in the relaxing factor when designing the objective function that could avoid the no-solution situation. Another factor that needs to be considered carefully when applying the MPC is the determination of the prediction horizon. A long prediction horizon can improve the control accuracy to a certain degree, but would consume more computation power and reduce the real-time performance. Different controlled plants in different operation environments may have different requirements on the control frequency. With the guarantee of the control frequency, our strategy is to select a prediction horizon that is as long as possible.

The successful daily operations demonstrated that using our autonomous logistic vehicle could effectively avoid virus spread due to human contact. It effectively provides a physical wall between the recipient and sender in the process of goods transportation. For example, one task is to deliver meals to quarantined persons suspected. Our vehicle automatically drives to the infected area and notifies the persons to collect the food through short messages. There is no contact between the persons sending the food and the quarantined during the whole process, which successfully avoid virus spread through human-to-human contact. Currently, there is a huge demand for contact-less goods transportation in many infected areas. We believe that continuous operations could extensively improve our vehicle and speed up the maturity of our autonomous-navigation technology.

8 Acknowledgements

This work was supported by the Research Grant Council of Hong Kong SAR Government, China, under Project No. 11210017 and No. 21202816, Guangdong Science and Technology Plan Guangdong-Hong Kong Cooperation Innovation Platform (Grant Number 2018B050502009), Shenzhen Science and Technology Innovation Commission (Grant Number JCYJ2017081853518789), Science and Technology Development Fund, Macao S.A.R (FDCT) (Project reference number 0015/2019/AKP).


  • [1] J. Jiao, Q. Liao, Y. Zhu, T. Liu, Y. Yu, R. Fan, L. Wang, and M. Liu, “A novel dual-lidar calibration algorithm using planar surfaces,” in 2019 IEEE IV.   IEEE, 2019, pp. 1499–1504.
  • [2] W. Kabsch, “A discussion of the solution for the best rotation to relate two sets of vectors,” Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography, vol. 34, no. 5, pp. 827–828, 1978.
  • [3] F. Pomerleau, F. Colas, R. Siegwart, and S. Magnenat, “Comparing icp variants on real-world data sets,” Autonomous Robots, vol. 34, no. 3, pp. 133–148, 2013.
  • [4] Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2018, pp. 4490–4499.
  • [5] Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, p. 3337, 2018.
  • [6] B. Graham and L. van der Maaten, “Submanifold sparse convolutional networks,” arXiv preprint arXiv:1706.01307, 2017.
  • [7] G. Ghiasi, T.-Y. Lin, and Q. V. Le, “Dropblock: A regularization method for convolutional networks,” in Advances in Neural Information Processing Systems, 2018, pp. 10 727–10 737.
  • [8] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  • [9] M. Simon, S. Milz, K. Amende, and H.-M. Gross, “Complex-yolo: An euler-region-proposal for real-time 3d object detection on point clouds,” in ECCV.   Springer, 2018, pp. 197–209.
  • [10] P. Yun, L. Tai, Y. Wang, C. Liu, and M. Liu, “Focal loss in 3d object detection,” IEEE Robotics and Automation Letters, vol. 4, no. 2, pp. 1263–1270, April 2019.
  • [11] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
  • [12] H. Ye, Y. Chen, and M. Liu, “Tightly coupled 3d lidar inertial odometry and mapping,” in 2019 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2019.
  • [13] J. Zhang and S. Singh, “Loam: Lidar odometry and mapping in real-time.” in Robotics: Science and Systems, vol. 2, no. 9, 2014.
  • [14] Y. Sun, M. Liu, and M. Q.-H. Meng, “Improving RGB-D SLAM in dynamic environments: A motion removal approach,” Robotics and Autonomous Systems, vol. 89, pp. 110 – 122, 2017.
  • [15] M. Liu, “Robotic online path planning on point cloud,” IEEE transactions on cybernetics, vol. 46, no. 5, pp. 1217–1228, 2015.
  • [16] C. E. Garcia, D. M. Prett, and M. Morari, “Model predictive control: theory and practice—a survey,” Automatica, vol. 25, no. 3, pp. 335–348, 1989.