ObserveNet Control: A Vision-Dynamics Learning Approach to Predictive Control in Autonomous Vehicles

by   Cosmin Ginerica, et al.

A key component in autonomous driving is the ability of the self-driving car to understand, track and predict the dynamics of the surrounding environment. Although there is significant work in the area of object detection, tracking and observations prediction, there is no prior work demonstrating that raw observations prediction can be used for motion planning and control. In this paper, we propose ObserveNet Control, which is a vision-dynamics approach to the predictive control problem of autonomous vehicles. Our method is composed of a: i) deep neural network able to confidently predict future sensory data on a time horizon of up to 10s and ii) a temporal planner designed to compute a safe vehicle state trajectory based on the predicted sensory data. Given the vehicle's historical state and sensing data in the form of Lidar point clouds, the method aims to learn the dynamics of the observed driving environment in a self-supervised manner, without the need to manually specify training labels. The experiments are performed both in simulation and real-life, using CARLA and RovisLab's AMTU mobile platform as a 1:4 scaled model of a car. We evaluate the capabilities of ObserveNet Control in aggressive driving contexts, such as overtaking maneuvers or side cut-off situations, while comparing the results with a baseline Dynamic Window Approach (DWA) and two state-of-the-art imitation learning systems, that is, Learning by Cheating (LBC) and World on Rails (WOR).



page 1

page 4

page 5

page 6

page 7


LVD-NMPC: A Learning-based Vision Dynamics Approach to Nonlinear Model Predictive Control for Autonomous Vehicles

In this paper, we introduce a learning-based vision dynamics approach to...

OctoPath: An OcTree Based Self-Supervised Learning Approach to Local Trajectory Planning for Mobile Robots

Autonomous mobile robots are usually faced with challenging situations w...

A Software Architecture for Autonomous Vehicles: Team LRM-B Entry in the First CARLA Autonomous Driving Challenge

The objective of the first CARLA autonomous driving challenge was to dep...

Deep Grid Net (DGN): A Deep Learning System for Real-Time Driving Context Understanding

Grid maps obtained from fused sensory information are nowadays among the...

Isolating and Leveraging Controllable and Noncontrollable Visual Dynamics in World Models

World models learn the consequences of actions in vision-based interacti...

Designing a Recurrent Neural Network to Learn a Motion Planner for High-Dimensional Inputs

The use of machine learning in the self-driving industry has boosted a n...

Learning a Lattice Planner Control Set for Autonomous Vehicles

In this paper, we introduce a method to compute a sparse lattice planner...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the rise of Artificial Intelligence and Deep Neural Networks (DNN), research in the area of autonomous vehicles has advanced significantly in the last decade, being driven both by academia and industry alike. An autonomous car is an intelligent agent which observes its environment, makes decisions and performs actions based on these decisions. The state-of-the-art approaches for self-driving car control are based on the traditional perception-planning-act pipeline 

[pendleton2017perception], where first the environment is reconstructed from sensory data. Secondly, a safe vehicle state trajectory path is computed by a path planner. Finally, the path is translated to motion commands which are executed by a controller.

A key difficulty is that the control strategy is affected by unpredictable external factors that can arise in the driving environment. It has been observed that the control quality increases when the motion of the objects around the car is tracked and predicted. However, such an approach is strictly dependent on the accuracy of the object detection system. If the detection system fails, than the motions of the objects cannot be predicted anymore, hence degrading the performance of the controller.

Figure 1: ObserveNet Control. A deep neural network is trained in a self-supervised fashion to predict raw sensory data over a finite prediction horizon . The predictions, as well as historic states and observations, are iteratively used by a temporal state trajectory planner to calculate and execute a safe vehicle motion.

In this paper, we propose the ObserveNet Control method depicted in Fig. 1, which is an approach to derive control signals from predicted raw sensory data. The algorithm is composed of a DNN which receives as input historical system states and measurements from a Lidar range sensing device. The network is composed of convolutional and recurrent layers trained in a self-supervised fashion, without the need of manually labeling training data. The output of the DNN represents the most likely future state trajectory of the vehicle, where each state in the state trajectory is augmented with predicted sensory data.

Our work was mainly inspired by [world_discovery_models], which served as a baseline for developing the architecture of the DNN. Apart from the application area (autonomous driving vs. navigating a simulated maze), we highlight the several key differences in our work, as detailed in Section 4.1. For performance evaluation, we have tested our algorithm on two distinct platforms, namely within the CARLA simulator, as well as on our RovisLab Autonomous Mobile Testing Platform (AMTU) from Fig 6.

The key contributions of the paper are:

  • the deep neural network architecture of ObserveNet, used to predict future state trajectories and raw sensory measurements along a finite prediction horizon;

  • the self-supervised training of ObserveNet with raw range sensing information acquired from a Lidar device;

  • a temporal planner used to calculate a safe vehicle path based on the output of ObserveNet.

The rest of the paper is organized as follows. Section 2 covers the related work. The methodology of ObserveNet Control is given in Section 3. The experimental validation is given in Section 4. Finally, conclusions are stated in Section 5.

2 Related Work

Similar to the perception-planning-act [pendleton2017perception] paradigm, our approach divides the algorithm into raw observations prediction and state trajectory planning. This is opposed to End-To-End learning [bojarski2016end], where sensory data is directly mapped to control signals via a deep neural network.

As a result of progress in AI, several model-free approaches have emerged in the field of learning-based control for self-driving vehicles. One such paradigm is End-To-End Learning, where input data is mapped directly to control signals. Another approach is Deep Reinforcement Learning (DRL), where an agent is controlled via action-reward systems 


. In terms of training data, self-supervised learning is a technique heavily used in modern learning-based architectures, consisting of algorithms that learn independently, without the need of human intervention in the form of manual data labelling.

From a data-driven control perspective, our work builds on goal-conditioned imitation learning [C, D, F, chen2021wor]. As shown in [D], this class of methods combines learning from an expert driver with the advantages of goal-directed planning based on dynamic models and reward functions. However, in addition learning the optimal driving path, we also employ the prediction of future raw observations within the control task.

Additional to accurate object detection, the performance of autonomous vehicles is directly influenced by their ability to predict future positions of dynamic agents, such as other cars and pedestrians [G]. Instead of focusing on label-intensive object detection and tracking, we proposes to learn to predict future sensory data via self-supervised learning and plan future trajectories based on the predicted sensory data. Raw observations predictions include PredNet [Lotter2016DeepPC], which was trained to predict future frames in a video sequence, while the work in [H, I, K, L] is focused on forecasting an intermediate occupancy map representation of the surrounding traffic environment. An inverted pose forecasting pipeline is proposed in [J]

, where the first phase of the algorithm predicts future 3D point clouds, followed by object detection and pose estimation, instead of first detecting objects and then predicting their future motion. Apart from sensory data, LaneGCN 

[LaneGCN] and VectorNet [VectorNet] are used to fuse object detection and tracking with HD map data in order to better forecast the motion of dynamic objects within complex road networks. However, using our ObserveNet sensory data predictor, we also propose to derive control signals for the vehicle, thus showing that automatic control based on predicted observations is viable.

The authors of [world_discovery_models]

use a temporal neural network to construct an agent’s belief. The belief is used in a RL setup to build a world model. The agent’s observations are modeled as one-hot encoded obstacles in an occupancy grids. The goal is to learn an information seeking behavior for an agent, endowing it with a strong drive towards discovering its environment. In our work, we use a different observation modelling mechanism for Lidar data. While the agent in 

[world_discovery_models] moves based on four discrete control actions (left, right, up, down), in our work the actions are represented by vehicle control signals. Another difference is that the authors in [world_discovery_models] use a control method based on Reinforcement Learning, as opposed to our ObserveNet Control pipeline.

A key contribution of our work is therefore to extend the neural network architecture from [world_discovery_models], adapt it to our sensor setup, and illustrate that this modified architecture can provide results in a closed-loop system, tested under dynamic traffic conditions and with a real-world robotic system.

3 Methodology

Throughout the paper, we use the following notation. The value of a variable is defined either for a single discrete time step , written as superscript , or as a discrete sequence defined in the time interval, where denotes the length of the sequence. For example, the value of a state variable is defined either at discrete time as , either within a sequence interval . The difference between a measured and a predicted quantity is made using the hat notation(e.g and

are measured and predicted values, respectively). Vector and matrices are indicated by bold symbols.

3.1 Problem definition

The autonomous driving problem can be stated as follows. Given a global reference trajectory , we want to calculate and execute a safe vehicle state trajectory , given past and current sensory observations , as well as predicted raw observations . is the discrete time, while and are the length of the sequence of past observations and the control horizon, respectively.

A state trajectory is defined as a sequence of vehicle states:


where and are the Cartesian coordinates of the ego-vehicle, is the velocity and the orientation.

3.2 ObserveNet Predictive Control

The objective of the control loop is to compute a safe state trajectory which is to be executed by a motion controller. Fig 2 illustrates our proposed ObserveNet predictive control approach. Our main contributions are the Observations and Trajectory Prediction module, along with the Temporal State Trajectory Planning algorithm which consumes past and predicted observations and states, while computing the input to a Constrained Non-linear Model Predictive (NMPC) controller. Within its optimization loop, NMPC calculates optimal the control signals , where and are the longitudinal velocity and steering angle, respectively.

Figure 2: Block diagram of the ObserveNet Predictive Control algorithm. Based on historical Localization & Perception data stored in the Augmented Memory component, the method predicts raw sensory Observations and State Trajectories which are used by a Temporal Planner to calculate a safe desired vehicle trajectory executed via Constrained Non-linear Model Predictive control. The dotted lines illustrate the flow of data used during self-supervised training.

The Localization & Perception component from Fig. 2 constructs a top-view 2D occupancy grid out of 3D Lidar point clouds, while computing the state of the vehicle using inertial measurements from an IMU, geolocation data from a GPS receiver and wheels odometry. The Ackermann vehicle model is used both by the state estimator and the controller:


where is the steering angle, is the length of the vehicle and is the sampling time.

The predictions are integrated by taking into consideration the vehicle’s dynamics from Eq. 2, so that obstacle information is projected onto the occupancy grid according to the ego-vehicle’s state for each predicted set of observations. In the following, we detail ObserveNet’s deep neural network architecture and the methodology within the temporal state trajectory planner.

3.3 ObserveNet Architecture and Training

The goal of the ObserveNet neural network from Fig. 2 is to predict future observations based on past sensory measurements , the global reference route and the past vehicle’s state trajectory . This past and predicted data is used to calculate the reference of the constrained NMPC controller which will execute the motion of the car.

Figure 3: ObserveNet deep neural network architecture. The DNN receives input data in the form of historical observations, vehicle states and reference route, with the goal of computing a safe desired state trajectory using predicted observations and temporal path planning. ObserveNet computes a dynamic output, having a size dependent on the time horizon for which the future observations are predicted.

In order to increase the computational speed, the input Lidar point clouds are firstly converted into an OcTree representation. An octree is a data structure used for representing 3D spatial features by subdividing them recursively into eight child units. The obtained 3D grid map can be calculated at a finer or more coarse resolution, while preserving a low-memory footprint, making the representation suitable for embedded computing. Within the conversion loop, we use several downsampling mechanisms, as follow.

First, we introduced a constraint for points which lay farther than a distance threshold:


where and are input and output point clouds, respectively. is the Euclidean distance between the ego-vehicle’s origin and the current point, is the distance threshold and is the original point cloud size.

Secondly, points which are below a certain height or higher than our ego-vehicle’s height are discarded:


where is height of point, is the minimum height, is the ego-vehicle height and is the point cloud size. The goal of is to filter our the pointclouds representing the road.

Finally, points which are outside of an deg field of view from the front and rear of the vehicle are filtered out:


where is the vehicle origin, is the field of view threshold and is the point cloud size. The problem of slopes, when the road would be visualized as obstacles in the pre-processed pointcloud, can be easily counteracted by updating the extrinsic parameters of the Lidar, which in turn relate its orientation with respect to the vehicle.

The obtained OcTree is stored as historical data in the Augmented Memory component from Fig. 2. In order to further increase the computational speed on ObServeNet, we project the OcTree onto a top-view 2D occupancy grid which is feed to the deep network. The network outputs observation predictions for future frames.

ObserveNet’s architecture is shown in Fig. 3

. The basic building blocks are convolutional layers for spatial data encoding and Gated Recurrent Units (GRU) for extracting temporal dependencies. The input sequences are normalized and fed through the network’s convolutional layers. The proposed DNN has a dynamic output, dependent on the length of the prediction horizon

. Namely, each predicted observation is given by a separate network output branch , where .

As illustrated in Fig. 3

, apart from the differentiable layers in the network, that is, its convolutional, recurrent and dense units, the architecture also embodies non-differentiable layers used to map each predicted observation to the most probable future state of the vehicle. The Plan & Simulate layer recursively uses our temporal planner to simulate future states, while replanning the vehicle desired trajectory

. The predicted states are smoothed using an extended Kalman filter. A couple of observations predictions an their ground truth are shown in Fig. 

4, as well as in Fig. 5 in the context of the temporal planner.

Figure 4: Predicted and ground truth observations at three different future timesteps. (top) Ground truth observations superimposed on ObserveNet predictions, along with the reference path marked in red and the planned state trajectory illustrated in green. For visualization purposes, we have considered only observations within a range. (bottom) Images from the front of the vehicle at the considered timestamps.

ObserveNet has been trained for epochs in a self-supervised fashion using the Adam optimizer, with learning rate of 0.0003, while minimizing the mean squared error loss function between measured and predicted observations.

3.4 Temporal State Trajectory Planning and Control

The temporal path planner is based on the Dynamic Window Approach (DWA) method, modified to iteratively compute optimal vehicle paths based on current, as well as predicted observations. The Dynamic Window Approach (DWA) is an online collision avoidance strategy for mobile robots, which uses robot dynamics and constraints imposed on the robot’s velocities and accelerations to calculate a collision free path in the 2D plane. We have implemented DWA based on the Robot Operating System (ROS) DWA local planner. DWA takes as input the distances from the ego-vehicle to the obstacles present in the scene, calculated as perception rays. In our case, we tune DWA to take into consideration both past, as well as future predicted information.

Figure 5: Temporal state trajectory planning using RovisLab AMTU. While tracking the red reference path, ObserveNet predicts the future observations marked with blue, green and yellow, respectively, as well as the most probable state trajectory of the vehicle. The blue lines represent candidate trajectories at each future timestamp, while the green line represents the output of the temporal planner. The white rectangles show predicted observations of the person depicted on the left, considered in this case as a moving obstacle.

A desired vehicle state is iteratively predicted at each time step in the future, based on the temporal vehicle dynamics and observations. The input data is both historical (), as well as predicted ():


where is the prediction timestep and if the temporal path planning function.

is used to interpolate the outputs of the DWA planner along the time horizon

and can be decomposed as:


where the first term represents the interpolated trajectories of the planner given past observations, while the second term computes the interpolation for the planned trajectories based on predicted observations. The states are iteratively integrated into the desired state trajectory feed as input to the NMPC.

A snapshot of our temporal path planner can be seen in Fig. 5. The planner generates candidate trajectories at each prediction step (shown with blue lines in Fig. 5). The optimal trajectory is selected based on the vehicle dynamics from Eq. 2 and predicted observations. The final trajectory computed using Eq. 7, marked with green in Fig. 5.

4 Experiments

The performance of ObserveNet Control was benchmarked against the classical Dynamic Window Approach (DWA) baseline method, as well as against two state-of-the-art methods based on imitation learning, that is, Learning by Cheating (LBC) [B] and World on Rails (WOR) [chen2021wor]. Since ObserveNet is a derivation of imitation learning, we have chosen to compare it to LBC and WOR, which are methods that do not require object-bound boxes, tracks and HD maps, like [LaneGCN] or [VectorNet]. DWA and ObserveNet use the same Constrained NMPC controller for executing the motion of the vehicle.

The evaluation of the four competing algorithms was performed on aggressive driving scenarios simulated in the city environments Town 1 to 6 in CARLA, as well as on CARLA Leaderboard [A]. Real-world experiments have been performed using DWA and ObserveNet on indoor and outdoor navigation using the RovisLab Autonomous Mobile Test Platform (AMTU) from Fig. 6.

Figure 6: RovisLab AMTU (Autonomous Mobile Test Unit). The platform is used in the context of a 1:4 scaled car model.

Global reference trajectories were recorded while controlling both simulated and real vehicles manually and used afterwards as reference trajectories for the autonomous vehicle controller to track.

As performance measures, we have recorded the number of collisions between the ego-vehicle and other traffic participants, the cross-track error , the vehicle’s orientation , the rate of change of the steering and acceleration commands and , respectively, as well as the ability of the vehicle to successfully complete the pre-recorded track, quantified as .

The cross-track error represents the difference between the reference trajectory and the vehicle’s position along the coordinate:


where represents the polynomial approximation of the trajectory, evaluated at .

The vehicle orientation error is defined as:


where is the vehicle orientation, is the desired vehicle orientation, is the ego-vehicle velocity, L is the vehicle wheelbase and is the steering angle.

The rate of change of the steering and acceleration commands is computed as:

Figure 7:

Mean (solid line) and standard deviation (shaded region) of the performance metrics recorded during testing in the five considered scenarios.

Per sampling time cross-track error, orientation error, consecutive acceleration difference and consecutive steering angle difference analysis. From left to right: overtake, opposite direction, side cut-off left, side cut-off right and multiple vehicle scenarios statistics.

The workflow of the experiments is as follows:

  • collect training data from driving recordings;

  • format training data as historical sequences of length and prediction sequences of length ;

  • train the ObserveNet deep network from Fig. 3;

  • evaluate on simulated and real-world driving scenarios.

To train the network, we use data collected from CARLA and RovisLab AMTU, with the ego-vehicle being manually driven in the test environment, while encountering various traffic participants. In order to to mitigate overfitting, all vehicles were driven in a different manner each time. During training, we have used an 80/10/10 train - validation - test data split.

4.1 Simulation Experiments

CARLA [Dosovitskiy17] was used for testing ObserveNet Control in simulation. We have simulated a Lidar sensor, spanning a maximum range of m. Additionally, camera images, inertial data and odometry information was acquired.

Due to the high flexibility of the CARLA simulator, we were able to simulate real-world situations like overtaking, opposite direction lane swerving, side cut-off, etc. During the evaluation step, our algorithm was used to control the ego-vehicle, while the other traffic participants were controlled using a simple feedback loop directly in CARLA.

Towns 1-6 from CARLA are used for training and testing, amounting to K pairs of Lidar and vehicle state data. We have investigated city driving with two or more simulated vehicles driving in the environment. We tackle aggressive driving, mainly overtaking and cut-off situations, where a classic controller would have difficulties due to the scene’s high dynamics. The ego-vehicle navigated a pre-recorded trajectory, while several challenging conditions were added. Other traffic participants were simulated as follows: i) travelling at a dangerously low speed, leading to the need of aggressive overtaking maneuvers, ii) vehicles incoming from the opposite direction at an unsafe distance and iii) vehicles cutting off the ego-vehicle from the left and right sides.

The goal of our experiments was to check if we can eliminate collisions, as well as to smooth the state trajectory, thus generating a safer more pleasant ride, with less jerking. We have implemented five challenging scenarios, with quantitative results shown in Fig. 7.

The overtake scenario represents a challenging situation, particularly at higher speed. The other vehicle accelerates at a slower pace than the ego-vehicle, towards a lower top speed, generating an overtake situation. State data is recorded for generating statistics.

In the opposite direction experiment, the ego-vehicle travels straight, while the other vehicle is coming from the opposite direction, spawning a near-miss situation. Using the network’s predictions, the ego-vehicle should sense the other vehicle sooner than the setup without predictions, generating a safer trajectory.

The side cut-off scenario consists of the ego-vehicle travelling straight and approaching an intersection. When it comes close to the intersection, another vehicle cuts it off, generating a collision. Using the predicted data, the ego-vehicle needs to be able to avoid the collision. This scenario was tested in two symmetrical situations: left side cut-off and right side cut off.

The rear pass-by scenario simulates a vehicle coming from behind the ego-vehicle on the same side, at a greater velocity, passing it at an unsafe distance. Using the predictions, the other vehicle’s position is predicted, thus allowing the ego-vehicle’s path planner to plan accordingly, widening the gap between the vehicles.

The multi-vehicle scenario has the most complexity, where the ego-vehicle is traveling a larger map section, encountering vehicles, overtaking and avoiding collisions. The scenario is designed to simulate real world complex scenes, where the ego-vehicle has limited maneuver space.

For all five experimental scenarios, the corresponding quality measures are summarized in Table 1. The experiments show that ObserveNet achieves better results than the regular DWA planner, providing less jerky movements. This is due to the fact that the obstacles are available before being actually observed.

The ego-vehicle successfully avoided obstacles in all five experimental scenarios. The cross-track error remained relatively low, although in some circumstances it was greater than the DWA baseline for all three imitation learning methods (e.g. when running the side cut-off or multi-vehicle scenarios). This behavior happens when the ego-vehicle is overtaking, thus deviating from the reference trajectory. We believe that the lower performance of ObserveNet, LBC and WOR against DWA in the side cut-off and multi-vehicle scenarios is due to the randomness of these scenarios, as well as due to the high dynamics of the involved participants. An additional amount of training data should increase performance.

The overtaking maneuvers also cause spikes in orientation error levels. When running the overtake scenario solely with DWA, the ego-vehicle collided violently with the other traffic participant, thus causing large cross-track and orientation errors. Consecutive acceleration and steering angle differences were in some instances better without predictions, however the error values were very close.

Scenario Algorithm
Overtake DWA 1 0.06 0.02 0.001 / 0.001 100
LBC 0 0.076 0.027 0.0069 / 0.0023 100
WOR 0 0.065 0.019 0.0061 / 0.0012 100
Ours 0 0.6 0.00001 0.0006 / 0.0006 100
Opposite DWA 1 0.72 0.003 0.006 / 0.001 100
LBC 0 0.2 0.0023 0.042 / 0.005 100
WOR 0 0.17 0.04 0.031 / 0.01 100
Ours 0 0.13 0.004 0.0008 / 0.0008 100
Side cut-off DWA 1 0.002 0.06 0.002 / 0.002 100
LBC 1 0.10 0.23 0.064 / 0.020 100
WOR 0 0.05 0.04 0.0096 / 0.010 100
Ours 0 0.003 0.005 0.004 / 0.001 100
Rear pass-by DWA N/A 1.57 0.007 0.01 / 0.003 100
LBC N/A 0.45 0.02 0.063 / 0.020 100
WOR N/A 0.62 0.005 0.0067 / 0.010 100
Ours N/A 0.39 0.02 0.004 / 0.004 100
Multi-vehicle DWA 4 0.005 0.04 0.0003 / 0.0003 24.56
LBC 1 0.006 0.34 0.0002 / 0.0004 100
WOR 1 0.006 0.13 0.00047 / 0.0011 100
Ours 0 0.01 0.01 0.0005 / 0.0005 100
Table 1: Quantitative results on simulated scenarios in CARLA.

4.2 CARLA Leaderboard Experiments

On top of our own CARLA simulation experiments, we have also assessed ObserveNet within Carla Autonomous Driving Leaderboard [A], which is an open platform designed to ease testing of autonomous driving systems. The platform provides several urban environments, as well as varying weather and lighting conditions and can be configured to simulate a variety of challenging traffic scenarios, such as vehicle control loss, obstacle avoidance, lane changing, crossing traffic running a red light and intersection handling. Three metrics are used for performance evaluation: total driving score, route completion and infraction penalty. In addition to these metrics, the framework provides collision information, off-road driving data, route deviations and running red lights and stop signs.

Since our focus is on eliminating collisions between the ego-vehicle and other traffic participants, we aim to optimize the number of collisions, route deviations and route completion. We have conducted a total of routes, each repeated three times for each evaluated algorithm, comparing side by side collision metrics and other infractions, as shown in Fig. 8. Although the number of infractions like stop sign violations and red light violations are more frequent in our approach, the average number of collisions remained overall lower than for LBC and WOR.

From Fig. 8 it can be observed that LBC and WOR tend to perform better than our approach for stop sign and red light violations. This is happening because we use range sensing only, where stop signs and traffic lights are not recognized.

Figure 8: Experimental results on CARLA Leaderboard [A]. Average number of infractions for all tested routes. From left to right: layout collisions (), pedestrian collisions (), vehicle collisions (), red light violations (), stop sign violations () and blocked vehicle violations ().

4.3 Ablation Study

In order to asses the core components of the algorithm, we have performed the following ablations to the ObserveNet structure from Fig. 3: i) ablation of the Kalman filter, ii) ablation of the prediction horizon and iii) control based on the prediction of the next vehicle’s state. In the last case, the Kalman filter and the temporal planner were removed, together with the branches. The kept branch was trained to predict only the next state of the vehicle based on observations. Fig. 9(a) shows abblations against the cross-track error , while Fig. 9(b) takes into consideration pedestrian and vehicle collisions.

Figure 9: Ablation study results. The Kalman filter was switched on/off, while the prediction horizon was set to , , , and , respectively.

The cross-track error is proportional to the length of prediction horizon . The further in time we predict observations, the larger the cross-track error is. Our intuition for this increase in is that ObserveNet is anticipating the visual dynamics of the obstacles encoded in the raw measurements, thus optimizing the vehicle’s trajectory against collisions, while diverging from the reference path. This phenomenon can be seen in Fig. 9(b), where the pedestians and other vehicles collisions are decreasing proportional to the length of .

Without the smoothing property of the Kalman filter, the values of the control signals are higher, leading to a larger error. A decrease in performance is observed when the vehicle is controlled solely by predicting its next state.

4.4 RovisLab AMTU Experiments

In addition to the CARLA simulated experiments, we have tested ObserveNet and DWA on real-world navigation tasks using the RovisLab AMTU robot from Fig. 6. Due to its better performance over LBC and WOR, we have chosen to conduct real-world experiments only with ObserveNet as the learning-based controller and DWA. RovisLab AMTU is an AgileX Scout 2.0 platform which acts as a 1:4 scaled car, equipped with a Hesai Pandar 40 Lidar, 4x e-130A cameras providing a visual perception of the surroundings, a VESC inertial measurement unit, GPS and an NVIDIA AGX Xavier board for data processing and control. The robot navigated indoor and outdoor environments, while avoiding dynamic obstacles.

The prediction horizon was set to based on trial-and-error experiments, while of training data was acquired for the indoor and outdoor experiments each. Both real-world experiments have been conducted at Transilvania University’s research institute, amounting to pairs of Lidar and vehicle state data.

We have replicated similar scenarios as the ones performed in simulation, with the difference that the moving vehicles from CARLA have been replaced by persons interacting with the robot in similar ways as in the five testing schemes described above. Table 2 shows quantitative results on the real-world experiments. The results show an increased improvement in tracking accuracy and obstacle avoidance in the case of ObserveNet Control.

Scenario Algorithm
Overtake DWA 1 0.27 0.055 0.0022 / 0.0022 100
Ours 0 0.55 0.001 0.0004 / 0.0004 100
Opposite DWA 1 1.32 0.006 0.005 / 0.002 100
Ours 0 0.19 0.002 0.0007 / 0.0007 100
Side cut-off DWA 1 0.12 0.1 0.003 / 0.004 100
Ours 0 0.09 0.007 0.001 / 0.001 100
Rear pass-by DWA N/A 2.23 0.06 0.009 / 0.0025 100
Ours N/A 0.72 0.09 0.006 / 0.006 100
Multi-obstacles DWA 3 0.035 0.08 0.009 / 0.008 70.32
Ours 0 0.067 0.01 0.005 / 0.005 100
Table 2: Quantitative results on RovisLab AMTU.

5 Conclusions

This paper introduces ObserveNet Control, which is an observation prediction control method for autonomous vehicles composed of a DNN for observations predictions and a temporal path planner. We have shown that our control approach has an improved path tracking and obstacles avoidance accuracy when augmented with predicted observations. Additionally, the training of the system is self-supervised from driving recordings, which requiring manual human annotations. Because we only predict raw observations, one drawback of ObserveNet is that it lacks driving context, hence resulting in a large number of traffic violations. This can be easily corrected by integrating our method into autonomus driving systems that use visual data to detect additional cues, such as lanes, or traffic signs. In such a case, ObserveNet would contribute with its raw observations predictions to the reconstructed virtual environment model used by the vehicle to plan its path. As future work, we plan to increase the time prediction horizon, as well as its possible application in mobile robotics and simultaneous localization and mapping.