Autonomous vehicles have evolved quickly in the last years, mainly thanks to recent development in the field of deep learning. Creating a self-driving vehicle can generally be divided into three subtasks, perception and localization, planning, and control and actuation. This paper is concerned with the latter, namely following a given path, ensuring correct control and actuation of the vehicle. More specifically, a Reinforcement Learning (RL) approach was used for controlling the steering angleand acceleration of a simulated land vehicle, trying to keep the cross-track error (CTE) and velocity error small. While traditional controlling algorithms exist for this coupled control task [chebly2017coupled]
and they provide promising performance, their main drawback is that precise modeling of the vehicle at hand is required. Many parameters must be measured or estimated accurately for the control algorithm to perform adequately. This motivates the use of deep learning where suitable parameters instead are learned by the algorithm itself. Although traditional supervised deep learning can be applied to solve this problem[devineau2018coupled], this will not be considered here. Instead, Deep Deterministic Policy Gradient will be applied, which is an off-policy learning algorithm suitable to use in a continuous domain where continuous control actions are desirable [lillicrap2015continuous]. The algorithm and its application to the control problem will be explained closer in the coming section.
Ii Related work
Reinforcement learning has been successfully applied to various scenarios, such as playing Atari video games [mnih2013playing] and basic control tasks [lillicrap2015continuous]. The basic idea of reinforcement learning consists of having an agent interacting with an environment by each time-step receiving an observation , based on this taking an action and receiving a reward . The agent then tries to select actions to maximize the cumulative reward. One of the more commonly known algorithms for finding action policies is called Q-learning, which is a model-free off-policy RL algorithm, introduced already in 1989 [watkins1989learning]. Model-free refers to that no model of the environment is needed to learn suitable actions, while off-policy means that it can learn from previously collected observations. The Q-learning algorithm is value-based, i.e. essentially it uses observations to learn the optimal action for any giving state by keeping track of all possible state-action pairs and continuously evaluating and updating their quality (hence the name Q-learning). After learning , actions are selected such that
is maximized for each state. However, as the state-space grows larger, the discrete approach used in Q-learning becomes infeasible due to the curse of dimensionality. Furthermore, for a continuous state-space, each state has to be discretized, where there is a trade-off between small discretization steps and keeping low dimensionality.
The problem of continuous observation spaces can be solved using Deep Q-learning (DQN)[mnih2015humanlevel]
. Instead of using a table to store the Q-values, a function approximation based on a neural network is used to estimate the values. In the algorithm a continuous state-vector is fed into a neural network where the output is an estimation of the current Q-value. However, this still leaves one with having to maximize the Q-value over all possible actions, i.e. for each state, one has to search through the actions, finding the one with the highest Q-value. For a large action space or large function approximation network, this becomes computationally costly and time-consuming.
For handling continuous action spaces, one can use Deep Deterministic Policy Gradient (DDPG) [lillicrap2015continuous] which is an extension of the actor-critic approach in Deterministic Policy Gradient [pmlr-v32-silver14] in combination with findings done by Deep Q-learning [mnih2015humanlevel]. The algorithm is based on two different neural networks, called the actor and critic networks, where and denote their weights respectively. The actor network maps vectors in the state-space into a vector in the action space, i.e. it uses the knowledge it has about its current state and tries to determine the best possible action to take. The critic network uses both the current state and the action taken by the actor-network to decide how good that particular action is to take in this particular state, i.e. trying to approximate the Q-value. The parameters of the critic are updated by using gradient stochastic descent for minimizing the Mean-Squared Bellman Error, similar as in traditional Q-learning. For the actor however, [pmlr-v32-silver14] showed that there exist a policy gradient, i.e. gradient of the policy’s performance, along which one updates the parameters of the network
Deep Q-learning introduced two major ideas to scale Q-learning which are also used in DDPG, namely the use of replay buffers and separate target networks. Replay buffers save a large number of experiences , where denotes the next state and whether that state is terminal, in a dataset . One later randomly samples mini-batches from this dataset, rather than taking the observations directly from the simulation when training the actor and critic networks. By employing a replay buffer, one can suppress the issue that the networks overfit to the most recent data and achieve a more stable training process. Furthermore, it also handles the issues that the observations taken from the simulation are sequentially dependent on each other.
When training the critic network, the Mean-Squared Bellman Error (MSBE) is to be minimized with respect to the parameters
, i.e. we are trying to minimize the loss function
by making more similar to the target . Here denotes the discount factor determining how much future rewards are valued and is the action taken in the next state . As can be seen in (2), the target is dependent on the parameters we are trying to learn, namely . This can make the training very unstable, leading to the agent not learning anything. Here is where the authors of DQN proposed target networks, which are yet another set of parameters, and , which lag behind the actual parameters and . In the original DQN paper[mnih2015humanlevel], the target networks were updated after a fixed number of episodes, however in the DDPG algorithm[lillicrap2015continuous] the authors used Polyak averaging after each training step. The update is therefore described as
By using te target networks instead of the actual networks to calculate the target in (2) one can achieve much more stable training.
Finally, in RL one has to balance the trade-off between exploration and exploitation. By adding a certain level of randomness in the agent one allows it to explore and learn more about the state-action space before optimizing its policy. In the original DDPG paper [lillicrap2015continuous], the authors proposed an exploration policy where Gaussian noise was added to the actor-networks output, i.e. directly to the selected action before executing it in the environment. However, as described in [kamran2019learning], using such independent exploration noise can be very inefficient as a vehicle system acts as a low pass filter for the high-frequency changes in acceleration and steering angle that this type of noise provides. Instead, they proposed an exploration policy where the noise added, and , is sinusoidal with randomly sampled mean and deviation according to
where , and
are sampled from a zero-mean Gaussian distribution, whileare uniformly sampled in the range . Note here that the values , , and are kept constant over one full episode.
Furthermore, in [kamran2019learning], the authors showed that it is advantageous to have an initial explore time (e.g. the first 500 episodes) where the agent, rather than using the output of the actor-network, samples actions uniformly in the range . They showed that this drastically improved the performance of the agent and enabled it to learn much better behaviour as it allows for vast exploration in the initial training period.
Before training the RL agent, an environment must be created with which it can interact. The environment in this project was produced by the authors and consisted of a kinematic bicycle model which took requested steering angles and accelerations as input, and calculated a global position , , velocity , steering angle , and heading of the agent. The choice of the kinematic bicycle model is based on the model’s simplicity while capturing the overall movements of a car. However, the model was adjusted such that the requested steering angle was limited by a maximum steering rate , i.e. the change in steering angle could not exceed a certain threshold. Further, the requested accelerations were also clipped to lie within adequate limits with . These changes were introduced to simulate how actuators of a car might behave in real-world conditions. The model can be summarized as selecting a time step and vehicle length , then iterating
The model was also used to generate reference paths for the agent to follow. For this, an average velocity was sampled once before each path generation started and then and where sampled randomly every time step until the path had a length of m.
was sampled from an uniform distributionand was sampled from two different uniform distributions, if , otherwise from . The reference trajectory was then processed to consist of waypoints with reference speed every meter. Due to the randomness of the generation, the agent received a unique path to follow each episode. The agent’s state observations consisted of the distance to the 25 nearest future waypoints, expressed in the agent’s coordinate frame, along with the difference between current velocity and reference velocity for all 25 waypoints. Additionally, the observed state also included the agent’s current velocity and steering angle , . An illustration is shown in Figure 1, where denote global coordinates and coordinates in the agents frame.
As the agent intends to control the steering angle and the acceleration , the action space is continuous and the actions have been chosen to be in the range of . Each command must therefore be multiplied with the maximally allowed acceleration and steering angle, as defined by the model described above before it is passed to the simulation.
The reward signal was designed with inspiration from [kamran2019learning] and is described by
where, cte is the cross-track error between the agent and the reference path, i.e. perpendicular distance from reference path to agent location, and , , and are penalties based on the cross-track error, steering angle, deviation from reference velocity, and the acceleration performed by the agent. The penalties were introduced to minimize cross-track errors, steering angles, velocity errors and jerky maneuvers respectively, and are defined by the following equations
Additional negative reward was assigned when the normalized velocity error became to large, i.e.
Each simulation was terminated when either the cross-track error meters, the velocity m/s or the agent reached the end of the path. However, only the to first two cases were marked as terminal states in the target .
As mentioned previously the DDPG architecture consists of two different networks, the actor and the critic. The network architecture for each of the networks are shown in Figure 2. The actor-network begins with two shared fully connected layer and then branches out into two branches, each producing their own action, i.e. either the steering angle or the acceleration . The two outputs are then concatenated before given as the final output of the actor-network. The weights and biases of the last fully connected layers of network were initialized to as this was found to be crucial for the agent to learn anything at all.
Furthermore, the replay-buffer size was set to and the sinusoidal exploration policy explained in Section II was used. To transition smoothly from exploration to exploitation, a noise amplitude multiplier was initialize to 1 and reduced by every episode. The agent was trained in total for epochs, each lasting for episodes. The division into epochs was done mainly for learning rate scheduling.
In Figure 3 several different metrics from the training are shown for the first 2500 episodes. Starting from the top we see the average cross-track error and velocity error normalized with the travelled path length, as well as the average reward per time-step. After episodes a major bump in performance was obtained, which is caused by switching from the aggressive exploration policy used in the initial training phase to the more lenient sinusoidal noise explained in Section II. Simply from sampling random actions in the action space, the agent was able to learn enough about the environment to perform adequately. When trained without the initial training phase, the agent was not able to perform any satisfactory behaviour after the same number of training steps, thus validating the results in [kamran2019learning]. After the initial bump in performance, the behavior of the agent only changed marginally. While it improved somewhat in all aspects shown in Figure 3
, the biggest change was in consistency. Towards the end of the training the agent was able to maintain a lower variance in all of the aspects discussed above, including the average percent of completion w.r.t. the reference path.
After the training, the agent was evaluated over a set of 10 test tracks randomly generated using the environment described earlier, where the performance over these is shown in Table I.
|Track||Avg cte||Max cte||Avg||Max||% of path|
From the evaluation one can conclude that the agent is performing passably. The average cross-track error is kept low ( m) while also ensuring that the velocity is kept close to its reference (on average m/s deviation). For the worst case scenario, the table shows the maximal cross-track error during the 10 evaluations and it never exceeded m. This is important to guarantee that the path following algorithm never strays to far from its reference.
Furthermore, in Figure 4 the evaluation on test track # 6 (as it is deemed to be representative over the 10 evaluations) is shown in more detail. Here the reference and the actual path travelled is shown together with cross-track error and the velocity error throughout the simulation. As can be seen, the cross-track error rarely exceeds 0.3 m
V Conclusions & Future Work
In this paper, we leveraged DDPG to solve the lateral and longitudinal control tasks. While conventionally solved with algorithms such as Pure Pursuit or Model Predictive Control our RL algorithm showed very promising results. The agent used relative distance and velocity to the 25 closest way-points on our reference path to determine the appropriate action. During testing, the agent was able to steer with an average cross-track error of only 0.12m while keeping the agents velocity within 0.6m/s deviation from its reference on average.
However, it should be noted that the agent was never applied in a real-world setting, but merely a rather simplistic simulator. In future work, it would be interesting not only to apply and test this on an actual vehicle but also to compare it to the performance of the more conventional controllers.
Furthermore, one could make the simulation more realistic by including e.g. noise in the state observations fed to the agent, or simply use a more sophisticated simulator.
Finally, the authors recognize that the black-box nature of our RL approach raises the issue of how we can guarantee safe behavior at all times. While conventional controllers performance can be analyzed and determined under which circumstances it will perform as intended, our solution is inherently hard to analyze begging the question; can we ever guarantee satisfactory behavior at all times, and if not, under which circumstances can we assure that the RL approach will act appropriately?