Active stability technologies play a major role in modern vehicle dynamic systems in order to enhance the lateral vehicle stability and reduction in fatal accidents. These technologies such as torque-vectoring controllers (i.e., the control of the traction and braking torque of each wheel), can effectively enhance the vehicle handling performance in extreme manoeuvres . Torque-vectoring controller has a prominent advantage of stabilizing a vehicle through choosing appropriate control parameters. However, if the parameters cannot be adapted to the environmental changes, then control design would be cumbersome and will not guarantee good robustness. In order to solve this problem, the adaptive control design has received wide attention. Some methods of tuning parameters have been proposed such as fuzzy logic method 
, which have optimization problem of parameters and requires more knowledge of the environment. Another method is evolutionary algorithm
such as genetic algorithm, which is a strong tool for tuning the control parameter with less computational requirements for prior knowledge, however depending on the complexity of the problem the calculation speed will be slow. Others have used neural networks to learn the optimal parameters for each state
, typically using supervised learning to train the neural network. However, the downside of supervised learning methods is that during training, the ground truth (i.e. correct prediction for given inputs) for each prediction is required to measure the error in the network’s predictions. Therefore, supervised learning is poorly suited to problems where the ground truth is not readily available or is difficult to estimate. On the other hand, reinforcement learning is a technique inspired by human and animal learning, where the neural network is trained through a trial-and-error mechanism, without requiring any ground truth labels. Therefore, reinforcement learning is better suited for this type of problem where the correct predictions are not known, as it enables the network to learn the optimal policy effectively through interaction with its environment. The success of reinforcement learning
was perceived as a breakthrough in the field of artificial intelligence. This algorithm attempts to learn the optimal control policy through interactions between the system and its environment without the need of prior knowledge of an environment or the system model. In this algorithm, an agent takes an action based on environment state and consequently receives a reward. Reinforcement learning algorithms are generally divided into three groups: value based, policy gradient and actor-critic methods. In this paper, DDPG algorithm  is used, which is in the family of actor-critic group algorithm, to tune the parameters of the torque-vectoring controller. DDPG agent is an actor-critic reinforcement learning that computes optimal policy that maximizes the long-term reward. The actor is a network attempting to execute the best action given the current state. The critic estimates the maximum value function of given state (i.e approximate maximizer). Then uses the reward from environment to determines the accuracy of the prediction value. In , authors proposed DDPG as a reinforcement learning algorithm for tuning the sliding mode controller parameters for autonomous underwater vehicle. Results demonstrated that DDPG achieved a good performance in terms of stability, fast convergence and low chattering. In , reinforcement learning was used to tune PID controller parameters where the auto tuning strategy was through actor-critic reinforcement learning method resulting in stable tracking performance of the system. However, the generalization of the trained models for the scenarios beyond the training environment were not validated. Therefore the performance of the algorithms in new environments can not be guaranteed.
In this paper, reinforcement learning algorithm is incorporated into the torque-vectoring controller to adjust the control input weights in order to maintain the stability of the vehicle. The contribution in this paper lies into the utilization of DDPG algorithm as an intelligent tuning strategy for the torque-vectoring controller in a real application for different scenarios. It has been shown that proper tuning of the control input weights results in improvement in handling and stability of the vehicle. In order to validate the effectiveness of the DDPG algorithm, simulation environment has been set under different velocities ranging from 80-130 and different friction surface ranging from 0.4-0.6. The performance is then compared with a conventional trial-and-error approach (which will be referred as manual tuning for the rest of the paper), and a self-tuning genetic algorithm, to show the enhanced effectiveness of DDPG as a tuning strategy for active yaw control systems. Moreover, to validate the generalization capability of the trained model, simulation has been deployed in the environment that DDPG was not trained in, to ensure the robustness of the algorithm under unknown operating conditions. The simulation results have been conducted using a four wheels vehicle model  with nonlinear tire characteristics in Matlab/Simulink environment. Moreover, for solving the on-line tuning algorithm, Matlab reinforcement learning toolbox is used for parameter tuning of the torque-vectoring controller. The remainder of the paper is structured as follows: Section II describes the vehicle model used in this paper. Section III introduces the torque vectoring algorithm used for stabilization purposes in this paper. Section IV presents the reinforcement learning algorithm used for tuning parameters of the torque vectoring control. Section V demonstrates the simulation results of the trained algorithm, and finally concluding remarks are presented in Section VI.
Ii Vehicle model
In this paper, vehicle simulation model is the six state nonlinear four wheel vehicle model presented in . The model states are where and represents the vehicle longitudinal and lateral velocities, respectively, in the body frame and , , and denotes yaw angle, yaw rate, longitudinal and lateral coordinates in the inertial frame. To describe the vehicle model, normal forces are assumed to be constant and roll angle is neglected in the formulation. The tire model used in this paper is described by pacejka model . This is an empirical nonlinear tire model which is able to closely capture the tire dynamics.
Iii Torque vectoring control
Iii-a System Architecture
The overall architecture of the proposed control framework is shown in Fig. 1. In this structure, the command interpreter module (CIM) generates desired forces at the centre of gravity (C.G) based on inputs from the reference generation block ( and ) in order to interpret the driver’s intention for the vehicle motion. CIM continuously monitors the inputs as well as the states of the vehicle to provide accurate desired values to the torque vectoring controller to keep the vehicle within the stable region. It is noted that CIM generats the desired C.G forces with assumption of normal driving conditions (e.g. high surface fricition condition). For the brevity of the paper, the calculation of CIM block is referred to . The inputs to the reinforcement learning are the errors between the actual and desired forces followed by the actual states of the vehicle. Moreover, the generated error between the actual and reference signal as well as the actions from reinforcement learning () are fed to the controller in order to generate the adjustment torque required to assist the driver for improving handling and stability of the vehicle.
A torque-vectoring controller was developed to ensure lateral/yaw stability of the vehicle. The primary objective of this controller is to provide a safe driving experience in the case of unexpected driving condition e.g. low surface friction condition. The desired C.G forces () in Fig. 1 are represented as:
where , and are the desired C.G longitudinal and lateral forces and yaw moment, respectively. The actual C.G forces of the vehicle are:
where , and are the actual C.G longitudinal and lateral forces and yaw moment of the vehicle, respectively. Each C.G force components is a function of longitudinal and lateral tire forces (see Fig. 2) as:
where, and , are longitudinal and lateral tire forces on each wheel of the vehicle, respectively. Equations (3)-(5) reveal a possible way to obtain the desired C.G forces by controlling the tire forces. The tire force vector can be represented as:
The corresponding adjusted C.G forces to minimize the error between the actual forces and desired forces , can be written as:
where, is the Jacobian matrix and is the vector containing the control actions needed to minimise the error between the actual and desired forces which is:
It is noted that in this work in order to derive torque vectoring control actions, only is considered for controller designing procedure and the effect of lateral control actions are neglected:
To formulate the required control action , the error between the desired C.G forces at and actual C.G forces at is described:
Then an objective function consists of weighted combination of error between the actual and desired vehicle C.G forces and control action must be defined. The mathematical representation of objective function is:
where is the weight on longitudinal, lateral and yaw moment error, and is the weight on control action. Since (13) is a quadratic form with respect to the tire force adjustment , the necessary condition of the solution is given by solving the equation
As a result of minimization of the objective function, the control action required to stabilize the vehicle is :
The applied corrective torque on each wheel is where is the effective wheel radius. It is noted that in this paper, there is no adjustment on lateral tire forces and only longitudinal corrective actions are calculated from the controller design. Subsequently, only the weights on the longitudinal corrective actions are tuned using reinforcement learning algorithm, while keeping as fixed values.
Iv Reinforcement Learning Algorithm
In this section an overview of reinforcement learning framework followed by DDPG algorithm used for parameter tuning of torque-vectoring controller is presented.
Iv-a Reinforcement learning overview
To implement a Reinforcement Learning (RL) based controller, we consider a standard RL setup for continuous control in which an agent interacts with the environment, aiming to learn from its own actions. The formulation of reinforcement learning is based on a Markov Decision Process. At each time step, the agent receives an observations , takes an action from a set of possible action . As a consequence of the actions, the agent receives a reward and observes a new state . The goal in reinforcement learning is to learn a policy , which maps states to actions: : resulting in maximizing the expected cumulative discounted reward with discounting factor and
denoting the expectation of the probability. The state-action value (value) at time step represents the expected cumulative discounted reward from time . Reinforcement learning problem is solved using Bellman’s principle of optimality. That is, if the optimal value of the state-action for the next time step is known, then the optimal state-action value for the current time step can be calculated by maximizing .
In this paper, the actions of the reinforcement learning framework are the control action weighting parameters . The observation for reinforcement learning are the errors between the actual and desired forces at the C.G of the vehicle as well as the vehicle states .
To encourage the agent to tune the weighting parameters, a reward function must be appropriately defined by the user. In our model the objective is to find a selection of weighting parameters to enforce the vehicle to follow the reference signals while maintaining the stability. Therefore, a reward function must be designed such that the weighting matrix generated from reinforcement learning makes the reward function and the objective function in (13) to be as similar as possible. Hence, the error between the actual and reference C.G forces that are used in the cost function (13) are chosen for the reward function in order to find a suitable selection of weighting matrix for torque vectoring controller. It is noted that proper selection of a reward function determines the behavior of controller, and thus affects the stability of the vehicle. Therefore, in order to ensure the reliability of the torque vectoring controller a well-defined reward function provided at every time step is introduced:
The first term in the reward function encourages the agent to minimize the errors, while the logic conditions encourage the agent to keep the error below some threshold. The two logics in (16) are: if the simulation is terminated, otherwise . if the components of error is , otherwise . In this way a large positive reward is applied when the agent is close to its ideal conditions which is when the errors of C.G are small. On the other hand, a negative reward is applied when the vehicle fails to maintain the stability, which discourages the agent from losing the directionality of the vehicle. It is noted that the simulation is terminated when the sideslip angle of the vehicle is greater than . Moreover, the action signal takes the values between and as lower and upper bound for all parameters in .
Iv-B Deep Deterministic Policy Gradient
DDPG is an evolution of Deterministic Policy Gradient  algorithm. It is in the family of actor-critic network , model-free, off-policy algorithms which utilize Deep Neural Networks as function approximators. DDPG allows to learn policies in high-dimensional, continuous state and action spaces. The DDPG used in this paper, is inspired from . A brief description of the theory will be discussed in this section, however, a keen reader is encouraged to read the original paper. The DDPG algorithm uses two deep neural networks: actor and critic network. The actor network is responsible for state-action mapping through the deterministic where represents the actor neural network weight parameters, and critic network is for -value function
The action value function is approximated using DNN with net weights parameters . For learning the Q-value, Bellman’s principle of optimality is used to minimize the root mean squared loss
The actor policy is updated using:
which is the policy gradient. For learning the policy gradient ascent is performed with respect to the policy parameter to maximize the Q-value. In DDPG the target networks are used to stabilize the training . Gaussian noise is used for action exploration . Experience replay is utilized for stability . Mini-batch gradient descent is used 
. The network parameters were empirically tuned, and final hyperparameters can be found inTable I.
Both the actor and critic networks are neural nets with 2 hidden layers. The hidden neurons all use Relu activation, actor output uses tanh activation, whilst critic output uses linear activation.
V Simulation results
The DDPG agent was trained for 650 episode with episode length. At the start of each training episode, the environment parameters were varied by choosing initial velocity randomly between to and friction value ranging from 0.4 to 0.6. As can be seen from the episode rewards (see Fig. 3), the agent quickly converges to an optimal policy, although slight improvements in the average reward can still be seen in the later episodes. Once the training phase is completed, performance of the DDPG network is validated in various driving scenarios. All simulation experiments presented in this paper were performed on a four wheel vehicle model with pacejka tire model described in Section II. For the simulation purposes the driver’s input torque and steering wheel angle from reference generator are shown in Fig. 4. The parameters used to design vehicle model are tabulated in Table II. In this section, we first demonstrated the results of the algorithm in scenarios that DDPG was trained in and compared the simulation results with genetic algorithm and manual tuning of the weighting matrix . It is noted that the error signals used in the reward function (16) are chosen for designing cost function in genetic algorithm in order to have a fair comparison between two strategies. For further validation, the effectiveness of the DDPG algorithm is investigated for the conditions that reinforcement learning was not trained in, under different input steering angle. This analysis aims to investigate the generalization capability of the DDPG tuning strategy, by testing its performance in scenarios beyond the training environment. The first scenario that is chosen to demonstrate the performance of DDPG algorithm, is under conditions where the surface friction is 0.4 and initial velocity is .
It is noted that every iteration of the algorithm is actively trying to maximise the reward function by reducing the error signals presented in (16), and at the same time maintaining a smooth and low control effort from torque vectoring controller by generating an optimal set of weighting matrix from reinforcement learning. This can be observed in Fig. 5 where the top figure shows the weighting matrix generated from the reinforcement learning algorithm while the bottom figure shows the control action from torque vectoring controller. The generated adjustment torque on each wheel shows a symmetric and smooth evolution as a consequence of weighting matrix . It is noted that the course of on the front left (FL) wheel is constant over the entire simulation compared to the other weights on each wheel. This behavior indicates that reinforcement learning found some semi-trivial solution to the problem for the entire operating regions, demonstrating the generalization capabilities of the trained model which will be clear later in this section.
|Yaw moment of inertia||2050|
|Effective wheel radius||0.3|
|Front distance from vehicle C.G||1.43|
|Rear distance from vehicle C.G||1.21|
|Front cornering stiffness|
|Rear cornering stiffness|
Fig. 6, compares the actual C.G forces obtained from the vehicle model with those estimated by CIM block. As can be seen, the generated C.G forces are able to track the reference values, resulting in stabilizing the vehicle under friction surface . However, the small discrepancy can be seen in the second figure of the subplot. This is due to the fact that in the formulation presented in Section III-B, since there is no direct control of the lateral tire forces, the magnitude of the error in lateral C.G force is relatively large, hence it would be difficult to fully minimize the lateral C.G error under this condition.
To validate the effectiveness of the DDPG algorithm, the result of the longitudinal C.G error is compared with genetic algorithm and trial-and-error approach (manual tuning). It is noted that, the manual tuning of was selected based on our previous work in  where its corresponding values resulted in maintaining the vehicle stability with . Fig. 7 shows the comparison between parameter tuning using DDPG algorithm, genetic algorithm and manual tuning of torque-vectoring controller. Fig. 7(a) demonstrates the evolution of error in longitudinal C.G forces for all tuning approaches. In this analysis the value of errors are such that vehicle can successfully maintain the stability. However, evolution of the error for DDPG algorithm outperforms the rest of the approaches with the absolute value of the maximum error of , whereas in genetic algorithm and manual tuning this value reaches to and respectively.
Moreover, unlike the manual tuning, the value of the error for both DDPG and genetic algorithm is able to return to near-zero, however the magnitude of the error for DDPG is smaller compared to genetic algorithm which shows the superiority of DDPG strategy over the rest of the approaches. This confirms that DDPG algorithm is a reliable technique to obtain better understanding of controller parameters and the tuning, particularly in the nonlinear region of operation of the vehicle. Note that, in this paper, only error on the longitudinal C.G force is analyzed, since only the longitudinal corrective action is considered to be controlled in order to stabilize the vehicle. Fig. 7(b) demonstrates the outcome of the objective function (13) to show the optimal solution of the torque vectoring controller as a result of parameter tuning. As can be seen, DDPG is able to find the best solution by reducing the performance index to the minimum level and converges to the near-zero value around of the simulation. This optimal value shows better result compared to genetic algorithm and almost double improvement compared to manual tuning of the weighting matrix.
Parametric analysis has been carried out to verify the effectiveness of DDPG algorithm. In this analysis, the absolute value of the maximum error of longitudinal C.G error force for DDPG algorithm, genetic algorithm and manual tuning of torque-vectoring controller is presented (see Table III). First, the analysis focuses on comparing the DDPG algorithm with other approaches, under different friction coefficient of the road and the same initial velocity of . In this analysis, the value of error is increasing gradually for all cases as the road friction decreases. However, for all friction values, DDPG reduces the maximum absolute errors by more than 50% which shows improvement compared to genetic algorithm and manual tuning.
|DDPG||132 N||115 N||77 N||54 N||43 N||32 N||18 N|
|Genetic algorithm||205 N||180 N||163 N||105 N||86 N||65 N||32 N|
|Manual tuning||366 N||289 N||223 N||170 N||127 N||95 N||70 N|
|DDPG||17 N||33 N||77 N||132 N||182 N||202 N||214 N|
|Genetic algorithm||40 N||73 N||161 N||205 N||287 N||322 N||390 N|
|Manual tuning||71 N||105 N||212 N||366 N||487 N||498 N||515 N|
This analysis shows the effectiveness of DDPG algorithm on the performance of the control structure in adapting of the system to the changes of the environment condition. The second analysis has been conducted to evaluate the proposed framework under different velocities with a fixed friction surface condition (). It can be seen that as the velocity of the vehicle increases, the absolute value of the maximum error increases for both cases. This analysis confirms the ability of the DDPG algorithm to improve the adaptive capability of the controller under different velocities and low surface friction condition.
Finally to validate generalization capability of the DDPG algorithm, simulation results have been conducted under the environment that DDPG was not trained in. Therefore, friction surface of is chosen as an environment that was not considered during training. Initial velocity of the vehicle is chosen as which is a reasonable representation of high-speed driving on an icy/snow surface. Additionally, a step steer manoeuvre is applied as an input steering angle () to the system in order to investigate the generalization capability of DDPG algorithm under the unforeseen conditions. This can be seen in Fig. 8 where in Fig. 8(a) the longitudinal C.G error shows larger evolution during braking and acceleration for the duration of 8 to 11s for all tuning approaches due to sudden input driving torque applied to the controller. However, as can be seen the error for DDPG algorithm converges to zero which indicates reasonably good tracking of this algorithm even in the environment that it was not trained in. On the other hand, genetic algorithm and manual tuning show larger evolution of the error and fail to converge to zero with slightly better tracking capability of genetic algorithm. Fig. 8(b) demonstrates the result of the cost function (13) in order to show the optimal solution of the torque vectoring controller for all approaches. As can be seen DDPG offers a better solution compared to other methods despite a relatively large peak around 11s. The reason of this is due to a large (see Fig. 5(a)) applied to the controller to maintain the vehicle stability, which ultimately minimised the performance index to near-zero value and find an optimal solution. On the other hand the value of performance index for genetic algorithm and manual tuning observed an increasing trend with lack of converging to an optimal solution. The results show the superiority of the proposed DDPG approach to parameter tuning over the manual tuning and genetic algorithm baselines. More importantly, by carrying out tests with different road conditions, vehicle velocities, and steering inputs from those seen during training, the results demonstrate that the DDPG has learned a general approach to tune the weighting parameters of the torque vectoring controller which can generalize to new scenarios and environments.
This paper has investigated the use of DDPG algorithm for automatically tuning a torque-vectoring controller under a wide range of different vehicle velocities and friction surface conditions. The proposed control framework consists of (i) reinforcement learning method as an auto-tuning algorithm, (ii) a torque-vectoring controller to ensure lateral-yaw stability of the vehicle. The closed-loop scheme was implemented on a four wheels vehicle model with nonlinear tire characteristics, and the numerical results indicate the benefits of the reinforcement learning on tuning the torque-vectoring parameters. The DDPG has multiple advantages over genetic algorithm and manual tuning as it can find better weighting parameters of the controller, as well as tuning the parameters in an online manner based on the current states of the vehicle. The results demonstrated the effectiveness of the DDPG algorithm as an auto-tuning method over wide range of operating points, resulting in significant reduction of Longitudinal C.G errors compared to other tuning approaches. The overall performance of the vehicle indicates the accurate tracking of references as well as maintaining the stability of the vehicle using DDPG which shows the efficacy of this algorithm. While this work evaluated the adaptive tuning of torque vectoring parameters via , this framework is quite general and could be adapted to different controllers and even different domains.
-  (2020) PID controller optimized by genetic algorithm for direct-drive servo system. Neural Computing and Applications 32 (1), pp. 23–30. External Links: Cited by: §I.
-  (2004) Applying neural networks to on-line updated PID controllers for nonlinear process control. Journal of Process Control 14 (2), pp. 211–230. External Links: Cited by: §I.
-  (2007) A linear time varying model predictive control approach to the integrated vehicle dynamics control problem in autonomous systems. Proceedings of the IEEE Conference on Decision and Control, pp. 2980–2985. External Links: Cited by: §I, §II.
-  (2013) Vehicle Optimal Torque Vectoring Using State-Derivative Feedback and Linear Matrix Inequality. 62 (4), pp. 1540–1552. Cited by: §III-A, §III-B.
Adaptive Sliding Mode Control of Dynamic Systems Using Double Loop Recurrent Neural Network Structure. IEEE Transactions on Neural Networks and Learning Systems 29 (4), pp. 1275–1286. External Links: Cited by: §I.
-  (2016) Adaptive neuro-fuzzy tracking control of UUV using sliding-mode-control-theory-based online learning algorithm. Proceedings of the World Congress on Intelligent Control and Automation (WCICA) 2016-Septe, pp. 691–696. External Links: Cited by: §I.
-  (2019) Review : A selected review on reinforcement learning based control. (2011), pp. 1–14. External Links: Cited by: §I.
-  (2003) On actor-critic algorithms. SIAM Journal on Control and Optimization 42 (4), pp. 1143–1166. External Links: Cited by: §I.
-  (2019) End-to-end Reinforcement Learning for Autonomous Longitudinal Control Using Advantage Actor Critic with Temporal Context. 2019 IEEE Intelligent Transportation Systems Conference (ITSC), pp. 2456–2462. External Links: Cited by: §IV-A.
-  (2016) Continuous control with deep reinforcement learning. 4th International Conference on Learning Representations, ICLR 2016 - Conference Track Proceedings. External Links: Cited by: §I, §IV-A, §IV-B, §IV-B.
-  (2016) H∞ loop shaping for the torque-vectoring control of electric vehicles: Theoretical design and experimental assessment. Mechatronics 35, pp. 32–43. External Links: Cited by: §I.
-  (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529–533. External Links: Cited by: §IV-B.
-  (2012) Tire and Vehicle Dynamics. External Links: Cited by: §II.
-  (2012) Tuning and Retuning of PID Controller for Unstable Systems Using Evolutionary Algorithm. ISRN Chemical Engineering 2012, pp. 1–11. External Links: Cited by: §I.
-  (2015) Reinforcement Learning: An Introduction. Cambridge. Cited by: §IV-B.
-  (2016) Mastering the game of Go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489. External Links: Cited by: §I.
-  (2017) Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. pp. 1–19. External Links: Cited by: §I.
Deterministic policy gradient algorithms.
31st International Conference on Machine Learning, ICML 20141, pp. 605–619. External Links: Cited by: §IV-B.
-  (2007) A Proposal of Adaptive PID Controller Based on Reinforcement Learning. Journal of China University of Mining and Technology 17 (1), pp. 40–44. External Links: Cited by: §I.
-  (2019) Integrated Trajectory Planning and Torque Vectoring for Autonomous Emergency Collision Avoidance. International Conference on Intelligent Transportation Systems (ITSC). External Links: Cited by: §V.
-  (1930) On the Theory of the Brownian Motion. Journal of the Physical Society of Japan 13, pp. 823–841. External Links: Cited by: §IV-B.
-  (2019) Adaptive Fuzzy Sliding Mode Control for Nonlinear Uncertain SISO System Optimized by Differential Evolution Algorithm. International Journal of Fuzzy Systems 21 (3), pp. 755–768. External Links: Cited by: §I.
-  (2019) Adaptive DDPG Design-Based Sliding-Mode Control for Autonomous Underwater Vehicles at Different Speeds. IEEE Underwater Technology (UT), pp. 1–5. Cited by: §I.
-  (2018) A gain scheduled robust linear quadratic regulator for vehicle direct yaw moment Control. Mechatronics 51 (January), pp. 31–45. External Links: Cited by: §I.