1 Introduction
Control systems have come a long way in the past few decades. Feedback control is really popular in academic research and the industry. SISO feedback proves to be quite efficient for simple linear systems with directly dependent observed quantities and input. Usually, for controlling more complex nonlinear systems, linear approximation around the desired control output is calculated and linear control systems are designed around those linear estimations. MIMO feedback control is extremely useful in many cases as it not only incorporates various observable quantities but also different controllable quantities as well. Many MIMO architectures have been developed over the years, and these control strategies employ modern control methods like predictive control and the application of machine learning in control systems.
The Inverted pendulum on a cart is an ideal unstable system problem thereby making it one of the classic problems revolving around feedback control theory, control systems and control algorithms. A load is attached to the top of a wheeled assembly that moves along a track. One of the earliest researches on integrating neural networks in order to come up with a control system for the Inverted pendulum on a cart was by Charles W. Anderson. He described two layered and two single layered neural network system using a rudimentary form of reinforcement learning and temporal difference technique to solve the given control problem. The paper mentions only one state of measurement which is the number of tasks ‘failures’ and also indicates the tedious amount learning time required
[1].Moving into the recent studies of similar configuration systems using different control systems such as PD, LQR, Fuzzy Logic, MPC and RL controllers. The comparison of all these algorithms and control systems are important in terms of robustness, timeliness, integration, acceptability and flexibility in order to come up with the best solution for industry application, authors reviewed papers comparing control systems mentioned above, in the same (or similar) assembly; such as comparison of PID vs (PD like) Fuzzy logic control in a similar configuration done by Goher[2] in 2019, concluded that the (PD like) Fuzzy logic control provides much better performance than PID controllers improving on the system responses. The work on Model Predictive Control for a Wheeled Inverted Pendulum by Yue et. al. [3] shows the excellent robustness of MPC controller in the WIP system. The study performed by Dinev, T. et. al. [4] on a wheeled robot had a prismatic joint as an extensible architecture that allowed an extra DOF in the direction of translation of the joint. They further compared MPC and PD controllers on their system in order to benchmark the qualities of the controllers, showing that MPC and PD showed identical results on flat terrain. However, on rough terrain MPC had better stabilization, and robust performance in relation to sensor noise. MPC seems to have a very robust and less variable performance on the inverted pendulum system. While comparing LQR to PD control, LQR proves to be a better alternative for PD controllers as concluded in the results by Li et al. [5]. LQR proves to be efficient in tracking small positional change which makes the system more responsive and improves the overall performance. Although, as discussed in this paper [5] nonlinear control systems are necessary in order to achieve more manageable solution for larger displacement.
Reinforcement learning based controllers in recent years have shown essential progress in understanding nonlinear control problems, and variable performance in learning time and robustness have also seen ample improvement in recent works based on the Reinforcement learning based controllers [5], [6] and Lim et al. [7]. Getting our inspiration from the work being done in the field of applying RL in mobile robot control, we have developed an Extendable Wheeled Inverse Pendulum (EWIP) system to control with two different popular RL algorithms.
Wheeled Inverse Pendulum systems are being extensively investigated for their applications in modern robotics for creating mobile robots. Kim et al. [8] analyzed and demonstrated a two wheeled IP using LQR based controller, for basic operation like balancing, steering and spinning. They claimed that the robustness of the model they presented was onpar with their expectations and they proved it by testing their system on different terrains. Peng et al.[9] stated the use of PID along with Fuzzy Logic Controllers to balance a modified TwoWheeled Inverted Pendulum robot. They introduced a prismatic joint with a bar at the top of system for better balance and then designed Fuzzy Logic Controller for this system. Similar to Wang et. al, many modified versions of inverse pendulum have been studied over time with new and advanced control strategies for accomplishing various tasks. Works like that of Klemm et al., [10] and Dinev et al., [4] shows the advent of modified Inverse pendulum systems in the hopes of better balancing performance. The former mentions the use of LQR controllers and later compares the model predictive controllers to traditional controllers like PD for crossing rough terrains. In recent times, reinforcement learning has been a major field of interest for system design researchers. Manrique Escobar et al., [11] shows the implementation of deep reinforcement learning in controlling cartpole system. Such works have motivated the authors to experiment with new RLbased control strategies and develop controllers for the EWIP system due to it versatile and nonlinear nature. In this article, the authors discuss Deep Reinforcement and model predictive strategies to control a simulated Extendable wheeled inverted pendulum system.
2 System Definition
The control system analysis is done on an Extendable wheeled inverted pendulum(EWIP). The independent states of the system can be represented by and their derivatives are given by , where is the displacement of wheel in x direction and is the displacement in z axis, is the angle made by the pendulum with the normal and is the angle the wheel has been rotated and finally, is the length of the Extendable pendulum from the wheel center to the ‘bob’.
The nonlinear equations of motions of this system can be written by writing the total kinetic () and potential energy () of the system. They are given by
(1) 
(2) 
Where and are defined for the pendulum ‘bob’ as,
(3) 
(4) 
Here, is the mass of the wheel, is the mass of the pendulum ‘bob’, and g is the acceleration due to gravity. Let the Lagrangian be , for the various state variables, the state equations in terms of Lagrangian can be written as,
(5) 
(6) 
(7) 
(8) 
(9) 
In Equations 59, is the normal force on the wheel, is the input torque on the wheel by the motor and is the input force on the Extendable link. can be written separately for each state variable to define the nonlinear dynamics of the system.
(10) 
(11) 
(12) 
(13) 
(14) 
The above equations (1014) are highly nonlinear in nature, so to design optimal control strategy, a complex controller needs to be defined by more rigorous calculations involving inseparable nonlinear terms. In order to bypass the calculations, a selflearning control system can be implemented that moves ‘closer’ towards the ‘ideal’ controller for this system through iterative training.
3 Control Schemes
3.1 Reinforcement Learning in Control System Design
Controlling and handling highly complex systems such as; mechanical assemblies with an extensive amount of dynamics involved [12]
, with the traditional approaches of deducing the system equations to compute their inverse or forward kinematics and tedious adjustment of parameters is rather unyielding and a gargantuan task. A more manageable solution to such highly dynamic and complex problems can be attained with the help of ML systems and algorithms. One such ML algorithm is based upon policy iteration of Markov Decision Process (MDPs) known as Reinforcement Learning. Another thread of RL was based upon trial and error
[13], much analogous to how humans psychologically learn a task, via the rewardpunishment based skill development, where reward is given at a successful execution and punishment when a task fails [14] [15]. Figure 3 shows a typical RLbased controller attached to a plant.This MDP optimizing technique has a certain goal of maximizing its reward functions, considering a uniform and standard RL setup which has an agent (consisting of an actor and a critic), this given agent interacts with the Environment in given specific discrete time. The parameters involved in agentenvironment interaction are (). gives state of the environment at time ‘t’, At gives the action performed by the agent at time ‘t’, describes the state of the environment after action has taken place at time ‘’, describes the reward received by the agent after evaluation of . As mentioned earlier the goal of the system is to imply an ‘optimal control’ policy (Which maps the states over actions) with maximized amount of rewards amassed at each time step [16], i.e. total cumulative reward to be maximized can be expressed as:
(15) 
The policy
can be approximated as a deep neural network function which implies that it contains weights of the various neurons and layers given by a vector
. Now, becomes a deep neural network function that receives the states, actions and the weights to give the control policy. The validity of any RL based controller is defined by three basic functions as described by Zhang et al. the appropriate state value function , the appropriate action value function and the appropriate action value function w.r.t state . For current state , current action , time , and iteration , these functions on an environmental expectation with the action policy become,(16) 
(17) 
(18) 
Here is the discount factor. The goal of training the RL network is to learn the vector to find the optimal control policy . In recent developments of RL, research has come up with numerous algorithms primarily for state continuity and action spaces, out of them deep deterministic policy gradient(DDPG)[18,21] and proximal policy optimization(PPO)[17] are the policy gradient methods of RL implementations, we will be benchmarking for our analysis. These methods have iterated optimisation of policy estimated by the gradients. DDPG and PPO are both modelfree algorithms, these algorithms follow the actorcritic architecture, their goal is to maximize the longterm reward, reducing the deviation of the gradient estimate.[18] [19].
3.1.1 Deep Deterministic Policy Gradient
Deep Deterministic Policy Gradient (or DDPG) is a modelfree algorithm actorcritic framework while learning a deterministic policy expanding the work to continuous space rather than discrete spaces via the combination of DPG and DQN [18,21]. The DDPG algorithm trains two networks, the actor and the critic . As mentioned earlier, gives the actual actions corresponding to a given state and is the Qvalue which measures the ‘goodness’ of the action taken for the current state. The neural networks this defined are randomly initializes and are then subjected to mini batches from the experience buffer. For each experience, sets of state variables, actions, reward and the state after the action is pushed to update the actor and critic networks to take the appropriate action for the given state and get the optimal Qvalue respectively. The critic or the Qvalue network is updated by minimizing the Mean squared Bellman error as described by Escobar et al. [11]
(19) 
Here M is the mini batch size, is the current reward and is the expected target value which can be expressed as,
(20) 
Here, is the discount factor. is utilized to update the Q network’s parameters and the gradient is utilized to update the policy or the actor ’s parameters by keeping parameters of the network constant. The policy gradient is given by differentiating w.r.t in ,
(21) 
The policy and the value function are updated until a viable policy is found. In our case, the state and the action can be described in tuples of the following format,
(22) 
(23) 
Figure 4 denotes the structure of the Actor and Critic networks utilized for the DDPG implementation for the system. The observations (
) are 22 in number which contain 10 state variables and 12 input target errors states from 6 previous steps. The actions are 2 in number as described earlier. The DDPG networks are trained for 8578 episodes with a total of 1.7M steps. The sampling time was set at 0.05s with a discount factor of 0.99. The batch size was 1024 and Experience buffer had the maximum length of 1e+06 experiences. The exploration noise that the DDPG network used had the mean set to 0 and standard deviation set to 0.1. Learning rate for critic and actor is set as 0.0001 and 1e05 respectively with gradient threshold to be 1. Adam Optimizer is used for both Actor and Critic networds with L2 Regularization Parameter to be 0.0001 for both.
3.1.2 Proximal Policy Optimization
PPO (proximal policy optimization) is defined as a modelfree policy gradient reinforcement learning technique that aims to evaluate or improve the policy that is used to make choices (’On Policy’). This technique is a form of policy gradient training that alternates between sampling data via interaction with the Environment and optimizing a section of the objective function using stochastic gradient descent. The clipped surrogate goal function improves training stability by limiting the extent of policy changes at each step
[17]. Unlike DDPG, PPO calculates its objective function by getting the ratio of the next and the previous policy and subjecting that to an advantage function . The surrogate objective function is given as,(24) 
Here,
denotes the ratio of the new and old probability distributions of
also known as the divergence of the distributions. The critic depicts a state value function which is utilized to define the ‘advantage’ for the current time step.(25) 
is the maximum time steps forward the value estimator can foresee which adds to the overall advantage
t. For each epoch, the actor calculated the surrogate losses for each time segment of length
. The advantage is calculated and then we optimize the parameters for a minibatch of . Similar to DDPG, the state variables and actions for the system presented in this paper are,(26) 
(27) 
Figure 5 depicts the actor and the critic models used to control the system by PPO algorithm. Here the total number of observations are 22 and the system inputs are two in number. As discussed earlier, actor outputs the optimal input for the system based on the observations and the critic evaluate the statevalue which is utilized to calculate the advantage. The sampling time for PPO implementation was 0.01s with the Experience horizon of 1000 samples, discount factor of 0.99 and clip factor of 0.2. Both Actor and Critic have been trained with Adam Optimizer with a learning rate of 0.0001; gradient threshold is set as 10 with L2 Regularization parameters as 0.0001. The networks are trained for 9875 episodes or 9.8M steps for more than 35 hours.
3.2 Model Predictive Control
Model predictive control (MPC) is an effective and efficient internal control approach which has been in academic and research discussions. It is broadly applied in industrial operations. MPC’s performance tuning formulation and ability to handle restricted and nonlinear multivariable problems distinguish it from traditional feedback control methods. However, as compared to conventional proportionalintegralderivative (PID) controllers, MPC’s computing efficiency seems to be at a disadvantage, especially when used in largescale nonlinear programming (NLP) challenges; as NLP must be handled online at every time interval [20]. Therefore, using MPC for NLP requires the use of methods such as Sequential Quadratic Programming. The SQP technique is an iterative approach based on actively improving the system that solves a series of optimization subtasks using warmstart and optimal activeset recognition [21]. The linear MPC employed here creates a Hessian matrix which contains the prediction model based on the number of inputs, the observable state, added input and output noise and the prediction horizon. This prediction matrix is viewed as an optimization QP problem is solved by using the Activeset solver.
The MPC controller implemented here is based on a linearized model of the system around its desired position where the system has no velocity and stands still with . The MPC is created using the MPC designer tool box in Matlab. Sampling rate is 0.01s with prediction horizon of 100 and control horizon of 15. The input observations are the 10 state variables and the two outputs, and respectively. Since we have the reference values for and , and the MPC has the constraint of keeping the value between and . Other states are unconstrained. The weights on input variables and are 0.0210 and 0.2101 respectively and the rate of change of input variables i.e. and have the weight of 0.4759 each.
Since MPC controllers are fairly popular and provide robust system control in many situations, here, the comparison of the RL models developed is done against the MPC for the same system and also for the same input trajectories.
4 Experimental Setup
The Simulation is developed and run on Simulink. This paper focuses on the ability of the robot system under study to balance itself, move and stop at a point. The system comprises of the Robot and a ground plane which provides friction for the robot wheel. The Robot itself is made up of three major parts, the wheel, the Extendable link and the bob. The dimensions of the robot’s anatomy are given in the figure. The wheel is connected to the Extendable link with a revolute joint and the bob is connected to the wheel shaft with a prismatic joint. Other simulation parameters are expressed in the Table 1.
Quantity  Value  Unit 

Coefficient of Friction of The Ground  0.6   
Radius of Wheel  100  mm 
Mass of Wheel  0.25  kg 
Link Minimum Length  250  mm 
Link Maximum Length  500  Mm 
Radius of Bob  50  mm 
Mass of Bob  0.125  kg 
The RL simulation is also setup in Simulink (Figure 6). The raw states generated from the environment are processed to from a part of observations () to be fed to the RL block. Planned variables contains the and position of the trajectory for the system to follow. The errors are calculated from the current system’s and positions and the errors of last 5 timesteps () from the total Observation vector. The RL block also requires a ‘done’ signal to terminate and reset the simulation when required.
(28) 
The reward () depends on the actions, state variables, previous and change rates and the trajectory error. The reward function for this simulation can be written as,
(29) 
(30) 
(31) 
5 Result and Discussion
The training of the proposed ML models is done on a cloud VM with an Intel Xenon 2.3GHz 4 core CPU, 52 GB memory and a Nvidia Tesla T4 GPU with 16GB of compute memory. The training for both models was done in discrete manner due to the memory constraints of the system. Since the system is continuous, the training took a considerable amount of time. The actor and critic are trained in parallel.
Table 2 depicts the training parameters for the actor and critic agents. To compare all controllers, we provide a reference xtrajectory to the controller to follow. Since we are studying the balancing and the pointtopoint motion of the robot in one direction, we ‘publish’ the trajectory values to the robot. The ‘ideal’ trajectory requires the robot to stay at its position for 3 seconds, then reach 2 meters ahead in the next 4 seconds and then stay at the destination for 3 more seconds. MPC performs well but possesses a delay throughout the trajectory but does the best work at stabilizing the system when it stops. DDPG controller does the best job at following the exact trajectory given to it. It even provides good stabilization when the system is at rest. On the other hand, PPO controller does provide stability in the start, does a decent job at following the trajectory but does a relatively bad job at slowing down in the last 3 seconds. This can be observed in Figure 7.
Parameter  DDPG  PPO 
Sample Time  0.05  0.01 
Number of Episodes Trained  8578  9875 
Number of Steps Trained  1.7e+06  9.7e+06 
Total Time Taken for Training  32+hrs  35+hrs 
Maximum Reward Possible After Training  60  300 
Maximum Reward Achieved After Final Training Episode  55.47  235.12 
The positional data in Figure 8 shows that all control strategies are successful in following the reference trajectoty. MPC seems to fall behing while following the path while RL algorithms accurately follow the given positional waypoints. From the velocity curves in Figure 8, it is evident than DDPG and MPC tend to go hand in hand with each other but only with fluctuations during stopping. PPO directly starts to increase its velocity without accelerating backwards slightly to build up forward momentum like MPC and DDPG controllers. All controllers achieve a maximum X velocity of 0.5 m/s. There is a much larger angular deviation at the stopping point for the DDPG controller, but it stabilizes this deviation quickly and returns back to 0. The PPO Controller shows a great amount of high frequency angular velocity fluctuations, this indicate that the controller needs more time and episodes to train to smooth out the noise.
Figure 9 shows the input torque provided by the controller to the system. DDPG controller moves closely with the MPC but with disturbances at some points. PPO’s input contain significant noise and it follows a different control strategy than the MPC and DDPG. Its effects are clearly visible in the state outputs. Rotational power of the wheel can be expressed as, , where is the wheel’s angular velocity and is the input torque on the wheel. The rotational power of the wheel under DDPG, PPO and MPC control for the same trajectory is shown in the following Figure 10. PPO controller pumps in the most energy into the wheel, indicating that this control scheme might be very ineffective w.r.t DDPG and MPC.Our implementation of DDPG is onpar with the predictive MPC controller. DDPG based controller not only proves to be good in following the given trajectory, but also the wheel possesses less rotational power even less than MPC; this is due to the introduction of input penalty introduced in the reward function (Equation 31). The only limitation is seen in PPO controller. PPO does a great job at maintaining a steady posture for the initial 3 seconds but does not move and stops as smoothly as the other wo methods. A suspected reason is the lack of experience for the stopping portion of the task training process. Even for greater training episodes and steps, PPO does not seem to perform as well as the other two controllers. This may be due to the fact that due to limited computational resources, PPO network might have not been trained sufficiently enough to remove the noise form its inputs. PPO seems to work better than DDPG on paper but this is not seen in our experiments, the only way to improve PPO’s results is to perform more rigorous training on more computationally powerful machines with larger memory and greater processing speeds.
6 Conclusion
In this paper we discussed different RL based control algorithms, namely PPO and DDPG and implemented those on an Extendable Wheeled Inverted Pendulum (EWIP). The system defined is a nonlinear system. To control such nonlinear systems, complicated functions which gives the required inputs to the system based on the independent state variables can be represented as Neural Networks and one can implement a pipeline where these systems learn to find an optimal control function by selfimproving through rewardbased learning.
This article has compared two popular Reinforcement Learning algorithms to a Model Predictive Control approach. The results indicate that both models perform really well in balancing the robot and initially starting the traversal. DDPG is successful at following the trajectory exactly but PPO struggles a bit towards the end to stop the movement at the goal location despite more training effort.
The future works includes exploring a Twowheeled Extendable approach to traverse various types of terrains and the exploration of ‘jumping’ control on the present system. Also, we would explore more robust nonlinear control methods and other RL algorithms to better control more generalized trajectories.
7 Acknowledgements
The authors would like to express their gratitude to the Department of Applied Physics and Department of Mechanical Engineering, Delhi Technological University, New Delhi to promote research in the field of robotics and application of new AI techniques. This work would not have been completed without the support of undergraduate students in DTU Altair Laboratory, DTU, Delhi. The authors are thankful to every person involved directly or indirectly with the work and helped along the way.
References
 [1] C W Anderson. Learning to control an inverted pendulum using neural networks. IEEE Control Syst., 9(3):31–37, April 1989.
 [2] Khaled M Goher and Sulaiman O Fadlallah. Control of a twowheeled machine with twodirections handling mechanism using PID and PDFLC algorithms. Int. J. Autom. Comput., 16(4):511–533, August 2019.
 [3] Ming Yue, Cong An, and JianZhong Sun. An efficient model predictive control for trajectory tracking of wheeled inverted pendulum vehicles with various physical constraints. Int. J. Control Autom. Syst., 16(1):265–274, February 2018.
 [4] Traiko Dinev, Songyan Xin, Wolfgang Merkt, Vladimir Ivan, and Sethu Vijayakumar. Modeling and control of a hybrid wheeled jumping robot. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, October 2020.
 [5] Zhijun Li, Chenguang Yang, and Liping Fan. Advanced control of wheeled inverted pendulum systems. Springer, London, England, 2013 edition, August 2014.
 [6] Phuong Nam Dao and YenChen Liu. Adaptive reinforcement learning strategy with sliding mode control for unknown and disturbed wheeled inverted pendulum. Int. J. Control Autom. Syst., 19(2):1139–1150, February 2021.

[7]
HyunKyo Lim, JuBong Kim, ChanMyung Kim, GyuYoung Hwang, HoBin Choi, and
YounHee Han.
Federated reinforcement learning for controlling multiple rotary
inverted pendulums in edge computing environments.
In
2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)
. IEEE, February 2020.  [8] Yeonhoon Kim, Soo Hyun Kim, and Yoon Keun Kwak. Dynamic analysis of a nonholonomic twowheeled inverted pendulum robot. J. Intell. Robot. Syst., 44(1):25–46, September 2005.
 [9] Tongrui Peng, Quanmin Zhu, Tokhi M. Osman, and Yufeng Yao. A study on pdlike fuzzy logic control based active noise control for narrowband noise cancellation with acoustic feedback and distance ratio. In Proceedings of the 11th International Conference on Modelling, Identification and Control (ICMIC2019), pages 87–98, Singapore, 2020. Springer Singapore.
 [10] Victor Klemm, Alessandro Morra, Ciro Salzmann, Florian Tschopp, Karen Bodie, Lionel Gulich, Nicola Kung, Dominik Mannhart, Corentin Pfister, Marcus Vierneisel, Florian Weber, Robin Deuber, and Roland Siegwart. Ascento: A TwoWheeled jumping robot. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, May 2019.
 [11] Camilo Andrés Manrique Escobar, Carmine Maria Pappalardo, and Domenico Guida. A parametric study of a deep reinforcement learning control system applied to the swingup problem of the cartpole. Appl. Sci. (Basel), 10(24):9013, December 2020.
 [12] Francesco Villecco and Arcangelo Pellegrino. Evaluation of uncertainties in the design process of complex mechanical systems. Entropy (Basel), 19(9):475, September 2017.
 [13] Richard S Sutton and Andrew G Barto. Reinforcement Learning. Adaptive Computation and Machine Learning series. Bradford Books, Cambridge, MA, 2 edition, November 2018.
 [14] Giovanni Pezzulo. Anticipation and futureoriented capabilities in natural and artificial cognition. In 50 Years of Artificial Intelligence, pages 257–270. Springer Berlin Heidelberg, Berlin, Heidelberg, 2007.
 [15] Jiaqi Xiang, Qingdong Li, Xiwang Dong, and Zhang Ren. Continuous control with deep reinforcement learning for mobile robot navigation. In 2019 Chinese Automation Congress (CAC). IEEE, November 2019.
 [16] Zhiang Zhang and Khee Poh Lam. Practical implementation and evaluation of deep reinforcement learning control for a radiant heating system. In Proceedings of the 5th Conference on Systems for Built Environments, New York, NY, USA, November 2018. ACM.
 [17] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017.
 [18] Eivind Bohn, Erlend M Coates, Signe Moe, and Tor Ame Johansen. Deep reinforcement learning attitude control of fixedwing UAVs using proximal policy optimization. In 2019 International Conference on Unmanned Aircraft Systems (ICUAS). IEEE, June 2019.
 [19] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning, 2019.
 [20] Xin He and Fernando V Lima. A modified SQPbased model predictive control algorithm: Application to supercritical coalfired power plant cycling. Ind. Eng. Chem. Res., 59(35):15671–15681, September 2020.
 [21] B Kouvaritakis and M Cannon. Model Predictive Control: Classical, Robust and Stochastic. Springer, 2016.