Deep Reinforcement Learning for Time Optimal Velocity Control using Prior Knowledge

11/28/2018 ∙ by Gabriel Hartman, et al. ∙ 0

While autonomous navigation has recently gained great interest in the field of reinforcement learning, only a few works in this field have focused on the time optimal velocity control problem, i.e. controlling a vehicle such that it travels at the maximal stable speed. Achieving maximal stable speed is important in many situations, such as emergency vehicles traveling at high speeds to their destinations, and regular vehicles executing emergency maneuvers to avoid imminent collisions. Traditionally, time optimal velocity control is solved by numerical computations that are based on optimal control and vehicle dynamics. In this paper, we show that a deep reinforcement learning method for the time optimal velocity control problem outperforms a numerically derived solution. We propose a method for using the numerical solution to further improve the performance of the reinforcement learner, especially at early stages of learning. This result may contribute to the optimal control of robots in applications where some analytical model is available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The operation of autonomous vehicles requires the synergetic application of a few critical technologies, such as sensing, motion planning, and control. This paper focuses on a subset of the motion planning problem that addresses the time optimal velocity control along a known path. Driving at the time optimal velocity would yield the shortest travel time possible along the path. It reflects the vehicle dynamics, and its acceleration and braking capabilities. If at any time during motion the vehicle exceeds the optimal speed, it would lose control, that is, turn over or slip off the desired path.

Driving at the time optimal velocity is important when motion time is critical, e.g. emergency vehicles such as ambulances, firefighters vehicles attempting to reach their destinations at high speeds, and common vehicles attempting to execute emergency maneuvers to avoid collisions. In such cases, it is necessary to know the vehicle’s dynamic capabilities, or its performance envelope, to ensure that the emergency maneuver is dynamically feasible. Even during a non-extreme driving scenario, the optimal velocity serves as an upper bound to any vehicle response. This is even more important when traveling off-road, in such situation the velocity is influenced by rough terrain and sharp curves, hence the maximum velocity can be extremely low and it is more challenging to determent it.

The problem of velocity control (not necessarily optimal) has been studied for decades but is still not considered a solved problem, autonomous vehicles still lag behind human-driven vehicles [Fingas2018]

. The velocity selection is influenced by many factors, for example, the mechanical properties, the interaction between the wheels and the ground, the current maneuver and the terrain. All these and additional factors influence the vehicle maximum speed at each moment. Most practical solutions and researches not take all these into account, but simplifies the problem and not try to achieve high velocities.

The velocity profile along a path (i.e. the velocity at each point along the path) can be determined by several methods. The simplest is the trapezoidal velocity profile, in which the vehicle accelerates at a constant acceleration, then cruises at a constant speed and finally decelerates at a constant deceleration to a stop at the end of the path. To ensure the vehicle’s dynamic stability, the constant velocity segment must be lower than the lowest point on the upper-velocity limit of the entire path. Therefore, this method yields relatively low average speed and hence results in poor performance in terms of motion time. A more promising approach reduces the vehicle’s speed when the vehicle approaches a sharp curve and increases it along straight path segments [Likhachev and Ferguson2009].

While several works have focused on the velocity control problem, the time optimal velocity control problem has drawn less attention. An efficient method for computing the time optimal velocity profile along a given path using dynamics based Numerical planning is presented in [Mann and Shiller2008]. It takes into account the vehicle’s dynamic capabilities while preventing overturning and slipping. This method is based on an efficient approach for computing the time optimal motions of robotic manipulators that move along specified paths [Shiller and Lu1992].

More recently, several reinforcement learning based methods have emerged for autonomous vehicle navigation [Bojarski et al.2016, Lefèvre, Carvalho, and Borrelli2016, Michels, Saxena, and Ng2005]. Although the purpose of reinforcement learning is to search for an optimal policy the main part of these works, do not focus on the time optimal velocity control. We propose a policy based reinforcement learning method for solving the time optimal problem and evaluate the results to a direct numerical solution.

Reinforcement learning based methods have two main advantages over direct planning. The first is that reinforcement learning based methods do not require the development of a precise dynamic model of a vehicle that takes into account all the dynamic characteristics, as required by the direct optimization methods, but may learn everything on the go. Eventually, learning based may achieve a more optimal velocity control compared to the direct optimization, because of the limitation of the developed model complexity, and therefore cannot always adequately describe real-world problems. The second advantage is that a reinforcement simplifies the need to perform complex tuning to the dynamics formulas for each type of vehicle and road separately, direct planning methods require the development of a new model for every problem.

On the other hand, direct optimization has a major advantage over reinforcement learning since the latter requires a large number of iterations in order to achieve convergence to the desired policy, and in the case of complex tasks, convergence is not always achieved. In contrast, direct optimization methods produce the solution immediately, and complex problems can be solved by developing simplified models that will yield a sub-optimal solution.

In this paper, we propose also a method that integrates reinforcement learning and direct optimization for solving the time optimal velocity problem. We show that the synergy between these two methods outperforms each one separately. Our proposed method does not require any access to the direct planning structure, but only to the output of a dynamics based planning controller, that is, this controller can be treated as a black box.

Our proposed method is based upon two concepts for improving the performance of a reinforcement learner agent. The first is adding the action computed by the dynamics based controller as another feature to the reinforcement learner. The second includes adding the dynamics based controller action to the action taken by the agent, that is, the agent’s action acts as a delta distance from the dynamics based method. We show that this novel solution results in a significant improvement to the reinforcement learner, especially at early stages of learning. We further show, that both concepts have a positive effect on the agent’s performance, and the agent performs best when applying both concepts.

To summarize, our main contributions of this paper are (i) Introducing a deep reinforcement learning based method for solving the time optimal velocity problem, and showing that this method outperforms a dynamics based numerical planning method. (ii) Developing a method for using prior knowledge obtained from a dynamics based optimization methods in a reinforcement learner, and showing that this method outperforms our learning based method.

2 Related Work

There are several approaches for velocity control using reinforcement learning. Each of them solves a problem defined differently. For example, [Stavens and Thrun2012]

developed a “self-supervised” method for velocity selection in off-road terrain using the input of an inertial measurement unit. The goal is to protect the vehicle from too high vibrations while achieving a higher velocity but not the optimal one. Rosolia and Borrelli rosolia2018learning present a model predictive controller that improves its performance based on previous iterations, and test it on a racing car. This method is effective for repetitive tasks, in which the initial state is the same at each iteration; hence it cannot control a vehicle on different paths. Other approaches use machine learning based methods as a preprocessing stage only. Drews et al. drews2017aggressive use a convolutional neural network to create the input for a predictive controller to achieve aggressive driving capabilities. While a convolutional neural network training is used as a preprocessing stage to create a cost map, the vehicle itself is controlled by a model predictive controller.

There exist many methods to use the additional information related to a problem that is trying to be solved. Imitation learning uses external demonstrations to teach an agent to perform a given task. For example, in

[Bojarski et al.2016] an agent is trained by human demonstrations to control the steering of a vehicle. Aşık et al. acsik2018end use imitation learning to control a vehicle using low-cost sensors. Lefèvre et al. lefevre2016learning learn driving style from drivers demonstration. Imitation learning itself cannot outperform the demonstrator’s performance but it is possible to start the learning process by initializing the policy by imitation learning and then further improve the policy by reinforcement learning [Sendonaris and Dulac-Arnold2017, Silver et al.2016]. These methods in fact subdivide the learning process into two different phases, and do not actually use the demonstrations to improve the reinforcement learning process itself. The demonstrator for imitation learning is usually not available during the entire training process, but when another controller exist (assuming no high computation cost), it is possible to integrate it in an inseparable form, which allows other approaches as the proposed methods in this paper.

Another well-studied approach is encoding the knowledge in the reward function which is known as reward shaping [Ng, Harada, and Russell1999]

but this may be non-trivial in many tasks. Another approach to using additional Knowledge to speed up learning is transfer learning

[Pan and Yang2010], however, this method is useful if another trained model exists, and not for direct computed results. Martínez-Marín at al.martinez2005fast use an optimal control technique to decide whether to update the learned model. However, they do not use reinforcement learning or try to learn the optimal velocity. Achiam et al. achiam2017constrained introduce a safe exploration learning process defined by constraints. Such an approach is not suitable to our problem, because we don not have an upper or lower bound on the velocity that can define a constraint. Furthermore, to achieve the real, time-optimal velocity, a constraint on the upper limit of the velocity may reduce the learned velocity.

3 Problem Definition

3.1 Environment

We denote the path where , are positions on plane and for all . The vehicle has position ,velocity of its center of mass in the forward direction , , yaw angle (rotation around z axis) and roll angle . The vehicle is controlled by a steering command and by torque command , is the normalized torque applied to each wheel of . We use to denote the euclidean distance between two positions, i.e. . The closest position on the path to the vehicle is denoted by . Assuming w.l.o.g that is uniquely defined. denotes the distance from to along the path, i.e. .

The steering control of the vehicle is performed by an analytic path following controller (pure pursuit [Snider and others2009]) which causes the vehicle to stay close to the predetermined path , with a maximum deviation of . Formally: , . The maximum distance between all points on , is assumed to be small (i.e. ), hence the vehicle stays within a band of width with center .

is a limited segment of the predetermined path before the vehicle (, , where and is the predefined number of points. Figure 1 summarizes these definitions. The state of the system includes all the information needed e.g. vehicles velocity, path. The system transformation in time occurs according to the following discrete time function: , where is an unknown (stochastic?) function.

A policy is a function that returns an action given a state: .

A state at time is considered “stable” if: (1) The vehicle is not too far away from , that is . that is an indication for slipping, because this distance may be greater than only if the vehicle slipped.

(2) The vehicle did not turn over, that is , where are predetermined roll angle limit of the vehicle.

A state is considered “asymptotically stable” if it is possible to control the vehicle from state such that it remains in stable states, formally:

where is a stable state.

Figure 1: A vehicle with current velocity follows the path , at each moment just the segment is apparent. The closest point in to the vehicle is . The vehicle try to stay closer than to the path.

3.2 The Time Optimal Velocity Problem

For every path , and a vehicle with some initial velocity , and a initial position near one of the points of the path and traveling time , a time optimal policy for this problem outputs at every time the action () such that the vehicle maximizes the travel distance along the path, while preserving stability in all states during the motion. Formally: Let be the initial state, and the state after time steps where the state transitions are according to , and assuming that is long enough. The time optimal policy maximize , where at and at , and every state during the motion is an asymptotically stable state.

The Time optimal velocity on path is the velocity profile produced by traveling on when the vehicle is controlled by the optimal policy

3.3 Information Required For the State Space

Also if all the information of the vehicle’s state (e.g. position, linear and rotational velocity and acceleration, individual wheel velocity, steering angle) are available, more information increases the complexity of the problem. In this paper, it is assumed that the velocity and the path relative to the vehicle is enough in order to control the vehicle at optimal velocity. The reinforcement agent can implicitly know additional information because the behavior of the vehicle on the given path is predictable. For example, the rotational speed of the vehicle depends only on the velocity and the steering angle, the steering angle is computed based on the path and the velocity, therefore, given the linear velocity and the path, the rotational velocity can also be estimated. The same trade-off regarding the path, according to the problem definition, the whole path is predefined. But while driving a car the whole path

is usually not observed, and only a limited segment before the vehicle can be seen. The best will be to see a segment that is exactly long enough to decide whether the state is asymptotically stable and what action must be taken in order to achieve time optimal velocity, a longer will provide unnecessary information that will increase the complexity of the problem. It is assumed that if is a little bit longer than the length needed to stop the vehicle within it, it is also possible also to decide according to it what is the optimal action. It can be proved with some assumptions on the environment. The influence of a too short on the performance excepted to be negligible. Not all the information of the original path segment is needed, hence is down-sampled to reduce dimensionality of the state. In addition, every is transformed to a local reference system with its reference point and direction matching that of the vehicle (that is, Subtraction of the vehicle’s position from () and then rotating using the rotation matrix around to angle where is the angle of , (). The fact that the state contain the path relative to the vehicle, excessive the need to know the absolute position of the vehicle, and dramatically decrease the complexity of the path diversity.

4 Our Method

4.1 Learning Method

Our basic reinforcement learner is based upon “deep deterministic policy gradient” (DDPG) [Lillicrap et al.2015], which we adapt to the time optimal velocity control problem. We term this Reinforcement learning based velocity optimization REVO. DDPG is an actor-critic, model-free algorithm for continuous action space , and continuous state space S. With the assumption that the state is fully observed, the agent receives a reward when being at state and taking action . The transition function

is the probability of ending at

when being at and taking action . The goal of the DDPG algorithm is to learn a deterministic policy (represented as a neural network) which maximizes the return from start of the episode:

where is the discount factor. DDPG learns the policy using policy gradient. The exploration of the environment is done by adding exploration noise to the action.

The goal is to learn a policy that achieves the time optimal velocity on each path, hence the paths used for training needs to be as diverse as possible. Just paths there are possible for the vehicle are considered, that is, avoiding too curved paths that the vehicle cannot follow at any velocity due to steering limitation. To generate such paths, each path is generated randomly in the following way: until reaching the desired length of , randomly choose a segment length , and randomly choose a segment curvature , connect these segments while changing the curvature in a constant rate. in this way, a random path, that respects the vehicles steering capabilities is created.

The reinforcement learner is trained on episodes of constant time . If an unstable state is obtained, the episode is terminated. At the start time , the vehicle is located at the beginning of the path () and faces it ( is in the direction of - ), with velocity of zero (). The state of the system is defined as where is a down sampled version of . If the vehicle is stable and has a positive velocity, the reward is proportional to the vehicle’s velocity (). If the vehicle encounters an unstable state it obtains a negative reward, that is, if is a stable state and , otherwise. At each time step, the action is determined as where is the exploration noise.

The next section, describe another approach, based on an analytic derived model.

4.2 Dynamics Based Numerical Planning

The time optimal velocity problem can be solved also by direct optimization methods. We implemented a model predictive controller that is based on a dynamic model of the vehicle and use numerical planning. This controller is based upon the algorithm described in [Mann and Shiller2008, Shiller and Gwo1991] and termed VOD. The performance of this controller is used to evaluate the results of the learning based optimization (REVO) and in addition, the actions output from this controller is used to improve the learning process itself.

Given a path and a vehicle , this method provides an optimal solution for the time optimal velocity problem, under several assumptions: 1) The dynamics of the system are deterministic; 2) The vehicle is exactly on the path during the motion i.e. ,; 3) The vehicle model assumes no suspension; 4) Geometric dimensions, mass and wheels torque of the vehicle, friction coefficient and maximum velocity are known.

The forces acting on the vehicle are inertial forces (e.g. centrifugal force), gravitation, and force produced by the torque applied on the wheels. It is required to limit these forces at any point during the motion to ensure that all states are stable, i.e. that the forces not overcome the surface grip or the roll stability of the vehicle.

Since the position and orientation of the vehicle depend only on the geometry of the path (as the assumption), it is possible to compute the forces applied on the vehicle at a point along the path given the velocity of the vehicle at this point. The ”velocity limit curve” is a set of the velocity limits of all points along the path. That is, the highest velocity at any point, that cause the maximum forces the vehicle can handle without becoming unstable

In a similar form, if the velocity of the vehicle at some point along the path is known, The maximum possible acceleration (positive and negative) of the vehicle in forward direction can be computed.

”Bang-bang” control provides the optimal control for this problem. That is, at every moment during the motion, or the maximum or the minimum acceleration must be applied to the vehicle, not a median value. At the start point of the path , the initial velocity is and the final velocity at the end of the path () is defined . The velocity is computed numerically by integrating the maximum accelerations and decelerations in such a way that the velocity limit curve will not be crossed.

An example of computing the velocity profile using the described method is shown in figure 2. For the given path, the velocity at the start point A is 0 (), the goal is to reach the end point D, also with velocity 0, at fast as possible. To achieve this, the vehicle needs to accelerate at the maximum allowed acceleration until point B where it must decelerate at the maximum deceleration to avoid crossing the velocity limit at point C. When finally reaching point D, the vehicle must decelerate to the goal.

This algorithm is used to control the vehicle. At every time the optimal velocity profile is computed on the limited horizon path segment (as defined in the Problem definition), the currents vehicle velocity is the initial velocity. To ensure that it will be possible to stop at the end of this path segment, the target velocity of the endpoint of is determent as zero. The acceleration at the first step of the computed velocity profile is used as the action output of the controller. Since it is repeated every time step, and the path segment is long enough the vehicle is not influenced by the zero target velocity.

(a)
(b)
Figure 2: (a) A path segment (b) The direct computed velocity curve (red) and velocity limit (black). At a high curved point on the path, the velocity limit is low. The velocity never crosses the velocity limit curve.

4.3 Using Direct Computation to Enhance Reinforcement Learning

We propose a method to combine VOD, the direct velocity optimization controller and REVO, the reinforcement learning based controller. This method is based upon two concepts. The first is based on combining VOD and REVO through the actions, its termed REVO+A. The value of the action output from the direct planner () is added to the value of the action output from the Reinforcement learner () such that the action output from the REVO+A controller is In this form, at the beginning of the learning process the vehicle follows the actions from the computed direct optimization (VOD), that is, at each time , because the initialization of the policy, the actions have a value near to zero at the beginning of the training. The policy of the VOD controller () is a good initial policy. This simplifies the problem for the reinforcement learner agent, which only needs to learn how much to deviate from the direct computed optimal velocity and is not required to start from the ground up.

The second concept for combine REVO and VOD is based on combining them through the state space, its termed REVO+F. At the REVO+A method the action output () from the VOD controller is added as an additional feature to the state space, that is, . An intuitive justification to this is, that the reinforcement learner has the information about , hence he can decide to act directly like , or to ”understand” when to act like it and when it is better to deviate from these actions. In the results section, it will be shown that in a certain case, REVO+F, the learning process is faster, and it can be seen that the reinforcement learner act according . And additional improvement above REVO+A and REVO+F is adding this to methods together (REVO+FA). A graphic summary of the methods is illustrated in figure 3

(a)
(b)
(c)
(d)
(e)
Figure 3: is a policy, is a action, is the state. (a) VOD: Direct planning (b) REVO: DDPG based learning. (c) REVO + A: combine actions of VOD and REVO. (d) REVO + F: adding action output of VOD as feature in state for REVO. (e) REVO+FA: combine actions and add as feature.

5 Experimental Results

5.1 Settings

A simulation of a four-wheel vehicle was developed using “Unity” software. A video of the vehicle driving along a path at time optimal velocity is uploaded to the supplementary material. The vehicle properties were set to a width of 2.08 [m], a height of 1.9 [m], a length of 5.1 [m], a center of mass at the height of 0.94 [m], a mass of 3200 [Kg], and a total force produced by all wheels of 21 [KN]. the maximum velocity of the vehicle is [m/s] (108 km/h). The maximum acceleration of the vehicle is 6.5 (0-100 km/h in 4.3 seconds). Acceleration and deceleration are on applied to all four wheels (4x4), steering is applied to the front wheels (Ackermann steering). The friction coefficient between the wheels and the ground was set to 5.

The paths are computed randomly such that continuous feasible paths are received, that is, each path has limited maximum curvature and constant curvature changing rate, as described in ”Our method” section.

As for the specific settings for the Reinforcement agents, a state consists of 25 points of the path in front of the vehicle. The distance between one point to the next point is 1 [m]. ( , , , ). The state was normalized to range The reward function was defined as:

A state is considered unstable if the angle of the vehicle is more then 4 degrees (), and when the vehicle is more then 2 [m] () away from the path (indication for slipping).

All the hyper-parameters of the reinforcement learning algorithm (e.g. neural network architecture, learning rates) was set as described in [Lillicrap et al.2015].

5.2 Experiment Description

The goal of the experiment is to compare the five discussed methods: Direct, numeric-based method (VOD), DDPG based learning method (REVO), the combination of VOD and REVO trough actions (REVO+A), the combination of them through state (REVO+F) and the combination of REVO+A and REVO+F (REVO+FA). Every learning process includes 500 episodes, each limited to 100 time steps. A time step is 0.2 seconds. The policy updates are synchronized with the simulation time steps, two updates per step. After every 10 episodes, an evaluation process is performed on 5 fixed random paths. During this process, the exploration noise was disabled to evaluate the policy accurately.

Every episode has a fixed time limit of 20 seconds (0.2 seconds per step, and 100 steps per episode), the goal is that the vehicle achieves the biggest progress along the path possible in this time. The progress that the vehicle achieved along the path in an episode is denoted as , the distance achieved by the VOD controller is denoted as . the normalized progress is . If the vehicle fails, the distance achieved is irrelevant (even the progress until the failure is not informative because every is independent of the rest of the path, hence, on the same policy the distance until the failure depends only on the path), therefore the learning process for each method is described as and the failure rate as function of policy updates. The failure rate is a running average over 60 evaluation episodes.

5.3 Results

Figure 4 shows one of the paths used for the evaluation during the training process and the velocity along it during 20 seconds. It can be seen, that the learned velocity profile of REVO+FA is higher compared to dynamics based planned velocity.

(a)
(b)
Figure 4: (a) One of the random paths used for tests and (b) The dynamics based velocity profile (VOD) and a learned velocity profile (REVO+FA)
Figure 5: Normalized progress in 20 seconds along 5 test paths, during training
Figure 6: Failure rate during training
(a) REVO,   VOD
(b) REVO+F,   VOD
Figure 7: A comparison between learning progress of REVO and REVO+F. In episode 5, both methods accelerate until a roll-over occurs. In episode 155, REVO+F started to imitate the VOD controller, while REVO shows a very little progress. In episode 205, D-VOL shows almost full imitation of VOD velocity, while REVO is still in its initial stages of learning. In episode 405, REVO+F actions result in a higher velocity than VOD

Figure 5 present the normalized progress on the path on successful episodes (i.e. unsuccessful episodes were filtered out from this graph). All the results are normalized with respect to VOD, hence VOD is a horizontal line at 1. The Reinforcement learning method (REVO) reach VOD after 6500 training iterations, while REVO+AF and REVO+A are always above the VOD performance (on the successful episodes). A little preference of REVO+AF over REVO+A can be seen. REVO+F shows worse performance than REVO. All methods outperform eventually the VOD controller performance by 15%.

Figure 6 shows the percentage of failed episodes during the learning. As depicted by the figure, the failure rate of REVO+A and REVO+FA is below the failure rate of REVO during the whole training. The conclusion from this results is that the training process was faster when combining the model-based controller (VOD) to the reinforcement controller (REVO) through the actions.

Figure 6 appears that after the fails average decreases, it increases again. A possible explanation is that the velocity gets closer to the handling limits of the vehicle and when driving near the limit the possibility to fail is much higher. Therefore, in reality, it is better to avoid such close-to-limit velocity, and down-scale this velocity by some factor. But if this limit is unknown it is also unknown how far is the velocity from the limit wish may be dangerous.

5.4 Closer Look at REVO+F

Before we conclude this section, we would like to take a closer look at how adding the VOD output () as a feature to the state (REVO+F) influences the learning process. When running the learning process on the same path (instead of on random paths) it is possible to closely track the policy improvement. In this case, it can be seen in 7 that after some training, the learned policy acts like the VOD input, while the regular learning process (REVO) is still not able to complete this path.

6 Conclusions & Future Work

Current methods for time optimal velocity control usually not data driven but are analytically derived based upon vehicle dynamics. Previous works on reinforcement methods for autonomous navigation usually do not try to achieve time optimal velocity control. In this paper, we explore the problem of time optimal velocity control of a simulated vehicle. We introduce the usage of deep reinforcement learning to solving this problem and outperform an analytically derived solution.

While several methods for using prior knowledge exist,neither of these methods have shown success in our domain.We therefore propose a method, REVO+FA, which is based upon two novel concepts for improving the performance of a reinforcement learner agent. The first is adding the output of the dynamics based method as another feature to the reinforcement learner. The second includes adding the dynamics based method action to the action taken by the agent, that is, the agent’s action acts as a delta distance from the dynamics based method. We show that this novel solution results in a significant improvement to the reinforcement learner, especially at early stages of learning. We further show, that the agent performs best when applying both concepts.

This method may have a great impact on robot agents in domains in which an agent that has the ability to take reasonable actions already exists. Future work will include the deployment of REVO+FA in an actual (small) autonomous vehicle. Aside of the engineering challenges that we will be facing, such deployment will also require research challenges such as the need for transfer learning [Pan and Yang2010]; in which the agent must transfer knowledge gathered in simulation to the actual vehicle. While the reinforcement learning agent developed in simulation will not work as is in an actual vehicle, the advantage of REVO+FA (over a standard reinforcement learner) may become more pronounced when interacting with an actual autonomous vehicle, as such interaction requires much longer duration, and as such, starting with a completely random reinforcement learner becomes impractical.

References