Deep reinforcement learning for guidewire navigation in coronary artery phantom

10/05/2021 ∙ by Jihoon Kweon, et al. ∙ 24

In percutaneous intervention for treatment of coronary plaques, guidewire navigation is a primary procedure for stent delivery. Steering a flexible guidewire within coronary arteries requires considerable training, and the non-linearity between the control operation and the movement of the guidewire makes precise manipulation difficult. Here, we introduce a deep reinforcement learning(RL) framework for autonomous guidewire navigation in a robot-assisted coronary intervention. Using Rainbow, a segment-wise learning approach is applied to determine how best to accelerate training using human demonstrations with deep Q-learning from demonstrations (DQfD), transfer learning, and weight initialization. `State' for RL is customized as a focus window near the guidewire tip, and subgoals are placed to mitigate a sparse reward problem. The RL agent improves performance, eventually enabling the guidewire to reach all valid targets in `stable' phase. Our framework opens anew direction in the automation of robot-assisted intervention, providing guidance on RL in physical spaces involving mechanical fatigue.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 6

page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Coronary arteries are vessels that supply oxygen-rich blood and nutrients to the myocardium. When coronary arteries are obstructed, the heart muscle is not supplied with sufficient oxygen and energy, resulting in ischemia. Such ischemic heart disease is reported to be the leading cause of death, responsible for 16 of the world’s total deaths.
Percutaneous coronary intervention (PCI) with balloon angioplasty and stent implantation is the standard treatment for coronary artery stenosis. The catheter access provides a path from the incision area to the coronary ostium, and the balloon and stent are delivered along the guidewire to the target location. Because the diameter of the coronary artery is relatively small (4 mm) (dodge1992lumen) and the distance between the operating controls and the distal end of the guidewire is long, a considerable amount of specialized training is required for precise manipulation of the interventional devices. Friction with the vessel wall causes deformation of the flexible tip, impeding guidewire control, and a risk of perforation may increase in the treatment procedure of severe lesions (guttmann2017prevalence). The non-linear relationship between the control motion applied to the guidewire and the movement of the distal end is an important feature that makes the device difficult to precisely control.
Interventional robots for coronary diseases have been introduced for improved manipulation of interventional devices with reduced irradiation (beyar2005concept; kiemeneij2008use). The safety and feasibility of interventional robots have been demonstrated by clinical studies (weisz2013safety; patel2020comparison), and their applications have been widened to complex lesions such as multi-vessel diseases and chronic total occlusion (hirai2020initial; mahmud2017demonstration). Integration of telecommunication systems with robotic apparatus enables the remote operation of robot-assisted PCI (patel2019long). In pandemic situations such as current COVID-19 wave, robotic procedures have been proposed as a way to reduce the potential infection risk for medical staff and patients (attanasio2021autonomy; zemmar2020rise

). With the adoption of artificial intelligence, interventional robots are expected to be further automated to minimize interference from human operators (

sardar2019impact).

Reinforcement learning (RL) is an area of machine learning that trains an agent to achieve a goal by maximizing rewards, which are imposed by the next state in response to changing conditions when an action is taken in the current state. Deep RL has been applied to various domains, such as Go and computer games, to surpass the world’s best human players (

silver2016mastering; mnih2015human), and its applications have expanded from software to the control engine for real-world hardware (levine2016end; gandhi2017learning; pinto2017asymmetric). Considering that simple control operations are repeatedly performed to utilize a robot-assisted intervention system, deep RL may be a solution that effectively alleviates the burden of human operators. Recent applications for autonomous control of interventional devices in phantom simulation supported the potential applicability of deep RL (you2019automatic; behr2019deep; karstensen2020autonomous; chi2020collaborative; zhao2019cnn).
In this study, we propose a deep RL framework for autonomous guidewire navigation in robot-assisted coronary interventions. We focus on how to accelerate the RL training to prevent mechanical fatigue on the guidewire due to repetitive movements. First, under the constraints of the discrete action space and training in a real-world setting, Rainbow (hessel2018rainbow) was applied (Figure 1a). Rainbow integrating Deep Q-Networks (DQN) (mnih2015human) with recent advancements (van2016deep; schaul2015prioritized; wang2016dueling; fortunato2017noisy; bellemare2017distributional) in reinforcement learning has demonstrated outstanding performance in real world environments (church2020deep). Replay memory (mnih2015human), a representative an off-policy method to enhance sample efficiency and increase the training speed with quality data, was a key component to reduce the physical time requirement. To optimize the initial composition of the replay memory, human demonstrations with DQfD (hester2018deep) and weighted random action (WRA) were evaluated. Second, the state for the RL agent was customized with a focus window near the guidewire tip. Given that the movement of the guidewire per control command was about two orders of magnitude smaller than the travel distance of the navigation, the focus window allowed the RL agent to confine its input to the area with the most important information. Subgoals like the dots in ’Pac-Man’ guided the navigation to the goal beyond the focus window as well as mitigating the sparse reward problem. Finally, segment-wise training was conducted, which was inspired by the concept of curriculum learning (graves2017automated. When expanding the navigation area, transfer learning was applied using the model from previous training. Our framework was assessed in a two-dimensional (2D) coronary phantom for trainees and further validated in three-dimensional (3D) coronary phantom with fluid flow.

Figure 1: (a) Interaction between physical environment and RL agent. Of the entire navigation area, only information around the guidewire tip is defined as ‘state’, which is given as an input to the RL agent. The RL agent selects a control command as an ‘action’ maximizing the expected ‘rewards’ at the current ’state’, and the selected control command is transferred to the robotic device to perform one of the control operations: forward/backward motion and rotation. While this process is repeated, sets of states, actions, rewards, and next states (transitions) are accumulated in the replay memory, and RL training is performed periodically using transitions. (b) Reward design of reinforcement learning for guidewire navigation.

2 Materials and Methods

2.1 Physical environment for reinforcement learning

A robotic module developed in Asan Medical Center was used for guidewire navigation (Figure 1a). A pair of roller units rotating in the opposite direction enabled forward and backward motion of the guidewire (Terumo Radifocus 0.035”, Terumo Co., Ltd., Japan). The vertical translation of the roller units produced the rotation of the guidewire. The roller units driven by step motors generated discrete control commands corresponding to 0.4 mm displacement or 33° rotation of the guidewire at the roller side. The guidewire with a pre-angled tip was delivered via the guiding catheter (Heartrail II JL-3.5, Terumo Co., Ltd., Tokyo, Japan) engaged at the ostium of the coronary artery in the phantom. The entire phantom area was captured by a RealSense™ D435 camera (Intel Co., Ltd., CA) mounted orthogonal to the phantom.

2.2 Network setting for reinforcement learning

2.2.1 Training

To build an RL agent determining control commands using Rainbow (hessel2018rainbow

), a convolutional neural network (CNN) was constructed (Figure

1a). Using ‘state’ information composed of four consecutive images as input, the trained network output a distribution of Q values for deciding an ‘action’. According to the control command, the guidewire was manipulated by the robotic module and the RL agent receives a ’reward’ depending on the ’next state’. A step was defined as the generation process of a transition, which was a set of state, next state, action, and reward. Every transition was saved in the replay memory.
In each episode, the guidewire tip initially located in front of the catheter was moved toward a goal by the combination of control commands. When the guidewire reached a target location within 500 steps, the episode was considered a ‘success’; otherwise it was considered a ‘failure’. After finishing an episode, the guidewire was pulled back to the initial location, and then a new episode began. The training consisted of 1000 episodes, and the goal was set to switch randomly for each episode. At the beginning of training, transitions for the replay memory were generated using weighted random action (WRA) or transferred weight for a given number of steps. The network was not updated for this ‘transition generation’ phase. The composition and generation method of transitions for the replay memory is summarized in Table 1.

The loss function was defined as a combination of Rainbow loss (

hessel2018rainbow) with large margin classification loss and L2 regularization loss from DQfD (hester2018deep). The large margin classification loss was only used for training the RL agent with human demonstrations. Hyper-parameters used for the training are summarized in Table A1 in Appendix.

Figure 2: (a) The navigation area, divided into three zones, is initially designated as a proximal zone and is expanded by adding medial and distal zones, respectively. The goal is set at the target location of the guidewire, and terminal signals are assigned to other branches. (b) Because the goal is not visible in the focus window around the guidewire tip, subgoals are introduced. The focus window contains at least one subgoal or a goal. (c) Experimental setup according to navigation area.
Generation of initial transitions in replay memory
Exp. Target Model Feature
Number of
transitions
Generation
method
Segment
Probability of
command in WRA
1
Prox-main,
Prox-side
P1 HD 10,000 WRA Prox 0.4, 0.2, 0.4
P2 - 10,000 WRA Prox 0.4, 0.2, 0.4
2
Med-main,
Med-side
M1 HD 10,000 WRA Prox, Med 0.4, 0.2, 0.4
M2 TR 10,000 P1 Prox, Med -
M3 TR 10,000 P2 Prox -
WRA Med 0.4, 0.2, 0.4
3
Dist-main,
Med-side,
Prox-side
D1 TR 20,000 M3 Prox, Med -
WRA Dist 0.6, 0.2, 0.2
D2 TR, WI 20,000 M3 Prox, Med -
WRA Dist 0.6, 0.2, 0.2
Table 1: Composition of initial replay memory. For P1 and M1 models, 3234 transitions from 40 episodes of human demonstrations are included in the replay memory prior to weighted random action (WRA). The probability of forward command in WRA is increased in experiment 3, because the distal zone has only a small branch that the guidewire is very unlikely to enter. Prox, proximal; Med, medial; Dist, distal; HD, human demonstration with DQfD; TR, transfer learning; WI, weight initialization.

2.2.2 State

In defining ‘state’, two major modifications of focus window and subgoal were introduced. The image area for the state, which was converted to grayscale as in X-ray angiography, was cropped to 84 × 84 pixels near the guidewire tip, allowing the RL agent to focus on more important information (Figure 2b). The main drawback of the image crop was that the RL agent could not recognize the goal until the guidewire approached the target location. Therefore, subgoals were added on the path leading to the target location. The subgoals were initially set at the bifurcation points and additional subgoals were placed at a distance of 20 pixels, which is about a quarter of the image size for the state. Also, terminal signals were designated near the entrance of untargeted branches, which helped prevent the RL agent from making useless exploration.

2.2.3 Action

The action space of the RL agent was composed of forward/backward motion and rotation. The magnitude of each action generated by the robotic manipulator was fixed as constant. The rotational direction of the guidewire was not changed until the maximum angle provided by the roller units.

2.2.4 Rewards

The RL agent accumulated a negative reward of -0.001 per step, while zero reward was added at subgoals and final goal (Figure 1b). When the guidewire tip arrived at a terminal signal, a large negative reward of -0.5 was imposed.

Figure 3: A trajectory of the guidewire tip in human operation as a demonstration. Even if given the same control command, the movement of the guidewire tip may vary depending on the tip orientation of the guidewire and the combination of control operations previously applied. For example, although four forward commands (FFFF) are consecutively selected at the starting point, the guidewire tip does not advance forward. Deformation of the guidewire caused by the friction forces on the walls of the catheter and the blood vessels causes non-linearity between the control command and the movement of the distal end, occasionally leaving the operator unable to predict the next state.

2.2.5 Human demonstrations

DQfD (hester2018deep

) was proposed to enhance the performance of reinforcement learning with a small amount of demonstration data. In our application of DQfD, human demonstrations were used to pre-train the network with supervised learning and were sampled with a high priority in the replay memory for reinforcement learning. To record demonstration data of experiments using the 2D phantom, 10 episodes per target location in the proximal and medial zones were created, as trained personnel generated discrete control commands using a keyboard (Figure

3).

3 Results

3.1 Navigation performance in 2D phantom

First, the RL navigation in this study was aimed at delivering the guidewire to a target location at the main vessel or a side branch in a 2D phantom (PCI trainer for beginners, Medi Alpha Co., Ltd., Japan). The training was conducted by dividing the left anterior descending arteries into three parts and expanding the navigation area step by step (Figure 2a). The RL performance was evaluated by independently performing the procedure three times per experiment, using a new guidewire each time. The target was selected at the beginning of every episode by uniform random distribution. A paired Wilcoxon test was used to compare the operation time and number of steps between RL agents. Values of p < 0.001 were considered statistically significant. Statistical analyses were performed using R package and SPSS 17.0 for Windows (IBM Corp., Armonk, NY, USA).

Success rate Number of steps Operating time (s)
Exp. Model <95 <99
First 100
episodes
Last 100
episodes
Last 500
episodes
First 100
episodes
Last 100
episodes
Last 500
episodes
1 P1 201st 401st
145.3
92.2
73.7
30.7
80.1
41.3
16.00
9.48
9.59
5.96
10.40
6.32
P2 188th 216th
164.8
99.9
63.1
19.4
67.7
28.3
17.43
9.70
8.99
5.88
9.29
6.00
2 M1 560th -
179.3
85.5
203.2
77.7
203.8
77.0
18.94
9.19
20.43
9.08
20.78
9.26
M2 195th 325th
291.5
142.3
162.2
33.3
165.3
40.2
26.83
12.42
17.48
7.03
17.63
7.40
M3 212th 446th
435.9
116.3
157.6
37.2
160.7
41.3
39.41
11.64
17.24
6.66
17.56
7.17
3 D1 - -
346.6
154.5
375.6
180.8
353.9
186.4
30.16
12.57
33.63
16.00
32.46
16.72
D2 646th -
218.7
134.2
183.6
90.7
184.8
98.6
21.69
12.97
21.65
11.44
21.21
11.47
Table 2: Evaluation metrics for assessing the performance of RL agents.
Figure 4: Navigation performance over episodes. (a) Human demonstrations with DQfD (hester2018deep) accelerate the training speed of RL agent, but require more episodes to succeed 100 of the time in the proximal zone. (b) Transfer learning strategy allows the RL agent to demonstrate the same level of training speed as the proximal zone. (c) When the final layer is initialized, the RL agent can make the guidewire reach the targets of different distances.
Figure 5: Tip trajectories of the guidewire in RL navigation. (a) In the proximal zone, the RL agent initially explores the path in a stochastic pattern, and as the RL evolves, the guidewire successfully reaches its goal by repeating efficient patterns in the stable phase. (b) In the medial zone, the RL agent finds an effective way to move the guidewire into a small side branch where the ostium is narrowed. (c) The RL agent passes through the severe obstruction in the distal zone by facing the guidewire tip to the right with respect to the travel direction.

3.1.1 Proximal targets: Human demonstrations

For the main and side targets in the proximal zone, we evaluated the effects of human demonstrations with DQfD on RL performance for guidewire navigation (Figure 2c). A stenotic lesion was located in the branching area between the targets, which hindered the guidewire control. Another side branch opposite to the proximal side target was set as a terminal signal.
When human demonstrations were added in the replay memory (P1 model), the learning speed was faster (<175 episodes in Figure 4a) than the RL agent trained with only WRA (P2 model). In this case, the success rate of P2 model increased rapidly, reaching 99 first at 216th episode (Table 2). After 500 episodes, RL navigation hardly failed while P2 models required significantly less control commands (80.1 ± 41.3 vs. 67.7 ± 28.3, p < 0.0001) and reduced operating time (10.40 ± 6.32 s vs. 9.29 ± 6.00 s, p < 0.0001). Compared to the human operation for demonstration data (82.1 ± 34.2), the reduction rates in the number of steps were 10.3 and 23.1 for P1 and P2 models in the final 100 episodes, respectively. For both models, although the success rate for the proximal-main target (vs. proximal-side target) was slightly lower in the beginning, as the RL training progresses, the difference between the navigation goals almost disappeared.
In the early stage of training, the RL agent explored the path in a stochastic pattern (exploration phase in Figure 5a). As the training progressed, unnecessary changes in the orientation of the guidewire tip were gradually reduced, and the probability of escape from untargeted branches was improved (evolving phase). In ‘stable phase’ at the last stage of training, two representative patterns were found in the trajectories of guidewire navigation. The first pattern was that after proceeding along the centerline of the main vessel, the guidewire rotated sharply to the proximal-side target in the bifurcation area or advanced to the proximal-main target. The second pattern was characterized by avoiding the branch vessel of terminal signal. Then, the RL agent steered the guidewire along the sidewall of the side branch (proximal-side target) or used the evasive movement again to the opposite side (proximal-main target).

3.1.2 Medial targets: Transfer learning

Because the travel distances to the medial targets were roughly three times longer than those to the proximal targets, it was extremely difficult to reach the goals with only WRA, especially the medial-side target (Figure 2a). For the medial targets, the transfer learning approach was applied by initializing the network using the trained models in the proximal zone. The M1 model, as a control, was trained using human demonstration with DQfD like the P1 model. The M2 model brought the initial weights from the P1 model, which generated the initial transitions to the medial targets. The initial transitions of the M3 model using the weight of the P2 model were produced with both the transferred weight and WRA (Figure 2c and Table 1).
Despite the increased travel distance with intervening multiple branches, the success rate of M2 and M3 models increased sharply in the same pattern as the proximal experiment, indicating > 95 from the 212th episode for both the models (Figure 4b and Table 2). After the success rate became saturated, little variation was found between the performance of the two models using transfer learning. Also, the number of steps and total reward per episode remained stable in terms of averaging of 100 episodes. Although human demonstrations allowed the RL agent to temporarily produce better results (M1 model), after the initial stage (> 200 episodes), the performance of M2 and M3 models exceeded the RL agent trained without transfer learning.
The most efficient patterns for medial targets, obviously, were to follow the centerline primarily using forward commands, which accounted for most of the last half of the experiment (Figure 5b). The main reasons for the failure or longer travel distance in the navigation were that the guidewire was misled to the side branches (medial-main target) and the orientation of the guidewire tip had to be changed repeatedly to pass through the narrowing (medial-side target).

3.1.3 Distal-main and side targets: Weight initialization

The goal was to demonstrate that the RL agent was viable for the major destinations of guidewire navigation: proximal-side, medial-side, and distal-main targets. Transfer learning was applied using the trained weights of the M3 model. To address the overfitting issue, the weight initialization was applied by replacing the final layer of the convolutional neural network with randomly generated parameters.


Without weight initialization (D1 model), the performance of the RL agent regressed as the training continued. For the last 100 episodes, D1 model produced a success rate of 25 and mostly failed to reach the distal-main target (Figure 4c). When the final layer initialization was applied (D2 model), the success rate for the medial-side and distal-main targets fluctuated, but eventually approached 100 for all targets. The success rate of the D2 model was 98.0 in average for the last 300 episodes. The change in the number of steps in the training process was relatively small compared to the proximal and medial zones, because the navigation was terminated early in the failed episodes.
Initially, the guidewire control suffered from the orientation adjustment of the guidewire in front of the stenosis (Figure 5c). Unless the guidewire tip faced to the right relative to the travel direction, it was exceedingly difficult to proceed through the distal obstruction. As the training progressed, even when the guidewire moved along an incorrect route, the RL agent reverted it to the path that could reach the designated goal as it learned in the previous experimental stages. Eventually, the navigation trajectories for the distal-main target almost converged except for irregular patterns around the largest side branches in the proximal zone.

Figure 6: (a) Navigation targets in 3D phantom. The target names may differ from the clinical standard. (b) Snapshots of RL navigation to the distal-main target. (c) Probability of guidewire movement corresponding to forward (F), backward (B) and rotation (R) commands. (d) Number of control commands required for guidewire movement.

3.2 Validation in 3D phantom

The training process of our framework was further validated using a 3D phantom. Expanding the navigation area in the 3D phantom (Embedded Coronary Model, Trandomed 3D Medical Technology Co., Ltd., China), the training methods of the P2, M3, and D2 models were applied sequentially following the best scenario in the 2D phantom. In the 3D experimental setup (Figure 6a), the vessel and the guidewire placed away from the center of the camera view could be detected as shorter than the actual length, like the shortening effect in X-ray coronary angiography. Also, fluid at physiological flow rate was supplied from an output-adjustable pump (WT300-1JA, Longer Precision Pump Co., Ltd., China) to the right coronary artery (RCA), which could affect the dynamic behavior of the guidewire along with the silicon wall.
Despite substantial changes in the experimental environment and thereby mutual interaction (Figures 6c and 6d), an RL agent was constructed that was able to steer the guidewire to reach all valid targets (see Figure A1 in Appendix). The navigation performance of the RL agent improved through the exploration and evolution phases, and eventually, it became capable of maneuvering the guidewire into the vessels with wavy walls and differing branching patterns (Figure 6b and Video A1).

4 Discussion

Our framework demonstrated the potential applicability of autonomous guidewire navigation using RL. To accelerate the training speed and avoid mechanical fatigue of the guidewire, human demonstrations with DQfD (hester2018deep), transfer learning, and weight initialization were evaluated as a segment-wise learning approach. The focus window and subgoals were introduced to customize the state and reward for RL agent, respectively. The RL agent improved navigation performance through ‘exploration’ and ‘evolution’ phases, which eventually enabled the guidewire to reach all valid targets in a ‘stable’ phase.
Human demonstrations initially accelerated the training speed but required more time to further increase the success rate. Sampling human demonstrations with a high priority, even after the RL agent’s performance exceeds the human demonstrations, may be a hindrance to performance improvement. The patterns of the human demonstrations were suboptimal, which rarely appeared in the ‘stable’ phase (gao2018reinforcement). Also, the difference in the input frequency between the human operator and the RL agent may cause different interactions between the guidewire tip and the experimental environment. In the navigation to the medial and distal targets, the segment-wise approach helped to collect better transitions for the training of the RL agent for a target with a low probability of reaching it only by random action. The transfer learning, which is commonly used for learning speed and performance of deep networks (girshick2014rich), also contributed to reduce the time required for 100 success.
In realistic physiological conditions, providing accurate state information to the RL agent is the key to applying RL in three-dimensional deformation of the vascular pathways with a living heartbeat. Uncertainties in registration can be an obstacle to apply our framework that required precise position information to define ‘state’ and apply subgoals. To detect the relative location of the guidewire within the coronary tree, a dynamic coronary roadmap can be helpful (piayda2018dynamic). The latest updated method provides a real-time registration of X-ray angiography with the guidewire tip in fluoroscopic images using ECG gating (ma2020dynamic

). Also, deep-learning segmentation of major vessels in X-ray angiography offers rapid and accurate identification of the target vessel to be reached (

yang2019deep).
The ultimate goal of autonomous navigation using deep RL is to build a generalized model encompassing the anatomical diversity of the coronary arteries. Time and cost for training can be an obstacle that fundamentally limits the application of RL navigation. To this end, the development of novel simulators is essential (wang2017robust; dulac2019challenges). Virtual simulators provide an opportunity to train more subjects quickly at a low cost. Distributed RL can improve the training and the performance of the RL agent using the strength of virtual simulators (mnih2016asynchronous). Advancements in the modeling of cardiovascular anatomy (corral2020digital) and interventional devices (sharei2018navigation) support the construction of virtual simulators that more accurately mimic interactions in the human body. Physical simulators may help translate from virtual simulators to in-vivo applications by relieving safety issues. Integration of novel 3D printing techniques (stepniak2020novel) with functional modeling of the cardiovascular system (vukicevic2017cardiac) may allow implementation of dynamic response in interventional devices in physical simulators. Our framework is expected to contribute to the adoption of autonomous navigation not only by providing data necessary for modeling virtual simulators, but also by presenting guidance on training methods for physical simulators involving mechanical fatigue.

References

Acknowledgement

This research is based upon work supported by the Ministry of Trade, Industry Energy (MOTIE, Korea), Ministry of Science ICT (MSIT, Korea), and Ministry of Health Welfare (MOHW, Korea) under Technology Development Program for AI-Bio-Robot-Medicine Convergence (20001638), and by the Korea Medical Device Development Fund grant funded by Korea government (the Ministry of Science ICT, the Ministry of Trade, Industry Energy, the Ministry of Health Welfare, the Ministry of Food and Drug Safety) (Project Number: KMDFPR202009010013).

Appendix

Figure A1: Success rate, number of steps, operating time, and total reward in subsequent 3 experiments using 3D phantom.
Hyper-parameter Value
Batch size 32
Buffer size 100,000
Discount factor (gamma) 0.99
Update frequency 4
Target network soft update ratio 0.005
Gradient clip 10
Multi-step returns v
Prioritization exponent alpha 0.4
Prioritization important sampling beta 0.6
Distributional atom 51
Distributional min/max values Value
Hyper-parameter [-2.0, 0.0]
Noisy Net sigma (std) 0.5
Human demo pre-training step 1,000
Human demo supervised loss margin 0.8
Learning rate 0.0001
Weight decay 0.00001
Optimizer Adam
Table A1: Hyper-parameters for reinforcement learning agent.