## 1 Introduction

Deep Reinforcement Learning (DRL) has been shown to be extremely effective in mastering complex simulations and artificial tasks, e.g. playing Atari games [13] and Go [17]. However, DRL’s poor sample complexity has limited its application to real world tasks, such as navigating a robot to a target position without crashing.

Deep Deterministic Policy Gradient (DDPG) [10] is an actor-critic algorithm that is suitable for such continuous control tasks in principle, but in practise the cost of exploration in complex navigation environments can prove prohibitive.

Since an agent must stochastically explore a long sequence of states during each training episode, high variance becomes the main bottleneck that hinders DDPG from learning effective DRL models. In order to mitigate this issue, conventional architectures generally require a huge number of learning samples, resulting in high computational and environmental costs in practice. In this paper, we propose a new framework that allows an agent to stochastically switch between high variance controllers (e.g. DDPG), and low variance controllers (e.g. simple deterministic controllers), effectively allowing the DDPG component to be quickly bootstrapped instead of starting from completely random moves.

Intuitively, learning is usually easier to be carried out under the guidance from other heuristics. The independent controllers here act as the guidance that are introduced for learning better DDPG policies. In our case, the agent still maintains an independent DDPG module that learns navigation by exploring the environment, but it is able to dynamically switch between learning from exploration or learning from the heuristic controllers. Here, the switching mechanism is constructed as a stochastic function updated by REINFORCE learning signal [23] to maximise total reward. Meanwhile, the DDPG component is learned by employing the action selected by the stochastic switch, rather than directly using the output action generated by its policy network. Therefore, the switching mechanism helps DDPG avoid trivial explorations during the early training process, and learns to balance between exploration and heuristic guidance. More interestingly, the DDPG component can be tested in isolation from other controllers, in which case the switch is turned off and the navigation is carried out solely by the DDPG component. Similar to the idea of imitation learning [3], the DDPG component is able to learn from the demonstrations given by the guidance (PID and OA (obstacle avoidance) components in our case) and instantly generalise to new situations (which PID and OA could not handle). Here, the guidance can be considered as a positive bias for reducing the variance of gradient estimators, and the model is able to remove this bias after benefiting from it.

For quantitative evaluation, we firstly compare our model with stochastic switch to the vanilla DDPG baseline and deterministic benchmarks for demonstrating the benefits brought by independent controllers. Then the influence of using different independent controllers is investigated, which shows that the framework has strong generalisation ability and it is able to accumulate the benefits from different simple controllers. In addition, we propose two variants of the switch mechanism including a uniformly random switch and an argmax switch for comparison. Finally, we show that the models can abandon the extra controllers when their usage rate declines below a threshold, and it is able to continue the self-learning by only using the DDPG component. For qualitative evaluation, we test our model in a real world scenario. Without further modification, the model trained in simulation is able to be directly transferred to carry out navigation tasks.

In summary, we propose a new framework that leverages the heuristic knowledge provided by independent controllers to bootstrap deep reinforcement learning for robot navigation. Our experiments demonstrate that by incorporating stochastic guidance, we are able to effectively and efficiently train the DDPG navigation policies and achieve significantly better performance than state-of-the-art baseline models. As a simple, robust and easy-to-use framework, it can be a generic method applied to improve many other deep reinforcement learning algorithms and applications.

## 2 Model

Robot navigation task can be defined as a partially observable Markov decision process [18] problem, which can be solved by DRL. With observations of world state, a robot needs to decide its action, i.e. a control policy, maximising an accumulative future reward. Since it is ideal for a robot to reach the goal without any crash, the reward function

at time is defined as:(1) |

where is a large penalty for collision, is a positive reward for reaching the goal, and denote the distances between the robot and the goal at two consecutive time steps and , represents the rotational speed of the robot at time , and is a constant time penalty which encourages the robot to approach the goal quickly.

The proposed model consists of three parts: perception, control and stochastic switch, as shown in Fig.1. At each time step, the perception part processes an observation and generates a corresponding input representation. Then different controllers can propose candidate actions based on the input representation. Finally, the stochastic switch determines which one of the actions to be carried out.

### 2.1 Perception

At each time step , the robot observes the state of the world

, which includes a stack of current and historical geometric observations, its linear and angular velocities and a destination. The geometric observations, which give distances to surrounding objects (depth images in this work), are processed by a convolutional neural network to produce a compressed input representation. It is then concatenated with the velocities and destination for controllers and stochastic switch.

### 2.2 Control

#### Action

With the observation at time , the robot takes an action , where and respectively denotes the expected linear and rotational velocity at time , to navigate. Then it obtains a reward given by the environment for assessing the chosen action and transits to next observation . The goal of our model is to reach a maximum discounted accumulative future reward , where is the discount factor and is the reward function in Eq. 1. In this work, the actions can be determined by the independent controllers and DDPG, establishing a set of candidate actions for the stochastic switch to choose.

#### Independent Controllers

Two independent controllers are introduced to facilitate the learning of DDPG’s policies, especially providing reasonable actions in the beginning. One is a proportional-integral-derivative (PID) controller with proportional term [2], which derives action from the relative position of the destination in robot coordinate frame as:

(2) |

where is the coefficient for the proportional term. PID controller is one of the most widely used and successful control mechanisms. However, without considering geometric observation, it does not have an obstacle avoidance capability.

The other one is a simple obstacle avoidance (OA) algorithm which can drive the robot without collision. It uses geometric observation to detect and avoid nearby obstacles by controlling the heading direction (rotational speed) of the robot:

(3) |

where is the distance to the closest obstacle, represents the largest rotational speed and indicates a pre-defined minimum safety distance. In the case where the distance between the robot and an object is less than the safety distance, i.e., , the robot will rotate to avoid collision. These two controllers complement one another to provide candidate actions for stochastic switch. Note that the OA only produces , while the selected is provided by the DDPG controller.

#### Ddpg

The main controller of this framework is DDPG, which is an actor-critic approach in DRL [10] that simultaneously learns the policy and the action-state value (Q-value) to assess the learnt policy. Although the policy network and the critic network of the DDPG share the same input representation from the perception, the policy network predicts the action, while the critic network estimates the Q-value for current state-action pair. In the learning mechanism of critic network, given the policy which maps states to actions , the expected return is which can be calculated with the Bellman equation [22]:

If the policy is deterministic, we can define and the inner expectation can be avoided. Since the outer expectation is independent on policy , it becomes an off-policy learning. Then, the objective is to minimise the temporal difference (TD) error:

(4) | ||||

where is a parameter of the critic network. To update the critic network by temporal difference learning [21], all learning samples stored in the replay buffer are formulated as .

The policy network is parameterised by

. During training, the gradients are estimated by applying chain rules to the objective function (expected reward)

w.r.t the parameters . Generally, in DDPG, the parameters are updated by the gradients computed based on the actions produced by the policy network. However, in our case, we introduce a stochastic switch for choosing the final action from a set of actions proposed by all controllers. Hence, the networks are actually updated by the action decided by the switch network, instead of the produced by the policy network of DDPG.### 2.3 Stochastic Switch

The PID controller, OA algorithm and DDPG are three independent sources that produce candidate actions for the switch network to (optimally) select. The switch network is a stochastic deep neural network which consists of a parameterisation network and a multinomial distribution. Conventionally, a softmax layer can be employed to provide the parameter

for the multinomial distribution. Here, instead, we apply stick-breaking construction [16, 8, 11], which is alternative to softmax.The intuition is to introduce a bias that encourages more usage of the deep reinforcement learning algorithm, such as DDPG in our case. Since our framework is designed to train a robust DDPG component that benefits from the stochastic guidance, we expect it to be used more often than others in this framework so that we are able to get rid of the simple independent controllers after a certain period of training.

It basically transforms the modelling of multinomial probability parameters into the modelling of the logits of binomial probability parameters. In our case (

controllers), given the binomial logits , can be generated by two breaks: , and . Here is the unscaled logit from a deep neural network given the input representation and is the parameter of the switch network. Generally, the stick breaking function can be generalised to more breaks (). Conditioned on the current observation , we are able to construct the stochastic switch policy as:(5) | ||||

(6) |

At each time step , the stochastic switch samples a decision and corresponds to DDPG, PID and OA. Then, according to the decision , the critic network of DDPG takes the final action as input and updates the networks accordingly. Meanwhile, the stochastic switch is updated by REINFORCE learning signal so that the switch network is able to dynamically choose to learn through exploration (DDPG), or it can choose to use the output of a heuristic controller (PID or OA) as guidance by observing the environment.

#### REINFORCE Algorithm

Since the gradients cannot be directly back-propagated through the discrete samples, we employ REINFORCE algorithm [23] to construct the gradient estimator for the switch network, where the goal is to maximise the total reward under the switching policy . Thus the objective function is:

(7) |

where is a sequence of decisions in an episode, is the decision sample at each time step , and is the probability of generating the current decision sequence. Hence, the gradients can be estimated as follows: ∂H ∂θs=E_p(S; θ^s)[∂∂θslogp(S;θ^s) R] ≈1N∑_n=1^N∑_t=1^T_n ∂∂θslogπ^s(s_t^n|x_t^n;θ^s)R^n where is the number of sampled episodes, is the length of the episode , and is the total reward of the episode. It indicates a Monte Carlo based unbiased gradient estimation for updating the switch network.

The introduction of stochastic switch can be considered as an inductive bias for learning to navigate with better action samples. Updated by REINFORCE, the stochastic switch is able to sense the environment, avoid the trivial explorations and select better actions for learning DDPG policies. In addition, as the independent controllers are incorporated via the stochastic switch, the negative influence of the introduced biases from the heuristics is limited. More interestingly, the independent controllers can be turned off in the late training process (the stochastic switch always chooses the output of DDPG as final action) so that the learning of DDPG could further rely on exploration after being bootstrapped by the independent controllers.

#### Variance Reduction

Since the REINFORCE gradient estimator also suffers from the high variance issue, we introduce two control variates [12] for alleviating the problem: a centred learning signal (moving average) and an input dependent control variate

respectively. Here, we simply build an MLP (multilayer perceptron) to implement the

conditioned on input . During training, the two control variates are learned by minimising the expectation: , and the gradients are derived as,## 3 Experiments

### 3.1 Training Environments and Settings

The proposed framework is trained in two different simulators.
The first one is a light-weight simulator, ROS Stage^{1}^{1}1http://wiki.ros.org/stage (Fig.2(a)), in which a large amount of repetitive experiments are conducted for showing the learning curve, demonstrating the improvements brought by stochastic guidance, and comparing to other baseline models.
In this simulator, we mount the mobile robot with a laser scanner to provide the geometric information of surroundings.
Hence, the convolutional neural network (in Fig. 1) is not being used in this case, and the laser scans are directly concatenated with other observation as input representation.
By accelerating the simulation time, we obtain the quantitative evaluation through a lot of repetitive experiments in ROS Stage.

The other one, ROS Gazebo^{2}^{2}2http://wiki.ros.org/gazebo_ros_pkgs (Fig.2(b)), contains a physical engine and can accurately simulate the dynamics of the mobile robot.
Thus the model trained in ROS Gazebo is directly applied to real world scenario to qualitatively evaluate the navigation performance, but it has a larger computational overhead compared to ROS Stage.
Here, depth images are utilised to observe surroundings, therefore a 3-layer convolutional network (the filters are [4,4,3,8], [4,4,8,16] and [4,4,16,32] respectively) is constructed to provide input representations based on depth images.

In each training episode the robot starts at the origin point with a random heading direction and the destination is randomised within the area beyond obstacles.
When the robot collides with an obstacle or reaches the destination, the current episode terminates.
The action control frequency is 5Hz
and the switching frequency is 1Hz.
For all the experiments carried out in ROS Stage, the training process lasts for 100k steps and is repeated for 5 times.
The averaged learning curves as well as the variance^{3}^{3}3Note that the variance mentioned here is the variance of the smoothed learning curves. are illustrated for demonstrating the performance.

As for the hyper-parameters, the hidden layers of critic network and actor network contain 100 ReLU units in each layer, while the output layer of actor network applies tanh and sigmoid respectively for rotational and linear velocity. When updating DDPG parameters, 32 learning samples are randomly sampled from a rank based prioritised experience replay [15] as a training batch, and the learning rate for the actor network, the critic network and the stochastic switch are

, and respectively, and the rest follows [10].### 3.2 Navigation in Simulated Environment

#### Reinforcement Learning with Stochastic Guidance

Fig.3(a) compares the models for demonstrating the benefits brought by learning with stochastic switch. SGuidance is our model with stochastic switch that dynamically choose the action from the candidates proposed by the controllers of DDPG, PID and OA. As shown in the figure, SGuidance achieves significantly better performance than the DDPG baseline. Meanwhile, DDPG suffers from the high variance issue according to the wide transparent area around the learning curve, while SGuidance is much more stable. This is due to the high complexity of the environment that leads to the highly variant learning samples provided by DDPG, which might lead to trivial explorations. In addition, the stochastic gradient estimator of DDPG applies biased approximation which makes it difficult to guarantee the convergence and stability. By contrast, SGuidance is able to benefit from the heuristic simple controllers since the beginning of training procedure instead of starting from completely random moves.

In this experiment, we also plot the rewards of MoveBase (without map) and Oracle (MoveBase with map) for comparison.
MoveBase package^{4}^{4}4http://wiki.ros.org/move_base is a widely used motion planner for mobile robot navigation and is implemented in the ROS package named Navigation Stage, which consists of a local planner [5, 4] and a global planner (implemented by Dijkstra or A* algorithm).
The global planner generates an optimal path from the origin to the destination on the global map of the environment, and the local planner dynamically avoids the newly detected obstacles while moving along the optimal path.
Hence, we call the MoveBase with map Oracle in this experiment.
As shown in Fig.3(a),
DDPG is able to obtain comparable performance to MoveBase.
SGuidance, however, significantly surpasses the deterministic MoveBase model.
Even without the access to the global map, SGuidance has shown its strong ability to navigate in the environment by just using the geometric information.
In addition, we plot the performance of two simple heuristic controllers (OA and PID) for reference.
Basically, the simple deterministic controllers can not be applied independently for carrying out navigation task (the accumulative rewards are both under 0).
However, when incorporated with DDPG via stochastic switch, it contributes notably for alleviating the high variance issue during the learning of DDPG.

#### Using Different Independent Controllers

This experiment shows the investigation where different independent controllers are incorporated with DDPG via stochastic switch. As illustrated in Fig.3(b), SGuidance (PID + OA) achieves the best performance when compared to the DDPG with only PID or OA and the DDPG without any independent controllers. Moreover, the contribution of the stochastic switch is greatly enhanced by adding more controllers, which yields more stable learning curves and better navigation performance. Interestingly, PID controller brings more benefits than OA controller in this context, and their benefits could be accumulated with the help of the stochastic switch.

#### Using Different Switching Mechanism

are the proposed argmax switch and uniformly distributed switch respectively.

(b) Discussion on different stochastic switch function. The left y-axis shows the total reward of all the methods, and the right y-axis shows the total usage of the independent controllers.Fig.4(a) compares the stochastic switch to other variants of switching mechanism, an uniformly random switch and an argmax switch. The uniform switch assigns uniformed fixed probability to DDPG, PID and OA controllers, while the argmax switch applies biased argmax output instead of stochastically drawing samples from the stochastic switch network. As illustrated in Fig.4(a), SGuidance has the best performance. Uniform switch is not as good as SGuidance, but it still contributes remarkably to the navigation performance. The curve of Argmax lies in between the other two mechanisms, but has much bigger variance on the total reward. This is because Argmax is a biased sampler compared to the others, and the introduced bias in turn damages its final performance since there are less explorations after a certain period of training.

#### Construction of Stochastic Switch Function

In Fig. 4(b), the StickBreaking1 (DDPG, PID, OA) represents the function we applied in the paper and the StickBreaking2 (DDPG, OA, PID) used an alternative order of the independent controllers. More specifically, according to Eq. 5, StickBreaking1 and StickBreaking2 both set with DDPG controller and give different order with PID and OA controller where is assinged with the PID controller in StickBreaking1 but with OA controller in StickBreaking2. As shown in the figure, Softmax is able to achieve almost adequate performance compared to StickBreaking1 in terms of the total reward. However, according to the total usage of independent controllers, the DDPG component is being less used in Softmax than StickBreaking1 and StickBreaking2. Although the two stick breaking functions have similar total usage of independent controllers, StickBreaking2 performs slightly worse than StickBreaking1, which shows that the order of independent controllers has a small effect on the performance. Hence, the softmax function is a safe choice to construct the stochastic switch function. However, the prior knowledge about the performance of simple controllers could be used to benefit the learning in stick breaking construction. For instance, Fig. 3(b) shows PID brings more benefit than OA when incorporating with stochastic switch, and Fig. 4(b) also shows the StickBreaking1 performs slightly better than StickBreaking2.

#### REINFORCE Variance Reduction

Fig.5(a) exhibits the performance of DDPG with or without variance reduction techniques. Due to the fact that REINFORCE algorithm also has the high variance issue, we study the benefits brought by the control variates. As shown in Fig.5(a), SGuidance (no CV) has already improved the vanilla DDPG model significantly. However, by introducing two control variates to reduce the variance of the REINFORCE gradient estimators (Eq.2.3), SGuidance is able to further enhance and stabilise the navigation performance, which also illustrates that the high variance issue is certainly influential in the context of deep reinforcement learning.

#### Turning off Independent Controllers

This experiment investigates the property of DDPG with stochastic switch that the trained DDPG policies are able to independently carry out navigation with all the heuristic controllers turned off. In Fig. 5(b), the percentage of selecting heuristic controllers by SGuidance are demonstrated with dashed lines. Their usage drops quickly and becomes stable after approximately 60k training steps. This is because that the DDPG controller has already reached a comparable policy as other controllers. Therefore we turn off both PID and OA controllers when their usage is beneath 15% to study the performance of isolated DDPG. Instead of abruptly shutting down the controllers, we monotonically diminish the probability of using a proposed action of heuristic controllers to zero within 10k training steps when they are selected by the switch. As the result, turning off the heuristic controllers only slightly affect the navigation performance which supports the independent navigation capability of trained DDPG policies.

### 3.3 Navigation in Real World Environment

In this experiment, we quantitatively analyse the performance of our model applied in real world environments. The model is trained in a simulated world built by ROS Gazebo ( Fig.2(b)), and directly transferred into real world scenario without any fine tuning in order to verify the effectiveness and strong generalisation of the model.

A Turtlebot 2 robot mounted with a Kinetic depth camera is used as the mobile platform. Unlike the observation from laser scanner which is simulated in ROS Stage, the dimension of state space for using a depth camera is dramatically increased. Therefore, a 3-layer convolutional neural network is employed (as in Fig.1) to provide geometric representations. Other inputs, i.e. velocity and goal location, are concatenated with the geometric representation into a dense input representation.

Since the ground truth of the robot locations is not available in the real world environment, we apply the off-the-shelf AMCL ROS package^{5}^{5}5http://wiki.ros.org/amcl for providing the estimation of the robot location, and calculating the destination position in the local coordinate frame.
In order to improve the localisation accuracy, we record the map of the environment with Gmapping ROS package^{6}^{6}6http://wiki.ros.org/gmapping.
It is worth mentioning that this global map is not used by the navigation component of the model during training or testing.
The obstacles are laid out in the room as illustrated in Fig. 6(a).
The target of this experiment is to employ the learned policy and control the robot to reach several destinations successively without any collision.
As shown in Fig.6(b),
the trajectory of the robot is plotted as the blue curves which indicates that the robot can smoothly avoid all the obstacles and reach each target successfully by only learning in simulation with the proposed stochastic guidance model.

## 4 Related Work

Many works have applied DRL on robotic problems, e.g. navigation [20, 26, 14, 20] and manipulation [6, 25]. Since most of the robotics problems involve continuous control, policy based approaches such as policy gradient [19] or actor-critic method, e.g. DDPG [10], are widely used as the conventional approaches. Introducing positive bias is a common approach for alleviating the issue. [15] assigns higher weights to the data where the model has less confidence to improve the efficiency of sample usage. [7] leverage the concept of information gain when exploring new policies. Unlike above approaches where the bias are tightly merged into the models, our framework incorporates extra knowledge as stochastic guidance without imposing any change to the underlying approach.

Thompson Sampling [1] shares the similar spirit of our framework that the model learns to switch among different controllers. The difference is that, instead of explicitly calculating the posterior for updating in Thompson Sampling, our framework directly employs neural networks to construct the latent distributions which are trained jointly with the DDPG component by backpropagation. The advantage is that the switch function can be easily built and conditioned on all of the sensor inputs so that it chooses different controllers according to different contexts/conditions. In addition, the target of our framework is more focused on training a better DDPG component, which is able to benefit from the low-variance gradient estimator due to the better samples generated by the stochastic switch.

In [9], Leonetti et al. investigated a low level integration of RL and external controllers where the RL algorithm only explores with feasible actions provided by the planner, these heuristics can not be discarded, both for training and testing. Therefore, the performance of the learner very depends on, if not limited by, the capability of the heuristics. By contrast, in our framework, DDPG can explore the full action space by itself alongside the guidance during the whole training process and can eventually work independently.

## 5 Conclusion

This paper proposes an new framework for effectively incorporating heuristic knowledge to overcome the high variance issue in learning DDPG. The experiments demonstrate that the stochastic switch allows an agent to balance the learning from exploration or heuristics, which significantly bootstraps the performance of navigation that surpasses state-of-the-art baseline models. More interestingly, the DDPG component remains independent and can be tested in isolation from other controllers. By transferring the policies into real world, the robot is able to successfully carry out navigation task, which indicates the robustness and strong generalisation of our proposed framework.

## References

[1] Shipra Agrawal and Navin Goyal. Analysis of thompson sampling for the multi-armed banditproblem. InConference on Learning Theory, pages 39–1, 2012.

[2] Karl Johan Åström and Tore Hägglund.PID controllers: theory, design, and tuning, volume 2.Instrument society of America Research Triangle Park, NC, 1995.

[3] Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas Schneider, IlyaSutskever, Pieter Abbeel, and Wojciech Zaremba. One-shot imitation learning.arXiv preprintarXiv:1703.07326, 2017.

[4] Dieter Fox, Wolfram Burgard, and Sebastian Thrun. The dynamic window approach to collisionavoidance.IEEE Robotics and Automation Magazine, 4(1):23–33, 1997.

[5] Brian P Gerkey and Kurt Konolige. Planning and control in unstructured terrain. InICRAWorkshop on Path Planning on Costmaps, 2008.

[6] Shixiang Gu, Ethan Holly, Timothy Lillicrap, and Sergey Levine. Deep reinforcement learningfor robotic manipulation with asynchronous off-policy updates. InRobotics and Automation(ICRA), 2017 IEEE International Conference on, pages 3389–3396. IEEE, 2017.

[7] Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime:Variational information maximizing exploration. InAdvances in Neural Information ProcessingSystems, pages 1109–1117, 2016.

[8] Mohammad Khan, Shakir Mohamed, Benjamin Marlin, and Kevin Murphy. A stick-breakinglikelihood for categorical data analysis with latent gaussian models. InArtificial Intelligenceand Statistics, pages 610–618, 2012.

[9] Matteo Leonetti, Luca Iocchi, and Peter Stone. A synthesis of automated planning and rein-forcement learning for efficient, robust decision-making.Artificial Intelligence, 241:103–130,2016.

[10] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, YuvalTassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.International Conference on Learning Representations, 2016.

[11] Yishu Miao, Edward Grefenstette, and Phil Blunsom. Discovering discrete latent topics withneural variational inference. InInternational Conference on Machine Learning, pages 2410–2419, 2017.

[12] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks.arXiv preprint arXiv:1402.0030, 2014.

[13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc GBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.Human-level control through deep reinforcement learning.Nature, 518(7540):529–533, 2015.10

[14] Fereshteh Sadeghi and Sergey Levine. (cad)2rl: Real single-image flight without a single realimage.Robotics: Science and Systems, 2017.

[15] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay.arXiv preprint arXiv:1511.05952, 2015.

[16] Jayaram Sethuraman. A constructive definition of dirichlet priors.Statistica sinica, pages639–650, 1994.

[17] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driess-che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mas-tering the game of go with deep neural networks and tree search.Nature, 529(7587):484–489,2016.

[18] Reid Simmons and Sven Koenig. Probabilistic robot navigation in partially observable en-vironments. InInternational Joint Conference on Artificial Intelligence, volume 95, pages1080–1087, 1995.

[19] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradientmethods for reinforcement learning with function approximation. InAdvances in neuralinformation processing systems, pages 1057–1063, 2000.

[20] Lei Tai, Giuseppe Paolo, and Ming Liu. Virtual-to-real deep reinforcement learning: Continuouscontrol of mobile robots for mapless navigation.Intelligent Robots and Systems (IROS), 2017IEEE/RSJ International Conference on, 2017.

[21] Gerald Tesauro. Temporal difference learning and td-gammon.Communications of the ACM,38(3):58–68, 1995.

[22] Christopher JCH Watkins and Peter Dayan. Q-learning.Machine learning, 8(3-4):279–292,1992.

[23] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforce-ment learning.Machine learning, 8(3-4):229–256, 1992.

[24] Linhai Xie, Sen Wang, Andrew Markham, and Niki Trigoni. Towards monocular vision basedobstacle avoidance through deep reinforcement learning.arXiv preprint arXiv:1706.09829,2017.

[25] Ali Yahya, Adrian Li, Mrinal Kalakrishnan, Yevgen Chebotar, and Sergey Levine. Collectiverobot reinforcement learning with distributed asynchronous guided policy search. InIntelligentRobots and Systems (IROS), 2017 IEEE/RSJ International Conference on, pages 79–86. IEEE,2017.

[26] Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and AliFarhadi. Target-driven visual navigation in indoor scenes using deep reinforcement learning. InRobotics and Automation (ICRA), 2017 IEEE International Conference on, pages 3357–3364.IEEE, 2017