One of the motives of introducing was to provide a safe method of access and operation in environments that are hazardous and unreachable to humans. But very often, these environments destabilize or damage the robot partially, often impairing them, and thus leading to a mission failure or significant drop in performance. This is especially critical for robots deployed in manufacturing industries and warehouses (Khatib, 2005), search and rescue missions (Murphy, 2004) and disaster response (Nagatani et al., 2013). Although this situation of partial damage is tackled in humans or animals by their learning of alternate ways to perform the action, this kind of learning in robots requires, what we call, intelligence. Hence, the objective while designing robotic devices is not just restricted to avoiding or tackling obstacles, it also includes the adaptation of the agent in presence of adversaries, both in the form of internal damages as well as external effects.
Deep Reinforcement learning (Deep RL) has been shown to be effective in modeling such navigation problems because of both its online and offline learning capabilities in high dimensional search spaces (Chatzilygeroudis et al., 2018; Hwangbo et al., 2017; Pinto et al., 2017a; Lobos-Tsunekawa et al., 2018). In the context of adapting to damages, offline learning would mean training a robust policy before the robot is deployed while online capabilities suggest learning to adapt at the time of damage. But the environments and agents in these situations are very complex in nature, as a result of which, retraining the RL policy every time a change occurs in either of them is highly impractical. This points to the necessity of having an efficient control architecture which can help the agent adapt in varying adversarial conditions.
To implement this, several approaches have tried to learn multiple policies at training time and then choosing from them at the time of damage. However, models which have made progress in this domain require reset of the agent to initial state (Cully et al., 2015), or multiple hardware trials are to be performed to help the agent recover or adapt (Cully et al., 2015; Bongard and Lipson, 2004; Koos et al., 2013). Although this is intuitive, it is inefficient considering the overhead of choosing from a set of high performing gaits. To make a smart recovery decision, an alternative method can be for agent to understand the damage first and then use that damage awareness to act optimally.
We thus propose Damage Aware-Proximal Policy Optimization (DA-PPO), combining damage diagnosis with deep reinforcement learning. The control architecture first performs damage diagnosis on multiple damage cases using a Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997)
, based supervised learning network. It uses the difference between the gaits of a (simulated) healthy and a damaged robot as input and classifies the damage that has occurred, if any. The data of the diagnosed damage is combined along with the current observation vector to create an augmented observation space, which contains information of both state space observation as well as damage. This augmented observation is used to train our RL model, which is optimized using Proximal Policy Optimization (PPO)(Schulman et al., 2017). The trained model shall be able to understand the damage that has occurred and choose its gait accordingly. Since, only a single policy is learnt, there is no overhead of storing and choosing between multiple policies, making our algorithm effective in real time.
The major objectives of our work are:
To create a deep reinforcement learning based control architecture for enabling locomotory agents to accomplish mission-critical tasks even in the presence of single or multiple internal damages.
To optimize the control architecture so that the agent adapts its gait in a single hardware trial.
2. Related Work
2.1. Automated Recovery in Robotics
Preliminary work on automated recovery in robots were based on evolutionary algorithms and generally divided the process into two phases - damage estimation and recovery. This necessitated the need for a healthy robot’s simulation to always be available to the physical robot, so as to estimate the damage. This estimation would then help create a neural controller, during the exploration or recovery phase, which can handle the damage. The neural controller is passed to the robot and thus used for adaptation. The algorithm introduced in(Bongard and Lipson, 2004), was one of the first to propose an automatic and continuous information flow between a physical robot and its simulation wherein the robot provides its current state information. The simulator updates its own state using this information and provides the robot with neural controllers to handle its state or damage. The major advantage was that the creation of recovery method didn’t have to be performed directly on the physical robot and thus the number of trials required to recover was drastically reduced.
Extending this work, Koos et al. (Koos et al., 2013), also created a self diagnosis model. The main difference between the two works is the use of an undamaged self-model of the robot to find out behaviors rather than constantly updating it based on diagnosis. Although the intuition behind these are correct and applicable even today, the use of evolutionary algorithms make these methods inefficient.
2.2. Map-based Algorithms for Adaptation
Algorithms based on behavior performance maps (Cully et al., 2015; Chatzilygeroudis et al., 2018) rely on the assumption that knowledge of the cause of damage i.e., a proper diagnosis report is not necessary to recover from the damage. Rather than considering two separate phases for damage diagnosis and recovery algorithm generation, Cully et al. (Cully et al., 2015), proposed a method inspired from animals, who perform trial and error to determine the least painful alternate gait in the presence of injury. The approach put forward in this work, Intelligent Trial and Error (ITE), relies on a behavior-performance map space. This map enables the robot to try multiple behaviors which are predicted to perform well. Based on the trials conducted and their results, the estimated performance values are also updated in the map. The process converges when the best behavior possible has been estimated. Even when the damage is absent, the high performing behaviors are expected to be useful.
The implementation of this idea uses gaussian process (Rasmussen and Williams, 2005), and bayesian optimization procedure (Mockus, 1989; Borji and Itti, 2013), to choose which gaits or behaviors to try at the time of damage by maximizing performance function from the behavior-performance space. The selected gait is tested on the robot and its performance is recorded which then helps update the value of expected performance of that gait. This select-test-update loop continues till the right behavior is obtained.
Inspired by ITE, Chatzilygeroudis et al. (Chatzilygeroudis et al., 2018), proposed a more optimized version of the algorithm. Reset free trial and error (RTE) focuses on the fact that some of the high performing policies which work on an intact robot should also work on a damaged robot, which is true mainly in complex robotic systems like humanoids or multi-legged robots. Similar to ITE, RTE pre-computes and generates a behavior performance map using MAP-ELITES (Mouret and Clune, 2015). It learns the robot’s model, especially when it is damaged and uses Monte Carlo Tree Search (Chaslot et al., 2008) to compute the next best action for the current state of robot. Also, the method uses a probabilistic model to incorporate uncertainty of predictions and uses this data to correct the outcome of each action on the damaged robot (Silver et al., 2016; T, 2013). This culmination of algorithms makes sure that there is no reset required when a damage occurs.
A significant drawback in the previous two methods is the huge complexity overhead due to the use of gaussian process and also the inability to work on dynamic unknown terrains.
2.3. Handling Environmental Adversaries
Adversarial forces on robots are not limited to physical damages. There could also be environmental factors which hinder normal robotic locomotion. Several methods have been proposed to deal with these kind of damages. Robust Adversarial Reinforcement Learning (RARL) (Pinto et al., 2017b), concentrates on ensuring stability of an agent in the presence of an adversary, which is trying to destabilize it. It is based on the assumption that environmental changes, such as change in coefficient of friction of floor, between training and testing can also be modelled as an adversary acting on some part of the agent’s body.
The algorithm is basically reduced to a min-max game where the adversary tries to minimize the reward of the concerned Markov Decision Process (MDP) and the protagonist tries to maximize it. The method of achieving this, as proposed, is to alternate between training of policies for both adversary and protagonist for a fixed number of iterations until convergence.
Another approach, introduced in (Kume et al., 2017), is based on enabling adaptation to both environmental adversaries as well as physical or internal damage of robot. The major difference between their work and previous works like ITE and RTE is the existence of a multi-policy mapping for a single behavior in place of a single policy. Map-based Multi-Policy Reinforcement Learning (MMPRL), proposed in this work, trains many different policies by collaborating a behavior-performance map and the concepts of deep reinforcement learning. It aims to search and store these multiple policies while maximizing expected reward. MMPRL saves all possible policies with different behavioral features, making it extremely fast and adaptable.
2.4. Domain Randomization
Some recent works have also experimented with randomization in simulation environments through domain and dynamics randomization (Tobin et al., 2017; Peng et al., 2017), so as to bridge the gap between simulation and real world. The idea is to create numerous variations in the simulation environment so that real world appears as just another sample from a rich distribution of training samples. In (Tobin et al., 2017)
, the authors have experimented on object localization for the purpose of grasping in cluttered environment. They have shown impressive results, randomizing in the visual domain to transfer learning from simulation to real world without requiring real world training images. On the other hand, in(Peng et al., 2017), the authors have randomized the dynamics of the environment such as mass, damping factor, friction coefficient and have shown that the policy learned in such a dynamic environment is quite robust to calibration errors in the real world.
While most map-based methods are able to adapt over a wide range of damages, their computational overhead in creating the behaviour-performance map is a significant drawback. In ITE and RTE, the complexity is further increased by the gaussian process computations. Moreover, all these approaches require multiple hardware trials for adapting to a damage. We try to incorporate domain randomization approach in the context of damages so that damages in the real world are just another variation of training samples. Moreover, we further improve this approach by presenting a single hardware trial control loop for diagnosing the damage.
We consider the following scenario: A robot has been damaged while in a remote and hazardous environment. We require the robot to reach the destination by adapting its gait so as to overcome the damage. Rather than making the agent dependent on a pre-computed set of high performing gaits, it should be able to identify and adapt to its damage autonomously.
Thus we propose a self-diagnose network which can predict the type of damage that has occurred in the structure of the robot. With this damage awareness, we use an augmented observation space for learning a well-performing policy through a modified version of Proximal Policy Optimization (PPO) which we call Damage Aware-Proximal Policy Optimization (DA-PPO). In our work, we assume that internal damages, unlike environmental adversaries, do not keep changing constantly. Thus, we only need to perform the self-diagnosis step for determining damage class whenever the reward drastically drops below a certain threshold, indicating that damage has occurred.
3.2. Self-Diagnose Network
In the min-max based game approach put forward in RARL (Pinto et al., 2017b), the technique fails to generalize over changing damages from adversary at every time step. This is actually a tough problem since the policy has no feedback mechanism to judge the performance of action taken in the last time step. Thus we propose a self-diagnose network, an LSTM (Hochreiter and Schmidhuber, 1997) based predictive model, which tries to classify the type of damage that has occurred in the robot using continuous feedback from its gait. In (Bongard and Lipson, 2004), the authors have used the difference between the behaviours of simulated robot and the physical robot in terms of forward displacement to classify damages. We extend this idea by measuring the difference in sensor values between the two for a fixed number of time steps. This results in a time series and our problem is reduced to classifying damage from this data. More specifically, the on-board computer of the robot can run a simulation of a healthy robot and compare its gait with the actual steps taken. Based on the difference between the two, the network can diagnose the class of damage (see Fig. 1).
Since this time series is multivariate and high dimensional, we use LSTM hidden units which are powerful and increasingly popular models for learning from sequencial data (Greff et al., 2017). Algorithm 1 describes in detail the sample collection process. The healthy and damaged robot environments are represented by and respectively. Both the environments are run from the same initial state and the difference between their observation spaces is collected continuously for a fixed number of time steps (represented here as ). For any environment, this results in a matrix of size (where is the observation space size for that environment) and this represents a single data point. These data points act as training data, for which labels are the corresponding damage classes upon which the simulation was run. The whole process is repeated number of times to get multiple data points. Note that represents an expert policy which has been pretrained on a healthy robot.
The network is trained using data obtained through the sample collection step explained in Algorithm 1. This step is also parallelized and thus doesn’t act as a bottleneck for the entire algorithm. The self-diagnose network, represented by , can be accessed on demand to determine damage class within a single trial as shown in Fig. 1.
3.3. Encoding of Damage Indicators
The self-diagnose network predicts the damage class of the robot which can act as an additional state information about the environment. We thus concatenate it with the observation space of the original robot to form, what we call, an augmented observation space.
This poses a necessity to encode the output of the classifier so that the policy efficiently learns various gaits in accordance with the damage. If a random encoding scheme is used for creating the augmented observation space, it results in the algorithm perceiving the encoding as noise, and completely ignoring it during policy learning. We have thus used partial one hot encoding and it is observed to work well in practice as the damage information is not lost during training.
In our experiments, we have limited the number of damages that can occur simultaneously to two and have taken the assumption that only one damage can occur on a limb at a time. The number of damage classes can thus be calculated as the sum of no damage case, single damage cases and multiple damage cases occurring at various limbs. This is given by:
where represents the number of limbs in the agent and represents the number of different damage types considered.
The encoded vector is of length where the damage of limb is represented by the values at indices and in the encoded vector. Thus, we have a tuple of size 2 associated with each limb where represents no damage, represents damage type 1 and represents damage type 2 at the limb. Note that the tuple can be used if we remove the assumption that two types of damages can’t occur together at a single limb. Furthermore, the tuple size can be increased to model more types of damages.
3.4. Proximal Policy Optimization
Since our task is that of continuous action control, we formulate it as a reinforcement learning problem, starting from initial state , choosing a series of action and obtaining state and reward at the timestep while maximizing the expected sum of rewards by changing the parameter of the parameterized stochastic policy . But the use of large scale optimization is less widespread in continuous action spaces. An attractive option for such problems is to use policy gradient algorithms (Silver et al., 2014). Proximal Policy Optimization is a simplified version of Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a). It improves upon the stability of policy gradient methods by allowing multiple updates on minibatch of on-policy data. This is implemented by limiting the KL divergence between updated policy and the policy from which the data was sampled. TRPO uses a hard optimization constraint for achieving the same but is computationally expensive to compute. PPO approximates TRPO by using a soft constraint. The original paper (Schulman et al., 2017)
proposes two methods for implementing this soft constraint: an adaptive KL loss penalty and using a clipped surrogate loss function.
PPO represents the ratio between new policy and old policy as:
The objective functions can be (Schulman et al., 2017):
Both these objective functions stabilize training by constraining the policy changes at each step, thus approximating the gradient to a local value, so that large steps are not taken between iterations. Additionally, we use Generalized Advantage Estimation (GAE) (Schulman et al., 2015b) for computing the advantage function
. In our implementation of PPO, we have used a combination of both clipping loss and adaptive KL penalty for locomotion tasks. The hyperparameters for the same are mentioned in Section4.3.
3.5. Damage Aware Proximal Policy Optimization
With the self-diagnose network in place, we can now use the policy learning algorithm on augmented observation space which encapsulates both environment state (through observation vector) and damage awareness (through damage encoding vector). We use the PPO algorithm for policy learning from the augmented observation space where is the observation at timestep , is the action taken according to policy and is the environment in which damage has occurred (see Fig. 1). Note that we only run self-diagnose network when reward during a run falls below a certain threshold. At other times, the damage is considered to be the same as diagnosed in the last run.
4. Experimental Setup
4.1. Simulation Setup
To evaluate our approach, we have conducted experiments on two environments, Ant, a quadrupedal locomotory robot and Hexapod, a six-legged locomotory robot. We have used OpenAI gym toolkit (Brockman et al., 2016), for performing simulations in combination with MuJoCo physics engine (Todorov et al., 2012). The Ant is an already implemented environment in OpenAI Gym while the Hexapod is implemented using the configuration and model described in ITE (Cully et al., 2015).
The two environments used in our experiments are discussed below:
Ant (Quadrupedal bot)
: Ant is a simple quadrupedal robot with 12 degrees of freedom (DoF) and 8 torque actuated joints. The joint has maximum flex and extension of 30 degrees from their original setting and also has a force and torque sensor. The observation includes features containing joint angles, angular velocity, the position of all structural elements with respect to the center of mass and force and torque sensor outputs of each joint forming a 111-dimension vector. The target action values are the motor torque values which are limited in the range -1.0 to 1.0. We limit an episode to at most 1000 timesteps and the episode will end whenever it crosses this limit or robot falls down on its legs or jumps above a certain height. The reward function is defined as follows:
where is the covered distance of the robot in the current time step since the previous time step, is the survival reward, which is 1 on survival and 0 if the episode is terminated by the aforementioned conditions. The variable is the number of legs making contact with the ground, are the target joint angles (the actions),
and is the weight of each component with = 0.5, = 0.5.
Hexapod: There are three actuators on each leg of the Hexapod. In the neutral position, the height of the robot is 0.2 meters. In addition to this, the actions are taken to be the joint angle positions of all 18 joints, which ranges from -0.785 to 0.785 radians. As the observation space of the agent, a 53-dimension vector is given as input which consists of the position and velocity of all the joints as well as he center of mass. Along with this, the observation space contains boolean values from touch sensors which indicate whether a leg is making contact with the ground or not. Again, we limit an episode to be at most 1000 timesteps and the episode will end whenever the robot falls down on its legs or jumps above a certain height or crosses the time limit.
The reward function R is defined as follows:
where is the covered distance of the robot in the current time step since the previous time step, is the survival reward, which is 0.1 on survival and 0 if the episode is terminated by the aforementioned conditions. The variable represents the number of legs making contact with the ground, is the vector of squared sum of external forces and torques on each joint, are the target joint angles (the actions), and is the weight of each component with = 0.03, = 0.0005, and = 0.05.
4.2. Damage Simulation
Since both the environments considered in our experiments are simulated in OpenAI gym, the damages are implemented by changing the xml files of the 3D models. This can be done on the fly without affecting parallely running experiments. In our work, we have simulated broadly two kinds of internal damages which are
Jamming of joint such that it can’t move irrespective of the amount of torsional force applied by the motor at that joint.
Missing toe, i.e., lower limb of the robot breaks off.
In MuJoCo environments, these damages are implemented as follows:
Jamming of joint is modelled by restricting the angle range of concerned joint to -0.1 to 0.1 degrees from the default value of -30 to 30 degrees.
Missing toe is modelled by shrinking the lower limb size to 0.01 from the original value of 0.8.
The original angle range of hexapod is -45 to 45 degrees. This is restricted to -0.1 to 0.1 when jamming of joint is modelled.
Missing of any of the 6 toes in hexapod is modelled by reducing the the lower limb size to 0.01 instead of 0.07 in healthy robot.
There are touch sensors on each lower limb of the hexapod. Thus whenever a lower limb breaks off we consider that the touch sensor corresponding to it stops giving any signal and its output is considered to be 0.
4.3. Hyperparameter Details
For the self-diagnose network, we take as input a matrix of size and this is followed by an embedding layer with embedding size 512 and an LSTM layer with 32 hidden units. After this, we stack three dense layers of size 256, 128, 64 along with dropouts, so as to reduce overfitting. The output layer uses softmax
as activation so that it outputs class probabilities. The loss function and optimizer used arecategorical crossentropy and adam respectively. For Ant and Hexapod environments, the possible classes range from 0 to 32 and 0 to 72 respectively as calculated from Equation 1.
As for the policy learning using PPO, we use the implementation from (Guadarrama et al., 2018)
. For both value function and policy function, we use the same network configuration having hidden layer sizes as 100, 200, 100. Adam optimizer was used for both the neural networks. The GAE gamma value is taken as 0.995 and lambda as 0.98. The clipping range is kept at 0.2 and adaptive KL target is initialized with 0.01. Adam learning rate and KL target value are adjusted dynamically during the training. Moreover, we trained the value function on the combination of current batch and previous batch to stabilize training.
5. Results and Discussion
We evaluate the performance of our approach within the two elements involved : (1) Self-Diagnose network for predicting class of damage (2) DA-PPO, which learns to adopt a policy given that a particular damage has occurred.
5.1. Self-Diagnose Network
For the comparison of performance, we consider different number of rollouts (amount of data to train on), length of history to look back into (timesteps) and what to give as observation data, i.e., our proposed approach of using difference of observations between healthy and damaged run or the observations of only damaged run. Table 1 summarizes the validation accuracy across these parameters. We can observe that classifying using fewer timesteps results in faster diagnosis but at the expense of accuracy. Moreover, classification using the difference between observation vectors as input outperforms the use of observations from damaged run only in all the cases. However, if there is a constraint on computation power of the on-board computer of the robot, the latter approach can be preferred over the former one.
|Classification Accuracy in Ant Environment|
|Timesteps||Method||Number of Rollouts|
|Classification Accuracy in Hexapod Environment|
|Timesteps||Method||Number of Rollouts|
5.2. Damage Aware-Proximal Policy Optimization
We start by creating a baseline model for comparison of performance. We define a model using PPO policy which is trained on experiments having damaged robot but without augmented observation space (i.e., without explicit knowledge of damage class), and call it PPO-Unaware. This is analogous to having a policy implementing domain randomization in damage space but without having a feedback loop. Our proposed model, which uses Damage Aware PPO policy, is called DA-PPO. The performance metric used is the forward reward of the agent, averaged across all the damage classes. Fig. 3 shows the training curve comparison between PPO-Unaware and DA-PPO in Ant and Hexapod environments (see Fig. 2(a), 2(b) ). DA-PPO shows a 60.7% improvement in average forward reward in Ant environment while in Hexapod environment, there is a 31.5% reward gain over PPO-Unaware.
For the Hexapod environment, we also use the concept of curriculum learning (Bengio et al., 2009), by progressively training on cases which are more difficult. We implement this by increasing the percentage of damage classes in training examples and also progressively increasing the severity of damages (by including multiple damages). In Fig. 2(b), each piece-wise curve represents a stage (I, II, III or IV) in the curriculum learning process. I has 100% healthy cases, II has 60% healthy and 40% single damage cases, III has 70% healthy and single damage cases and 30% multiple damage cases and IV has all damages equally likely. In this way, we were able to encourage a faster learning progress.
We also do a per class performance analysis of the two approaches discussed across various damage classes in both Ant and Hexapod (see Fig. 4, 5). In the Ant environment, DA-PPO performs better in 82.84% of damage classes when compared to PPO-Unaware. Comparing between various damage classes, DA-PPO is seen to adapt really well (in terms of reward improvement over PPO-Unaware) when damages occur on opposite limbs as compared to damages occurring on adjacent limbs. In the Hexapod environment, DA-PPO performs better in 72.6% of damage classes when compared to PPO-Unaware (see Fig. 4). This shows that being damage aware results in significant improvement in performance in presence of adversaries.
We have proposed and implemented a two-part control architecture for robotic damage adaptation. This is particularly useful when robots are used in hazardous environments, where human intervention is nearly impossible.
Our approach enables the agent to autonomously identify and understand the damage that has occurred in its physical structure and adapt its gait accordingly. Since the ultimate goal is the creation of intelligent machines, understanding the damage is as important as adapting from it, which has often been overlooked in past works.
On comparison with map-based approaches, DA-PPO doesn’t require any map generation phase and thus the initial training time is much less. This is also enhanced by the fact that our approach adapts to the damage in a single trial itself, without trying multiple well-performing gaits or without having to be reset to the initial state to perform the trial.
Our work can also be easily scaled to a larger number of damage classes. Since no differentiation is made between the cause of damage, adaptation is possible in case of both morphological and external damages. Also, in the case of unknown damages, the network is expected to predict a damage class which resembles the actual damage the most and try to choose a gait accordingly. This implies a very low rate of complete failure. We intend to study more on this in a future work.
Future work shall be focused on extending the algorithm to handle environmental adversaries, which is much desirable since real-world environments are not predictable. We also intend to work on DA-PPO for complex and dynamic environments, using SLAM (Durrant-Whyte and Bailey, 2006). Finally, we plan to extend our method and prove its effectiveness by applying it on a physical robot.
Bengio et al. (2009)
Jérôme Louradour, Ronan
Collobert, and Jason Weston.
Curriculum Learning. In
Proceedings of the 26th Annual International Conference on Machine Learning(ICML ’09). ACM, New York, NY, USA, 41–48. https://doi.org/10.1145/1553374.1553380
- Bongard and Lipson (2004) J. C. Bongard and H. Lipson. 2004. Automated damage diagnosis and recovery for remote robotics. In IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA ’04. 2004, Vol. 4. 3545–3550 Vol.4. https://doi.org/10.1109/ROBOT.2004.1308802
- Borji and Itti (2013) Ali Borji and Laurent Itti. 2013. Bayesian optimization explains human active search. In Advances in Neural Information Processing Systems 26. Curran Associates, Inc., 55–63.
- Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. CoRR abs/1606.01540 (2016). arXiv:1606.01540 http://arxiv.org/abs/1606.01540
- Chaslot et al. (2008) Guillaume Chaslot, Sander Bakkes, István Szita, and Pieter Spronck. 2008. Monte-Carlo Tree Search: A New Framework for Game AI. In AIIDE.
- Chatzilygeroudis et al. (2018) Konstantinos Chatzilygeroudis, Vassilis Vassiliades, and Jean-Baptiste Mouret. 2018. Reset-free Trial-and-Error Learning for Robot Damage Recovery. Robotics and Autonomous Systems 100 (2018), 236–250. https://www.sciencedirect.com/science/article/pii/S0921889017302440
- Cully et al. (2015) Antoine Cully, Jeff Clune, Danesh Tarapore, and Jean-Baptiste Mouret. 2015. Robots that can adapt like animals. Nature 521, 7553 (28 May 2015), 503–507. https://doi.org/10.1038/nature14422
- Durrant-Whyte and Bailey (2006) Hugh Durrant-Whyte and Tim Bailey. 2006. Simultaneous Localisation and Mapping (SLAM): Part I The Essential Algorithms. IEEE Robotics and Automation Magazine 2 (2006), 2006.
- Greff et al. (2017) Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. 2017. LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems 28, 10 (2017), 2222–2232.
et al. (2018)
Sergio Guadarrama, Anoop
Korattikara, Oscar Ramirez, Pablo
Castro, Ethan Holly, Sam Fishman,
Ke Wang, Ekaterina Gonina,
Neal Wu, Chris Harris,
Vincent Vanhoucke, and Eugene Brevdo.
TF-Agents: A library for Reinforcement Learning in TensorFlow.https://github.com/tensorflow/agents. https://github.com/tensorflow/agents [Online; accessed 25-June-2019].
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1918.104.22.1685
- Hwangbo et al. (2017) J. Hwangbo, I. Sa, R. Siegwart, and M. Hutter. 2017. Control of a Quadrotor With Reinforcement Learning. IEEE Robotics and Automation Letters 2, 4 (Oct 2017), 2096–2103. https://doi.org/10.1109/LRA.2017.2720851
- Khatib (2005) Bruno Siciliano Oussama Khatib. 2005. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press.
- Koos et al. (2013) Sylvain Koos, Antoine Cully, and Jean-Baptiste Mouret. 2013. Fast damage recovery in robotics with the T-resilience algorithm. The International Journal of Robotics Research 32, 14 (2013), 1700–1723. https://doi.org/10.1177/0278364913499192 arXiv:https://doi.org/10.1177/0278364913499192
- Kume et al. (2017) Ayaka Kume, Eiichi Matsumoto, Kuniyuki Takahashi, Wilson Ko, and Jethro Tan. 2017. Map-based Multi-Policy Reinforcement Learning: Enhancing Adaptability of Robots by Deep Reinforcement Learning. CoRR abs/1710.06117 (2017). http://arxiv.org/abs/1710.06117
- Lobos-Tsunekawa et al. (2018) K. Lobos-Tsunekawa, F. Leiva, and J. Ruiz-del-Solar. 2018. Visual Navigation for Biped Humanoid Robots Using Deep Reinforcement Learning. IEEE Robotics and Automation Letters 3, 4 (Oct 2018), 3247–3254. https://doi.org/10.1109/LRA.2018.2851148
- Mockus (1989) Jonas Mockus. 1989. Bayesian Approach to Global Optimization. Springer Netherlands.
- Mouret and Clune (2015) Jean-Baptiste Mouret and Jeff Clune. 2015. Illuminating search spaces by mapping elites. CoRR abs/1504.04909 (2015). arXiv:1504.04909 http://arxiv.org/abs/1504.04909
- Murphy (2004) R. R. Murphy. 2004. Trial by fire [rescue robots]. IEEE Robotics Automation Magazine 11, 3 (Sep. 2004), 50–61. https://doi.org/10.1109/MRA.2004.1337826
- Nagatani et al. (2013) Keiji Nagatani, Seiga Kiribayashi, Yoshito Okada, Kazuki Otake, Kazuya Yoshida, Satoshi Tadokoro, Takeshi Nishimura, Tomoaki Yoshida, Eiji Koyanagi, Mineo Fukushima, and Shinji Kawatsuma. 2013. Emergency Response to the Nuclear Accident at the Fukushima Daiichi Nuclear Power Plants Using Mobile Rescue Robots. J. Field Robot. 30, 1 (Jan. 2013), 44–63. https://doi.org/10.1002/rob.21439
- Peng et al. (2017) Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. 2017. Sim-to-Real Transfer of Robotic Control with Dynamics Randomization. CoRR abs/1710.06537 (2017). arXiv:1710.06537 http://arxiv.org/abs/1710.06537
- Pinto et al. (2017a) Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. 2017a. Robust Adversarial Reinforcement Learning. ICML (2017). https://arxiv.org/abs/1703.02702
- Pinto et al. (2017b) Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. 2017b. Robust Adversarial Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning (Proceedings of Machine Learning Research), Vol. 70. PMLR, 2817–2826. http://proceedings.mlr.press/v70/pinto17a.html
- Rasmussen and Williams (2005) Carl Edward Rasmussen and Christopher K. I. Williams. 2005. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press.
- Schulman et al. (2015a) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015a. Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on Machine Learning (Proceedings of Machine Learning Research), Francis Bach and David Blei (Eds.), Vol. 37. PMLR, Lille, France, 1889–1897. http://proceedings.mlr.press/v37/schulman15.html
- Schulman et al. (2015b) John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. 2015b. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438 (2015).
- Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. (07 2017).
- Silver et al. (2016) D. Silver, A. Huang, C.J. Maddison, A. Guez, L. Sifre, G. Van De Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis. 2016. Mastering the game of Go with deep neural networks and tree search. Nature (2016), 484–489.
- Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. 2014. Deterministic Policy Gradient Algorithms. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (ICML’14). JMLR.org, I–387–I–395. http://dl.acm.org/citation.cfm?id=3044805.3044850
- T (2013) Hester T. 2013. The TEXPLORE Algorithm. Springer, Heidelberg.
- Tobin et al. (2017) Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. 2017. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 23–30.
- Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. 2012. MuJoCo: A physics engine for model-based control. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (2012), 5026–5033.