Neural SLAM: Learning to Explore with External Memory

06/29/2017 ∙ by Jingwei Zhang, et al. ∙ University of Freiburg The Hong Kong University of Science and Technology 0

We present an approach for agents to learn representations of a global map from sensor data, to aid their exploration in new environments. To achieve this, we embed procedures mimicking that of traditional Simultaneous Localization and Mapping (SLAM) into the soft attention based addressing of external memory architectures, in which the external memory acts as an internal representation of the environment. This structure encourages the evolution of SLAM-like behaviors inside a completely differentiable deep neural network. We show that this approach can help reinforcement learning agents to successfully explore new environments where long-term memory is essential. We validate our approach in both challenging grid-world environments and preliminary Gazebo experiments. A video of our experiments can be found at:



There are no comments yet.


page 4

page 6

page 8

page 11

Code Repositories


Neural Turing Machine (NTM) & Differentiable Neural Computer (DNC) with pytorch & visdom

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Cognitive Mapping

Studies of animal navigation have shown that the hippocampus plays an important role [O’keefe and Nadel1978] [McNaughton et al.2006] [Collett and Graham2004]. It performs cognitive mapping that combines path integration and visual landmarks, so as to give the animals sophisticated navigation capabilities instead of just reflexive behaviors based only upon the immediate information they perceive.

Similarly, to successfully navigate and explore new environments in a timely fashion, intelligent agents would benefit from having their own internal representation of the environment whilst traverse, so as to go beyond the scope of performing reactive actions based on the most recent sensory input. Traditional methods in robotics thus developed a series of methods like simultaneous localization and mapping (SLAM), localization in a given map, path planning and motion control, to enable robots to complete such challenging tasks [Thrun, Burgard, and Fox2005] [LaValle2006] [Latombe1991]. Those individual components have been well studied and understood as separate parts, but here we view them as a unified system and attempt to embed SLAM-like procedures into a neural network such that SLAM-like behaviors maybe be able to out of the course of reinforcement learning agents exploring new environments. This guided learned system could then benefit from each individual component (localization, mapping and planning) adapting in the awareness of each other’s existence, instead of being rigidly combined together as in traditional methods. Also, in this paper we represent this system using a completely differentiable deep neural network, ensuring the learned representation is distributed and feature-rich, a property that rarely comes with traditional methods but is key to robust and adaptive systems [Bengio2013].

1.2 External Memory

The memory structure in traditional recurrent neural networks (RNNs) like long short term memory networks (LSTMs) are ultimately short-term, which would not be sufficient for developing informative navigation or exploration strategies. For the network to have an internal representation of the environment, i.e., its own cognitive map, an external memory architecture

[Graves, Wayne, and Danihelka2014] [Graves et al.2016] is required. Having an external memory besides a deep network separates the learning of computation algorithms from the storage of data over long time-scales. This is essential for learning successful exploration strategies, since if the computation and the memory are mixed together in the weights of the network, then with the memory demands increasing over time, the expressive capacity of the network would be very likely to decrease [Graves et al.2016].

Besides the neural turning machine (NTM) [Graves, Wayne, and Danihelka2014] and the differentiable neural computer (DNC) [Graves et al.2016], there is another branch of work on external memory architectures for deep networks which studies the memory networks. But the memory networks as in [Oh et al.2016] [Sukhbaatar et al.2015] do not learn what to write to the memory, which is not sufficient for our task because the network is expected to learn to map onto its external memory to aid planning.

The Neural Map as proposed in [Parisotto and Salakhutdinov2017] adapted the external memory in [Graves, Wayne, and Danihelka2014] to as a form of structured map storage for an agent to learn to navigate. However, they do not utilize the structure of this memory as all their operations can be conducted as if the memory address were a vector. Furthermore, they assume the location of the agent is always known so as to write exactly to the corresponding location in the memory while the agent travels through the maze, a prerequisite that can rarely be met in real-world scenarios.

1.3 Embedding Classic Models into Deep Neural Networks

Embedding domain-specific structures into neural networks can be found in many works. Unlike methods that treat the networks completely as black-box approximators thus cannot benefit from the valuable prior knowledge accumulated over the years (it is like forcing a boy to deduct all the laws of physics from his observations by himself but not giving him the physics textbook to learn from), this line of formulation biases deep models toward learning representations containing the structures that we already know they would benefit from having for specific domains.

[Tamar et al.2016] embedded the value iteration procedures into a single network, forcing the network to learn representations following the well-defined policy-evaluation, policy-improvement loop, while benefiting from the feature-rich representations from deep architectures. [Gupta et al.2017] went one step further by using the Value Iteration Network as the planning module inside a visual navigation system. They treat an internal part of the network as an egocentric map and apply motion on it. [Fischer et al.2015] added a cross-correlation layer to compute correlations of features of corresponding neighboring cells between subsequent frames, which explicitly provides the network with matching capabilities. This greatly helps the learning of optical flow since the optical flow is computed based on local pixel dynamics. [Zhang et al.2016] forced the network to learn representative features across tasks by explicitly embedding structures mimicking the computation procedures of successor feature reinforcement learning into the network, and their resulting architecture is able to transfer navigation policies across similar environments.

Traditionally, when using well-established models in a combined system with other modules, they typically do not benefit from the other components. This is because their behaviors can not adapt accordingly, as those models come out of deduction but have not evolved out of learning (directly applying those well established traditional models is like to directly give the boy all the answers to his physics questions instead of giving him the physics textbook for him to learn to solve those questions). While if those functionalities are learned along with other components, their behaviors can influence each other and the system could potentially obtain performance beyond directly combining well-established models.

Let us take SLAM as an example. SLAM is used as a building block for complicated autonomous systems to aid navigation and exploration, yet the SLAM model and the path planning algorithms are individually developed, not taking each other into account. [Bhatti et al.2016] augmented the state space of their reinforcement learning agent with the output of a traditional SLAM algorithm. Although this improves the navigation performance of the agent, it still experiences the issues discussed above since SLAM is rigidly combined into their architecture. While if SLAM-like behaviors can be encouraged to evolve out of the process of agents learning to navigate or to explore, then the resulting system would be much more deeply integrated as a whole, with each individual component influencing each other while benefiting from learning alongside each other. The SLAM model from the resulting system would evolve out of the need for exploration or navigation, not purely just for performing SLAM. Additionally, if learning with deep neural nets, the resulting models will be naturally feature-rich, which is rarely a property of traditional well-established models.

Although a number of works have been presented on utilizing deep reinforcement learning algorithms for autonomous navigation [Mirowski et al.2016] [Zhu et al.2016] [Zhang et al.2016] [Gupta et al.2017] [Tai, Paolo, and Liu2017], none of them have an explicit external memory architecture to equip the agent with the capability of making long-term decisions based on an internal representation of a global map. Also, these works mainly focus on learning to navigate to a target location, while in this paper we attempt to solve a more challenging task of learning to explore new environments under a time constraint, in which an effective long term memory mechanism is essential.

Following these observations, we attempt to embed the motion prediction step and the measurement update step of SLAM into our network architecture, by utilizing the soft attention based addressing mechanism in [Graves, Wayne, and Danihelka2014], biasing the write/read operations towards traditional SLAM procedures and treating the external memory as an internal representation of the map of the environment, and train this model using deep reinforcement learning algorithms, to encourage the evolution of SLAM-like behaviors during the course of exploration.

1.4 Exploration in Unknown Environments

Effective exploration capabilities are required of intelligent agents to perform tasks like surveillance, rescue and sample collection [Shen, Michael, and Kumar2012]. Traditional techniques for exploration includes information gain based approaches, goal assignment using coverage maps or occupancy grid maps, etc [Stachniss2009]. However, such techniques require building and maintaining accurate maps of the environment for the agent to memorize the already explored areas, in which loop closure plays an important role.

[Mirowski et al.2016]

added loop closure detection, along with depth prediction as auxiliary tasks to provide additional supervision signals when training reinforcement learning agents in environments with only sparse rewards. Specifically, the loop closure detection is trained via supervised learning by integrating the ground truth velocities of the agent, which is not accessible in real world scenarios. In this paper, we highlight the difference to

[Mirowski et al.2016] that the loop closure is learned implicitly within our model via an embedded SLAM structure. Our strategy requires less input information and depends less on ground-truth information as supervision. Additionally, we tested our approach in the Gazebo environment [Koenig and Howard] which is more realistic with respect to the underlying physics and the sensor noise compared to the simulated environment used in [Mirowski et al.2016]. Additionally, compared with traditional SLAM-based methods, our strategy eliminates the need for building and maintaining an expensive map for each new environment and only needs a forward pass through the trained model to give out planning decisions, which runs on CPU. This enables our agent to cope with the limited memory and processing capabilities on robotics platforms.

2 Methods

2.1 Background

Figure 1: Visualization of the Neural-SLAM model architecture (we intentionally use blue for the components in charge of computation, green for memory, and cyan for a mixture of both.)

We formulate the exploration task as a Markov decision process (MDP) in which the agent interacts with the environment through a sequence of observations, actions and rewards. At each time step

the agent receives an observation (in this paper the agent receives a vector of laser ranges) of the true state of the environment. The agent then selects an action based on a policy , which corresponds to a motion command for the agent to execute. The agent then receives a reward signal and transits to the next state . The goal for the agent is to maximize the expected cumulative future reward ( is the discount factor):


Recent success in deep reinforcement learning represents the value functions or the policies with deep neural networks. In this paper we utilize the asynchronous advantage actor-critic (A3C) algorithm [Mnih et al.2016], in which both the policy and the value functions are represented by deep neural function approximators, parameterized by and respectively (we note that and share parameters except for their output layers, parameterized by and (Sec. 2.4)). Those parameters are updated using the following gradients (, with being the rollout step) ( is the entropy of the policy, is the coefficient of the entropy regularization term):


2.2 Neural SLAM Architecture

As discussed previously, we require our model to have an external memory structure for the agent to utilize as an internal representation of the environment. Thus, we added an external memory chunk of size (containing memory slots, with channels or features for each slot), which can be accessed by the network via a write head and a read head. (We note that our work can be easily extended to multiple heads for write/read, but in this paper we only investigate with one write/read head. We also observe that the number of heads can be viewed as the number of particles as in particle filter [Thrun, Burgard, and Fox2005].)

At each time step, we feed our input directly to an LSTM cell, which gives out a hidden state . This hidden state is then used in each head to emit a set of control variables {, , , , } (each write head additionally emits {, }) through a set of linear layers. The write head and the read head then each computes their access weight ( and , both of size ) based on those control variables. Then the write head would use its access weight along with and to write to the memory , while the read head would access the updated memory with its access weight to output a read vector . Next, and are concatenated together to compute the final output: a policy distribution

, and an estimated value

, which are then used to calculate gradients to update the whole model according to Equ. 2 and Equ. 3.

The Neural-SLAM Model Architecture is shown in Fig. 1, and we will describe the operations in each component in detail in the following section.

2.3 Embedded SLAM Structure

We use the same addressing mechanism for computing the access weights of the write head and the read head , except that the read head addressing happens after the write head updates the external memory, thus it would access the memory of the current time step. Below we describe the computations in detail, where we refer to both access weights at time step as .

Prior Belief

We view the access weights of the heads as their current beliefs. We make the assumption that the initial pose of the agent is known at the beginning of each episode. Also, the sensing range of its onboard sensor is known a priori. Then we initialize the access weight with a Gaussian kernel centering around the initial pose, filling up the whole sensing area and summing up to 1; all other areas are assigned with weight . The external memory is initialized as (we discuss in more detail this choice of initialization value in Sec. 2.5).

Localization & Motion Prediction

At each time step, we first do a motion prediction, by applying the motion command the agent receives from the last time step onto its access weight from the last time step ( here can be any motion model):


Note that since we view our external memory not as an egocentric map but as a global map, we need to first localize on the access weight before the motion model can be applied. Thus, we localize by first identifying the center of mass in the current access weight matrix as the position of the agent, then choose the direction with the largest sum of weights within the corresponding sensing area as its orientation.

Data Association

Each head emits a key vector of length , which is compared with each slot in the external memory under a similarity measure

(in this paper we use cosine similarity as in Equ.

6), to compute a content-based access weight based on the data association score (each head also outputs a key strength scalar to increase or decrease the focus of attention):


Measurement Update

We then perform a measurement update by the following steps.

First, the content-based access weight from this time step and the last access weight after motion prediction

are interpolated together using an interpolation gate scalar

generated by each head:


Then, a shift operation is applied based on the shift kernel emitted by each head (in this paper

defines a normalized distribution over a

area), to account for the noise in motion and measurement. This shift operation can be viewed as a convolution over the access weight matrices, with being the convolution kernel:


Finally, the smoothing effect of the shift operation is compensated with a sharpen scalar :



The write head each generates two additional vectors (both contain elements): an erase vector and an add vector . Along with its access weight , the write head accesses and updates the external memory with the following operations:


2.4 Planning

After the memory has been updated to , it is accessed by the read head by its access weight , to output a read vector (which can be seen as a summary of the current internal map):


This read vector is then concatenated with the hidden state , and fed into two linear layers (parameterized by and respectively) to give out the policy distribution and the value estimate:


and are subsequently used for calculating losses for on-policy deep reinforcement learning, as discussed in Section 2.1. An action is then drawn from a multinomial distribution defined by during training, while a greedy action is taken during evaluation and testing.

2.5 Read-out Map from External Memory

As previously mentioned, we view the external memory as an internal representation of the environment for the agent. More specifically, we treat the values stored in

as the log odds representation of occupancy in occupancy grid mapping techniques

[Thrun, Burgard, and Fox2005]

. Following this representation, we can recover the occupancy probability of all the grids (i.e., the slots on

) following the equation below.


At the beginning of each episode, we set all the values in to , corresponding to an occupancy probability of , to represent maximum uncertainty. We identify that Equ. 15 is identical to a sigmoid operation, thus sigmoid function is used in our implementation for this map read-out operation.

Following this formulation, one possible extension for our method would be to use to compute the exploration reward as an internal reward signal for the agent, to eliminate the need for receiving from the ground truth map (for example, use the information gain from to as ) (we refer to Sec. 3.1 for a detailed description of our reward structure).

3 Experiments

3.1 Experimental Setup

Figure 2: Visualization of a sample trajectory of a trained Neural-SLAM agent successfully completing exploration in a new environment. The agent is visualized as a grey grid with a black rectangle at its center pointing at its current orientation. The obstacles are shown as black grids, free space as white grids, and grey grids indicate unexplored areas. The world clears up as the agent explores with its sensor (the sensor cannot see through walls nor across sharp angles), whose sensing area is shown by red bounding boxes (the information in the red bounding box is the input to the network). An exploration is completed when the agent has cleared up all possible grids, in which case the current episode is considered to be terminated and solved. An episode would also be terminated (but not considered as solved) when a maximum step of is reached.
Figure 3: Comparison between the average reward obtained and number of episodes solved in steps during evaluation by an A3C agent (with 1 LSTM, motion command directly concatenated into the input), an A3C-Nav1 agent (with 2 stacked LSTMs, motion command directly concatenated into the input), an A3C-Nav2 agent (with 2 stacked LSTMs, motion command concatenated with the output of the LSTM, then input into the LSTM) , an A3C-Ext agent (with 1 LSTM and an external memory, motion command concatenated with the output of the LSTM then fed to the external memory architecture, which is like the Neural-SLAM without the Localization Motion Prediction step, and our Neural-SLAM agent (incorportate motion command with an explicit motion model, as discussed in Sec. 2.3). We train continuously for courses transitioning from world sizes of to .

(a) World




(e) World



Figure 4: Visualization of the world view, write head weights , memory (we note that this visualization is the output of a map read-out operation (Sec. 2.5)), and read head weights during one exploration of our trained Neural-SLAM agent (Fig. 3(a), 3(b), 3(c), 3(d)) and a trained A3C-Ext agent (Fig. 3(e), 3(f), 3(g), 3(h)) on a simplest environment.

We first test our algorithm in simulated grid world environments. We use a curriculum learning strategy to train agents to explore randomly generated environments ranging in size to (we ensure all the free grids are connected together when generating environments). At the beginning of each episode, the agent is randomly placed in a randomly generated grid world during both training and evaluation. It has a sensing area of size (we note that this simulated laser sensor cannot see through walls nor across sharp angles) as is shown in Fig. 2.

The agent can take an action out of {, , , }. It receives a reward of for each step it takes before completing the exploration task, for colliding with obstacles, and for the completion of an exploration. Also during the course of exploration, the agent receives a reward of for each new grid it clears up. (We note that this requires a ground truth map, but such a map is only needed during training to provide exploration rewards, while a map is not needed during execution.)

At each time step, the agent receives a sensor reading of size , which is then fed into the network, along with the action it selected from the last time step. We train the network the same way as A3C [Mnih et al.2016] and deploy training processes purely on CPU, optimized with the ADAM optimizer [Kingma and Ba2015] with shared statistics across all training processes, with a learning rate of . We also use a weight decay of since we find this to be essential to stabilize training when combining external memory architectures with A3C. We set the rollout step to be and the maximum number of steps for each episode to be .

We experimented with four baseline agents as comparisons for our Neural-SLAM agent: 1) A3C: an A3C agent with one LSTM cell (with hidden units) without external memory architectures. The action from the last time step is concatenated with the current sensor readings as the input to the network; 2) A3C-Nav1: an A3C agent with two stacked LSTM cells. The last action is fed into the LSTM cell; 3) A3C-Nav2: same as A3C-Nav1 except that the last action is fed into the LSTM cell (we note that this agent is very similar to the one proposed by [Mirowski et al.2016], while here we only feed the selected action but not the true velocity and the reward to the LSTM since that information is usually not available during execution); 4) A3C-Ext: an A3C agent with one LSTM cell, which also interacts with a addressed external memory (, we note that the largest map size the agents are trained on is , here we initialize the address of the external memory to be for the generalization tests on larger map sizes, which we will discuss in Sec. 3.3), and access it using the same approach as we described in Sec. 2.3. But unlike our Neural-SLAM agent where the previous action is applied onto the memory through an explicit motion model, no motion prediction step (Sec. 2.3) is executed for the A3C-Ext agent, and the previous action is simply concatenated with the output of the LSTM and fed to the external memory structure.

3.2 Grid World Experiments

We conducted experiments in the simulated grid world environment, training the four baseline agents and our Neural-SLAM agent continuously over a curriculum of courses. We observe that our Neural-SLAM agent shows a relatively consistent and stable performance across all courses (Fig. 3). Specifically, our agent can still successfully and reliably explore in the course where the environments contain more complex structures for which effective long-term memory is essential (the A3C-Nav2 agent fails to learn the course in all multiple trainings that we conducted). We visualize the memory addressing in Fig. 4, and observe that the write head addressing weight converges to a more focused attention to center around the current pose of the agent, while the read head addressing weight learns to spread out over the entire world area, so that the resulting read vector can summarize the current memory for the agent to make planning decisions. We note that the memory and the weights for the write/read heads are all initialized to the size of , since generalization performance tests will be conducted on worlds of those sizes, as will be discussed in Sec. 3.3, yet the agent is able to constrain its writing and reading to the correct map size that it is currently traveling on, which is (within the red bounding boxes).

3.3 Generalization Tests

In the experiments discussed above, the agents are trained across 3 courses on world sizes ranging from to . We conducted additional experiments on a set of 50 pre-generated worlds of size to test the generalization capabilities of the trained agents. Specifically, we deploy the following agents onto the same set of 50 worlds, starting from the same position in each world: a Random agent which would always select random actions and whose performance can be viewed as a measure of the complexity of the tasks; a trained A3C-Nav2 agent, as this agent is the most similiar to the model proprosed by [Mirowski et al.2016] and is shown to be able to generalize its navigation capabilities across different environments; and our trained Neural-SLAM agent. We note that no step limit is set for the Random agent, while for both the A3C-Nav2 and Neural-SLAM agents, an episode will be terminated if the agent has not finished the exploration task within steps. The experimental results are summarized in Table 1.

We can observe from Table 1 that the generalization tasks here are relatively challenging as the Random agent takes an average of steps to finish an episode. We can also see that the A3C-Nav2 agent has almost no capability to generalize to much larger environments. We suspect that although the two stacked LSTMs enable the A3C-Nav2 agent to memorize its odometry, thus it would not travel back to places that it has recently traveled. The lack of an external memory to store its perception of the world makes it difficult to navigate to unexplored areas that are far outside of its current vicinity. While for the Neural-SLAM agent, since it embeds a SLAM structure within the planning module, it is able to construct an internal representation of the world, which enables it to identify and plan to go to unexplored areas that might be relatively far away. These different behaviors can be observed in the supplementary video.

3.4 Gazebo Experiments

Steps Reward Success Ratio
Random 5531.600 4299.554 -596.644 505.436 -
A3C-Nav2 682.980 201.075 -15.345 11.209 5/50
Neural-SLAM 174.920 174.976 13.732 9.839 46/50
Table 1: Testing statistics for the generalization experiment, showing the performance of Random (random actions), A3C-Nav2 (very similar to that proposed by [Mirowski et al.2016]), and Neural-SLAM (ours), each evaluated on the same set of 50 randomly generated worlds of size 16x16. The maximum number of steps per episode was 750 steps for both the A3C-Nav2 agent and our Neural-SLAM agent.

We also experimented with a simple world built in Gazebo. We used a slightly different reward structure: as a step cost, for collision, for the completion of an exploration, and the exploration reward is scaled down by compared to the grid world experiments. We deploy learners using docker for training our Neural-SLAM agent. From the experimental results shown in Fig. 5, we can see that our Neural-SLAM agent is able to solve the task effectively. We also deploy the trained agent in new Gazebo environments to test its generalization performance, and we observe that the agent is still able to accomplish exploration efficiently (we show these experiments in the supplementary video).

Figure 5: Gazebo experiments (rollout step : 50; maximum steps per episode: 2500).

4 Conclusions and Future Work

We propose an approach to provide deep reinforcement learning agents with long-term memory capabilities by utilizing external memory access mechanisms. We embed SLAM-like procedures into the soft-attention based addressing to bias the write/read operations towards SLAM-like behaviors. Our method provides the agent with an internal representation of the environment, so as to guide it to make informative planning decisions to effectively explore new environments. Several interesting extensions could emerge from our work, including the internal reward as discussed in Sec. 2.5, to evaluate our approach on more challenging environments, to conduct real-world experiments, and to experiment with higher dimensional inputs.