Learning to Navigate in Cities Without a Map

by   Piotr Mirowski, et al.

Navigating through unstructured environments is a basic capability of intelligent creatures, and thus is of fundamental interest in the study and development of artificial intelligence. Long-range navigation is a complex cognitive task that relies on developing an internal representation of space, grounded by recognisable landmarks and robust visual processing, that can simultaneously support continuous self-localisation ("I am here") and a representation of the goal ("I am going there"). Building upon recent research that applies deep reinforcement learning to maze navigation problems, we present an end-to-end deep reinforcement learning approach that can be applied on a city scale. Recognising that successful navigation relies on integration of general policies with locale-specific knowledge, we propose a dual pathway architecture that allows locale-specific features to be encapsulated, while still enabling transfer to multiple cities. We present an interactive navigation environment that uses Google StreetView for its photographic content and worldwide coverage, and demonstrate that our learning method allows agents to learn to navigate multiple cities and to traverse to target destinations that may be kilometres away. A video summarizing our research and showing the trained agent in diverse city environments as well as on the transfer task is available at: goo.gl/ESUfho.


page 1

page 3


From Visual Place Recognition to Navigation: Learning Sample-Efficient Control Policies across Diverse Real World Environments

Visual navigation tasks in real world environments often require both se...

Learning Your Way Without Map or Compass: Panoramic Target Driven Visual Navigation

We present a robot navigation system that uses an imitation learning fra...

Towards Deployment of Deep-Reinforcement-Learning-Based Obstacle Avoidance into Conventional Autonomous Navigation Systems

Recently, mobile robots have become important tools in various industrie...

Cross-View Policy Learning for Street Navigation

The ability to navigate from visual observations in unfamiliar environme...

The StreetLearn Environment and Dataset

Navigation is a rich and well-grounded problem domain that drives progre...

Latent Space Reinforcement Learning for Steering Angle Prediction

Model-free reinforcement learning has recently been shown to successfull...

Synthesized Policies for Transfer and Adaptation across Tasks and Environments

The ability to transfer in reinforcement learning is key towards buildin...

1 Introduction

(a) Diverse views and corresponding local maps in Street View.
(b) Street View regions used in this study.
Figure 1: (a) Our environment is built of real-world places from Street View (we illustrate Times Square and Central Park in New York City and St. Paul’s Cathedral in London). The green cone represents the agent’s location and orientation. (b) We use large regions of London and Paris and in New York we focus on 5 different regions to show transfer.

The subject of navigation is attractive to various research disciplines and technology domains alike, being at once a subject of inquiry from the point of view of neuroscientists wishing to crack the code of grid and place cells banino2018vector ; cueva2018emergence

, as well as a fundamental aspect of robotics research. The majority of algorithms involve building an explicit map during an exploration phase and then planning and acting via that representation. In this work, we are interested in pushing the limits of end-to-end deep reinforcement learning for navigation by proposing new methods and demonstrating their performance in large-scale, real-world environments. Just as humans can learn to navigate a city without relying on maps, GPS localisation, or other aids, it is our aim to show that a neural network agent can learn to traverse entire cities using only visual observations. In order to realise this aim, we designed an interactive environment that uses the images and underlying connectivity information from Google Street View, and propose a dual pathway agent architecture that can navigate within the environment (see Fig. 


Learning to navigate directly from visual inputs has been shown to be possible in some domains, by using deep reinforcement learning (RL) approaches that can learn from task rewards – for instance, navigating to a destination. Recent research has demonstrated that RL agents can learn to navigate house scenes zhu_icra2017 ; wu2018building , mazes (e.g. mirowski2016learning ), and 3D games (e.g. lample_aaai17 ). These successes notwithstanding, deep RL approaches are notoriously data inefficient and sensitive to perturbations of the environment, and are more well-known for their successes in games and simulated environments than in real-world applications. It is therefore not obvious that they can be used for large-scale visual navigation based on real-world images, and hence this is the subject of our investigation.

The primary contributions of this paper are (a) to present a new RL challenge that features real world visual navigation through city-scale environments, and (b) to propose a modular, goal-conditional deep RL algorithm that can solve this task, thus providing a strong baseline for future research. StreetLearn111http://streetlearn.cc (dataset) and https://github.com/deepmind/streetlearn (code).is a new interactive environment for reinforcement learning that features real-world images as agent observations, with real-world grounded content that is built on top of the publicly available Google Street View. Within this environment we have developed a traversal task that requires that the agent navigates from goal to goal within London, Paris and New York City. To evaluate the feasibility of learning in such an environment, we propose an agent that learns a goal-dependent policy with a dual pathway, modular architecture with similarities to the interchangeable task-specific modules approach from devin2017learning , and the target-driven visual navigation approach of zhu_icra2017 . The approach features a recurrent neural architecture that supports both locale-specific learning as well as general, transferable navigation behaviour. Balancing these two capabilities is achieved by separating a recurrent neural pathway from the general navigation policy of the agent. This pathway addresses two needs. First, it receives and interprets the current goal given by the environment, and second, it encapsulates and memorises the features and structure of a single city region. Thus, rather than using a map, or an external memory, we propose an architecture with two recurrent pathways that can effectively address a challenging navigation task in a single city as well as transfer to new cities or regions by training only a new locale-specific pathway.

2 Related Work

Reward-driven navigation in a real-world environment is related to research in various areas of deep learning, reinforcement learning, navigation and planning.

Learning from real-world imagery.

Localising from only an image may seem impossible, but humans can integrate visual cues to geolocate a given image with surprising accuracy, motivating machine learning approaches. For instance, convolutional neural networks (CNNs) achieve competitive scores on the geolocation task 

weyand2016planet and CNN+LSTM architectures improve on this donahue2015long ; malinowski2017ask

. robust location-based image retrieval

arandjelovic2016netvlad . Several methods berriel2018heading ; khosla2014looking , including DeepNav brahmbhatt2017deepnav , use datasets collected using Street View or Open Street Maps and solve navigation-related tasks using supervision. RatSLAM demonstrates localisation and path planning over long distances using a biologically-inspired architecture milford2004ratslam . The aforementioned methods rely on supervised training with ground truth labels: with the exception of the compass, we do not provide labels in our environment.

Deep RL methods for navigation. Many RL-based approaches for navigation rely on simulators which have the benefit of features like procedurally generated variations but tend to be visually simple and unrealistic beattie2016deepmind ; kempka2016vizdoom ; tessler_aaai17 . To support sparse reward signals in these environments, recent navigational agents use auxiliary tasks in training mirowski2016learning ; jaderberg2016reinforcement ; lample_aaai17 . Other methods learn to predict future measurements or to follow simple text instructions dosovitskiy2016learning ; hill2017understanding ; hermann2017grounded ; chaplot2017gated ; in our case, the goal is designated using proximity to local landmarks. Deep RL has also been used for active localisation chaplot2018active . Similar to our proposed architecture, zhu_icra2017 show goal-conditional indoor navigation with a simulated robot and environment.

To bridge the gap between simulation and reality, researchers have developed more realistic, higher-fidelity simulated environments dosovitskiy2017carla ; kolve2017ai2 ; shah2018airsim ; wu2018building . However, in spite of their increasing photo-realism, the inherent problems of simulated environments lie in the limited diversity of the environments and the antiseptic quality of the observations. Photographic environments have been used to train agents on short navigation problem in indoor scenes with limited scale chang2017matterport3d ; anderson2017vision ; bruce2017one ; mo2018adobeindoornav . Our real-world dataset is diverse and visually realistic, comprising scenes with vegetation, pedestrians or vehicles, diverse weather conditions and covering large geographic areas. However, we note that there are obvious limitations of our environment: it does not contain dynamic elements, the action space is necessarily discrete as it must jump between panoramas, and the street topology cannot be arbitrarily altered.

Deep RL for path planning and mapping. Several recent approaches have used memory or other explicit neural structures to support end-to-end learning of planning or mapping. These include Neural SLAM zhang2017neural that proposes an RL agent with an external memory to represent an occupancy map and a SLAM-inspired algorithm, Neural Map parisotto2017neural which proposes a structured 2D memory for navigation, Memory Augmented Control Networks khan2017memory , which uses a hierarchical control strategy, and MERLIN, a general architecture that achieves superhuman results in novel navigation tasks wayne2018unsupervised . Other work brunner2017teaching ; chaplot2018active explicitly provides a global map that is input to the agent. The architecture in gupta2017unifying uses an explicit neural mapper and planner for navigation tasks as well as registered pairs of landmark images and poses. Similar to gupta2017cognitive ; zhang2017neural , they use extra memory that represents the ego-centric agent position. Another recent work proposes a graph network solution savinov2018semi . The focus of our paper is to demonstrate that simpler architectures can explore and memorise very large environments using target-driven visual navigation with a goal-conditional policy.

3 Environment

This section presents an interactive environment, named StreetLearn, constructed using Google Street View, which provides a public API222https://developers.google.com/maps/documentation/streetview/. Street View provides a set of geolocated panoramic images which form the nodes of an undirected graph. We selected a number of large regions in New York City, Paris and London that contain between 7,000 and 65,500 nodes (and between 7,200 and 128,600 edges, respectively), have a mean node spacing of 10m, and cover a range of up to 5km (see Fig. 0(b)). We do not simplify the underlying connectivity, thus there are congested areas with complex occluded intersections, tunnels and footpaths, and other ephemera. Although the graph is used to construct the environment, the agent only sees the raw RGB images (see Fig. 0(a)).

3.1 Agent Interface and the Courier Task

(a) Goal description using landmarks.
(b) Comparison of architectures.
Figure 2: (a) In the illustration of the goal description, we show a set of 5 nearby landmarks and 4 distant ones; the code

is a vector with a softmax-normalised distance to each landmark.

(b) Left: GoalNav is a convolutional encoder plus policy LSTM with goal description input. Middle: CityNav is a single-city navigation architecture with a separate goal LSTM and optional auxiliary heading (). Right: MultiCityNav is a multi-city architecture with individual goal LSTM pathways for each city.

An RL environment needs to specify the start space, observations, and action space of the agent as well as the task reward. The agent has two inputs: the image , which is a cropped, square, RGB image that is scaled to pixels (i.e. not the entire panorama), and the goal description . The action space is composed of five discrete actions: “slow” rotate left or right (), “fast” rotate left or right (), or move forward—this action becomes a noop if there is not an edge in view from the current agent pose. If there are multiple edges in the view cone of the agent, then the most central one is chosen.

There are many options for how to specify the goal to the agent, from images to agent-relative directions, to text descriptions or addresses. We choose to represent the current goal in terms of its proximity to a set of fixed landmarks: , specified using the Lat/Long (latitude and longitude) coordinate system. To represent a goal at we take a softmax over the distances to the landmarks (see Fig. 1(a)), thus for distances the goal vector contains for the landmark with (which we chose through cross-validation). This forms a goal description with certain desirable qualities: it is a scalable representation that extends easily to new regions, it does not rely on any arbitrary scaling of map coordinates, and it has intuitive meaning—humans and animals also navigate with respect to fixed landmarks. Note that landmarks are fixed per map and we used the same list of landmarks across all experiments; is computed using the distance to all landmarks, but by feeding these distances through a non-linearity, the contribution of distant landmarks is reduced to zero. In the Supplementary material, we show that the locally-continuous landmark-based representation of the goal performs as well as the linear scalar representation . Since the landmark-based representation performs well while being independent of the coordinate system and thus more scalable, we use this representation as canonical. Note that the goal description is not relative to the agent’s position and only changes when a new goal is sampled. Locations of the 644 manually defined landmarks in New York, London and Paris are given in the Supplementary material, where we also show that the density of landmarks does not impact the agent performance.

In the courier task, which we define as the problem of navigating to a series of random locations in a city, the agent starts each episode from a randomly sampled position and orientation. If the agent gets within 100m of the goal (approximately one city block), the next goal is randomly chosen and input to the agent. Each episode ends after 1000 agent steps. The reward that the agent gets upon reaching a goal is proportional to the shortest path between the goal and the agent’s position when the goal is first assigned; much like a delivery service, the agent receives a higher reward for longer journeys. Note that we do not reward agents for taking detours, but rather that the reward in a given level is a function of the optimal distance from start to goal location. As the goals get more distant during the training curriculum, per-episode reward statistics should ideally reach and stay at a plateau performance level if the agent can equally reach closer and further goals.

4 Methods

We formalise the learning problem as a Markov Decision Process, with state space

, action space , environment , and a set of possible goals . The reward function depends on the current goal and state: . The usual reinforcement learning objective is to find the policy that maximises the expected return defined as the sum of discounted rewards starting from state with discount . In this navigation task, the expected return from a state also depends on the series of sampled goals . The policy is a distribution over actions given the current state and the goal : . We define the value function to be the expected return for the agent that is sampling actions from policy from state with goal : .

We hypothesise the courier task should benefit from two types of learning: general, and locale-specific. A navigating agent not only needs an internal representation that is general, to support cognitive processes such as scene understanding, but also needs to organise and remember the features and structures that are unique to a place. Therefore, to support both types of learning, we focus on neural architectures with multiple pathways.

4.1 Architectures

The policy and the value function are both parameterised by a neural network which shares all layers except the final linear outputs. The agent operates on raw pixel images , which are passed through a convolutional network as in mnih2016asynchronous

. A Long Short-Term Memory (LSTM)

hochreiter1997long receives the output of the convolutional encoder as well as the past reward and previous action . The three different architectures are described below. Additional architectural details are given in the Supplementary Material.

The baseline GoalNav architecture (Fig. 1(b)a) has a convolutional encoder and policy LSTM. The key difference from the canonical A3C agent mnih2016asynchronous is that the goal description is input to the policy LSTM (along with the previous action and reward).

The CityNav architecture (Fig. 1(b)b) combines the previous architecture with an additional LSTM, called the goal LSTM, which receives visual features as well as the goal description. The CityNav agent also adds an auxiliary heading () prediction task on the outputs of the goal LSTM.

The MultiCityNav architecture (Fig. 1(b)c) extends the CityNav agent to learn in different cities. The remit of the goal LSTM is to encode and encapsulate locale-specific features and topology such that multiple pathways may be added, one per city or region. Moreover, after training on a number of cities, we demonstrate that the convolutional encoder and the policy LSTM become general enough that only a new goal LSTM needs to be trained for new cities, a benefit of the modular approach devin2017learning .

Figure 1(b) illustrates that the goal descriptor is not seen by the policy LSTM but only by the locale-specific LSTM in the CityNav and MultiCityNav architectures (the baseline GoalNav agent has only one LSTM, so we directly input ). This separation forces the locale-specific LSTM to interpret the absolute goal position coordinates, with the hope that it then sends relative goal information (directions) to the policy LSTM. This hypothesis is tested in section 2.3 of the supplementary material.

As shown in jaderberg2016reinforcement ; mirowski2016learning ; dosovitskiy2016learning ; lample_aaai17 , auxiliary tasks can speed up learning by providing extra gradients as well as relevant information. We employ a very natural auxiliary task: the prediction of the agent’s heading , defined as an angle between the north direction and the agent’s pose, using a multinomial classification loss on binned angles. The optional heading prediction is an intuitive way to provide additional gradients for training the convnet. The agent can learn to navigate without it, but we believe that heading prediction helps learning the geometry of the environment; the Supplementary material provides a detailed architecture ablation analysis and agent implementation details.

To train the agents, we use IMPALA espeholt2018impala , an actor-critic implementation that decouples acting and learning. In our experiments, IMPALA results in similar performance to A3C mnih2016asynchronous . We use 256 actors for CityNav and 512 actors for MultiCityNav, with batch sizes of 256 or 512 respectively, and sequences are unrolled to length 50.

4.2 Curriculum Learning

Curriculum learning gradually increases the complexity of the learning task by presenting progressively more difficult examples to the learning algorithm bengio2009curriculum ; graves2017automated ; zaremba2014learning . We use a curriculum to help the agent learn to find increasingly distant destinations. Similar to RL problems such as Montezuma’s Revenge, the courier task suffers from very sparse rewards; unlike that game, we are able to define a natural curriculum scheme. We start by sampling each new goal to be within 500m of the agent’s position (phase 1). In phase 2, we progressively grow the maximum range of allowed destinations to cover the full graph (3.5km in the smaller New York areas, or 5km for central London or Paris).

5 Results

In this section, we demonstrate and analyse the performance of the proposed architectures on the courier task. We first show the performance of our agents in large city environments, next their generalisation capabilities on a held-out set of goals. Finally, we investigate whether the proposed approach allows transfer of an agent trained on a set of regions to a new and previously unseen region.

(a) NYU (New York City)
(b) Central London
(c) Effect of reward shaping
Figure 3: Average per-episode rewards (y axis) are plotted vs. learning steps (x axis) for the courier task. We compare the GoalNav agent, the CityNav agent, and the CityNav agent without skip connection on the NYU environment (a), and the CityNav agent in London (b). We also give Oracle performance and a Heuristic agent. A curriculum is used in London—we indicate the end of phase 1 (up to 500m) and the end of phase 2 (5000m). (c) Results of the CityNav agent on NYU, comparing radii of early rewards (ER) vs. ER with random coins vs. curriculum with ER 200m and no coins.

5.1 Courier Navigation in Large, Diverse City Environments

Figure 4: (a) Number of steps required for the CityNav agent to reach a goal from 100 start locations vs. the straight-line distance to the goal in metres. (b) CityNav performance in London (left panes) and NYU (right panes). Top: examples of the agent’s trajectory during one 1000-step episode, showing successful consecutive goal acquisitions. The arrows show the direction of travel of the agent. Bottom: We visualise the agent’s value function over 100 trajectories with random starting points and the same goal. Thicker and warmer colour lines correspond to higher value functions.

We first show that the CityNav

agent, trained with curriculum learning, succeeds in learning the courier task in New York, London and Paris. We replicated experiments with 5 random seeds and plot the mean and standard deviation of the reward statistic throughout the experimental results. Throughout the paper, and for ease of comparison with experiments that include reward shaping, we report only the rewards at the goal destination (

goal rewards). Figure 3 compares different agents and shows that the CityNav architecture with the dual LSTM pathways and the heading prediction task attains a higher performance and is more stable than the simpler GoalNav agent. We also trained a CityNav agent without the skip connection from the vision layers to the policy LSTM. While this hurts the performance in single-city training, we consider it because of the multi-city transfer scenario (see Section 5.4) where funeling all visual information through the locale-specific LSTM seems to regularise the interface between the goal LSTM and the policy LSTM. We also consider two baselines which give lower (Heuristic) and upper (Oracle) bounds on the performance. Heuristic

is a random walk on the street graph, where the agent turns in a random direction if it cannot move forward; if at an intersection it will turn with a probability

. Oracle uses the full graph to compute the optimal path using breath-first search.

We visualise trajectories from the trained agent over two 1000 step episodes (Fig. 3(b) (top row)). In London, we see that the agent crosses a bridge to get to the first goal, then travels to goal 2, and the episode ends before it can reach the third goal. Figure 3(b) (bottom row) shows the value function of the agent as it repeatedly navigates to a chosen destination (respectively, St Paul’s Cathedral in London and Washington Square in New York).

To understand whether the agent has learned a policy over the full extent of the environment, we plot the number of steps required by the agent to get to the goal. As the number grows linearly with the straight-line distance to that goal, this result suggests that the agent has successfully learnt the navigation policy on both cities (Fig. 3(a)).

5.2 Impact of Reward Shaping and Curriculum Learning

To better understand the environment, we present further experiments on reward, curriculum. Additional analysis, including architecture ablations, the robustness of the agent to the choice of goal representations, and position and goal decoding, are presented in the Supplementary Material.

Our navigation task assigns a goal to the agent; once the agent successfully navigates to the goal, a new goal is given to the agent. The long distance separating the agent from the goal makes this a difficult RL problem with sparse rewards. To simplify this challenging task, we investigate giving early rewards (reward shaping) to the agent before it reaches the goal (we define goals with a 100m radius), or to add random rewards (coins) to encourage exploration beattie2016deepmind ; mirowski2016learning . Figure 2(c) suggests that coins by themselves are ineffective as our task does not benefit from wide explorations. At the same time, large radii of reward shaping help as they greatly simplify the problem. We prefer curriculum learning to reward shaping on large areas because the former approach keeps agent training consistent with its experience at test time and also reduces the risk of learning degenerate strategies such as ascending the gradient of increasing rewards to reach the goal, rather than learn to read the goal specification .

As a trade-off between task realism and feasibility, and guided by the results in Fig. 2(c), we decide to keep a small amount of reward shaping (200m away from the goal) combined with curriculum learning. The specific reward function we use is: , where is the distance from the current position of the agent to the goal, and is the reward that the agent will receive if it reaches the goal. Early rewards are given only once per panorama / node, and only if the distance to the goal is decreasing (in order to avoid the agent developing a behavior of harvesting early rewards around the goal rather than going directly towards the goal).

We choose a curriculum that starts by sampling the goal within a radius of 500m from the agent’s location, and progressively grows that disc until it reaches the maximum distance an agent could travel within the environment (e.g., 3.5km, and 5km in the NYU and London environments respectively) by the end of the training. Note that this does not preclude the agent from going astray in the opposite direction several kilometres away from the goal, and that the goal may occasionally be sampled close to the agent. Hence, our curriculum scheme naturally combines easy with difficult cases zaremba2014learning , with the latter becoming more common over the period of time.

5.3 Generalization on Held-out Goals

Figure 5: Illustration of medium-sized held-out grid with gray corresponding to training destinations, black corresponding to held-out test destinations. Landmark locations are marked in red.
Grid Train Test
Size Rew Rew Fail
fine 655 567 11% 229
medium 637 293 20% 184
coarse 623 164 38% 243
Figure 6: CityNav agent generalization performance (reward and fail metrics) on a set of held-out goal locations. We also compute the half-trip time (), to reach halfway to the goal.

Navigation agents should, ideally, be able to generalise to unseen environments dhiman2018critical . While the nature of our courier task precludes zero-shot navigation in a new city without retraining, we test the CityNav agent’s ability to exploit local linearities of the goal representation to handle unseen goal locations. We mask of the possible goals and train on the remaining ones (Fig. 6). At test time we evaluate the agent only on its ability to reach goals in the held-out areas. Note that the agent is still able to traverse through these areas, it just never samples a goal there. More precisely, the held-out areas are squares sized , or of latitude and longitude (roughly 1km1km, 0.5km0.5km and 0.25km0.25km). We call these grids respectively coarse (with few and large held-out areas), medium and fine (with many small held-out areas).

In the experiments, we train the CityNav agent for 1B steps, and next freeze the weights of the agent and evaluate its performance on held-out areas for 100M steps. Table 6 shows decreasing performance of the agents as the held-out area size increases. We believe that the performance drops on the large held-out areas (medium and coarse grid size) because the model cannot process new or unseen local landmark-based goal specifications, which is due to our landmark-based goal representation: as Figure 6 shows, some coarse grid held-out areas cover multiple landmarks. To gain further understanding, in addition to the Test Reward metric, we also use missed goals (Fail) and half-trip time () metrics. The missed goals metric measures the percentage of times goals were not reached. The half-trip time measures the number of agent steps necessary to cover half the distance separating the agent from the goal. While the agent misses more goal destinations on larger held-out grids, it still manages to travel half the distance to the goal within a similar time, which suggests that the agent has an approximate held-out goal representation that enables it to head towards it until it gets close to the goal and the representation is no longer useful for the final approach.

5.4 Transfer in Multi-city Experiments

A critical test for our proposed method is to demonstrate that it can provide a mechanism for transfer to new cities. By definition, the courier task requires a degree of memorization of the map, and what we focused on was not zero-shot transfer, but rather the capability of models to generalize quickly, learning to separate general ability from local knowledge when migrating to a new map. Our motivation for transfer learning experiments comes from the goal of continual learning, which is about learning new skills without forgetting older skills. As with humans, when our agent visits a new city we would expect it to have to learn a new set of landmarks, but not have to re-learn its visual representation, its behaviours, etc. Specifically, we expect the agent to take advantage of existing visual features (convnet) and movement primitives (policy LSTM). Therefore, using the

MultiCityNav agent, we train on a number of cities (actually regions in New York City), freeze both the policy LSTM and the convolutional encoder, and then train a new locale-specific pathway (the goal LSTM) on a new city. The gradient that is computed by optimising the RL loss is passed through the policy LSTM without affecting it and then applied only to the new pathway.

We compare the performance using three different training regimes, illustrated in Fig. 6(a): Training on only the target city (single training); training on multiple cities, including the target city, together (joint training); and joint training on all but the target city, followed by training on the target city with the rest of the architecture frozen (pre-train and transfer). In these experiments, we use the whole Manhattan environment as shown in Figure 0(b), and consisting of the following regions “Wall Street”, “NYU”, “Midtown”, “Central Park” and “Harlem”. The target city is always the Wall Street environment, and we evaluate the effects of pre-training on 2, 3 or 4 of the other environments. We also compare performance if the skip connection between the convolutional encoder and the policy LSTM is removed.

We can see from the results in Figure 6(b) that not only is transfer possible, but that its effectiveness increases with the number of the regions the network is trained on. Remarkably, the agent that is pre-trained on 4 regions and then transferred to Wall Street achieves comparable performance to an agent trained jointly on all the regions, and only slightly worse than single-city training on Wall Street alone333We observed that we could train a model jointly on 4 cities in fewer steps than when training 4 single-city models.. This result supports our intuition that training on a larger set of environments results in successful transfer. We also note that in the single-city scenario it is better to train an agent with a skip-connection, but this trend is reversed in the multi-city transfer scenario. We hypothesise that isolating the locale-specific LSTM as a bottleneck is more challenging but reduces overfitting of the convolutional features and enforces a more general interface to the policy LSTM. While the transfer learning performance of the agent is lower than the stronger agent trained jointly on all the areas, the agent significantly outperforms the baselines and demonstrates goal-dependent navigation.

(a) Diagram of transfer learning experiments.
(b) Transfer learning performance.
Figure 7: Left: Illustration of training regimes: (a) training on a single city (equivalent to CityNav); (b) joint training over multiple cities with a dedicated per-city pathway and shared convolutional net and policy LSTM; (c) joint pre-training on a number of cities followed by training on a target city with convolutional net and policy LSTM frozen (only the target city pathway is optimised). Right: Joint multi-city training and transfer learning performance of variants of the MultiCityNav agent, evaluated only on the target city (Wall Street).

6 Conclusion

Navigation is an important cognitive task that enables humans and animals to traverse a complex world without maps. We have presented a city-scale real-world environment for training RL navigation agents, introduced and analysed a new courier task, demonstrated that deep RL algorithms can be applied to problems involving large-scale real-world data, and presented a multi-city neural network agent architecture that demonstrates transfer to new environments. A multi-city version of the Street View based RL environment, with carefully processed images provided by Google Street View (i.e., blurred faces and license plates, with a mechanism for enforcing image take-down requests) has been released for Manhattan and Pittsburgh and is accessible from http://streetlearn.cc and https://github.com/deepmind/streetlearn. The project webpage at http://streetlearn.cc also contains resources on how to build and train an agent. Future work will involve learning landmarks from images and scaling up the navigation and path-planning thanks to hierarchical RL approaches.


The authors wish to acknowledge Andras Banki-Horvath for open-sourcing the StreetLearn environment, Lasse Espeholt and Hubert Soyer for technical help with the IMPALA algorithm, Razvan Pascanu, Ross Goroshin, Pushmeet Kohli and Nando de Freitas for their feedback, Chloe Hillier, Razia Ahamed and Vishal Maini for help with the project, and the Google Street View team (Tilman Reinhardt, Wenfeng Li, Ben Mears, Karen Guo, Oliver Metzger, Jayanth Nayak) as well as Richard Ives and Ashwin Kakarla for their support in accessing the data.


Appendix A Video of the Agent Trajectories and Observations

The video available at http://streetlearn.cc and https://youtu.be/2yjWDNXYh5s shows the performance of trained CityNav agents in the Paris Rive Gauche and Central London environments, as well as of the MultiCityNav agents trained jointly on 4 environments (Greenwich Village, Midtown, Central Park and Harlem) and then transferred to a fifth environment (Lower Manhattan). The video shows the high-resolution StreetView images (actual inputs to the network are RGB observations), overlaid with the map of the environment indicating its position and the location of the goal.

Appendix B Further Analysis

b.1 Architecture Ablation Analysis

Figure 8: Learning curves of the CityNav agent (2LSTM+Skip+HD) on NYU, comparing different ablations, all they way down to GoalNav (LSTM). 2LSTM architectures have a global pathway LSTM and a policy LSTM with optional Skip connection between the convnet and the policy LSTM. HD is the heading prediction auxiliary task.

In this analysis, we focus on agents trained with a two-day learning curriculum and early rewards at 200m, in the NYU environment. Here, we study how the learning benefits from various auxiliary tasks as well as we provide additional ablation studies where we investigate various design choices. We quantify the performance in terms of average reward per episode obtained at goal acquisition. Each experiment was repeated 5 times with different seeds. We report average final reward and plot the mean and standard deviation of the reward statistic. As Fig. 8 shows, the auxiliary task of heading (HD) prediction helps in achieving better performance in the navigation task for the GoalNav architecture, and, in conjunction with a skip connection from the convnet to the policy LSTM, for the 2-LSTM architecture. The CityNav agent significantly outperforms our main baseline GoalNav (LSTM in Fig. 8), which is a goal-conditioned variant of the standard RL baseline [35]. CityNav perfoms on par with GoalNav with heading prediction, but the latter cannot adapt to new cities without re-training or adding city-specific components, whereas the MultiCityNav agent with locale-specific LSTM pathways can, as demonstrated in the paper’s section on transfer learning. Our weakest baseline (CityNav no vision) performs poorly as the agent cannot exploit visual cues while performing navigation tasks. In our investigation, we do not consider other auxiliary tasks introduced in prior works [26, 34] as they are either unsuitable for our task, do not perform well, or require extra information that we consider too strong. Specifically, we did not implement the reward prediction auxiliary task on the convnet from [26], as the goal is not visible from pixels, and the motion model of the agent with widely changing visual input is not appropriate for the pixel control tasks in that same paper. From [34], we kept the 2-LSTM architecture and substituted depth prediction (which we cannot perform on this dataset) by heading and neighbor traversability prediction. We did not implement loop-closure prediction as it performed poorly in the original paper and uses privileged map information.

b.2 Goal Representation

Figure 9: Learning curves for CityNav agents with different goal representations: landmark-based, as well as latitude and longitude classification-based and regression-based.

As described in Section 3.1 of the paper, our task uses a goal description which is a vector of normalised distances to a set of fixed landmarks. Reducing the density of the landmarks to half, a quarter or an eighth (, , ) does not significantly reduce the performance (Fig. 9). We also investigate some alternative representations: a) latitude and longitude scalar coordinates normalised to be between 0 and 1 (Lat/long scalar in Figure 9), and b) a binned representation Lat/long binned using 35 bins for X and Y each, with each bin covering 100m. The Lat/long scalar goal representations performs best.

Fig. 9 compares the performance of the CityNav agent for different goal representations on the NYU environment. The most intuitive one consists in normalized latitude and longitude coordinates, or in binned representation of latitude and longitude (we used 35 bins for X and 35 bins for Y, where each bin covers 100m, or 80 bins for each coordinate).

An alternative goal representation is expressed in terms of the distances from the goal position to a set of arbitrary landmarks . We defined and tuned using grid search. We manually defined 644 landmarks covering New York, Paris and London, which we use throughout the experiments and which are illustrated on Fig.2a. We observe that reducing the density of the landmarks to half, a quarter or an eighth has a slightly detrimental effect on performance because some areas are sparsely covered by landmarks. Because the landmark representation is independent of the coordinate system, we choose it and use it in all the other experimnets in this paper.

Finally, we also train a Goal-less CityNav agent by removing inputs . The poor performance of this agent (Fig. 9) confirms that the performance of our method cannot be attributed to clever street graph exploration alone. Goal-less CityNav learns to run in circles of increasing radii—a reasonable, greedy behaviour that is a good baseline to the other agents.

Since the landmark-based representation performs well while being independent of the coordinate system and thus more scalable, we use this representation as canonical.

b.3 Allocentric and Egocentric Goal Representation

We do an analysis of the activations of the 256 hidden units of the region-specific LSTM, by training decoders (2-layer multi-layer perceptrons, MLP, with 128 units on the hidden layer and rectified nonlinearity transfer functions) for the allocentric position of the agent and of the goal as well as for the egocentric direction towards the goal. Allocentric decoders are multinomial classifiers over the joint Lat/Long position, and have 50

50 bins (London) or 3535 bins (NYU), each bin covering an area of size 100m100m. The egocentric decoder has 16 orientation bins. Fig.10 illustrates the noisy decoding of the agent’s position along 3 trajectories and the decoding of the goal (here, St Paul’s), overlayed with the ground truth trajectories and goal location. The average error of the egocentric goal direction decoding is about (as compared to for a random predictor), suggesting some decoding but not a cartesian manifold representation in the hidden units of the LSTM.

Figure 10: Decoding of the agent position (blue dots) and goal position (cyan stars) over 3 trajectories (in red) with a goal at St Paul’s Cathedral, in London (in black).

b.4 Reward Shaping: Goal Rewards vs. Rewards

The agent is considered to have reached the goal if it is within 100m of the goal, and the reward shaping consists in giving the agent early rewards as it is within 200m of the goal. Early rewards are shaped as following:

where is the distance from the current position of the agent to the goal and is the reward that the agent will receive if it reaches the goal. Early rewards are given only once per panorama / node, and only if the distance to the goal is decreasing (in order to avoid the agent developing a behavior of harvesting early rewards around the goal rather than going directly towards the goal). However, depending on the path taken by the agent towards the goal, it could earn slightly more rewards if it takes a longer path to the goal rather than a shorter path. Throughout the paper, and for ease of comparison with experiments that include reward shaping, we report only the rewards at the goal destination (goal rewards).

Appendix C Implementation Details

c.1 Neural Network Architecture

For all the experiments in the paper we use the standard vision model for Deep RL [35] with 2 convolutional layers followed by a fully connected layer. The baseline GoalNav architecture has a single recurrent layer (LSTM), from which we predict the policy and value function, similarly to [35].

The convolutional layers are as follows. The first convolutional layer has a kernel of size 8x8 and a stride of 4x4, and 16 feature maps. The second layer has a kernel of size 4x4 and a stride of 2x2, and 32 feature maps. The fully connected layer has 256 units, and outputs 256-dimensional visual features

. Rectified nonlinearities (ReLU) separate the layers.

The convnet is connected to the policy LSTM (in case of two-LSTM architectures, we call it a Skip connection). The policy LSTM has additional inputs: past reward and previous action expressed as a one-hot vector of dimension 5 (one for each action: forward, turn left or right by , turn left or right by ).

The goal information is provided as an extra input, either to the policy LSTM (GoalNav agent) or to each goal LSTM in the CityNav and MultiCityNav agents. In case of landmark-based goals, is a vector of 644 elements (see Section D for the complete list of landmark locations in the New York and London environments). In the case of Lat/Long scalars, is a 2-dimensional vector of Lat and Long coordinates normalized to be between 0 and 1 in the environment of interest. In the case of binned Lat/Long coordinates, we bin the normalized scalar coordinates using 35 bins for Lat and 35 bins for Long in the NYU environment, each bin representing 100m, and the vector contains 70 elements.

The goal LSTM also takes 256-dimensional inputs from to the convnet. The goal LSTM contains 256 hidden units and is followed by a tanh nonlinearity, a dropout layer with probability , then a 64-dimensional linear layer and finally a tanh layer. It is this (CityNav) or these (MultiCityNav) 64-dimensional outputs that are connected to the policy LSTM. We chose to use this bottleneck, consisting of a dropout, linear layer from 256 to 64, followed by a nonlinearity, in order to force the representations in the goal LSTM to be more robust to noise and to send only a small amount of information (possibly related to the egocentric position of the agent w.r.t. the goal) to the policy LSTM. Please note that the CityNav agent can still be trained to solve the navigation task without that layer.

Similarly to [35], the policy LSTM contains 256 hidden units, followed by two parallel layers: one linear layer going from 256 to 1 and outputing the value function, and one linear layer going from 256 to 5 (the number of actions), and a softmax nonlinearity, outputting the policy.

The heading prediction auxiliary task is done using an MPL with a hidden layer of 128 units, connected to the hidden units of each goal LSTM in the CityNav and MultiCityNav agents, and outputs a softmax of 16-dimensional vectors, corresponding to 16 binned directions towards North. The auxiliary task is optimized using a multinomial loss.

c.2 Learning Hyperparameters

The costs for all auxiliary heading prediction tasks, of the value prediction, of the entropy regularization and of the policy loss are added before being sent to the RMSProp gradient learning algorithm [41] (momentum 0, discounting factor 0.99, , initial learning rate 0.001). The weight of heading prediction is 1, the entropy cost is 0.004 and the value baseline weight is 0.5.

In all our experiments, we train our agent with IMPALA [19], an actor-critic implementation of deep reinforcement learning that decouples acting and learning. In our experiments, IMPALA results in similar performance to A3C [35] on a single city task, but as it has been demonstrated to handle better multi-task learning than A3C, we prefer it to A3C for our multi-city and transfer learning experiments. We use 256 actors for CityNav and 512 actors for MultiCityNav, with batch sizes of 256 or 512 respectively, and sequences are unrolled to length 50. We used a learning rate of 0.001, linearly annealed to 0 after 2B steps (NYU), 4B steps (London) or 8B steps (multi-city and transfer learning experiments). The discounting coefficient in the Bellman equation is 0.99. Rewards are clipped at 1 for the purpose of gradient calculations.

c.3 Curriculum Learning

Because of the distributed nature of the learning algorithm, it was easier to implement the duration of phase 1 and phase 2 of curriculum learning using the Wall clock of the actors and learners rather than by sharing the total number of steps with the actors, which explains why phase durations are expressed in terms of days, rather than in a given number of steps. With our software implementation, hardware and batch size as well as number of actors, the distributed learning algorithm runs at about 6000 environment steps/sec, and a day of training corresponds to about 500M steps. In terms of gradient steps, given than we use unrolls of length 50 steps and batch sizes of 256 or 512, each gradient step corresponds to either or environment steps, and is taken every 2s or 4s respectively for a speed of 6000 environment steps/sec.

Appendix D Environment

For the experiments on data from Manhattan, New York, we relied on sub-areas of a larger StreetView graph that contains 256961 nodes and 266040 edges. We defined 5 areas by selecting a starting point at a given coordinate and collecting panoramas in a panorama adjacency graph using breadth-first-search, until a given depth of the search tree. We defined areas as following:

  • Wall Street / Lower Manhattan: 6917 nodes and 7191 edges, 200-deep search tree starting at (40.705510, -74.013589).

  • NYU / Greenwich Village: 17227 nodes and 17987 edges, 200-deep search tree starting at (40.731342, -73.996903).

  • Midtown: 16185 nodes and 16723 edges, 200-deep search tree starting at (40.756889, -73.986147).

  • Central Park: 10557 nodes and 10896 edges, 200-deep search tree starting at (40.773863, -73.971984).

  • Harlem: 14589 nodes and 15099 edges, 220-deep search tree starting at (40.806379, -73.950124).

The Central London StreetView environment contains 24428 nodes and 25352 edges, and is defined by a bounding box between the following Lat/Long coordinates: (51.500567, -0.139157) and (51.526175, -0.080043). The Paris Rive Gauche environment contains 34026 nodes and 35475 edges, and is defined by a bounding box between Lat/Long coordinates: (48.839413, 2.2829247) and (48.866578, 2.3653221).

We provide, in a text file444Available at http://streetlearn.cc, the locations of the 644 landmarks used throughout the study.