PyTorch implementation of Value Iteration Networks (VIN): Clean, Simple and Modular. Visualization in Visdom.
We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person views and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the planner, and b) a spatial memory with the ability to plan given an incomplete set of observations about the world. CMP constructs a top-down belief map of the world and applies a differentiable neural net planner to produce the next action at each time step. The accumulated belief of the world enables the agent to track visited regions of the environment. Our experiments demonstrate that CMP outperforms both reactive strategies and standard memory-based architectures and performs well in novel environments. Furthermore, we show that CMP can also achieve semantically specified goals, such as "go to a chair".READ FULL TEXT VIEW PDF
This works presents a formulation for visual navigation that unifies map...
Egocentric spatial memory (ESM) defines a memory system with encoding,
A crucial aspect of mobile intelligent agents is their ability to integr...
We demonstrate how a sequence model and a sampling-based planner can
In this paper, we propose an approach for assigning an interest level to...
The ability to perform effective planning is crucial for building an
In this paper, we perform safety and performance analysis of an autonomo...
PyTorch implementation of Value Iteration Networks (VIN): Clean, Simple and Modular. Visualization in Visdom.
TensorFlow implementation of Value Iteration Networks (VIN): Clean, Simple and Modular
As humans, when we navigate through novel environments, we draw on our previous experience in similar conditions. We reason about free-space, obstacles and the topology of the environment, guided by common sense rules and heuristics for navigation. For example, to go from one room to another, I must first exit the initial room; to go to a room at the other end of the building, getting into a hallway is more likely to succeed than entering a conference room; a kitchen is more likely to be situated in open areas of the building than in the middle of cubicles. The goal of this paper is to design a learning framework for acquiring such expertise, and demonstrate this for the problem of robot navigation in novel environments.
first build a 3D map using LIDAR, depth, or structure from motion, and then plan paths in this map. These maps are built purely geometrically, and nothing is known until it has been explicitly observed, even when there are obvious patterns. This becomes a problem for goal directed navigation. Humans can often guess, for example, where they will find a chair or that a hallway will probably lead to another hallway but a classical robot agent can at best only do uninformed exploration. The separation between mapping and planning also makes the overall system unnecessarily fragile. For example, the mapper might fail on texture-less regions in a corridor, leading to failure of the whole system, but precise geometry may not even be necessary if the robot just has to keep traveling straight.
Inspired by this reasoning, recently there has been an increasing interest in more end-to-end learning-based approaches that go directly from pixels to actions zhu2016target ; mnih2015human ; levine2016end
without going through explicit model or state estimation steps. These methods thus enjoy the power of being able to learn behaviors from experience. However, it is necessary to carefully design architectures that can capture the structure of the task at hand. For instance Zhu et al.zhu2016target use reactive memory-less vanilla feed forward architectures for solving visual navigation problems, In contrast, experiments by Tolman tolman1948cognitive have shown that even rats build sophisticated representations for space in the form of ‘cognitive maps’ as they navigate, giving them the ability to reason about shortcuts, something that a reactive agent is unable to.
This motivates our Cognitive Mapping and Planning (CMP) approach for visual navigation (arch). CMP consists of a) a spatial memory to capture the layout of the world, and b) a planner that can plan paths given partial information. The mapper and the planner are put together into a unified architecture that can be trained to leverage regularities of the world. The mapper fuses information from input views as observed by the agent over time to produce a metric egocentric multi-scale belief about the world in a top-down view. The planner uses this multi-scale egocentric belief of the world to plan paths to the specified goal and outputs the optimal action to take. This process is repeated at each time step to convey the agent to the goal.
At each time step, the agent updates the belief of the world from the previous time step by a) using the ego-motion to transform the belief from the previous time step into the current coordinate frame and b) incorporating information from the current view of the world to update the belief. This allows the agent to progressively improve its model of the world as it moves around. The most significant contrast with prior work is that our approach is trained end-to-end to take good actions in the world. To that end, instead of analytically computing the update to the belief (via classical structure from motion) we frame this as a learning problem and train a convolutional neural network to predict the update based on the observed first person view. We make the belief transformation and update operations differentiable thereby allowing for end-to-end training. This allows our method to adapt to the statistical patterns in real indoor scenes without the need for any explicit supervision of the mapping stage.
Our planner uses the metric belief of the world obtained through the mapping operation described above to plan paths to the goal. We use value iteration as our planning algorithm but crucially use a trainable, differentiable and hierarchical version of value iteration. This has three advantages, a) being trainable it naturally deals with partially observed environments by explicitly learning when and where to explore, b) being differentiable it enables us to train the mapper for navigation, and c) being hierarchical it allows us to plan paths to distant goal locations in time complexity that is logarithmic in the number of steps to the goal.
Our approach is a reminiscent of classical work in navigation that also involves building maps and then planning paths in these maps to reach desired target locations. However, our approach differs from classical work in the following significant way: except for the architectural choice of maintaining a metric belief, everything else is learned from data. This leads to some very desirable properties: a) our model can learn statistical regularities of indoor environments in a task-driven manner, b) jointly training the mapper and the planner makes our planner more robust to errors of the mapper, and c) our model can be used in an online manner in novel environments without requiring a pre-constructed map.
This paper originally appeared at CVPR 2017. In this journal article, we additionally describe real world deployment of our learned policies on a TurtleBot 2 platform, and report results of our deployment on indoor test environments. We have also incorporated feedback from the community. In particular, we have added comparisons to a policy that very closely resembles a classical mapping and planning method. We have also included more visualizations of the representations produced by the mapper. Finally, we situate the current work in context of the new research directions that are emerging in the field at the intersection of machine learning, robotics and computer vision.
related Navigation is one of the most fundamental problems in mobile robotics. The standard approach is to decompose the problem into two separate stages: (1) mapping the environment, and (2) planning a path through the constructed map khatib1986real ; elfes1987sonar . Decomposing navigation in this manner allows each stage to be developed independently, but prevents each from exploiting the specific needs of the other. A comprehensive survey of classical approaches for mapping and planning can be found in thrun2005probabilistic .
Mapping has been well studied in computer vision and robotics in the form of structure from motion and simultaneous localization and mapping slam-survey:2015 ; izadiUIST11 ; henry2010rgb ; snavely2008modeling with a variety of sensing modalities such as range sensors, RGB cameras and RGB-D cameras. These approaches take a purely geometric approach. Learning based approaches zamir2016generic ; hadsell2009learning study the problem in isolation thus only learning generic task-independent maps. Path planning in these inferred maps has also been well studied, with pioneering works from Canny canny1988complexity , Kavraki et al.kavraki1996probabilistic and LaValle and Kuffner lavalle2000rapidly . Works such as elfes1989using ; fraundorfer2012vision have studied the joint problem of mapping and planning. While this relaxes the need for pre-mapping by incrementally updating the map while navigating, but still treat navigation as a purely geometric problem, Konolige et al.konolige2010view and Aydemir et al.aydemir2013active proposed approaches which leveraged semantics for more informed navigation. Kuipers et al.kuipers1991robot introduce a cognitive mapping model using hierarchical abstractions of maps. Semantics have also been associated with 3D environments more generally koppulaNIPS11 ; guptaCVPR13 .
As an alternative to separating out discrete mapping and planning phases, reinforcement learning (RL) methods directly learn policies for robotic taskskim2003autonomous ; peters2008reinforcement ; kohl2004policy . A major challenge with using RL for this task is the need to process complex sensory input, such as camera images. Recent works in deep reinforcement learning (DRL) learn policies in an end-to-end manner mnih2015human going from pixels to actions. Follow-up works mnih2016asynchronous ; gu2016continuous ; schulman2015trust propose improvements to DRL algorithms, oh2016control ; mnih2016asynchronous ; wierstra2010recurrent ; heess2015memory ; zhang2016learning study how to incorporate memory into such neural network based models. We build on the work from Tamar et al.tamar2016value who study how explicit planning can be incorporated in such agents, but do not consider the case of first-person visual navigation, nor provide a framework for memory or mapping. oh2016control study the generalization behavior of these algorithms to novel environments they have not been trained on.
In context of navigation, learning and DRL has been used to obtain policies toussaint2003learning ; zhu2016target ; oh2016control ; tamar2016value ; kahn2016plato ; giusti2016machine ; daftry2016learning ; abel2016exploratory . Some of these works kahn2016plato ; giusti2016machine , focus on the problem of learning controllers for effectively maneuvering around obstacles directly from raw sensor data. Others, such as tamar2016value ; blundell2016model ; oh2016control , focus on the planning problem associated with navigation under full state information tamar2016value , designing strategies for faster learning via episodic control blundell2016model , or incorporate memory into DRL algorithms to ease generalization to new environments. Most of this research (except zhu2016target ) focuses on navigation in synthetic mazes which have little structure to them. Given these environments are randomly generated, the policy learns a random exploration strategy, but has no statistical regularities in the layout that it can exploit. We instead test on layouts obtained from real buildings, and show that our architecture consistently outperforms feed forward and LSTM models used in prior work.
The research most directly relevant to our work is the contemporary work of Zhu et al.zhu2016target . Similar to us, Zhu et al.also study first-person view navigation using macro-actions in more realistic environments instead of synthetic mazes. Zhu et al.propose a feed forward model which when trained in one environment can be finetuned in another environment. Such a memory-less agent cannot map, plan or explore the environment, which our expressive model naturally does. Zhu et al.also don’t consider zero-shot generalization to previously unseen environments, and focus on smaller worlds where memorization of landmarks is feasible. In contrast, we explicitly handle generalization to new, never before seen interiors, and show that our model generalizes successfully to floor plans not seen during training.
Relationship to contemporary research. In this paper, we used scans of real world environments to construct visually realistic simulation environments to study representations that can enable navigation in novel previously unseen environments. Since conducting this research, over the last year, there has been a major thrust in this direction in computer vision and related communities. Numerous works such as Matterport3D ; dai2017scannet ; active-vision-dataset2017 have collected large-scale datasets consisting of scans of real world environments, while savva2017minos ; wu2018building ; xiazamirhe2018gibsonenv have built more sophisticated simulation environments based on such scans. A related and parallel stream of research studies whether or not models trained in simulators can be effectively transferred to the real world sadeghi2016cadrl ; bruce2018learning , and how the domain gap between simulation and the real world may be reduced xiazamirhe2018gibsonenv . A number of works have studied related navigation problems in such simulation environments henriques2018mapnet ; chaplot2018active ; anderson2017vision . Researchers have also gone beyond specifying goals as a desired location in space and finding objects of interest as done in this paper, for example, Wu et al.wu2018building generalize the goal specification to also include rooms of interest, and Das et al.embodiedqa allow goal specification via templated questions. Finally, a number of works have also pursued the problem of building representation for space in context of navigation. parisotto2017neural ; bhatti2016playing ; khan2017memory ; gordon2017iqa use similar spatial representations, Mirowski et al.mirowski2016learning use fully-connected LSTMs, while Savinov et al.savinov2018semi develop topological representations. Interesting reinforcement learning techniques have also been explored for the task of navigation mirowski2016learning ; duan2016rl .
prelim To be able to focus on the high-level mapping and planning problem we remove confounding factors arising from low-level control by conducting our experiments in simulated real world indoor environments. Studying the problem in simulation makes it easier to run exhaustive evaluation experiments, while the use of scanned real world environments allows us to retains the richness and complexity of real scenes. We also only study the static version of the problem, though extensions to dynamic environments would be interesting to explore in future work.
We model the robot as a cylinder of a fixed radius and height, equipped with vision sensors (RGB cameras or depth cameras) mounted at a fixed height and oriented at a fixed pitch. The robot is equipped with low-level controllers which provide relatively high-level macro-actions . These macro-actions are a) stay in place, b) rotate left by , c) rotate right by , and d) move forward cm, denoted by , respectively. We further assume that the environment is a grid world and the robot uses its macro-actions to move between nodes on this graph. The robot also has access to its precise egomotion. This amounts to assuming perfect visual odometry nister2004visual , which can itself be learned haarnoja2016backprop , but we defer the joint learning problem to future work.
We want to learn policies for this robot for navigating in novel environments that it has not previously encountered. We study two navigation tasks, a geometric task where the robot is required to go to a target location specified in robot’s coordinate frame ( forward, left) and a semantic task where the robot is required to go to an object of interest (a chair). These tasks are to be performed in novel environments, neither the exact environment map nor its topology is available to the robot.
Our navigation problem is defined as follows. At a given time step , let us assume the robot is at a global position (position in the world coordinate frame) . At each time step the robot receives as input the image of the environment , and a target location (or a semantic goal) specified in the coordinate frame of the robot. The navigation problem is to learn a policy that at every time steps uses these inputs (current image, egomotion and target specification) to output the action that will convey the robot to the target as quickly as possible.
Experimental Testbed. We conduct our experiments on the Stanford large-scale 3D Indoor Spaces (S3DIS) dataset introduced by Armeni et al.armeni20163d . The dataset consists of 3D scans (in the form of textured meshes) collected in 6 large-scale indoor areas that originate from 3 different buildings of educational and office use. The dataset was collected using the Matterport scanner matterport . Scans from 2 buildings were used for training and the agents were tested on scans from the 3rd building. We pre-processed the meshes to compute space traversable by the robot. We also precompute a directed graph consisting of the set of locations the robot can visit as nodes and a connectivity structure based on the set of actions available to the robot to efficiently generate training problems. More details in envvis.
mapping We describe how the mapping portion of our learned network can integrate first-person camera images into a top-down 2D representation of the environment, while learning to leverage statistical structure in the world. Note that, unlike analytic mapping systems, the map in our model amounts to a latent representation. Since it is fed directly into a learned planning module, it need not encode purely free space representations, but can instead function as a general spatial memory. The model learns to store inside the map whatever information is most useful for generating successful plans. However to make description in this section concrete, we assume that the mapper predicts free space.
The mapper architecture is illustrated in mapper. At every time step we maintain a cumulative estimate of the free space in the coordinate frame of the robot. is represented as a multi-channel 2D feature map that metrically represents space in the top-down view of the world. is estimated from the current image , cumulative estimate from the previous time step and egomotion between the last and this step using the following update rule:
here, is a function that transforms the free space prediction from the previous time step according to the egomotion in the last step , is a function that takes as input the current image and outputs an estimate of the free space based on the view of the environment from the current location (denoted by ). is a function which accumulates the free space prediction from the current view with the accumulated prediction from previous time steps. Next, we describe how each of the functions , and are realized.
The function is realized using bi-linear sampling. Given the ego-motion, we compute a backward flow field . This backward flow maps each pixel in the current free space image to the location in the previous free space image where it should come from. This backward flow can be analytically computed from the ego-motion (as shown in ego). The function uses bi-linear sampling to apply this flow field to the free space estimate from the previous frame. Bi-linear sampling allows us to back-propagate gradients from to jaderberg2015spatial , which will make it possible to train this model end to end.
The function is realized by a convolutional neural network. Because of our choice to represent free space always in the coordinate frame of the robot, this becomes a relatively easy function to learn, given the network only has to output free space in the current coordinate, rather than in an arbitrary world coordinate frame determined by the cumulative egomotion of the robot so far.
Intuitively, the network can use semantic cues (such as presence of scene surfaces like floor and walls, common furniture objects like chairs and tables) alongside other learned priors about size and shapes of common objects to generate free space estimates, even for object that may only be partiality visible. Qualitative results in mapper_vis show an example for this where our proposed mapper is able to make predictions for spaces that haven’t been observed.
The architecture of the neural network that realizes function
is shown in mapper. It is composed of a convolutional encoder which uses residual connectionshe2015deep and produces a representation of the scene in the 2D image space. This representation is transformed into one that is in the egocentric 2D top-down view via fully connected layers. This representation is up-sampled using up-convolutional layers (also with residual connections) to obtain the update to the belief about the world from the current frame.
In addition to producing an estimate of the free space from the current view the model also produces a confidence . This estimate is also warped by the warping function and accumulated over time into
. This estimate allows us to simplify the update function, and can be thought of as playing the role of the update gate in a gated recurrent unit. The update functiontakes in the tuples , and and produces as follows:
We chose an analytic update function to keep the overall architecture simple. This can be replaced with more expressive functions like those realized by LSTMs hochreiter1997long .
Mapper performance in isolation. To demonstrate that our proposed mapper architecture works we test it in isolation on the task of free space prediction. mapper_performance shows qualitative and quantitative results.
planning Our planner is based on value iteration networks proposed by Tamar et al.tamar2016value , who observed that a particular type of planning algorithm called value iteration bellman1957markovian can be implemented as a neural network with alternating convolutions and channel-wise max pooling operations, allowing the planner to be differentiated with respect to its inputs. Value iteration can be thought of as a generalization of Dijkstra’s algorithm, where the value of each state is iteratively recalculated at each iteration by taking a max over the values of its neighbors plus the reward of the transition to those neighboring states. This plays nicely with 2D grid world navigation problems, where these operations can be implemented with small kernels followed by max-pooling over channels. Tamar et al.tamar2016value also showed that this reformulation of value iteration can also be used to learn the planner (the parameters in the convolutional layer of the planner) by providing supervision for the optimal action for each state. Thus planning can be done in a trainable and differentiable manner by very deep convolutional network (with channel wise max-pooling). For our problem, the mapper produces the 2D top-view of the world which shares the same 2D grid world structure as described above, and we use value iteration networks as a trainable and differentiable planner.
Hierarchical Planning. Value iteration networks as presented in tamar2016value (v2) are impractical to use for any long-horizon planning problem. This is because the planning step size is coupled with the action step size thus leading to a) high computational complexity at run time, and b) a hard learning problem as gradients have to flow back for as many steps. To alleviate this problem, we extend the hierarchical version presented in tamar2016value (v1).
Our hierarchical planner plans at multiple spatial scales. We start with a times spatially downsampled environment and conduct value iterations in this downsampled environment. The output of this value iteration process is center cropped, upsampled, and used for doing value iterations at a finer scale. This process is repeated to finally reach the resolution of the original problem. This procedure allows us to plan for goals which are as far as
steps away while performing (and backpropagating through) onlyplanning iterations. This efficiency increase comes at the cost of approximate planning.
Planning in Partially Observed Environments. Value iteration networks have only been evaluated when the environment is fully observed, the entire map is known while planning. However, for our navigation problem, the map is only partially observed. Because the planner is not hand specified but learned from data, it can learn policies which naturally take partially observed maps into account. Note that the mapper produces not just a belief about the world but also an uncertainty , the planner knows which parts of the map have and haven’t been observed.
Our final architecture, Cognitive Mapping and Planning (CMP) puts together the mapper and planner described above. At each time step, the mapper updates its multi-scale belief about the world based on the current observation. This updated belief is input to the planner which outputs the action to take. As described previously, all parts of the network are differentiable and allow for end-to-end training, and no additional direct supervision is used to train the mapping module – rather than producing maps that match some ground truth free space, the mapper produces maps that allow the planner to choose effective actions.
Training Procedure. We optimize the CMP network with fully supervised training using DAGGER ross2011reduction . We generate training trajectories by sampling arbitrary start and goal locations on the graph . We generate supervision for training by computing shortest paths on the graph. We use an online version of DAGGER, where during each episode we sample the next state based on the action from the agent’s current policy, or from the expert policy. We use scheduled sampling and anneal the probability of sampling from the expert policy using inverse sigmoid decay.
Note that the focus of this work is on studying different architectures for navigation. Our proposed architecture can also be trained with alternate paradigms for learning such policies, such as reinforcement learning. We chose DAGGER for training our models because we found it to be significantly more sample efficient and stable in our domain, allowing us to focus on the architecture design.
experiments The goal of this paper is to learn policies for visual navigation for different navigation tasks in novel indoor environments. We first describe these different navigation tasks, and performance metrics. We then discuss different comparison points that quantify the novelty of our test environments, difficulty of tasks at hand. Next, we compare our proposed CMP architecture to other learning-based methods and to classical mapping and planning based methods. We report all numbers on the test set. The test set consists of a floor from an altogether different building not contained in the training set. (See dataset website and envvis for environment visualizations.)
Tasks. We study two tasks: a geometric task, where the goal is to reach a point in space, and a semantic task, where the goal is to find objects of interest. We provide more details about both these tasks below:
Geometric Task: The goal is specified geometrically in terms of position of the goal in robot’s coordinate frame. Problems for this task are generated by sampling a start node on the graph and then sampling an end node which is within 32 steps from the starting node and preferably in another room or in the hallway (we use room and hallway annotations from the dataset armeni20163d ). This is same as the PointGoal task as described in anderson2018evaluation .
We consider three tasks: ‘go to a chair’, ‘go to a door’ and ‘go to a table’. The agent receives a one-hot vector indicating the object category it must go to and is considered successful if it can reachany instance of the indicated object category. We use object annotations from the S3DIS dataset armeni20163d to setup this task. We initialize the agent such that it is within 32 time steps of at least one instance of the indicated category, and train it to go towards the nearest instance. This is same as the ObjectGoal task as described in anderson2018evaluation .
The same sampling process is used during training and testing. For testing, we sample 4000 problems on the test set. The test set consists of a floor from an altogether different building not contained in the training set. These problems remain fixed across different algorithms that we compare. We measure performance by measuring the distance to goal after running the policies for a maximum number of time steps (200), or if they emit the stop action.
Performance Metrics. We report multiple performance metrics: a) the mean distance to goal, b) the 75th percentile distance to goal, and c) the success rate (the agent succeeds if it is within a distance of steps of the goal location) as a function of number of time-steps. We plot these metrics as a function of time-steps and also report performance at and time steps in the various tables. For the most competitive methods, we also report the SPL metric111For computing SPL, we use the shortest-path on the graph as the shortest-path length. We count both rotation and translation actions for both the agent’s path and the shortest path. An episode is considered successful, if the agent ends up within 3 steps of the goal location. For the geometric task, we run the agent till it outputs the ‘stay-in-place’ action, or for a maximum of 200 time steps. For the semantic task, we train a separate ‘stop’ predictor. This ’stop’ predictor is trained to predict if the agent is within 3 steps of the goal or not. The probability at which the episode should be terminated is determined on the validation set. (higher is better) as introduced in anderson2018evaluation . In addition to measuring whether the agent reaches the goal, SPL additionally also measures the efficiency of the path used and whether the agent reliably determines that it has reached the goal or not.
Models are trained asynchronously using TensorFlowtf . We used ADAM kingma2014adam
to optimize our loss function and trained for 60K iterations with a learning rate ofwhich was dropped by a factor of 10 every 20K iterations (we found this necessary for consistent training across different runs). We use weight decay of to regularize the network and use batch-norm ioffe2015batch . We use ResNet-50 he2016identity
pre-trained on ImageNetimagenet_cvpr09 to represent RGB images. We transfer supervision from RGB images to depth images using cross modal distillation gupta2016cross between RGB-D image pairs rendered from meshes in the training set to obtain a pre-trained ResNet-50 model to represent depth images.
Our experiments are designed to test performance at visual navigation in novel environments. We first quantify the differences between training and test environments using a nearest neighbor trajectory method. We next quantify the difficulty of our environments and evaluation episodes by training a blind policy that only receives the relative goal location at each time step. Next, we test the effectiveness of our memory-based architecture. We compare to a purely reactive agent to understand the role of memory for this task, and to a LSTM-based policy to test the effectiveness of our specific memory architecture. Finally, we make comparisons with classical mapping and planning based techniques. Since the goal of this paper is to study various architectures for navigation we train all these architectures the same way using DAGGER ross2011reduction as described earlier. We provide more details for each of these baselines below.
Nearest Neighbor Trajectory Transfer: To quantify similarity between training and testing environments, we transfer optimal trajectories from the train set to the test set using visual nearest neighbors (in RGB ResNet-50 feature space). At each time step, we pick the location in the training set which results in the most similar view to that seen by the agent at the current time step. We then compute the optimal action that conveys the robot to the same relative offset in the training environment from this location and execute this action at the current time step. This procedure is repeated at each time step. Such a transfer leads to very poor results.
No image, goal location only with LSTM: Here, we ignore the image and simply use the relative goal location (in robot’s current coordinate frame) as input to a LSTM, and predict the action that the agent should take. The relative goal location is embedded into a
dimensional space via fully connected layers with ReLU non-linearities before being input to the LSTM.
Reactive Policy, Single Frame: We next compare to a reactive agent that uses the first-person view of the world. As described above we use ResNet-50 to extract features. These features are passed through a few fully connected layers, and combined with the representation for the relative goal location which is used to predict the final action. We experimented with additive and multiplicative combination strategies and both performed similarly.
Reactive Policy, Multiple Frames: We also consider the case where the reactive policy receives 3 previous frames in addition to the current view. Given the robot’s step-size is fairly large we consider a late fusion architecture and fuse the information extracted from ResNet-50. Note that this architecture is similar to the one used in zhu2016target . The primary differences are: goal is specified in terms of relative offset (instead of an image), training uses DAGGER (which utilizes denser supervision) instead of A3C, and testing is done in novel environments. These adaptations are necessary to make an interpretable comparison on our task.
LSTM Based Agent: Finally, we also compare to an agent which uses an LSTM based memory. We introduce LSTM units on the multiplicatively combined image and relative goal location representation. Such an architecture also gives the LSTM access to the egomotion of the agent (via how the relative goal location changes between consecutive steps). Thus this model has access to all the information that our method uses. We also experimented with other LSTM based models (ones without egomotion, inputting the egomotion more explicitly, ), but weren’t able to reliably train them in early experiments and did not pursue them further.
Purely Geometric Mapping: We also compare to a purely geometric incremental mapping and path planning policy. We projected observed 3D points incrementally into a top-down occupancy map using the ground truth egomotion and camera extrinics and intrinsics. When using depth images as input, these 3D points are directly available. When using RGB images as input, we triangulated SIFT feature points in the RGB images (registered using the ground truth egomotion) to obtain the observed 3D points (we used the COLMAP library schoenberger2016sfm ). This occupancy map was used to compute a grid-graph (unoccupied cells are assumed free). For the geometric task, we mark the goal location on this grid-graph and execute the action that minimizes the distance to the goal node. For the semantic task, we use purely geometric exploration along with a semantic segmentation network trained222We train this semantic segmentation network to segment chairs, doors and table on the S3DIS dataset armeni20163d . to identify object categories of interest. The agent systematically explores the environment using frontier-based exploration yamauchi1997frontier 333We sample a goal location outside the map, and try to reach it, as for the geometric task. As there is no path to this location, the agent ends up systematically exploring the environment. till it detects the specified category using the semantic segmentation network. These labels are projected onto the occupancy map using the 3D points, and nodes in the grid-graph are labelled as goals. We then output the action that minimizes the distance to these inferred goal nodes.
For this baseline, we experimented with different input image sizes, and increased the frequency at which RGB or depth images were captured. We validated a number of other hyper-parameters: a) number of points in a cell before it is considered occupied, b) number of intervening cell to be occupied before it is considered non-traversable, c) radius for morphological opening of the semantic labels on the map. 3D reconstruction from RGB images was computationally expensive, and thus we report comparisons to these classical baselines on a subset of the test cases.
|Method||Mean||75th %ile||Success %age||Mean||75th %ile||Success %age||SPL %age|
|Geometric Task (4000 episodes)|
|No Image LSTM||20.8||20.8||28||28||6.2||6.2|
|Reactive (1 frame)||20.9||17.0||28||26||8.2||21.9|
|Reactive (4 frames)||17.0||8.9||26||18||20.3||56.1||16.7||8.2||26||17||22.8||62.2||17.4||52.0|
|Geometric Task (1000 episodes)|
|Classical (900px, images)||20.3||3.3||29||2||17.4||89.6||20.4||3.3||29||2||17.7||90.7||15.9||80.6|
|Our (CMP [+6 MP3D Envs])||6.3||2.8||7||0||71.5||86.1||3.8||1.9||0||0||89.7||94.6||70.8||82.3|
|Method||Success %age [39 steps]||Success %age [199 steps]||SPL %age [199 steps]|
|RGB (4000 episodes)|
|Reactive (4 frames)||24.8||24.7||18.6||22.7||19.1||24.9||19.7||21.2||12.7||13.3||10.2||12.1|
|RGB (500 epsiodes)|
|Classical (Explore + Sem. Segm.)||11.6||28.3||13.6||17.8||16.2||29.1||22.2||22.5||2.5||10.1||5.1||5.9|
|Our (CMP [+ 6 MP3D Envs])||42.8||40.3||21.0||34.7||80.3||42.2||40.7||54.4||21.1||20.7||15.1||19.0|
|Depth (4000 episodes)|
|Reactive (4 frames)||28.4||23.6||29.2||27.1||33.8||23.8||38.7||32.1||14.0||9.8||11.9||11.9|
|Depth (1000 epsiodes)|
|Classical (Explore + Sem. Segm.)||38.0||72.4||28.5||46.3||38.9||49.8||43.0||43.9||27.5||24.7||24.3||25.5|
|Our (CMP [+ 6 MP3D Envs])||43.9||43.2||42.5||43.2||60.4||64.5||66.5||63.8||19.5||22.9||26.1||22.8|
Geometric Task. We first present results for the geometric task. geom-plot plots the error metrics over time (for 199 time steps), while geom-result reports these metrics at and time steps, and SPL (with a max episode length of 199). We summarize the results below:
We first note that nearest neighbor trajectory transfer does not work well, with the mean and median distance to goal being and steps respectively. This highlights the differences between the train and test environments in our experiments.
Next, we note that the ‘No Image LSTM’ baseline performs poorly as well, with a success rate of only. This suggests that our testing episodes aren’t trivial. They don’t just involve going straight to the goal, but require understanding the layout of the given environment.
Next, we observe that the reactive baseline with a single frame also performs poorly, succeeding only of the time. Note that this reactive baseline is able to perform well on the training environments obtaining a mean distance to goal of about 9 steps, but perform poorly on the test set only being able to get to within 17 steps of the goal on average. This suggests that a reactive agent is able to effectively memorize the environments it was trained on, but fails to generalize to novel environments, this is not surprising given it does not have any form of memory to allow it to map or plan. We also experimented with using Drop Out in the fully connected layers for this model but found that to hurt performance on both the train and the test sets.
Using additional frames as input to the reactive policy leads to a large improvement in performance, and boosts performance to , and to when using depth images.
The LSTM based model is able to consistently outperform these reactive baseline across all metrics. This indicates that memory does have a role to play in navigation in novel environments.
Our proposed method CMP, outperforms all of these learning based methods, across all metrics and input modalities. CMP achieves a lower 75th %ile distance to goal (14 and 1 as compared to 21 and 5 for the LSTM) and improves the success rate to 62.5% and 78.3% from 53.0% and 71.8%. CMP also obtains higher SPL (59.4% 51.3% and 73.7% 69.1% for RGB and depth input respectively).
We next compare to classical mapping and path planning. We first note that a purely geometric approach when provided with depth images does really really well, obtaining a SPL of 80.6%. Access to depth images and perfect pose allows efficient and accurate mapping, leading to high performance. In contrast, when using only RGB images as input (but still with perfect pose), performance drops sharply to only 15.9%. There are two failure modes: spurious stray points in reconstruction that get treated as obstacles, and failure to reconstruct texture-less obstacles (such as walls) and bumping into them. In comparison, CMP performs well even when presented with just RGB images, at 59.6% SPL. Furthermore, when CMP is trained with more data (6 additional large buildings from the Matterport3D dataset Matterport3D ), performance improves further, to 70.8% SPL for RGB input and 82.3% SPL for depth input. Though we tried our best at implementing the classical purely geometry-based method, we note that they may be improved further by introducing and validating over more and more hyper-parameters, specially for the case where depth observations are available as input.
Variance Over Multiple Runs.
We also report variance in performance over five re-trainings from different random initializations of the network for the 3 most competitive methods (Reactive with 4 frames, LSTM, and CMP) for the depth image case. all_error shows the performance, the solid line shows the median metric value and the surrounding shaded region represents the minimum and maximum metric value over the five re-trainings. As we are using imitation learning (instead of reinforcement learning) for training our models, variation in performance is reasonably small for all models and CMP leads to significant improvements.
Ablations. We also studied ablated versions of our proposed method. We summarize the key takeaways, a learned mapper leads to better navigation performance than an analytic mapper, planning is crucial (specially for when using RGB images as input) and single-scale planning works slightly better than the multi-scale planning at the cost of increased planning cost. More details in supp_exp.
Additional comparisons between LSTM and CMP. We also conducted additional experiments to further compare the performance of the LSTM baseline with our model in the most competitive scenario where both methods use depth images. We summarize the key conclusions here and provide more details in supp_exp. We report performance when the target is much further away (64 time steps away) in scenarios (top). We see a larger gap in performance between LSTM and CMP for this test scenarios. We also compared performance of CMP and LSTM over problems of different difficulty and observed that CMP is generally better across all values of hardness, but for RGB images it is particularly better for cases with high hardness (hardness). We also evaluate how well these models generalize when trained on a single scene, and when transferring across datasets. We find that there is a smaller drop in performance for CMP as compared to LSTM (scenarios (bottom)). More details in supp_exp. trajectory-vis visualizes and discusses some representative success and failure cases for CMP, video examples are available on the project website.
Semantic Task. We next present results for the semantic task, where the goal is to find object of interest. The agent receives a one-hot vector indicating the object category it must go to and is considered successful if it can reach any instance of the indicated object category. We compare our method to the best performing reactive and LSTM based baseline models from the geometric navigation task444This LSTM is impoverished because it no longer receives the egomotion of the agent as input (because the goal can not be specified as an offset relative to the robot). We did experiment with a LSTM model which received egomotion as input but weren’t able to train it in initial experiments.. This is a challenging task specially because the agent may start in a location from which the desired object is not visible, and it must learn to explore the environment to find the desired object. sem-plot and sem-result reports the success rate and the SPL metric for the different categories we study. trajectory-vis shows sample trajectories for this task for CMP. We summarize our findings below:
This is a hard task, performance for all methods is much lower than for the geometric task of reaching a specified point in space.
CMP performs better than the other two learning based baselines across all metrics.
Comparisons to the classical baseline of geometric exploration followed by use of semantic segmentation (sem-plot (bottom) orange blue line) are also largely favorable to CMP. Performance for classical baseline with RGB input suffers due to inaccuracy in estimating the occupancy of the environment. With depth input, this becomes substantially easier, leading to better performance. A particularly interesting case is that of finding doors. As the classical baseline explores, it comes close to doors as it exits the room it started from. However, it is unable to stop (possibly being unable to reliably detect them). This explains the spike in performance in sem-plot.
We also report SPL for this task for the different methods in sem-result. We observe that though the success rates are high, SPL numbers are low. In comparison to success rate, SPL additionally measures path efficiency and whether the agent is able to reliably determine that it has reached the desired target. sem-plot (bottom) shows that the success rate continues to improve over the length of the episodes, implying that the agent does realize that it has reached the desired object of interest. Thus, we believe SPL numbers are low because of inefficiency in reaching the target, specially as SPL measures efficiency with respect to the closest object of interest using full environment information. This can be particularly strict in novel environments where the agent may not have any desired objects in view, and thus needs to explore the environment to be able to find them. Nevertheless, CMP outperforms learning-based methods on this metric, and also outperforms our classical baseline when using RGB input.
CMP when trained with additional data (6 additional buildings from the Matterport3D dataset Matterport3D ) performs much better (green orange lines in sem-plot (bottom)), indicating scope for further improvements in such polcies as larger datasets become available. Semantic segmentation networks for the classical baseline can similarly be improved using more data (possibly also from large-scale Internet datasets), but we leave those experiments and comparisons for future work.
We visualize activations at different layers in the CMP network to check if the architecture conforms to the intuitions that inspired the design of the network. We check for the following three aspects: a) is the representation produced by the mapper indeed spatial, b) does the mapper capture anything beyond what a purely geometric mapping pipeline captures, and c) do the value maps obtained from the value iteration module capture the behaviour exhibited by the agent.
Is the representation produced by the mapper spatial? We train simple readout functions on the learned mapper representation to predict free space around the agent. rom visualizes the output of these readout functions at two time steps from an episode as the agent moves. We see that the representation produced by the mapper is in correspondence with the actual free space around the agent. The representation produced by the mapper is indeed spatial in nature. We also note that readouts are generally better at finer scales.
What does the mapper representation capture? We next try to understand as to what information is captured in these spatial representations. First, as discussed above the representation produced by the mapper can be used to predict free space around the agent. Note that the agent was never trained to predict free space, yet the representations produced by the mapper carry enough information to predict free space reasonable well. Second, rom2 shows free space prediction for two cases where the agent is looking through a doorway. We see that the mapper representation is expressive enough to make reasonable predictions for free space behind the doorway. This is something that a purely geometric system that only reasons about directly visible parts of the environment is simply incapable of doing. Finally, we show the output of readout functions that were trained for differentiating between free space in a hallway free space in a room. rom3 (top) shows the prediction for when the agent is out in the hallway, and rom3 (bottom) shows the prediction for when the agent is in a room. We see that the representation produced by the mapper can reasonably distinguish between free space in a hallway free space in a room, even though it was never explicitly trained to do so. Once again, this is something that a purely geometric description of the world will be unable to capture.
Do the value maps obtained from the value iteration module capture the behaviour exhibited by the agent? Finally, vis visualizes a one channel projection of the value map for the single scale version of our model at five time steps from an episode. We can see that the value map is indicative of the current actions that the agent takes, and how the value maps change as the agent discovers that the previously hypothesised path was infeasible.
realrobot We have also deployed these learned policies on a real robot. We describe the robot setup, implementation details and our results below.
Robot description. We conducted our experiments on a TurtleBot 2 robot. TurtleBot 2 is a differential drive platform based on the Yujin Kobuki Base. We mounted an Orbbec Astra camera at a height of , and a GPU-equipped high-end gaming laptop (Gigabyte Aero 15” with an NVIDIA 1060 GPU). The robot is shown in robot (left). We used ROS to interface with the robot and the camera. We read out images from the camera, and an estimate of the robot’s position and orientation obtained from wheel encoders and an onboard inertial measurement unit (IMU). We controlled the robot by specifying desired linear and angular velocities. These desired velocity commands are internally used to determine the voltage that is applied to the two motors through a proportional integral derivative (PID) controller. Note that TurtleBot 2 is a non-holonomic system. It only moves in the direction it is facing, and its dynamics can be approximated as a Dubins Car.
Implementation of macro-actions. Our policies output macro actions (rotate left or right by , move forward ). Unlike past work bruce2018learning that uses human operators to implement such macro-actions for such simulation to real transfer, we implement these macro-actions using an iterative linear–quadratic regulator (iLQR) controller jacobson1970differential ; li2004iterative . iLQR leverages known system dynamics to output a dynamically feasible local reference trajectory (sequence of states and controls) that can convey the system from a specified initial state to a specified final state (in our case, rotation of or forward motion of ). Additionally, iLQR is a state-space feedback controller. It estimates time-varying feedback matrices, that can adjust the reference controls to compensate for deviations from the reference trajectory (due to mis-match in system dynamics or noise in the environment). These adjusted controls are applied to the robot. More details are provided in ilqr.
Policy. We deployed the policy for the geometric task onto the robot. As all other policies, this policy was trained entirely in simulation. We used the ‘CMP [+6 MP3D Env]’ policy that was trained with the six additional large environments from Matterport3D dataset Matterport3D (on top of the 4 environments from the S3DIS armeni20163d dataset). Apart from improves performance in simulation (SPL from to ), it also exhibited better real world behavior in preliminary runs.
Results. We ran the robot in 10 different test configurations (shown in realtests). These tests were picked such that there was no straight path to the goal location, and involved situation like getting out of a room, going from one cubicle to another, and going around tables and kitchen counters. We found the depth as sensed from the Orbbec camera to be very noisy (and different from depth produced in our simulator), and hence only conducted experiments with RGB images as input. We conducted 5 runs for each of the 10 different test configurations, and report the success rate for the 10 configurations in robot (middle). A run was considered successful if the robot made it close to the specified target location (within ) without brushing against or colliding with any objects. Sample videos of execution are available on the project website. The policy achieved a success rate of . Executed trajectories are plotted in realtests. This is a very encouraging result, given that the policy was trained entirely in simulation on very different buildings, and the lack of any form of domain adaptation. Our robot, that only uses monocular RGB images, successfully avoids running into obstacles and arrives at the goal location for a number of test cases.
robot (right) presents failure modes of our runs. 10 of the 16 failures are due to infractions (head-on collisions, grazing against objects, and tripping over rods on the floor). These failures can possibly be mitigated by use of a finer action space for more dexterous motion, additional instrumentation such as near range obstacle detection, or coupling with a collision avoidance system. The remaining 6 failures correspond to not going around obstacles, possibly due to inaccurate perception.
discussion In this paper, we introduced a novel end-to-end neural architecture for navigation in novel environments. Our architecture learns to map from first-person viewpoints and uses a planner with the learned map to plan actions for navigating to different goals in the environment. Our experiments demonstrate that such an approach outperforms other direct methods which do not use explicit mapping and planning modules. While our work represents exciting progress towards problems which have not been looked at from a learning perspective, a lot more needs to be done for solving the problem of goal oriented visual navigation in novel environments.
A central limitations in our work is the assumption of perfect odometry. Robots operating in the real world do not have perfect odometry and a model that factors in uncertainty in movement is essential before such a model can be deployed in the real world.
A related limitation is that of building and maintaining metric representations of space. This does not scale well for large environments. We overcome this by using a multi-scale representation for space. Though this allows us to study larger environments, in general it makes planning more approximate given lower resolution in the coarser scales which could lead to loss in connectivity information. Investigating representations for spaces which do not suffer from such limitations is important future work.
In this work, we have exclusively used DAGGER for training our agents. Though this resulted in good results, it suffers from the issue that the optimal policy under an expert may be unfeasible under the information that the agent currently has. Incorporating this in learning through guided policy search or reinforcement learning may lead to better performance specially for the semantic task.
Acknowledgments: We thank Shubham Tulsiani, David Fouhey, Somil Bansal and Christian Häne for useful discussion and feedback on the manuscript, as well as Marek Fiser and Sergio Guadarrama for help with the Google infrastructure and Tensorflow.
Hochreiter, S., Schmidhuber, J.: Long short-term memory.Neural computation (1997)
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift.In: ICML (2015)
Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks.In: NIPS (2015)
In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
ego Consider a robot that rotates about its position by an angle and then moves units forward. Corresponding points in the original top-view and in the new top-view are related to each other as follows ( is a rotation matrix that rotates a point by an angle ):
Thus given the egomotion and , for each point in the new top-view we can compute the location in the original top-view from which it came from.
mapper_vis To demonstrate that our proposed mapper architecture works we test it in isolation on the task of free space prediction. We consider the scenario of an agent rotating about its current position, and the task is to predict free space in a 3.20 meter neighborhood of the agent. We only provide supervision for this experiment at end of the agents rotation. mapper-vis illustrates what the mapper learns. Observe that our mapper is able to make predictions where no observations are made. We also report the mean average precision for various versions of the mapper mapper on the test set (consisting of 2000 locations from the testing environment). We compare against an analytic mapping baseline which projects points observed in the depth image into the top view (by back projecting them into space and rotating them into the top-down view).
|Method||Modality||CNN Architecture||Free Space|
|Learned Mapper||depth||ResNet-50 Random Init.||63.4|
|Learned Mapper||depth||ResNet-50 Init. using gupta2016cross||78.4|
Additional experiment on an internal Matterport dataset. We also conduct experiments on an internal Matterport dataset consisting of 41 scanned environments. We train on 27 of these environments, use 4 for validation and test on the remaining 10. We show results for the 10 test environments in gmd_error. We again observe that CMP consistently outperforms the 4 frame reactive baseline and LSTM.
|Method||Mean||75th %ile||Success %age|
|No Image LSTM||20.8||20.8||28||28||6.2||6.2|
|Mean||75th %ile||Success Rate (in %)|
|Far away goal (maximum 64 steps away)|
|Run for 79 steps||47.2||15.2||11.9||58||29||19.2||0.0||58.4||66.3|
|Run for 159 steps||47.2||12.5||9.3||58||19||0||0.0||69.0||78.5|
|Train on 1 floor||25.3||8.9||7.0||30||18||10||0.7||58.9||67.9|
|Transfer from IMD||25.3||11.0||8.5||30||21||15||0.7||48.6||61.1|
Ablations. We also present performance of ablated versions of our proposed method in ablation.
Single Scale Planning. We replace the multi-scale planner with a single-scale planner. This results in slightly better performance but comes at the cost of increased planning cost.
No Planning. We swap out the planner CNN with a shallower CNN. This also results in drop in performance specially for the RGB case as compared to the full system which uses the full planner.
Analytic Mapper. We also train a model where we replace our learned mapper for an analytic mapper that projects points from the depth image into the overhead view and use it with a single scale version of the planner. We observe that this analytic mapper actually works worse than the learned one thereby validating our architectural choice of learning to map.
Additional comparisons between LSTM and CMP. We also report additional experiments on the Stanford S3DIS dataset to further compare the performance of the LSTM baseline with our model in the most competitive scenario where both methods use depth images. These are reported in scenarios. We first evaluate how well do these models perform in the setting when the target is much further away (instead of sampling problems where the goal is within 32 time steps we sample problems where the goal is 64 times steps away). We present evaluations for two cases, when this agent is run for 79 steps or 159 steps (see ‘Far away goal’ rows in scenarios). We find that both methods suffer when running for 79 steps only, because of limited time available for back-tracking, and performance improves when running these agents for longer. We also see a larger gap in performance between LSTM and CMP for both these test scenarios, thereby highlighting the benefit of our mapping and planning architecture.
We also evaluate how well these models generalize when trained on a single scene (‘Train on 1 scene’). We find that there is a smaller drop in performance for CMP as compared to LSTM. We also found CMP to transfer from internal Matterport dataset to the Stanford S3DIS Dataset slightly better (‘Transfer from internal dataset’).
We also study how performance of LSTM and CMP compares across geometric navigation tasks of different hardness in hardness. We define hardness as the gap between the ground truth and heuristic (Manhattan) distance between the start and goal, normalized by the ground truth distance. For each range of hardness we show the fraction of cases where LSTM gets closer to the goal (LSTM Better), both LSTM and CMP are equally far from the goal (Both Equal) and CMP gets closer to goal than LSTM (CMP Better). We observe that CMP is generally better across all values of hardness, but for RGB images it is particularly better for cases with high hardness.
envvis We pre-processed the meshes to compute space traversable by the robot. Top views of the obtained traversable space are shown in area_train1area_train2 (training and validation) and area4 (testing) and indicate the complexity of the environments we are working with and the differences in layouts between the training and testing environments. Recall that robot’s action space consists of macro-actions. We pick to be which allows us to pre-compute the set of locations (spatial location and orientation) that the robot can visit in this traversable space. We also precompute a directed graph consisting of this set of locations as nodes and a connectivity structure based on the actions available to the robot.
Our setup allows us to study navigation but also enables us to independently develop and design our mapper and planner architectures. We developed our mapper by studying the problem of free space prediction from sequence of first person view as available while walking through these environments. We developed our planner by using the ground truth top view free space as 2D mazes to plan paths through. Note that this division was merely done to better understand each component, the final mapper and planner are trained jointly and there is no restriction on what information gets passed between the mapper and the planner.
ilqr We use the robot 2D location and orientation as the state , the linear and angular velocity as the control inputs to the system, and function to model the dynamics of the system as follows:
Given an initial state , and a desired final state ( without loss of generality), iLQR solves the following optimization problem:
where, matrices and are specified to be appropriately scaled identity matrices, controls the frequency with which we apply the control input, and determines the total time duration we have to finish executing the macro-action. Matrix incentives the system to reach the target state quickly, and matrix incentives applying small velocities. The exact scaling of matrices and , and are set experimentally by running the robot on a variety of start and goal state pairs.
Given Dubins Car dynamics are non-linear, iLQR optimizes the cost function by iteratively linearizing the system around the current solution. As mentioned, iLQR outputs , , and a set of feedback matrices . The control to be applied to the system at time step is obtained as , where is the estimated state of the system as measured from the robots wheel encoders and IMU (after appropriate coordinate transforms).
v1. First Version. v2. CVPR 2017 Camera Ready Version. Added more experiments for semantic task. Code made available on project website. v3. IJCV Version. Putting work in context of more recent work in the area, added SPL metric, added comparisons to classical methods, added details about real world deployment of the learned policies, added more visualizations of what is being learned inside the network.