Autonomous navigation in outdoor and urban environments has received major attention in the vision and robotic communities, mostly driven by investments from the automotive industry. Differently, less attention has been dedicated to indoor navigation where the diversity in the environment structure is providing open and new scientific challenges.
This paper focuses on the Active Visual Search (AVS) problem in a known indoor environment. We propose a motion planning policy that decides the movements of an agent within its observed world, in order to approach a specific object (the target) and visually detect it. When the target is successfully detected the agent can reach it following the shortest path (see Figure 1(a)).
AVS in real-world scenarios using an egocentric camera can be a very challenging problem due to the unpredictable quality of the observations, i.e. object in the far field, motion blur and low resolution, partial views and occlusions due to scene clutters. This has an impact not only on the object detection but also on the planning policy. To address this challenge, recent efforts are mostly based on deep Reinforcement Learning (RL), e.g. deep recurrent Q-network (DRQN), fed with deep visual embedding [schmid2019iros, ye2019ral]. To train such DRQN models, a large amount of data is required, which are sequences of observations of various lengths, covering successful and unsuccessful search episodes from multiple real scenarios or simulated environments.
Instead of performing any training to learn the policy beforehand, we propose to learn the AVS policy online to react properly at any environmental condition of the scene (e.g. changes in the furniture), or to cope with the new modality of sensory data, without the need of an ad-hoc training. This fundamental shift in the methodology is carried out considering the Partially Observable Monte Carlo Planning (POMCP) method [Silver2010]. POMCP has been applied in benchmark problems, such as rocksample, battleship and pocman (partially observable pacman) with impressive results, however, its use for robotic applications is an open and challenging research problem. To the best of our knowledge, this is the first attempt to use POMCP for the AVS problem.
The overall architecture of POMP is shown in Figure 2. At each time step, the inputs are the agent pose, i.e. position and orientation, in a known 2D map and a RGB-D frame given by a sensor acquisition. An off-the-shelf object detector is applied to the RGB image, where the corresponding depth of the candidate target proposal is further exploited to obtain the candidate position in the map. Such object-related observation is then passed to the POMCP exploration module that assigns each possible move a reward indicating that a chosen move brings the agent closer to the object. The policy is learnt online by Monte Carlo simulations and related particle-filter based belief update, therefore it is general and easy-to-deploy in any environment. Crucially, our approach exploits the model of the environment to consider the sensor’s field of view and all the admissible moves of the agent in the area. For our active vision search scenario, such a model can be easily obtained by building a map of the environment to include the position of fixed elements (e.g. walls) but does not need to consider the position of moving objects. Unlike other RL-based strategies [schmid2019iros, ye2019ral], implicitly encoding such environment knowledge in a data-driven manner, our motion policy explicitly use the knowledge of the environment for the visibility modelling. Once the target is detected, the robust visual approaching module further localises the target on the map, so that a destination pose of the agent can be determined, i.e
. the closest pose with a frontal-facing viewpoint to the target, for the estimation of the shortest path[dijkstra1959note]. A path replanning scheme is proposed in the docking module to be robust to detector failures, such as miss-detections or false positives.
Our main contributions can be summarised as: 1) we solve the policy learning bottleneck, which requires to have an offline training stage, with the first online policy learning by using the POMCP technique; 2) we evaluate our approach according to Active Vision Dataset (AVD) benchmark [ammirato2017dataset], and show that it outperforms alternative approaches in terms of success rate without the cost of offline training under certain cases, and 3) we perform an ablation study to assess the behaviour of the proposed approach when fed with increasingly corrupted detections and prove the robustness of our approach against missing detections.
2 Related work
Active Visual Search can be either addressed as a pure exploration task [Tovar2003, Ruangpayoongsak2005] where target detection is merely subordinate to the solution of such task, or as an exploration and search task with target-specific inferences [Garvey1976, Wixson1994, Kollar2009, Sjoo2012, Aydemir2013, schmid2019iros, ye2019ral, han2019active]. One early example of the latter approach is indirect search [Garvey1976, Wixson1994, Kunze2014] which exploits intermediate objects (e.g. a table) to restrict the search area for the target object (e.g. a telephone). Although intermediate objects are usually easier to detect because of their size, their spatial relation w.r.t. the target may be not systematic. A softer reasoning is proposed in [Kunze2014]
, where the likelihood of the target increases when objects which are expected to be co-occurring are detected. Such probabilistic modelling in a voxelised 3D scene representation is a common strategy to facilitate the planning of the agent’s path towards the discovery of the target[Kollar2009, Shubina2010, Andreopoulos2011, Sjoo2012, Aydemir2013], enriched by visual attention principles [Rasouli2017] that rank the search locations depending on their saliency.
AVS with deep learning is viable using Deep Reinforcement Learning techniques[han2019active, schmid2019iros, ye2019ral], where visual neural embeddings are often exploited for the action policy training. Han et al. [han2019active] proposed a novel deep Q-network (DQN) where the agent state is given by CNN features describing the current RGB observation and the bounding box of the detected object. However, this work is limited since it assumes that the object has to be detected initially. To address the search task, EAT [schmid2019iros]
performs feature extraction from the current RGB observation, and the candidate target crop generated by a region proposal network (RPN). The features are then fed into the Action Policy network. Twelve scenes from the AVD[ammirato2017dataset] are used for the training of EAT. Similarly, GAPLE [ye2019ral] uses deep visual features enriched by 3D information (from depth) for the policy learning. Although GAPLE claims to be generalized, expensive training is the cost to pay as GAPLE is trained with 100 scenes rendered using a simulator House3D based on the synthetic SUNCG dataset. In general, RL-based strategies are dependent on the training with a large amount of data in order to encode the environmental modelling and motion policy. Differently, our proposed POMCP-based method makes explicit use of the available scene knowledge and performs efficient planning for the agent’s path online without additional offline training.
As for optimal policy computation, a popular choice is to use Partially Observable Markov Decision Processes (POMDPs), a sound and complete framework for modeling dynamical processes in uncertain environments [Kaelbling1998]. Computing exact solutions for non-trivial POMDPs is computationally intractable [Papadimitriou1987], but in the recent years impressive progress was made in developing approximate solvers. One of the most recent and efficient approximation methods for POMDP policies is Monte Carlo Tree Search (MCTS) [Thrun2000, Coulom2006, Kocsis2006, Browne2012]
, a heuristic search algorithm that represents system states as nodes of a tree, and actions/observations as edges. The most influential solver for POMDPs which takes advantage of MCTS isPartially Observable Monte Carlo Planning (POMCP) [Silver2010] which combines a Monte Carlo update of the agent’s belief state with an MCTS-based policy. The most recent extensions of POMCP include applications to multiagents problems [Amato2015] and reward maximization while constraining the cost [Lee2018]. Finally, [Castellini2019a] uses the constraints on the state space to refine the belief space and increase policy performance. Here we build our AVS approach upon this method and propose a first methodology which integrates POMCP to AVS avoiding the training bottleneck of state-of-the-art AVS methods, allowing to move the agent and simultaneously learn the optimal policy.
We consider the scenario where an agent moves in a known environment, searching for a specific object. The agent explores the environment to find the target object, to localise it in the floor map, and then to approach it, i.e. move close to the object location.
The agent’s pose111Here with pose we mean the 2D robot pose, i.e. position and orientation, to be coherent with the related literature. We do not consider complex kinematics related to the agent structure (e.g. if a humanoid robot is used). at time step is , where and are the coordinates on the floor plane, and is the orientation. At each time step the agent takes an action : It can move forward or backward, or rotate clockwise or counter-clockwise by a fixed angle. We assume the set of feasible actions is known a priori. When the agent reaches a new pose , it receives an observation which is the output of an object detector applied to the image acquired by a RGB-D camera222Notice that observations are not actions: i.e. they are not actively performed by our POMCP planner, rather they are received after each movement of the robot.. We model the search space as a grid map (see Figure 1(b)). Each cell can be either: (i) “visual occlusion”, if the cell is occupied by obstacles, such as a wall or a piece of furniture, that prevent the agent to see through; (ii) “empty”, if the agent is allowed to enter the cell and thus no objects can be located in there; or (iii) “candidate”, if none of the above, thus the cell is a candidate location for the target object.
We formulate the AVS problem as a Partially Observable Markov Decision Process (POMDP), which is a standard framework for modeling sequential decision processes under uncertainty in dynamical environments [Kaelbling1998]. A POMDP is a tuple , where is a finite set of partially observable states, is a finite set of actions, is a finite set of observations, : is the state-transition model, : is the observation model, : is the reward function and is a discount factor. Agents operating POMDPs aim to maximise their expected total discounted reward , by choosing the best action in each state , where is the time instant;
reduces the weight of distant rewards and ensures the (infinite) sum’s convergence. The partial observability of the state is dealt with by considering at each time-step a probability distribution over all the states, called thebelief . POMDP solvers are algorithms that compute, in an exact or approximate way, a policy for POMDPs, namely a function : that maps beliefs to actions.
We therefore propose POMP to address the POMDP problem with a Monte-Carlo Tree Search strategy that allows us to learn the policy online (POMCP). A graphical overview of the method is shown in Figure 2. The POMCP exploration module explores the known environment to detect the target with some prior knowledge that can be obtained from the pre-exploration of the environment. The learning process ends when the agent detects the object. Once the target object is detected, we can approach it. We first localise the detected target using the depth channel together with the camera pose (which can be obtained by the agent pose with agent-camera calibration). With the target position, we then compute the destination pose as the closest pose of the agent to the object, facing it frontally. Finally, we drive the agent to reach the target pose by using a shortest path exploration method with a path replanning scheme to be robust against imperfect detectors. We will detail our framework in the following sections.
3.1 POMCP exploration
Partially Observable Monte Carlo Planning (POMCP) [Silver2010] is an online Monte-Carlo based solver for POMDPs. It uses Monte-Carlo Tree Search (MCTS) for selecting, at each time-step, an action which approximates the optimal one. The Monte Carlo tree is generated by performing a certain number of simulations () from the current belief. A big advantage of POMCP is that it enables to scale to large state spaces because it never represents the complete policy but it generates only the part of the policy related to the belief states actually seen during the plan execution. Moreover, the local policy approximation is generated online using a simulator of the environment, namely a function that given the current state and an action provides the new state and an observation according to the POMDP transition and observation models.
The methodology proposed here is a specialization of POMCP for the AVS problem. It is based on four main elements defined in the following and used all together by POMCP to perform the search of an object in the environment. We assume that is the number of possible poses – i.e. pairs (positions, orientations) – that the agent can take in the environment, is the number of objects in the environment, and is the number of positions in which each object can be located. The first element of the proposed framework is a pose graph in which nodes represent the possible poses of the agent and edges connect only poses reachable by the agent with a single action. The second element is the set of all possible indices of positions that each objects can take in the environment. Each index in
corresponds to a specific position in the topology of the environment where the search is made. The third element of our framework is the hidden state of the system, which is represented by a vector of object positions, where indicates the pose of the -th object in the environment. The goal of the search is to reach a specific object. The fourth element is a matrix of object visibility , where if the object is visible from pose (i.e. agent’s position and orientation) . Matrix can be deterministically derived from , and by a visibility function which computes the visibility of each object from each agent pose, considering the physical properties of the environment.
POMCP uses all these elements during its computation: Vectors of object poses are first used to represent possible hidden states (i.e. possible arrangements of objects in the environment), these vectors are then used to generate matrices of object visibility that are used, together with graph , to perform simulation steps. In particular, at each step the POMCP simulator takes the current agent pose (i.e. node of ) and computes the related set of visible objects . If this set of objects contains the searched object than a positive reward is provided, the search involving POMCP is terminated, otherwise a negative reward is provided (corresponding to the energy spent to perform the movement) and the POMCP-based search is continued. To prevent the agent to visit the same poses more than once, the agent maintains an internal memory vector that collects all the poses already visited during the current run. Every time the agent visits a pose already visited it receives a high negative reward. The planner gets an observed value if the searched object has been observed,
otherwise. The belief of the agent at each step is an (approximated) probability distribution over positions of the searched object in the environment, that represents the POMCP hidden state. If after a given amount of moves, the object is not observed, the method terminates and reports a search failure.
3.2 Robust Visual Docking
Once the agent has explored the environment and detects the target object, the agent is asked to approach the object and stop as close as possible in front of it; this pose is named destination pose. We first process the depth channel of the last observation to estimate the position of the detected target in the environment. The depth crop of the object detection is converted to 3D points with the camera pose, where the position of the closest point to the camera is used to approximate the target position. Then we generate the destination pose by selecting the subset of admissible poses that can see the target position, according to the observation model, and finding the closest to the target position. We use the Dijkstra algorithm [dijkstra1959note] to compute the shortest path between the current pose (reached using POMCP) and the estimated destination pose. In order to be robust against the detector imperfections, we further introduce a path replanning scheme after new observations. At every time step in the approaching phase the agent observes the scene. In case the target object is detected, we recompute the object location in the environment, the destination pose and re-plan the optimal path to reach it. If instead the object is not detected, the agent continues with the planned path. The effect of the path replanning against following the originally planned path can be seen in Table 1, where we have an average improvement of 10% in success rate.
|(a) Easy: Home_005_2||(b) Medium difficult: Home_001_2||(c) Hard: Home_003_2|
We validate our proposed method against baselines and state-of-the-art methods using the AVD dataset [ammirato2017dataset] following the evaluation protocol defined by the AVD Benchmark (AVDB) on the task of active object search in known environments (referred as Task 1a in the benchmark). AVD is the largest real-world dataset available for testing active visual search, containing scans of 14 real apartments recorded using a robot equipped with a RGB-D camera, so allowing for a virtual exploration of the environment.
Following the analysis proposed by the EAT authors [schmid2019iros], we test three scenes that correspond to three different difficulty levels (see Figure 3). The easy level is represented by Home_005_2, where the agent only explores within a kitchen area. The medium difficult scene is represented by Home_001_2, where the agent explores in a living room with an open kitchen. Finally Home_003_2 represents the most difficult scene where the agent explores a large living room with a half-open kitchen and some dining area. For each scene, the agent’s pose graph and the ground-truth (GT) annotations of each target object are provided by AVD, while we prepare the 2D grid map of each scene for the POMCP module. To obtain the occluded cells, we first perform 3D scene reconstruction using Open3D [Zhou2018] followed by a z-plane intersection.
We remark that with POMP we are introducing a new (harder) scenario where no training is allowed, so no other published approaches are available as comparison. Nevertheless we refer to state-of-the-art systems which applied on the AVDB, quoting their performances in italic, to remind that they derive from an easier setup. In particular, we consider the RL-based EAT [schmid2019iros] and GAPLE [ye2019ral] (discussed in Sec. 2). Unfortunately, the protocol adopted with GAPLE on AVD is not documented, while in [schmid2019iros] their protocol is explicit, but is different from that of the AVD benchmark. With EAT, only a subset of objects are used for the searching task in each scene, while the AVDB protocol uses all objects. To this sake, we mark with an asterisk the numbers obtained with the EAT protocol (reported in the original paper [schmid2019iros]). As comparative baselines we consider two methods: the Random Walk, that allows the agent to randomly select an action among all the feasible ones at each time step. The second baseline is partial-POMP, i.e. an ablation of POMP, where we exclude the path replanning after new observations. This helps to appreciate the net contribution of the path replanning scheme during the visual docking phase.
In line with the AVDB, we consider: Success Rate (SR) i.e. the percentage of times the agent successfully reaches one of the destination poses (as provided in AVDB) over the total number of trails (a larger value indicates a more effective search); Average Path Length (APL) is the average number of poses visited by the agent among the paths that lead to a successful search (a lower value indicates a higher efficiency); Finally, the Average shortest path length (ASPPL)
is the average ratio between the shortest possible path to reach a valid destination pose (provided by AVDB as a piece of GT information) and the length of the path generated by the model (a larger value indicates a higher absolute efficiency). Additionally, we compute the standard deviation of ASPPL to investigate the variability of POMP in behaving efficiently.
|Random Walk||0.32||74||0.19 (0.35)||0.113||74.48||0.21 (0.36)||0.10||79.27||0.17 (0.18)||0.18||75.91||0.19 (0.29)|
|partial-POMP||0.96||12.88||0.73 (0.25)||0.68||16.13||0.80 (0.23)||0.41||21.05||0.70 (0.40)||0.68||16.68||0.74 (0.29)|
|Random Walk||0.22||71.47||0.23(0.38)||0.16||69.84||0.22 (0.33)||0.14||62.30||0.29(0.38)||0.17||67.87||0.25 (0.36)|
|partial-POMP||0.88||12.19||0.80 (0.24)||0.68||16.75||0.74(0.24)||0.34||23.09||0.66 (0.33)||0.63||17.34||0.73 (0.27)|
|POMP||0.93||12.96||0.78 (0.24)||0.80||18.2||0.72(0.24)||0.43||21.9||0.65 (0.32)||0.72||17.68||0.72 (0.26)|
|Random Walk||0.22||71.47||0.23(0.38)||0.16||69.84||0.22 (0.33)||0.14||62.30||0.29(0.38)||0.17||67.87||0.25 (0.36)|
|partial-POMP||0.61||17.49||0.7 (0.29)||0.37||19.2||0.64(0.26)||0.18||26.22||0.54(0.28)||0.38||20.97||0.62 (0.27)|
|POMP||0.6||17.9||0.68 (0.28)||0.40||20.73||0.62 (0.26)||0.19||26.6||0.53 (0.28)||0.4||21.74||0.61(0.27)|
Table 1 is divided in two plates, the upper showing the results obtained with the EAT [schmid2019iros] protocol (less objects into play, approaches marked with an asterisk ), and where the EAT numbers are in italic to remind that EAT has been trained beforehand on separate data, while we are training-free. The plate below shows numbers obtained with the original AVDB protocol (all the objects are considered). On the right we have the average of the performance of the three scenarios (mean of the means) and the average of the three standard deviations. To compare the exploration engine of POMP, discarding nuisances caused by the underlying object detectors, we report the results using an ideal detector (as in EAT [schmid2019iros]). The behavior of POMP when in presence of noisy detectors is the subject of another experiment.
From Table 1 we can see that on average our proposed method is able to outperform EAT in terms of the SR with a comparable APL, despite our setup eliminates any training. Notably, our proposed POMP has a dominant advantage against EAT in terms of SR, with this advantage decreasing as the scene gets more difficult.
Table 2 shows the performance of POMP against the baselines with a detector provided by the same authors of the AVDB [ammirato2018target]. The detector is similar to Faster-RCNN[ren2015faster] but with additional input as the reference images of the target object, in order to improve the detection quality. We use the pre-trained model without any customisation. The detector achieves a precision of 0.73 and a recall of 0.53 under the confidence threshold of 0.9. On average, our method with path replanning improves the SR with a slight increase in APL, which is consistent with what we observe in Fig. 4. In terms of processing speed, we run experiments on a machine with 6-core Intel i7-6800k CPU, achieving 0.07 seconds per step on average.
Since the detector plays a role in POMP, we investigate its impact in terms of missing detections and false positives by manipulating the GT annotations. Specifically, for each target in a scene, we randomly exclude a set of ratios, from 0% to 80% with 20% as a gap, of its GT annotations. Regarding the false positives, we randomly change the label of detections corresponding to other instances to the target object, for the set of ratios from 0% to 80% with 20% as a gap, of its GT annotations. Fig. 4 shows the plot of POMP and partial-POMP in terms of SR and APL over a set of varying ratios of missing detections (left) and false positives (right). The results are averaged over 551 independent runs composed by 19 target objects and 29 starting positions for each target object. On one hand, we see that both the versions are robust to missing detections where the SR only starts to noticeably decrease after 60% missing detections. The path length starts to have a noticeable increase after 40% missing detections. On the other hand, we are vulnerable to false positives in terms of the SR, while the APL is not much affected. This is because more false positives lead to a higher chance of POMCP perceiving wrong destinations and ending up with failure paths. However, since the APL is only computed among successful paths, the impact of false positive ratio is therefore limited. From both plots, we can also prove that POMP with path replanning in the shortest path computation can further boost the SR although with a trade-off of an increase in APL.
We proposed a POMCP-based planner, POMP, to learn an optimal policy for AVS in known indoor environments. To the best of our knowledge, our approach is the first to use an online policy learning method for AVS. Notably, POMP does not need expensive (in time and computation) labelled data but rather exploits the information of the floormap of the environment, which is usually available or easy to obtain (e.g. via a single exploration run). We evaluated our approach following the AVD benchmark and achieved comparable performance (i.e. average success rate and average path length) against the state-of-the-art methods while using far less information. This work paves the way to several interesting research directions, including the possibility of integrating more scene priors, e.g. object co-occurrence, in the POMCP modelling to further boost the performance.