Semantic Linking Maps for Active Visual Object Search

06/18/2020 ∙ by Zhen Zeng, et al. ∙ 0

We aim for mobile robots to function in a variety of common human environments. Such robots need to be able to reason about the locations of previously unseen target objects. Landmark objects can help this reasoning by narrowing down the search space significantly. More specifically, we can exploit background knowledge about common spatial relations between landmark and target objects. For example, seeing a table and knowing that cups can often be found on tables aids the discovery of a cup. Such correlations can be expressed as distributions over possible pairing relationships of objects. In this paper, we propose an active visual object search strategy method through our introduction of the Semantic Linking Maps (SLiM) model. SLiM simultaneously maintains the belief over a target object's location as well as landmark objects' locations, while accounting for probabilistic inter-object spatial relations. Based on SLiM, we describe a hybrid search strategy that selects the next best view pose for searching for the target object based on the maintained belief. We demonstrate the efficiency of our SLiM-based search strategy through comparative experiments in simulated environments. We further demonstrate the real-world applicability of SLiM-based search in scenarios with a Fetch mobile manipulation robot.



There are no comments yet.


page 1

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Being able to efficiently search for objects in an environment is crucial for service robots to autonomously perform tasks [khandelwal2017bwibots, veloso2015cobots, hawes2017strands]. When asked where a target object can be found, humans are able to give hypothetical locations expressed by spatial relations with respect to other objects. For example, a cup can be found “on a table” or “near a sink”. Table and sink are considered landmark objects that are informative for searching for the target object cup. Robots should be able to reason similarly about objects locations, as shown in Figure 1.

Previous works [kollar2009utilizing, kunze2014using, toris2017temporal] assume landmark objects are static, in that they mostly remain where they were last observed. This assumption can be invalid for dynamic landmark objects that change their location over time, such as chairs, food carts and toolboxes. Temporal assumptions can mislead the search process if the prior on the landmarks’ locations is too strong. Further, there also exists uncertainty in the spatial relations between landmark objects and the target object, and between landmark objects themselves. For example, a cup can be “in” or “next to” a sink.

Considering the problem of dynamic landmarks, we propose the Semantic Linking Maps (SLiM) model to account for uncertainty in the locations of landmark objects during object search. Building on Lorbach et al. [lorbach2014prior], we model inter-object spatial relations probabilistically via a factor graph. The marginal belief on inter-object spatial relations inferred from the factor graph is used in SLiM to account for probabilistic spatial relations between objects.

Using the maintained belief over target and landmark objects’ locations from SLiM, we propose a hybrid strategy for active object search. We select the next best view pose, which guides the robot to explore promising regions that may contain the target and/or landmark objects. Previous works [wixson1994using, garvey1976perceptual, sjoo2012topological, aydemir2011search] have shown the benefit of purposefully looking for landmark objects (Indirect Search) before directly looking for the target object (Direct Search). The proposed hybrid search strategy draws insights from both indirect and direct search. We demonstrate the efficiency of the proposed hybrid search strategy in our experiments.

In this paper, we describe the Semantic Linking Maps model as a Conditional Random Field (CRF). Our description of SLiM as a CRF allows us to simultaneously maintain the belief over target and landmark object locations with probabilistic modeling over inter-object spatial relations. We also describe a hybrid search strategy based on SLiM that draws upon ideas from both indirect and direct search representations. This SLiM-based search makes use of the maintained belief over objects’ locations by selecting the next best view pose based on the current belief. In our experiments, we show that the proposed object search approach is more robust to noisy priors on landmark locations by simultaneously maintaining belief over the locations of target and landmark objects.

Fig. 1: Robot tasked to find a coffee machine.

Ii Related Work

Existing works have studied object search with different assumptions on prior knowledge of the environment. Some assume priors on landmark objects’ locations in the environment, and utilize the spatial relations between the target object and landmark objects to prioritize regions to search. Kollar et al. [kollar2009utilizing] utilize object-object co-occurrences extracted from image tags on to infer target object locations. Kunze et al. [kunze2014using] expanded the generic notion of co-occurrences to more restrictive spatial relations (e.g. “in front of”, “left of”), which provide more confined regions to search, thus improving the search efficiency. Toris et al. [toris2017temporal] proposed to learn a temporal model on inter-object spatial relations to facilitate search. These methods assume the landmark objects to be static, however, we believe accounting for the uncertainty in landmark objects’ locations is important for object search.

Existing works have also explored known priors on spatial relations between landmark and target objects. Given exact spatial relations between landmark and target objects, Sjöö et al. [sjoo2012topological] used an indirect object search strategy [wixson1994using, garvey1976perceptual], where the robot first searches for landmark objects, and then searches for a target object in regions satisfying given spatial relations. On the other hand, given a probabilistic distribution over the spatial relations between objects, Aydemir et al. [aydemir2011search]

formulate the object search problem as a Markov Decision Process. In our work, we learn the probabilistic inter-object spatial relations by building on ideas of Lorbach et al. 

[lorbach2014prior], where inter-object relations are being probabilistically modeled via a factor graph.

There are also works that do not assume prior knowledge of the environment. Researchers have explored object search with visual attention mechanisms [shubina2010visual, sjo2009object, meger2010curious], such as saliency detection. Similar to [kollar2009utilizing, kunze2014using], other research [loncomilla2018bayesian, elfring2013active, 5509285] utilizes object-object co-occurrences to guide the search for a target object. Positive and negative detections of landmark objects will result in an updated belief over the target object. We expand object-object co-occurrences to finer-grained spatial relations between objects, i.e., “in”, “on”, “proximity”, “disjoint”, which specify more confined regions for object search.

Other literature [wang2018efficient, kunze2012searching, viswanathan2009automated] has also explored object-place relations to facilitate object search. Wang et al. [wang2018efficient] build a belief road map based on object-place co-occurrences for efficient path planning during object search. Kunze et al. [kunze2012searching] bootstraps commonsense knowledge on object-place co-occurrences from the Open Mind Indoor Common Sense (OMICS) dataset. Samadi learned similar knowledge by actively querying the World Wide Web (WWW). Our work also takes object-place co-occurrences into account. Aydemir et al. [aydemir2013active] made use of place-place co-occurrences to infer the type of the room next door, as the robot explores an environment during search. Manipulation-based object search, as in [xiao2019online, wong2013manipulation, li2016act], is not within the scope of this paper.

Iii Problem Statement

Let be the set of objects of interest, including landmark objects and the target object for search. Given observations and robot poses , we aim to maintain the belief over object locations , while accounting for the probabilistic spatial relations between objects . For this work, we consider the set of spatial relations to be . For example, the relation indicates that object is inside object . The probabilistic spatial relations between object is represented by the belief over , denoted as .

Based on the maintained belief , the robot searches for the target object by selecting the next best view pose ranked by an utility function . specifies the 6 DOF of camera view pose. The utility function

trades off between navigation cost and the probability of search success. Upon a user request to find a target object, the robot iterates between the belief update of objects’ locations and view pose selection, until the target object is found or the maximum search time is reached.

Iv Semantic Linking Maps

For Semantic Linking Maps (SLiM), we consider inter-object spatial relations, while maintaining the belief over target and landmark objects’ locations. Building on our previous work [zeng2018semantic]

, we probabilistically formalize the object location estimation problem via a Conditional Random Field (CRF). The model is now extended to account for probabilistic inter-object spatial relations, as shown in Figure


The posterior probability of object locations

history is


where is a normalization constant. Robot pose and observation are known. We assume that the robot stays localized given a metric map of the environment.

is the prediction potential that models the movement of an object over time. We assume objects to remain static or move with temporal coherence (varies across object classes) during the search, i.e.

is the measurement potential that accounts for the observation model, and are (potentially noisy) detections for each object at time . Because and are independent if , we simplify to s.t.,


where each stands for the probability of false negative, true negative, true positive, and false positive detection. is the effective observation region for given robot pose at time . Note, is larger for larger objects, which can be reliably detected from longer distance compared to small objects. is the camera projection matrix, and denotes that the projected object lies in the detected bounding box in .

Fig. 2: CRF-based SLiM model: (a) Known: robot poses, sensor observations; Unknown: . (b) Plate notation: at time , the spatial relations between each object pair is parameterized by the belief over their spatial relations .

We model the spatial relations between objects with context potential . Here, we extend from our previous work by parameterizing it with the belief over the inter-object spatial relation between ,


where can take any value in the set of possible relations .

For , is equal to if objects satisfy the spatial relation given the width, length and height of the object, otherwise . For ,

corresponds to a Gaussian distribution that models

and is determined by the size of objects . The larger the size of

, the larger the variance in

. For , .

Iv-a Inference

We propose a particle filtering inference method for maintaining the belief over object locations, as shown in Algorithm 1. Examples of the belief update over time are available in Figure 3. Instead of estimating the posterior of the complete history of object locations , we recursively estimate the posterior probability of each object , similarly to [zeng2018semantic, limketkai2007crf].

To deal with particle decay, we reinvigorate the particles of each by sampling in known room areas, as well as around other objects based on . In step 5, only if . Across our experiments, we use 100 particles for each object. The inference algorithm does not assume single object instance for each object class. The inference algorithm has a complexity of , where is the average cardinality of . Further works can be done to decrease the complexity down to by sampling representative and divergent particles from the original particles ().

Input: Observation , Robot pose , Particle set for each object:
1 Resample particles from with probability proportional to importance weights ;
2 for  do
3       for  do
4             Sample ;
5             Assign weight ;
6                 where  
7       end for
9 end for
Algorithm 1 Inference of objects locations in SLiM.
Fig. 3: Examples of belief updates in SLiM. given observations. Upper: Evolution of particles of fridge, sink, coffee machine over time. Lower: RGB observation (with object detection) over time. (Best viewed in color).

Iv-B Probabilistic Inter-Object Spatial Relations

To get the belief over inter-object spatial relations for each object pair , we use a factor graph by building on preceding work by Lorbach et al [lorbach2014prior]. We generalize [lorbach2014prior] by relaxing the assumption on known spatial relations between landmark objects.

The factor graph consists of variable vertices , factor vertices and edges which connect factor vertices with variable vertices. Specifically, is a unary factor that considers commonsense knowledge on spatial relation between objects,

Similar to  [lorbach2014prior], we extract commonsense knowledge on from online image search engine (e.g. Flickr) by counting the frequency of certain spatial relation between objects . For example, the frequency of is computed as the number of search results of a query “cup on the table” divided by the number of search results of a query “on the table”. These extracted frequencies can be noisy. For example, the frequency of “laptop on kitchen” is larger than 0, but it is not a valid expression because it refers to a laptop being on top of the room geometry of a kitchen. We manually encode the for invalid expressions to .

is a triplet factor that considers logical consistency between a triplet of objects ,

For example, if is in , and is in , then should be in to satisfy logical consistency, i.e., . Previous work [lorbach2014prior] assumes the spatial relations between landmark objects to be known, and only relations connecting target object and landmark object to be unknown. Their pairwise factor enforcing logical consistency is a binary function . In contrast, our formulation employs a trinary factor considering all possible combinations of and evaluating their logical consistency.

By applying Belief Propagation [kschischang2001factor] on the factor graph formulated as above, we can get the marginal belief over inter-object relations between all object pairs. We use the libDAI [mooij2010libdai] library for inference. An example of the probabilistic inter-object spatial relations inferred from the factor graph is as shown in Figure 5, and it is used in our experiments.

V Search Strategy

Based on the belief over the object locations, we actively search for the target object, by generating promising view poses and select the best one ranked by a utility function. Given the particle set of the target object as being maintained in IV

, we fit Gaussian Mixture Models (GMMs) through Expectation Maximization to the particles by auto selecting the number of clusters 



V-a View Pose Generation

For each Gaussian component , we generate a set of camera view pose candidates , where and denote the translation and the rotation of the camera respectively.

Initially, we sample the location of the camera evenly from a circle with a fixed radius around the center of the Gaussian component, and assign a default value to rotation . Note, that these initially sampled view poses can put the robot in collision with the environment, and the camera is not necessarily looking at . Thus, we formulate a view pose optimization problem under constraints as below,

where is the view direction given , denotes the effective observation region of the target object at camera pose , and is a function that computes a signed distance of a configuration to the collision geometry of the environment.

V-B View Pose Selection

We propose two different utility functions to rank the view pose candidates:

V-B1 Direct Search utility

encourages the robot to explore promising areas that could contain the target object while accounting for navigation cost,


where is the weight of the Gaussian component (as in (4)) that is generated from, and is the navigation distance from the current robot location to view pose . Parameter trades off between the probability of finding the target object and the navigation cost. Parameter determines how quickly the plateaus.

With , the object search is direct because we are directly considering promising areas represented by the GMMs for the target object.

Fig. 4: Simulation experiments setup in Gazebo: an apartment-like environment with four rooms. There are landmark objects and target objects: coffee machine, laptop, cup. Each target object has two equally possible locations.

V-B2 Hybrid Search utility

encourages the robot to explore promising areas that could contain the target object and/or any landmark object, while accounting for navigation cost

where the additional term compared to acts to encourage the robot to also explore areas that could contain landmark object which co-occurs with the target object with probability . Specifically, , and is the weight of the -th Gaussian component of GMMs fitted to the belief over the location of the landmark object . And is if the -th Gaussian of object is within the effective observation region at camera pose , otherwise .

is inspired by the indirect object search strategy as studied in [garvey1976perceptual, wixson1994using]. Previous studies demonstrated that purposefully looking for an intermediate landmark object helps quickly narrow down the search region for the target object if the landmark object often co-occurs with the target object, thus improving the search efficiency.

With , the object search can be considered hybrid because we are considering promising areas represented by GMMs for both the target object (as in direct search) and landmark objects that co-occur with the target object (as in indirect search).

In our experiments, we use a A based planner to compute . We empirically set , , and such that plateaus as goes beyond 3.

Vi Experiments

We perform object search tasks in both simulation and real-world environments with a Fetch robot. In the simulation experiments, we quantitatively benchmark various methods, including methods that resemble previous works and our proposed method. In the real-world experiments, we demonstrate qualitatively that the proposed method scales to real-world applications. In both simulation and real-world experiments, the robot accelerates to at most m/s and turns at most at rad/s.

Vi-1 Simulation Experiments

The simulation experiments are performed in an apartment-like environment (mxm) setup in the Gazebo simulator, as shown in Figure 4. The room types and considered landmark objects are annotated in Figure 4, along with the placements of target objects. The marginal belief inferred from the factor graph as explained in IV-B is depicted in Figure 5.

Fig. 5: Marginal belief on inter-object spatial relations, as well as object-room relations, inferred from the factor graph as explained in Sec. IV-B. CM: coffee machine, CT: coffee table
Fig. 6: Examples of search paths generated by each method while searching for cup. Methods from left to right: UDS, IDS-Known-Static, IDS-Known-Dynamic, IDS-Unknown, IHS-Unknown. (Best viewed in color).
Target Object Metrics UDS IDS known, static IDS known, dynamic IDS unknown IHS unknown
Coffee Machine Views 7.83 6.17 4.67 6.33 3.67
Search Time (s) 107 76 60 75 50
Search Path (m) 8.68 6.70 5.80 6.74 4.93
Success Rate 1.0 1.0 1.0 1.0 1.0
Laptop Views 11.00 12.50 7.17 5.67 4.17
Search Time (s) 197 222 124 91 78
Search Path (m) 28.27 26.86 13.13 7.69 8.40
Success Rate 0.83 0.50 1.00 1.00 1.00
Cup Views 13.17 14.50 12.67 11.83 9.00
Search Time (s) 184 229 189 185 139
Search Path (m) 22.64 29.81 23.40 19.68 13.91
Success Rate 0.83 0.33 0.83 0.83 1.00
TABLE I: Benchmark results for object search in simulation experiments. Among methods that reached 100% success rate, IHS unknown successfully found target objects within the smallest number of views and least search time.

We set up an object detector in simulation that returns a detection of an object, if the object is in view, not fully occluded, and within the effective observation range. For large objects (e.g. sofa, bed, fridge), mid-sized objects (e.g. desk, table, sink), and small objects (e.g. cup, laptop, coffee machine), we assume an effective observation range of m, m, m respectively.

We benchmark following methods:

  • UDS: Uninformed direct search (Eq.5). The robot does not account for the spatial relations between the target and landmark objects (omitting Eq. 3 in SLiM). This baseline represents a naive approach for object search.

  • IDS-Known-Static: Informed direct search (Eq.5) with a known prior on landmark object locations. The robot assumes that landmark objects are static at the locations provided by the prior. This method resembles previous works [kollar2009utilizing, kunze2014using, toris2017temporal].

  • IDS-Known-Dynamic: Informed direct search (Eq.5) with a known prior on landmark object locations. This is similar to IDS-Known-Static except that the robot does not assume the landmark objects to remain at the locations expressed in the prior.

  • IDS-Unknown: Informed direct search (Eq.5) without prior on landmark object locations. The particles for landmark objects are initialized uniformly across the environment. This method resembles previous works [loncomilla2018bayesian, aydemir2010object].

  • IHS-Unknown: Informed hybrid search (Eq.V-B2) without prior on landmark object locations.

All methods except for UDS are using the full SLiM model. We assume that an occupancy-grid map of the environment is given. We also assume that the room types are accurately recognized across the environment. IDS-Known- methods are provided with a noisy prior on landmark object locations which differ from the actual locations, to emulate the common cases where perfect knowledge about landmark locations is not available. For all methods, the particles for the target object are initialized uniformly across the environment.

For each target object, we run trials per method. In each trial, the robot starts at the same location, depicted in Figure 4. The object search is terminated if (1) the belief over the target object location has converged, or (2) the maximum search time of mins has been exceeded. A trial is successful if the robot finds the target object before timeout. For each target object and each method, we measure the number of view poses, search time, distance travelled by the robot, and search success rate averaged across all trials.

The benchmark result is as shown in Table I. Examples of the resulting search path from each method are depicted in Figure 6. As we can see, UDS is not as efficient because it is not making use of the spatial relations between the target and landmark objects in the environment. Given a noisy prior on landmark object locations, IDS-Known-Dynamic outperforms IDS-Known-Static because it accounts for the uncertainty of the landmark object locations, whereas IDS-Known-Static is misled by the noisy prior.

Given no prior information, IHS-unknown outperforms IDS-unknown because it encourages the robot to explore promising regions that contain the target and/or useful landmark objects, whereas IDS-unknown only considers promising regions that contain the target object. With IHS-unknown, the robot benefits from finding landmark objects which help narrow down the search region for the target object.

Vi-2 Real-World Experiments:

The real-world experiment is executed in an environment (mxm) that consists of a kitchen and a living room. The robot stays localized in the pre-mapped environment based on its LIDAR, and navigates based on a MPEPC based path planner [park2012IROS]. The target object is a cup, and landmark objects include table, sofa, coffee machine and sink. IHS-Unknown reached average success rate of 0.7 (7 out of 10 trials). The average number of view poses, search time and search path is , s, and m repectively. The failure cases were due to false negative detection of the cup due to lighting (we used Faster R-CNN [ren2017faster] trained on COCO dataset [coco]). Examples of real-world experiments with a Fetch robot is available in online video

Vii Conclusion

In this paper we present an efficient active visual object search approach through the introduction of the SLiM model. SLiM simultaneously maintains the belief over target and landmark objects locations, while accounting for the probabilistic inter-object spatial relations. Further, we propose a hybrid search strategy that draws insights from both direct and indirect object search. Given noisy or no prior on landmark objects locations, we demonstrate the benefit of modeling landmark objects locations under uncertainty in SLiM, and the hybrid search strategy that encourages the robot to explore promising areas that can contain the target and/or landmark objects in both simulation and real-world experiments.