In the context of partially observable Markovian systems, planning over belief space (BSP) under some simplifying assumptions, provides scalable applications including autonomous navigation, object grasping and manipulation, active SLAM, and robotic surgery. In presence of uncertainty, such as in robot motion and sensing, the true state of variables of interest (e.g. robot poses), is unknown and can only be represented by a probability distribution over possible states, given available data. This distribution, the belief space, is inferred using probabilistic approaches based on incoming sensor observations and prior knowledge. The corresponding problem is an instantiation of a partially observable Markov decision problem (POMDP) . Apart from simplifying structural assumptions – such as Gaussian noise around a given observation and motion model – state-of-the-art BSP approaches typically assume data association to be given and perfect (see Figure (b)b), i.e. the robot is assumed to correctly perceive the environment to be observed by its sensors, given a candidate action. For brevity, we shall call it DAS. In reality, the world is often full of ambiguity, that together with other sources of uncertainty, make perception a challenging task. As an example, matching images from two different but similar in appearance places, or attempting to recognise an object that is similar in appearance, from the current viewpoint, to another object. Both cases are examples of ambiguous situations, where naïve and straightforward approaches using DAS are likely to yield incorrect results, i.e. mistakenly considering the two places as same, and incorrectly associating the observed object.
Thus, in presence of ambiguity, DAS may lead to incorrect posterior beliefs and as a result, to sub-optimal actions. More advanced approaches are therefore required to enable reliable operation in ambiguous conditions, approaches often referred to as (active) robust perception. These approaches typically involve probabilistic data association and hypothesis tracking given available data. Thus, for the object detection example, each hypothesis may represent a candidate object from a given database that the current observation (e.g. image or point-cloud) is successfully registered to. Similarly, one might reason probabilistically regarding perceptual aliasing, as in the first example above, which would also involve probabilistic data association. Yet, existing robust perception approaches focus on the passive case, where robot actions are externally determined and given, while the closely related approaches for active object detection and classification consider the robot to be perfectly localised.
In this work we develop a general data association aware belief space planning (DA-BSP) framework capable of better handling complexities arising in real world, possibly perceptually aliased, scenarios. We rigorously incorporate reasoning about data association within belief space planning, while also considering other sources of uncertainty (motion, sensing and environment). In particular, we show our framework can be used for active disambiguation by determining appropriate actions, e.g. future viewpoints, for increasing confidence in a certain data association hypothesis.
Organization of the paper: After discussing related work and stating our contributions, we formulate the considered problem in Section 2. In Section 3 we provide concept overview and then discuss in detail the proposed approach, while demonstrating key aspects in simulated basic and realistic scenarios in Section 4. Finally, in Section 5 we conclude the discussion and suggest potential directions for future research.
1.1 Related Work
Calculating optimal solutions to POMDP is computationally intractable (PSPACE-complete)  for all but the smallest problems. The vast research area of approximate approaches (with reduced computational complexity) can be roughly segmented into point-based value iteration methods [26, 19], simulation based  and sampling based approaches [27, 6, 2], and direct trajectory optimization [33, 25, 11] methods. In all cases, finding the (locally) optimal actions involves evaluating a given objective function while considering future observations to be acquired as a result of each candidate action. They all assume DAS. For example, it is typically assumed that the robot can be localised by making observations of known landmarks or beacons (see, e.g. [27, 2]), while assuming to correctly associate each future measurement with an appropriate landmark. Though reasonable in certain scenarios, DAS becomes unrealistic in the presence of perceptually aliased environments (two scenes that look alike) and localisation uncertainty, as in this work.
The issue of perceptual aliasing has been considered in the earlier works on POMDP planning, though again with highly simplified scenarios, since the data-association further complicates the problem. In a slightly separate line of research, the approaches that study the issue were in the context of multiple hypothesis tracking(see  for earliest work on MHT) or more recently, of active robust perception. Both these approaches rely on passive and often non-parametric approaches, through various filtering techniques; we refer an interested reader to the book  and tutorial  for further details. For example,  proposed using Gaussian mixture probability hypothesis density (PHD) filter. To the best of our knowledge, such approaches are not considered in the context of active planning.
Coming back to scalable planning methods such as BSP, we note that while the traditional BSP approaches had typically assumed the environment to be accurately known (e.g. a given map), recent works, including [8, 9, 35, 18, 11], relax this assumption and model the uncertainty of the environment mapped thus far within the belief. The corresponding framework is thus tightly related to active SLAM, with the well known trade-off between exploration and exploitation. Recent work [18, 11, 9, 35] in this branch focused in particular on probabilistically modelling what future observations will be obtained given a candidate action. Though none of them relax DAS assumption.
In the last few years, the SLAM research community has investigated approaches to be resilient to false data association (outliers) overlooked by front-end algorithms (e.g. image matching), see e.g.[31, 21, 7, 14, 13]. However these approaches, also known as robust graph optimization approaches, are developed only for the passive problem setting, i.e. robot actions are given and externally determined. In contrast, we consider a complimentary active framework that incorporates data association aspects within BSP.
Our approach is also tightly related with recent work on active hypothesis disambiguation in the context object detection and classification [4, 29, 20, 36, 32]. Given hypotheses regarding object class and pose, these approaches aim to find a sequence future viewpoints that will lead to disambiguation, i.e. identifying the correct hypothesis. However, these approaches assume the sensor is perfectly localized and can be shown to be a specific case of DA-BSP.
Probably the closest work to our approach is by Agarwal et al. , where the authors also consider hypotheses due to ambiguous data association and develop a BSP approach for active disambiguation. However, unlike them, DA-BSP considers ambiguous data association also in posterior and thus does not require a guarantee of fully disambiguating action in the future.
To summarize, our main contributions in this paper111Earlier versions of this paper appeared in  and . are as follows: (a) relaxing the data-association-is-solved assumption for a general data-association aware BSP framework (DA-BSP) with GMM priors (b) considering active data-association aspect for both planning and inference, hence providing a closed-loop framework (c) reducing some of the known recent BSP approaches to a degenerate cases of DA-BSP (d) demonstrating empirical results in support of two claims: data-association is crucial for a robust BSP and the principled approach of DA-BSP can be scalable enough to be applied on practical problems.
2 Notations and Problem Formulation
Consider a robot operating in a partially known or pre-mapped environment which can be ambiguous and perceptually aliased. The robot takes observations of different scenes and objects in the environment, and uses these observations to infer application-dependent random variables of interest (e.g. past and current robot poses). The following three spaces are involved in the considered problem, as shown in Figure1: pose-space, scene-space and observation-space.
Pose-space involves all possible perspectives a robot can take with respect to a given environment model and in the context of task at hand. We denote the robot pose at time step by and a sequence of poses from up to by . Given all controls and observations up to time step
, the posterior probability distribution function222 is defined as . For notational convenience, we define below histories and and rewrite the posterior pdf (belief), at time as .
The scene-space involves a discrete set of objects or scenes, denoted by the set , in the given world model, and which can be detected through the sensors of the robot. We will use symbols and to denote such typical scenes. Note that even if the objects are identical, they are distinct in scene space. This will be important when we shall consider the cases where the objects look similar from some perspectives. Finally, observation-space is the set of all possible observations that the robot is capable of obtaining when considering its mission and sensory capabilities.
We consider probabilistic motion and observation models
and denote them by and , respectively. As common in literature, we consider Gaussian zero-mean process and measurement noise and , with known noise covariance matrices and . Here, is a noise-free observation which we would refer as nominal or predicted observation , that corresponds to observing scene from pose .
Given a prior and motion and observation models (2), the joint posterior pdf at the current time can be written as
Note that DAS is the underlying assumption in the above equation.
If the prior is Gaussian, it is not difficult to show that is also a Gaussian with some mean and covariance that can be efficiently calculated via maximum a posteriori (MAP) inference, see e.g. 
. It is also valid in case where the environment model is given but uncertain, and when this model is unknown a priori and instead is constructed on-line within SLAM framework. However, in this paper we consider a more general case where the prior belief is modeled by a Gaussian mixture model (GMM). Such a situation can arise, for example, in the kidnapped robot problem in a perceptually aliased environment (e.g. different similar in appearance rooms), where matching sensor observations against a given map would indicate several most probable robot locations. In such a case the belief at timecan be represented by a GMM,
where is the number of components (or modes), the th component is represented by the weight , modeling the probability of the robot being in that component, and by the conditional Gaussian
with appropriate mean and covariance . Here, is an indicator variable denoting the component number.
Given the belief at time , one can reason about the robot’s best future actions that would minimize (or maximize) an objective function .
where the expectation is over the (unknown) future observation , and is the immediate cost.
The posterior belief at time is a function of control and observation , i.e.
Similarly, we define the propagated joint belief as
from which the marginal belief over the future pose can be calculated as .
In particular, the propagated belief at the first look ahead step, given the GMM belief (4) at time is
with , and .
As earlier, with DAS assumption, one can consider for each specific value of the corresponding observed scene , and express the posterior (7) recursively as
which can be represented as with appropriate mean and covariance . The optimal control is then defined as:
DAS assumption simplifies greatly the above formulation. Yet, in practice, determining data association reliably is often a non trivial task by itself, especially when operating in perceptually aliased environments. An incorrect data association (wrong scene in Eq. (10)) can lead to catastrophic results, see, e.g. [14, 12, 13]. In this work we relax this restricting assumption and rigorously incorporate data association aspects within belief space planning and inference considering the underlying distributions are GMMs.
Given some candidate action (or sequence of actions) and the belief at planning time , we can reason about a future observation (e.g. an image) to be obtained once this action is executed; its actual value is unknown. All the possible values such an observation can assume should be taken into account while evaluating the objective function; hence, the expectation operator in Eq. (6). When written explicitly it transforms to
The two terms and in the above equation have intuitive meaning: for each considered value of , represents how likely is it to get such an observation when both the history and control are known, while corresponds to the posterior belief given this specific .
Considering DAS means we can correctly associate each possible measurement with the corresponding scene it captures, as in Eq. (10). Yet, it is unknown from what future robot pose the actual observation will be acquired, since the actual robot pose at time is unknown and the control is stochastic. Indeed, as a result of action , the robot actual (true) pose can be anywhere within the propagated belief . In inference, we have a similar situation with the key difference that the observation has been acquired. We must first associate the captured measurement with the scene or object it describes, i.e. write the appropriate measurement likelihood term in the posterior (3).
In BSP framework, solved data association means that for each such observation the corresponding observed scene is known. In contrast, we do not assume this, and instead reason about possible scenes or objects that the future observation could be generated from, see Figures (b)b and 1.
Parsimonious data association:
Incorporating data-association is expensive. However, if the environment has only distinct scenes or objects, then for each specific value of , there will be only one scene that can generate such an observation according to the model (2). In case of perceptually aliased environments, there could be also several scenes (or objects) that are either completely identical, or have a similar visual appearance when observed from appropriate viewpoints. They could equally well explain the considered observation . Thus, there are several possible associations and due to localisation uncertainty determining which association is the correct one is not trivial. As we show in the sequel, in these cases the posterior (term in Eq. (11)) becomes a Gaussian mixture with appropriate weights that we rigorously compute. Additionally, the weight updates are capable of discriminating against unlikely data-associations, during the planning steps.
Intuitively speaking, perceptual aliasing occurs when an object different from the actual one, produces the same observation and thereby is an alias, in the sense of perception, to the true object. Consider two notions of perceptual aliasing: exact and probabilistic. Exact perceptual aliasing of scenes and is defined as , and will be denoted in this paper by . In other words, the same nominal (noise-free) observation can be generated by observing different scenes, possibly from different viewpoints. Such a situation is depicted in Figure 1. A probabilistic perceptual aliasing is a more general form of aliasing, which can be defined as for some small threshold .
3.1 Computing the term (a) :
Applying total probability over non-overlapping scene space and marginalizing over all possible robot poses, yields
As seen from the above equation, to calculate the likelihood of obtaining some observation , we consider separately, for each scene , the likelihood that this observation was generated by scene . This probability is captured for each scene by a corresponding weight ; these weights are then summed to get the actual likelihood of observation . As will be seen below, these weights naturally account for perceptual aliasing aspects for each considered .
In practice, instead of considering the entire scene space that could be huge, the availability of the belief makes it possible to consider only those scenes that could be actually observed from viewpoints with non-negligible probability according , e.g. within standard deviations of uncertainty for each GMM component. In the following, however, we proceed while reasoning about the entire scene space .
Proceeding with the derivation further, using the chain rule we compute
However, since , we get
Since the propagated belief (9), from which is calculated, is a GMM, we can replace with .
Here, is the standard measurement likelihood term, while represents the event likelihood, which denotes the probability of scene to be observed from viewpoint . In other words, this scenario-dependent term encodes from what viewpoints each scene is observable and could also model occlusion and additional aspects. As such, this term can be determined given a model of the environment and thus, in this work, we consider this term to be given.
The weights (15) naturally capture perceptual aliasing aspects: consider some observation and the corresponding generative model with appropriate unknown true robot pose and scene . Clearly, the measurement likelihood will be high when evaluated for and in vicinity of . Note that we will necessarily consider such a case, since according to Eq. (12) we separately consider each scene in , and, given , we reason about all poses in Eq. (15). In case of perceptual aliasing, however, there will be also another scene(s) which could generate the same observation from appropriate robot pose . Thus, the corresponding measurement likelihood term to will also be high for .
However, the actual value of (for each ) depends, in addition to the measurement likelihood, also on the mentioned-above event likelihood and on the GMM belief , with the latter weighting the probability of each considered robot pose . This correctly captures the intuition that those observations with low-probability poses will be unlikely to be actually acquired, leading to low value of with . Since is a GMM with components, low-probability pose corresponds to low probabilities for each component . However, the likelihood term (12) could still go up in case of perceptual aliasing, where the aliased scene generates a similar observation to from viewpoint with latter being more probable, i.e. high probability .
In practice, calculating the integral in Eq. (15) can be done efficiently considering separately each component of the GMM . Each such component is a Gaussian that is multiplied by the measurement likelihood which is also a Gaussian and it is known that a product of Gaussians remains a Gaussian. The integral can then be only calculated for the window where event likelihood is non-zero i.e . For general probability distributions, the integral in Eq. (15) should be computed numerically. Since in practical applications is sparse w.r.t. , this computational cost is not severe.
3.2 Computing the term (b) :
The term , , represents the posterior probability conditioned on observation . This term can be similarly calculated, with a key difference: since the observation is given, it must have been generated by one specific (but unknown) scene according to measurement model (2). Hence, also here, we consider all possible such scenes and weight them accordingly, with weights representing the probability of each scene to have generated the observation . As will be seen next, the posterior is a GMM with components.
Applying total probability over non-overlapping and chain-rule, we get:
The first term, , is the posterior belief conditioned on observation , history , as well as a candidate scene that supposedly generated the observation . It is not difficult to show that this posterior is actually the GMM
where is the posterior of the th GMM component of the propagated belief , see Eq. (9).
The term, , is merely the likelihood of being actually the one which generated the observation . This term can be evaluated, in a similar fashion to Section 3.1, accounting for for each considered th component as , and applying Bayes’ rule yields
with . Note that for each component , . Finally, we can re-write Eq. (18) as
or in short, , where
As seen, we got a new GMM with components, where each component , with appropriate mapping to indices from Eq. (18), is represented by weight and posterior conditional belief . The latter can be evaluated as the Gaussian , where the mean and covariance can be efficiently recovered via MAP inference.
3.3 Summary thus Far
To summarize the discussion thus far, we have shown that for the myopic case, the objective function (11) can be re-written as
One can observe that according to Eq. (18), each of the components from the belief at a previous time, is split into new components with appropriate weights. This would imply an explosion in the number of components, making the proposed framework hardly applicable. However, in practice, the majority of the weights will be negligible, and therefore can be pruned, while the remaining number of components is denoted by in Eq. (20). Depending on the scenario and the degree of perceptual aliasing, this can correspond to full or partial disambiguation.
Having shown incorporating data association within belief space planning leads to Eq. (22), we now proceed with the exposition of our approach.
3.4 Simulating Future Observations given
Calculating the objective function (22) for each candidate action involves considering all possible realizations of . One approach to perform this in practice, is to simulate future observations given propagated GMM belief , scenes and observation model (2). One can then evaluate Eq. (22) considering all observations in .
We now briefly describe how this concept can be realised. First, viewpoints are sampled from . For each viewpoint , an observed scene is determined according to event likelihood . Together, and are then used to generate nominal and noise-corrupted observations according to observation model (2): . The set is then the union of all such generated observations . Note that while generating , the true association is known (scene ), it is unknown to our algorithm, i.e. while evaluating Eq. (22).
3.5 Computing Mixture of Posterior Beliefs
As seen from Eq. (22), reasoning about data association aspects resulted in a mixture of posteriors within the cost , i.e. , for each possible observation . In this section we briefly describe how one can actually calculate the corresponding posterior distributions, given some specific observation . For simplicity, we consider the belief at planning time is a Gaussian . However, our approach could be applied also to more general cases (e.g. mixture of Gaussians) with a certain price in terms of computational complexity. Further investigation of these aspects is left to future research.
Under this setting, each of the components in the mixture pdf can be written as . It is then not difficult to show that the above belief is a Gaussian
and to find its first two moments via MAP inference. Obviously, the mixture of posterior beliefs in the costfrom Eq. (22) is now a mixture of Gaussians:
3.6 Designing a Specific Cost Function
The treatment so far has been agnostic to the structure of the cost function . Recalling Eq. (22) we see that the belief over which the cost function is defined, is multimodal in general. Standard cost functions in literature, typically include terms such as control usage , distance to goal and uncertainty , see e.g. [33, 11]. In our case, however, the specific form of the latter should be re-examined and an additional term quantifying ambiguity level can be introduced. In this section we thus briefly discuss these two terms, starting with the cost over posterior uncertainty.
Since, unlike in usual BSP, the posterior belief in our case is multimodal and represented as mixture of Gaussians , see Eq. (23), we could define several different cost structures depending on how we treat the different modes. Two particular such costs are taking the worst-case covariance among all covariances in the mixture, e.g. , or to collapse the mixture into a single Gaussian , see e.g. . In both cases, we can define the cost due to uncertainty as .
The cost due to ambiguity, , should penalise ambiguities such as those arising out of perceptual aliasing. Here, we note that non-negligible weights in Eq. (22) arise due to perceptual aliasing, whereas in case of no aliasing, all but one of these weights are zero. In most severe case of aliasing (all scenes or objects
are identical), all of these weights are comparable among each other. Thus we take Kullback-Leibler divergenceof these weights
from a uniform distribution to penalise higher aliasing, and define, where is a small number to avoid division-by-zero in case of extreme perceptual aliasing. With user-defined weights and , the overall cost then can be defined as a combination
3.7 Formal Algorithm for DA-BSP
We now have all the ingredients to present the overall framework of data-association aware belief space planning, calling it DA-BSP for brevity. It is summarised in Algorithm 1 and briefly described below.
Given a GMM belief and candidate action , we first propagate the belief to get and then simulate future observations (line 4). The algorithm then calculates the contribution of each observation to the objective function (22). In particular, on lines 10 and 11 we calculate the weights that are used in evaluating the likelihood of obtaining observation . On lines 12-18 we compute the posterior belief: this involves updating each th component from the propagated belief with observation , considering each of the possible scenes . After pruning (line 20), this yields a posterior GMM with components. We then evaluate the cost (line 22) and use to update the value of the objective function with the weighted cost for measurement (line 23).
4 Experimental results
4.1 An Abstract Example for DA-BSP
Consider the problem of robotic manipulation of objects in the kitchen. For simplicity, let us abstract it to a simpler domain of three objects, . We consider a single step control at time step , from a given belief , as well as that of one step ahead , and assume the following motion and observation models and
where observations as well as the shift is in an object-centric frame, with representing location of . Intuitively, is a simple mechanism to model perceptual aliasing between objects; e.g., identical objects would have the same . Figure 2 illustrates the process of simulating future observations for = up, considering unique and perceptually aliased scenes (Figures (c)c-(d)d). In particular, a sampled pose used to generate an observation is shown in Figure (b)b.
Figure 3 demonstrates key aspects in our approach, considering each time a single observation . Our approach reasons about data association and hence we consider each could have been generated by one of the 3 objects; each such association would fetch us a conditional posterior belief as denoted by small ellipses. Finally, we compute the total cost according to Algorithm 1.
Figures (a)a-(d)d denote the situation when the true pose is close to center and observe , while in Figures (e)e-(h)h it is at the left side and observe . Different degrees of aliasing are considered. Both weights and are shown in the inset histograms. Note that the unnormalised weight is higher when the object is at the centre, because the overall likelihood of the observation is higher. Also, with no aliasing, for any other scene than the true one, the normalised weight is small irrespective of where is. In other words, weights are also related to how likely the objects are to be the causes behind an observation; in case of no aliasing, this can be negligibly small. This is crucial since it implies that DA-BSP in practical applications with infrequent aliasing, would not require any significant additional computational effort w.r.t. usual BSP.
Figures (b)b-(d)d depict , and . When , the weights are similar, and indeed our cost of weights (in Eq. (24)) is high. For similar uncertainty in pose, this cost would remain constant. Hence, in the presence of identical objects placed similarly within the current belief, optimization of general cost function would be guided towards active localization. On the other hand, if one object lies closer to the current nominal pose, it will have slightly higher . In case , i.e. all objects are identical, the weights are simply an indication of the prior. This is reasonable since in such a case, considering different data association does not yield any new information.
quantifying estimation error, defined over incorrect (w.r.t ground-truth) associations through random sampling of various modes. Intuitivelyevaluate how good the posterior mean is w.r.t. ground-truth for usual BSP and DA-BSP respectively (lower is better). Recall that unlike action , action leads to fully unambiguous observations, around most-likely value (see Fig. 3) and consequently, .