Object-based World Modeling in Semi-Static Environments with Dependent Dirichlet-Process Mixtures

12/02/2015 ∙ by Lawson L. S. Wong, et al. ∙ 0

To accomplish tasks in human-centric indoor environments, robots need to represent and understand the world in terms of objects and their attributes. We refer to this attribute-based representation as a world model, and consider how to acquire it via noisy perception and maintain it over time, as objects are added, changed, and removed in the world. Previous work has framed this as multiple-target tracking problem, where objects are potentially in motion at all times. Although this approach is general, it is computationally expensive. We argue that such generality is not needed in typical world modeling tasks, where objects only change state occasionally. More efficient approaches are enabled by restricting ourselves to such semi-static environments. We consider a previously-proposed clustering-based world modeling approach that assumed static environments, and extend it to semi-static domains by applying a dependent Dirichlet-process (DDP) mixture model. We derive a novel MAP inference algorithm under this model, subject to data association constraints. We demonstrate our approach improves computational performance in semi-static environments.



There are no comments yet.


page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There are many situations in which it is important for an automated system to maintain an estimate of the state of a complex dynamical system. Many physical systems are well described in terms of a set of objects, attributes of those objects, and relations between them. The number and properties of the objects in the world may change over time, and they are only partially observable due to noise and occlusion in the observation process. Domains that are appropriately modeled this way include: a household robot, which must maintain an estimate of the contents of a refrigerator that is used by multiple other people based on partial views of its contents; a wildlife-monitoring drone, which must maintain an estimate of the number, age, and health of elephants in a herd based on a sequence of photos of the herd moving through a forest; a surveillance satellite, which must estimate the number, activity, and hostility of soldiers in an enemy camp based on photos capturing people only when they are outside of buildings.

Estimating properties of individuals from noisy observations is a relatively simple statistical estimation problem if the observations are labeled according to which individual generated them. Even when the underlying attributes of the individual change over time, the problem of inferring the history of each individual’s attributes can be reduced to a problem of inference in a hidden Markov model.

The key difficulty in the problems described above is data association. We do not know which particular individual is responsible for each observation and so determining an appropriate association of observations to individuals is key. The only information we have to help make such associations are noisy and partial observations, which may contain errors both in attribute values and in number.

Within the context of world modeling, Cox and Leonard (1994) first identified this issue, and applied well-known multiple-hypothesis tracking (MHT) methods to resolve the issue (Reid, 1979; Bar-Shalom and Fortmann, 1988; Elfring et al., 2013). Recently, Oh et al. (2009) have pointed out drawbacks in using the MHT, which include inefficiency due to considering an exponential number of hypotheses, and the inability to revisit associations from previously-considered views (the MHT is essentially a forward filtering algorithm). Inspired by this, they and others (Dellaert et al., 2003; Pasula et al., 1999)

have proposed different Markov-chain Monte Carlo (MCMC) methods for data association. See

Wong et al. (2015) for in-depth coverage about previous work in semantic world modeling and data association.

The methods mentioned above were all formulated for multiple-target tracking problems, where the each target’s state (typically location) changes between observations. However, if we consider applications such as tracking objects in a household, the dynamics are typically different: most objects tend to stay in the same state when they are not being actively used. In this paper, we study the world modeling problem in semi-static environments

, where time is divided into known epochs, and within each epoch the world is stationary. It seems intuitive that data association should be easier within static periods, since there is no uncertainty arising from stochastic dynamics.

Figure 1: An illustration of the world modeling problem. An unknown number of objects exist in the world (top row), and change in pose and number over time (world at each epoch enclosed in box). At each epoch, limited views of the world are captured, as depicted by the triangular viewcones. Within these viewcones, objects and their attributes are detected using black-box perception modules (e.g., off-the-shelf object detectors). In this example, the attributes are shape type (discrete) and 2-D location. The observations are noisy, as depicted by the perturbed versions of viewcones in the middle row. Uncertainty exists both in the attribute values and the existence of objects, as detections may include false positives and negatives (e.g., ). The actual attribute detection values obtained from the views are shown in the bottom row (“Observations”); this is the format of input data. Given these noisy measurements as input, the goal is to determine which objects were in existence at each epoch, their attribute values (e.g., in top right), and their progression over time.

An alternative approach to data association is to perform inference over the entire time-series of observations and to think of it as a problem of clustering: we wish to group together similar detections over time, under the assumption that they will have been generated by the same individual. Bayesian nonparametric models, such as the Dirichlet-process mixture model (DPMM), can be used to model domains in which the number of individuals is unknown a priori; in Wong et al. (2015), we found that a state-estimation technique based on DPMM clustering was effective for determining the number and type of objects in a static domain, given a sequence of images with partial views of the scene and significant occlusion.

In this paper, we apply the clustering approach to the much more difficult case of a dynamic domain in which the attributes of objects may change over time, new objects may appear, and old objects may permanently disappear. The DPMM is not an appropriate model for this problem, but an extension, the dependent Dirichlet process mixture model (DDPMM), which models dependencies between a collection of clusters, can be used effectively. In particular, we use a construction proposed by Lin et al. (2010) for a class of DDPs that can be represented as a Markov chain over DPs. In our case of semi-static world modeling, we model objects in each static epoch as clusters in a DPMM, and clusters between epochs are related by Markovian transitions, thus forming a DDPMM.

In the remainder, we will formalize the world modeling problem, review the DDP construction and apply it to our problem, and derive a novel maximum a posteriori (MAP) inference algorithm for the model. We show that this model yields computational advantages for tracking in semi-static environments, both in simulation and on real-world data.

2 Problem Definition

In world modeling, we seek the state of the world, consisting an unknown finite number of objects, which changes over time. Object at epoch has attribute values . We sometimes decompose into , where

is a vector of fixed attributes, and

is a vector of attributes that may change between epochs. The top row in Figure 1 illustrates the world state over three epochs for a simple domain.

Our system obtains noisy, partial views of the world. Each view produces a set of observations , where , corresponding to the fixed attributes and dynamic attributes of some (possibly non-existent) object111 Superscripts in variables will generally refer to the ‘context’, such as object index and time index . Subscripts refer to the index in a list, such as ’th observation at time .. Each view is also associated with a field of view . The collection of views in a single epoch may fail to cover the entire world. The partial views and noisy observations are illustrated in the middle and bottom rows of Figure 1.

The world modeling problem can now be defined: Given observations and fields of view , determine the state of objects over time . The state includes not only objects’ attribute values, but also the total number of objects that existed at each epoch, and implicitly when objects were added and removed (if at all).

There is no definitive information in the observations that will allow us to know which particular observations correspond with which underlying objects in the world, or even how many objects were in existence at any time step. For example, in the views of shown in Figure 1, the square detected in the left-most view may correspond to either (or neither) square in the center view. Also, despite there being only four objects in the world, there were five observations because of overlapping visible regions.

The critical piece of information that is missing is the association of an observation to an underlying object . With this information, we can perform statistical aggregation of the observations assigned to the same object to recover its state. We will model the associations

as latent variables in a Bayesian inference process.

2.1 Observation noise model

The observation model describes how likely an observation was generated from some given object state

(if any), given by the probability

. For a single object, let and be the true continuous and discrete attribute values respectively, and likewise and for a single observation of the object. We typically consider observation noise models of the following form:



represents a discrete confusion matrix, where

is the probability of observing given the true object has discrete attributes . The continuous-valued observation is the true value corrupted with zero-mean Gaussian noise, with fixed sensing covariance . The noise on and are assumed to be independent for simplicity.

Besides errors in attribute values, Figure 1 also illustrates cases of false positives and false negatives. A false positive occurs when the observation did not originate from any true object. We assume that this occurs at a fixed rate , depending on the perception system. When this occurs, has noise distribution , and

is uniformly distributed over the field of view

. A false negative occurs when an object is within the sensor’s field of view but failed to be detected. We assume that an object within the field of view will be undetected with an attribute-dependent probability .

2.2 Additional assumption: Cannot-link constraint (CLC)

Finally, there is an additional common domain assumption in target-tracking problems that is essential: within a single view, each visible object can generate at most one detection Bar-Shalom and Fortmann (1988). This implies that within a view, each observation must be assigned to a different hypothesized underlying object. Adopting the terminology of clustering, we refer to this as a ‘’cannot-link constraint” (CLC). The constraint is powerful because it can reduce ambiguities when there are similar nearby objects. However, clustering algorithms typically cannot handle such constraints, and similar to the DPMM-based data association work of Wong et al. (2015), we will need to modify the DDP model and inference algorithms to handle the world modeling problem.

3 A Clustering-Based Approach

We now specify a prior on how likely an assignment to a cluster is, and how clusters change over time. Since the number of clusters are unknown, we chose to use Bayesian nonparametric mixture models, which allow for an indefinite and unbounded number of mixture components. (although the number of instantiated components is limited by the data size).

The Dirichlet process (DP) (Teh (2010) provides a good overview), and its application to mixture modeling (Antoniak, 1974; Neal, 2000), is a widely-studied prior for density estimation and clustering. The DP’s popularity stems from its simplicity and elegance. However, one major limitation is that clusters cannot change over time, a consequence of the fact that observations are assumed to be fully exchangeable. This assumption is violated for problems like ours, where the observed entities change over time and space. Indeed, the previous application of DPs to world modeling mentioned above required that the world is static, which is a significant limitation. Various generalizations of the DP that model temporal dynamics have thus been proposed (Zhu et al., 2005; Ahmed and Xing, 2008).

Many of these generalizations belong to a broad class of stochastic models known as dependent Dirichlet processes (DDP) (MacEachern, 1999, 2000). We will adopt a theoretically-appealing instance of the DDP, based on a recently-proposed Poisson-process construction (Lin et al., 2010; Lin, 2012). This construction subsumes a number of existing algorithmically-motivated DP generalizations. Additionally, Lin’s construction has the nice property that at each time slice, the prior over clusters is marginally a DP. Given a DP prior at time , the construction specifies a dependent prior at time (or another future time), which is shown to also be a DP. The construction therefore generates a Markov chain of DPs over time, which reflects temporal dynamics between epochs in our problem.

We now state one result of the DDP construction; see Appendix A, and Lin (2012) for details. The construction results in the following prior on parameter (to be assigned to a new observation), given past parameters and parameters corresponding to clusters that have already been instantiated at the current epoch:


At the initial time step, clusters are formed as in a standard DPMM with concentration parameter and base distribution . For later time steps, the prior distribution on is defined recursively. The first two terms are similar to the base case, for new clusters and already-instantiated clusters (in the current epoch) respectively. The third term corresponds to previously-existing clusters that may be removed with probability , and, if it survives, is moved with transition probability . is the number of points that have been assigned to cluster , for all time steps up to time . This term is similar to that in the DP. Note that if and , then the model is static, and Equation 2 is equivalent to the predictive distribution in the DP.

3.1 Inference by forward sampling

As mentioned in the problem definition, our focus will be on determining latent assignments of observations to clusters with parameters . In the generic DDP, views do not exist yet; those will be introduced in Section 4. One way to explore the distribution of assignments is to sample repeatedly from the assignment’s conditional distribution, given all other assignments :


The first term in the integrand is given by the observation noise model (Equation 1), and the second term is given by the DDP prior (Equation 2). If already exists, then , and the integrand only has support for . Otherwise, we have to consider all possible settings of , which has a prior distribution given by Equation 2. The expression in Equation 3.1 above can be decomposed into three cases, corresponding to terms in Equation 2:


In the DDPMM, clusters move around the parameter space during their lifetimes, and, depending on our chosen viewpoints, may not generate observations at some epochs. When cluster has at least one time- observation assigned to it, it becomes instantiated at time . Any subsequent observations at time that are assigned to cluster must then share the same parameter ; this corresponds to the first case. The second case is for clusters not yet instantiated at time , and we must infer from the last known parameter for cluster , at time . If , we use generalized survival and transition expressions for our application:


The third case is for new clusters that are added at time . The first and third cases essentially have the same form as the Gibbs sampler for the (static) DP .

In general, since the cluster parameters are also unknown, inference schemes need to alternate between sampling the cluster assignments (given parameters) as above, and sampling the parameters given the cluster assignments. The conditional distribution of each cluster’s parameters (for each cluster , a sequence of parameters) can be found using Bayes’ rule:


Depending on the choice of parameter priors and observation functions, the resulting conditional distributions can potentially be complicated to represent and difficult to sample from. With additional assumptions that will be presented next, we can find the parameter posterior distribution efficiently and avoid sampling the parameter entirely by “collapsing” it.

3.2 Application of DDPs to world modeling

We now apply the DDP mixture model (DDPMM) to our semi-static world modeling problem. For concreteness and simplicity, we consider an instance of the world modeling problem where the fixed attribute is the discrete object type (from a finite list of known types), and the dynamic attribute is the continuous pose in (either 3-D location or 6-D pose). Despite these restrictions, our model and derivations below can be immediately applied to problems with any fixed attributes, and with any dynamic continuous attributes with linear-Gaussian dynamics. Arbitrary dynamic attributes can be represented in our model, but inference will likely be more challenging because in general we will not obtain closed-form expressions.

For our instance of the DDPMM, we assume:

  • Time steps in the DDP correspond to epochs in world modeling. This implies that each epoch is modeled as a static DPMM, similar to the problem in Wong et al. (2015) .

  • The survival rate only depends on the fixed attribute, i.e., . (For us, that means the likelihood of object removal is dependent on the object type but not its pose.)

  • Likewise, the detection probability only depends on the fixed attribute, i.e., .

  • The dynamic attribute (pose) follows a random walk with zero-mean Gaussian noise that depends on (e.g., a mug likely travels farther per epoch than a table):


    This implies that the full transition distribution (of both object type and pose) is:

  • At each epoch, the DP base distribution has the following form:


    Here a (discrete) prior

    over the object type, and a normal distribution over the object pose. The initial covariance

    is large, in order to give reasonable likelihood of an object being introduced at any location. In fact, we will set and , representing a noninformative prior over the location. Details can be found in Appendix B.

The above choices for the dynamics and base distribution implies that the parameter posterior and predictive distributions have closed-form expressions. The posterior distribution of the dynamic attribute is a mixture of Gaussians, with a component for each possible value of the fixed attribute (since the process noise

may be different), weighted by the posterior probability of

. In practice, we track the pose using only the dynamics of the most-likely object type. Thus, in our application, each cluster will maintain a discrete posterior distribution

for the object type, and a single Kalman filter / Rauch-Tung-Striebel (RTS) smoother for the object pose distribution. The latter is represented as a sequence of means and covariances

over the cluster’s lifetime , with the interpretation that .

As mentioned previously, because we have compact representations of the parameter posterior distributions, we can analytically integrate them out sampling them. We first modify the forward sampling equation (Equation 4) to reflect this “collapsing” operation. Since we can no longer condition on the parameters themselves, we instead need to condition on the other observations and their current cluster assignments , and use posterior predictive likelihoods of the form to evaluate the current observation :


We can now substitute the expressions for , , and , where properties of the normal distribution will help us evaluate the integrals. The derivations in Appendix B

give the following expressions, as well as details for finding the posterior hyperparameters

, , and (recall :


In the second case, for tractability in filtering, we have assumed that a cluster’s dynamics behaves according to its most-likely type ; otherwise, the posterior is a mixture of Gaussians (over all possible transition densities). Also, the third case contains an approximation to avoid evaluating an improper probability density; see Appendix B for details.

4 Incorporating World Modeling Constraints

So far, we have only applied a generic DDPMM to our observations, but have ignored the cannot-link constraint, as well as false positives and negatives. We now present modifications to the Gibbs sampler to handle these constraints; the modifications are similar to those from the static case in Wong et al. (2015) .

The cannot-link constraint (see Section 2.2; referred to as “one measurement per object” (OMPO) in Wong et al. (2015) ) couples together cluster assignments for observations within the same view, since we must ensure that no two observations can be assigned to the same existing cluster. For each view, all cluster assignments must be considered together as a joint correspondence vector, and the probability of choosing one such correspondence is proportional to the product of the individual cluster assignment probabilities given in Equation 3.2. Invalid correspondence vectors that violate the cannot-link constraint are assigned zero probability and hence are not considered; the remaining conditional probabilities are normalized. This can be interpreted as performing blocked Gibbs sampling, where blocks are determined by the joint constraints:


The correspondence vector is again the concatenation of the individual assignment variables, for all observation indices made in view at epoch ; the interpretation of is similar. The individual terms in the product are given by Equation 3.2 (with the appropriate case depending on the value of ), except now all observations within the same view are excluded (since their assignments are being sampled together) – instead of , and likewise for assignments and counts .

For false positives, we essentially treat it as a special “cluster” that has no underlying parameter Instead, we assume that if an observation is generated from a false positive, it is generated from some spurious parameter drawn from the base distribution , so the likelihood term is the same as that for drawing a new cluster. Like the other cases, we also multiply the likelihood by the number of points already assigned to the cluster, i.e., the number of false positives except for those in the current view. If there are currently no other false positives, then we multiply by the concentration parameter instead to ensure that it is always feasible to assign observations to the false positive “cluster”. Also, to incorporate the assumption that false positives are generated with a fixed rate , we attach a Bernoulli probability to each case in the Gibbs sampler. The false positive conditional probability is multiplied by , and all other cases are multiplied by . In summary, the conditional probability of an observation being a false positive () is:


The normalizer depends on the other cases in Equation 3.2 (with additional factors).

Finally, for false negatives, recall that an object that is within the field of view fails to be detected with type-dependent probability . Let be if cluster is detected in view at epoch , and otherwise. For a cluster that is alive at epoch () with parameter , the probability of detection is therefore:


The function denotes the CDF of the multivariate normal distribution, with mean and covariance . For a particular view , we only evaluate the above detection probability on clusters that are currently alive at epoch . For each such cluster, there is a corresponding detection indicator variable, whose value is determined during sampling by the candidate joint correspondence vector : if some element of is assigned to cluster index , then ; otherwise, . The detection probability for the correspondence vector is:


Putting everything together, we arrive at a constrained blocked collapsed Gibbs sampling inference algorithm. The algorithm takes the observations and visible regions as input. As output, the algorithm produces samples from the posterior distribution over correspondence vectors , from which we can compute the posterior parameter distributions and . The sampling algorithm repeatedly iterates over epochs and views , each time sampling a new correspondence vector from its constrained conditional distribution:


The probability terms in the first and third lines can be found in Equations 3.2 and 14 respectively.

As in the static case, after incorporating the world modeling constraints, inference becomes inefficient because we now have to compute conditional probabilities for (and sample from) the joint space of correspondence vectors, which in general is exponential in the number of observations in a view. Using the same insights and ideas as in Wong et al. (2015) , however, we can adaptively factor the correspondence vector by initially decoupling all assignment variables, then coupling only those that violate the cannot-link constraint .

5 Approximate Maximum a Posteriori (MAP) inference

We have now presented the entire Gibbs sampling algorithm for DDPMM-based world modeling. However, sampling-based inference can be slow, especially because of the cannot-link constraint that couples together many latent variables, even if adaptive factoring is used. Although we are interested in maintaining an estimate of our uncertainty in the world, frequently just having the most-likely (maximum a posteriori – MAP) world state suffices. In general, even the MAP world model is hard to find (Bar-Shalom and Fortmann, 1988), and many approximate solutions have been proposed.

In the static case, Wong et al. (2015)

adapted a hard-clustering algorithm, DP-means, and empirically found that it returned good clustering assignments for some hyperparameter settings . A similar analysis via small-variance asymptotics was performed recently for DDPs, where the mixture components were Gaussian distributions with isotropic noise, resulting in the Dynamic Means algorithm

(Campbell et al., 2013). However, there is no simple and principled way to incorporate the additional information from Section 4. Additionally, even without such modifications, the Dynamic Means algorithm requires three free hyperparameters to be specified, which may be significantly harder to tune than the one in DP-means. Instead, we will use a much older idea that does not involve asymptotics, can incorporate all the world-modeling information and constraints, and produces an local optimization algorithm that is similar in spirit to Dynamic Means.

5.1 Iterated conditional modes (ICM)

The iterated conditional modes (ICM) algorithm performs coordinate ascent on each variable’s conditional distribution, and is guaranteed to converge to a local maximum (Besag, 1986). In particular, instead of iteratively sampling correspondence vectors from their conditional distributions in Gibbs sampling, we find the most-likely one, update parameters based on it, and repeat for each view. Since we are still dealing with the joint space of assignments for all observations in a given view, finding the maximizer still potentially requires searching through a combinatorial space. Fortunately, finding the most-likely correspondence can be formulated as a maximum weighted assignment problem, for which efficient algorithms such as the Hungarian algorithm exist (and have been previously used in data association).

Suppose, for view at epoch , there are observations and existing clusters (possibly not alive/instantiated). Then we wish to match each to an existing cluster, a new cluster, or a false positive. Any unmatched existing cluster must also be assigned the probability of missed detection. We can solve this as an assignment problem with the following payoff matrix:

Obs () FN ()
Clusters ()
New () 0
FP () 0

The payoff matrix has entries (indicated in parentheses), to allow for the case that all observations are assigned to new clusters, and likewise that all are spurious. Any extra New/FP nodes are assigned to extra FN nodes, with zero payoff. The payoffs in the first column are: for an existing cluster, given by cases 1 and 2 in Equation 3.2, depending on whether or not the cluster has been instantiated yet; for a new cluster, given by case 3 in Equation 3.2; and for a false positive, given by Equation 13. Note that log probabilities are used to decompose the view’s joint correspondence probability into a sum of individual terms. By construction, the cannot-link constraint is satisfied.

5.2 A two-stage inference scheme

0:  Observations Visible regions Number of samples
0:  Samples of cluster assignments
1:  Init. all entries to (FP) in
2:  for  to  do
3:      (see Oh et al. (2009))
5:     Sample
6:     if  then
8:     else
(a) MCMCDA (Oh et al., 2009) for DDPMM
0:  Observations Visible regions Number of samples
0:  Samples of cluster assignments
1:  Init. all entries to (FP) in
2:  repeat
3:     for  to ; to  do
4:         Solve ICM weighted assignment problem for most-likely , given
5:  until convergence
6:  Construct new dataset with a single data point for each non-FP cluster found by ICM (at the same epoch)
7:  Sample tracks by performing MCMCDA on
8:  Convert track samples to cluster assignments
(b) Two-stage inference algorithm for DDPMM
Figure 2: Two algorithms for performing inference in DDPMMs, one by Metropolis-Hastings (MH) (Oh et al., 2009), the other a two-stage procedure involving ICM, followed by the MH procedure.

Although the ICM algorithm presented can find good clusters at a single epoch very quickly, we will see in experiments that it does not converge to good cluster trajectories. The issue is that ICM moves are local, in that it considers one view at a time. Suppose we have identified correctly all objects in epoch using ICM. When we consider the first view in epoch , there may be significant changes present, and using the first view only, ICM decides whether or not to assign the new observations to existing clusters (by reviving them from the previous epoch). Since the uncertainty in the object states immediately after a transition is high, basing the cluster connectivity decisions on a single view is unreliable.

This suggests a two-level inference scheme. Since ICM can reliably find good clusters within single epochs, we first apply ICM to each epoch’s data independently, treating them as unrelated static worlds. Next, we attempt to connect clusters between different epochs. This is essentially another tracking problem, although the likelihood function is somewhat different (depends on many underlying data points), and is much reduced in size. Since the problem is significantly smaller, traditional tracking methods such as MHT can be applied to this cluster-level tracking problem.

We present one such scheme in Algorithm 1(b), using MCMCDA (Algorithm 1(a); Oh et al. (2009)) to solve the cluster-level problem. We choose a batch-mode sampling algorithm such as MCMCDA because it can return samples from the posterior distribution, and has an attractive anytime property – we can terminate at any point and still return a list of valid samples. For inferring the MAP configuration, the best sample can be returned instead. Since we are sampling from the true posterior distribution, in the limit of infinite samples, the true MAP configuration will be found almost surely.

To apply MCMCDA, we need to evaluate the likelihood of a complete configuration , encompassing all epochs and views (line 4 in Algorithm 1(a)). To do so, we first find the posterior parameter distributions for the clusters/objects (as given by ) using Appendix B, then combine the observation likelihoods (Equation 36), as well as the false positive and false negative priors:


6 Experiments

Approximate MAP inference for world modeling via ICM, MCMCDA, and the two-stage algorithm ICM-MCMC were tested on a simulated domain, and also on a sequence of real robot vision data constructed from the static scenes in Wong et al. (2015) . To perform MAP inference on MCMCDA and ICM-MCMC, the most-likely sample (as scored by Equation 5.2) was chosen, from samples in MCMCDA, and in the second stage of ICM-MCMC. In both experiments, ICM-MCMC significantly outperforms the other two methods, and even ICM performs better than MCMCDA.

(a) True object trajectories
(b) Data (from all epochs)
Figure 3: Data and object states in a simulated domain. The top left shows the true object locations and their trajectories over time, color-coded by their associated object type. Observations are shown as filled dots (corresponding to true positives) and crosses (false positives). The top right shows the data from all epochs ( views per epoch) that is given as input, without any information about the underlying object states and associations. Some form of clustering over views and time is visible. A more realistic view of the data is shown in the bottom row, for a sequence of epochs.

6.1 Simulation

Objects in our simulated domain had one of four fixed object types, a time-evolving location , and a time-evolving velocity vector. Observations were made in epochs of this domain, with views per epoch (visible region is the entire domain). In total, objects existed, each for some contiguous sub-interval of the elapsed time. Within each view, the number of false positives was generated from , and the probability of a missed detection was . The correct object type was observed with probability , with equal likelihood () of being confused with the other

object types. Locations were observed with isotropic Gaussian noise, standard deviation

. The object’s velocity vector was maintained from the previous time step, with added Gaussian noise, standard deviation . Between epochs, the probability of survival was . The observed data (i.e., the algorithm input) and the true object states are shown in Figure 3.

The resulting MAP clusters found by ICM, MCMCDA, and ICM-MCMC are shown in Figure 4, along with their log-likelihood values (higher / less negative is better). ICM-MCMC clearly outperforms the other methods, and finds essentially the same clusters as given by the true association. The clusters found generally have tight covariance values, unlike those in ICM and MCMCDA. These two methods, especially MCMCDA, tend to find many more clusters than are truly present.

(a) Truth
(b) ICM-MCMC (LL )
(c) ICM (LL )
(d) MCMC (LL )
Figure 4: The clusters found for the simulated domain are shown in thick ellipses, centered at the location mean, color-coded by the most-likely object type inferred (across the entire trajectory, since it is a static attribute). The ellipses depict a level set of the posterior location distribution (uncertainty given by Gaussian covariance matrix). The posterior clusters derived from the true association is shown in the top left; the one found by ICM-MCMC is essentially identical (with a minor difference in the green track). In contrast, the posterior clusters found by ICM and the most-likely sample from MCMC (of ), shown in the bottom row, are qualitatively much different, and have significantly lower log-likelihood (LL) values.

6.2 Using robot data from static scenes

We also applied the same algorithms to the static robot vision data that were used in Wong et al. (2015) to evaluate DPMM methods. To convert static scenes into dynamic scenes, we choose static scenes that were reasonably similar, and simply concatenated their data together, as if each scene corresponded to a different epoch. One such example is shown in Figure 5.

Objects in different scenes were all placed on the same tabletop of dimensions ; all data were placed in the table’s frame of reference. Four object types were present, and typically each scene had objects. Unlike the previous simulation, we do not assume objects have velocities; between epochs, we assume that the location changes with isotropic Gaussian noise, standard deviation . Since changes were significant between epochs, we assumed a relatively low probability of survival. Object locations are sensed with Gaussian noise, standard deviation ; the object type noise model and probability of detection is the same as before. The probability of false positives is much lower for this domain; we assumed the number of false positives had a distribution.

Figure 5 shows the MAP associations found by ICM and ICM-MCMC, with lines connecting cluster states over epochs. Annotations were also added (in the form of three different line styles) to facilitate comparison between the ICM and ICM-MCMC results; see figure caption for details. ICM tends to suggest many more transitions than ICM-MCMC, many of which are actually implausible.

(a) ICM transitions not present in ICM-MCMC
(b) Most-likely ICM configuration (LL )
(c) Most-likely ICM-MCMC conf (LL )
(d) ICM-MCMC transitions not present in ICM
Figure 5: Approximate MAP cluster (object) trajectories found using ICM and ICM-MCMC on the robot vision data collection in Wong et al. (2015) . The concatenated sequence of scenes (epochs) is shown from top to bottom. The inferred clusters and tracks are shown in the middle two columns. Lines connecting cluster pairs between epochs are color-coded by the inferred object type (fixed across epochs), and are marked by one of three line styles used to compare results from the two algorithms. A solid line means the same pair was connected by both algorithms; a dashed line means a similar pair (in likelihood) was connected; a dotted line means the pair was not connected by the other algorithm. To make the differences clearer, the top-down reference views have been annotated with arrows, for pairs of objects that were only connected by one algorithm (dotted lines in the middle two). The left column shows pairs that were connected by ICM but not ICM-MCMC; the right column shows the opposite. Solid arrows depict transitions that are unlikely, whereas dashed arrows depict plausible transitions. ICM tends to suggest many more transitions than ICM-MCMC, many of which are actually implausible.

Appendix A Background on dependent Dirichlet processes

Lin et al. (2010) exploited the fact that there exists a one-to-one correspondence between DPs over space and spatial Poisson processes in the product space . This means that an underlying Poisson process can be extracted from any DP, and vice versa. By considering transitions on the underlying Poisson processes, and restricting to transition steps where the Poisson process remains closed under transition (more fundamentally, by preserving complete randomness), we obtain a new spatial Poisson process at the next time step, which can be converted back to a new DP.

According to the stick-breaking construction of the DP (Sethuraman, 1994), if , then it can be expressed as infinite sum of weighted atoms: , where , and . Then the following DP-preserving transition steps are applied in order:

  • Subsampling (removal): Let be a parameter-dependent survival rate, i.e., specifies how likely some in the current time step survives in the next time step. For each atom , draw , and retain atoms with . Renormalizing the weights on the retained atoms gives a new DP (where ).

  • Point transition (movement): Let be a parameter-dependent transition function, i.e., specifies how likely some in the current time step moves to in the next time step, given that it survives. For each atom , draw . Then is a new DP.

  • Superposition (addition): Let be a new independent DP, and let , where and are the concentration parameters of and respectively. Then the random convex combination is a DP, and acts as the prior for the next time step.

The upshot of this DDP construction is that, if we marginalize out the DP prior, we get the following prior for , given the parameters from the previous time :


The first term is for new atoms, drawn from a DP with base distribution and concentration parameter .222Technically, includes both the innovation process from the superposition step, as well as a subsampled and transitioned version of innovation processes from previous times; see Lin (2012). The second term corresponds corresponds to existing atoms that have undergone subsampling and transition steps; these steps affect the assignment probability, as indicated by the presence of and . Additionally, is the number of points that have been assigned to cluster , for all time steps up to time . This term is similar to that in the DP. Notice that if and , then we exactly get back the predictive distribution in the DP.

Since , and is a DP, we can find the predictive distribution of , conditioning also on parameters that have been instantiated at time :


In general, some atoms may not be observed for several time steps, but still affect the prior (with decayed weight and dispersed parameter values). Also, some clusters may already have been instantiated at the current time , either newly drawn from the innovation process, or transitioned from existing atoms. The general form of the prior on is:


The first two terms are similar to those in Equation 18 above, except the sum is only over clusters that have not been instantiated at time (). These existing clusters may be ‘revived’, but they are weighted by accumulated subsampling and transition terms, based on the previous time at which they were instantiated:


The third term in Equation 20 corresponds to atoms that have been instantiated at the current time . In this case, we know both that the atom survived and its current value, so and disappear. Also, the count now includes cluster assignments at the current time as well.

Appendix B Derivation of closed-form inference expressions for application of DDPs to world modeling (Section 3.2)

In this appendix, we derive closed-form expressions for the posterior and predictive distributions of the parameter , under the assumptions specified in Section 3.2.

The expressions for the fixed attribute are the same as in Wong et al. (2015) , since it is static. For convenience, we reproduce the equations here. Given a set of observations :


Given a set of observations of the dynamic attributes, we can find the posterior distribution on by performing Kalman filtering and smoothing. Applying a generic Kalman filter to the world modeling problem gives the following recursive filtering equations for , the hyperparameters in the forward direction (during filtering):


Recall that is the covariance per time step of the random walk on , and is the covariance of the measurement noise distribution. The “^” variables are the predicted parameters before incorporating observations, and the “~” variables are the parameters after incorporating observations (i.e., the Kalman filter output). Since there may be multiple observations of the pose in a single epoch, we have used an equivalent formulation involving the sample means , by exploiting the fact that if each , then the sample mean has distribution . There may also be no observations at a given time, in which case the correction step has no effect ().

The Kalman filter is initialized with a noninformative prior:


In practice, this implies that after the initial measurement(s) at time , . To see this, we can apply Equation B on :


To handle the infinite initial covariance, we interpret as . This leads to:


Hence . For the covariance: