## 1 Introduction

An autonomous agent that interacts with other agents needs to do more than simply perceive and respond to their environment. Eventually agents will need to reason about all of the complexities inherent in the real world, including the beliefs, intents and desires of other intentional agents. This is known as theory of mind, and is indispensable if we hope to one day create agents capable of empathy, “reading between the lines,” and interacting with humans as peers.

In this paper, we explore how theory of mind can be implemented using nested simulations in the form of probabilistic programs. We develop a scenario involving two agents, a chaser and a runner. The chaser seeks to intercept the runner and the runner seeks to reach a goal location without detection. However, the runner’s intended start location, goal location, and likely path to the goal are initially unknown to the chaser. We assume that the runner knows the current location of the chaser, but not where the chaser will move in the future. This results in a setting where both agents must reason about each other, and about how they reason about reasoning.

To simulate runner and chaser trajectories, we employ a
variety of semi-realistic primitives, including path planners and
visibility graphs.
We formulate the model of the chaser and the runner
as nested probabilistic programs, which are conditioned according to the desired behavior of the respective agents. The model of the chaser is conditioned to
*maximize* the likelihood of detection, and the runner is conditioned to *minimize* likelihood of detection.
The result is a probabilistic model over possible chaser trajectories. At each point of time, the chaser imagines
possible future trajectories, along with possible runner trajectories, and selects
a move that has a high relative expected utility. This planning-as-inference
formulation [toussaint06] is a natural fit for probabilistic programming,
which makes it straightforward to incorporate complex deterministic primitives
into both models, and perform recursive Bayesian reasoning using the framework
of nested importance sampling [naesseth2015nested].

We evaluate our models in a variety of scenarios and demonstrate that nested Bayesian reasoning leads to rational behaviors which maximize utility respectively at each level. Our experiments show that our formulation leads to improved runner detection rates relative to basic models, and that allocating additional computation to perform nested reasoning about agents results in lower-variance estimates of expected utility.

## 2 Background

### 2.1 Theory of Mind

Human children develop theory of mind during their early years, generally between the ages of three and six [wellman1990child, chater2006probabilistic]. bello2006developmental explore this phenomenon with a computational model that suggests that the underlying cognitive shifts required for the development of theory of mind may be smaller than previously supposed. goodman2006intuitive present a formal model that attempts to account for false belief in children, and later take the innovative approach of linking inference with causal reasoning [goodman2009cause]. Additionally, the same group explores language as a type of social cognition [goodman2013knowledge].

The development of theory of mind in machines leads naturally to interaction with their human counterparts. awais2010human, fern2007decision, and nguyen2012capir investigate collaboration between humans and robots in which the robot must determine the human’s (unobservable) goal. In a complementary line of research, sadigh2016information explore the idea of active information, in which the agent’s own behaviors become a tool for identifying a human’s internal state.

Fully-developed theory of mind requires the possibility of nested beliefs. koller1997effective present an inference algorithm for recursive stochastic programs. frith2005theory argue that theory of mind can be modeled using probabilistic programming, and demonstrate examples of nested conditioning with the probabilistic programming language, Church. zettlemoyer2009multi address filtering in environments with many agents and infinitely nested beliefs.

To our knowledge, our work is the first to model nested reasoning about agents in a time-dependent manner. Prior work by baker2009action develops a Bayesian framework for reasoning about preferences of individual agents based on observed time-dependent trajectories. Our work differs in that our environment is not discretized into a grid world, and as such represents a continuous action space. Work by stuhlmuller2014reasoning employed probabilistic programs to model nested reasoning about other agents. Relative to this work, our work differs in that agents update and act upon their beliefs of other agents in a time-dependent manner, whereas the work by stuhlmuller2014reasoning considers problems in which there is a single decision.

### 2.2 Probabilistic Program Inference

To represent our generative model cleanly and to perform inference in it, we employ the tools of probabilistic programming [vandemeent2018introduction]. This allows us to define probabilistic models that incorporate control flow, libraries of deterministic primitives, and data structures. A probabilistic program is a procedural model that, when run unconditionally, yields a sample from a prior distribution. Running probabilistic programs forward can be quite fast, and is limited only by the native speed of the interpreter for the language.

Inference in probabilistic programming involves reasoning about a target distribution that is conditioned by a likelihood, or more generally a notion of utility [vandemeent2018introduction]. Inference for probabilistic programs is difficult because of the flexibility that probabilistic programming languages provide: an inference algorithm must behave reasonably for any program a user wishes to write. Many probabilistic programming systems rely on Monte Carlo methods due to their generality [goodman08, milch05, pfeffer01, standevelopmentteam2014stan, venture]. Methods based on importance sampling and SMC have become particularly popular [murray2013, todeschini2014biips, wood-aistats-2014, goodman2014dippl, ge2016turing], owing to their simplicity and compositionality [naesseth2015nested].

For our purposes, the most important feature of probabilistic programming languages is that they allow us to freely mix deterministic and stochastic elements, resulting in tremendous modeling flexibility. This makes it relatively easy to (for example) describe distributions over Rapidly-Exploring Random Tree (RRTs), isovists, or even distributions that involve optimization problems as a subcomponent of the distribution.

## 3 Simulation Primitives

Although probabilistic programming has previously been used to model theory of mind [stuhlmuller2014reasoning], past implementations have thus far considered relatively simplistic problems involving a small number of decisions. In this paper, we not only model a setting in which agents must reason about future events, but also do so in a manner that involves reasoning about properties of the physical world around them. To enable this type of reasoning, we will employ a number of semi-realistic simulation primitives.

The environment. To search for and intercept the runner, the chaser requires a representation of the world that allows reasoning about starting locations, goals, plans, movement and visibility. We use a polygonal model designed around a known, fixed map of the city of Bremen, Germany [BremenPointCloud], shown in Fig. 1 (a).

Path planning and trajectory optimization. We model paths using a RRT [lavalle1998rapidly]

, a randomized path planning algorithm designed to handle nonholonomic constraints and high degrees of freedom. We leverage the random nature of the RRT to describe an entire distribution over possible paths: each generated RRT path can be viewed as a sample from the distribution of possible paths taken by a runner (see Fig.

1 (b)). RRTs naturally consider short paths as well as long paths to the goal location. To foreshadow a bit, note that because we will be performing inference over RRTs conditioned on not being detected, the runner will naturally tend to use paths that minimize the chance of detection, which are often, but not always, the shortest and most direct. Our RRTs are refined using a trajectory optimizer to eliminate bumps and wiggles.Visibility and detection. Detection of the runner by the chaser is modeled using an isovist, a polygon representation of the chaser’s current range of sight [isovist79, morariu2007human]. Given a map, chaser location, and runner location, the isovist determines the likelihood that the runner was detected. Although an isovist usually uses a 360 degree view to describe all possible points of sight to the chaser, we limit the range of sight to 45 degrees, and add direction to the chaser’s sight as seen in Fig. 1 (c). The direction of the chaser’s line of sight is determined by the imagined location of the runner.

## 4 The Chaser-Runner Model

To model theory of mind, we will develop a nested probabilistic program in which a Chaser plans a trajectory by maximizing the probability of interception relative to imagined runner trajectories. The model for runner trajectories, in turn, assumes that the runner imagines chaser trajectories and avoids paths with a high probability of interception.

Our model has four levels: the episode model samples a sequence of moves by the chaser. Each move is sampled from the outermost model, which describes the beliefs of the chaser about the expected utility of moves. This model compares future chaser trajectories to possible runner trajectories and assigns higher probability to trajectories in which the runner is likely to be detected. The runner trajectories are in turn sampled from the middlemost model, which minimizes detection probability based on imagined chaser trajectories that are sampled from the innermost model. These three models work in tandem to create nuanced inferences about where the chaser believes the runner might be, and how it ought to counter-plan to maximize probability of detection.

Algorithm 1 shows pseudo-code for the Chaser-Runner model, formulated as nested probabilistic programs, which we refer to as queries. Together, these programs define a planning-as-inference problem [toussaint06] in which queries generate weighted samples, resulting in a nested importance sampling [naesseth2015nested] scheme that we describe in more detail below.

The episode model initializes the location of the chaser to a specified start location . For subsequent time points , the model samples a weighted partial trajectory from the outermost chaser model. After the final iteration, the model returns the completed trajectory and the product over incremental weights .

The outermost model describes the chaser’s plan for trajectories, given the chaser’s belief about possible runner trajectories. The chaser selects a goal location at random and uses the RRT planner to sample a possible future trajectory . Note that this trajectory is random, owing to the stochastic nature of the RRT algorithm. In order to evaluate the utility of this trajectory, the chaser imagines a possible runner trajectory by sampling from the middlemost runner model. The chaser then evaluates the utility of the trajectory by using an isovist representation to determine the number of time points during which the runner is visible to the chaser. The chaser then conditions the sampled trajectories by defining a weight . As we will discuss below, this corresponds to assigning a utility proportional to in a planning-as-inference formulation. The model discards most of the imagined future trajectory, keeping only the next time point , and returns the partial trajectory , together with a weight that reflects the utility of the chaser and the runner.

The middlemost model describes the chaser’s reasoning about possible runner trajectories. We assume that the chaser models a worst-case scenario where the runner is aware of the chaser’s location. This could be, for example, because the runner uses a police scanner to listen in on the chaser’s reported location. Moreover, we assume that at any point in time, the episode only continues when the chaser has not yet detected the runner. Finally, we assume that the runner will seek to avoid detection by imagining a chaser trajectory, and then selecting a trajectory that will not intersect the that of the chaser. We implement these assumptions in the probabilistic program as follows. The runner model first selects a start location and goal location at random, and then samples a random trajectory using the RRT planner. The runner then imagines a future chaser trajectory by selecting a goal location at random and sampling from the innermost model. We then condition this sample by computing the total time of visibility , based on both the known past trajectory and the imagined future trajectory of the chaser. Finally, we assign a weight , which corresponds to a negative utility (i.e. a cost) proportional to in the planning-as-inference formulation.

The innermost model describes future chaser trajectories imagined by the runner. This model is the simplest of all the models in our nested formulation. Given the previous location of the chaser, the runner imagines a goal location at random and then uses the RRT planner to a sample a random future trajectory . Since this model is not conditioned in any way, it returns weight 1.

## 5 Planning as Inference Formulation

The Chaser-Runner model performs two levels of nested inference. At the episode level, we infer the next time point , conditioning on expected future detections. In order to evaluate this likelihood, we simulate runner trajectories that are conditioned to avoid future detections. We will perform inference using a nested importance sampling scheme [naesseth2015nested], which is a generalization of importance sampling in which weighted samples at one level in the model are used as proposals at other levels in the model. Note that nested importance samplign is *not* a form of nested Monte Carlo estimation as discussed in rainforth2018nesting and rainforth2018nestingb. We discuss the distinctions between the two methods below.

We implement conditioning using a planning-as-inference formulation [toussaint06, vandemeent2016black-box]. In planning-as-inference problems, the target density is defined in terms of an unnormalized density

(1) |

which in turn is defined in terms of prior and a utility or reward . The normalizing constant is sometimes referred to as the desirability [todorov2009efficient].

The Chaser-Runner model in Algorithm 1 defines a sequence of unnormalized densities

In this density, the reward depends on the *difference* between the number of time points during which the chaser expects that the runner will be visible, and the number of time points during which the runner expects to be visible based the imagined chaser trajectory (which reflects a more naive model of the chaser). In other words, the the chaser aims to identify trajectories that will result in likely detections of the runner, under the assumption that the runner will avoid trajectories where detection is likely given a naive chaser model.

## 6 Nested Importance Sampling

We can perform inference in the chaser-runner model using Monte Carlo method for probabilistic programs. Algorithm 1 defines an importance sampling scheme. At each time , we sample from the marginal of the target density above. To do so, we sample particles from the chaser model. For each sample, we draw samples from the runner model. We then perform resampling to select of the resulting particles with, which corresponds to performing SMC sampling within the episode model.

To denote this sampling scheme, we define query distributions in lines 1-3 of Algorithm 1. We assume an operator importance that accepts a query and a number of samples and returns a transformed query that accepts samples, and returns weighted samples. We additionally assume a resample operation that accepts a query and a sample count and returns a new query that resamples samples from a query, down-sampling or up-sampling if necessary.

When , this sampling scheme reduces to standard SMC inference for probabilistic programs [wood-aistats-2014]. When it can be understood as a a form of nested importance sampling [naesseth2015nested]. Note that in this sampling scheme, each of the samples corresponds a *different* runner trajectory , but that the reward for this trajectory is evaluated relative to the *same* past and imagined future trajectory for the chaser.

As noted above, nested importance sampling is not the same as nested Monte Carlo estimation. In nested Monte Carlo problems, we compute an expectation of the form , which is to say that we compute an expectation in which, for each sample , we need to compute an expected value by marginalizing over samples . In the chaser-runner problem, we would obtain a nested Monte Carlo probelm if we defined the weight

by averaging the reward over chaser trajectories

In nested importance sampling, we select a particle according to the average weight

This is sometimes referred to as nested conditioning, in the context of probabilistic programming systems [rainforth2018nestingb]. For any choice of

, this is a valid importance sampling scheme in which the importance weight provides an unbiased estimate of the normalizing constant.

## 7 Experiments

We carry out three categories of experiments: 1) trajectory visualization experiments, in which we qualitatively evaluate what forms of rational behavior arise in our model depending on conditioning, 2) detection rate experiments, which test to what extent a more accurate model of a runner enables the chaser to detect the runner most often, and 3) sample budget experiments, where we quantify the trade-offs in allocating our sample budget at different levels of the model.

### 7.1 Visualization of Trajectories

Before carrying out a more quantitative evaluation of the chaser-runner model, we visualize sampled trajectories to show how nested inference converges empirically to rational behavior at each level of the model. We begin by considering a simplified scenario in which we assume fixed start and goal locations. These locations are known to both the chaser and the runner, which means that the chaser and runner do not have to perform inference over possible goal locations. Figure 2 (a) shows a heat map of naive chaser paths in the innermost model, which are conditioned on the start and goal locations. In Figure 2 (b), we show a heat map of runner paths, in which the runner travels in the opposite direction along the same two locations. We observe that the runner avoids direct routes so as to minimize chance of detection. In Figure 2 (c) we show a heat map of chaser trajectories in the outermost model, which shows that the chaser selects paths that are likely to lead to interception of the runner. Together, Figures 2 (a)-(c) demonstrate how our Chaser-Runner model can perform planning conditioned on start and end locations.

In Figure 2 (d)-(e), we visualize naive chaser and runner paths from a single sample (sampled proportionally to importance weights) at times step . The runner paths in Figure 2 (e) once again avoid detection relative to the naive chaser paths in Figure 2 (d). Although the naive chaser travels directly toward goal locations from the upper end of the map, in this particular sample, the naive chaser most often remains on the left side of the map. This results in the runner traveling through the center of city to minimize probability of detection, but more often planning toward location h. This is a case where the RRT planner provides the runner with a shorter and direct plan to minimize detection from the chaser.

### 7.2 Detection Experiments

To evaluate the influence of nested modeling on resulting plans, we compare detection rates in the full chaser-runner model to detection rates in three simplified models. We run simulations using two types of runners. We refer to the runner from the full model as the *smarter runner*, and also consider a naive runner which samples from the RRT planner in the same manner as the naive chaser. We similarly consider two chaser models. We refer to the chaser from the full model as the *smartest chaser*. We additionally consider a simplified model in which the chaser assumes a naive runner, which we refer to as a *smart chaser*.

These two runner and chaser models together yield 4 modeling scenarios. Table 1 shows the average detection rate over 50 restarts for each scenario. Figure 3 shows illustrative trajectories. In this figure, ‘C’ stands for Chaser; ‘R’ stands for runner. The blue triangle represents the chaser’s true, current location. Blue dashed lines represent the past chaser trajectory whereas crosses mark future locations imagined by the runner. Magenta diamonds represent samples of inferred runner locations; magenta crosses represent inferred future runner trajectories.

#### 1. Naive Runner, Smart Chaser.

A smart chaser can reliably intercept a naive runner. Figure 3 (a) illustrates a successful detection. We observe that the chaser typically navigates to the center of the map. Since the shortest path between most points crosses the center of the map, this allows the chaser to intercept the runner with high probability.

#### 2. Smarter Runner, Smart Chaser.

When we increase the model complexity of the runner, the detection probability decreases.
Figure 3 (b) illustrates a prototypical result. The *smarter runner* expects the chaser to remain in the center of the map, as it is trying to head off a naive agent, and successfully avoids the center of the map. In Figure 3 (b), the runner is seen swerving sharply left taking a longer path around the perimeter of the city to reach its goal. As a result, the chaser is unable to find the runner for the rest of the simulation. The average detection rate is 0.36, which means that a smarter runner is able to avoid a misinformed chaser in most episodes.

#### 3. Naive Runner, Smartest Chaser.

In this experiment, the chaser assumes a smarter runner, even though the runner’s behavior is in fact naive. Figure 3 (c) illustrates a prototypical result. Here, the multimodality of the model’s inferences is apparent: the chaser predicts two possible modes where the runner could be (clusters of magenta triangles), but assigns more probability mass to the upper (correct) cluster; the result is that the chaser plans a path to that location, which results in a detection. As it turns out, this model variant results yields a detection rate of 0.98, which is the same as that of in scenario 1, where the chaser has an accurate model of the naive runner.

4. Smarter Runner, Smartest Chaser. Figure 3 (d) shows a prototypical result from the full chaser-runner model, which results in a successful detection. The chaser anticipates that the runner will avoid highly visible areas of the map and travel through alley ways and around the city.

This experiment yielded a detection rate of 0.56, which is significantly higher than the detection rate of 0.36 in experiment 2.

Naive Runner | Smarter Runner | |
---|---|---|

Smart Chaser | ||

Smartest Chaser |

Discussion. These 4 scenarios illustrate that when the runner reasons more deeply, he evades more effectively; Conversely when the chaser reasons more deeply, he intercepts more effectively. Furthermore, we show that a single, unified inference algorithm can uncover a wide variety of intuitive, rational behaviors for both the runner and the chaser.

### 7.3 Sample Budget Experiments

To evaluate how the allocation of computational resources to different levels of the model affects the variance of our importance sampling estimator, we carry out experiments in which we set and to

This fixes the total computation budget to samples, which allows us to assess how many samples from the runner are needed to effectively evaluate utilities in the chaser model.

In this experiment, we perform independent episode restarts for combinations of values. For each episode we compute chaser trajectories and runner trajectories for time steps. In other words, we compute runner trajectories , just over 4 million in total.

In Figure 4 (top row), we show the log mean log weights for the chaser (left) and runner (right) at each time point . These are computed as follows

For each sample budget, decreases (left) as a function of time while remain relatively stable independent of time (right). The decrease in is to be expected, given that the probability of intercepting the runner decreases as we approach the end of the episode.

To get a evaluate the weight variance at each time step, we compute the effective sample size (ESS), which for a set of weights is defined as

Figure 4 (bottom row), shows the fractional ESS (normalized by ) as a function of time for each sample budget. The effective sample size for the chaser weights increases over the course of the episode, reflecting that inference becomes easier owing to the previously mentioned conclusion of progressively decreasing runner detection probabilities as we reach the end of the episode.

Figure 5 shows quantiles with respect to restarts for log mean weights, which further confirms the trend in Figure 4

. We show higher median log weights and lower number of outliers as K decreases and L increases,

results show that computed log weights are less robust when we draw a smaller number of samples from the outermost model.## 8 Conclusion and Future Work

In the beginning of this paper, we considered the question, “How do we give autonomous agents the ability to infer the mental state of other agents?”, and more importantly, “How do we reason about that mental state for decision making and planning?” We have taken a step towards this goal by contributing a model with several novel elements, including complex path planning, visibility and nested planning-as-inference. We have shown how relatively straightforward models of theory of mind can capture a variety of rich behavior, and that probabilistic programming is a natural way to describe those models. We experimentally demonstrated that runner detections increase as we increase the complexity of the chaser model, therefore showing that more complex models produce improved behavior, and thus improved detection rates. Additionally we show that nested reasoning results in lower-variance estimates of expected utility.

One of the virtues of a Bayesian approach is compositionality. While we assumed access to a high-level map, the same framework could be applied to a joint model that blends high-level reasoning with low-level perception. In such models, inferences driven by theory of mind models could go beyond goals and paths, and could additionally infer (for example) the existence of objects or other agents seen by the runner, but not by the chaser. Such integrated models may require inference metaprogramming; but how best to make such models computationally tractable is an open question.

## 9 Acknowledgements

We would like to acknowledge helpful feedback from a number of reviewers. IRS and DW gratefully acknowledge the support of DARPA under grant FA8750-16-2-0209, and additional support from the NSF CUAS. IRS and JWM additionally acknowledge support from startup funds provided by Northeastern.

Comments

There are no comments yet.