Suppose you have to catch a flight in three hours, and you are in your bedroom packing. How would you plan your next move? You might reason, “I will finish packing my bag. Then, I will get a car to take me to the airport.” This plan seems intuitive and straightforward. However, many details have been left out: Which taxi or ride-sharing company will you use? How are you going to get from security to your gate? If you reach the gate and get hungry, can you pick up a snack in time to still make your flight? Clearly, you did not imagine every possible contingency from now until your flight departs. Instead, you sketched out a partial plan that considers the information most relevant to your current circumstance, and delayed thinking about other details to when they become more relevant. For instance, rather than thinking of a detailed path from your bedroom to the flight gate right now, you might plan to later think about the route to the gate once you get through airport security.
As this example suggests, human decision making not only involves planning one’s actions, but also planning one’s plans. But why plan one’s plans? Why not just plan one’s actions? We propose that planning one’s plans emerges from two aspects of human decision making. First, plans can include different details at different times and change as time moves on. Second, representing detailed plans is itself costly (in terms of time and memory). Thus, people should consider what details they include in a plan and when to include those details. That is, they should plan their plans.
Here, we develop and empirically evaluate this idea in humans. We first discuss how planning one’s plans relates to existing ideas in psychology and machine learning. Then, we formalize the relation between planning one’s plans and the cost of planning by defining a general Bellman objective that includes both task rewards and information-theoretic planning costs. We then discuss several qualitative features of solutions to this objective. Finally, we report the results of two new human experiments that compare participant reaction times during problem solving to the predictions of our model and alternatives. These empirical findings support our normative account of planning one’s plans and raise new questions about the nature of boundedly rational decision making in both people and machines.
Put simply, planning involves finding a good sequence of actions given a particular problem representation . In computer science, many approaches have been developed to facilitate planning. These include classical planning methods such as depth-limited search 
, heuristic search, and Monte-Carlo tree search  as well as the use of data structures like hierarchies [29, 14], temporal abstractions , and state abstractions . At the same time, psychologists have long recognized that people also rely on heuristics [35, 8] and use abstractions to organize their thoughts and behaviors [17, 26, 31].
But why do people use heuristics and abstractions, and why do we build these structures into our algorithms to use? A simple reason is that naïve planning is prohibitively costly [2, 18], so such aids focus limited computational resources on the most important, urgent, or relevant parts of a problem. This observation motivates the following question: What if usefully and efficiently representing plans were an explicit goal of making decisions over time?
To develop a model of adaptively planning with the costs of planning in mind (i.e., planning one’s plans), we draw on ideas from several lines of research. One is work on boundedly optimal intelligence , anytime algorithms [13, 6], and human rational meta-reasoning , which articulates the value of managing computational costs during decision making. Another is work on the psychology of intertemporal choice  and representation , which studies how people’s construal of different aspects of the world changes as a function of distance, time, and context. Finally, we draw on formal tools from information-theoretic approaches to bounded rationality [33, 27, 21], which provides a general framework for characterizing the relationship between environmental rewards and decision-making costs.
Planned Information Processing
We begin by formalizing the objective of planning one’s plans in terms of planned information processing. To model how an agent should plan at different points in time given a task and a cost of planning, we treat the decisions of what to plan and when to plan as a sequential decision-making problem . To capture a general, architecture-agnostic notion of planning costs, we use a formulation of partial planning and information-theoretic action costs [33, 27, 21]. Finally, at the end of this section and in the supplementary materials, we present a gradient-based algorithm for solving this objective.
Markov Decision Processes
Sequential decision-making can be described in terms of a Markov Decision Process (MDP). A discrete MDP is a tuple , where is the state space; is the action space;
is a transition function that defines a probability distribution over next stateshaving taken action in state ; is a reward function that defines agent payoffs; and is a discount rate.111 denotes the simplex over discrete elements .
An agent’s policy describes its behavior in an MDP. Formally, a policy is a mapping from states to probability distributions over actions . A policy combined with a transition function defines how an agent will move within the state space over time. Additionally, the normalized discounted occupancy of a policy in MDP is .
We are interested in the value function for a particular policy, , which is the expected discounted cumulative reward that an agent receives by following from state onward. We are especially interested in the optimal value function defined by the unique fixed point of the Bellman equation , where for all :
The optimal value function describes the best that an agent can expect to do in terms of maximizing future discounted rewards from each state. It is also useful to define the optimal state-action value function , for all . Finally, an optimal policy is any policy such that for all .
Partial Planning via Soft Planning
In an MDP, planning corresponds to finding a policy that yields high value by doing computations over a model of the task. For instance, an optimal policy generates plans that perfectly maximize value. But, we may not always want a perfect plan. Rather, since perfection is costly, it is often useful to express imperfect partial plans.
We introduce two ideas to formalize partial plans. First, we distinguish between the ground MDP, , and a “simulated” MDP, . Here, we focus on the relationship between ground states and simulated states , while also assuming that . This distinction allows us to express different quantities (e.g., action probabilities, expected values) over the same simulated state space but from the perspective of different ground states. For example, we denote the planned probability of taking action at simulated state from ground state as . In the airport example, you are simulating what you would do at the airport, , from your bedroom, . Note the special case of (i.e., ), which defines the actions an agent plans on taking at their current state.
Second, we introduce a soft planning parameterization of partial policies that controls the allocation of planning to different simulated states. Formally, an inverse temperature assignment from state , , assigns a positive real inverse temperature to each simulated state . Given this assignment, we define soft-Bellman equations over simulated states from a ground state :
Intuitively, the inverse temperature assignment captures how much attention is paid at each simulated state when constructing a partial plan. Larger inverse temperatures entail more attention at a particular state, and the interaction of temperatures induces a partial plan .222Previous work has interpreted the soft-maximization in Eq. 2 in terms of the sampling process of a bounded rational decision-maker, where inverse temperature reflects noise (Train2003, Train2003, p48; Ortega2015, Ortega2015). Here, we mainly treat soft-planning as a way to parameterize partial policies and leave a more in-depth analysis of the sampling interpretation of this model for future work.
Information-Theoretic Planning Costs
How can we quantify the cost of a partial plan? Our proposal borrows the idea of information theoretic costs for actions [33, 27, 21] and applies it to simulated (planned) actions. For instance, we can define the cost of a plan based on sum of Kullback-Leibler (KL) divergences, denoted , from a default policy at each state:
where for distributions and with the same support . Here, we set
to be the uniform distribution at all states as it makes few assumptions and is justified by previous work. However, our formulation straightforwardly accommodates task-specific default policies.
Specifying planning costs as an information theoretic quantity has conceptual and practical advantages. First, planning costs can be characterized independently of an agent’s specific representation of a task. Second, it can be interpreted as the minimal cost in bits of using the old plan to represent the action distributions in the new plan , which captures intuitions about planning as a costly process of information transformation and control. Finally, the cost term is differentiable with respect to the policy probabilities.
Planned Information Processing Bellman Objective
Given our formalization of partial planning and information theoretic planning costs, we can now define the objective of planning one’s plans. Put simply, we nest partial planning (Equations 2–4) inside a meta-planning problem that includes planning costs (Equation 5):
This equation extends the original Bellman equation (Equation 1) in two ways. First, ground actions are no longer directly chosen. Instead, a distribution over actions is induced at the current state by influencing the computations of the planner with .
Second, Equation 6 takes into account the immediate cost of planning via the term , where is a planning cost weight. Note that this formulation also includes planning opportunities and costs in the future via the recursive nature of the Bellman equation. To understand the significance of this, consider again the airport example: While at home packing, you only vaguely consider how you will get from security to your gate. This is not because it does not matter; if you never make it to your gate and miss your flight, there is no point in packing. Rather, packing does not require you to think about the details of navigating the airport until later. In short, Equation 6
expresses how an agent should partially plan to act in the moment, taking into account that they will engage in planning in the future.
Optimal Planned Information Processing
To solve for the meta-planning objective in Equation 6, we implemented a gradient-based algorithm that solves for the optimal inverse temperature assignments (i.e., ) given an MDP , planning cost weight , and default policy .
To understand optimal meta-planning and how it relates to previous work on hierarchical planning and learning, we ran our algorithm on the Four Rooms domain with deterministic transitions . An absorbing goal state worth points was placed in the upper-right corner. Step costs were included ( points) and the discount rate was set to , with . Planning iterations were chosen such that value iteration would converge (), while meta-planning parameters were solved using the Adam optimizer  for iterations (see supplementary materials for details on the algorithm).
The optimization finds an inverse temperature assignment that yields a partial plan from each ground state —i.e., . The cost of each partial plan is determined by the entire simulated state space (Equation 5), but we can examine the contribution of the planned actions at each simulated state to identify which ones are prioritized, shown as colors in Figure 1a. Two salient features emerge in the Four Rooms Domain. First, at any point in time, “doorways” leading to the goal are prioritized because the quality of the current decision depends on planning to go through the door and proceeding towards the goal. Second, actions at states that are closer along the path are represented in greater detail since they are more relevant to simulating the value of the current decision.
Additionally, Equation 6 implicitly defines a task-specific Pareto frontier where task rewards cannot be improved without worsening planning costs and vice versa. Since our algorithm seeks to find points that jointly maximize rewards and minimize planning costs, we can identify this curve for the Four Rooms task by running our algorithm with a range of values. Figure 1b shows the plot of this curve at the lower-left state, which divides the space of possible task rewards and planning costs into feasible and infeasible combinations of rewards and planning costs.
Experiment 1: Parametric Mazes
Our account makes quantitative predictions about how people should flexibly plan information processing on a task. To test these predictions, we examined how long it took people to make decisions when navigating through a set of parametrically generated 2D Gridworld mazes. For a given maze, our model predicts an optimal amount of partial planning at a state. For example, Figure 2a displays two mazes that our model predicts will have different immediate partial planning costs at the initial state. Since this cost reflects information processing, we can operationalize it in this task based on reaction times (RTs) as in previous work (e.g., ortega2016 ortega2016). Specifically, we assume that the amount of time between when a participant is presented with a maze and when they take their first action reflects the cost of encoding a partial plan from the initial state.
Materials, Participants, and Procedure
The stimuli were a set of 50 different Gridworld mazes in which the start state was in the lower right corner and the goal state was in the upper left corner. These were chosen by first randomly generating a batch of 2,000 mazes and then selecting a random subset that were predicted by the model to have a range of different optimal planning costs at the initial state.
We recruited 50 participants from Amazon Mechanical Turk and used the psiTurk framework . Each participant was paid a base pay of $1.00. After reading the instructions and familiarizing themselves with the general mechanics of the task, participants started the main part of the task that included the 50 mazes. Each round, participants were first shown a blank grid. When they pressed the spacebar, the maze for that round appeared immediately and they could move their agent (a blue circle) using the arrow keys. The initial-state RT measure was the time measured between the appearance of the maze on a round and their first action. When they reached the goal state, they received points (50 points = 1¢; total bonus = $1.00).
|Predictor||LL Ratio ||SE|
|Partial Planning Cost||37.71||0.15||0.02|
|A* Node Count||18.60||0.00||0.00|
|Optimal Plan Length||14.06||-0.02||0.01|
|Information Theoretic Bounded Rationality||0.82||0.00||0.00|
|RL Softmax Entropy||1.50||0.32||0.26|
Experiment 1 (Parametric Mazes) likelihood ratio tests and model estimates. Even when multiple planning metrics are included, the Partial Plan Cost derived from planning to plan is predictive of RTs.
Planned Information Processing and Alternative Models
In this experiment, we are interested in the minimized information theoretic planning cost at the initial state predicted by planned information processing. Formally, this corresponds to the term in Equation 6 for an initial state . These values were calculated for each maze using our algorithm and the same parameters as in the simulation.
To assess whether people are not simply planning, but adaptively planning their information processing, we considered seven alternative planning-based metrics as predictors. First, we considered the length of the shortest path from the start state to the goal, calculated using value iteration (Optimal Plan Length). Second, we ran A search , a classical planning algorithm that finds a shortest path by maintaining a prioritized set of states to explore, starting with an initial state, and then iteratively exploring states and adding connected states to the exploration set until the goal state is reached. To facilitate better exploration, we provided A with a Manhattan distance heuristic to the goal. We considered the number of candidate states explored by the algorithm before termination (A Node Count).
Third, we analyzed the action cost associated with the first step of the boundedly rational planning method proposed by Ortega2015, Ortega2015 (Information theoretic bounded rationality). We set the information theoretic cost to be to be commensurate with our planning to plan implementation. Fourth, we calculated the initial state entropy of a standard softmax over the optimal value function, with (RL Softmax Entropy). Fifth, we calculated the initial state entropy of an optimal soft-Bellman policy, with (Soft-Bellman Entropy). Sixth, we calculated the number of iterations of standard (planning) value iteration before convergence as a measure of planning computation (VI Iterations). Finally, as a heuristic measure of the complexity of a grid, we calculated the mean number of “turns” that occurred along a trajectory sampled from the optimal policy (Trajectory Turns).
If people are only planning actions, then the planning-based metrics should be sufficient for predicting RTs. If people are planning information processing—that is, constructing a computationally inexpensive partial plan that provides a good action at their current state—then the partial planning cost in our model would separately predict RTs.
Results and Discussion
We analyzed our data by comparing the predictions of the models to participants’ initial-state RTs. Two participants had substantial missing data, and outliers were excluded, which leftinitial-state RT measurements. To assess the relative predictive power of the eight models, we first fit a fully specified mixed effects linear model to log-normalized RTs. This included by-participant intercepts and round number slopes as random effects, and Partial Planning Costs as well as the seven planning-metrics as fixed effects. We then performed log-likelihood ratio tests with lesions versions that did not include each of the eight fixed effects. As shown in Table 1
, although several of the planning metrics are significant predictors, Partial Planning Cost not only predicts RTs, it is the predictor with the highest log-likelihood ratio test statistic. Thus, Experiment 1 suggests that people engage in planned information processing.
Experiment 2: Probing Partial Plans
Experiment 1 tested planning information processing using initial-state RTs to measure planning costs. In Experiment 2, we more directly examine whether people’s partial planning is captured by our model. To probe partial plans, we used a technique of teleporting participants’ avatars in Gridworld mazes and measuring their reactions. For instance, imagine that, in the middle of packing your bag for a flight you are unexpectedly teleported to the main terminal of the airport with your bag. It is likely you would quickly know what to do next (e.g., pick up your boarding pass) if that were part of your partial plan prior to being teleported. In contrast, if you had not thought that far ahead, then it would likely take you longer to determine your next action. Thus, post-teleportation reaction times can be used to measure the divergence between a pre-teleportation plan and a post-teleportation plan.
Materials, Participants, and Procedure
The experiment consisted of 64 rounds of Gridworld mazes. To generate the mazes, four base mazes were generated such that the initial state was in the lower right corner and the goal state was in the upper left corner. These were then transformed using the eight symmetries of a square, yielding a total of 32 perceptually distinct mazes. Each of the 32 mazes appeared twice. Half of the rounds were Normal rounds while the other half were Teleportation rounds. On the Teleportation rounds, a random number between 1 and the length of an optimal path for a maze was chosen, and on the -th trial, the agent was hidden for 750ms and could not be controlled. It then reappeared in a randomly chosen location in the maze and could be controlled immediately. The amount of time between the reappearance of the circle and the participant’s response was the post-teleportation RT measure that is the focus of this study.
Sixty participants from MTurk were recruited for our experiment and given the same familiarization procedure as in Experiment 1. Participants were paid a base pay of $1.00 and received a bonus of $1.28 for completing the 64 trials.
|Predictor||LL Ratio ||SE|
|A Destination Nodes|
|A Node Difference|
|Optimal Path Length|
Our goal is to explain post-teleportation RTs as a function of pre- and post-teleportation states and task structure. Our account provides partial plans at the pre- and post-teleportation states. If these map onto people’s partial plans, then RTs will reflect a process of updating the pre-teleportation plan into the post-teleportation plan. To quantify this updating (i.e., re-planning) process, we calculated the state–action divergence between the pre-teleportation partial plan and post-teleportation partial plan , , where . This “Partial-Plan Divergence” reflects the cost to encode the state–action distribution of the partial plan at the post-teleportation state starting from the one at the pre-teleportation state. It should thus reflect participants’ “new” planning at the post-teleportation state. The same parameters as in the model in Experiment 1 were used to calculate the partial plans.
We calculated several alternative planning measures. First, we calculated the length of the optimal path from the post-teleportation state to the goal. Second, we calculated the number of A nodes from the post-teleportation state (A Destination Nodes). Third, we calculated an A Node Difference score, corresponding to the additional nodes that A explores at the post-teleportation state, taking into account those already explored at the pre-teleportation state.
Results and Discussion
To assess the influence of the different predictors on post-teleportation log-normalized RTs, we used a similar mixed-effects linear model as in Experiment 1 (random effects were by-participant intercepts and round number, by-maze intercepts; fixed effects were model predictors). As summarized in Table 2, the Partial-Plan Divergence significantly predicted log-transformed RTs on post-teleportation trials. The planning-based predictors did not account for how quickly people reacted after being teleported.
Additionally, we conducted a separate analysis that included teleportation distance as a fixed effect in the full model. Note that unlike the planning and partial planning models, teleportation distance is not an explicit model of decision-making. In this new model, partial planning is significant but weakened (, ). Additional details are included in the Supplementary Materials.
Thus, overall, the results of Experiment 2 suggest that people’s RTs are explained by planned information processing via partial planning and not simply by planning actions.
This paper asks two questions. First, why plan one’s information processing? We argue that meta-planning lets agents adaptively capitalize on the benefits of planning while regulating planning costs. To make this precise, we formalize the general notion of partial plans that prioritize planning in different parts of a simulated model and define an information-theoretic encoding cost for partial plans, enabling us to define a novel recursive Bellman objective that includes both task rewards and planning costs. This model provides a point of departure for future normative accounts of human meta-planning.
Second, do people plan their information processing? We reported two human experiments that test our formal account of planned information processing. Experiment 1 demonstrates that adaptive partial planning explains people’s initial reaction times when navigating parametrically generated 2D mazes. Experiment 2 used unexpected teleportations while navigating mazes to probe partial planning representations. The optimal partial plans generated by our model explain human responses even when accounting for action planning.
People plan because planning is useful. But, planning is hard, so people make planning easier by being selective about what and when they plan. In other words, people should plan their planning. For the most part, current decision-making algorithms plan, albeit with the help of good heuristics and abstractions provided by computer scientists. But, ideally, algorithms would learn how to make planning easier for themselves by planning their planning. Understanding planned use of computational resources can also provide insight into the nature and function of abstractions when learning
. For instance, we performed hierarchical clustering over states in Four Rooms based on the similarity of their optimal partial plans (Figure3
; details in supplementary materials), which results in clusters resembling options from research on hierarchical reinforcement learning[32, 7, 23, 4]. In short, this work is an important step towards understanding the scale and sophistication of human meta-planning and applying such insights to the design of machines.
The authors would like to thank Daniel Reichman, Bill Thompson, Fred Callaway, and Rachit Dubey for their advice and feedback on this work. This research was supported by NSF grant #1544924, AFOSR grant #FA9550-18-1-0077, and grant #61454 from the John Templeton Foundation.
-  (1987) The Expected-Outcome Model of Two-Player Games. Ph.D. Thesis, Columbia University, New York, NY, USA. Cited by: Background.
-  (1957) Dynamic programming. Princeton University Press. Cited by: Background, Markov Decision Processes.
-  (2007) Intertemporal choice—toward an integrative framework. Trends in Cognitive Sciences 11 (11), pp. 482 – 488. External Links: Cited by: Background.
-  (2008) Hierarchical models of behavior and prefrontal function. Trends in Cognitive Sciences 12 (5), pp. 201–208. External Links: Cited by: General Discussion.
-  (1991) Elements of Information Theory. Wiley Series in Telecommunications, John Wiley & Sons, Inc., New York, USA. External Links: Cited by: Information-Theoretic Planning Costs, Information-Theoretic Planning Costs.
-  (1988) An analysis of time-dependent planning.. In AAAI, Vol. 88, pp. 49–54. Cited by: Background.
Hierarchical reinforcement learning with the maxq value function decomposition.
Journal of Artificial Intelligence Research13, pp. 227–303. Cited by: General Discussion.
-  (1996) Reasoning the fast and frugal way: Models of bounded rationality.. Psychological Review 103 (4), pp. 650–669. External Links: Cited by: Background.
-  (2003) Equivalence notions and model minimization in markov decision processes. Artificial Intelligence 147 (1-2), pp. 163–223. Cited by: Background.
-  (2015) Rational Use of Cognitive Resources: Levels of Analysis Between the Computational and the Algorithmic. Topics in Cognitive Science 7 (2), pp. 217–229. External Links: Cited by: Background.
PsiTurk: an open-source framework for conducting replicable behavioral experiments online. Behavior research methods 48 (3), pp. 829–842. Cited by: Materials, Participants, and Procedure.
-  (1968) A formal basis for the heuristic determination of minimum cost paths. IEEE transactions on Systems Science and Cybernetics 4 (2), pp. 100–107. Cited by: Planned Information Processing and Alternative Models.
-  (1990) Computation and action under bounded resources. Ph.D. Thesis, Stanford University, California. Cited by: Background.
-  (2010) Hierarchical Planning in the Now. Workshops at the Twenty-Fourth AAAI Conference on Artificial Intelligence. Cited by: Background.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Optimal Planned Information Processing.
-  (1990) Real-time heuristic search. Artificial Intelligence 42 (2), pp. 189 – 211. External Links: Cited by: Background.
-  (1951) The problem of serial order in behavior.. In Cerebral mechanisms in behavior; the Hixon Symposium., pp. 112–146. Cited by: Background.
-  (1995) On the complexity of solving markov decision problems. In Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pp. 394–402. Cited by: Background.
-  (1998) PDDL-the planning domain definition language. Cited by: Background.
-  (1958) Elements of a theory of human problem solving.. Psychological Review 65 (3), pp. 151–166. External Links: Cited by: Background.
-  (2013-03) Thermodynamics as a theory of decision-making with information-processing costs. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 469 (2153), pp. 20120683–20120683. External Links: Cited by: Background, Information-Theoretic Planning Costs, Planned Information Processing.
-  (2015) Information-Theoretic Bounded Rationality. External Links: Cited by: Information-Theoretic Planning Costs.
-  (1998) Reinforcement learning with hierarchies of machines. In Advances in neural information processing systems, pp. 1043–1049. Cited by: General Discussion.
-  (1984) Heuristics: intelligent search strategies for computer problem solving. Addison-Wesley. Cited by: Background.
-  (1994) Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, Inc.. Cited by: Markov Decision Processes, Planned Information Processing.
-  (1984) Choosing between movement sequences: A hierarchical editor model.. Journal of Experimental Psychology: General 113 (3), pp. 372–393. External Links: Cited by: Background.
-  (2012) Trading Value and Information in MDPs. pp. 57–74. External Links: Cited by: Background, Information-Theoretic Planning Costs, Planned Information Processing.
-  (1995) Provably Bounded-Optimal Agents. Journal of Artificial Intelligence Research 2, pp. 575–609. External Links: Cited by: Background.
-  (1974) Planning in a hierarchy of abstraction spaces. Artificial Intelligence 5 (2), pp. 115 – 135. External Links: Cited by: Background.
-  (1948) A mathematical theory of communication. Bell system technical journal 27 (3), pp. 379–423. Cited by: Figure 1.
-  (2014) Optimal Behavioral Hierarchy. PLoS Computational Biology 10 (8), pp. e1003779. External Links: Cited by: Background.
-  (1999) Between mdps and semi-mdps: a framework for temporal abstraction in reinforcement learning. Artificial intelligence 112 (1-2), pp. 181–211. Cited by: Optimal Planned Information Processing, General Discussion.
-  (2011) Information Theory of Decisions and Actions. In Perception-Action Cycle, pp. 601–636. External Links: Cited by: Background, Figure 1, Information-Theoretic Planning Costs, Planned Information Processing.
-  (2003) Temporal construal.. Psychological review 110 (3), pp. 403. Cited by: Background.
-  (1974) Judgment under uncertainty: heuristics and biases. Science 185 (4157), pp. 1124–1131. External Links: Cited by: Background.
Solving Planning to Plan
Solving the planning-to-plan objective allows us to understand its qualitative features as well as derive predictions of human decision-making. Algorithm 1 describes a gradient-based procedure for solving this problem, and conveys the main ideas of our account procedurally. The output of an inner loop of partial planning (lines 4 to 9) is evaluated based on the information-theoretic cost of plans (lines 10 and 11) and sequential decision-making rewards (line 12). This is nested within an outer Planning to Plan loop (lines 3 to 16) that optimizes the partial plans.
Deriving Option-like Representations
Our account concerns planning, but also sheds light on the nature of representations that facilitate good learning. We examined how option-like representations can emerge from the optimal partial planning process by clustering states based on their planning similarity.
To examine how the plans at different states in Four Rooms relate to one another, we calculated a symmetric planning distance for each pair of states. Specifically, for each pair of ground states and , we calculated
We then performed hierarchical clustering using Ward’s method with the distance matrix . Figure 4 shows the results of hierarchical clustering. Figure 5 shows the largest three clusters. The first cluster is the room containing the goal, while the other two clusters each contain one of the two intermediate rooms and half of the starting room. Although our goal here is to better understand meta-planning, we note that these clusters are highly reminiscent of options used in hierarchical reinforcement learning.
Experiment 2 Additional Analysis
|Predictor||LL Ratio ||SE|
|A Distance to Goal|
|A Node Difference|
|Optimal Path Length|
For Experiment 2, we conducted a secondary analysis in which the full mixed-effects linear model included the partial-plan divergence, A Distance to Goal, A Node difference, optimal path length, and teleportation distance. Unlike the other metrics, which are based on a planning model, the teleportation distance metric is derived by taking the Euclidean distance between the pre-teleportation and post-teleportation state. In this model, teleportation distance is strongly predictive, suggesting that future work is needed to disentangle the contributions of planning, meta-planning, and other processes in predicting human RTs.