Many decisions involve choosing an uncertain course of actions in deep and wide decision trees, as when we plan to visit an exotic country for vacation. In these cases, exhaustive search for the best sequence of actions is not tractable due to the large number of possibilities and limited time or computational resources available to make the decision. Therefore, planning agents need to balance breadth (exploring many actions at each level of the tree) and depth (exploring many levels in the tree) to allocate optimally their finite search capacity. We provide efficient analytical solutions and numerical analysis to the problem of allocating finite sampling capacity in one shot to large decision trees. We find that in general the optimal policy is to allocate few samples per level so that deep levels can be reached, thus favoring depth over breadth search. In contrast, in poor environments and at low capacity, it is best to broadly sample branches at the cost of not sampling deeply, although this policy is marginally better than deep allocations. Our results provide a theoretical foundation for the optimality of deep imagination for planning and show that it is a generally valid heuristic that could have evolved from the finite constraints of cognitive systems.
When we plan our next vacation to an exotic paradise, we decide on a course of actions that has a tree structure: first, choose a country to visit, then the city to stay in, what restaurant or show to go, and so on. Planning is a daunting problem because the number of scenarios that could be considered grows exponentially with the depth and width of the associated decision tree. The dilemma that arises then is how to allocate limited search resources over large decision trees: should we consider many countries for our next vacation (breadth), at the cost of not evaluating very thoroughly any of them, or should we consider very few countries more deeply (depth), at the risk of missing the most exciting one? The above problem is one example of the so-called breadth-depth (BD) dilemma, important in tree search algorithms [21, 23], optimizing menu designs , decision-making [29, 38, 53], knowledge management  and education .
Optimizing BD tradeoffs in decision trees is a hard problem due to the combinatorial explosion of states with their depth (number of levels) and width (number of actions per node). Many approaches that work in relatively small trees, do not scale well in large decision trees where BD tradeoffs will be most relevant. For instance, optimal policies in decisions trees can be found by solving the Bellman equation using backwards induction , but exact induction is intractable in very large trees due to the exponential grow of states with depth of the tree. Monte Carlo tree search algorithms 
approximate the optimal solution by efficiently exploring promising tree nodes, but these methods assign in the long run a non-zero sampling probability to every available action, and thus they do not scale well in very wide decision trees. Meta-reasoning approaches extend the notion of actions to any internal action that can update the state of knowledge of the agent, such as expanding a node in a decision tree and simulating its value[41, 11, 13, 5, 45, 22], but as they are formally identical to dynamic programming [10, 31], exact inference is extremely expensive in large trees.
While the above approaches will sample exhaustively all tree nodes in the long run, exhaustive search might be prohibitive, unnecessary or both. First, agents are characterized by having finite capacity [41, 11, 13, 29, 33, 26], and thus in practice any algorithm needs to be aware of the limited resources available. Second, not every action in a node needs to be sampled in order to achieve a relatively high performance, and thus in practice many actions might be ignored from the very outset of the planning process. For example, in economic choices the first relevant decision is to select a small consideration set out of the many options available and then make a choice within the smaller set, a heuristic that pervades human behavior [19, 49, 40, 27, 42, 43]
. Many other situations are best characterized by the availability of compound actions where a myriad of simple actions can be performed in parallel with little or no interaction between them. Examples range from cognitive systems where millions of neurons can perform independent computations in parallel, over investing, where money can be divided and allocated in a combinatorial number of ways, to social decisions. In all these cases, exhaustive exploration of all possible actions and levels is prohibitive.
Optimization of BD tradeoffs have been studied using the framework of infinitely many-armed bandits and combinatorial multi-armed bandits where finite resources can be arbitrarily allocated among many options. These include one-shot infinitely many-armed Bernoulli  and Gaussian  bandits with compound actions, sequential infinitely many-armed Bernoulli bandits  and broader families thereof  with simple actions, and sequential combinatorial multi-armed bandits with compound actions . These studies show that it is indeed optimal to ignore the vast majority of options while focusing sampling on a relatively small number of them that sublinearly scales with capacity [29, 38]. However, the described optimal BD tradeoffs have been limited to trees of depth one, and thus how to balance breadth and depth search in decision trees remains an unresolved problem.
In this paper we characterize the optimal sampling policies in model-based planning for the allocation of finite search capacity over a large, stochastically and binarily rewarded, decision tree (Fig. 1). Rewards resulting from visiting the tree nodes are unknown and can be learned by sampling them, but as a finite, possibly low, number of samples are available, the agent needs to determine the best way to allocate them over the nodes of the tree. The agent, if desired, could allocate many samples in the first levels, but then search capacity will be exhausted without reaching deep into the tree (breadth search; Fig. 1a), or could allocate few samples per level such that the tree can be sampled deeply (depth search; Fig. 1a), or anything in between. We consider the problem of allocating samples in one shot without knowing their individual outcomes, corresponding to a single compound action that consists of many simple actions to be executed in parallel. One-shot allocations describe situations where the dispatching of sampling resources needs to be made before feedback is received. Even if the assumption of long delays does not hold, our framework will still be relevant when it is better to use simpler allocation strategies that are agnostic to feedback and thus avoid computational overload. In the trip example, a one-shot allocation policy would correspond to sampling countries, cities, etc. in magazines or books, independently of each other during some period of time, and having decided beforehand how many countries, cities, etc. will be sampled. Once all the information is acquired, the agent could choose the best course of actions. Thus, optimal allocations are sampling policies that maximize the probability of finding the best course of actions starting at the root of the tree by using only the information obtained from the samples.
We describe the optimal sampling policy over large decision trees as a function of the capacity of the agent and the difficulty of obtaining rewards. We develop an efficient diffusion-maximization algorithm for the exact evaluation of the search policies with computational cost of order , where is the number of levels of the decision tree and is its branching factor, much better than the scaling using backwards induction on the tree itself. We find that it is generally better to sample very deeply the decision tree such that information over many levels can be gathered, a policy that we call deep imagination, in analogy to human imagination. We find that the optimal number of actions that are explored per node is just two in most conditions, thus leading to a vast options-narrowing effect by which most available actions per node are ignored from the outset of the planning process. Regardless of capacity, in rich environments it is best to allocate samples deeply into many levels, such that depth is favored over breadth, and departures from the optimal policy result in large performance impairments. In poor environments at low capacity, it is best to broadly sample branches at the cost of not sampling deeply, although this policy is very often only marginally better than deep allocations. All together, our results provide a theoretical foundation for the optimality of deep imagination for model-based planning in large decision trees, which will be discussed in relation to similar heuristics used in human planning.
A model for search in wide and deep decision trees with finite capacity
We consider a Markov Decision Process (MDP) that operates in two consecutive phases having different actions (Fig. 1b). The first phase is a learning or exploration phase, while the second one is an exploitation phase. In both phases, the underlying structure is a directed rooted treewith levels and homogeneous branching factor, or out-degree, . Thus, each parent node has exactly children so that there are nodes at level . Vertices in correspond to nodes in the tree, with a total of of them, and edges are links between parents and their children nodes. In the first phase, an action consists of sampling in one shot a subset of nodes in excluding the root node, denoted
, which results in observing the associated random variablesfor each . Based on the outcomes of the sampled nodes, the agent can update their belief about the expected rewards resulting from visiting them, for all , while the expected reward resulting from visiting unsampled nodes remains unchanged. In the second phase, the agent solves a MDP over , where edges correspond to potential actions, , and the expected reward resulting from visiting state in the tree are the -s updated (or not) in the first phase. Next we describe the above in further detail and provide a rationale for our modeling choices.
A relevant example is planning a trip to an exotic country: in the first step (root of the trees in Fig. 1) an agent can choose one out of different countries, from where they can choose one of different cities to visit in that country, from where they can choose one of different restaurants, and so on. The planning process can be divided into two phases (Fig. 1b). In the learning phase, the agent learns about what states would be more desirable. In this first phase, actions of the agent do not correspond to actually visiting the nodes of the tree. Rather, actions correspond to allocating ‘samples’ over certain nodes, resulting in observations that the agent can use to update their belief about the expected reward, , of visiting those nodes. For instance, the agent first gathers information about countries, hotels, etc., by using external (e.g., books) and internal (e.g., memory recollections) information, which results in an update of the expected rewards resulting from actually visiting those states. This information is used in the exploitation (second) phase to design the best course of actions and commence the trip.
In the learning phase, we assume that the agent has a finite search capacity, modeled as a finite number of samples that can be allocated over the tree (Fig. 1b, brown panel). The most interesting scenario corresponds to , when the agent can only sample a small fraction of the nodes in a large decision tree. Thus, the agent’s action set equals all possible allocations of the samples over the graph excluding the root node. Formally, every node
has an associated binary variable, indicating whether the node has been sampled, , or not, . Note that we assume that nodes can be sampled at most once, and that the finite capacity constraint imposes . Then, the action set can be expressed as . The nodes with define the subset of sampled nodes . Finite sampling capacity models cognitive and time limitations of the agent, which impedes that a full exhaustive search over all the nodes be possible.
We assume that the agent allocates all samples at once, that is, without knowing the feedback from the samples. Thus, we consider ‘one-shot’ allocation policies , which model situations where feedback from the samples arrive with delays longer than the duration of the allocation process (related to explore-then-commit policies ). Many relevant allocation problems are well described by this framework, such as dividing search time to plan a trip, allocating neurons and wiring to different brain areas or cognitive functions during brain development, or dividing budget into several research programs or vaccines. One-shot policies are not optimal if the agent is allowed to sample sequentially nodes one by one based on immediate feedback. However, as we show below one-shot optimal policies strongly favor depth over broad search, such that including feedback is expected to further favor depth search, as some tree branches can be pruned early on in the planning process. Therefore, restricting ourselves to one-shot strategies entails a conservative stance to study whether optimal policies favor depth search.
The result of sampling a node is to gain information about the expected reward when visiting the node, which will used in the exploitation phase to optimize the course of actions. We assume that, before sampling starts, the expected reward of any state is . Non-zero average reward can be easily introduced by just adding a constant offset to the rewards independent of the policy. Effectively, we focus on reward excesses compared to a baseline that could result, for instance, from a default policy over which the agent will improve. Thus, with this definition, if the agent chose a path from the root to the leaves and navigated thought it without having sampled any of the nodes before, the expected accumulated reward associated to such course of actions would be zero. In the trip example, assuming zero expected rewards for all the states before sampling might imply that the agent does not have any initial preference for countries, exotic restaurants and so on. This situation is clearly extreme, as agents might have strong initial, overt preferences . However, strong preferences can only reduce the number of actions to be considered, and therefore will favor depth over breadth policies. Thus, once again our initial no-preference assumption effectively entails a conservative stance.
When the agent chooses an allocation action , the graph is partitioned into the sampled and unsampled nodes, and (excluding the root node), respectively. The expected reward of an unsampled node, , is not updated and thus it remains . For a sampled node, , the belief about its expected reward is updated as follows: we assume that the outcome of sampling the node is to update from to with probability and to with probability , independently for each sampled node (see Fig. 1b, blue and red dots). Thus, and for a sampled node, and for an unsampled node. We enforce the condition that the average over updated expected rewards equals zero, that is, , such that sampling a node does not result in net reward or loss. We call this condition the ‘zero-average constraint’, which can be satisfied by taking without loss of generality and then using . If the zero-average constraint were not satisfied, we would violate the basic assumption that sampling by itself cannot create or annihilate reward. That is, sampling can change our state of knowledge but not the state or rewards in the world. One way to think about this process is by considering samples as internal ‘actions’ acting over our memory so that they serve to recall or imagine whether some type of food or cities would be desirable [39, 46]. Clearly, this process does not change the state or the rewards of the world, although it will be critical to build our preferences. It is important to note that the probability of a high reward in a sampled node measures the overall richness of the environment, and thus how easy is to find a sampled node with positive expected reward . Therefore, ‘rich’ environments correspond to high and ‘poor’ environments corresponds to low .
Once the expected rewards have been updated, the optimal path (Fig. 1b, green path) is computed, which corresponds to the one that has the highest expected accumulated reward based on the observations from the samples. Specifically, in the exploitation phase the decision problem forms a standard MDP , where states corresponds to nodes in the graph, , actions correspond to edges of the graph, , the learned rewards correspond to the actual expected rewards that result from visiting state , and the transition function between states after an action is made is deterministic. The agent starts in the root node of , corresponding to the zero- level, and takes action , which results in a deterministic transition to the children node in the first level and the acquisition of a reward with expected value . Recursively, from node at level , the agent can choose a new action resulting in a transition to its children node in level and the acquisition of a reward with average . At the level, there are not possible actions and thus leaves correspond to terminal states. Given the learned expected rewards , the optimal course of actions is found by using backwards induction . As we will see, the optimal set of sampled nodes forms a much smaller tree than the original one due to the finite sampling capacity, and then backwards induction over the reduced tree becomes tractable.
The overall goal of the agent is to determine the best policy to allocate samples in order to maximize the expected accumulated reward of the optimal path, which implies balancing breadth and depth search: should the agent allocate samples broadly in a few levels, or should allocate few samples per level so that the tree can be sampled deeply?
Value estimation and optimal sample allocations
We first introduce exhaustive allocation policies, which effectively ignore finite capacity by sampling all nodes of a decision tree of depth and branching factor , but they are simpler to analyze and provide useful tools. We then introduce selective allocation policies, which allow the agent to select the number of sampled branches as well as the probability of drawing samples at each tree level under the constraint that the number of allocated samples is on average a fixed capacity . As we show below, selective allocations are rich enough to display a broad range of behaviors. For each policy we show how to compute its value, defined as the expected accumulated reward of the optimal path. To avoid cluttered text, we refer to expected rewards simply as rewards.
An exhaustive allocation policy fully samples all the nodes of a tree with depth and branching factor . Here, we first compute the probability that an agent can find a path with reward equal to the depth in such a tree. After this, we calculate the value, , of playing such a tree to develop a useful tool for the case where agents cannot exhaustively sample all nodes.
We first show that, in general, it is not possible to find a path with all visited nodes having a positive reward. Hence, an optimal path is likely to find a blocked node, that is, a node where all possible actions lead to negative reward, and thus extreme optimism cannot be guaranteed. By assuming that the reward in a node has value with probability and setting (which is negative) such that the zero-average constraint is satisfied, then the event of finding a path with all positive rewards corresponds to the event that the accumulated reward of the optimal path is the depth of the tree. We denote the accumulated reward of the optimal path in a tree of depth by , and thus we ask for the probability . If the tree has depth and branching factor , then . This expression follows from the fact that there are possible actions, and the probability that none of those actions leads to a reward equal to , and thus it is blocked, is .
For we make use of the quantity , known as action-value, defined as the accumulated reward obtained by first choosing one of the branches and collect immediate reward , and then choosing the best sequence of branches in the remaining levels to collect accumulated reward . Note that in principle there are different action-values , one per branch, but as all of them are statistically indistinguishable, an index is not made explicit (the same happens for the rewards ). Using this relationship we find
The first equality in Eq. (1) comes from the fact that to get an accumulated reward it is necessary that none of the possible actions from the root node leads to , and that each of those events are statistically independent. The second equality comes from the fact that , which is the probability that a particular action from the root node is followed by a state with , which has probability , and afterwards followed by an optimal path with accumulated reward , which has probability .
We can use the above expression to find cases where the probability of having optimal paths with accumulated reward approaches zero as increases. For and , using Eq. (1) we obtain and for . We see that , as the only solution to the fixed point equation is . Therefore, the probability that the agent finds a blocking node is one as the tree depth increases. For any positive integer and , the fixed point equation for large becomes . As the rhs is convex in , positive and has its maximum at , the fixed point equation has a non-zero solution only when the rhs’ slope at the origin is smaller than , that is, when . Therefore, if decreases, then a large enough ensures a non-zero probability of finding an optimal path with accumulated reward equal to the tree depth. In contrast, if , then the probability that the path is blocked with nodes having negative rewards is one.
After establishing that extreme optimizing is not always guaranteed, we turn to the problem of finding the value of playing the tree with levels and branching factor , defined as the expected accumulated reward of the optimal paths over such a tree. We provide here the analytical solution for and describe the more general analytical solution valid for and , where is a positive integer, in Sec.3 of the Methods.
For simplicity and without loss of generality we set and with probabilities , which satisfies the zero-average constraint. Thus, the accumulated reward of a path following a sequence of actions through the tree with levels can take values . The size of this set is order , which allows us to compute the value of any tree of depth in polynomial time. We first compute the probability of the value of playing a tree of depth , and then compute the probability of the value of playing a tree of depth recursively from . Above we showed that for a tree of depth . Thus, the value of playing such a tree is the average of over sampling outcomes, which equals .
Our algorithm is based on alternating diffusion and maximization steps as follows. To find the probability from , we first remind that the action-value is defined as the accumulated reward by taking one action at the root, collect reward and then follow the optimal path in a tree with levels. Written as , it has probabilities
This mapping from to is a diffusion step, as the state diffuses to higher, , and lower, , states of with probability . We recognize the first identity in Eq. (2) as the probability that a chosen action followed by the optimal path over a tree with levels leads to an accumulated reward for the case , as discussed above.
The diffusion step is followed by the maximization step, which maps into by
for . Eq. (3) represents a maximization step because the agent will choose the best action out of available actions, and it expresses that the probability of equals the probability of finding at least one action with at most a value of .
In summary, iterating the diffusion and maximization steps in Eqs. (2,3) with initial conditions allows us to compute the value of playing a tree with levels and branches by . The number of operations required to determine the value of such a tree is , as the diffusion step requires operations due to the presence of levels and different states at each level, and the maximization step involves operations for each in the calculation of -th powers. In contrast, a direct solution to the problem using dynamic programming requires operations. This is because the complexity is dominated by the number of nodes in the level before the last one, where there are nodes, and operations are needed in each one to solve the operator before implementing backwards induction. In addition, the complexity of dynamic programming does not take into account the additional need to average over the samples’ outcomes, while the diffusion-maximization method in Eqs. (2,3) provides the exact expected value of playing the tree.
We have studied the value of playing trees as a function of , and using the diffusion-maximization method in Eqs. (2,3) for and Eqs. (9,10) and (13,14) in the Methods for the rational values and with positive integer . In all cases, the zero-average constrained is satisfied by setting and . The analytical predictions allow us to study very deep trees with, e.g., and at little numerical cost, where the number of nodes is larger than . In contrast, these digits are prohibitive for Bellman - Monte Carlo simulations. The value of playing a tree grows monotonically with both its depth and breadth (Fig. 2a), as a tree with a smaller depth or breadth is a subtree that can only have a value equal or smaller than the original tree. Asymptotically, the value grows with unit slope and runs parallel and below the diagonal line (dashed line), which constitutes the highest possible value of any tree, as no tree can have a value above it given our choice . With larger , the value runs closer to the diagonal. The value of the tree grows monotonically with the probability of finding high expected reward nodes (Fig. 2b).
Now we turn to the central problem of how to optimally sample an infinitely large tree with finite sampling capacity . Assuming a tree having infinite number of levels and infinite branches per node allows us to consider any possible sampling allocation policy that is solely constrained by finite capacity. As such decision tree cannot be sampled exhaustively, we refer to the problem of allocating finite sampling capacity as ‘selective’ allocation. We restrict ourselves to a family of policies where the agent chooses the number of levels that will be considered as well as the number of branches per reached node that will be contemplated. Given finite capacity , choosing a large will imply having to choose a small , thus allowing the agent to trade breadth for depth. To provide more flexibility to the allocation policy, we also allow that the agent chooses the probability of independently allocating a sample in each node in level (note the reversed order, e.g., refers to the last level ). Under this stochastic allocation policy, a node receives a maximum of one sample or can receive none, and thus the allocation is a independent Bernoulli process with sampling probability in each node in level . Note that here we have relaxed the hard capacity constraint to an average capacity constraint, which turns to be easier to deal with and leads to a smoother analysis. We have observed through numerical simulations that results do not qualitatively differ between hard and average capacity constraints.
In the following, we first compute the value of sampling a tree of depth and branching factor with per-level sampling probabilities . The capacity constraint will be imposed afterwards simply by constraining , and to be such that on average the number of allocated samples equals capacity . The algorithm is simply a generalization of the diffusion-maximization algorithm derived for exhaustive allocation in Eqs. (2,3), shown here for the case and generalized in Sec.3 of the Methods to other rational probabilities.
In contrast to exhaustive allocation, when using selective allocation some nodes might not be sampled, as , and thus they will remain having expected reward . As before, sampled nodes have values with probability . Therefore, the value of a depth- tree is in the set . To compute the expectation of we note that the action-value of each branch (leaf) has values with probabilities , and , which follows from the facts that the node is sampled with probability , that if it is sampled then its expected reward with probability , and that if it is not sampled then its expected reward is . As branches are available each with the same independent distribution of action-values, the value has probabilities , which results in , and .
To compute recursively from , we first relate with . Since the action-value can be written as , where is the reward in a node in level , the diffusion step takes the form
The diffusion step is followed by the maximization step
for . Iterating the diffusion and maximization steps in Eqs. (4,5) with initial conditions described above allows us to compute , which is the value of playing a tree of depth , branching factor and per-level sampling probabilities .
We not turn to the problem of optimizing , and under the finite capacity constraint. In practice, we can consider a fixed, large and optimize and , such that we effectively assume that the sampling probabilities are zero above some depth . If is large enough this assumption does not impose any restrictions, as the sampling probability can also be zero in levels shallower than the last considered level . As the agent is limited by finite sampling capacity, both and are constrained by
which states that the average number of sampled nodes in the subtree must be equal to capacity . The optimal and are found by
In addition to the optimal allocation policies in Eq. (7), that we call heterogeneous, we also consider a subfamily of selective allocations that we call homogeneous. In a homogeneous allocation policy, the sampling probability is one for all levels except, possibly, the last level, which is chosen to satisfy the finite capacity constraint. As shown below, homogeneous policies are close to optimal and are also simpler to study. In a homogeneous selective policy, as in exhaustive allocations, the only choice of the agent is the number of considered branches per reached node . Then, effectively, upon choosing , the agent samples nodes in the first level, and from each of those the agent samples another nodes in the second level, and so on until capacity is exhausted at some depth , that depends on and . Possibly, not all resulting nodes in the last sampled level can be fully sampled. Defining as the remaining number of samples available when reaching the last sampled level , then each of the considered nodes is given a sample independently with probability , such that on average total capacity equals . More specifically, we focus on policies where is free, , and (note again reversed index), with . Within this family of allocation policies, the optimal policy is
Optimal breadth-depth tradeoffs in allocating finite capacity
We now describe how optimal selective allocations depend on sampling capacity and on the richness of the environment as measured by . We start by homogeneous policies, which will be show in the next section to be very close to optimal when compared to heterogeneous policies. Selective homogeneous allocations maximize the value of sampling selectively an infinitely broad and deep tree by optimizing the number of sampled branches (Eqs. 4,5,8). As capacity is constrained and the sampling probability is one except possibly for the last level, choosing a large implies reaching shallowly in the tree (Fig. 3b). Thus optimal BD tradeoffs are reflected in the optimal number of considered branches. We find that the optimal number of branches is for a rich environment () regardless of capacity (Fig. 3c, left panel). Interestingly, we observe that choosing or , which are the neighbor policies to the optimal , leads to a large reduction of performance, indicating that the benefit from correctly choosing the optimum is high. The optimal favors exploring trees as deep as possible while keeping the possibility of choosing between two branches at each level. Indeed, the deepest possible policy resulting from the policy is highly suboptimal (leftmost point in the left panel, and rightmost points in the right panel), as the expected accumulated reward equals zero due to lack of freedom to select the best path.
For a poor environment (Fig. 3d; ), the optimal number of sampled branches is also when capacity is large (peak of red line), but as capacity decreases, increases. Thus, the optimal policy approaches pure breadth at low capacity, which entails exhausting all sampling resources in just the first level. We observe that in this case the dependence of the value of playing the tree with is very shallow when capacity is small (blue line), and therefore the actual optimal is quite loose.
The results for the two environments described above suggest that depth is always favored when capacity is large enough or whenever the environment is rich, while breadth is only favored at low capacities and for poor environments. Further, while optimal breadth policies can be quite loose in that choosing the exact value of is not very important to maximize value, optimal depth policies are very sensitive to the precise value of the chosen value , always very close to , such that variations of it cause large loses in performance. Exploration of a large parameter space confirms the generality of the above results (Fig. 4). In particular, the optimal number of sampled branches is for a very large region of the parameters space (Fig. 4b), while an optimal number of branches larger than mostly occurs exclusively when is small () or capacity is small (). If the agent used a depth heuristic consisting in always sampling branches, then the loss incurred compared to the optimal would be around at the most, but the region where there are significant deviations in performance concentrates at both low and values (Fig. 4c). Indeed, for a very large region of parameter space the loss is zero because almost everywhere the optimal number of sampled branches equals or because the value of playing the tree is not very sensitivity to . In contrast, using a breadth heuristic where the agent always uses is almost everyone a very poor policy, as losses can reach close to or above in large regions of parameter space (Fig. 4d). Therefore, as an optimal strategy, depth dominates over breadth in larger portions of parameter space, and as a heuristic, depth generalizes much better than breadth.
Although the optimal policy is quite nuanced as a function of the parameters, a general intuition can be provided about why depth tends to dominate over breadth: exploring a tree allows agents to find paths with accumulated rewards bounded by the length of the path; thus, exploring more deeply leads to knowledge about potentially large rewards excesses as compared to exploring less deeply and following afterwards a default policy. Although this effect seems to be the dominant one, being able to compare among many short courses of actions becomes optimal in poor environments when capacity is small, as it allows securing at least a good enough accumulated reward.
Exploring further into the future is a slightly better policy
One important question is how much can be gained by giving to the agent a larger degree of flexibility in allocating samples over the levels. In heterogeneous selective policies, the agent is free to choose the number of branches to be considered as well as the sampling probabilities for each of the levels (Eqs. 4,5,7). Therefore, in contrast to homogeneous selective policies, the agent can decide not to allocate samples to the first levels and reserve them for deeper levels. Our analysis, however, shows that it is not the best allocation policy, as optimal heterogeneous policies sample exhaustively the first levels, as homogeneous policies do (Fig. 5a). One important difference is that optimal heterogeneous policies explore further into the future than homogeneous policies. This is accomplished by using sampling probabilities decaying to zero in the last few sampled levels. This is in contrast to homogeneous policies, where only the last level is given, possibly, a sampling probability smaller than one. Thus, exploring slightly further into the future provides a surplus value of playing the tree (Fig. 5b, full lines), but it is only marginally better than the one obtained from homogeneous policies (dashed lines), which are much simpler to implement due to their fixed sampling probability structure. As in the case of homogeneous policies, heterogeneous policies attain their optimal value when the number of considered branches is , thus favoring depth over breadth search. Finally, we tested random policies where samples are allocated with the same probability to the nodes of the first layers of the tree until capacity is exhausted (dotted lines), and found that they are much worst than the optimal policies.
Agents with limited resources face breadth-depth tradeoffs when looking for the best course of actions in deep and wide decision trees. To gain information about the best course, an agent might allocate resources to sample many actions per level at the cost of not exploring the tree deeply, or allocate resources to sample deeply the tree at the risk is missing relevant actions. We have found that deep imagination is favored over breadth in a broad range of conditions, with very little balance between the two: it is almost always optimal to sample just a couple of actions per depth level such that the tree is explored as deeply as possible while sacrificing wide exploration. In addition, using depth as a heuristic for all cases incurs much smaller errors than assuming a breadth heuristic. We have provided analytical expressions for this problem, which allows us to study the optimal allocations in very large decision trees.
During planning, we very often picture the course of actions as an imaginary episode, from taking the plane to visiting the first museum, in a process that has been called imagination-based planning, model-based planning, mental simulations or emulation, each term carrying somehow different meanings [8, 14, 30, 9, 48, 35, 18]. Imagination strongly affects choices through the availability of the imagined content , and it is used when the value of the options are unknown and thus preferences need to be built on the fly . However, imagination-based planning is slow and there is no evidence that can run in parallel [15, 36], implying that as an algorithm for exploring deep and wide decision trees it might not be efficient. Indeed, very few courses of actions () are considered in our ‘minds’ before a decision is made [19, 49, 40, 27, 42, 43], and in some cases the imagined episodes can be characteristically long, like when playing chess , although their depth can be adapted to the current constraints and time pressure . As an alternative to its apparent clumsiness, deep imagination
–the process of sampling few long sequences of states and actions– might have evolved as the favored solution to breadth-depth tradeoffs in model-based planning under limited resources against policies that sample many short sequences. Our results provide a theoretical foundation for the optimality of deep imagination in model-based planning by showing that it becomes the dominant strategy in one-shot allocations of resources over a broad range of capacity and environmental parameters. Recent deep-learning work has studied through numerical simulations how agents can benefit from imagining future steps by using models of the environment[17, 32, 55, 16], and thus our results might help to clarify and stress the importance of deep tree sampling through mental simulations of state transitions.
Deep imagination resembles depth-first tree search algorithms in that they both favor deep over broad exploration [34, 23]. However, depth-first search starts by sampling deeply until a terminal state is found, but actually reaching a terminal state in very deep trees can be unpractical  and even the notion of terminal state might not be well-defined, as in continuing tasks . In very deep decision trees such strategy would imply the sampling of a single course of actions until exhaustion of resources, which is a highly suboptimal strategy, as we have shown (see Fig. 3 with ). Another family of search algorithms, called breadth-first search , and other approaches that give finite sampling probability to every action at each visited nodes, such as Monte Carlo tree search  or
-greedy reinforcement learning methods, poorly scale when the branching factor of the tree is very large, and thus they are unpractical approaches for BD dilemmas. In contrast, deep imagination samples two actions per visited node until resources are exhausted, which allows selecting the best among a large number of paths, and at the same time constitutes an algorithm that is simple to implement and generalizes well. Due to finite capacity, any algorithm can only sample a large decision tree up to some finite depth, which leaves open the question of how the agent should act afterwards. Following the approach of plan-until-habit strategies [22, 45], we have assumed that agents can follow a random, or default, strategy after the last sampled level of the tree, such that different allocations policies with different sampled depth and branching factors could be compared on an equal footing.
One important assumption in our work is the one-shot nature of the sample allocation. Many important decisions have delayed feedback, like allocating funding budget to vaccine companies, choosing college, or planning a round of interviews for a faculty position, and thus they are well modeled as one-shot finite-resource allocations [19, 40, 27]. However, other decisions involve quicker feedback and then the allocation of resources could be adapted on the fly. Although our results are yet to be extended to sequential problems where at every step a compound action is to be made, we conjecture that such extension will not substantially change the close-to-optimality of deep sampling, although a bias towards more breadth is expected . Further, pre-computing allocation strategies at design-time and using them afterwards might lift up the burden of performing heavy online computations that would require complex tree expansion in large state spaces. Thus, by hard-wiring these strategies much of the overload caused by meta-reasoning [41, 11, 13] could be alleviated, allowing agents to use their finite resources for the tasks that change on a faster time scale. Finally, it is important to note that, in contrast to many experimental frameworks on binary choices or very low number of options [12, 7, 10, 24] and games [47, 1] where the number of actions is highly constrained by design, realistic decisions face too many immediate options to be all considered [19, 49, 40, 43], and thus a first decision that cannot be deferred is how many of those to focus on in the first place [29, 38, 24, 20]. All in all, the optimal BD tradeoffs that we have characterized here might play an important role even in cases that substantially depart from our modeling assumptions.
In summary, we have provided a theoretical foundation for deep imagination as a close to optimal policy for allocating finite resources in wide and large decision trees. Many of the features of the optimal allocations that we have described here can be tested by controlling parametrically the available capacity of agents and the properties of the environment  by using similar experimental paradigms to those recently developed , which constitutes a relevant future direction.
1 Bellman - Monte Carlo simulations
The exact values of playing tree for a subset of rational values of are computed using the diffusion-maximization algorithm. For probabilities of positive rewards
not in that set, we can estimate the value by Bellman - Monte Carlo simulations. We first sample each node in the tree (except the root node) to determine the reward associated with it,, which is with probability and with probability . We take and to satisfy the zero-average constraint. Based on the learned -s, we compute the value of the tree by using backwards induction from the last nodes until reaching the root node. Specifically, the leaf nodes have value . Recursively, going backwards, the value of a node at depth is computed from the values of its children nodes at depth as . The value of playing the tree with the specific realization of the -s is the value of the root node computed that way. The value of playing the tree is the average value over a large number of realizations of the -s, as indicated in the corresponding figures.
2 Gradient ascent
For each we optimize in Eq. (7) under the capacity constraint, Eq. (6), by a gradient ascent method. The unconstrained gradient of the value is numerically computed for an initial using a discretization step size , . The unconstrained gradient is then projected onto the capacity constraint plane defined by Eq. (6). Then, the projected gradient multiplied by a learning rate is added to the original , from where a new is proposed. If the resulting has a component that does not satisfy the constraint , then is moved to either or , whichever is closer. This movement can make in turn to be outside the capacity constraint plane, so a new projection onto the constraint plain is performed. The projections and movements are repeated until satisfies both constraints, leading to a new valid . From the new , an unconstrained gradient is computed again, and the procedure continues up to a maximum of iterations or when the improvement in the value is less than a tolerance of . To avoid numerical instabilities for very deep trees (), the probabilities are normalized to sum one at every iteration. One order of magnitude differences in the ranges of step sizes, learning rates and tolerances, and all tested initial conditions for give almost identical results to those reported in the main text.
3 Value of exhaustive or selective search in a large tree with rational
We extend our results for to the case of rational values and for any positive integer . The zero-average reward constraint enforces that and . We arbitrarily take and select so that the zero-average reward constraint is satisfied.
3.1 Reward probability
We first consider , which implies . The zero-average constraint results in . We describe below how to compute the value of playing a large tree exhaustively and selectively with such a probability of positive reward.
We begin by describing the value of a tree with one level (), which will serve as initial condition for the diffusion-maximization algorithm. In this case, the accumulated reward can only be or , that is, . Thus
where is the number of branches.
As we have seen for in the main text, we can compute the probabilities for a tree of depth starting from the probabilities of the accumulated reward of a tree of depth by alternating the diffusion and maximization steps. The diffusion step uses the probabilities of the accumulated reward of a tree of depth to compute the action values of a tree of depth using the possible rewards . Both the accumulated rewards and the action values for a tree of depth can take values , with , where is number of times the positive reward was observed in the best possible path.
Using the above, the diffusion step becomes
where it is understood that if lies outside the domain of , in particular when or , and thus some terms in the rhs of the above equation can become zero, by definition.
The maximization step is, as before,
The average finite capacity constraint enforces that
where is the sampling probability of tree level . We underline the reverse order of the index of , which is due to the fact that we are describing a backward algorithm: will appear in the first step and corresponds to the last level, in the second step and corresponds to the second last level, and so on. In selective allocation of samples, it is possible that a node is not sampled, and thus the possible values of both and are
with and , where is the number of times the positive reward is observed, and is the number of times the negative reward is observed.
We now proceed to compute the value of a tree with one level, and then use the diffusion-maximization algorithm to compute the value of a tree with any arbitrary depth . The probabilities of the action values for the branches of such a tree are
and by using the maximization step, we obtain that the values take probabilities
Now, the diffusion step is
where, again, it is understood that when lies outside the domain of , in particular when or , and thus many terms contribute zero.
The diffusion step is then followed by the usual maximization step
3.2 Algorithmic complexity
The complexity of the algorithm is proportional to the number of equations, which equals the sum of the number of possible different states per level. As we said above, the possible state values at level are , with and . As is an integer, it is possible to have repeated values of for different values of and within the allowed set.
To count the number of distinct states, we start by noticing that if , then , and thus there are distinct states (Fig. (5(a)), orange points in the bottom row of the triangle). Assume first that . If , then , where lies between and (second bottom row of points in the triangle). As , the resulting states do not reach , and thus all of them are distinct from those corresponding to the bottom row. If , the states are , where lies between and (third bottom row), and as the values of do not reach , the new states are all new. In conclusion if the total number of distinct states in level is
For , there are many values of and that result in repeated states (Fig. (5(b)), violet points). If , then , resulting in distinct states, as before (orange points in the bottom row of the triangle). If , then , resulting in the states , of which all states equal or above are repeated (violet points in the second bottom row). Thus, there are new states. Extending the above, for each in there are new states, and for larger values of the new states are .
In conclusion, if the total number of distinct states in level is
From here, the scaling of states is proportional to the level , and for large the term dominates. Therefore, when summing up distinct states from the first to the last level of the tree, we conclude that the complexity of the maximization-diffusion algorithm is , where we take into account that for every state we need to perform a maximization step (a power operation that counts per state). Analogous steps can be made for the case considered next of to reach to an identical algorithmic complexity.
3.3 Reward probability
We proceed by considering which implies . The zero-average reward leads in this case to a negative reward . We show here how to compute the value of playing a large tree, exhaustively and selectively, and with such reward probability .
As shown before, the initial conditions for the diffusion-maximization algorithm come from the value of a tree with just one level . For a single level tree the accumulated reward can only be or , namely . Thus, for a number of branches
Again we can compute the probabilities of for a tree of depth from the probabilities of for a tree of depth using diffusion-maximization. In the diffusion step, we use the probabilities of of a tree of depth to compute the action values of the tree of depth along with the possible rewards . For a tree of depth , both the accumulated reward and the action value can take the values with , where is the number of times that the positive reward is observed.
Now, the diffusion step becomes
where again the probabilities are zero when lies outside the domain of , in particular when or .
After the diffusion, the maximization step is always
As we have shown in the main text for , and previously here for , in selective allocation we consider the average finite capacity constraint
where is the sampling probability of tree level . As nodes might not be sampled, the possible values of both and are
with and , where is the number of times that the positive reward is observed, and is the number of times that the the negative reward is observed in the best possible path. We first compute the value of a tree with depth and then use the diffusion-maximization algorithm to perform induction over . The probabilities of the action values for the branches of a tree with are
Thus, the probability of are obtained by using the maximization step
Given these initial conditions, it is easy to see that the diffusion step for level is