Decision trees are a ubiquitous tool in decision theory and artificial intelligence research to represent a wide range of decision-making problems that include the classic reinforcement learning paradigm as well as competitive games(Osborne and Rubinstein, 1999; Russell and Norvig, 2010). Depending on the kind of system one is interacting with, there are different decision rules one has to apply—the most famous ones being Expectimax, Minimax and Expectiminimax—see Figure LABEL:fig:gametrees. When an agent interacts with a stochastic system, the agent chooses its decisions based on Expectimax. Essentially, Expectimax is the dynamic programming algorithm that solves the Bellman optimality equations, thereby recursively maximizing expected future reward in a sequential decision problem (Bellman, 1957).
In two-player zero-sum games where strictly competitive players make alternate moves, an agent should use the Minimax strategy. The motivation underlying minimax decisions is that the agent wants to optimize the worst-case gain as a means of protecting itself against the potentially harmful decisions made by the adversary. Finally, there are games that mix the two previous interaction types. For instance, in Backgammon, the course of the game depends on the skill of the players and chance elements. In these cases, the agent bases its decisions on the Expectiminimax rule (Michie, 1966).
What is common to all of these decision-making schemes is that they presuppose a fully rational decision-maker that is able to compute all of the required operations with absolute precision. In contrast, a bounded rational decision-maker trades off expected utility gains against the cost of the required computations (Simon, 1984). Recently, the free energy has been suggested as a normative variational principle for such bounded rational decision-making that takes the computational effort into account (Ortega and Braun, 2011; Braun and Ortega, 2011; Ortega, 2011). This builds on previous work on efficient computation of optimal actions that trades off the benefits obtained from maximizing the utility function against the cost of changing the uncontrolled dynamics given by the environment (Kappen, 2005; Todorov, 2006, 2009; Kappen et al., 2012). The aim of this paper is to extend these results to generalized decision trees such that Expectimax, Minimax, Expectiminimax, and bounded rational acting can all be derived from a single optimization principle. Moreover, this framework leads to a natural measure of computational costs spent at each node of the decision tree. All the proofs are given in the appendix.
2 Free Energy
2.1 Equilibrium Distribution
between two information processing states represented by two probability distributionsand . The decision process then transforms the initial choice probability into a final choice probability by taking into account the utility gains (or losses) and the transformation costs. This transformation process can be formalized as
Accordingly, the choice pattern of the decision-maker is predicted by the equilibrium distribution . Crucially, the probability distribution extremizes the following functional (Callen, 1985; Keller, 1998):
[Negative Free Energy Difference] Let be a probability distribution and let be a real-valued utility function over the set . For any , define the negative free energy difference as
The parameter is called the inverse temperature. Although strictly speaking, the functional corresponds to the negative free energy difference, we will refer to it as the “free energy” in the following for simplicity. When inserting the equilibrium distribution (1) into (2), the extremum of yields:
For different values of , this extremum takes the following limits:
The case corresponds to the perfectly rational agent, the case corresponds to the expectation at a chance node and the case anticipates the perfectly rational opponent. Therefore, the single expression can represent the maximum, expectation and minimum depending on the value of .
The inspection of (2) reveals that the free energy encapsulates a fundamental decision-theoretic trade-off: it corresponds to the expected utility, penalized—or regularized—by the information cost of transforming the base distribution into the final distribution . The inverse temperature plays the role of the conversion factor between units of information and units of utility.
If we want to change the temperature to while keeping the equilibrium and reference distributions equal, then we need to change the corresponding utilities from to in a manner given by the following theorem. Temperature changes will be important for the application of the free energy principle to the general decision trees in Section 3.
Let be the equilibrium distribution for a given inverse temperature , utility function and reference distribution . If the temperature changes to while keeping and fixed, then the utility function changes to
2.2 Resource Costs
Consider the problem of picking the largest number in a sequence of i.i.d. data, where each is drawn from a source with probability distribution . After draws the largest number will be given by . Naturally, the larger the number of draws, the higher the chances of observing a large number.
Let be a finite set. Let and be strictly positive probability distributions over . Let be a positive integer. Define as the probability distribution over the maximum of samples from . Then, there are strictly positive constants and depending only on such that for all ,
Consequently, one can interpret the inverse temperature as a resource parameter that determines how many samples are drawn to estimate the maximum. Note that the distributionis arbitrary as long as it has the same support as . This interpretation can be extended to a negative , by noting that , i.e. instead of the maximum we take the minimum of samples.
3 General Decision Trees
A generalized decision tree is a tree where each node corresponds to a possible interaction history , where is smaller or equal than some fixed horizon , and where edges connect two consecutive interaction histories. Furthermore, every node has an associated inverse temperature ; and every transition has a base probability of moving from state to state representing the stochastic law the interactions follow when it is not controlled, and an immediate reward . The objective of the agent is to make decisions such that the sum is maximized subject to the temperature constraints.
3.1 Free Energy for General Decision Trees
The free energy principle is stated above for one decision variable . If
represents a tuple of (possibly dependent) random variables, then the free energy principle can be applied in a straightforward manner to the corresponding tree. However, all nodes of the tree will have the same inverse temperature assigned to them and, therefore, the same amount of computational resources will be spent at each node of the tree. This allows for example deriving the formalisms of path integral control and KL control (Todorov, 2009; Braun and Ortega, 2011; Kappen et al., 2012).
In the case of general decision trees the assumption of uniform temperatures has to be relaxed (Figure LABEL:fig:transformation). In general, we can then dedicate different amounts of computational resources to each node of the tree. However, this requires a translation between a tree with a single temperature and to a tree with different temperatures. This translation can be achieved using Theorem 2.1. Define a reward as the change in utility of two subsequent nodes. Then, the rewards of the resulting decision tree are given by
This allows introducing a collection of node-specific (not necessarily time-specific) inverse temperatures , allowing for a greater degree of flexibility in the representation of information costs. The next theorem states the connection between the free energy and the general decision tree formulation.
The free energy of the whole trajectory can be rewritten in terms of rewards:
This translation allows applying the free energy principle to each node with a different resource parameter . By writing out the sum in (5), one realizes that this free energy has a nested structure where the latest time step forms the innermost variational problem and all other variational problems of the previous time steps can be solved recursively by working backwards in time. This then leads to the following solution:
The solution to the free energy in terms of rewards is given by
where and where for all
3.2 Generalized Optimality Equations
By virtue of our previous analysis, this equation tells us how to recursively calculate the value function (i.e. the utility of each node) given the computational resources allocated in each node.
It is immediately clear that the three kinds of decision trees mentioned in the introduction are special cases of general decision trees. In particular, the three classical operators are obtained as limit cases:
The familiar Bellman optimality equations for stochastic systems are obtained by considering an agent decision node followed by a random decision node:
4 Discussions & Conclusions
Bounded rational decision-making schemes based on the free energy generalize classic decision-making schemes by taking into account information processing costs measured by the Kullback-Leibler divergence(Wolpert, 2004; Todorov, 2009; Peters et al., 2010; Ortega and Braun, 2011; Kappen et al., 2012). Ultimately, these costs are determined by Lagrange multiplier constraints given by the inverse temperature playing the role of a resource parameter. Here we generalize this approach to general decision trees where each node can have a different resource allocation. Consequently, we obtain generalized optimality equations for sequential decision-making that include the well-known Bellman optimality equation as well as Expectimax-, Minimax- and Expectiminimax-decision rules depending on the limit values of the resource parameters. The resource parameters themselves are amenable to interesting computational, statistical and economic interpretations. In the first sense they measure the number of samples needed from a distribution before applying the max operator and therefore correspond directly to computational effort. In the second sense they reflect the confidence of the estimate of the maximum and therefore they can also express risk attitudes. Finally, the resource parameters reflect the control an agent has over a random variable. These different ramifications need to be explored further in the future.
Appendix A Proofs
a.1 Proof of Theorem 2.1
Since the equilibrium and reference distributions and are constant but arbitrarily chosen, it must be that
a.2 Proof of Theorem 2.2
Let be the ordering of such that . It is well known that the distribution over the maximum of samples is equal to , where is the cumulative distribution . Defining , one has . Hence, the probability can be bounded as , or
if we use where . The Boltzmann distribution can be bounded as
The upper bound is obtained by dropping all the summands in the expectation but the largest. In exponential form, the bounds are written as
Choosing allows rewriting the upper bound and changing the lower bound to
Finally, choosing and yields the bounds of the theorem
a.3 Proof of Theorem 3.1
The free energy of the whole trajectory with inverse temperature is given by
Using a telescopic sum for the utilities yields
Using the definition of rewards (4), one gets the result
a.4 Proof of Theorem 3.1
The inner sum of the free energy
can be expanded as
This can be solved by induction, starting with the innermost sums and then recursively solving the outer sums. The innermost sums
are maximized when
This can be seen by noting that for probabilities and positive numbers , the quantity is minimized by choosing , where is just a normalizing constant. Substituting this solution yields the outer sums
These sums are then maximized by choosing
- Bellman (1957) R.E. Bellman. Dynamic programming, 1957.
- Braun and Ortega (2011) D. A. Braun and P. A. Ortega. Path integral control and bounded rationality. In IEEE Symposium on adaptive dynamic programming and reinforcement learning, pages 202–209, 2011.
- Callen (1985) H.B. Callen. Thermodynamics and an introduction to thermostatistics. John Wiley & Sons, New York, 1985.
- Kappen (2005) H.J. Kappen. A linear theory for control of non-linear stochastic systems. Physical Review Letters, 95:200201, 2005.
- Kappen et al. (2012) H.J. Kappen, V. Gómez, and M. Opper. Optimal control as a graphical model inference problem. Machine Learning, 1:1–11, 2012.
- Keller (1998) G. Keller. Equilibrium States in Ergodic Theory. London Mathematical Society Student Texts. Cambridge Univeristy Press, 1998.
- Michie (1966) D. Michie. Game-playing and game-learning automata. Advances in Programming & Non-Numerical Computation, pages 183–200, 1966.
- Ortega (2011) P. Ortega. A unified framework for resource-bounded autonomous agents interacting with unknown environments. PhD thesis, Department of Engineering, University of Cambridge, UK, 2011.
- Ortega and Braun (2011) P.A. Ortega and D.A. Braun. Information, utility and bounded rationality. In Lecture notes on artificial intelligence, volume 6830, pages 269–274, 2011.
Osborne and Rubinstein (1999)
M.J. Osborne and A. Rubinstein.
A Course in Game Theory. MIT Press, 1999.
- Peters et al. (2010) J. Peters, K. Mülling, and Y. Altun. Relative entropy policy search. In AAAI, 2010.
- Russell and Norvig (2010) S.J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood Cliffs, NJ, 3rd edition edition, 2010.
- Simon (1984) H. Simon. Models of Bounded Rationality. MIT Press, Cambridge, MA, 1984.
- Todorov (2006) E. Todorov. Linearly solvable markov decision problems. In Advances in Neural Information Processing Systems, volume 19, pages 1369–1376, 2006.
- Todorov (2009) E. Todorov. Efficient computation of optimal actions. Proceedings of the National Academy of Sciences U.S.A., 106:11478–11483, 2009.
- Wolpert (2004) D.H. Wolpert. Complex Engineering Systems, chapter Information theory - the bridge connecting bounded rational game theory and statistical physics. Perseus Books, 2004.