Abstraction in decision-makers with limited information processing capabilities

by   Tim Genewein, et al.
Max Planck Society

A distinctive property of human and animal intelligence is the ability to form abstractions by neglecting irrelevant information which allows to separate structure from noise. From an information theoretic point of view abstractions are desirable because they allow for very efficient information processing. In artificial systems abstractions are often implemented through computationally costly formations of groups or clusters. In this work we establish the relation between the free-energy framework for decision making and rate-distortion theory and demonstrate how the application of rate-distortion for decision-making leads to the emergence of abstractions. We argue that abstractions are induced due to a limit in information processing capacity.



There are no comments yet.


page 7


Q-Search Trees: An Information-Theoretic Approach Towards Hierarchical Abstractions for Agents with Computational Limitations

In this paper, we develop a framework to obtain graph abstractions for d...

The Future is Big Graphs! A Community View on Graph Processing Systems

Graphs are by nature unifying abstractions that can leverage interconnec...

Hierarchical State Abstractions for Decision-Making Problems with Computational Constraints

In this semi-tutorial paper, we first review the information-theoretic a...

Utility-Based Abstraction and Categorization

We take a utility-based approach to categorization. We construct general...

We Have So Much In Common: Modeling Semantic Relational Set Abstractions in Videos

Identifying common patterns among events is a key ability in human and m...

On computable abstractions (a conceptual introduction)

This paper introduces abstractions that are meaningful for computers and...

Guidelines For Pursuing and Revealing Data Abstractions

Many data abstraction types, such as networks or set relationships, rema...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Most scientific papers start with an abstract that focuses on the main ideas of the work but leaves out many of the details. From an information theoretic point of view this allows for very efficient processing which is crucial if information processing capabilities are limited. In general, abstractions are formed by reducing the information content of an entity until it contains only information that is relevant for a particular purpose. This partial neglect of information can lead to different entities being treated as equal or, phrased differently, the separation of structure from noise. Consider the abstract concept of a “chair”, where many aspects such as the size, color, material or particular shape are considered as noise that is irrelevant to the purpose of “sitting down”.

The ability to form abstractions is thought of as a hallmark of intelligence, both in cognitive tasks and in basic sensorimotor behaviors [1, 2, 3, 4, 5, 6] Traditionally it is conceptualized as being computationally costly because particular entities have to be grouped together by neglecting irrelevant information. Here we argue that abstractions arise as a consequence of limited computational capacity. The inability to distinguish different entities leads to the formation of abstractions. Note that this information processing limitation can be induced through limited computational capacity, but also through limited sample sizes or low signal-to-noise ratios. In this paper we study abstractions in the process of decision-making, where “similar” situations elicit the same behavior when partially ignoring the current situational context.

Following the work of [7]

decision-making with limited information-processing resources has been studied extensively in psychology, economics, political science, industrial organization, computer science and artificial intelligence research. In this paper we use a information-theoretic model of decision-making under resource constraints

[8, 9, 10, 11, 12, 13, 14]. In particular, [15, 16, 17, 18] present a framework in which gain in expected utility is traded off against the adaptation cost of changing from an initial behavior to a posterior behavior. The variational problem that arises due to this trade-off has the same mathematical form as the minimization of a free energy difference functional in thermodynamics. Here, we discuss the close connection between the thermodynamic decision-making framework [15] and rate-distortion theory which is an information theoretic framework for lossy compression. The problem in lossy compression is essentially the problem of separating structure from noise and is thus highly related to finding abstractions [19, 20, 21]. In the context of decision-making the rate-distortion framework can be applied by conceptualizing the decision-maker as a channel from observations to actions with limited capacity, which is known in economics as the framework of “Rational Inattention” [22].

In the next section we discuss how the rate-distortion framework can be obtained for bounded-rational decision-makers that face a number of tasks. In Section 3 we demonstrate two simple applications to explore the type of abstractions that emerge from limited information processing capabilities. In Section 4 we summarize the findings and discuss the presented approach.

2 Rate-distortion theory for decision-making

2.1 Bounded-rational decision-making

In [15], a bounded-rational actor that initially follows a policy changes its behavior to in a way that optimally trades off the expected gain in utility against the transformation costs for adapting from to . This trade-off is formalized by the following variational principle


where is known as the inverse temperature and is known as the difference in free energy—negative free energy in physics—which is composed of the expected utility w.r.t. and the Kullback-Leibler (KL) divergence between and . acts as a conversion-factor between transformation cost (usually in nats or bits) and the expected utility.

The distribution that maximizes the variational principle is given by


with the partition sum .

The influence of the transformation cost and thus the boundedness of the actor is governed by the parameter which determines “how far” the final behavior can deviate from the initial behavior measured in terms of KL-divergence. The perfectly rational actor that maximizes his utility can be recovered as the limit case where transformation cost is ignored, whereas corresponds to an actor that has infinite transformation cost or no computational resources and, thus, sticks with his prior policy .

Note that in the notation shown here, is conceptualized as a function over gains. In case

corresponds to a loss-function, the same variational principle allows to find the distribution

that optimally trades off minimum expected loss against transformation cost. In this case the argmin over has to be taken and the sign of is inverted. In case

is a continuous random variable, sums have to be replaced by the corresponding integrals.

2.2 Multi-task decision-making with limited resources

Consider an actor that is embedded into an environment and receives (potentially partial and noisy) information about the current state of the environment, that is the actor observes the value of a random variable . This observation allows the actor to reduce uncertainty about the current state of the environment and adapt its behavior correspondingly. Formally this is expressed with the conditional distribution over the action . The thermodynamic framework for decision-making introduced in the previous section can straightforwardly be harnessed for describing such a bounded-rational agent that receives information by plugging in the conditional distribution into Equation 1


with the solution


Notice that the utility function in general depends on the observation , leading to , but to indicate the conditioning on a specific value of we write .

The initial distribution can be interpreted as a default- or prior-behavior in the absence of an observation, thus we will refer to as “the prior”. The information processing cost is then given as the KL divergence between and the prior with the conversion factor that relates the units of transformation cost and the units of utility.

2.3 The optimal prior

In the free energy principle (Equation 3), the prior is assumed to be given. A very interesting question is which prior distribution maximizes the free energy difference for all observations on average. To formalize this question, we extend the variational principle in Equation 3 by taking the expectation over and the argmax over

The inner argmax-operator over and the expectation over can be swapped because the variation is not over . With the KL-term expanded this leads to

The solution to the argmax over is given by . (see 2.1.1 in [19] or [23]). Plugging in for yields the following variational principle for bounded-rational decision-making with a minimum average relative entropy prior


where is the mutual information between and . The variational problem can be interpreted as maximizing expected utility with an upper bound on the mutual information or in the dual point of view, as minimizing the mutual information between actions and observations with a lower bound on the expected utility. The problem in Equation 5 is equivalent to the problem formulation in rate-distortion theory ([19, 24, 25]), where is usually conceptualized as a distortion function which leads to a flip in the sign of and an argmin instead of an argmax.

The solution that extremizes the variational problem is given by the self-consistent equations (see [19])


Note that the solution for the conditional distribution in the rate-distortion problem (Equation 6) is the same as the solution in the free energy case of the previous section (Equation 4), except that the prior is now defined as the marginal distribution (see Equation 7). This particular prior distribution minimizes the the average relative entropy between and which is the mutual information between actions and observations .

In the limit-case where transformation costs are ignored, is equal to the perfectly rational policy for each value of independent of any of the other policies and

becomes a mixture of these solutions. Note that if there is a subset of perfectly rational solutions that is shared among tasks, then only this subset will be assigned probability mass since it reduces mutual information (see Section 3.3) Importantly, high values of the mutual information term in Equation 

5 will not lead to a penalization, which means that actions can be very informative about the observation . The behavior of an actor with infinite computational resources will thus be very observation-specific.

In the case the mutual information between actions and observations is minimized to , leading to , the maximal abstraction where all elicit the same response. The actor’s behavior becomes independent of the observation due to the lack in computational resources to change its behavior. Within this limitation the actor will, however, still emit actions that maximize the expected utility .

For values of the rationality parameter in between these limit-cases, that is , the bounded-rational actor trades off observation-specific actions that lead to a higher expected utility for particular observations at the cost of high mutual information between observations and actions , against abstract actions that yield a “good” expected utility for many observations and lead to a lower mutual information term.

An alternative interpretation, closer to the rate-distortion framework, is that the perceptual channel through which is transmitted to the actor has a limited capacity given by . For large values of , the transmission of is not severely influenced and the actor can choose the best action for this particular observation. For lower values of however, the actor becomes very uncertain about the true value of and has to choose abstract actions that are “good” under all observations which are compatible with the actor’s belief over .

2.4 Computing the self-consistent solution

The self-consistent solutions that maximize the variational principle in Equation 5 can be computed by starting with an initial distribution and then iterating Equation 6 and Equation 7 in an alternating fashion. This procedure is well known in the rate-distortion framework as a Blahut-Arimoto-type algorithm [25, 26]. The iteration is guaranteed to converge to a unique maximum (see 2.1.1 in [19] and [23, 24]. Note that has to have the same support as .

Implemented in a straightforward manner, the Blahut-Arimoto iterations can become computationally costly since the iterations involve evaluating the utility function for every action-observation-pair and computing the normalization constant . In case of continuous-valued random variables, closed-form analytic solutions exist only for special cases.

3 Abstractions in multi-task decision-making

3.1 Problem formulation

In the following we present the application of the rate-distortion framework for decision-making introduced in the previous section to multi-task decision problems. We assume that we are given a number of tasks within the same environment and that the observations from the environment are fully informative about the current task, that is we observe the value of a discrete random variable

corresponding to a unique task. Note that this assumption can easily be relaxed.

More formally we make the following assumptions: we are given a set of tasks which define the set of observations with if and only if . Each task is defined through the utility function , where is an action. The action-space is the same for all tasks. We assume that the probability over tasks is known and given by .

The goal of the decision maker is to find task-specific distributions that maximize the expected utility given its computational constraints. This problem is formalized in the variational principle in Equation 5 with the self-consistent solutions in Equations 67. In this principle for bounded-rational decision-making, information processing costs arise from changing the prior-behavior to the task-specific behavior and are measured in terms of KL-cost in accordance with the thermodynamic framework for decision-making [15].

3.2 Trading off abstraction against optimal action

We designed the following two-task problem, to demonstrate the role of the rationality parameter that governs the trade-off between expected utility and mutual information. In both tasks, the action

is one of four possible action-vectors (see Table 

1). The utility for the first task is simply given by the value of the first component of the action vector, whereas the utility for the second task is the Manhattan distance between the two components of the action vector:

The utilities for all actions are summarized in Table 1. The observation-variable is fully informative about the task with the task probabilities .

With this particular choice of utility functions and action-vectors, the maximum-utility action for one task has a utility of zero for the other task. However, there is a suboptimal action that leads to the second-best utility in both environments. The simulation results summarized in Table 1 show that for a high value of the inverse temperature the decision-maker picks the maximum-utility action in each task with probability . At a low value of the actor uses the same action distribution for both tasks due to its boundedness, resulting in . This leads to a maximal abstraction over both tasks which is solved optimally by putting all the probability mass on the suboptimal action . Note that the limit shown here is in general still far from the fully bounded limit — in this particular example however lowering further has no effect.

Table 1: Two-task decision problem. Possible actions and their utilities for both tasks are given in the first three columns of the table. The results of the Blahut-Arimoto iterations for a large value of are shown in the middle three columns. In this case the maximum-utility action for each task is picked with full certainty. The results for a small value of are shown in the last three columns. The decision maker does not have computational resources to change its behavior according to the task and thus always picks the suboptimal action that leads to a high utility in both tasks.

Figure 1 A shows the transition from perfect rationality to full boundedness. Starting at the entropy of the conditionals is zero, since for a given task the actor picks the maximum-utility action with certainty. By lowering the inverse temperature , both the mutual information and the expected utility monotonically decrease. Initially stays constant, whereas increases, which means that the actor picks the two maximum-utility actions with increasing stochasticity. At

a phase transition occurs — the entropy

rapidly peaks at bits implying that three actions are now equally probable in . Lowering further leads to a rapid drop in , and to zero bits as well as a drop in expected utility to . The decision maker is now in the fully abstract regime, where is always chosen, regardless of the task.

Figure 1 B shows the Rate-Utility function (in analogy to the rate-distortion function) where the information processing rate is shown as a function of the expected utility. If the decision-maker is conceptualized as a communication channel between observations and actions, the rate defines the minimal required capacity of that channel. The Rate-Utility function thus specifies the minimum required capacity for computing an action with a certain expected utility, or analogously the maximally achievable expected utility given a certain information processing capacity. Importantly, decision-makers in the shaded region are impossible, whereas decision-makers in the white region are suboptimal with respect to their information processing capabilities.

Figure 1: Transition from full rationality () to full boundedness (). A Trade-off between and expected utility B Rate-Utility function showing the information processing rate as a function of the expected utility. The rate specifies the minimal average number of bits of the observation that need to be processed in order to achieve a certain expected utility. For the limit the decision maker picks the maximum utility action for each environment deterministically thus following the maximum expected utility (MEU) principle.

3.3 Changing the level of granularity

Abstractions are formed by reducing the information content of an entity until it only contains relevant information. For a discrete random variable this translates into forming a partitioning over the space where “similar” elements are grouped into the same subset of and become indistinguishable within the subset. In physics changing the granularity of a partitioning to a coarser level is known as coarse-graining which reduces the resolution of the space in a nonuniform manner. In the rate-distortion framework the partitioning emerges in the shared prior as a soft-partitioning (see [20]), where actions with the same average utility get the same probability mass and become essentially indistinguishable.

To demonstrate this, we use a binary grid of size , where each cell of the grid can be white or colored in black . Actions are particular patterns on this grid thus the actionspace becomes . The utility function defines the following three tasks:

  1. The utility equals the number of colored pixels, but one row and one column has to be all-white, otherwise the utility is zero.

  2. Any pattern with exactly four colored pixels scores a utility of , all other patterns have utility zero.

  3. Any pattern with an even number of colored pixels scores a utility equal to the total number of colored pixels; all other patterns have a utility of zero.

Figure 2 shows samples each from the conditionals for each task and the prior for . Since the inverse temperature is high, all the samples with nonzero probability are actions that yield maximum utility in their particular task. Note that the patterns that lead to a maximum utility in task (1) are a subset of the patterns that lead to maximum utility in task (2) but also lead to a nonzero utility in task (3). Since transformation costs are mostly ignored in this case, the patterns appearing for task (3) are very different from the patterns in task (1). Note however that the additional patterns in task (2) that would also lead to maximum utility are assigned a probability of zero. The subset of patterns which are also optimal in task (1) is sufficient to achieve maximum expected utility and by not including the additional “specialized” patterns for task (2) the mutual information can be reduced significantly. The prior consists essentially of two kinds of patterns: the ones that are optimal in task (1) and (2) simultaneously and the patterns that are optimal in task (3). The first two tasks have essentially become indistinguishable because the actor will respond with exactly the same action-distribution.

By lowering the inverse temperature to (see Figure 3), the mutual information constraint gets more weight and suboptimal patterns are picked for task (3), similar to the simulation in the previous section. The behavior of the actor has now become indistinguishable for all three tasks at the expense of a lower expected utility. Importantly, the effective resolution of the prior has reduced from two distinct sets of patterns to a single set of indistinguishable patterns (in terms of their expected utility). The level of granularity of the prior has been reduced even further.

Figure 2: Sampled patterns for . The number above each pattern indicates the probability of the pattern in the corresponding distribution. A Samples for task (1) . All shown patterns yield maximum utility in the task. B Samples for task (2) . All shown patterns yield maximum utility in this task, however task (2) has more patterns that would potentially lead to maximum utility—only the subset that coincides with the maximum utility patterns in (1) has nonzero probability though. This is a consequence of sharing the same prior and mutual information minimization. C Samples for task (3) . The patterns in task (1) and (2) would also have a nonzero probability in task (3), but the sampled patterns shown here yield twice the utility and have thus all the probability mass. D Samples from the shared prior . The prior is a mixture over the patterns shown in the conditional distributions.
Figure 3: Sampled patterns for . The number above each pattern indicates the probability of the pattern in the corresponding distribution. A Samples for task (1) . B Samples for task (2) . C Samples for task (3) . Compared to the case in Figure 2, the increased weight of the mutual information term has led to the selection of suboptimal actions in task (3), similar to the previous simulation. D Samples from the shared prior . In the fully abstract regime all conditional distributions are exactly equal to the prior , leading to .

4 Discussion & Conclusions

In this work, we discussed the connection between the thermodynamic framework [15] for decision-making with information processing costs and rate-distortion theory. This connection implies a novel interpretation of the rate-distortion framework for multi-task bounded-rational decision-making. Importantly, abstractions emerge naturally in this framework due to limited information processing capabilities. The authors in [27] find a very similar emergence of “natural abstractions” and “ritualized behavior” when studying goal-directed behavior in the MPD case using the Relevant Information method, which is a particular application of rate-distortion theory.

Although not shown here, the approach presented in this paper straightforwardly carries over to an inference case by treating as observations and as the belief-state. In the inference case, limited information processing capacities make it impossible to detect certain patterns which in turn renders different entities indistinguishable, leading to the formation of abstractions. This idea has been explored previously in [20, 21]. Both, the work just mentioned and our work are inspired by the Information Bottleneck Method [19], which is mathematically very similar to the rate-distortion problem (with a particular choice of distortion function) and thus also to the approach presented here.

Note that limited information processing capabilities can arise for various reasons. The most obvious reason, perhaps, is the lack of computational power which is in many cases equivalent to certain time-constraints (such as reaction times) or memory constraints. Other reasons for information processing limits are small sample sizes or low signal-to-noise ratios that put an upper limit on the mutual information independent of available computational power.

In the approach presented here, we assume that the decision-maker draws samples from . Responding to a certain task with a sample from could then be implemented for instance with a rejection sampling procedure. The prior will then be the proposal-distribution that has the highest average acceptance rate over all tasks . The computational cost of finding is not part of the current framework. These implications have to be explored in further work.


This study was supported by the DFG, Emmy Noether grant BR4164/1-1.


  • [1] Joshua B Tenenbaum, Charles Kemp, Thomas L Griffiths, and Noah D Goodman. How to grow a mind: Statistics, structure, and abstraction. science, 331(6022):1279–1285, 2011.
  • [2] Charles Kemp, Amy Perfors, and Joshua B Tenenbaum. Learning overhypotheses with hierarchical bayesian models. Developmental science, 10(3):307–321, 2007.
  • [3] Samuel J Gershman and Yael Niv. Learning latent structure: carving nature at its joints. Current opinion in neurobiology, 20(2):251–256, 2010.
  • [4] Daniel A Braun, Carsten Mehring, and Daniel M Wolpert. Structure learning in action. Behavioural brain research, 206(2):157–165, 2010.
  • [5] Daniel A Braun, Stephan Waldert, Ad Aertsen, Daniel M Wolpert, and Carsten Mehring. Structure learning in a sensorimotor association task. PloS one, 5(1):e8973, 2010.
  • [6] Tim Genewein and Daniel A Braun. A sensorimotor paradigm for bayesian model selection. Frontiers in human neuroscience, 6, 2012.
  • [7] Herbert A Simon. Theories of bounded rationality. Decision and organization, 1:161–176, 1972.
  • [8] Richard D McKelvey and Thomas R Palfrey. Quantal response equilibria for normal form games. Games and economic behavior, 10(1):6–38, 1995.
  • [9] David H Wolpert.

    Information theory-the bridge connecting bounded rational game theory and statistical physics.

    In Complex Engineered Systems, pages 262–290. Springer, 2006.
  • [10] Hilbert J Kappen. Linear theory for control of nonlinear stochastic systems. Physical review letters, 95(20):200201, 2005.
  • [11] Jan Peters, Katharina Mülling, and Yasemin Altun. Relative entropy policy search. In AAAI, 2010.
  • [12] Emanuel Todorov. Efficient computation of optimal actions. Proceedings of the national academy of sciences, 106(28):11478–11483, 2009.
  • [13] Evangelos Theodorou, Jonas Buchli, and Stefan Schaal.

    A generalized path integral control approach to reinforcement learning.

    The Journal of Machine Learning Research

    , 9999:3137–3181, 2010.
  • [14] Jonathan Rubin, Ohad Shamir, and Naftali Tishby. Trading value and information in mdps. In Decision Making with Imperfect Decision Makers, pages 57–74. Springer, 2012.
  • [15] Pedro A Ortega and Daniel A Braun. Thermodynamics as a theory of decision-making with information-processing costs. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Science, 469(2153), 2013.
  • [16] Daniel A Braun, Pedro A Ortega, Evangelos Theodorou, and Stefan Schaal. Path integral control and bounded rationality. In Adaptive Dynamic Programming And Reinforcement Learning (ADPRL), 2011 IEEE Symposium on, pages 202–209. IEEE, 2011.
  • [17] Daniel Alexander Ortega and Pedro Alejandro Braun. Information, utility and bounded rationality. In Artificial General Intelligence, pages 269–274. Springer, 2011.
  • [18] Pedro A Ortega and Daniel A Braun. Free energy and the generalized optimality equations for sequential decision making. arXiv preprint arXiv:1205.3997, 2012.
  • [19] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. The 37th annual Allerton Conference on Communication, Control, and Computing, 1999.
  • [20] Susanne Still and James P Crutchfield. Structure or noise? arXiv preprint arXiv:0708.0654, 2007.
  • [21] Susanne Still, James P Crutchfield, and Christopher J Ellison.

    Optimal causal inference: Estimating stored information and approximating causal architecture.

    Chaos: An Interdisciplinary Journal of Nonlinear Science, 20(3):037111–037111, 2010.
  • [22] Christopher A Sims. Implications of rational inattention. Journal of monetary Economics, 50(3):665–690, 2003.
  • [23] I Csiszár and G Tusnády. Information geometry and alternating minimization procedures. Statistics and decisions, 1984.
  • [24] Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 1991.
  • [25] Raymond W Yeung. Information theory and network coding. Springer, 2008.
  • [26] Richard Blahut. Computation of channel capacity and rate-distortion functions. IEEE Transactions on Information Theory, 18(4):460–473, 1972.
  • [27] Sander G van Dijk and Daniel Polani. Informational constraints-driven organization in goal-directed behavior. Advances in Complex Systems, 2013.