Bounded rational decision-making from elementary computations that reduce uncertainty

04/08/2019 ∙ by Sebastian Gottwald, et al. ∙ 0

In its most basic form, decision-making can be viewed as a computational process that progressively eliminates alternatives, thereby reducing uncertainty. Such processes are generally costly, meaning that the amount of uncertainty that can be reduced is limited by the amount of available computational resources. Here, we introduce the notion of elementary computation based on a fundamental principle for probability transfers that reduce uncertainty. Elementary computations can be considered as the inverse of Pigou-Dalton transfers applied to probability distributions, closely related to the concepts of majorization, T-transforms, and generalized entropies that induce a preorder on the space of probability distributions. As a consequence we can define resource cost functions that are order-preserving and therefore monotonic with respect to the uncertainty reduction. This leads to a comprehensive notion of decision-making processes with limited resources. Along the way, we prove several new results on majorization theory, as well as on entropy and divergence measures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In rational decision theory, uncertainty may have multiple sources that ultimately share the commonality that they reflect a lack of knowledge on the part of the decision-maker about the environment. A paramount example is the perfectly rational decision-maker [99] that has a probabilistic model of the environment and chooses its actions so as to maximize the expected utility entailed by the different choices. When we consider bounded rational decision-makers [87]

, we may add another source of uncertainty arising from the decision-maker’s limited processing capabilities, since the decision-maker will not only accept a single best choice, but will accept any satisficing option. Today, bounded rationality is an active research topic that crosses multiple scientific fields such as economics, political science, decision theory, game theory, computer science, and neuroscience

[81, 66, 59, 12, 31, 63, 45, 89, 101, 42, 91, 92, 90, 46, 16, 70, 58, 1, 30], where uncertainty is one of the most important common denominators.

Uncertainty is often equated with Shannon entropy in information theory [85], measuring the average number of yes/no-questions that have to be answered to resolve the uncertainty. Even though Shannon entropy has many desirable properties, there are plenty of alternative suggestions for entropy measures in the literature, known as generalized entropies, such as Rényi entropy [77] or Tsallis entropy [95]

. Closely related to entropies are divergence measures, which express how probability distributions differ from a given reference distribution. If the reference distribution is uniform then divergence measures can be expressed in terms of entropy measures, which is why divergences can be viewed as generalizations of entropy, for example the Kullback-Leibler divergence

[52] generalizing Shannon entropy.

Here, we introduce the concept of elementary computation based on a slightly stronger notion of uncertainty than is expressed by Shannon entropy, or any other generalized entropy alone, but is equivalent to all of them combined. Equating decision-making with uncertainty reduction, this leads to a new comprehensive view of decision-making with limited resources. Our main contributions can be summarized as follows:

  1. Based on a fundamental concept of probability transfers related to the Pigou-Dalton principle of welfare economics [23], we promote a generalized notion of uncertainty reduction of a probability distribution that we call elementary computation. This leads to a natural definition of cost functions that quantify the resource costs for uncertainty reduction necessary for decision-making. We generalize these concepts to arbitrary reference distributions. In particular, we define Pigou-Dalton-type transfers for probability distributions relative to a reference or prior distribution, which induce a preorder that is slightly stronger than Kullback-Leibler divergence, but is equivalent to the notion of divergence given by all -divergences combined. We prove several new characterizations of the underlying concept, known as relative majorization.

  2. An interesting property of cost functions is their behavior under coarse-graining, which plays an important role in decision-making and formalizes the general notion of making abstractions. More precisely, if a decision in a set is split up into two steps by partitioning and first deciding in the set of (coarse-grained) partitions and secondly choosing a fine-grained option inside the selected partition , then it is an important question how the cost for the total decision-making process differs from the sum of the costs in each step. We show that -divergences are superadditive with respect to coarse-graining, which means that decision-making costs can potentially be reduced by splitting up the decision into multiple steps. In this regard, we find evidence that the well-known property of Kullback-Leibler divergence of being additive under coarse-graining might be viewed as describing the minimal amount of processing cost that cannot be reduced by a more intelligent decision-making strategy.

  3. We define bounded rational decision-makers as decision-making processes that are optimizing a given utility function under a constraint on the cost function, or minimizing the cost function under a minimal requirement on expected utility. As a special case for Shannon-type information costs, we arrive at information-theoretic bounded rationality, which may form a normative baseline for bounded-optimal decision-making in the absence of process-dependent constraints. We show that bounded-optimal posteriors with informational costs trace a path through probability space that can itself be seen as an anytime decision-making process, where each step optimally trades off utility and processing costs.

  4. We show that Bayesian inference can be seen as a decision-making process with limited resources given by the number of available datapoints.

Section 2 deals with and , aiming at a general characterization of decision-making in terms of uncertainty reduction. Item is covered in Section 3, deriving information-theoretic bounded rationality as a special case. Section 4 illustrates the concepts with an example including item . Sections 5 and 6 contain a general discussion and concluding remarks.


Notation

Let denote the real numbers, the set of non-negative real numbers, and the rational numbers. We write for the number of elements contained in a countable set , and for the set difference, that is the set of elements in that are not in . denotes the set of probability distributions on a set , in particular, any is normalized so that

. Random variables are denoted by capital letters

, while their explicit values are denoted by small letters . For the probability distribution of a random variable we write , and for the values of . Correspondingly, the expectation is also written as , , or . We also write , to denote the approximation of by an average over samples from .


2. Decision-making with limited resources

In this section, we develop the notion of a decision-making process with limited resources following the simple assumption that any decision-making process

  • reduces uncertainty

  • by spending resources.

Starting from an intuitive interpretation of uncertainty and resource costs, these concepts are refined incrementally until a precise definition of a decision-making process is given at the end of this section (Definition 2.11) in terms of elementary computations. Here, a decision-making process is a comprehensive term that describes all kinds of biological as well as artificial systems that are searching for solutions to given problems, for example a human decision-maker that burns calories while thinking, or a computer that uses electric energy to run an algorithm. However, resources

do not necessarily refer to a real consumable quantity but can also represent more explicit resources (like time) as a proxy, for example the number of binary comparisons in a search algorithm, the number of forward simulations in a reinforcement learning algorithm, the number of samples in a Monte Carlo algorithm, or, even more abstractly, they can express the limited availability of some source of information, like for example the number of datapoints that are available to an inference algorithm (see Section

4).

[width=]figure1.pdf

Figure 1. Decision-making as search in a set of options. At the expense of more and more resources, the number of uncertain options is progressively reduced until is the only remaining option.

2.1. Uncertainty reduction by eliminating options

In its most basic form, the concept of decision-making can be formalized as the process of looking for a decision in a discrete set of options . We say that a decision is certain, if repeated queries of the decision-maker will result in the same decision, and it is uncertain, if repeated queries can result in different decisions. Uncertainty reduction then corresponds to reducing the amount of uncertain options. Hence, a decision-making process that transitions from a space of options to a strictly smaller subset reduces the amount of uncertain options from to , with the possible goal to eventually find a single certain decision . Such a process is generally costly, the more uncertainty is reduced the more resources it costs (Figure 1). The explicit mapping between uncertainty reduction and resource cost depends on the details of the underlying process and on which explicit quantity is taken as the resource. For example, if the resource is given by time (or any monotone function of time), then a search algorithm that eliminates options sequentially until the target value is found (linear search) is less cost efficient than an algorithm that takes a sorted list and in each step removes half of the options by comparing the mid point to the target (logarithmic search). Abstractly, any real-valued function on the power set of that satisfies whenever might be used as a cost function in the sense that quantifies the expenses of reducing the uncertainty from to .

In utility theory decision-making is modelled as an optimization process that maximizes a so-called utility function (which can itself be an expected utility with respect to a probabilistic model of the environment, in the sense of von Neumann and Morgenstern [99]). A decision-maker that is optimizing a given utility function obtains a utility of on average after reducing the amount of uncertain options from to (see Figure 2). A decision-maker that completely reduces uncertainty by finding the optimum is called rational (w.l.o.g. we can assume that is unique, by redefining in the case when it is not). Since uncertainty reduction generally comes with a cost, a utility optimizing decision-maker with limited resources, correspondingly called bounded rational (see Section 3), in contrast will obtain only uncertain decisions from a subset . Such decision-makers seek satisfactory rather than optimal solutions, for example by taking the first option that satisfies a minimal utility requirement, which Herbert Simon calls a satisficing solution [87].

Summarizing, we conclude that a decision-making process with decision space that successively eliminates options can be represented by a mapping between subsets of , together with a cost function that quantifies the total expenses of arriving at a given subset,

(2.1)

such that

(2.2)

For example, a rational decision-maker can afford , whereas a decision-maker with limited resources can typically only afford uncertainty reduction with cost .

[width=]figure2.pdf

Figure 2. Decision-making as utility optimization process.

From a probabilistic perspective, a decision-making process as described above is a transition from a uniform probability distribution over options to a uniform probability distribution over options, that converges to the Dirac measure centered at

in the fully rational limit. From this point of view, the restriction to uniform distributions is artificial. A decision-maker that is uncertain about the optimal decision

might indeed have a bias towards a subset without completely excluding other options (the ones in ), so that the behavior must be properly described by a probability distribution . Therefore, in the following section, we extend (2.1) and (2.2) to transitions between probability distributions. In particular, we must replace the power set of by the space of probability distributions on , denoted by .


2.2. Probabilistic decision-making

Let be a discrete decision space of options, so that consists of discrete distributions

, often represented by probability vectors

. However, many of the concepts presented in this and the following section can be generalized to the continuous case [62, 44].

Intuitively, the uncertainty contained in a distribution is related to the relative inequality of its entries, the more similar its entries are, the higher the uncertainty. This means that uncertainty is increased by moving some probability weight from a more likely option to a less likely option. It turns out that this simple idea leads to a concept widely known as majorization [36, 10, 72, 13, 62, 11], which has roots in the economic literature of the early 19th century [60, 74, 23], where it was introduced to describe income inequality, later known as the Pigou-Dalton Principle of Transfers. Here, the operation of moving weight from a more likely to a less likely option corresponds to the transfer of income from one individual of a population to a relatively poorer individual (also known as a Robin Hood operation [10]). Since a decision-making process can be viewed as a sequence of uncertainty reducing computations, we call the inverse of such a Pigou-Dalton transfer an elementary computation.

Definition 2.1 (Elementary computation).

A transformation on of the form

(2.3)

where are such that , and , is called a Pigou-Dalton transfer (see Figure 3). We call its inverse an elementary computation.

Since making two probability values more similar or more dissimilar are the only two possibilities to minimally transform a probability distribution, elementary computations are the most basic principle of how uncertainty is reduced. Hence, we conclude that a distribution has more uncertainty than a distribution if and only if can be obtained from by finitely many elementary computations (and permutations, which are not considered an elementary computation due to the choice of ).

[width=]figure3.pdf

Figure 3. A Pigou-Dalton transfer as given by Equation (2.3). The transfer of probability from a more likely to a less likely option increases uncertainty.
Definition 2.2 (Uncertainty).

We say that contains more uncertainty than , denoted by

(2.4)

if and only if can be obtained from by a finite number of elementary computations and permutations.

Note that, mathematically this defines a preorder on , i.e. a reflexive ( for all ) and transitive (if , then for all ) binary relation.

In the literature, there are different names for the relation between and expressed by Definition 2.2, for example is called more mixed than [79], more disordered than [78], more chaotic than [13], or an average of [36]. Most commonly however, is said to majorize , which started with the early influences of Muirhead [65], and Hardy, Littlewood, and Pólya [36] and was developed by many authors into the field of majorization theory (a standard reference is by Marshall, Olkin, and Arnold [62]), with far reaching applications until today, especially in nonequilibrium thermodynamics and quantum information theory [15, 41, 34].

There are plenty of equivalent (arguably less intuitive) characterizations of , some of which we are summarizing below. However, one characterization makes use of a concept very closely related to Pigou-Dalton transfers, known as T-transforms [13, 62], which expresses the fact that moving some weight from a more likely option to a less likely option is equivalent to taking (weighted) averages of the two probability values. More precisely, a T-transform is a linear operator on with a matrix of the form , where

denotes the identity matrix on

, denotes a permutation matrix of two elements, and . If permutes and , then for all , and

(2.5)

Hence, a T-transform considers any two probability values and of a given , calculates their weighted averages with weights and , and replaces the original values with these averages. From (2.5) it follows immediately that a T-transform with parameter and a permutation of with is a Pigou-Dalton transfer with . Also allowing means that T-transfers include permutations, in particular, if and only if can be derived from by successive applications of finitely many T-transforms

Due to a classic result by Hardy, Littlewood and Pólya [36, p.49], this characterization can be stated in an even simpler form by using doubly stochastic matrices, i.e. matrices with and for all . By writing for all , and , these conditions are often stated as

(2.6)

Note that doubly stochastic matrices can be viewed as generalizations of T-transforms in the sense that a T-transform takes an average of two entries, whereas if

with a doubly stochastic matrix

, then is a convex combination, or a weighted average, of with coefficients for each . This is also, why is then called more mixed than [79]. Therefore, similar to T-transforms, we might expect that if is the result of an application of a doubly stochastic matrix, , then is an average of and therefore contains more uncertainty than . This is exactly what is expressed by characterization in the following theorem. A similar characterization of is that must be given by a convex combination of permutations of the elements of (see property below).

Without having the concept of majorization, Schur proved that functions of the form with a convex function are monotone with respect to the application of a doubly stochastic matrix [83] (see property below). Functions of this form are an important class of cost functions for probabilistic decision-makers, as we will discuss in Example 2.2.

[width=.8]figure4.pdf

Figure 4. Comparability of probability distributions in . The region in the center consists of all that are majorized by , i.e. , whereas the outer region consists of all that majorize , . The bright regions are not comparable to . , .
Theorem 2.3 (Characterizations of [62]).

For , the following are equivalent:

  1. , i.e. contains more uncertainty than (Definition 2.2)

  2. is the result of finitely many T-transforms applied to

  3. for a doubly stochastic matrix

  4. where , , , and is a permutation for all

  5. for all continuous convex functions

  6. for all , where denotes the decreasing rearrangement of .

As argued above, the equivalence between and is straight-forward. The equivalences between , , and are due to Muirhead [65] and Hardy, Littlewood, and Pólya [36]. The implication is due to Karamata [47] and Hardy, Littlewood, and Pólya [37], whereas goes back to Schur [83]. Mathematically, means that belongs to the convex hull of all permutations of the entries of , and the equivalence is known as the Birkhoff-von Neumann theorem. Here, we have stated all relations for probability vectors , even though they are usually stated for all with the additional requirement that .

Condition is the classical and most commonly used definition of majorization [60, 36, 62], since it is often the easiest to check in practical examples. For example, from it immediately follows that uniform distributions over options contain more uncertainty than uniform distributions over options, since we have for all , i.e. for it follows that

(2.7)

In particular, if , then the uniform distribution over contains less uncertainty than the uniform distribution over , which shows that the notion of uncertainty introduced in Definition 2.2 is indeed a generalizatin of the notion of uncertainty given by the number of uncertain options introduced in the previous section.

Note that, only being a preorder on , in general, two distributions are not necessarily comparable, i.e. we can have both and . In Figure 4, we visualize the regions of all comparable distributions for two exemplary distributions on a three-dimensional decision space (), represented on the two-dimensional simplex of probability vectors . For example,

can not be compared under , since , but .

Cost functions can now be generalized to probabilistic decision-making by noting that the property whenever in (2.2) means that is strictly monotonic with respect to the preorder given by set inclusion.

Definition 2.4 (Cost functions on ).

We say that a function is a cost function, if it is strictly monotonically increasing with respect to the preorder , i.e. if

(2.8)

with equality only if and are equivalent, , which is defined as and . Moreover, for a parametrized family of posteriors , we say that is a resource parameter with respect to a cost function , if the mapping is strictly monotonically increasing.

Since monotonic functions with respect to majorization were first studied by Schur [83], functions with this property are usually called (strictly) Schur-convex [62, Ch. 3].

[width = 1.1]figure5.pdf

Figure 5. Examples of cost functions for decision spaces with three elements (): Shannon entropy, Tsallis entropy of order , Rényi entropy of order .
Example (Generalized entropies).

From in Theorem 2.3 it follows that functions of the form

(2.9)

where is strictly convex, are examples of cost functions. Since many entropy measures used in the literature can be seen to be special cases of (2.9) (with a concave ), functions of this form are often called generalized entropies [18]. In particular, for the choice , we have , where denotes the Shannon entropy of . Thus, if contains more uncertainty than in the sense of Definition 2.2 () then the Shannon entropy of is larger than the Shannon entropy of and therefore contains also more uncertainty in the sense of classical information theory than . Similarly, for we obtain the (negative) Burg entropy, and for functions of the form for we get the (negative) Tsallis entropy, where the sign is chosen depending on such that is convex (see e.g. [32] for more examples). Moreover, the composition of any (strictly) monotonically increasing function with (2.9) generates another class of cost functions, which contains for example the (negative) Rényi entropy [77]. Note also that entropies of the form (2.9) are special cases of Csiszár’s f-divergences [20] for uniform reference distributions (see Example 2.3 below). In Figure 5, several examples of cost functions are shown for . In this case, the 2-dimensional probability simplex is given by the triangle in with edges , , and . Cost functions are visualized in terms of their level sets.

We prove in Proposition A.1 in the appendix that all cost functions of the form (2.9) are superadditive with respect to coarse-graining. This seems to be a new result and an improvement upon the fact that generalized entropies (and -divergences) satisfy information monotonicity [5]. More precisely, if a decision in , represented by a random variable , is split up into two steps by partitioning and first deciding about the partition , correspondingly described by a random variable with values in , and then choosing an option inside of the selected partition , represented by a random variable , i.e. , then

(2.10)

where and . For symmetric cost functions (such as (2.9)) this is equivalent to

(2.11)

The case of equality in (2.10) and (2.11) (see Figure 6) is sometimes called separability [51], strong additivity [21], or recursivity [4], and it is often used to characterize Shannon entropy [26, 96, 48, 55, 77, 2]. In fact, we also show in the appendix (Proposition A.2) that cost functions that are additive under coarse-graining are proportional to the negative Shannon entropy . See also Example 2.3 in the next section, where we discuss the generalization to arbitrary reference distributions.

We can now refine the notion of a decision-making process introduced in the previous section as a mapping together with a cost function satisfying (2.2). Instead of simply mapping from sets to smaller subsets by successively eliminating options, we now allow to be a mapping between probability distributions such that can be obtained from by a finite number of elementary computations (without permutations), and we require to be a cost function on , so that

(2.12)

[width=]figure6.pdf

Figure 6. Additivity under coarse-graining. If the cost for is the sum of the costs for and the cost for given , then the cost function is proportional to Shannon entropy.

Here, quantifies the total costs of arriving at a distribution , and means that and . In other words, a decision-making process can be viewed as traversing probability space by moving pieces of probability from one option to another option such that uncertainty is reduced.

Up to now we have ignored one important property of a decision-making process, the distribution with minimal cost, i.e. satisfying for all , which must be identified with the initial distribution of a decision-making process with cost function . As one might expect (see Figure 5), it turns out that all cost functions according to Definition 2.4 have the same minimal element.

Proposition 2.5 (Uniform distributions are minimal).

The uniform distribution is the unique minimal element in with respect to , i.e.

(2.13)

Once (2.13) is established, it follows from (2.8) that for all , in particular the uniform distribution corresponds to the initial state of all decision-making processes with cost function satisfying (2.12). In particular, it contains the maximum amount of uncertainty with respect to any entropy measure of the form (2.9), known as the second Khinchin axiom [51], e.g. for Shannon entropy . Proposition 2.5 follows from characterization in Theorem 2.3 after noticing that every can be transformed to a uniform distribution by permuting its elements cyclically (see Proposition A.3 in the appendix for a detailed proof).

Regarding the possibility that a decision-maker may have prior information, for example originating from the experience of previous comparable decision-making tasks, the assumption of a uniform initial distribution seems to be artificial. Therefore, in the following section we arrive at the final notion of a decision-making process by extending the results of this section to allow for arbitrary initial distributions.


2.3. Decision-making with prior knowledge

From the discussion at the end of the previous section we conclude that, in full generality, a decision-maker transitions from an initial probability distribution , called prior, to a terminal distribution , called posterior. Note that, since once eliminated options are excluded from the rest of the decision-making process, a posterior must be absolutely continuous with respect to the prior , denoted by , i.e. can be non-zero for a given only if is non-zero.

The notion of uncertainty (Definition 2.2) can be generalized with respect to a non-uniform prior by viewing the probabilities as the probabilities of partitions of an underlying elementary probability space of equally likely elements under , in particular represents as the uniform distribution on (see Figure 7). The similarity of the entries of the corresponding representation of any (its uncertainty) then contains information about how close is to , which we call the relative uncertainty of with respect to (Definition 2.6 below).

The formal construction is as follows: Let be such that and . The case when then follows from a simple approximation of each entry by a rational number. Let be such that for all , for example could be chosen as the least common multiple of the denominators of the . The underlying elementary probability space then consists of elements and there exists a partitioning of such that

(2.14)

where denotes the uniform distribution on . In particular, it follows that

(2.15)

i.e.  represents in with respect to the partitioning . Similarly, any can be represented as a distribution on by requiring that for all and letting to be constant inside of each partition, i.e. similar to (2.15) we have for all and therefore by (2.14)

(2.16)

Note that, if then by absolute continuity () in which case we can either exclude option from or set .

[width=.8]figure7.pdf

Figure 7. Representation of and by and on (Example 2.3), such that the probabilities and are given by the probabilities of the partitions with respect to and , respectively.
Example.

For a prior we put , so that should be partitioned as . Then corresponds to the probability of the -th partition under the uniform distribution , while is represented on by (see Figure 7).

Importantly, if the components of the representation in given by (2.16) are similar to each other, i.e. if is close to uniform, then the components of must be very similar to the components of , which we express by the concept of relative uncertainty.

Definition 2.6 (Uncertainty relative to ).

We say that contains more uncertainty with respect to a prior than , denoted by , if and only if contains more uncertainty than , i.e.

(2.17)

where is given by (2.16).

As we will show in Theorem 2.7 below, it turns out that coincides with a known concept called -majorization [97], majorization relative to [44, 62], or mixing distance [80]. Due to the lack of a characterization by partial sums, it is usually defined as a generalization of characterization in Theorem 2.3, that is is -majorized by iff , where is a so-called -stochastic matrix, which means that it is a stochastic matrix () with . In particular, does not depend on the choice of in the definition of . Here, we provide two new characterizations of -majorization, the one given by Definition 2.6, and one using partial sums generalizing the original definition of majorization.

Theorem 2.7 (Characterizations of ).

The following are equivalent

  1. , i.e. contains more uncertainty relative to than (Def. 2.6)

  2. can be obtained from by a finite number of elementary computations and permutations on

  3. for a -stochastic matrix , i.e.  and

  4. for all continuous convex functions

  5. for all that satisfy and , where the arrows indicate that is ordered decreasingly, and .

To prove that , , and are equivalent (see Proposition A.4 in the appendix), we use of the fact that has a left inverse . This can be verified by simply multiplying the corresponding matrices given in the proof of Proposition A.4. The equivalence betweeen and is shown in [44] (see also [80, 62]). Characterization follows immediately from Definition 2.2 and Definition 2.6.

As required from the discussion at the end of the previous section, is indeed minimal with respect to , which means that it contains the most amount of uncertainty with respect to itself.

Proposition 2.8 (The prior is minimal).

The prior is the unique minimal element in with respect to , that is

(2.18)

This follows more or less directly from Proposition 2.5 and the equivalence of and in Theorem 2.7 (see Proposition A.5 in the appendix for a detailed proof).

Order-preserving functions with respect to generalize cost functions introduced in the previous section (Definition 2.4). According to Proposition 2.8, such functions have a unique minimum given by the prior . Since cost functions are used in Definition 2.11 below to quantify the expenses of a decision-making process, we require their minimum to be zero, which can always be achieved by redefining a given cost function by an additive constant.

Definition 2.9 (Cost functions relative to ).

We say that a function is a cost function relative to , if , if it is invariant under relabeling , and if it is strictly monotonically increasing with respect to the preorder , that is if

(2.19)

with equality only if , i.e. if and . Moreover, for a parametrized family of posteriors , we say that is a resource parameter with respect to a cost function , if the mapping is strictly monotonically increasing.

[width = 1.1]figure8.pdf

Figure 8. Examples of cost functions for relative to . Kullback-Leibler divergence, Squared distance, Tsallis relative entropy of order .

Similar to generalized entropy functions discussed in Example 2.2, in the literature there are many examples of relative cost functions, usually called divergences or measures of divergence.

Example (-divergences).

From in Theorem 2.7 it follows that functions of the form

(2.20)

where is continuous and strictly convex with , are examples of cost functions relative to . Many well-known divergence measures can be seen to belong to this class of relative cost functions, also known as Csiszár’s -divergences [20]: the Kullback-Leibler divergence (or relative entropy), the squared distance, the Hartley entropy, the Burg entropy, the Tsallis entropy, and many more [32, 21] (see Figure 8 for visualizations of some of them in relative to a non-uniform prior).

As a generalizition of Proposition A.1 (superadditivity of generalized entropies), we prove in Proposition A.6 in the appendix that -divergences are superadditive under coarse-graining, that is, for

(2.21)

whenever and . This generalizes (2.10) to the case of a non-uniform prior. Similar to entropies, the case of equality in (2.21) is sometimes called composition rule [40], chain rule [57], or recursivity [21], and is often used to characterize Kullback-Leibler divergence [40, 63, 21, 57].

Indeed, we also show in the appendix (Proposition A.7) that all additive cost functions with respect to are proportional to Kullback-Leibler divergence (relative entropy). This goes back to Hobson’s modification [40] of Shannon’s original proof [85], after establishing the following monotonicity property for uniform distributions: If denotes the cost of a uniform distribution over elements relative to a uniform distribution over elements, then (see Figure 9)

(2.22)

Note that, even though our proof of Proposition A.7 uses additivity under coarse graining to show the monotonicity property (2.22), it is easy to see that any relative cost function of the form (2.20) also satisfies (2.22) by using the convexity of in the form with .

[width=1.0]figure9.pdf

Figure 9. Monotonicity property (2.22). The cost is higher when more uncertainty has been reduced. If the posterior is the same, then it is cheaper to start from a prior with fewer options.

In terms of decision-making, superadditivity under coarse-graining means that decision-making costs can potentially be reduced by splitting up the decision into multiple steps, for example by a more intelligent search strategy. For example, if for some and is superadditive, then the cost for reducing uncertainty to a single option, i.e. , when starting from a uniform distribution , satisfies

where , and we have set as unit cost (corresponding to 1 bit in the case of Kullback-Leibler divergence). Thus, intuitively the property of the Kullback-Leibler divergence of being additive under coarse-graining might be viewed as describing the minimal amount of processing costs that must be contained in any cost function, because it cannot be reduced by changing the decision-making process. Therefore, in the following we call cost functions that are proportional to the Kullback-Leibler divergence simply informational costs.

In contrast to the previous section, in the definition of and its characterizations we have never used elementary computations on directly. This is due to the fact that permutations do interact with the uncertainty relative to , and therefore cannot be characterized by a finite number of elementary computations and permutations on . However, we can still define elementary computations relative to by the inverse of Pigou-Dalton transfers of the form (2.3) such that for , which is arguably the most basic form of how to generate uncertainty with respect to .

Even for small , a regular Pigou-Dalton transfer does not necessarily increase uncertainty relative to , because the similarity of the components now needs to be considered with respect to . Instead, we compare the components of the representation of , and move some probability weight from to whenever for and , by distributing evenly among the elements in (see Figure 10), denoted by the transformation . Here, must be small enough such that the inequality is invariant under , which means that

and therefore

(2.23)

By construction, minimally increases uncertainty in while staying in the image of under , by keeping the values of constant in each partition, and therefore can be considered as the most basic way of how to increase uncertainty relative to .

[width=.9]figure10.pdf

Figure 10. Pigou-Dalton transfer relative to . A distribution is transformed relative to by first moving some amount of weight from to where are such that whenever and , with small enough such that this relation remains true after the transformation, and then mapping the transformed distribution back to by (see Definition 2.10).
Definition 2.10 (Elementary computation relative to ).

We call a transformation on of the form

(2.24)

with such that , and satisfying (2.23), a Pigou-Dalton transfer relative to , and its inverse an elementary computation relative to .

We are now in the position to state our final definition of a decision-making process.

Definition 2.11 (Decision-making process).

A decision-making process is a gradual transformation

of a prior to a posterior , such that each step decreases uncertainty relative to . This means that is obtained from by successive application of a mapping between probability distributions on , such that can be obtained from by finitely many elementary computations relative to , in particular

(2.25)

where quantifies the total costs of a distribution , and means that and .

In other words, a decision-making process can be viewed as traversing probability space from prior to posterior by moving pieces of probability from one option to another option such that uncertainty is reduced relative to , while expending a certain amount of resources determined by the cost function .


3. Bounded rationality

3.1. Bounded rational decision-making

In this section, we consider decision-making processes that trade off utility against costs. Such decision-makers either maximize a utility function subject to a constraint on the cost function, for example an author of a scientific article that optimizes the article’s quality until a deadline is reached, or minimizing the cost function subject to a utility constraint, for example a high-school student that minimizes effort such that the requirement to pass a certain class is achieved. In both cases, the decision-makers are called bounded rational, since in the limit of no resource constraints they coincide with rational decision-makers.

In general, depending on the underlying system, such an optimization process might have additional process dependent

constraints that are not directly given by resource limitations, for example in cases when the optimization takes place in a parameter space that has less degrees of freedom than the full probability space

. Abstractly, this is expressed by allowing the optimization process to search only in a subset .

Definition 3.1 (Bounded rational decision-making process).

Let be a given utility function, and . A decision-making process with prior , posterior , and cost function is called bounded rational if its posterior satisfies

(3.1)

for a given upper bound , or equivalently

(3.2)

for a given lower bound . In the case when the process constraints disappear, i.e. if , then a bounded rational decision-maker is called bounded-optimal.

The equivalence between (3.1) and (3.2) is easily seen from the equivalent optimization problem given by the formalism of Lagrange multipliers [25],

(3.3)

where the cost or utility constraint is expressed by a trade-off between utility and cost, or cost and utility, with a trade-off parameter given by the Lagrange multiplier , which is chosen such that the constraint given by or is satisfied. It is easily seen from the maximization problem on the right side of (3.3) that a larger value of decreases the weight of the cost term and thus allows for higher values of the cost function. Hence, parametrizes the amount of resources the decision-maker can afford with respect to the cost function , and, at least in non-trivial cases (non-constant utilities) it is therefore a resource parameter with respect to in the sense of Definition 2.9. In particular, for , the decision-maker minimizes its cost function irrespective of the expected utility, and therefore stays at the prior, , whereas makes the cost function disappear so that the decision-maker becomes purely rational with a Dirac posterior centered on the optima of the utility function .

For example, in Figure 11 we can see how the posteriors of bounded-optimal decision-makers with different cost functions for and with utility leave a trace in probability space, by moving away from an exemplary prior and eventually arriving at the rational solution .

[width = 1.1]figure11.pdf

Figure 11. Paths of bounded-optimal decision-makers in for . The straight lines in the background denote level sets of expected utility, the solid lines are level sets of the cost functions, and the dashed curves represent the paths of a bounded-optimal decision-maker given by (3.3) with , prior , and cost functions given by Kullback-Leibler divergence, Tsallis relative entropy of order , and Burg relative entropy.

For informational costs (i.e. proportional to Kullback-Leibler divergence), is a resource parameter with respect to any cost function.

Proposition 3.2.

If is a family of bounded-optimal posteriors given by (3.3) with , then is a resource parameter with respect to any cost function, in particular

(3.4)

This generalizes a result in [78] to the case of non-uniform priors, by making use of our new characterization