The question of how to combine a given set of individual entities in order to perform a certain task efficiently is a long-lasting question shared by many disciplines, including economics, neuroscience, and computer science. Even though the explicit nature of a single individuum might differ between these fields, e.g. an employee of a company, a neuron in a human brain, or a computer or processor as part of a cluster, they have one important feature in common that usually prevents them from functioning isolated by themselves: they are all limited. In fact, this was the driving idea that inspired Herbert A. Simons early work on decision-making within economic organizations(Simon, 1943, 1955), which earned him a Nobel prize in 1978. He suggested that a scientific behavioral grounding of economics should be based on bounded rationality, which has remained an active research topic until today (Russell and Subramanian, 1995; Lipman, 1995; Aumann, 1997; Kaelbling et al., 1998; DeCanio and Watkins, 1998; Gigerenzer and Selten, 2001; Jones, 2003; Sims, 2003; Burns et al., 2013; Ortega and Braun, 2013; Acerbi et al., 2014; Gershman et al., 2015). Subsequent studies in management theory have been built upon Simons basic observation, because “if individual managers had unlimited access to information that they could process costlessly and instantaneously, there would be no role for organizations employing multiple managers” (Geanakoplos and Milgrom, 1991). In neuroscience and biology, similar concepts have been used to explore the evolution of specialization and modularity in nature (Kashtan and Alon, 2005; Wagner et al., 2007). In modern computer science, the terms parallel computing and distributed computing denote two separate fields that share the concept of decentralized computing (Radner, 1993), i.e. the combination of multiple processing units in order to decrease the time of computationally expensive calculations.
Despite of their success, there are also shortcomings of most approaches to the organization of decision-making units based on bounded rationality: As (DeCanio and Watkins, 1998) point out, existing agent-based methods (including their own) are not using an overreaching optimization principle, but are tailored to the specific types of calculations the agents are capable of, and therefore lack in generality. Moreover, it is usually imposed as a separate assumption that there are two types of units, specialized operational units and coordinating non-operational units, which was expressed by (Knight, 1921) as “workers do, and managers figure out what to do”.
Here, we use a Free Energy optimization principle in order to study systems of bounded rational agents, extending the work in (Ortega and Braun, 2011, 2013; Genewein and Braun, 2013; Genewein et al., 2015) on decision-making, hierarchical informationprocessing, and abstraction in intelligent systems with limited information-processing capacity, that has precursors in the economic and game-theoretic literature (McKelvey and Palfrey, 1995; Ochs, 1995; Mattsson and Weibull, 2002; Wolpert, 2006; Spiegler, 2011; Howes et al., 2009; Todorov, 2009; Still, 2009; Tishby and Polani, 2011; Kappen et al., 2012; Edward et al., 2014; Lewis et al., 2014). Note that the Free Energy optimization principle of information-theoretic bounded rationality is connected to the Free Energy principle used in variational Bayes and Active Inference (Friston et al., 2015a, b, 2017a, 2017b), but has a conceptually distinct interpretation and some formal differences (see Section 6.3 for a detailed comparison).
By generalizing the ideas in (Genewein and Braun, 2013; Genewein et al., 2015) on two-step information-processing to an arbitrary number of steps, we arrive at a general Free Energy principle that can be used to study systems of bounded rational agents. The advantages of our approach can be summarized as follows:
The computational nature of the optimization principle allows to explicitly calculate and compare optimal performances of different agent architectures for a given set of objectives and resource constraints (see Section 5).
The information-theoretic description implies the existance of the two types of units mentioned above, non-operational units (selector nodes) that coordinate the activities of operational units. Depending on their individual resource constraints, the Free Energy principle assigns each unit to a region of specialization that is part of an optimal partitioning of the underlying decision space (see Section 4.3).
In particular, we find that, for a wide range of objectives and resource limitations (see Sections 9 and 5.3), hierarchical systems with specialized experts at lower levels and coordinating units at higher levels generally outperform other structures.
This section serves as an introduction to the terminology required for our framework presented in Section 3 and 4.
We use curly letters, , , , etc. to denote sets of finite cardinality, in particular the underlying spaces of the corresponding random variables , respectively. We denote the space of probability distributions on a given set
, etc. to denote sets of finite cardinality, in particular the underlying spaces of the corresponding random variables, , , etc., whereas the values of these random variables are denoted by small letters, i.e. , , and
, respectively. We denote the space of probability distributions on a given setby . Given a probability distribution , the expectation of a function is denoted by . If the underlying probability measure is clear without ambiguity we just write .
For a function with multiple arguments, e.g. for , we denote the function for fixed by (partial application), i.e. the dot indicates the variable of the new function. Similarly, for fixed , we denote a conditional probability distribution on
, we denote a conditional probability distribution onwith values by . This notation shows the dependencies clearly without giving up the original function names and thus allows to write more complicated expressions in a concise form. For example, if is a functional defined on functions of one variable, e.g. for all functions , then evaluating on the function in its first variable while keeping the second variable fixed, is simply denoted by . Here, the dot indicates on which argument of the functional is acting and at the same time it records that the resulting value (which equals in the case of the example) does not depend on a particular but on the fixed .
Here, we consider (multi-task) decision-making as the process of observing a world state , sampled from a given distribution , and choosing a corresponding action drawn from a posterior policy
. Assuming that the joint distribution ofand is given by , then is the conditional probability distribution of given . Unless stated otherwise, the capital letter always denotes a posterior, while the small letter denotes the joint distribution or a marginal of the joint (i.e. a dependent variable).
A decision-making unit is called agent. An agent is rational, if its posterior policy maximizes the expected utility
for a given utility function . Note that the utility may itself represent an expected utility over consequences in the sense of von Neumann and Morgenstern (1944), where would serve as a context variable for different tasks. The posterior can be seen as a state-action policy that selects the best action with respect to a utility function given the state of the world.
2.2 Bounded rational agents
for a given bound and a prior policy . Here, denotes the Kullback-Leibler (KL) divergence between two distributions on a set , defined by . Note that, for to be well-defined, must be absolutely continuous with respect to , so that implies . When or are conditional probabilities, then we treat as a function of the additional variables.
Given a world state , the information-processing consists of transforming a prior to a world state specific posterior distribution . Since measures by how much diverges from , the upper bound in (2) characterizes the limitation of the agent’s average information-processing capability: If is close to zero, the posterior must be close to the prior for all world states, which means that contains only little information about , whereas if is large, the posterior is allowed to deviate from the prior by larger amounts and therefore contains more information about . We use the KL-divergence as a proxy for any resource measure, as any resource must be monotone in processed information, which is measured by the KL-divergence between prior and posterior.
Technically, maximizing expected utility under the constraint (2) is the same as minimizing expected complexity cost under the constraint of a minimal expected performance, where complexity is given by the expected KL-divergence between prior and posterior and performance by expected utility. Minimizing complexity means minimizing the number of bits required to generate the actions.
2.3 Free Energy principle
By the variational method of Lagrange multipliers, the above constrained optimization problem is equivalent to the unconstrained problem
where is chosen such that the constraint (2) is satisfied. In the literature on information-theoretic bounded rationality (Ortega and Braun, 2011, 2013), the objective in (3) is known as the Free Energy of the corresponding decision-making process. In this form, the optimal posterior can be explicitly derived by determining the zeros of the functional derivative of with respect to , yielding the Boltzmann-Gibbs distribution
Note how the Lagrange multiplier (also known as inverse temperature
) interpolates between an agent with zero processing capability that always acts according to its prior policy () and a perfectly rational agent (). Note that, plugging (4) back into the Free Energy (3) gives
2.4 Optimal prior
The performance of a given bounded rational agent crucially depends on the choice of the prior policy . Depending on and the explicit form of the utility function, it can be advantageous to a priori prefer certain actions over others. Therefore, optimal bounded rational decision-making includes optimizing the prior in (3). In contrast to (3), the modified optimization problem
does not have a closed form solution. However, since the objective is convex in , a unique solution can be obtained iteratively by alternating between fixing one and optimizing the other variable (Csiszár and Tusnády, 1984), resulting in a Blahut-Arimoto type algorithm (Arimoto, 1972; Blahut, 1972) that consists of alternating the equations
with given by (4). In particular, the optimal prior policy is the marginal of the joint distribution of and
. In this case, the average Kullback-Leibler divergence between prior and posterior coincides with themutual information between and ,
It follows that the modified optimization principle (6) is equivalent to
2.5 Multi-step and multi-agent systems
When multiple random variables are involved in a decision-making process, such a process constitutes a multi-step system (see Section 3). Consider the case of a prior over that is conditioned on an additional random variable with values , i.e. for all . Remember that we introduced a bounded rational agent as a decision-making unit, that, after observing a world state , transforms a single prior policy over a choice space to a posterior policy . Therefore, in the case of a conditional prior, the collection of prior policies can be considered as a collection or ensemble of agents, or a multi-agent system, where for a given , the prior is transformed to a posterior by exactly one agent. Note that a single agent deciding about both, and , would be modelled by a prior of the form with and , instead.
Hence, in order to combine multiple bounded rational agents, we are first splitting the full decision-making process into multiple steps by introducing additional intermediate random variables (Section 3), which then will be used to assign one or more agents to each of these steps (Section 4). In this view, we can regard a multi-agent decision-making system as performing a sequence of successive decision steps until an ultimate action is selected.
3 Multi-step bounded rational decision-making
3.1 Decision nodes
Let and denote the random variables describing the full decision-making process for a given utility function , as described in Section 2. In order to separate the full process into steps, we introduce internal random variables , …, , which represent the outputs of additional intermediate bounded rational decision-making steps. For each , let denote the target space and a particular value of . We call a random variable that is part of a multi-step decision-making system a (decision) node. For simplicity, we assume that all intermediate random variables are discrete (just like and ).
Here, we are treating feed-forward architectures originating at and terminating in . This allows to label the variables according to the information flow, so that potentially can only obtain information about if . The canonical factorization
of the joint probability distribution of therefore consists of the posterior policies of each decision node.
3.2 Two types of nodes: inputs and prior selectors
A specific multi-step architecture is characterized by specifying the explicit dependencies on the preceding variables for each node’s prior and posterior, or better the missing dependencies. For example, in a given multi-step system, the posterior of the node might depend explicitly on the outputs of and but not on , so that . If its prior has the form , then has to process the output of . Moreover, in this case, the actual prior policy that is used by for decision-making is selected by (see Figure 1).
In general, the inputs that have to be processed by a particular node , are given by the variables in the posterior that are missing from the prior, and, if its prior is conditioned on the outputs of , then these nodes select which of the prior policies is used by for decision-making, i.e. for the transformation
We denote the collection of input nodes of by () and the prior selecting nodes of by (). The joint distribution of is then given by
for all and , ().
Specifying the sets and of selectors and inputs for each node in the system then uniquely characterizes a particular multi-step decision-making system. Note that we always have .
Decompositions of the form (9) are often visualized by directed acyclic graphs, so-called DAGs (see e.g. Bishop, 2006, pp. 360). Here, in addition to the decomposition of the joint in terms of posteriors, we have added the information about the prior dependencies in terms of dashed arrows, as shown in Figure 1.
3.3 Multi-step Free Energy principle
If and denote the posterior and prior of the -th node of an -step decision-process, then the Free Energy principle takes the form
where, in addition to the expectation over inputs, the average of now also includes the expectation with respect to ,
Since the prior policies only appear in the KL-divergences, and moreover, there is exactly one KL-divergence per prior, it follows as in 2.4, that for each the optimal prior is the marginal given for all by
whenever . Hence, the Free Energy principle can be simplified to
where denotes the conditional mutual information of two random variables given a third random variable .
By optimizing (12) alternatingly, i.e. optimizing one posterior at a time while keeping the others fixed, we obtain for each ,
whenever and . Here, denotes the normalization constant and denotes the (effective) utility function on which the decision-making in is based on. More precisely, given , it is the Free Energy of the subsequent nodes in the system, i.e. for any value of we obtain for ,
Here, and are collections of values of the random variables in and , respectively. The final Blahut-Arimito-type algorithm consists of iterating (13), (11), and (14) for each until convergence is achieved. Note that, since each optimization step is convex (marginal convexity), convergence is guaranteed but generally not unique (Jain and Kar, 2017), so that, depending on the initialization, one might end up in a local optimum.
3.4 Example: two-step information-processing
The cases of serial and parallel information-processing studied in (Genewein and Braun, 2013), are special cases of multi-step decision-making systems introduced above. Both cases are two-step processes () involving the variables , , and . The serial case is characterized by , and the parallel case by . There is a third possible combination for , given by . However, it can be shown that this case is equivalent to the (one-step) rate distoration case from Section 2, because if has direct world state access, then any extra input to the final node , that is not a prior selector, contains redundant information.
4 Systems of bounded rational agents
4.1 From multi-step to multi-agent systems
As explained in 2.5 above, a single random variable that is part of an -step decision-making system can represent a single agent or a collection of multiple agents, depending on the cardinality of , i.e. whether has multiple priors which are selected by the nodes in or not. Therefore, an -step bounded rational decision-making system with represents a bounded rational multi-agent system (of depth ).
For a given , each value of corresponds to exactly one agent in . During decision-making, the agents that belong to the nodes in are choosing which of the agents in is going to receive a given input (see 4.4 below for a detailed example). This decision is based on how well the selected agent will perform on the input by transforming its prior policy into a posterior policy , subject to the constraint
where is a given bound on the agent’s information-processing capability. Similarly to multi-step systems, this choice is based on the performance measured by the Free energy of the subsequent agents.
4.2 Multi-agent Free Energy principle
In contrast to multi-step decision-making, the information-processing bounds are allowed to be functions of the agents instead of just the nodes, resulting in an extra Lagrange multiplier for each agent in the Free Energy principle (10). As in (12), optimizing over the priors yields the simplified Free Energy principle
which can be solved iteratively as explained in the previous section, the only difference being that the Lagrange parameters now depend on . Hence, for the posterior of an agent that belongs to node , we have
The resulting Blahut-Arimoto-type algorithm is summarized in Algorithm 1.
Even though a given multi-agent architecture predetermines the underlying set of choices for each agent, only a small part of such a set might be used by a given agent in the optimized system. For example, all agents in the final step potentially can perform any action (see Figure 2 and the Example in 4.4 below). However, depending on their indiviual information-processing capabilities, the optimization over the agents’ priors can result in a (soft) partitioning of the full action space into multiple chunks, where each of these chunks is given by the support of the prior of a given agent , . Note that the resulting partitioning is not necessarily disjoint, since agents might still be sharing a number of actions, depending on their available information-processing resources. If the processing capability is low compared to the amount of possible actions in the full space, and if there are enough agents at the same level, then this partitioning allows each agent to focus on a smaller number of options to choose from, provided that the coordinating agents have enough resources to decide between the partitions reliably.
Therefore, the amount of prior adaptation of an agent, i.e. by how much its optimal prior deviates from a uniform prior over all accessible choices, which is measured by the KL-divergence , determines its degree of specialization. More precisely, we define the specialization of an agent with prior and choice space by
where denotes the Shannon entropy of . By normalizing with , we obtain a quantity between and , since . Here, corresponds to , which means that the agent is completely unspecialized, whereas corresponds to , which implies that has support on a single option meaning that the agent deterministically performs always the same action and therefore is fully specialized.
4.4 Example: Hierarchical multi-agent system with three levels
Consider the example of an architecture of 10 agents shown in Figure 2 that are combined via the 3-step decision-making system given by
as visualized in the upper left corner of Figure 2. The number of agents in each node is given by the cardinality of the target space of the selecting node(s) (or equals one if there are no selectors). Hence, consists of one agent, consists of agents, and consists of agents. For example, if we have and , as in Figure 2, then this results in a hierarchy of and agents.
The joint probability of the system characterized by (20) is given by
and the Free Energy by
where the priors , , and are given by the marginals (11), i.e.
By (13), the posteriors that iteratively solve the Free Energy principle are
Given a world state , the agent in decides about which of the three agents in obtains as an input. This narrows down the possible choices for the selected agent in to two out of the six agents in . The selected agent performs the final decision by choosing an action . Depending on its degree of specialization, which is a result of his own and the coordinating agents’ resources, this agent will choose his action from a certain subset of the full space .
5 Optimal Architectures
Here, we show how the above framework can be used to determine optimal architectures of bounded rational agents. Summarizing the assumptions made in the derivations, the multi-agent systems that we analyze must fulfill the following requirements:
The information-flow is feed-forward: An agent in can obtain information directly from another agent that belongs to only if .
Intermediate agents cannot be endpoints of the decision-making process: the information-flow always starts with the processing of and always ends with a decision .
A single agent is not allowed to have multiple prior policies: Agents are the smallest decision-making unit, in the sense that they transform a prior to a posterior policy over a set of actions in one step.
The performance of the resulting architectures is measured with respect to the expected utility they are able to achieve under a given set of resource constraints. To this end, we need to specify (1) the objective for the full decision-making process, (2) the number of decision-making steps in the system, (3) the maximal number of agents to be distributed among the nodes, and (4) the individual resource constraints of those agents.
We illustrate the specifications – with a toy example in Section 5.2 by showcasing and explicitly explaining the differences in performance of several architectures. Moreover, we provide a broad performance comparison in Section 5.3, where we systematically vary a set of objective functions and resource constraints, in order to determine which architectural features most affect the overall performance. For simplicity, in all simulations we are limiting ourselves to architectures with nodes and agents. In the following section, we start by describing how we characterize the architectures conforming to the requirements –.
5.1 Characterization of architectures
Type. In view of property above, we can label any -step decision-making process by a tuple , which we call the type of the architecture, where characterizes the relation between the first variables , , …, , and determines how these variables are connected to .
For example, for , we obtain the types shown in Figure 3, where and represent the following relations:
For example, the architecture shown in Figure 2 has the type . Correspondingly, the two-step cases are labelled by for , and the one-step rate distoration case by . Note that not every combination of and describes a unique system, e.g. is equivalent to when replacing by . Moreover, as mentioned above, is equivalent to , and similarly, is equivalent to .
Shape. After the number of nodes has been fixed, the remaining property that characterizes a given architecture is the number of agents per node. For most architectures there are multiple possibilities to distribute a given amount of agents among the nodes, even when neglecting individual differences in resource constraints. We call such a distribution a shape, denoted by , where denotes the number of agents in node . Note that, not all architectures will be able to use the full amount of available agents, most immanently the one-step rate distortion case ( agent), or the two-step serial-case ( agents). For these systems, we always use the agents with the highest available resources in our simulations.
For example, for the resulting shapes for a maximum of agents are as follows:
for , for , and for ,
for and ,
for , , , ,
and for and ,
for and ,
for and ,
and for ,
where a tuple inside the shape means that two different nodes are deciding about the agents in that spot, e.g. means that there are 8 agents in the last node, labeled by the values with and . In Figure 4, we visualize one example architecture for each of the above 3-step shapes, except for the shapes of type of which one example is shown in Figure 2.
Together, the type and shape uniquely characterize a given multi-agent architecture, denoted by .
5.2 Example: Callcenter
Consider the operation of a company’s callcenter as a decision-making process, where customer calls (world states) must be answered with an appropriate response (action) in order to achieve high customer satisfaction (utility). The utility function shown in Figure 5 on the left can be viewed as a simplistic model for a real-world callcenter of a big company such as a communication service provider. In this simplification, there are 24 possible customer calls that belong to three separate topics, for example questions related to telephone, internet, or television, which can be further subdivided into two subcategories, for example consisting of questions concerning the contract or problems with the hardware. See the description of Figure 5 for the explicit utility values.
Handling all possible phone calls perfectly by always choosing the corresponding response with maximum utility requires bit (see Figure 5). However, in practice a single agent is usually not capable of knowing the optimal answers to every single type of question. For our example this means that the callcenter only has access to agents with information-processing capability less than bit. It is then required to organize the agents in a way so that each agent only has to deal with a fraction of the customer calls. This is often realized by first passing the phone call through several filters in order to forward it to a specialized agent. Arranging these selector or filter units in a strict hierarchy then corresponds to architectures of the form of or (see below for a comparison of these two), where at each stage a single operator selects how a call is forwarded. In contrast, architectures of the form of allow for multiple independent filters working in parallel, for example realized by multiple trained neural networks, where each is responsible for a particular feature of the call (for example, one node deciding about the language of the call, and another node deciding about the topic). In the following we do not discriminate between human and artificial decision-makers, since both can qualify equally well as information-processing units.
allow for multiple independent filters working in parallel, for example realized by multiple trained neural networks, where each is responsible for a particular feature of the call (for example, one node deciding about the language of the call, and another node deciding about the topic). In the following we do not discriminate between human and artificial decision-makers, since both can qualify equally well as information-processing units.