Abstract
The concept of free energy has its origins in 19th century thermodynamics, but has recently found its way into the behavioral and neural sciences, where it has been promoted for its wide applicability and has even been suggested as a fundamental principle of understanding intelligent behavior and brain function. We argue that there are essentially two different notions of free energy in current models of intelligent agency, that can both be considered as applications of Bayesian inference to the problem of action selection: one that appears when trading off accuracy and uncertainty based on a general maximum entropy principle, and one that formulates action selection in terms of minimizing an error measure that quantifies deviations of beliefs and policies from given reference models. The first approach provides a normative rule for action selection in the face of model uncertainty or when informationprocessing capabilities are limited. The second approach directly aims to formulate the action selection problem as an inference problem in the context of Bayesian brain theories, also known as Active Inference in the literature. We elucidate the main ideas and discuss critical technical and conceptual issues revolving around these two notions of free energy that both claim to apply at all levels of decisionmaking, from the highlevel deliberation of reasoning down to the lowlevel informationprocessing of perception.
Keywords: free energy, intelligent agency, bayesian inference, maximum entropy, utility theory, active inference
1 Introduction
There is a surprising line of thought connecting some of the greatest scientists of the last centuries, including Immanuel Kant, Hermann von Helmholtz, Ludwig E. Boltzmann, and Claude E. Shannon, whereby modelbased processes of action, perception, and communication are explained with concepts borrowed from statistical physics. Inspired by Kant’s Copernican revolution, Helmholtz was one of the first proponents of the analysisbysynthesis approach to perception (Yuille and Kersten, 2006) that was motivated from his own studies of the physiology of the sensory system, whereby a perceiver does not simply record raw external stimuli on some kind of tabula rasa, but rather relies on internal models of the world to match and anticipate sensory inputs as well as possible. The internal model paradigm is now ubiquitous in the cognitive and neural sciences and has even led some researchers to propose a Bayesian brain hypothesis, whereby the brain would essentially be a prediction and inference engine based on internal models (Kawato, 1999; Flanagan et al., 2003; Doya, 2007). Coincidentally, Helmholtz also invented the notion of the Helmholtz free energy that plays an important role in thermodynamics and statistical mechanics, even though he never made a connection between the two concepts in his lifetime.
This connection was first made by Dayan, Hinton, Neal, and Zemel in their computational model of perceptual processing as a statistical inference engine known as the Helmholtz machine (Dayan et al., 1995)
. In this neural network architecture, there are feedforward and feedback pathways, where the bottomup pathway translates inputs from the bottom layer into hidden causes at the upper layer (the recognition model), and topdown activation translates simulated hidden causes into simulated inputs (the generative model). When considering loglikelihood in this setup as energy in analogy to statistical mechanics, learning becomes a relaxation process that can be described by the minimization of variational free energy. While it should be emphasized that variational free energy is not the same as Helmholtz free energy, the two free energy concepts can be formally related. Importantly, variational free energy minimization is not only a hallmark of the Helmholtz machine, but of a more general family of inference algorithms, such as the popular EM algorithm
(Neal and Hinton, 1998; Beal, 2003). In fact, over the last two decades, variational Bayesian methods have become one of the foremost approximation schemes for tractable inference in the machine learning literature. Moreover, a plethora of machine learning approaches use free energy tradeoffs when optimizing performance under entropy regularization in order to boost generalization of learning models
(Williams and Peng, 1991; Mnih et al., 2016).In the meanwhile, free energy concepts have also made their way into the behavioral sciences. In the economic literature, for example, tradeoffs between utility and entropic uncertainty measures that take the form of free energies have been proposed to describe decisionmakers with stochastic choice behavior due to limited resources (McKelvey and Palfrey, 1995; Sims, 2003; Mattsson and Weibull, 2002; McFadden, 2005; Wolpert, 2006) or robust decisionmakers with limited precision in their models (Maccheroni et al., 2006; Hansen and Sargent, 2008). The free energy tradeoff between entropy and reward can also be found in informationtheoretic models of biological perceptionaction systems (Still, 2009; Tishby and Polani, 2011; Ortega and Braun, 2013), some of which have been subjected to experimental testing (Ortega and Stocker, 2016; Sims, 2016; Schach et al., 2018; LindigLeón et al., 2019; Bhui and Gershman, 2018; Ho et al., 2020). Finally, in the neuroscience literature the notion of free energy has risen to recent fame as the central puzzle piece in the Free Energy Principle (Friston, 2010) that has been used to explain a cornucopia of experimental findings including neural prediction error signals, the hierarchical organization of cortical responses, synaptic plasticity rules, neural effects of biased competition and attention—see references in (Parr and Friston, 2019). Over time, the Free Energy Principle has grown out of an application of the free energy concept used in the Helmholtz machine, to interpret cortical responses in the context of predictive coding (Friston, 2005), and has gradually developed into a general principle for intelligent agency, also known as Active Inference (Friston et al., 2013, 2015b; Parr and Friston, 2019). Consequences and implications of the Free Energy Principle are discussed in neighbouring fields like psychiatry (Schwartenbeck and Friston, 2016; Linson et al., 2020) and the philosophy of mind (Clark, 2013; Colombo and Wright, 2018).
Given that the notion of free energy has become such a pervasive concept that cuts through multiple disciplines, the main rationale for this discussion paper is to trace back and to clarify different notions of free energy, to see how they are related and what role they play in explaining behavior and neural activity. As the notion of free energy mainly appears in the context of statistical models of cognition, we need to be familiar with probabilistic models as a common framework in the following discussion. Section 2 therefore starts with preliminary remarks on probabilistic modelling. Section 3 introduces two notions of free energy that are subsequently expounded in Section 4 and Section 5, where they are applied to models of intelligent agency. Section 6 concludes the paper.
2 Probabilistic models and perceptionaction systems
Systems that show stochastic behavior, for example due to randomly behaving components or because the observer ignores certain degrees of freedom, are modelled using probability distributions. This way, any behavioral, environmental, and hidden variables can be related by their statistics, and dynamical changes can be modelled by changes in their distributions.
Consider, for example, the simple probabilistic model illustrated in Fig 1, consisting of the variables past and future soil quality , past and future crop yields , and fertilization . The graphical model shown in the figure corresponds to the joint probability given by the factorization
(1) 
where is the base probability of the past soil quality , is the probability of crop yields depending on the past soil quality , and so forth. Given the joint distribution we can also ask questions about each of the variables. For example, we could ask about the probability distribution of soil quality if we are told that the crop yields are equal to a value . We can obtain the answer from the probabilistic model by doing Bayesian inference, yielding the Bayes’ posterior
(2) 
where the dependencies on , , and have been summed out to calculate the marginal from . In general, Bayesian inference in a probabilistic model means to determine the probability of some queried unobserved variables given the knowledge of some observed variables. This can be viewed as transforming the prior probabilistic model to a posterior model , where the observed variables have probability one and unobserved variables have probabilities given by the corresponding Bayes’ posteriors.
In principle, Bayesian inference requires only two different kinds of operations, namely marginalization, i.e. summing out unobserved variables that have not been queried, such as and above, and conditionalization, i.e. renormalizing the joint distribution over observed and queried variables—that may itself be the result from a previous marginalization such as above—to obtain the required conditional distribution over the queried variables. In practice, however, inference is a hard computational problem and many more efficient inference methods are available that may provide approximate solutions to the exact Bayes’ posteriors, including belief propagation (Pearl, 1988), expectation propagation (Minka, 2001), variational Bayesian inference (Hinton and van Camp, 1993), and Monte Carlo algorithms (MacKay, 2002). Also note that inference is trivial if the sought after conditional distribution of the queried variable is already given by one of the conditional distributions that jointly specify the probabilistic model.
Probabilistic models can not only be used as observer models, but also as internal models that are employed by the agent itself or by a designer of an agent in order to determine a desired course of action (cf. Fig 2). In this case, actions could either be thought of parameters of the probabilistic model that influence the future (influence diagrams) or as random variables that are part of the probabilistic model themselves (prior models). Both types of models allow to make predictions over future consequences in order to specify actions or distributions over actions that lead to desirable outcomes, for example actions that produce high rewards in the future. In mechanistic or process model interpretations, some of these specification methods are themselves meant to represent what the agent is actually doing while reasoning, whereas as if interpretations simply use these methods as tools to arrive at distributions that describe the agent’s behavior. Free energy is one of the concepts that appears in some of these methods.
3 The two notions of free energy
Vaguely speaking, free energy can refer to any quantity that is of the form
(3) 
where energy is an expected value of some quantitity of interest, and entropy refers to a quantity measuring disorder, uncertainty, or complexity, that must be specified in the given context. From the relation (3), it is not surprising that free energy sometimes appears enshrouded by mystery, as it relies on an understanding of entropy, and “nobody really knows what entropy is anyway”, as John Von Neumann famously quipped (Feynman et al., 1996).
Historically, the concept of free energy goes back to the roots of thermodynamics, where it was introduced to measure the maximum amount of work that can be extracted from a thermodynamic system at a constant temperature and volume. If, for example, all the molecules in a box move to the left, we can use this kinetic energy to drive a turbine. If, however, the same kinetic energy is distributed as random molecular motion, it cannot be fully transformed into work. Therefore, only part of the total energy is usable, because the exact positions and momenta of the molecules, the socalled microstates, are unknown. In this case, the maximum usable part of the energy is the Helmholtz free energy, defined as
(4) 
where is the thermodynamic entropy. In general, the transformation between two macrostates with free energies and allows the extraction of work .
3.1 Nonequilibrium free energy and maximum entropy
3.1.1 The Boltzmann distribution
In statistical mechanics, which studies macroscopic systems in terms of the behavior of its elementary constituents, thermodynamic quantities (macrostates of a system) are identified by expected values of the corresponding quantities defined on microstates. This means that the total energy is identified with the expected value of the energy levels of the system with respect to a probability distribution . Based on the central assumption that states with equal energy are occupied with equal probability and that thermodynamic entropy grows logarithmically with the number of possible microstates (Boltzmann’s equation), one can determine the probability of a microstate with energy as
(5) 
and the Helmholtz free energy (4) as , where is the temperature of a heat bath that is in equilibrium with the thermodynamic system and is the socalled partition sum (Callen, 1985). Consequently, one can identify the thermodynamic entropy with the Gibbs or Shannon entropy , where .
By allowing other distributions than the Boltzmann distribution , one can define a nonequilibrium free energy
(6) 
that equals the Helmholtz free energy when evaluated at the Boltzmann distribution, . Moreover, it turns out that (6) actually takes its minimum at , i.e. . In general, minimizing the nonequilibrium free energy (6) with respect to can be understood more abstractly without any reference to thermodynamics or physics, because it is equivalent to the constrained optimization problem of maximizing entropy under a constraint on the expectation , known as the principle of maximum entropy (Jaynes, 1957).
3.1.2 The tradeoff between energy and uncertainty
An important feature of the minimization of the free energy (6) consists in the tradeoff between the two competing terms, expected energy and entropy (see Fig 3). It is this tradeoff between maximal uncertainty (uniform distribution) and minimal energy (delta distribution) that is at the core of free energy minimization. Here, the temperature plays the role of a tradeoff parameter that controls how these two counteracting forces are balanced. In optimization theory, such a tradeoff parameter is usually introduced to transform an optimization problem that has to satisfy an equality or inequality constraint into an unconstrained optimization problem of the form (6). In this case, the tradeoff parameter plays the role of a socalled Lagrange multiplier that is determined by the constraint.
If the two counteracting quantities in such a tradeoff are entropy and the expected value of some quantity, then one obtains the principle of maximum entropy, first formulated rigorously by Jaynes (Jaynes, 1957) as a method of how to determine an unbiased subjective probability distribution on the basis of partial information given by a constraint. It goes back to the principle of insufficient reason (Bernoulli, 1713; de Laplace, 1812; Poincaré, 1912), which states that two events should be assigned the same probability if there is no reason to think otherwise. The principle of maximum entropy (and its close relative the principle of minimum relative entropy) has very broad application and virtually appears in all branches of science. It has been hailed as a principled method to determine prior distributions and to incorporate novel information into existing probabilistic knowledge. In fact, Bayesian inference can be cast in terms of relative entropy minimization with constraints given by the available information (Williams, 1980). Applications of this idea can also be found in the machine learning literature, where subtracting (or adding) an entropy term from an expected value of a function that must be optimized is known as entropy regularization
and plays an important role in modern reinforcement learning algorithms
(Williams and Peng, 1991; Mnih et al., 2016) to encourage exploration (Haarnoja et al., 2017)as well as to penalize overly deterministic policies resulting in biased reward estimates
(Fox et al., 2016).3.2 Variational free energy
3.2.1 An extension of relative entropy
There is another, distinct appearance of the term “free energy” outside of physics, that is a priori not motivated from a tradeoff between an energy and entropy term, but from possible efficiency gains when representing Bayes’ rule in terms of an optimization problem. This technique is mainly used in variational Bayesian inference and was originally introduced by Hinton and van Camp (Hinton and van Camp, 1993)
. Such variational representations do not only allow to approximate exact Bayes’ posteriors by simpler distributions, but also to construct efficient iterative algorithms for exact or approximate inference, such as the ExpectationMaximization (EM) algorithm
(Dempster et al., 1977; Neal and Hinton, 1998), belief propagation (Pearl, 1988; Yedidia et al., 2001), and other messagepassing algorithms(Minka, 2001; Wainwright et al., 2005; Winn and Bishop, 2005; Minka, 2005).All of these examples can be seen as applications of the same basic concept allowing to cast Bayesian inference in terms of the minimization of the variational free energy , which generally is a function of two quantities, a probability distribution and a nonnegative function , given by
(7) 
where denotes the expected value of with respect to . In the application to Bayesian inference, the reference function is constructed by evaluating a joint distribution given by the probabilistic model, say , at known quantities, say , resulting in which is not a probability distribution in anymore. Its rescaling (normalization) in order to obtain the probability distribution is exactly what Bayesian inference is about, and what the variational free energy (7) is used for. It is a free energy in the sense of (3) since by the additivity of the logarithm under multiplication (),
(8) 
with energy term and entropy term given by the Shannon entropy . It is variational because its purpose is to be minimized over the socalled trial distributions , with the solution
(9) 
Here, for simplicity, all random variables are discrete, but most expressions can directly be translated to the continuous case by replacing sums by the corresponding integrals. When choosing , Equation (9) becomes the Boltzmann distribution (5) and accordingly the variational free energy (8) becomes the nonequilibrium free energy (6). The variational property (9) allows to normalize to obtain the probability distribution , which has the same shape as but sums to , without having to carry out the rescaling of explicitly. Instead, by minimizing variational free energy, one fits auxiliary trial distributions to the shape of (cf. Fig 4). If this optimization process has no constraints, then the trial distributions are fitted to the shape of until is achieved. In the case of constraints, for instance if the trial distributions are parametrized by a nonexhaustive parametrization, then the optimized trial distributions approximate as close as possible within this parametrization. Moreover, the minimal value of the variational free energy (7) is
(10) 
In particular, this implies that for all , so that varying with arbitrary trial distributions always provides a lower bound to the unknown normalization constant . In Bayesian inference this unknown constant is the normalization constant in Bayes’ rule, and called the model evidence (cf. Section 3.2.2 below). Due to this bound, the variational free energy is also called evidence lower bound (ELBO). The proof of (9) and (10) directly follows from Jensen’s inequality and only relies on the concavity of the logarithm.
Variational free energy (7) can be regarded as an extension of relative entropy with the reference distribution being replaced by a nonnormalized function, since in the case when is already normalized, that is if , then the free energy (7) coincides with relative entropy, the socalled KullbackLeibler (KL) divergence from information theory. In particular, while relative entropy is a measure for the dissimilarity of two probability distributions, where the minimum is achieved if both distributions are equal, variational free energy is a measure for the dissimilarity between a probability distribution and a (generally nonnormalized) function , where the minimum with respect to is achieved at . Accordingly, we can think of the variational free energy as a specific error measure between probability distributions and reference functions. In principle, one could design many other error measures that have the same minimum. This means that, a statement in a probabilistic setting that a distribution minimizes variational free energy is analogous to a statement in a nonprobabilistic setting that some parameter minimizes an error measure like the squared error between a parameterized prediction and a given reference value .
3.2.2 Variational inference
As we have seen in Section 2
, Bayesian inference consists in the calculation of a conditional probability distribution over unknown variables given the values of known variables. In the most simple case of two variables, say
and , and a probabilistic model of the form , Bayesian inference applies if is observed and is queried. Analogous to (2), the exact Bayes’ posterior is defined by the renormalization of in order to obtain a distribution over that respects the new information .In variational Bayesian inference (cf. Fig 5A), however, this Bayes’ posterior is not calculated directly by normalizing the joint distribution with respect to , but indirectly by approximating it by a distribution that is adjusted through the minimization of an error measure that quantifies the deviation from the exact Bayes’ posterior. As we have seen in the previous section, the variational free energy is one possible candidate for such an error measure, since by (9),
(11) 
As mentioned at the beginning of this section, representing Bayes’ rule as an optimization problem over auxiliary distributions has two main applications that both can simplify the inference process (cf. Fig 5B). First, it allows to approximate
exact Bayes’ posteriors by restricting the optimization space, for example using a nonexhaustive parametrization such as Gaussian distributions. Second, it enables
iterative inference algorithms consisting of multiple simpler optimization steps, for example by optimizing with respect to each term in a factorized representation of separately. A popular choice is the meanfield approximation, which combines both of these simplifications, as it assumes independence between hidden states, effectively reducing the search space from joint distributions to factorized ones, and moreover it allows to optimize with respect to each factor alternatingly.4 Free energy from uncertainty reduction
We are now turning to the question of how the two notions of free energy introduced in the previous section appear in recent theories on intelligent agency, as it is a priori not entirely obvious how the practical uses of free energy as uncertaintyenergy tradeoff on the one hand (this section) and as a variational error measure on the other hand (next section) relate to intelligent behavior.
4.1 The basic idea
The concept of free energy as a tradeoff between energy and uncertainty can be used in models of perceptionaction systems, where entropy quantifies informationprocessing complexity required for decisionmaking (e.g. planning a path for fleeing a predator) and energy corresponds to performance (e.g. distinguishing better and worse flight directions). The notion of decision in this context is very broad and can be applied to any internal variable in the perceptionaction pipeline (Kahneman, 2002), that is not directly determined by the environment. In particular, it also subsumes perception itself, where the decision variables are given by the hidden causes that are being inferred from observations.
In rational choice theory (von Neumann and Morgenstern, 1944), a decisionmaker chooses its decisions from a set of options such that a utility function defined on is maximized,
(12) 
The utility values could either be objective, for example a monetary gain, or subjective in which case they represent the decisionmaker’s preferences. In general, the utility does not have to be defined directly on , but could be derived from utility values that are attached to certain states, for example to the configurations of the playboard in a board game. In the case of perception, utility values are usually given by (log)likelihood functions, in which case utility maximization without constraints corresponds to greedy inference such as maximum likelihood estimation.
Whereas ideal rational decisionmakers are assumed to perfectly optimize a given utility function , real behavior is often stochastic, meaning that multiple exposures to the same problem lead to different decisions. Such nondeterministic behavior could be a consequence of model uncertainty, as in Bayesian inference or various stochastic gambling schemes, or a consequence of satisficing (Simon, 1955), where decisionmakers do not choose the single best option, but simply one option that is good enough. Abstractly, this means that, the choice of a single decision is replaced by the choice of a distribution over decisions. More generally, also considering prior information that the decisionmaker might have from previous experience, the process of deliberation during decisionmaking might be expressed as the transformation of a prior to a posterior distribution .
When assuming that deliberation has a cost , then arriving at narrow posterior distributions should intuitively be more costly than choosing distributions that contain more uncertainty (cf. Fig 6A). In other words, deliberation costs must be increasing with the amount of uncertainty that is reduced by the transformation from to . Uncertainty reduction can be understood as making the probabilities of options less equal to each other, rigorously expressed by the mathematical concept of majorization (Marshall et al., 2011). This notion of uncertainty can also be generalized to include prior information, so that the degree of uncertainty reduction corresponds to more or less deviations from the prior (Gottwald and Braun, 2019a).
Maximizing expected utility with respect to under restrictions on processing costs is a constrained optimization problem that can be interpreted as a particular model of bounded rationality (Simon, 1955), explaining nonrational behavior of decisionmakers that may be unable to select the single best option by their limited informationprocessing capability. Similarly to the free energy tradeoff between energy and entropy (cf. Fig 3), this results in a tradeoff between utility and processing costs ,
(13) 
Here, the tradeoff parameter is analogous to the inverse temperature in statistical mechanics (cf. Equation (6)) and parametrizes the optimal tradeoffs between utility and cost, that define an efficiency frontier separating the space of perceptionaction systems into boundedoptimal, nonoptimal, and nonadmissible systems (cf. Fig 6).
When assuming that the total transformation cost is the same independent of whether a decision problem is solved in one step or multiple substeps (additivity under coarsegraining) the tradeoff in (13) takes the general form (3) of a free energy in the sense of energy (utility) minus entropy (cost), because then the cost function is uniquely given by the relative entropy
(14) 
Note that the additivity of (14) also implies a coarsegrainig property of the free energy (13) in the case when the decision is split into multiple steps, such that the utility of preceding decisions is effectively given by the free energy of following decisions. Therefore, in this case, free energy can be seen as a certaintyequivalent value of a stochastic choice, that besides of expected utility also takes informationprocessing costs of the subordinate decision problems into account. The special case (14) has been studied extensively in multiple contexts, including quantal response equilibria in the gametheoretic literature (McKelvey and Palfrey, 1995; Wolpert, 2006), rational inattention and costly contemplation (Sims, 2003; Ergin and Sarver, 2010), bounded rationality with KL costs (Mattsson and Weibull, 2002; Ortega and Braun, 2013), KL control (Todorov, 2009; Kappen et al., 2012), entropy regularization (Williams and Peng, 1991; Mnih et al., 2016), robustness (Maccheroni et al., 2006; Hansen and Sargent, 2008), and the analysis of information flow in perceptionaction systems (Tishby and Polani, 2011; Still, 2009).
4.2 A Simple Example
Consider the probabilistic model shown in Fig 1 with the joint distribution that is specified by the factors in the decomposition (1). Here, and denote the current environmental state and the corresponding observation, and denotes the action that must be determined in order to drive the system into a new state with observation . The decisionmaking problem is specified by assuming that we have given a utility function over future observations which the decisionmaker seeks to maximize by selecting an action , while only having access to the current observation . This means that the decisionmaker has control over the distribution , which replaces the prior in the model to define the posterior model (cf. Fig 7). Further assuming that the decisionmaker is subject to an informationprocessing constraint , for some nonnegative bound , results in the unconstrained optimization problem with free energy given by (13), where the tradeoff parameter is tuned to comply with the bound .
Since the action distribution is the only distribution in the posterior model that is varied, the total free energy simplifies to with
In particular, the optimal action distribution for a given observation is a Boltzmann distribution (5) with “energy” and prior ,
Note that in order to evaluate the utility , it is required to determine the Bayes’ posterior . This shows how in a utilitybased approach, the need to perform Bayesian inference results directly from the assumption about which variables are observed and which are not.
4.3 Critical points
The main idea of free energy in the context of informationprocessing with limited resources is that any computation can be thought of abstractly as a transformation from a distribution of prior knowledge to a posterior distribution that encapsulates an advanced state of knowledge resulting from deliberation. The progress that is made through such a transformation is quantitatively captured by two measures: the expected utility that quantifies the quality of and that measures the cost of uncertainty reduction from to . Clearly, the critical point of this framework is the choice of the cost function . In particular, we could ask whether there is some kind of universal cost function that is applicable to any perceptionaction process or whether there are only problemspecific instantiations. Of course, having a universal measure that allows applying the same concepts to extremely diverse systems is both a boon and a bane, because the practical insights it may provide for any concrete instance could be very limited. This is the root of a number of critical issues:

[wide,labelwidth=!,labelindent=0pt]

What is the cost ? An important restriction of all deliberation costs of the form is that they only depend on the initial and final distributions and ignore the process of how to get from to . When varying a single resource (e.g. processing time) we can use as a processindependentproxy for the resource. However, if there are multiple resources involved (e.g. processing time, memory, and power consumption), a single cost can not tell us how these resources are weighted optimally without making further processdependent assumptions. In general, the theory makes no suggestions whatsoever about mechanical processes that could implement resourceoptimal strategies, it only serves as a baseline for comparison. Finally, simply requiring the measure to be monotonic in the uncertainty reduction, does not uniquely determine the form of , as there have been multiple proposals of uncertainty measures in the literature (see e.g. (Csiszár, 2008)), where relative entropy is just one possibility. However, relative entropy is distinguished from all other uncertainty measures in its additivity property, that for example allows to express optimal probabilistic updates from to in terms of additions or subtractions of utilities, such as loglikelihoods for evidence accumulation in Bayesian inference.

What is the utility? When systems are engineered, utilities are usually assumed to be given such that desired behavior is specified by utility maximization. However, when we observe perceptionaction systems, it is often not so clear what the utility should be, or in fact, whether there even exists a utility that captures the observed behavior in terms of utility maximization. This question of the identifiability of a utility function is studied extensively in the economic sciences, where the basic idea is that systems reveal their preferences through their actual choices and that these preferences have to satisfy certain consistency axioms in order to guarantee the existence of a utility function. In practice, to guarantee unique identifiability these axioms are usually rather strong, for example ignoring the effects of history and context when choosing between different items, or ignoring the possibility that there might be multiple objectives. When not making these strong assumptions, utility becomes a rather weak concept, even weaker than probabilities, as additional assumptions like softmaximization would be necessary to translate from utilities to choice probabilities.

The problem of infinite regress. One of the main conceptual issues with the interpretation of as a deliberation cost is that the original utility optimization problem is simply replaced by another optimization problem that may even be more difficult to solve. This novel optimization problem might again require resources to be solved and could therefore be described by a higherlevel deliberation cost, thus leading to an infinite regress. In fact, any decisionmaking model that assumes that decisionmakers reason about processing resources are affected by this problem (Russell and Subramanian, 1995; Gigerenzer and Selten, 2001). A possible way out is to consider the utilityinformation tradeoff simply an as if description, since perceptionaction systems that are subject to a utilityinformation tradeoff do not necessarily have to reason or know about their deliberation costs. It is straightforward, for example, to design processes that probabilistically optimize a given utility with no explicit notion of free energy, but for an outside observer the resulting choice distribution looks like an optimal free energy tradeoff (Ortega and Braun, 2014).
In summary, the free energy tradeoff between utility and information primarily serves as a normative model that provides a paretooptimility curve consisting of optimal decisionpolicies. It can also serve as a guide for constructing and interpreting systems, although it is in general not a mechanistic model of behavior. In that respect the abstract free energy tradeoff shares the fate of its cousins in thermodynamics and Shannon’s coding theory (Shannon, 1948) in that they provide theoretical bounds on optimality but devise no mechanism for processes to achieve these bounds.
5 Variational free energy in Active Inference
5.1 The basic idea
Variational free energy is the main ingredient used in the Free Energy Principle for biological systems in the neuroscience literature (Friston, 2005, 2010; Friston et al., 2015b, 2006) and has even been considered as “arguably the most ambitious theory of the brain available today” (Gershman, 2019). Since variational free energy in itself is just a mathematical construct to measure the dissimilarity between distributions and functions—see Section 3—, the biological content of the Free Energy Principle must come from somewhere else. The basic biological phenomenon that the Free Energy Principle purports to explain is homeostasis, the ability to actively maintain certain relevant variables (e.g. blood sugar) within a preferred range. Usually, homeostasis is applied as an explanatory principle in physiology whereby the actual value of a variable is compared to a target value and corrections to deviation errors are made through a feedback loop. However, homeostasis has also been proposed as an explanatory principle for complex behavior in the cybernetic literature (Wiener, 1948; Ashby, 1960; Powers, 1973; Cisek, 1999)—for example, maintaining blood sugar may entail complex feedback loops of learning to hunt, to trade and to buy food. Crucially, being able to exploit the environment in order to attain favorable sensory states, requires implicit or explicit knowledge of the environment that could either be preprogrammed (e.g. insect locomotion) or learnt (e.g. playing the piano).
The Free Energy Principle was originally suggested as a theory of cortical responses (Friston, 2005) by promoting the free energy formulation of predictive coding that was introduced by Dayan and Hinton with the Helmholtz machine (Dayan et al., 1995). It found its most recent incarnation in what is known as Active Inference, the attempt to extend variational Bayesian inference to action selection. Here, the target value of homeostasis is expressed through a probability distribution under which desired sensory states have a high probability. The required knowledge about the environment is expressed through a generative model that relates observations, hidden causes, and actions. As the generative model allows to make predictions about future states and observations, it enables to choose actions in such a way that the predicted consequences conform to the desired distribution. In Active Inference, this is achieved by merging the generative and the desired distributions, and , into a single function to which trial distributions over the unknown variables are fitted by minimizing the variational free energy . In the resulting homeostatic process, the trial distributions play the role of internal variables that are manipulated in order to achieve the desired sensory states that are not directly controllable. Minimizing variational free energy by the alternating variation of trial distributions over actions and trial distributions over hidden states,
(15) 
is then equated with processes of action and perception. Such a free energy minimization can be regarded as an approximate inference process with respect to the reference , similarly to variational Bayesian inference (cf. Section 3.2.2).
In a nutshell, the central tenet of the Free Energy Principle states that organisms maintain homeostasis through minimization of variational free energy between a trial distribution and a reference distribution by acting and perceiving. Sometimes the even stronger statement is made that minimizing variational free energy is mandatory for homeostatic systems (Friston, 2013).
5.2 A Simple Example
Following the Active Inference recipe (cf. Fig 8), first we need to define a generative model and a desired distribution for our running example from Fig 1, assuming that is observed and and are in the future to be determined by the choice of the action . As before, the generative model is specified by the factors in the decomposition (1), the desired distribution is a given fixed probability distribution over future sensory states , and the trial distributions are probabilities over all unknown variables, , and .
In most treatments of Active Inference in the literature, the trial distributions are simplified, either by a full meanfield approximation over states and actions (Friston et al., 2013, 2015b), by a partial meanfield approximation where the dependency on actions is kept but the states are treated independently of each other (Friston et al., 2016), or most recently (Parr et al., 2019), by the socalled Bethe approximation. Note that the Bethe approximation is actually exact in treelike models (Heskes, 2003), in particular it is exact in all models that were considered in Active Inference so far. In the partial meanfield assumption of (Friston et al., 2016), the trial distribution over is fixed and given by , while for , and the trial distributions are variable but restricted to be of the meanfield form for and , , so that
(16) 
This assumption effectively means that under the approximate model , knowing a particular value of the random variable does not say anything more about than knowing its distribution . Note however, that such meanfield assumptions might be too strong simplifications that fail to produce goaldirected behavior even for very simple tasks such as the navigation in a gridworld, as can be seen in B.2.
Next, the two distributions and are put together to form the reference model . To do so, there have been several proposals in the Active Inference literature, which fall into one of two cases: either a handcrafted value function is defined which is multiplied to the generative model using a softmax function, (Friston et al., 2015b, 2016), or the desired distribution is multiplied directly to the generative model, (Parr and Friston, 2019). The value function is sometimes also referred to as the expected free energy and defined as
(17) 
where favors both desirable and plausible future observations , in contrast to the utility function in Section 4.2 that only considers desirability (there, the likelihood of future observations is automatically taken into account when (soft)maximizing expected utility; see the simulations in B.3). Moreover, the extra entropy term in ensures that actions lead to consequences that more or less match the desired distribution, rather than trying to produce the single most desired outcome—see the discussion at the end of Section 5.3. Note also that the value function depends on the trial distributions which is generally problematic—see in Section 5.3.
Once the form of the trial distributions (e.g. by the partial meanfield assumption (16)) and the reference are defined, the variational free energy is simply determined by . The resulting free energy minimization problem is usually solved approximately by performing an alternating optimization scheme, in which the variational free energy is minimized separately with respect to each of the variable factors in a factorization of , for example by alternating between , , and in the case of the partial meanfield assumption (16), where in each step the factors that are not optimized are kept fixed. The resulting update equations (see A.1) turn out to be quite different depending on how the probabilistic model is combined with and which assumption on the structure of is made. In B.1, we compare different proposed formulations of Active Inference in the general case of arbitrarily many time steps.
5.3 Critical points
The main idea behind Active Inference is to express the problem of action selection in a similar manner to the perceptual problem of Bayesian inference over hidden causes. In Bayesian inference, agents are equipped with likelihood models that determine the desirability of different hypotheses given the data . In Active Inference, agents are equipped with a given desired distribution over future outcomes that ultimately determines the desirability of actions . An important difference that arises is that perceptual inference has to condition on past observations , whereas inference over actions would have to be conditioned on desired future observations . However, Active Inference never conditions on future observations, but rather merges the desired distribution with the likelihood into a single reference model , such that the dependency of actions on the future is not the result of an inference process, but is prespecified by the handcrafted reference. This is the root of a number of critical issues with current formulations of Active Inference:

[wide,labelwidth=!,labelindent=0pt]

How to incorporate the desired distribution into the reference? Instead of using Bayesian conditioning directly in order to condition the generative model on the desired future, there have been essentially two different proposals in the literature of how to merge the two distributions into a single reference function . In order to create a dependency of actions on the desired future, this function is used in place of the probabilistic model as a reference in the variational free energy to perform variational inference. The merging of and is achieved either by a handcrafted value function that specifically modifies the action probability of the generative model, or by adjusting the probability over futures of the generative model by multiplying the likelihood with and renormalizing. The first option leads to problem , the second option to problem .

The reference model is not fixed. In most implementations, see e.g. (Friston et al., 2015a, 2016), the probability over actions in the probabilistic reference model is defined through the value function that itself depends on the trial distributions . Therefore, both the trial distribution and the reference distribution change when is varied during free energy minimization. Consequently, minimizing the variational free energy with respect to no longer fits the trial distributions to a fixed reference as in variational Bayes, but instead minimizes the dissimilarity of the two variables and . This is comparable to minimizing a squared error loss with respect to a model parameter in order to fit a function to data , where the data is not fixed but also depends on . Due to this double dependency, it is not obvious anymore what kind of result such a minimization process produces, even though an optimum might well be found.

Exact inference over actions is given by the prior. Given a reference model with known factorization , standard Bayesian conditioning on past observations can only produce trivial inference over actions, because it can only return the predefined distribution . In this case, exact inference over actions given past experience can only reproduce the prespecified distributions in the prior model. For Active Inference models that combine and by using a value function of the form (17) (Friston et al., 2015b, 2016), this means that exact inference just produces the predefined action distribution given by the softmax of , whereas in Active Inference models that multiply directly (Parr and Friston, 2019), exact inference results in the fixed prior that does not lead to desirable futures. Note that, the update equation for the action distribution (for example resulting from a meanfield assumption) in both cases will depend on the other factors. However, in the end, the variational free energy minimization seeks to approximate the prespecified prior model as closely as possible. This effect can be seen in the gridworld simulations provided in B.2.
Instead of doing inference over actions given past experience, one could do inference over actions given the desired future outcomes, as has been done in other approaches (Dayan and Hinton, 1997; Toussaint and Storkey, 2006; Kappen et al., 2012; Levine, 2018). For a single desired future observation , inference can be applied in a straightforward manner by simply conditioning the generative model on . Similarly, one could condition on a desired distribution using Jeffrey’s conditioning rule (Jeffrey, 1965), resulting in , which could be implemented by first sampling a goal and then inferring given the single desired observation . However, the problem with such a naive approach is that the choice of a goal is solely determined by its desirability, whereas its realizability for the decisionmaker is not taken into account, that is the decisionmaker ignores how likely a certain outcome can be achieved under the predictive distribution by choosing the right action . This problem can be alleviated by introducing an auxiliary variable together with a probability that plays the role of a utility and determines how well the outcomes satisfy desirability criteria of the decisionmaker (Toussaint and Storkey, 2006). The extra variable gives the necessary flexibility in to infer successful actions by simply conditioning on (cf. B.3). In Active Inference this issue is avoided because the desired distribution is not used for conditioning. Instead, in early versions of Active Inference (Friston et al., 2013), decisionmakers are assumed to match the desired distribution over future states, by defining a value function
that takes the form of a KullbackLeibler divergence between the predicted and desired future. As can be seen from examples such as
A.2, this assumption can lead to counterintuitive behavior, especially when none of the predicted outcomes fits the desired distribution well. In later versions of Active Inference (Friston et al., 2015b, 2016), the value function is modified by an additional entropy term that explicitly punishes observations with high variability (cf. B.1). Even though this correction might fix the issue resulting from matching the predicted and desired distributions, the general question remains whether defining a desired distribution directly over outcomes is a good starting point when formulating decisionmaking as an inference problem (Gershman and Daw, 2012), and whether the proposed form of the value function is the right way to implement such a desired distribution into the probabilistic model (Millidge et al., 2020).6 So What Does Free Energy Bring To the Table?
6.1 A Practical Tool
It is unquestionable that free energy has seen many fruitful practical applications in the statistical and machine learning literature. As has been discussed in Section 3, these applications generally fall into one of two categories, the principle of maximum entropy, and a variational formulation of Bayesian inference. Here, the principle of maximum entropy is interpreted in a wider sense of optimizing a tradeoff between uncertainty (entropy) and the expected value of some quantity of interest (energy), which in practice often appears in the form of regularized optimization problems (e.g. to prevent overfitting) or as a general inference method allowing to determine unbiased priors and posteriors (cf. Section 3.1). In the variational formulation of Bayes’ rule, free energy plays the role of an error measure that allows to do approximate inference by constraining the space of distributions over which free energy is optimized, but can also inform the design of efficient iterative inference algorithms that result from an alternating optimization scheme where in each step the full variational free energy is optimized only partially, such as the EM algorithm, belief propagation, and other message passing algorithms (cf. Section 3.2).
6.2 Theories of Intelligent Agency
These practical usecases of free energy formulations have also influenced models of intelligent behavior. In the cognitive and behavioral sciences, intelligent agency has been modelled in a number of different frameworks, including logicbased symbolic models, connectionist models, statistical decisionmaking models, and dynamical systems approaches. Even though statistical thinking in a broader sense can in principle be applied to any of the other frameworks as well, statistical models of cognition in a more narrow sense have often focused on Bayesian inference, where agents are equipped with probabilistic models of their environment allowing them to infer unknown variables in order to select actions that lead to desirable consequences (Tenenbaum and Griffiths, 2001; Wolpert, 2006; Todorov, 2009). Naturally, the inference of unknown variables in such models can be achieved by a plethora of methods including the two types of free energy approaches of maximum entropy and variational Bayes. However, both free energy formulations go one step further in that they attempt to extend both principles from the case of inference to the case of action selection: utility optimization with information constraints based on the maximum entropy principle and Active Inference based on variational free energy.
While sharing similar mathematical concepts, both approaches differ in syntax and semantics. A prominent apple of discord is the concept of utility (Gershman and Daw, 2012). Utility optimization with information constraints requires the determination of a utility function, whereas Active Inference requires the determination of a reference function. Subjective utility functions that quantify the preferences of decisionmakers can lead to identifiability issues when certain consistency axioms are not satisfied. Similarly, in Active Inference the reference function involves determining a desired distribution given by the preferred frequency of outcomes, which in the utility framework would correspond to a very general but weak utility concept, more similar to the concept of probability that is able to explain arbitrary behavior. However, Active Inference then has to solve the additional problem to marry up the agent’s probabilistic model with its desired distribution into a single reference function (cf. Section 5.3), for example by a handcrafted value function that is incorporated into the probabilistic model. Crucially, the choice of the reference lies outside the scope of variational Bayes, but is critical for the resulting behavior because it determines the exact solutions that are approximated by free energy minimization. Thus, the choice of the reference in Active Inference conceptually corresponds to the solutions to the free energy tradeoff in utilitybased approaches.
Also, both approaches differ fundamentally in their motivation. The motivation of utility optimization with information constraints is to capture the tradeoff between precision and uncertainty that underlies informationprocessing. This tradeoff takes the form of a free energy, once an informational cost function has been chosen (cf. Section 4.3). Note that Bayes’ rule can be seen as a special case of such a free energy tradeoff with loglikelihoods as utilities, even though this equivalence is not the primary motivation of this tradeoff. In contrast, Active Inference is motivated from casting the problem of action selection itself as an inference process (Friston et al., 2013), as this allows to express both action and perception as the result of minimizing the same function, the variational free energy. This is clearly possible, because the underlying probabilistic model already contains both action and perception variables in a single functional format and the variational free energy is just a function of that model. Moreover, while approximate inference can be formulated on the basis of variational free energy, inference in general does not rely on this concept, and thus inference over actions can easily be done without free energy (Dayan and Hinton, 1997; Toussaint and Storkey, 2006). Also, as we have argued in Section 5.3, even without constraints on the auxiliary distributions, Active Inference does not actually do straightforward Bayesian inference over actions. Instead, Active Inference merges the desired distribution with the probabilistic model and fits trial distributions to the resulting reference (cf. Fig 8). Therefore, there is not a single fundamental principle in Active Inference that generates all the equations, but rather several principles in multiple variants where different formal assumptions are put together to determine the reference and to perform variational inference with that reference (cf. Section 5.3).
However, there are also plenty of similarities between the two approaches. For example, the assumption of a softmax action distribution in Active Inference is similar to the posterior solutions resulting from utility optimization with information constraints. Moreover, the assumption of a desired future distribution relates to constrained computational resources, because the uncertainty constraint in a desired distribution over future states may not only be a consequence of environmental uncertainty, but could also originate from stochastic preferences of a satisficing decisionmaker that accepts a wide range of outcomes. In B.2, we provide a comparison of the two approaches using gridworld simulations where we also include other methods for inference over actions.
A remarkable resemblance among both approaches is the exclusive appearance of relative entropy to measure dissimilarity. In Active Inference it is claimed that every homeostatic system must minimize variational free energy (Friston, 2013), which is simply an extension of relative entropy for nonnormalized reference functions (cf. Section 3.2.1). In utilitybased approaches, one typically uses relative entropy (14) to measure the amount of informationprocessing, even though theoretically other cost functions would be conceivable (Gottwald and Braun, 2019a). For a given homeostatic process, the KullbackLeibler divergence measures the dissimilarity between the current distribution and the limiting distribution and therefore is reduced while approximating the equilibrium. Similarly, in utilitybased decisionmaking models, relative entropy measures the dissimilarity between the current posterior and the prior. In the Active Inference literature the stepwise minimization of variational free energy that goes along with KL minimization is often equated with the minimization of sensory surprise (see A.3 for a more detailed explanation), an idea that stems from maximum likelihood algorithms, but that has been challenged as a general principle (Biehl et al., 2020). Similarly, one could in principle rewrite maximum entropy tradeoffs in terms of informational surprise, which would however simply be a rewording of the probabilistic concepts in logspace. The same kind of rewording is wellknown between probabilistic inference and the minimum description length principle (Grünwald, 2007) that also operates in the logspace, and thus, reformulates the inference problem as a surprise minimization problem, without adding any new features or properties.
6.3 Biological Relevance
So far we have seen how free energy is used as a technical instrument to solve inference problems and its corresponding appearance in different models of intelligent agency. Crucially, these kinds of models can be applied to any inputoutput system, be it a human that reacts to sensory stimuli, a cell that tries to maintain homeostasis, or a particle that reacts to physical forces. Given the existing literature that has widely applied the concept of free energy to biological systems, we may ask whether there are any specific biological implications of these models.
If we regard free energy primarily as a tradeoff between utility and informationprocessing costs, we obtain a normative model of decisionmaking under resource constraints, that extends previous optimality models based on expected utility maximization and Bayesian inference. Similarly to ratedistortion curves in coding theory, it provides optimal solutions to decisionmaking problems forming an informationutility curve of pareto optima (cf. Fig 3). The behavior of real decisionmaking systems under varying information constraints can be analyzed experimentally by comparing their performance with respect to the corresponding optimality curve. One can experimentally relate abstract informationprocessing costs measured in bits to taskdependent resource costs like reaction or planning times (Schach et al., 2018; Ortega and Stocker, 2016)
. Moreover, the free energy tradeoff can also be used to describe networks of agents, where each agent is limited in its ability, but the system as a whole has a higher informationprocessing capacity—for example, neurons in a brain or humans in a group. In such systems different levels of abstraction arise depending on the different positions of decisionmakers in the network
(LindigLeón et al., 2019; Genewein et al., 2015; Gottwald and Braun, 2019b). As we have discussed in Section 4.3, just like coding and ratedistortion theory, utility theory with information costs can only provide optimality bounds but does not specify any particular mechanism of how to achieve optimality. However, by including more and more constraints one can make a model more and more mechanistic and thereby gradually move from a normative to a more descriptive model, such as models that consider the communication channel capacity of neurons with a finite energy budget (Bhui and Gershman, 2018).Considering free energy in the sense of variational free energy, there is a vast literature on biological applications mostly focusing on neural processing (e.g. predictive coding, dopamine) (Schwartenbeck et al., 2015; Friston et al., 2017b; Parr et al., 2019), but there are also a number of applications aiming to explain behavior (e.g. human decisionmaking, hallucinations) (Parr et al., 2018). Similarly to utilitybased models, Active Inference models can be studied in terms of as if models, so that actual behavior can be compared to predicted behavior as long as suitable prior and likelihood models can be identified from the experiment. When applied to brain dynamics, the as if models are sometimes also given a mechanistic interpretation by relating iterative update equations that appear when minimizing variational free energy with dynamics in neuronal circuits. As discussed in Section 3.2.2, the update equations resulting for example from meanfield or Bethe approximations, can often be written in messagepassing form in the sense that the update for a given variable only has contributions that requires the current approximate posterior of neighbouring nodes in the probabilistic model. These contributions are interpreted as local messages passed between the nodes and might be related to brain signals (Parr et al., 2019). Other interpretations (Friston et al., 2006, 2017a; Bogacz, 2017) obtain similar update equations by minimizing variational free energy directly through gradient descent, which can again be related to neural coding schemes like predictive coding. As these coding schemes have existed irrespective of free energy (Rao and Ballard, 1999; Aitchison and Lengyel, 2017), especially since prediction error minimization is essentially just maximum likelihood estimation, the question remains whether there are any specific predictions of the Active Inference framework that cannot be explained with previous models.
6.4 Conclusion
The goal of this article is to trace back the seemingly mysterious connection between Helmholtz free energy from thermodynamics and Helmholtz’ view of modelbased informationprocessing that led to the analysisbysynthesis approach of perception, as exemplified in predictive coding schemes, and in particular to discuss the role of free energy in current models of intelligent behavior. The mystery starts to dissolve when we consider the two kinds of free energies discussed in this article, one based on the maximum entropy principle and the other based on variational free energy—a dissimilarity measure between distributions and (generally unnormalized) functions that extends the wellknown KullbackLeibler divergence from information theory. The Helmholtz free energy is a particular example of an energy information tradeoff that results from the maximum entropy principle (Jaynes, 1957). Analysisbysynthesis is a particular application of inference to perception, where determining model parameters and hidden states can either be seen as a result of maximum entropy under observational constraints or of fitting parameter distributions to the model through variational free energy minimization. Thus, both notions of free energy can be formally related as entropyregularized maximization of logprobabilities.
Any theory about intelligent behavior has to answer three questions: where am I?, where do I want to go?, and how do I get there?, corresponding to the three problems of inference and perception, goals and preferences, and planning and execution. All three problems can be addressed either in the language of probabilities or utilities. Perceptual inference can either be considered as finding parameters that maximize probabilities or likelihood utilities. Goals and preferences can either be expressed by utilities over outcomes or by desired distributions. The third question is answered by the two free energy approaches that either determine future utilities based on model predictions or infer actions that lead to outcomes predicted to match the desired distribution. In standard decisionmaking models actions are usually determined by a utility function that ranks different options, whereas perceptual inference is determined by a likelihood model that quantifies how probable certain observations are. In contrast, both free energy approaches have in common that they treat all types of informationprocessing, from action planning to perception, as the same formal process of minimizing some form of free energy. But the crucial difference is not whether they use utilities or probabilities, but how predictions and goals are interwoven into action.
Utilitybased models with information constraints serve primarily as ultimate explanations of behavior, this means they do not focus on mechanism, but on the goals of behavior and their realizability under ideal circumstances. They have the appeal of being relatively straightforward generalization of standard utility theory, but they rely on abstract concepts like utility and relative entropy that may not be so straightforwardly related to experimental settings. While these normative models have no immediate mechanistic interpretation, their relevance for mechanistic models is analogous to the relevance of optimality bounds in Shannon’s information theory for practical codes (Shannon, 1948). In contrast, Active Inference models of behavior often mix ultimate and proximate arguments of explaining behavior (Alcock, 1993; Tinbergen, 1963), because they combine the normative aspect of optimizing variational free energy with the mechanistic interpretation of the particular form of approximate solutions to this optimization.
Finally, both free energy formulations are so general and flexible in their ingredients that it might be more appropriate to consider them languages or tools to phrase and describe behavior rather than theories that explain behavior, analogous to how the language of statistics is used in statistical mechanics, where the actual physical theory depends on many additional constraints that are taken into account.
Funding
This study was funded by the European Research Council (ERCStG2015ERC Starting Grant, Project ID: 678082, “BRISC: Bounded Rationality in Sensorimotor Coordination”).
References
 Aitchison and Lengyel (2017) Aitchison, L. and Lengyel, M. (2017). With or without you: predictive coding and bayesian inference in the brain. Current Opinion in Neurobiology, 46:219–227. Computational Neuroscience.
 Alcock (1993) Alcock, J. (1993). Animal behavior: an evolutionary approach. Sinauer Associates.
 Ashby (1960) Ashby, W. (1960). Design for a Brain: The Origin of Adaptive Behavior. Springer Netherlands.
 Beal (2003) Beal, M. J. (2003). Variational Algorithms for Approximate Bayesian Inference. PhD thesis, University of Cambridge, UK.
 Bernoulli (1713) Bernoulli, J. (1713). Ars conjectandi. Basel, Thurneysen Brothers.
 Bhui and Gershman (2018) Bhui, R. and Gershman, S. J. (2018). Decision by sampling implements efficient coding of psychoeconomic functions. Psychological Review, 125(6):985–1001.
 Biehl et al. (2020) Biehl, M., Pollock, F. A., and Kanai, R. (2020). A technical critique of the free energy principle as presented in "life as we know it" and related works. preprint: arXiv:2001.06408v2.
 Bogacz (2017) Bogacz, R. (2017). A tutorial on the freeenergy framework for modelling perception and learning. Journal of Mathematical Psychology, 76:198–211. Modelbased Cognitive Neuroscience.
 Callen (1985) Callen, H. (1985). Thermodynamics and an Introduction to Thermostatistics. Wiley.
 Cisek (1999) Cisek, P. (1999). Beyond the computer metaphor: behaviour as interaction. Journal of Consciousness Studies, 6(1112):125–142.
 Clark (2013) Clark, A. (2013). Whatever next? predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3):181–204.
 Colombo and Wright (2018) Colombo, M. and Wright, C. (2018). First principles in the life sciences: the freeenergy principle, organicism, and mechanism. Synthese.
 Csiszár (2008) Csiszár, I. (2008). Axiomatic characterizations of information measures. Entropy, 10(3):261–273.
 Dayan and Hinton (1997) Dayan, P. and Hinton, G. E. (1997). Using expectationmaximization for reinforcement learning. Neural Computation, 9(2):271–278.
 Dayan et al. (1995) Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). The helmholtz machine. Neural Comput., 7(5):889–904.
 de Laplace (1812) de Laplace, P. S. (1812). Théorie analytique des probabilités. Ve. Courcier, Paris.
 Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38.
 Doya (2007) Doya, K. (2007). Bayesian Brain: Probabilistic Approaches to Neural Coding. MIT Press, Cambridge, Mass.
 Ergin and Sarver (2010) Ergin, H. and Sarver, T. (2010). A unique costly contemplation representation. Econometrica, 78(4):1285–1339.
 Feynman et al. (1996) Feynman, R., Hey, A., and Allen, R. (1996). Feynman Lectures on Computation. Advanced book program. AddisonWesley.
 Flanagan et al. (2003) Flanagan, J. R., Vetter, P., Johansson, R. S., and Wolpert, D. M. (2003). Prediction precedes control in motor learning. Current Biology, 13(2):146–150.

Fox et al. (2016)
Fox, R., Pakman, A., and Tishby, N. (2016).
Taming the noise in reinforcement learning via soft updates.
In
Proceedings of the ThirtySecond Conference on Uncertainty in Artificial Intelligence
, UAI’16, pages 202–211, Arlington, Virginia, United States. AUAI Press.  Friston (2013) Friston, K. (2013). Life as we know it. Journal of The Royal Society Interface, 10(86):20130475.
 Friston et al. (2016) Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., O’Doherty, J., and Pezzulo, G. (2016). Active inference and learning. Neuroscience & Biobehavioral Reviews, 68:862–879.
 Friston et al. (2013) Friston, K., Schwartenbeck, P., Fitzgerald, T., Moutoussis, M., Behrens, T., and Dolan, R. (2013). The anatomy of choice: active inference and agency. Frontiers in Human Neuroscience, 7:598.
 Friston (2005) Friston, K. J. (2005). A theory of cortical responses. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1456):815–836.
 Friston (2010) Friston, K. J. (2010). The freeenergy principle: a unified brain theory? Nature Reviews Neuroscience, 11:127–138.
 Friston et al. (2017a) Friston, K. J., FitzGerald, T. H. B., Rigoli, F., Schwartenbeck, P., and Pezzulo, G. (2017a). Active inference: A process theory. Neural Computation, 29:1–49.
 Friston et al. (2006) Friston, K. J., Kilner, J., and Harrison, L. M. (2006). A free energy principle for the brain. Journal of PhysiologyParis, 100:70–87.
 Friston et al. (2015a) Friston, K. J., Levin, M., Sengupta, B., and Pezzulo, G. (2015a). Knowing one’s place: a freeenergy approach to pattern regulation. Journal of The Royal Society Interface, 12(105):20141383.
 Friston et al. (2017b) Friston, K. J., Parr, T., and de Vries, B. (2017b). The graphical brain: Belief propagation and active inference. Network Neuroscience, 1(4):381–414.
 Friston et al. (2015b) Friston, K. J., Rigoli, F., Ognibene, D., Mathys, C., Fitzgerald, T., and Pezzulo, G. (2015b). Active inference and epistemic value. Cognitive Neuroscience, 6(4):187–214.
 Genewein et al. (2015) Genewein, T., Leibfried, F., GrauMoya, J., and Braun, D. A. (2015). Bounded rationality, abstraction, and hierarchical decisionmaking: An informationtheoretic optimality principle. Frontiers in Robotics and AI, 2.
 Gershman (2019) Gershman, S. J. (2019). What does the free energy principle tell us about the brain. Neurons, Behavior, Data Analysis, and Theory.
 Gershman and Daw (2012) Gershman, S. J. and Daw, N. D. (2012). Perception, action and utility: The tangled skein. In Principles of Brain Dynamics. MIT Press.
 Gigerenzer and Selten (2001) Gigerenzer, G. and Selten, R. (2001). Bounded Rationality: The Adaptive Toolbox. MIT Press: Cambridge, MA, USA.
 Gottwald and Braun (2019a) Gottwald, S. and Braun, D. A. (2019a). Bounded rational decisionmaking from elementary computations that reduce uncertainty. Entropy, 21(4).
 Gottwald and Braun (2019b) Gottwald, S. and Braun, D. A. (2019b). Systems of bounded rational agents with informationtheoretic constraints. Neural Computation, 31(2):440–476.
 Grünwald (2007) Grünwald, P. (2007). The Minimum Description Length Principle. MIT Press, Cambridge, Mass.
 Haarnoja et al. (2017) Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep energybased policies. In ICML.
 Hansen and Sargent (2008) Hansen, L. P. and Sargent, T. J. (2008). Robustness. Princeton University Press.
 Heskes (2003) Heskes, T. (2003). Stable fixed points of loopy belief propagation are local minima of the bethe free energy. In Becker, S., Thrun, S., and Obermayer, K., editors, Advances in Neural Information Processing Systems 15, pages 359–366. MIT Press.

Hinton and van Camp (1993)
Hinton, G. E. and van Camp, D. (1993).
Keeping the neural networks simple by minimizing the description
length of the weights.
In
Proceedings of the Sixth Annual Conference on Computational Learning Theory
, COLT ’93, pages 5–13, New York, NY, USA. ACM.  Ho et al. (2020) Ho, M. K., Abel, D., Cohen, J. D., Littman, M. L., and Griffiths, T. L. (2020). The Efficiency of Human Cognition Reflects Planned Information Processing. Proceedings of the 34th AAAI Conference on Artificial Intelligence, page arXiv:2002.05769.
 Jaynes (1957) Jaynes, E. T. (1957). Information theory and statistical mechanics. Phys. Rev., 106:620–630.
 Jeffrey (1965) Jeffrey, R. C. (1965). The Logic of Decision. University of Chicago Press, 1 edition.
 Kahneman (2002) Kahneman, D. (2002). Maps of bounded rationality: A perspective on intuitive judgement. In Frangsmyr, T., editor, Nobel prizes, presentations, biographies, & lectures, pages 416–499. Almqvist & Wiksell, Stockholm, Sweden.
 Kappen et al. (2012) Kappen, H. J., Gómez, V., and Opper, M. (2012). Optimal control as a graphical model inference problem. Machine Learning, 87(2):159–182.
 Kawato (1999) Kawato, M. (1999). Internal models for motor control and trajectory planning. Current Opinion in Neurobiology, 9(6):718–727.
 Levine (2018) Levine, S. (2018). Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv:1805.00909v3.
 LindigLeón et al. (2019) LindigLeón, C., Gottwald, S., and Braun, D. A. (2019). Analyzing abstraction and hierarchical decisionmaking in absolute identification by informationtheoretic bounded rationality. Frontiers in Neuroscience, 13:1230.
 Linson et al. (2020) Linson, A., Parr, T., and Friston, K. J. (2020). Active inference, stressors, and psychological trauma: A neuroethological model of (mal)adaptive exploreexploit dynamics in ecological context. Behavioural Brain Research, 380:112421.
 Maccheroni et al. (2006) Maccheroni, F., Marinacci, M., and Rustichini, A. (2006). Ambiguity aversion, robustness, and the variational representation of preferences. Econometrica, 74(6):1447–1498.
 MacKay (2002) MacKay, D. J. C. (2002). Information Theory, Inference & Learning Algorithms. Cambridge University Press, USA.
 Marshall et al. (2011) Marshall, A. W., Olkin, I., and Arnold, B. C. (2011). Inequalities: Theory of Majorization and Its Applications. Springer New York, 2^{nd} edition.
 Mattsson and Weibull (2002) Mattsson, L.G. and Weibull, J. W. (2002). Probabilistic choice and procedurally bounded rationality. Games and Economic Behavior, 41(1):61–78.
 McFadden (2005) McFadden, D. L. (2005). Revealed stochastic preference: a synthesis. Economic Theory, 26(2):245–264.
 McKelvey and Palfrey (1995) McKelvey, R. D. and Palfrey, T. R. (1995). Quantal response equilibria for normal form games. Games and Economic Behavior, 10(1):6–38.
 Millidge et al. (2020) Millidge, B., Tschantz, A., and Buckley, C. L. (2020). Whence the expected free energy?
 Minka (2005) Minka, T. (2005). Divergence measures and message passing. Technical Report MSRTR2005173, Microsoft.
 Minka (2001) Minka, T. P. (2001). Expectation propagation for approximate bayesian inference. In Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, UAI ’01, pages 362–369, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
 Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In Balcan, M. F. and Weinberger, K. Q., editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1928–1937, New York, New York, USA. PMLR.
 Neal and Hinton (1998) Neal, R. M. and Hinton, G. E. (1998). A view of the em algorithm that justifies incremental, sparse, and other variants. In Jordan, M. I., editor, Learning in Graphical Models, pages 355–368. Springer Netherlands, Dordrecht.
 Ortega and Braun (2013) Ortega, P. A. and Braun, D. A. (2013). Thermodynamics as a theory of decisionmaking with informationprocessing costs. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 469(2153):20120683.

Ortega and Braun (2014)
Ortega, P. A. and Braun, D. A. (2014).
Generalized thompson sampling for sequential decisionmaking and causal inference.
Complex Adaptive Systems Modeling, 2(1):2.  Ortega and Stocker (2016) Ortega, P. A. and Stocker, A. (2016). Human decisionmaking under limited time. In 30th Conference on Neural Information Processing Systems.
 Parr et al. (2018) Parr, T., Benrimoh, D. A., Vincent, P., and Friston, K. J. (2018). Precision and false perceptual inference. Frontiers in Integrative Neuroscience, 12:39.
 Parr and Friston (2017) Parr, T. and Friston, K. J. (2017). Working memory, attention, and salience in active inference. Scientific reports, 7(1):14678–14678.
 Parr and Friston (2019) Parr, T. and Friston, K. J. (2019). Generalised free energy and active inference. Biological Cybernetics.
 Parr et al. (2019) Parr, T., Markovic, D., Kiebel, S. J., and Friston, K. J. (2019). Neuronal message passing using meanfield, bethe, and marginal approximations. Scientific Reports, 9(1):1889.
 Pearl (1988) Pearl, J. (1988). Belief updating by network propagation. In Pearl, J., editor, Probabilistic Reasoning in Intelligent Systems, pages 143–237. Morgan Kaufmann, San Francisco (CA).
 Poincaré (1912) Poincaré, H. (1912). Calcul des probabilités. GauthierVillars, Paris.
 Powers (1973) Powers, W. T. (1973). Behavior: The Control of Perception. Aldine, Chicago, IL.
 Rao and Ballard (1999) Rao, R. P. N. and Ballard, D. H. (1999). Predictive coding in the visual cortex: a functional interpretation of some extraclassical receptivefield effects. Nature Neuroscience, 2(1):79–87.
 Russell and Subramanian (1995) Russell, S. J. and Subramanian, D. (1995). Provably boundedoptimal agents. Journal of Artificial Intelligence Research, 2(1):575–609.
 Schach et al. (2018) Schach, S., Gottwald, S., and Braun, D. A. (2018). Quantifying motor task performance by bounded rational decision theory. Frontiers in Neuroscience, 12:932.
 Schwartenbeck et al. (2015) Schwartenbeck, P., FitzGerald, T. H. B., Mathys, C., Dolan, R., and Friston, K. (2015). The dopaminergic midbrain encodes the expected certainty about desired outcomes. Cerebral cortex (New York, N.Y. : 1991), 25(10):3434–3445.
 Schwartenbeck and Friston (2016) Schwartenbeck, P. and Friston, K. (2016). Computational phenotyping in psychiatry: A worked example. eNeuro, 3(4):ENEURO.0049–16.2016.
 Shannon (1948) Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27:379–656.
 Simon (1955) Simon, H. A. (1955). A behavioral model of rational choice. The Quarterly Journal of Economics, 69(1):99–118.
 Sims (2003) Sims, C. A. (2003). Implications of rational inattention. Journal of Monetary Economics, 50(3):665–690. Swiss National Bank/Study Center Gerzensee Conference on Monetary Policy under Incomplete Information.
 Sims (2016) Sims, C. R. (2016). Rate–distortion theory and human perception. Cognition, 152:181–198.
 Still (2009) Still, S. (2009). Informationtheoretic approach to interactive learning. EPL (Europhysics Letters), 85(2):28005.
 Tenenbaum and Griffiths (2001) Tenenbaum, J. B. and Griffiths, T. L. (2001). Generalization, similarity, and bayesian inference. Behavioral and Brain Sciences, 24(4):629–640.
 Tinbergen (1963) Tinbergen, N. (1963). On aims and methods of ethology. Zeitschrift für Tierpsychologie, 20:410–433.
 Tishby and Polani (2011) Tishby, N. and Polani, D. (2011). Information Theory of Decisions and Actions, pages 601–636. Springer New York.
 Todorov (2009) Todorov, E. (2009). Efficient computation of optimal actions. Proceedings of the National Academy of Sciences, 106(28):11478–11483.

Toussaint and Storkey (2006)
Toussaint, M. and Storkey, A. (2006).
Probabilistic inference for solving discrete and continuous state markov decision processes.
In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pages 945–952, New York, NY, USA. Association for Computing Machinery.  von Neumann and Morgenstern (1944) von Neumann, J. and Morgenstern, O. (1944). Theory of Games and Economic Behavior. Princeton University Press, Princeton, NJ, USA.

Wainwright et al. (2005)
Wainwright, M., Jaakkola, T., and Willsky, A. (2005).
Map estimation via agreement on (hyper)trees: Messagepassing and linearprogramming approaches.
IEEE Transactions on Information Theory, 51(11):3697–3717.  Wiener (1948) Wiener, N. (1948). Cybernetics: Or Control and Communication in the Animal and the Machine. John Wiley.
 Williams (1980) Williams, P. M. (1980). Bayesian conditionalisation and the principle of minimum information. The British Journal for the Philosophy of Science, 31(2):131–144.
 Williams and Peng (1991) Williams, R. J. and Peng, J. (1991). Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268.
 Winn and Bishop (2005) Winn, J. and Bishop, C. M. (2005). Variational message passing. J. Mach. Learn. Res., 6:661–694.

Wolpert (2006)
Wolpert, D. H. (2006).
Information Theory – The Bridge Connecting Bounded Rational Game Theory and Statistical Physics
, pages 262–290. Springer Berlin Heidelberg.  Yedidia et al. (2001) Yedidia, J. S., Freeman, W. T., and Weiss, Y. (2001). Generalized belief propagation. In Leen, T. K., Dietterich, T. G., and Tresp, V., editors, Advances in Neural Information Processing Systems 13, pages 689–695. MIT Press.
 Yuille and Kersten (2006) Yuille, A. and Kersten, D. (2006). Vision as bayesian inference: analysis by synthesis? Trends in Cognitive Sciences, 10(7):301–308. Special issue: Probabilistic models of cognition.
Appendix A Appendices
a.1 Exemplary update equations of Active Inference
Here, we list the iterative update equations of active inference for the example from Section 5.2 under the partial meanfield assumption (16). In the case when the desired distribution is combined with the generative model via the value function , i.e. if , then the update equations are given by
where is shorthand for the logtransition probability, and denotes the free energy over all variables besides the action, which simplifies to
Note that the update equations shown above are only correct under the assumption that the dependency of the value function on the trial distributions can be neglected (Friston et al., 2015b). This problem can be avoided in the other case where is combined with multiplicatively, i.e. if , where the update equations for and are given by
with . Here, denotes the free energy over all variables besides the action, replacing in the previous expression, and is given by
Moreover, sometimes the more restrictive meanfield assumption is made (Friston et al., 2015b), in which case the definition of and the solution equations are also different. For simplicity and in line with more recent formulations of active inference (Parr and Friston, 2019), we have ignored the precision parameter that appears in earlier formulations (Friston et al., 2013, 2015b) where it is multiplied to the value function and treated as another unknown variable.
a.2 Uncertain and deterministic options
Consider the simple example of three possible observations, , a desired distribution , and two actions with predictive distributions and . When conditioning on directly through a naive Bayesian inference approach using Jeffrey conditioning, one finds the choice probability . In particular, one would prefer the action that has maximum variability in the two outcomes over the deterministic action , even though the average desirability, , is the same for both actions. In fact, following (Toussaint and Storkey, 2006) with success probability results in being indifferent regarding the choice between the two options when doing inference over conditioned on .
The early Active Inference approach (Friston et al., 2013) of measuring the dissimilarity between and using the KullbackLeibler divergence and then calculating the choice probability through a softmax function would result in
, i.e. similar to the naive Bayes’ approach it would lead to preferring the second option, because the shape of
is more similar to than to when measured using the KullbackLeibler divergence, although none of them is very close. The modification that was made in later versions of Active Inference (e.g. (Friston et al., 2015b)), i.e. subtracting the entropy of , here results in the choice probability . Thus, Active Inference slightly prefers the deterministic option by explicitly punishing the option with higher variability.a.3 Surprise minimization
The (informational) surprise or surprisal of a given element with respect to a probability distribution is defined as , i.e. it is simply a strictly decreasing function of probability such that outcomes with low probability have high surprise and outcomes with high probability have low surprise. A common statement found in the literature (Parr and Friston, 2017) is that variational free energy is an upper bound on surprise and thus minimizing free energy also minimizes surprise. This idea originates from the special case of greedy inference with latent variables, where, for fixed data , the goal is to maximize the likelihood with respect to a parameter . If the marginalization over the latent variable is too hard to carry out directly, then one might take advantage of the bound
(18) 
i.e. that the variational free energy of is an upper bound on the surprise , which might therefore be reduced by minimizing its upper bound with respect to as a proxy. In the variational Bayes’ approach to the above inference problem, where is treated as a random variable , minimization with respect to is replaced by the minimization with respect to . In this case, the analogous bound to (18) is
where the righthand side is the minimum of the lefthand side with respect to . In this sense, variational free energy is generally not a bound on the surprise anymore, but on a logsumexp version of it instead. Nonetheless, also in this Bayesian approach, variational free energy is an upper bound on the surprise ,
(19) 
where the righthand side is the minimum of the lefthand side with respect to both and . However, in contrast to (18), there is no variable left in over which one could minimize. Therefore, saying that minimizing free energy also minimizes surprise (Parr and Friston, 2017), is generally only true in the sense that minimizing free energy minimizes an upper bound on surprise, however surprise itself is not minimized. Instead, the important fact about (19) is that equality is achieved by the Bayes’ posteriors and as discussed in Section 3.2.2.
Appendix B Supplementary Material
The following ancillary files are provided as supplementary material:
b.1 Comparison of different formulations of Active Inference
A detailed comparison of the different formulations of active inference found in the literature (20132019), including their meanfield and exact solutions in the general case of arbitrary many time steps.
b.2 Simulations: Gridworld navigation
We provide implementations of the models discussed in this article in a gridworld environment, both as a rendered html file as well as an interactive jupyter notebook for interested readers to tinker with.
b.3 Simulations: Nonuniform emission probability
In these simulations, we compare the four approaches that have successfully navigated the basic gridworlds from B.2 in an environment with two desired outcomes and a nonuniform emission probability, both as a rendered html file as well as an interactive jupyter notebook.
Comments
There are no comments yet.