The Two Kinds of Free Energy and the Bayesian Revolution

04/24/2020 ∙ by Sebastian Gottwald, et al. ∙ 0

The concept of free energy has its origins in 19th century thermodynamics, but has recently found its way into the behavioral and neural sciences, where it has been promoted for its wide applicability and has even been suggested as a fundamental principle of understanding intelligent behavior and brain function. We argue that there are essentially two different notions of free energy in current models of intelligent agency, that can both be considered as applications of Bayesian inference to the problem of action selection: one that appears when trading off accuracy and uncertainty based on a general maximum entropy principle, and one that formulates action selection in terms of minimizing an error measure that quantifies deviations of beliefs and policies from given reference models. The first approach provides a normative rule for action selection in the face of model uncertainty or when information-processing capabilities are limited. The second approach directly aims to formulate the action selection problem as an inference problem in the context of Bayesian brain theories, also known as Active Inference in the literature. We elucidate the main ideas and discuss critical technical and conceptual issues revolving around these two notions of free energy that both claim to apply at all levels of decision-making, from the high-level deliberation of reasoning down to the low-level information-processing of perception.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

The concept of free energy has its origins in 19th century thermodynamics, but has recently found its way into the behavioral and neural sciences, where it has been promoted for its wide applicability and has even been suggested as a fundamental principle of understanding intelligent behavior and brain function. We argue that there are essentially two different notions of free energy in current models of intelligent agency, that can both be considered as applications of Bayesian inference to the problem of action selection: one that appears when trading off accuracy and uncertainty based on a general maximum entropy principle, and one that formulates action selection in terms of minimizing an error measure that quantifies deviations of beliefs and policies from given reference models. The first approach provides a normative rule for action selection in the face of model uncertainty or when information-processing capabilities are limited. The second approach directly aims to formulate the action selection problem as an inference problem in the context of Bayesian brain theories, also known as Active Inference in the literature. We elucidate the main ideas and discuss critical technical and conceptual issues revolving around these two notions of free energy that both claim to apply at all levels of decision-making, from the high-level deliberation of reasoning down to the low-level information-processing of perception.
Keywords: free energy, intelligent agency, bayesian inference, maximum entropy, utility theory, active inference

1 Introduction

There is a surprising line of thought connecting some of the greatest scientists of the last centuries, including Immanuel Kant, Hermann von Helmholtz, Ludwig E. Boltzmann, and Claude E. Shannon, whereby model-based processes of action, perception, and communication are explained with concepts borrowed from statistical physics. Inspired by Kant’s Copernican revolution, Helmholtz was one of the first proponents of the analysis-by-synthesis approach to perception (Yuille and Kersten, 2006) that was motivated from his own studies of the physiology of the sensory system, whereby a perceiver does not simply record raw external stimuli on some kind of tabula rasa, but rather relies on internal models of the world to match and anticipate sensory inputs as well as possible. The internal model paradigm is now ubiquitous in the cognitive and neural sciences and has even led some researchers to propose a Bayesian brain hypothesis, whereby the brain would essentially be a prediction and inference engine based on internal models (Kawato, 1999; Flanagan et al., 2003; Doya, 2007). Coincidentally, Helmholtz also invented the notion of the Helmholtz free energy that plays an important role in thermodynamics and statistical mechanics, even though he never made a connection between the two concepts in his lifetime.

This connection was first made by Dayan, Hinton, Neal, and Zemel in their computational model of perceptual processing as a statistical inference engine known as the Helmholtz machine (Dayan et al., 1995)

. In this neural network architecture, there are feed-forward and feedback pathways, where the bottom-up pathway translates inputs from the bottom layer into hidden causes at the upper layer (the recognition model), and top-down activation translates simulated hidden causes into simulated inputs (the generative model). When considering log-likelihood in this setup as energy in analogy to statistical mechanics, learning becomes a relaxation process that can be described by the minimization of variational free energy. While it should be emphasized that variational free energy is not the same as Helmholtz free energy, the two free energy concepts can be formally related. Importantly, variational free energy minimization is not only a hallmark of the Helmholtz machine, but of a more general family of inference algorithms, such as the popular EM algorithm

(Neal and Hinton, 1998; Beal, 2003)

. In fact, over the last two decades, variational Bayesian methods have become one of the foremost approximation schemes for tractable inference in the machine learning literature. Moreover, a plethora of machine learning approaches use free energy trade-offs when optimizing performance under entropy regularization in order to boost generalization of learning models

(Williams and Peng, 1991; Mnih et al., 2016).

In the meanwhile, free energy concepts have also made their way into the behavioral sciences. In the economic literature, for example, trade-offs between utility and entropic uncertainty measures that take the form of free energies have been proposed to describe decision-makers with stochastic choice behavior due to limited resources (McKelvey and Palfrey, 1995; Sims, 2003; Mattsson and Weibull, 2002; McFadden, 2005; Wolpert, 2006) or robust decision-makers with limited precision in their models (Maccheroni et al., 2006; Hansen and Sargent, 2008). The free energy trade-off between entropy and reward can also be found in information-theoretic models of biological perception-action systems (Still, 2009; Tishby and Polani, 2011; Ortega and Braun, 2013), some of which have been subjected to experimental testing (Ortega and Stocker, 2016; Sims, 2016; Schach et al., 2018; Lindig-León et al., 2019; Bhui and Gershman, 2018; Ho et al., 2020). Finally, in the neuroscience literature the notion of free energy has risen to recent fame as the central puzzle piece in the Free Energy Principle (Friston, 2010) that has been used to explain a cornucopia of experimental findings including neural prediction error signals, the hierarchical organization of cortical responses, synaptic plasticity rules, neural effects of biased competition and attention—see references in (Parr and Friston, 2019). Over time, the Free Energy Principle has grown out of an application of the free energy concept used in the Helmholtz machine, to interpret cortical responses in the context of predictive coding (Friston, 2005), and has gradually developed into a general principle for intelligent agency, also known as Active Inference (Friston et al., 2013, 2015b; Parr and Friston, 2019). Consequences and implications of the Free Energy Principle are discussed in neighbouring fields like psychiatry (Schwartenbeck and Friston, 2016; Linson et al., 2020) and the philosophy of mind (Clark, 2013; Colombo and Wright, 2018).

Given that the notion of free energy has become such a pervasive concept that cuts through multiple disciplines, the main rationale for this discussion paper is to trace back and to clarify different notions of free energy, to see how they are related and what role they play in explaining behavior and neural activity. As the notion of free energy mainly appears in the context of statistical models of cognition, we need to be familiar with probabilistic models as a common framework in the following discussion. Section 2 therefore starts with preliminary remarks on probabilistic modelling. Section 3 introduces two notions of free energy that are subsequently expounded in Section 4 and Section 5, where they are applied to models of intelligent agency. Section 6 concludes the paper.

[width=.5]figure1.pdf

Figure 1: Graphical representation of an exemplary probabilistic model

. The arrows (edges) indicate causal relationships between the random variables (nodes). The full joint distribution

over all random variables is sometimes also referred to as a generative model, because it contains the complete knowledge about the random variables and their dependencies and therefore allows to generate simulated data. Such a model could for example be used by a farmer to infer the soil quality based on the crop yields through Bayesian inference, which allows to determine a priori unknown distributions such as from the generative model via marginalization and conditionalization.

2 Probabilistic models and perception-action systems

Systems that show stochastic behavior, for example due to randomly behaving components or because the observer ignores certain degrees of freedom, are modelled using probability distributions. This way, any behavioral, environmental, and hidden variables can be related by their statistics, and dynamical changes can be modelled by changes in their distributions.

Consider, for example, the simple probabilistic model illustrated in Fig 1, consisting of the variables past and future soil quality , past and future crop yields , and fertilization . The graphical model shown in the figure corresponds to the joint probability given by the factorization

(1)

where is the base probability of the past soil quality , is the probability of crop yields depending on the past soil quality , and so forth. Given the joint distribution we can also ask questions about each of the variables. For example, we could ask about the probability distribution of soil quality if we are told that the crop yields are equal to a value . We can obtain the answer from the probabilistic model by doing Bayesian inference, yielding the Bayes’ posterior

(2)

where the dependencies on , , and have been summed out to calculate the marginal from . In general, Bayesian inference in a probabilistic model means to determine the probability of some queried unobserved variables given the knowledge of some observed variables. This can be viewed as transforming the prior probabilistic model to a posterior model , where the observed variables have probability one and unobserved variables have probabilities given by the corresponding Bayes’ posteriors.

In principle, Bayesian inference requires only two different kinds of operations, namely marginalization, i.e. summing out unobserved variables that have not been queried, such as and above, and conditionalization, i.e. renormalizing the joint distribution over observed and queried variables—that may itself be the result from a previous marginalization such as above—to obtain the required conditional distribution over the queried variables. In practice, however, inference is a hard computational problem and many more efficient inference methods are available that may provide approximate solutions to the exact Bayes’ posteriors, including belief propagation (Pearl, 1988), expectation propagation (Minka, 2001), variational Bayesian inference (Hinton and van Camp, 1993), and Monte Carlo algorithms (MacKay, 2002). Also note that inference is trivial if the sought after conditional distribution of the queried variable is already given by one of the conditional distributions that jointly specify the probabilistic model.

[width=.35]figure2.pdf

Figure 2: Two types of probabilistic models must be discriminated. An external observer model describes the input-output behavior of an agent, as viewed from an outside observer. In contrast, an agent model is used by the agent itself or by a designer of the agent to plan actions based on the belief about current states, observations, and predicted consequences. Actions can be either parameters of such an internal model (influence diagrams), or be part of the model as random variables themselves, in which case at least part of the prior model must be variable so that it can be changed to a posterior model under which desirable outcomes are more likely.

Probabilistic models can not only be used as observer models, but also as internal models that are employed by the agent itself or by a designer of an agent in order to determine a desired course of action (cf. Fig 2). In this case, actions could either be thought of parameters of the probabilistic model that influence the future (influence diagrams) or as random variables that are part of the probabilistic model themselves (prior models). Both types of models allow to make predictions over future consequences in order to specify actions or distributions over actions that lead to desirable outcomes, for example actions that produce high rewards in the future. In mechanistic or process model interpretations, some of these specification methods are themselves meant to represent what the agent is actually doing while reasoning, whereas as if interpretations simply use these methods as tools to arrive at distributions that describe the agent’s behavior. Free energy is one of the concepts that appears in some of these methods.

3 The two notions of free energy

Vaguely speaking, free energy can refer to any quantity that is of the form

(3)

where energy is an expected value of some quantitity of interest, and entropy refers to a quantity measuring disorder, uncertainty, or complexity, that must be specified in the given context. From the relation (3), it is not surprising that free energy sometimes appears enshrouded by mystery, as it relies on an understanding of entropy, and “nobody really knows what entropy is anyway”, as John Von Neumann famously quipped (Feynman et al., 1996).

Historically, the concept of free energy goes back to the roots of thermodynamics, where it was introduced to measure the maximum amount of work that can be extracted from a thermodynamic system at a constant temperature and volume. If, for example, all the molecules in a box move to the left, we can use this kinetic energy to drive a turbine. If, however, the same kinetic energy is distributed as random molecular motion, it cannot be fully transformed into work. Therefore, only part of the total energy is usable, because the exact positions and momenta of the molecules, the so-called microstates, are unknown. In this case, the maximum usable part of the energy is the Helmholtz free energy, defined as

(4)

where is the thermodynamic entropy. In general, the transformation between two macrostates with free energies and allows the extraction of work .

3.1 Non-equilibrium free energy and maximum entropy

3.1.1 The Boltzmann distribution

In statistical mechanics, which studies macroscopic systems in terms of the behavior of its elementary constituents, thermodynamic quantities (macrostates of a system) are identified by expected values of the corresponding quantities defined on microstates. This means that the total energy is identified with the expected value of the energy levels of the system with respect to a probability distribution . Based on the central assumption that states with equal energy are occupied with equal probability and that thermodynamic entropy grows logarithmically with the number of possible microstates (Boltzmann’s equation), one can determine the probability of a microstate with energy as

(5)

and the Helmholtz free energy (4) as , where is the temperature of a heat bath that is in equilibrium with the thermodynamic system and is the so-called partition sum (Callen, 1985). Consequently, one can identify the thermodynamic entropy with the Gibbs or Shannon entropy , where .

By allowing other distributions than the Boltzmann distribution , one can define a non-equilibrium free energy

(6)

that equals the Helmholtz free energy when evaluated at the Boltzmann distribution, . Moreover, it turns out that (6) actually takes its minimum at , i.e. . In general, minimizing the non-equilibrium free energy (6) with respect to can be understood more abstractly without any reference to thermodynamics or physics, because it is equivalent to the constrained optimization problem of maximizing entropy under a constraint on the expectation , known as the principle of maximum entropy (Jaynes, 1957).

3.1.2 The trade-off between energy and uncertainty

[width=.5]figure3.pdf

Figure 3: Minimizing the non-equilibrium free energy (6) requires to trade off the competing terms of energy and entropy

, here shown exemplarily for the case of three elements. Assuming there exists a unique minimal entry of the energy vector, say

, then minimizing only over all probability distributions results in the (Dirac delta) distribution that assigns zero probability to all and probability one to , and therefore has zero entropy. In contrast, minimizing only the term is equivalent to maximizing

and therefore would result in the uniform distribution that gives equal probability to all elements. The distribution

that results from free energy minimization interpolates between these two extreme solutions of minimal energy (

) and maximum entropy ().

[width=]figure4.pdf

Figure 4: The normalization of a functon to obtain a probability distribution is equivalent to fitting trial distributions to the shape of by minimizing free energy. In two dimensions, the normalization of a point corresponds to a (non-orthogonal) projection onto the plane of probability vectors (A). For continuous domains, where probability distributions are represented by densities, normalization corresponds to a rescaling of such that the area below the graph equals (B). Instead, when minimizing variational free energy (red colour), the trial distributions are varied until they fit to the shape of the unnormalized function (perfectly at ).

An important feature of the minimization of the free energy (6) consists in the trade-off between the two competing terms, expected energy and entropy (see Fig 3). It is this trade-off between maximal uncertainty (uniform distribution) and minimal energy (delta distribution) that is at the core of free energy minimization. Here, the temperature plays the role of a trade-off parameter that controls how these two counteracting forces are balanced. In optimization theory, such a trade-off parameter is usually introduced to transform an optimization problem that has to satisfy an equality or inequality constraint into an unconstrained optimization problem of the form (6). In this case, the trade-off parameter plays the role of a so-called Lagrange multiplier that is determined by the constraint.

If the two counteracting quantities in such a trade-off are entropy and the expected value of some quantity, then one obtains the principle of maximum entropy, first formulated rigorously by Jaynes (Jaynes, 1957) as a method of how to determine an unbiased subjective probability distribution on the basis of partial information given by a constraint. It goes back to the principle of insufficient reason (Bernoulli, 1713; de Laplace, 1812; Poincaré, 1912), which states that two events should be assigned the same probability if there is no reason to think otherwise. The principle of maximum entropy (and its close relative the principle of minimum relative entropy) has very broad application and virtually appears in all branches of science. It has been hailed as a principled method to determine prior distributions and to incorporate novel information into existing probabilistic knowledge. In fact, Bayesian inference can be cast in terms of relative entropy minimization with constraints given by the available information (Williams, 1980). Applications of this idea can also be found in the machine learning literature, where subtracting (or adding) an entropy term from an expected value of a function that must be optimized is known as entropy regularization

and plays an important role in modern reinforcement learning algorithms

(Williams and Peng, 1991; Mnih et al., 2016) to encourage exploration (Haarnoja et al., 2017)

as well as to penalize overly deterministic policies resulting in biased reward estimates

(Fox et al., 2016).

3.2 Variational free energy

3.2.1 An extension of relative entropy

There is another, distinct appearance of the term “free energy” outside of physics, that is a priori not motivated from a trade-off between an energy and entropy term, but from possible efficiency gains when representing Bayes’ rule in terms of an optimization problem. This technique is mainly used in variational Bayesian inference and was originally introduced by Hinton and van Camp (Hinton and van Camp, 1993)

. Such variational representations do not only allow to approximate exact Bayes’ posteriors by simpler distributions, but also to construct efficient iterative algorithms for exact or approximate inference, such as the Expectation-Maximization (EM) algorithm

(Dempster et al., 1977; Neal and Hinton, 1998), belief propagation (Pearl, 1988; Yedidia et al., 2001), and other message-passing algorithms(Minka, 2001; Wainwright et al., 2005; Winn and Bishop, 2005; Minka, 2005).

[width=.99]figure5.pdf

Figure 5: In variational Bayesian inference, the operation of renormalizing the probabilistic model evaluated at an observation (Bayes’ rule), is replaced by an optimization problem (A). In practice, this variational representation is often exploited to simplify a given inference problem (B), either by reducing the seach space of distributions, for example through a non-exhaustive parametrization (e.g. a Gaussian), resulting in approximate inference, or by splitting up the optimization into multiple partial optimization steps that are potentially easier to solve than the original problem but might still converge to the exact solution. These two simplifications can also be combined, for example in the case of mean-field assumptions where the space of distributions is reduced and an efficient iterative inference algorithm is obtained at the same time.

All of these examples can be seen as applications of the same basic concept allowing to cast Bayesian inference in terms of the minimization of the variational free energy , which generally is a function of two quantities, a probability distribution and a non-negative function , given by

(7)

where denotes the expected value of with respect to . In the application to Bayesian inference, the reference function is constructed by evaluating a joint distribution given by the probabilistic model, say , at known quantities, say , resulting in which is not a probability distribution in anymore. Its rescaling (normalization) in order to obtain the probability distribution is exactly what Bayesian inference is about, and what the variational free energy (7) is used for. It is a free energy in the sense of (3) since by the additivity of the logarithm under multiplication (),

(8)

with energy term and entropy term given by the Shannon entropy . It is variational because its purpose is to be minimized over the so-called trial distributions , with the solution

(9)

Here, for simplicity, all random variables are discrete, but most expressions can directly be translated to the continuous case by replacing sums by the corresponding integrals. When choosing , Equation (9) becomes the Boltzmann distribution (5) and accordingly the variational free energy (8) becomes the non-equilibrium free energy (6). The variational property (9) allows to normalize to obtain the probability distribution , which has the same shape as but sums to , without having to carry out the rescaling of explicitly. Instead, by minimizing variational free energy, one fits auxiliary trial distributions to the shape of (cf. Fig 4). If this optimization process has no constraints, then the trial distributions are fitted to the shape of until is achieved. In the case of constraints, for instance if the trial distributions are parametrized by a non-exhaustive parametrization, then the optimized trial distributions approximate as close as possible within this parametrization. Moreover, the minimal value of the variational free energy (7) is

(10)

In particular, this implies that for all , so that varying with arbitrary trial distributions always provides a lower bound to the unknown normalization constant . In Bayesian inference this unknown constant is the normalization constant in Bayes’ rule, and called the model evidence (cf. Section 3.2.2 below). Due to this bound, the variational free energy is also called evidence lower bound (ELBO). The proof of (9) and (10) directly follows from Jensen’s inequality and only relies on the concavity of the logarithm.

Variational free energy (7) can be regarded as an extension of relative entropy with the reference distribution being replaced by a non-normalized function, since in the case when is already normalized, that is if , then the free energy (7) coincides with relative entropy, the so-called Kullback-Leibler (KL) divergence from information theory. In particular, while relative entropy is a measure for the dissimilarity of two probability distributions, where the minimum is achieved if both distributions are equal, variational free energy is a measure for the dissimilarity between a probability distribution and a (generally non-normalized) function , where the minimum with respect to is achieved at . Accordingly, we can think of the variational free energy as a specific error measure between probability distributions and reference functions. In principle, one could design many other error measures that have the same minimum. This means that, a statement in a probabilistic setting that a distribution minimizes variational free energy is analogous to a statement in a non-probabilistic setting that some parameter minimizes an error measure like the squared error between a parameterized prediction and a given reference value .

3.2.2 Variational inference

As we have seen in Section 2

, Bayesian inference consists in the calculation of a conditional probability distribution over unknown variables given the values of known variables. In the most simple case of two variables, say

and , and a probabilistic model of the form , Bayesian inference applies if is observed and is queried. Analogous to (2), the exact Bayes’ posterior is defined by the renormalization of in order to obtain a distribution over that respects the new information .

In variational Bayesian inference (cf. Fig 5A), however, this Bayes’ posterior is not calculated directly by normalizing the joint distribution with respect to , but indirectly by approximating it by a distribution that is adjusted through the minimization of an error measure that quantifies the deviation from the exact Bayes’ posterior. As we have seen in the previous section, the variational free energy is one possible candidate for such an error measure, since by (9),

(11)

As mentioned at the beginning of this section, representing Bayes’ rule as an optimization problem over auxiliary distributions has two main applications that both can simplify the inference process (cf. Fig 5B). First, it allows to approximate

exact Bayes’ posteriors by restricting the optimization space, for example using a non-exhaustive parametrization such as Gaussian distributions. Second, it enables

iterative inference algorithms consisting of multiple simpler optimization steps, for example by optimizing with respect to each term in a factorized representation of separately. A popular choice is the mean-field approximation, which combines both of these simplifications, as it assumes independence between hidden states, effectively reducing the search space from joint distributions to factorized ones, and moreover it allows to optimize with respect to each factor alternatingly.

4 Free energy from uncertainty reduction

We are now turning to the question of how the two notions of free energy introduced in the previous section appear in recent theories on intelligent agency, as it is a priori not entirely obvious how the practical uses of free energy as uncertainty-energy tradeoff on the one hand (this section) and as a variational error measure on the other hand (next section) relate to intelligent behavior.

4.1 The basic idea

The concept of free energy as a trade-off between energy and uncertainty can be used in models of perception-action systems, where entropy quantifies information-processing complexity required for decision-making (e.g. planning a path for fleeing a predator) and energy corresponds to performance (e.g. distinguishing better and worse flight directions). The notion of decision in this context is very broad and can be applied to any internal variable in the perception-action pipeline (Kahneman, 2002), that is not directly determined by the environment. In particular, it also subsumes perception itself, where the decision variables are given by the hidden causes that are being inferred from observations.

In rational choice theory (von Neumann and Morgenstern, 1944), a decision-maker chooses its decisions from a set of options such that a utility function defined on is maximized,

(12)

The utility values could either be objective, for example a monetary gain, or subjective in which case they represent the decision-maker’s preferences. In general, the utility does not have to be defined directly on , but could be derived from utility values that are attached to certain states, for example to the configurations of the playboard in a board game. In the case of perception, utility values are usually given by (log-)likelihood functions, in which case utility maximization without constraints corresponds to greedy inference such as maximum likelihood estimation.

[width=0.48]figure6.pdf

Figure 6: Decision-making can be considered as a search process in the space of options , where options are progressively ruled out. Deliberation costs are defined to be monotone functions under such uncertainty reduction (A). The resulting trade-off between utility and costs results in an efficiency curve that separates non-optimal from non-admissible behavior, while the points on the curve correspond to bounded-optimal agents that optimally trade off utility against uncertainty, analogous to the rate-distortion curve in information theory (B).

Whereas ideal rational decision-makers are assumed to perfectly optimize a given utility function , real behavior is often stochastic, meaning that multiple exposures to the same problem lead to different decisions. Such non-deterministic behavior could be a consequence of model uncertainty, as in Bayesian inference or various stochastic gambling schemes, or a consequence of satisficing (Simon, 1955), where decision-makers do not choose the single best option, but simply one option that is good enough. Abstractly, this means that, the choice of a single decision is replaced by the choice of a distribution over decisions. More generally, also considering prior information that the decision-maker might have from previous experience, the process of deliberation during decision-making might be expressed as the transformation of a prior to a posterior distribution .

When assuming that deliberation has a cost , then arriving at narrow posterior distributions should intuitively be more costly than choosing distributions that contain more uncertainty (cf. Fig 6A). In other words, deliberation costs must be increasing with the amount of uncertainty that is reduced by the transformation from to . Uncertainty reduction can be understood as making the probabilities of options less equal to each other, rigorously expressed by the mathematical concept of majorization (Marshall et al., 2011). This notion of uncertainty can also be generalized to include prior information, so that the degree of uncertainty reduction corresponds to more or less deviations from the prior (Gottwald and Braun, 2019a).

Maximizing expected utility with respect to under restrictions on processing costs is a constrained optimization problem that can be interpreted as a particular model of bounded rationality (Simon, 1955), explaining non-rational behavior of decision-makers that may be unable to select the single best option by their limited information-processing capability. Similarly to the free energy trade-off between energy and entropy (cf. Fig 3), this results in a trade-off between utility and processing costs ,

(13)

Here, the trade-off parameter is analogous to the inverse temperature in statistical mechanics (cf. Equation (6)) and parametrizes the optimal trade-offs between utility and cost, that define an efficiency frontier separating the space of perception-action systems into bounded-optimal, non-optimal, and non-admissible systems (cf. Fig 6).

When assuming that the total transformation cost is the same independent of whether a decision problem is solved in one step or multiple sub-steps (additivity under coarse-graining) the trade-off in (13) takes the general form (3) of a free energy in the sense of energy (utility) minus entropy (cost), because then the cost function is uniquely given by the relative entropy

(14)

Note that the additivity of (14) also implies a coarse-grainig property of the free energy (13) in the case when the decision is split into multiple steps, such that the utility of preceding decisions is effectively given by the free energy of following decisions. Therefore, in this case, free energy can be seen as a certainty-equivalent value of a stochastic choice, that besides of expected utility also takes information-processing costs of the subordinate decision problems into account. The special case (14) has been studied extensively in multiple contexts, including quantal response equilibria in the game-theoretic literature (McKelvey and Palfrey, 1995; Wolpert, 2006), rational inattention and costly contemplation (Sims, 2003; Ergin and Sarver, 2010), bounded rationality with KL costs (Mattsson and Weibull, 2002; Ortega and Braun, 2013), KL control (Todorov, 2009; Kappen et al., 2012), entropy regularization (Williams and Peng, 1991; Mnih et al., 2016), robustness (Maccheroni et al., 2006; Hansen and Sargent, 2008), and the analysis of information flow in perception-action systems (Tishby and Polani, 2011; Still, 2009).

[width=.48]figure7.pdf

Figure 7: Overview of how to apply utility maximization with information-processing costs to the example from Section 2.

4.2 A Simple Example

Consider the probabilistic model shown in Fig 1 with the joint distribution that is specified by the factors in the decomposition (1). Here, and denote the current environmental state and the corresponding observation, and denotes the action that must be determined in order to drive the system into a new state with observation . The decision-making problem is specified by assuming that we have given a utility function over future observations which the decision-maker seeks to maximize by selecting an action , while only having access to the current observation . This means that the decision-maker has control over the distribution , which replaces the prior in the model to define the posterior model (cf. Fig 7). Further assuming that the decision-maker is subject to an information-processing constraint , for some non-negative bound , results in the unconstrained optimization problem with free energy given by (13), where the trade-off parameter is tuned to comply with the bound .

Since the action distribution is the only distribution in the posterior model that is varied, the total free energy simplifies to with

In particular, the optimal action distribution for a given observation is a Boltzmann distribution (5) with “energy” and prior ,

Note that in order to evaluate the utility , it is required to determine the Bayes’ posterior . This shows how in a utility-based approach, the need to perform Bayesian inference results directly from the assumption about which variables are observed and which are not.

4.3 Critical points

The main idea of free energy in the context of information-processing with limited resources is that any computation can be thought of abstractly as a transformation from a distribution of prior knowledge to a posterior distribution that encapsulates an advanced state of knowledge resulting from deliberation. The progress that is made through such a transformation is quantitatively captured by two measures: the expected utility that quantifies the quality of and that measures the cost of uncertainty reduction from to . Clearly, the critical point of this framework is the choice of the cost function . In particular, we could ask whether there is some kind of universal cost function that is applicable to any perception-action process or whether there are only problem-specific instantiations. Of course, having a universal measure that allows applying the same concepts to extremely diverse systems is both a boon and a bane, because the practical insights it may provide for any concrete instance could be very limited. This is the root of a number of critical issues:

  1. [wide,labelwidth=!,labelindent=0pt]

  2. What is the cost ? An important restriction of all deliberation costs of the form is that they only depend on the initial and final distributions and ignore the process of how to get from to . When varying a single resource (e.g. processing time) we can use as a process-independentproxy for the resource. However, if there are multiple resources involved (e.g. processing time, memory, and power consumption), a single cost can not tell us how these resources are weighted optimally without making further process-dependent assumptions. In general, the theory makes no suggestions whatsoever about mechanical processes that could implement resource-optimal strategies, it only serves as a baseline for comparison. Finally, simply requiring the measure to be monotonic in the uncertainty reduction, does not uniquely determine the form of , as there have been multiple proposals of uncertainty measures in the literature (see e.g. (Csiszár, 2008)), where relative entropy is just one possibility. However, relative entropy is distinguished from all other uncertainty measures in its additivity property, that for example allows to express optimal probabilistic updates from to in terms of additions or subtractions of utilities, such as log-likelihoods for evidence accumulation in Bayesian inference.

  3. What is the utility? When systems are engineered, utilities are usually assumed to be given such that desired behavior is specified by utility maximization. However, when we observe perception-action systems, it is often not so clear what the utility should be, or in fact, whether there even exists a utility that captures the observed behavior in terms of utility maximization. This question of the identifiability of a utility function is studied extensively in the economic sciences, where the basic idea is that systems reveal their preferences through their actual choices and that these preferences have to satisfy certain consistency axioms in order to guarantee the existence of a utility function. In practice, to guarantee unique identifiability these axioms are usually rather strong, for example ignoring the effects of history and context when choosing between different items, or ignoring the possibility that there might be multiple objectives. When not making these strong assumptions, utility becomes a rather weak concept, even weaker than probabilities, as additional assumptions like soft-maximization would be necessary to translate from utilities to choice probabilities.

  4. The problem of infinite regress. One of the main conceptual issues with the interpretation of as a deliberation cost is that the original utility optimization problem is simply replaced by another optimization problem that may even be more difficult to solve. This novel optimization problem might again require resources to be solved and could therefore be described by a higher-level deliberation cost, thus leading to an infinite regress. In fact, any decision-making model that assumes that decision-makers reason about processing resources are affected by this problem (Russell and Subramanian, 1995; Gigerenzer and Selten, 2001). A possible way out is to consider the utility-information trade-off simply an as if description, since perception-action systems that are subject to a utility-information trade-off do not necessarily have to reason or know about their deliberation costs. It is straightforward, for example, to design processes that probabilistically optimize a given utility with no explicit notion of free energy, but for an outside observer the resulting choice distribution looks like an optimal free energy trade-off (Ortega and Braun, 2014).

In summary, the free energy trade-off between utility and information primarily serves as a normative model that provides a pareto-optimility curve consisting of optimal decision-policies. It can also serve as a guide for constructing and interpreting systems, although it is in general not a mechanistic model of behavior. In that respect the abstract free energy trade-off shares the fate of its cousins in thermodynamics and Shannon’s coding theory (Shannon, 1948) in that they provide theoretical bounds on optimality but devise no mechanism for processes to achieve these bounds.

5 Variational free energy in Active Inference

5.1 The basic idea

Variational free energy is the main ingredient used in the Free Energy Principle for biological systems in the neuroscience literature (Friston, 2005, 2010; Friston et al., 2015b, 2006) and has even been considered as “arguably the most ambitious theory of the brain available today” (Gershman, 2019). Since variational free energy in itself is just a mathematical construct to measure the dissimilarity between distributions and functions—see Section 3—, the biological content of the Free Energy Principle must come from somewhere else. The basic biological phenomenon that the Free Energy Principle purports to explain is homeostasis, the ability to actively maintain certain relevant variables (e.g. blood sugar) within a preferred range. Usually, homeostasis is applied as an explanatory principle in physiology whereby the actual value of a variable is compared to a target value and corrections to deviation errors are made through a feedback loop. However, homeostasis has also been proposed as an explanatory principle for complex behavior in the cybernetic literature (Wiener, 1948; Ashby, 1960; Powers, 1973; Cisek, 1999)—for example, maintaining blood sugar may entail complex feedback loops of learning to hunt, to trade and to buy food. Crucially, being able to exploit the environment in order to attain favorable sensory states, requires implicit or explicit knowledge of the environment that could either be pre-programmed (e.g. insect locomotion) or learnt (e.g. playing the piano).

The Free Energy Principle was originally suggested as a theory of cortical responses (Friston, 2005) by promoting the free energy formulation of predictive coding that was introduced by Dayan and Hinton with the Helmholtz machine (Dayan et al., 1995). It found its most recent incarnation in what is known as Active Inference, the attempt to extend variational Bayesian inference to action selection. Here, the target value of homeostasis is expressed through a probability distribution under which desired sensory states have a high probability. The required knowledge about the environment is expressed through a generative model that relates observations, hidden causes, and actions. As the generative model allows to make predictions about future states and observations, it enables to choose actions in such a way that the predicted consequences conform to the desired distribution. In Active Inference, this is achieved by merging the generative and the desired distributions, and , into a single function to which trial distributions over the unknown variables are fitted by minimizing the variational free energy . In the resulting homeostatic process, the trial distributions play the role of internal variables that are manipulated in order to achieve the desired sensory states that are not directly controllable. Minimizing variational free energy by the alternating variation of trial distributions over actions and trial distributions over hidden states,

(15)

is then equated with processes of action and perception. Such a free energy minimization can be regarded as an approximate inference process with respect to the reference , similarly to variational Bayesian inference (cf. Section 3.2.2).

[width=.99]figure8.pdf

Figure 8: Overview of the active inference recipe, applied to our running example from Figure 1.

In a nutshell, the central tenet of the Free Energy Principle states that organisms maintain homeostasis through minimization of variational free energy between a trial distribution and a reference distribution by acting and perceiving. Sometimes the even stronger statement is made that minimizing variational free energy is mandatory for homeostatic systems (Friston, 2013).

5.2 A Simple Example

Following the Active Inference recipe (cf. Fig 8), first we need to define a generative model and a desired distribution for our running example from Fig 1, assuming that is observed and and are in the future to be determined by the choice of the action . As before, the generative model is specified by the factors in the decomposition (1), the desired distribution is a given fixed probability distribution over future sensory states , and the trial distributions are probabilities over all unknown variables, , and .

In most treatments of Active Inference in the literature, the trial distributions are simplified, either by a full mean-field approximation over states and actions (Friston et al., 2013, 2015b), by a partial mean-field approximation where the dependency on actions is kept but the states are treated independently of each other (Friston et al., 2016), or most recently (Parr et al., 2019), by the so-called Bethe approximation. Note that the Bethe approximation is actually exact in tree-like models (Heskes, 2003), in particular it is exact in all models that were considered in Active Inference so far. In the partial mean-field assumption of (Friston et al., 2016), the trial distribution over is fixed and given by , while for , and the trial distributions are variable but restricted to be of the mean-field form for and , , so that

(16)

This assumption effectively means that under the approximate model , knowing a particular value of the random variable does not say anything more about than knowing its distribution . Note however, that such mean-field assumptions might be too strong simplifications that fail to produce goal-directed behavior even for very simple tasks such as the navigation in a gridworld, as can be seen in B.2.

Next, the two distributions and are put together to form the reference model . To do so, there have been several proposals in the Active Inference literature, which fall into one of two cases: either a handcrafted value function is defined which is multiplied to the generative model using a soft-max function, (Friston et al., 2015b, 2016), or the desired distribution is multiplied directly to the generative model, (Parr and Friston, 2019). The value function is sometimes also referred to as the expected free energy and defined as

(17)

where favors both desirable and plausible future observations , in contrast to the utility function in Section 4.2 that only considers desirability (there, the likelihood of future observations is automatically taken into account when (soft-)maximizing expected utility; see the simulations in B.3). Moreover, the extra entropy term in ensures that actions lead to consequences that more or less match the desired distribution, rather than trying to produce the single most desired outcome—see the discussion at the end of Section 5.3. Note also that the value function depends on the trial distributions which is generally problematic—see in Section 5.3.

Once the form of the trial distributions (e.g. by the partial mean-field assumption (16)) and the reference are defined, the variational free energy is simply determined by . The resulting free energy minimization problem is usually solved approximately by performing an alternating optimization scheme, in which the variational free energy is minimized separately with respect to each of the variable factors in a factorization of , for example by alternating between , , and in the case of the partial mean-field assumption (16), where in each step the factors that are not optimized are kept fixed. The resulting update equations (see A.1) turn out to be quite different depending on how the probabilistic model is combined with and which assumption on the structure of is made. In B.1, we compare different proposed formulations of Active Inference in the general case of arbitrarily many time steps.

5.3 Critical points

The main idea behind Active Inference is to express the problem of action selection in a similar manner to the perceptual problem of Bayesian inference over hidden causes. In Bayesian inference, agents are equipped with likelihood models that determine the desirability of different hypotheses given the data . In Active Inference, agents are equipped with a given desired distribution over future outcomes that ultimately determines the desirability of actions . An important difference that arises is that perceptual inference has to condition on past observations , whereas inference over actions would have to be conditioned on desired future observations . However, Active Inference never conditions on future observations, but rather merges the desired distribution with the likelihood into a single reference model , such that the dependency of actions on the future is not the result of an inference process, but is prespecified by the handcrafted reference. This is the root of a number of critical issues with current formulations of Active Inference:

  1. [wide,labelwidth=!,labelindent=0pt]

  2. How to incorporate the desired distribution into the reference? Instead of using Bayesian conditioning directly in order to condition the generative model on the desired future, there have been essentially two different proposals in the literature of how to merge the two distributions into a single reference function . In order to create a dependency of actions on the desired future, this function is used in place of the probabilistic model as a reference in the variational free energy to perform variational inference. The merging of and is achieved either by a handcrafted value function that specifically modifies the action probability of the generative model, or by adjusting the probability over futures of the generative model by multiplying the likelihood with and renormalizing. The first option leads to problem , the second option to problem .

  3. The reference model is not fixed. In most implementations, see e.g. (Friston et al., 2015a, 2016), the probability over actions in the probabilistic reference model is defined through the value function that itself depends on the trial distributions . Therefore, both the trial distribution and the reference distribution change when is varied during free energy minimization. Consequently, minimizing the variational free energy with respect to no longer fits the trial distributions to a fixed reference as in variational Bayes, but instead minimizes the dissimilarity of the two variables and . This is comparable to minimizing a squared error loss with respect to a model parameter in order to fit a function to data , where the data is not fixed but also depends on . Due to this double dependency, it is not obvious anymore what kind of result such a minimization process produces, even though an optimum might well be found.

  4. Exact inference over actions is given by the prior. Given a reference model with known factorization , standard Bayesian conditioning on past observations can only produce trivial inference over actions, because it can only return the predefined distribution . In this case, exact inference over actions given past experience can only reproduce the prespecified distributions in the prior model. For Active Inference models that combine and by using a value function of the form (17) (Friston et al., 2015b, 2016), this means that exact inference just produces the predefined action distribution given by the softmax of , whereas in Active Inference models that multiply directly (Parr and Friston, 2019), exact inference results in the fixed prior that does not lead to desirable futures. Note that, the update equation for the action distribution (for example resulting from a mean-field assumption) in both cases will depend on the other factors. However, in the end, the variational free energy minimization seeks to approximate the prespecified prior model as closely as possible. This effect can be seen in the gridworld simulations provided in B.2.

Instead of doing inference over actions given past experience, one could do inference over actions given the desired future outcomes, as has been done in other approaches (Dayan and Hinton, 1997; Toussaint and Storkey, 2006; Kappen et al., 2012; Levine, 2018). For a single desired future observation , inference can be applied in a straightforward manner by simply conditioning the generative model on . Similarly, one could condition on a desired distribution using Jeffrey’s conditioning rule (Jeffrey, 1965), resulting in , which could be implemented by first sampling a goal and then inferring given the single desired observation . However, the problem with such a naive approach is that the choice of a goal is solely determined by its desirability, whereas its realizability for the decision-maker is not taken into account, that is the decision-maker ignores how likely a certain outcome can be achieved under the predictive distribution by choosing the right action . This problem can be alleviated by introducing an auxiliary variable together with a probability that plays the role of a utility and determines how well the outcomes satisfy desirability criteria of the decision-maker (Toussaint and Storkey, 2006). The extra variable gives the necessary flexibility in to infer successful actions by simply conditioning on (cf. B.3). In Active Inference this issue is avoided because the desired distribution is not used for conditioning. Instead, in early versions of Active Inference (Friston et al., 2013), decision-makers are assumed to match the desired distribution over future states, by defining a value function

that takes the form of a Kullback-Leibler divergence between the predicted and desired future. As can be seen from examples such as

A.2, this assumption can lead to counter-intuitive behavior, especially when none of the predicted outcomes fits the desired distribution well. In later versions of Active Inference (Friston et al., 2015b, 2016), the value function is modified by an additional entropy term that explicitly punishes observations with high variability (cf. B.1). Even though this correction might fix the issue resulting from matching the predicted and desired distributions, the general question remains whether defining a desired distribution directly over outcomes is a good starting point when formulating decision-making as an inference problem (Gershman and Daw, 2012), and whether the proposed form of the value function is the right way to implement such a desired distribution into the probabilistic model (Millidge et al., 2020).

6 So What Does Free Energy Bring To the Table?

6.1 A Practical Tool

It is unquestionable that free energy has seen many fruitful practical applications in the statistical and machine learning literature. As has been discussed in Section 3, these applications generally fall into one of two categories, the principle of maximum entropy, and a variational formulation of Bayesian inference. Here, the principle of maximum entropy is interpreted in a wider sense of optimizing a trade-off between uncertainty (entropy) and the expected value of some quantity of interest (energy), which in practice often appears in the form of regularized optimization problems (e.g. to prevent overfitting) or as a general inference method allowing to determine unbiased priors and posteriors (cf. Section 3.1). In the variational formulation of Bayes’ rule, free energy plays the role of an error measure that allows to do approximate inference by constraining the space of distributions over which free energy is optimized, but can also inform the design of efficient iterative inference algorithms that result from an alternating optimization scheme where in each step the full variational free energy is optimized only partially, such as the EM algorithm, belief propagation, and other message passing algorithms (cf. Section 3.2).

6.2 Theories of Intelligent Agency

These practical use-cases of free energy formulations have also influenced models of intelligent behavior. In the cognitive and behavioral sciences, intelligent agency has been modelled in a number of different frameworks, including logic-based symbolic models, connectionist models, statistical decision-making models, and dynamical systems approaches. Even though statistical thinking in a broader sense can in principle be applied to any of the other frameworks as well, statistical models of cognition in a more narrow sense have often focused on Bayesian inference, where agents are equipped with probabilistic models of their environment allowing them to infer unknown variables in order to select actions that lead to desirable consequences (Tenenbaum and Griffiths, 2001; Wolpert, 2006; Todorov, 2009). Naturally, the inference of unknown variables in such models can be achieved by a plethora of methods including the two types of free energy approaches of maximum entropy and variational Bayes. However, both free energy formulations go one step further in that they attempt to extend both principles from the case of inference to the case of action selection: utility optimization with information constraints based on the maximum entropy principle and Active Inference based on variational free energy.

While sharing similar mathematical concepts, both approaches differ in syntax and semantics. A prominent apple of discord is the concept of utility (Gershman and Daw, 2012). Utility optimization with information constraints requires the determination of a utility function, whereas Active Inference requires the determination of a reference function. Subjective utility functions that quantify the preferences of decision-makers can lead to identifiability issues when certain consistency axioms are not satisfied. Similarly, in Active Inference the reference function involves determining a desired distribution given by the preferred frequency of outcomes, which in the utility framework would correspond to a very general but weak utility concept, more similar to the concept of probability that is able to explain arbitrary behavior. However, Active Inference then has to solve the additional problem to marry up the agent’s probabilistic model with its desired distribution into a single reference function (cf. Section 5.3), for example by a handcrafted value function that is incorporated into the probabilistic model. Crucially, the choice of the reference lies outside the scope of variational Bayes, but is critical for the resulting behavior because it determines the exact solutions that are approximated by free energy minimization. Thus, the choice of the reference in Active Inference conceptually corresponds to the solutions to the free energy trade-off in utility-based approaches.

Also, both approaches differ fundamentally in their motivation. The motivation of utility optimization with information constraints is to capture the trade-off between precision and uncertainty that underlies information-processing. This trade-off takes the form of a free energy, once an informational cost function has been chosen (cf. Section 4.3). Note that Bayes’ rule can be seen as a special case of such a free energy trade-off with log-likelihoods as utilities, even though this equivalence is not the primary motivation of this trade-off. In contrast, Active Inference is motivated from casting the problem of action selection itself as an inference process (Friston et al., 2013), as this allows to express both action and perception as the result of minimizing the same function, the variational free energy. This is clearly possible, because the underlying probabilistic model already contains both action and perception variables in a single functional format and the variational free energy is just a function of that model. Moreover, while approximate inference can be formulated on the basis of variational free energy, inference in general does not rely on this concept, and thus inference over actions can easily be done without free energy (Dayan and Hinton, 1997; Toussaint and Storkey, 2006). Also, as we have argued in Section 5.3, even without constraints on the auxiliary distributions, Active Inference does not actually do straightforward Bayesian inference over actions. Instead, Active Inference merges the desired distribution with the probabilistic model and fits trial distributions to the resulting reference (cf. Fig 8). Therefore, there is not a single fundamental principle in Active Inference that generates all the equations, but rather several principles in multiple variants where different formal assumptions are put together to determine the reference and to perform variational inference with that reference (cf. Section 5.3).

However, there are also plenty of similarities between the two approaches. For example, the assumption of a soft-max action distribution in Active Inference is similar to the posterior solutions resulting from utility optimization with information constraints. Moreover, the assumption of a desired future distribution relates to constrained computational resources, because the uncertainty constraint in a desired distribution over future states may not only be a consequence of environmental uncertainty, but could also originate from stochastic preferences of a satisficing decision-maker that accepts a wide range of outcomes. In B.2, we provide a comparison of the two approaches using gridworld simulations where we also include other methods for inference over actions.

A remarkable resemblance among both approaches is the exclusive appearance of relative entropy to measure dissimilarity. In Active Inference it is claimed that every homeostatic system must minimize variational free energy (Friston, 2013), which is simply an extension of relative entropy for non-normalized reference functions (cf. Section 3.2.1). In utility-based approaches, one typically uses relative entropy (14) to measure the amount of information-processing, even though theoretically other cost functions would be conceivable (Gottwald and Braun, 2019a). For a given homeostatic process, the Kullback-Leibler divergence measures the dissimilarity between the current distribution and the limiting distribution and therefore is reduced while approximating the equilibrium. Similarly, in utility-based decision-making models, relative entropy measures the dissimilarity between the current posterior and the prior. In the Active Inference literature the stepwise minimization of variational free energy that goes along with KL minimization is often equated with the minimization of sensory surprise (see A.3 for a more detailed explanation), an idea that stems from maximum likelihood algorithms, but that has been challenged as a general principle (Biehl et al., 2020). Similarly, one could in principle rewrite maximum entropy trade-offs in terms of informational surprise, which would however simply be a rewording of the probabilistic concepts in log-space. The same kind of rewording is well-known between probabilistic inference and the minimum description length principle (Grünwald, 2007) that also operates in the log-space, and thus, reformulates the inference problem as a surprise minimization problem, without adding any new features or properties.

6.3 Biological Relevance

So far we have seen how free energy is used as a technical instrument to solve inference problems and its corresponding appearance in different models of intelligent agency. Crucially, these kinds of models can be applied to any input-output system, be it a human that reacts to sensory stimuli, a cell that tries to maintain homeostasis, or a particle that reacts to physical forces. Given the existing literature that has widely applied the concept of free energy to biological systems, we may ask whether there are any specific biological implications of these models.

If we regard free energy primarily as a trade-off between utility and information-processing costs, we obtain a normative model of decision-making under resource constraints, that extends previous optimality models based on expected utility maximization and Bayesian inference. Similarly to rate-distortion curves in coding theory, it provides optimal solutions to decision-making problems forming an information-utility curve of pareto optima (cf. Fig 3). The behavior of real decision-making systems under varying information constraints can be analyzed experimentally by comparing their performance with respect to the corresponding optimality curve. One can experimentally relate abstract information-processing costs measured in bits to task-dependent resource costs like reaction or planning times (Schach et al., 2018; Ortega and Stocker, 2016)

. Moreover, the free energy trade-off can also be used to describe networks of agents, where each agent is limited in its ability, but the system as a whole has a higher information-processing capacity—for example, neurons in a brain or humans in a group. In such systems different levels of abstraction arise depending on the different positions of decision-makers in the network

(Lindig-León et al., 2019; Genewein et al., 2015; Gottwald and Braun, 2019b). As we have discussed in Section 4.3, just like coding and rate-distortion theory, utility theory with information costs can only provide optimality bounds but does not specify any particular mechanism of how to achieve optimality. However, by including more and more constraints one can make a model more and more mechanistic and thereby gradually move from a normative to a more descriptive model, such as models that consider the communication channel capacity of neurons with a finite energy budget (Bhui and Gershman, 2018).

Considering free energy in the sense of variational free energy, there is a vast literature on biological applications mostly focusing on neural processing (e.g. predictive coding, dopamine) (Schwartenbeck et al., 2015; Friston et al., 2017b; Parr et al., 2019), but there are also a number of applications aiming to explain behavior (e.g. human decision-making, hallucinations) (Parr et al., 2018). Similarly to utility-based models, Active Inference models can be studied in terms of as if models, so that actual behavior can be compared to predicted behavior as long as suitable prior and likelihood models can be identified from the experiment. When applied to brain dynamics, the as if models are sometimes also given a mechanistic interpretation by relating iterative update equations that appear when minimizing variational free energy with dynamics in neuronal circuits. As discussed in Section 3.2.2, the update equations resulting for example from mean-field or Bethe approximations, can often be written in message-passing form in the sense that the update for a given variable only has contributions that requires the current approximate posterior of neighbouring nodes in the probabilistic model. These contributions are interpreted as local messages passed between the nodes and might be related to brain signals (Parr et al., 2019). Other interpretations (Friston et al., 2006, 2017a; Bogacz, 2017) obtain similar update equations by minimizing variational free energy directly through gradient descent, which can again be related to neural coding schemes like predictive coding. As these coding schemes have existed irrespective of free energy (Rao and Ballard, 1999; Aitchison and Lengyel, 2017), especially since prediction error minimization is essentially just maximum likelihood estimation, the question remains whether there are any specific predictions of the Active Inference framework that cannot be explained with previous models.

6.4 Conclusion

The goal of this article is to trace back the seemingly mysterious connection between Helmholtz free energy from thermodynamics and Helmholtz’ view of model-based information-processing that led to the analysis-by-synthesis approach of perception, as exemplified in predictive coding schemes, and in particular to discuss the role of free energy in current models of intelligent behavior. The mystery starts to dissolve when we consider the two kinds of free energies discussed in this article, one based on the maximum entropy principle and the other based on variational free energy—a dissimilarity measure between distributions and (generally unnormalized) functions that extends the well-known Kullback-Leibler divergence from information theory. The Helmholtz free energy is a particular example of an energy information trade-off that results from the maximum entropy principle (Jaynes, 1957). Analysis-by-synthesis is a particular application of inference to perception, where determining model parameters and hidden states can either be seen as a result of maximum entropy under observational constraints or of fitting parameter distributions to the model through variational free energy minimization. Thus, both notions of free energy can be formally related as entropy-regularized maximization of log-probabilities.

Any theory about intelligent behavior has to answer three questions: where am I?, where do I want to go?, and how do I get there?, corresponding to the three problems of inference and perception, goals and preferences, and planning and execution. All three problems can be addressed either in the language of probabilities or utilities. Perceptual inference can either be considered as finding parameters that maximize probabilities or likelihood utilities. Goals and preferences can either be expressed by utilities over outcomes or by desired distributions. The third question is answered by the two free energy approaches that either determine future utilities based on model predictions or infer actions that lead to outcomes predicted to match the desired distribution. In standard decision-making models actions are usually determined by a utility function that ranks different options, whereas perceptual inference is determined by a likelihood model that quantifies how probable certain observations are. In contrast, both free energy approaches have in common that they treat all types of information-processing, from action planning to perception, as the same formal process of minimizing some form of free energy. But the crucial difference is not whether they use utilities or probabilities, but how predictions and goals are interwoven into action.

Utility-based models with information constraints serve primarily as ultimate explanations of behavior, this means they do not focus on mechanism, but on the goals of behavior and their realizability under ideal circumstances. They have the appeal of being relatively straightforward generalization of standard utility theory, but they rely on abstract concepts like utility and relative entropy that may not be so straightforwardly related to experimental settings. While these normative models have no immediate mechanistic interpretation, their relevance for mechanistic models is analogous to the relevance of optimality bounds in Shannon’s information theory for practical codes (Shannon, 1948). In contrast, Active Inference models of behavior often mix ultimate and proximate arguments of explaining behavior (Alcock, 1993; Tinbergen, 1963), because they combine the normative aspect of optimizing variational free energy with the mechanistic interpretation of the particular form of approximate solutions to this optimization.

Finally, both free energy formulations are so general and flexible in their ingredients that it might be more appropriate to consider them languages or tools to phrase and describe behavior rather than theories that explain behavior, analogous to how the language of statistics is used in statistical mechanics, where the actual physical theory depends on many additional constraints that are taken into account.

Funding

This study was funded by the European Research Council (ERC-StG-2015-ERC Starting Grant, Project ID: 678082, “BRISC: Bounded Rationality in Sensorimotor Coordination”).


References

  • Aitchison and Lengyel (2017) Aitchison, L. and Lengyel, M. (2017). With or without you: predictive coding and bayesian inference in the brain. Current Opinion in Neurobiology, 46:219–227. Computational Neuroscience.
  • Alcock (1993) Alcock, J. (1993). Animal behavior: an evolutionary approach. Sinauer Associates.
  • Ashby (1960) Ashby, W. (1960). Design for a Brain: The Origin of Adaptive Behavior. Springer Netherlands.
  • Beal (2003) Beal, M. J. (2003). Variational Algorithms for Approximate Bayesian Inference. PhD thesis, University of Cambridge, UK.
  • Bernoulli (1713) Bernoulli, J. (1713). Ars conjectandi. Basel, Thurneysen Brothers.
  • Bhui and Gershman (2018) Bhui, R. and Gershman, S. J. (2018). Decision by sampling implements efficient coding of psychoeconomic functions. Psychological Review, 125(6):985–1001.
  • Biehl et al. (2020) Biehl, M., Pollock, F. A., and Kanai, R. (2020). A technical critique of the free energy principle as presented in "life as we know it" and related works. preprint: arXiv:2001.06408v2.
  • Bogacz (2017) Bogacz, R. (2017). A tutorial on the free-energy framework for modelling perception and learning. Journal of Mathematical Psychology, 76:198–211. Model-based Cognitive Neuroscience.
  • Callen (1985) Callen, H. (1985). Thermodynamics and an Introduction to Thermostatistics. Wiley.
  • Cisek (1999) Cisek, P. (1999). Beyond the computer metaphor: behaviour as interaction. Journal of Consciousness Studies, 6(11-12):125–142.
  • Clark (2013) Clark, A. (2013). Whatever next? predictive brains, situated agents, and the future of cognitive science. Behavioral and Brain Sciences, 36(3):181–204.
  • Colombo and Wright (2018) Colombo, M. and Wright, C. (2018). First principles in the life sciences: the free-energy principle, organicism, and mechanism. Synthese.
  • Csiszár (2008) Csiszár, I. (2008). Axiomatic characterizations of information measures. Entropy, 10(3):261–273.
  • Dayan and Hinton (1997) Dayan, P. and Hinton, G. E. (1997). Using expectation-maximization for reinforcement learning. Neural Computation, 9(2):271–278.
  • Dayan et al. (1995) Dayan, P., Hinton, G. E., Neal, R. M., and Zemel, R. S. (1995). The helmholtz machine. Neural Comput., 7(5):889–904.
  • de Laplace (1812) de Laplace, P. S. (1812). Théorie analytique des probabilités. Ve. Courcier, Paris.
  • Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38.
  • Doya (2007) Doya, K. (2007). Bayesian Brain: Probabilistic Approaches to Neural Coding. MIT Press, Cambridge, Mass.
  • Ergin and Sarver (2010) Ergin, H. and Sarver, T. (2010). A unique costly contemplation representation. Econometrica, 78(4):1285–1339.
  • Feynman et al. (1996) Feynman, R., Hey, A., and Allen, R. (1996). Feynman Lectures on Computation. Advanced book program. Addison-Wesley.
  • Flanagan et al. (2003) Flanagan, J. R., Vetter, P., Johansson, R. S., and Wolpert, D. M. (2003). Prediction precedes control in motor learning. Current Biology, 13(2):146–150.
  • Fox et al. (2016) Fox, R., Pakman, A., and Tishby, N. (2016). Taming the noise in reinforcement learning via soft updates. In

    Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence

    , UAI’16, pages 202–211, Arlington, Virginia, United States. AUAI Press.
  • Friston (2013) Friston, K. (2013). Life as we know it. Journal of The Royal Society Interface, 10(86):20130475.
  • Friston et al. (2016) Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., O’Doherty, J., and Pezzulo, G. (2016). Active inference and learning. Neuroscience & Biobehavioral Reviews, 68:862–879.
  • Friston et al. (2013) Friston, K., Schwartenbeck, P., Fitzgerald, T., Moutoussis, M., Behrens, T., and Dolan, R. (2013). The anatomy of choice: active inference and agency. Frontiers in Human Neuroscience, 7:598.
  • Friston (2005) Friston, K. J. (2005). A theory of cortical responses. Philosophical Transactions of the Royal Society B: Biological Sciences, 360(1456):815–836.
  • Friston (2010) Friston, K. J. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11:127–138.
  • Friston et al. (2017a) Friston, K. J., FitzGerald, T. H. B., Rigoli, F., Schwartenbeck, P., and Pezzulo, G. (2017a). Active inference: A process theory. Neural Computation, 29:1–49.
  • Friston et al. (2006) Friston, K. J., Kilner, J., and Harrison, L. M. (2006). A free energy principle for the brain. Journal of Physiology-Paris, 100:70–87.
  • Friston et al. (2015a) Friston, K. J., Levin, M., Sengupta, B., and Pezzulo, G. (2015a). Knowing one’s place: a free-energy approach to pattern regulation. Journal of The Royal Society Interface, 12(105):20141383.
  • Friston et al. (2017b) Friston, K. J., Parr, T., and de Vries, B. (2017b). The graphical brain: Belief propagation and active inference. Network Neuroscience, 1(4):381–414.
  • Friston et al. (2015b) Friston, K. J., Rigoli, F., Ognibene, D., Mathys, C., Fitzgerald, T., and Pezzulo, G. (2015b). Active inference and epistemic value. Cognitive Neuroscience, 6(4):187–214.
  • Genewein et al. (2015) Genewein, T., Leibfried, F., Grau-Moya, J., and Braun, D. A. (2015). Bounded rationality, abstraction, and hierarchical decision-making: An information-theoretic optimality principle. Frontiers in Robotics and AI, 2.
  • Gershman (2019) Gershman, S. J. (2019). What does the free energy principle tell us about the brain. Neurons, Behavior, Data Analysis, and Theory.
  • Gershman and Daw (2012) Gershman, S. J. and Daw, N. D. (2012). Perception, action and utility: The tangled skein. In Principles of Brain Dynamics. MIT Press.
  • Gigerenzer and Selten (2001) Gigerenzer, G. and Selten, R. (2001). Bounded Rationality: The Adaptive Toolbox. MIT Press: Cambridge, MA, USA.
  • Gottwald and Braun (2019a) Gottwald, S. and Braun, D. A. (2019a). Bounded rational decision-making from elementary computations that reduce uncertainty. Entropy, 21(4).
  • Gottwald and Braun (2019b) Gottwald, S. and Braun, D. A. (2019b). Systems of bounded rational agents with information-theoretic constraints. Neural Computation, 31(2):440–476.
  • Grünwald (2007) Grünwald, P. (2007). The Minimum Description Length Principle. MIT Press, Cambridge, Mass.
  • Haarnoja et al. (2017) Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep energy-based policies. In ICML.
  • Hansen and Sargent (2008) Hansen, L. P. and Sargent, T. J. (2008). Robustness. Princeton University Press.
  • Heskes (2003) Heskes, T. (2003). Stable fixed points of loopy belief propagation are local minima of the bethe free energy. In Becker, S., Thrun, S., and Obermayer, K., editors, Advances in Neural Information Processing Systems 15, pages 359–366. MIT Press.
  • Hinton and van Camp (1993) Hinton, G. E. and van Camp, D. (1993). Keeping the neural networks simple by minimizing the description length of the weights. In

    Proceedings of the Sixth Annual Conference on Computational Learning Theory

    , COLT ’93, pages 5–13, New York, NY, USA. ACM.
  • Ho et al. (2020) Ho, M. K., Abel, D., Cohen, J. D., Littman, M. L., and Griffiths, T. L. (2020). The Efficiency of Human Cognition Reflects Planned Information Processing. Proceedings of the 34th AAAI Conference on Artificial Intelligence, page arXiv:2002.05769.
  • Jaynes (1957) Jaynes, E. T. (1957). Information theory and statistical mechanics. Phys. Rev., 106:620–630.
  • Jeffrey (1965) Jeffrey, R. C. (1965). The Logic of Decision. University of Chicago Press, 1 edition.
  • Kahneman (2002) Kahneman, D. (2002). Maps of bounded rationality: A perspective on intuitive judgement. In Frangsmyr, T., editor, Nobel prizes, presentations, biographies, & lectures, pages 416–499. Almqvist & Wiksell, Stockholm, Sweden.
  • Kappen et al. (2012) Kappen, H. J., Gómez, V., and Opper, M. (2012). Optimal control as a graphical model inference problem. Machine Learning, 87(2):159–182.
  • Kawato (1999) Kawato, M. (1999). Internal models for motor control and trajectory planning. Current Opinion in Neurobiology, 9(6):718–727.
  • Levine (2018) Levine, S. (2018). Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv:1805.00909v3.
  • Lindig-León et al. (2019) Lindig-León, C., Gottwald, S., and Braun, D. A. (2019). Analyzing abstraction and hierarchical decision-making in absolute identification by information-theoretic bounded rationality. Frontiers in Neuroscience, 13:1230.
  • Linson et al. (2020) Linson, A., Parr, T., and Friston, K. J. (2020). Active inference, stressors, and psychological trauma: A neuroethological model of (mal)adaptive explore-exploit dynamics in ecological context. Behavioural Brain Research, 380:112421.
  • Maccheroni et al. (2006) Maccheroni, F., Marinacci, M., and Rustichini, A. (2006). Ambiguity aversion, robustness, and the variational representation of preferences. Econometrica, 74(6):1447–1498.
  • MacKay (2002) MacKay, D. J. C. (2002). Information Theory, Inference & Learning Algorithms. Cambridge University Press, USA.
  • Marshall et al. (2011) Marshall, A. W., Olkin, I., and Arnold, B. C. (2011). Inequalities: Theory of Majorization and Its Applications. Springer New York, 2nd edition.
  • Mattsson and Weibull (2002) Mattsson, L.-G. and Weibull, J. W. (2002). Probabilistic choice and procedurally bounded rationality. Games and Economic Behavior, 41(1):61–78.
  • McFadden (2005) McFadden, D. L. (2005). Revealed stochastic preference: a synthesis. Economic Theory, 26(2):245–264.
  • McKelvey and Palfrey (1995) McKelvey, R. D. and Palfrey, T. R. (1995). Quantal response equilibria for normal form games. Games and Economic Behavior, 10(1):6–38.
  • Millidge et al. (2020) Millidge, B., Tschantz, A., and Buckley, C. L. (2020). Whence the expected free energy?
  • Minka (2005) Minka, T. (2005). Divergence measures and message passing. Technical Report MSR-TR-2005-173, Microsoft.
  • Minka (2001) Minka, T. P. (2001). Expectation propagation for approximate bayesian inference. In Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, UAI ’01, pages 362–369, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  • Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. (2016). Asynchronous methods for deep reinforcement learning. In Balcan, M. F. and Weinberger, K. Q., editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1928–1937, New York, New York, USA. PMLR.
  • Neal and Hinton (1998) Neal, R. M. and Hinton, G. E. (1998). A view of the em algorithm that justifies incremental, sparse, and other variants. In Jordan, M. I., editor, Learning in Graphical Models, pages 355–368. Springer Netherlands, Dordrecht.
  • Ortega and Braun (2013) Ortega, P. A. and Braun, D. A. (2013). Thermodynamics as a theory of decision-making with information-processing costs. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 469(2153):20120683.
  • Ortega and Braun (2014) Ortega, P. A. and Braun, D. A. (2014).

    Generalized thompson sampling for sequential decision-making and causal inference.

    Complex Adaptive Systems Modeling, 2(1):2.
  • Ortega and Stocker (2016) Ortega, P. A. and Stocker, A. (2016). Human decision-making under limited time. In 30th Conference on Neural Information Processing Systems.
  • Parr et al. (2018) Parr, T., Benrimoh, D. A., Vincent, P., and Friston, K. J. (2018). Precision and false perceptual inference. Frontiers in Integrative Neuroscience, 12:39.
  • Parr and Friston (2017) Parr, T. and Friston, K. J. (2017). Working memory, attention, and salience in active inference. Scientific reports, 7(1):14678–14678.
  • Parr and Friston (2019) Parr, T. and Friston, K. J. (2019). Generalised free energy and active inference. Biological Cybernetics.
  • Parr et al. (2019) Parr, T., Markovic, D., Kiebel, S. J., and Friston, K. J. (2019). Neuronal message passing using mean-field, bethe, and marginal approximations. Scientific Reports, 9(1):1889.
  • Pearl (1988) Pearl, J. (1988). Belief updating by network propagation. In Pearl, J., editor, Probabilistic Reasoning in Intelligent Systems, pages 143–237. Morgan Kaufmann, San Francisco (CA).
  • Poincaré (1912) Poincaré, H. (1912). Calcul des probabilités. Gauthier-Villars, Paris.
  • Powers (1973) Powers, W. T. (1973). Behavior: The Control of Perception. Aldine, Chicago, IL.
  • Rao and Ballard (1999) Rao, R. P. N. and Ballard, D. H. (1999). Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nature Neuroscience, 2(1):79–87.
  • Russell and Subramanian (1995) Russell, S. J. and Subramanian, D. (1995). Provably bounded-optimal agents. Journal of Artificial Intelligence Research, 2(1):575–609.
  • Schach et al. (2018) Schach, S., Gottwald, S., and Braun, D. A. (2018). Quantifying motor task performance by bounded rational decision theory. Frontiers in Neuroscience, 12:932.
  • Schwartenbeck et al. (2015) Schwartenbeck, P., FitzGerald, T. H. B., Mathys, C., Dolan, R., and Friston, K. (2015). The dopaminergic midbrain encodes the expected certainty about desired outcomes. Cerebral cortex (New York, N.Y. : 1991), 25(10):3434–3445.
  • Schwartenbeck and Friston (2016) Schwartenbeck, P. and Friston, K. (2016). Computational phenotyping in psychiatry: A worked example. eNeuro, 3(4):ENEURO.0049–16.2016.
  • Shannon (1948) Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27:379–656.
  • Simon (1955) Simon, H. A. (1955). A behavioral model of rational choice. The Quarterly Journal of Economics, 69(1):99–118.
  • Sims (2003) Sims, C. A. (2003). Implications of rational inattention. Journal of Monetary Economics, 50(3):665–690. Swiss National Bank/Study Center Gerzensee Conference on Monetary Policy under Incomplete Information.
  • Sims (2016) Sims, C. R. (2016). Rate–distortion theory and human perception. Cognition, 152:181–198.
  • Still (2009) Still, S. (2009). Information-theoretic approach to interactive learning. EPL (Europhysics Letters), 85(2):28005.
  • Tenenbaum and Griffiths (2001) Tenenbaum, J. B. and Griffiths, T. L. (2001). Generalization, similarity, and bayesian inference. Behavioral and Brain Sciences, 24(4):629–640.
  • Tinbergen (1963) Tinbergen, N. (1963). On aims and methods of ethology. Zeitschrift für Tierpsychologie, 20:410–433.
  • Tishby and Polani (2011) Tishby, N. and Polani, D. (2011). Information Theory of Decisions and Actions, pages 601–636. Springer New York.
  • Todorov (2009) Todorov, E. (2009). Efficient computation of optimal actions. Proceedings of the National Academy of Sciences, 106(28):11478–11483.
  • Toussaint and Storkey (2006) Toussaint, M. and Storkey, A. (2006).

    Probabilistic inference for solving discrete and continuous state markov decision processes.

    In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, pages 945–952, New York, NY, USA. Association for Computing Machinery.
  • von Neumann and Morgenstern (1944) von Neumann, J. and Morgenstern, O. (1944). Theory of Games and Economic Behavior. Princeton University Press, Princeton, NJ, USA.
  • Wainwright et al. (2005) Wainwright, M., Jaakkola, T., and Willsky, A. (2005).

    Map estimation via agreement on (hyper)trees: Message-passing and linear-programming approaches.

    IEEE Transactions on Information Theory, 51(11):3697–3717.
  • Wiener (1948) Wiener, N. (1948). Cybernetics: Or Control and Communication in the Animal and the Machine. John Wiley.
  • Williams (1980) Williams, P. M. (1980). Bayesian conditionalisation and the principle of minimum information. The British Journal for the Philosophy of Science, 31(2):131–144.
  • Williams and Peng (1991) Williams, R. J. and Peng, J. (1991). Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268.
  • Winn and Bishop (2005) Winn, J. and Bishop, C. M. (2005). Variational message passing. J. Mach. Learn. Res., 6:661–694.
  • Wolpert (2006) Wolpert, D. H. (2006).

    Information Theory – The Bridge Connecting Bounded Rational Game Theory and Statistical Physics

    , pages 262–290.
    Springer Berlin Heidelberg.
  • Yedidia et al. (2001) Yedidia, J. S., Freeman, W. T., and Weiss, Y. (2001). Generalized belief propagation. In Leen, T. K., Dietterich, T. G., and Tresp, V., editors, Advances in Neural Information Processing Systems 13, pages 689–695. MIT Press.
  • Yuille and Kersten (2006) Yuille, A. and Kersten, D. (2006). Vision as bayesian inference: analysis by synthesis? Trends in Cognitive Sciences, 10(7):301–308. Special issue: Probabilistic models of cognition.

Appendix A Appendices

a.1 Exemplary update equations of Active Inference

Here, we list the iterative update equations of active inference for the example from Section 5.2 under the partial mean-field assumption (16). In the case when the desired distribution is combined with the generative model via the value function , i.e. if , then the update equations are given by

where is shorthand for the log-transition probability, and denotes the free energy over all variables besides the action, which simplifies to

Note that the update equations shown above are only correct under the assumption that the dependency of the value function on the trial distributions can be neglected (Friston et al., 2015b). This problem can be avoided in the other case where is combined with multiplicatively, i.e. if , where the update equations for and are given by

with . Here, denotes the free energy over all variables besides the action, replacing in the previous expression, and is given by

Moreover, sometimes the more restrictive mean-field assumption is made (Friston et al., 2015b), in which case the definition of and the solution equations are also different. For simplicity and in line with more recent formulations of active inference (Parr and Friston, 2019), we have ignored the precision parameter that appears in earlier formulations (Friston et al., 2013, 2015b) where it is multiplied to the value function and treated as another unknown variable.

a.2 Uncertain and deterministic options

Consider the simple example of three possible observations, , a desired distribution , and two actions with predictive distributions and . When conditioning on directly through a naive Bayesian inference approach using Jeffrey conditioning, one finds the choice probability . In particular, one would prefer the action that has maximum variability in the two outcomes over the deterministic action , even though the average desirability, , is the same for both actions. In fact, following (Toussaint and Storkey, 2006) with success probability results in being indifferent regarding the choice between the two options when doing inference over conditioned on .

The early Active Inference approach (Friston et al., 2013) of measuring the dissimilarity between and using the Kullback-Leibler divergence and then calculating the choice probability through a softmax function would result in

, i.e. similar to the naive Bayes’ approach it would lead to preferring the second option, because the shape of

is more similar to than to when measured using the Kullback-Leibler divergence, although none of them is very close. The modification that was made in later versions of Active Inference (e.g. (Friston et al., 2015b)), i.e. subtracting the entropy of , here results in the choice probability . Thus, Active Inference slightly prefers the deterministic option by explicitly punishing the option with higher variability.


a.3 Surprise minimization

The (informational) surprise or surprisal of a given element with respect to a probability distribution is defined as , i.e. it is simply a strictly decreasing function of probability such that outcomes with low probability have high surprise and outcomes with high probability have low surprise. A common statement found in the literature (Parr and Friston, 2017) is that variational free energy is an upper bound on surprise and thus minimizing free energy also minimizes surprise. This idea originates from the special case of greedy inference with latent variables, where, for fixed data , the goal is to maximize the likelihood with respect to a parameter . If the marginalization over the latent variable is too hard to carry out directly, then one might take advantage of the bound

(18)

i.e. that the variational free energy of is an upper bound on the surprise , which might therefore be reduced by minimizing its upper bound with respect to as a proxy. In the variational Bayes’ approach to the above inference problem, where is treated as a random variable , minimization with respect to is replaced by the minimization with respect to . In this case, the analogous bound to (18) is

where the right-hand side is the minimum of the left-hand side with respect to . In this sense, variational free energy is generally not a bound on the surprise anymore, but on a log-sum-exp version of it instead. Nonetheless, also in this Bayesian approach, variational free energy is an upper bound on the surprise ,

(19)

where the right-hand side is the minimum of the left-hand side with respect to both and . However, in contrast to (18), there is no variable left in over which one could minimize. Therefore, saying that minimizing free energy also minimizes surprise (Parr and Friston, 2017), is generally only true in the sense that minimizing free energy minimizes an upper bound on surprise, however surprise itself is not minimized. Instead, the important fact about (19) is that equality is achieved by the Bayes’ posteriors and as discussed in Section 3.2.2.

Appendix B Supplementary Material

The following ancillary files are provided as supplementary material:

b.1 Comparison of different formulations of Active Inference

A detailed comparison of the different formulations of active inference found in the literature (2013-2019), including their mean-field and exact solutions in the general case of arbitrary many time steps.

b.2 Simulations: Gridworld navigation

We provide implementations of the models discussed in this article in a gridworld environment, both as a rendered html file as well as an interactive jupyter notebook for interested readers to tinker with.

b.3 Simulations: Non-uniform emission probability

In these simulations, we compare the four approaches that have successfully navigated the basic gridworlds from B.2 in an environment with two desired outcomes and a non-uniform emission probability, both as a rendered html file as well as an interactive jupyter notebook.