1 Introduction
Active inference extends the free energy principle to generative models with actions (Friston et al., 2016a; Da Costa et al., 2020; Champion et al., 2021b) and can be regarded as a form of planning as inference (Botvinick and Toussaint, 2012). This framework has successfully explained a wide range of brain phenomena, such as habit formation (Friston et al., 2016a), Bayesian surprise (Itti and Baldi, 2009), curiosity (Schwartenbeck et al., 2018), and dopaminergic discharge (FitzGerald et al., 2015). It has also been applied to a variety of tasks such as navigation in the Animal AI environment (Fountas et al., 2020), robotic control (Pezzato et al., 2020; Sancaktar et al., 2020), the mountain car problem (Catal et al., 2020), the game DOOM (Cullen et al., 2018) and the cartpole problem (Millidge, 2019).
Active inference builds on a subfield of Bayesian statistics called variational inference
(Fox and Roberts, 2012), in which the true posterior is approximated with a variational distribution. This method provides a way to balance the computational cost and accuracy of the posterior distribution. Indeed, the variational approach is only tractable because some statistical dependencies are ignored during the inference process, i.e., the variational distribution is generally assumed to fully factorise, leading to the well known meanfield approximation:where is the set of all hidden variables of the model and represents the ith hidden variable. Winn and Bishop (2005) presented a messagebased implementation of variational inference, naturally called variational message passing. And more recently, Champion et al. (2021b) rigorously framed active inference as a variational message passing procedure. By combining the Forney factor graph formalism (Forney, 2001) with the method of Winn and Bishop (2005), it becomes possible to create modular implementations of active inference (van de Laar and de Vries, 2019; Cox et al., 2019) that allows the users to define their own generative models without the burden of deriving the update equations. This paper uses a new software package called Homing Pigeon that implements such a modular approach and the relevant code has been made publicly available on github: https://github.com/ChampiB/HomingPigeon.
Arguably, the major bottleneck for scaling up the active inference framework was the exponential growth of the number of policies. In the reinforcement learning literature, this explosion is frequently handled using Monte Carlo tree search (MCTS)
(Silver et al., 2016; Browne et al., 2012; Schrittwieser et al., 2019). MCTS is based on the upper confidence bound for trees (UCT), which originally comes from the multiarmed bandit problem, and tradesoff exploration and exploitation during the tree search. In the reinforcement learning litterature, the selection of the node to expand is carried out using the UCT criterion^{1}^{1}1This version of UCT comes from Silver et al. (2016), which is defined as:(1) 
where is the value of taking action in state (i.e. here is not the variational posterior), is the exploration constant that modulates the amount of exploration, is the visit count, and
is the prior probability of selecting action
in state . MCTS has been applied to active inference in a recent paper by Fountas et al. (2020). The main change made to apply MCTS to active inference was to modify the node selection step that returns the node to be expanded. From equation (9) of (Fountas et al., 2020), one can see that the UCT formula has been replaced by:(2) 
where
is the best estimation of the expected free energy (EFE) obtained from Equation 8 of
Fountas et al. (2020)using sampling of 3 (out of 4) neural networks used by the system,
is the number of times that was explored in state , is an exploration constant and is a neural network modelling the posterior distribution over actions. Note, specializes in equation (1), by providing the probability of selecting action
in state . One can see that has been obtained from by replacing the average reward by the negative EFE.More recently, Champion et al. (2021a) proposed an online method that frames planning as a form of structure learning guided by the expected free energy. This method, called branchingtime active inference (BTAI), generalises active inference (Friston et al., 2016a; Champion et al., 2021b; Da Costa et al., 2020) and relates to another recently introduced framework for inference and decision making, called sophisticated inference (Friston et al., 2021). Importantly, the generative model of BTAI enables the agent to trade off risk and ambiguity, instead of only seeking for certainty as was the case in (Champion et al., 2021b). In this paper, we provide an empirical study of BTAI, enabling us to explicitly demonstrate that BTAI provides a more scalable realization of planning as inference than active inference.
Section 2 reviews the BTAI theory, with full details presented in (Champion et al., 2021a). Then, Section 3 compares BTAI to standard active inference in the context of a graph navigation task both empirically and theoretically. We show that active inference is able to solve small graphs but suffers from an exponential (space and time) complexity class that makes the approach intractable for bigger graphs. In contrast, BTAI is able to search the space of policies efficiently and scale to bigger graphs. Next, Section 4.2 presents the challenge of local minima in the context of a maze solving task, and shows how better prior preferences and deeper tree search help to overcome this challenge. Lastly, Section 4.3 compares two cost functions, and , in two new mazes. Finally, Section 5 concludes this paper, and provides ideas for future research.
2 Branching Time Active Inference (BTAI)
In this section, we provide a short review of BTAI, and the reader is referred to (Champion et al., 2021a) for details. BTAI frames planning as a form of structure learning (Smith et al., 2020; Friston et al., 2016b, 2018) guided by the expected free energy. The idea is to define a generative model that can be expanded dynamically as shown in Figure 1.
The past and present is modelled using a partially observable Markov decision process (POMDP) in which each observation (
) only depends on the state at time , and this state () only depends on the previous state () and previous action (). In addition to the POMDP which models the past and present, the future is modelled using a treelike generative model whose branches are dynamically expanded. Each branch of the tree corresponds to a trajectory of states reached under a specific policy. The branches are expanded following a logic similar to the Monte Carlo tree search algorithm, and the state estimation is performed using variational message passing.At the start of a trial, the model contains only the initial hidden state and the initial observation . Then, the agent starts expanding the generative model using an approach inspired by Monte Carlo tree search (Browne et al., 2012), where the selection of a node is based on expected free energy. The expansion of a node adds the future hidden state for each available action , where refers to the multiindex obtained by adding the action at the end of the sequence of actions described by the multiindex . The expansion step also adds the latent variable corresponding to observation to the generative model. Next, the evaluation step estimates the cost attached to the pairs for each action . In this paper, we experimented with two kinds of cost. First, the standard expected free energy:
where for an arbitrary action , and trades off risk and ambiguity. Second, we also experiment with the following quantity:
which depends on both the risk over observations and the risk over states. Lastly, the cost of the best action (i.e., the action that produces the smallest cost) is propagated towards the root and used to update the aggregated cost of the ancestors of .
Finally, during the planning procedure, the agent needs to perform inference of the future hidden states and observations. This is performed using variational message passing on the set of newly expanded nodes, i.e. , until convergence to a minimum in the free energy landscape.
We refer the interested reader to (Champion et al., 2021b)
for additional information about the derivation of the update equations. Also, since this paper only considers inference and not learning (i.e. the model does not have Dirichlet priors over the tensors defining the world’s contingencies), the generative model is different from the one presented in the theoretical paper
(Champion et al., 2021a). Therefore, we provide a mathematical description of the generative model, the variational distribution and the belief updates in Appendix A.3 BTAI vs active inference
In this section, we benchmark BTAI against standard active inference as implemented in Statistical Parametric Mapping (SPM), c.f. Friston (2007) for additional details about SPM. First, we do this in terms of complexity class and then empirically through experiments of increasing difficulty.
3.1 The deep reward environment
First, we introduce a canonical example of the kind of environment in which BTAI outperforms standard active inference. This environment is called the deep reward environment because the agent needs to navigate a tree like graph where the graph’s nodes correspond to the states of the system, and the agent needs to look deep into the future to diferentiate the good path from the traps.
At the beginning of each trial, the agent is placed at the root of the tree, i.e., the initial state of the system. From the initial states, the agent can perform actions, where and are the number of good and bad paths, respectively. Additionally, at any point in time, the agent can make two observations: a pleasant one or an unpleasant one. The states of the good paths produce pleasant observations, while the states of the bad paths produces unpleasant ones.
If the first action selected was one of the bad actions, then the agent will enter a bad path in which actions are available at each time step but all of them produce unpleasant observations. If the first action selected was one of the good actions, then the agent will enter the associated good path. We let be the length of the th good path. Once the agent is engaged on the th path, there are still actions available but only one of them keeps the agent on the good path. All the other actions will produce unpleasant observations, i.e., the agent will enter a bad path.
This process will continue until the agent reaches the end of the th path which is determined by the path’s length . If the th path was the longest of the good paths, then the agent will from now on only receive pleasant observations independently of the action performed. If the th path was not the longest path, then independently of the action performed, the agent will enter a bad path.
To summarize, at the beginning of each trial, the agent is prompted with good paths and bad paths. Only the longest good path will be beneficial in the long term, the others are traps, which will ultimately lead the agent to a bad state. Figure 2 illustrates this environment.
3.2 BTAI vs active inference: Space and Time complexity
In this section, we compare our model to the standard model of active inference (Friston et al., 2016a; Da Costa et al., 2020). In the standard formulation, the implementation needs to store the parameter of the posterior over states for each policy and each time step. Therefore, assuming possible actions, time steps, policies, and possible hidden state values, the space complexity class for storing the parameters of the posterior over hidden states is . This corresponds to the number of parameters that needs to be stored, and it is a problem because grows exponentially with the number of time steps. Additionally, performing inference on an exponential number of parameters will lead to an exponential time complexity class.
BTAI solves this problem by allowing only expansions of the tree. In BTAI, we need to store parameters for each time step in the past and present, and for each expansion, we only need to compute and store the parameters of the posterior over the hidden states corresponding to this expansion. Therefore, the time and space complexity class is , where is the current time point. This is linear in the number of expansions. Now, the question is how many expansions are required to solve the task? Even if the task requires the tree to be fully expanded, then the complexity class of BTAI would be . Figure 3 illustrates the difference between AI and BTAI in terms of the space complexity class, when BTAI performs a full expansion of the tree.
Additionally to the gain afforded by the structure of the tree, most practical applications can be solved by expanding only a small number of nodes (Silver et al., 2016; Schrittwieser et al., 2019), which means that MCTS and BTAI approaches will be even more optimised than in Figure 3 because most branches will not be expanded.
One could argue that there is a trade off in the nature and extent of the information inferred by classic active inference and branchingtime active inference. Specifically, classic active inference exhaustively represents and updates all possible policies, while branchingtime active inference will typically only represent a small subset of the possible trajectories. These will typically be the more advantageous paths for the agent to pursue, with the less beneficial paths not represented at all. Indeed, the tree search is based on the expected free energy that favors policies that maximize information gain while realizing the prior preferences of the agent.
Additionally, the inference process can update the system’s understanding of past contingencies on the basis of new observations. As a result, the system can obtain more refined information about previous decisions, perhaps reevaluating the optimality of these past decisions. Because classic active inference represents a larger space of policies, this reevaluation could apply to more policies.
We also know that humans engage in counterfactual reasoning (Rafetseder et al., 2013), which, in our planning context, could involve the entertainment and evaluation of alternative (nonselected) sequences of decisions. It may be that, because of the more exhaustive representation of possible trajectories, classic active inference can more efficiently engage in counterfactual reasoning. In contrast, branchingtime active inference would require these alternative pasts to be generated “a fresh” for each counterfactual deliberation. In this sense, one might argue that there is a trade off: branchingtime active inference provides considerably more efficient planning to attain current goals, classic active inference provides a more exhaustive assessment of paths not taken.
3.3 BTAI vs active inference: Simulations
In this section, we compare BTAI to standard active inference in the context of three deep reward environments. All the environments have two good paths and five bad paths. However, the lengths of the two good paths (i.e., and ) changes from environment to environment as summarised in Table 1. For all the environments , therefore the first path is a trap that will lead to a bad state, and the second path is the path the agent should take. Also, to identify that the first path is a trap, the agent must be able to plan at least steps ahead, since before that the two good paths are identical.
Environment  

first  2  3 
second  4  5 
third  7  9 
Table 2 shows the result of SPM simulations in which a standard active inference agent is run on the three deep reward environments presented in Table 1. As expected, the agent successfully solved the first two environments, for which it was required to plan three and five steps ahead. However, the third deep reward environment is intractable using standard active inference, i.e., the simulation runs out of memory because of the exponential (space) complexity class.
Environment  Policy size  P(goal)  P(trap) 

first  3  1  0 
second  5  1  0 
third  8  crash  crash 
Table 3 shows the result obtained by BTAI on the three deep reward environments presented in Table 1. As expected, the agent successfully solved the three deep reward environments. Ten planning iterations were required for the first environment, fifteen were required for the second environment, and twenty for the third.
Environment  Planning iterations  P(goal)  P(trap) 

first  10  1  0 
second  10  0.49  0.51 
15  1  0  
third  10  0.47  0.53 
15  0.55  0.45  
20  1  0 
4 BTAI Empirical Intuition
In this section, we study the BTAI agent’s behaviour through experiments highlighting its vulnerability to local minimum and ways to mitigate this issue. The goal is to gain some intuition about how the model behaves when: enabling deeper searches, providing better preferences, and using different kind of cost functions to guide the Monte Carlo tree search. The code of those experiments is available on GitHub at the following URL: https://github.com/ChampiB/Experiments_AI_TS.
4.1 The maze environment
This section presents the environment in which various simulations will be run. In this environment, the agent can be understood as a rat navigating a maze. Figure 4 illustrates the three mazes studied in the following sections. The agent can perform five actions, i.e., UP, DOWN, LEFT, RIGHT and IDLE. The goal is to reach the maze exit from the starting position of the agent. To do so, the agent must move from empty cells to empty cells avoiding walls. If the agent tries to move through a wall, the action becomes equivalent to IDLE. Finally, the observations made by the agent correspond to the Manhattan distance (with the ability to traverse walls) between its current position and the maze exit. Taking the first maze of Figure 4
as an example, if the agent stands on the exit, the observation will be zero or equivalently using one hot encoding [1 0 0 0 0 0 0 0 0], and if the agent stands at the initial position, the observation will be nine or equivalently [0 0 0 0 0 0 0 0 1].
4.2 Overcoming the challenge of local minima
In this section, we investigate the challenge of local minima and provide two ways of mitigating the issue: improving the prior preferences and using a deeper tree.
4.2.1 Prior preferences and local minimum
In this first experiment, the agent was asked to solve the second maze of Figure 4, which has the property that the start location (red square) is a local minimum. Remember from Section 4.1 that the agent observes the Manhattan distance between his location and the maze exit. The Manhattan distance naturally creates local minima throughout the mazes, i.e., cells of the maze (apart from the exit) for which no adjacent cell has a lower distance to the exit. An example of such a local minimum is shown as a blue square in Figure 5. The presence of such a local minimum implies that a well behaved agent (i.e., an agent trying to get as close as possible to the exit) might get trapped in those cells for which no adjacent cell has a lower distance to the exit and thus fail to solve the task.
Next, we need to define the prior preferences of the agent. Our framework allows the modeller to define prior preferences over both future observations and future states. However, we start by assuming no preferences over the hidden states, i.e., is uniform. We define the prior preferences over future observations as:
where is the number of possible observations (9 in the first maze of Figure 4), is the precision of the prior preferences, and is the softmax function. The above prior preferences will give high probability to cells close to the exit and will exhibit the local minimum behaviours previously mentioned. Indeed, the minus sign before makes the elements in big, when the corresponding elements in are small, and vice versa.
Using these prior preferences, we ran 100 simulations in the second maze of Figure 4. Each simulation was composed of 20 actionperception cycles. Note, the results might vary from simulation to simulation, because the actions performed in the environment are sampled from , where is a softmax function, is the precision of action selection,
is a vector whose elements correspond to the cost of the root’s children (i.e. the children of
) and is a vector whose elements correspond to the number of visits of the root’s children.Table 4 reports the frequency at which the agent reaches the exit when using: an exploration constant of in the UCT criterion and the expected free energy as the cost function. First, note that with only 10 planning iterations, the agent is unable to leave the initial position (i.e., it is trapped in the local minimum). But as the number of planning iterations is increased, the agent becomes able to foresee the benefits of leaving the local minimum.
Planning iterations  P(exit)  P(local) 

10  0  1 
15  0.98  0.02 
20  1  0 
4.2.2 Improving prior preference to avoid local minimum
In this second experiment, we modified the prior preferences of the agent to enable it to avoid local minima. We first change the cost function from the expected free energy to the pure cost , which allows us to set nontrivial preferences over states (in the previous simulation, these were set to uniform). Specifically, the prior preferences over hidden states will be of the form:
where is the precision over prior preferences, and is set according to Figure 6. Also, we set the precision to two and the exploration constant in the UCT criterion to . Finally, the prior preferences over future observations remain the same as in the previous section.
Tables 5 and 6 summarize the results of the experiments with and without the use of prior preferences over hidden states, respectively. As expected better prior preferences lead to better performance when less planning iterations are performed. Specifying prior preferences over hidden states requires the modeller to bring additional knowledge to the agent, and might not always be possible. However, when such knowledge is available it can improve the agent’s performance. This illustrates the value of the BTAI approach, which enables preferences to be specified for observations, as does active inference, as well as for states.
Planning iterations  P(global)  P(local) 
10  0  1 
15  0  1 
20  1  0 
Planning iterations  P(global)  P(local) 
10  0  1 
15  1  0 
20  1  0 
4.3 Solving more mazes
Up to now, we focused on the second maze of Figure 4 to demonstrate that both improving prior preferences and deepening the tree can help to mitigate the problem of local minima. In this section, we extend our analysis to the first and third mazes. Tables 7 and 8 shows the performance of the BTAI agent in the first maze of Figure 4 when using and as cost function, respectively. When was used as a cost function, the agent was only equipped with prior preferences over observations (i.e., uniform preferences over hidden states). Additionally, in both tables, the exploration constant of the UCT criterion was set to two.
Planning iterations  P(global)  P(local) 
10  1  0 
15  1  0 
20  1  0 
Planning iterations  P(global)  P(local) 
10  0  1 
15  1  0 
20  1  0 
Importantly, BTAI is able to solve the task using both cost functions. Also, performs better than on this maze. However, this finding does not generalize to the third maze of Figure 4, where both and lead to the same results when the precision over prior preference was set to one, and the exploration constant of the UCT criterion was set to two. Those results are presented in Table 9.
Planning iterations  P(global)  P(local) 
10  0  1 
15  0  1 
20  1  0 
5 Conclusion and future works
In this paper, we provided an empirical study of branching time active inference (BTAI), where the name takes inspiration from branchingtime theories of concurrent and distributed systems in computer science (Glabbeek, 1990; van Glabbeek, 1993; Bowman, 2005), and planning was cast as structure learning (Smith et al., 2020; Friston et al., 2016b, 2018). Simply put, the generative model is dynamically expanded and each expansion leads to the exploration of new policy fragments. The expansions are guided by the expected free energy, which provides a trade off between exploration and exploitation. Importantly, this approach is composed of not two, but three major distributions. The first is the prior distribution (or generative model) that encodes the agent’s beliefs before performing any observations. The second is the posterior (or variational) distribution encoding the updated beliefs of the agent after performing some observations. And the third is a target distribution over future states and observations that encodes the prior preferences of the agent, i.e., a generalization of the matrix in the standard formulation of active inference proposed by Friston et al. (2016a). An important advantage of this generalization is that it allows the specification of prior preferences over both future observations and future states at the same time.
We compared BTAI and standard active inference theoretically by studying its space and time complexity class. This study highlights that our method should perform better than the standard model used in active inference when the task can be solved by expanding the tree only a small number of times with respect to an exhaustive search. Second, we compared BTAI to active inference empirically within the deep reward environment. Those simulations suggest that BTAI is able to solve problems for which a standard active inference agent would run out of memory. As elaborated upon in Section 3.2, one might argue that there is a trade off between banchingtime active inference, which provides considerably more efficient planning to attain current goals, and classic active inference which provides a more exhaustive assessment of paths not taken. This might enable active inference to more exhaustively reflect counterfactuals and reasoning based upon them.
Also, BTAI was studied (experimentally) in the context of a maze solving task and we showed that when the heuristic used to create the prior preferences is not perfect, the agent becomes vulnerable to local minima. In other words, the agent might be attracted by a part of the maze that has low cost but does not allow it to solve the task. Then, we demonstrated empirically that improving the prior preferences of the agent by specifying a good prior over future hidden states and deepening the tree search, helped to mitigate this issue.
This paper could lead to a large number of future research directions. One could for example add the ability of the agent to learn the transition matrices as well as the likelihood matrix and the vector of initial states . This can be done in at least two ways. The first is to add Dirichlet priors over those matrices/vectors and the second would be to use neural networks as function approximators. The second option will lead to a deep active inference agent (Sancaktar and Lanillos, 2020; Millidge, 2020) equiped with tree search that could be directly compared to the method of Fountas et al. (2020). Including deep neural networks in the framework will also open the door to direct comparison with the deep reinforcement learning literature (Haarnoja et al., 2018; Mnih et al., 2013; van Hasselt et al., 2015; Lample and Chaplot, 2017; Silver et al., 2016). Those comparisons will enable the study of the impact of the epistemic terms when the agent is composed of deep neural networks.
Another direction of research will be to set up behavioural experiments to try to determine which kind of planning is used by the brain. This could simply be done by looking at the time required by a human to solve various mazes and compare it with both the classic model and the tree search alternative. Finally, one could also set up a hierarchical model of action and compare it to the tree search algorithm presented here. One could also evaluate the plausibility of a hierarchical model of action by running behavioural experiments on humans.
Finally, a completely different direction will be to focus on the integration of memory. At the moment, when a new action is performed in the environment and a new observation is recieved from it, all the branches in the tree are prunned and a new temporal slice (i.e. a new state, action and observation triple) is added to the POMDP. In other words, the integration function simply records the past. This exact recording of the past is very unlikely to really happen in the brain. Therefore, one might simply ask what to do with this currently ever growing record of the past. This would certainly lead to the notion of an active inference agent equipped with episodic memory
(Botvinick et al., 2019).Appendix A: The theoretical approach of this paper.
This appendix describes the generative model, the variational distribution and the update equations used throughout this paper. For full details of vocabulary and notation the reader is referred to Champion et al. (2021a).
The generative model can be understood as a fixed part modelling the past and present, and an expandable part modelling the future. The past and present is represented as a sequence of hidden states, where the transition between any two consecutive states depends on the action performed and is modelled using the 3tensor . The generation of an observation is modelled by the matrix , and the prior over the initial hidden state as well as the prior over the various actions are modelled using vectors, i.e., and , respectively.
Concerning the second part of the model (i.e., the one modelling the future), the transition between consecutive states in the future is defined using the 2subtensor , which is the matrix corresponding to the last action performed to reach the node . The generation of future observations from future hidden states is identical to the one used for the past and present.
For the sake of simplicity, we assume that the tensors , , and are given to the agent, which means that the agent knows the dynamics of the environment (c.f., Table 10 for additional information about those tensors). Practically, this means that the generative model does not have Dirichlet priors over those tensors. Furthermore, we follow Parr and Friston (2018)
, by viewing future observations as latent random variables. The formal definition of the generative model, which encodes our prior knowledge of the task, is given by:
where is the set of all nonempty multiindexes already expanded, and is the parent of . Additionally, we need to define the individual factors:
where is the last index of the multiindex , i.e., the last action that led to , and is the matrix corresponding to . We now turn to the definition of the variational posterior. Under the meanfield approximation:
where the individual factors are defined as:
Lastly, we follow Millidge et al. (2021) in assuming that the agent aims to minimise the KL divergence between the variational posterior and a desired (target) distribution. Therefore, our framework allows for the specification of prior preferences over both future hidden states and future observations:
where the individual factors are defined as:
Importantly, and play the role of the vector in the active inference model (Friston et al., 2016a), i.e., they specify which observations and hidden states are rewarding. To sum up, this framework is defined using three distributions: the prior defines the agent’s beliefs before performing any observation, the posterior is an updated version of the prior that takes into account the observation made by the agent, and the target (desired) distribution encodes the agent’s prior preferences in terms of future observations and hidden states.
Finally, the update equations used in this paper rely on variational message passing as presented in (Champion et al., 2021b; Winn and Bishop, 2005) and are given by:
where is the observation made at time step , is the last action of the sequence , is the set of multiindices corresponding to the children of the root node, and is the set of multiindices corresponding to the children of . Once again, for additional information about , the reader is referred to (Champion et al., 2021a).
References
 Botvinick et al. (2019) Botvinick, M., Ritter, S., Wang, J. X., KurthNelson, Z., Blundell, C., and Hassabis, D. (2019). Reinforcement learning, fast and slow. Trends in Cognitive Sciences, 23(5):408 – 422.
 Botvinick and Toussaint (2012) Botvinick, M. and Toussaint, M. (2012). Planning as inference. Trends in Cognitive Sciences, 16(10):485 – 488.
 Bowman (2005) Bowman, H. (2005). Concurrency Theory: Calculi an Automata for Modelling Untimed and Timed Concurrent Systems. Springer, Dordrecht.
 Browne et al. (2012) Browne, C. B., Powley, E., Whitehouse, D., Lucas, S. M., Cowling, P. I., Rohlfshagen, P., Tavener, S., Perez, D., Samothrakis, S., and Colton, S. (2012). A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43.
 Catal et al. (2020) Catal, Ozan and Verbelen, Tim and Nauta, Johannes and De Boom, Cedric and Dhoedt, Bart (2020). Learning perception and planning with deep active inference. In ICASSP 2020  2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3952–3956. IEEE.
 Champion et al. (2021a) Champion, T., Bowman, H., and Grześ, M. (2021a). Branching time active inference: the theory and its generality.
 Champion et al. (2021b) Champion, T., Grześ, M., and Bowman, H. (2021b). Realizing Active Inference in Variational Message Passing: The OutcomeBlind Certainty Seeker. Neural Computation, pages 1–65.
 Cox et al. (2019) Cox, M., van de Laar, T., and de Vries, B. (2019). A factor graph approach to automated design of Bayesian signal processing algorithms. Int. J. Approx. Reason., 104:185–204.
 Cullen et al. (2018) Cullen, M., Davey, B., Friston, K. J., and Moran, R. J. (2018). Active Inference in OpenAI Gym: A Paradigm for Computational Investigations Into Psychiatric Illness. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 3(9):809 – 818. Computational Methods and Modeling in Psychiatry.
 Da Costa et al. (2020) Da Costa, L., Parr, T., Sajid, N., Veselic, S., Neacsu, V., and Friston, K. (2020). Active inference on discrete statespaces: A synthesis. Journal of Mathematical Psychology, 99:102447.
 FitzGerald et al. (2015) FitzGerald, T. H. B., Dolan, R. J., and Friston, K. (2015). Dopamine, reward learning, and active inference. Frontiers in Computational Neuroscience, 9:136.
 Forney (2001) Forney, G. D. (2001). Codes on graphs: normal realizations. IEEE Transactions on Information Theory, 47(2):520–548.
 Fountas et al. (2020) Fountas, Z., Sajid, N., Mediano, P. A. M., and Friston, K. (2020). Deep active inference agents using MonteCarlo methods. arXiv eprints, page arXiv:2006.04176.

Fox and Roberts (2012)
Fox, C. W. and Roberts, S. J. (2012).
A tutorial on variational Bayesian inference.
Artificial Intelligence Review, 38(2):85–95.  Friston et al. (2021) Friston, K., Da Costa, L., Hafner, D., Hesp, C., and Parr, T. (2021). Sophisticated Inference. Neural Computation, 33(3):713–763.
 Friston et al. (2016a) Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., Doherty, J. O., and Pezzulo, G. (2016a). Active inference and learning. Neuroscience & Biobehavioral Reviews, 68:862 – 879.
 Friston et al. (2018) Friston, K., Parr, T., and Zeidman, P. (2018). Bayesian model reduction. arXiv eprints, page arXiv:1805.07092.
 Friston (2007) Friston, K. J. (2007). Statistical parametric mapping: the analysis of functional brain images. Elsevier.
 Friston et al. (2016b) Friston, K. J., Litvak, V., Oswal, A., Razi, A., Stephan, K. E., van Wijk, B. C., Ziegler, G., and Zeidman, P. (2016b). Bayesian model reduction and empirical bayes for group (dcm) studies. NeuroImage, 128:413–431.
 Glabbeek (1990) Glabbeek, R. J. v. (1990). The linear timebranching time spectrum (extended abstract). In Proceedings of the Theories of Concurrency: Unification and Extension, CONCUR ’90, page 278–297, Berlin, Heidelberg. SpringerVerlag.
 Haarnoja et al. (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. CoRR, abs/1801.01290.
 Itti and Baldi (2009) Itti, L. and Baldi, P. (2009). Bayesian surprise attracts human attention. Vision Research, 49(10):1295 – 1306. Visual Attention: Psychophysics, electrophysiology and neuroimaging.
 Lample and Chaplot (2017) Lample, G. and Chaplot, D. S. (2017). Playing FPS games with deep reinforcement learning. In Singh, S. P. and Markovitch, S., editors, Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, February 49, 2017, San Francisco, California, USA, pages 2140–2146. AAAI Press.
 Millidge (2019) Millidge, B. (2019). Combining active inference and hierarchical predictive coding: A tutorial introduction and case study.
 Millidge (2020) Millidge, B. (2020). Deep active inference as variational policy gradients. Journal of Mathematical Psychology, 96:102348.
 Millidge et al. (2021) Millidge, B., Tschantz, A., and Buckley, C. L. (2021). Whence the expected free energy? Neural Comput., 33(2):447–482.
 Mnih et al. (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. A. (2013). Playing Atari with Deep Reinforcement Learning. CoRR, abs/1312.5602.
 Parr and Friston (2018) Parr, T. and Friston, K. J. (2018). Generalised free energy and active inference: can the future cause the past? bioRxiv.
 Pezzato et al. (2020) Pezzato, C., Corbato, C. H., and Wisse, M. (2020). Active inference and behavior trees for reactive action planning and execution in robotics. CoRR, abs/2011.09756.
 Rafetseder et al. (2013) Rafetseder, E., Schwitalla, M., and Perner, J. (2013). Counterfactual reasoning: From childhood to adulthood. Journal of experimental child psychology, 114(3):389–404.
 Sancaktar and Lanillos (2020) Sancaktar, C. and Lanillos, P. (2020). Endtoend pixelbased deep active inference for body perception and action. ArXiv, abs/2001.05847.
 Sancaktar et al. (2020) Sancaktar, C., van Gerven, M. A. J., and Lanillos, P. (2020). Endtoend pixelbased deep active inference for body perception and action. In Joint IEEE 10th International Conference on Development and Learning and Epigenetic Robotics, ICDLEpiRob 2020, Valparaiso, Chile, October 2630, 2020, pages 1–8. IEEE.
 Schrittwieser et al. (2019) Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K., Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hassabis, D., Graepel, T., Lillicrap, T. P., and Silver, D. (2019). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. ArXiv, abs/1911.08265.
 Schwartenbeck et al. (2018) Schwartenbeck, P., Passecker, J., Hauser, T. U., FitzGerald, T. H. B., Kronbichler, M., and Friston, K. (2018). Computational mechanisms of curiosity and goaldirected exploration. bioRxiv.
 Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T. P., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489.
 Smith et al. (2020) Smith, R., Schwartenbeck, P., Parr, T., and Friston, K. J. (2020). An active inference approach to modeling structure learning: Concept learning as an example case. Frontiers in Computational Neuroscience, 14:41.
 van de Laar and de Vries (2019) van de Laar, T. and de Vries, B. (2019). Simulating active inference processes by message passing. Front. Robotics and AI, 2019.
 van Glabbeek (1993) van Glabbeek, R. J. (1993). The linear time — branching time spectrum II. In Best, E., editor, CONCUR’93, pages 66–81, Berlin, Heidelberg. Springer Berlin Heidelberg.
 van Hasselt et al. (2015) van Hasselt, H., Guez, A., and Silver, D. (2015). Deep reinforcement learning with double qlearning. CoRR, abs/1509.06461.

Winn and Bishop (2005)
Winn, J. and Bishop, C. (2005).
Variational message passing.
Journal of Machine Learning Research
, 6:661–694.
Comments
There are no comments yet.