1 Introduction
Intelligent agents are usually faced with the task of optimizing some utility function
that is a priori unknown and can only be evaluated samplewise. We do not restrict ourselves on the form of this function, thus in principle it could be a classification or regression loss, a reward function in a reinforcement learning environment or any other utility function. The framework of informationtheoretic bounded rationality
[16, 17] and related informationtheoretic models [3, 14, 20, 21, 23] provide a formal framework to model agents that behave in a computationally restricted manner by modeling resource constraints through informationtheoretic constraints. Such limitations also lead to the emergence of hierarchies and abstractions [5], which can be exploited to reduce computational and search effort. Recently, the main principles have been successfully applied to spiking and artificial neural networks, in particular feedforwardneural network learning problems, where the informationtheoretic constraint was mainly employed as some kind of regularization [7, 11, 12, 18]. In this work we introduce bounded rational decisionmaking with adaptive generative neural network priors. We investigate the interaction between anytime samplebased decisionmaking processes and concurrent improvement of prior policies through learning, where the prior policies are parameterized as Variational Autoencoders [10]—a recently proposed generative neural network model.The paper is structured as follows. In Section 2 we discuss the basic concepts of informationtheoretic bounded rationality, sampledbased interpretations of bounded rationality in the context of Markov Chain Monte Carlo (MCMC), and the basic concepts of Variational Autoencoders. In Section 3 we present the proposed decisionmaking model by combining samplebased decisionmaking with concurrent learning of priors parameterized by Variational Autoencoders. In Section 4 we evaluate the model with toy examples. In Section 5 we discuss our results.
2 Methods
2.1 Bounded Rational Decision Making
The foundational concept in decisionmaking theory is Maximum Expected Utility [22], whereby an agent is modeled as choosing actions such that it maximizes its expected utility
(1) 
where is an action from the action space and is a world state from the world state space , and is a utility function. We assume that the world states are distributed according to a known and fixed distribution and that the world sates are finite and discrete. In the case of a single world state or world state distribution , the decisionmaking problem simplifies into a single function optimization problem . In many cases, solving such optimization problems may require an exhaustive search, where simple enumeration is extremely expensive.
A bounded rational decision maker tackles the above decisionmaking problem by settling on a good enough solution. Finding a bounded optimal policy requires to maximize the utility function while simultaneously remaining within some given constraints. The resulting policy is a conditional probability distribution
, which essentially consists of choosing an action given a particular world state . The constraints of limited informationprocessing resources can be formalized by setting an upper bound on the (say B bits) that the decisionmaker is maximally allowed to spend to transform its prior strategy into a posterior strategy through deliberation. This results in the following constrained optimization problem [5]:(2) 
This constrained optimization problem can be formulated as an unconstrained problem [16]:
(3) 
where the inverse temperature is a Lagrange multiplier that influences the trade off between expected utility gain and information cost. For the agent behaves perfectly rational and for the agent can only act according to the prior policy. The optimal prior policy in this case is given by the marginal [5], in which case the KullbackLeibler divergence becomes equal to the mutual information, i.e. . The solution to the optimization problem (3) can be found by iterating the following set of selfconsistent equations [5]:
where is normalization factor. Computing such a normalization factor is usually computationally expensive as it involves summing over spaces with high cardinality. We avoid this by Monte Carlo approximation.
2.2 MCMC as SampleBased Bounded Rational DecisionMaking
Monte Carlo methods are mostly used to solve two related kinds of problems. One is to generate samples from a given distribution
and the other is to estimate the expectation of a function. For example, if
is a function for which we need to compute the expectation we can draw samples to obtain the estimate [15]. Samples can be drawn by employing Markov Chains to simulate stochastic processes. A Markov Chain can be defined by an initial probability and a transition probability , which gives the probability of transitioning from state to . The probability of being in state at the (th iteration is given by:(4) 
Such a chain can be used to generate sample proposals from a desired target distribution , if the following prerequisites are met [15]. Firstly, the chain must be ergodic, i.e. the chain must converge to independent of the initial distribution . Secondly, the desired distribution must be an invariant distribution of the chain. A distribution is an invariant of
if its probability vector is an eigenvector of the transition probability matrix. A sufficient, but not necessary condition to fulfill this requirement is detailed balance, i.e. the probability of going from state
to is the same as going from to : .An MCMC chain can be viewed as a bounded rational decisionmaking process for a single context in the sense that it performs an anytime optimization of a utility function with some precision and that it is initialized with a prior . The target distribution has to be chosen as in this case. A decision is made with the last sample when the chain is stopped. The resource corresponds then to the number of steps the chain has taken to evaluate the function . To find the transition probabilities of the chain, we assume detailed balance and a MetropolisHastings scheme such that
(5) 
with a proposal distribution and an acceptance probability . One common choice that satisfies Equation (5) is
(6) 
which can be further simplified when using a symmetric proposal distribution with , resulting in .
Note that the decision of the chain will in general follow a nonequilibrium distribution, but that we can use the bounded rational optimum as a normative baseline to quantify how efficiently resources are used by analyzing how closely the bounded rational equilibrium is approximated.
2.3 Representing Prior Strategies with Variational Autoencoders
While an anytime optimization process such as MCMC can be regarded as a transformation from prior to posterior, the question remains how to choose the prior. While the prior may be assumed to be fixed, it would be far more efficient if the prior itself were subjected to an optimization process that minimizes the overall informationprocessing costs. Since in the case of multiple world states the optimal prior is given by the marginal , we can use the outputs of the anytime decisionmaking process to train a generative model of the prior
. If the generative model was chosen from a parametric family such as a Gaussian distribution, then training would consist in updating mean and variance of the Gaussian. Choosing such a parametric family imposes restrictions on the shape of the prior, in particular in the continuous domain. Therefore, we investigate nonparametric generative models of the prior, in particular neural network models such as Variational Autoencoders (VAEs).
VAEs were introduced by [10]
as generative models that use a similar architecture as deterministic autoencoder networks. Their functioning is best understood as variational Bayesian inference in a latent variable model
with prior , where is observable data, and is the latent variable that explains the data, but that cannot be observed directly. The aim is to find a parameter that maximizes the likelihood of the data . Samples from can then be generated by first sampling and then sampling an from . As the maximum likelihood optimization may prove difficult due to the integral, we may express the likelihood in a different form by assuming a distribution such that(7)  
Assuming that the distribution is expressive enough to approximate the true posterior reasonably well, we can neglect the between the two distributions, and directly optimize the lower bound through gradient descent. In VAEs is called the encoder that translates from to and is called the decoder that translates from to . Both distributions and the prior are assumed to be Gaussian
where , and
are nonlinear functions implemented by feedforward neural networks and where it is ensured that
and that is a covariance matrix.Note that the optimization of the autoencoder itself can also be viewed as a bounded rational choice
(8) 
where the expected likelihood is maximized while the encoder distribution is kept close to the prior .
3 Modeling Bounded Rationality with Adaptive Neural Network Priors
In this section we combine MCMC anytime decisionprocesses with adaptive autoencoder priors. In the case of a single world state, the combination is straightforward in that each decision selected by the MCMC process is fed as an observable input to an autoencoder. The updated autoencoder is then used as an improved prior to initialize the next MCMC decision. In case of multiple world states, there are two straightforward scenarios. In the first scenario there are as many priors as world states and each of them is updated independently. For each world state we obtain exactly the same solution as in the single world state case. In the second scenario there is only a single prior over actions for all world states. In this case the autoencoder is trained with the decisions by all MCMC chains such that the autoencoder should converge to the optimal rate distortion prior. A third, more interesting scenario occurs when we allow multiple priors, but less than world states—compare Figure 1. This is especially plausible when dealing with continuous world states, but also in the case of large discrete spaces.
3.1 Decision making with multiple priors
Decisionmaking with multiple priors can be regarded as a multiagent decisionmaking problem where several bounded rational decisionmakers are combined into a single decisionmaking process [5]. In our case the most suitable arrangement of decisionmakers is a twostep process where first each world state is assigned probabilistically to a prior which is then used in the second step to initialize an MCMC chain—compare Figure 1. The output of that chain is then used to train the autoencoder corresponding to the selected prior. As each prior may be responsible for multiple world states, each prior will learn an abstraction that is specialized for this subspace of world states. This twostage decisionprocess can be formalized as a bounded rational optimization problem
(9) 
where is selecting the responsible prior indexed by for world state . The resource parameter for the first selection stage is given by and by for the second decision made by the MCMC process. The solution of optimization (9) is given by the following set of equations:
(10) 
where and are the normalization factors and is the free energy of the action selection stage. The marginal distribution encapsulates an action selection policy consisting of the priors weighted by the responsibilities given by the Bayesian posterior . Note that the Bayesian posterior is not determined by a given likelihood model, but is the result of the optimization process (9).
3.2 Model Architecture
Equations (10) describe abstractly how a twostep decision process with bounded rational decisionmakers should be optimally partitioned. In this section we propose a samplebased model of a bounded rational decision process that approximately corresponds to Equations (10) such that the performance of the decision process can be compared against its normative baseline. To translate Equations (10) into a stochastic process we proceed in three steps. First, we implement the priors as Variational Autoencoders. Second, we formulate an MCMC chain that is initialized with a sample from the prior and generates a decision . Third, we design an MCMC chain that functions as a selector between the different priors.
3.2.1 Autoencoder Priors.
Each prior in Equations (10) is represented by a VAE that learns to generate action samples that mimic the samples given by the MCMC chains—compare Figure 2. The functions , and
are implemented as feedforward neural networks with one hidden layer. The units in the hidden layer were all chosen with sigmoid activation function, the output units in the case of the
functions were also chosen as sigmoids and for thefunction as ReLU. During training the weights
and are adapted to optimize the expected loglikelihood of the action samples that are given by the decisions made by the MCMC chains for all world states that have been assigned to the prior . Due to the Gaussian shape of the decoder distribution, optimizing the loglikelihood corresponds to minimizing quadratic loss of the reconstruction error. After training, the network can generate sample actions itself by feeding the decoder network with samples from .3.2.2 MCMC DecisionMaking.
To implement the bounded rational decisionmaker we obtain an action sample from the autoencoder prior to initialize an MCMC chain that optimizes the target utility for the given world state. We run the MCMC chain for steps. In each step we generate a proposal from a Gaussian distribution with and accept with probability
(11) 
Over the course of time steps, the precision is adjusted following an annealing schedule conditioned on the maximum number of steps . We use an inverse Boltzmann annealing schedule, i.e. , where is a tuning parameter. The rationale behind this is that we assume the sampling process to be coarse grained in the beginning and is getting finer during the search.
3.2.3 Prior Selection.
To implement the bounded rational prior selection through an MCMC process, we first sample an from the prior and start an MCMC chain that (approximately) optimizes for a given world state sampled from . The prior is represented by a multinomial and updated by the frequencies of the selected prior indices . The number of steps in the prior selection MCMC chain was kept constant at a value of and similarly the precision was annealed over the course of time steps. The target comprises a tradeoff between expected utility and information resources. However, it cannot be directly evaluated and would require the computation of . Here we use number of steps in the downstream MCMC process as a resource measure. As the number of downstream steps was constant, the model selector’s choice only depended on the average utility achieved by each decisionmaker, which results in the acceptance rule
As the priors are discrete choices the proposal distribution samples globally with for all .
4 Empirical Results
To demonstrate our approach we evaluate two scenarios. First, a simple agent, which is equipped with a single prior policy , as introduced in section 2. In case of a single agent there is no need for a prior selection stage. Second, we evaluated a multiprior decisionmaking system and compared the results to the single prior agent. For the mutliprior agent, we split a fixed number of MCMC steps between the prior selection and the action selection. The task we designed consists of six world states where each world state has a Gaussian utility function in the interval
with a unique optimum. In both settings, we equipped the Variational Autoencoders with one hidden layer consisting of 16 units with ReLU activations. We implemented the experiments using Keras
[2]. We show the results in Figure 3.Our results indicate that using MCMC evaluation steps as a surrogate for information processing costs can be interpreted as bounded rational decisionmaking. In figure 3 we show the efficiency of several agents with different processing constraints. To compare our results to the theoretical baseline, we discretized the action space into 100 equidistant slices and solved the problem using the algorithm proposed in [5] to implement equations (10). Furthermore our results indicate that the multiprior system generally outperforms the singleprior system in terms of utility.
To illustrate the differences in efficiency between the single prior agent and the multiprior agents, we plotted in Figure 4 utility gained through the second MCMC optimization. For multiprior agents this is caused by specialized priors which provide initializations to the MCMC chains that are close to the optimal action. In this particular case, does not become zero because we allow only three priors to cover six world states, thus leading to abstraction, i.e. specializing on actions that fit well for the assigned world states. In singleprior agents, the prior is adapting to all world states, thus providing, on average, an initial action that is suboptimal for the requested world state.
5 Discussion
In this study we implemented bounded rational decision makers with adaptive priors. We achieved this with Variational Autoencoder priors. The bounded rational decisionmaking process was implemented by MCMC optimization to find the optimal posterior strategy, thus giving a computationally simple way of generating samples. As the number of steps in the optimization process was constrained, we could quantify the information processing capabilities of the resulting decisionmakers using relative Shannon entropy. Our analysis may have interesting implications, as it provides a normative framework for this kind of combined optimization of adaptive priors and decisionmaking processes. Prior to our work there have been several attempts to apply the framework of informationtheoretic bounded rationality to machine learning tasks
[7, 11, 12, 18]. The novelty of our approach is that we design adaptive priors for both the singlestep case and the multiagent case and we demonstrate how to transform informationtheoretic constraints into computational constraints in the form of MCMC steps.Recently, the combination of Monte Carlo optimization and neural networks has gained increasing popularity. These approaches include both using MCMC processes to find optimal weights in ANNs [1, 4] and using ANNs as parametrized proposal distributions in MCMC processes [8, 13]. While our approach is more similar to the latter, the important difference is that in such adaptive MCMC approaches there is only a single MCMC chain with a single (adaptive) proposal to optimize a single task, whereas in our case there are multiple adaptive priors to initialize multiple chains with otherwise fixed proposal, which can be used to learn multiple tasks simultaneously. In that sense our work is more related to mixtureofexperts methods and divideandconquer paradigms [6, 9, 24], where we employ a selection policy rather than a blending policy, as we design our model specifically to encourage specialization. In mixtureofexperts models, there are multiple decisionmakers that correspond to multiple priors in our case, but experts are typically not modeled as anytime optimization processes. The possibly most popular combination of neural network learning with Monte Carlo methods was achieved by AlphaGo [19], which beat the leading Go champion by optimizing the strategies provided by value networks and policy networks with Monte Carlo Tree Search, leading to a major breakthrough in reinforcement learning. An important difference here is that the neural network is used to directly approximate the posterior and MCMC is used to improve performance by concentrating on the most promising moves during learning, whereas in our case ANNs are used to represent the prior. Moreover, in our work we assumed the utility function (i.e. the value network) to be given. For future work it would be interesting to investigate how to incorporate learning the utility function into our model to investigate more complex scenarios such as in reinforcement learning.
5.0.1 Acknowledgement.
This work was supported by the European Research Council Starting Grant BRISC, ERCSTG2015, Project ID 678082.
5.0.2 Open Access
This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License
(http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
References

[1]
Andrieu, C., De Freitas, N., Doucet, A.: Reversible jump mcmc simulated annealing for neural networks. In: Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence. pp. 11–18. Morgan Kaufmann Publishers Inc. (2000)
 [2] Chollet, F., et al.: Keras. https://keras.io (2015)
 [3] Edward, V., Noah, G., L., G.T., B., T.J.: One and done? optimal decisions from very few samples. Cognitive Science 38(4), 599–637 (2014)
 [4] Freitas, J.d., Niranjan, M., Gee, A.H., Doucet, A.: Sequential monte carlo methods to train neural network models. Neural computation 12(4), 955–993 (2000)
 [5] Genewein, T., Leibfried, F., GrauMoya, J., Braun, D.A.: Bounded rationality, abstraction, and hierarchical decisionmaking: An informationtheoretic optimality principle. Frontiers in Robotics and AI 2, 27 (2015)
 [6] Ghosh, D., Singh, A., Rajeswaran, A., Kumar, V., Levine, S.: Divideandconquer reinforcement learning. arXiv preprint arXiv:1711.09874 (2017)

[7]
GrauMoya, J., Leibfried, F., Genewein, T., Braun, D.A.: Planning with informationprocessing constraints and model uncertainty in markov decision processes. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. pp. 475–491. Springer (2016)
 [8] Gu, S., Ghahramani, Z., Turner, R.E.: Neural adaptive sequential monte carlo. In: Advances in Neural Information Processing Systems. pp. 2629–2637 (2015)
 [9] Haruno, M., Wolpert, D.M., Kawato, M.: Mosaic model for sensorimotor learning and control. Neural computation 13(10), 2201–2220 (2001)
 [10] Kingma, D.P., Welling, M.: Autoencoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

[11]
Leibfried, F., Braun, D.A.: A rewardmaximizing spiking neuron as a bounded rational decision maker. Neural computation
27(8), 1686–1720 (2015)  [12] Leibfried, F., GrauMoya, J., Ammar, H.B.: An informationtheoretic optimality principle for deep reinforcement learning. arXiv preprint arXiv:1708.01867 (2017)
 [13] Levy, D., Hoffman, M.D., SohlDickstein, J.: Generalizing hamiltonian monte carlo with neural networks. International Conference on Learning Representations (2018)
 [14] Lewis, R.L., Howes, A., Singh, S.: Computational rationality: Linking mechanism and behavior through bounded utility maximization. Topics in Cognitive Science 6(2), 279–311 (2014)
 [15] MacKay, D.J.: Introduction to monte carlo methods. In: Learning in graphical models, pp. 175–204. Springer (1998)
 [16] Ortega, P.A., Braun, D.A.: Thermodynamics as a theory of decisionmaking with informationprocessing costs. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 469(2153) (2013)
 [17] Ortega, P.A., Braun, D.A., Dyer, J., Kim, K.E., Tishby, N.: Informationtheoretic bounded rationality. arXiv preprint arXiv:1512.06789 (2015)
 [18] Peng, Z., Genewein, T., Leibfried, F., Braun, D.A.: An informationtheoretic online update principle for perceptionaction coupling. In: Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on. pp. 789–796. IEEE (2017)
 [19] Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.: Mastering the game of go with deep neural networks and tree search. nature 529(7587), 484–489 (2016)
 [20] Tishby, N., Polani, D.: Information theory of decisions and actions. In: PerceptionAction Cycle: Models, Architectures, and Hardware. Springer (2011)
 [21] Todorov, E.: Efficient computation of optimal actions. Proceedings of the National Academy of Sciences 106(28), 11478–11483 (2009)
 [22] Von Neumann, J., Morgenstern, O.: Theory of games and economic behavior (commemorative edition). Princeton university press (2007)

[23]
Wolpert, D.H.: Information Theory – The Bridge Connecting Bounded Rational Game Theory and Statistical Physics, pp. 262–290. Springer Berlin Heidelberg, Berlin, Heidelberg (2006)
 [24] Yuksel, S.E., Wilson, J.N., Gader, P.D.: Twenty years of mixture of experts. IEEE transactions on neural networks and learning systems 23(8), 1177–1193 (2012)
Comments
There are no comments yet.