1 Introduction
Traditional statistical estimation, or statistical inference in general, is static, in the sense that the estimate of the quantity of interest does not affect the future evolution of the quantity. In some sequential estimation problems however, we encounter the situation where the future values of the quantity to be estimated depend on the estimate of its current value. Examples include: 1) stock price prediction by big investors, where the prediction of today’s price of a stock affects today’s investment decision, which further changes the stock’s supplydemand status and hence its price tomorrow; 2) interactive product recommendation, where the estimate of a customer’s preference based on their activity leads to certain product recommendations, which would in turn shape the customer’s future activity and preference; 3) behavior prediction in multiagent systems, e.g. predicting the intentions of vehicles on the road adjacent to the ego vehicle, where the prediction of an adjacent vehicle’s intention based on its current driving situation leads to a certain action of the ego vehicle, which can change the future driving situation and intention of that adjacent vehicle.
Broadly speaking, this type of interactive sequential estimation problems arises in any autonomous agent that interacts with a system of interest, through a measurementinferenceaction loop. During the interaction, the inference, either estimation or prediction, of a property of the system based on the measurements of its current and past states affects the action to be taken by the autonomous agent, which further influences the future states and properties of the system of interest. We may call such problems as dynamic inference.
In Section 2, a mathematical formulation of dynamic inference is given under a Bayesian probabilistic framework. It is shown in Section 3 that this problem can be converted to a Markov decisionmaking process (MDP), and the optimal estimation strategy that minimizes the overall inference loss can be derived as the optimal policy of this MDP through dynamic programming. Two examples, stock trend prediction and vehicle behavior prediction, are given in Section 4 to illustrate how the optimal estimation strategy for dynamic inference works, and how it differs from the solution to the traditional statistical inference.
Section 5 briefly discusses the problem of learning for dynamic inference, which is to address the situation where the underlying probabilistic models for dynamic inference become unknown. Learning for dynamic inference can potentially serve as a unifying meta problem of machine learning, such that supervised learning, imitation learning, and reinforcement learning can be cast as its special instances. Having a good understanding of dynamic inference and its learning extension will thus be helpful in gaining better understandings of a broad spectrum of machine learning problems. The formulation of dynamic inference appears to be new, but it can be related to a variety of existing interactive decisionmaking problems and prediction problems that take the consequence of the prediction into account. Moreover, any MDP may be thought of as a dynamic inference problem. These related problems are discussed in Section 6.
2 Problem formulation
2.1 Traditional statistical inference
In traditional statistical inference, the goal is to estimate a quantity of interest based on an observation that statistically depends on . Under the Bayesian formulation, the pair
is modeled as a jointly distributed random vector with distribution
. Given a loss function
, the optimal estimator , a.k.a. the Bayes estimator, is a map that achieves the minimum expected loss:(1) 
A basic result from estimation theory is that for any , the optimal estimate of given is a minimizer of the expected posterior loss, i.e. . The above statistical inference problem is static, in the sense that only one round of estimation is considered. When there is a need to estimate a sequence of quantities based on observations , if the pairs are i.i.d. for , the sequential estimation problem to minimize the accumulated expected loss can be optimally solved by repeatedly using the same singleround optimal estimator .
2.2 Dynamic inference
The problem of round dynamic inference is to sequentially estimate quantities of interest based on observations , where in each round, the quantity of interest only depends on the observation , while depends on the observation and the estimate of in the previous round; the estimate of is made potentially based on all the information available so far, namely , with the goal of minimizing the expected accumulated loss over the rounds. Here it is assumed that after the th round of estimation, is revealed to the estimator. It can also happen that are never revealed during the process, in which case is estimated only based on . Nevertheless, it will be shown in Section 3 that an optimal estimation strategy can estimate only based on the instantaneous observation , no matter are available or not.
Formally, we assume the knowledge of the distribution
of the initial observation, the probability transition kernel
of the th observation given the observation and the estimate in the previous round, , and the probability transition kernel of the th quantity of interest given the th observation, . We may call these two types of probability transition kernels the observationtransition model and the quantitygeneration model, respectively. The estimates are sequentially made according to an estimation strategy:Definition 1.
An estimation strategy for an round dynamic inference is a sequence of estimators , where is the estimator used in the th round, , which maps the history of observations and revealed quantities of interest to an estimate of , such that .
Any specification of , , and
defines a joint distribution of all the random variables
under consideration. The Bayesian network of the random variables in dynamic inference with a Markov estimation strategy, meaning that each estimate has the form
, is illustrated in Fig. 1.We use a loss function to evaluate the estimate made in each round in dynamic inference. This loss function is a generalization of the one used in statistical inference, in the sense that the estimate in each round is evaluated in the context of the observation in that round. Given an estimation strategy , we define its inference loss as the expected accumulated loss over the rounds, . The goal of dynamic inference is to find an estimation strategy to minimize the inference loss:
(2) 
where . Comparing with the statistical inference problem in (1), we summarize the two distinctive features of dynamic inference:

[leftmargin=*]

The joint distribution of the pair changes in each round in a controlled manner, as it depends on .

The loss in each round is contextual, as it depends on .
3 Optimal estimation strategy
In this section we show that the dynamic inference problem can be converted to a Markov decision process, and the optimal estimation strategy can be found via dynamic programming.
3.1 MDP reformulation
3.1.1 Equivalent expression of inference loss
For a given loss function and a joint distribution of , we can define a corresponding observationestimate loss function as
(3) 
for . From the specification of the joint distribution of the random variables in the previous section, we know that in dynamic inference is conditionally independent of given , therefore for any realization of , the value of the th observationestimate loss can be computed as
(4) 
We see that as a function of is determined by and , and does not depend on the estimator . This fact is crucial for the optimality proof later. With the above definition, the inference loss can be expressed in terms of the observationestimate loss:
Lemma 1.
For any estimation strategy, the inference loss in (2) can be rewritten as
(5) 
Proof.
3.1.2 Optimality of Markov estimators
Next, we show that the search space of the optimization problem in (9) can be restricted to Markov estimators , such that . We start with a lemma known as Blackwell’s principle of irrelevant information [1], and provide a proof for completeness.
Lemma 2.
For any fixed function and for any jointly distributed pair ,
(10) 
Proof.
The left side of (10) is the Bayes risk of estimating based on , defined with respect to the loss function , and can be written as ; while the right side of (10) is the Bayes risk of estimating based on itself, also defined with respect to the loss function , and can be written as . It is clear from their definitions that . It also follows from a data processing inequality of Bayes risk [2, Lemma 1] that
(11) 
as
form a Markov chain. Hence
, which proves the claim. ∎The first application of Lemma 2 is to prove that the last estimator of an optimal estimation strategy can be replaced by a Markov one, which preserves the optimality.
Lemma 3 (Lastround lemma).
Given any estimation strategy , there exists a Markov estimator , such that
Proof.
Lemma 2 can be further used to prove that whenever the last estimator is Markov, its preceding estimator can also be replaced by a Markov one which preserves the optimality.
Lemma 4 (thround lemma).
For any , given any estimation strategy for an round dynamic inference, if the last estimator is a Markov one , then there exists a Markov estimator for the th round, such that
Proof.
According to Lemma 1, the inference loss of the given is
(14) 
Since the first expectation in (14) does not depend on , it suffices to show that there exists a Markov estimator , such that
(15) 
where on the left side is the observation in the th round when the Markov estimator is used in the th round. To get around with the dependence of on , we write the second expectation on the right side of (15) as
(16) 
and notice that the inner conditional expectation as a function of does not depend on . This is because the conditional distribution of given is specified by the probability transition kernel . It follows that the right side of (15) can be written as
(17)  
(18) 
where the function does not depend on . It follows from Lemma 2 that there exists an estimator , such that
(19)  
(20)  
(21) 
which proves (15) and the claim. ∎
Theorem 1.
The minimum of in (9) can be achieved by an estimation strategy with Markov estimators , , such that .
Proof.
Picking an optimal estimation strategy , we first replace its last estimator by a Markov one that preserves the optimality of the strategy, as guaranteed by Lemma 3. Then, for , we repeatedly replace the th estimator by a Markov one that preserves the optimality of the previous strategy, as guaranteed by Lemma 4 and the additive structure of the inference loss as in . Finally we obtain an estimation strategy consisting of Markov estimators achieving the same inference loss as the originally picked strategy. ∎
3.1.3 Conversion to MDP
Theorem 1 with Lemma 1 imply that the original dynamic inference problem in (2) is equivalent to
(22) 
With this reformulation, we see that the unknown quantities do not appear in the loss function in (22) any more, and the optimization problem becomes a standard MDP: the observations become the states in this MDP, the estimates become the actions, the probability transition kernel now defines the controlled state transition, and any Markov estimation strategy becomes a policy of this MDP. The goal of dynamic inference then becomes finding the optimal policy of this MDP to minimize the expected accumulated loss with respect to . Conversely, the solution to the MDP will be an optimal estimation strategy for dynamic inference.
3.2 Solution via dynamic programming
3.2.1 Optimal estimation strategy
From the theory of MDP [3] it is known that the optimal policy for the MDP in (22), or the optimal estimation strategy for dynamic inference, can be found via dynamic programming. To derive the optimal estimators, define the functions and recursively in backward as ,
(23) 
and
(24) 
The optimal estimate to make in the th round when is then
(25) 
3.2.2 Minimum inference loss and losstogo
For any estimation strategy , we can define its losstogo at the th round of estimation when as the conditional expected loss accumulated from the th round to the final round given that the observation in the th round is :
(26) 
The following theorem states that the estimation strategy derived from dynamic programming not only achieves the minimum inference loss, but also achieves the minimum losstogo in each round with any observation in that round.
Theorem 2.
Proof.
The first claim stating that the estimation strategy achieves the minimum in (2) follows from the equivalence between the original problem in (2) and the MDP reformulation in (22) as discussed in Section 3.1.3, and from the wellknown optimality of the solution via dynamic programming to MDP [3].
The second claim can be proved via backward induction. Consider an arbitrary Markov estimation strategy .

[leftmargin=*]

For , as the inductive assumption, suppose (27) holds in the th round. We first show a selfrecursive expression of :
(31) (32) (33) (34) (35) where the first term of (34) follows from the definition of in (3), while the second term of (34) follows from the fact that is conditionally independent of given , which is a consequence of the assumption that the estimators under consideration are Markov and the specification of the joint distribution of in Section 2. Then,
(36) (37) (38) (39) where (36) follows from the inductive assumption; (37) follows from the fact that is determined given through the Markov estimator ; (38) follows from the definition of in (24); and the final equality condition follows from the definitions of above (24) and in (25).
This proves the second claim. ∎
A consequence of Theorem 2 is that the minimum losstogo at the th round can be expressed in terms of . Moreover, once the values of for all and are computed, the optimal estimation strategy for any round dynamic inference with the same model for can be determined by these values and the observationestimate loss function . These results are stated in the following corollary.
Corollary 1.
For any and any initial distribution ,
(40) 
and the minimum is achieved by the estimators defined in (25).
3.3 An illustrative example
We now work out an example to illustrate how an optimal estimation strategy for dynamic inference works. Consider a situation where the observations, the quantities of interest, and the estimates all take binary values, i.e. . The observationtransition model is assumed to be stationary and deterministic, such that if , while if , as depicted in Fig. 2. The quantitygeneration model is also stationary and is described by and . The loss function neglects the observation and takes the form
With this setup, the goal of dynamic inference is to minimize the expected number of wrong estimates during rounds of estimation. The optimal estimation strategy can be easily found by the dynamic programming procedure presented in Section 3.2. With , the resulting values of the function are labeled at each observation in the unrolled observationtransition diagram in Fig. 3. The optimal estimate at each observation is also labeled: a solid branch indicates an optimal estimate, while a dashed branch indicates a nonoptimal one. We see that at each observation, the optimal estimate for dynamic inference can be different from that for the singleround estimation. For example, at , and , the optimal estimate for dynamic inference is , whereas the optimal estimate for the singleround estimation at these observations would be to minimize .
This example reveals a key difference between dynamic inference and the traditional statistical inference: in dynamic inference, the optimal estimate in each round strives to balance the losstoincur in that round and the losstogo from that round. Consequently, an optimal estimation strategy may need to, at least occasionally, make nonoptimal singleround estimates, in order to steer the future observations toward those with which the associated quantities of interest are easier to estimate or less costly if inaccurately estimated.
4 Two applications
Having formulated the dynamic inference problem and derived its solution, in this section we study its applications to real challenges. The two examples given below are simplistic, but they capture the essence of how dynamic inference can be used to model and solve various sequential and interactive estimation or prediction problems.
4.1 Stock trend prediction
The first application is the prediction by big investors of the trend of stock, which could be the trend of the price of an individual stock or the index of a bundle of stocks. The trend, either rising or falling, statistically depends on some observable market signal, e.g. the supplydemand profile of the stocks under consideration. The prediction is sequentially made for several rounds, e.g. one round each day for that day’s trend, each based on the past observed market signals. Once a prediction is made, it influences that day’s investment decision and hence the supplydemand profile of the stocks under consideration, which will be reflected by the market signal in the next day and will further influence the next day’s trend.
Formally, for rounds of prediction, let be the trend in the th round, which is to be predicted as based on the observable market signal in that round. To have the simplest observation model, we consider the situation where ’s take only two values . The transition model of the next round’s market signal given the signal and prediction in the current round can then be described by . A stationary such model is shown in Fig. 4. Additionally, the dependence of the trend on the market signal can be described by , e.g. and where the trend positively correlates with the market signal. The loss function can be simply .
With all these elements specified, the problem is recognized as a dynamic inference problem similar to the example presented in Section 3.3, only with a more general observationtransition model. Figure 5 shows the optimal estimation strategy of this example when the observationtransition model is deterministic, such that . Similar to the example given in Section 3.3, we see from Fig. 5 that at , and , the optimal prediction for dynamic inference is different from the optimal singleround prediction. These predictions are made to steer the market signal to , with which it is more certain that the trend will be rising according to , hence smaller prediction error probability to occur. Only when , the optimal predictions coincide with the optimal singleround predictions, as the future prediction errors to accumulate weigh less toward the end of dynamic inference.
4.2 Vehicle behavior prediction
Another challenging problem that could be cast as dynamic inference is behavior prediction of vehicles on the road. For example, a desired feature of a selfdriving system is to predict whether the following vehicle in the neighbor lane would yield if the ego vehicle initiates a cutin to that lane, whenever there is a need for lane change. This task may be termed as yield prediction, which can be sequential and interactive especially when the traffic is dense: the predicted intention determines the action to be taken by the ego vehicle, e.g. to turn on the blinker and initiate the cutin when the following vehicle is predicted to yield, or not to cutin and shoot for another gap when it is predicted not to yield; in response to the ego vehicle’s behavior and according to the driving situation, the following vehicle would either slow down or accelerate, which can change the driving situation of the two vehicles and affect their subsequent behaviors; the interaction continues until the cutin is completed due to a yield by some vehicle, or given up due to an opposite.
As in the stock trend prediction, we can formally define a dynamic inference problem for yield prediction. Suppose the prediction can be deconstructed into rounds. For the th round, let denote the driving situation, which could be the positions and velocities of the vehicles under consideration, and let represent the intention of the following vehicle. In the simplest setting, could be just the longitudinal bumper to bumper distance between the two vehicles, and the probabilistic model relating to could be where is the logistic function, is a tunable parameter, and is a critical distance that can be empirically determined. The observationtransition model in yield prediction is nonstationary and depends on the design of the planner in the selfdriving system. For example, when is small and , depending on the planner, could be either larger if the planner decides to increase the gap and still aims to cutin, or smaller if the planner decides to slow down to shoot for another gap behind the following vehicle. Another feature of this problem is that the loss function should be contextual and carefully designed. For example, when and , the loss can be small and proportional to , to moderately penalize a wasted chance for cutin. On the other hand, when and , the loss should be large especially when is small, to heavily penalize a wrong prediction that can lead to a dangerous situation. With all the elements specified, the problem can in principle be solved through dynamic programming as in Section 3.2, to minimize the overall prediction cost.
To better illustrate the idea, three typical cases that can be encountered by the yield prediction are depicted in Fig. 6. In the first case, , , and . The ego vehicle initiates the cutin by turning on the blinker, which results in a slowdown of the following vehicle, allowing the cutin to be completed. There is only one round of prediction in this case. In the second case, , , and . The ego vehicle initiates the cutin by turning on the blinker, which results in an acceleration of the following vehicle, not allowing the cutin to be completed, and leads to a dangerous driving situation. The ego vehicle then starts a second round of prediction under this situation. In the third case, and . Since the ego vehicle predicts that the following vehicle will not yield if the blinker is on, it does not initiate a cutin, and slows down to shoot for a gap behind the following vehicle. It then starts a second round of prediction, which can potentially be easier and less costly compared with the first round of prediction. By properly specifying the models and the loss function, a yield predictor designed under the framework of dynamic inference should enable the ego vehicle to drive in the first and the third case most of the time according to different driving situations, and avoid the behavior as in the second situation, unless it is deliberately designed to support aggressive cutin.
5 Learning for dynamic inference
Solving the dynamic inference problem requires the knowledge of two important elements: the observationtransition model and the quantitygeneration model. However, in many practically interesting situations, we may not have such knowledge. Instead, we either have a training dataset from which we can learn these models offline before doing inference, or we can learn them onthefly during inference if ’s are revealed after each round of estimation. Such problems may be termed as learning for dynamic inference
, either offline or online. These problems can also be studied under a Bayesian framework, where the unknown models are assumed to be members of parametrized model families with certain priors, and the optimal learning rule that minimizes the expected inference loss can be mathematically derived
[4].Perhaps more importantly, the problem of learning for dynamic inference can potentially serve as a meta problem for machine learning, such that almost all familiar learning problems can be cast as its special cases, examples including supervised learning, imitation learning, and reinforcement learning. For instance, the offline learning for dynamic inference can be viewed as an extension of the behavior cloning method in imitation learning [5, 6, 7, 8], in that it not only learns the demonstrator’s actiongeneration model and statetransition model, but simultaneously learns a policy based on the learned models to minimize the overall contextualaware imitation error. As another instance, any loss function of the form can be expressed in the integral form of (4), with some contextual loss function and probability transition kernel . The unknown quantity that depends on can then be viewed as a latent variable of the loss function. With this view, any reinforcement learning problem [9] can be solved as an instance of online learning for dynamic inference, where the quantities to be estimated are the latent variables of the loss function, and the quantitygeneration model to be learned is the conditional distribution of the latent variable. More detailed discussions on the connection between learning for dynamic inference and other learning problems are made in [4]. The study of dynamic inference and its learning extension thus help us gain deeper and unifying understandings of a broad spectrum of machine learning problems.
6 Multiple views of dynamic inference and related works
The formulation of dynamic inference appears to be new, but it can be viewed from different angles, and is related to a variety of existing problems.
Sequential interactive prediction as dynamic inference
The problems that can be most naturally formulated as dynamic inference are estimation or prediction problems in sequential and interactive settings. Traditionally some of those problems are formulated and studied using game theory
[10]. This type of problems become more common in recent years with widespread adoption of AI systems, e.g., they arise in behavior prediction of vehicles [11, 12, 13], interactive recommendation with user feedback [14], and prediction in finance [15]. Dynamic inference provides a rigorous mathematical formulation of such problems, and provides an optimal solution to it.Dynamic inference as performative prediction
Dynamic inference also shares a similar spirit with a recent trend of research called performativity [16, 17, 18], where the problem is to deal with the tendency that the decision to make in optimization or prediction problems can change the underlying distribution the decision is made for. A method called repeated risk minimization is proposed to solve the performative prediction, either with [18] or without [17] state transitions. The goal there is to minimize the loss in each single round of prediction, based on the distribution from the last round, and the hope is that such a method can reach a minimax equilibrium under certain conditions. On the contrary, dynamic inference aims to minimize the overall inference loss, and it explicitly considers multiple rounds of estimation and state transitions in these problems.
Dynamic inference as imitation game
The formulation of dynamic inference can also be viewed as a game of imitation, where the learner drives a system with state, observes an action from a demonstrator at each encountered state, and tries to imitate the demonstrator’s actions by minimizing the accumulated stateaware imitation error. When the underlying models are unknown, such a view can provide a rigorous formulation of imitation learning, both online and offline, and the optimal learning strategy is derived in [4]. In practice, this formulation has already been implicitly adopted by practitioners in imitation learning [19].
MDP as dynamic inference
In this work, the solution to dynamic inference is derived by reformulating it to an MDP which can be solved by dynamic programming. Conversely, any MDP can be thought of as a dynamic inference problem, by viewing the loss function in an integral form that involves an unknown latent variable, as discussed in the previous section. The goal of MDP is then to estimate the latent variables by minimizing the overall estimation error. This view also helps us to understand reinforcement learning, especially Bayesian reinforcement learning in the modelbased form [20, 21, 22], from a learning for dynamic inference perspective. The practical benefit of viewing MDP and reinforcement learning in this way would be an interesting research problem.
Acknowledgement
The author is indebted to Peng Guan for many helpful discussions; the discussion with whom on imitation learning in early 2018 motivated this study. The author is grateful to Prof. Lav Varshney, for the detailed comments and many helpful suggestions, and for pointing out [17] on performative prediction. The author also would like to thank Prof. Maxim Raginsky, for his encouragement in looking into dynamic aspects of statistical problems.
References
 [1] M. Raginsky, Lecture notes for ECE 555 Control of Stochastic Systems, Spring 2019, University of Illinois at UrbanaChampaign, 2019.
 [2] A. Xu and M. Raginsky, “Minimum excess risk in Bayesian learning,” arXiv 2012.14868, 2020.
 [3] D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. 1. Athena Scientific, 2017.
 [4] A. Xu and P. Guan, “Bayesian learning for dynamic inference,” arXiv preprint, 2021.
 [5] T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and J. Peters, “An algorithmic perspective on imitation learning,” Foundations and Trends in Robotics, vol. 7, no. 12, 2018.
 [6] D. B. Grimes, D. R. Rashid, and R. P. Rao, “Learning nonparametric models for probabilistic imitation,” In Advances in Neural Information Processing Systems, 2006.
 [7] P. Englert, A. Paraschos, J. Peters, and M. P. Deisenroth, “Probabilistic modelbased imitation learning,” Adaptive Behavior, 2013.

[8]
F. Torabi, G. Warnell, and P. Stone, “Behavioral cloning from observation,”
in
International Joint Conference on Artificial Intelligence
, 2018.  [9] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed. MIT Press, 2018.
 [10] C. Camerer, Behavioral game theory: Experiments in strategic interaction. Princeton University Press, 2003.
 [11] Y. Hu, W. Zhan, L. Sun, and M. Tomizuka, “Multimodal probabilistic prediction of interactive behavior via an interpretable model,” in IEEE Intelligent Vehicles Symposium, 2019.
 [12] J. Li, F. Yang, M. Tomizuka, and C. Choi, “Evolvegraph: Multiagent trajectory prediction with dynamic relational reasoning,” in Conference on Neural Information Processing Systems, 2020.
 [13] X. Ma, J. Li, M. J. Kochenderfer, D. Isele, and K. Fujimura, “Reinforcement learning for autonomous driving with latent state inference and spatialtemporal relationships,” in IEEE International Conference on Robotics and Automation, 2021.
 [14] R. Zhang, T. Yu, Y. Shen, H. Jin, and C. Chen, “Textbased interactive recommendation via constraintaugmented reinforcement learning,” in Conference on Neural Information Processing Systems, 2019.
 [15] E. Ippoliti, Methods and Finance: A View from Outside. Cham: Springer International Publishing, 2017.
 [16] L. R. Varshney, N. S. Keskar, and R. Socher, “Pretrained AI models: Performativity, mobility, and change,” arXiv:1909.03290, 2019.
 [17] J. Perdomo, T. Zrnic, C. MendlerDünner, and M. Hardt, “Performative prediction,” in International Conference on Machine Learning, 2020.
 [18] I. K. Gavin Brown, Shlomi Hod, “Performative prediction in a stateful world,” arXiv:2011.03885, 2020.
 [19] Y. Pan, C. Cheng, K. Saigol, K. Lee, X. Yan, E. Theodorou, and B. Boots, “Imitation learning for agile autonomous driving,” International Journal of Robotics Research, 2019.
 [20] M. Strens, “A Bayesian framework for reinforcement learning.” in Proceedings of the 17th International Conference on Machine Learning, pp. 943–950, 2000.
 [21] P. Poupart, N. Vlassis, J. Hoey, and K. Regan, “An analytic solution to discrete Bayesian reinforcement learning,” International Conference on Machine Learning, 2006.
 [22] M. Ghavamzadeh, S. Mannor, J. Pineau, and A. Tamar, Bayesian Reinforcement Learning: A Survey. Now Foundations and Trends, 2015.
Comments
There are no comments yet.