Dynamic Inference

11/29/2021
by   Aolin Xu, et al.
0

Traditional statistical estimation, or statistical inference in general, is static, in the sense that the estimate of the quantity of interest does not change the future evolution of the quantity. In some sequential estimation problems however, we encounter the situation where the future values of the quantity to be estimated depend on the estimate of its current value. Examples include stock price prediction by big investors, interactive product recommendation, and behavior prediction in multi-agent systems. We may call such problems as dynamic inference. In this work, a formulation of this problem under a Bayesian probabilistic framework is given, and the optimal estimation strategy is derived as the solution to minimize the overall inference loss. How the optimal estimation strategy works is illustrated through two examples, stock trend prediction and vehicle behavior prediction. When the underlying models for dynamic inference are unknown, we can consider the problem of learning for dynamic inference. This learning problem can potentially unify several familiar machine learning problems, including supervised learning, imitation learning, and reinforcement learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/03/2020

Tracking the Race Between Deep Reinforcement Learning and Imitation Learning – Extended Version

Learning-based approaches for solving large sequential decision making p...
02/24/2020

Statistical inference for Axiom A attractors

From the climate system to the effect of the internet on society, chaoti...
05/18/2019

Convolutional Feature Extraction and Neural Arithmetic Logic Units for Stock Prediction

Stock prediction is a topic undergoing intense study for many years. Fin...
08/09/2014

Probabilistic inverse reinforcement learning in unknown environments

We consider the problem of learning by demonstration from agents acting ...
10/12/2019

Efficient Inference and Exploration for Reinforcement Learning

Despite an ever growing literature on reinforcement learning algorithms ...
12/19/2021

Approximately valid probabilistic inference on a class of statistical functionals

Existing frameworks for probabilistic inference assume the inferential t...
09/28/2021

Constructing Prediction Intervals Using the Likelihood Ratio Statistic

Statistical prediction plays an important role in many decision processe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Traditional statistical estimation, or statistical inference in general, is static, in the sense that the estimate of the quantity of interest does not affect the future evolution of the quantity. In some sequential estimation problems however, we encounter the situation where the future values of the quantity to be estimated depend on the estimate of its current value. Examples include: 1) stock price prediction by big investors, where the prediction of today’s price of a stock affects today’s investment decision, which further changes the stock’s supply-demand status and hence its price tomorrow; 2) interactive product recommendation, where the estimate of a customer’s preference based on their activity leads to certain product recommendations, which would in turn shape the customer’s future activity and preference; 3) behavior prediction in multi-agent systems, e.g. predicting the intentions of vehicles on the road adjacent to the ego vehicle, where the prediction of an adjacent vehicle’s intention based on its current driving situation leads to a certain action of the ego vehicle, which can change the future driving situation and intention of that adjacent vehicle.

Broadly speaking, this type of interactive sequential estimation problems arises in any autonomous agent that interacts with a system of interest, through a measurement-inference-action loop. During the interaction, the inference, either estimation or prediction, of a property of the system based on the measurements of its current and past states affects the action to be taken by the autonomous agent, which further influences the future states and properties of the system of interest. We may call such problems as dynamic inference.

In Section 2, a mathematical formulation of dynamic inference is given under a Bayesian probabilistic framework. It is shown in Section 3 that this problem can be converted to a Markov decision-making process (MDP), and the optimal estimation strategy that minimizes the overall inference loss can be derived as the optimal policy of this MDP through dynamic programming. Two examples, stock trend prediction and vehicle behavior prediction, are given in Section 4 to illustrate how the optimal estimation strategy for dynamic inference works, and how it differs from the solution to the traditional statistical inference.

Section 5 briefly discusses the problem of learning for dynamic inference, which is to address the situation where the underlying probabilistic models for dynamic inference become unknown. Learning for dynamic inference can potentially serve as a unifying meta problem of machine learning, such that supervised learning, imitation learning, and reinforcement learning can be cast as its special instances. Having a good understanding of dynamic inference and its learning extension will thus be helpful in gaining better understandings of a broad spectrum of machine learning problems. The formulation of dynamic inference appears to be new, but it can be related to a variety of existing interactive decision-making problems and prediction problems that take the consequence of the prediction into account. Moreover, any MDP may be thought of as a dynamic inference problem. These related problems are discussed in Section 6.

2 Problem formulation

2.1 Traditional statistical inference

In traditional statistical inference, the goal is to estimate a quantity of interest based on an observation that statistically depends on . Under the Bayesian formulation, the pair

is modeled as a jointly distributed random vector with distribution

. Given a loss function

, the optimal estimator , a.k.a. the Bayes estimator, is a map that achieves the minimum expected loss:

(1)

A basic result from estimation theory is that for any , the optimal estimate of given is a minimizer of the expected posterior loss, i.e. . The above statistical inference problem is static, in the sense that only one round of estimation is considered. When there is a need to estimate a sequence of quantities based on observations , if the pairs are i.i.d. for , the sequential estimation problem to minimize the accumulated expected loss can be optimally solved by repeatedly using the same single-round optimal estimator .

2.2 Dynamic inference

The problem of -round dynamic inference is to sequentially estimate quantities of interest based on observations , where in each round, the quantity of interest only depends on the observation , while depends on the observation and the estimate of in the previous round; the estimate of is made potentially based on all the information available so far, namely , with the goal of minimizing the expected accumulated loss over the rounds. Here it is assumed that after the th round of estimation, is revealed to the estimator. It can also happen that are never revealed during the process, in which case is estimated only based on . Nevertheless, it will be shown in Section 3 that an optimal estimation strategy can estimate only based on the instantaneous observation , no matter are available or not.

Formally, we assume the knowledge of the distribution

of the initial observation, the probability transition kernel

of the th observation given the observation and the estimate in the previous round, , and the probability transition kernel of the th quantity of interest given the th observation, . We may call these two types of probability transition kernels the observation-transition model and the quantity-generation model, respectively. The estimates are sequentially made according to an estimation strategy:

Definition 1.

An estimation strategy for an -round dynamic inference is a sequence of estimators , where is the estimator used in the th round, , which maps the history of observations and revealed quantities of interest to an estimate of , such that .

Any specification of , , and

defines a joint distribution of all the random variables

under consideration. The Bayesian network of the random variables in dynamic inference with a Markov estimation strategy, meaning that each estimate has the form

, is illustrated in Fig. 1.

Figure 1: Bayesian network of the random variables under consideration with . Here we assume the estimates are made with Markov estimators, such that .

We use a loss function to evaluate the estimate made in each round in dynamic inference. This loss function is a generalization of the one used in statistical inference, in the sense that the estimate in each round is evaluated in the context of the observation in that round. Given an estimation strategy , we define its inference loss as the expected accumulated loss over the rounds, . The goal of dynamic inference is to find an estimation strategy to minimize the inference loss:

(2)

where . Comparing with the statistical inference problem in (1), we summarize the two distinctive features of dynamic inference:

  • [leftmargin=*]

  • The joint distribution of the pair changes in each round in a controlled manner, as it depends on .

  • The loss in each round is contextual, as it depends on .

3 Optimal estimation strategy

In this section we show that the dynamic inference problem can be converted to a Markov decision process, and the optimal estimation strategy can be found via dynamic programming.

3.1 MDP reformulation

3.1.1 Equivalent expression of inference loss

For a given loss function and a joint distribution of , we can define a corresponding observation-estimate loss function as

(3)

for . From the specification of the joint distribution of the random variables in the previous section, we know that in dynamic inference is conditionally independent of given , therefore for any realization of , the value of the th observation-estimate loss can be computed as

(4)

We see that as a function of is determined by and , and does not depend on the estimator . This fact is crucial for the optimality proof later. With the above definition, the inference loss can be expressed in terms of the observation-estimate loss:

Lemma 1.

For any estimation strategy, the inference loss in (2) can be rewritten as

(5)
Proof.

For each ,

(6)
(7)
(8)

where (7) is a consequence of the fact that is determined by and the fact that is conditionally independent of given ; and (8) is due to the definition of . The claim then follows from the fact that

With Lemma 1, the optimization problem in (2) becomes equivalent to

(9)

where equals to the inference loss, and .

3.1.2 Optimality of Markov estimators

Next, we show that the search space of the optimization problem in (9) can be restricted to Markov estimators , such that . We start with a lemma known as Blackwell’s principle of irrelevant information [1], and provide a proof for completeness.

Lemma 2.

For any fixed function and for any jointly distributed pair ,

(10)
Proof.

The left side of (10) is the Bayes risk of estimating based on , defined with respect to the loss function , and can be written as ; while the right side of (10) is the Bayes risk of estimating based on itself, also defined with respect to the loss function , and can be written as . It is clear from their definitions that . It also follows from a data processing inequality of Bayes risk [2, Lemma 1] that

(11)

as

form a Markov chain. Hence

, which proves the claim. ∎

The first application of Lemma 2 is to prove that the last estimator of an optimal estimation strategy can be replaced by a Markov one, which preserves the optimality.

Lemma 3 (Last-round lemma).

Given any estimation strategy , there exists a Markov estimator , such that

Proof.

According to Lemma 1, the inference loss of can be written as

(12)

Since the first expectation in (12) does not depend on , it suffices to show that there exists a Markov estimator , such that

(13)

The existence of such an estimator is guaranteed by Lemma 2. ∎

Lemma 2 can be further used to prove that whenever the last estimator is Markov, its preceding estimator can also be replaced by a Markov one which preserves the optimality.

Lemma 4 (th-round lemma).

For any , given any estimation strategy for an -round dynamic inference, if the last estimator is a Markov one , then there exists a Markov estimator for the th round, such that

Proof.

According to Lemma 1, the inference loss of the given is

(14)

Since the first expectation in (14) does not depend on , it suffices to show that there exists a Markov estimator , such that

(15)

where on the left side is the observation in the th round when the Markov estimator is used in the th round. To get around with the dependence of on , we write the second expectation on the right side of (15) as

(16)

and notice that the inner conditional expectation as a function of does not depend on . This is because the conditional distribution of given is specified by the probability transition kernel . It follows that the right side of (15) can be written as

(17)
(18)

where the function does not depend on . It follows from Lemma 2 that there exists an estimator , such that

(19)
(20)
(21)

which proves (15) and the claim. ∎

With Lemma 3 and Lemma 4, we can finally prove the optimality of Markov estimators.

Theorem 1.

The minimum of in (9) can be achieved by an estimation strategy with Markov estimators , , such that .

Proof.

Picking an optimal estimation strategy , we first replace its last estimator by a Markov one that preserves the optimality of the strategy, as guaranteed by Lemma 3. Then, for , we repeatedly replace the th estimator by a Markov one that preserves the optimality of the previous strategy, as guaranteed by Lemma 4 and the additive structure of the inference loss as in . Finally we obtain an estimation strategy consisting of Markov estimators achieving the same inference loss as the originally picked strategy. ∎

3.1.3 Conversion to MDP

Theorem 1 with Lemma 1 imply that the original dynamic inference problem in (2) is equivalent to

(22)

With this reformulation, we see that the unknown quantities do not appear in the loss function in (22) any more, and the optimization problem becomes a standard MDP: the observations become the states in this MDP, the estimates become the actions, the probability transition kernel now defines the controlled state transition, and any Markov estimation strategy becomes a policy of this MDP. The goal of dynamic inference then becomes finding the optimal policy of this MDP to minimize the expected accumulated loss with respect to . Conversely, the solution to the MDP will be an optimal estimation strategy for dynamic inference.

3.2 Solution via dynamic programming

3.2.1 Optimal estimation strategy

From the theory of MDP [3] it is known that the optimal policy for the MDP in (22), or the optimal estimation strategy for dynamic inference, can be found via dynamic programming. To derive the optimal estimators, define the functions and recursively in backward as ,

(23)

and

(24)

The optimal estimate to make in the th round when is then

(25)

3.2.2 Minimum inference loss and loss-to-go

For any estimation strategy , we can define its loss-to-go at the th round of estimation when as the conditional expected loss accumulated from the th round to the final round given that the observation in the th round is :

(26)

The following theorem states that the estimation strategy derived from dynamic programming not only achieves the minimum inference loss, but also achieves the minimum loss-to-go in each round with any observation in that round.

Theorem 2.

The estimators defined in (25) constitute an optimal estimation strategy for dynamic inference, which achieves the minimum in (2). Moreover, for any Markov estimation strategy with , its loss-to-go satisfies

(27)

for all and , with equality if for all and .

Proof.

The first claim stating that the estimation strategy achieves the minimum in (2) follows from the equivalence between the original problem in (2) and the MDP reformulation in (22) as discussed in Section 3.1.3, and from the well-known optimality of the solution via dynamic programming to MDP [3].

The second claim can be proved via backward induction. Consider an arbitrary Markov estimation strategy .

  • [leftmargin=*]

  • In the final round, for all ,

    (28)
    (29)
    (30)

    where (28) follows from the definition of in (26); (29) follows from the way how can be computed as in (4); and (30) follows from the definition of above (24), while the equality holds if .

  • For , as the inductive assumption, suppose (27) holds in the th round. We first show a self-recursive expression of :

    (31)
    (32)
    (33)
    (34)
    (35)

    where the first term of (34) follows from the definition of in (3), while the second term of (34) follows from the fact that is conditionally independent of given , which is a consequence of the assumption that the estimators under consideration are Markov and the specification of the joint distribution of in Section 2. Then,

    (36)
    (37)
    (38)
    (39)

    where (36) follows from the inductive assumption; (37) follows from the fact that is determined given through the Markov estimator ; (38) follows from the definition of in (24); and the final equality condition follows from the definitions of above (24) and in (25).

This proves the second claim. ∎

A consequence of Theorem 2 is that the minimum loss-to-go at the th round can be expressed in terms of . Moreover, once the values of for all and are computed, the optimal estimation strategy for any -round dynamic inference with the same model for can be determined by these values and the observation-estimate loss function . These results are stated in the following corollary.

Corollary 1.

For any and any initial distribution ,

(40)

and the minimum is achieved by the estimators defined in (25).

3.3 An illustrative example

We now work out an example to illustrate how an optimal estimation strategy for dynamic inference works. Consider a situation where the observations, the quantities of interest, and the estimates all take binary values, i.e. . The observation-transition model is assumed to be stationary and deterministic, such that if , while if , as depicted in Fig. 2. The quantity-generation model is also stationary and is described by and . The loss function neglects the observation and takes the form

Figure 2: An example of stationary and deterministic observation-transition model.
Figure 3: Unrolled observation-transition diagram for the dynamic inference example given in Section 3.3, with . The value of and the optimal estimate at each observation are labeled. A solid branch indicates an optimal estimate, while a dashed branch indicates a non-optimal one. The three blue branches indicate the optimal estimates for dynamic inference that are different from the optimal single-round estimates.

With this setup, the goal of dynamic inference is to minimize the expected number of wrong estimates during rounds of estimation. The optimal estimation strategy can be easily found by the dynamic programming procedure presented in Section 3.2. With , the resulting values of the function are labeled at each observation in the unrolled observation-transition diagram in Fig. 3. The optimal estimate at each observation is also labeled: a solid branch indicates an optimal estimate, while a dashed branch indicates a non-optimal one. We see that at each observation, the optimal estimate for dynamic inference can be different from that for the single-round estimation. For example, at , and , the optimal estimate for dynamic inference is , whereas the optimal estimate for the single-round estimation at these observations would be to minimize .

This example reveals a key difference between dynamic inference and the traditional statistical inference: in dynamic inference, the optimal estimate in each round strives to balance the loss-to-incur in that round and the loss-to-go from that round. Consequently, an optimal estimation strategy may need to, at least occasionally, make non-optimal single-round estimates, in order to steer the future observations toward those with which the associated quantities of interest are easier to estimate or less costly if inaccurately estimated.

4 Two applications

Having formulated the dynamic inference problem and derived its solution, in this section we study its applications to real challenges. The two examples given below are simplistic, but they capture the essence of how dynamic inference can be used to model and solve various sequential and interactive estimation or prediction problems.

4.1 Stock trend prediction

The first application is the prediction by big investors of the trend of stock, which could be the trend of the price of an individual stock or the index of a bundle of stocks. The trend, either rising or falling, statistically depends on some observable market signal, e.g. the supply-demand profile of the stocks under consideration. The prediction is sequentially made for several rounds, e.g. one round each day for that day’s trend, each based on the past observed market signals. Once a prediction is made, it influences that day’s investment decision and hence the supply-demand profile of the stocks under consideration, which will be reflected by the market signal in the next day and will further influence the next day’s trend.

Formally, for rounds of prediction, let be the trend in the th round, which is to be predicted as based on the observable market signal in that round. To have the simplest observation model, we consider the situation where ’s take only two values . The transition model of the next round’s market signal given the signal and prediction in the current round can then be described by . A stationary such model is shown in Fig. 4. Additionally, the dependence of the trend on the market signal can be described by , e.g.  and where the trend positively correlates with the market signal. The loss function can be simply .

Figure 4: A stationary observation-transition model for stock trend prediction. The solid arrows represent the deterministic transition .
Figure 5: Unrolled observation-transition diagram for the stock trend prediction example, with . The value of and the optimal prediction at each observation are labeled. A solid branch indicates an optimal prediction, while a dashed branch indicates a non-optimal one. The three blue branches indicate the optimal predictions for dynamic inference that are different from the optimal single-round predictions.

With all these elements specified, the problem is recognized as a dynamic inference problem similar to the example presented in Section 3.3, only with a more general observation-transition model. Figure 5 shows the optimal estimation strategy of this example when the observation-transition model is deterministic, such that . Similar to the example given in Section 3.3, we see from Fig. 5 that at , and , the optimal prediction for dynamic inference is different from the optimal single-round prediction. These predictions are made to steer the market signal to , with which it is more certain that the trend will be rising according to , hence smaller prediction error probability to occur. Only when , the optimal predictions coincide with the optimal single-round predictions, as the future prediction errors to accumulate weigh less toward the end of dynamic inference.

4.2 Vehicle behavior prediction

Another challenging problem that could be cast as dynamic inference is behavior prediction of vehicles on the road. For example, a desired feature of a self-driving system is to predict whether the following vehicle in the neighbor lane would yield if the ego vehicle initiates a cut-in to that lane, whenever there is a need for lane change. This task may be termed as yield prediction, which can be sequential and interactive especially when the traffic is dense: the predicted intention determines the action to be taken by the ego vehicle, e.g. to turn on the blinker and initiate the cut-in when the following vehicle is predicted to yield, or not to cut-in and shoot for another gap when it is predicted not to yield; in response to the ego vehicle’s behavior and according to the driving situation, the following vehicle would either slow down or accelerate, which can change the driving situation of the two vehicles and affect their subsequent behaviors; the interaction continues until the cut-in is completed due to a yield by some vehicle, or given up due to an opposite.

As in the stock trend prediction, we can formally define a dynamic inference problem for yield prediction. Suppose the prediction can be deconstructed into rounds. For the th round, let denote the driving situation, which could be the positions and velocities of the vehicles under consideration, and let represent the intention of the following vehicle. In the simplest setting, could be just the longitudinal bumper to bumper distance between the two vehicles, and the probabilistic model relating to could be where is the logistic function, is a tunable parameter, and is a critical distance that can be empirically determined. The observation-transition model in yield prediction is non-stationary and depends on the design of the planner in the self-driving system. For example, when is small and , depending on the planner, could be either larger if the planner decides to increase the gap and still aims to cut-in, or smaller if the planner decides to slow down to shoot for another gap behind the following vehicle. Another feature of this problem is that the loss function should be contextual and carefully designed. For example, when and , the loss can be small and proportional to , to moderately penalize a wasted chance for cut-in. On the other hand, when and , the loss should be large especially when is small, to heavily penalize a wrong prediction that can lead to a dangerous situation. With all the elements specified, the problem can in principle be solved through dynamic programming as in Section 3.2, to minimize the overall prediction cost.

To better illustrate the idea, three typical cases that can be encountered by the yield prediction are depicted in Fig. 6. In the first case, , , and . The ego vehicle initiates the cut-in by turning on the blinker, which results in a slow-down of the following vehicle, allowing the cut-in to be completed. There is only one round of prediction in this case. In the second case, , , and . The ego vehicle initiates the cut-in by turning on the blinker, which results in an acceleration of the following vehicle, not allowing the cut-in to be completed, and leads to a dangerous driving situation. The ego vehicle then starts a second round of prediction under this situation. In the third case, and . Since the ego vehicle predicts that the following vehicle will not yield if the blinker is on, it does not initiate a cut-in, and slows down to shoot for a gap behind the following vehicle. It then starts a second round of prediction, which can potentially be easier and less costly compared with the first round of prediction. By properly specifying the models and the loss function, a yield predictor designed under the framework of dynamic inference should enable the ego vehicle to drive in the first and the third case most of the time according to different driving situations, and avoid the behavior as in the second situation, unless it is deliberately designed to support aggressive cut-in.

Figure 6: Three typical cases of interactive vehicle behaviors in yield prediction.

5 Learning for dynamic inference

Solving the dynamic inference problem requires the knowledge of two important elements: the observation-transition model and the quantity-generation model. However, in many practically interesting situations, we may not have such knowledge. Instead, we either have a training dataset from which we can learn these models offline before doing inference, or we can learn them on-the-fly during inference if ’s are revealed after each round of estimation. Such problems may be termed as learning for dynamic inference

, either offline or online. These problems can also be studied under a Bayesian framework, where the unknown models are assumed to be members of parametrized model families with certain priors, and the optimal learning rule that minimizes the expected inference loss can be mathematically derived

[4].

Perhaps more importantly, the problem of learning for dynamic inference can potentially serve as a meta problem for machine learning, such that almost all familiar learning problems can be cast as its special cases, examples including supervised learning, imitation learning, and reinforcement learning. For instance, the offline learning for dynamic inference can be viewed as an extension of the behavior cloning method in imitation learning [5, 6, 7, 8], in that it not only learns the demonstrator’s action-generation model and state-transition model, but simultaneously learns a policy based on the learned models to minimize the overall contextual-aware imitation error. As another instance, any loss function of the form can be expressed in the integral form of (4), with some contextual loss function and probability transition kernel . The unknown quantity that depends on can then be viewed as a latent variable of the loss function. With this view, any reinforcement learning problem [9] can be solved as an instance of online learning for dynamic inference, where the quantities to be estimated are the latent variables of the loss function, and the quantity-generation model to be learned is the conditional distribution of the latent variable. More detailed discussions on the connection between learning for dynamic inference and other learning problems are made in [4]. The study of dynamic inference and its learning extension thus help us gain deeper and unifying understandings of a broad spectrum of machine learning problems.

6 Multiple views of dynamic inference and related works

The formulation of dynamic inference appears to be new, but it can be viewed from different angles, and is related to a variety of existing problems.

Sequential interactive prediction as dynamic inference

The problems that can be most naturally formulated as dynamic inference are estimation or prediction problems in sequential and interactive settings. Traditionally some of those problems are formulated and studied using game theory

[10]. This type of problems become more common in recent years with widespread adoption of AI systems, e.g., they arise in behavior prediction of vehicles [11, 12, 13], interactive recommendation with user feedback [14], and prediction in finance [15]. Dynamic inference provides a rigorous mathematical formulation of such problems, and provides an optimal solution to it.

Dynamic inference as performative prediction

Dynamic inference also shares a similar spirit with a recent trend of research called performativity [16, 17, 18], where the problem is to deal with the tendency that the decision to make in optimization or prediction problems can change the underlying distribution the decision is made for. A method called repeated risk minimization is proposed to solve the performative prediction, either with [18] or without [17] state transitions. The goal there is to minimize the loss in each single round of prediction, based on the distribution from the last round, and the hope is that such a method can reach a minimax equilibrium under certain conditions. On the contrary, dynamic inference aims to minimize the overall inference loss, and it explicitly considers multiple rounds of estimation and state transitions in these problems.

Dynamic inference as imitation game

The formulation of dynamic inference can also be viewed as a game of imitation, where the learner drives a system with state, observes an action from a demonstrator at each encountered state, and tries to imitate the demonstrator’s actions by minimizing the accumulated state-aware imitation error. When the underlying models are unknown, such a view can provide a rigorous formulation of imitation learning, both online and offline, and the optimal learning strategy is derived in [4]. In practice, this formulation has already been implicitly adopted by practitioners in imitation learning [19].

MDP as dynamic inference

In this work, the solution to dynamic inference is derived by reformulating it to an MDP which can be solved by dynamic programming. Conversely, any MDP can be thought of as a dynamic inference problem, by viewing the loss function in an integral form that involves an unknown latent variable, as discussed in the previous section. The goal of MDP is then to estimate the latent variables by minimizing the overall estimation error. This view also helps us to understand reinforcement learning, especially Bayesian reinforcement learning in the model-based form [20, 21, 22], from a learning for dynamic inference perspective. The practical benefit of viewing MDP and reinforcement learning in this way would be an interesting research problem.

Acknowledgement

The author is indebted to Peng Guan for many helpful discussions; the discussion with whom on imitation learning in early 2018 motivated this study. The author is grateful to Prof. Lav Varshney, for the detailed comments and many helpful suggestions, and for pointing out [17] on performative prediction. The author also would like to thank Prof. Maxim Raginsky, for his encouragement in looking into dynamic aspects of statistical problems.

author email: xuaolin@gmail.com

References

  • [1] M. Raginsky, Lecture notes for ECE 555 Control of Stochastic Systems, Spring 2019, University of Illinois at Urbana-Champaign, 2019.
  • [2] A. Xu and M. Raginsky, “Minimum excess risk in Bayesian learning,” arXiv 2012.14868, 2020.
  • [3] D. P. Bertsekas, Dynamic Programming and Optimal Control, Vol. 1.   Athena Scientific, 2017.
  • [4] A. Xu and P. Guan, “Bayesian learning for dynamic inference,” arXiv preprint, 2021.
  • [5] T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, and J. Peters, “An algorithmic perspective on imitation learning,” Foundations and Trends in Robotics, vol. 7, no. 1-2, 2018.
  • [6] D. B. Grimes, D. R. Rashid, and R. P. Rao, “Learning nonparametric models for probabilistic imitation,” In Advances in Neural Information Processing Systems, 2006.
  • [7] P. Englert, A. Paraschos, J. Peters, and M. P. Deisenroth, “Probabilistic model-based imitation learning,” Adaptive Behavior, 2013.
  • [8] F. Torabi, G. Warnell, and P. Stone, “Behavioral cloning from observation,” in

    International Joint Conference on Artificial Intelligence

    , 2018.
  • [9] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, 2nd ed.   MIT Press, 2018.
  • [10] C. Camerer, Behavioral game theory: Experiments in strategic interaction.   Princeton University Press, 2003.
  • [11] Y. Hu, W. Zhan, L. Sun, and M. Tomizuka, “Multi-modal probabilistic prediction of interactive behavior via an interpretable model,” in IEEE Intelligent Vehicles Symposium, 2019.
  • [12] J. Li, F. Yang, M. Tomizuka, and C. Choi, “Evolvegraph: Multi-agent trajectory prediction with dynamic relational reasoning,” in Conference on Neural Information Processing Systems, 2020.
  • [13] X. Ma, J. Li, M. J. Kochenderfer, D. Isele, and K. Fujimura, “Reinforcement learning for autonomous driving with latent state inference and spatial-temporal relationships,” in IEEE International Conference on Robotics and Automation, 2021.
  • [14] R. Zhang, T. Yu, Y. Shen, H. Jin, and C. Chen, “Text-based interactive recommendation via constraint-augmented reinforcement learning,” in Conference on Neural Information Processing Systems, 2019.
  • [15] E. Ippoliti, Methods and Finance: A View from Outside.   Cham: Springer International Publishing, 2017.
  • [16] L. R. Varshney, N. S. Keskar, and R. Socher, “Pretrained AI models: Performativity, mobility, and change,” arXiv:1909.03290, 2019.
  • [17] J. Perdomo, T. Zrnic, C. Mendler-Dünner, and M. Hardt, “Performative prediction,” in International Conference on Machine Learning, 2020.
  • [18] I. K. Gavin Brown, Shlomi Hod, “Performative prediction in a stateful world,” arXiv:2011.03885, 2020.
  • [19] Y. Pan, C. Cheng, K. Saigol, K. Lee, X. Yan, E. Theodorou, and B. Boots, “Imitation learning for agile autonomous driving,” International Journal of Robotics Research, 2019.
  • [20] M. Strens, “A Bayesian framework for reinforcement learning.” in Proceedings of the 17th International Conference on Machine Learning, pp. 943–950, 2000.
  • [21] P. Poupart, N. Vlassis, J. Hoey, and K. Regan, “An analytic solution to discrete Bayesian reinforcement learning,” International Conference on Machine Learning, 2006.
  • [22] M. Ghavamzadeh, S. Mannor, J. Pineau, and A. Tamar, Bayesian Reinforcement Learning: A Survey.   Now Foundations and Trends, 2015.