This paper is dedicated to sequential decision-making problems that may be modeled as Markov Decision Processes for which the system dynamics may be partially observable, a class of problems often called Partially Observable Markov Decision Processes (POMDPs). Within this setting, we focus on decision-making strategies computed using Reinforcement Learning (RL). RL approaches rely on observations gathered through interactions with the (PO)MDP, and, although most RL approaches have strong convergence properties, classic RL approaches are challenged by data scarcity. When acquisition of new observations is possible (the “online” case), data scarcity is gradually phased out using strategies balancing the exploration / exploitation (E/E) tradeoff. The scientific literature related to this topic is vast; in particular, Bayesian RL techniques[Ross et al., 2011; Ghavamzadeh et al., 2015] offer an elegant way of formalizing the E/E tradeoff.
However, such E/E strategies are not applicable when the acquisition of new observations is not possible anymore (the “batch” setting). Within this context, we propose to revisit RL as a learning paradigm that faces, similarly to supervised learning, a tradeoff between simultaneously minimizing two sources of error: an asymptotic bias and an overfitting error. The asymptotic bias (also simply called bias in the following) directly relates to the choice of the RL algorithm (and its parameterization). Any RL algorithm defines a policy class as well as a procedure to search within this class, and the bias may be defined as the performance gap between best candidate optimal policies and actual optimal policies. This bias does not depend on the set of observations. On the other hand, overfitting is an error term induced by the fact that only a limited amount of data is available to the algorithm that may potentially overfit suboptimal policies. This overfitting error vanishes as the size and the quality of the dataset increase.
In this paper, we focus on studying the interactions between these two sources of error, in a setting where the system dynamics is partially observable. Due to this particular setting, one needs to build a state representation from a history of data. By increasing the cardinality of the state representation, the algorithm may be provided with a more informative representation of the POMDP, but at the price of simultaneously increasing the size of the set of candidate policies, thus also increasing the risk of overfitting. We analyze this tradeoff in the case where the RL algorithm provides an optimal solution to the frequentist-based MDP associated with the state representation (independently of the method used by the learning algorithm to converge towards that solution). Our analysis relies on expressing the quality of a state representation by bounding error terms of the associated belief states, thus defining -sufficient statistics in the hidden state dynamics.
Experimental results illustrate the theoretical findings on a distribution of POMDPs in the case where the state representations are truncated histories of observations. In particular, we illustrate the link between the variance observed when dealing with different datasets (directly linked to the size of the dataset) and overfitting, where the link is that variance leads to overfitting if we have a (too) large feature space.
We also discuss and illustrate how using function approximators and adapting the discount factor play a role in the tradeoff between bias and overfitting. This provides the reader with an overview of key elements involved in this tradeoff.
The remainder of the paper is organized as follows. Section 2 formalizes POMDPs, (limited) sets of observations and state representations. Section 3 details the main contribution of this paper: an analysis of the bias-overfitting tradeoff in batch POMDPs. Section 4 empirically illustrates the main theoretical results, while Section 5 concludes.
We consider a discrete-time POMDP model described by the 7-tuple where
is a finite set of states ,
is a finite set of actions ,
is the transition function (set of conditional transition probabilities between states),
is the reward function, where is a continuous set of possible rewards in a range (e.g., without loss of generality),
is a finite set of observations ,
is a set of conditional observation probabilities, and
is the discount factor.
The environment starts in a distribution of initial states . At each time step , the environment is in a state . At the same time, the agent receives an observation which depends on the state of the environment with probability and the agent has to take an action . Then, the environment transitions to state with probability and the agent receives a reward equal to .
2.1 Processing a history of data
Policies considered in this paper are mappings from (a set of) observation(s) into actions. A naive approach to build a space of candidate policies is to consider the set of mappings taking only the very last observation(s) as input. However, in a POMDP setting, this leads to candidate policies that are likely not rich enough to capture the system dynamics, thus suboptimal [Singh et al., 1994; Wolfe, 2006]
. there is no alternative to using a history of previously observed features to better estimate the hidden state dynamics. We denote bythe set of histories observed up to time for , and by the space of all possible observable histories.
A straightforward approach is to take the whole history as input of candidate policies [Braziunas, 2003]. However, taking a too long history may have several drawbacks. Indeed, increasing the size of the set of candidate optimal policies generally implies: (i) more computation to search within this set [Singh et al., 1994; McCallum, 1996] and (ii) an increased risk of including candidate policies suffering overfitting (see Section 3). In this paper, we are specifically interested in minimizing the latter overfitting drawback while keeping an informative state representation.
We define a mapping , where is of finite cardinality . Note that a mapping induces an upper bound on the number of candidate policies: .
The belief state (resp. ) is defined as the vector of probabilities where the
) is defined as the vector of probabilities where thecomponent () is given by (resp. ), for all sequences .
A mapping is a particular mapping such that is a sufficient statistic for the POMDP :
for all states and for all sequences .
is a well-defined probability distribution even ifis not a sufficient statistic in the case where we consider a well-defined probability distribution of .
A mapping is a particular mapping such that is an -sufficient statistic for the POMDP that satisfies the following condition with and with the norm:
for all sequences .
2.2 Working with a limited dataset
Let be a set of POMDPs with fixed , , , and . For any , we denote by a random dataset generated according to a probability distribution over the set of (unordered) trajectories of length . One such trajectory is defined as the observable history obtained in when starting from and following a stochastic sampling policy that ensures a non-zero probability of taking any action given an observable history . For simplicity we will denote , simply as . For the purpose of the analysis, we also introduce the asymptotic dataset that would be theoretically obtained in the case where one could generate an infinite number of observations ( and ).
In this paper, the algorithm cannot generate additional data. The challenge is to determine a high-performance policy (in the actual environment) while having only access to a fixed dataset .
2.3 Assessing the performance of a policy
In this paper, we will consider stationary and deterministic control policies . Any particular choice of induces a particular definition of the policy space . We introduce with as the expected return obtained over an infinite time horizon when the system is controlled using policy in the POMDP :
where and .
Note that Equation 1 is well defined even if is not a sufficient statistic at the condition that is a well-defined probability distribution.
Let be an optimal policy in defined as:
where is the distribution of initial observations (compatible with the distribution of initial states through the conditional observation probabilities). Note that, as it will become clear in the following, the definition of is the same as it is usually defined in a POMDP (when known , and and known initial belief state), such as in [Sondik, 1978].
3 Bias-overfitting in RL with partial observability
Importance of the feature space:
This section introduces and analyzes a bias-overfitting decomposition of the performance gap of RL policies computed from the frequentist-based augmented MDP built from the dataset . In that setting, the agent behaves optimally with respect to the maximum-likelihood model estimated from the data under the chosen abstraction, which allows removing from the analysis how the RL algorithm converges. Let us first define the frequentist-based augmented MDP:
With defined by and the dataset built from interactions with while following a policy , the frequentist-based augmented MDP , also denoted for simplicity , is defined with
the state space: ,
the action space: ,
the estimated transition function: for and , is the number of times we observe the transition in D divided by the number of times we observe ; if any has never been encountered in a dataset, we set ,
the estimated reward function: for and , is the mean of the rewards observed at ; if any has never been encountered in a dataset, we set to the average of rewards observed over the whole dataset D, and
the discount factor .
By definition of and as long as , the asymptotic frequentist-based MDP (when unlimited data is available) actually gathers the relevant information from the actual POMDP. Indeed, in the model-based context when the POMDP dynamics is known (i.e. ), the knowledge of allows calculating the belief state (calculated recursively thanks to the Bayes rule based on ). It is then possible to define, from the history and for any action , the expected immediate reward as well as a transition function into the next observation :
In the frequentist approach, this information is actually estimated from interactions with the POMDP in and without any explicit knowledge of the dynamics of the POMDP.
We introduce with as the expected return obtained over an infinite time horizon when the system is controlled using a policy s.t. in the augmented decision process : where is a reward s.t. and the dynamics is given by .
The frequentist-based policy is an optimal policy of the augmented MDP defined as:
Let us now decompose the error of using a frequentist-based policy :
The term bias actually refers to an asymptotical bias when the size of the dataset tends to infinity while the term overfitting refers to the expected suboptimality due to a finite size of the dataset.
Selecting carefully the feature space allows building a class of policies that have the potential to accurately capture information from data (low bias), but also generalize well (low overfitting). On the one hand, using too many non-informative features will increase overfitting, as stated in Theorem 2. On the other hand, a mapping that discards useful available information will suffer an asymptotic bias, as stated in Theorem 1 (arbitrarily large depending on the POMDP and on the features discarded).
Let be a POMDP described by the 7-tuple . Let be an augmented MDP estimated, according to Definition 3.1, from a dataset . Then the asymptotic bias can be bounded as follows:
where is such that the mapping respects the conditions to be .
The proof is deferred to Appendix A.1. This bound is an original result based on the belief states (which was not considered in other works) via the -sufficient statistic ( norm error). Note that bisimulation metrics [Ferns et al., 2004] may also be used to take into account how the errors on the belief states may have less of an impact at the condition that the hidden states affected by these errors are close according to the bisimulation metrics.
We now provide a bound on the overfitting error that monotonically grows with .
Let be a POMDP described by the 7-tuple . Let be an augmented MDP estimated, according to Definition 3.1, from a dataset with the assumption that has transitions from any possible pair . Then the overfitting due to using the frequentist-based policy instead of in the PODMP can be bounded as follows:
with probability at least .
The proof is deferred to Appendix A.2. Theorem 2 shows that using a large set of features allows a larger policy class, hence potentially leading to a stronger drop in performance when the available dataset is limited (the bound decreases proportionally to ). A theoretical analysis in the context of MDPs with a finite dataset was performed in [Jiang et al., 2015a].
Overall Theorems 1 and 2 can help to choose a good state representation for POMDPs as they provide bounds on the two terms that appear in the bias-overfitting decomposition of Equation 2. For example, an additional feature in the mapping has an overall positive effect only if it provides a significant increase of information on the belief state (i.e. if it allows obtaining a more accurate knowledge of the underlying state of the MDP defined by and when given ). This increase of information must be significant enough to compensate for the additional risk of overfitting when choosing a large cardinality of . Note that one could combine the two bounds to theoretically define an optimal choice of the state representation with lower bound guarantees regarding the bias-overfitting tradeoff.
Importance of function approximators:
As described earlier, a straightforward mapping may be obtained by discarding a few features from the observable history. In addition, it is also possible to learn a processing of the features by selecting a suitable function approximator structure (e.g., in a Q-learning scheme) that constrains policies to have interesting generalization properties. Note that a theorem similar to Theorem 2
which takes into account complexity measures of the function approximator (e.g., Rademacher complexity) may provide a tighter bound. That is left out of the scope of this paper because such a bound is usually of little interest in practice, specifically concerning deep learning (these bounds fail to provide insights on the generalization capabilities of neural networks[Zhang et al., 2016]).
It is worth noting that in the case of neural networks, architectures such as convolutional layers or recurrency are particularly well-suited to deal with a large input space because they offer interesting generalization properties that are adapted to high-dimensional sensory inputs for which hierarchical patterns can be found (see, e.g., [LeCun et al., 2015]). A few recent successes make use of convolutional layers [Mnih et al., 2015] and/or recurrent layers [Hausknecht & Stone, 2015] (e.g., LSTMs [Hochreiter & Schmidhuber, 1997]) to solve large scale POMDPs.
Importance of the discount factor used in the training phase:
Selection of the parameters with validation or cross-validation to balance the bias-overfitting tradeoff:
In the batch setting case, the selection of the policy parameters to effectively balance the bias-overfitting tradeoff can be done similarly to that in supervised learning (e.g., cross-validation) as long as the performance criterion can be estimated from a subset of the trajectories from the dataset not used during training (validation set). One possibility is to fit an MDP model from data via the frequentist approach (or regression), and evaluate the policy against the model. Another approach is to use the idea of importance sampling [Precup, 2000]. A mix of the two approaches can be found in [Jiang & Li, 2016; Thomas & Brunskill, 2016].
We randomly sample POMDPs such that , and (except when stated otherwise) from a distribution that we refer to as Random POMDP. The distribution is fully determined by specifying a distribution over the set of possible transition functions , a distribution over the set of reward functions , and a distribution over the set of possible conditional observation probabilities .
Random transition functions are drawn by assigning, for each entry , a zero value with probability 3/4, and, with probability 1/4, a non-zero entry with a probability drawn uniformly in . For all , if all are zeros, we enforce one non-zero value for a random . Values are normalized.
Random reward functions are generated by associating to all possible a reward sampled uniformly and independently from .
Random conditional observation probabilities are generated the following way: the probability to observe when being in state is equal to 0.5, while all other values are chosen uniformly randomly so that it is normalized for any .
For all POMDPs, we fixed and if not stated otherwise and we truncate the trajectories to a length of time steps.
For each generated POMDP , we generate 20 datasets where is a probability distribution over all possible sets of trajectories (); where each trajectory is made up of an history of 100 time steps, when starting from an initial state while taking uniformly random decisions. Each dataset induces a policy , and we want to evaluate the expected return of this policy while discarding the variance related to the stochasticity of the transitions, observations and rewards. To do so, policies are tested with 1000 rollouts of the policy. For each POMDP , we are then able to get an estimate of the average score which is defined as:
We are also able to get an estimate of a parametric variance defined as:
4.2 History processing
In this section, we show experimentally that any additional feature is likely to reduce the asymptotic bias, but may also increase the overfitting. For each dataset , we define the policy according to Definition 3.2 for every history of interest . When (resp. ) (resp. ), only the current observation (resp. the last two observations as well as the last action) (resp. the three last observations and the two last actions) is (resp. are) used for building the state of the frequentist-based augmented MDP.
The values and are displayed in Figure 1. One can observe that a small set of features (small history) appears to be a better choice (in terms of total bias) when the dataset is poor (only a few trajectories). With an increasing number of trajectories, the optimal choice in terms of number of features ( or ) also increases. In addition, one can also observe that the expected variance of the score decreases as the number of samples increases. As the variance decreases, the risk of overfitting also decreases, and it becomes possible to target a larger policy class (using a larger feature set).
The overfitting error is also linked to the variance of the value function estimates in the frequentist-based MDP. When these estimates have a large variance, an overfitting term appears because of a higher chance of picking one of the suboptimal policies (when using Definition 3.2), as illustrated in Figure 2. Note that this variance term has already been studied in the context of estimating value functions in finite MDPs using frequentist approximations of the parameters of the MDPs ( and ) [Mannor et al., 2007].
4.3 Function approximator and discount factor
We also illustrate the effect of using function approximators on the bias-overfitting tradeoff. To do so, we process the output of the state representation into a deep Q-learning scheme (technical details are given in appendix A.3). We can see in Figure 3 that deep Q-learning policies suffer less overfitting as compared to the frequentist-based approach (lower degradation of performance in the low data regime) even though using a large set of features still leads to more overfitting than a small set of features. We can also see that deep Q-learning policies avoid introducing an important asymptotic bias (identical performance when lots of data is available) because the neural network architecture is rich enough. Note that the variance is slightly larger than in Figure 1, and does not vanish to 0 with additional data. This is due to the additional stochasticity induced when building the Q-value function with neural networks (note that when performing the same experiments while taking the average recommendation of several Q-value functions, this variance decreases with the number of Q-value functions).
Finally, we empirically illustrate in Figure 4 the effect of the discount factor. When a training discount factor is lower than the one used in the actual POMDP (), there is an additional bias term, while when a high discount factor is used with a limited amount of data, overfitting increases. In our experiments, the influence of the discount factor is more subtle as compared to the impact of the state representation and the function approximator. The influence is nonetheless clear: it is better to have a low discount factor when only a few data is available, and it is better to have a high discount factor when a lot of data is available.
5 Conclusion and future works
This paper discusses the bias-overfitting tradeoff of batch RL algorithms in the context of POMDPs. We propose an analysis showing that, similarly to supervised learning techniques, RL may face a bias-overfitting dilemma in situations where the policy class is too large compared to the batch of data. In such situations, we show that it may be preferable to concede an asymptotic bias in order to reduce overfitting. This (favorable) asymptotic bias may be introduced through different manners: (i) downsizing the state representation, (ii) using specific types of function approximators and (iii) lowering the discount factor. The main theoretical results of this paper relate to the state representation and the originality of the setting proposed in this paper compared to [Maillard et al., 2011; Ortner et al., 2014] and the related work is mainly to formalize the problem in a batch setting (limited set of tuples) instead of the online setting. As compared to [Jiang et al., 2015b], the originality is to consider a partially observable setting.
The work proposed in this paper may also be of interest in online settings because at each stage, obtaining a performant policy from given data is part of the solution to an efficient exploration/exploitation tradeoff. For instance, this sheds lights on the interest of progressively increasing the discount factor [François-Lavet et al., 2015]. Besides, optimizing the bias-overfitting tradeoff suggests to dynamically (along learning) adapt the feature space and the function approximator (e.g., through ad hoc regularization, or by adapting the neural network architecture, using for instance the NET2NET transformation [Chen et al., 2015]).
Abel et al. 
Abel, David, Hershkowitz, David, and Littman, Michael.
Near optimal behavior via approximate state abstraction.
Proceedings of The 33rd International Conference on Machine Learning, pp. 2915–2923, 2016.
- Braziunas  Braziunas, Darius. POMDP solution methods. University of Toronto, Tech. Rep, 2003.
- Chen et al.  Chen, Tianqi, Goodfellow, Ian, and Shlens, Jonathon. Net2net: Accelerating learning via knowledge transfer. arXiv preprint arXiv:1511.05641, 2015.
Ferns et al. 
Ferns, Norm, Panangaden, Prakash, and Precup, Doina.
Metrics for finite markov decision processes.
Proceedings of the 20th conference on Uncertainty in artificial intelligence, pp. 162–169. AUAI Press, 2004.
- François-Lavet et al.  François-Lavet, Vincent, Fonteneau, Raphael, and Ernst, Damien. How to discount deep reinforcement learning: Towards new dynamic strategies. arXiv preprint arXiv:1512.02011, 2015.
- Ghavamzadeh et al.  Ghavamzadeh, Mohammad, Mannor, Shie, Pineau, Joelle, Tamar, Aviv, et al. Bayesian reinforcement learning: A survey. Foundations and Trends® in Machine Learning, 8(5-6):359–483, 2015.
- Glorot & Bengio  Glorot, Xavier and Bengio, Yoshua. Understanding the difficulty of training deep feedforward neural networks. In Aistats, volume 9, pp. 249–256, 2010.
- Hausknecht & Stone  Hausknecht, Matthew and Stone, Peter. Deep recurrent q-learning for partially observable MDPs. arXiv preprint arXiv:1507.06527, 2015.
- Hochreiter & Schmidhuber  Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Probability inequalities for sums of bounded random variables.Journal of the American statistical association, 58(301):13–30, 1963.
- Jiang & Li  Jiang, Nan and Li, Lihong. Doubly robust off-policy value evaluation for reinforcement learning. In Proceedings of The 33rd International Conference on Machine Learning, pp. 652–661, 2016.
- Jiang et al. [2015a] Jiang, Nan, Kulesza, Alex, and Singh, Satinder. Abstraction selection in model-based reinforcement learning. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pp. 179–188, 2015a.
- Jiang et al. [2015b] Jiang, Nan, Kulesza, Alex, Singh, Satinder, and Lewis, Richard. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pp. 1181–1189. International Foundation for Autonomous Agents and Multiagent Systems, 2015b.
- LeCun et al.  LeCun, Yann, Bengio, Yoshua, and Hinton, Geoffrey. Deep learning. Nature, 521(7553):436–444, 2015.
- Maillard et al.  Maillard, Odalric-Ambrym, Ryabko, Daniil, and Munos, Rémi. Selecting the state-representation in reinforcement learning. In Advances in Neural Information Processing Systems, pp. 2627–2635, 2011.
- Mannor et al.  Mannor, Shie, Simester, Duncan, Sun, Peng, and Tsitsiklis, John N. Bias and variance approximation in value function estimates. Management Science, 53(2):308–322, 2007.
- McCallum  McCallum, Andrew Kachites. Reinforcement learning with selective perception and hidden state. PhD thesis, University of Rochester, 1996.
- Mnih et al.  Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Ortner et al.  Ortner, Ronald, Maillard, Odalric-Ambrym, and Ryabko, Daniil. Selecting near-optimal approximate state representations in reinforcement learning. In International Conference on Algorithmic Learning Theory, pp. 140–154. Springer, 2014.
- Petrik & Scherrer  Petrik, Marek and Scherrer, Bruno. Biasing approximate dynamic programming with a lower discount factor. In Advances in neural information processing systems, pp. 1265–1272, 2009.
- Precup  Precup, Doina. Eligibility traces for off-policy policy evaluation. Computer Science Department Faculty Publication Series, pp. 80, 2000.
- Ross et al.  Ross, Stéphane, Pineau, Joelle, Chaib-draa, Brahim, and Kreitmann, Pierre. A bayesian approach for learning and planning in partially observable markov decision processes. The Journal of Machine Learning Research, 12:1729–1770, 2011.
- Singh et al.  Singh, Satinder P, Jaakkola, Tommi S, and Jordan, Michael I. Learning without state-estimation in partially observable markovian decision processes. In ICML, pp. 284–292, 1994.
- Sondik  Sondik, Edward J. The optimal control of partially observable markov processes over the infinite horizon: Discounted costs. Operations research, 26(2):282–304, 1978.
- Thomas & Brunskill  Thomas, Philip S and Brunskill, Emma. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, 2016.
- Wolfe  Wolfe, Alicia Peregrin. Pomdp homomorphisms. In NIPS RL Workshop, 2006.
- Zhang et al.  Zhang, Chiyuan, Bengio, Samy, Hardt, Moritz, Recht, Benjamin, and Vinyals, Oriol. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
Appendix A Appendix
a.1 Proof of Theorem 1
In the frequentist-based MDP , for , let us define
where the reward . Then : , it follows that
By using the definition of the -sufficient statistics, we have
which has the consequence that
It follows that:
where the last line is obtained by noticing that the transition functions are always normalized such that
Applying Lemma 1 from [Abel et al., 2016]:
By further noticing that the dynamics provided by and when starting in provide an identical value function for a given policy and that , i.e. , the theorem follows.
a.2 Proof of Theorem 2
Let us denote :
It follows that
where is the action-value function for policy in with . Similarly is the action-value function for policy in .
By Lemma 1, we have:
With , we notice that is the mean of i.i.d. variables bounded in the interval [0,] and with mean for any policy . Therefore, according to Hoeffding’s inequality [Hoeffding, 1963], we have with the number of tuples for every pair :
As we want to obtain a bound over all pairs and a union bound on all policies s.t. (indeed Equation 5 does not hold for alone because that policy is not chosen randomly), we want the right-hand side of equation 5 to be . This gives and we conclude that:
with probability at least .
For any and the frequentist-based augmented MDP defined from according to definition 3.1, we have :
Proof of Lemma 1
Given any policy , let us define s.t.
where . We have
Taking the limit of , we have
which completes the proof.
a.3 Q-learning with neural network as a function approximator: technical details of Figure 3
The neural network is made up of three intermediate fully connected layers with 20, 50 and 20 neurons with ReLu activation function and is trained with Q-learning. Weights are initialized with a Glorot uniform initializer[Glorot & Bengio, 2010]. It is trained using a target Q-network with a freeze interval (see [Mnih et al., 2015]
) of 100 mini-batch gradient descent steps. It uses an RMSprop update rule (learning rate of 0.005, rho=0.9), mini-batches of size 32 and 20000 mini-batch gradient descent steps.