Cross-Entropic Learning of a Machine for the Decision in a Partially Observable Universe

by   Frederic Dambreville, et al.

Revision of the paper previously entitled "Learning a Machine for the Decision in a Partially Observable Markov Universe" In this paper, we are interested in optimal decisions in a partially observable universe. Our approach is to directly approximate an optimal strategic tree depending on the observation. This approximation is made by means of a parameterized probabilistic law. A particular family of hidden Markov models, with input and output, is considered as a model of policy. A method for optimizing the parameters of these HMMs is proposed and applied. This optimization is based on the cross-entropic principle for rare events simulation developed by Rubinstein.



There are no comments yet.


page 1

page 2

page 3

page 4


Hidden Markov Model Estimation-Based Q-learning for Partially Observable Markov Decision Process

The objective is to study an on-line Hidden Markov model (HMM) estimatio...

Reinforcement Learning of POMDPs using Spectral Methods

We propose a new reinforcement learning algorithm for partially observab...

Repairing dynamic models: a method to obtain identifiable and observable reparameterizations with mechanistic insights

Mechanistic dynamic models allow for a quantitative and systematic inter...

Identification of Unexpected Decisions in Partially Observable Monte-Carlo Planning: a Rule-Based Approach

Partially Observable Monte-Carlo Planning (POMCP) is a powerful online a...

Learning classifier systems with memory condition to solve non-Markov problems

In the family of Learning Classifier Systems, the classifier system XCS ...

Cross-Entropy method: convergence issues for extended implementation

The cross-entropy method (CE) developed by R. Rubinstein is an elegant p...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There are different degrees of difficulty in planning and control problems. In most problems, the planner have to start from a given state and terminate in a required final state. There are several transition rules, which condition the sequence of decision. For example, a robot may be required to move from room A, starting state, to room B, final state; its decision could be go forward, turn right or turn left, and it cannot cross a wall; these are the conditions over the decision. A first degree in the difficulty is to find at least one solution for the planning. When the states are only partially known or the resulting actions are not deterministic, the difficulty is quite enhanced: the planner has to take into account the various observations. Now, the problem becomes much more complex, when this planning is required to be optimal or near-optimal. For example, find the shortest trajectory which moves the robot from room A to room B. There are again different degrees in the difficulty, depending on the problem to be deterministic or not, depending on the model of the future observations. In the particular case of a Markovian problem with the full observation hypothesis, the dynamic programming principle[2]

could be efficiently applied (Markov Decision Process theory/MDP). This solution has been extended to the case of partial observation (Partially Observable Markov Decision Process/POMDP), but this solution is generally not practicable, owing to the huge dimension of the variables

[10, 4].

For such reason, different methods for approximating this problem has been introduced. For example, Reinforcement Learning methods

[11] are able to learn an evaluation table of the decision conditionnally to the known universe states and an observation short range. In this case, the range of observation is indeed limited in time, because of an exponential grow of the table to learn. Recent works[1] are investigating the case of hierarchical RL, in order to go beyond this range limitation. Whatever, these methods are generally based on an additivity hypothesis about the reward. Another viewpoint is based on the direct learning of the policy[7]. Our approach is of this kind. It is particularly based on the Cross-Entropy optimisation algorithm developed by Rubinstein[9]. This simulation method relies both on a probabilistic modelling of the policies (in this paper, these models are Bayesian Networks) and on an efficient and robust iterative algorithm for optimizing the model parameters. More precisely, the policy will be modelled by conditional probabilistic law, i.e. decisions depending on observations, which are involving memories; typically hidden Markov models are used. Also are implemented a hierachical modelling of the policies by means of hierarchical hidden Markov models.

The next section introduces some formalism and gives a quick description of the optimal planning in partially observable universes. It is proposed a near-optimal planning method, based on the direct approximation of the optimal decision tree. The third section introduces the family of Hierarchical Hidden Markov Models being in use for approximating the decision trees. The fourth section describes the method for optimizing the parameters of the HHMM, in order to approximate the optimal decision tree for the POMDP problem. The cross-entropy method is described and applied. The fifth section gives an example of application. A comparison with a Reinforcement Learning method, the Q-learning, is made. The paper is then concluded.

2 Decision in a partially observable universe

It is assumed that a subject is acting in a given world with a given purpose or mission. Thus, the subject interacts with the world and perceives partial informations. The goal is to optimize the accomplishment of the mission, which is characterized by its reward. The forthcoming paragraphs are formalizing what is actually a world, what is a mission reward, and how is defined an optimal policy for such a mission.

The world.

The world is described by an hidden state , which evolves with the time ; in this paper, the time is discretized and increases from step to step . More specifically, the variable contains an information which characterizes entirely the world at time . In the example of section 5, the hidden state is characterized by the locations of the target and patrols.

The evolution of the hidden state is given by the vector

. During the mission, the subject produces decisions which will impact the evolution of the world. In example 5, is the move of the patrols. The subject perceives partial observations from the world, denoted , which are noisily derived from the hidden state.

In the example, this observation is an inaccurate estimate of the target location.

As a conclusion, the world is characterized by a law describing the hidden states and observations conditionnally to the decisions. This probabilistic law is denoted :

The hidden state and observation are obtained from the law  , which are conditionned by the past hidden states, observations and decisions. It is assumed that is generated by the subject after receiving  .

In this paper, the law is quite general, and for example there is no Markovian hypothesis (this hypothesis is required for a dynamic programming approach). Nevertheless, it is assumed that may be sampled very quickly. The law is illustrated by figure 1 . In this figure, the out-going arrows are related to the data produced by the world, i.e. observations, while incoming arrows are for the data consummed by the world, i.e. the decisions. The variables are put in chronological order from left to right: happens before since decision is produced after observing  . From now on, denotes the law of the world for the completed mission:

(-12.5,2.5)*+, (0,0)=”box00”, (90,0)=”box10”, (90,-10)=”box11”, (110,0)=”box10f”, (110,-10)=”box11f”, (0,-10)=”box01”, @-”box00”;”box10”, @–”box10”;”box10f”, @-”box10f”;”box11f”, @–”box11”;”box11f”, @-”box11”;”box01”, @-”box00”;”box01”,”box01”, (55,-5)*+Hidden state , (5,-10)=”a10y”, (25,-10)=”a11y”, (45,-10)=”a12y”, (75,-10)=”a13y”, (95,-10)=”a14y”, (15,-10)=”a10x”, (35,-10)=”a11x”, (55,-10)=”a12x”, (85,-10)=”a13x”, (105,-10)=”a14x”, (5,-20)*+y_1=”b10y”, (25,-20)*+y_2=”b11y”, (45,-20)*+y_3=”b12y”, (75,-20)*+y_t=”b13y”, (95,-20)*+y_t+1=”b14y”, (15,-20)*+d_1=”b10x”, (35,-20)*+d_2=”b11x”, (55,-20)*+d_3=”b12x”, (85,-20)*+d_t=”b13x”, (105,-20)*+d_t+1=”b14x”, @-¿”a10y”;”b10y”, @-¿”a11y”;”b11y”, @-¿”a12y”;”b12y”, @-¿”a13y”;”b13y”, @-¿”a14y”;”b14y”, @-¿”b10x”;”a10x”, @-¿”b11x”;”a11x”, @-¿”b12x”;”a12x”, @-¿”b13x”;”a13x”, @-¿”b14x”;”a14x”, ”a14x”, (142.5,-23.5)*+

Figure 1: The world

Reward and optimal planning.

The mission is limited in time and is characterized by a reward. This reward, denoted , is a function of the trajectories  . Typically, the function could be used for computing the time needed for the mission accomplishment. The only hypothesis about is that it is quickly computable. In particular, the additivity of the reward111Additive rewards are of the form with time, a requested hypothesis for many classical methods, is not necessary.
The purpose is to construct an optimal decision tree , depending on the past observations, in order to maximize the mean reward:


This optimization process is illustrated by figure 2. The double arrows are related to the variables to be optimized. These arrows describe the information flow between observations and decisions. The cells denoted are making decisions and transmitting all the received and generated informations. This architecture illustrates that planning with observation is a non-finite memory problem : the decision depends on the whole past observations. Since the optimum for such a problem is generally intractable, it is necessary to search for near-optimal solutions. The alternative method proposed now relies on the optimal tuning of a probabilitic model of the policies.

(-12.5,2.5)*+, (0,0)=”box00”, (90,0)=”box10”, (90,-10)=”box11”, (110,0)=”box10f”, (110,-10)=”box11f”, (0,-10)=”box01”, @-”box00”;”box10”, @–”box10”;”box10f”, @-”box10f”;”box11f”, @–”box11”;”box11f”, @-”box11”;”box01”, @-”box00”;”box01”,”box01”, (55,-5)*+Hidden state , (5,-10)=”a10y”, (25,-10)=”a11y”, (45,-10)=”a12y”, (75,-10)=”a13y”, (95,-10)=”a14y”, (15,-10)=”a10x”, (35,-10)=”a11x”, (55,-10)=”a12x”, (85,-10)=”a13x”, (105,-10)=”a14x”, (5,-20)*+y_1=”b10y”, (25,-20)*+y_2=”b11y”, (45,-20)*+y_3=”b12y”, (75,-20)*+y_t=”b13y”, (95,-20)*+y_t+1=”b14y”, (15,-20)*+d_1=”b10x”, (35,-20)*+d_2=”b11x”, (55,-20)*+d_3=”b12x”, (85,-20)*+d_t=”b13x”, (105,-20)*+d_t+1=”b14x”, @-¿”a10y”;”b10y”, @-¿”a11y”;”b11y”, @-¿”a12y”;”b12y”, @-¿”a13y”;”b13y”, @-¿”a14y”;”b14y”, @-¿”b10x”;”a10x”, @-¿”b11x”;”a11x”, @-¿”b12x”;”a12x”, @-¿”b13x”;”a13x”, @-¿”b14x”;”a14x”, ”a14x”, (10,-40)*+∞=”b00I”, (30,-40)*+∞=”b01I”, (50,-40)*+∞=”b02I”, (80,-40)*+∞=”b03I”, (100,-40)*+∞=”b04I”, (110,-40)=”b05I”, @2-¿”b00I”;”b01I”, @2-¿”b01I”;”b02I”, @–”b02I”;”b03I”, @2-¿”b03I”;”b04I”, @2-¿”b10y”;”b00I”, @2-¿”b11y”;”b01I”, @2-¿”b12y”;”b02I”, @2-¿”b13y”;”b03I”, @2-¿”b14y”;”b04I”, @2-¿”b00I”;”b10x”, @2-¿”b01I”;”b11x”, @2-¿”b02I”;”b12x”, @2-¿”b03I”;”b13x”, @2-¿”b04I”;”b14x”, @–”b04I”;”b05I”, ”b05I”, (142.5,-50.5)*+

Figure 2: The optimization process

Approximating the decision tree.

In a program like (1) , the variable to be optimized,  , is a deterministic object. In this precise case, is a tree of decision, that is a function which maps to a decision from any sequence of observation  . But it is more interesting to have a probabilistic viewpoint, when approximating. Then the problem is equivalent to finding  , a probabilistic law of the decisions conditionally to the past observations, which maximizes the mean reward:

This new problem is still illustrated by figure 2, but the double arrows are now describing a Bayesian network structure for the law . By the way, there is not a great difference with the deterministic case for the optimum: when is unique, the optimal law is a dirac on  . However, the probabilistic viewpoint is more suitable to an approximation: it is simplier to handle probabilistic models than deterministic decision trees, and the optimization is ensured to be continuous; moreover, a natural approximation of is obtained by replacing the non-finite memories by finite memories ; c.f. figure 3. Restricting the memory size of the policies is equivalent to approximate the law by a hidden Markov Model. Then, the approach developped in this paper is quite general and can be split up into two processes:

  • Define a family of parameterized HMMs  ,

  • Optimize the parameters of the HMM in order to maximize the mean reward:

As will be seen later, it is easy to tune a HMM optimally by the Cross-Entropy method of Rubinstein[9]. But first, it is discussed in the next section about the choice of the familly .

(-12.5,2.5)*+, (0,0)=”box00”, (90,0)=”box10”, (90,-10)=”box11”, (110,0)=”box10f”, (110,-10)=”box11f”, (0,-10)=”box01”, @-”box00”;”box10”, @–”box10”;”box10f”, @-”box10f”;”box11f”, @–”box11”;”box11f”, @-”box11”;”box01”, @-”box00”;”box01”,”box01”, (55,-5)*+Hidden state , (5,-10)=”a10y”, (25,-10)=”a11y”, (45,-10)=”a12y”, (75,-10)=”a13y”, (95,-10)=”a14y”, (15,-10)=”a10x”, (35,-10)=”a11x”, (55,-10)=”a12x”, (85,-10)=”a13x”, (105,-10)=”a14x”, (5,-20)*+y_1=”b10y”, (25,-20)*+y_2=”b11y”, (45,-20)*+y_3=”b12y”, (75,-20)*+y_t=”b13y”, (95,-20)*+y_t+1=”b14y”, (15,-20)*+d_1=”b10x”, (35,-20)*+d_2=”b11x”, (55,-20)*+d_3=”b12x”, (85,-20)*+d_t=”b13x”, (105,-20)*+d_t+1=”b14x”, @-¿”a10y”;”b10y”, @-¿”a11y”;”b11y”, @-¿”a12y”;”b12y”, @-¿”a13y”;”b13y”, @-¿”a14y”;”b14y”, @-¿”b10x”;”a10x”, @-¿”b11x”;”a11x”, @-¿”b12x”;”a12x”, @-¿”b13x”;”a13x”, @-¿”b14x”;”a14x”, ”a14x”, (10,-40)*+m_1=”b00I”, (30,-40)*+m_2=”b01I”, (50,-40)*+m_3=”b02I”, (80,-40)*+m_t=”b03I”, (100,-40)*+m_t+1=”b04I”, (110,-40)=”b05I”, @2-¿”b00I”;”b01I”, @2-¿”b01I”;”b02I”, @–”b02I”;”b03I”, @2-¿”b03I”;”b04I”, @2-¿”b10y”;”b00I”, @2-¿”b11y”;”b01I”, @2-¿”b12y”;”b02I”, @2-¿”b13y”;”b03I”, @2-¿”b14y”;”b04I”, @2-¿”b00I”;”b10x”, @2-¿”b01I”;”b11x”, @2-¿”b02I”;”b12x”, @2-¿”b03I”;”b13x”, @2-¿”b04I”;”b14x”, @–”b04I”;”b05I”, ”b05I”, (142.5,-50.5)*+

Figure 3: Finite-memory approximation

3 Models

General points.

The choice of the family of policy models, , will profoundly impact the efficiency of the approximation. In particular, the models will be characterized by the memory size and the internal structure of the HMMs (e.g. is it hierarchical or not?). Both characteristics will act upon the convergence, as will be seen in the experiments. In the most simple case, the HMMs of contain no structure and are distinguished by their memory size only. Example of simple HMM:

Let be indeed a finite set of states, describing the memory capacity of our models. Then, the memory of the HMM at time is , a variable valued within . A HMM is thus typically defined by:

where the conditionnal law and are time invariant.

But subsequently will be considered the impact of both the memory and HMM stuctures. For this purpose a specific family of hierarchical HMM will be introduced and studied. HHMM are indeed a particular case of HMM, implementing strong intern structures.

Hierarchical HMM.

Hierarchical models are inspired from biology: to solve a complex problem, factorize it and make decisions in a hierarchical fashion. Low hierarchies manipulate low level informations and actions, making short-term decisions. High hierarchies manipulate high level informations and actions (uncertainty is less), making long-term decisions. Hierarchical HMM are such kind of models. A hierarchical hidden Markov model (HHMM) is a HMM which output is either a hierarchical HMM or an actual output. A HHMM could also be considered as a hierarchy of stochastic processes calling sub-processes. From this common definition, HHMM are complex structures, which are difficult to formalize and to computerize. Nevertheless, these models have been introduced and applied for handwriting recognition [5], as well for modelling complex worlds in control applications [12]. A fundamental contribution has been made by Murphy and Paskin [8], which have shown how HHMM could be interpreted as a particular dimension dynamic Bayesian Network. Now, Dynamic Bayesian Networks are easily formalized, manipulated and computerized. DBN could be considered as HMM with complex intern structures. From the work of Murphy and Paskin, it could be shown that a hierarchical HMM (with input and output) could be interpreted by a DBN as described in figure 4, with discrete or semi-continuous states. It appears, that there is a up and down flow of the information between the hierarchical levels in addition to the usual temporal flow (the Markovian property). It is important to note that boolean informations are necessary for implementing the hierarchy. These boolean are needed for controlling the information flows betwenn processes and subprocesses.

(-.5,.5)*+, (0,0)*+∘=”b00”, (15,0)*+∘=”b01”, (30,0)*+∘=”b02”, (45,0)*+∘=”b03”, (60,0)*+∘=”b04”, (75,0)=”b05”, @-¿”b00”;”b01”, @-¿”b01”;”b02”, @-¿”b02”;”b03”, @-¿”b03”;”b04”, @–”b04”;”b05”,”b05”, (0,-15)*+∘=”b10”, (15,-15)*+∘=”b11”, (30,-15)*+∘=”b12”, (45,-15)*+∘=”b13”, (60,-15)*+∘=”b14”, (75,-15)=”b15”, @-¿”b00”;”b10”, @¡-”b01”;”b11”, @-¿”b02”;”b12”, @¡-”b03”;”b13”, @-¿”b04”;”b14”, @-¿”b10”;”b11”, @-¿”b11”;”b12”, @-¿”b12”;”b13”, @-¿”b13”;”b14”, @–”b14”;”b15”,”b15”, (0,-30)*+∘=”b00”, (15,-30)*+∘=”b01”, (30,-30)*+∘=”b02”, (45,-30)*+∘=”b03”, (60,-30)*+∘=”b04”, (75,-30)=”b05”, @¡-”b00”;”b10”, @-¿”b01”;”b11”, @¡-”b02”;”b12”, @-¿”b03”;”b13”, @¡-”b04”;”b14”, @-¿”b00”;”b01”, @-¿”b01”;”b02”, @-¿”b02”;”b03”, @-¿”b03”;”b04”, @–”b04”;”b05”,”b05”, (0,-45)*+∘=”b10”, (15,-45)*+∘=”b11”, (30,-45)*+∘=”b12”, (45,-45)*+∘=”b13”, (60,-45)*+∘=”b14”, (75,-45)=”b15”, @-¿”b00”;”b10”, @¡-”b01”;”b11”, @-¿”b02”;”b12”, @¡-”b03”;”b13”, @-¿”b04”;”b14”, @-¿”b10”;”b11”, @-¿”b11”;”b12”, @-¿”b12”;”b13”, @-¿”b13”;”b14”, @–”b14”;”b15”,”b15”, (0,-60)*+▽=”b00”, (15,-60)*+△=”b01”, (30,-60)*+▽=”b02”, (45,-60)*+△=”b03”, (60,-60)*+▽=”b04”, @¡-”b00”;”b10”, @-¿”b01”;”b11”, @¡-”b02”;”b12”, @-¿”b03”;”b13”, @¡-”b04”;”b14”,”b14”, (75.5,-60.5)*+
information+boolean / ouput / input
Figure 4: Model of a controlled Hierarchical HMM

The next paragraph introduces the customized model of HHMM, which has been considered in this work. It is simplification of the general HHMM model, and it allows a more simple implementation.

Implemented model.

The implemented model familly is composed by HHMM with hierarchical levels. Each level is associated to a finite memory set (the memory size may change with the hierarchy). The exchange of information between the levels is characterized by the DBN illustrated in figure 5. Notice that each memory cell receives an information from the current upper-level cell and the previous lower-level cell. As a consequence, the hierarchical and temporal information exchanges are guaranted. In a more formal way, the HHMM are of the form:

where is the variable for the memory at level . It is noteworthy that this model is equivalent to a simple HMM when  . And when  , the law just maps the immediate observation to decisions, without any memory of the past observations.
For any , define the complete probabilistic law of the system world/subject:

Then the issue is to find the near-optimal strategy such that:

A solution to this problem, by means of the cross-entropy method, is proposed in the next section.

(-.5,.5)*+, (0,0)*+m^3_1=”b00”, (15,0)*+m^3_2=”b01”, (30,0)*+m^3_3=”b02”, (55,0)*+m^3_t=”b03”, (70,0)*+m^3_t+1=”b04”, (85,0)=”b05”, (0,-15)*+m^2_1=”b10”, (15,-15)*+m^2_2=”b11”, (30,-15)*+m^2_3=”b12”, (55,-15)*+m^2_t=”b13”, (70,-15)*+m^2_t+1=”b14”, (85,-15)=”b15”, @-¿”b00”;”b10”, @-¿”b01”;”b11”, @-¿”b02”;”b12”, @-¿”b03”;”b13”, @-¿”b04”;”b14”, @-¿”b10”;”b01”, @-¿”b11”;”b02”, @-¿”b13”;”b04”, @–”b02”;”b03”, @–”b04”;”b05”,”b05”, (0,-30)*+m^1_1=”b00”, (15,-30)*+m^1_2=”b01”, (30,-30)*+m^1_3=”b02”, (55,-30)*+m^1_t=”b03”, (70,-30)*+m^1_t+1=”b04”, (85,-30)=”b05”, @¡-”b00”;”b10”, @¡-”b01”;”b11”, @¡-”b02”;”b12”, @¡-”b03”;”b13”, @¡-”b04”;”b14”, @-¿”b00”;”b11”, @-¿”b01”;”b12”, @-¿”b03”;”b14”, @–”b02”;”b03”, @–”b04”;”b05”,”b05”, (0,-45)*+d_1=”b10”, (15,-45)*+d_2=”b11”, (30,-45)*+d_3=”b12”, (55,-45)*+d_t=”b13”, (70,-45)*+d_t+1=”b14”, (85,-45)=”b15”, (-7.5,-45)*+y_1=”b100b”, (7.5,-45)*+y_2=”b10b”, (22.5,-45)*+y_3=”b11b”, (47.5,-45)*+y_t=”b12b”, (62.5,-45)*+y_t+1=”b13b”, @-¿”b00”;”b10”, @-¿”b01”;”b11”, @-¿”b02”;”b12”, @-¿”b03”;”b13”, @-¿”b04”;”b14”, @-¿”b100b”;”b00”, @-¿”b10b”;”b01”, @-¿”b11b”;”b02”, @-¿”b12b”;”b03”, @-¿”b13b”;”b04”,”b04”, (85.5,-45.5)*+
Figure 5: HHMM model for the planning

4 Cross-entropic optimization of

The reader interested in CE methods should refer to the tutorial [3] and the book [9] on the CE method. CE algorithms were first dedicated to estimating the probability of rare events. A slight change of the basic algorithm made it also good for optimization. In their new article[6], Homem-de-Mello and Rubinstein have given some results about the global convergence. In order to ensure such convergence, some refinements are introduced particularly about the selective rate.
This presentation is restricted to the basic CE method. The new improvements of the CE algorithm proposed in [6] have not been implemented, but the algorithm has been seen to work properly. For this reason, this paper does not deal with the choice of the selective rate.

4.1 General CE algorithm for the optimization

The Cross Entropy algorithm repeats until convergence the three successive phases:

  1. Generate samples of random data according to a parameterized random mechanism,

  2. Select the best samples according to a reward criterion,

  3. Update the parameters of the random mechanism, on the basis of the selected samples.

In the particular case of CE, the update in phase 3 is obtained by minimizing the Kullback-Leibler distance, or cross entropy, between the updated random mechanism and the selected samples. The next paragraphs describe on a theoretical example how such method can be used in an optimization problem.


Let be given a function ; this function is easily computable. The value has to be maximized, by optimizing the choice of . The function will be the reward criterion.
Now let be given a family of probabilistic laws,  , applying on the variable . The family is the parameterized random mechanism. The variable is the random data.
Let be a selective rate. The CE algorithm for follows the synopsis :

  1. Initialize  ,

  2. Generate samples according to  ,

  3. Select the best samples according to the reward criterion  ,

  4. Update as a minimizer of the cross-entropy with the selected samples:

  5. Repeat from step 2 until convergence.

This algorithm requires to be easily computable and the sampling of to be fast.


The CE algorithm tightens the law around the maximizer of . Then, when the probabilistic family is well suited to the maximization of  , it becomes equivalent to find a maximizer for or to optimize the parameter by means of the CE algorithm. The problem is to find a good family…Another issue is the criterion for deciding the convergence. Some answers are given in [6]. Now, it is outside the scope of this paper to investigate these questions precisely. Our criterion was to stop after a given threshold of successive unsuccessful tries and this very simple method have worked fine on our problem.

4.2 Application

Optimizing means tuning the parameter in order to tighten the probability around the optimal values for  . This is exactly solved by the Cross-Entropy optimization method. However, it is required that the reward function is easily computable. Typically, the definition of may be recursive, e.g. :

Let the selective rate be a positive number such that  . The cross-entropy method for optimizing follows the synopsis :

  1. Initialize  . For example a flat ,

  2. Build samples according to the law ,

  3. Choose the best samples according to the reward  . Denote the set of the selected samples,

  4. Update as the minimizer of the cross-entropy with the selected samples:

  5. Reiterate from step 2 until convergence.

For our HHMM model, the maximization (2) is solved by:

and for  ,:

The next section presents an example of implementation of the algorithm described in section 4.2.

5 Implementation

The algorithm has been applied to a simulated target detection problem.

5.1 Problem setting

A target is moving in a lattice of cells, ie. . is tracked by two mobiles, and , controlled by the subject. The coordinate of , and at time are denoted and . and have a very limited information about the target position, and are maneuvering much slower:

  • A move for (respectively ) is either: turn left, turn right, go forward, no move. Consequently, there are possible actions for the subject. These moves cannot be combined in a single turn. No diagonal forward: a mobile is either directed up, right, down or left,

  • The mobiles are initially positioned in the down corners, ie. and . The mobile are initially directed downward,

  • (respectively ) observes whether the target relative position is forward or not. More precisely:

    • when is directed upward, it knows whether or not,

    • when is directed right, it knows whether or not,

    • when is directed downward, it knows whether or not,

    • when is directed left, it knows whether or not,

  • (respectively ) knows whether its distance with the target is less than , ie. , or not. The distance is defined by:

At last, there are possible observations for the subject.
Several test cases have been considered. In case 1, the target does not move. In any other case, the target chooses stochastically its next position in its neighborhood. Any move is possible (up/down, left/right, diagonals, no move). The probability to choose a new position is proportional to the sum of the squared distance from the mobiles:

This definition was intended to favorize escape moves: more great is a distance, more probable is the move. But in such summation, a short distance will be neglected compared to a long distance. It is implied that a distant mobile will hide a nearby mobile. This “deluding” property will induce actually two different kinds of strategy, whithin the learned machines.
The objective of the subject is to maintain the target sufficiently closed to at least one mobile (in this example, the distance between the target and a mobile is required to be not more than ). More precisely, the reward function, , is just counting the number of such “encounter”:

The total number of turns is .

5.2 Results


Like many stochastic algorithms, this algorithm needs some time for convergence. For the considered example, about two hours were needed for convergence (on a 2GHz PC); the selective rate was . This speed depends on the size of the HHMM model and on the convergence criterion. A weak and a strong criterion are used for deciding the convergence. Within the weak criterion, the algorithm is terminated after successive unsuccessful tries. Within the strong criterion, the algorithm is terminated after successive unsuccessful tries. Of course, the strong criterion computes a (slightly) better optimum than the weak criterion, but it needs time. Because of the many tested examples, the weak criterion has been the most used in particular for the big models. For the same HHMM model, the computed optimal values do not depend on the algorithmic instance (small variations result however from the stochastic nature of the algorithm).
In the sequel, mean rewards are rounded to the nearest integer, or are expressed as a percentage of the optimum. Thus, the presentation is made clearer. And owing to the small variations of this stochastic algorithm, more precision turns out to be irrelevant.

Case 1: does not move.

This example has been considered in order to test the algorithm. The position of the target is fixed in the center of the square space, ie. . It is recalled that the mobiles are initially directed downward. Then, the optimal strategy is known and its value is  : the time needed to reach the target is  , and no further move is needed. The learned approximates the reward  . The convergence is good.

Case 2: is moving but the observation is hidden.

Initially, is located within the upper cells of the lattice (ie. ), accordingly to a uniform probabilistic law. The computed optimal means reward is about . In this case, the mobiles tend to move towards the upper corners.

Case 3: is moving and is observed.

Again, is located uniformly within the upper cells of the lattice. The computed optimal means reward is about . This reward has been obtained from a large HHMM model ( with states per level, ie. ) and with the strong criterion. However, somewhat smaller models should work as well.
Specific computations are now presented, depending on the number of levels and the number of states per levels. For each case, the weak criterion has been used. The rewards are now expressed as percentage.
Subcase  . For such model, the action is constructed only from the immediate last observation . The model does not keep any memory of the past observations. Then, only states are sufficient to describe the hidden variable  , ie. . The resulting reward is of the optimum.
Subcase  . This model is equivalent to a HMM and it is assumed that . The following table gives the computed reward for several choices of the memory size:

It is noteworthy that the memory of the past observations allows better strategies than the only last observation (case ) . Indeed, the reward jumps from up to .
Subcases . A comparison of graduated hierarchic models, , has been made. The first level contained possible states, and the higger levels were restricted to states:

The test has been accomplished according to the weak criterion:

and the strong criterion:

It seems that a high hierarchic grade (i.e. more structure) makes the convergence difficult. This is particularly the case here for the grade  , which failed under the weak criterion at only . However, the algorithm works again when improving the convergence criterion.
It is interesting to make a comparison with the subcase where . Under the weak criterion, the result for this HHMM was as for the grade . However, the dimension of the law is quite different for the two models:

  • for the -level HHMM,

  • for the -level HHMM.

This dimension is a rough characterization of the complexity of the model. It seems clear on these examples that the highly hierarchized models are more efficient than the weakly hierarchized models. And the problem considered here is quite simple. On complex problems, hierarchical models may be pre-eminent.

Global behavior.

The algorithm. The convergence speed is low at the beginning. After this initial stage, it improves greatly until it reaches a new “waiting” stage. This alternation of low speed and great speed stages have been noticed several times.
The near optimal policy. It is now discussed about the behaviour of the best found policy. This policy has reach the mean reward . The mobiles strategy results in a tracking of the target. The figure 6 illustrates a short sequence of escape/tracking of the target. It has been noticed two quite distinct behaviours, among the many runs of the policy:

  • The two mobiles may both cooperate on tracking the target,

  • When the target is near a border, one mobile may stay along the opposite border while the other mobile may perform the tracking. This strategy seems strange at first sight. But it is recalled that the moving rule of the target tends to neglect a nearby mobile compared to a distant mobile. In this strategy, the first mobile is just annihilating the ability of the target to escape from the tracking of the second mobile.

target  observer 1  observer 2
Relative times are put in supscript
Figure 6: Near-optimal control sequence

5.3 Comparison with the Q-learning

The Q-learning is a reinforcement learning method, which is based on the computation of a table evaluating the decision conditionnally to the known information. The known information is typically the state of the world if it is known, or partial states and observations. Since the known information increases exponentially with the observation range, the test will only implement a Q-learning based on the immediate past observation. Now, let us recall some theoretical grounds about the Q-learning.


A founding reference about reinforcement learning is the well known book of Sutton and Barto [11], which is available on internet. This paragraph will not enter deeply into the subject, and is limited to a simple description of the Q-learning. Moreover, we will make the hypothesis of infinite horizon (that is ) with a weak discounting of the reward , so as to implement the algorithm in its most classical form. Tests however have also been made with a finite horizon but have not achieved a good convergence for the considered algorithm.
The learning relies on the following hypotheses:

  • At each step , the subject has a (partial) knowledge of the state of the world, and chooses an action ,

  • Let be the cumulated reward from step to step . Assume a state and action at step . Then  , i.e. an instantaneous reward is obtained and cumulated to the discounted future reward.

The question is: being given a current state , what is the best action to be done? The answer is simple, if we are able to predict the future and evaluate the expected cumulated reward for any : the best action is  . The following algorithm could be used for learning the table (taken from [11]) :

  • Initialize arbitrary

  • (Repeat for each episode:  [finite-horizon case])

    • Initialize

    • Repeat for each step (of the episode):

      • With probability choose  ; otherwise chose randomly

      • Take action , receive reward and observe the new state

      • Set

      • Set

    • (until is terminal)

where controls the convergence speed and the innovation.
In our implementation, , , , and the instantaneous reward is complient with the experiment definition of previous section. Since contains the last observation plus the known part of the world state, this experiment should be equivalent to [case 3/subcase ] considered previously. The computer memory needed to store the table was approximately giga-byte: we are around the limits of the computer. In particular, it is rather uneasy to involve a greater observation range without some approximations.


The algorithm has been stoped after iterations, but seemed sufficient. It took several hours, but the algorithm has not been optimized. In order to make the comparison possible with our method, the Q-strategies has been evaluated by a non-discounted cumulation of the reward on 100-step-wide windows. Moreover, these evaluations have been made:

  • from the initial stage of the simulation, so as to conform to previous section,

  • after many cycles, so as to simulate an infinite horizon.

The following table makes a comparison between the Q-strategies and the model based strategies with .

It is first noticed that the policy obtained by the Q-learning is less regulated than the model based policy. Moreover, although it may be quite good to track a target when the encounter has been inited (best is ), it is rather bad at initing the encounter (mean for initial stage is ) or when the tracking is lost (worst is ). At last, the mean evaluation at infinite horizon is , which is even smaller than the model-based policy working from the initial stage.
On this example, and for this simple Q-learning implemention, the comparison is favorable to the model-based policy. Moreover, model-based policies are able to manage more observation range. Now, this planning example has been constructed so as to make difficult the management of the state variables (the dimension is huge) and observations (the observations are poor and have to be combined). For such a problem, a more dedicated RL-method should be chosen.

6 Conclusion

In this paper, we proposed a general method for approximating the optimal planning in a partially observable world. Hierarchical HMM families have been used for approximating the optimal decision tree, and the approximation has been optimized by means of the Cross-Entropy method.
At this time, the method has been applied to a strictly discrete-state problem and has been seen to work properly. This algorithm has been compared favorably with a Q-learning implementation of the considered problem: it is able to manage more observation range, and the optimized policy is more regulated. An interesting point is that the optimized policy has discovered two quite different global strategies and is able to choose between them: make the mobiles both cooperate on tracking or require one mobile for deluding the target.
The results are promising. However, the observation and action spaces are limited to a few number of states. And what happens if the hidden space becomes much more intricated? There are several possible answers to such difficulties:
First, the cross-entropic principle could be applied for optimizing continuous laws. It is thus certainly possible to consider semi-continuous models, which will be more realistic for a planning policy. Secondly, many refinements are foreseeable about the structure of the models. Hierarchic models for observation, decision and memory should be improved in order to locally factorize intricated problems. This research is just preliminary and future works should investigate these questions.


  • [1] B. Bakker, J. Schmidhuber, Hierarchical Reinforcement Learning Based on Subgoal Discovery and Subpolicy Specialization, in Proceedings of the 8-th Conference on Intelligent Autonomous Systems, Amsterdam, The Netherlands, p. 438-445, 2004.
  • [2] Richard Bellman, Dynamic Programming, Princeton University Press, Princeton, New Jersey, 1957.
  • [3] De Boer and Kroesse and Mannor and Rubinstein, A Tutorial on the Cross-Entropy Method,
  • [4] Anthony Rocco Cassandra, Exact and approximate algorithms for partially observable Markov decision processes, PhD thesis, Brown University, Rhode Island, Providence, May 1998.
  • [5] Shai Fine and Yoram Singer and Naftali Tishby, The Hierarchical Hidden Markov Model: Analysis and Application

    , Machine Learning, 1998.

  • [6] Homem-de-Mello, Rubinstein, Rare Event Estimation for Static Models via Cross-Entropy and Importance Sampling,
  • [7] N. Meuleau, L. Peshkin, Kee-Eung Kim, L.P. Kaelbling, Learning finite-state controllers for partially observable environments, in Proc. of UAI-99, pages 427–436, Stockholm, 1999.
  • [8] Kevin Murphy and Mark Paskin, Linear Time Inference in Hierarchical HMMs, Proceedings of Neural Information Processing Systems, 2001.
  • [9] R. Rubinstein,D. P. Kroese,

    The Cross-Entropy method. An unified approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning

    , Information Science & Statistics, Springer 2004.
  • [10] Edward J. Sondik, The Optimal Control of Partially Observable Markov Processes, PhD thesis, Stanford University, Stanford, California, 1971.
  • [11] R.J. Sutton, A.G. Barto Reinforcement Learning, MIT Press, Cambridge, Massachusetts, 2000.
  • [12] Georgios Theocharous, Hierarchical Learning and Planning in Partially Observable Markov Decision Processes, PhD thesis, Michigan State University, 2002.