A Hybrid POMDP-BDI Agent Architecture with Online Stochastic Planning and Plan Caching

07/03/2016 ∙ by Gavin Rens, et al. ∙ 0

This article presents an agent architecture for controlling an autonomous agent in stochastic environments. The architecture combines the partially observable Markov decision process (POMDP) model with the belief-desire-intention (BDI) framework. The Hybrid POMDP-BDI agent architecture takes the best features from the two approaches, that is, the online generation of reward-maximizing courses of action from POMDP theory, and sophisticated multiple goal management from BDI theory. We introduce the advances made since the introduction of the basic architecture, including (i) the ability to pursue multiple goals simultaneously and (ii) a plan library for storing pre-written plans and for storing recently generated plans for future reuse. A version of the architecture without the plan library is implemented and is evaluated using simulations. The results of the simulation experiments indicate that the approach is feasible.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Imagine a scenario where a planetary rover has five tasks of varying importance. The tasks could be, for instance, collecting gas (for industrial use) from a natural vent at the base of a hill, taking a temperature measurement at the top of the hill, performing self-diagnostics and repairs, reloading its batteries at the solar charging station and collect soil samples wherever the rover is. The rover is programmed to know the relative importance of collecting soil samples. The rover also has a model of the probabilities with which its various actuators fail and the probabilistic noise-profile of its various sensors. The rover must be able to reason (plan) in real-time to pursue the right task at the right time while considering its resources and dealing with various events, all while considering the uncertainties about its actions (actuators) and perceptions (sensors).

We propose an architecture for the proper control of an agent in a complex environment such as the scenario described above. The architecture combines belief-desire-intention (BDI) theory (Bratman, 1987; Rao and Georgeff, 1995) and partially observable Markov decision processes (POMDPs) (Monahan, 1982; Lovejoy, 1991). Traditional BDI architectures (BDIAs) cannot deal with probabilistic uncertainties and they do not generate plans in real-time. A traditional POMDP cannot manage goals (major and minor tasks) as well as BDIAs can. Next, we analyse the POMDPs and BDIAs in a little more detail.

One of the benefits of agents based on BDI theory, is that they need not generate plans from scratch; their plans are already (partially) compiled, and they can act quickly once a goal is focused on. Furthermore, the BDI framework can deal with multiple goals. However, their plans are usually not optimal, and it may be difficult to find a plan which is applicable to the current situation. That is, the agent may not have a plan in its library which exactly ‘matches’ what it ideally wants to achieve. On the other hand, POMDPs can generate optimal policies on the spot to be highly applicable to the current situation. Moreover, policies account for stochastic actions in partially observable environments. Unfortunately, generating optimal POMDP policies is usually intractable. One solution to the intractability of POMDP policy generation is to employ a continuous planning strategy, or agent-centred search (Koenig, 2001). Aligned with agent-centred search is the forward-search approach or online planning approach in POMDPs (Ross et al., 2008).

The traditional BDIA maintains goals as desires; there is no reward for performing some action in some state. The reward function provided by POMDP theory is useful for modeling certain kinds of behavior or preferences. For instance, an agent based on a POMDP may want to avoid moist areas to prevent its parts becoming rusty. Moreover, a POMDP agent can generate plans which can optimally avoid moist areas. But one would not say that avoiding moist areas is the agent’s task. And POMDP theory maintains a single reward function; there is no possibility of weighing alternative reward functions and pursuing one at a time for a fixed period—all objectives must be considered simultaneously, in one reward function. Reasoning about objectives in POMDP theory is not as sophisticated as in BDI theory. A BDI agent cannot, however, simultaneously avoid moist areas and collect gas; it has to switch between the two or combine the desire to avoid moist areas with every other goal.

The Hybrid POMDP-BDI agent architecture (or HPB architecture, for short) has recently been introduced (Rens and Meyer, 2015). It combines the advantages of POMDP theoretic reasoning and the potentially sophisticated means-ends reasoning of BDI theory in a coherent agent architecture. In this paper, we generalize the management of goals by allowing for each goal to be pursued with different intensities, yet concurrently.

Typically, BDI agents do not deal with stochastic uncertainty. Integrating POMDP notions into a BDIA addresses this. For instance, an HPB agent will maintain a (subjective) belief state representing its probabilistic (uncertain) belief about its current state. Planning with models of stochastic actions and perceptions is possible in the HPB architecture. The tight integration of POMDPs and BDIAs is novel to this architecture, especially in combination with desires with changing intensity levels.

This article serves to introduce two significant extensions to the first iteration (Rens and Meyer, 2015)

of the HPB architecture. The first extension allows for multiple intentions to be pursued simultaneously, instead of one at a time. In the previous architecture, only one intention was actively pursued at any moment. In the new version, one agent action can take an agent closer to more than one goal at the moment the action is performed – the result of a new approach to planning. As a consequence of allowing multiple intentions, the policy generation module (§ 

4.3), the desire function and the method of focusing on intentions (§ 4.2) had to be adapted. The second extension is the addition of a plan library. Previously, a policy (conditional plan) would have to be generated periodically and regularly to supply the agent with the recommendations of actions it needs to take. Although one of the strengths of traditional BDI theory is the availability of a plan library with pre-written plans for quick use, a plan library was excluded from the HPB architecture so as to simplify the architecture’s introduction. Now we propose a framework where an agent designer can store hand-written policies in a library of plans and where generated policies are stored for later reuse. Every policy in the library is stored together with a ‘context’ in which it will be applicable and the set of intentions which it is meant to satisfy. There are two advantages of introducing a plan library: (i) policies can be tailored by experts to achieve specific goals in particular contexts, giving the agent immediate access to recommended courses of action in those situations, and (ii) providing a means for policies, once generated, to be stored for later reuse so that the agent can take advantage of past ‘experience’ – saving time and computation.

In Section 2, we review the necessary theory, including POMDP and BDI theory. In Section 3, we describe the basic HPB architecture. The extensions to the basic architecture are presented in Section 4. Section 5 describes two simulation experiments in which the proposed architecture is tested, evaluating the performance on various dimensions. The results of the experiments confirm that the approach may be useful in some domains. The last section discusses some related work and points out some future directions for research in this area.

2 Preliminaries

The basic components of a BDI architecture (Wooldridge, 1999, 2002) are

  • a set or knowledge-base of beliefs;

  • an option generation function , generating the objectives the agent would ideally like to pursue (its desires);

  • a set of desires (goals to be achieved);

  • a ‘focus’ function which selects intentions from the set of desires;

  • a structure of intentions of the most desirable options/desires returned by the focus function;

  • a library of plans and subplans;

  • a ‘reconsideration’ function which decides whether to call the focus function;

  • an execution procedure, which affects the world according to the plan associated with the intention;

  • a sensing or perception procedure, which gathers information about the state of the environment; and

  • a belief update function, which updates the agent’s beliefs according to its latest observations and actions.

Exactly how these components are implemented result in a particular BDI architecture.

Input: : initial beliefs
Input: : initial intentions
1 ;
2 ;
3 ;
4 while alive do
5       ;
6       ;
7       ;
8       ;
9       ;
10       ;
Algorithm 1 Basic BDI agent control loop

Algorithm 1 (adapted from Wooldridge (2000, Fig. 2.3)) is a basic BDI agent control loop. is the current plan to be executed. senses the environment and returns a percept (processed sensor data) which is an input to , which updates the agent’s beliefs. generates a set of desires, given the agent’s beliefs, current intentions and possibly its innate motives. It is usually impractical for an agent to pursue the achievement of all its desires. It must thus filter out the most valuable and achievable desires. This is the function of , taking beliefs, desires and current intentions as parameters. Together, the processes performed by and may be called deliberation, formally encapsulated by the procedure. returns a plan from the plan library to achieve the agent’s current intentions.

A more sophisticated controller would have the agent consider whether to re-deliberate, with a function placed just before deliberation would take place. The agent could also test at every iteration through the main loop whether the currently pursued intention is still possibly achievable. Serendipity could also be taken advantage of by periodically testing whether the intention has been achieved, without the plan being fully executed. Such an agent is considered ‘reactive’ because it executes one action per loop iteration; this allows for deliberation between executions. There are various mechanisms which an agent might use to decide when to reconsider its intentions. See, for instance, Bratman (1987); Pollack and Ringuette (1990); Kinny and Georgeff (1991, 1992); Schut and Wooldridge (2000, 2001); Schut et al. (2004).

In a partially observable Markov decision process (POMDP), the actions the agent performs have non-deterministic effects in the sense that the agent can only predict with a likelihood in which state it will end up after performing an action. Furthermore, its perception is noisy. That is, when the agent uses its sensors to determine in which state it is, it will have a probability distribution over a set of possible states to reflect its conviction for being in each state.

Formally (Kaelbling et al., 1998), a POMDP is a tuple with

  • , a finite set of states of the world (that the agent can be in),

  • a finite set of actions (that the agent can choose to execute),

  • a transition function , the probability of being in after performing action in state ,

  • , the immediate reward gained for executing action while in state ,

  • , a finite set of observations the agent can perceive in its world,

  • a perception function , the probability of observing in state resulting from performing action in some other state, and

  • the initial probability distribution over all states in .

In general, we regard an observation as the signal recognized by a sensor; the signal is generated by some event which is not directly perceivable.

A belief state is a set of pairs where each state in is associated with a probability . All probabilities must sum up to one, hence, forms a probability distribution over the set of all states. To update the agent’s beliefs about the world, a special function is defined as


where is an action performed in ‘current’ belief state , is the resultant observation and denotes the probability of the agent being in state in ‘new’ belief state . Note that is a normalizing constant.

Let the planning horizon (also called the look-ahead depth) be the number of future steps the agent plans ahead each time it plans. is the optimal value of future courses of actions the agent can take with respect to a finite horizon starting in belief state . This function assumes that at each step the action that will maximize the state’s value will be selected.

Because the reward function provides feedback about the utility of a particular state (due to executed in it), an agent who does not know in which state it is in cannot use this reward function directly. The agent must consider, for each state , the probability of being in , according to its current belief state . Hence, a belief reward function is defined, which takes a belief state as argument. Let .

The optimal state-value function is define by

where is a factor to discount the value of future rewards and denotes the probability of reaching belief state . While denotes the optimal value of a belief state, function denotes the optimal action-value:

is the value of executing in the current belief state, plus the total expected value of belief states reached thereafter.

3 The Basic HPB Architecture

In BDI theory, one of the big challenges is to know when the agent should switch its current goal and what its new goal should be (Schut et al., 2004). To address this challenge, we propose that an agent should maintain intensity levels of desire for every goal. This intensity of desire could be interpreted as a kind of emotion. The goals most intensely desired should be the goals sought (the agent’s intentions). We also define the notion of how much an intention is satisfied in the agent’s current belief state. For instance, suppose that out of five possible goals, the agent currently most desires to watch a film and to eat a snack. Then these two goals become the agent’s intentions. However, eating is not allowed inside the film-theatre, and if the agent were to go buy a snack it would miss the beginning of the film. So the total reward for first watching the film then buying and eating a snack is higher than first eating then watching. As soon as the film-watching goal is satisfied, it is no longer an intention. But while the agent was watching the film, the desire-level of the (non-intention) goal of being at home has been increasing. However, it cannot become an intention because snack-eating has not yet been satisfied. Going home cannot simply become an intention and dominate snack-eating, because the architecture is designed so that current intentions have precedence over non-intention goals, else there is a danger that the agent will vacillate between which goals to pursue. Nonetheless, snack-eating may be ejected from the set of intentions under the special condition that the agent is having an unusually hard time achieving it. For instance, if someone stole its wallet in the theatre, the agent can no longer have the current intention (i.e., actively pursue) eating a snack. Hence, in our architecture, if an intention takes ‘too long’ to satisfy, it is removed from the set of intentions. As soon as the agent gets home or is close to home, the snack-eating goal will probably become an intention again and the agent will start making plans to satisfy eating a snack. Moreover, the desire-level of snack-eating will now be very high (it has been steadily increasing) and the agent’s actions will be biased towards satisfying this intention over other current intentions (e.g., over getting home, if it is not yet there).

A Hybrid POMDP-BDI (HPB) agent (Rens and Meyer, 2015) maintains (i) a belief state which is periodically updated, (ii) a mapping from goals to numbers representing the level of desire to achieve the goals, and (iii) the current set of intentions, the goals with the highest desire levels (roughly speaking). As the agent acts, its desire levels are updated and it may consider choosing new intentions and discard others based on new desire levels. Refer to Figure 1 for an overview of the operational semantics. The figure refers to concepts defined in the following subsection.

Figure 1: Operational semantics of the basic HPB architecture. SL stands for . Note that depends on the current belief state and not on desire levels. Planning is also independent of desire levels. The focus function depends on desire levels and on satisfaction levels. In the case of plans consisting of a single action, the Replan decision node always returns ‘yes’.

3.1 Declarative Semantics

The state of an HPB agent is defined by the tuple , where is the agent’s current belief state (i.e., a probability distribution over the states , defined below), is the agent’s current desire function and is the agent’s current intention. More will be said about and a little later.

An HPB agent could be defined by the tuple , where

  • is a set of attribute-sort pairs (for short, the attribute set). For every , is the name or identifier of an attribute of interest in the domain of interest, like or , and is the set from which can take a value, for instance, real numbers in the range or a list of values like , , , . So could be an attribute set.

    A state is induced from as one possible way of assigning values to attributes: if and , then . The set of all possible states is denoted .

  • is a set of goals. A goal is a subset of some state . For instance, is a goal, and so are and . The set of goals is given by the agent designer as ‘instructions’ about the agent’s tasks.

  • is a finite set of actions.

  • is a finite set of observations.

  • is the transition function of POMDPs.

  • is the perception function of POMDPs.

  • consists of two functions and which allow an agent to determine the utilities of alternative sequences of actions. .

    is the preference function with a range in . It takes an action and a state , and returns the preference (any real number) for performing in . That is, . Numbers closer to 1 imply greater preference and numbers closer to 0 imply less preference. Except for the range restriction of , it has the same definition as a POMDP reward function, but its name indicates that it models the agent’s preferences and not what is typically thought of as rewards. An HPB agent gets ‘rewarded’ by achieving its goals. The preference function is especially important to model action costs; the agent should prefer ‘inexpensive’ actions. has a local flavor. Designing the preference function to have a value lying in [0,1] may sometimes be challenging, but we believe it is always possible.

    is the satisfaction function with a range in . It takes a state and an intention , and returns a value representing the degree to which the state satisfies the intention. That is, . It is completely up to the agent designer to decide how the satisfaction function is defined, as long as numbers closer to 1 mean more satisfaction and numbers closer to 0 mean less satisfaction. has a global flavor.

Figure 1 shows a flow diagram representing the operational semantics of the basic HPB architecture.

3.2 The Desire Function

The desire function is a total function from goals in into the positive real numbers . The real number represents the intensity or level of desire of the goal. For instance, could be in , meaning that the goal of having the battery level at 13 and the week-day Tuesday is desired with a level of 2.2. and are also examples of desires in .

is the agent’s current intention; an element of ; the goal with the highest desire level. This goal will be actively pursued by the agent, shifting the importance of the other goals to the background. The fact that only one intention is maintained makes the HPB agent architecture quite different to standard BDIAs.

We propose the following desire update rule.


Rule 2 is defined so that as tends to one (total satisfaction), the intensity with which the incumbent goal is desired does not increase. On the other hand, as becomes smaller (more dissatisfaction), the goal’s intensity is incremented. The rule transforms with respect to and . A goal’s intensity should drop the more it is being satisfied. The update rule thus defines how a goal’s intensity changes over time with respect to satisfaction.

Note that desire levels never decrease. This does not reflect reality. It is however convenient to represent the intensity of desires like this: only relative differences in desire levels matter in our approach and we want to avoid unnecessarily complicating the architecture.

3.3 Focusing and Satisfaction Levels

is a function which returns one member of called the (current) intention . In the initial version of the architecture, the goal selected is the one with the highest desire level. After every execution of an action in the real-world, is called to decide whether to call to select a new intention. is a meta-reasoning function analogous to the function mentioned in Section 2. It is important to keep the agent focused on one goal long enough to give it a reasonable chance of achieving it. It is the job of to recognize when the current intention seems impossible or too expensive to achieve.

Let be the sequence of satisfaction levels of the current intention since it became active and let be a designer-specified number representing the length of a sub-sequence of —the last satisfaction levels.

One possible definition of is

where is the average change from one satisfaction level to the next in the agent’s ‘memory’ , and is some threshold, for instance, . If the agent is expected to increase its satisfaction by at least, say, 0.1 on average for the current intention, then should be set to 0.1. With this approach, if the agent ‘gets stuck’ trying to achieve its current intention, it will not blindly keep on trying to achieve it, but will start pursuing another goal (with the highest desire level). Some experimentation will likely be necessary for the agent designer to determine a good value for in the application domain.

Note that if an intention was not well satisfied, its desire level still increases at a relatively high rate. So whenever the agent focuses again, a goal not well satisfied in the past will be a top contender to become the intention (again).

3.4 Planning for the Next Action

A basic HPB agent controls its behaviour according to the policies it generates. is a procedure which generates a POMDP policy of depth . Essentially, we want to consider all action sequences of length and the belief states in which the agent would find itself if it followed the sequences. Then we want to choose the sequence (or at least its first action) which yields the least cost and which ends in the belief state most satisfying with respect to the intention.

Planning occurs over an agents belief states. The satisfaction and preference functions thus need to be defined for belief states: The satisfaction an agent gets for an intention in its current belief state is defined as

where is defined above and is the probability of being in state . The definition of has the same form as the reward function over belief states in POMDP theory:

where was discussed above.

During planning, preferences and intention satisfaction must be maximized. The main function used in the procedure is the HPB action-state value function , giving the value of some action , conditioned on the current belief state , intention and look-ahead depth :

where , is the goal/preference ‘trade-off’ factor, is the normal POMDP discount factor and

is the normal POMDP state estimation function.

returns , the trivial policy of a single action.

4 The Extended HPB Architecture

The operational semantics of the extended architecture is essentially the same as for the first version, except that a plan library is now involved. The agent starts off with an initial set of intentions, a subset of its goals. For the current set of intentions, it must either select a plan from the plan library or generate a plan to pursue all its intentions. At every iteration of the agent’s control loop, an action is performed, an observation is made, the belief state is updated, and a decision is made whether to modify the set of intentions. But only when the current policy (conditional plan) is ‘exhausted’ does the agent seek a new policy, by consulting its plan library, and if an adequate policy is not found, generating one.

In the next subsection, we introduce some new notation and changes made to the architecture. Section 4.2 discusses how the focussing procedure must change to accommodate the changes. Section 4.3 explains how policies are generated for simultaneous pursuit of multiple goals. Finally, Section 4.4 presents the plan library, which was previously unavailable, and how the agent and agent designer can use it to their benefit.

4.1 Prologue

The HPB agent model gets three new component – a goal weight function , a compatibility function and the plan library . It can thus be defined by the tuple , , , , , , , , , .

In the previous version, satisfaction and preference were traded-off by “trade-off factor” which was not explicitly mentioned in the agent model. Actually the trade-off factor should have been part of the model, because it must be provided by the agent designer, and it directly affects the agent’s behaviour. In the new version, every goal will be weighted by according to the importance of to the agent. Goal weights are constrained such that for all , and .

The third fundamental extension is that becomes a set of intentions. In this way, an HPB agent may actively pursue several goals simultaneously. For example, a planetary rover may want to travel to its recharging station and simultaneously make same atmospheric measurements en route.

The first version has also been changed so that the set of goals is simply a set of names, rather than restricting a goal to be a set of attribute values, as was previously done. Goals are defined by how they are used in the architecture, particularly by their involvement in the definition of satisfaction functions.

In the extended architecture, it will be convenient to use more compact notation: Here we let , where is the same as and is a set of satisfaction functions . In particular, we move away from a preference function, and rather think of a cost function . Preferences will be captured by the set of satisfaction functions.

As a consequence of being able to pursue several goals at the same time, there exists a danger that the agent will pursue one intention when it necessarily causes another intention to become less satisfied. For instance, visiting the USA regional headquarters is diametrically opposite to visiting the China regional headquarters at the same time. Other examples of goals which should be ‘disjoint’ are and , and and . The solution we use is to list, for each goal , all other goals which are compatible with it, in the sense that their simultaneous pursuit ‘effective’ (defined by the agent designer). Let denote the set of goals compatible with . It is mandatory that . Two goals and are called incompatible if and only if or .

Suppose , , , ,
, . Then an agent designer may specify


Note that and are incompatible.

4.2 A New Approach to Focusing

Given that is a set of intentions, ensuring that the ‘correct’ goals are intentions at the ‘right’ time to ensure that the agent behaves as desired, requires some careful thought. It is still important to keep the agent focused on one intention long enough to give it a reasonable chance of achieving it, temporarily stop pursuing intentions it is struggling to achieve.

The HPB architecture does not have a focus function which returns a subset of of intentions . Rather, we have a set of procedures which decide at each iteration which intention to remove from (if any) and which goal to add to (if any). Incompatible must also be dealt with.

Let be the sequence of satisfaction levels of some goal since became active (i.e., was added to ) and let be a number representing the length of a sub-sequence of —the last satisfaction levels of goal . is defined exactly like :

where is the average change from one satisfaction level of to the next in the agent’s ‘memory’, and is the threshold above which must be for to remain an intention.

Let be the currently most intense goal defined as

We define two focusing strategies for sets of intentions: the over-optimistic strategy and the compatibility strategy.

4.2.1 Over-optimistic Strategy

This strategy ignores compatibility issues between goals. In this sense, the agent is (over) optimistic that it can successfully simultaneously pursue goals which are incompatible.

Add to only if . If is added to , clear ’s record of satisfaction levels, that is, let be the empty sequence.

Next: For every , if and returns ‘yes’, then remove from .

4.2.2 Compatibility Strategy

Add to only if and there does not exists a such that . If is added to , clear ’s record of satisfaction levels, that is, let be the empty sequence.

Next: For every , if and returns ‘yes’, then remove from .

There is one case which must still be dealt with in the compatibility strategy: Suppose for some , . Further suppose that (i.e., ) and is and remains the most intensely desired goal. Now, may not be added to because it is incompatible with , no other goal will be attempted to be added to and may not be removed while it is the only intention, even if returns ‘yes’. What could easily happen in this case is that will continually increase in desire level, ’s average satisfaction level will remain below the change threshold (i.e., remains true), and the agent continues to pursue only . To remedy this ‘locked’ situation, the following procedure is run after the previous ‘add’ and ‘remove’ procedures are attempted. If , and returns ‘yes’, then remove from , add to and clear ’s record of satisfaction levels.

4.2.3 A New Desire Function

The old rule (in new notation) is still available:


We have found through experimentation that when an intention-goal’s desire levels are updated, non-intention-goals may not get the opportunity to become intentions. In other words, it may happen that whenever new non-intention-goals are considered to become intentions, they are always ‘dominated’ by goals with higher levels of desire which are already intentions. By disallowing intentions’ desire levels to increase, non-intentions get the opportunity to ‘catch up’ with their desire levels. A new form of the desire update rule is thus proposed for this version of the architecture:


The term in (4) ensures that a goal’s desire level changes if and only if the goal is not an intention.

Both forms of the rule are defined so that as tends to one (total satisfaction), the intensity with which the incumbent goal is desired does not increase. On the other hand, as becomes smaller (more dissatisfaction), the goal’s intensity is incremented—by at most its weight of importance . A goal’s intensity should drop the more it is being satisfied.

However, update rule (3) which is independent of whether a goal is an intention may still result in better performance in particular domains. (This question needs more research.) It is thus left up to the agent designer to desire which form of the rule better suits the application domain.

4.3 Planning by Policy Generation

In this section, we shall see how the planner can be extended to compute a policy which pursues several goals simultaneously. Goal weights are also incorporated into the action-state value function.

The satisfaction an agent gets for an intention at its current belief state is defined as

where is defined above and is the probability of being in state . The definition of has the same form as the reward function over belief states in POMDP theory:

where was discussed above.

The main function used in the procedure is the HPB action-state value function , giving the value of some action , conditioned on the current belief state and look-ahead depth :


  • if , else if ,

  • is an ordering of the goals in ,

  • and are the expected (w.r.t. a belief state) values of , resp., ,

  • ,

  • is the normal POMDP discount factor and

  • is the normal POMDP state estimation function.

Now, instead of returning a single action (assuming ), generates a tree-structures plan of depth , conditioned on observations, that is, a policy. With a policy of depth , an agent can execute a sequence of actions, where the choice of exactly which action to take at each step depends on the observation received just prior. is used at every choice point to construct the policy.

Figure 2 is a graphical example of a policy with two actions and two observations. The agent is assumed to be in belief state when the policy is generated. At every belief state node (triangles), the optimal action is recommended. After an action is performed, all/both observations are possible and thus considered. There is thus a choice at every node; however, it is not a choice for the agent, rather, it is a choice for the environment which observation to send to the agent. Given the action performed, for every possible observation, a different belief state is generated. At every node (belief state), is applied to determine the action to perform there. (In theory, the agent can choose to perform any action at these nodes, but our agent will take the recommendations of POMDP theory for optimal behavior.) The agent will perform first, then depending on whether or in sensed, the agent should next (according to the policy) perform , respectively, . Then a third action will be performed according to the policy and conditional on which observation is sensed.

Figure 2: An example policy of depth 3.

4.4 Introducing a Plan Library

Another extension of the basic architecture is that a language based on the attributes is introduced. The language is the set of all sentences. Let and be sentences. Then the following are also sentences.

  • ,

  • , i.e., an attribute-value pair,

  • ,

  • ,

  • .

If a sentence is satisfied or true in a state , we write . The semantics of is defined by

  • always,

  • ,

  • and ,

  • or ,

  • not .

Let be a sentence in . When a sentence in appears in a written policy (see below), it is called a context.

We define two kinds of plans: an attribute condition plan is a triple , and a belief state condition plan is a triple , where is a set of intentions, is a POMDP policy, is a context and is a belief state. All plans are stored in a plan library.

The idea is that attribute condition plans (abbreviation: a-plans) are written by agent designers and are available for use when the agent is deployed. Roughly speaking, belief state condition plans (abbreviation: b-plans) are automatically generated by a POMDP planner and stored when no a-plan is found which ‘matches’ the agent’s current belief state and intention set.

Policies in a-plans are of two kinds:

Definition 1 (Most likely context)

An a-plan most-likely-context policy is either an action or has the form

where is an action, the are contexts, and each of the is one of the two kinds of a-plan policies.

At belief state , the degree of belief of is

We abbreviate “most-likely-context” as ‘ml’. If an ml policy is adopted for execution and it is not simply an action, then is executed, an observation is received, the current belief state is updated to and finally the policy which is paired with the most likely context is executed – that is,

is executed.

Definition 2 (First applicable context)

An a-plan first-applicable-context policy is either an action or has the form

where is an action, the are contexts, , the are probabilities, and each of the is one of the two kinds of a-plan policies.

We abbreviate “first-applicable-context” as ‘fa’. If an fa policy is adopted for execution and it is not simply an action, then is executed, an observation is received, the current belief state is updated to and finally the policy which is paired with the first context which satisfies its probability inequality is executed - that is, is executed such that and and there is no such that for which . If no context in the sequence satisfies its inequality, the a-plan of which the policy is a part is regarded as having finished, that is, the control loop is then in a position where a fresh plan in the plan library is sought.

In the following example a-plan policy, an agent must move around in a six-by-six grid world to collect items. Suppose the plan selected from the library is with being , being

and being

One can see that itself is an ml policy, but embedded inside it is an fa policy.

Suppose that the agent currently has a belief state and an intention set . First, the agent will scan through all a-plans, selecting all those which ‘match’ . From this set, the agent will execute the policy of the a-plan whose attribute condition has the highest degree of belief at . If the set of a-plans matching is empty, the agent will scan through all b-plans, selecting all those which ‘match’ . From this set, the agent will execute the policy of the b-plan whose belief state is ‘most similar’ to . If the set of b-plans matching is empty, or there is no b-plan with belief state similar to , then the agent will generate policy , execute it and store in the plan library for possible reuse later. The high-level planning process is depicted by the diagram in Figure 3.

Figure 3: A flow diagram of the planning process in the new version of the HPB agent architecture.

To “execute policy ” (where has horizon/depth ) means to perform actions as recommended by . No policy will be sought in the library, nor will a new policy be generated until the action recommendations of the current policy being executed have been ‘exhausted’. One may be concerned that a policy becomes ‘stale’ or inapplicable while being executed, and that seeking or generating ‘fresh’ policies at every iteration keeps action selection relevant in a dynamic world. However, written policies (in a-plans) should preferably have the form of generated policies, and generated policies (in b-plans) can deal with all situations understood by the agent: It is assumed that each observation distinguishable by the agent, identifies a particular state of the world, as far as the agent’s sensors allow. Hence, if a policy considers every observation at its choice nodes, the policy will have a recommended (for written policies) or optimal (for generated policies) action, no matter the state of the world. However, writing or generating policies with far horizons (e.g., ) is impractical. With large , an agent will take relatively long to generate a policy and thus lose its reactiveness. Reactiveness is especially important in highly dynamic environments.

Input: : current belief state
Input: : current intention set
Input: : intention-set threshold
Input: : belief state threshold
Input: : the plan library
Input: : planning horizon / policy depth
Output: A POMDP policy of depth
1 ;
2 if  then
3       return ;
4if  then
5      ;
7if  then
8       ;
10if  then
11       return ;
12if  then
13       ;
14       Add to ;
15       return ;
Algorithm 2

With respect to a-plans, whether two intention sets match will be determined by how many goals they have in common. Thus, the similarity between and can be determined as follows.

lies in . and need not have equal cardinality. Larger values of mean more similarity / closer match. The agent designer can decide what value of constitutes a ‘match’ between and (see the discussion on “thresholds” below).

What constitutes a match between intention sets with respect to b-plans is different: Policies generated at two times and might be significantly different for the same (similar) context(s) if the satisfaction levels of the intentions are significantly different at the two times. This is an important insight because policies of b-plans are generated, not written. Even though may constitute a ‘match’ (with in a b-plan), might be completely impractical for pursuing . The measure of similarity will be the sum of differences between satisfaction levels. Note that an intention’s satisfaction levels can only be compared if the intention appears in both intention sets under consideration. We denote the similarity between two intention sets and as and define it as follows.

where denotes the absolute value of . lies in . and need not have equal cardinality. Larger values of mean more similarity / closer match. The agent designer can decide what value of constitutes a ‘match’ between and .

For a fixed pair of intention sets, . That is, is a stronger measure of similarity than . This is because with , intention satisfaction levels must also be similar. The stronger measure is required to filter out b-plans that seem similar when judged only on the commonality of their intentions, but not on their satisfaction levels. And there may be several b-plans in the library which would be judged similar by , but they have been added to the library exactly because they are indeed different when their satisfaction levels are taken into account. The following example should make this clear. Suppose that the following two b-plans are in the library: and , where and . And suppose is most satisfied when the agent is in , and is most satisfied when the agent is in . A policy to pursue when starting in would rather suggest actions to move towards , while a policy to pursue when starting in would rather suggest actions to move towards . The point is that although the two b-plans are identical with respect to the intention set, they have very different policies, due to their different belief states (and thus satisfaction levels).

We now prepare for the definition of similarity between two belief states. The ‘directed divergence’ (Kullback, 1968; Csiszár, 1975) of belief state from belief state is defined as

is undefined when while . When , then because . Let

where is the set of all probability distributions over the states (i.e., all belief states which can be induced from ). That is, is the set of belief states which keep defined. Let

For our purposes, we can define as whenever it would normally be undefined. We define a slightly modified cross-entropy as

Finally, the similarity between the current belief state and the belief state in a plan in the library is .

Two thresholds are involved with determining when library plans are applicable and how plans are dealt with: the intention-set threshold (abbreviation: ) and the belief-state threshold (abbreviation: ). The former is involved in both a-plans and b-plans, and the latter is involved only in b-plans.

The procedure (Algo. 2) formally defines what policy the agent will execute whenever the agent seeks a policy, and the procedure defines when and how new plans are added to the plan library.

5 Simulations

We performed some tests on an HPB agent in two domains: a six-by-six grid-world and a three-battery system. In the experiments which follow, the threshold is set to , is set to 5 and . Desire levels are initially set to zero for all goals. For each experiment, 10 trials were run. The plan library is not made use of.

In the grid-world, the agent’s task is to visit each of the four corners, and to collect twelve items randomly scattered. The goals are , , , , , and , , and are marked mutually incompatible. That is,

  • ,

  • ,

  • ,