1 Introduction
Markov decision processes (MDPs) have been adopted as a representational and computational model for decisiontheoretic planning problems in much recent work, e.g., by barto. The basic solution techniques for MDPs rely on the dynamic programming (DP) principle [Boutilier, Dean, HanksBoutilier et al.1999]. Unfortunately, classical dynamic programming algorithms require explicit enumeration of the state space that grows exponentially with the number of variables relevant to the planning domain. Therefore, these algorithms do not scale up to complex AI planning problems.
However, several methods that avoid explicit state enumeration have been developed recently. One technique, referred to as state abstraction, exploits the structure of the factored MDP representation to solve problems efficiently, circumventing explicit state space enumeration [Boutilier, Dean, HanksBoutilier et al.1999]. Another technique, referred to as heuristic search, restricts the computation to states that are reachable from the initial state, e.g., RTDP by barto, envelope DP by dean and by feng. One existing approach that combines both these techniques is the symbolic algorithm by feng which performs heuristic search symbolically for factored MDPs. It exploits state abstraction, i.e., manipulates sets of states instead of individual states. More precisely, following the SPUDD approach by hoey, all MDP components, value functions, policies, and admissible heuristic functions are compactly represented using algebraic decision diagrams (ADDs). This allows computations of the algorithm to be performed efficiently using ADDs.
Following ideas of symbolic , given an initial state, we use an admissible heuristic to restrict search only to those states that are reachable from the initial state. Moreover, we exploit state abstraction in order to avoid evaluating states individually. Thus, our work is very much in the spirit of symbolic but extends it in an important way. Whereas the symbolic algorithm starts with propositionalization of the FOMDP, and only after that performs state abstraction on its propositionalized version by means of propositional ADDs, we apply state abstraction directly on the structure of the FOMDP, avoiding propositionalization. This kind of abstraction is referred to as firstorder state abstraction.
Recently, following work by boutilierB, hoelldoblerA have developed an algorithm, referred to as firstorder value iteration (FOVI) that exploits firstorder state abstraction. The dynamics of an MDP is specified in the Probabilistic Fluent Calculus established by hoelldoblerB, which is a firstorder language for reasoning about states and actions. More precisely, FOVI produces a logical representation of value functions and policies by constructing firstorder formulae that partition the state space into clusters, referred to as abstract states. In effect, the algorithm performs value iteration on top of these clusters, obviating the need for explicit state enumeration. This allows problems that are represented in firstorder terms to be solved without requiring explicit state enumeration or propositionalization.
Indeed, propositionalizing FOMDPs can be very impractical: the number of propositions grows considerably with the number of domain objects and relations. This has a dramatic impact on the complexity of the algorithms that depends directly on the number of propositions. Finally, systems for solving FOMDPs that rely on propositionalizing states also propositionalize actions which is problematic in firstorder domains, because the number of ground actions also grows dramatically with domain size.
In this paper, we address these limitations by proposing an approach for solving FOMDPs that combines firstorder state abstraction and heuristic search in a novel way, exploiting the power of logical representations. Our algorithm can be viewed as a firstorder generalization of , in which our contribution is to show how to perform heuristic search for firstorder MDPs, circumventing their propositionalization. In fact, we show how to improve the performance of symbolic by providing a compact firstorder MDP representation using Probabilistic Fluent Calculus instead of propositional ADDs. Alternatively, our approach can be considered as a way to improve the efficiency of the FOVI algorithm by using heuristic search together with symbolic dynamic programming.
2 Firstorder Representation of MDPs
Recently, several representations for propositionallyfactored MDPs have been proposed, including dynamic Bayesian networks by boutilierA and ADDs by hoey. For instance, the SPUDD algorithm by hoey has been used to solve MDPs with hundreds of millions of states optimally, producing logical descriptions of value functions that involve only hundreds of distinct values. This work demonstrates that large MDPs, described in a logical fashion, can often be solved optimally by exploiting the logical structure of the problem.
Meanwhile, many realistic planning domains are best represented in firstorder terms. However, most existing implemented solutions for firstorder MDPs rely on propositionalization, i.e., eliminate all variables at the outset of a solution attempt by instantiating terms with all possible combinations of domain objects. This technique can be very impractical because the number of propositions grows dramatically with the number of domain objects and relations.
For example, consider the following goal statement taken from the colored Blocksworld scenario, where the blocks, in addition to unique identifiers, are associated with colors.
where represents the fact that all eight blocks comprise one tower. We assume that the number of blocks in the domain and their color distribution agrees with that in the goal statement, namely there are eight blocks in the domain, where four of them are red, three are green and one is blue. Then, the full propositionalization of the goal statement results in different ground towers, because there are exactly that many ways of arranging four red, three green and one blue block in a tower of eight blocks with the required color characteristics.
The number of ground combinations, and hence, the complexity of reasoning in a propositional planner, depends dramatically on the number of blocks and, most importantly, on the number of colors in the domain. The fewer colors a domain contains, the harder it is to solve by a propositional planner. For example, a goal statement , that is the same as above, but all eight blocks are of the same color, results in ground towers, when grounded.
To address these limitations, we propose a concise representation of FOMDPs within the Probabilistic Fluent Calculus which is a logical approach to modelling dynamically changing systems based on firstorder logic. But first, we briefly describe the basics of the theory of MDPs.
2.1 MDPs
A Markov decision process (MDP), is a tuple , where is a finite set of states, is a finite set of actions, and , written
, specifies transition probabilities. In particular,
denotes the probability of ending up at state given that the agent was in state and action was executed. is a realvalued reward function associating with each state its immediate utility . is a realvalued cost function associating a cost with each action . A sequential decision problem consists of an MDP and is the problem of finding a policy that maximizes the total expected discounted reward received when executing the policy over an infinite (or indefinite) horizon.The value of state , when starting in and following the policy afterwards, can be computed by the following system of linear equations:
where is a discount factor. We take equal to 1 for indefinitehorizon problems only, i.e., when a goal is reached the system enters an absorbing state in which no further rewards or costs are accrued. The optimal value function satisfies:
for each .
For the competition, the expected total reward model was used as the optimality criterion. Without discounting, some care is required in the design of planning problems to ensure that the expected total reward is bounded for the optimal policy. The following restrictions were made for problems used in the planning competition:

Each problem had a goal statement, identifying a set of absorbing goal states.

A positive reward was associated with transitioning into a goal state.

A cost was associated with each action.

A “done” action was available in all states, which could be used to end further accumulation of reward.
These conditions ensure that an MDP model of a planning problem is a positive bounded model described by puterman. The only positive reward is for transitioning into a goal state. Since goal states are absorbing, that is, they have no outgoing transitions, the maximum value of any state is bounded by the goal reward. Furthermore, the “done” action ensures that there is an action available in each state that guarantees a nonnegative future reward.
2.2 Probabilistic Fluent Calculus
Fluent Calculus (FC) by hoelldoblerB was originally set up as a firstorder logic program with equality using SLDEresolution as the sole inference rule. The Probabilistic Fluent Calculus (PFC) is an extension of the original FC for expressing planning domains with actions which have probabilistic effects.
States
Formally, let denote a set of function symbols. We distinguish two function symbols in , namely the binary function symbol , which is associative, commutative, and admits the unit element, and a constant 1. Let . Nonvariable terms are called fluents. The function names of fluents are referred to as fluent names. For example, is a fluent meaning informally that some block is on the table, where is a fluent name. Fluent terms are defined inductively as follows: 1 is a fluent term; each fluent is a fluent term; is a fluent term, if and are fluent terms. For example, is a fluent term denoting informally that the block is on the table and some block is in the robot’s gripper. In other words, freely occurring variables are assumed to be existentially quantified.
We assume that each fluent may occur at most once in a state. Moreover, function symbols, except for the binary operator, constant , fluent names and constants, are disallowed. In addition, the binary function symbol is allowed to appear only as an outermost connective in a fluent term. We denote a set of fluents as and a set of fluent terms as , respectively. An abstract state is defined by a pair , where and . We denote individual states by , , etc., abstract states by , , etc. and a set of abstract states .
The interpretation over , denoted as , is the pair , where the domain is a set of all finite sets of ground fluents from ; and an interpretation function which assigns to each fluent term a set and to each abstract state a set as follows:
where is a substitution. For example, Figure 1 depicts the interpretation of an abstract state
that can be informally read: There exists a block that is on the block which is on the table, there is no such block that is on and there exists no such block that the robot holds.
Since contains all such finite sets of ground fluents that satisfy the part and do not satisfy any of the elements of the part, we subtract all sets of ground fluents that belong to each of from the set of ground fluents that correspond to the part. Thus, the bold area in Figure 1 contains exactly those sets of ground fluents (or, individual states) that do satisfy the part of and none of the elements of its part. For example, an individual state belongs to , whereas does not. In other words, abstract states are characterized by means of conditions that must hold in each ground instance thereof and, thus, they represent clusters of individual states. In this way, abstract states embody a form of state space abstraction. This kind of abstraction is referred to as firstorder state abstraction.
Actions
Actions are firstorder terms starting with an action function symbol. For example, the action of picking up some block from another block might be denoted as . Formally, let denote a set of action names disjoint with . An action space is a tuple ), where is a set of terms of the form , referred to as actions, with and each being either a variable, or a constant; is a precondition of ; and is an effect of .
So far, we have described deterministic actions only. But actions in PFC may have probabilistic effects as well. Similar to the work by boutilierB, we decompose a stochastic action into deterministic primitives under nature’s control, referred to as nature’s choices. We use a relation symbol to model nature’s choice. Consider the action :
where and define two nature’s choices for action , viz., that it succeeds or fails. For example, the nature’s choice pickupS can be defined as follows:
where the fluent denotes the empty robot’s gripper. For simplicity, we denote the set of nature’s choices of an action as . Please note that nowhere do these action descriptions restrict the domain of discourse to some prespecified set of blocks.
For each of nature’s choices associated with an action we define the probability denoting the probability with which one of nature’s choices is chosen in a state . For example,
states that the probability for the successful execution of the pickup action in state is .
In the next step, we define the reward function for each state. For example, we might want to give a reward of 500 to all states in which some block is on block and , otherwise:
where denotes the subsumption relation, which will be described in detail in Section 3.2.1. One should observe that we have specified the reward function without explicit state enumeration. Instead, the state space is divided into two abstract states depending on whether or not, a block is on block . Likewise, value functions can be specified with respect to the abstract states only. This is in contrast to classical DP algorithms, in which the states are explicitly enumerated. Action costs can be analogously defined as follows:
penalizing the execution of the pickup action with the value of 3.
Inference Mechanism
Herein, we show how to perform inferences, i.e., compute successors of a given abstract state, with action schemata directly, avoiding unnecessary grounding. We note that computation of predecessors can be performed in a similar way.
Let be an abstract state, be an action with parameters , preconditions and effects . Let and be substitutions. An action is forward applicable, or simply applicable, to with and , denoted as , if the following conditions hold:
where and are new AC1variables and AC1 is the equational theory for that is represented by the following system of “associativity”, “commutativity”, and “unit element” equations:
In other words, the conditions (f1) and (f2) guarantee that contains both positive and negative preconditions of the action . If an action is forward applicable to with and then , where
(1) 
is referred to as the successor of with and and denoted as .
For example, consider the action as defined above, take . The action is forward applicable to with and . Thus, with
3 FirstOrder LAO*
We present a generalization of the symbolic algorithm by feng, referred to as firstorder (), for solving FOMDPs. Symbolic is a heuristic search algorithm that exploits state abstraction for solving factored MDPs. Given an initial state, symbolic uses an admissible heuristic to focus computation on the parts of the state space that are reachable from the initial state. Moreover, it specifies MDP components, value functions, policies, and admissible heuristics using propositional ADDs. This allows symbolic to manipulate sets of states instead of individual states.
Despite the fact that symbolic shows an advantageous behaviour in comparison to classical nonsymbolic by hansen that evaluates states individually, it suffers from an important drawback. While solving FOMDPs, symbolic propositionalizes the problem. This approach is impractical for large FOMDPs. Our intention is to show how to improve the performance of symbolic by providing a compact firstorder representation of MDPs so that the heuristic search can be performed without propositionalization. More precisely, we propose to switch the representational formalism for FOMDPs in symbolic from propositional ADDs to Probabilistic Fluent Calculus. The algorithm is presented in Figure 2.
As symbolic , has two phases that alternate until a complete solution is found, which is guaranteed to be optimal. First, it expands the best partial policy and evaluates the states on its fringe using an admissible heuristic function. Then it performs dynamic programming on the states visited by the best partial policy, to update their values and possibly revise the current best partial policy. We note that we focus on partial policies that map a subcollection of states into actions.
In the policy expansion step, we perform reachability analysis to find the set of states that have not yet been expanded, but are reachable from the set of initial states by following the partial policy . The set of states contains states that have been expanded so far. By expanding a partial policy we mean that it will be defined for a larger set of states in the dynamic programming step. In symbolic , reachability analysis on ADDs is performed by means of the image operator from symbolic model checking, that computes the set of successor states following the best current policy. Instead, in , we apply the operator, defined in Equation 1. One should observe that since the reachability analysis in is performed on abstract states that are defined as firstorder entities, the reasoning about successor states is kept on the firstorder level. In contrast, symbolic would first instantiate with all possible combinations of objects, in order to be able to perform computations using propositional ADDs later on.
In contrast to symbolic , where the dynamic programming step is performed using a modified version of SPUDD, we employ a modified firstorder value iteration algorithm (FOVI). The original FOVI by hoelldoblerA performs value iteration over the entire state space. We modify it so that it computes on states that are reachable from the initial states, more precisely, on the set of states that are visited by the best current partial policy. In this way, we improve the efficiency of the original FOVI algorithm by using reachability analysis together with symbolic dynamic programming. FOVI produces a PFC representation of value functions and policies by constructing firstorder formulae that partition the state space into abstract states. In effect, it performs value iteration on top of abstract states, obviating the need for explicit state enumeration.
Given a FOMDP and a value function represented in PFC, FOVI returns the best partial value function , the best partial policy and the residual . In order to update the values of the states in , we assign the values from the current value function to the successors of . We compute successors with respect to all nature’s choices . The residual is computed as the absolute value of the largest difference between the current and the newly computed value functions and , respectively. We note that the newly computed value function is taken in its normalized form, i.e., as a result of the procedure that will be described in Section 3.2.1. Extraction of a best partial policy is straightforward: One simply needs to extract the maximizing actions from the best partial value function .
As with symbolic , converges to an optimal policy when three conditions are met: (1) its current policy does not have any unexpanded states, (2) the residual is less than the predefined threshold , and (3) the value function is initialized with an admissible heuristic. The original convergence proofs for and symbolic by hansen carry over in a straightforward way to .
When calling , we initialize the value function with an admissible heuristic function that focuses the search on a subset of reachable states. A simple way to create an admissible heuristic is to use dynamic programming to compute an approximate value function. Therefore, in order to obtain an admissible heuristic in , we perform several iterations of the original FOVI. We start the algorithm on an initial value function that is admissible. Since each step of FOVI preserves admissibility, the resulting value function is admissible as well. The initial value function assigns the goal reward to each state thereby overestimating the optimal value, since the goal reward is the maximal possible reward.
Since all computations of are performed on abstract states instead of individual states, FOMDPs are solved avoiding explicit state and action enumeration and propositionalization. The firstorder reasoning leads to better performance of in comparison to symbolic , as shown in Section 4.
3.1 Policy Expansion
The policy expansion step in is very similar to the one in the symbolic algorithm. Therefore, we illustrate the expansion procedure by means of an example. Assume that we start from the initial state and two nondeterministic actions and are applicable in , each having two outcomes , and , , respectively. Without loss of generality, we assume that the current best policy chooses as an optimal action at state . We construct the successors and of with respect to both outcomes and of the action .
The fringe set as well as the set of states expanded so far contain the states and only, whereas, the set of states visited by the best current partial policy gets the state in addition. See Figure 3a. In the next step, FOVI is performed on the set . We assume that the values have been updated in such a way that becomes an optimal action in . Thus, the successors of have to be recomputed with respect to the optimal action . See Figure 3b.
One should observe that one of the successors of , namely , is an element of the set and thus, it has been contained already in the fringe during the previous expansion step. Hence, the state should be expanded and its value recomputed. This is shown in Figure 3c, where states and are successors of , under assumption that is an optimal action in . As a result, the fringe set contains the newly discovered states , and and we perform FOVI on . The state is not contained in , because it does not belong to the best current partial policy, and the dynamic programming step is performed only on the states that were visited by the best current partial policy.
3.2 FirstOrder Value Iteration
In , the firstorder value iteration algorithm (FOVI) serves two purposes: First, we perform several iterations of FOVI in order to create an admissible heuristic in . Second, in the dynamic programming step of , we apply FOVI on the states visited by the best partial policy in order to update their values and possibly revise the current best partial policy.
The original FOVI by hoelldoblerA takes a finite state space of abstract states, a finite set of stochastic actions, realvalued reward and cost functions, and an initial value function as input. It produces a firstorder representation of the optimal value function and policy by exploiting the logical structure of a FOMDP. Thus, FOVI can be seen as a firstorder counterpart of the classical value iteration algorithm by bellman.
3.2.1 Normalization
Following the ideas of boutilierB, FOVI relies on the normalization of the state space that represents the value function. By normalization of a state space, we mean an equivalencepreserving procedure that reduces the size of a state space. This would have an effect only if a state space contains redundant entries, which is usually the case in symbolic computations.
Although normalization is considered to be an important issue, it has been done by hand so far. To the best of our knowledge, the preliminary implementation of the approach by boutilierB performs only rudimentary logical simplifications and the authors suggest using an automated firstorder theorem prover for the normalization task. hoelldoblerA have developed an automated normalization procedure for FOVI that, given a state space, delivers an equivalent one that contains no redundancy. The technique employs the notion of a subsumption relation.
More formally, let and be abstract states. Then is said to be subsumed by , written , if and only if there exist substitutions and such that the following conditions hold:


,
where and are new AC1variables. The motivation for the notion of subsumption on abstract states is inherited from the notion of subsumption between firstorder clauses by robinson with the difference that abstract states contain more complicated negative parts in contrast to the firstorder clauses.
For example, consider two abstract states and that are defined as follows:
where informally asserts that some block is on the block which is on the table and no blocks are red. Whereas informally states that some block is on the block and is not red. We show that . The relation holds since both conditions (s1) and (s2) are satisfied. Indeed,
and
with and .
One should note that subsumption in the language of abstract states inherits the complexity bounds of subsumption kapur. Namely, deciding subsumption between two abstract states is NPcomplete, in general. However, karabaev have recently developed an efficient algorithm that delivers all solutions of the subsumption problem for the case where abstract states are fluent terms.
For the purpose of normalization, it is convenient to represent the value function as a set of pairs of the form , where is an abstract state and is a real value. In essence, the normalization algorithm can be seen as an exhaustive application of the following simplification rule to the value function .
Table 1 illustrates the importance of the normalization algorithm by providing some representative timing results for the first ten iterations of FOVI. The experiments were carried out on the problem taken from the colored Blocksworld scenario consisting of ten blocks. Even on such a relatively simple problem FOVI with the normalization switched off does not scale beyond the sixth iteration.
N  Number of states  Time, msec  Runtime, msec  Runtime w/o norm, msec  

Update  Norm  
0  9  6  144  1  145  144 
1  24  14  393  3  396  593 
2  94  23  884  12  896  2219 
3  129  33  1377  16  1393  13293 
4  328  39  2079  46  2125  77514 
5  361  48  2519  51  2570  805753 
6  604  52  3268  107  3375  n/a 
7  627  54  3534  110  3644  n/a 
8  795  56  3873  157  4030  n/a 
9  811  59  4131  154  4285  n/a 

The results in Table 1 demonstrate that the normalization during some iteration of FOVI dramatically shrinks the computational effort during the next iterations. The columns labelled and show the size of the state space after performing the value updates and the normalization, respectively. For example, the normalization factor, i.e., the ratio between the number of states obtained after performing one update step and the number of states obtained after performing the normalization step, at the seventh iteration is 11.6. This means that more than ninety percent of the state space contained redundant information. The fourth and fifth columns in Table 1 contain the time Update and Norm spent on performing value updates and on the normalization, respectively. The total runtime Runtime, when the normalization is switched on, is given in the sixth column. The seventh column labelled Runtime w/o norm depicts the total runtime of FOVI when the normalization is switched off. If we would sum up all values in the seventh column and the values in the sixth column up to the sixth iteration inclusively, subtract the latter from the former and divide the result by the total time Norm needed for performing normalization during the first six iterations, then we would obtain the normalization gain of about three orders of magnitude.
4 Experimental Evaluation
We demonstrate the advantages of combining the heuristic search together with firstorder state abstraction on a system, referred to as FluCaP, that has successfully entered the probabilistic track of the 2004 International Planning Competition (IPC’2004). The experimental results were all obtained using RedHat Linux running on a 3.4GHz Pentium IV machine with 3GB of RAM.
In Table 2, we present the performance comparison of FluCaP together with symbolic on examples taken from the colored Blocksworld (BW) scenario that was introduced during IPC’2004.
Our main objective was to investigate whether firstorder state abstraction using logic could improve the computational behaviour of a planning system for solving FOMDPs. The colored BW problems were our main interest since they were the only ones represented in firstorder terms and hence the only ones that allowed us to make use of the firstorder state abstraction. Therefore, we have concentrated on the design of a domaindependent planning system that was tuned for the problems taken from the Blocksworld scenario.
The colored BW problems differ from the classical BW ones in that, along with the unique identifier, each block is assigned a specific color. A goal formula, specified in firstorder terms, provides an arrangement of colors instead of an arrangement of blocks.
Problem  Total av. reward  Total time, sec.  H.time, sec.  NAS  NGS,  %  
B  C 
LAO* 
FluCaP 
FOVI 
FluCaP 
LAO* 
FluCaP 
FOVI 
FluCaP 
LAO* 
FluCaP 
LAO* 
FluCaP 
FOVI 
LAO* 
FluCaP 
FluCaP 

4  494  494  494  494  22.3  22.0  23.4  31.1  8.7  4.2  35  410  1077  0.86  0.82  2.7 
5 
3  496  495  495  496  23.1  17.8  22.7  25.1  9.5  1.3  34  172  687  0.86  0.68  2.1 

2  496  495  495  495  27.3  11.7  15.7  16.5  12.7  0.3  32  55  278  0.86  0.66  1.9 

4  493  493  493  493  137.6  78.5  261.6  285.4  76.7  21.0  68  1061  3847  7.05  4.24  3.1 
6 
3  493  492  493  492  150.5  33.0  119.1  128.5  85.0  9.3  82  539  1738  7.05  6.50  2.3 

2  495  494  495  496  221.3  16.6  56.4  63.3  135.0  1.2  46  130  902  7.05  6.24  2.0 

4  492  491  491  491  1644  198.1  2776  n/a  757.0  171.3  143  2953  12014  65.9  23.6  3.5 
7 
3  494  494  494  494  1265  161.6  1809  2813  718.3  143.6  112  2133  7591  65.9  51.2  2.4 

2  494  494  494  494  2210  27.3  317.7  443.6  1241  12.3  101  425  2109  65.9  61.2  2.0 

4  n/a  490  n/a  n/a  n/a  1212  n/a  n/a  n/a  804.1  n/a  8328  n/a  n/a  66.6  4.1 
8 
3  n/a  490  n/a  n/a  n/a  598.5  n/a  n/a  n/a  301.2  n/a  3956  n/a  n/a  379.7  3.0 

2  n/a  492  n/a  n/a  n/a  215.3  1908  n/a  n/a  153.2  n/a  2019  7251  n/a  1121  2.3 
15 
3  n/a  486  n/a  n/a  n/a  1809  n/a  n/a  n/a  1733  n/a  7276  n/a  n/a  5.7  
17 
4  n/a  481  n/a  n/a  n/a  3548  n/a  n/a  n/a  1751  n/a  15225  n/a  n/a  6.1  

At the outset of solving a colored BW problem, symbolic starts by propositionalizing its components, namely, the goal statement and actions. Only after that, the abstraction using propositional ADDs is applied. In contrast, FluCaP performs firstorder abstraction on a colored BW problem directly, avoiding unnecessary grounding. In the following, we show how an abstraction technique affects the computation of a heuristic function. To create an admissible heuristic, FluCaP performs twenty iterations of FOVI and symbolic performs twenty iterations of an approximate value iteration algorithm similar to APRICODD by staubin. The columns labelled H.time and NAS show the time needed for computing a heuristic function and the number of abstract states it covers, respectively. In comparison to FluCaP, symbolic needs to evaluate fewer abstract states in the heuristic function but takes considerably more time. One can conclude that abstract states in symbolic enjoy more complex structure than those in FluCaP.
We note that, in comparison to FOVI, FluCaP restricts the value iteration to a smaller state space. Intuitively, the value function, which is delivered by FOVI, covers a larger state space, because the time that is allocated for the heuristic search in FluCaP is now used for performing additional iterations in FOVI. The results in the column labelled % justify that the harder the problem is (that is, the more colors it contains), the higher the percentage of runtime spent on normalization. Almost on all test problems, the effort spent on normalization takes three percent of the total runtime on average.
In order to compare the heuristic accuracy, we present in the column labelled NGS the number of ground states which the heuristic assigns nonzero values to. One can see that the heuristics returned by FluCaP and symbolic have similar accuracy, but FluCaP takes much less time to compute them. This reflects the advantage of the plain firstorder abstraction in comparison to the marriage of propositionalization with abstraction using propositional ADDs. In some examples, we gain several orders of magnitude in H.time.
The column labelled Total time presents the time needed to solve a problem. During this time, a planner must execute 30 runs from an initial state to a goal state. A onehour block is allocated for each problem. We note that, in comparison to FluCaP, the time required by heuristic search in symbolic (i.e., difference between Total time and H.time) grows considerably faster in the size of the problem. This reflects the potential of employing firstorder abstraction instead of abstraction based on propositional ADDs during heuristic search.
The average reward obtained over 30 runs, shown in column Total av. reward, is the planner’s evaluation score. The reward value close to 500 (which is the maximum possible reward) simply indicates that a planner found a reasonably good policy. Each time the number of blocks B increases by 1, the running time for symbolic increases roughly 10 times. Thus, it could not scale to problems having more than seven blocks. This is in contrast to FluCaP which could solve problems of seventeen blocks. We note that the number of colors C in a problem affects the efficiency of an abstraction technique. In FluCaP, as C decreases, the abstraction rate increases which, in turn, is reflected by the dramatic decrease in runtime. The opposite holds for symbolic .
In addition, we compare FluCaP with two variants. The first one, denoted as FOVI, performs no heuristic search at all, but rather, employs FOVI to compute the optimal total value function from which a policy is extracted. The second one, denoted as FluCaP, performs ‘trivial’ heuristic search starting with an initial value function as an admissible heuristic. As expected, FluCaP that combines heuristic search and FOVI demonstrates an advantage over plain FOVI and trivial heuristic search. These results illustrate the significance of heuristic search in general (FluCaP vs. FOVI) and the importance of heuristic accuracy, in particular (FluCaP vs. FluCaP). FOVI and FluCaP do not scale to problems with more than seven blocks.
B  Total av. reward, 500  Total time, sec.  H.time, sec.  NAS  NGS 

20 
489.0  137.5  56.8  711  1.7 
22  487.4  293.8  110.2  976  1.1 
24  492.0  757.3  409.8  1676  1.0 
26  482.8  817.0  117.2  1141  4.6 
28  493.0  2511.3  823.3  2832  8.6 
30  491.2  3580.4  1174.0  4290  1.1 
32  476.0  3953.8  781.8  2811  7.4 
34  475.6  3954.1  939.4  3248  9.6 
36  n/a  n/a  n/a  n/a  n/a 

Table 3 presents the performance results of FluCaP on larger instances of onecolor BW problems with the number of blocks varying from twenty to thirty four. We believe that FluCaP does not scale to problems of larger size because the implementation is not yet well optimized. In general, we believe that the FluCaP system should not be as sensitive to the size of a problem as propositional planners are.
Our experiments were targeted at the onecolor problems only because they are, on the one hand, the simplest ones for us and, on the other hand, the bottleneck for propositional planners. The structure of onecolor problems allows us to apply firstorder state abstraction in its full power. For example, for a 34blocks problem FluCaP operates on about 3.3 thousand abstract states that explode to individual states after propositionalization. A propositional planner must be highly optimized in order to cope with this nontrivial state space.
We note that additional colors in larger instances (more than 20 blocks) of BW problems cause dramatic increase in computational time, so we consider these problems as being unsolved. One should also observe that the number of abstract states NAS increases with the number of blocks nonmonotonically because the problems are generated randomly. For example, the 30blocks problem happens to be harder than the 34blocks one. Finally, we note that all results that appear in Tables 2 and 3 were obtained by using the new version of the evaluation software that does not rely on propositionalization in contrast to the initial version that was used during the competition.
Table 4 presents the competition results from IPC’2004, where FluCaP was competitive in comparison with other planners on colored BW problems. FluCaP did not perform well on noncolored BW problems because these problems were propositional ones (that is, goal statements and initial states are ground) and FluCaP does not yet incorporate optimization techniques applied in modern propositional planners. The contestants are indicated by their origin. For example, Dresden  FluCaP, UMass  symbolic etc. Because only the pickup action has cost 1, the gain of five points in total reward means that the plan contains ten fewer actions on average. The competition domains and log files are available in an online appendix of younes.
Although the empirical results that are presented in this work were obtained on the domaindependent version of FluCaP, we have recently developed in [Karabaev, Rammé, SkvortsovaKarabaev et al.2006] an efficient domainindependent inference mechanism that is the core of a domainindependent version of FluCaP.
Problem  Total av. reward, 500  

B  C 
Canberra 
Dresden 
UMass 
Michigan 
Purdue 
Purdue 
Purdue 
Caracas 
Toulouse 
5 
3  494.6  496.4  n/a  n/a  496.5  496.5  495.8  n/a  n/a 
8 
3  486.5  492.8  n/a  n/a  486.6  486.4  487.2  n/a  n/a 
11 
5  479.7  486.3  n/a  n/a  481.3  481.5  481.9  n/a  n/a 
5 
0  494.6  494.6  494.8  n/a  494.1  494.6  494.4  494.9  494.1 
8  0  489.7  489.9  n/a  n/a  488.7  490.3  490  488.8  n/a 
11  0  479.1  n/a  n/a  n/a  480.3  479.7  481.1  465.7  n/a 
15  0  467.5  n/a  n/a  n/a  469.4  467.7  486.3  397.2  n/a 
18  0  351.8  n/a  n/a  n/a  462.4  54.9  n/a  n/a  n/a 
21  0  285.7  n/a  n/a  n/a  455.7  455.1  459  n/a  n/a 

5 Related Work
We follow the symbolic DP (SDP) approach within Situation Calculus (SC) of boutilierB in using firstorder state abstraction for FOMDPs. One difference is in the representation language: We use PFC instead of SC. In the course of symbolic value iteration, a state space may contain redundant abstract states that dramatically affect the algorithm’s efficiency. In order to achieve computational savings, normalization must be performed to remove this redundancy. However, in the original work by boutilierB this was done by hand. To the best of our knowledge, the preliminary implementation of the SDP approach within SC uses humanprovided rewrite rules for logical simplification. In contrast, hoelldoblerA have developed an automated normalization procedure for FOVI that is incorporated in the competition version of FluCaP and brings the computational gain of several orders of magnitude. Another crucial difference is that our algorithm uses heuristic search to limit the number of states for which a policy is computed.
The ReBel algorithm by kersting relates to in that it also uses a representation language that is simpler than Situation Calculus. This feature makes the state space normalization computationally feasible.
In motivation, our approach is closely connected to Relational Envelopebased Planning (REBP) by gardiol that represents MDP dynamics by a compact set of relational rules and extends the envelope method by dean. However, REBP propositionalizes actions first, and only afterwards employs abstraction using equivalenceclass sampling. In contrast, directly applies state and action abstraction on the firstorder structure of an MDP. In this respect, REBP is closer to symbolic than to . Moreover, in contrast to PFC, action descriptions in REBP do not allow negation to appear in preconditions or in effects. In organization, , as symbolic , is similar to realtime DP by barto that is an online search algorithm for MDPs. In contrast, works offline.
All the above algorithms can be classified as deductive approaches to solving FOMDPs. They can be characterized by the following features: (1) they are modelbased, (2) they aim at exact solutions, and (3) logical reasoning methods are used to compute abstractions. We should note that FOVI aims at exact solution for a FOMDP, whereas
, due to the heuristic search that avoids evaluating all states, seeks for an approximate solution. Therefore, it would be more appropriate to classify as an approximate deductive approach to FOMDPs.In another vein, there is some research on developing inductive approaches to solving FOMDPs, e.g., by fern. The authors propose the approximate policy iteration (API) algorithm, where they replace the use of costfunction approximations as policy representations in API with direct, compact stateaction mappings, and use a standard relational learner to learn these mappings. In effect, fern provide policylanguage biases that enable solution of very large relational MDPs. All inductive approaches can be characterized by the following features: (1) they are modelfree, (2) they aim at approximate solutions, and (3) an abstract model is used to generate biased samples from the underlying FOMDP and the abstract model is altered based on them.
A recent approach by grettonB proposes an inductive policy construction algorithm that strikes a middleground between deductive and inductive techniques. The idea is to use reasoning, in particular firstorder regression, to automatically generate a hypothesis language, which is then used as input by an inductive solver. The approach by grettonB is related to SDP and to our approach in the sense that a firstorder domain specification language as well as logical reasoning are employed.
6 Conclusions
We have proposed an approach that combines heuristic search and firstorder state abstraction for solving FOMDPs more efficiently. Our approach can be seen as twofold: First, we use dynamic programming to compute an approximate value function that serves as an admissible heuristic. Then heuristic search is performed to find an exact solution for those states that are reachable from the initial state. In both phases, we exploit the power of firstorder state abstraction in order to avoid evaluating states individually. As experimental results show, our approach breaks new ground in exploring the efficiency of firstorder representations in solving MDPs. In comparison to existing MDP planners that must propositionalize the domain, e.g., symbolic , our solution scales better on larger FOMDPs.
However, there is plenty remaining to be done. For example, we are interested in the question of to what extent the optimization techniques applied in modern propositional planners can be combined with firstorder state abstraction. In future competitions, we would like to face problems where the goal and/or initial states are only partially defined and where the underlying domain contains infinitely many objects.
The current version of is targeted at the problems that allow for efficient firstorder state abstraction. More precisely, these are the problems that can be polynomially translated into PFC. For example in the colored BW domain, existentiallyclosed goal descriptions were linearly translated into the equivalent PFC representation. Whereas universallyclosed goal descriptions would require full propositionalization. Thus, the current version of PFC is less firstorder expressive than, e.g., Situation Calculus. In the future, it would be interesting to study the extensions of the PFC language, in particular, to find the tradeoff between the PFC’s expressive power and the tractability of solution methods for FOMDPs based on PFC.
Acknowledgements
We are very grateful to all anonymous reviewers for the thorough reading of the previous versions of this paper. We also thank Zhengzhu Feng for fruitful discussions and for providing us with the executable of the symbolic planner. We greatly appreciate David E. Smith for his patience and encouragement. His valuable comments have helped us to improve this paper. Olga Skvortsova was supported by a grant within the Graduate Programme GRK 334 “Specification of discrete processes and systems of processes by operational models and logics” under auspices of the Deutsche Forschungsgemeinschaft (DFG).
References
 [Barto, Bradtke, SinghBarto et al.1995] Barto, A. G., Bradtke, S. J., Singh, S. P. 1995. Learning to act using realtime dynamic programming Artificial Intelligence, 72(12), 81–138.
 [BellmanBellman1957] Bellman, R. E. 1957. Dynamic programming. Princeton University Press, Princeton, NJ, USA.
 [Boutilier, Dean, HanksBoutilier et al.1999] Boutilier, C., Dean, T., Hanks, S. 1999. Decisiontheoretic planning: Structural assumptions and computational leverage Journal of Artificial Intelligence Research, 11, 1–94.
 [Boutilier, Reiter, PriceBoutilier et al.2001] Boutilier, C., Reiter, R., Price, B. 2001. Symbolic Dynamic Programming for FirstOrder MDPs In Nebel, B., Proceedings of the Seventeenth International Conference on Artificial Intelligence (IJCAI’2001), 690–700. Morgan Kaufmann.
 [Dean, Kaelbling, Kirman, NicholsonDean et al.1995] Dean, T., Kaelbling, L., Kirman, J., Nicholson, A. 1995. Planning under time constraints in stochastic domains Artificial Intelligence, 76, 35–74.
 [Feng HansenFeng Hansen2002] Feng, Z. Hansen, E. 2002. Symbolic heuristic search for factored Markov Decision Processes In Dechter, R., Kearns, M., Sutton, R., Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI’2002), 455–460, Edmonton, Canada. AAAI Press.
 [Fern, Yoon, GivanFern et al.2003] Fern, A., Yoon, S., Givan, R. 2003. Approximate policy iteration with a policy language bias In Thrun, S., Saul, L., Schölkopf, B., Proceedings of the Seventeenth Annual Conference on Neural Information Processing Systems (NIPS’2003), Vancouver, Canada. MIT Press.
 [Gardiol KaelblingGardiol Kaelbling2003] Gardiol, N. Kaelbling, L. 2003. Envelopebased planning in relational MDPs In Thrun, S., Saul, L., Schölkopf, B., Proceedings of the Seventeenth Annual Conference on Neural Information Processing Systems (NIPS’2003), Vancouver, Canada. MIT Press.
 [Gretton ThiebauxGretton Thiebaux2004] Gretton, C. Thiebaux, S. 2004. Exploiting firstorder regression in inductive policy selection In Chickering, M. Halpern, J., Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI’2004), Banff, Canada. Morgan Kaufmann.
 [Hansen ZilbersteinHansen Zilberstein2001] Hansen, E. Zilberstein, S. 2001. LAO*: A heuristic search algorithm that finds solutions with loops Artificial Intelligence, 129, 35–62.
 [Hoey, StAubin, Hu, BoutilierHoey et al.1999] Hoey, J., StAubin, R., Hu, A., Boutilier, C. 1999. SPUDD: Stochastic Planning using Decision Diagrams In Laskey, K. B. Prade, H., Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence (UAI’1999), 279–288, Stockholm. Morgan Kaufmann.
 [Hölldobler SchneebergerHölldobler Schneeberger1990] Hölldobler, S. Schneeberger, J. 1990. A new deductive approach to planning New Generation Computing, 8, 225–244.
 [Hölldobler SkvortsovaHölldobler Skvortsova2004] Hölldobler, S. Skvortsova, O. 2004. A LogicBased Approach to Dynamic Programming In Proceedings of the Workshop on “Learning and Planning in Markov Processes–Advances and Challenges” at the Nineteenth National Conference on Artificial Intelligence (AAAI’04), 31–36, San Jose, CA. AAAI Press.
 [Kapur NarendranKapur Narendran1986] Kapur, D. Narendran, P. 1986. NPcompleteness of the set unification and matching problems In Siekmann, J. H., Proceedings of the Eighth International Conference in Automated Deduction (CADE’1986), 489–495, Oxford, England. Springer Verlag.
 [Karabaev, Rammé, SkvortsovaKarabaev et al.2006] Karabaev, E., Rammé, G., Skvortsova, O. 2006. Efficient symbolic reasoning for firstorder MDPs In Proceedings of the Workshop on “Planning, Learning and Monitoring with Uncertainty and Dynamic Worlds” at the Seventeenth European Conference on Artificial Intelligence (ECAI’2006), Riva del Garda, Italy. To appear.

[Kersting, van Otterlo, De RaedtKersting
et al.2004]
Kersting, K., van Otterlo, M., De Raedt, L. 2004.
Bellman goes relational
In Brodley, C. E., Proceedings of the TwentyFirst International Conference in Machine Learning (ICML’2004), 465–472, Banff, Canada. ACM.
 [PutermanPuterman1994] Puterman, M. L. 1994. Markov Decision Processes  Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY.
 [RobinsonRobinson1965] Robinson, J. 1965. A machinelearning logic based on the resolution principle Journal of the Association for Computing Machinery, 12(1), 23–41.
 [StAubin, Hoey, BoutilierStAubin et al.2000] StAubin, R., Hoey, H., Boutilier, C. 2000. APRICODD: Approximate policy construction using decision diagrams In Leen, T. K., Dietterich, T. G., Tresp, V., Proceedings of the Fourteenth Annual Conference on Neural Information Processing Systems (NIPS’2000), 1089–1095, Denver. MIT Press.
 [Younes, Littman, Weissman, AsmuthYounes et al.2005] Younes, H., Littman, M., Weissman, D., Asmuth, J. 2005. The first probabilistic track of the International Planning Competition Journal of Artificial Intelligence Research, 24, 851–887.
Comments
There are no comments yet.