Stochastic Planning and Lifted Inference

01/04/2017 ∙ by Roni Khardon, et al. ∙ Tufts University 0

Lifted probabilistic inference (Poole, 2003) and symbolic dynamic programming for lifted stochastic planning (Boutilier et al, 2001) were introduced around the same time as algorithmic efforts to use abstraction in stochastic systems. Over the years, these ideas evolved into two distinct lines of research, each supported by a rich literature. Lifted probabilistic inference focused on efficient arithmetic operations on template-based graphical models under a finite domain assumption while symbolic dynamic programming focused on supporting sequential decision-making in rich quantified logical action models and on open domain reasoning. Given their common motivation but different focal points, both lines of research have yielded highly complementary innovations. In this chapter, we aim to help close the gap between these two research areas by providing an overview of lifted stochastic planning from the perspective of probabilistic inference, showing strong connections to other chapters in this book. This also allows us to define Generalized Lifted Inference as a paradigm that unifies these areas and elucidates open problems for future research that can benefit both lifted inference and stochastic planning.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this chapter we illustrate that stochastic planning can be viewed as a specific form of probabilistic inference and show that recent symbolic dynamic programming (SDP) algorithms for the planning problem can be seen to perform “generalized lifted inference”, thus making a strong connection to other chapters in this book. As we discuss below, although the SDP formulation is more expressive in principle, work on SDP to date has largely focused on algorithmic aspects of reasoning in open domain models with rich quantified logical structure whereas lifted inference has largely focused on aspects of efficient arithmetic computations over finite domain (quantifier free) template-based models. The contributions in these areas are therefore largely along different dimensions. However, the intrinsic relationships between these problems suggest a strong opportunity for cross-fertilization where the true scope of generalized lifted inference can be achieved. This chapter intends to highlight these relationships and lay out a paradigm for generalized lifted inference that subsumes both fields and offers interesting opportunities for future research.

To make the discussion concrete, let us introduce a running example for stochastic planning and the kind of generalized solutions that can be achieved. For illustrative purposes, we borrow a planning domain from Boutilier et. al. [1] that we refer to as BoxWorld. In this domain, outlined in Figure 1, there are several cities such as , etc., trucks , etc., and boxes , etc. The agent can load a box onto a truck or unload it and can drive a truck from one city to another. When any box has been delivered to a specific city, , the agent receives a positive reward. The agent’s planning task is to find a policy for action selection that maximizes this reward over some planning horizon.

Domain Object Types (i.e., sorts): , , Relations (with parameter sorts): BoxIn: , TruckIn: , BoxOn: Reward: if then 10 else 0 Actions (with parameter sorts): :

Success Probability: if

then .9 else 0 Add Effects on Success: Delete Effects on Success: : Success Probability: if then .9 else 0 Add Effects on Success: Delete Effects on Success: : Success Probability: if then 1 else 0 Add Effects on Success: Delete Effects on Success: Success Probability: 1 Add Effects on Success: Delete Effects on Success:

Figure 1: A formal desciption of the BoxWorld adapted from [1]. We use a simple STRIPS-like [2] add and delete list representation of actions and, as a simple probabilistic extension in the spirit of PSTRIPS [3], we assign probabilities that an action successfully executes conditioned on various state properties.

Our objective in lifted stochastic planning is to obtain an abstract policy, for example, like the one shown in Figure 2. In order to get some box to , the agent should drive a truck to the city where the box is located, load the box on the truck, drive the truck to , and finally unload the box in . This is essentially encoded in the symbolic value function shown in Fig. 2, which was computed by discounting rewards time steps into the future by .

Similar to this example, for some problems we can obtain a solution which is described abstractly and is independent of the specific problem instance or even its size — for our example problem the description of the solution does not depend on the number of cities, trucks or boxes, or on knowledge of the particular location of any specific truck. Accordingly, one might hope that computing such a solution can be done without knowledge of these quantities and in time complexity independent of them. This is the computational advantage of symbolic stochastic planning which we associate with lifted inference in this chapter.

The next two subsections expand on the connection between planning and inference, identify opportunities for lifted inference, and use these observations to define a new setup which we call generalized lifted inference which abstracts some of the work in both areas and provides new challenges for future work.

if then do (value = 100.00) else if then do (value = 89.0) else if then do (value = 80.0) else if then do (value = 72.0) else if then do (value = 64.7) else do (value = 0.0)

Figure 2: A decision-list representation of the optimal policy and expected discounted reward for the BoxWorld problem. The optimal action parameters in the then conditions correspond to the existential bindings that made the if conditions true.

1.1 Stochastic Planning and Inference

Planning is the task of choosing what actions to take to achieve some goals or maximize long-term reward. When the dynamics of the world are deterministic, that is, each action has exactly one known outcome, then the problem can be solved through logical inference. That is, inference rules can be used to deduce the outcome of individual actions given the current state, and by combining inference steps one can prove that the goal is achieved. In this manner a proof of goal achievement embeds a plan. This correspondence was at the heart of McCarthy’s seminal paper [4] that introduced the topic of AI and viewed planning as symbolic logical inference. Since this formulation uses first-order logic, or the closely related situation calculus, lifted logical inference can be used to solve deterministic planning problems.

When the dynamics of the world are non-deterministic, this relationship is more complex. In particular, in this chapter we focus on the stochastic planning problem where an action can have multiple possible known outcomes that occur with known state-dependent probabilities. Inference in this case must reason about probabilities over an exponential number of state trajectories for some planning horizon. While lifted inference and planning may seem to be entirely different problems, analogies have been made between the two fields in several forms [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. To make the connections concrete, consider a finite domain and the finite horizon goal-oriented version of the BoxWorld planning problem of Figure 1, e.g., two boxes, three trucks, and four cities and a planning horizon of 10 steps where the goal is to get some box in . In this case, the value of a state,

, corresponds to the probability of achieving the goal, and goal achievement can be modeled as a specific form of inference in a Bayesian network or influence diagram.

We start by considering the conformant planning problem where the intended solution is an explicit sequence of actions. In this case, the sequence of actions is determined in advance and action choice at the th step does not depend on the actual state at the th step. For this formulation, one can build a Dynamic Bayesian Network (DBN) model where each time slice represents the state at that time and action nodes affect the state at the next time step, as in Figure 3(a). The edges in this diagram capture , where is the current state, is the current action and is the next state, and each of is represented by multiple nodes to show that they are given by a collection of predicates and their values. Note that, since the world dynamics are known, the conditional probabilities for all nodes in the graph are known. As a result, the goal-based planning problem where a goal must hold at the last step, can be modeled using standard inference. The value of conformant planning is given by marginal MAP (where we seek a MAP value for some variables but take expectation over the remaining variables) [7, 12, 13]:

The optimal conformant plan is extracted using argmax instead of max in the equation.

The standard MDP formulation with a reward per time time step which is accumulated can be handled similarly, by normalizing the cumulative reward and adding a binary node

whose probability of being true is a function of the normalized cumulative reward. Several alternative formulations of planning as inference have been proposed by defining an auxiliary distribution over finite trajectories which captures utility weighted probability distribution over the trajectories

[6, 9, 10, 11, 15]. While the details vary, the common theme among these approaches is that the planning objective is equivalent to calculating the partition function (or “probability of evidence”) in the resulting distribution. This achieves the same effect as adding a node that depends on the cumulative reward. To simplify the discussion, we continue the presentation with the simple goal based formulation.

The same problem can be viewed from a Bayesian perspective, treating actions as random variables with an uninformative prior. In this case we can use

to observe that [5, 6, 8]

and therefore one can alternatively maximize the probability conditioned on .



Figure 3: Planning as inference: conditioning on start and goal state. (a) Conformant planning – actions selected per time step without knowledge of the state. (b) An exponential size policy at each time step determines action selection. The transition depends on the current state and policy’s actions for that state.

However, linear plans, as the ones produced by the conformant setting, are not optimal for probabilistic planning. In particular, if we are to optimize goal achievement then we must allow the actions to depend on the state they are taken in. That is, the action in the second step is taken with knowledge of the probabilistic outcome of the first action, which is not known in advance. We can achieve this by duplicating action nodes, with a copy for each possible value of the state variables, as illustrated in Figure 3(b). This represents a separate policy associated with each horizon depth which is required because finite horizon problems have non-stationary optimal policies. In this case, state transitions depend on the identity of the current state and the action variables associated with that state. The corresponding inference problem can be written as follows:


However, the number of random variables in this formulation is prohibitively large since we need the number of original action variables to be multiplied by the size of the state space.

Alternatively, the same desideratum, optimizing actions with knowledge of the previous state, can be achieved without duplicating variables in the equivalent formulation


In fact, this formulation is exactly the same as the finite horizon application of the value iteration (VI) algorithm for (goal-based) Markov Decision Processes (MDP) which is the standard formulation for sequential decision making in stochastic environments. The standard formulation abstracts this by setting


The optimal policy (at ) can be obtained as before by recording the argmax values. In terms of probabilistic inference, the problem is no longer a marginal MAP problem because summation and maximization steps are constrained in their interleaved order. But it can be seen as a natural extension of such inference questions with several alternating blocks of expectation and maximization. We are not aware of an explicit study of such problems outside the planning context.

1.2 Stochastic Planning and Generalized Lifted Inference

Given that planning can be seen as an inference problem, one can try to apply ideas of lifted inference to planning. Taking the motivating example from Figure 1, let us specialize the reward to a ground atomic goal equivalent to for constants and . Then we can query to compute where is the concrete value of the current state.

Given that Figure 1 implies a complex relational specification of the transition probabilities, lifted inference techniques are especially well-placed to attempt to exploit the structure of this query to perform inference in aggregate and thus avoid redundant computations. However, we emphasize that, even if lifted inference is used, this is a standard query in the graphical model where evidence constrains the value of some nodes, and the solution is a single number representing the corresponding probability (together with a MAP assignment to variables).

However, Eq 3 suggests an explicit additional structure for the planning problem. In particular, the intermediate expressions include the values (the probability of reaching the goal in steps) for all possible concrete values of . Similarly, the final result includes the values for all possible start states. In addition, as in our running example we can consider more abstract rewards. This suggests a first generalization of the standard setup in lifted inference. Instead of asking about a ground goal and expecting a single number as a response, we can abstract the setup in two ways: first, we can ask about more general conditions such as and second we can expect to get a structured result that specifies the corresponding probability for every concrete state in the world. If we had two box instances and truck instances , the answer for , i.e., the value for the goal based formulation with horizon one, might take the form:

if           then else if           then else .

The significance of this is that the question can have a more general form and that the answer solves many problems simultaneously, providing the response as a case analysis depending on some properties of the state. We refer to this reasoning as inference with generalized queries and answers. In this context, the goal of lifted inference will be to calculate a structured form of the reply directly.

A second extension arises from the setup of generalized queries. The standard form for lifted inference is to completely specify the domain in advance. This means providing the number of objects and their properties, and that the response to the query is calculated only for this specific domain instantiation. However, inspecting the solution in the previous paragraph it is obvious that we can at least hope to do better. The same solution can be described more compactly as

if           then else if           then else .

Arriving at such a solution requires us to allow open domain reasoning over all potential objects (rather than grounding them, which is impossible in open domains), and to extend ideas of lifted inference to exploit quantifiers and their structure. Following through with this idea, we can arrive at a domain-size independent value function and policy as the one shown in Figure 2. In this context, the goal of lifted inference will be to calculate an abstracted form of the reply directly. We call this problem inference with generalized models. As we describe in this chapter, SDP algorithms are able to perform this type of inference.

The previous example had enough structure and a special query that allowed the solution to be specified without any knowledge of the concrete problem instance. This property is not always possible. For example, consider a setting where we get one unit of reward for every box in : . In addition, consider the case where, after the agent takes their action, any box which is not on a truck disappears with probability . In this case, we can still potentially calculate an abstract solution, but it requires access to more complex properties of the state, and in some cases the domain size (number of objects) in the state. For our example this gives:

Let if           then else .

Here we have introduced a new notation for count expressions where, for example, counts the number of boxes in Paris in the current state. To see this result note that any existing box in Paris disappears 20% of the time and that a box on a truck is successfully unloaded 90% of the time but remains and does not disappear only in 80% of possible futures leading to the value 7.2. This is reminiscent of the type of expressions that arise in existing lifted inference problems and solutions. Typical solutions to such problems involve parameterized expressions over the domain (e.g., counting, summation, etc.), and critically do not always require closed-domain reasoning (e.g., a priori knowledge of the number of boxes). They are therefore suitable for inference with generalized models. Some work on SDP has approached lifted inference for problems with this level of complexity, including exogenous activities (the disappearing boxes) and additive rewards. But, as we describe in more detail, the solutions for these cases are much less well understood and developed.

To recap, our example illustrates that stochastic planning potentially enables abstract solutions that might be amenable to lifted computations. SDP solutions for planning problems have focused on the computational advantages arising from these expressive generalizations. At the same time, the focus in SDP algorithms has largely been on problems where the solution is completely independent of domain size and does not require numerical properties of the state. These algorithms have thus skirted some of the computational issues that are typically tackled in lifted inference. It is the combination of these aspects, as illustrated in the last example, which we call generalized lifted inference. As the discussion suggests, generalized lifted inference is still very much an open problem. In addition to providing a survey of existing SDP algorithms, the goal of this chapter is to highlight the opportunities and challenges in this exciting area of research.

2 Preliminaries

This section provides a formal description of the representation language, the relational planning problem, and the description of the running example in this context.

2.1 Relational Expressions and their Calculus of Operations

The computation of SDP algorithms is facilitated by a representation that enables compact specification of functions over world states. Several such representations have been devised and used. In this chapter we chose to abstract away some of those details and focus on a simple language of relational expressions. This is closest to the GFODD representation of [16, 17], but it resembles the case notation of [1, 18].

Syntax. We assume familiarity with basic concepts and notation in first order logic (FOL) [19, 20, 21]. Relational expressions are similar to expressions in FOL. They are defined relative to a relational signature, with a finite set of predicates each with an associated arity (number of arguments), a countable set of variables , and a set of constants . We do not allow function symbols other than constants (that is, functions with arity ). A term is a variable (often denoted in uppercase) or constant (often denoted in lowercase) and an atom is either an equality between two terms or a predicate with an appropriate list of terms as arguments. Intuitively, a term refers to an object in the world of interest and an atom is a property which is either true or false.

We illustrate relational expressions informally by some examples. In FOL we can consider open formulas that have unbound variables. For example, the atom is such a formula and its truth value depends on the assignment of and to objects in the world. To simplify the discussion, we assume for this example that arguments are typed (or sorted) and ranges over “objects” and over “colors”. We can then quantify over these variables to get a sentence which will be evaluated to a truth value in any concrete possible world. For example, we can write expressing the statement that there is a color associated with all objects. Generalized expressions allow for more general open formulas that evaluate to numerical values. For example, is similar to the previous logical expression but returns non-binary values.

Quantifiers from logic are replaced with aggregation operators that combine numerical values and provide a generalization of the logical constructs. In particular, when the open formula is restricted to values 0 and 1, the operators and simulate existential and universal quantification. Thus, is equivalent to the logical sentence given above. But we can allow for other types of aggregations. For example, evaluates to the largest number of objects associated with one color, and the expression evaluates to the number of objects that have no color association. In this manner, a generalized expression represents a function from possible worlds to numerical values and, as illustrated, can capture interesting properties of the state.

Relational expressions are also related to work in statistical relational learning [22, 23, 24]. For example, if the open expression given above captures probability of ground facts for the predicate and the ground facts are mutually independent then captures the joint probability for all facts for . Of course, the open formulas in logic can include more than one atom and similarly expressions can be more involved.

In the following we will drop the cumbersome if-then-else notation and instead will assume a simpler notation with a set of mutually exclusive conditions which we refer to as cases. In particular, an expression includes a set of mutually exclusive open formulas in FOL (without any quantifiers or aggregators) denoted associated with corresponding numerical values . The list of cases refers to a finite set of variables . A generalized expression is given by a list of aggregation operators and their variables and the list of cases so that the last expression is canonically represented as .

Semantics. The semantics of expressions is defined inductively exactly as in first order logic and we skip the formal definition. As usual, an expression is evaluated in an interpretation also known as a possible world. In our context, an interpretation specifies (1) a finite set of domain elements also known as objects, (2) a mapping of constants to domain elements, and (3) the truth values of all the predicates over tuples of domain elements of appropriate size to match the arity of the predicate. Now, given an expression , an interpretation , and a substitution of variables in to objects in , one can identify the case which is true for this substitution. Exactly one such case exists since the cases are mutually exclusive and exhaustive. Therefore, the value associated with is . These values are then aggregated using the aggregation operators. For example, consider again the expression and an interpretation with objects and where is associated with colors black and white and is associated with color black. In this case we have exactly 4 substitutions evaluating to 0.3, 0.3, 0.5, 0.3. Then the final value is .

Operations over expressions. Any binary operation over real values can be generalized to open and closed expressions in a natural way. If and are two closed expressions, represents the function which maps each interpretation to . This provides a definition but not an implementation of binary operations over expressions. For implementation, the work in [16] showed that if the binary operation is safe, i.e., it distributes with respect to all aggregation operators, then there is a simple algorithm (the Apply procedure) implementing the binary operation over expressions. For example, is safe w.r.t.  aggregation, and it is easy to see that = , and the open formula portion of the result can be calculated directly from the open expressions and . Note that we need to standardize the expressions apart, as in the renaming of to for such operations. When and are open relational expressions the result can be computed through a cross product of the cases. For example,

When the binary operation is not safe then this procedure fails, but in some cases, operation-specific algorithms can be used for such combinations.111For example, a product of expressions that include only product aggregations, which is not safe, can be obtained by scaling the result with a number that depends on domain size, and is euqal to when the domain has objects.

As will become clear later, to implement SDP we need the binary operations , , and the aggregation includes in addition to aggregation in the reward function. Since , , are safe with respect to aggregation one can provide a complete solution when the reward is restricted to have aggregation. When this is not the case, for example when using sum aggregation in the reward function, one requires a special algorithm for the combination. Further details are provided in [16, 17].

Summary. Relational expressions are closest to the GFODD representation of [16, 17]. Every case in a relational expression corresponds to a path or set of paths in the GFODD, all of which reach the same leaf in the graphical representation of the GFODD. GFODDs are potentially more compact than relational expressions since paths share common subexpressions, which can lead to an exponential reduction in size. On the other hand, GFODDs require special algorithms for their manipulation. Relational expressions are also similar to the case notation of [1, 18]. However, in contrast with that representation, cases are not allowed to include any quantifiers and instead quantifiers and general aggregators are globally applied over the cases, as in standard quantified normal form in logic.

2.2 Relational MDPs

In this section we define MDPs, starting with the basic case with enumerated state and action spaces, and then providing the relational representation.

MDP Preliminaries. We assume familiarity with basic notions of Markov Decision Processes (MDPs) [25, 26]. Briefly, a MDP is a tuple given by a set of states , set of actions , transition probability , immediate reward function and discount factor . The solution of a MDP is a policy that maximizes the expected discounted total reward obtained by following that policy starting from any state. The Value Iteration algorithm (VI) informally introduced in Eq 3, calculates the optimal value function by iteratively performing Bellman backups, , defined for each state as,


Unlike Eq 3, which was goal-oriented and had only a single reward at the terminal horizon, here we allow the reward R(S) to accumulate at all time steps as typically allowed in MDPs. If we iterate the update until convergence, we get the optimal infinite horizon value function typically denoted by and optimal stationary policy . For finite horizon problems, which is the topic of this chapter, we simply stop the iterations at a specific . In general, the optimal policy for the finite horizon case is not stationary, that is, we might make different choice in the same state depending on how close we are to the horizon.

Logical Notation for Relational MDPs (RMDPs). RMDPs are simply MDPs where the states and actions are described in a function-free first order logical language. A state corresponds to an interpretation over the corresponding logical signature, and actions are transitions between such interpretations.

A relational planning problem is specified by providing the logical signature, the start state, the transitions as controlled by actions, and the reward function. As mentioned above, one of the advantages of relational SDP algorithms is that they are intended to produce an abstracted form of the value function and policy that does not require specifying the start state or even the number of objects in the interpretation at planning time. This yields policies that generalize across domain sizes. We therefore need to explain how one can use logical notation to represent the transition model and reward function in a manner that does not depend on domain size.

Two types of transition models have been considered in the literature:

  • Endogenous Branching Transitions: In the basic form, state transitions have limited stochastic branching due to a finite number of action outcomes. The agent has a set of action types each parametrized with a tuple of objects to yield an action template and a concrete ground action (e.g. template and concrete action ). Each agent action has a finite number of action variants (e.g., action success vs. action failure), and when the user performs in state one of the variants is chosen randomly using the state-dependent action choice distribution . To simplify the presentation we follow [27, 16] and require that are given by open expressions, i.e., they have no aggregations and cannot introduce new variables. For example, in BoxWorld, the agent action has success outcome and failure outcome with action outcome distribution as follows:


    where, to simplify the notation, the last case is shortened as to denote that it complements previous cases. This provides the distribution over deterministic outcomes of actions.

    The deterministic action dynamics are specified by providing an open expression, capturing successor state axioms [28], for each variant and predicate template . Following [27] we call these expressions TVDs, standing for truth value diagrams. The corresponding TVD, , is an open expression that specifies the truth value of in the next state (following standard practice we use prime to denote that the predicate refers to the next state) when has been executed in the current state. The arguments and are intentionally different logical variables as this allows us to specify the truth value of all instances of simultaneously. Similar to the choice probabilities we follow [27, 16] and assume that TVDs have no aggregations and cannot introduce new variables. This implies that the regression and product terms in the SDP algorithm of the next section do not change the aggregation function, thereby enabling analysis of the algorithm. Continuing our BoxWorld example, we define the TVD for for and as follows:


    Note that each TVD has exactly two cases, one leading to the outcome 1 and the other leading to the outcome 0. Our algorithm below will use these cases individually. Here we remark that since the next state (primed) only depends on the previous state (unprimed), we are effectively logically encoding the Markov assumption of MDPs.

  • Exogenous Branching Transitions: The more complex form combines the endogenous model with an exogenous stochastic process that affects ground atoms independently. As a simple example in our BoxWorld domain, we might imagine that with some small probability, each box in a city () may independently randomly disappear (falsify ) owing to issues with theft or improper routing — such an outcome is independent of the agent’s own action. Another more complicated example could be an inventory control problem where customer arrival at shops (and corresponding consumption of goods) follows an independent stochastic model. Such exogenous transitions can be formalized in a number of ways [29, 30, 17]; we do not aim to commit to a particular representation in this chapter, but rather to mention its possibility and the computational consequences of such general representations.

Having completed our discussion of RMDP transitions, we now proceed to define the reward , which can be any function of the state and action, specified by a relational expression. Our running example with existentially quantified reward is given by


but we will also consider additive reward as in


3 Symbolic Dynamic Programming

The SDP algorithm is a symbolic implementation of the value iteration algorithm. The algorithm repeatedly applies so-called decision-theoretic regression which is equivalent to one iteration of the value iteration algorithm.

As input to SDP we get closed relational expressions for and . In addition, assuming that we are using the Endogenous Branching Transition model of the previous section, we get open expressions for the probabilistic choice of actions and for the dynamics of deterministic action variants as TVDs. The corresponding expressions for the running example are given respectively in Eq (7), Eq (5) and Eq (6).

The following SDP algorithm of [16] modifies the earlier SDP algorithm of [1] and implements Eq (4) using the following 4 steps:

  1. Regression: The step-to-go value function is regressed over every deterministic variant of every action to produce . Regression is conceptually similar to goal regression in deterministic planning. That is, we identify conditions that need to occur before the action is taken in order to arrive at other conditions (for example the goal) after the action. However, here we need to regress all the conditions in the relational expression capturing the value function, so that we must regress each case of separately. This can be done efficiently by replacing every atom in each by its corresponding positive or negated portion of the TVD without changing the aggregation function. Once this substitution is done, logical simplification (at the propositional level) can be used to compress the cases by removing contradictory cases and simplifying the formulas. Applying this to regress over the reward function given by Eq (7) we get:

    and regressing yields

    This illustrates the utility of compiling the transition model into the TVDs which allow for a simple implementation of deterministic regression.

  2. Add Action Variants: The Q-function for each action is generated by combining regressed diagrams using the binary operations and over expressions. Recall that probability expressions do not refer to additional variables. The multiplication can therefore be done directly on the open formulas without changing the aggregation function. As argued by [27], to guarantee correctness, both summation steps ( and steps) must standardize apart the functions before adding them.

    For our running example and assuming , we would need to compute the following:

    We next illustrate some of these steps. The multiplication by probability expressions can be done by cross product of cases and simplification. For this yields

    and for we get

    Note that the values here are weighted by the probability of occurrence. For example the first case in the last equation has value 1=10*0.1 because when the preconditions of hold the variant occurs with probability. The addition of the last two equations requires standardizing them apart, performing the safe operation through cross product of cases, and simplifying. Skipping intermediate steps, this yields

    Multiplying by the discount factor scales the numbers in the last equation by 0.9 and finally standardizing apart and adding the reward and simplifying (again skipping intermediate steps) yields

    Intuitively, this result states that after executing a concrete stochastic action with arguments , we achieve the highest value (10 plus a discounted 0.9*10) if a box was already in Paris, the next highest value (10 occurring with probability 0.9 and discounted by 0.9) if unloading from in , and a value of zero otherwise. The main source of efficiency (or lack thereof) of SDP is the ability to perform such operations symbolically and simplify the result into a compact expression.

  3. Object Maximization: Note that up to this point in the algorithm the action arguments are still considered to be concrete arbitrary objects, in our example. However, we must make sure that in each of the (unspecified and possibly infinite set of possible) states we choose the best concrete action for that state, by specifying the appropriate action arguments. This is handled in the current step of the algorithm.

    To achieve this, we maximize over the action parameters of to produce for each action . This implicitly obtains the value achievable by the best ground instantiation of in each state. This step is implemented by converting action parameters to variables, each associated with the aggregation operator, and appending these operators to the head of the aggregation function. Once this is done, further logical simplification may be possible. This occurs in our running example where existential quantification (over ) which is constrained by equality can be removed, and the result is:

  4. Maximize over Actions: The st step-to-go value function , is generated by combining the expressions using the binary operation .

    Concretely, for our running example, this means we would compute:

    While we have only shown above, we remark that the values achievable in each state by dominate or equal the values achievable by and in the same state. Practically this implies that after simplification we obtain the following value function:

    Critically for the objectives of lifted stochastic planning, we observe that the value function derived by SDP is indeed lifted: it holds for any number of boxes, trucks and cities.

SDP repeats these steps to the required depth, iteratively calculating . For example, Figure 2 illustrates for the BoxWorld example, which was computed by terminating the SDP loop once the value function converged.

The basic SDP algorithm is an exact calculation whenever the model can be specified using the constraints above and the reward function can be specified with and aggregation [16]. This is satisfied by classical models of stochastic planning. As illustrated, in these cases, the SDP solution conforms to our definition of generalized lifted inference.

Extending the Scope of SDP. The algorithm above cannot handle models with more complex dynamics and rewards as motivated in the introduction. In particular, prior work has considered two important properties that appear to be relevant in many domains. The first is additive rewards, illustrated for example, in Eq 8. The second property is exogenous branching transitions illustrated above by the disappearing blocks example. These represent two different challenges for the SDP algorithm. The first is that we must handle sum aggregation in value functions, despite the fact that this means that some of the operations are not safe and hence require a special implementation. The second is in modeling the exogenous branching dynamics which requires getting around potential conflicts among such events and between such events and agent actions. The introduction illustrated the type of solution that can be expected in such a problem where counting expressions, that measure the number of times certain conditions hold in a state, determine the value in that state.

To date, exact abstract solutions for problems of this form have not been obtained. The work of [30] and [29] (Ch. 6) considered additive rewards and has formalized an expressive family of models with exogenous events. This work has shown that some specific challenging domains can be handled using several algorithmic ideas, but did not provide a general algorithm that is applicable across problems in this class. The work of [17] developed a model for “service domains” which significantly constrains the type of exogenous branching. In their model, a transition includes an agent step whose dynamics use endogenous branching, followed by “nature’s step” where each object (e.g., a box) experiences a random exogenous action (potentially disappearing). Given these assumptions, they provide a generally applicable approximation algorithm as follows. Their algorithm treats agent’s actions exactly as in SDP above. To regress nature’s actions we follow the following three steps: (1) the summation variables are first ground using a Skolem constant , then (2) a single exogenous event centered at is regressed using the same machinery, and finally (3) the Skolemization is reversed to yield another additive value function. The complete details are beyond the scope of this chapter. The algorithm yields a solution that avoids counting formulas and is syntactically close to the one given by the original algorithm. Since such formulas are necessary, the result is an approximation but it was shown to be a conservative one in that it provides a monotonic lower bound on the true value. Therefore, this algorithm conforms to our definition of approximate generalized lifted inference.

In our example, starting with the reward of Eq (8) we first replace the sum aggregation with a scaled version of average aggregation (which is safe w.r.t. summation)

and then ground it to get

The next step is to regress through the exogenous event at . The problem where boxes disappear with probability 0.2 can be cast as having two action variants where “disappearing-block” succeeds with probability 0.2 and fails with probability 0.8. Regressing the success variant we get the expression (the zero function) and regressing the fail variant we get . Multiplying by the probabilities of the variants we get: and and adding them (there are no variables to standardize apart) we get

Finally lifting the last equation we get

Next we follow with the standard steps of SDP for the agent’s action. The steps are analogous to the example of SDP given above. Considering the discussion in the introduction (recall that in order to simplify the reasoning in this case we omitted discounting and adding the reward) this algorithm produces

which is identical to the exact expression given in the introduction. As already mentioned, the result is not guaranteed to be exact in general. In addition, the maximization in step iv of SDP requires some ad-hoc implementation because maximization is not safe with respect to average aggregation.

It is clear from the above example that the main difficulty in extending SDP is due to the interaction of the counting formulas arising from exogenous events and additive rewards with the first-order aggregation structure inherent in the planning problem. Relational expressions, their GFODD counterparts, and other representations that have been used to date are not able to combine these effectively. A representation that seamlessly supports both relational expressions and operations on them along with counting expressions might allow for more robust versions of generalized lifted inference to be realized.

4 Discussion and Related Work

As motivated in the introduction, SDP has explored probabilistic inference problems with a specific form of alternating maximization and expectation blocks. The main computational advantage comes from lifting in the sense of lifted inference in standard first order logic. Issues that arise from conditional summations over combinations random variables, common in probabilistic lifted inference, have been touched upon but not extensively. In cases where SDP has been shown to work it provides generalized lifted inference where the complexity of the inference algorithm is completely independent of the domain size (number of objects) in problem specification, and where the response to queries is either independent of that size or can be specified parametrically. This is a desirable property but to our knowledge it is not shared by most work on probabilistic lifted inference. A notable exception is given by the knowledge compilation result of [31] (see Chapter 4 and Theorem 5.5) and the recent work in [32, 33], where a model is compiled into an alternative form parametrized by the domain and where responses to queries can be obtained in polynomial time as a function of . The emphasis in that work is on being domain lifted (i.e., being polynomial in domain size). Generalized lifted inference requires an algorithm whose results can be computed once, in time independent of that size, and then reused to evaluate the answer for specific domain sizes. This analogy also shows that SDP can be seen as a compilation algorithm, compiling a domain model into a more accessible form representing the value function, which can be queried efficiently. This connection provides an interesting new perspective on both fields.

In this chapter we focused on one particular instance of SDP. Over the last 15 years SDP has seen a significant amount of work expanding over the original algorithm by using different representations, by using algorithms other than value iteration, and by extending the models and algorithms to more complex settings. In addition, several “lifted” inductive approaches that do not strictly fall within the probabilistic inference paradigm have been developed. We review this work in the remainder of this section.

4.1 Deductive Lifted Stochastic Planning

As a precursor to its use in lifted stochastic planning, the term SDP originated in the propositional logical context [34, 35] when it was realized that propositionally structured MDP transitions (i.e., dynamic Bayesian networks [36]) and rewards (e.g., trees that exploited context-specific independence [37]) could be used to define highly compact factored MDPs; this work also realized that the factored MDP structure could be exploited for representational compactness and computational efficiency by leveraging symbolic representations (e.g., trees) in dynamic programming. Two highly cited (and still used algorithms) in this area of work are the SPUDD [38] and APRICODD [39] algorithms that leveraged algebraic decision diagrams (ADDs) [40] for, respectively, exact and approximate solutions to factored MDPs. Recent work in this area [41] shows how to perform propositional SDP directly with ground representations in PPDDL [42], and develops extensions for factored action spaces [43, 44].

Following the seminal introduction of lifted SDP in [1], several early papers on SDP approached the problem with existential rewards with different representation languages that enabled efficient implementations. This includes the First-order value iteration (FOVIA) [45, 46], the Relational Bellman algorithm (ReBel) [47], and the FODD based formulation of [27, 48, 49].

Along this dimension two representations are closely related to the relational expression of this chapter. As mentioned above, relational expressions are an abstraction of the GFODD representation [16, 17, 50] which captures expressions using a decision diagram formulation extending propositional ADDs [40]. In particular, paths in the graphical representation of the DAG representing the GFODD correspond to the mutually exclusive conditions in expressions. The aggregation in GFODDs and relational expressions provides significant expressive power in modeling relational MDPs. The GFODD representation is more compact than relational expressions but requires more complex algorithms for its manipulation. The other closely related representation is the case notation of [1, 18]. The case notation is similar to relational expressions in that we have a set of conditions (these are mostly in a form that is mutually exclusive but not always so) but the main difference is that quantification is done within each case separately, and the notion of aggregation is not fully developed. First-order algebraic decision diagrams (FOADDs) [29, 18] are related to the case notation in that they require closed formulas within diagram nodes, i.e., the quantifiers are included within the graphical representation of the expression. The use of quantifiers inside cases and nodes allows for an easy incorporation of off the shelf theorem provers for simplification. Both FOADD and GFODD were used to extend SDP to capture additive rewards and exogenous events as already discussed in the previous section. While the representations (relational expression and GFODDs vs. case notation and FOADD) have similar expressive power, the difference in aggregation makes for different algorithmic properties that are hard to compare in general. However, the modular treatment of aggregation in GFODDs and the generic form of operations over them makes them the most flexible alternative to date for directly manipulating the aggregated case representation used in this chapter.

The idea of SDP has also been extended in terms of the choice of planning algorithm, as well as to the case of partially observable MDPs. Case notation and FOADDs have been used to implement approximate linear programming 

[51, 18] and approximate policy iteration via linear programming [52] and FODDs have been used to implement relational policy iteration [53]. GFODDs have also been used for open world reasoning and applied in a robotic context [54]. The work of [55] and [56] explore SDP solutions, with GFODDs and case notation respectively, to relational partially observable MDPs (POMDPs) where the problem is conceptually and algorithmically much more complex. Related work in POMDPs has not explicitly addressed SDP, but rather has implicitly addressed lifted solutions through the identification of (and abstraction over) symmetries in applications of dynamic programming for POMDPs [57, 58].

4.2 Inductive Lifted Stochastic Planning

Inductive methods can be seen to be orthogonal to the inference algorithms in that they mostly do not require a model and do not reason about that model. However, the overall objective of producing lifted value functions and policies is shared with the previously discussed deductive approaches. We therefore review these here for completeness. As we discuss, it is also possible to combine the inductive and deductive approaches in several ways.

The basic inductive approaches learn a policy directly from a teacher, sometimes known as behavioral cloning. The work of [59, 60, 61]

provided learning algorithms for relational policies with theoretical and empirical evidence for their success. Relational policies and value functions were also explored in reinforcement learning. This was done with pure reinforcement learning using relational regression trees to learn a Q-function 

[62], combining this with supervised guidance [63], or using Gaussian processes and graph kernels over relational structures to learn a Q-function [64]

. A more recent approach uses functional gradient boosting with lifted regression trees to learn lifted policy structure in a policy gradient algorithm 


Finally, several approaches combine inductive and deductive elements. The work of [66]

combines inductive logic programming with first-order decision-theoretic regression, by first using deductive methods (decision theoretic regression) to generate candidate policy structure, and then learning using this structure as features. The work of

[67] shows how one can implement relational approximate policy iteration where policy improvement steps are performed by learning the intended policy from generated trajectories instead of direct calculation. Although these approaches are partially deductive they do not share the common theme of this chapter relating planning and inference in relational contexts.

5 Conclusions

This chapter provides a review of SDP methods, that perform abstract reasoning for stochastic planning, from the viewpoint of probabilistic inference. We have illustrated how the planning problem and the inference problem are related. Specifically, finite horizon optimization in MDPs is related to an inference problem with alternating maximization and expectation blocks and is therefore more complex than marginal MAP queries that have been studied in the literature. This analogy is valid both at the propositional and relational levels and it suggests a new line of challenges for inference problems in discrete domains.We have also identified the opportunity for generalized lifted inference, where the algorithm and its solution are agnostic of the domain instance and its size and are efficient regardless of this size. We have shown that under some conditions SDP algorithms provide generalized lifted inference. In more complex models, especially ones with additive rewards and exogenous events, SDP algorithms are yet to mature into an effective and widely applicable inference scheme. On the other hand, the challenges faced in such problems are exactly the ones typically seen in standard lifted inference problems. Therefore, exploring generalized lifted inference more abstractly has the potential to lead to advances in both areas.


This work is partly supported by NSF grants IIS-0964457 and IIS-1616280.


  • [1] C. Boutilier, R. Reiter, and B. Price. Symbolic dynamic programming for first-order MDPs. In Proc. of IJCAI, pages 690–700, 2001.
  • [2] Richard E. Fikes and Nils J. Nilsson. STRIPS: A new approach to the application of theorem proving to problem solving. AI Journal, 2:189–208, 1971.
  • [3] Neil Kushmerick, Steve Hanks, and Dan Weld. An algorithm for probabilistic planning. Artificial Intelligence, 76(12):239–286, 1995.
  • [4] J. McCarthy. Programs with common sense. In Proceedings of the Symposium on the Mechanization of Thought Processes, volume 1, pages 77–84. National Physical Laboratory, 1958. Reprinted in R. Brachman and H. Levesque (Eds.), Readings in Knowledge Representation, 1985, Morgan Kaufmann, Los Altos, CA.
  • [5] Hagai Attias. Planning by probabilistic inference. In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, AISTATS 2003, Key West, Florida, USA, January 3-6, 2003, 2003.
  • [6] M. Toussaint and A. Storsky. Probabilistic inference for solving discrete and continuous sta te markov decision processes. In

    Proceedings of the International Conference on Machine Learning

    , 2006.
  • [7] Carmel Domshlak and Jörg Hoffmann. Fast probabilistic planning through weighted model counting. In Proceedings of the International Conference on Automated Planning and Scheduling, 2006.
  • [8] M. Lang and M. Toussaint. Approximate inference for planning in stochastic relational worlds. In Proceedings of the International Conference on Machine Learning, 2009.
  • [9] Thomas Furmston and David Barber. Variational methods for reinforcement learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics, AISTATS, pages 241–248, 2010.
  • [10] Qiang Liu and Alexander T. Ihler. Belief propagation for structured decision making. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), pages 523–532, 2012.
  • [11] Qiang Cheng, Qiang Liu, Feng Chen, and Alexander T. Ihler. Variational planning for graph-based mdps. In International Conference on Neural Information Processing Systems, pages 2976–2984, 2013.
  • [12] J. Lee, R. Marinescau, and R. Dechter. Applying marginal map search to probabilistic conformant planning. In Fourth International Workshop on Statistical Relational AI (StarAI), 2014.
  • [13] J. Lee, R. Marinescau, and R. Dechter. Applying search based probabilistic inference algorithms to probabilistic conformant planning: Preliminary results. In Proceedings of the International Symposium on Artificial Intelligence and Mathematics (ISAIM), 2016.
  • [14] M. Issakkimuthu, A. Fern, R. Khardon, P. Tadepalli, and S. Xue. Hindsight optimization for probabilistic planning with factored actions. In ICAPS, 2015.
  • [15] Jan-Willem van de Meent, Brooks Paige, David Tolpin, and Frank Wood. Black-box policy search with probabilistic programs. In Proceedings of the International Conference on Artificial Intelligence and Statistics, AISTATS, pages 1195–1204, 2016.
  • [16] S. Joshi, K. Kersting, and R. Khardon. Decision theoretic planning with generalized first order decision diagrams. AIJ, 175:2198–2222, 2011.
  • [17] S. Joshi, R. Khardon, A. Raghavan, P. Tadepalli, and A. Fern. Solving relational MDPs with exogenous events and additive rewards. In ECML, 2013.
  • [18] S. Sanner and C. Boutilier. Practical solution techniques for first order MDPs. AIJ, 173:748–788, 2009.
  • [19] J.W. Lloyd. Foundations of Logic Programming. Springer Verlag, 1987. Second Edition.
  • [20] S. Russell and P. Norvig. Artificial Intelligence: a modern approach. Prentice Hall, 1995.
  • [21] C. Chang and J. Keisler. Model Theory. Elsevier, Amsterdam, Holland, 1990.
  • [22] M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62:107–136, 2006.
  • [23] L. De Raedt, A. Kimmig, and H. Toivonen. Problog: A probabilistic prolog and its application in link discovery. In Proc. of the IJCAI, pages 2462–2467, 2007.
  • [24] G. Van den Broeck, N. Taghipour, W. Meert, J. Davis, and L. De Raedt. Lifted probabilistic inference by first-order knowledge compilation. In Proc. of the IJCAI, pages 2178–2185, 2011.
  • [25] S. Russell and P. Norvig. Artificial Intelligence: a modern approach. Prentice Hall, 2009. 3rd Edition.
  • [26] M. L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. Wiley, 1994.
  • [27] C. Wang, S. Joshi, and R. Khardon. First order decision diagrams for relational MDPs. JAIR, 31:431–472, 2008.
  • [28] Ray Reiter. Knowledge in Action: Logical Foundations for Specifying and Implementing Dynamical Systems. MIT Press, 2001.
  • [29] S. Sanner. First-order decision-theoretic planning in structured relational environments. PhD thesis, University of Toronto, 2008.
  • [30] S. Sanner and C. Boutilier. Approximate solution techniques for factored first-order MDPs. In Proceedings of the 17th Conference on Automated Planning and Scheduling (ICAPS-07), 2007.
  • [31] Guy Van den Broeck. Lifted Inference and Learning in Statistical Relational Models. PhD thesis, KU Leuven, 2013.
  • [32] Seyed Mehran Kazemi and David Poole. Knowledge compilation for lifted probabilistic inference: Compiling to a low-level language. In Proceedings of the Conference on Knowledge Representation and Reasoning, pages 561–564, 2016.
  • [33] Seyed Mehran Kazemi, Angelika Kimmig, Guy Van den Broeck, and David Poole. New liftable classes for first-order probabilistic inference. In International Conference on Neural Information Processing Systems, pages 3117–3125, 2016.
  • [34] Craig Boutilier, Thomas Dean, and Steve Hanks. Planning under uncertainty: Structural assumptions and computational leverage. In Third European Workshop on Planning, Assisi, Italy, 1995.
  • [35] Craig Boutilier, Thomas Dean, and Steve Hanks. Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research (JAIR), 11:1–94, 1999.
  • [36] Thomas Dean and Keiji Kanazawa. A model for reasoning about persistence and causation. Computational Intelligence, 5(3):142–150, 1989.
  • [37] Craig Boutilier, Nir Friedman, Moisés Goldszmidt, and Daphne Koller. Context-specific independence in Bayesian networks. In Uncertainty in Artificial Intelligence (UAI-96), pages 115–123, Portland, OR, 1996.
  • [38] Jesse Hoey, Robert St-Aubin, Alan Hu, and Craig Boutilier. SPUDD: Stochastic planning using decision diagrams. In Uncertainty in Artificial Intelligence (UAI-99), pages 279–288, Stockholm, 1999.
  • [39] Robert St-Aubin, Jesse Hoey, and Craig Boutilier. APRICODD: Approximate policy construction using decision diagrams. In Advances in Neural Information Processing 13 (NIPS-00), pages 1089–1095, Denver, 2000.
  • [40] R. Bahar, E. Frohm, C. Gaona, G. Hachtel, E. Macii, A. Pardo, and F. Somenzi. Algebraic decision diagrams and their applications. In IEEE /ACM ICCAD, pages 188–191, 1993.
  • [41] Boris Lesner and Bruno Zanuttini. Efficient policy construction for mdps represented in probabilistic pddl. In Fahiem Bacchus, Carmel Domshlak, Stefan Edelkamp, and Malte Helmert, editors, ICAPS. AAAI, 2011.
  • [42] Håkan L. S. Younes, Michael L. Littman, David Weissman, and John Asmuth. The first probabilistic track of the international planning competition. Journal of Artificial Intelligence Research (JAIR), 24:851–887, 2005.
  • [43] Aswin Raghavan, Saket Joshi, Alan Fern, Prasad Tadepalli, and Roni Khardon. Planning in Factored Action Spaces with Symbolic Dynamic Pr ogramming. In Proceedings of the AAAI Conference on Artificial Intelligence, 2012.
  • [44] Aswin Raghavan, Roni Khardon, Alan Fern, and Prasad Tadepalli. Symbolic opportunistic policy iteration for factored-action mdps. In International Conference on Neural Information Processing Systems, pages 2499–2507, 2013.
  • [45] Eldar Karabaev and Olga Skvortsova.

    A heuristic search algorithm for solving first-order MDPs.

    In Uncertainty in Artificial Intelligence (UAI-05), pages 292–299, Edinburgh, Scotland, 2005.
  • [46] S. Hölldobler, E. Karabaev, and O. Skvortsova. FluCaP: a heuristic search planner for first-order MDPs. JAIR, 27:419–439, 2006.
  • [47] K. Kersting, M. Van Otterlo, and L. De Raedt. Bellman goes relational. In Proc. of ICML, 2004.
  • [48] S. Joshi and R. Khardon. Stochastic planning with first order decision diagrams. In Proc. of ICAPS, pages 156–163, 2008.
  • [49] S. Joshi, K. Kersting, and R. Khardon. Self-taught decision theoretic planning with first order decision diagrams. In Proc. of ICAPS, pages 89–96, 2010.
  • [50] B. Hescott and R. Khardon. The complexity of reasoning with FODD and GFODD. Artificial Intelligence, 2015.
  • [51] Scott Sanner and Craig Boutilier. Approximate linear programming for first-order MDPs. In Uncertainty in Artificial Intelligence (UAI-05), pages 509–517, Edinburgh, Scotland, 2005.
  • [52] Scott Sanner and Craig Boutilier. Practical linear evaluation techniques for first-order MDPs. In Uncertainty in Artificial Intelligence (UAI-06), Boston, Mass., 2006.
  • [53] C. Wang and R. Khardon. Policy iteration for relational MDPs. In Proceedings of UAI, 2007.
  • [54] Saket Joshi, Paul W. Schermerhorn, Roni Khardon, and Matthias Scheutz. Abstract planning for reactive robots. In ICRA, pages 4379–4384, 2012.
  • [55] Chenggang Wang and Roni Khardon. Relational partially observable MDPs. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010, 2010.
  • [56] Scott Sanner and Kristian Kersting. Symbolic dynamic programming for first-order POMDPs. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2010, Atlanta, Georgia, USA, July 11-15, 2010, 2010.
  • [57] Finale Doshi and Nicholas Roy. The permutable pomdp: Fast solutions to pomdps for preference elicitation. In Proceedings of the Seventh International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2008), Estoril, Portugal, May 2008.
  • [58] Byung Kon Kang and Kee-Eung Kim. Exploiting symmetries for single- and multi-agent partially observable stochastic domains. Artif. Intell., 182-183:32–57, 2012.
  • [59] R. Khardon. Learning to take actions. Machine Learning, 35:57–90, 1999.
  • [60] R. Khardon. Learning action strategies for planning domains. Artificial Intelligence, 113(1-2):125–148, 1999.
  • [61] SungWook Yoon, Alan Fern, and Robert Givan. Inductive policy selection for first-order Markov decision processes. In Uncertainty in Artificial Intelligence (UAI-02), pages 569–576, Edmonton, 2002.
  • [62] Saso Dzeroski, Luc DeRaedt, and Kurt Driessens. Relational reinforcement learning. Machine Learning Journal (MLJ), 43:7–52, 2001.
  • [63] Kurt Driessens and Saso Dzeroski. Integrating experimentation and guidance in relational reinforcement learning. In International Conference on Machine Learning (ICML), pages 115–122, 2002.
  • [64] Thomas Gartner, Kurt Driessens, and Jan Ramon. Graph kernels and gaussian processes for relational reinforcement learning. Machine Learning Journal (MLJ), 64:91–119, 2006.
  • [65] Kristian Kersting and Kurt Driessens. Non-parametric policy gradients: A unified treatment of propositional and relational domains. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 456–463, New York, NY, USA, 2008. ACM.
  • [66] Charles Gretton and Sylvie Thiebaux. Exploiting first-order regression in inductive policy selection. In Uncertainty in Artificial Intelligence (UAI-04), pages 217–225, Banff, Canada, 2004.
  • [67] Sungwook Yoon, Alan Fern, and Robert Givan. Approximate policy iteration with a policy language bias: Learning to solve relational markov decision processes. Journal of Artificial Intelligence Research (JAIR), 25:85–118, 2006.