# Elaboration Tolerant Representation of Markov Decision Process via Decision-Theoretic Extension of Probabilistic Action Language pBC+

We extend probabilistic action language pBC+ with the notion of utility as in decision theory. The semantics of the extended pBC+ can be defined as a shorthand notation for a decision-theoretic extension of the probabilistic answer set programming language LPMLN. Alternatively, the semantics of pBC+ can also be defined in terms of Markov Decision Process (MDP), which in turn allows for representing MDP in a succinct and elaboration tolerant way as well as to leverage an MDP solver to compute pBC+. The idea led to the design of the system pbcplus2mdp, which can find an optimal policy of a pBC+ action description using an MDP solver.

## Authors

• 85 publications
• 19 publications
02/27/2021

### CP-MDP: A CANDECOMP-PARAFAC Decomposition Approach to Solve a Markov Decision Process Multidimensional Problem

Markov Decision Process (MDP) is the underlying model for optimal planni...
12/21/2020

### Universal Policies for Software-Defined MDPs

We introduce a new programming paradigm called oracle-guided decision pr...
03/01/2014

### Dynamic Decision Process Modeling and Relation-line Handling in Distributed Cooperative Modeling System

The Distributed Cooperative Modeling System (DCMS) solves complex decisi...
07/31/2019

### Bridging Commonsense Reasoning and Probabilistic Planning via a Probabilistic Action Language

To be responsive to dynamically changing real-world environments, an int...
05/03/2021

### Learning Good State and Action Representations via Tensor Decomposition

The transition kernel of a continuous-state-action Markov decision proce...
02/23/2018

### Novel Approaches to Accelerating the Convergence Rate of Markov Decision Process for Search Result Diversification

Recently, some studies have utilized the Markov Decision Process for div...
07/30/2021

### An Extensible and Modular Design and Implementation of Monte Carlo Tree Search for the JVM

Flexible implementations of Monte Carlo Tree Search (MCTS), combined wit...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Many problems in Artificial Intelligence are about how to choose actions that maximize the agent’s utility. Since actions may also have stochastic effects, the main computational task is, rather than to find a sequence of actions that leads to a goal, to find an optimal policy, that states which actions to execute in each state to achieve the maximum expected utility.

While a few decades of research has produced several expressive action languages, such as [1], [2], [3], + [4], [5], and + [6], that are able to describe actions and their effects in a succinct and elaboration tolerant way, they are not equipped with constructs to represent stochastic actions and the utility of an agent. In this paper, we present an action language that overcomes the limitation. Our method is to extend probabilistic action language [7] with the concept of utility and define policy optimization problems in that language.

To follow the way that is defined as a shorthand notation of probabilistic answer set programming language for describing a probabilistic transition system, we first extend

by associating a utility measure to each soft stable model in addition to its already defined probability. We call this extension

. Next, we define a decision-theoretic extension of as a shorthand notation for . It turns out that the semantics of can also be directly defined in terms of Markov Decision Process (MDP), which in turn allows us to define MDP in a succinct and elaboration tolerant way. The result is theoretically interesting as it formally relates action languages to MDP despite their different origins, and furthermore justifies the semantics of the extended in terms of MDP. It is also computationally interesting because it allows for applying a number of algorithms developed for MDP to computing . Based on this idea, we design the system pbcplus2mdp, which turns a action description into the input language of an MDP solver, and leverages MDP solving to find an optimal policy for the action description.

The extended can thus be viewed as a high-level representation of MDP that allows for compact and elaboration tolerant encodings of sequential decision problems. Compared to other MDP-based planning description languages, such as PPDDL [8] and RDDL [9], it inherits the nonmonotonicity of stable model semantics to be able to compactly represent recursive definitions and indirect effects of actions, which can save the state space significantly. Section 5 contains such an example.

This paper is organized as follows. After Section 2 reviews preliminaries, Section 3 extends with the notion of utility, through which we define the extension of with utility in Section 4. Section 5 defines as a high-level representation language for MDP and presents the prototype system pbcplus2mdp. We discuss the related work in Section 6.

## 2 Preliminaries

Due to the space limit, the reviews are brief. We refer the reader to the original papers [10, 7], or technical report of this paper 111wang19elaboration-tech for the reviews of the preliminaries. The note also contains all proofs and experiments with the system pbcplus2mdp.

### 2.1 Review: Language LPMLN

An program is a finite set of weighted rules where is a rule and is a real number (in which case, the weighted rule is called soft) or for denoting the infinite weight (in which case, the weighted rule is called hard). Throughout the paper, we assume that the language is propositional. Schematic variables can be introduced via grounding as standard in answer set programming.

For any program and any interpretation , denotes the usual (unweighted) ASP program obtained from by dropping the weights, and denotes the set of in such that .

Given a ground program , denotes the set

 {I∣I is a (deterministic) stable model of ΠI that satisfies all hard rules in Π}.

For any interpretation , the weight of an interpretation , denoted , is defined as

 WΠ(I)=⎧⎪ ⎪⎨⎪ ⎪⎩exp(∑w:R∈ΠIw)if I∈SM[Π];0otherwise,

and the probability of , denoted , is defined as

 PΠ(I)=WΠ(I)∑J∈SM[Π]WΠ(J).

### 2.2 Review: Action Language pBC+

assumes that the propositional signature is constructed from “constants” and their “values.” A constant is a symbol that is associated with a finite set , called the domain. The signature is constructed from a finite set of constants, consisting of atoms for every constant and every element in . If the domain of  is , then we say that  is Boolean, and abbreviate as and as .

There are four types of constants in : fluent constants, action constants, pf (probability fact) constants and initpf (initial probability fact) constants. Fluent constants are further divided into regular and statically determined. The domain of every action constant is restricted to be Boolean. An action description is a finite set of causal laws, which describes how fluents depend on each other within a single time step and how their values change from one time step to another. The expressions (causal laws) in are listed in the first column of Figure 1. A fluent formula is a formula such that all constants occurring in it are fluent constants.

We use , , , and to denote the set of fluent constants, action constants, pf constants, and initpf constants, respectively. Given a nonnegative integer as the maximum timestamp, for any signature and any , we use to denote the set . By we denote the result of inserting in front of every occurrence of every constant in formula .

The semantics of a action description is defined by a translation into an program . Below we describe the essential part of the translation that turns a description into an program. For the complete translation, we refer the reader to [7].

The signature of consists of atoms of the form such that

• for each fluent constant of , and ,

• for each action constant or pf constant of , and .

contains rules obtained from static laws, fluent dynamic laws, and pf constant declarations as described in the third column of Figure 1, as well as for every regular fluent constant and every , and222We write to denote the rule . This expression is called a “choice rule”.

 {i:c=\sc true}ch,{i:c=\sc false}ch

for every action constant to state that the initial value of each fluent and actions are exogenous. contains rules obtained from initial static laws and initpf constant declarations as described in the third column of Figure 1. Both and also contain constraints for asserting that each constant is mapped to exactly one value in its domain.

For any program of signature and a value assignment to a subset of , we say is a residual (probabilistic) stable model of if there exists a value assignment to such that is a (probabilistic) stable model of .

For any value assignment to constants in , by we denote the value assignment to constants in so that iff . For , we use to denote the subset of :

A state is an interpretation of such that is a residual (probabilistic) stable model of . A transition of is a triple where and are interpretations of and is an interpretation of such that is a residual stable model of . A pf-transition of is a pair , where is a value assignment to such that is a stable model of .

The following simplifying assumptions are made on action descriptions in .

1. No Concurrency: For all transitions , we have for at most one ;

2. Nondeterministic Transitions are Determined by pf constants: For any state , any value assignment of , and any value assignment of , there exists exactly one state such that is a pf-transition;

3. Nondeterminism on Initial States are Determined by Initpf constants: For any value assignment of , there exists exactly one assignment of such that is a stable model of .

With the above three assumptions, the probability of a history, i.e., a sequence of states and actions, can be computed as the product of the probabilities of all the transitions that the history is composed of, multiplied by the probability of the initial state (Corollary 1 in [7]).

### 2.3 Review: Markov Decision Process

A Markov Decision Process (MDP) is a tuple where (i) is a set of states; (ii) is a set of actions; (iii) defines transition probabilities; (iv) is the reward function.

#### 2.3.1 Finite Horizon Policy Optimization

Given a nonnegative integer as the maximum timestamp, and a history such that each and each , the total reward of the history under MDP is defined as

 RM(⟨s0,a0,s1,…,sm−1,am−1,sm⟩)=m−1∑i=0R(si,ai,si+1).

The probability of under MDP is defined as

 PM(⟨s0,a0,s1,…,sm−1,am−1,sm⟩)=m−1∏i=0T(si,ai,si+1).

A non-stationary policy is a function from to , where . Given an initial state , the expected total reward of a non-stationary policy under MDP is

 ER\/M(π,s0)= E⟨s1,…,sm⟩:si∈S for i∈{1,…,m}[RM(⟨s0,π(s0,0),s1,…,sm−1,π(sm−1,m−1),sm⟩)] = ∑⟨s1,…,sm⟩:si∈S for i∈{1,…,m}(m−1∑i=0R(si,π(si,i),si+1))×(m−1∏i=0T(si,π(si,i),si+1)).

The finite horizon policy optimization problem is to find a non-stationary policy that maximizes its expected total reward, given an initial state , i.e.,

#### 2.3.2 Infinite Horizon Policy Optimization

Policy optimization with the infinite horizon is defined similar to the finite horizon, except that a discount rate for the reward is introduced, and the policy is stationary, i.e., no need to mention time steps (ST). Given an infinite sequence of states and actions , such that each and each , and a discount factor , the discounted total reward of the sequence under MDP is defined as

 RM(⟨s0,a0,s1,a1,…⟩)=∞∑i=0γi+1R(si,ai,si+1).

Various algorithms for MDP policy optimization have been developed, such as value iteration [11] for an exact solution, and Q-learning [12] for approximate solutions.

## 3 DT−LPMLN

We extend the syntax and semantics of for by introducing atoms of the form

 utility(u,t) (1)

where is a real number, and is an arbitrary list of terms. These atoms can only occur in the head of hard rules of the form

 α:utility(u,t)←Body\/ (2)

where Body is a list of literals. We call these rules utility rules.

The weight and the probability of an interpretation are defined the same as in . The utility of an interpretation under is defined as

 UΠ(I)=∑utility(u,t)∈Iu.

Given a proposition , the expected utility of is defined as

 E[UΠ(A)]=∑I⊨A UΠ(I)×PΠ(I∣A). (3)

A program is a pair where is an program with a propositional signature (including utility atoms) and is a subset of consisting of decision atoms. We consider two reasoning tasks on .

• Evaluating a Decision. Given a propositional formula (“evidence”) and a truth assignment of decision atoms , represented as a conjunction of literals over atoms in , compute the expected utility of decision in the presence of evidence , i.e., compute

 E[UΠ(dec∧e)]=∑I⊨dec∧e UΠ(I)×PΠ(I∣dec∧e).
• Finding a Decision with Maximum Expected Utility (MEU). Given a propositional formula (“evidence”), find the truth assignment on such that the expected utility of in the presence of is maximized, i.e., compute

 argmaxdec : dec is a truth assignment on DecE[UΠ(dec∧e)]. (4)
###### Example 1

Consider a directed graph representing a social network: (i) each vertex represents a person; each edge represents that influences ; (ii) each edge is associated with a probability representing the probability of the influence; (iii) each vertex is associated with a cost , representing the cost of marketing the product to ; (iv) each person who buys the product yields a reward of .

The goal is to choose a subset of vertices as marketing targets so as to maximize the expected profit. The problem can be represented as a program as follows:

with the graph instance represented as follows:

• for each edge , we introduce a probabilistic fact

• for each vertex , we introduce the following rule:

For simplicity, we assume that marketing to a person guarantees that the person buys the product. This assumption can be removed easily by changing the first rule to a soft rule.

The MEU solution of program corresponds to the subset of vertices that maximizes the expected profit.

For example, consider the directed graph on the right, where each edge is labeled by and each vertex is labeled by . Suppose the reward for each person buying the product is . There are different truth assignments on decision atoms, corresponding to choices of marketing targets. The best decision is to market to Alice only, which yields the expected utility of .

## 4 pBC+ with Utility

We extend by introducing the following expression called utility law that assigns a reward to transitions:

 \bf reward v if F after G (5)

where is a real number representing the reward, is a formula that contains fluent constants only, and is a formula that contains fluent constants and action constants only (no pf, no initpf constants). We extend the signature of with a set of atoms of the form (1). We turn a utility law of the form (5) into the rule

 α:utility(v,i+1,id) ← (i+1:F)∧(i:G) (6)

where is a unique number assigned to this (ground) rule and .

Given a nonnegative integer denoting the maximum timestamp, a action description with utility over multi-valued propositional signature is defined as a high-level representation of the program .

We extend the definition of a probabilistic transition system as follows: A probabilistic transition system represented by a probabilistic action description is a labeled directed graph such that the vertices are the states of , and the edges are obtained from the transitions of : for every transition of , an edge labeled goes from to , where and . The number is called the transition probability of , denoted by , and the number is called the transition reward of , denoted by .

###### Example 2

The following action description describes a simple probabilistic action domain with two Boolean fluents , , and two actions and . causes to be true with probability , and if is true, then causes to be true with probability . The agent receives the reward if and become true for the first time (after then, it remains in the state as it is an absorbing state).

 A \bf causes P \bf if Pf\/1B \bf causes Q \bf if P∧Pf\/2\bf inertial P,Q\bf constraint ¬(Q∧∼P)\bf caused Pf\/1={\sc true:0.8,\sc false:0.2}\bf caused Pf\/2={\sc true:0.7,\sc false:0.3}
 \bf reward 10 \bf if P∧Q \bf after%  ¬(P∧Q)\bf caused InitP\/={\sc true:0.6,\sc false:0.4}\bf initially P=x \bf if InitP\/=x\bf caused InitQ\/={\sc true:0.5,\sc false:0.5}\bf initially Q \bf if InitQ\/∧P\bf initially ∼Q \bf if ∼P.

The transition system is as follows:

### 4.1 Policy Optimization

Given a action description , we use to denote the set of states, i.e, the set of interpretations of such that is a residual (probabilistic) stable model of . We use to denote the set of interpretations of such that is a residual (probabilistic) stable model of . Since we assume at most one action is executed each time step, each element in makes either only one action or none to be true.

A (non-stationary) policy (in ) is a function

 π:S×{0,…,m−1}↦A

that maps a state and a time step to an action (including doing nothing). By (each ) we denote the formula , and by (each and each ) the formula

 0:s0∧0:a0∧1:s1∧⋯∧m−1:am−1∧m:sm.

For any and , we write as an abbreviation of the formula ; for any and , we write as an abbreviation of the formula .

We say a state is consistent with if there exists at least one probabilistic stable model of such that . The Policy Optimization problem is to find a policy that maximizes the expected utility starting from , i.e., with

 argmaxπ is a policy E[UTr(Π,m)(Cπ,m∪⟨s0⟩t)]

where is the following formula representing policy :

 ⋀s∈S, π(s,i)=a, i∈{0,…,m}i:s→i:a .

We define the total reward of a history under action description as

 RD(⟨s0,a0,s1,…,sm⟩)=E[UTr(D,m)(⟨s0,a0,s1,a1,…,am−1,sm⟩t)].

Although it is defined as an expectation, the following proposition tells us that any stable model of such that has the same utility, and consequently, the expected utility of is the same as the utility of any single stable model that satisfies the history.

###### Proposition 1

For any two stable models of that satisfy a trajectory
, we have

 UTr(D,m)(X1) = UTr(D,m)(X2) = E[UTr(D,m)(⟨s0,a0,s1,a1,…,am−1,sm⟩t)].

It can be seen that the expected utility of can be computed from the expected utility from all possible state sequences.

###### Proposition 2

Given any initial state that is consistent with , for any non-stationary policy , we have

 E[UTr(D,m)(Cπ,m∧⟨s0⟩t)]=∑⟨s1,…,sm⟩:si∈SRD(⟨s0,π(s0),s1,…,π(sm−1),sm⟩)×PTr(D,m)(⟨s0,s1,…,sm⟩t∣⟨s0⟩t∧Cπ,m).
###### Definition 1

For a action description , let be the MDP where (i) the state set is ; (ii) the action set is ; (iii) transition probability is defined as ; (iv) reward function is defined as .

We show that the policy optimization problem for a action description can be reduced to policy optimization problem for for the finite horizon. The following theorem tells us that for any trajectory following a non-stationary policy, its total reward and probability under defined under the semantics coincide with those under the corresponding MDP M(D).

###### Theorem 1

Assuming Assumptions 13 in Section 2.2 are met, given an initial state that is consistent with , for any non-stationary policy and any finite state sequence such that each in , we have

• .

It follows that the policy optimization problem for action descriptions and the same problem for MDP with finite horizon coincide.

###### Theorem 2

For any nonnegative integer and an initial state that is consistent with , we have

 argmaxπ is a non-stationary policy E[UTr(D,m)(Cπ,m∧⟨s0⟩t)]=argmaxπ is a non-% stationary policy ER\/M(D)(π,s0).

Theorem 2 justifies using an implementation of to compute optimal policies of MDP as well as using an MDP solver to compute optimal policies of the descriptions. Furthermore the theorems above allow us to check the properties of MDP by using formal properties of , such as whether a certain state is reachable in a given number of steps.

## 5 pBC+ as a High-Level Representation Language of MDP

An action description consists of causal laws in a human-readable form describing the action domain in a compact and high-level way, whereas it is non-trivial to describe an MDP instance directly from the domain description in English. The result in the previous section shows how to construct an MDP instance for a action description so that the solution to the policy optimization problem of coincide with that of MDP . In that sense, can be viewed as a high-level representation language for MDP.

As its semantics is defined in terms of , inherits the nonmonotonicity of the stable model semantics to be able to compactly represent recursive definitions or transitive closure. The static laws in can prune out invalid states to ensure that only meaningful value combinations of fluents will be given to MDP as states, thus reducing the size of state space at the MDP level.

We illustrate the advantage of using action descriptions as high-level representations of MDP with an example.

###### Example 3

Robot and Blocks   There are two rooms , , and three blocks , , that are originally located in . A robot can stack one block on top of another block if the two blocks are in the same room. The robot can also move a block to a different room, resulting in all blocks above it also moving if successful (with probability ). Each moving action has a cost of . What is the best way to move all blocks to ?

The example can be represented in as follows. range over , , ; ranges over , . , , and GoalNotAchieved are Boolean statically determined fluent constants; is a regular fluent constant with Domain , and is a Boolean regular fluent constant. and are action constants and Pf_Move is a Boolean pf constant. In this example, we make the goal state absorbing, i.e., when all the blocks are already in R2, then all actions have no effect.

Moving block to room causes to be in with probability :

 MoveTo\/(x,r) \bf causes In\/(x)=r \bf if Pf\_Move\/∧GoalNotAchieved\/\bf caused Pf\_Move\/={\sc true:p,\sc false:1−p}.

Successfully Moving a block to a room causes to be no longer underneath the block that was underneath in the previous step, if is different from where is:

 MoveTo\/(x1,r2) \bf causes ∼%OnTopOf\/(x1,x2)\bf if Pf\_Move\/∧In\/(x1)=r1∧OnTopOf\/(x1,x2)∧GoalNotAchieved\/    (r1≠r2).

Stacking a block on another block causes to be on top of , if the top of is clear, and and are at the same location:

 StackOn\/(x1,x2) \bf causes OnTopOf\/% (x1,x2)

Stacking a block on another block causes to be no longer on top of the block where was originally on top of:

 StackOn\/(x1,x2) \bf causes ∼% OnTopOf\/(x1,x) \bf if TopClear\/(x2)∧At\/(x1)=r∧At\/(x2)=r∧ OnTopOf\/(x1,x)∧% GoalNotAchieved\/ (x2≠x,x1≠x2).

Two different blocks cannot be on top of the same block, and a block cannot be on top of two different blocks:

 \bf constraint ¬(OnTopOf\/(x1,x)∧OnTopOf\/(x2,x)) (x1≠x2)\bf constraint ¬(OnTopOf\/(x,x1)∧OnTopOf\/(x,x2)) (x1≠x2).

By default, the top of a block is clear. It is not clear if there is another block that is on top of it:

 \bf default TopClear\/(x)\bf caused ∼TopClear\/(x) \bf if OnTopOf\/(x1,x).

The relation between two blocks is the transitive closure of the relation : A block is above another block if is on top of , or there is another block such that is above and is above :

 \bf caused Above\/(x1,x2) \bf if%  OnTopOf\/(x1,x2)\bf caused Above\/(x1,x2) \bf if Above\/(x1,x)∧Above\/(x,x2).

One block cannot be above itself; Two blocks cannot be above each other:

 \bf caused ⊥ \bf if Above\/(x1,x2)∧Above\/(x2,x1).

If a block is above another block , then has the same location as :

 \bf caused At\/(x1)=r \bf if Above\/(x1,x2)∧At\/(x2)=r. (7)

Each moving action has a cost of :

 \bf reward −1 \bf if ⊤ \bf after MoveTo\/(x,r).

Achieving the goal when the goal is not previously achieved yields a reward of :

 \bf reward 10 \bf if ∼% GoalNotAchieved\/ \bf after GoalNotAchieved\/.

The goal is not achieved if there exists a block that is not at . It is achieved otherwise:

 \bf caused GoalNotAchieved\/ \bf if At\/(x)=r  (r≠R2)\bf default ∼GoalNotAchieved\/.

and are inertial:

 \bf inertial At\/(x),OnTopOf\/(x1,x2).

Finally, we add      for each distinct pair of ground action constants and , to ensure that at most one action can occur each time step.

It can be seen that stacking all blocks together and moving them at once would be the best strategy to move them to L2.

In Example 3, many value combinations of fluents do not lead to a valid state, such as

 {OnTopOf\/(B1,B2),OnTopOf\/(B2,B1),...}

where the two blocks and are on top of each other. Moreover, the fluents and are completely dependent on the value of the other fluents. There would be states if we define a state as any value combination of fluents. On the other hand, the static laws in the above action descriptions reduce the number of states to only .333This number can be verified by counting all possible configurations of 3 blocks with 2 locations.

Furthermore, in this example, needs to be defined as a transitive closure of , so that the effects of can be defined in terms of the (inferred) spatial relation of blocks. Also, the static law (7) defines an indirect effect of .

We implemented the prototype system pbcplus2mdp, which takes an action description and time horizon as input, and finds an optimal policy by constructing the corresponding MDP and utilizing MDP policy optimization algorithms as blackbox. We use mdptoolbox as our underlying MDP solver. The current system uses 1.0 [13] (http://reasoning.eas.asu.edu/lpmln) for exact inference to find states, actions, transition probabilities, and transition rewards. The system is publicly available at https://github.com/ywang485/pbcplus2mdp, along with several examples.

The current system is not yet fast enough for large scale domains, since generating exact transition probability and reward matrices requires enumerating all stable models of and . For large scale domains, one more practical way of using action description to define MDP would be to use a solver as simulation black box in approximate MDP planning methods, such as Q-Learning. We leave this as future work.

## 6 Related Work

There have been quite a few studies and attempts in defining factored representations of (PO)MDP, with feature-based state descriptions and more compact, human-readable action definitions. PPDDL [8] extends PDDL with constructs for describing probabilistic effects of actions and reward from state transitions. One limitation of PPDDL is the lack of static causal laws, which prohibits PPDDL from expressing recursive definitions or transitive closure. This may yield a large state space to explore as discussed in Section 5. RDDL (Relational Dynamic Influence Diagram Language) [9] improves the expressivity of PPDDL in modeling stochastic planning domains by allowing concurrent actions, continuous values of fluents, state constraints, etc. The semantics is defined in terms of lifted dynamic Bayes network extended with influence graph. A lifted planner can utilize the first-order representation and potentially achieve better performance. Still, indirect effects are hard to be represented in RDDL. Compared to PPDDL and RDDL, the advantage of lies in its expressivity from the stable model semantics, which allows elegant representation of recursive definition, defeasible behabiour, and indirect effect.

Zhang et al. [14] adopt ASP and P-Log [15]

to perform high-level symbolic reasoning, which, respectively produce a refined set of states and a refined probability distribution over states that are then fed to POMDP solvers for low-level planning. The refined sets of states and probability distribution over states take into account commonsense knowledge about the domain, and thus improve the quality of a plan and reduce computation needed at the POMDP level. Yang

et al. [16] adopts the (deterministic) action description language for high-level representations of the action domain, which defines high-level actions that can be treated as deterministic. Each action in the generated high-level plan is then mapped into more detailed low-level policies, which takes in stochastic effects of low-level actions into account. Compared to pbcplus2mdp

, these approaches are looser integration of their symbolic reasoning module and reinforcement learning module.

Similarly, [sridharan15reba], [leonetti16synthesis] and [sridharan18knowledge] are also works with two levels of planning: a high-level symbolic reasoning based planning and a low-level (PO)MDP planning. [sridharan15reba] introduces a general framework for combining reasoning and planning in robotics, with planning in a coarse-resolution transition model and a fine-resolution transition model. Action language is used for defining the two levels of transition model. The fine-resolution transition model is further turned into a POMDP for detailed planning with stochastic effects of actions and transition rewards. The role of action description in our work is similar to the fine-resolution transition model described in as they can both by viewed as a high-level representation of (PO)MDP. Similar to our work, the action description eliminates invalid states so that the search space of (PO)MDP planning is reduced. The difference is that a action description can fully capture all aspects of (PO)MDP including transition probabilities and rewards, while the action description only provide states, action and transitions with no quantitative information. Based on the framework in [sridharan15reba], [sridharan18knowledge] further introduced relational reinforcement learning of domain knowledge. [leonetti16synthesis], on the other hand, uses symbolic reasoner such as ASP to reduce the search space of reinforcement learning based planning method by generating partial policies from planning results generated by the symbolic reasoner. The partial policies are constructed by merging candidate deterministic plans from symbolic reasoners, and the exploration of the low-level RL module is constrained on actions that satisfy the partial policy.

Another related work is [17], which combines ASP and reinforcement learning by using action language as a meta-level description of MDP. The action descriptions define non-stationary MDPs in the sense that the states and actions can change with new situations occurring in the environment. The algorithm ASP(RL) proposed in this work iteratively calls an ASP solver to obtain states and actions for the RL methods to learn transition probabilities and rewards, and updates the action description with changes in the environment found by the RL methods, in this way finding optimal policy for a non-stationary MDP with the search space reduced by ASP. The work is similar to ours in that ASP-based high-level logical description is used to generate states and actions for MDP, but the difference is that we use an extension of + that expresses transition probabilities and rewards.

## 7 Conclusion

Our main contributions are as follows.

• We presented a decision-theoretic extension of , through which we extend with the language constructs for representing rewards of transitions;

• We showed that the semantics of can be equivalently defined in terms of the decision-theoretic or MDP;

• We presented the system pbcplus2mdp, which solves policy optimization problems with an MDP solver.

Formally relating action languages and MDP opens up an interesting research direction. Dynamic programming methods in MDP can be utilized to compute action languages. In turn, action languages may serve as a formal verification tool for MDP. Action languages may serve as a high-level representation language for MDP that describes an MDP instance in a succinct and elaboration tolerant way. As many reinforcement learning tasks use MDP as a modeling language, the work may be related to incorporating symbolic knowledge to reinforcement learning as evidenced by [14, 16].

may deserve attention on its own for static domains. We are currently working on an implementation that extends system to handle utility. We expect that the system can be a useful tool for verifying properties for MDP.

The theoretical results in this paper limit attention to MDP for the finite horizon case. When the maximum step is sufficiently large, we may view it as an approximation of the infinite horizon case, in which case, we allow discount factor by replacing in (6) with . While it appears intuitive to extend the theoretical results in this paper to the infinite case, it requires extending the definition of to allow infinitely many rules, which we leave for future work.

Acknowledgements: We are grateful to the anonymous referees for their useful comments. This work was partially supported by the National Science Foundation under Grant IIS-1815337.

## References

• [1] Gelfond, M., Lifschitz, V.:

Representing action and change by logic programs.

Journal of Logic Programming 17 (1993) 301–322
• [2] Gelfond, M., Lifschitz, V.: Action languages. Electronic Transactions on Artificial Intelligence 3 (1998) 195–210
• [3] Giunchiglia, E., Lifschitz, V.: An action language based on causal explanation: Preliminary report. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 623–630
• [4] Giunchiglia, E., Lee, J., Lifschitz, V., McCain, N., Turner, H.: Nonmonotonic causal theories. Artificial Intelligence 153(1–2) (2004) 49–104
• [5] Lee, J., Lifschitz, V., Yang, F.: Action language : Preliminary report. In: Proceedings of International Joint Conference on Artificial Intelligence (IJCAI). (2013)
• [6] Babb, J., Lee, J.: Action language +: Preliminary report. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). (2015)
• [7] Lee, J., Wang, Y.: A probabilistic extension of action language +. Theory and Practice of Logic Programming 18(3–4) (2018) 607–622
• [8] Younes, H.L., Littman, M.L.: PPDDL1. 0: An extension to PDDL for expressing planning domains with probabilistic effects. (2004)
• [9] Sanner, S.: Relational dynamic influence diagram language (RDDL): Language description. Unpublished ms. Australian National University (2010)  32
• [10] Lee, J., Wang, Y.: Weighted rules under the stable model semantics. In: Proceedings of International Conference on Principles of Knowledge Representation and Reasoning (KR). (2016) 145–154
• [11] Bellman, R.: A Markovian decision process. Indiana Univ. Math. J. 6 (1957) 679–684
• [12] Watkins, C.J.C.H.: Learning from delayed rewards. PhD thesis (1989)
• [13] Lee, J., Talsania, S., Wang, Y.: Computing LPMLN using ASP and MLN solvers. Theory and Practice of Logic Programming (2017)
• [14] Zhang, S., Stone, P.: CORPP: Commonsense reasoning and probabilistic planning, as applied to dialog with a mobile robot. In: Proceedings of the AAAI Conference on Artificial Intelligence. (2015) 1394–1400
• [15] Baral, C., Gelfond, M., Rushton, J.N.: Probabilistic reasoning with answer sets.